Greenplum 4.1 Community Edition mapreduce demo problem and fix
%YAML 1.1
---
VERSION: 1.0.0.1
DEFINE:
- INPUT:
NAME: access_logs
FILE:
# change seghostname1, seghostname2 and file_path to reflect
# your runtime file locations
- gp-single-host:/home/gpadmin/gpmapreduce/data/access_log
- gp-single-host:/home/gpadmin/gpmapreduce/data/access_log2
- MAP:
NAME: grep_map
LANGUAGE: perl
FUNCTION: |
# 0: name the input parameters
my ($key, $value) = @_;
# 1: extract the URL portion of the access log
$value =~ /"GET (.*) HTTP/;
my $url = $1;
return <{"key" ="" >=">" $key,="$key," value="value" $value}="$value}"> if ($value =~/$key/);
return [];
EXECUTE:
- RUN:
SOURCE: access_logs
MAP: grep_map
REDUCE: IDENTITY
Here is what a sample of the access_log files look like, they are typical of Apache access_log files:
$ cat data/access_log 10.254.0.52 - - <28/aug/2008:16:52:13 -0700> "GET / HTTP/1.1" 200 1456 10.254.0.52 - - <28/aug/2008:16:52:13 -0700> "GET /apache_pb.gif HTTP/1.1" 200 2326 10.254.0.52 - - <28/aug/2008:16:52:13 -0700> "GET /favicon.ico HTTP/1.1" 404 209 10.254.0.52 - - <28/aug/2008:16:52:16 -0700> "GET /favicon.ico HTTP/1.1" 404 209 10.254.0.52 - - <28/aug/2008:16:52:21 -0700> "GET /~mapreduce HTTP/1.1" 301 236
Here is the output of the mapreduce:
$ gpmapreduce -f 1_grep.yml gpmrdemo WARNING: unset parameter - grep_map(key => NULL) mapreduce_11694_run_1 key|value ---+------------------------------------------------------------------------------------------------ | |10.254.0.52 - - <28/aug/2008:16:52:13 -0700> "GET /apache_pb.gif HTTP/1.1" 200 2326 |10.254.0.52 - - <28/aug/2008:16:52:13 -0700> "GET /favicon.ico HTTP/1.1" 404 209 |10.254.0.52 - - <28/aug/2008:16:52:13 -0700> "GET / HTTP/1.1" 200 1456
You can see there is an issue right away. Basically all the mapreduce is doing is taking the entire line of access_log with no reduction and placing it into the database "value" column. What one would expect to have happened is for some key to be extracted, perhaps an IP address, and then the value put into the database with it, which could be the URL. My guess is this was not the intended YAML file for the demo or that it was only partially complete.
I have written one fix to this, to show actual mapreduce happening:
%YAML 1.1
---
VERSION: 1.0.0.1
DEFINE:
- INPUT:
NAME: access_logs
FILE:
# change seghostname1, seghostname2 and file_path to reflect
# your runtime file locations
- gp-single-host:/home/gpadmin/gpmapreduce/data/access_log
- gp-single-host:/home/gpadmin/gpmapreduce/data/access_log2
- MAP:
NAME: grep_map
LANGUAGE: perl
FUNCTION: |
# 0: name the input parameters
my ($key, $value) = @_;
# 1: extract the IP and URL portion of the access log
$value =~ m/^(<0-9>{1,3}\.<0-9>{1,3}\.<0-9>{1,3}\.<0-9>{1,3}).*"GET (.*) HTTP/;
my $ip = $1;
my $url = $2;
return <{"key" ="" >=">" $ip,="$ip," value="value" $url}="$url}"> if ($value =~/$key/);
return [];
EXECUTE:
- RUN:
SOURCE: access_logs
MAP: grep_map
REDUCE: IDENTITY
Here is what the output now looks like:
$ gpmapreduce -f 1_grep.yml gpmrdemo
WARNING: unset parameter - grep_map(key => NULL)
mapreduce_17558_run_1
key|value
-----------+----------------------------------------
10.254.0.52|/
10.254.0.52|/apache_pb.gif
10.254.0.52|/favicon.ico
10.254.0.52|/favicon.ico
10.254.0.52|/icons/back.gif
10.254.0.52|/icons/blank.gif
10.254.0.52|/icons/folder.gif
10.254.0.52|/icons/text.gif
10.254.0.52|/icons/unknown.gif
...
If anyone has other YAML files with mapreduce functions in them please let me know, I am always looking for more ways to learn about YAML and MR.
Recent Posts
See AllRecently I was working on a problem with Time Series. Time Series can quickly add up to a lot of data, as you are using previous...
One of the biggest bottlenecks in Deep Learning is loading data. having fast drives and access to the data is important, especially if...
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.TableName;...
Comments