The Greenplum 4.1 Community Edition comes with a mapreduce demo that has two parts.
Part 1 uses the perl language and it parses multiple apache access_log files
Part 2 uses the python language and does a word count in the google whitepaper on mapreduce
There is a lot of information in access_log files. So it would make sense that someone would want to run a mapreduce on them and extract some meaningful information such as IP address, date, url, etc. The included mapreduce demo seems to have a bug in it. The demo executes:
gpmapreduce -f $RUNDIR/1_grep.yml gpmrdemo
the YAML file looks like this:
[gpadmin@gp-single-host gpmapreduce]$ cat 1_grep.yml
Here is what a sample of the access_log files look like, they are typical of Apache access_log files:
[gpadmin@gp-single-host gpmapreduce]$ cat data/access_log
10.254.0.52 – - [28/Aug/2008:16:52:13 -0700] “GET / HTTP/1.1″ 200 1456
10.254.0.52 – - [28/Aug/2008:16:52:13 -0700] “GET /apache_pb.gif HTTP/1.1″ 200 2326
10.254.0.52 – - [28/Aug/2008:16:52:13 -0700] “GET /favicon.ico HTTP/1.1″ 404 209
10.254.0.52 – - [28/Aug/2008:16:52:16 -0700] “GET /favicon.ico HTTP/1.1″ 404 209
10.254.0.52 – - [28/Aug/2008:16:52:21 -0700] “GET /~mapreduce HTTP/1.1″ 301 236
Here is the output of the mapreduce:
[gpadmin@gp-single-host gpmapreduce]$ gpmapreduce -f 1_grep.yml gpmrdemo
WARNING: unset parameter – grep_map(key => NULL)
mapreduce_11694_run_1
key|value
—+————————————————————————————————
|
|10.254.0.52 – - [28/Aug/2008:16:52:13 -0700] “GET /apache_pb.gif HTTP/1.1″ 200 2326
|10.254.0.52 – - [28/Aug/2008:16:52:13 -0700] “GET /favicon.ico HTTP/1.1″ 404 209
|10.254.0.52 – - [28/Aug/2008:16:52:13 -0700] “GET / HTTP/1.1″ 200 1456
You can see there is an issue right away. Basically all the mapreduce is doing is taking the entire line of access_log with no reduction and placing it into the database “value” column. What one would expect to have happened is for some key to be extracted, perhaps an IP address, and then the value put into the database with it, which could be the URL. My guess is this was not the intended YAML file for the demo or that it was only partially complete.
I have written one fix to this, to show actual mapreduce happening:




