Greenplum 4.1 Community Edition mapreduce demo problem and fix

The Greenplum 4.1 Community Edition comes with a mapreduce demo that has two parts.

Part 1 uses the perl language and it parses multiple apache access_log files

Part 2 uses the python language and does a word count in the google whitepaper on mapreduce

There is a lot of information in access_log files.  So it would make sense that someone would want to run a mapreduce on them and extract some meaningful information such as IP address, date, url, etc.  The included mapreduce demo seems to have a bug in it.  The demo executes:

gpmapreduce -f $RUNDIR/1_grep.yml gpmrdemo

the YAML file looks like this:

[gpadmin@gp-single-host gpmapreduce]$ cat 1_grep.yml

%YAML 1.1
VERSION:         1.0.0.1
 
DEFINE:
  – INPUT:
      NAME:      access_logs
      FILE:
         # change seghostname1, seghostname2 and file_path to reflect
         # your runtime file locations
         – gp-single-host:/home/gpadmin/gpmapreduce/data/access_log
         – gp-single-host:/home/gpadmin/gpmapreduce/data/access_log2
         
  – MAP:
      NAME:      grep_map
      LANGUAGE:  perl
      FUNCTION:  |
        # 0: name the input parameters
        my ($key, $value) = @_;
        
        # 1: extract the URL portion of the access log
        $value =~ /”GET (.*) HTTP/;
        my $url = $1;
        
        return [{“key” => $key, “value” => $value}] if ($value =~/$key/);
        return [];
      
EXECUTE:
  – RUN:
      SOURCE:    access_logs
      MAP:       grep_map
      REDUCE:    IDENTITY

Here is what a sample of the access_log files look like, they are typical of Apache access_log files:

[gpadmin@gp-single-host gpmapreduce]$ cat data/access_log
10.254.0.52 – – [28/Aug/2008:16:52:13 -0700] “GET / HTTP/1.1” 200 1456
10.254.0.52 – – [28/Aug/2008:16:52:13 -0700] “GET /apache_pb.gif HTTP/1.1” 200 2326
10.254.0.52 – – [28/Aug/2008:16:52:13 -0700] “GET /favicon.ico HTTP/1.1” 404 209
10.254.0.52 – – [28/Aug/2008:16:52:16 -0700] “GET /favicon.ico HTTP/1.1” 404 209
10.254.0.52 – – [28/Aug/2008:16:52:21 -0700] “GET /~mapreduce HTTP/1.1” 301 236

Here is the output of the mapreduce:

[gpadmin@gp-single-host gpmapreduce]$ gpmapreduce -f 1_grep.yml gpmrdemo
WARNING: unset parameter – grep_map(key => NULL)
mapreduce_11694_run_1
key|value
—+————————————————————————————————
   |
   |10.254.0.52 – – [28/Aug/2008:16:52:13 -0700] “GET /apache_pb.gif HTTP/1.1” 200 2326
   |10.254.0.52 – – [28/Aug/2008:16:52:13 -0700] “GET /favicon.ico HTTP/1.1” 404 209
   |10.254.0.52 – – [28/Aug/2008:16:52:13 -0700] “GET / HTTP/1.1” 200 1456

You can see there is an issue right away.  Basically all the mapreduce is doing is taking the entire line of access_log with no reduction and placing it into the database “value” column.  What one would expect to have happened is for some key to be extracted, perhaps an IP address, and then the value put into the database with it, which could be the URL.  My guess is this was not the intended YAML file for the demo or that it was only partially complete.

I have written one fix to this, to show actual mapreduce happening:

%YAML 1.1
VERSION:         1.0.0.1
 
DEFINE:
  – INPUT:
      NAME:      access_logs
      FILE:
         # change seghostname1, seghostname2 and file_path to reflect
         # your runtime file locations
         – gp-single-host:/home/gpadmin/gpmapreduce/data/access_log
         – gp-single-host:/home/gpadmin/gpmapreduce/data/access_log2
 
  – MAP:
      NAME:      grep_map
      LANGUAGE:  perl
      FUNCTION:  |
        # 0: name the input parameters
        my ($key, $value) = @_;
 
        # 1: extract the IP and URL portion of the access log
        $value =~ m/^([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}).*”GET (.*) HTTP/;
        my $ip  = $1;
        my $url = $2;
 
        return [{“key” => $ip, “value” => $url}] if ($value =~/$key/);   
        return [];
 
EXECUTE:
  – RUN:
      SOURCE:    access_logs
      MAP:       grep_map
      REDUCE:    IDENTITY
Here is what the output now looks like:
[gpadmin@gp-single-host gpmapreduce]$ gpmapreduce -f 1_grep.yml gpmrdemo
WARNING: unset parameter – grep_map(key => NULL)
mapreduce_17558_run_1
        key|value                                   
———–+—————————————-
10.254.0.52|/                                       
10.254.0.52|/apache_pb.gif                          
10.254.0.52|/favicon.ico                            
10.254.0.52|/favicon.ico                            
10.254.0.52|/icons/back.gif                         
10.254.0.52|/icons/blank.gif                        
10.254.0.52|/icons/folder.gif                       
10.254.0.52|/icons/text.gif                         
10.254.0.52|/icons/unknown.gif      
…               
If anyone has other YAML files with mapreduce functions in them please let me know, I am always looking for more ways to learn about YAML and MR.
This entry was posted in Greenplum. Bookmark the permalink.

Leave a Reply