top of page

Greenplum 4.1 Community Edition mapreduce demo problem and fix

%YAML 1.1

---

VERSION:         1.0.0.1

DEFINE:

  - INPUT:

      NAME:      access_logs

      FILE:

         # change seghostname1, seghostname2 and file_path to reflect

         # your runtime file locations

         - gp-single-host:/home/gpadmin/gpmapreduce/data/access_log

         - gp-single-host:/home/gpadmin/gpmapreduce/data/access_log2

  - MAP:

      NAME:      grep_map

      LANGUAGE:  perl

      FUNCTION:  |

        # 0: name the input parameters

        my ($key, $value) = @_;

        # 1: extract the URL portion of the access log

        $value =~ /"GET (.*) HTTP/;

        my $url = $1;

        return <{"key" ="" >="&gt;" $key,="$key," value="value" $value}="$value}"> if ($value =~/$key/);

        return [];

EXECUTE:

  - RUN:

      SOURCE:    access_logs

      MAP:       grep_map

      REDUCE:    IDENTITY Here is what a sample of the access_log files look like, they are typical of Apache access_log files: $ cat data/access_log 10.254.0.52 - - <28/aug/2008:16:52:13 -0700> "GET / HTTP/1.1" 200 1456 10.254.0.52 - - <28/aug/2008:16:52:13 -0700> "GET /apache_pb.gif HTTP/1.1" 200 2326 10.254.0.52 - - <28/aug/2008:16:52:13 -0700> "GET /favicon.ico HTTP/1.1" 404 209 10.254.0.52 - - <28/aug/2008:16:52:16 -0700> "GET /favicon.ico HTTP/1.1" 404 209 10.254.0.52 - - <28/aug/2008:16:52:21 -0700> "GET /~mapreduce HTTP/1.1" 301 236 Here is the output of the mapreduce: $ gpmapreduce -f 1_grep.yml gpmrdemo WARNING: unset parameter - grep_map(key => NULL) mapreduce_11694_run_1 key|value ---+------------------------------------------------------------------------------------------------   |   |10.254.0.52 - - <28/aug/2008:16:52:13 -0700> "GET /apache_pb.gif HTTP/1.1" 200 2326   |10.254.0.52 - - <28/aug/2008:16:52:13 -0700> "GET /favicon.ico HTTP/1.1" 404 209   |10.254.0.52 - - <28/aug/2008:16:52:13 -0700> "GET / HTTP/1.1" 200 1456 You can see there is an issue right away.  Basically all the mapreduce is doing is taking the entire line of access_log with no reduction and placing it into the database "value" column.  What one would expect to have happened is for some key to be extracted, perhaps an IP address, and then the value put into the database with it, which could be the URL.  My guess is this was not the intended YAML file for the demo or that it was only partially complete. I have written one fix to this, to show actual mapreduce happening:

%YAML 1.1

---

VERSION:         1.0.0.1

DEFINE:

  - INPUT:

      NAME:      access_logs

      FILE:

         # change seghostname1, seghostname2 and file_path to reflect

         # your runtime file locations

         - gp-single-host:/home/gpadmin/gpmapreduce/data/access_log

         - gp-single-host:/home/gpadmin/gpmapreduce/data/access_log2

  - MAP:

      NAME:      grep_map

      LANGUAGE:  perl

      FUNCTION:  |

        # 0: name the input parameters

        my ($key, $value) = @_;

        # 1: extract the IP and URL portion of the access log

        $value =~ m/^(<0-9>{1,3}\.<0-9>{1,3}\.<0-9>{1,3}\.<0-9>{1,3}).*"GET (.*) HTTP/;

        my $ip  = $1;

        my $url = $2;

        return <{"key" ="" >="&gt;" $ip,="$ip," value="value" $url}="$url}"> if ($value =~/$key/);   

        return [];

EXECUTE:

  - RUN:

      SOURCE:    access_logs

      MAP:       grep_map

      REDUCE:    IDENTITY

Here is what the output now looks like:


$ gpmapreduce -f 1_grep.yml gpmrdemo

WARNING: unset parameter - grep_map(key => NULL)

mapreduce_17558_run_1

        key|value                                   

-----------+----------------------------------------

10.254.0.52|/                                       

10.254.0.52|/apache_pb.gif                          

10.254.0.52|/favicon.ico                            

10.254.0.52|/favicon.ico                            

10.254.0.52|/icons/back.gif                         

10.254.0.52|/icons/blank.gif                        

10.254.0.52|/icons/folder.gif                       

10.254.0.52|/icons/text.gif                         

10.254.0.52|/icons/unknown.gif      

...               

If anyone has other YAML files with mapreduce functions in them please let me know, I am always looking for more ways to learn about YAML and MR.

Recent Posts

See All

Comments


Hi, thanks for stopping by!

I'm a paragraph. Click here to add your own text and edit me. I’m a great place for you to tell a story and let your users know a little more about you.

Let the posts
come to you.

Thanks for submitting!

  • Facebook
  • Instagram
  • Twitter
  • Pinterest
bottom of page