Archive for the ‘ EMC ’ Category

The Greenplum 4.1 Community Edition comes with a mapreduce demo that has two parts.

Part 1 uses the perl language and it parses multiple apache access_log files

Part 2 uses the python language and does a word count in the google whitepaper on mapreduce

There is a lot of information in access_log files.  So it would make sense that someone would want to run a mapreduce on them and extract some meaningful information such as IP address, date, url, etc.  The included mapreduce demo seems to have a bug in it.  The demo executes:

gpmapreduce -f $RUNDIR/1_grep.yml gpmrdemo

the YAML file looks like this:

[gpadmin@gp-single-host gpmapreduce]$ cat 1_grep.yml

%YAML 1.1
VERSION:         1.0.0.1
 
DEFINE:
  – INPUT:
      NAME:      access_logs
      FILE:
         # change seghostname1, seghostname2 and file_path to reflect
         # your runtime file locations
         - gp-single-host:/home/gpadmin/gpmapreduce/data/access_log
         - gp-single-host:/home/gpadmin/gpmapreduce/data/access_log2
         
  – MAP:
      NAME:      grep_map
      LANGUAGE:  perl
      FUNCTION:  |
        # 0: name the input parameters
        my ($key, $value) = @_;
        
        # 1: extract the URL portion of the access log
        $value =~ /”GET (.*) HTTP/;
        my $url = $1;
        
        return [{"key" => $key, "value" => $value}] if ($value =~/$key/);
        return [];
      
EXECUTE:
  – RUN:
      SOURCE:    access_logs
      MAP:       grep_map
      REDUCE:    IDENTITY

Here is what a sample of the access_log files look like, they are typical of Apache access_log files:

[gpadmin@gp-single-host gpmapreduce]$ cat data/access_log
10.254.0.52 – - [28/Aug/2008:16:52:13 -0700] “GET / HTTP/1.1″ 200 1456
10.254.0.52 – - [28/Aug/2008:16:52:13 -0700] “GET /apache_pb.gif HTTP/1.1″ 200 2326
10.254.0.52 – - [28/Aug/2008:16:52:13 -0700] “GET /favicon.ico HTTP/1.1″ 404 209
10.254.0.52 – - [28/Aug/2008:16:52:16 -0700] “GET /favicon.ico HTTP/1.1″ 404 209
10.254.0.52 – - [28/Aug/2008:16:52:21 -0700] “GET /~mapreduce HTTP/1.1″ 301 236

Here is the output of the mapreduce:

[gpadmin@gp-single-host gpmapreduce]$ gpmapreduce -f 1_grep.yml gpmrdemo
WARNING: unset parameter – grep_map(key => NULL)
mapreduce_11694_run_1
key|value
—+————————————————————————————————
   |
   |10.254.0.52 – - [28/Aug/2008:16:52:13 -0700] “GET /apache_pb.gif HTTP/1.1″ 200 2326
   |10.254.0.52 – - [28/Aug/2008:16:52:13 -0700] “GET /favicon.ico HTTP/1.1″ 404 209
   |10.254.0.52 – - [28/Aug/2008:16:52:13 -0700] “GET / HTTP/1.1″ 200 1456

You can see there is an issue right away.  Basically all the mapreduce is doing is taking the entire line of access_log with no reduction and placing it into the database “value” column.  What one would expect to have happened is for some key to be extracted, perhaps an IP address, and then the value put into the database with it, which could be the URL.  My guess is this was not the intended YAML file for the demo or that it was only partially complete.

I have written one fix to this, to show actual mapreduce happening:

%YAML 1.1
VERSION:         1.0.0.1
 
DEFINE:
  – INPUT:
      NAME:      access_logs
      FILE:
         # change seghostname1, seghostname2 and file_path to reflect
         # your runtime file locations
         - gp-single-host:/home/gpadmin/gpmapreduce/data/access_log
         - gp-single-host:/home/gpadmin/gpmapreduce/data/access_log2
 
  – MAP:
      NAME:      grep_map
      LANGUAGE:  perl
      FUNCTION:  |
        # 0: name the input parameters
        my ($key, $value) = @_;
 
        # 1: extract the IP and URL portion of the access log
        $value =~ m/^([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}).*”GET (.*) HTTP/;
        my $ip  = $1;
        my $url = $2;
 
        return [{"key" => $ip, "value" => $url}] if ($value =~/$key/);   
        return [];
 
EXECUTE:
  – RUN:
      SOURCE:    access_logs
      MAP:       grep_map
      REDUCE:    IDENTITY
Here is what the output now looks like:
[gpadmin@gp-single-host gpmapreduce]$ gpmapreduce -f 1_grep.yml gpmrdemo
WARNING: unset parameter – grep_map(key => NULL)
mapreduce_17558_run_1
        key|value                                   
———–+—————————————-
10.254.0.52|/                                       
10.254.0.52|/apache_pb.gif                          
10.254.0.52|/favicon.ico                            
10.254.0.52|/favicon.ico                            
10.254.0.52|/icons/back.gif                         
10.254.0.52|/icons/blank.gif                        
10.254.0.52|/icons/folder.gif                       
10.254.0.52|/icons/text.gif                         
10.254.0.52|/icons/unknown.gif      
…               
If anyone has other YAML files with mapreduce functions in them please let me know, I am always looking for more ways to learn about YAML and MR.

I have recently been playing with the Greenplum 4.1 Community Edition VM available from Greenplum.  EMC has an internal initiative to make all of its “demo’s” as VM’s and I generally agree with this.  I would say that make the VM’s available but also  make the software available so people can build their own environments as well.  Not sure if that will be the case.

Even though the VM is designed to work on VMware Player or Fusion, I got it working under Parallels with a few tweaks.

The VM itself fires up fine but there are a few problems with it you may encounter and I have made an attempt to catalog some of those things here and how to get around the problems.  I will add to this article as I find new things.

When the system boots up, you are presented with a nice desktop of icons, one of the first things you will likely do is click the Start Greenplum DB icon.  Here is a picture of the Desktop you are presented with:

 

 

 

 

 

 

 

 

 

You will be presented with an output that should show all has gone well, and at the end it directs you to fire up a browser to view the Performance Monitor User Interface:

Note: you can now use the GP monitor if you want monitor query and system performance Connect to the GUI by opening this link in a browser (outside of the VM): https://gp-single-host:28080/ Login using the user/pass: gpmon/password

The output directs you to connect from “outside of the VM”, as in your laptop which is hosting the VM by hitting https://gp-single-host:28080.  Obviously you will need to add a host to your laptops host file with the name of gp-single-host and the IP address of the VM.  You can get the IP address of the VM by simply opening a terminal window and doing an ifconfig eth0.  A good test is to connect to the Performance Monitor UI from within the VM itself, this will fail.  First you must re-rerun the installer:

[gpadmin@gp-single-host gpquery]$ su - gpadmin Password:  [gpadmin@gp-single-host ~]$ gpperfmon_install --enable --password password --port 5432 [...]

After you run this, you should now be able to connect locally from within the VM.  You will notice however you cannot connect outside of the VM.  This is due to the firewall rules that are in effect on the CentOS VM:

[root@gp-single-host gpadmin]# /sbin/iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
RH-Firewall-1-INPUT all — anywhere anywhere

Chain FORWARD (policy ACCEPT)
target prot opt source destination
RH-Firewall-1-INPUT all — anywhere anywhere

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

Chain RH-Firewall-1-INPUT (2 references)
target prot opt source destination
ACCEPT all — anywhere anywhere
ACCEPT icmp — anywhere anywhere icmp any
ACCEPT esp — anywhere anywhere
ACCEPT ah — anywhere anywhere
ACCEPT udp — anywhere 224.0.0.251 udp dpt:mdns
ACCEPT udp — anywhere anywhere udp dpt:ipp
ACCEPT tcp — anywhere anywhere tcp dpt:ipp
ACCEPT all — anywhere anywhere state RELATED,ESTABLISHED
ACCEPT tcp — anywhere anywhere state NEW tcp dpt:ssh
REJECT all — anywhere anywhere reject-with icmp-host-prohibited

The easiest thing to do here is just change the security level to disable the firewall.  If you are knowledgable in iptables you can modify it to suit your needs.  To change the security level execute:

[gpadmin@gp-single-host ~]$ su
Password:
[root@gp-single-host gpadmin]# system-config-securitylevel

Set the Security Level to “disabled” and you can leave the SELinux setting to “Enforcing”.  Now your web browser should be able to connect to the Performance Monitor UI from outside the VM.

The next issue you will encounter is when you click on the “Run Queries Demo” icon on the Desktop.  You will encounter the following error:

Running Query demo
This demo will create and load data for 8 tables, then will run 22 queries

Press enter key to continue…
Executing command: ./reload.sh
Running command “psql -d gpadmin -c ‘drop database if exists gpdemo’” …

Error running command psql -d gpadmin -c ‘drop database if exists gpdemo’
Exiting…
Output is in file /home/gpadmin/gpquery/sysout
There was an error running the command.

Press enter key to continue…

The issue is that the gpadmin database does not exist.

The script reload.sh in the demo tries to run:

runCmd “psql -d gpadmin -c ‘drop database if exists $PGDATABASE’”

yet there is no gpadmin database as installed by default in the Greenplum CE VM. So the script fails.

We must create it:

[gpadmin@gp-single-host ~]$ psql -d template1
psql (8.2.15)
Type “help” for help.

template1=# CREATE DATABASE gpadmin;
CREATE DATABASE
template1=# \q

Now you can re-run the “Run Queries Demo” script and it should succeed with no errors.

As a note, you should definitely read the documentation provided on the Desktop of the VM. The Installation Guide and Administration Guide have much  useful information in them.  For example, if you want to connect to the database externally, you will need to add users to the pg_hba.conf file.  The correct pg_hba.conf file lives in ${MASTER_DATA_DIRECTORY}. I just added a wildcard to allow all connections like so:

host     all         gpadmin         0.0.0.0/0      trust

As I run into any other caveats with the Greenplum 4.1 Community Edition VM I will update this article.

Greenplum released Community Edition 4.1 which is a great free VM appliance you can run to get your feet wet with Greenplum and gain an understanding of what it can offer.  Unfortunately it was only released to work on VMware Workstation or Fusion.  Personally I like to run Parallels Desktop on my Macbook Air, so I worked to figure out how to make this work.

First off, the VM includes its vmdk broken up into 2GB chunks.  Parallels may or may not be able to deal with this.  For good measure, I just converted it to a single file.  To do this I used VMware Fusion, unchecked the box to split the vmdk into 2GB chunks and let it do its thing.  Another way you can accomplish this would be to use the QEMU tools:

qemu-img convert file.vmdk -O raw file.hdd

You can then take the file.hdd and add it to a VM  you create in Parallels.  Greenplum 4.1 CE does not require much resources, the default its only set for 1536MB and a single processor in the stock vmx file.  So you could create a Parallels VM with a single processor, 1536MB memory and simply add the file.hdd as its hard disk.

When creating the VM in Parallels make sure you select IDE0:0 as the hard disk, it will fail otherwise.

If you try to import the VMX file into Parallels it may leave you with a CD-ROM trying to connect to an image that doesn’t exist as well, so you will need to correct that.  There is really no benefit to importing the VMX.  Best thing is to just create a new VM, as CentOS 4/5 64-bit, single processor, 1536MB memory and connect the raw .hdd file.  Parallels should glue all of this together and successfully build a .pvm file.

You will want to install the Parallels Tools once the system boots.

My journey to Cloud Architect ITaaS Expert has taken longer than many of my other EMC Certifications.  After all, it is an Expert level EMC certification and to that point my first.  I was one of the first people outside of EMC to achieve the Cloud Architect (Virtual Data Center), just a little over a year ago.  It was a very vendor agnostic certification as is the Cloud Architect ITaaS offering.  EMC has led the way in preparing users for interdisciplinary focus, as we build a work force with skills needed to deploy cloud technologies across a wide range of systems.

In completing Cloud Architect ITaaS, I definitely feel better prepared than our customers.  There is so much to learn around everything from designing an organization that is ready to tackle the task of cloud, to understanding the security implications, compliance, governance and trade-offs.  It helps cut through the marketing of what is “Cloud” to what the cloud really is and promises.  We are definitely in the early days of this journey.  Everyone knows in a high level way of what they want.  How they will get there is a lot more fuzzy.  The reality is that the perfect world does not exist today for Cloud and that all the systems and hooks do not exist to deploy Cloud in the most ideal way.  It will take some time for the marketplace to deliver systems that get us to where we all know we want to be.  That said there are advances made every day.  Companies cannot sit on the sidelines with indecision waiting for the perfect product set to role off the assembly line.  It is very much similar to the earlier days of virtualization. Companies who make the investments now will be in a much better place than the competition and they will have a competitive advantage.  Cloud is more than a technology play, in fact technology is the easy part.  Cloud is more of a re-tooling of the IT organization, a shift in how it does business internally with other lines of business, and changes in the way we think of peoples roles an responsibilities.

In the old day, the subject matter expert was King.  We had people focus on systems, storage, networking, etc.  The New World is a  more interdisciplinary focus.  Systems are evolving to where we don’t need as many “experts” in a single discipline.  What we need is people that have knowledge across multiple disciplines.  Things like storage provisioning is getting easier.  There will always be a need for experts in a given area, but the holistic view of the IT organization is going to need more focused generalists than deep seated subject matter experts.  And this is a good thing for everyone.

I was fortunate to be involved during the shaping of the EMC Cloud Architect ITaaS course, and was invited to Massachusetts to participate in the Beta class.  I was able to get a preview of what the eventual public course would look like, and provide some input.  The reality was the class was always ideal the way it was, with very little tweaking needed.  I also had the advantage of going through the fully released Video ILT after the course was published.  So I have spent considerable time in this journey, having a solid background in storage, virtualization and networking, and devoting a good bit of time to study for Cloud Architect ITaaS.  I have been very impressed with the amount of resources and effort that EMC put into developing this course.  From the top down there was focus from the EMC education team, and it was obvious a lot of work was put into this.

As far as tips for studying for this certification I would say that really the ILT or VILT are your only good options.  There is so much interdisciplinary information, that really the course materials are the only good source I can think of.  That said, its good information and will be challenging.  The reason it will be challenging is because no one I have met really knows everything in ITaaS nor has a background in all areas.  For example, I was and am weak in the governance and security aspects.  I don’t do security every day.  Even though I am a CISSP and have a good grounding in security and compliance issues, the perspective of these issues with regards to the Cloud is different than how we may approach things in a pre-Cloud world.

Things will be much easier on the industry as a whole, once we can move everyone past the buzz of the word Cloud and can truly have an understanding of what Cloud and ITaaS really mean.  I find this land grab for all things Cloud is really putting confusion into the marketplace of where the true benefits are and what organizations should be focusing on.  It makes sense that consulting companies will help lead the charge to the real benefits, but our job will be much easier once the customers are all educated.  I look at how much time I have spent in understanding ITaaS, and I have had the benefit of working across many disciplines as a subject matter expert.  Most customers do not have this luxury.  Over the next 2-5 years we will see the real shift of organizations adopting ITaaS, with much support from the manufacturers to truly enable them to do so.  EMC and its alliances with VMware, VCE, Cisco, and others are definitely leading the way with developing the promise of orchestration, automation, chargeback / showback, and all other facets the field is looking for in ITaaS.

When EMC re-did their group/role mappings for Celerra Administrative Roles, back in 2008 or so (When DART 5.6 was released), they had a chance to create a new set of group/roles that totally make sense.  And for the most part they do, but does anyone else see something wrong with this picture?

So with the security in Celerra, Roles and Groups have a One to One relationship.  You can see that the fullnas group is mapped to the Nasadmin role.  The nasadmin group is mapped to the Operator role.  ?!?!?!??!  To me, it would have made a lot more sense to create an operator group and map the Operator role to that.  Maybe I am just being a bit OCD about this, but it just bothers me that the entire scheme looks relatively clean, and they had an opportunity to make it just so perfect, but left in this confusing point.

Now, why are some of the Role Names capitalized and others not?  I have no idea.  But I must say this.  EMC Education does a hell of a job cranking out a great amount of material.  So sometimes typo’s exist and things are actually correct(ed) in the OS, and other times they are just the messenger and have nothing to do with the design of the system (actually, that’s probably most cases).

I have been impressed in watching the advancements of the Celerra from a few years ago until now morphing into the VNX.  Things have always improved greatly.  I am not a heavy user of RBAC, simply because I look at it more like there are two options:  Those that should have access and those who should not :) .  Obviously we design things for customers based on their requirements but I like to have an educated group who have access, and then not have to worry about those that don’t.  When I say educated, I don’t mean they are the Grand Master at all things, Celerra in this case, but that they understand enough to know there are things they should touch and things they should not.

If you don’t know much about Celerra, you shouldn’t be doing something like following commands that start off with you doing “export NAS_DB_DEBUG=1″.