With two versions of MapReduce available for Hadoop, the older MRv1 and the newer YARN, sometimes you need to move between the two.  Using RPM’s or other packages with the Cloudera CDH installation makes this mostly easy, however there is still some work to do for a successful downgrade from YARN to MRv1.  For going from MRv1 to YARN, the Cloudera installation guide walks  you through doing this.  The instructions here are for going the other direction, from YARN to MRv1.

I recently had to go through the exercise of making this downgrade, and I have documented my steps below.  I am using CentOS with yum/RPM’s, other distributions may be similar.  Please let me know if you find any recommendations for changes to these steps:

# remove YARN configuration
sudo yum remove hadoop-conf-pseudo
 
# stop YARN
sudo service hadoop-yarn-resourcemanager stop 
sudo service hadoop-yarn-nodemanager stop
sudo service hadoop-mapreduce-historyserver stop
 
# stop HDFS
sudo for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x stop ; done
 
# Install MRv1
sudo yum install hadoop-0.20-conf-pseudo
 
# Remove cache dir
sudo rm -rf /var/lib/hadoop-hdfs/cache/
 
# format namenode
sudo -u hdfs hdfs namenode -format 
 
# start HDFS
sudo for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done
 
# make /tmp directories and set permissions/ownership
sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp 
 
sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
 
sudo -u hdfs mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/local/
sudo chown -R mapred  /var/lib/hadoop-hdfs/cache/mapred
 
# check dir structure
sudo -u hdfs hadoop fs -ls -R / 
 
# start MRv1
for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service $x start ; done
 
# make user directory for your username
sudo -u hdfs hadoop fs -mkdir /user/cloudera
sudo -u hdfs hadoop fs -chown cloudera /user/cloudera
 
# test
hadoop fs -mkdir input
hadoop fs -put /etc/hadoop/conf/*.xml input
hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output ‘dfs[a-z.]+’
 

Great article on MOOC’s

Wanted to include a link to a great Article entitled Putting a MOOC on the Resume.  I use MOOCs all the time to learn new stuff, keep my knowledge up in existing areas, and often times to use in conjunction with University study for an additional resource. 

You can find a list of deprecated properties in Hadoop .20.2 here:

http://hadoop.apache.org/docs/r2.0.3-alpha/hadoop-project-dist/hadoop-common/DeprecatedProperties.html

After reading this you may think that you need to set mapreduce.input.keyvaluelinerecordreader.key.value.separator in order to change the delimiter for the KeyValueTextInputFormat.  However, what I have noticed from experience is that this is one of many areas that differ from what the documentation would lead you to believe.

What you must do is continue to use key.value.separator.in.input.line.  You will do this like so:

public int run(String[] args) throws Exception {

     Configuration conf = getConf();

     conf.set(“key.value.separator.in.input.line”, “,”);

At the present time the API for Hadoop can be quite confusing, as there are many areas where things have changed, from the simple spelling of methods, to entire syntaxes changing. The documentation doesn’t always lead you to success, so you must experiment.  

In earlier releases of Hadoop you could change the number of mappers by setting:

setNumMapTasks()

You did this using JobConf.  Things in Hadoop .20.2 have migrated to using the Job class instead of JobConf.  Although setNumReduceTasks() is still valid, setNumMapTasks() has been deprecated.  How then do you set the number of Mappers on a MapReduce job?  You must adjust the split size.  There is much written on this, but it can be difficult to find at times.  The split size is determined by the InputFormat being used.  I typically use KeyValueTextInputFormat.  To adjust my split size, I simply pass the mapred.max.split.size parameter like so:

-D mapred.max-split.size=2500

In this example my input file size was 24451.  By setting the parameter –D mapred.max.split.size=2500, I was able to configure 10 map tasks.

Apache Mahout has gone through some changes recently, and one of the things you will notice no longer works, is the old prepare20newsgroups classifier routine.  This has been replaced, and the new syntax is much different.  This page will walk you though how to use the classifier:

Download some SPAM:

curl -O http://spamassassin.apache.org/publiccorpus/20021010_spam.tar.bz2

Download some HAM:

curl -O http://spamassassin.apache.org/publiccorpus/20021010_easy_ham.tar.bz2 

Extract the files:

$ tar xjf 20021010_spam.tar.bz2
$ tar xjf 20021010_easy_ham.tar.bz2 

Copy easy_ham and spam directories into 20news-all:
 cp -R easy_ham/ spam/ 20news-all/
 
Copy 20news-all to HDFS:
hadoop fs -put 20news-all
 
Prepare data by sequencing into vectors:
 mahout seqdirectory -i 20news-all -o 20news-seq
 mahout seq2sparse -i 20news-seq -o 20news-vectors  -lnorm -nv  -wt tfidf
 
Split data into train and test sets with 20% of the data being used for test and 80% for train:
mahout split -i 20news-vectors/tfidf-vectors –trainingOutput 20news-train-vectors –testOutput 20news-test-vectors –randomSelectionPct 20 –overwrite –sequenceFiles -xm sequential
 
Build the model:
mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow -c
 
You can test the model against the training set:
mahout testnb -i 20news-train-vectors -m model -l labelindex -ow -o 20news-testing-train -c
 
Now test against the test set:
mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o 20news-testing-test -c
 
You can view how the messages were classified, although it is a bit cryptic:
mahout seqdumper -i 20news-testing-test/part-m-00000 
 
The insights I used for figuring this out come from the examples/bin/classify-20newsgroups.sh which comes with Mahout, and sometimes can be found in a package called mahout-docs (for example if using the Cloudera CDH repositories).