<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Maximum Entropy</title>
	<atom:link href="http://www.feeny.org/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.feeny.org</link>
	<description>Maintaining sanity, if just barely</description>
	<lastBuildDate>Mon, 29 Apr 2013 18:57:41 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Hadoop &#8211; Downgrading from YARN to MRv1 (Cloudera CDH4)</title>
		<link>http://www.feeny.org/hadoop-downgrading-from-yarn-to-mrv1-cloudera-cdh4/</link>
		<comments>http://www.feeny.org/hadoop-downgrading-from-yarn-to-mrv1-cloudera-cdh4/#comments</comments>
		<pubDate>Mon, 29 Apr 2013 13:16:34 +0000</pubDate>
		<dc:creator>brian</dc:creator>
				<category><![CDATA[Data Analytics]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[MRv1]]></category>
		<category><![CDATA[YARN]]></category>

		<guid isPermaLink="false">http://www.feeny.org/?p=2195</guid>
		<description><![CDATA[<p>With two versions of MapReduce available for Hadoop, the older MRv1 and the newer YARN, sometimes you need to move between the two. &#160;Using RPM&#8217;s or other packages with the Cloudera CDH installation makes this mostly easy, however there is still some work to do for a successful downgrade from YARN to MRv1. &#160;For going [...]</p><p>The post <a href="http://www.feeny.org/hadoop-downgrading-from-yarn-to-mrv1-cloudera-cdh4/">Hadoop &#8211; Downgrading from YARN to MRv1 (Cloudera CDH4)</a> appeared first on <a href="http://www.feeny.org">Maximum Entropy</a>.</p>]]></description>
				<content:encoded><![CDATA[<p>With two versions of MapReduce available for Hadoop, the older MRv1 and the newer YARN, sometimes you need to move between the two. &nbsp;Using RPM&#8217;s or other packages with the Cloudera CDH installation makes this mostly easy, however there is still some work to do for a successful downgrade from YARN to MRv1. &nbsp;For going from MRv1 to YARN, the Cloudera installation guide walks &nbsp;you through doing this. &nbsp;The instructions here are for going the other direction, from YARN to MRv1.</p>
<p>I recently had to go through the exercise of making this downgrade, and I have documented my steps below. &nbsp;I am using CentOS with yum/RPM&#8217;s, other distributions may be similar. &nbsp;Please let me know if you find any recommendations for changes to these steps:</p>
<div><span style="font-family: 'courier new', courier;"># remove YARN configuration</span></div>
<div><span style="font-family: 'courier new', courier;">sudo yum remove hadoop-conf-pseudo</span></div>
<div><span style="font-family: 'courier new', courier;">&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;"># stop YARN</span></div>
<div><span style="font-family: 'courier new', courier;">sudo service hadoop-yarn-resourcemanager stop&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;">sudo service hadoop-yarn-nodemanager stop</span></div>
<div><span style="font-family: 'courier new', courier;">sudo service hadoop-mapreduce-historyserver stop</span></div>
<div><span style="font-family: 'courier new', courier;">&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;"># stop HDFS</span></div>
<div><span style="font-family: 'courier new', courier;">sudo for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x stop ; done</span></div>
<div><span style="font-family: 'courier new', courier;">&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;"># Install MRv1</span></div>
<div><span style="font-family: 'courier new', courier;">sudo yum install hadoop-0.20-conf-pseudo</span></div>
<div><span style="font-family: 'courier new', courier;">&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;"># Remove cache dir</span></div>
<div><span style="font-family: 'courier new', courier;">sudo rm -rf /var/lib/hadoop-hdfs/cache/</span></div>
<div><span style="font-family: 'courier new', courier;">&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;"># format namenode</span></div>
<div><span style="font-family: 'courier new', courier;">sudo -u hdfs hdfs namenode -format&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;">&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;"># start HDFS</span></div>
<div><span style="font-family: 'courier new', courier;">sudo for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done</span></div>
<div><span style="font-family: 'courier new', courier;">&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;"># make /tmp directories and set permissions/ownership</span></div>
<div><span style="font-family: 'courier new', courier;">sudo -u hdfs hadoop fs -mkdir /tmp</span></div>
<div><span style="font-family: 'courier new', courier;">sudo -u hdfs hadoop fs -chmod -R 1777 /tmp&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;">&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;">sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging</span></div>
<div><span style="font-family: 'courier new', courier;">sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging</span></div>
<div><span style="font-family: 'courier new', courier;">sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred</span></div>
<div><span style="font-family: 'courier new', courier;">&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;">sudo -u hdfs mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/local/</span></div>
<div><span style="font-family: 'courier new', courier;">sudo chown -R mapred &nbsp;/var/lib/hadoop-hdfs/cache/mapred</span></div>
<div><span style="font-family: 'courier new', courier;">&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;"># check dir structure</span></div>
<div><span style="font-family: 'courier new', courier;">sudo -u hdfs hadoop fs -ls -R /&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;">&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;"># start MRv1</span></div>
<div><span style="font-family: 'courier new', courier;">for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service $x start ; done</span></div>
<div><span style="font-family: 'courier new', courier;">&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;"># make user directory for your username<user></user></span></div>
<div><span style="font-family: 'courier new', courier;">sudo -u hdfs hadoop fs -mkdir /user/cloudera</span></div>
<div><span style="font-family: 'courier new', courier;">sudo -u hdfs hadoop fs -chown cloudera /user/cloudera</span></div>
<div><span style="font-family: 'courier new', courier;">&nbsp;</span></div>
<div><span style="font-family: 'courier new', courier;"># test</span></div>
<div><span style="font-family: 'courier new', courier;">hadoop fs -mkdir input</span></div>
<div><span style="font-family: 'courier new', courier;">hadoop fs -put /etc/hadoop/conf/*.xml input</span></div>
<div><span style="font-family: 'courier new', courier;">hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output &#8216;dfs[a-z.]+&#8217;</span></div>
<div>&nbsp;</div>
<p>The post <a href="http://www.feeny.org/hadoop-downgrading-from-yarn-to-mrv1-cloudera-cdh4/">Hadoop &#8211; Downgrading from YARN to MRv1 (Cloudera CDH4)</a> appeared first on <a href="http://www.feeny.org">Maximum Entropy</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://www.feeny.org/hadoop-downgrading-from-yarn-to-mrv1-cloudera-cdh4/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Great article on MOOC&#8217;s</title>
		<link>http://www.feeny.org/great-article-on-moocs/</link>
		<comments>http://www.feeny.org/great-article-on-moocs/#comments</comments>
		<pubDate>Tue, 16 Apr 2013 19:28:57 +0000</pubDate>
		<dc:creator>brian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.feeny.org/?p=2191</guid>
		<description><![CDATA[<p>Wanted to include a link to a great Article entitled Putting a MOOC on the Resume. &#160;I use MOOCs all the time to learn new stuff, keep my knowledge up in existing areas, and often times to use in conjunction with University study for an additional resource.&#160;</p><p>The post <a href="http://www.feeny.org/great-article-on-moocs/">Great article on MOOC&#8217;s</a> appeared first on <a href="http://www.feeny.org">Maximum Entropy</a>.</p>]]></description>
				<content:encoded><![CDATA[<p>Wanted to include a link to a great Article entitled <a href="http://www.onlinecollegecourses.com/2013/03/07/putting-a-mooc-on-the-resume/">Putting a MOOC on the Resume</a>. &nbsp;I use MOOCs all the time to learn new stuff, keep my knowledge up in existing areas, and often times to use in conjunction with University study for an additional resource.&nbsp;</p>
<p>The post <a href="http://www.feeny.org/great-article-on-moocs/">Great article on MOOC&#8217;s</a> appeared first on <a href="http://www.feeny.org">Maximum Entropy</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://www.feeny.org/great-article-on-moocs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Changing key/value split delimiter in Hadoop .20.2</title>
		<link>http://www.feeny.org/changing-keyvalue-split-delimeter-in-hadoop-20-2/</link>
		<comments>http://www.feeny.org/changing-keyvalue-split-delimeter-in-hadoop-20-2/#comments</comments>
		<pubDate>Mon, 15 Apr 2013 19:23:04 +0000</pubDate>
		<dc:creator>brian</dc:creator>
				<category><![CDATA[Data Analytics]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://www.feeny.org/?p=2186</guid>
		<description><![CDATA[<p>You can find a list of deprecated properties in Hadoop .20.2 here: http://hadoop.apache.org/docs/r2.0.3-alpha/hadoop-project-dist/hadoop-common/DeprecatedProperties.html After reading this you may think that you need to set&#160;mapreduce.input.keyvaluelinerecordreader.key.value.separator in order to change the delimiter for the KeyValueTextInputFormat. &#160;However, what I have noticed from experience is that this is one of many areas that differ from what the documentation would [...]</p><p>The post <a href="http://www.feeny.org/changing-keyvalue-split-delimeter-in-hadoop-20-2/">Changing key/value split delimiter in Hadoop .20.2</a> appeared first on <a href="http://www.feeny.org">Maximum Entropy</a>.</p>]]></description>
				<content:encoded><![CDATA[<p>You can find a list of deprecated properties in Hadoop .20.2 here:</p>
<p>http://hadoop.apache.org/docs/r2.0.3-alpha/hadoop-project-dist/hadoop-common/DeprecatedProperties.html</p>
<p>After reading this you may think that you need to set&nbsp;<span style="font-family: 'courier new', courier;">mapreduce.input.keyvaluelinerecordreader.key.value.separator</span> in order to change the delimiter for the <span style="font-family: 'courier new', courier;">KeyValueTextInputFormat</span>. &nbsp;However, what I have noticed from experience is that this is one of many areas that differ from what the documentation would lead you to believe.</p>
<p>What you must do is continue to use&nbsp;key.value.separator.in.input.line. &nbsp;You will do this like so:</p>
<p><span style="font-family: 'courier new', courier;">public int run(String[] args) throws Exception {</p>
<p></span><span style="font-family: 'courier new', courier;"><em id="__mceDel">&nbsp; &nbsp; &nbsp;Configuration conf = getConf();</p>
<p></em></span><em id="__mceDel"><span style="font-family: 'courier new', courier;"><em id="__mceDel"><em id="__mceDel">&nbsp; &nbsp; &nbsp;conf.set(&#8220;key.value.separator.in.input.line&#8221;, &#8220;,&#8221;);</em></em></span></em></p>
<p>At the present time the API for Hadoop can be quite confusing, as there are many areas where things have changed, from the simple spelling of methods, to entire syntaxes changing. The documentation doesn&#8217;t always lead you to success, so you must experiment. &nbsp;</p>
<p>The post <a href="http://www.feeny.org/changing-keyvalue-split-delimeter-in-hadoop-20-2/">Changing key/value split delimiter in Hadoop .20.2</a> appeared first on <a href="http://www.feeny.org">Maximum Entropy</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://www.feeny.org/changing-keyvalue-split-delimeter-in-hadoop-20-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Changing MapReduce number of Mappers in Hadoop .20.2</title>
		<link>http://www.feeny.org/changing-mapreduce-number-of-mappers-in-hadoop-20-2/</link>
		<comments>http://www.feeny.org/changing-mapreduce-number-of-mappers-in-hadoop-20-2/#comments</comments>
		<pubDate>Mon, 15 Apr 2013 19:14:37 +0000</pubDate>
		<dc:creator>brian</dc:creator>
				<category><![CDATA[Data Analytics]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[setNumMapTasks()]]></category>

		<guid isPermaLink="false">http://www.feeny.org/?p=2184</guid>
		<description><![CDATA[<p>In earlier releases of Hadoop you could change the number of mappers by setting: setNumMapTasks() You did this using JobConf. &#160;Things in Hadoop .20.2 have migrated to using the Job class instead of JobConf. &#160;Although setNumReduceTasks() is still valid, setNumMapTasks() has been deprecated. &#160;How then do you set the number of Mappers on a MapReduce [...]</p><p>The post <a href="http://www.feeny.org/changing-mapreduce-number-of-mappers-in-hadoop-20-2/">Changing MapReduce number of Mappers in Hadoop .20.2</a> appeared first on <a href="http://www.feeny.org">Maximum Entropy</a>.</p>]]></description>
				<content:encoded><![CDATA[<p>In earlier releases of Hadoop you could change the number of mappers by setting:</p>
<p><span style="font-family: 'courier new', courier;">setNumMapTasks()</span></p>
<p>You did this using <span style="font-family: 'courier new', courier;">JobConf</span>. &nbsp;Things in Hadoop .20.2 have migrated to using the <span style="font-family: 'courier new', courier;">Job</span> class instead of <span style="font-family: 'courier new', courier;">JobConf</span>. &nbsp;Although <span style="font-family: 'courier new', courier;">setNumReduceTasks()</span> is still valid, <span style="font-family: 'courier new', courier;">setNumMapTasks()</span> has been deprecated. &nbsp;How then do you set the number of Mappers on a MapReduce job? &nbsp;You must adjust the split size. &nbsp;There is much written on this, but it can be difficult to find at times. &nbsp;The split size is determined by the <span style="font-family: 'courier new', courier;">InputFormat</span> being used. &nbsp;I typically use <span style="font-family: 'courier new', courier;">KeyValueTextInputFormat</span>. &nbsp;To adjust my split size, I simply pass the <span style="font-family: 'courier new', courier;">mapred.max.split.size</span> parameter like so:</p>
<p><span style="font-family: 'courier new', courier;">-D mapred.max-split.size=2500</span></p>
<p>In this example my&nbsp;input file size was 24451.&nbsp; By setting the parameter <span style="font-family: 'courier new', courier;">–D mapred.max.split.size=2500</span>, I was able to configure 10 map tasks.</p>
<p>The post <a href="http://www.feeny.org/changing-mapreduce-number-of-mappers-in-hadoop-20-2/">Changing MapReduce number of Mappers in Hadoop .20.2</a> appeared first on <a href="http://www.feeny.org">Maximum Entropy</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://www.feeny.org/changing-mapreduce-number-of-mappers-in-hadoop-20-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Apache Mahout prepare20newsgroups in version .7</title>
		<link>http://www.feeny.org/apache-mahout-prepare20newsgroups-in-version-7/</link>
		<comments>http://www.feeny.org/apache-mahout-prepare20newsgroups-in-version-7/#comments</comments>
		<pubDate>Mon, 15 Apr 2013 19:05:19 +0000</pubDate>
		<dc:creator>brian</dc:creator>
				<category><![CDATA[Data Analytics]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[mahout]]></category>

		<guid isPermaLink="false">http://www.feeny.org/?p=2182</guid>
		<description><![CDATA[<p>Apache Mahout has gone through some changes recently, and one of the things you will notice no longer works, is the old prepare20newsgroups classifier routine. &#160;This has been replaced, and the new syntax is much different. &#160;This page will walk you though how to use the classifier: Download some SPAM: curl -O http://spamassassin.apache.org/publiccorpus/20021010_spam.tar.bz2 Download some [...]</p><p>The post <a href="http://www.feeny.org/apache-mahout-prepare20newsgroups-in-version-7/">Apache Mahout prepare20newsgroups in version .7</a> appeared first on <a href="http://www.feeny.org">Maximum Entropy</a>.</p>]]></description>
				<content:encoded><![CDATA[<p>Apache Mahout has gone through some changes recently, and one of the things you will notice no longer works, is the old prepare20newsgroups classifier routine. &nbsp;This has been replaced, and the new syntax is much different. &nbsp;This page will walk you though how to use the classifier:</p>
<p><strong>Download some SPAM:</strong></p>
<div title="Page 22">
<div>
<div>
<div>
<p>curl -O http://spamassassin.apache.org/publiccorpus/20021010_spam.tar.bz2</p>
<p><strong>Download some HAM:</strong></p>
<div title="Page 22">
<div>
<div>
<div>
<p>curl -O http://spamassassin.apache.org/publiccorpus/20021010_easy_ham.tar.bz2&nbsp;</p>
<p><strong>Extract the files:</strong></p>
<div title="Page 22">
<div>
<div>
<div>
<p>$ tar xjf 20021010_spam.tar.bz2<br />
$ tar xjf 20021010_easy_ham.tar.bz2&nbsp;</p>
<div><b>Copy easy_ham and spam directories into 20news-all:</b></div>
<div>&nbsp;cp -R easy_ham/ spam/ 20news-all/</div>
<div>&nbsp;</div>
<div><b>Copy 20news-all to HDFS:</b></div>
<div>hadoop fs -put 20news-all</div>
<div>&nbsp;</div>
<div><b>Prepare data by sequencing into vectors:</b></div>
<div>
<div>&nbsp;mahout seqdirectory -i 20news-all -o 20news-seq</div>
<div>&nbsp;mahout seq2sparse -i 20news-seq -o 20news-vectors &nbsp;-lnorm -nv &nbsp;-wt tfidf</div>
</div>
<div>&nbsp;</div>
<div><b>Split data into train and test sets with 20% of the data being used for test and 80% for train:</b></div>
<div>
<div>mahout split -i 20news-vectors/tfidf-vectors &#8211;trainingOutput 20news-train-vectors &#8211;testOutput 20news-test-vectors &#8211;randomSelectionPct 20 &#8211;overwrite &#8211;sequenceFiles -xm sequential</div>
<div>&nbsp;</div>
</div>
<div><b>Build the model:</b></div>
<div>mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow -c</div>
<div>&nbsp;</div>
<div><b>You can test the model against the training set:</b></div>
<div>mahout testnb -i 20news-train-vectors -m model -l labelindex -ow -o 20news-testing-train -c</div>
<div>&nbsp;</div>
<div><b>Now test against the test set:</b></div>
<div>mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o 20news-testing-test -c</div>
<div>&nbsp;</div>
<div><b>You can view how the messages were classified, although it is a bit cryptic:</b></div>
<div>mahout seqdumper -i&nbsp;20news-testing-test/part-m-00000&nbsp;</div>
<div>&nbsp;</div>
<div>The insights I used for figuring this out come from the&nbsp;examples/bin/classify-20newsgroups.sh which comes with Mahout, and sometimes can be found in a package called mahout-docs (for example if using the Cloudera CDH repositories).</div>
<div>&nbsp;</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<p>The post <a href="http://www.feeny.org/apache-mahout-prepare20newsgroups-in-version-7/">Apache Mahout prepare20newsgroups in version .7</a> appeared first on <a href="http://www.feeny.org">Maximum Entropy</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://www.feeny.org/apache-mahout-prepare20newsgroups-in-version-7/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
<!-- This Quick Cache file was built for (  www.feeny.org/feed/ ) in 1.69603 seconds, on Jun 18th, 2013 at 9:59 pm UTC. -->
<!-- This Quick Cache file will automatically expire ( and be re-built automatically ) on Jun 18th, 2013 at 10:59 pm UTC -->
<!-- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -->
<!-- Quick Cache Is Fully Functional :-) ... A Quick Cache file was just served for (  www.feeny.org/feed/ ) in 0.00151 seconds, on Jun 18th, 2013 at 10:11 pm UTC. -->