Apache Mahout prepare20newsgroups in version .7

Apache Mahout has gone through some changes recently, and one of the things you will notice no longer works, is the old prepare20newsgroups classifier routine.  This has been replaced, and the new syntax is much different.  This page will walk you though how to use the classifier:

Download some SPAM:

curl -O http://spamassassin.apache.org/publiccorpus/20021010_spam.tar.bz2

Download some HAM:

curl -O http://spamassassin.apache.org/publiccorpus/20021010_easy_ham.tar.bz2 

Extract the files:

$ tar xjf 20021010_spam.tar.bz2
$ tar xjf 20021010_easy_ham.tar.bz2 

Copy easy_ham and spam directories into 20news-all:
 cp -R easy_ham/ spam/ 20news-all/
Copy 20news-all to HDFS:
hadoop fs -put 20news-all
Prepare data by sequencing into vectors:
 mahout seqdirectory -i 20news-all -o 20news-seq
 mahout seq2sparse -i 20news-seq -o 20news-vectors  -lnorm -nv  -wt tfidf
Split data into train and test sets with 20% of the data being used for test and 80% for train:
mahout split -i 20news-vectors/tfidf-vectors –trainingOutput 20news-train-vectors –testOutput 20news-test-vectors –randomSelectionPct 20 –overwrite –sequenceFiles -xm sequential
Build the model:
mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow -c
You can test the model against the training set:
mahout testnb -i 20news-train-vectors -m model -l labelindex -ow -o 20news-testing-train -c
Now test against the test set:
mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o 20news-testing-test -c
You can view how the messages were classified, although it is a bit cryptic:
mahout seqdumper -i 20news-testing-test/part-m-00000 
The insights I used for figuring this out come from the examples/bin/classify-20newsgroups.sh which comes with Mahout, and sometimes can be found in a package called mahout-docs (for example if using the Cloudera CDH repositories).
This entry was posted in Data Analytics and tagged , . Bookmark the permalink.

Leave a Reply