Apache Mahout prepare20newsgroups in version .7

bfeeny
Apr 15, 2013
1 min read

curl -O http://spamassassin.apache.org/publiccorpus/20021010_spam.tar.bz2 Download some HAM:

curl -O http://spamassassin.apache.org/publiccorpus/20021010_easy_ham.tar.bz2 Extract the files:

$ tar xjf 20021010_spam.tar.bz2 $ tar xjf 20021010_easy_ham.tar.bz2

Copy easy_ham and spam directories into 20news-all:

cp -R easy_ham/ spam/ 20news-all/

Copy 20news-all to HDFS:

hadoop fs -put 20news-all

Prepare data by sequencing into vectors:

mahout seqdirectory -i 20news-all -o 20news-seq

mahout seq2sparse -i 20news-seq -o 20news-vectors -lnorm -nv -wt tfidf

Split data into train and test sets with 20% of the data being used for test and 80% for train:

mahout split -i 20news-vectors/tfidf-vectors --trainingOutput 20news-train-vectors --testOutput 20news-test-vectors --randomSelectionPct 20 --overwrite --sequenceFiles -xm sequential

Build the model:

mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow -c

You can test the model against the training set:

mahout testnb -i 20news-train-vectors -m model -l labelindex -ow -o 20news-testing-train -c

Now test against the test set:

mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o 20news-testing-test -c

You can view how the messages were classified, although it is a bit cryptic:

mahout seqdumper -i 20news-testing-test/part-m-00000

The insights I used for figuring this out come from the examples/bin/classify-20newsgroups.sh which comes with Mahout, and sometimes can be found in a package called mahout-docs (for example if using the Cloudera CDH repositories).

Apache Mahout prepare20newsgroups in version .7

Recent Posts

Comments

Hi, thanks for stopping by!

Let the posts
come to you.

Comments

Hi, thanks for stopping by!

Let the posts come to you.

Let the posts
come to you.