top of page

Apache Mahout prepare20newsgroups in version .7




curl -O http://spamassassin.apache.org/publiccorpus/20021010_spam.tar.bz2 Download some HAM:




curl -O http://spamassassin.apache.org/publiccorpus/20021010_easy_ham.tar.bz2 Extract the files:




$ tar xjf 20021010_spam.tar.bz2 $ tar xjf 20021010_easy_ham.tar.bz2

Copy easy_ham and spam directories into 20news-all:

cp -R easy_ham/ spam/ 20news-all/

Copy 20news-all to HDFS:

hadoop fs -put 20news-all

Prepare data by sequencing into vectors:


mahout seqdirectory -i 20news-all -o 20news-seq

mahout seq2sparse -i 20news-seq -o 20news-vectors -lnorm -nv -wt tfidf

Split data into train and test sets with 20% of the data being used for test and 80% for train:


mahout split -i 20news-vectors/tfidf-vectors --trainingOutput 20news-train-vectors --testOutput 20news-test-vectors --randomSelectionPct 20 --overwrite --sequenceFiles -xm sequential

Build the model:

mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow -c

You can test the model against the training set:

mahout testnb -i 20news-train-vectors -m model -l labelindex -ow -o 20news-testing-train -c

Now test against the test set:

mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o 20news-testing-test -c

You can view how the messages were classified, although it is a bit cryptic:

mahout seqdumper -i 20news-testing-test/part-m-00000

The insights I used for figuring this out come from the examples/bin/classify-20newsgroups.sh which comes with Mahout, and sometimes can be found in a package called mahout-docs (for example if using the Cloudera CDH repositories).

Recent Posts

See All

Comments


Hi, thanks for stopping by!

I'm a paragraph. Click here to add your own text and edit me. I’m a great place for you to tell a story and let your users know a little more about you.

Let the posts
come to you.

Thanks for submitting!

  • Facebook
  • Instagram
  • Twitter
  • Pinterest
bottom of page