Apache Mahout prepare20newsgroups in version .7
curl -O http://spamassassin.apache.org/publiccorpus/20021010_spam.tar.bz2 Download some HAM:
curl -O http://spamassassin.apache.org/publiccorpus/20021010_easy_ham.tar.bz2 Extract the files:
$ tar xjf 20021010_spam.tar.bz2
$ tar xjf 20021010_easy_ham.tar.bz2
Copy easy_ham and spam directories into 20news-all:
cp -R easy_ham/ spam/ 20news-all/
Copy 20news-all to HDFS:
hadoop fs -put 20news-all
Prepare data by sequencing into vectors:
mahout seqdirectory -i 20news-all -o 20news-seq
mahout seq2sparse -i 20news-seq -o 20news-vectors -lnorm -nv -wt tfidf
Split data into train and test sets with 20% of the data being used for test and 80% for train:
mahout split -i 20news-vectors/tfidf-vectors --trainingOutput 20news-train-vectors --testOutput 20news-test-vectors --randomSelectionPct 20 --overwrite --sequenceFiles -xm sequential
Build the model:
mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow -c
You can test the model against the training set:
mahout testnb -i 20news-train-vectors -m model -l labelindex -ow -o 20news-testing-train -c
Now test against the test set:
mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o 20news-testing-test -c
You can view how the messages were classified, although it is a bit cryptic:
mahout seqdumper -i 20news-testing-test/part-m-00000
The insights I used for figuring this out come from the examples/bin/classify-20newsgroups.sh which comes with Mahout, and sometimes can be found in a package called mahout-docs (for example if using the Cloudera CDH repositories).
Recent Posts
See AllRecently I was working on a problem with Time Series. Â Time Series can quickly add up to a lot of data, as you are using previous...
One of the biggest bottlenecks in Deep Learning is loading data. Â having fast drives and access to the data is important, especially if...
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.TableName;...
Comments