Apache Mahout prepare20newsgroups in version .7
- bfeeny
- Apr 15, 2013
- 1 min read
curl -O http://spamassassin.apache.org/publiccorpus/20021010_spam.tar.bz2 Download some HAM:
curl -O http://spamassassin.apache.org/publiccorpus/20021010_easy_ham.tar.bz2 Extract the files:
$ tar xjf 20021010_spam.tar.bz2
$ tar xjf 20021010_easy_ham.tar.bz2
Copy easy_ham and spam directories into 20news-all:
cp -R easy_ham/ spam/ 20news-all/
Copy 20news-all to HDFS:
hadoop fs -put 20news-all
Prepare data by sequencing into vectors:
mahout seqdirectory -i 20news-all -o 20news-seq
mahout seq2sparse -i 20news-seq -o 20news-vectors -lnorm -nv -wt tfidf
Split data into train and test sets with 20% of the data being used for test and 80% for train:
mahout split -i 20news-vectors/tfidf-vectors --trainingOutput 20news-train-vectors --testOutput 20news-test-vectors --randomSelectionPct 20 --overwrite --sequenceFiles -xm sequential
Build the model:
mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow -c
You can test the model against the training set:
mahout testnb -i 20news-train-vectors -m model -l labelindex -ow -o 20news-testing-train -c
Now test against the test set:
mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o 20news-testing-test -c
You can view how the messages were classified, although it is a bit cryptic:
mahout seqdumper -i 20news-testing-test/part-m-00000
The insights I used for figuring this out come from the examples/bin/classify-20newsgroups.sh which comes with Mahout, and sometimes can be found in a package called mahout-docs (for example if using the Cloudera CDH repositories).
Recent Posts
See AllRecently I was working on a problem with Time Series. Time Series can quickly add up to a lot of data, as you are using previous...
One of the biggest bottlenecks in Deep Learning is loading data. having fast drives and access to the data is important, especially if...
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.TableName;...
コメント