Step By Step
- Install Weka
- Download and install Weka and LibSVM
- Weka is an open source toolkit of machine learning. The
most recent versions (3-5-x) are platform independent and we could
download the .zip file for Windows,
Linux, or Mac machines. After unpacking the compressed
file, put the result directory at you preferred place
(/Application/weka-3-5-2 in Mac, for instance), we
will call the directory $weka_home from now on. You can download the latest verson from:
http://www.cs.waikato.ac.nz/ml/weka/
- LibSVM is a SVM classifier which is available to the
public, the default SVM classifier is SMO since weka-3-5-2, the toolkit
include a wrapper function
which allows users to run LibSVM as any other weka
built-in classifiers. You can download the libsvm.jar from http://www.cs.iastate.edu/~yasser/wlsvm/
- Update CLASSPATH
- Running Weka usually requires adding weka.jar to the CLASSPATH variable of the hosting machine. We want to add libsvm.jar and java home as well.
On Linux and Mac machines, that is to update the .bash_profile or .profile by adding the following line:
export CLASSPATH=$CLASSPATH:$weka_home/weka.jar:$weka_home/libsvm.jar:$JAVA_HOME/bin
On Windows platform, you have to use Control Panel->System Variables.
- Completing this step will let you to run weka with a lot
more flexibilities. But, you could also run weka directly as java –jar
weka.jar.
- Data preparation (generating .arff file)
- Create .arff file from text files
- Convert string attribute to numeric attributes
- The .arff file we got from the previous step contains two
attributes only. One is ‘class’, the other is ‘text’ which is a string
of the content of a document. We have to
convert this preliminary file into such format that we could
extract features (attributes) and have numeric value for each feature
(attribute).
- Weka has a built-in function called StringtoWordVector for this purpose:
java
-Xmx1024m weka.filters.unsupervised.attribute.StringToWordVector
–b -i str_corn_training.arff -o corn_training.arff -r str_corn_test.arff –s corn_test.arff
-R 2 -W 5000 -C -T -I -N 1 -L -M 2
-b: batch mode
This is useful to filter two
datasets at once. The first dataset is used to initialize the filter and
the second one is then filtered
according to this setup. i.e., your test set will contain the
same attributes then.
-i: training input file
-o: training output file
-r: test input file
-w: test output file
-R 2: process the second attribute which is the string attribute, this is by default
-W 5000: output the top 5,000 features, it is
ranked by per-class which is helpful for binary problem. You could use
the –O option to rank features based on all classes.
-C: output word count rather than boolean word presence
-T: transform term frequency into log(1+tf)
-I: transform word frequency into tf*log(total# of docs/# of docs contain this word)
It is actually the tf*idf weight without normalization
-N: 0=not normalize/1=normalize all data/2=normalize test data only
to average length of training documents (default 0=don't normalize).(Detail explanation from Wekalist)
-L: Convert all tokens to lowercase before adding to the dictionary
-A:
Only form tokens from contiguous alphabetic sequences
(Turn this off when work with phrase!!! After Weka
3-5-6, this option is no more available and is replaced by
weka.core.tokenizers.AlphabeticTokenizer)
-S: Ignore words that are in the stoplist. (we don’t use this one since we’ve use our own
stop list already)
-M: minimal term frequency, here is 2.
The –Xmx is simply to define the heap memory size
assigned for Weka.(e.g. –Xmx1024m means setting the maximum heap size
for your java engine to 1GB)
(more options)
- Extracting ngrams
In Weka 3-5-6, a new tokenizer is added for extracting ngrams. Using
the same example above to extract unigrams and trigrams.
java
-Xmx1024m weka.filters.unsupervised.attribute.StringToWordVector
–b -i str_corn_training.arff -o corn_training.arff -r str_corn_test.arff –s corn_test.arff
-R 2
-W 5000 -C -T -I -N 1 -L -M 2 -tokenizer
"weka.core.tokenizers.NGramTokenizer -min 2 -max 3"
Valid options are:
-delimiters <value> The delimiters to use (default ' \n\t.,;:'"()?!').
-max <int> The max size of the Ngram (default = 3).
-min <int> The min size of the Ngram (default = 1).
- Feature Selection
- FS is conducted by calling
weka.filters.supervised.attribute.AttributeSelection function:
java
-Xmx1024m weka.filters.supervised.attribute.AttributeSelection
-S "weka.attributeSelection.Ranker -N 100"
-E
"weka.attributeSelection.InfoGainAttributeEval"
–b -i corn_training.arff -o corn_chi100_training.arff –r corn_test.arff –s corn_chi100_test.arff -c 1
-S: <"Name of search class [search options]"> Sets search method for subset evaluators.
-E: <"Name of attribute/subset evaluation class [evaluator options]">
We can understand it as feature selection algorithm. weka.attributeSelection.ChiSquaredAttributeEval and weka.attributeSelection.InfoGainAttributeEval are most efficient FS method according to literature.
-N : N top-ranked features (if use -T, one can specify a threshold value)
-i : training input arff file
-o: training output arff file
-r : test input arff file
-s: test output arff file
-c : define which attribute is the class attribute
(acq, noacq) because it is a “nomial” attribute, not a “numerical”
attribute and cannot be ranked
-b: batch mode
-
Notice that since the feature selection step takes lots of time,
it is recommended to select top N features first, and then run somd code
like this (/u2/home/nyu/Classification/Code/getTopFeature.pl)
to select top n (<N) features from this file.
- Notice that after feature selection, the class
attribute will be moved from top to the end, so no need to use option -c
1 when use the feature selection arff file for future training or
classification
- Train a Classifier with Cross Validation
- Cross validation is commonly used method to tune a classifier.
Below is an example of how to run cross validation on a NB classifier in
Weka:
java
-Xmx1024m weka.classifiers.bayes.NaiveBayesMultinomial
-t corn_chi100_training .arff -d corn-ig100.model –x 10 -o -i> corn_chi100_training.results
-t <name of training file>: training data file
-x <number of folders>: Sets number of folds for cross-validation (default: 10)
-o : outputs statistics only, not the classifier
-i : output IR measures, like precision, recall, and F1 measure
-d <name of output file>: model output file
-l <name of input file>: model input file
(more options)
- Notice: Weka always outputs the model built from
the full training set, even if the performance of this model is
estimated using a cross-validation.
- Test a Classifier
- Use a Classifier to Making predictions
- Once a classifier is trained and tested, you can now use it to predict the class lable of new object.:
java
-Xmx1024m weka.classifiers.bayes.NaiveBayesMultinomial
-l corn-ig100.model -T corn_chi100_test.arff -p 0 > corn_chi100_test.results
-P <attribute range>
Only outputs predictions for test instances (or the train instances
if no test instances provided), along with attributes (0 for none).
More
|
Evaluations
- Overview
- Evaluations have been conducted against reuters21578 (ModApte
split) which has been used in most text classification literature. The
experiments were designed to test the robustness of Weka toolkit as well
as examine the impaction, if any, of a variety of factors on
classification task.
- Another set of experiments were designed and evaluated against
the TREC blog06 data. The purpose here is to find out whether auto
classification approach is suitable for the opinionated/polarity
classification other than subjective classification. If so, how to
design a classifier for a non-content classification task.
- Reuters Data Preparation
- Collection splitting
- Motivation: to check the ready to use split data after initially find some bugs in the Reuters21578 collection.
- Download the full collection,
Reuters21578.tar.gz directly from the UCI KDD Archive . Then split the data by running
$home/Classification/Code/reutersSplit.pl. Remove the unknown category.
(Located: $home/Classification/Corpora/Reuters21578_apte_top90cat)
- Download
the splitted Reuters21578-Apte-90Cat.tar.gz. Remove the unknown category.(Located:
$home/Classification/Corpora/Reuters21578-Apte-90Cat)
- Took the top 10 categories only.
- Negative example generation
- Motivation: when training data distribution is skewed, one
could argue either use a large negative example set to increase the
classifier precision or use a smaller negative example set to increase
the recall.
- When building a classifier on one category, use all data in
the rest of the 9 categories as negative examples.
($home/Classification/Corpora/Prepared0)
- Randomly select negative examples from the rest of the 9
categories and keep the amount same as what's in the target
category.($home/Classification/Corpora/Prepared1)
- Using all the examples in categories which share no example(s)
with the target category. ($home/Classification/Corpora/Prepared2)
- Randomly select negative examples from the no over-lapping
categories and keep the amount same as what's in the target
category.($home/Classification/Corpora/Prepared3)
- Pre-processing
- Motivation: to see the effect of traditional information
retrieval pre-processing method:stopping, stemming, and phrase(ngram)
extracting.
- Non-processed data
($home/Classification/Corpora/Prepared?/Reuters21578-Apte-90Cat|Reuters21578_apte_top90cat)
- Run $home/Classification/Code/sstem.pl to get Stopped and
stemmed collection (combo stemmer,stoplist1)
($home/Classification/Corpora/Prepared?/Reuters21578-Apte-90Cat|Reuters21578_apte_top90cat-sstem)
- Run $home/Classification/Code/gnrPhrase.pl (combo
stemmer,stoplist1) to extract noun phrase
($home/Classification/Corpora/Prepared?/Reuters21578-Apte-90Cat|Reuters21578_apte_top90cat-phrase)
(notice: the noun phrase extraction algorithm is from the old
NLPsub.pl)
- Feature Selection
- Motivation: to test and compare the two most effective feature selection method, chi and ig.
- features of 100, 500, 1000, 1500 and 2000 were selected and tested respectively.
- Classifier configuration
- NB: NaiveBayesMultinomial is applied with its default setting in Weka.
- KNN: weka.classifiers.lazy.IBk is applied with k = 3, 5 or 7.
- SVM: weka.classifiers.functions.LibSVM with grid searching on
the cost and grammer values(appending task). Notice that due to the
efficiency constriction,
SVM will not be tested against all the parameters in the previous
setting. Only the best selected parameter combination according to NB
and KNN
classifier will be applied to the SVM training. Analysis on the
training results will then suggest for parameters for testing.
- 10 folds cross validation was conducted during training
- Reuters results analysis
- Overall top 5 performance based on macro F-score
over 10 categories
NB |
KNN |
SVM |
3,A,ss,chi1500 |
0.9483 |
3,a,ss,ig100,k3 |
0.9227 |
3, A, null, chi 1000, c2 g -5 |
0.9659 |
3,A,ss,ig1500 |
0.9481 |
3,a,ss,chi100,k3 |
0.9209 |
3, A, null, chi 1000, c5 g-8 |
0.9657 |
3,A,ss,chi1000 |
0.9469 |
3,a,ss,ig100,k5 |
0.9198 |
3, A, null, chi 1000, c11 g-14 |
0.9656 |
3,A,ss,chi2000 |
0.9469 |
3,A,ss,ig100,k3 |
0.9159 |
3, A, null, chi 1000,c8 g-11 |
0.9656 |
3,A,ss,ig1000 |
0.9469 |
3,a,ss,chi100,k7 |
0.9152 |
3, A, ss, chi100, c2 g-5 |
0.9626 |
- Top performance of individual category classifier (F-score) (click on .. to see the top 10 runs)
Category |
NB |
KNN |
SVM |
earn |
3|2,A,ss,ig500(..) |
0.982 |
1,A,ss,chi100,k5(..) |
0.985 |
3,A,ss,chi1000, c5g-2(..) |
0.989 |
acq |
3,A,null,chi|ig2k(..) |
0.981 |
3,A,null,ig100,k5(..) |
0.959 |
3,A,null, chi 500, c14 g-2(..) |
0.985 |
money-fx |
3,a,ss|null,ig1k(..) |
0.986 |
3,A,ss,ig2k,k7(..) |
0.951 |
3,A,null, chi1000, c2g-5 (..) |
0.992 |
grain |
3,A,ss,chi|ig1k(..) |
0.914 |
3,a,ss,ig100,k7(..) |
0.931 |
3,A,ss,chi100, c5g-5 (..) |
0.980 |
crude |
3,A,null,ig2k(..) |
0.977 |
3,a,ss,chi100,k5|7(..) |
0.954 |
3,A,ss,chi2000, c14g-14 (..) |
0.992 |
trade |
3,a,ss,ig100(..) |
0.912 |
3,A,ss,chi|ig1k,k7(..) |
0.886 |
3,A,null,chi1500, c14g-8 (..) |
0.940 |
interest |
2,A,phrase,chi100 (..) |
0.990 |
2,A,ss,ig100,k3 (..) |
0.993 |
3,A,null,chi1000,c5g-8 (..) |
0.988 |
wheat |
3,a,null|ss,ig|chi1k(..) |
0.94 |
3,A,ss,ig|chi1500,k3(..) |
0.928 |
3, A, null, chi100, c11g-8(..) |
0.972 |
ship |
1,a,ss,chi100 (..) |
0.972 |
3,a,ss,chi100,k7 (..) |
0.930 |
3, A, null, chi500, c2g-5 (..) |
0.960 |
corn |
3,A,ss,chi|ig1500|2k(..) |
0.946 |
3,A,ss,ig100,k3(..) |
0.906 |
3,A,null,chi100, c11g-8 (..) |
0.973 |
- Analysis by factors (F-score)
- Collection splitting.
(manually generated - auto downloaded)/auto downloaded
Top10 category apte-Apte |
NB classifier |
KNN classifier |
Micro Average |
Macro Average |
Micro Average |
Macro Average |
1#earn |
0.93% |
0.71% |
3.44% |
3.16% |
2#acq |
-0.16% |
-0.17% |
-3.35% |
-3.13% |
3#money-fx |
0.53% |
0.32% |
-2.82% |
-3.01% |
4#grain |
0.29% |
0.44% |
-3.70% |
-4.36% |
5#crude |
-0.54% |
-0.49% |
-2.72% |
-1.02% |
6#trade |
-0.14% |
0.06% |
1.23% |
-6.53% |
7#interest |
-0.26% |
-0.25% |
-1.34% |
-1.73% |
8#wheat |
-0.27% |
-0.60% |
-6.12% |
-6.30% |
9#ship |
0.82% |
0.37% |
3.31% |
4.39% |
10#corn |
0.03% |
0.10% |
-3.97% |
-6.16% |
Average |
0.12% |
0.05% |
-1.60% |
-2.47% |
Observation:
NB: No big difference between these two data set as expected. The manually split one is slightly better than the downloaded pack.
KNN: The manually split collection performs worse than the downloaded pack.
Suggestion: To make the results comparable to others, we might want to stick to the downloaded one. |
-
Negative example generation
-
Legend:
0 |
Prepared0: all |
1 |
Prepared1: random selected negative examples to balance numbers of positive and negative examples |
2 |
Prepared2: negative examples only from no overlapping categories |
3 |
Prepared3: no olp cats plus random sampling to balance numbers of positive and negative examples |
- NB classifier
Top10 category |
Micro Average |
Macro Average |
diff10 |
diff20 |
diff32 |
diff31 |
diff10 |
diff20 |
diff32 |
diff31 |
1#earn |
1.64% |
2.38% |
0.01% |
0.74% |
1.45% |
2.20% |
0.01% |
0.74% |
2#acq |
7.68% |
7.69% |
0.73% |
0.73% |
7.30% |
7.34% |
0.68% |
0.72% |
3#money-fx |
56.16% |
14.18% |
43.73% |
3.80% |
55.01% |
14.15% |
40.99% |
3.83% |
4#grain |
47.25% |
-9.86% |
72.20% |
2.47% |
45.55% |
-9.26% |
64.32% |
2.44% |
5#crude |
33.62% |
26.98% |
7.54% |
2.58% |
32.70% |
26.46% |
7.57% |
2.51% |
6#trade |
110.68% |
17.46% |
83.43% |
1.75% |
101.47% |
16.71% |
75.62% |
1.73% |
7#interest |
-6.15% |
0.47% |
-5.13% |
1.56% |
-6.17% |
0.47% |
-5.15% |
1.56% |
8#wheat |
119.18% |
8.22% |
115.03% |
4.79% |
101.88% |
7.69% |
96.17% |
4.65% |
9#ship |
64.22% |
-10.06% |
82.46% |
-0.22% |
60.38% |
-10.47% |
78.54% |
-0.33% |
10#corn |
140.41% |
17.81% |
137.19% |
11.08% |
122.84% |
17.13% |
111.31% |
11.07% |
AVG |
57.47% |
7.53% |
53.72% |
2.93% |
52.24% |
7.24% |
47.01% |
2.89% |
- KNN classifier
Top10 category |
Micro Average |
Macro Average |
diff10 |
diff20 |
diff32 |
diff31 |
diff10 |
diff20 |
diff32 |
diff31 |
1#earn |
-3.23% |
-7.20% |
0.24% |
-4.10% |
-3.55% |
-7.69% |
0.23% |
-4.07% |
2#acq |
28.71% |
25.94% |
3.65% |
1.36% |
24.68% |
22.08% |
3.47% |
1.32% |
3#money-fx |
121.25% |
16.05% |
102.15% |
5.39% |
108.14% |
15.65% |
87.98% |
4.45% |
4#grain |
61.92% |
3.39% |
69.35% |
9.45% |
52.34% |
5.73% |
56.02% |
8.28% |
5#crude |
58.83% |
6.30% |
54.94% |
4.02% |
52.04% |
8.60% |
45.07% |
3.62% |
6#trade |
403.81% |
37.93% |
338.24% |
18.16% |
205.34% |
43.35% |
139.25% |
12.32% |
7#interest |
-17.04% |
0.05% |
-21.06% |
-4.21% |
-17.03% |
0.05% |
-21.04% |
-4.79% |
8#wheat |
95.23% |
13.67% |
97.45% |
9.96% |
97.68% |
19.69% |
69.64% |
2.71% |
9#ship |
376.21% |
-30.67% |
902.26% |
8.70% |
243.55% |
-10.99% |
303.34% |
4.50% |
10#corn |
54.57% |
3.43% |
67.83% |
14.45% |
56.83% |
4.90% |
70.33% |
13.92% |
AVG |
118.03% |
6.89% |
161.51% |
6.32% |
82.00% |
10.14% |
75.43% |
4.23% |
- SVM classifier
(Note: this is based on the 10fold cross validation results on the training set. Only prepared3 is used for testing )
Top10 category |
Micro Average |
Macro Average |
diff31 |
diff31 |
1#earn |
7.38% |
4.50% |
2#acq |
2.00% |
2.05% |
3#money-fx |
4.19% |
3.98% |
4#grain |
18.69% |
8.22% |
5#crude |
4.11% |
3.02% |
6#trade |
4.30% |
3.09% |
7#interest |
1.63% |
1.76% |
8#wheat |
-0.94% |
2.18% |
9#ship |
4.02% |
2.65% |
10#corn |
10.63% |
8.77% |
AVG |
5.60% |
4.02% |
- Observation :
For both NB and KNN,
Balancing the number of positive and negative examples largely improve
the performance. Selecting negative examples from non-overlapping
category only helps further improve the performance.
The trend stays the same for SVM classifiers: prepared3 steadily overperforms prepared1, so in the testing stage, we only use prepared3.
Suggestion: We should always be careful when generating the negative examples since it will largely effect the performance. |
- Pre-processing Type
- NB classifier
Top10 category |
Micro Average |
Macro Average |
diff(phrase-null) |
diff(sstem-null) |
diff(sstem-phrase) |
diff(phrase-null) |
diff(sstem-null) |
diff(sstem-phrase) |
1#earn |
-8.13% |
0.41% |
9.55% |
-8.11% |
0.41% |
9.27% |
2#acq |
-11.05% |
0.47% |
13.09% |
-11.03% |
0.38% |
12.82% |
3#money-fx |
-1.71% |
2.68% |
6.95% |
-5.10% |
2.12% |
7.62% |
4#grain |
10.73% |
2.33% |
-3.42% |
3.87% |
1.71% |
-2.08% |
5#crude |
-5.23% |
0.30% |
6.92% |
-6.44% |
0.24% |
7.14% |
6#trade |
15.26% |
0.68% |
-7.65% |
5.17% |
0.31% |
-4.62% |
7#interest |
-1.70% |
-0.13% |
1.72% |
-1.66% |
-0.12% |
1.57% |
8#wheat |
17.99% |
1.53% |
-8.74% |
-4.18% |
0.93% |
5.34% |
9#ship |
-2.22% |
0.80% |
6.94% |
-13.48% |
0.51% |
16.18% |
10#corn |
18.77% |
-1.02% |
-10.64% |
-5.31% |
-0.46% |
5.12% |
AVG |
3.27% |
0.81% |
1.47% |
-4.63% |
0.60% |
5.83% |
- KNN classifier
Top10 category |
Micro Average |
Macro Average |
diff(phrase-null) |
diff(sstem-null) |
diff(sstem-phrase) |
diff(phrase-null) |
diff(sstem-null) |
diff(sstem-phrase) |
1#earn |
-2.73% |
-1.29% |
3.04% |
-3.72% |
-1.27% |
2.54% |
2#acq |
-3.59% |
2.07% |
7.09% |
-5.03% |
0.66% |
5.99% |
3#money-fx |
-28.69% |
3.48% |
56.48% |
-28.69% |
2.39% |
43.59% |
4#grain |
-40.41% |
3.42% |
145.73% |
-38.70% |
1.94% |
66.29% |
5#crude |
-31.48% |
4.13% |
105.59% |
-31.72% |
2.92% |
50.72% |
6#trade |
-15.16% |
30.54% |
560.02% |
-33.37% |
18.42% |
77.71% |
7#interest |
-3.61% |
-3.16% |
1.52% |
-3.97% |
-2.44% |
1.59% |
8#wheat |
-49.70% |
6.72% |
216.06% |
-49.34% |
5.98% |
109.19% |
9#ship |
-48.02% |
11.99% |
456.67% |
-66.50% |
15.55% |
244.93% |
10#corn |
-49.70% |
6.72% |
216.06% |
-49.45% |
7.26% |
112.20% |
AVG |
-27.31% |
6.46% |
176.83% |
-31.05% |
5.14% |
71.48% |
- SVM
Top10 category |
Micro Average |
Macro Average |
diff(sstem-null) |
diff(sstem-null) |
1#earn |
0.21% |
0.21% |
2#acq |
-0.11% |
-0.11% |
3#money-fx |
0.39% |
0.38% |
4#grain |
0.69% |
0.62% |
5#crude |
0.52% |
0.46% |
6#trade |
-1.37% |
-1.46% |
7#interest |
0.05% |
0.01% |
8#wheat |
1.36% |
1.17% |
9#ship |
1.39% |
1.31% |
10#corn |
2.79% |
2.63% |
AVG |
0.59% |
0.52% |
- Observation :
Stopping and stemming helps for all three classifiers.
Phrase always hurts performance in case of KNN and only helps in certain category in case of NB classifier.
Suggestion: Try below in the future:
1. only stemming or stopping
2. use the updated noun-phrase method (this experience was based on 2006 version NLPsub)
3. phrase + single feature |
- Feature Selection Type (chi - ig)/ig
Top10 category |
NB classifier |
KNN classifier |
Micro Average |
Macro Average |
Micro Average |
Macro Average |
1#earn |
-0.05% |
-0.05% |
-0.01% |
0.00% |
2#acq |
-0.29% |
-0.27% |
-2.28% |
-2.06% |
3#money-fx |
0.21% |
0.00% |
-2.09% |
-1.69% |
4#grain |
0.37% |
0.29% |
-1.39% |
-1.30% |
5#crude |
0.40% |
0.27% |
-1.10% |
-1.21% |
6#trade |
2.99% |
1.80% |
-0.26% |
-2.19% |
7#interest |
0.15% |
0.15% |
-0.02% |
-0.03% |
8#wheat |
7.29% |
4.57% |
-0.11% |
-0.35% |
9#ship |
0.70% |
1.17% |
-2.10% |
-2.27% |
10#corn |
7.37% |
4.10% |
-0.21% |
-0.38% |
Average |
1.91% |
1.20% |
-0.96% |
-1.15% |
Observation:
NB: Chi square slightly over-performs ig.
KNN: ig slightly over-performs Chi square.
Suggestion: Apply different feature selection method to different classifiers. |
- Feature Number
- NB classifier
Top10 category |
Micro Average |
Macro Average |
2k-1.5k |
1.5k-1k |
1k-500 |
500-100 |
100-2k |
2k-1.5k |
1.5k-1k |
1k-500 |
500-100 |
100-2k |
1#earn |
0.12% |
0.27% |
0.70% |
3.04% |
-3.85% |
0.12% |
0.26% |
0.65% |
2.85% |
-3.77% |
2#acq |
0.15% |
0.28% |
1.38% |
6.05% |
-7.14% |
0.15% |
0.27% |
1.32% |
5.63% |
-6.95% |
3#money-fx |
1.08% |
2.57% |
5.42% |
3.59% |
-10.05% |
0.95% |
2.17% |
4.19% |
3.15% |
-9.79% |
4#grain |
1.23% |
2.08% |
2.15% |
-5.01% |
0.92% |
0.90% |
1.48% |
1.57% |
-3.96% |
0.11% |
5#crude |
0.67% |
1.43% |
1.30% |
-0.37% |
-2.24% |
0.64% |
1.34% |
1.03% |
-0.14% |
-2.82% |
6#trade |
1.63% |
2.18% |
0.42% |
-5.08% |
3.47% |
0.92% |
1.66% |
0.65% |
-3.65% |
0.52% |
7#interest |
0.31% |
0.33% |
0.30% |
0.29% |
-1.18% |
0.29% |
0.32% |
0.30% |
0.27% |
-1.17% |
8#wheat |
2.80% |
3.25% |
3.65% |
-13.56% |
26.96% |
1.45% |
-4.82% |
2.77% |
-12.23% |
14.81% |
9#ship |
1.60% |
1.68% |
4.41% |
-7.31% |
9.28% |
1.27% |
-2.05% |
2.76% |
-6.97% |
5.45% |
10#corn |
0.45% |
6.51% |
3.04% |
-13.03% |
19.53% |
0.02% |
-3.17% |
1.48% |
-10.54% |
13.72% |
AVG |
1.00% |
2.06% |
2.28% |
-3.14% |
3.57% |
0.67% |
-0.25% |
1.67% |
-2.56% |
1.01% |
- KNN classifier
Top10 category |
Micro Average |
Macro Average |
2k-1.5k |
1.5k-1k |
1k-500 |
500-100 |
100-2k |
2k-1.5k |
1.5k-1k |
1k-500 |
500-100 |
100-2k |
1#earn |
-3.52% |
-3.51% |
-5.45% |
-0.16% |
15.19% |
-3.51% |
-3.40% |
-5.48% |
-0.46% |
14.03% |
2#acq |
1.94% |
-1.36% |
-9.65% |
-5.59% |
19.24% |
0.80% |
-2.37% |
-9.23% |
-5.76% |
18.79% |
3#money-fx |
-0.06% |
-6.07% |
-7.23% |
-24.64% |
74.51% |
-1.59% |
-7.32% |
-7.30% |
-21.29% |
50.27% |
4#grain |
-3.71% |
-8.95% |
-15.43% |
-26.71% |
172.39% |
-2.91% |
-7.87% |
-14.37% |
-24.70% |
73.37% |
5#crude |
-4.48% |
-11.03% |
-14.54% |
-32.78% |
181.86% |
-1.01% |
-11.39% |
-14.68% |
-31.38% |
94.70% |
6#trade |
-7.50% |
-6.64% |
-33.86% |
-50.32% |
884.67% |
4.01% |
-2.26% |
-28.16% |
-46.50% |
155.91% |
7#interest |
-7.65% |
-3.83% |
-4.07% |
-1.03% |
29.77% |
-6.65% |
-3.32% |
-3.85% |
-0.98% |
16.37% |
8#wheat |
3.31% |
3.28% |
-7.82% |
-35.04% |
83.46% |
-1.24% |
16.50% |
-5.27% |
-31.29% |
33.53% |
9#ship |
1.26% |
5.62% |
27.05% |
-49.05% |
281.22% |
-4.76% |
7.42% |
4.83% |
-37.53% |
49.28% |
10#corn |
5.40% |
3.87% |
-8.30% |
-30.01% |
47.99% |
2.31% |
15.32% |
-8.76% |
-27.82% |
28.71% |
AVG |
-1.50% |
-2.86% |
-7.93% |
-25.53% |
179.03% |
-1.46% |
0.13% |
-9.23% |
-22.77% |
53.50% |
- SVM classifier
Top10 category |
Micro Average |
Macro Average |
2k-1.5k |
1.5k-1k |
1k-500 |
500-100 |
100-2k |
2k-1.5k |
1.5k-1k |
1k-500 |
500-100 |
100-2k |
1#earn |
0.07% |
-0.24% |
0.20% |
0.47% |
-0.49% |
0.07% |
-0.24% |
0.19% |
0.46% |
-0.49% |
2#acq |
-0.41% |
-0.16% |
-0.06% |
0.78% |
-0.14% |
-0.40% |
-0.16% |
-0.06% |
0.78% |
-0.15% |
3#money-fx |
-0.90% |
-0.08% |
1.03% |
1.14% |
-1.09% |
-0.88% |
-0.08% |
1.03% |
1.14% |
-1.18% |
4#grain |
-1.32% |
-1.08% |
-0.96% |
-2.10% |
5.97% |
-1.25% |
-1.07% |
-0.96% |
-2.12% |
5.59% |
5#crude |
-1.76% |
-0.89% |
-0.15% |
-0.11% |
3.37% |
-1.65% |
-0.88% |
-0.15% |
-0.12% |
2.86% |
6#trade |
-0.96% |
0.84% |
-0.61% |
-2.36% |
3.53% |
-0.87% |
0.85% |
-0.60% |
-2.38% |
3.07% |
7#interest |
-1.62% |
-2.30% |
0.45% |
2.41% |
1.62% |
-1.50% |
-2.28% |
0.43% |
2.40% |
1.03% |
8#wheat |
-2.53% |
-2.16% |
-0.25% |
-2.06% |
8.04% |
-2.41% |
-2.05% |
-0.24% |
-2.09% |
7.11% |
9#ship |
-0.86% |
-3.58% |
-0.21% |
0.88% |
4.58% |
-0.78% |
-3.49% |
-0.21% |
0.83% |
3.80% |
10#corn |
-0.55% |
-3.07% |
-2.35% |
-1.10% |
8.28% |
-0.50% |
-2.90% |
-2.32% |
-1.10% |
7.14% |
AVG |
-1.08% |
-1.27% |
-0.29% |
-0.20% |
3.37% |
-1.02% |
-1.23% |
-0.29% |
-0.22% |
2.88% |
- Observation:
NB and SVM: 100 features have the best performance overall.
KNN: 100 features have the best performance across categories.
Suggestion:It seems like 100 features
should be enough for both small and large categories. Later we should
use percentage instead of absolute number of features.(I am not sure if
weka provide this option or not) Or maybe we should try the -T option to
set a score threshold based on certain precision score. |
- Classifier specific factor
- KNN: number of nearest neighbor
Top10 category |
Micro Average |
Macro Average |
diff(5-3) |
diff(7-5) |
diff(5-3) |
diff(7-5) |
1#earn |
-0.71% |
-0.81% |
-0.64% |
-0.74% |
2#acq |
-2.46% |
-2.50% |
-2.24% |
-2.21% |
3#money-fx |
-7.43% |
-4.63% |
-5.08% |
-3.01% |
4#grain |
-16.30% |
-16.99% |
-12.18% |
-9.88% |
5#crude |
-11.26% |
-11.38% |
-8.06% |
-7.62% |
6#trade |
-18.48% |
-16.18% |
-7.67% |
-2.56% |
7#interest |
-0.52% |
-1.18% |
-0.33% |
-0.70% |
8#wheat |
-16.15% |
-18.43% |
-11.51% |
-8.61% |
9#ship |
-32.84% |
-20.15% |
-17.33% |
-2.84% |
10#corn |
-11.46% |
-17.48% |
-9.18% |
-11.63% |
AVG |
-11.76% |
-10.97% |
-7.42% |
-4.98% |
- Observation:
3 seems to be the optimal number of neighbors for KNN |
- SVM: Cost and Gamma (Click here to see detail)
|