Guide for Using Weka Toolkit



Step By Step | Classifiers | Result | Evaluation | Resources

Step By Step

  1. Install Weka
    • Download and install Weka and LibSVM
      • Weka is an open source toolkit of machine learning. The most recent versions (3-5-x) are platform independent and we could download the .zip file for Windows, Linux, or Mac machines. After unpacking the compressed file, put the result directory at you preferred place (/Application/weka-3-5-2 in Mac, for instance), we will call the directory $weka_home from now on. You can download the latest verson from: http://www.cs.waikato.ac.nz/ml/weka/
      • LibSVM is a SVM classifier which is available to the public, the default SVM classifier is SMO since weka-3-5-2, the toolkit include a wrapper function which allows users to run LibSVM as any other weka built-in classifiers. You can download the libsvm.jar from http://www.cs.iastate.edu/~yasser/wlsvm/
    • Update CLASSPATH
      • Running Weka usually requires adding weka.jar to the CLASSPATH variable of the hosting machine. We want to add libsvm.jar and java home as well. On Linux and Mac machines, that is to update the .bash_profile or .profile by adding the following line:

        export CLASSPATH=$CLASSPATH:$weka_home/weka.jar:$weka_home/libsvm.jar:$JAVA_HOME/bin

        On Windows platform, you have to use Control Panel->System Variables.

      • Completing this step will let you to run weka with a lot more flexibilities. But, you could also run weka directly as java –jar weka.jar.

  2. Data preparation (generating .arff file)
    • Create .arff file from text files
      • Database(text corpus) is represented as ARFF (Attribute-Relation File Format) in Weka. There are examples of arff file under “data” directory of $weka_home. (Detail explanation on ARFF is available here: http://weka.sourceforge.net/wekadoc/index.php/en:ARFF_%283.5.1%29 ) Each document (row) is called as instance and each feature (term) is called as attribute.
      • A java program called TextDirectoriestoARFF could convert free text collection to arff file with “string” attribute. To use this program, each category in the collection must have its own directory. For binary classification, each category has two folders to store the positive and negative documents separately.

                    java TextDirectoriesToArffFile $cat_dir > str_corn_training.arff

    • Convert string attribute to numeric attributes
      • The .arff file we got from the previous step contains two attributes only. One is ‘class’, the other is ‘text’ which is a string of the content of a document. We have to convert this preliminary file into such format that we could extract features (attributes) and have numeric value for each feature (attribute).
      • Weka has a built-in function called StringtoWordVector for this purpose:

                    java -Xmx1024m weka.filters.unsupervised.attribute.StringToWordVector
                   –b -i str_corn_training.arff -o corn_training.arff -r str_corn_test.arff –s corn_test.arff
                   -R 2 -W 5000 -C -T -I -N 1 -L -M 2

        -b: batch mode
           This is useful to filter two datasets at once. The first dataset is used to initialize the filter and the second one is then filtered according to this setup. i.e., your test set will contain the same    attributes then.
        -i: training input file
        -o: training output file
        -r: test input file
        -w: test output file
        -R 2: process the second attribute which is the string attribute, this is by default
        -W 5000: output the top 5,000 features, it is ranked by per-class which is helpful for binary problem. You could use the –O option to rank features based on all classes.
        -C: output word count rather than boolean word presence
        -T: transform term frequency into log(1+tf)
        -I: transform word frequency into tf*log(total# of docs/# of docs contain this word) It is actually the tf*idf weight without normalization
        -N: 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don't normalize).(Detail explanation from Wekalist)
        -L: Convert all tokens to lowercase before adding to the dictionary
        -A: Only form tokens from contiguous alphabetic sequences (Turn this off when work with phrase!!! After Weka 3-5-6, this option is no more available and is replaced by weka.core.tokenizers.AlphabeticTokenizer)
        -S: Ignore words that are in the stoplist. (we don’t use this one since we’ve use our own stop list already)
        -M: minimal term frequency, here is 2.
        The –Xmx is simply to define the heap memory size assigned for Weka.(e.g. –Xmx1024m means setting the maximum heap size for your java engine to 1GB)
      • (more options)
      • Extracting ngrams
        In Weka 3-5-6, a new tokenizer is added for extracting ngrams. Using the same example above to extract unigrams and trigrams.

                    java -Xmx1024m weka.filters.unsupervised.attribute.StringToWordVector
                   –b -i str_corn_training.arff -o corn_training.arff -r str_corn_test.arff –s corn_test.arff
                   -R 2 -W 5000 -C -T -I -N 1 -L -M 2 -tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3"

        Valid options are:
        -delimiters <value> The delimiters to use (default ' \n\t.,;:'"()?!').
        -max <int> The max size of the Ngram (default = 3).
        -min <int> The min size of the Ngram (default = 1).

  3. Feature Selection
    • FS is conducted by calling weka.filters.supervised.attribute.AttributeSelection function:

                  java -Xmx1024m weka.filters.supervised.attribute.AttributeSelection
                  -S "weka.attributeSelection.Ranker -N 100"
                  -E "weka.attributeSelection.InfoGainAttributeEval"
                  –b -i corn_training.arff -o corn_chi100_training.arff –r corn_test.arff –s corn_chi100_test.arff -c 1

      -S: <"Name of search class [search options]"> Sets search method for subset evaluators.
      -E: <"Name of attribute/subset evaluation class [evaluator options]">
         We can understand it as feature selection algorithm. weka.attributeSelection.ChiSquaredAttributeEval and weka.attributeSelection.InfoGainAttributeEval are most efficient FS method according to literature.
      -N : N top-ranked features (if use -T, one can specify a threshold value)
      -i : training input arff file
      -o: training output arff file
      -r : test input arff file
      -s: test output arff file
      -c : define which attribute is the class attribute (acq, noacq) because it is a “nomial” attribute, not a “numerical” attribute and cannot be ranked
      -b: batch mode

    • Notice that since the feature selection step takes lots of time, it is recommended to select top N features first, and then run somd code like this (/u2/home/nyu/Classification/Code/getTopFeature.pl) to select top n (<N) features from this file.

    • Notice that after feature selection, the class attribute will be moved from top to the end, so no need to use option -c 1 when use the feature selection arff file for future training or classification

  4. Train a Classifier with Cross Validation
    • Cross validation is commonly used method to tune a classifier. Below is an example of how to run cross validation on a NB classifier in Weka:

                  java -Xmx1024m weka.classifiers.bayes.NaiveBayesMultinomial
                 -t corn_chi100_training .arff -d corn-ig100.model –x 10 -o -i> corn_chi100_training.results

      -t <name of training file>: training data file
      -x <number of folders>: Sets number of folds for cross-validation (default: 10)
      -o : outputs statistics only, not the classifier
      -i : output IR measures, like precision, recall, and F1 measure
      -d <name of output file>: model output file
      -l <name of input file>: model input file
      (more options)

    • Notice: Weka always outputs the model built from the full training set, even if the performance of this model is estimated using a cross-validation.

  5. Test a Classifier
    • After finding the best parameters and build the final model, now apply the same classifier on the test set:

                  java -Xmx1024m weka.classifiers.bayes.NaiveBayesMultinomial
                  -l corn-ig100.model -T corn_chi100_test.arff -o -i > corn_chi100_test.results

      -T <name of test file>
      options are the same as in step 4.

    • Notice: When loading a saved model, no need to set up any options.
  6. Use a Classifier to Making predictions
    • Once a classifier is trained and tested, you can now use it to predict the class lable of new object.:

                  java -Xmx1024m weka.classifiers.bayes.NaiveBayesMultinomial
                  -l corn-ig100.model -T corn_chi100_test.arff -p 0 > corn_chi100_test.results

      -P <attribute range>
      Only outputs predictions for test instances (or the train instances if no test instances provided), along with attributes (0 for none).

      More

Top

Classifers

  1. SVM
    • LibSVM

      The default kernel of LibSVM is ‘RBF' (Gaussian) kernel, two parameters are important : -G and -C:

      Training:

                java -Xmx1024m weka.classifiers.functions.LibSVM
                -d corn_chi100-1000-0001.model -t corn_chi100_training.arff
               -G 0.001 -C 1000.0 -M 100.0 -Z –x 10 -o -i > corn_chi100-1000-0001_training.result

      Test:

                java -Xmx1024m weka.classifiers.functions.LibSVM
                -l corn_chi100-1000-0001.model -T corn_chi100_test.arff
               -o -i > corn_chi100-1000-0001_test.result

      -C: cost or penalty of training errors.(default 0)

      -G: gamma, which control the width of the RBF kernel.(default 1/k)

      -Z: Whether to normalize inputdata , (default) off

      -M: Set cache memory size in MB (default: 40)

      For recommendation of how to obtain acceptable results fast and easily, please read this article: "A Practical Guide to Support Vector Classification"

      More options


    • SMO

      We use LibSVM instead of SMO since the previous runs faster. If use RBF kernel, then we should tune gamma and cost also. If using “polynomial” kernel, we should tune cost and exponent which controls the degree of polynomial.

                  java -Xmx1200m weka.classifiers.functions.SMO
                   -d corn_chi100-1000-0001.model -t corn_chi100_train.arff
                  -C 100.0 -E 2.0 -O -o -i > corn_chi100-1000-0001_train.result

      -C: cost, 100
      -E: exponent, we use 2.0 here for quadratic kernel.
      We could see the same parameter –C has the same meaning but two different values for LibSVM and SMO. I think this caused by the implementation of the classifier and only could be tuned with empirical approach.

      More options


  2. NB
    • As shown in step 4

  3. KNN
    • Examples:

      Training:

                java -Xmx1024m weka.classifiers.lazy.IBk
                -d corn_chi100-k3.model -t corn_chi100_training.arff
               -K 3 –x 10 -o -i > corn_chi100-k3_training.result

      Test:

                java -Xmx1024m weka.classifiers.lazy.IBk
                -l corn_chi100-k3.model -T corn_chi100_test.arff
                -o -i > corn_chi100-k3_test.result

      -K num: Set the number of nearest neighbors (default 1)

      More options

Top

Results

Top

Evaluations

  1. Overview
    • Evaluations have been conducted against reuters21578 (ModApte split) which has been used in most text classification literature. The experiments were designed to test the robustness of Weka toolkit as well as examine the impaction, if any, of a variety of factors on classification task.
    • Another set of experiments were designed and evaluated against the TREC blog06 data. The purpose here is to find out whether auto classification approach is suitable for the opinionated/polarity classification other than subjective classification. If so, how to design a classifier for a non-content classification task.
  2. Reuters Data Preparation
    • Collection splitting
      • Motivation: to check the ready to use split data after initially find some bugs in the Reuters21578 collection.
      • Download the full collection, Reuters21578.tar.gz directly from the UCI KDD Archive . Then split the data by running $home/Classification/Code/reutersSplit.pl. Remove the unknown category. (Located: $home/Classification/Corpora/Reuters21578_apte_top90cat)
      • Download the splitted Reuters21578-Apte-90Cat.tar.gz. Remove the unknown category.(Located: $home/Classification/Corpora/Reuters21578-Apte-90Cat)
      • Took the top 10 categories only.
    • Negative example generation
      • Motivation: when training data distribution is skewed, one could argue either use a large negative example set to increase the classifier precision or use a smaller negative example set to increase the recall.
      • When building a classifier on one category, use all data in the rest of the 9 categories as negative examples. ($home/Classification/Corpora/Prepared0)
      • Randomly select negative examples from the rest of the 9 categories and keep the amount same as what's in the target category.($home/Classification/Corpora/Prepared1)
      • Using all the examples in categories which share no example(s) with the target category. ($home/Classification/Corpora/Prepared2)
      • Randomly select negative examples from the no over-lapping categories and keep the amount same as what's in the target category.($home/Classification/Corpora/Prepared3)
    • Pre-processing
      • Motivation: to see the effect of traditional information retrieval pre-processing method:stopping, stemming, and phrase(ngram) extracting.
      • Non-processed data
        ($home/Classification/Corpora/Prepared?/Reuters21578-Apte-90Cat|Reuters21578_apte_top90cat)
      • Run $home/Classification/Code/sstem.pl to get Stopped and stemmed collection (combo stemmer,stoplist1) ($home/Classification/Corpora/Prepared?/Reuters21578-Apte-90Cat|Reuters21578_apte_top90cat-sstem)
      • Run $home/Classification/Code/gnrPhrase.pl (combo stemmer,stoplist1) to extract noun phrase ($home/Classification/Corpora/Prepared?/Reuters21578-Apte-90Cat|Reuters21578_apte_top90cat-phrase) (notice: the noun phrase extraction algorithm is from the old NLPsub.pl)
    • Feature Selection
      • Motivation: to test and compare the two most effective feature selection method, chi and ig.
      • features of 100, 500, 1000, 1500 and 2000 were selected and tested respectively.
    • Classifier configuration
      • NB: NaiveBayesMultinomial is applied with its default setting in Weka.
      • KNN: weka.classifiers.lazy.IBk is applied with k = 3, 5 or 7.
      • SVM: weka.classifiers.functions.LibSVM with grid searching on the cost and grammer values(appending task). Notice that due to the efficiency constriction, SVM will not be tested against all the parameters in the previous setting. Only the best selected parameter combination according to NB and KNN classifier will be applied to the SVM training. Analysis on the training results will then suggest for parameters for testing.
      • 10 folds cross validation was conducted during training
  3. Reuters results analysis

    • Overall top 5 performance based on macro F-score over 10 categories
      NB KNN SVM
      3,A,ss,chi1500 0.9483 3,a,ss,ig100,k3 0.9227 3, A, null, chi 1000, c2 g -5 0.9659
      3,A,ss,ig1500 0.9481 3,a,ss,chi100,k3 0.9209 3, A, null, chi 1000, c5 g-8 0.9657
      3,A,ss,chi1000 0.9469 3,a,ss,ig100,k5 0.9198 3, A, null, chi 1000, c11 g-14 0.9656
      3,A,ss,chi2000 0.9469 3,A,ss,ig100,k3 0.9159 3, A, null, chi 1000,c8 g-11 0.9656
      3,A,ss,ig1000 0.9469 3,a,ss,chi100,k7 0.9152 3, A, ss, chi100, c2 g-5 0.9626

    • Top performance of individual category classifier (F-score) (click on .. to see the top 10 runs)
      Category NB KNN SVM
      earn 3|2,A,ss,ig500(..) 0.982 1,A,ss,chi100,k5(..) 0.985 3,A,ss,chi1000, c5g-2(..) 0.989
      acq 3,A,null,chi|ig2k(..) 0.981 3,A,null,ig100,k5(..) 0.959 3,A,null, chi 500, c14 g-2(..) 0.985
      money-fx 3,a,ss|null,ig1k(..) 0.986 3,A,ss,ig2k,k7(..) 0.951 3,A,null, chi1000, c2g-5 (..) 0.992
      grain 3,A,ss,chi|ig1k(..) 0.914 3,a,ss,ig100,k7(..) 0.931 3,A,ss,chi100, c5g-5 (..) 0.980
      crude 3,A,null,ig2k(..) 0.977 3,a,ss,chi100,k5|7(..) 0.954 3,A,ss,chi2000, c14g-14 (..) 0.992
      trade 3,a,ss,ig100(..) 0.912 3,A,ss,chi|ig1k,k7(..) 0.886 3,A,null,chi1500, c14g-8 (..) 0.940
      interest 2,A,phrase,chi100 (..) 0.990 2,A,ss,ig100,k3 (..) 0.993 3,A,null,chi1000,c5g-8 (..) 0.988
      wheat 3,a,null|ss,ig|chi1k(..) 0.94 3,A,ss,ig|chi1500,k3(..) 0.928 3, A, null, chi100, c11g-8(..) 0.972
      ship 1,a,ss,chi100 (..) 0.972 3,a,ss,chi100,k7 (..) 0.930 3, A, null, chi500, c2g-5 (..) 0.960
      corn 3,A,ss,chi|ig1500|2k(..) 0.946 3,A,ss,ig100,k3(..) 0.906 3,A,null,chi100, c11g-8 (..) 0.973

    • Analysis by factors (F-score)
      • Collection splitting. (manually generated - auto downloaded)/auto downloaded
        Top10 category apte-Apte
        NB classifier
        KNN classifier
        Micro Average Macro Average Micro Average Macro Average
        1#earn 0.93% 0.71% 3.44% 3.16%
        2#acq -0.16% -0.17% -3.35% -3.13%
        3#money-fx 0.53% 0.32% -2.82% -3.01%
        4#grain 0.29% 0.44% -3.70% -4.36%
        5#crude -0.54% -0.49% -2.72% -1.02%
        6#trade -0.14% 0.06% 1.23% -6.53%
        7#interest -0.26% -0.25% -1.34% -1.73%
        8#wheat -0.27% -0.60% -6.12% -6.30%
        9#ship 0.82% 0.37% 3.31% 4.39%
        10#corn 0.03% 0.10% -3.97% -6.16%
        Average 0.12% 0.05% -1.60% -2.47%
        Observation:

        NB: No big difference between these two data set as expected. The manually split one is slightly better than the downloaded pack.

        KNN: The manually split collection performs worse than the downloaded pack.

        Suggestion: To make the results comparable to others, we might want to stick to the downloaded one.


      • Negative example generation
        • Legend:
          0
          Prepared0: all
          1
          Prepared1: random selected negative examples to balance numbers of positive and negative examples
          2
          Prepared2: negative examples only from no overlapping categories
          3
          Prepared3: no olp cats plus random sampling to balance numbers of positive and negative examples
        • NB classifier
          Top10 category
          Micro Average
          Macro Average
          diff10 diff20 diff32 diff31 diff10 diff20 diff32 diff31
          1#earn 1.64% 2.38% 0.01% 0.74% 1.45% 2.20% 0.01% 0.74%
          2#acq 7.68% 7.69% 0.73% 0.73% 7.30% 7.34% 0.68% 0.72%
          3#money-fx 56.16% 14.18% 43.73% 3.80% 55.01% 14.15% 40.99% 3.83%
          4#grain 47.25% -9.86% 72.20% 2.47% 45.55% -9.26% 64.32% 2.44%
          5#crude 33.62% 26.98% 7.54% 2.58% 32.70% 26.46% 7.57% 2.51%
          6#trade 110.68% 17.46% 83.43% 1.75% 101.47% 16.71% 75.62% 1.73%
          7#interest -6.15% 0.47% -5.13% 1.56% -6.17% 0.47% -5.15% 1.56%
          8#wheat 119.18% 8.22% 115.03% 4.79% 101.88% 7.69% 96.17% 4.65%
          9#ship 64.22% -10.06% 82.46% -0.22% 60.38% -10.47% 78.54% -0.33%
          10#corn 140.41% 17.81% 137.19% 11.08% 122.84% 17.13% 111.31% 11.07%
          AVG 57.47% 7.53% 53.72% 2.93% 52.24% 7.24% 47.01% 2.89%
        • KNN classifier
          Top10 category
          Micro Average
          Macro Average
          diff10 diff20 diff32 diff31 diff10 diff20 diff32 diff31
          1#earn -3.23% -7.20% 0.24% -4.10% -3.55% -7.69% 0.23% -4.07%
          2#acq 28.71% 25.94% 3.65% 1.36% 24.68% 22.08% 3.47% 1.32%
          3#money-fx 121.25% 16.05% 102.15% 5.39% 108.14% 15.65% 87.98% 4.45%
          4#grain 61.92% 3.39% 69.35% 9.45% 52.34% 5.73% 56.02% 8.28%
          5#crude 58.83% 6.30% 54.94% 4.02% 52.04% 8.60% 45.07% 3.62%
          6#trade 403.81% 37.93% 338.24% 18.16% 205.34% 43.35% 139.25% 12.32%
          7#interest -17.04% 0.05% -21.06% -4.21% -17.03% 0.05% -21.04% -4.79%
          8#wheat 95.23% 13.67% 97.45% 9.96% 97.68% 19.69% 69.64% 2.71%
          9#ship 376.21% -30.67% 902.26% 8.70% 243.55% -10.99% 303.34% 4.50%
          10#corn 54.57% 3.43% 67.83% 14.45% 56.83% 4.90% 70.33% 13.92%
          AVG 118.03% 6.89% 161.51% 6.32% 82.00% 10.14% 75.43% 4.23%
        • SVM classifier (Note: this is based on the 10fold cross validation results on the training set. Only prepared3 is used for testing )
          Top10 category
          Micro Average
          Macro Average
          diff31 diff31
          1#earn 7.38% 4.50%
          2#acq 2.00% 2.05%
          3#money-fx 4.19% 3.98%
          4#grain 18.69% 8.22%
          5#crude 4.11% 3.02%
          6#trade 4.30% 3.09%
          7#interest 1.63% 1.76%
          8#wheat -0.94% 2.18%
          9#ship 4.02% 2.65%
          10#corn 10.63% 8.77%
          AVG 5.60% 4.02%
        • Observation :

          For both NB and KNN, Balancing the number of positive and negative examples largely improve the performance. Selecting negative examples from non-overlapping category only helps further improve the performance.

          The trend stays the same for SVM classifiers: prepared3 steadily overperforms prepared1, so in the testing stage, we only use prepared3.

          Suggestion: We should always be careful when generating the negative examples since it will largely effect the performance.

      • Pre-processing Type
        • NB classifier
          Top10 category Micro Average Macro Average
          diff(phrase-null) diff(sstem-null) diff(sstem-phrase) diff(phrase-null) diff(sstem-null) diff(sstem-phrase)
          1#earn -8.13% 0.41% 9.55% -8.11% 0.41% 9.27%
          2#acq -11.05% 0.47% 13.09% -11.03% 0.38% 12.82%
          3#money-fx -1.71% 2.68% 6.95% -5.10% 2.12% 7.62%
          4#grain 10.73% 2.33% -3.42% 3.87% 1.71% -2.08%
          5#crude -5.23% 0.30% 6.92% -6.44% 0.24% 7.14%
          6#trade 15.26% 0.68% -7.65% 5.17% 0.31% -4.62%
          7#interest -1.70% -0.13% 1.72% -1.66% -0.12% 1.57%
          8#wheat 17.99% 1.53% -8.74% -4.18% 0.93% 5.34%
          9#ship -2.22% 0.80% 6.94% -13.48% 0.51% 16.18%
          10#corn 18.77% -1.02% -10.64% -5.31% -0.46% 5.12%
          AVG 3.27% 0.81% 1.47% -4.63% 0.60% 5.83%
        • KNN classifier
          Top10 category Micro Average Macro Average
          diff(phrase-null) diff(sstem-null) diff(sstem-phrase) diff(phrase-null) diff(sstem-null) diff(sstem-phrase)
          1#earn -2.73% -1.29% 3.04% -3.72% -1.27% 2.54%
          2#acq -3.59% 2.07% 7.09% -5.03% 0.66% 5.99%
          3#money-fx -28.69% 3.48% 56.48% -28.69% 2.39% 43.59%
          4#grain -40.41% 3.42% 145.73% -38.70% 1.94% 66.29%
          5#crude -31.48% 4.13% 105.59% -31.72% 2.92% 50.72%
          6#trade -15.16% 30.54% 560.02% -33.37% 18.42% 77.71%
          7#interest -3.61% -3.16% 1.52% -3.97% -2.44% 1.59%
          8#wheat -49.70% 6.72% 216.06% -49.34% 5.98% 109.19%
          9#ship -48.02% 11.99% 456.67% -66.50% 15.55% 244.93%
          10#corn -49.70% 6.72% 216.06% -49.45% 7.26% 112.20%
          AVG -27.31% 6.46% 176.83% -31.05% 5.14% 71.48%
        • SVM
          Top10 category Micro Average Macro Average
          diff(sstem-null) diff(sstem-null)
          1#earn 0.21% 0.21%
          2#acq -0.11% -0.11%
          3#money-fx 0.39% 0.38%
          4#grain 0.69% 0.62%
          5#crude 0.52% 0.46%
          6#trade -1.37% -1.46%
          7#interest 0.05% 0.01%
          8#wheat 1.36% 1.17%
          9#ship 1.39% 1.31%
          10#corn 2.79% 2.63%
          AVG 0.59% 0.52%
        • Observation :

          Stopping and stemming helps for all three classifiers.

          Phrase always hurts performance in case of KNN and only helps in certain category in case of NB classifier.

          Suggestion: Try below in the future:

          1. only stemming or stopping

          2. use the updated noun-phrase method (this experience was based on 2006 version NLPsub)

          3. phrase + single feature


      • Feature Selection Type (chi - ig)/ig
        Top10 category NB classifier KNN classifier
        Micro Average Macro Average Micro Average Macro Average
        1#earn -0.05% -0.05% -0.01% 0.00%
        2#acq -0.29% -0.27% -2.28% -2.06%
        3#money-fx 0.21% 0.00% -2.09% -1.69%
        4#grain 0.37% 0.29% -1.39% -1.30%
        5#crude 0.40% 0.27% -1.10% -1.21%
        6#trade 2.99% 1.80% -0.26% -2.19%
        7#interest 0.15% 0.15% -0.02% -0.03%
        8#wheat 7.29% 4.57% -0.11% -0.35%
        9#ship 0.70% 1.17% -2.10% -2.27%
        10#corn 7.37% 4.10% -0.21% -0.38%
        Average 1.91% 1.20% -0.96% -1.15%
        Observation:

        NB: Chi square slightly over-performs ig.

        KNN: ig slightly over-performs Chi square.

        Suggestion: Apply different feature selection method to different classifiers.

      • Feature Number
        • NB classifier
          Top10 category Micro Average Macro Average
          2k-1.5k 1.5k-1k 1k-500 500-100 100-2k 2k-1.5k 1.5k-1k 1k-500 500-100 100-2k
          1#earn 0.12% 0.27% 0.70% 3.04% -3.85% 0.12% 0.26% 0.65% 2.85% -3.77%
          2#acq 0.15% 0.28% 1.38% 6.05% -7.14% 0.15% 0.27% 1.32% 5.63% -6.95%
          3#money-fx 1.08% 2.57% 5.42% 3.59% -10.05% 0.95% 2.17% 4.19% 3.15% -9.79%
          4#grain 1.23% 2.08% 2.15% -5.01% 0.92% 0.90% 1.48% 1.57% -3.96% 0.11%
          5#crude 0.67% 1.43% 1.30% -0.37% -2.24% 0.64% 1.34% 1.03% -0.14% -2.82%
          6#trade 1.63% 2.18% 0.42% -5.08% 3.47% 0.92% 1.66% 0.65% -3.65% 0.52%
          7#interest 0.31% 0.33% 0.30% 0.29% -1.18% 0.29% 0.32% 0.30% 0.27% -1.17%
          8#wheat 2.80% 3.25% 3.65% -13.56% 26.96% 1.45% -4.82% 2.77% -12.23% 14.81%
          9#ship 1.60% 1.68% 4.41% -7.31% 9.28% 1.27% -2.05% 2.76% -6.97% 5.45%
          10#corn 0.45% 6.51% 3.04% -13.03% 19.53% 0.02% -3.17% 1.48% -10.54% 13.72%
          AVG 1.00% 2.06% 2.28% -3.14% 3.57% 0.67% -0.25% 1.67% -2.56% 1.01%
        • KNN classifier
          Top10 category Micro Average Macro Average
          2k-1.5k 1.5k-1k 1k-500 500-100 100-2k 2k-1.5k 1.5k-1k 1k-500 500-100 100-2k
          1#earn -3.52% -3.51% -5.45% -0.16% 15.19% -3.51% -3.40% -5.48% -0.46% 14.03%
          2#acq 1.94% -1.36% -9.65% -5.59% 19.24% 0.80% -2.37% -9.23% -5.76% 18.79%
          3#money-fx -0.06% -6.07% -7.23% -24.64% 74.51% -1.59% -7.32% -7.30% -21.29% 50.27%
          4#grain -3.71% -8.95% -15.43% -26.71% 172.39% -2.91% -7.87% -14.37% -24.70% 73.37%
          5#crude -4.48% -11.03% -14.54% -32.78% 181.86% -1.01% -11.39% -14.68% -31.38% 94.70%
          6#trade -7.50% -6.64% -33.86% -50.32% 884.67% 4.01% -2.26% -28.16% -46.50% 155.91%
          7#interest -7.65% -3.83% -4.07% -1.03% 29.77% -6.65% -3.32% -3.85% -0.98% 16.37%
          8#wheat 3.31% 3.28% -7.82% -35.04% 83.46% -1.24% 16.50% -5.27% -31.29% 33.53%
          9#ship 1.26% 5.62% 27.05% -49.05% 281.22% -4.76% 7.42% 4.83% -37.53% 49.28%
          10#corn 5.40% 3.87% -8.30% -30.01% 47.99% 2.31% 15.32% -8.76% -27.82% 28.71%
          AVG -1.50% -2.86% -7.93% -25.53% 179.03% -1.46% 0.13% -9.23% -22.77% 53.50%
        • SVM classifier
          Top10 category Micro Average Macro Average
          2k-1.5k 1.5k-1k 1k-500 500-100 100-2k 2k-1.5k 1.5k-1k 1k-500 500-100 100-2k
          1#earn 0.07% -0.24% 0.20% 0.47% -0.49% 0.07% -0.24% 0.19% 0.46% -0.49%
          2#acq -0.41% -0.16% -0.06% 0.78% -0.14% -0.40% -0.16% -0.06% 0.78% -0.15%
          3#money-fx -0.90% -0.08% 1.03% 1.14% -1.09% -0.88% -0.08% 1.03% 1.14% -1.18%
          4#grain -1.32% -1.08% -0.96% -2.10% 5.97% -1.25% -1.07% -0.96% -2.12% 5.59%
          5#crude -1.76% -0.89% -0.15% -0.11% 3.37% -1.65% -0.88% -0.15% -0.12% 2.86%
          6#trade -0.96% 0.84% -0.61% -2.36% 3.53% -0.87% 0.85% -0.60% -2.38% 3.07%
          7#interest -1.62% -2.30% 0.45% 2.41% 1.62% -1.50% -2.28% 0.43% 2.40% 1.03%
          8#wheat -2.53% -2.16% -0.25% -2.06% 8.04% -2.41% -2.05% -0.24% -2.09% 7.11%
          9#ship -0.86% -3.58% -0.21% 0.88% 4.58% -0.78% -3.49% -0.21% 0.83% 3.80%
          10#corn -0.55% -3.07% -2.35% -1.10% 8.28% -0.50% -2.90% -2.32% -1.10% 7.14%
          AVG -1.08% -1.27% -0.29% -0.20% 3.37% -1.02% -1.23% -0.29% -0.22% 2.88%
        • Observation:

          NB and SVM: 100 features have the best performance overall.

          KNN: 100 features have the best performance across categories.

          Suggestion:It seems like 100 features should be enough for both small and large categories. Later we should use percentage instead of absolute number of features.(I am not sure if weka provide this option or not) Or maybe we should try the -T option to set a score threshold based on certain precision score.

      • Classifier specific factor
        • KNN: number of nearest neighbor
          Top10 category Micro Average Macro Average
          diff(5-3) diff(7-5) diff(5-3) diff(7-5)
          1#earn -0.71% -0.81% -0.64% -0.74%
          2#acq -2.46% -2.50% -2.24% -2.21%
          3#money-fx -7.43% -4.63% -5.08% -3.01%
          4#grain -16.30% -16.99% -12.18% -9.88%
          5#crude -11.26% -11.38% -8.06% -7.62%
          6#trade -18.48% -16.18% -7.67% -2.56%
          7#interest -0.52% -1.18% -0.33% -0.70%
          8#wheat -16.15% -18.43% -11.51% -8.61%
          9#ship -32.84% -20.15% -17.33% -2.84%
          10#corn -11.46% -17.48% -9.18% -11.63%
          AVG -11.76% -10.97% -7.42% -4.98%
        • Observation:

          3 seems to be the optimal number of neighbors for KNN

        • SVM: Cost and Gamma (Click here to see detail)

Top

Resources

  1. Weka Doc http://weka.sourceforge.net/wekadoc/index.php/en:Weka_3.5.5
  2. Command line tutorial http://weka.sourceforge.net/wekadoc/index.php/en%3APrimer
  3. Weka 3.5 API http://weka.sourceforge.net/doc.dev/
  4. Weka mailing list http://www.nabble.com/WEKA-f435.html
  5. More weka document
  6. My log on l710 and Blog track(need password)
  7. My script folder on elvis: /u2/home/nyu/Classification/Code
Top