|
DocumentationJBoost has several boosting algorithms, and each algorithm has its own set of parameters. There are several forms of documentation available:
What are ADTrees?ADTrees are a generalization of decision trees that utilize the boosting model for machine learning. There is a Wikipedia ADTree page that describes how to read the data structure and how the construction algorithm works. There is also a short presentation (PDF) describing the data structure with many graphical examples. For more information on boosting, see the links below. Input FilesJBoost requires three data files for the data specification,
training data, and test data. There are several examples in the
The data specification file contains definitions for the configurable parameters, the types of each example attribute and the possible set of labels for each example. A sample data specification looks like this, exampleTerminator=; attributeTerminator=, maxBadExa=0 INDEX number age number income number sex (male, female) highest-degree (none, high-school, bachelors, masters, phd) goal-in-life text labels (rich, smart, happy, none) This format lists the configurable parameters, followed by the attribute definitions, and the possible labels. The parameters are summarized,
The special keyword attribute names are
The attribute definitions include the attribute name and the type. The
attribute names must be listed in the order they appear in the data.
There are keyword attribute names: There is no other special relationships between the data and the attribute names.
There are currently only three attribute types
supported in JBoost. Attributes of
After you have completed the data specification you have to update the
test data and training data to follow the definition in the
specification file. Lines starting with '//' in the data files are
ignored. Each line should contain an example with attributes separated
by the 51, 52000, male, phd, be with my family, 1.0, happy; 24, 1000000, male, bachelors, retire at 25, 0.5, rich; After you have completed creating your train and test files, you can use JBoost to learn an alternating decision tree. Output FilesThere are many options that will create different types of output files. To turn on/off certain output, see the Usage Details section. This section provides an overview of all the options.
Usage DetailsThere are large number of usage options. While there are many options, the most common runs are shown in a list of examples. The global configuration options are: -p N Specify number of threads (default: 1) -CONFIG The name of the configuration file (default "jboost.config") All options can be specified in this file instead of on the command line. -V Print version and exit
If you have a multiprocessing environment, the -n src/jboost/controller/data.spec -T src/jboost/controller/data.test -t src/jboost/controller/data.train -a -1 -serialTreeOutput src/jboost/controller/atree.serialized Data/Test/Train file options are: -S stem Base (stem) name for the files (default: "data") -n file.spec Specfile name (default: stem+".spec") -t file.train Training file name (default: stem+".train") -T file.test Test file name (default: stem+".test") -serialTreeInput file.tree Java object of adtree (default: stem+".output.tree") -weightThreshold T Set threshold for accepting an example
The Boosting options: -b type The type of booster to use (default: AdaBoost). AdaBoost Loss function: exp(-margin) LogLossBoost Loss: log(1 + exp(-margin)) RobustBoost Loss: min(1,1-erf((margin - mu(time))/sigma(time))) -numRounds N The number of rounds of boosting that are to be executed. This option should be used with AdaBoost and LogLossBoost -ATreeType type The type of ATree to create. There are several options: ADD_ALL Create a full ADTree (default) ADD_ROOT Add splits only at the root producing a glat tree. Equivalent to boosting decision stumps ADD_SINGLES Create a decision tree ADD_ROOT_OR_SINGLES Create an ensemble combination of decision trees. -BoosTexter Only make a zero prediction at the root node. -booster_smooth sf Smoothing factor for prediction computation (default: 0.5) Described Schapire & Singer 1999 (smoothing the predictions), $epsilon = sf / total_num_examples$ -booster_paranoid Use debugging version of booster (default: false)
The most common usage is The type of alternating decision tree can drastically changed the classifier. Depending on the dataset, one type of tree may be better than the others. However, the ADD_ALL type is the most flexible. BoosTexter is an implementation of Schapire and Singer's bag of words method where zero predictions are allowed at the root node. RobustBoost options: -rb_time NUM See documentation. -rb_epsilon NUM See documentation. -rb_theta NUM See documentation. -rb_theta_0 NUM See documentation. -rb_theta_1 NUM See documentation. -rb_sigma_f NUM See documentation. -rb_sigma_f_0 NUM See documentation. -rb_sigma_f_1 NUM See documentation. -rb_cost_0 NUM See documentation. -rb_cost_1 NUM See documentation. -rb_conf_rated The RobustBoost option ... TODO Code Output Options: -serialTreeOutput file.tree Java object output of adtree -O file.tree Output tree file name (default: stem+".output.tree") -P filename Output python code file name -j filename Output java code file name -c filename Output C code file name -m filename Output matlab code file name -cOutputProc name Name of procedure for output C code (default: 'predict') -javaStandAlone Output java code that can stand alone, but cannot read jboost-format data -javaOutputClass name Name of class for output java code (default: 'Predict') -javaOutputMethod name Name of method for output java code (default: 'predict') There are two ways to output the ADTree: a text format meant for visualization or a code format meant for classification. The code can be integrated with JBoost or a custom classification system.
Make sure not to forget to specify the Logging Options -info filename High-level log file name (default: stem+".info") -log filename Debugging log (default stem+".log") -loglevel N Amount of information to be output to log The larger N is, the more information will be output. This is meant to be used as a debugging tool. -a iter Generate margin (score) logs iter>0 log only on iteration iter, iter=-1 log on iters 1,2..9,10,20,...,90,100,200 ...) iter=-2 log on all iterations iter=-3 log only on the last iteration of boosting
To obtain the margin distribution, the Using JBoost Code OutputJBoost generates several files containing information about the run. These files can be categorized into two groups: log files and classifier files.
There are two log files. The file
There can be numerous classifier files. The generated ADTree will be
stored in The Python code contains a single class (ATree) and a main function demonstrating how to use the class. The classifier file (for example spambase.py) can be used on a new dataset (spambase.data) using a format specified in a specfile (spambase.spec) via $> python spambase.py spambase.data spambase.spec
To write your own code, see the The C code contains a single procedure:
The first argument
The second argument The .java file contains a class called 'Predict'. To use this file, it must be moved to a file called Predict.java. This file can be compiled using
Unless the javaStandAlone command line option is invoked, this file can be run (assuming you are in the same directory as Predict.java) using
If you want to call Predict from an alternative directory, set an environment variable to the directory of Predict.java as such,
In this mode, the program reads examples from standard input having the identical form of examples contained in the .train and .test files. After each example is read, a vector of scores (one for each class) is output to standard output. Alternatively, the method
can be called from another java program. The argument attr is an array of Objects corresponding to the attributes specified in the spec file. Thus, if attribute i is text, then attr[i] must be a String; if attribute i is a number, then attr[i] must be a Double; and if attribute i is finite, then attr[i] must be an Integer containing the index of the chosen value. The return value of this procedure is an array of doubles containing the scores for each of the classes. A third option is to call the method
which also can be called from another java program. As before, the argument attr is an array corresponding to the attributes specified in the spec file. Now, however, all of the values of this array are specified by Strings. Thus, if attribute i is text, then attr[i] is the text string itself; if attribute i is a number, then attr[i] is a String representing this number; and if attribute i is finite, then attr[i] must be the chosen value itself (not its index) represented as a String. Note that removing punctuation, converting to lower case, removing leading or trailing blank spaces, etc. are the responsibility of the calling procedure. As before, the return value of this procedure is an array of doubles containing the scores for each of the classes. Other Visualization ToolsThere are currently a few visualization tools. All of these programs depend on files outputted by JBoost. There are currently tools for visualizing three aspects of JBoost: 1) the error, 2) the margin, and 3) the alternating decision tree. The tools can be found in the scripts directory and when ran without arguments, will output their usage. We assume that the following programs are installed and are accessible in your executable path:
Visualizing the ADTree is important for understanding the intuition
for why the classifier works. The program
./atree2dot2ps.pl usage: -i (--info) file.info File containing runtime information (required) -t (--tree) file.tree File containing the ADTree in text format (required) -d (--dir) directiory Directory to use (optional, defaults to '.') -l (--labels) Flip the labels (-1 becomes +1) (optional) --truncate Truncate threshold values to increase readability (optional) --threshold num A given depth to stop the tree from becoming too large. (optional) -h (--help) Print this usage information An example (shows the first 10 nodes of the spambase ADTree) is ./atree2dot2ps.pl --info spambase.info --tree spambase.output.tree --threshold 10 which produces (click to see full size image) Boosting doesn't require many parameters, in fact it only really requires one parameter: the number of iterations. Specifying this number is non-trivial; however, there is somewhat of a consensus that there are a few ways to determine an appropriate number of iterations:
These two conditions combined with cross validation experiments (and
boosting's proclivity to not overfit) are typically sufficient
evidence for choice of number of iterations. There are two scripts:
The error.py script can be ran while JBoost is still executing. It reads the info file which is created incrementally. Here is the usage for the program: Usage: error.py --info=info_file scores file as output by jboost --logaxis should the axis be log-scaled (default: false) --bound show the bound on training error An example (1000 iterations on the spambase ADTree) is ./error.py --info=spambase.info --logaxis which results in margin.py can be ran while JBoost is still executing. It reads
the given train/test file and the boosting.info file (the Usage: margin.py --boost-info=filename margins file as output by jboost (-a -2 switch) --data=filename train/test filename --spec=filename spec filename --iteration=i,j,k,... the iterations to inspect, no spaces between commas --sample are we interested in a sample of false pos/neg An example (spambase margins for 5,50,300,1000 iterations) is ./margin.py --boost-info=spambase.train.boosting.info --data=spambase.train \ --spec=spambase.spec --iteration=5,50,300,1000 which results in Note that after 300 iterations, it seems that the margin distribution has converged for the most part. This in combination with the error curve indicate that 300 iterations is likely to be sufficient. However, in both cases, boosting does not overfit, thus the additional iterations can only help. This is not always the case though. Cross ValidationCross validation is a vital part of any machine learning experiment. It demonstrates that the classifier is robust and has not been overfitted to specific training/test data. While CV is not 100% conclusive, it certainly is an important piece of evidence for the robustness of a classifier.
The script Here is the usage for the program: Usage: nfold.py --folds=N create N files for N fold cross validation (required) --data=DATAFILE DATAFILE contains the data to be divided into test/train sets (required) --spec=SPECFILE SPECFILE is the same as used in JBoost typically (required) --rounds=N The number of rounds to run the booster (required) --tree=TREETYPE The type of tree nodes to use possible options are ADD_ROOT, ADD_ALL, ADD_SINGLES, ADD_ROOT_OR_SINGLES. If unspecified, all types will be executed sequentially --generate Creates directories for the cross validation sets --dir=DIR Specifies a name for the directory to be generated An example (5-fold CV, spambase dataset, 500 iterations, 1 tree type, test/train files and directory is generated) is cat spambase.test spambase.train > spambase.data ./nfold.py --folds=5 --data=spambase.data --spec=spambase.spec \ --rounds=500 --tree=ADD_ALL --generate
This will create the directory cvdata-MONTH-DAY-HOUR-MINUTE-SECOND
containing a directory ADD_ALL, files trialK.train and
trialK.test where K ranges from 1 to N, and
trial.spec. The ADD_ALL directory will contain all the runtime
information for each trial. To edit other runtime, parameters of
JBoost (e.g. memory size, number of threads, amount of output via -a
switch, etc), edit the Dealing with Multiclass ProblemsIn version 2.0 JBoost stopped internally supporting multiclass problems. Instead, a python script is now offered that splits a multiclass problem into many binary problems. It can be found in the scripts folder in the jboost root directory. First, it estimates via cross-validation the test error of a one-vs-all set of classifiers (the label of a test example is given by the one-vs-all classifier that gives it the largest score). As a side product, a confusion matrix C can be built where C(i,j) is the number of test examples with label i that were classified as j. From this information, it attempts to split the classes into two groups via spectral clustering on the symmetrized confusion matrix. A classifier that discriminates between these two groups is learned, and the process repeated on the resulting two child nodes. This produces a tree of classifiers where singleton labels are at the leaves. The test error of this classifier is also reported. The usage of the script is as follows. First, simply alter your spec file to include all the labels for your multiclass problem. Then call MultiClassHierarchical.py with like the following: ./MultiClassHierarchical.py --S=./datasets/iris/iris --numRounds=10 --ATreeType=ADD_ROOT_OR_SINGLES Here is an example output from the script on the iris UCI dataset: *********New Node********** ClassList: ['Iris-versicolor', 'Iris-virginica', 'Iris-setosa'] Confusion Matrix: 46 4 0 6 44 0 0 0 50 ZeroLabels: ['Iris-versicolor', 'Iris-virginica'] OneLabels: ['Iris-setosa'] LLB: acc +/- std, (iters) = 0.000000 +/- 0.000000, (10) ADA: acc +/- std, (iters) = 0.000000 +/- 0.000000, (10) *********New Node********** ClassList: ['Iris-versicolor', 'Iris-virginica'] Confusion Matrix: 47 3 4 46 ZeroLabels: ['Iris-versicolor'] OneLabels: ['Iris-virginica'] LLB: acc +/- std, (iters) = 0.070000 +/- 0.050990, (10) ADA: acc +/- std, (iters) = 0.070000 +/- 0.050990, (10) ************************SUMMARY************************ LLB 1 vs All Classification Error: 0.0667 ADA 1 vs All Classification Error: 0.0667 LLB Spectral Cluster Tree Classification Error: 0.0467 ADA Spectral Cluster Tree Classification Error: 0.0467 Here the tree structure of classifiers is shown. Shown are: the confusion matrix built by finding one-vs-all classifiers, the two clusterings of labels, and the resulting classifier built by AdaBoost and LogLossBoost to separate the two clusters. Children nodes are indicated by indentations. Finally at the end the overall accuracy of each classifier on the cross validated test sets are shown. The full list of options is given by: Usage: MultiClassHierarchical.py --S=<stem> path to stem name -- expecting <stem>.data and <stem>.spec to be there --numRounds=N number of rounds to run each classifier by --folds=N number of folds to use in cross-validation. Default: 5 --rb RobustBoost will be run. By default this is off. WARNING: Can take a very long time. --ATreeType=<tree> ADD_ROOT, ADD_SINGLES, ADD_ROOT_OR_SINGLES, or ADD_ALL ADTree types (ADD_ROOT by default) This script is just for testing purposes. Unfinished features include using the various output functions that JBoost can offer, e.g in Python or Matlab, and combine them into a multiclass combined classifier. Searching for Good RobustBoost Parameter SettingsWe have developed a script that we recommend for searching for the optimal setting of parameters in RobustBoost. It can be found in the scripts/MultiClass folder from the JBoost root directory. It first fixes sigma_f at 1.0, which is typically a good setting for most problems in our experience. Then, for each of theta in {0,1,2,4,8,16,32,64} it tries to find the smallest epsilon for which RobustBoost can still get to t=1.0 within the specified number of boosting iterations. It uses a binary searching procedure on epsilon for each theta. The usage of the script for a binary problem is as follows: ./RobustBoostSearch.py --S=./datasets/breast-w/breast-w --numRounds=200 --ATreeType=ADD_ROOT_OR_SINGLES Here is an example output from the script on the breast-w UCI dataset: ******FINAL RESULTS***** RB: acc +/- std, (epsilon,theta,sigma_f) = 0.038680 +/- 0.024394, (0.00,2.00,1.00) Shown are the best setting of sigma_f, theta and epsilon with the test accuracy according to a cross-validation. The full list of options is given by: Usage: RobustBoostSearch.py --S=<stem> path to stem name -- expecting <stem>.data and <stem>.spec to be there --numRounds=N number of rounds to run each classifier by --folds=N number of folds to use in cross-validation. Default: 5 --ATreeType=<tree> ADD_ROOT, ADD_SINGLES, ADD_ROOT_OR_SINGLES, or ADD_ALL ADTree types (ADD_ROOT by default)
|
This page last modified Thursday, 18-Jun-2009 03:10:28 UTC