JBoost Logo
| Home | Downloads | Documentation | FAQ | Publications |


JBoost has several boosting algorithms, and each algorithm has its own set of parameters. There are several forms of documentation available:

What are ADTrees?

ADTrees are a generalization of decision trees that utilize the boosting model for machine learning. There is a Wikipedia ADTree page that describes how to read the data structure and how the construction algorithm works. There is also a short presentation (PDF) describing the data structure with many graphical examples. For more information on boosting, see the links below.

Input Files

JBoost requires three data files for the data specification, training data, and test data. There are several examples in the demo directory of the distribution.

The data specification file contains definitions for the configurable parameters, the types of each example attribute and the possible set of labels for each example. A sample data specification looks like this,

 INDEX            number
 age              number
 income           number
 sex              (male, female)
 highest-degree   (none, high-school, bachelors, masters, phd)
 goal-in-life     text
 labels           (rich, smart, happy, none)

This format lists the configurable parameters, followed by the attribute definitions, and the possible labels. The parameters are summarized,

Spec File Parameters
Parameter Name Description Possible Values Optional?
exampleTerminator the character used to signify the end of each example a typical value is ';' Required
attributeTerminator the character used to separate attributes of each example a typical value is ',' Required
maxBadExa the maximum number of bad examples that will be tolerated default = 20 Optional
maxBadAtt the maximum number of bad attributes that will be tolerated default = 0 Optional

The special keyword attribute names are

Attribute Name Description Possible Values Optional?
INDEX the unique index number of each example n/a Optional
weight an initial weighting of the data (higher weight implies greater importance to classify correctly) default = 1.0 Optional
labels the list of possible labels (Order matters! See below) this is a finite type (see below) Required

The attribute definitions include the attribute name and the type. The attribute names must be listed in the order they appear in the data. There are keyword attribute names: weight ascribes an importance to the example and labels defines the label for the example. The order in which the labels are defined changes any post-processing steps, so be sure you define your labels as text (e.g. labels (rich, poor)) or increasing in numeric value (e.g. labels (-1,1)) to get the most intuitive post-processing results. JBoost considers the label to be an enumeration of values and does not take into account any numeric value associated with such values. While specific ordering of labels isn't necessary, it may allow for less post-processing changes.

There is no other special relationships between the data and the attribute names.

There are currently only three attribute types supported in JBoost. Attributes of finite type must include the set of possible values. This set is a parenthesized comma-separated list.

Attribute Type Description
number the attribute can be any number, represented as a floating point number
finite the attribute is a member of a defined set of strings
text the attribute is a sequence of words delimited by whitespace

After you have completed the data specification you have to update the test data and training data to follow the definition in the specification file. Lines starting with '//' in the data files are ignored. Each line should contain an example with attributes separated by the attributeTerminator. Each line should end with the exampleTerminator. If one of the terminators needs to be used as part of the data, it can be 'escaped' using the '\' character. An example test file, using the example specification above, looks like this,

 51, 52000, male, phd, be with my family, 1.0, happy;
 24, 1000000, male, bachelors, retire at 25, 0.5, rich; 

After you have completed creating your train and test files, you can use JBoost to learn an alternating decision tree.

Output Files

There are many options that will create different types of output files. To turn on/off certain output, see the Usage Details section. This section provides an overview of all the options.

  • Option and error information. A very high level file gives all options used to learn the ATree and the error rate at every iteration.
  • Boosting Information. Boosting methods have weights, margins, and potential loss associated with each example on each iteration. These files are useful for visualizing score/margin distribution. These files may grow to be very large (100s MBs) with large datasets and many iterations. Use the -a switch to change the amount of output.
  • Classifier code. For many of the boosters, a classifier can be output to a java, C++, python, or matlab code file. This can then be integrated into your on project or used from the command line.
  • ATree. This is a text file that is human readable and PERL parsable (see atree2dot2ps.pl). It is useful for visualization, but cannot be used to classify future inputs (unless you want to implement that feature!).
  • Serial Atree. These trees can be read in by java at a later date. This option is useful if you have a large and complicated classifier that may take a long time to learn.

Usage Details

There are large number of usage options. While there are many options, the most common runs are shown in a list of examples.

The global configuration options are:

        -p N       Specify number of threads (default: 1)
        -CONFIG    The name of the configuration file (default "jboost.config")
                   All options can be specified in this file instead of on 
		   the command line.
        -V         Print version and exit

If you have a multiprocessing environment, the -p N switch will tell JBoost how many threads to use. By default, JBoost only uses one thread. JBoost spends most of it's time searching for good weak hypotheses. This scales inverse-linearly to the number of threads.
The -CONFIG filename switch specifies a configuration file to use instead of the command line parameters. If you find yourself using many parameters, it may be easier to specify all of them in a file. If you use a Windows system and don't want to use the command line, you can create jboost.config with all the parameters specified inside. An example configuration file could be:

-n src/jboost/controller/data.spec
-T src/jboost/controller/data.test
-t src/jboost/controller/data.train
-a -1
-serialTreeOutput src/jboost/controller/atree.serialized

Data/Test/Train file options are:

        -S stem        Base (stem) name for the files (default: "data")
        -n file.spec   Specfile name (default: stem+".spec")
        -t file.train  Training file name (default: stem+".train")
        -T file.test   Test file name (default: stem+".test")
        -serialTreeInput file.tree   Java object of adtree (default: stem+".output.tree")
        -weightThreshold T    Set threshold for accepting an example

The -S stem switch is the most common way to specify the train/test/spec file. If you use various non-standardized filenames, you can specify the spec, train, and test files separately. The serialized input tree is a Java Object of an ADTree that was outputted by a previous run of JBoost. Moreover, when the serialized input tree is given, you have an option to specify a weight theshold by using -weightThreshold T so that any examples with weight smaller than T will be ignored. This can be very useful when the dataset is large.

Boosting options:

    -b type   The type of booster to use (default: AdaBoost).
              AdaBoost     Loss function: exp(-margin)
              LogLossBoost Loss: log(1 + exp(-margin))
              RobustBoost  Loss: min(1,1-erf((margin - mu(time))/sigma(time)))
    -numRounds N  The number of rounds of boosting that are to be executed.
                  This option should be used with AdaBoost and LogLossBoost
    -ATreeType type   The type of ATree to create.  There are several options:
                      ADD_ALL               Create a full ADTree (default)
                      ADD_ROOT              Add splits only at the root producing a glat tree.
                                            Equivalent to boosting decision stumps
                      ADD_SINGLES           Create a decision tree
                      ADD_ROOT_OR_SINGLES   Create an ensemble combination of decision trees.
    -BoosTexter        Only make a zero prediction at the root node.
    -booster_smooth sf   Smoothing factor for prediction computation (default: 0.5)
                         Described Schapire & Singer 1999 (smoothing the predictions),
                         $epsilon = sf / total_num_examples$
    -booster_paranoid Use debugging version of booster (default: false)

The most common usage is -b AdaBoost -numRounds 100. If test error doesn't asymptote (see visualizations) run for more rounds. If AdaBoost is overfitting use LogLossBoost and/or BrownBoost. Descriptions can be found on Wikipedia for LogLossBoost (LogitBoost) and RobustBoost boosting algorithms.

The type of alternating decision tree can drastically changed the classifier. Depending on the dataset, one type of tree may be better than the others. However, the ADD_ALL type is the most flexible.

BoosTexter is an implementation of Schapire and Singer's bag of words method where zero predictions are allowed at the root node.

RobustBoost options:

    -rb_time       NUM          See documentation.
    -rb_epsilon    NUM          See documentation.
    -rb_theta      NUM          See documentation.
    -rb_theta_0    NUM          See documentation.
    -rb_theta_1    NUM          See documentation.
    -rb_sigma_f    NUM          See documentation.
    -rb_sigma_f_0  NUM          See documentation.
    -rb_sigma_f_1  NUM          See documentation.
    -rb_cost_0     NUM          See documentation.
    -rb_cost_1     NUM          See documentation.
    -rb_conf_rated  See documentation.
    -rb_potentialSlack   NUM    See documentation.

The RobustBoost option ... TODO

Code Output Options:

        -serialTreeOutput file.tree    Java object output of adtree
        -O file.tree   Output tree file name (default: stem+".output.tree")
        -P filename    Output python code file name 
        -j filename    Output java code file name 
        -c filename    Output C code file name 
        -m filename    Output matlab code file name 
        -cOutputProc name  Name of procedure for output C code (default: 'predict')
        -javaStandAlone    Output java code that can stand alone, but
                           cannot read jboost-format data
        -javaOutputClass name     Name of class for output java code (default: 'Predict')
        -javaOutputMethod name    Name of method for output java code (default: 'predict')

There are two ways to output the ADTree: a text format meant for visualization or a code format meant for classification. The code can be integrated with JBoost or a custom classification system.

Make sure not to forget to specify the -j, -c, or -m flags if you wish to have code output. Also remember that the Java class name must match the filename (a Java constraint).

Logging Options

        -info filename      High-level log file name (default: stem+".info")
        -log  filename      Debugging log (default stem+".log")
        -loglevel N   Amount of information to be output to log
                      The larger N is, the more information will be output.
                      This is meant to be used as a debugging tool.
        -a iter      Generate margin (score) logs
                     iter>0   log only on iteration iter,
                     iter=-1  log on iters 1,2..9,10,20,...,90,100,200 ...)
                     iter=-2  log on all iterations
                     iter=-3  log only on the last iteration of boosting

To obtain the margin distribution, the -a iter option must be used. The loglevel is meant to be used by developers for debugging purposes. The default values for -info and -log are typically sufficient.

Using JBoost Code Output

JBoost generates several files containing information about the run. These files can be categorized into two groups: log files and classifier files.

There are two log files. The file stem.info is generated as the program runs. It contains a high level log, containing information such as error rate for given iteration and command line arguments. The other log file is stem.log, which contains a detailed log. The granularity of the log is controlled by the -logLevel switch. Large values of N should only be used by developers trying to debug.

There can be numerous classifier files. The generated ADTree will be stored in stem.output.tree. If you use JBoost to create source code for the classifier (via one of the switches -j, -c, -m), then code representing this tree will be generated in files with those names. We assume here that the defaults are used for the Java and C code output.

The Python code contains a single class (ATree) and a main function demonstrating how to use the class. The classifier file (for example spambase.py) can be used on a new dataset (spambase.data) using a format specified in a specfile (spambase.spec) via

    $> python spambase.py spambase.data spambase.spec

To write your own code, see the main function at the bottom of the file.

The C code contains a single procedure:

double predict(void **attr, double *ret)

The first argument attr is an array of pointers corresponding to the attributes specified in the spec file. Thus, if attribute i is text, then attr[i] must be a char array; if attribute i is a number, then *attr[i] must be a double; and if attribute i is finite, then *attr[i] must be an int containing the index of the chosen value. If attribute i is undefined then attr[i] should be set to NULL.

The second argument ret is a pointer to an array of k doubles, where k is the number of classes. The scores for each of the k classes will be stored in this array. If ret is NULL, then no scores are stored. In any case, predict returns the score for class 0 (ret[0]).

The .java file contains a class called 'Predict'. To use this file, it must be moved to a file called Predict.java. This file can be compiled using

javac Predict.java

Unless the javaStandAlone command line option is invoked, this file can be run (assuming you are in the same directory as Predict.java) using

java Predict < datafile

If you want to call Predict from an alternative directory, set an environment variable to the directory of Predict.java as such,

java -cp "$CLASSPATH:$PREDICTPATH" Predict < datafile

In this mode, the program reads examples from standard input having the identical form of examples contained in the .train and .test files. After each example is read, a vector of scores (one for each class) is output to standard output.

Alternatively, the method

static public double[] predict(Object[] attr)

can be called from another java program. The argument attr is an array of Objects corresponding to the attributes specified in the spec file. Thus, if attribute i is text, then attr[i] must be a String; if attribute i is a number, then attr[i] must be a Double; and if attribute i is finite, then attr[i] must be an Integer containing the index of the chosen value.

The return value of this procedure is an array of doubles containing the scores for each of the classes.

A third option is to call the method

static public double[] predict(String[] attr)

which also can be called from another java program. As before, the argument attr is an array corresponding to the attributes specified in the spec file. Now, however, all of the values of this array are specified by Strings. Thus, if attribute i is text, then attr[i] is the text string itself; if attribute i is a number, then attr[i] is a String representing this number; and if attribute i is finite, then attr[i] must be the chosen value itself (not its index) represented as a String. Note that removing punctuation, converting to lower case, removing leading or trailing blank spaces, etc. are the responsibility of the calling procedure.

As before, the return value of this procedure is an array of doubles containing the scores for each of the classes.

Other Visualization Tools

There are currently a few visualization tools. All of these programs depend on files outputted by JBoost. There are currently tools for visualizing three aspects of JBoost: 1) the error, 2) the margin, and 3) the alternating decision tree. The tools can be found in the scripts directory and when ran without arguments, will output their usage.

We assume that the following programs are installed and are accessible in your executable path:

  • Python - Used in the *.py scripts. Available for nearly all platforms and comes installed by default on most Linux and Mac OS X machines.
  • PERL - Used by the *.pl scripts. Available for nearly all platforms and comes installed by default on most Linux and Mac OS X machines.
  • Graphviz - A package from AT&T labs for displaying graphs. Used by atree2dot2ps.pl.
  • gnuplot - A multi-platform data plotting utility. Used by all the *.py scripts.
  • (Optional) cygwin - An environment similar to Linux that can run on Windows machines. JBoost (and associated scripts) can run on Windows machines, but cygwin makes it much simpler for people used to a linux environment. You can also use cygwin to download all the above programs (except for Graphviz, which I believe you'll have to do through windows).

Visualizing the ADTree is important for understanding the intuition for why the classifier works. The program atree2dot2ps.pl parses an info file and a tree file to produce a ps,pdf, and png representation of the ADTree. This program is written in PERL and uses the Graphviz package from AT&T labs. Usage is:

./atree2dot2ps.pl usage: 
         -i (--info) file.info   File containing runtime information (required) 
         -t (--tree) file.tree   File containing the ADTree in text format (required) 
         -d (--dir)  directiory  Directory to use (optional, defaults to '.') 
         -l (--labels)           Flip the labels (-1 becomes +1) (optional)
         --truncate              Truncate threshold values to increase readability (optional)
         --threshold num         A given depth to stop the tree from becoming too large. (optional) 
         -h (--help)             Print this usage information 

An example (shows the first 10 nodes of the spambase ADTree) is

./atree2dot2ps.pl  --info spambase.info --tree spambase.output.tree --threshold 10 

which produces (click to see full size image)

pic of spambase adtree

Boosting doesn't require many parameters, in fact it only really requires one parameter: the number of iterations. Specifying this number is non-trivial; however, there is somewhat of a consensus that there are a few ways to determine an appropriate number of iterations:

  • When test error asymptotes or begins to increase, the number of iterations is sufficient.
  • When the margin distribution "converges"

These two conditions combined with cross validation experiments (and boosting's proclivity to not overfit) are typically sufficient evidence for choice of number of iterations. There are two scripts: error.py and margin.py.

The error.py script can be ran while JBoost is still executing. It reads the info file which is created incrementally. Here is the usage for the program:

Usage: error.py    
         --info=info_file    scores file as output by jboost
         --logaxis           should the axis be log-scaled (default: false)
         --bound             show the bound on training error

An example (1000 iterations on the spambase ADTree) is

./error.py --info=spambase.info --logaxis

which results in

pic of spambase adtree

margin.py can be ran while JBoost is still executing. It reads the given train/test file and the boosting.info file (the -a NUM switch must be used while boosting), which is created incrementally and flushed after every iteration. Here is the usage for the program:

Usage: margin.py    
        --boost-info=filename  margins file as output by jboost (-a -2 switch)
        --data=filename        train/test filename
        --spec=filename        spec filename
        --iteration=i,j,k,...  the iterations to inspect, no spaces between commas
        --sample               are we interested in a sample of false pos/neg

An example (spambase margins for 5,50,300,1000 iterations) is

./margin.py --boost-info=spambase.train.boosting.info --data=spambase.train \
            --spec=spambase.spec --iteration=5,50,300,1000 

which results in

pic of spambase margin

Note that after 300 iterations, it seems that the margin distribution has converged for the most part. This in combination with the error curve indicate that 300 iterations is likely to be sufficient. However, in both cases, boosting does not overfit, thus the additional iterations can only help. This is not always the case though.

Cross Validation

Cross validation is a vital part of any machine learning experiment. It demonstrates that the classifier is robust and has not been overfitted to specific training/test data. While CV is not 100% conclusive, it certainly is an important piece of evidence for the robustness of a classifier.

The script nfold.py can be used to partition a dataset into k folds and run JBoost on each fold. The user gives the script a single dataset (i.e. not separated train/test files) and the script creates a directory where it creates k training files and k test files. Each of the test files are disjoint and the ith training files contains all the data points not found in the ith test file.

Here is the usage for the program:

Usage: nfold.py 
        --folds=N          create N files for N fold cross validation (required)
        --data=DATAFILE    DATAFILE contains the data to be divided into test/train sets (required)
        --spec=SPECFILE    SPECFILE is the same as used in JBoost typically (required)
        --rounds=N         The number of rounds to run the booster (required)
        --tree=TREETYPE    The type of tree nodes to use possible options are ADD_ROOT, 
                           ADD_ALL, ADD_SINGLES, ADD_ROOT_OR_SINGLES.  If unspecified, all
                           types will be executed sequentially
        --generate         Creates directories for the cross validation sets
        --dir=DIR          Specifies a name for the directory to be generated

An example (5-fold CV, spambase dataset, 500 iterations, 1 tree type, test/train files and directory is generated) is

cat spambase.test spambase.train > spambase.data
./nfold.py --folds=5 --data=spambase.data --spec=spambase.spec \
           --rounds=500 --tree=ADD_ALL --generate

This will create the directory cvdata-MONTH-DAY-HOUR-MINUTE-SECOND containing a directory ADD_ALL, files trialK.train and trialK.test where K ranges from 1 to N, and trial.spec. The ADD_ALL directory will contain all the runtime information for each trial. To edit other runtime, parameters of JBoost (e.g. memory size, number of threads, amount of output via -a switch, etc), edit the learner function in nfold.py.

Dealing with Multiclass Problems

In version 2.0 JBoost stopped internally supporting multiclass problems. Instead, a python script is now offered that splits a multiclass problem into many binary problems. It can be found in the scripts folder in the jboost root directory.

First, it estimates via cross-validation the test error of a one-vs-all set of classifiers (the label of a test example is given by the one-vs-all classifier that gives it the largest score). As a side product, a confusion matrix C can be built where C(i,j) is the number of test examples with label i that were classified as j. From this information, it attempts to split the classes into two groups via spectral clustering on the symmetrized confusion matrix. A classifier that discriminates between these two groups is learned, and the process repeated on the resulting two child nodes. This produces a tree of classifiers where singleton labels are at the leaves. The test error of this classifier is also reported.

The usage of the script is as follows. First, simply alter your spec file to include all the labels for your multiclass problem. Then call MultiClassHierarchical.py with like the following:

./MultiClassHierarchical.py --S=./datasets/iris/iris --numRounds=10 --ATreeType=ADD_ROOT_OR_SINGLES

Here is an example output from the script on the iris UCI dataset:

*********New Node**********
 ['Iris-versicolor', 'Iris-virginica', 'Iris-setosa']
Confusion Matrix:
 46    4    0
  6   44    0
  0    0   50
 ['Iris-versicolor', 'Iris-virginica']
LLB: acc +/- std, (iters) = 0.000000 +/- 0.000000, (10)
ADA: acc +/- std, (iters) = 0.000000 +/- 0.000000, (10)
   *********New Node**********
    ['Iris-versicolor', 'Iris-virginica']
   Confusion Matrix:
    47    3
     4   46
   LLB: acc +/- std, (iters) = 0.070000 +/- 0.050990, (10)
   ADA: acc +/- std, (iters) = 0.070000 +/- 0.050990, (10)

LLB 1 vs All Classification Error:               0.0667
ADA 1 vs All Classification Error:               0.0667
LLB Spectral Cluster Tree Classification Error:  0.0467
ADA Spectral Cluster Tree Classification Error:  0.0467

Here the tree structure of classifiers is shown. Shown are: the confusion matrix built by finding one-vs-all classifiers, the two clusterings of labels, and the resulting classifier built by AdaBoost and LogLossBoost to separate the two clusters. Children nodes are indicated by indentations. Finally at the end the overall accuracy of each classifier on the cross validated test sets are shown.

The full list of options is given by:

Usage: MultiClassHierarchical.py
	--S=<stem>           path to stem name -- expecting <stem>.data and <stem>.spec to be there
	--numRounds=N        number of rounds to run each classifier by
	--folds=N            number of folds to use in cross-validation. 
                             Default: 5
	--rb                 RobustBoost will be run. By default this is off.
                             WARNING: Can take a very long time.
                             ADTree types (ADD_ROOT by default)

This script is just for testing purposes. Unfinished features include using the various output functions that JBoost can offer, e.g in Python or Matlab, and combine them into a multiclass combined classifier.

Searching for Good RobustBoost Parameter Settings

We have developed a script that we recommend for searching for the optimal setting of parameters in RobustBoost. It can be found in the scripts/MultiClass folder from the JBoost root directory.

It first fixes sigma_f at 1.0, which is typically a good setting for most problems in our experience. Then, for each of theta in {0,1,2,4,8,16,32,64} it tries to find the smallest epsilon for which RobustBoost can still get to t=1.0 within the specified number of boosting iterations. It uses a binary searching procedure on epsilon for each theta.

The usage of the script for a binary problem is as follows:

./RobustBoostSearch.py --S=./datasets/breast-w/breast-w --numRounds=200 --ATreeType=ADD_ROOT_OR_SINGLES

Here is an example output from the script on the breast-w UCI dataset:

******FINAL RESULTS*****

RB: acc +/- std, (epsilon,theta,sigma_f) = 0.038680 +/- 0.024394, (0.00,2.00,1.00)

Shown are the best setting of sigma_f, theta and epsilon with the test accuracy according to a cross-validation.

The full list of options is given by:

Usage: RobustBoostSearch.py
	--S=<stem>           path to stem name -- expecting <stem>.data and <stem>.spec to be there
	--numRounds=N        number of rounds to run each classifier by
	--folds=N            number of folds to use in cross-validation. 
                             Default: 5
                             ADTree types (ADD_ROOT by default)

boosting.info Output Format

This is the file that is parsed by most of the visualization scripts. We provide this brief description of the fields so that people can write their own parsing scripts.

The file has as much information as specified by the -a switch (see JBoost usage). Each iteration that is outputted, there is a preamble line describing the parameters associated with a given booster. This is followed by an enumeration of all examples, accompanied by their ID, margin, score, weight, potential, and label. This list may expand in the future.

The format is most easily understood via a demonstration:

iteration=5: elements=3681: boosting_params=None (AdaBoost):
0: -0.12695: -0.12695: 0.18725: 0.0: +1: 
1: 0.53953: -0.53953: 0.03180: 0.0: -1: 
. . .

There are two examples shown, separated by a the system dependent newline character(s). We go over the first example:

  • All fields are separated by a colon ':' character.
  • The example has an internal identifier of 0.
  • The margin is -0.12695. This is because it has a +1 label but it's current score is -0.12695.
  • The weight of the example is 0.18725.
  • The potential loss of the example is 0.0. This option is only used by BrownBoost, NormalBoost and other potential based boosting algorithms. LogLossBoost and AdaBoost potential functions can be derived from the weights and are not stored internally.
  • The label is +1. This is the internal label given by JBoost, your label may be "-1", "Alpha Helix", etc.

The above example is for binary classification tasks, for multiclass classification, JBoost currently uses a 1-vs-all approach. This means that if there are 20 possible classes for an instance, then each instance is expanded into 20 instances of binary classification.

iteration=0: elements=6: boosting_params=BrownBoost r=1.4000 s=1.3927 : 
0: -0.174: -0.174,-0.174,-0.174: 0.173,0.346,0.173: 0.030,0.072,0.030: -1,+1,-1:
. . .

In the above example, we see that the ID is 0 and the margin is -0.174. However, each class now has it's own weight, potential, and label separated by commas. Also, since we used BrownBoost, we can now see the parameters passed to the booster. Note that the weight (and potential) are higher for the second class. This is because the other two classes are correctly scored as "-1," whereas the second class has label "+1" in this example.

Links on Boosting

To learn more about general boosting techniques, there are a variety of sources. Here are some links to get started:

Valid CSS! Valid HTML 4.01 Transitional SourceForge.net Logo

This page last modified Thursday, 18-Jun-2009 03:10:28 UTC