Machine learning classifiers can be very sensitive to data sets and
parameters. The boosting methods in JBoost have been shown to be very
robust in contrast to many other methods. However, even JBoost has
limitations. To minimize the effect of these limitations, we have
composed a set of useful tips.
- Perform Cross Validation. A script
nfold.py has been provided in the
directory. This will give the user a better idea of the
generalization error of the classifier on their dataset.
Boost until test error asymptotes. It has been shown
empirically and theoretically that boosting for more rounds is
frequently better and does not result in overfitting. This is not
universally true, so use the script
error.py to see how
many rounds seems appropriate. Note that you can do this while the
program is running. To learn more, see error visualization tools
Look at the margin distribution. The script
has been provided for visualizing margin distributions. See visualizing the margin for more
Consider asymmetric cost. If you have a dataset that is similar to
needles in a haystack, consider using an asymmetric cost. That is, if
it is acceptable to find a lot of hay to find a single needle, than be
sure to tell JBoost this fact. Sophisticated methods are currently
Margin is more important than classification. The margin can be
considered the amount of confidence the boosting algorithm has in a
prediction. Not all classifications are made with equal confidence,
and the user can take advantage of this knowledge.
Use BrownBoost for noisy data. If your dataset has known outliers or
generalization error of AdaBoost is poor, then BrownBoost may be able
to give better results. For documentation, see the BrownBoost algorithm
and using BrownBoost in JBoost.