public class BFTree extends RandomizableClassifier implements AdditionalMeasureProducer, TechnicalInformationHandler
@mastersthesis{Shi2007, address = {Hamilton, NZ}, author = {Haijian Shi}, note = {COMP594}, school = {University of Waikato}, title = {Best-first decision tree learning}, year = {2007} } @article{Friedman2000, author = {Jerome Friedman and Trevor Hastie and Robert Tibshirani}, journal = {Annals of statistics}, number = {2}, pages = {337-407}, title = {Additive logistic regression : A statistical view of boosting}, volume = {28}, year = {2000}, ISSN = {0090-5364} }Valid options are:
-S <num> Random number seed. (default 1)
-D If set, classifier is run in debug mode and may output additional info to the console
-P <UNPRUNED|POSTPRUNED|PREPRUNED> The pruning strategy. (default: POSTPRUNED)
-M <min no> The minimal number of instances at the terminal nodes. (default 2)
-N <num folds> The number of folds used in the pruning. (default 5)
-H Don't use heuristic search for nominal attributes in multi-class problem (default yes).
-G Don't use Gini index for splitting (default yes), if not information is used.
-R Don't use error rate in internal cross-validation (default yes), but root mean squared error.
-A Use the 1 SE rule to make pruning decision. (default no).
-C Percentage of training data size (0-1] (default 1).
Modifier and Type | Field and Description |
---|---|
protected Attribute |
m_Attribute
Attribute used for splitting.
|
protected Attribute |
m_ClassAttribute
Class attribute of a dataset.
|
protected double[] |
m_ClassProbs
Class probabilities.
|
protected double |
m_ClassValue
Class value for a node.
|
protected double[] |
m_Distribution
Class distributions.
|
protected double[][][] |
m_Dists
Distributions of each attribute for two successor nodes.
|
protected static int |
m_Expansion
Number of expansions.
|
protected int |
m_FixedExpansion
Fixed number of expansions (if no pruning method is used, its value is -1.
|
protected boolean |
m_Heuristic
If use huristic search for binary split (default true).
|
protected boolean |
m_isLeaf
If the ndoe is leaf node.
|
protected int |
m_minNumObj
Minimum number of instances at leaf nodes.
|
protected int |
m_numFoldsPruning
Number of folds for the pruning.
|
protected double[] |
m_Props
Branch proportions.
|
protected int |
m_PruningStrategy
the pruning strategy
|
protected double |
m_SizePer
The training data size (0-1).
|
protected int[][] |
m_SortedIndices
Sorted indices.
|
protected String |
m_SplitString
Split subset (for nominal attributes).
|
protected double |
m_SplitValue
Split point (for numeric attributes).
|
protected BFTree[] |
m_Successors
Successor nodes.
|
protected double |
m_TotalWeight
Total weights.
|
protected boolean |
m_UseErrorRate
If use error rate in internal cross-validation to fix the number of expansions - default
(if not, root mean squared error is used).
|
protected boolean |
m_UseGini
If use Gini index as the splitting criterion - default (if not, information is used).
|
protected boolean |
m_UseOneSE
If use the 1SE rule to make the decision.
|
protected double[][] |
m_Weights
Sorted weights.
|
static int |
PRUNING_POSTPRUNING
pruning strategy: post-pruning
|
static int |
PRUNING_PREPRUNING
pruning strategy: pre-pruning
|
static int |
PRUNING_UNPRUNED
pruning strategy: un-pruned
|
static Tag[] |
TAGS_PRUNING
pruning strategy
|
m_Seed
m_Debug
Constructor and Description |
---|
BFTree() |
Modifier and Type | Method and Description |
---|---|
void |
buildClassifier(Instances data)
Method for building a BestFirst decision tree classifier.
|
protected double |
computeEntropy(double[] dist,
double total)
Compute and return entropy for a given distribution of a node.
|
protected double |
computeGini(double[] dist,
double total)
Compute and return gini index for a given distribution of a node.
|
protected double |
computeGiniGain(double[] parentDist,
double[][] childDist)
Compute and return gini gain for given distributions of a node and its
successor nodes.
|
protected double |
computeInfoGain(double[] parentDist,
double[][] childDist)
Compute and return information gain for given distributions of a node
and its successor nodes.
|
protected double |
computeSortedInfo(Instances data,
int[][] sortedIndices,
double[][] weights,
double[] classProbs)
Compute sorted indices, weights and class probabilities for a given
dataset.
|
protected FastVector |
computeSplitInfo(BFTree node,
Instances data,
int[][] sortedIndices,
double[][] weights,
double[][][] dists,
double[][] props,
double[][] totalSubsetWeights,
boolean useHeuristic,
boolean useGini)
Compute the best splitting attribute, split point or subset and the best
gini gain or iformation gain for a given dataset.
|
double[] |
distributionForInstance(Instance instance)
Computes class probabilities for instance using the decision tree.
|
Enumeration |
enumerateMeasures()
Return an enumeration of the measure names.
|
Capabilities |
getCapabilities()
Returns default capabilities of the classifier.
|
boolean |
getHeuristic()
Get if use heuristic search for nominal attributes in multi-class problems.
|
double |
getMeasure(String additionalMeasureName)
Returns the value of the named measure
|
int |
getMinNumObj()
Get minimal number of instances at the terminal nodes.
|
int |
getNumFoldsPruning()
Set number of folds in internal cross-validation.
|
String[] |
getOptions()
Gets the current settings of the Classifier.
|
SelectedTag |
getPruningStrategy()
Gets the pruning strategy.
|
String |
getRevision()
Returns the revision string.
|
double |
getSizePer()
Get training set size.
|
TechnicalInformation |
getTechnicalInformation()
Returns an instance of a TechnicalInformation object, containing
detailed information about the technical background of this class,
e.g., paper reference or book this class is based on.
|
boolean |
getUseErrorRate()
Get if use error rate in internal cross-validation.
|
boolean |
getUseGini()
Get if use Gini index as splitting criterion.
|
boolean |
getUseOneSE()
Get if use the 1SE rule to choose final model.
|
String |
globalInfo()
Returns a string describing classifier
|
String |
heuristicTipText()
Returns the tip text for this property
|
Enumeration |
listOptions()
Returns an enumeration describing the available options.
|
static void |
main(String[] args)
Main method.
|
protected void |
makeLeaf(Instances data)
Make the node leaf node.
|
protected void |
makeSuccessors(FastVector BestFirstElements,
Instances data,
int[][][] subsetSortedIndices,
double[][][] subsetWeights,
double[][][] dists,
Attribute att,
boolean useHeuristic,
boolean useGini)
Generate successor nodes for a node and put them into BestFirstElements
according to gini gain or information gain in a descending order.
|
protected void |
makeTree(FastVector BestFirstElements,
BFTree root,
Instances train,
Instances test,
FastVector modelError,
int[][] sortedIndices,
double[][] weights,
double[][][] dists,
double[] classProbs,
double totalWeight,
double[] branchProps,
int minNumObj,
boolean useHeuristic,
boolean useGini,
boolean useErrorRate)
This method is to find the number of expansions based on internal
cross-validation for just post-pruning.
|
protected boolean |
makeTree(FastVector BestFirstElements,
BFTree root,
Instances train,
int[][] sortedIndices,
double[][] weights,
double[][][] dists,
double[] classProbs,
double totalWeight,
double[] branchProps,
int minNumObj,
boolean useHeuristic,
boolean useGini)
This method is to find the number of expansions based on internal
cross-validation for just pre-pruning.
|
protected void |
makeTree(FastVector BestFirstElements,
Instances data,
int[][] sortedIndices,
double[][] weights,
double[][][] dists,
double[] classProbs,
double totalWeight,
double[] branchProps,
int minNumObj,
boolean useHeuristic,
boolean useGini,
int preExpansion)
Recursively build a best-first decision tree.
|
double |
measureTreeSize()
Return number of tree size.
|
String |
minNumObjTipText()
Returns the tip text for this property
|
protected String |
nominalDistribution(double[][] props,
double[][][] dists,
Attribute att,
int[] sortedIndices,
double[] weights,
double[][] subsetWeights,
double[] gains,
Instances data,
boolean useHeuristic,
boolean useGini)
Compute distributions, proportions and total weights of two successor
nodes for a given nominal attribute.
|
protected double |
numericDistribution(double[][] props,
double[][][] dists,
Attribute att,
int[] sortedIndices,
double[] weights,
double[][] subsetWeights,
double[] gains,
Instances data,
boolean useGini)
Compute distributions, proportions and total weights of two successor nodes for
a given numeric attribute.
|
String |
numFoldsPruningTipText()
Returns the tip text for this property
|
int |
numLeaves()
Compute number of leaf nodes.
|
int |
numNodes()
Compute size of the tree.
|
String |
pruningStrategyTipText()
Returns the tip text for this property
|
void |
setHeuristic(boolean value)
Set if use heuristic search for nominal attributes in multi-class problems.
|
void |
setMinNumObj(int value)
Set minimal number of instances at the terminal nodes.
|
void |
setNumFoldsPruning(int value)
Set number of folds in internal cross-validation.
|
void |
setOptions(String[] options)
Parses the options for this object.
|
void |
setPruningStrategy(SelectedTag value)
Sets the pruning strategy.
|
void |
setSizePer(double value)
Set training set size.
|
void |
setUseErrorRate(boolean value)
Set if use error rate in internal cross-validation.
|
void |
setUseGini(boolean value)
Set if use Gini index as splitting criterion.
|
void |
setUseOneSE(boolean value)
Set if use the 1SE rule to choose final model.
|
String |
sizePerTipText()
Returns the tip text for this property
|
protected void |
splitData(int[][][] subsetIndices,
double[][][] subsetWeights,
Attribute att,
double splitPoint,
String splitStr,
int[][] sortedIndices,
double[][] weights,
Instances data)
Split data into two subsets and store sorted indices and weights for two
successor nodes.
|
String |
toString()
Prints the decision tree using the protected toString method from below.
|
protected String |
toString(int level)
Outputs a tree at a certain level.
|
String |
useErrorRateTipText()
Returns the tip text for this property
|
String |
useGiniTipText()
Returns the tip text for this property
|
String |
useOneSETipText()
Returns the tip text for this property
|
getSeed, seedTipText, setSeed
classifyInstance, debugTipText, forName, getDebug, makeCopies, makeCopy, runClassifier, setDebug
public static final int PRUNING_UNPRUNED
public static final int PRUNING_POSTPRUNING
public static final int PRUNING_PREPRUNING
public static final Tag[] TAGS_PRUNING
protected int m_PruningStrategy
protected BFTree[] m_Successors
protected Attribute m_Attribute
protected double m_SplitValue
protected String m_SplitString
protected double m_ClassValue
protected Attribute m_ClassAttribute
protected int m_minNumObj
protected int m_numFoldsPruning
protected boolean m_isLeaf
protected static int m_Expansion
protected int m_FixedExpansion
protected boolean m_Heuristic
protected boolean m_UseGini
protected boolean m_UseErrorRate
protected boolean m_UseOneSE
protected double[] m_Distribution
protected double[] m_Props
protected int[][] m_SortedIndices
protected double[][] m_Weights
protected double[][][] m_Dists
protected double[] m_ClassProbs
protected double m_TotalWeight
protected double m_SizePer
public String globalInfo()
public TechnicalInformation getTechnicalInformation()
getTechnicalInformation
in interface TechnicalInformationHandler
public Capabilities getCapabilities()
getCapabilities
in interface CapabilitiesHandler
getCapabilities
in class Classifier
Capabilities
public void buildClassifier(Instances data) throws Exception
buildClassifier
in class Classifier
data
- set of instances serving as training dataException
- if decision tree cannot be built successfullyprotected void makeTree(FastVector BestFirstElements, Instances data, int[][] sortedIndices, double[][] weights, double[][][] dists, double[] classProbs, double totalWeight, double[] branchProps, int minNumObj, boolean useHeuristic, boolean useGini, int preExpansion) throws Exception
BestFirstElements
- list to store BFTree nodesdata
- training datasortedIndices
- sorted indices of the instancesweights
- weights of the instancesdists
- class distributions for each attributeclassProbs
- class probabilities of this nodetotalWeight
- total weight of this node (note if the node
can not split, this value is not calculated.)branchProps
- proportions of two subbranchesminNumObj
- minimal number of instances at leaf nodesuseHeuristic
- if use heuristic search for nominal attributes
in multi-class problemuseGini
- if use Gini index as splitting criterionpreExpansion
- the number of expansions the tree to be expandedException
- if something goes wrongprotected boolean makeTree(FastVector BestFirstElements, BFTree root, Instances train, int[][] sortedIndices, double[][] weights, double[][][] dists, double[] classProbs, double totalWeight, double[] branchProps, int minNumObj, boolean useHeuristic, boolean useGini) throws Exception
BestFirstElements
- list to store BFTree nodesroot
- root node of tree in each foldtrain
- training datasortedIndices
- sorted indices of the instancesweights
- weights of the instancesdists
- class distributions for each attributeclassProbs
- class probabilities of this nodetotalWeight
- total weight of this node (note if the node
can not split, this value is not calculated.)branchProps
- proportions of two subbranchesminNumObj
- minimal number of instances at leaf nodesuseHeuristic
- if use heuristic search for nominal attributes
in multi-class problemuseGini
- if use Gini index as splitting criterionException
- if something goes wrongprotected void makeTree(FastVector BestFirstElements, BFTree root, Instances train, Instances test, FastVector modelError, int[][] sortedIndices, double[][] weights, double[][][] dists, double[] classProbs, double totalWeight, double[] branchProps, int minNumObj, boolean useHeuristic, boolean useGini, boolean useErrorRate) throws Exception
BestFirstElements
- list to store BFTree nodesroot
- root node of tree in each foldtrain
- training data in each foldtest
- test data in each foldmodelError
- list to store error for each expansion in
each foldsortedIndices
- sorted indices of the instancesweights
- weights of the instancesdists
- class distributions for each attributeclassProbs
- class probabilities of this nodetotalWeight
- total weight of this node (note if the node
can not split, this value is not calculated.)branchProps
- proportions of two subbranchesminNumObj
- minimal number of instances at leaf nodesuseHeuristic
- if use heuristic search for nominal attributes
in multi-class problemuseGini
- if use Gini index as splitting criterionuseErrorRate
- if use error rate in internal cross-validationException
- if something goes wrongprotected void makeSuccessors(FastVector BestFirstElements, Instances data, int[][][] subsetSortedIndices, double[][][] subsetWeights, double[][][] dists, Attribute att, boolean useHeuristic, boolean useGini) throws Exception
BestFirstElements
- list to store BestFirst nodesdata
- training instancesubsetSortedIndices
- sorted indices of instances of successor nodessubsetWeights
- weights of instances of successor nodesdists
- class distributions of successor nodesatt
- attribute used to split the nodeuseHeuristic
- if use heuristic search for nominal attributes in multi-class problemuseGini
- if use Gini index as splitting criterionException
- if something goes wrongprotected double computeSortedInfo(Instances data, int[][] sortedIndices, double[][] weights, double[] classProbs) throws Exception
data
- training datasortedIndices
- sorted indices of instances at the nodeweights
- weights of instances at the nodeclassProbs
- class probabilities at the nodeException
- if something goes wrongprotected FastVector computeSplitInfo(BFTree node, Instances data, int[][] sortedIndices, double[][] weights, double[][][] dists, double[][] props, double[][] totalSubsetWeights, boolean useHeuristic, boolean useGini) throws Exception
node
- node to be splitdata
- training datasortedIndices
- sorted indices of the instancesweights
- weights of the instancesdists
- class distributions for each attributeprops
- proportions of two branchestotalSubsetWeights
- total weight of two subsetsuseHeuristic
- if use heuristic search for nominal attributes
in multi-class problemuseGini
- if use Gini index as splitting criterionException
- if something is wrongprotected double numericDistribution(double[][] props, double[][][] dists, Attribute att, int[] sortedIndices, double[] weights, double[][] subsetWeights, double[] gains, Instances data, boolean useGini) throws Exception
props
- proportions of each two branches for each attributedists
- class distributions of two branches for each attributeatt
- numeric att split onsortedIndices
- sorted indices of instances for the attirubteweights
- weights of instances for the attirbutesubsetWeights
- total weight of two branches split based on the attributegains
- Gini gains or information gains for each attributedata
- training instancesuseGini
- if use Gini index as splitting criterionException
- if something goes wrongprotected String nominalDistribution(double[][] props, double[][][] dists, Attribute att, int[] sortedIndices, double[] weights, double[][] subsetWeights, double[] gains, Instances data, boolean useHeuristic, boolean useGini) throws Exception
props
- proportions of each two branches for each attributedists
- class distributions of two branches for each attributeatt
- numeric att split onsortedIndices
- sorted indices of instances for the attirubteweights
- weights of instances for the attirbutesubsetWeights
- total weight of two branches split based on the attributegains
- Gini gains for each attributedata
- training instancesuseHeuristic
- if use heuristic searchuseGini
- if use Gini index as splitting criterionException
- if something goes wrongprotected void splitData(int[][][] subsetIndices, double[][][] subsetWeights, Attribute att, double splitPoint, String splitStr, int[][] sortedIndices, double[][] weights, Instances data) throws Exception
subsetIndices
- sorted indecis of instances for each attribute for two successor nodesubsetWeights
- weights of instances for each attribute for two successor nodeatt
- attribute the split based onsplitPoint
- split point the split based on if att is numericsplitStr
- split subset the split based on if att is nominalsortedIndices
- sorted indices of the instances to be splitweights
- weights of the instances to bes splitdata
- training dataException
- if something goes wrongprotected double computeGiniGain(double[] parentDist, double[][] childDist)
parentDist
- class distributions of parent nodechildDist
- class distributions of successor nodesprotected double computeGini(double[] dist, double total)
dist
- class distributionstotal
- class distributionsprotected double computeInfoGain(double[] parentDist, double[][] childDist)
parentDist
- class distributions of parent nodechildDist
- class distributions of successor nodesprotected double computeEntropy(double[] dist, double total)
dist
- class distributionstotal
- class distributionsprotected void makeLeaf(Instances data)
data
- training datapublic double[] distributionForInstance(Instance instance) throws Exception
distributionForInstance
in class Classifier
instance
- the instance for which class probabilities is to be computedException
- if something goes wrongpublic String toString()
protected String toString(int level)
level
- the level at which the tree is to be printedpublic int numNodes()
public int numLeaves()
public Enumeration listOptions()
listOptions
in interface OptionHandler
listOptions
in class RandomizableClassifier
public void setOptions(String[] options) throws Exception
-S <num> Random number seed. (default 1)
-D If set, classifier is run in debug mode and may output additional info to the console
-P <UNPRUNED|POSTPRUNED|PREPRUNED> The pruning strategy. (default: POSTPRUNED)
-M <min no> The minimal number of instances at the terminal nodes. (default 2)
-N <num folds> The number of folds used in the pruning. (default 5)
-H Don't use heuristic search for nominal attributes in multi-class problem (default yes).
-G Don't use Gini index for splitting (default yes), if not information is used.
-R Don't use error rate in internal cross-validation (default yes), but root mean squared error.
-A Use the 1 SE rule to make pruning decision. (default no).
-C Percentage of training data size (0-1] (default 1).
setOptions
in interface OptionHandler
setOptions
in class RandomizableClassifier
options
- the options to useException
- if setting of options failspublic String[] getOptions()
getOptions
in interface OptionHandler
getOptions
in class RandomizableClassifier
public Enumeration enumerateMeasures()
enumerateMeasures
in interface AdditionalMeasureProducer
public double measureTreeSize()
public double getMeasure(String additionalMeasureName)
getMeasure
in interface AdditionalMeasureProducer
additionalMeasureName
- the name of the measure to query for its valueIllegalArgumentException
- if the named measure is not supportedpublic String pruningStrategyTipText()
public void setPruningStrategy(SelectedTag value)
value
- the strategypublic SelectedTag getPruningStrategy()
public String minNumObjTipText()
public void setMinNumObj(int value)
value
- minimal number of instances at the terminal nodespublic int getMinNumObj()
public String numFoldsPruningTipText()
public void setNumFoldsPruning(int value)
value
- the number of foldspublic int getNumFoldsPruning()
public String heuristicTipText()
public void setHeuristic(boolean value)
value
- if use heuristic search for nominal attributes in
multi-class problemspublic boolean getHeuristic()
public String useGiniTipText()
public void setUseGini(boolean value)
value
- if use Gini index splitting criterionpublic boolean getUseGini()
public String useErrorRateTipText()
public void setUseErrorRate(boolean value)
value
- if use error rate in internal cross-validationpublic boolean getUseErrorRate()
public String useOneSETipText()
public void setUseOneSE(boolean value)
value
- if use the 1SE rule to choose final modelpublic boolean getUseOneSE()
public String sizePerTipText()
public void setSizePer(double value)
value
- training set sizepublic double getSizePer()
public String getRevision()
getRevision
in interface RevisionHandler
getRevision
in class Classifier
public static void main(String[] args)
args
- the options for the classifierCopyright © 2015 University of Waikato, Hamilton, NZ. All rights reserved.