Class DatasetSplitter


  • public class DatasetSplitter
    extends java.lang.Object
    Utility class for creating training / test / cross validation indexes from the original index.
    • Constructor Summary

      Constructors 
      Constructor Description
      DatasetSplitter​(double testRatio, double crossValidationRatio)
      Create a DatasetSplitter by giving test and cross validation IDXs sizes
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private Document createNewDoc​(IndexReader originalIndex, FieldType ft, ScoreDoc scoreDoc, java.lang.String[] fieldNames)  
      void split​(IndexReader originalIndex, Directory trainingIndex, Directory testIndex, Directory crossValidationIndex, Analyzer analyzer, boolean termVectors, java.lang.String classFieldName, java.lang.String... fieldNames)
      Split a given index into 3 indexes for training, test and cross validation tasks respectively
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • crossValidationRatio

        private final double crossValidationRatio
      • testRatio

        private final double testRatio
    • Constructor Detail

      • DatasetSplitter

        public DatasetSplitter​(double testRatio,
                               double crossValidationRatio)
        Create a DatasetSplitter by giving test and cross validation IDXs sizes
        Parameters:
        testRatio - the ratio of the original index to be used for the test IDX as a double between 0.0 and 1.0
        crossValidationRatio - the ratio of the original index to be used for the c.v. IDX as a double between 0.0 and 1.0
    • Method Detail

      • split

        public void split​(IndexReader originalIndex,
                          Directory trainingIndex,
                          Directory testIndex,
                          Directory crossValidationIndex,
                          Analyzer analyzer,
                          boolean termVectors,
                          java.lang.String classFieldName,
                          java.lang.String... fieldNames)
                   throws java.io.IOException
        Split a given index into 3 indexes for training, test and cross validation tasks respectively
        Parameters:
        originalIndex - an LeafReader on the source index
        trainingIndex - a Directory used to write the training index
        testIndex - a Directory used to write the test index
        crossValidationIndex - a Directory used to write the cross validation index
        analyzer - Analyzer used to create the new docs
        termVectors - true if term vectors should be kept
        classFieldName - name of the field used as the label for classification; this must be indexed with sorted doc values
        fieldNames - names of fields that need to be put in the new indexes or null if all should be used
        Throws:
        java.io.IOException - if any writing operation fails on any of the indexes
      • createNewDoc

        private Document createNewDoc​(IndexReader originalIndex,
                                      FieldType ft,
                                      ScoreDoc scoreDoc,
                                      java.lang.String[] fieldNames)
                               throws java.io.IOException
        Throws:
        java.io.IOException