Class TextExtractor

    • Constructor Detail

      • TextExtractor

        public TextExtractor​(Segment segment)
        Constructs a new TextExtractor based on the specified Segment.
        Parameters:
        segment - the segment from which the text will be extracted.
        See Also:
        Segment.getTextExtractor()
    • Method Detail

      • writeTo

        public void writeTo​(java.io.Writer writer)
                     throws java.io.IOException
        Description copied from interface: CharStreamSource
        Writes the output to the specified Writer.
        Specified by:
        writeTo in interface CharStreamSource
        Parameters:
        writer - the destination java.io.Writer for the output.
        Throws:
        java.io.IOException - if an I/O exception occurs.
      • appendTo

        public void appendTo​(java.lang.Appendable appendable)
                      throws java.io.IOException
        Description copied from interface: CharStreamSource
        Appends the output to the specified Appendable object.
        Specified by:
        appendTo in interface CharStreamSource
        Parameters:
        appendable - the destination java.lang.Appendable object for the output.
        Throws:
        java.io.IOException - if an I/O exception occurs.
      • getEstimatedMaximumOutputLength

        public long getEstimatedMaximumOutputLength()
        Description copied from interface: CharStreamSource
        Returns the estimated maximum number of characters in the output, or -1 if no estimate is available.

        The returned value should be used as a guide for efficiency purposes only, for example to set an initial StringBuilder capacity. There is no guarantee that the length of the output is indeed less than this value, as classes implementing this method often use assumptions based on typical usage to calculate the estimate.

        Although implementations of this method should never return a value less than -1, users of this method must not assume that this will always be the case. Standard practice is to interpret any negative value as meaning that no estimate is available.

        Specified by:
        getEstimatedMaximumOutputLength in interface CharStreamSource
        Returns:
        the estimated maximum number of characters in the output, or -1 if no estimate is available.
      • toString

        public java.lang.String toString()
        Description copied from interface: CharStreamSource
        Returns the output as a string.
        Specified by:
        toString in interface CharStreamSource
        Overrides:
        toString in class java.lang.Object
        Returns:
        the output as a string.
      • setConvertNonBreakingSpaces

        public TextExtractor setConvertNonBreakingSpaces​(boolean convertNonBreakingSpaces)
        Sets whether non-breaking space ( ) character entity references are converted to spaces.

        The default value is that of the static Config.ConvertNonBreakingSpaces property at the time the TextExtractor is instantiated.

        Parameters:
        convertNonBreakingSpaces - specifies whether non-breaking space ( ) character entity references are converted to spaces.
        Returns:
        this TextExtractor instance, allowing multiple property setting methods to be chained in a single statement.
        See Also:
        getConvertNonBreakingSpaces()
      • getConvertNonBreakingSpaces

        public boolean getConvertNonBreakingSpaces()
        Indicates whether non-breaking space ( ) character entity references are converted to spaces.

        See the setConvertNonBreakingSpaces(boolean) method for a full description of this property.

        Returns:
        true if non-breaking space ( ) character entity references are converted to spaces, otherwise false.
      • setIncludeAttributes

        public TextExtractor setIncludeAttributes​(boolean includeAttributes)
        Sets whether any attribute values are included in the output.

        If the value of this property is true, then each attribute still has to match the conditions implemented in the includeAttribute(StartTag,Attribute) method in order for its value to be included in the output.

        The default value is false.

        Parameters:
        includeAttributes - specifies whether any attribute values are included in the output.
        Returns:
        this TextExtractor instance, allowing multiple property setting methods to be chained in a single statement.
        See Also:
        getIncludeAttributes()
      • getIncludeAttributes

        public boolean getIncludeAttributes()
        Indicates whether any attribute values are included in the output.

        See the setIncludeAttributes(boolean) method for a full description of this property.

        Returns:
        true if any attribute values are included in the output, otherwise false.
      • includeAttribute

        public boolean includeAttribute​(StartTag startTag,
                                        Attribute attribute)
        Indicates whether the value of the specified attribute in the specified start tag is included in the output.

        This method is ignored if the IncludeAttributes property is set to false, in which case no attribute values are included in the output.

        If the IncludeAttributes property is set to true, every attribute of every start tag encountered in the segment is checked using this method to determine whether the value of the attribute should be included in the output.

        The default implementation of this method returns true if the name of the specified attribute is one of title, alt, label, summary, content*, or href, but the method can be overridden in a subclass to perform a check of arbitrary complexity on each attribute.

        * The value of a content attribute is only included if a name attribute is also present in the specified start tag, as the content attribute of a META tag only contains human readable text if the name attribute is used as opposed to an http-equiv attribute.

        Example:
        To include only the value of title and alt attributes:

        final Set includeAttributeNames=new HashSet(Arrays.asList(new String[] {"title","alt"}));
        TextExtractor textExtractor=new TextExtractor(segment) {
            public boolean includeAttribute(StartTag startTag, Attribute attribute) {
                return includeAttributeNames.contains(attribute.getKey());
            }
        };
        textExtractor.setIncludeAttributes(true);
        String extractedText=textExtractor.toString();
        Parameters:
        startTag - the start tag of the element to check for inclusion.
        Returns:
        if the text inside the Element of the specified start tag should be excluded from the output, otherwise false.
      • setExcludeNonHTMLElements

        public TextExtractor setExcludeNonHTMLElements​(boolean excludeNonHTMLElements)
        Sets whether the content of non-HTML elements is excluded from the output.

        The default value is false, meaning that content from all elements meeting the other criteria is included.

        Parameters:
        excludeNonHTMLElements - specifies whether content non-HTML elements is excluded from the output.
        Returns:
        this TextExtractor instance, allowing multiple property setting methods to be chained in a single statement.
        See Also:
        getExcludeNonHTMLElements()
      • excludeElement

        public boolean excludeElement​(StartTag startTag)
        Indicates whether the text inside the Element of the specified start tag should be excluded from the output.

        During the text extraction process, every start tag encountered in the segment is checked using this method to determine whether the text inside its associated element should be excluded from the output.

        The default implementation of this method is to always return false, so that every element is included, but the method can be overridden in a subclass to perform a check of arbitrary complexity on each start tag.

        All elements nested inside an excluded element are also implicitly excluded, as are all SCRIPT and STYLE elements. Such elements are skipped over without calling this method, so there is no way to include them by overriding the method.

        Example:
        To extract the text from a segment, excluding any text inside elements with the attribute class="NotIndexed":

        TextExtractor textExtractor=new TextExtractor(segment) {
            public boolean excludeElement(StartTag startTag) {
                return "NotIndexed".equalsIgnoreCase(startTag.getAttributeValue("class"));
            }
        };
        String extractedText=textExtractor.toString();
        Parameters:
        startTag - the start tag of the element to check for inclusion.
        Returns:
        if the text inside the Element of the specified start tag should be excluded from the output, otherwise false.