Class StreamedSource

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable, java.lang.Iterable<Segment>

    public final class StreamedSource
    extends java.lang.Object
    implements java.lang.Iterable<Segment>, java.io.Closeable
    Represents a streamed source HTML document.

    This class provides a means, via the iterator() method, of sequentially parsing every tag, character reference and plain text segment contained within the source document using a minimum amount of memory.

    In contrast, the standard Source class stores the entire source text in memory and caches every tag parsed, resulting in memory problems when attempting to parse very large files.

    The iterator parses and returns each segment as the source text is streamed in. Previous segments are discarded for garbage collection. Source documents up to 2GB in size can be processed, a limit which is imposed by the java language because of its use of the int data type to index string operations.

    There is however a significant trade-off in functionality when using the StreamedSource class as opposed to the Source class. The Tag.getElement() method is not supported on tags that are returned by the iterator, nor are any methods that use the Element class in any way. The Segment.getSource() method is also not supported.

    Most of the methods and constructors in this class mirror similarly named methods in the Source class where the same functionality is available.

    See the description of the iterator() method for a typical usage example of this class.

    In contrast to a Source object, the Reader or InputStream specified in the constructor or created implicitly by the constructor remains open for the life of the StreamedSource object. If the stream is created internally, it is automatically closed when the end of the stream is reached or the StreamedSource object is finalized. However a Reader or InputStream that is specified directly in a constructor is never closed automatically, as it can not be assumed that the application has no further use for it. It is the user's responsibility to ensure it is closed in this case. Explicitly calling the close() method on the StreamedSource object ensures that all resources used by it are closed, regardless of whether they were created internally or supplied externally.

    The functionality provided by StreamedSource is similar to a StAX parser, but with some important benefits:

    • The source document does not have to be valid XML. It can be plain HTML, can contain invalid syntax, undefined entities, incorrectly nested elements, server tags, or anything else that is commonly found in "tag soup".
    • Every single syntactical construct in the source document's original text is included in the iterator, including the XML declaration, character references, comments, CDATA sections and server tags, each providing the segment's begin and end position in the source document. This allows an exact copy of the original document to be generated, allowing modifications to be made only where they are explicitly required. This is not possible with either SAX or StAX, which to some extent provide interpretations of the content of the XML instead of the syntactial structures used in the original source document.

    The following table summarises the differences between the StreamedSource, StAX and SAX interfaces. Note that some of the available features are documented as optional and may not be supported by all implementations of StAX and SAX.

    FeatureStreamedSourceStAXSAX
    Parse XML
    Parse entities without DTD
    Automatically validate XML
    Parse HTML
    Tolerant of syntax or nesting errors
    Provide begin and end character positions of each event1
    Provide source text of each event
    Handle server tag events
    Handle XML declaration event
    Handle comment events
    Handle CDATA section events
    Handle document type declaration event
    Handle character reference events
    Allow chunking of plain text
    Allow chunking of comment text
    Allow chunking of CDATA section text
    Allow specification of maximum buffer size
    1 StAX optionally reports the "offset" of each event but this could be either byte or character position depending on the source.

    Note that the OutputDocument class can not be used to create a modified version of a streamed source document. Instead, the output document must be constructed manually from the segments provided by the iterator.

    StreamedSource objects are not thread safe.

    • Constructor Summary

      Constructors 
      Constructor Description
      StreamedSource​(java.io.InputStream inputStream)
      Constructs a new StreamedSource object by loading the content from the specified InputStream.
      StreamedSource​(java.io.Reader reader)
      Constructs a new StreamedSource object by loading the content from the specified Reader.
      StreamedSource​(java.lang.CharSequence text)
      Constructs a new StreamedSource object from the specified text.
      StreamedSource​(java.net.URL url)
      Constructs a new StreamedSource object by loading the content from the specified URL.
      StreamedSource​(java.net.URLConnection urlConnection)
      Constructs a new StreamedSource object by loading the content from the specified URLConnection.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void close()
      Closes the underlying Reader or InputStream and releases any system resources associated with it.
      protected void finalize()
      Called by the garbage collector on an object when garbage collection determines that there are no more references to the object.
      int getBufferSize()
      Returns the current size of the internal character buffer.
      Segment getCurrentSegment()
      Returns the current Segment from the iterator().
      java.nio.CharBuffer getCurrentSegmentCharBuffer()
      Returns a CharBuffer containing the source text of the current segment.
      java.lang.String getEncoding()
      Returns the character encoding scheme of the source byte stream used to create this object.
      java.lang.String getEncodingSpecificationInfo()
      Returns a concise description of how the encoding of the source document was determined.
      Logger getLogger()
      Returns the Logger that handles log messages.
      java.lang.String getPreliminaryEncodingInfo()
      Returns the preliminary encoding of the source document together with a concise description of how it was determined.
      boolean isXML()
      Indicates whether the source document is likely to be XML.
      java.util.Iterator<Segment> iterator()
      Returns an iterator over every tag, character reference and plain text segment contained within the source document.
      StreamedSource setBuffer​(char[] buffer)
      Specifies an existing character array to use for buffering the incoming character stream.
      StreamedSource setCoalescing​(boolean coalescing)
      Specifies whether an unbroken section of plain text in the source document should always be coalesced into a single Segment by the iterator.
      void setLogger​(Logger logger)
      Sets the Logger that handles log messages.
      java.lang.String toString()
      Returns a string representation of the object as generated by the default Object.toString() implementation.
      • Methods inherited from class java.lang.Object

        clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
      • Methods inherited from interface java.lang.Iterable

        forEach, spliterator
    • Constructor Detail

      • StreamedSource

        public StreamedSource​(java.io.Reader reader)
                       throws java.io.IOException
        Constructs a new StreamedSource object by loading the content from the specified Reader.

        If the specified reader is an instance of InputStreamReader, the getEncoding() method of the created StreamedSource object returns the encoding from InputStreamReader.getEncoding().

        Parameters:
        reader - the java.io.Reader from which to load the source text.
        Throws:
        java.io.IOException - if an I/O error occurs.
      • StreamedSource

        public StreamedSource​(java.io.InputStream inputStream)
                       throws java.io.IOException
        Constructs a new StreamedSource object by loading the content from the specified InputStream.

        The algorithm for detecting the character encoding of the source document from the raw bytes of the specified input stream is the same as that for the Source(URLConnection) constructor of the Source class, except that the first step is not possible as there is no Content-Type header to check.

        If the specified InputStream does not support the mark method, the algorithm that determines the encoding may have to wrap it in a BufferedInputStream in order to look ahead at the encoding meta data. This extra layer of buffering will then remain in place for the life of the StreamedSource, possibly impacting memory usage and/or degrading performance. It is always preferable to use the StreamedSource(Reader) constructor if the encoding is known in advance.

        Parameters:
        inputStream - the java.io.InputStream from which to load the source text.
        Throws:
        java.io.IOException - if an I/O error occurs.
        See Also:
        getEncoding()
      • StreamedSource

        public StreamedSource​(java.net.URL url)
                       throws java.io.IOException
        Constructs a new StreamedSource object by loading the content from the specified URL.

        This is equivalent to StreamedSource(url.openConnection()).

        Parameters:
        url - the URL from which to load the source text.
        Throws:
        java.io.IOException - if an I/O error occurs.
        See Also:
        getEncoding()
      • StreamedSource

        public StreamedSource​(java.net.URLConnection urlConnection)
                       throws java.io.IOException
        Constructs a new StreamedSource object by loading the content from the specified URLConnection.

        The algorithm for detecting the character encoding of the source document is identical to that described in the Source(URLConnection) constructor of the Source class.

        The algorithm that determines the encoding may have to wrap the input stream in a BufferedInputStream in order to look ahead at the encoding meta data if the encoding is not specified in the HTTP headers. This extra layer of buffering will then remain in place for the life of the StreamedSource, possibly impacting memory usage and/or degrading performance. It is always preferable to use the StreamedSource(Reader) constructor if the encoding is known in advance.

        Parameters:
        urlConnection - the URL connection from which to load the source text.
        Throws:
        java.io.IOException - if an I/O error occurs.
        See Also:
        getEncoding()
      • StreamedSource

        public StreamedSource​(java.lang.CharSequence text)
        Constructs a new StreamedSource object from the specified text.

        Although the CharSequence argument of this constructor apparently contradicts the notion of streaming in the source text, it can still benefits over the equivalent use of the standard Source class.

        Firstly, using the StreamedSource class to iterate the nodes of an in-memory CharSequence source document still requires much less memory than the equivalent operation using the standard Source class.

        Secondly, the specified CharSequence object could possibly implement its own paging mechanism to minimise memory usage.

        If the specified CharSequence is mutable, its state must not be modified while the StreamedSource is in use.

        Parameters:
        text - the source text.
    • Method Detail

      • setBuffer

        public StreamedSource setBuffer​(char[] buffer)
        Specifies an existing character array to use for buffering the incoming character stream.

        The specified buffer is fixed for the life of the StreamedSource object, in contrast to the default buffer which can be automatically replaced by a larger buffer as needed. This means that if a tag (including a comment or CDATA section) is encountered that is larger than the specified buffer, an unrecoverable BufferOverflowException is thrown. This exception is also thrown if coalescing has been enabled and a plain text segment is encountered that is larger than the specified buffer.

        In general this method should only be used if there needs to be an absolute maximum memory limit imposed on the parser, where that requirement is more important than the ability to parse any source document successfully.

        This method can only be called before the iterator() method has been called.

        Parameters:
        buffer - an existing character array to use for buffering the incoming character stream, must not be null.
        Returns:
        this StreamedSource instance, allowing multiple property setting methods to be chained in a single statement.
        Throws:
        java.lang.IllegalStateException - if the iterator() method has already been called.
      • setCoalescing

        public StreamedSource setCoalescing​(boolean coalescing)
        Specifies whether an unbroken section of plain text in the source document should always be coalesced into a single Segment by the iterator.

        If this property is set to the default value of false, and a section of plain text is encountered in the document that is larger than the current buffer size, the text is chunked into multiple consecutive plain text segments in order to minimise memory usage.

        If this property is set to true then chunking is disabled, ensuring that consecutive plain text segments are never generated, but instead forcing the internal buffer to expand to fit the largest section of plain text.

        Note that CharacterReference segments are always handled separately from plain text, regardless of the value of this property. For this reason, algorithms that process element content almost always have to be designed to expect the text in multiple segments in order to handle character references, so there is usually no advantage in coalescing plain text segments.

        Parameters:
        coalescing - the new value of the coalescing property.
        Returns:
        this StreamedSource instance, allowing multiple property setting methods to be chained in a single statement.
        Throws:
        java.lang.IllegalStateException - if the iterator() method has already been called.
      • close

        public void close()
                   throws java.io.IOException
        Closes the underlying Reader or InputStream and releases any system resources associated with it.

        If the stream is already closed then invoking this method has no effect.

        Specified by:
        close in interface java.lang.AutoCloseable
        Specified by:
        close in interface java.io.Closeable
        Throws:
        java.io.IOException - if an I/O error occurs.
      • getEncoding

        public java.lang.String getEncoding()
        Returns the character encoding scheme of the source byte stream used to create this object.

        This method works in essentially the same way as the Source.getEncoding() method.

        If the byte stream used to create this object does not support the mark method, the algorithm that determines the encoding may have to wrap it in a BufferedInputStream in order to look ahead at the encoding meta data. This extra layer of buffering will then remain in place for the life of the StreamedSource, possibly impacting memory usage and/or degrading performance. It is always preferable to use the StreamedSource(Reader) constructor if the encoding is known in advance.

        The getEncodingSpecificationInfo() method returns a simple description of how the value of this method was determined.

        Returns:
        the character encoding scheme of the source byte stream used to create this object, or null if the encoding is not known.
        See Also:
        getEncodingSpecificationInfo()
      • getEncodingSpecificationInfo

        public java.lang.String getEncodingSpecificationInfo()
        Returns a concise description of how the encoding of the source document was determined.

        The description is intended for informational purposes only. It is not guaranteed to have any particular format and can not be reliably parsed.

        Returns:
        a concise description of how the encoding of the source document was determined.
        See Also:
        getEncoding()
      • getPreliminaryEncodingInfo

        public java.lang.String getPreliminaryEncodingInfo()
        Returns the preliminary encoding of the source document together with a concise description of how it was determined.

        This method works in essentially the same way as the Source.getPreliminaryEncodingInfo() method.

        The description returned by this method is intended for informational purposes only. It is not guaranteed to have any particular format and can not be reliably parsed.

        Returns:
        the preliminary encoding of the source document together with a concise description of how it was determined, or null if no preliminary encoding was required.
        See Also:
        getEncoding()
      • iterator

        public java.util.Iterator<Segment> iterator()
        Returns an iterator over every tag, character reference and plain text segment contained within the source document.

        Plain text is defined as all text that is not part of a Tag or CharacterReference.

        This results in a sequential walk-through of the entire source document. The end position of each segment should correspond with the begin position of the subsequent segment, unless any of the tags are enclosed by other tags. This could happen if there are server tags present in the document, or in rare circumstances where the document type declaration contains markup declarations.

        Each segment generated by the iterator is parsed as the source text is streamed in. Previous segments are discarded for garbage collection.

        If a section of plain text is encountered in the document that is larger than the current buffer size, the text is chunked into multiple consecutive plain text segments in order to minimise memory usage. Setting the Coalescing property to true disables chunking, ensuring that consecutive plain text segments are never generated, but instead forcing the internal buffer to expand to fit the largest section of plain text. Note that CharacterReference segments are always handled separately from plain text, regardless of whether coalescing is enabled. For this reason, algorithms that process element content almost always have to be designed to expect the text in multiple segments in order to handle character references, so there is usually no advantage in coalescing plain text segments.

        Character references that are found inside tags, such as those present inside attribute values, do not generate separate segments from the iterator.

        This method may only be called once on any particular StreamedSource instance.

        Example:

        The following code demonstrates the typical (implied) usage of this method through the Iterable interface to make an exact copy of the document from reader to writer (assuming no server tags are present):

         StreamedSource streamedSource=new StreamedSource(reader);
         for (Segment segment : streamedSource) {
           if (segment instanceof Tag) {
             Tag tag=(Tag)segment;
             // HANDLE TAG
             // Uncomment the following line to ensure each tag is valid XML:
             // writer.write(tag.tidy()); continue;
           } else if (segment instanceof CharacterReference) {
             CharacterReference characterReference=(CharacterReference)segment;
             // HANDLE CHARACTER REFERENCE
             // Uncomment the following line to decode all character references instead of copying them verbatim:
             // characterReference.appendCharTo(writer); continue;
           } else {
             // HANDLE PLAIN TEXT
           }
           // unless specific handling has prevented getting to here, simply output the segment as is:
           writer.write(segment.toString());
         }

        Note that the last line writer.write(segment.toString()) in the above code can be replaced with the following for improved performance:

         CharBuffer charBuffer=streamedSource.getCurrentSegmentCharBuffer();
         writer.write(charBuffer.array(),charBuffer.position(),charBuffer.length());

        The following code demonstrates how to process the plain text content of a specific element, in this case to print the content of every paragraph element:

         StreamedSource streamedSource=new StreamedSource(reader);
         StringBuilder sb=new StringBuilder();
         boolean insideParagraphElement=false;
         for (Segment segment : streamedSource) {
           if (segment instanceof Tag) {
             Tag tag=(Tag)segment;
             if (tag.getName().equals("p")) {
               if (tag instanceof StartTag) {
                 insideParagraphElement=true;
                 sb.setLength(0);
               } else { // tag instanceof EndTag
                 insideParagraphElement=false;
                 System.out.println(sb.toString());
               }
             }
           } else if (insideParagraphElement) {
             if (segment instanceof CharacterReference) {
               ((CharacterReference)segment).appendCharTo(sb);
             } else {
               sb.append(segment);
             }
           }
         }
        Specified by:
        iterator in interface java.lang.Iterable<Segment>
        Returns:
        an iterator over every tag, character reference and plain text segment contained within the source document.
      • getCurrentSegment

        public Segment getCurrentSegment()
        Returns the current Segment from the iterator().

        This is defined as the last Segment returned from the iterator's next() method.

        This method returns null if the iterator's next() method has never been called, or its hasNext() method has returned the value false.

        Returns:
        the current Segment from the iterator().
      • getCurrentSegmentCharBuffer

        public java.nio.CharBuffer getCurrentSegmentCharBuffer()
        Returns a CharBuffer containing the source text of the current segment.

        The returned CharBuffer provides a window into the internal char[] buffer including the position and length that spans the current segment.

        For example, the following code writes the source text of the current segment to writer:

        CharBuffer charBuffer=streamedSource.getCurrentSegmentCharBuffer();
        writer.write(charBuffer.array(),charBuffer.position(),charBuffer.length());

        This may provide a performance benefit over the standard way of accessing the source text of the current segment, which is to use the CharSequence interface of the segment directly, or to call Segment.toString().

        Because this CharBuffer is a direct window into the internal buffer of the StreamedSource, the contents of the CharBuffer.array() must not be modified, and the array is only guaranteed to hold the segment source text until the iterator's hasNext() or next() method is next called.

        Returns:
        a CharBuffer containing the source text of the current segment.
      • isXML

        public boolean isXML()
        Indicates whether the source document is likely to be XML.

        The algorithm used to determine this is designed to be relatively inexpensive and to provide an accurate result in most normal situations. An exact determination of whether the source document is XML would require a much more complex analysis of the text.

        The algorithm is as follows:

        1. If the document begins with an XML declaration, it is an XML document.
        2. If the document begins with a document type declaration that contains the text "xhtml", it is an XHTML document, and hence also an XML document.
        3. If none of the above conditions are met, assume the document is normal HTML, and therefore not an XML document.

        This method can only be called after the iterator() method has been called.

        Returns:
        true if the source document is likely to be XML, otherwise false.
        Throws:
        java.lang.IllegalStateException - if the iterator() method has not yet been called.
      • setLogger

        public void setLogger​(Logger logger)
        Sets the Logger that handles log messages.

        Specifying a null argument disables logging completely for operations performed on this StreamedSource object.

        A logger instance is created automatically for each StreamedSource object in the same way as is described in the Source.setLogger(Logger) method.

        Parameters:
        logger - the logger that will handle log messages, or null to disable logging.
        See Also:
        Config.LoggerProvider
      • getLogger

        public Logger getLogger()
        Returns the Logger that handles log messages.

        A logger instance is created automatically for each StreamedSource object using the LoggerProvider specified by the static Config.LoggerProvider property. This can be overridden by calling the setLogger(Logger) method. The name used for all automatically created logger instances is "net.htmlparser.jericho".

        Returns:
        the Logger that handles log messages, or null if logging is disabled.
      • getBufferSize

        public int getBufferSize()
        Returns the current size of the internal character buffer.

        This information is generally useful only for investigating memory and performance issues.

        Returns:
        the current size of the internal character buffer.
      • toString

        public java.lang.String toString()
        Returns a string representation of the object as generated by the default Object.toString() implementation.

        In contrast to the Source.toString() implementation, it is generally not possible for this method to return the entire source text.

        Overrides:
        toString in class java.lang.Object
        Returns:
        a string representation of the object as generated by the default Object.toString() implementation.
      • finalize

        protected void finalize()
        Called by the garbage collector on an object when garbage collection determines that there are no more references to the object.

        This implementation calls the close() method if the underlying Reader or InputStream stream was created internally.

        Overrides:
        finalize in class java.lang.Object