Class OutputDocument

  • All Implemented Interfaces:
    CharStreamSource

    public final class OutputDocument
    extends java.lang.Object
    implements CharStreamSource
    Represents a modified version of an original Source document or Segment.

    An OutputDocument represents an original Source document or Segment that has been modified by substituting segments of it with other text. Each of these substitutions must be registered in the output document, which is most commonly done using the various replace, remove or insert methods in this class. These methods internally register one or more OutputSegment objects to define each substitution.

    If a Segment is used to construct the output document, all character positions are relative to the source document of the specified segment.

    After all of the substitutions have been registered, the modified text can be retrieved using the writeTo(Writer) or toString() methods.

    The registered output segments may be adjacent and may also overlap. An output segment that is completely enclosed by another output segment is not included in the output.

    If unexpected results are being generated from an OutputDocument, the getDebugInfo() method provides information on each registered output segment, which should provide enough information to determine the cause of the problem. In most cases the problem will be caused by overlapping output segments.

    The following example converts all externally referenced style sheets to internal style sheets:

      URL sourceUrl=new URL(sourceUrlString);
      String htmlText=Util.getString(new InputStreamReader(sourceUrl.openStream()));
      Source source=new Source(htmlText);
      OutputDocument outputDocument=new OutputDocument(source);
      StringBuilder sb=new StringBuilder();
      List linkStartTags=source.getAllStartTags(HTMLElementName.LINK);
      for (Iterator i=linkStartTags.iterator(); i.hasNext();) {
        StartTag startTag=(StartTag)i.next();
        Attributes attributes=startTag.getAttributes();
        String rel=attributes.getValue("rel");
        if (!"stylesheet".equalsIgnoreCase(rel)) continue;
        String href=attributes.getValue("href");
        if (href==null) continue;
        String styleSheetContent;
        try {
          styleSheetContent=Util.getString(new InputStreamReader(new URL(sourceUrl,href).openStream()));
        } catch (Exception ex) {
          continue; // don't convert if URL is invalid
        }
        sb.setLength(0);
        sb.append("<style");
        Attribute typeAttribute=attributes.get("type");
        if (typeAttribute!=null) sb.append(' ').append(typeAttribute);
        sb.append(">\n").append(styleSheetContent).append("\n</style>");
        outputDocument.replace(startTag,sb);
      }
      String convertedHtmlText=outputDocument.toString();
     
    See Also:
    OutputSegment
    • Constructor Summary

      Constructors 
      Constructor Description
      OutputDocument​(Segment segment)
      Constructs a new output document based on the specified Segment.
      OutputDocument​(Source source)
      Constructs a new output document based on the specified source document.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void appendTo​(java.lang.Appendable appendable)
      Appends the final content of this output document to the specified Appendable object.
      void appendTo​(java.lang.Appendable appendable, int begin, int end)
      Appends the specified portion of the final content of this output document to the specified Appendable object.
      java.lang.String getDebugInfo()
      Returns a string representation of this object useful for debugging purposes.
      long getEstimatedMaximumOutputLength()
      Returns the estimated maximum number of characters in the output, or -1 if no estimate is available.
      java.util.List<OutputSegment> getRegisteredOutputSegments()
      Returns a list all of the registered OutputSegment objects in this output document.
      Segment getSegment()
      Returns the original segment upon which this output document is based.
      java.lang.CharSequence getSourceText()
      Returns the original source text upon which this output document is based.
      void insert​(int pos, java.lang.CharSequence text)
      Inserts the specified text at the specified character position in this output document.
      void register​(OutputSegment outputSegment)
      Registers the specified output segment in this output document.
      void remove​(int begin, int end)
      Removes the specified segment of this output document.
      void remove​(java.util.Collection<? extends Segment> segments)
      Removes all the segments from this output document represented by the specified source Segment objects.
      void remove​(Segment segment)
      Removes the specified segment from this output document.
      void replace​(int begin, int end, char ch)
      Replaces the specified segment of this output document with the specified character.
      void replace​(int begin, int end, java.lang.CharSequence text)
      Replaces the specified segment of this output document with the specified text.
      java.util.Map<java.lang.String,​java.lang.String> replace​(Attributes attributes, boolean convertNamesToLowerCase)
      Replaces the specified Attributes segment in this output document with the name/value entries in the returned Map.
      void replace​(Attributes attributes, java.util.Map<java.lang.String,​java.lang.String> map)
      Replaces the specified attributes segment in this source document with the name/value entries in the specified Map.
      void replace​(FormControl formControl)
      Replaces the specified FormControl in this output document.
      void replace​(FormFields formFields)
      Replaces all the constituent form controls from the specified FormFields in this output document.
      void replace​(Segment segment, java.lang.CharSequence text)
      Replaces the specified segment in this output document with the specified text.
      void replaceWithSpaces​(int begin, int end)
      Replaces the specified segment of this output document with a string of spaces of the same length.
      java.lang.String toString()
      Returns the final content of this output document as a String.
      void writeTo​(java.io.Writer writer)
      Writes the final content of this output document to the specified Writer.
      void writeTo​(java.io.Writer writer, int begin, int end)
      Writes the specified portion of the final content of this output document to the specified Writer.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
    • Constructor Detail

      • OutputDocument

        public OutputDocument​(Source source)
        Constructs a new output document based on the specified source document.
        Parameters:
        source - the source document.
      • OutputDocument

        public OutputDocument​(Segment segment)
        Constructs a new output document based on the specified Segment.
        Parameters:
        segment - the original Segment.
    • Method Detail

      • getSegment

        public Segment getSegment()
        Returns the original segment upon which this output document is based.

        If a Source was used to construct the output document, this returns the Source object.

        Returns:
        the original segment upon which this output document is based.
      • getSourceText

        public java.lang.CharSequence getSourceText()
        Returns the original source text upon which this output document is based.

        If a Segment was used to construct the output document, this returns the text of the entire source document rather than just the segment.

        Returns:
        the original source text upon which this output document is based.
      • remove

        public void remove​(int begin,
                           int end)
        Removes the specified segment of this output document.

        This is equivalent to replace(begin,end,null).

        Parameters:
        begin - the character position at which to begin the removal.
        end - the character position at which to end the removal.
      • remove

        public void remove​(Segment segment)
        Removes the specified segment from this output document.

        This is equivalent to replace(segment,null).

        Parameters:
        segment - the segment to remove.
      • remove

        public void remove​(java.util.Collection<? extends Segment> segments)
        Removes all the segments from this output document represented by the specified source Segment objects.

        This is equivalent to the following code:

          for (Iterator i=segments.iterator(); i.hasNext();)
            remove((Segment)i.next());
        Parameters:
        segments - a collection of segments to remove, represented by source Segment objects.
      • insert

        public void insert​(int pos,
                           java.lang.CharSequence text)
        Inserts the specified text at the specified character position in this output document.
        Parameters:
        pos - the character position at which to insert the text.
        text - the replacement text.
      • replace

        public void replace​(Segment segment,
                            java.lang.CharSequence text)
        Replaces the specified segment in this output document with the specified text.

        Specifying a null argument to the text parameter is exactly equivalent to specifying an empty string, and results in the segment being completely removed from the output document.

        Parameters:
        segment - the segment to replace.
        text - the replacement text, or null to remove the segment.
      • replace

        public void replace​(int begin,
                            int end,
                            java.lang.CharSequence text)
        Replaces the specified segment of this output document with the specified text.

        Specifying a null argument to the text parameter is exactly equivalent to specifying an empty string, and results in the segment being completely removed from the output document.

        Parameters:
        begin - the character position at which to begin the replacement.
        end - the character position at which to end the replacement.
        text - the replacement text, or null to remove the segment.
      • replace

        public void replace​(int begin,
                            int end,
                            char ch)
        Replaces the specified segment of this output document with the specified character.
        Parameters:
        begin - the character position at which to begin the replacement.
        end - the character position at which to end the replacement.
        ch - the replacement character.
      • replace

        public void replace​(FormControl formControl)
        Replaces the specified FormControl in this output document.

        The effect of this method is to register zero or more output segments in the output document as required to reflect previous modifications to the control's state. The state of a control includes its submission value, output style, and whether it has been disabled.

        The state of the form control should not be modified after this method is called, as there is no guarantee that subsequent changes either will or will not be reflected in the final output. A second call to this method with the same parameter is not allowed. It is therefore recommended to call this method as the last action before the output is generated.

        Although the specifics of the number and nature of the output segments added in any particular circumstance is not defined in the specification, it can generally be assumed that only the minimum changes necessary are made to the original document. If the state of the control has not been modified, calling this method has no effect at all.

        Parameters:
        formControl - the form control to replace.
        See Also:
        replace(FormFields)
      • replace

        public void replace​(FormFields formFields)
        Replaces all the constituent form controls from the specified FormFields in this output document.

        This is equivalent to the following code:

        for (Iterator i=formFields.getFormControls().iterator(); i.hasNext();)
           replace((FormControl)i.next());

        The state of any of the form controls in the specified form fields should not be modified after this method is called, as there is no guarantee that subsequent changes either will or will not be reflected in the final output. A second call to this method with the same parameter is not allowed. It is therefore recommended to call this method as the last action before the output is generated.

        Parameters:
        formFields - the form fields to replace.
        See Also:
        replace(FormControl)
      • replace

        public java.util.Map<java.lang.String,​java.lang.String> replace​(Attributes attributes,
                                                                              boolean convertNamesToLowerCase)
        Replaces the specified Attributes segment in this output document with the name/value entries in the returned Map. The returned map initially contains entries representing the attributes from the source document, which can be modified before output.

        The documentation of the replace(Attributes,Map) method contains more information about the requirements of the map entries.

        Specifying a value of true as an argument to the convertNamesToLowerCase parameter causes all original attribute names to be converted to lower case in the map. This simplifies the process of finding/updating specific attributes since map keys are case sensitive.

        Attribute values are automatically decoded before being loaded into the map.

        This method is logically equivalent to:
        replace(attributes, attributes.populateMap(new LinkedHashMap<String,String>(),convertNamesToLowerCase))

        The use of LinkedHashMap to implement the map ensures (probably unnecessarily) that existing attributes are output in the same order as they appear in the source document, and new attributes are output in the same order as they are added.

        Example:
          Source source=new Source(htmlDocument);
          Attributes bodyAttributes
            =source.getNextStartTag(0,HTMLElementName.BODY).getAttributes();
          OutputDocument outputDocument=new OutputDocument(source);
          Map<String,String> attributesMap=outputDocument.replace(bodyAttributes,true);
          attributesMap.put("bgcolor","green");
          String htmlDocumentWithGreenBackground=outputDocument.toString();
        Parameters:
        attributes - the Attributes segment defining the span of the segment and initial name/value entries of the returned map.
        convertNamesToLowerCase - specifies whether all attribute names are converted to lower case in the map.
        Returns:
        a Map containing the name/value entries to be output.
        See Also:
        replace(Attributes,Map)
      • replace

        public void replace​(Attributes attributes,
                            java.util.Map<java.lang.String,​java.lang.String> map)
        Replaces the specified attributes segment in this source document with the name/value entries in the specified Map.

        This method might be used if the Map containing the new attribute values should not be preloaded with the same entries as the source attributes, or a map implementation other than LinkedHashMap is required. Otherwise, the replace(Attributes, boolean convertNamesToLowerCase) method is generally more useful.

        An attribute with no value is represented by a map entry with a null value.

        Attribute values are stored unencoded in the map, and are automatically encoded if necessary during output.

        The use of invalid characters in attribute names results in unspecified behaviour.

        Note that methods in the Attributes class treat attribute names as case insensitive, whereas the Map treats them as case sensitive.

        Parameters:
        attributes - the Attributes object defining the span of the segment to replace.
        map - the Map containing the name/value entries.
        See Also:
        replace(Attributes, boolean convertNamesToLowerCase)
      • replaceWithSpaces

        public void replaceWithSpaces​(int begin,
                                      int end)
        Replaces the specified segment of this output document with a string of spaces of the same length.

        This method is most commonly used to remove segments of the document without affecting the character positions of the remaining elements.

        It is used internally to implement the functionality available through the Segment.ignoreWhenParsing() method.

        To remove a segment from the output document completely, use the remove(Segment) method instead.

        Parameters:
        begin - the character position at which to begin the replacement.
        end - the character position at which to end the replacement.
      • register

        public void register​(OutputSegment outputSegment)
        Registers the specified output segment in this output document.

        Use this method if you want to use a customised OutputSegment class.

        Parameters:
        outputSegment - the output segment to register.
      • writeTo

        public void writeTo​(java.io.Writer writer)
                     throws java.io.IOException
        Writes the final content of this output document to the specified Writer.

        The writeTo(Writer, int begin, int end) method allows the output of a portion of the output document.

        If the output is required in the form of a Reader, use CharStreamSourceUtil.getReader(this) instead.

        Specified by:
        writeTo in interface CharStreamSource
        Parameters:
        writer - the destination java.io.Writer for the output.
        Throws:
        java.io.IOException - if an I/O exception occurs.
        See Also:
        toString()
      • writeTo

        public void writeTo​(java.io.Writer writer,
                            int begin,
                            int end)
                     throws java.io.IOException
        Writes the specified portion of the final content of this output document to the specified Writer.

        Any zero-length output segments located at begin or end are included in the output.

        Parameters:
        writer - the destination java.io.Writer for the output.
        begin - the character position at which to start the output, inclusive.
        end - the character position at which to end the output, exclusive.
        Throws:
        java.io.IOException - if an I/O exception occurs.
        See Also:
        writeTo(Writer)
      • appendTo

        public void appendTo​(java.lang.Appendable appendable)
                      throws java.io.IOException
        Appends the final content of this output document to the specified Appendable object.

        The appendTo(Appendable, int begin, int end) method allows the output of a portion of the output document.

        Specified by:
        appendTo in interface CharStreamSource
        Parameters:
        appendable - the destination java.lang.Appendable object for the output.
        Throws:
        java.io.IOException - if an I/O exception occurs.
        See Also:
        toString()
      • appendTo

        public void appendTo​(java.lang.Appendable appendable,
                             int begin,
                             int end)
                      throws java.io.IOException
        Appends the specified portion of the final content of this output document to the specified Appendable object.

        Any zero-length output segments located at begin or end are included in the output.

        Parameters:
        appendable - the destination java.lang.Appendable object for the output.
        begin - the character position at which to start the output, inclusive.
        end - the character position at which to end the output, exclusive.
        Throws:
        java.io.IOException - if an I/O exception occurs.
        See Also:
        appendTo(Appendable)
      • getEstimatedMaximumOutputLength

        public long getEstimatedMaximumOutputLength()
        Description copied from interface: CharStreamSource
        Returns the estimated maximum number of characters in the output, or -1 if no estimate is available.

        The returned value should be used as a guide for efficiency purposes only, for example to set an initial StringBuilder capacity. There is no guarantee that the length of the output is indeed less than this value, as classes implementing this method often use assumptions based on typical usage to calculate the estimate.

        Although implementations of this method should never return a value less than -1, users of this method must not assume that this will always be the case. Standard practice is to interpret any negative value as meaning that no estimate is available.

        Specified by:
        getEstimatedMaximumOutputLength in interface CharStreamSource
        Returns:
        the estimated maximum number of characters in the output, or -1 if no estimate is available.
      • toString

        public java.lang.String toString()
        Returns the final content of this output document as a String.
        Specified by:
        toString in interface CharStreamSource
        Overrides:
        toString in class java.lang.Object
        Returns:
        the final content of this output document as a String.
        See Also:
        writeTo(Writer)
      • getDebugInfo

        public java.lang.String getDebugInfo()
        Returns a string representation of this object useful for debugging purposes.

        The output includes details of all the registered output segments.

        Returns:
        a string representation of this object useful for debugging purposes.
      • getRegisteredOutputSegments

        public java.util.List<OutputSegment> getRegisteredOutputSegments()
        Returns a list all of the registered OutputSegment objects in this output document.

        The output segments are sorted in order of their starting position in the document.

        The returned list is modifiable and any changes will affect the output generated by this OutputDocument.

        Returns:
        a list all of the registered OutputSegment objects in this output document.