Class HTMLStripCharFilter

All Implemented Interfaces:
Closeable, AutoCloseable, Readable

public final class HTMLStripCharFilter extends BaseCharFilter
A CharFilter that wraps another Reader and attempts to strip out HTML constructs.
  • Field Details

    • YYEOF

      private static final int YYEOF
      This character denotes the end of file.
      See Also:
    • ZZ_BUFFERSIZE

      private static final int ZZ_BUFFERSIZE
      Initial size of the lookahead buffer.
      See Also:
    • YYINITIAL

      private static final int YYINITIAL
      Lexical states.
      See Also:
    • AMPERSAND

      private static final int AMPERSAND
      See Also:
    • NUMERIC_CHARACTER

      private static final int NUMERIC_CHARACTER
      See Also:
    • CHARACTER_REFERENCE_TAIL

      private static final int CHARACTER_REFERENCE_TAIL
      See Also:
    • LEFT_ANGLE_BRACKET

      private static final int LEFT_ANGLE_BRACKET
      See Also:
    • BANG

      private static final int BANG
      See Also:
    • COMMENT

      private static final int COMMENT
      See Also:
    • SCRIPT

      private static final int SCRIPT
      See Also:
    • SCRIPT_COMMENT

      private static final int SCRIPT_COMMENT
      See Also:
    • LEFT_ANGLE_BRACKET_SLASH

      private static final int LEFT_ANGLE_BRACKET_SLASH
      See Also:
    • LEFT_ANGLE_BRACKET_SPACE

      private static final int LEFT_ANGLE_BRACKET_SPACE
      See Also:
    • CDATA

      private static final int CDATA
      See Also:
    • SERVER_SIDE_INCLUDE

      private static final int SERVER_SIDE_INCLUDE
      See Also:
    • SINGLE_QUOTED_STRING

      private static final int SINGLE_QUOTED_STRING
      See Also:
    • DOUBLE_QUOTED_STRING

      private static final int DOUBLE_QUOTED_STRING
      See Also:
    • END_TAG_TAIL_INCLUDE

      private static final int END_TAG_TAIL_INCLUDE
      See Also:
    • END_TAG_TAIL_EXCLUDE

      private static final int END_TAG_TAIL_EXCLUDE
      See Also:
    • END_TAG_TAIL_SUBSTITUTE

      private static final int END_TAG_TAIL_SUBSTITUTE
      See Also:
    • START_TAG_TAIL_INCLUDE

      private static final int START_TAG_TAIL_INCLUDE
      See Also:
    • START_TAG_TAIL_EXCLUDE

      private static final int START_TAG_TAIL_EXCLUDE
      See Also:
    • START_TAG_TAIL_SUBSTITUTE

      private static final int START_TAG_TAIL_SUBSTITUTE
      See Also:
    • STYLE

      private static final int STYLE
      See Also:
    • STYLE_COMMENT

      private static final int STYLE_COMMENT
      See Also:
    • ZZ_LEXSTATE

      private static final int[] ZZ_LEXSTATE
      ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer
    • ZZ_CMAP_TOP

      private static final int[] ZZ_CMAP_TOP
      Top-level table for translating characters to character classes
    • ZZ_CMAP_TOP_PACKED_0

      private static final String ZZ_CMAP_TOP_PACKED_0
      See Also:
    • ZZ_CMAP_BLOCKS

      private static final int[] ZZ_CMAP_BLOCKS
      Second-level tables for translating characters to character classes
    • ZZ_CMAP_BLOCKS_PACKED_0

      private static final String ZZ_CMAP_BLOCKS_PACKED_0
      See Also:
    • ZZ_ACTION

      private static final int[] ZZ_ACTION
      Translates DFA states to action switch labels.
    • ZZ_ACTION_PACKED_0

      private static final String ZZ_ACTION_PACKED_0
      See Also:
    • ZZ_ROWMAP

      private static final int[] ZZ_ROWMAP
      Translates a state to a row index in the transition table
    • ZZ_ROWMAP_PACKED_0

      private static final String ZZ_ROWMAP_PACKED_0
      See Also:
    • ZZ_TRANS

      private static final int[] ZZ_TRANS
      The transition table of the DFA
    • ZZ_TRANS_PACKED_0

      private static final String ZZ_TRANS_PACKED_0
      See Also:
    • ZZ_TRANS_PACKED_1

      private static final String ZZ_TRANS_PACKED_1
      See Also:
    • ZZ_TRANS_PACKED_2

      private static final String ZZ_TRANS_PACKED_2
      See Also:
    • ZZ_TRANS_PACKED_3

      private static final String ZZ_TRANS_PACKED_3
      See Also:
    • ZZ_TRANS_PACKED_4

      private static final String ZZ_TRANS_PACKED_4
      See Also:
    • ZZ_TRANS_PACKED_5

      private static final String ZZ_TRANS_PACKED_5
      See Also:
    • ZZ_TRANS_PACKED_6

      private static final String ZZ_TRANS_PACKED_6
      See Also:
    • ZZ_TRANS_PACKED_7

      private static final String ZZ_TRANS_PACKED_7
      See Also:
    • ZZ_TRANS_PACKED_8

      private static final String ZZ_TRANS_PACKED_8
      See Also:
    • ZZ_TRANS_PACKED_9

      private static final String ZZ_TRANS_PACKED_9
      See Also:
    • ZZ_TRANS_PACKED_10

      private static final String ZZ_TRANS_PACKED_10
      See Also:
    • ZZ_TRANS_PACKED_11

      private static final String ZZ_TRANS_PACKED_11
      See Also:
    • ZZ_TRANS_PACKED_12

      private static final String ZZ_TRANS_PACKED_12
      See Also:
    • ZZ_TRANS_PACKED_13

      private static final String ZZ_TRANS_PACKED_13
      See Also:
    • ZZ_TRANS_PACKED_14

      private static final String ZZ_TRANS_PACKED_14
      See Also:
    • ZZ_UNKNOWN_ERROR

      private static final int ZZ_UNKNOWN_ERROR
      Error code for "Unknown internal scanner error".
      See Also:
    • ZZ_NO_MATCH

      private static final int ZZ_NO_MATCH
      Error code for "could not match input".
      See Also:
    • ZZ_PUSHBACK_2BIG

      private static final int ZZ_PUSHBACK_2BIG
      Error code for "pushback value was too large".
      See Also:
    • ZZ_ERROR_MSG

      private static final String[] ZZ_ERROR_MSG
      Error messages for ZZ_UNKNOWN_ERROR, ZZ_NO_MATCH, and ZZ_PUSHBACK_2BIG respectively.
    • ZZ_ATTRIBUTE

      private static final int[] ZZ_ATTRIBUTE
      ZZ_ATTRIBUTE[aState] contains the attributes of state aState
    • ZZ_ATTRIBUTE_PACKED_0

      private static final String ZZ_ATTRIBUTE_PACKED_0
      See Also:
    • zzReader

      private Reader zzReader
      Input device.
    • zzState

      private int zzState
      Current state of the DFA.
    • zzLexicalState

      private int zzLexicalState
      Current lexical state.
    • zzBuffer

      private char[] zzBuffer
      This buffer contains the current text to be matched and is the source of the yytext() string.
    • zzMarkedPos

      private int zzMarkedPos
      Text position at the last accepting state.
    • zzCurrentPos

      private int zzCurrentPos
      Current text position in the buffer.
    • zzStartRead

      private int zzStartRead
      Marks the beginning of the yytext() string in the buffer.
    • zzEndRead

      private int zzEndRead
      Marks the last character in the buffer, that has been read from input.
    • zzAtEOF

      private boolean zzAtEOF
      Whether the scanner is at the end of file.
      See Also:
    • zzFinalHighSurrogate

      private int zzFinalHighSurrogate
      The number of occupied positions in zzBuffer beyond zzEndRead.

      When a lead/high surrogate has been read from the input stream into the final zzBuffer position, this will have a value of 1; otherwise, it will have a value of 0.

    • yyline

      private int yyline
      Number of newlines encountered up to the start of the matched text.
    • yycolumn

      private int yycolumn
      Number of characters from the last newline up to the start of the matched text.
    • yychar

      private long yychar
      Number of characters up to the start of the matched text.
    • zzAtBOL

      private boolean zzAtBOL
      Whether the scanner is currently at the beginning of a line.
    • zzEOFDone

      private boolean zzEOFDone
      Whether the user-EOF-code has already been executed.
    • upperCaseVariantsAccepted

      private static final Map<String,String> upperCaseVariantsAccepted
    • entityValues

      private static final CharArrayMap<Character> entityValues
    • INITIAL_INPUT_SEGMENT_SIZE

      private static final int INITIAL_INPUT_SEGMENT_SIZE
      See Also:
    • BLOCK_LEVEL_START_TAG_REPLACEMENT

      private static final char BLOCK_LEVEL_START_TAG_REPLACEMENT
      See Also:
    • BLOCK_LEVEL_END_TAG_REPLACEMENT

      private static final char BLOCK_LEVEL_END_TAG_REPLACEMENT
      See Also:
    • BR_START_TAG_REPLACEMENT

      private static final char BR_START_TAG_REPLACEMENT
      See Also:
    • BR_END_TAG_REPLACEMENT

      private static final char BR_END_TAG_REPLACEMENT
      See Also:
    • SCRIPT_REPLACEMENT

      private static final char SCRIPT_REPLACEMENT
      See Also:
    • STYLE_REPLACEMENT

      private static final char STYLE_REPLACEMENT
      See Also:
    • REPLACEMENT_CHARACTER

      private static final char REPLACEMENT_CHARACTER
      See Also:
    • escapedTags

      private CharArraySet escapedTags
    • inputStart

      private long inputStart
    • cumulativeDiff

      private int cumulativeDiff
    • escapeBR

      private boolean escapeBR
    • escapeSCRIPT

      private boolean escapeSCRIPT
    • escapeSTYLE

      private boolean escapeSTYLE
    • restoreState

      private int restoreState
    • previousRestoreState

      private int previousRestoreState
    • outputCharCount

      private int outputCharCount
    • eofReturnValue

      private int eofReturnValue
    • inputSegment

      private HTMLStripCharFilter.TextSegment inputSegment
    • outputSegment

      private HTMLStripCharFilter.TextSegment outputSegment
    • entitySegment

      private HTMLStripCharFilter.TextSegment entitySegment
  • Constructor Details

    • HTMLStripCharFilter

      public HTMLStripCharFilter(Reader in, Set<String> escapedTags)
      Creates a new HTMLStripCharFilter over the provided Reader with the specified start and end tags.
      Parameters:
      in - Reader to strip html tags from.
      escapedTags - Tags in this set (both start and end tags) will not be filtered out.
    • HTMLStripCharFilter

      public HTMLStripCharFilter(Reader in)
      Creates a new scanner
      Parameters:
      in - the java.io.Reader to read input from.
  • Method Details

    • zzUnpackcmap_top

      private static int[] zzUnpackcmap_top()
    • zzUnpackcmap_top

      private static int zzUnpackcmap_top(String packed, int offset, int[] result)
    • zzUnpackcmap_blocks

      private static int[] zzUnpackcmap_blocks()
    • zzUnpackcmap_blocks

      private static int zzUnpackcmap_blocks(String packed, int offset, int[] result)
    • zzUnpackAction

      private static int[] zzUnpackAction()
    • zzUnpackAction

      private static int zzUnpackAction(String packed, int offset, int[] result)
    • zzUnpackRowMap

      private static int[] zzUnpackRowMap()
    • zzUnpackRowMap

      private static int zzUnpackRowMap(String packed, int offset, int[] result)
    • zzUnpackTrans

      private static int[] zzUnpackTrans()
    • zzUnpackTrans

      private static int zzUnpackTrans(String packed, int offset, int[] result)
    • zzUnpackAttribute

      private static int[] zzUnpackAttribute()
    • zzUnpackAttribute

      private static int zzUnpackAttribute(String packed, int offset, int[] result)
    • read

      public int read() throws IOException
      Overrides:
      read in class Reader
      Throws:
      IOException
    • read

      public int read(char[] cbuf, int off, int len) throws IOException
      Specified by:
      read in class Reader
      Throws:
      IOException
    • close

      public void close() throws IOException
      Description copied from class: CharFilter
      Closes the underlying input stream.

      NOTE: The default implementation closes the input Reader, so be sure to call super.close() when overriding this method.

      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable
      Overrides:
      close in class CharFilter
      Throws:
      IOException
    • getInitialBufferSize

      static int getInitialBufferSize()
    • zzCMap

      private static int zzCMap(int input)
      Translates raw input code points to DFA table row
    • zzRefill

      private boolean zzRefill() throws IOException
      Refills the input buffer.
      Returns:
      false iff there was new input.
      Throws:
      IOException - if any I/O-Error occurs
    • yyclose

      private final void yyclose() throws IOException
      Closes the input reader.
      Throws:
      IOException - if the reader could not be closed.
    • yyreset

      private final void yyreset(Reader reader)
      Resets the scanner to read from a new input stream.

      Does not close the old reader.

      All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to ZZ_INITIAL.

      Internal scan buffer is resized down to its initial length, if it has grown.

      Parameters:
      reader - The new input stream.
    • yyResetPosition

      private final void yyResetPosition()
      Resets the input position.
    • yyatEOF

      private final boolean yyatEOF()
      Returns whether the scanner has reached the end of the reader it reads from.
      Returns:
      whether the scanner has reached EOF.
    • yystate

      private final int yystate()
      Returns the current lexical state.
      Returns:
      the current lexical state.
    • yybegin

      private final void yybegin(int newState)
      Enters a new lexical state.
      Parameters:
      newState - the new lexical state
    • yytext

      private final String yytext()
      Returns the text matched by the current regular expression.
      Returns:
      the matched text.
    • yycharat

      private final char yycharat(int position)
      Returns the character at the given position from the matched text.

      It is equivalent to yytext().charAt(pos), but faster.

      Parameters:
      position - the position of the character to fetch. A value from 0 to yylength()-1.
      Returns:
      the character at position.
    • yylength

      private final int yylength()
      How many characters were matched.
      Returns:
      the length of the matched text region.
    • zzScanError

      private static void zzScanError(int errorCode)
      Reports an error that occurred while scanning.

      In a well-formed scanner (no or only correct usage of yypushback(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen".

      If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.).

      Usual syntax/scanner level error handling should be done in error fallback rules.

      Parameters:
      errorCode - the code of the error message to display.
    • yypushback

      private void yypushback(int number)
      Pushes the specified amount of characters back into the input stream.

      They will be read again by then next call of the scanning method.

      Parameters:
      number - the number of characters to be read again. This number must not be greater than yylength().
    • zzDoEOF

      private void zzDoEOF()
      Contains user EOF-code, which will be executed exactly once, when the end of file is reached
    • nextChar

      private int nextChar() throws IOException
      Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
      Returns:
      the next token.
      Throws:
      IOException - if any I/O-Error occurs.