StandardTokenizer
implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in
Unicode Standard Annex #29.See: Description
Class | Description |
---|---|
ClassicAnalyzer |
Filters
ClassicTokenizer with ClassicFilter , LowerCaseFilter and StopFilter , using a list of
English stop words. |
ClassicFilter |
Normalizes tokens extracted with
ClassicTokenizer . |
ClassicFilterFactory |
Factory for
ClassicFilter . |
ClassicTokenizer |
A grammar-based tokenizer constructed with JFlex
|
ClassicTokenizerFactory |
Factory for
ClassicTokenizer . |
StandardAnalyzer |
Filters
StandardTokenizer with StandardFilter , LowerCaseFilter and StopFilter , using a list of
English stop words. |
StandardFilter |
Normalizes tokens extracted with
StandardTokenizer . |
StandardFilterFactory |
Factory for
StandardFilter . |
StandardTokenizer |
A grammar-based tokenizer constructed with JFlex.
|
StandardTokenizerFactory |
Factory for
StandardTokenizer . |
StandardTokenizerImpl |
This class implements Word Break rules from the Unicode Text Segmentation
algorithm, as specified in
Unicode Standard Annex #29.
|
UAX29URLEmailAnalyzer |
Filters
UAX29URLEmailTokenizer
with StandardFilter ,
LowerCaseFilter and
StopFilter , using a list of
English stop words. |
UAX29URLEmailTokenizer |
This class implements Word Break rules from the Unicode Text Segmentation
algorithm, as specified in
Unicode Standard Annex #29
URLs and email addresses are also tokenized according to the relevant RFCs.
|
UAX29URLEmailTokenizerFactory |
Factory for
UAX29URLEmailTokenizer . |
UAX29URLEmailTokenizerImpl |
This class implements Word Break rules from the Unicode Text Segmentation
algorithm, as specified in
Unicode Standard Annex #29
URLs and email addresses are also tokenized according to the relevant RFCs.
|
WordBreakTestUnicode_6_3_0 |
This class was automatically generated by generateJavaUnicodeWordBreakTest.pl
from: http://www.unicode.org/Public/6.3.0/ucd/auxiliary/WordBreakTest.txt
WordBreakTest.txt indicates the points in the provided character sequences
at which conforming implementations must and must not break words.
|
StandardTokenizer
implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in
Unicode Standard Annex #29.
Unlike UAX29URLEmailTokenizer
from the analysis module, URLs and email addresses are
not tokenized as single tokens, but are instead split up into
tokens according to the UAX#29 word break rules.
StandardAnalyzer
includes
StandardTokenizer
,
StandardFilter
,
LowerCaseFilter
and StopFilter
.Copyright © 2000–2018 The Apache Software Foundation. All rights reserved.