Class Dictionary
java.lang.Object
org.apache.lucene.analysis.hunspell.Dictionary
In-memory structure for the dictionary (.dic) and affix (.aff) data of a hunspell dictionary.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescription(package private) static class
Possible word breaks according to BREAK directivesprivate static class
Used to read flags as UTF-8 even if the rest of the file is in the default (8-bit) encodingprivate static class
Implementation ofDictionary.FlagParsingStrategy
that assumes each flag is encoded as two ASCII characters whose codes must be combined into a single character.(package private) static class
Abstraction of the process of parsing flags taken from the affix and dic filesprivate static class
Implementation ofDictionary.FlagParsingStrategy
that assumes each flag is encoded in its numerical form.private static class
Simple implementation ofDictionary.FlagParsingStrategy
that treats the chars in each String as a individual flags. -
Field Summary
FieldsModifier and TypeFieldDescription(package private) static final int
private static final int
(package private) static final int
(package private) static final int
(package private) char[]
private int
private String[]
private boolean
private static final byte[]
(package private) Dictionary.Breaks
(package private) boolean
(package private) boolean
(package private) List<CheckCompoundPattern>
(package private) boolean
(package private) boolean
(package private) boolean
(package private) char
(package private) boolean
(package private) char
(package private) char
(package private) char
(package private) char
(package private) int
(package private) char
(package private) int
(package private) char
(package private) List<CompoundRule>
private int
(package private) CharsetDecoder
(package private) static final Charset
private static final int
(package private) boolean
private static final char
(package private) static final char
(package private) final FlagEnumerator.Lookup
The list of unique flagsets (wordforms).(package private) Dictionary.FlagParsingStrategy
(package private) char
(package private) char
(package private) boolean
(package private) boolean
we set this during sorting, so we know to add an extra int (index inmorphData
) to FST output(package private) static final char
(package private) ConvTable
private char[]
(package private) boolean
(package private) char
(package private) String
(package private) static final int
(package private) int
(package private) int
private static final char
private int
private String[]
(package private) char
(package private) String[]
(package private) static final char[]
(package private) char
(package private) ConvTable
(package private) char
(package private) boolean
(package private) ArrayList<AffixCondition>
All condition checks used by prefixes and suffixes.private char[]
All flags used in affix continuation classes.private char[]
All flags used in affix continuation classes.(package private) boolean
(package private) char[]
(package private) int[]
(package private) char
(package private) String
(package private) String
(package private) WordStorage
The entries in the .dic file, mapping to their set of flags -
Constructor Summary
ConstructorsConstructorDescriptionDictionary
(Directory tempDir, String tempFileNamePrefix, InputStream affix, InputStream dictionary) Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files.Dictionary
(Directory tempDir, String tempFileNamePrefix, InputStream affix, List<InputStream> dictionaries, boolean ignoreCase) Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. -
Method Summary
Modifier and TypeMethodDescriptionprivate void
addHiddenCapitalizedWord
(StringBuilder reuse, OfflineSorter.ByteSequencesWriter writer, String word, String afterSep) private int
addMorphFields
(Map<String, Integer> indices, String morphFields) private void
addPhoneticRepEntries
(String word, String ph) (package private) char
affixData
(int affixIndex, int offset) (package private) char
caseFold
(char c) folds single character (according to LANG if present)private void
checkCriticalDirectiveSame
(String directive, LineNumberReader reader, Object expected, Object actual) (package private) CharSequence
cleanInput
(CharSequence input, StringBuilder reuse) (package private) static String
extractLanguageCode
(String isoCode) private String
firstArgument
(LineNumberReader reader, String line) (package private) int
formStep()
(package private) int
getAffixCondition
(int affix) private String
getAliasValue
(int id) private CharsetDecoder
getDecoder
(String encoding) Retrieves the CharsetDecoder for the given encoding.(package private) static Path
Returns the default temporary directory pointed to byjava.io.tmpdir
.(package private) static Dictionary.FlagParsingStrategy
getFlagParsingStrategy
(String flagLine, Charset charset) Determines the appropriateDictionary.FlagParsingStrategy
based on the FLAG definition line taken from the affix fileboolean
Returns true if this dictionary was constructed with theignoreCase
option(package private) boolean
hasFlag
(int entryId, char flag) (package private) boolean
(package private) boolean
hasLanguage
(String... langCodes) (package private) static int
indexOfSpaceOrTab
(String text, int start) (package private) boolean
isCrossProduct
(int affix) (package private) boolean
isDotICaseChangeDisallowed
(char[] word) (package private) boolean
isSecondStagePrefix
(char flag) (package private) boolean
isSecondStageSuffix
(char flag) private IntsRef
lookupEntries
(String root) (package private) IntsRef
lookupPrefix
(char[] word) (package private) IntsRef
lookupSuffix
(char[] word) (package private) IntsRef
lookupWord
(char[] word, int offset, int length) Looks up Hunspell word forms from the dictionaryprivate static boolean
maybeConsume
(BufferedInputStream stream, byte[] bytes) Consume the provided byte sequence in full, if present.(package private) boolean
private int
mergeDictionaries
(List<InputStream> dictionaries, CharsetDecoder decoder, IndexOutput output) private static int
morphBoundary
(String line) (package private) boolean
needsInputCleaning
(CharSequence input) (package private) static IntsRef
private void
parseAffix
(TreeMap<String, List<Integer>> affixes, Set<Character> secondStageFlags, String header, LineNumberReader reader, AffixKind kind, Map<String, Integer> seenPatterns, Map<String, Integer> seenStrips, FlagEnumerator flags) Parses a specific affix rule putting the result into the provided affix mapprivate void
parseAlias
(String line) private Dictionary.Breaks
parseBreaks
(LineNumberReader reader, String line) private List<CompoundRule>
parseCompoundRules
(LineNumberReader reader, int num) private ConvTable
parseConversions
(LineNumberReader reader, int num) parseMapEntry
(LineNumberReader reader, String line) private void
parseMorphAlias
(String line) private int
parseNum
(LineNumberReader reader, String line) private void
readAffixFile
(InputStream affixStream, CharsetDecoder decoder, FlagEnumerator flags) Reads the affix file through the provided InputStream, building up the prefix and suffix mapsprivate void
readConfig
(InputStream stream, Charset streamCharset) Parses the encoding and flag format specified in the provided InputStreamreadMorphFields
(String word, String unparsed) private WordStorage
readSortedDictionaries
(Directory tempDir, String sorted, FlagEnumerator flags, int wordCount) private static CharsetDecoder
replacingDecoder
(Charset charset) private static boolean
shouldSkipEscapedChar
(char ch) private String
singleArgument
(LineNumberReader reader, String line) private String
sortWordsOffline
(Directory tempDir, String tempFileNamePrefix, IndexOutput unsorted) private String[]
splitBySpace
(LineNumberReader reader, String line, int expectedParts) private String[]
splitBySpace
(LineNumberReader reader, String line, int minParts, int maxParts) splitMorphData
(String morphData) (package private) String
toLowerCase
(String word) (package private) static char[]
toSortedCharArray
(Set<Character> set) (package private) String
toTitleCase
(String word) private String
unescapeEntry
(String entry) private int
writeNormalizedWordEntry
(StringBuilder reuse, OfflineSorter.ByteSequencesWriter writer, String line)
-
Field Details
-
MAX_PROLOGUE_SCAN_WINDOW
static final int MAX_PROLOGUE_SCAN_WINDOW- See Also:
-
NOFLAGS
static final char[] NOFLAGS -
FLAG_UNSET
static final char FLAG_UNSET- See Also:
-
DEFAULT_FLAGS
private static final int DEFAULT_FLAGS- See Also:
-
HIDDEN_FLAG
static final char HIDDEN_FLAG- See Also:
-
DEFAULT_CHARSET
-
decoder
CharsetDecoder decoder -
prefixes
-
suffixes
-
breaks
Dictionary.Breaks breaks -
patterns
ArrayList<AffixCondition> patternsAll condition checks used by prefixes and suffixes. these are typically re-used across many affix stripping rules. so these are deduplicated, to save RAM. -
words
WordStorage wordsThe entries in the .dic file, mapping to their set of flags -
flagLookup
The list of unique flagsets (wordforms). theoretically huge, but practically small (for Polish this is 756), otherwise humans wouldn't be able to deal with it either. -
stripData
char[] stripData -
stripOffsets
int[] stripOffsets -
wordChars
String wordChars -
affixData
char[] affixData -
currentAffix
private int currentAffix -
AFFIX_FLAG
static final int AFFIX_FLAG- See Also:
-
AFFIX_STRIP_ORD
static final int AFFIX_STRIP_ORD- See Also:
-
AFFIX_CONDITION
private static final int AFFIX_CONDITION- See Also:
-
AFFIX_APPEND
static final int AFFIX_APPEND- See Also:
-
flagParsingStrategy
Dictionary.FlagParsingStrategy flagParsingStrategy -
aliases
-
aliasCount
private int aliasCount -
morphAliases
-
morphAliasCount
private int morphAliasCount -
morphData
-
hasCustomMorphData
boolean hasCustomMorphDatawe set this during sorting, so we know to add an extra int (index inmorphData
) to FST output -
ignoreCase
boolean ignoreCase -
checkSharpS
boolean checkSharpS -
complexPrefixes
boolean complexPrefixes -
secondStagePrefixFlags
private char[] secondStagePrefixFlagsAll flags used in affix continuation classes. If an outer affix's flag isn't here, there's no need to do 2-level affix stripping with it. -
secondStageSuffixFlags
private char[] secondStageSuffixFlagsAll flags used in affix continuation classes. If an outer affix's flag isn't here, there's no need to do 2-level affix stripping with it. -
circumfix
char circumfix -
keepcase
char keepcase -
forceUCase
char forceUCase -
needaffix
char needaffix -
forbiddenword
char forbiddenword -
onlyincompound
char onlyincompound -
compoundBegin
char compoundBegin -
compoundMiddle
char compoundMiddle -
compoundEnd
char compoundEnd -
compoundFlag
char compoundFlag -
compoundPermit
char compoundPermit -
compoundForbid
char compoundForbid -
checkCompoundCase
boolean checkCompoundCase -
checkCompoundDup
boolean checkCompoundDup -
checkCompoundRep
boolean checkCompoundRep -
checkCompoundTriple
boolean checkCompoundTriple -
simplifiedTriple
boolean simplifiedTriple -
compoundMin
int compoundMin -
compoundMax
int compoundMax -
compoundRules
List<CompoundRule> compoundRules -
checkCompoundPatterns
List<CheckCompoundPattern> checkCompoundPatterns -
ignore
private char[] ignore -
tryChars
String tryChars -
neighborKeyGroups
String[] neighborKeyGroups -
enableSplitSuggestions
boolean enableSplitSuggestions -
repTable
-
mapTable
-
maxDiff
int maxDiff -
maxNGramSuggestions
int maxNGramSuggestions -
onlyMaxDiff
boolean onlyMaxDiff -
noSuggest
char noSuggest -
subStandard
char subStandard -
iconv
ConvTable iconv -
oconv
ConvTable oconv -
fullStrip
boolean fullStrip -
language
String language -
alternateCasing
private boolean alternateCasing -
BOM_UTF8
private static final byte[] BOM_UTF8 -
CHARSET_ALIASES
-
FLAG_SEPARATOR
private static final char FLAG_SEPARATOR- See Also:
-
MORPH_SEPARATOR
private static final char MORPH_SEPARATOR- See Also:
-
-
Constructor Details
-
Dictionary
public Dictionary(Directory tempDir, String tempFileNamePrefix, InputStream affix, InputStream dictionary) throws IOException, ParseException Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.- Parameters:
tempDir
- Directory to use for offline sortingtempFileNamePrefix
- prefix to use to generate temp file namesaffix
- InputStream for reading the hunspell affix file (won't be closed).dictionary
- InputStream for reading the hunspell dictionary file (won't be closed).- Throws:
IOException
- Can be thrown while reading from the InputStreamsParseException
- Can be thrown if the content of the files does not meet expected formats
-
Dictionary
public Dictionary(Directory tempDir, String tempFileNamePrefix, InputStream affix, List<InputStream> dictionaries, boolean ignoreCase) throws IOException, ParseException Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.- Parameters:
tempDir
- Directory to use for offline sortingtempFileNamePrefix
- prefix to use to generate temp file namesaffix
- InputStream for reading the hunspell affix file (won't be closed).dictionaries
- InputStream for reading the hunspell dictionary files (won't be closed).- Throws:
IOException
- Can be thrown while reading from the InputStreamsParseException
- Can be thrown if the content of the files does not meet expected formats
-
-
Method Details
-
formStep
int formStep() -
lookupWord
Looks up Hunspell word forms from the dictionary -
lookupPrefix
-
lookupSuffix
-
lookup
-
nextArc
-
readAffixFile
private void readAffixFile(InputStream affixStream, CharsetDecoder decoder, FlagEnumerator flags) throws IOException, ParseException Reads the affix file through the provided InputStream, building up the prefix and suffix maps- Parameters:
affixStream
- InputStream to read the content of the affix file fromdecoder
- CharsetDecoder to decode the content of the file- Throws:
IOException
- Can be thrown while reading from the InputStreamParseException
-
checkCriticalDirectiveSame
private void checkCriticalDirectiveSame(String directive, LineNumberReader reader, Object expected, Object actual) throws ParseException - Throws:
ParseException
-
parseMapEntry
- Throws:
ParseException
-
hasLanguage
-
lookupEntries
- Parameters:
root
- a string to look up in the dictionary. No case conversion or affix removal is performed. To get the possible roots of any word, you may callHunspell.getRoots(String)
- Returns:
- the dictionary entries for the given root, or
null
if there's none
-
extractLanguageCode
-
parseNum
- Throws:
ParseException
-
singleArgument
- Throws:
ParseException
-
firstArgument
- Throws:
ParseException
-
splitBySpace
private String[] splitBySpace(LineNumberReader reader, String line, int expectedParts) throws ParseException - Throws:
ParseException
-
splitBySpace
private String[] splitBySpace(LineNumberReader reader, String line, int minParts, int maxParts) throws ParseException - Throws:
ParseException
-
parseCompoundRules
private List<CompoundRule> parseCompoundRules(LineNumberReader reader, int num) throws IOException, ParseException - Throws:
IOException
ParseException
-
parseBreaks
private Dictionary.Breaks parseBreaks(LineNumberReader reader, String line) throws IOException, ParseException - Throws:
IOException
ParseException
-
affixFST
- Throws:
IOException
-
parseAffix
private void parseAffix(TreeMap<String, List<Integer>> affixes, Set<Character> secondStageFlags, String header, LineNumberReader reader, AffixKind kind, Map<String, throws IOException, ParseExceptionInteger> seenPatterns, Map<String, Integer> seenStrips, FlagEnumerator flags) Parses a specific affix rule putting the result into the provided affix map- Parameters:
affixes
- Map where the result of the parsing will be putheader
- Header line of the affix rulereader
- BufferedReader to read the content of the rule fromseenPatterns
- map from condition -> index of patterns, for deduplication.- Throws:
IOException
- Can be thrown while reading the ruleParseException
-
affixData
char affixData(int affixIndex, int offset) -
isCrossProduct
boolean isCrossProduct(int affix) -
getAffixCondition
int getAffixCondition(int affix) -
parseConversions
private ConvTable parseConversions(LineNumberReader reader, int num) throws IOException, ParseException - Throws:
IOException
ParseException
-
readConfig
private void readConfig(InputStream stream, Charset streamCharset) throws IOException, ParseException Parses the encoding and flag format specified in the provided InputStream- Throws:
IOException
ParseException
-
maybeConsume
Consume the provided byte sequence in full, if present. Otherwise leave the input stream intact.- Returns:
true
if the sequence matched and has been consumed.- Throws:
IOException
-
getDecoder
Retrieves the CharsetDecoder for the given encoding. Note, This isn't perfect as I think ISCII-DEVANAGARI and MICROSOFT-CP1251 etc are allowed...- Parameters:
encoding
- Encoding to retrieve the CharsetDecoder for- Returns:
- CharSetDecoder for the given encoding
-
replacingDecoder
-
getFlagParsingStrategy
Determines the appropriateDictionary.FlagParsingStrategy
based on the FLAG definition line taken from the affix file- Parameters:
flagLine
- Line containing the flag information- Returns:
- FlagParsingStrategy that handles parsing flags in the way specified in the FLAG definition
-
unescapeEntry
-
shouldSkipEscapedChar
private static boolean shouldSkipEscapedChar(char ch) -
morphBoundary
-
indexOfSpaceOrTab
-
mergeDictionaries
private int mergeDictionaries(List<InputStream> dictionaries, CharsetDecoder decoder, IndexOutput output) throws IOException - Throws:
IOException
-
writeNormalizedWordEntry
private int writeNormalizedWordEntry(StringBuilder reuse, OfflineSorter.ByteSequencesWriter writer, String line) throws IOException - Returns:
- the number of word entries written
- Throws:
IOException
-
addHiddenCapitalizedWord
private void addHiddenCapitalizedWord(StringBuilder reuse, OfflineSorter.ByteSequencesWriter writer, String word, String afterSep) throws IOException - Throws:
IOException
-
toLowerCase
-
toTitleCase
-
sortWordsOffline
private String sortWordsOffline(Directory tempDir, String tempFileNamePrefix, IndexOutput unsorted) throws IOException - Throws:
IOException
-
readSortedDictionaries
private WordStorage readSortedDictionaries(Directory tempDir, String sorted, FlagEnumerator flags, int wordCount) throws IOException - Throws:
IOException
-
readMorphFields
-
addMorphFields
-
addPhoneticRepEntries
-
isDotICaseChangeDisallowed
boolean isDotICaseChangeDisallowed(char[] word) -
parseAlias
-
getAliasValue
-
parseMorphAlias
-
splitMorphData
-
hasFlag
-
hasFlag
boolean hasFlag(int entryId, char flag) -
mayNeedInputCleaning
boolean mayNeedInputCleaning() -
needsInputCleaning
-
cleanInput
-
toSortedCharArray
-
isSecondStagePrefix
boolean isSecondStagePrefix(char flag) -
isSecondStageSuffix
boolean isSecondStageSuffix(char flag) -
caseFold
char caseFold(char c) folds single character (according to LANG if present) -
getIgnoreCase
public boolean getIgnoreCase()Returns true if this dictionary was constructed with theignoreCase
option -
getDefaultTempDir
Returns the default temporary directory pointed to byjava.io.tmpdir
. If not accessible or not available, an IOException is thrown.- Throws:
IOException
-