The Zorba XQuery engine implements the XQuery and XPath Full Text 1.0 specification that, among other things, adds the ability to use stemming for text-matching via the stemming option. For example, the query:
let $x := <msg>Self Improvement</msg> return $x contains text "improve" using stemming
returns true
because $x
contains "Improvment" that has the same stem as "improve".
The initial implementation of the stemming option uses the Snowball stemmers and therefore can stem words in the following languages: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, and Turkish.
Using the Zorba C++ API, you can provide your own stemmer by deriving from two classes: Stemmer
and StemmerProvider
.
The Stemmer
class is:
class Stemmer { public: typedef /* implementation-defined */ ptr; virtual void destroy() const = 0; virtual void stem( String const &word, locale::iso639_1::type lang, String *result ) const = 0; protected: virtual ~Stemmer(); };
For details about the ptr
type, the destroy()
function, and why the destructor is protected
, see the Memory Management document.
To implement the Stemmer
, you need to implement the stem()
function where:
word | The word to be stemmed. |
lang | The language of the word. |
result | The stemmed word goes here. |
Note that result
should always be set to something. If your stemmer doesn't know how to stem the given word, you should set result
to word
.
A very simple stemmer that stems the word "foobar" to "foo" can be implemented as:
class MyStemmer : public Stemmer { public: void destroy() const; void stem( String const &word, locale::iso639_1::type lang, String *result ) const; private: MyStemmer(); friend class MyStemmerProvider; // only it can create instances }; void MyStemmer::destroy() const { // Do nothing since we statically allocate a singleton instance of our stemmer. } void MyStemmer::stem( String const &word, locale::iso639_1::type lang, String *result ) const { if ( word == "foobar" ) *result = "foo"; else *result = word; // Don't know how to stem word: set result to word as-is. }
A real stemmer would either use a stemming algorithm or a dictionary look-up to stem many words, of course.
Although not used in this simple example, lang
can be used to allow a single stemmer instance to stem words in more than one language.
In addition to a Stemmer
, you must also implement a StemmerProvider
that, given a language, provides a Stemmer
for that language:
class StemmerProvider { public: virtual ~StemmerProvider(); virtual Stemmer::ptr getStemmer( locale::iso639_1::type lang ) const = 0; };
A simple StemmerProvider
for our simple stemmer can be implemented as:
class MyStemmerProvider : public StemmerProvider { public: Stemmer::ptr getStemmer( locale::iso639_1::type lang ) const; }; Stemmer::ptr MyStemmerProvider::getStemmer( locale::iso639_1::type lang ) const { static MyStemmer stemmer; Stemmer::ptr result; switch ( lang ) { case iso639_1::en: case iso639_1::unknown: // Handle "unknown" language since, in many cases, the language is not known. result.reset( &stemmer ); break; default: // // We have no stemmer for the given language: leave the result as null to indicate this. // Zorba will then use the built-in stemmer for the given language. // break; } resturn std::move( result ); }
To enable your stemmer to be used, you need to register it with the XmlDataManager:
void *const store = StoreManager::getStore(); Zorba *const zorba = Zorba::getInstance( store ); MyStemmerProvider provider; zorba->getXmlDataManager()->registerStemmerProvider( &provider );