Originally posted by William Brogden:
If you try for a search using a word that Lucene does not have in the index, is there some way you can get a list of words that are "close" in some sense? Alphabetic or phonetic for example. When I was working with full text searching in legal documents, being able to find "close" words was very important, given variations in spelling of people's names for example.
Bill
Here are two different type of live examples:
http://www.lucenebook.com/search?query=stemming - look at the highlighted words and compare it to the query expression. I'm using the Snowball stemmer (part of the Lucene Sandbox) to accomplish.
http://www.lucenebook.com/search?query=eric%7E - this one is using a FuzzyQuery, which uses the Levenshtein distance algorithm, to find words close enough. (for future reference, I spell my name with a "k"!)
There are other techniques that can be employed for seeing through transliterations and misspellings. In fact, Bob Carpenter contributed a wonderful case study to Chapter 10 describing this in detail using his LingPipe project.
This brings up another great selling point to the book... Case Studies chapter - it has case studies of Nutch, Searchblox, Michaels.com, TheServerSide, jGuru, and Alias-i (LingPipe). Read this to see how Lucene is leveraged in some heavy duty systems - I learned a lot by reading what they contributed, thats for sure!