• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Close words in vocabulary

 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you try for a search using a word that Lucene does not have in the index, is there some way you can get a list of words that are "close" in some sense? Alphabetic or phonetic for example. When I was working with full text searching in legal documents, being able to find "close" words was very important, given variations in spelling of people's names for example.
Bill
 
Author
Posts: 111
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by William Brogden:
If you try for a search using a word that Lucene does not have in the index, is there some way you can get a list of words that are "close" in some sense? Alphabetic or phonetic for example. When I was working with full text searching in legal documents, being able to find "close" words was very important, given variations in spelling of people's names for example.
Bill



Here are two different type of live examples:

http://www.lucenebook.com/search?query=stemming - look at the highlighted words and compare it to the query expression. I'm using the Snowball stemmer (part of the Lucene Sandbox) to accomplish.

http://www.lucenebook.com/search?query=eric%7E - this one is using a FuzzyQuery, which uses the Levenshtein distance algorithm, to find words close enough. (for future reference, I spell my name with a "k"!)

There are other techniques that can be employed for seeing through transliterations and misspellings. In fact, Bob Carpenter contributed a wonderful case study to Chapter 10 describing this in detail using his LingPipe project.

This brings up another great selling point to the book... Case Studies chapter - it has case studies of Nutch, Searchblox, Michaels.com, TheServerSide, jGuru, and Alias-i (LingPipe). Read this to see how Lucene is leveraged in some heavy duty systems - I learned a lot by reading what they contributed, thats for sure!
 
Author
Posts: 23
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
There is another relevant solution: synonym injection via the Analyzer. Here is some context: http://www.lucenebook.com/search?query=synonym

The code that comes with the book includes a synonym engine.

Otis
 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
So if I understand you, there is no phonetic "Sounds like" mechanism right now but it looks like it would be easy to add one. The Jakarta commons codec toolkit has some implementations of phonetic coding - including metaphone - which I have used in the legal docuement searcher. Of course, "sounds like" is different for different languages, and probably even regional dialects within languages.
 
Erik Hatcher
Author
Posts: 111
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by William Brogden:
So if I understand you, there is no phonetic "Sounds like" mechanism right now but it looks like it would be easy to add one. The Jakarta commons codec toolkit has some implementations of phonetic coding - including metaphone - which I have used in the legal docuement searcher. Of course, "sounds like" is different for different languages, and probably even regional dialects within languages.



In fact, Metaphone from Jakarta Commons Codec is an example I wrote about in the Analysis chapter! Yes, very easy to integrate into an analyzer. Check out the source code (lia.analysis package) for the book freely available here to see for yourself.
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic