• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • paul wheaton
  • Jeanne Boyarsky
  • Ron McLeod
Sheriffs:
  • Paul Clapham
  • Liutauras Vilda
  • Devaka Cooray
Saloon Keepers:
  • Tim Holloway
  • Roland Mueller
Bartenders:

Lucene in Action: i18n

 
blacksmith
Posts: 979
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

...how does Lucene cope with internationalisation?

Cheers,

Gian
 
author
Posts: 20
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
In general, Lucene's core handles i18n just fine -- all APIs support full unicode strings, the index stores unicode terms, etc.

But there are certain issues. For example the core QueryParser is English (well, latin languages) biased as it pre-splits all text on whitespace, and by default creates a PhraseQuery whenever a chunk of text analyzes to more than one token (which of course happens all the time for non-whitespace languages like CJK). You have to side step these land mines... (also, Lucene is improving in this regard -- eg in the next (3.1/4.0) releases of Lucene, this QueryParser trap is disabled by default).

The biggest challenge is finding the right analyzer, including decompounding, stemming, stopping, as appropriate, for your language(s). Lucene's contrib/analyzers has a number of language-specific analyzers that you can use/iterate from.
 
Ranch Hand
Posts: 8946
Firefox Browser Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Suppose I want to search for word "run" and my documents are in different languages[English, french ect], would the search result return docs in with meaning "run". Can we try some custom code to do this/
 
Gian Franco
blacksmith
Posts: 979
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Pradeep bhatt wrote:Suppose I want to search for word "run" and my documents are in different languages[English, french ect], would the search result return docs in with meaning "run". Can we try some custom code to do this/



You could create a custom Analyzer that takes care of that...

Cheers,

Gian
 
Author
Posts: 23
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Pradeep bhatt wrote:Suppose I want to search for word "run" and my documents are in different languages[English, french ect], would the search result return docs in with meaning "run". Can we try some custom code to do this/



Lucene can't understand the meaning of the word - it simply matches tokens. Which tokens get indexed depends on what the analysis does with the input text. This is well covered in LIA.

Otis
 
Pradeep bhatt
Ranch Hand
Posts: 8946
Firefox Browser Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Otis Gospodnetic wrote:

Pradeep bhatt wrote:Suppose I want to search for word "run" and my documents are in different languages[English, french ect], would the search result return docs in with meaning "run". Can we try some custom code to do this/



Lucene can't understand the meaning of the word - it simply matches tokens. Which tokens get indexed depends on what the analysis does with the input text. This is well covered in LIA.

Otis



Dear author, thanks Can you tell me what LIA means ?
 
Sheriff
Posts: 9708
43
Android Google Web Toolkit Hibernate IntelliJ IDE Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Can you tell me what LIA means ?


I suppose Lucene In Action :-)
 
Pradeep bhatt
Ranch Hand
Posts: 8946
Firefox Browser Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Ankit Garg wrote:

Can you tell me what LIA means ?


I suppose Lucene In Action :-)



Sorry authors.
 
Gian Franco
blacksmith
Posts: 979
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Gian Franco wrote:
You could create a custom Analyzer that takes care of that...



...here is an example in Lucene.net regarding the
same topic...

Cheers,

Gian
 
Let's go to the waterfront with this tiny ad:
Smokeless wood heat with a rocket mass heater
https://woodheat.net
reply
    Bookmark Topic Watch Topic
  • New Topic