This week's book giveaway is in the Reactive Progamming forum.
We're giving away four copies of Reactive Streams in Java: Concurrency with RxJava, Reactor, and Akka Streams and have Adam Davis on-line!
See this thread for details.
Win a copy of Reactive Streams in Java: Concurrency with RxJava, Reactor, and Akka Streams this week in the Reactive Progamming forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Junilu Lacar
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Knute Snortum
  • Tim Cooke
  • Devaka Cooray
Saloon Keepers:
  • Ron McLeod
  • Stephan van Hulst
  • Tim Moores
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • Frits Walraven
  • Ganesh Patekar

Lucene in Action: i18n

 
blacksmith
Posts: 979
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

...how does Lucene cope with internationalisation?

Cheers,

Gian
 
author
Posts: 20
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In general, Lucene's core handles i18n just fine -- all APIs support full unicode strings, the index stores unicode terms, etc.

But there are certain issues. For example the core QueryParser is English (well, latin languages) biased as it pre-splits all text on whitespace, and by default creates a PhraseQuery whenever a chunk of text analyzes to more than one token (which of course happens all the time for non-whitespace languages like CJK). You have to side step these land mines... (also, Lucene is improving in this regard -- eg in the next (3.1/4.0) releases of Lucene, this QueryParser trap is disabled by default).

The biggest challenge is finding the right analyzer, including decompounding, stemming, stopping, as appropriate, for your language(s). Lucene's contrib/analyzers has a number of language-specific analyzers that you can use/iterate from.
 
Ranch Hand
Posts: 8934
Firefox Browser Spring Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Suppose I want to search for word "run" and my documents are in different languages[English, french ect], would the search result return docs in with meaning "run". Can we try some custom code to do this/
 
Gian Franco
blacksmith
Posts: 979
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Pradeep bhatt wrote:Suppose I want to search for word "run" and my documents are in different languages[English, french ect], would the search result return docs in with meaning "run". Can we try some custom code to do this/



You could create a custom Analyzer that takes care of that...

Cheers,

Gian
 
Author
Posts: 23
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Pradeep bhatt wrote:Suppose I want to search for word "run" and my documents are in different languages[English, french ect], would the search result return docs in with meaning "run". Can we try some custom code to do this/



Lucene can't understand the meaning of the word - it simply matches tokens. Which tokens get indexed depends on what the analysis does with the input text. This is well covered in LIA.

Otis
 
Pradeep bhatt
Ranch Hand
Posts: 8934
Firefox Browser Spring Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Otis Gospodnetic wrote:

Pradeep bhatt wrote:Suppose I want to search for word "run" and my documents are in different languages[English, french ect], would the search result return docs in with meaning "run". Can we try some custom code to do this/



Lucene can't understand the meaning of the word - it simply matches tokens. Which tokens get indexed depends on what the analysis does with the input text. This is well covered in LIA.

Otis



Dear author, thanks Can you tell me what LIA means ?
 
Sheriff
Posts: 9643
42
Android Google Web Toolkit Hibernate IntelliJ IDE Spring Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Can you tell me what LIA means ?


I suppose Lucene In Action :-)
 
Pradeep bhatt
Ranch Hand
Posts: 8934
Firefox Browser Spring Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Ankit Garg wrote:

Can you tell me what LIA means ?


I suppose Lucene In Action :-)



Sorry authors.
 
Gian Franco
blacksmith
Posts: 979
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Gian Franco wrote:
You could create a custom Analyzer that takes care of that...



...here is an example in Lucene.net regarding the
same topic...

Cheers,

Gian
 
A teeny tiny vulgar attempt to get you to buy our stuff
Java file APIs (DOC, XLS, PDF, and many more)
https://products.aspose.com/total/java
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!