Michael McCandless

author
+ Follow
since Jul 11, 2010
Merit badge: grant badges
For More
Cows and Likes
Cows
Total received
In last 30 days
0
Forums and Threads

Recent posts by Michael McCandless

The book shows how to take a hot backup of the index (ie backing up in the background even while an IndexWriter is still making changes to the index). The resulting backup is still a point-in-time copy of the index, as of when the backup began.
You're welcome!

Near-real-time search was added in 2.9.
Those restrictions sound fine w/ Lucene. I guess the biggest thing is not using the filesystem, ie we need a Directory impl backed by the App Engine datastore (and if I remember there's a max file size restriction (maybe 10 MB?) in this?).

You'd have to switch to SerialMergeScheduler, too, since the default ConcurrentMergeScheduler does merges with its own background threads.
Yes JavaRanch's forums use Lucene ;)

That PoweredBy page is actually a vast undercount -- users of Lucene are not required to post their usage there. But Lucene is used by all kinds of products/companies. LinkedIn, Twitter, NetFlix, HotJobs, are some example that come to mind...
In general, Lucene's core handles i18n just fine -- all APIs support full unicode strings, the index stores unicode terms, etc.

But there are certain issues. For example the core QueryParser is English (well, latin languages) biased as it pre-splits all text on whitespace, and by default creates a PhraseQuery whenever a chunk of text analyzes to more than one token (which of course happens all the time for non-whitespace languages like CJK). You have to side step these land mines... (also, Lucene is improving in this regard -- eg in the next (3.1/4.0) releases of Lucene, this QueryParser trap is disabled by default).

The biggest challenge is finding the right analyzer, including decompounding, stemming, stopping, as appropriate, for your language(s). Lucene's contrib/analyzers has a number of language-specific analyzers that you can use/iterate from.
prepareCommit lets you do a 2-phased commit with Lucene and some other transactional resource(s).

Ie, you first call .prepareCommit in Lucene (and similarly in your other resources), which does nearly all the work required for a commit but does not in fact make the changes visible in the index.

If any resources hit an error during this phase, you can then call .rollback to remove all the changes.

Else, you then call .commit to make the change visible.

If you don't call .prepareCommit yourself externally, ie just call .commit, then internally Lucene will call .prepareCommit and then commit.
One of the nice recent additions to Lucene is a feature called near-real-time search (it's covered in the book -- the IndexWriter.getReader method).

This make the turnaround time between making changes (adds/deletes/updates) to the index, and opening a new searcher that can see these changes, much faster, because you no longer have to .commit or .close the IndexWriter in order to see the changes.
Or, you can help to push to a more complete and mature solution for Lucene running on GAE ;)

Do you have a sense of what challenges/limitations GAE imposes?
Lucene itself does no caching of queries or query terms, but Solr does.

That said, the OS does: it caches recently read pages from the IO system, in RAM in its IO cache.

This means the first search for foo bar will take longer (must go to the IO system, eg a local hard drive or SSD), but then a subsequent search will be fast.
I explained all the new features as of Lucene 3.0.2; there's nothing missing as far as I know (though I could very well have missed something!).

But Lucene is very active project so development continues and once 3.1/4.0 come out, they will have new & interesting features.
IO errors (disk full, permission problems, etc.). It's remotely possible you've hit a bug (and then I'll be very interested in the infoStream output!). Also, if you're just curious about the inner workings of IndexWriter (when it flushes, what the RAM efficiency for the segment was, when & what merges are running, etc.), it's fun to turn on.
Lucene is very approachable -- you just need Chapter 1 to get "started". It shows you the basics of indexing text & searching for it. But if you want to do more interesting things, explore different query types, tune your analysis, etc., then you should read the 3 dedicated chapters on indexing, searching and analysis. The two further chapters on searching (advanced techniques and extending) are for even deeper use cases / customization.
Wildcard matching on any analyzer that does stemming will be problematic because stemming alters the original words. And, there's no general way to stem the term you are using for the wildcard search.
There's this project:

http://code.google.com/p/gaelucene/

But I haven't used it personally.
Sorry, Lucene in Action 2 only briefly touches on this [good] topic.

A single Lucene index can scale quite large, depending on your performance requirements.

Beyond that you'll have to break the index across multiple machines. Solr has support for this out of the box. Katta does as well.