Originally posted by Arjun Shastry:
IndexSearcher in Lucene accepts the query and returns the Hits object.As stated in one tutorial,Lucene is IR Library rather than Search Engine.Does implementor need to construct catche/crawler for even faster search/indexing?
Also how the results are returned?As per the tutorial(s) on net,it uses Score for a page(Document in general),how differen is this in comparison with PageRank of Google?To my knowledge ,PageRank calculates the score not only on the frequency of accessing the page but also the backlinks(total pages pointing towards that page)How the score of Document is calculated in Lucene?
Does Hit stand for Hypertext Induced Topic Selection?the algorithm used to rank the document?
[ January 06, 2005: Message edited by: Arjun Shastry ]
I call Lucene a "search engine" because its a convenient and recognizable term. Technically it is an API that has no user interface, no crawler, and no parsers. To me, it is the "engine", whereas Google is a search "application". Semantics and
word games aside it is not necessary to implement caching around Lucene. The Hits object itself has some built-in caching for most recently accessed (or soon to be accessed) documents.
Hits from Lucene are ordered by
score, a
sophisticated calculation which puts more relevant documents (to the query) at the top, and less relevant documents below.
Google's PageRank is comparable to how Nutch, a system built around Lucene, ranks its documents. It does lots of Lucene trickery to weight documents in a PageRank-like fashion. Most of us, however, are not building web crawlers where PageRank works decently. In intranet or other domains of use, the built-in Lucene scoring mechanism works amazingly well.
I have never heard that acronym for HIT, and I do not think it applies to Lucene's concept of a Hit. A "hit" is synonymous with "match".