• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Hibernate Search In Action- Is indexing documents simplified with Hibernate?

 
Chandra Bhatt
Ranch Hand
Posts: 1710
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How much simplicity does it provide to index MS doc format files, pdf,
and other text content usable in full text search?

Does it contain any wrapper over the indexing API's used by Apache Lucene,
to make this indexing bit handy?

What about documents containing non textual contents as well?
[ December 09, 2008: Message edited by: Chandra Bhatt ]
 
John Griffin
author
Greenhorn
Posts: 22
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Chandra,

In chapter 13 we discuss using 3rd party libraries to extract text from various formats (MS documents, XML SAX and DOM, plain text, PDF) and place that text into indexes utilizing Hibernate Search. I is very easy to do utilizing the constructs of Hibernate Search and the chapter is filled with examples of how to do it.

Hope this helps.

John G
 
John Griffin
author
Greenhorn
Posts: 22
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Chandra,

I forgot to add that I'm not quite sure what you mean by non-text but you have to remember this is a full TEXT search engine. If you wanted to be able to search for, let's say, a particular jpg file, then the searchable data would not be the jpg file itself but text-based metadata that was entered about that jpg file.

Hope this helps.

John G
 
Chandra Bhatt
Ranch Hand
Posts: 1710
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks John,

Its full text based search, we keep meta data defining images (about images
what is this image all about).

I agree there is nothing called image search till now(I mean you put your
image on the google and it will return your duplicates or similar faced
people around the globe)


About indexing pdf and docs:
I felt bit tedious to index pdf and doc, So wanted to know how easy will
it be to go with Hibernate to do so.
 
Emmanuel Bernard
author
Ranch Hand
Posts: 62
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
As John said, we have examples in the book to index MS Doc and PDFs and you can download the book source code at book.emmanuelbernard.com/hsia

Generally, Hibernate Search provides the concept of bridge that let's you index unknown data type into Lucene. They are pretty much like Hibenate user types but for Hibernate Search.

Here are a few bridge examples people can implement:
- read a URL (on your entity) where a PDF is, extract the data from the PDF and index it in the Lucene Document
- read the byte[] (o your entity) representing a MS Document, extract the data and index it in the Lucene document
- store and index a Map in a particular way not natively supported by Hibernate Search

Also Hibernate Search can index all the basic JDK types (URL, Date, numbers, etc)
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic