Win a copy of Programmer's Guide to Java SE 8 Oracle Certified Associate (OCA) this week in the OCAJP forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Create meta cards of documents using POI. Lucene or reg expressions

 
Aaron Williams
Greenhorn
Posts: 2
Android Eclipse IDE Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
All,

Thanks in advance. I am indexing documents via multiple data sources. I am creating meta cards for each document and storing them in an Oracle DB. I only store the meta card and a link to the document, not the document itself.

I started using POI and PDFBOX to read doc, excel, power point, etc..

If I want to create structured, intelligeble phrases and summaries from let us say a an expense report, would you recommend using LUCENE or regular expressions? I've considering creating a library class of some sort of keywords to phrases and just allowing it to grow. I know there has to be a more powerful and efficient way to do this other than regular expressions.

So back the expense report example. I want to find words that match Mr. or Mrs, Unilever, 2012 conference, etc.. and store those in the metacard.

Thanks,
AD
 
Tim Moores
Bartender
Posts: 2894
46
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Instead of handling all those document formats yourself, you may want to look into the Apache Tika project - it has all that built in, and runs on top of Lucene. For semantic text handling I definitely recommend Lucene.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic