• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Ron McLeod
  • paul wheaton
  • Jeanne Boyarsky
Sheriffs:
  • Paul Clapham
  • Devaka Cooray
Saloon Keepers:
  • Tim Holloway
  • Roland Mueller
  • Himai Minh
Bartenders:

Highlight the words in files( doc, excel, pdf etc) using java

 
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have a web application in which i have a search functionality. Search results will be different types of files.

While opening the file i want to highlight the search words.

Can anybody please help me.
 
Bartender
Posts: 1210
25
Android Python PHP C++ Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Depends on what you've used to implement search inside files. How have you implemented it? What frameworks are you using?
 
mangala shenoy
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Iam using JSF and SDO. And for Search i have my own logic. It works fine. Now once the search results are displayed i want to open the file with highlighed.
I tried lucene,JACOB, inserting html tags in the file.

Nothing seems to work.
 
Bartender
Posts: 7645
178
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
How are you displaying the search results? What do you mean by "inserting html tags in the file"?
 
mangala shenoy
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Iam putting the search results in session scope and displaying it in jsf page using jsf data table. Iam providing a link to open file.

On click of filename calling a function for reading from the file and writing to servletouputstram.

I tried inserting html tags before writing to outputstream. Like for .doc i used wordextractor and then did a replaceall for the word to include htl tags
 
Tim Moores
Bartender
Posts: 7645
178
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
So the output is HTML? Because if you stream the actual file contents then it's far from trivial to alter the file contents so that arbitrary words will be highlighted, especially for the structured document formats you mention. I predict you will end up not doing this.
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android Python PHP C++ Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I agree with Tim. Extracting contents of all hits, making a copy for every search query, and inserting html tags seems error prone, inefficient and time consuming. If you extract just the words and display it, you're losing all the formatting. Even if use something like toHtml to convert to formatted HTML, it'll still look quite messed up in a lot of cases and not resemble formatting of the original files at all.

An ideal solution would be a document viewer that is capable of displaying multiple formats with formatting, searching them and highlighting hits.

I don't know any perfect solution, but one solution I could think of is the google docs viewer. Here's an example document with a search query.
I would think your users expect a readable and formatting rendering of a document, and feels like viewing it using a native viewer.
Using their Google documents API with appropriate sharing permissions and some kind of use-once-then-throwaway URLs for your documents, I think it's possible to integrate google docs even for private documents.

Now, if highlighting and viewing a document is not a critical use case - just highlighting the area around the search hit is enough - then you can look into the snippet highlighting capabilities provided by lucene and solr.
Here's an example of what it can look like.
You mention search is your own logic - not sure what logic that is - but if it's feasible for you to move to lucene or solr, then they can provide this kind of snippet highlighting.
Solr gives you document extraction and highlighting out of the box, through the Apache Tika framework (which is itself built over POI, iText and other file format specific handlers).
With lucene, you'd have to roll out your own implementation that integrates Tika and then use the Highlighter API.
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android Python PHP C++ Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Just found this after typing my reply: You might also be interested in Aspose's capabilities for this. It's a commercial product.
 
mangala shenoy
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
No.. Output need not be html. My requirement is when user opens a file i want the search terms to be highlighted.
I just tried putting html tags. It dint work.

I tried lucene, using lucene is possible to highlight the words in search results . But i dont think it has the option for highlighting the terms in a file.

 
Tim Moores
Bartender
Posts: 7645
178
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

mangala shenoy wrote:I just tried putting html tags. It dint work.


Of course not. You can't mix and match structured document formats and HTML.

But i dont think [lucene] has the option for highlighting the terms in a file.


Correct. It can help you find the stuff that you want to highlight, but your code needs to do it, and it's different for each kind of document format. For PDF and DOC(X) in particular this will be hard, if not impossible. XLS(X) a bit easier if you use the Apache POI library.
 
mangala shenoy
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Can you please tell me how to integrate google doc api in Java
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android Python PHP C++ Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If your documents are - or can be made - available at some public URLs, then it's a simple matter of showing URL like this in your search results table:
<a href="https://docs.google.com/viewer?url=[your-document-url]&q=[search query]" target="_blank" rel="nofollow">View document</a>
It'll open the viewer in another window/tab.

But if your documents have some security requirements (can be opened only by select people, etc), then you should go through their guides and prototype using their java client library, before integrating it. The viewer certainly seems to solve your problem, but whether there are any security incompatibilities with your system should be checked by you. I have not integrated google docs myself so far, so I'm basing my answer only on a shallow review of their documentation.
 
mangala shenoy
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Documents will be stored in a location on server. I need to open from there. For Google doc viewer it has to be in some url right?

Do you know about JACOB?. For doc and docx i could do it using JACOB. Iam not able to do for ppt, xls.
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android Python PHP C++ Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
By JACOB, are you referring to this project - Java COM bridge?
I have not used it. As I understood, it uses the COM interfaces exposed by Word, Excel, and other MSoffice components via JNI. I guess MSOffice has to be installed on your web server and both have to be running on a Windows OS for this solution to work. What problem are you facing exactly - any error information?
 
mangala shenoy
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I dint know that thanks.

I tried google doc viewer the files have to be on some url right? S i think i cant use it.

So is there any other solution for my reqirement.
 
Karthik Shiraly
Bartender
Posts: 1210
25
Android Python PHP C++ Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

mangala shenoy wrote:I tried google doc viewer the files have to be on some url right? S i think i cant use it.
So is there any other solution for my reqirement.



Yes, google docs viewer requires some URL to get the contents. But this URL need not necessarily be a direct URL to a document.
It could be a plain servlet URL which reads the document from wherever it's stored, and dumps the contents on the output stream.
Or it could be uploaded to a google docs account (perhaps temporarily, then deleted) . That uploaded document will have a URL which can be sent to the viewer.

If these are private docs and there are security constraints, then security wise too, I can think of a bunch of options to make it secure (or atleast, as obscure as possible) though which one would be suitable for your system I have no idea. Perhaps, examine request headers and allow only if it's from google docs viewer URL (hopefully, the headers contains such information). Or, check whether google docs API's ACL permissions and viewer can be integrated. Or authenticate first and then redirect to viewer. If I were you, I'd prototype these approaches, to see which one(s) give fairly foolproof security.

I'm not aware of other solution, but that doesn't mean it doesn't exist. Probably you'll have to look at commercial offerings, like Aspose or box.com.

Perhaps other forum members know some good viewers.
You can try asking another question asking specifically for web based document viewers, instead of highlighting - maybe in the HTML forum (because solution is likely to be a flash implementation) - and then evaluate the suggestions you get.
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic