I agree with Tim. Extracting contents of all hits, making a copy for every search query, and inserting html tags seems error prone, inefficient and time consuming. If you extract just the words and display it, you're losing all the formatting. Even if use something like toHtml to convert to formatted HTML, it'll still look quite messed up in a lot of cases and not resemble formatting of the original files at all.
An ideal solution would be a document viewer that is capable of displaying multiple formats with formatting, searching them and highlighting hits.
I don't know any perfect solution, but one solution I could think of is the google docs viewer. Here's an
example document with a search query.
I would think your users expect a readable and formatting rendering of a document, and feels like viewing it using a native viewer.
Using their
Google documents API with appropriate sharing permissions and some kind of use-once-then-throwaway URLs for your documents, I think it's possible to integrate google docs even for private documents.
Now, if highlighting and viewing a document is not a critical use case - just highlighting the area around the search hit is enough - then you can look into the snippet highlighting capabilities provided by lucene and solr.
Here's an example of what it can look like.
You mention search is your own logic - not sure what logic that is - but if it's feasible for you to move to lucene or solr, then they can provide this kind of snippet highlighting.
Solr gives you
document extraction and highlighting out of the box, through the Apache Tika framework (which is itself built over POI, iText and other file format specific handlers).
With lucene, you'd have to roll out your own implementation that integrates Tika and then use the
Highlighter API.