• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Devaka Cooray
  • Knute Snortum
  • Paul Clapham
  • Tim Cooke
Sheriffs:
  • Liutauras Vilda
  • Jeanne Boyarsky
  • Bear Bibeault
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Ron McLeod
  • Piet Souris
  • Frits Walraven
Bartenders:
  • Ganesh Patekar
  • Tim Holloway
  • salvin francis

Apache POI - Extract/Identify Highlighted Text?  RSS feed

 
Bartender
Posts: 1665
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Has anyone figured out how to use the Apache POI libraries to identify highlighted text in Word DOCX files?

I have a client who marks up documents with text highlighted in various colors and would like to be able to search/extract text using highlights as a search mechanism.

I don't see any method in the Apache POI API for working with highlighted text but perhaps I'm looking at the problem incorrectly?

Thanks in advance,

- mike
 
Saloon Keeper
Posts: 5478
143
Android Firefox Browser Mac OS X Safari Tomcat Server VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You'll have to break down the text into runs, specifically org.apache.poi.xwpf.usermodel.XWPFRun objects - which has methods to identify whether the run is highlighted or not, in which color if it is, and to get the text of the run.

The way to get at the text runs is shown in the section "Reading Styles from Word Document" in https://www.devglan.com/corejava/parsing-word-document-example
 
Mike London
Bartender
Posts: 1665
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Tim Moores wrote:You'll have to break down the text into runs, specifically org.apache.poi.xwpf.usermodel.XWPFRun objects - which has methods to identify whether the run is highlighted or not, in which color if it is, and to get the text of the run.

The way to get at the text runs is shown in the section "Reading Styles from Word Document" in https://www.devglan.com/corejava/parsing-word-document-example



That's cool, thanks Tim! This is great. There also appears to be a method to tell me what the color of the highlighted text is. Not sure yet as the code doesn't quite work.

----

I noticed, that using that exact code you referenced (thanks for this link), I get a "Error:(28, 44) java: incompatible types: java.lang.Object cannot be converted to org.apache.poi.xwpf.usermodel.XWPFParagraph".

If change the type inside the for loop to Object, then I lose the getRuns() method.

Should I contact the author of this code or is there a quick fix you see?

See code snippet screenshot below:

Thanks,

--mike
convert-error.png
[Thumbnail for convert-error.png]
 
Master Rancher
Posts: 4074
47
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
getParagraphs returns a List<XWPFParagraph>.
You have declared your paragraphs as just a List.
 
Mike London
Bartender
Posts: 1665
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Dave Tolls wrote:getParagraphs returns a List<XWPFParagraph>.
You have declared your paragraphs as just a List.



I didn't write the code (sample take directly from website link), but yes, I fixed that after posting above, but I was implicitly wondering about the code and how it could have worked for the author (see link above) in the first place.

All good now.

-- mike
 
Dave Tolls
Master Rancher
Posts: 4074
47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ah!
Sorry.

I'm going to guess they didn't actually compile the code?
Though that seems a strange thing to miss out.
 
Mike London
Bartender
Posts: 1665
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Dave Tolls wrote:Ah!
Sorry.

I'm going to guess they didn't actually compile the code?
Though that seems a strange thing to miss out.



LOL.

I do find that the "runs" don't always exactly match the paragraphs, but they're pretty close.

This is pretty cool functionality.

I'm assuming, but haven't yet tested, that the same code would work for DOC files -- assuming the correct POI document type (DOC vs. DOCX) instantiation.

Thanks for your reply!

- mike
 
Tim Moores
Saloon Keeper
Posts: 5478
143
Android Firefox Browser Mac OS X Safari Tomcat Server VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Mike London wrote:I'm assuming, but haven't yet tested, that the same code would work for DOC files -- assuming the correct POI document type (DOC vs. DOCX) instantiation


Well... something similar, in any case. For the HWPF and XWPF APIs there isn't yet a unifying API like there is for HSSF and XSSF, so classes and methods will not always have the same names or work the same way. But it shouldn't be hard to derive from the XWPF code.
 
Mike London
Bartender
Posts: 1665
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Tim Moores wrote:

Mike London wrote:I'm assuming, but haven't yet tested, that the same code would work for DOC files -- assuming the correct POI document type (DOC vs. DOCX) instantiation


Well... something similar, in any case. For the HWPF and XWPF APIs there isn't yet a unifying API like there is for HSSF and XSSF, so classes and methods will not always have the same names or work the same way. But it shouldn't be hard to derive from the XWPF code.



Sounds good ... That's what I figured.

Thanks much.

- mike
 
Those cherries would go best on cherry cheesecake. Don't put those cherries on this tiny ad:
ScroogeXHTML - small and flexible RTF to HTML converter library
https://coderanch.com/t/710903/ScroogeXHTML-RTF-HTML-XHTML-converter
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!