Win a copy of Machine Learning with R: Expert techniques for predictive modeling this week in the Artificial Intelligence and Machine Learning forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Junilu Lacar
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Knute Snortum
  • Tim Cooke
  • Devaka Cooray
Saloon Keepers:
  • Ron McLeod
  • Stephan van Hulst
  • Tim Moores
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • Frits Walraven
  • Ganesh Patekar

Apache POI - Extract/Identify Highlighted Text?

 
Bartender
Posts: 1681
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Has anyone figured out how to use the Apache POI libraries to identify highlighted text in Word DOCX files?

I have a client who marks up documents with text highlighted in various colors and would like to be able to search/extract text using highlights as a search mechanism.

I don't see any method in the Apache POI API for working with highlighted text but perhaps I'm looking at the problem incorrectly?

Thanks in advance,

- mike
 
Saloon Keeper
Posts: 5811
146
Android Mac OS X Firefox Browser VI Editor Tomcat Server Safari
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You'll have to break down the text into runs, specifically org.apache.poi.xwpf.usermodel.XWPFRun objects - which has methods to identify whether the run is highlighted or not, in which color if it is, and to get the text of the run.

The way to get at the text runs is shown in the section "Reading Styles from Word Document" in https://www.devglan.com/corejava/parsing-word-document-example
 
Mike London
Bartender
Posts: 1681
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Tim Moores wrote:You'll have to break down the text into runs, specifically org.apache.poi.xwpf.usermodel.XWPFRun objects - which has methods to identify whether the run is highlighted or not, in which color if it is, and to get the text of the run.

The way to get at the text runs is shown in the section "Reading Styles from Word Document" in https://www.devglan.com/corejava/parsing-word-document-example



That's cool, thanks Tim! This is great. There also appears to be a method to tell me what the color of the highlighted text is. Not sure yet as the code doesn't quite work.

----

I noticed, that using that exact code you referenced (thanks for this link), I get a "Error:(28, 44) java: incompatible types: java.lang.Object cannot be converted to org.apache.poi.xwpf.usermodel.XWPFParagraph".

If change the type inside the for loop to Object, then I lose the getRuns() method.

Should I contact the author of this code or is there a quick fix you see?

See code snippet screenshot below:

Thanks,

--mike
convert-error.png
[Thumbnail for convert-error.png]
 
Rancher
Posts: 4275
47
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
getParagraphs returns a List<XWPFParagraph>.
You have declared your paragraphs as just a List.
 
Mike London
Bartender
Posts: 1681
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Dave Tolls wrote:getParagraphs returns a List<XWPFParagraph>.
You have declared your paragraphs as just a List.



I didn't write the code (sample take directly from website link), but yes, I fixed that after posting above, but I was implicitly wondering about the code and how it could have worked for the author (see link above) in the first place.

All good now.

-- mike
 
Dave Tolls
Rancher
Posts: 4275
47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ah!
Sorry.

I'm going to guess they didn't actually compile the code?
Though that seems a strange thing to miss out.
 
Mike London
Bartender
Posts: 1681
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Dave Tolls wrote:Ah!
Sorry.

I'm going to guess they didn't actually compile the code?
Though that seems a strange thing to miss out.



LOL.

I do find that the "runs" don't always exactly match the paragraphs, but they're pretty close.

This is pretty cool functionality.

I'm assuming, but haven't yet tested, that the same code would work for DOC files -- assuming the correct POI document type (DOC vs. DOCX) instantiation.

Thanks for your reply!

- mike
 
Tim Moores
Saloon Keeper
Posts: 5811
146
Android Mac OS X Firefox Browser VI Editor Tomcat Server Safari
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Mike London wrote:I'm assuming, but haven't yet tested, that the same code would work for DOC files -- assuming the correct POI document type (DOC vs. DOCX) instantiation


Well... something similar, in any case. For the HWPF and XWPF APIs there isn't yet a unifying API like there is for HSSF and XSSF, so classes and methods will not always have the same names or work the same way. But it shouldn't be hard to derive from the XWPF code.
 
Mike London
Bartender
Posts: 1681
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Tim Moores wrote:

Mike London wrote:I'm assuming, but haven't yet tested, that the same code would work for DOC files -- assuming the correct POI document type (DOC vs. DOCX) instantiation


Well... something similar, in any case. For the HWPF and XWPF APIs there isn't yet a unifying API like there is for HSSF and XSSF, so classes and methods will not always have the same names or work the same way. But it shouldn't be hard to derive from the XWPF code.



Sounds good ... That's what I figured.

Thanks much.

- mike
 
"I know this defies the law of gravity... but I never studied law." -B. Bunny Defiant tiny ad:
Java file APIs (DOC, XLS, PDF, and many more)
https://products.aspose.com/total/java
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!