• Post Reply Bookmark Topic Watch Topic
  • New Topic

Extracting Text from Word Doc

 
Mike London
Ranch Hand
Posts: 1290
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,

I downloaded the lastest POI from Apache (poi_3.15), but trying to get the most basic word code working is not straightforward.

Two Examples I tried among lots of web searches:



Generates this error:



Then, looking around, I see that the netbeans/XMLException is deprecated and actually no longer in use or any links on how to refactor existing code.

----

Trying another example...


Gives this error stack:




XSSF seems to deal with Excel, but I couldn't find any examples that worked using XSSF.

--

So, how do I just read a simple Word (XML) 2011 document in Java if the Apache stuff doesn't handle it?

I'm sure it's simple, but, so far, I can't find a single example that works.

Thanks in advance.

- mike

 
Rob Spoor
Sheriff
Posts: 20837
68
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You should start by using XWPFDocument and XWPFWordExtractor instead of HWPFDocument and WordExtractor.
 
Mike London
Ranch Hand
Posts: 1290
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rob Spoor wrote:You should start by using XWPFDocument and XWPFWordExtractor instead of HWPFDocument and WordExtractor.


It looks like you missed the first part of my posting above where I did just what you suggested. Please note that particular error stack.

No simple working example that I can find.

Thanks in advance.

- mike
 
Knute Snortum
Sheriff
Posts: 3334
84
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hmm... this worked fine for me:

The only thing I did differently was create an XWPFWordExtractor object and then close it when I was done.  I also used POI v3.13.
 
Tony Docherty
Saloon Keeper
Posts: 3155
75
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Then, looking around, I see that the netbeans/XMLException is deprecated

It's not the netbeans.XMLException class you appear to need it's the xmlbeans.XMLException class which is in the apache xmlbeans bundle which can be downloaded from https://xmlbeans.apache.org.

Not sure why POI needs this and doesn't include or reference it in the download notes but from the stack trace you have shown it would appear it does.
 
Mike London
Ranch Hand
Posts: 1290
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tony Docherty wrote:
Then, looking around, I see that the netbeans/XMLException is deprecated

It's not the netbeans.XMLException class you appear to need it's the xmlbeans.XMLException class which is in the apache xmlbeans bundle which can be downloaded from https://xmlbeans.apache.org.

Not sure why POI needs this and doesn't include or reference it in the download notes but from the stack trace you have shown it would appear it does.


Yep, as I also noted in my original posting, this download is now extinct with no clear replacement.

If you visit this site: http://attic.apache.org/projects/xmlbeans.html

You'll see what I mean.

Hence, my posting here.

Still baffled.

Thanks for your reply.

- mike
 
Tony Docherty
Saloon Keeper
Posts: 3155
75
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That page gives a link to an archived version of XmlBeans (http://archive.apache.org/dist/xml/xmlbeans/) which can be downloaded.
 
Mike London
Ranch Hand
Posts: 1290
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tony Docherty wrote:That page gives a link to an archived version of XmlBeans (http://archive.apache.org/dist/xml/xmlbeans/) which can be downloaded.


Yes, I understand how to download this code, but I was hoping to find working code that doesn't require extinct and mothballed projects.

It doesn't seem there is a single current example on how to read a word document in Java, like the one I posted about (that is, using currently-supported APIs).

Strange.

OK, I guess that's my answer.  Good to know.

Thanks!

- mike
 
Mike London
Ranch Hand
Posts: 1290
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Knute Snortum wrote:Hmm... this worked fine for me:

The only thing I did differently was create an XWPFWordExtractor object and then close it when I was done.  I also used POI v3.13.


Thanks.

Yeah, I got this to work, finally, also. In my case, I used:

1. org.apache.xmlbeans:xmlbeans:2.6.0 and
2. poi-3 (latest version).

It seems odd that there isn't a currently supported API to read word docs, or at least that's my interpretation of this mini project.

Appreciate all the help.

- mike
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!