• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Problems reading in .docx files in java

 
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

I'm helping to write an application that needs to read in a word doc (the text of the doc will be processed by some language processing software. Im working on the
frontend for the project.) At the moment, I am able to read in word docs with the file extension .doc using the Apache POI library (POIFileSystem, HWPFDocument and WordExtractor).
Now I want to be able to read in .docx files. I've tried using XWPFDocument and XWPFWordExtractor. I pass in OPCPackage.create(filename) as an argument to XWPFDocument, but
its not working.The code compiles, but when I run it, it throws an exception.Its throwing an org.apache.xmlbeans.XmlException. I thought I had set the classpath for the relevant jar files.
I'm using Apache POI 3.5 beta6. If anyone can shed some light on this, that would great!
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Welcome to JavaRanch.

Instead of OPCPackage.create, try POIXMLDocument.openPackage. Here's sample code that shows the XWPFExtractor in action.

Note that the change notes for the trunk code (post-beta 6) list various improvements in the XWPF extractor. So you may want to grab the latest source from the repository and use that to build the jar files.
 
D Slevin
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,
cheers for that. Just after I posted my problem, I got it working last night. I implemented mine a little differently.



I don't know if its the proper way to do it, but it reads in the file perfectly. I'll have a go at trying the code you linked me to (no harm in knowing 2 ways). I also kept getting ClassNotFoundExceptions. I put the jar file it was looking for (such xmlbeans, dom4j) in the classpath and it worked then.

Again, thanks.

ps if anyone needs any help reading doc or docx files, I'll be happy to post code here
 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,
We are facing the similar problem using POI for reading 2007 docs can you please tell me from where you get 3.5 version?
& please sahre the sample code as well.
Thanks & regds
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Megha Ad wrote:can you please tell me from where you get 3.5 version?


Searching for "download apache poi" should find it real quick.
 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Could you please tell me the how to read .docx file in POI.... When I try to .docx file using XWPF. Its throwing exception as


Exception in thread "main" org.apache.poi.openxml4j.exceptions.InvalidFormatException: Package should contain a content type part [M1.13]
at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:148)
at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:623)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:209)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:186)
at org.apache.poi.openxml4j.opc.OPCPackage.openOrCreate(OPCPackage.java:248)
at view.Document_XWPF_Sample.main(Document_XWPF_Sample.java:28)

Please let me know as soon as possible.
 
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
i have a problem with parse a zip file in tika parser
in parse a zip file i have a error that is :

java.lang.InternalError: jzentry == 0, jzfile = 139750727169136, total = 235, name = /tmp/apache-tika-8076182698055047262.tmp, i = 176, message = null

at java.util.zip.ZipFile$2.nextElement(ZipFile.java:322)

at java.util.zip.ZipFile$2.nextElement(ZipFile.java:304)

at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:158)

at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:615)

at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:208)

at org.apache.tika.parser.pkg.ZipContainerDetector.detectOfficeOpenXML(ZipContainerDetector.java:118)

at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:74)

at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
at org.apache.tika.Tika.detect(Tika.java:134)

at org.apache.tika.Tika.detect(Tika.java:181)

at org.apache.tika.Tika.detect(Tika.java:228)

at java.lang.Thread.run(Thread.java:619)

please response to my error .
thanks
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic