Ulf Dittmer wrote:Apache POI can do lots of things with such files, but its API is not particularly intuitive. I'm not aware of more capable free libraries, though.
Wim Vanni wrote:Docx4j could be a solution.
If you can make it so that the Word documents come in the format of xml you might be better of using xml/xpath etc to do the searches.
Cheers,
Wim
Wim Vanni wrote:Remember that Office 2003 came with XML support, meaning you could save Word (and Excel and ..) in an XML format.
Wim
Ulf Dittmer wrote:The XML supported by Office 2003 is very different from the Office 2007 formats.
Wim Vanni wrote:I'm pretty sure Docx4j can open and convert the old formats to the newer 2007 (2010?) formats. I haven't used this myself but making a POC for this shouldn't be too hard. If this is succesful you basicly have at that point, XML. Plenty of Java libraries around to handle (and search in) XML, and if needed, adding regexp searching to that shouldn't be difficult either.
You don't have to be a guru. Just learn new ingredients now and then and learn to combine them into extraordinary dishes
Chef Wim
So now, for efficiency, a program should handle .doc (COM), 2003 xml format (is that docx too?) and .docx formats
I'm pretty sure Docx4j can open and convert the old formats to the newer 2007 (2010?) formats.
does Docx4j need ms word installed
I'm pretty sure Docx4j can open and convert the old formats to the newer 2007 (2010?) formats.
I've seen no indication that docx4j can handle the old binary Office formats. Or did you mean the 2003 XML formats?
Handling legacy binary .doc files
Apache POI's HWPF can read .doc files, and docx4j could use this for basic conversion of .doc to .docx. The problem with this approach is that POI's HWPF code fails on many .doc files.
An effective approach is to use OpenOffice (via jodconverter) to convert the doc to docx, which docx4j can then process. If you need to return a binary .doc, OpenOffice/jodconverter can convert the docx back to .doc.
There is also http://b2xtranslator.sourceforge.net/ . If a pure Java approach were required, this could be converted.
Miltos Deligiannis wrote:
Actually i need my app to be able to handle both .doc (word 2003, xp etc.) and .docx files. Aspose.Words for Java is a very very good API that can do many things but unfortunately i cannot afford it (especially since it's license is valid for a given period)... On the other hand OO Uno or Apache POI have pretty big learning curve. The goog thing with Aspose is that it does not require Ms Word installed and that's very important i think..
Developer Evangelist @ Aspose. I love to explore and learn new technologies and help other developers along the way.
Did you see how Paul cut 87% off of his electric heat bill with 82 watts of micro heaters? |