• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • paul wheaton
  • Liutauras Vilda
  • Ron McLeod
Sheriffs:
  • Jeanne Boyarsky
  • Devaka Cooray
  • Paul Clapham
Saloon Keepers:
  • Scott Selikoff
  • Tim Holloway
  • Piet Souris
  • Mikalai Zaikin
  • Frits Walraven
Bartenders:
  • Stephan van Hulst
  • Carey Brown

Convert .doc file to .txt file

 
Ranch Hand
Posts: 40
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi ,

I am having a template in .doc format. In my application I want to read this .doc file and insert the text to .txt file.. But when I am doing this, I find some special ASCII characters are also inserted into the text file.. I dont want these special characters but only the the text (words) present in the word file



Please help me in this regard

thanks
Smriti
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
DOC files contain plenty of control characters (and other text) that are not part of the main text. You'll need to use a library that understands the DOC format, like Jakarta POI. Have a look at "Basic Text Extraction" here.
 
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have a similar problem. When i extract text plainly, the format of the strings is not the same as any Text File. I need to match strings between a Text File and .doc file. It gives 2 same strings as unequal.Any suggestions?
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

It gives 2 same strings as unequal.


What does that mean? Can you post a short code section that illustrates the problem?
 
Kartik Lunkad
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Consider 2 words string1 and string 2 extracted from 2 files .txt and .doc file.( The string extracted from .doc file is done using the method specified in the above posts)
When we compare them, they come as unequal.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Why should "string1" and "string 2" be considered equal?
 
Kartik Lunkad
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Suppose they are equal in a certain case, for example string1= "sample" and string2="sample" also. But when extracted from their respective file formats, if you compare them, the compiler will show them as unequal. I hope you got my problem.
 
Sheriff
Posts: 28346
97
Eclipse IDE Firefox Browser MySQL Database
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The compiler isn't doing any of that comparing. It's the runtime which is comparing. And if it says the two strings are unequal, then they are unequal. If you say they are equal, then you are using a non-standard definition of equal; or more likely, you have overlooked something. Often people overlook things like trailing blanks, for example, because they aren't easy to see in debugging output.
 
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Below program will convert .doc to .txt file:-

import java.io.*;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class ReadDocFile {
public static void main(String[] args) {
File file = null;

try {
// Read the Doc/DOCx file
file = new File("D:\\New.docx");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
XWPFDocument doc = new XWPFDocument(fis);
XWPFWordExtractor ex = new XWPFWordExtractor(doc);
String text = ex.getText();

//write the text in txt file
File fil = new File("D:\\New.txt");
Writer output = new BufferedWriter(new FileWriter(fil));
output.write(text);
output.close();
} catch (Exception exep) {
}
}
}


Also upload the xmlbeans-2.3.0,dom4j-1.6.1 and stax-api-1.0.1.

Download the Apache POI jar also.
 
Sheriff
Posts: 22818
132
Eclipse IDE Spring Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Welcome to the Ranch! While technically that doesn't convert .doc to .txt but .docx (there's a difference), you can use org.apache.poi.hwpf.extractor.WordExtractor and org.apache.poi.hwpf.HWPFDocument instead of org.apache.poi.xwpf.extractor.XWPFWordExtractor and org.apache.poi.xwpf.usermodel.XWPFDocument. The rest of the code should be the same.

And please UseCodeTags next time.
 
Greenhorn
Posts: 21
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hello Amit,

For code is working for some files only.
For some files i got the exception as org.apache.poi.openxml4j.exceptions.InvalidFormatException: Package should contain a content type part [M1.13]

How to resolve this.

Thanks.
 
Ranch Hand
Posts: 43
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Amit kumarJha wrote:Below program will convert .doc to .txt file:-

import java.io.*;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class ReadDocFile {
public static void main(String[] args) {
File file = null;

try {
// Read the Doc/DOCx file
file = new File("D:\\New.docx");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
XWPFDocument doc = new XWPFDocument(fis);
XWPFWordExtractor ex = new XWPFWordExtractor(doc);
String text = ex.getText();

//write the text in txt file
File fil = new File("D:\\New.txt");
Writer output = new BufferedWriter(new FileWriter(fil));
output.write(text);
output.close();
} catch (Exception exep) {
}
}
}


Also upload the xmlbeans-2.3.0,dom4j-1.6.1 and stax-api-1.0.1.

Download the Apache POI jar also.



Hello, I tried this code but it does not work for me. At the line XWPFDocument doc = new XWPFDocument(fis); everything stops, nothing happens. I inserted a print statement after this line but nothing is printed (to the console). Also no exception is mentioned. Could you give me some suggestions as to what may be the cause of this behavior?

Thanks a bunch,
Monica
 
Rancher
Posts: 1776
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Welcome to Javaranch, Monica

You need to print the Exception in the catch block if you need to see it. Also check Rob Spoor's earlier message on some changes to that code.
 
Monica Marcus
Ranch Hand
Posts: 43
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi John, thanks for the message. The fact is that I had noticed Rob Spoor's message and the code works for doc files. My problem is with docx files only. And I print the exceptions in the catch blocks. Any other suggestion? I am completely stuck.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Monica Marcus wrote:the code works for doc files. My problem is with docx files only.


Really? I would have thought it would be the other way around, bcause the XWPF classes can handle .docx files, but not .doc files. For .doc files you'd need to use the corresponding HWPF classes.
 
Monica Marcus
Ranch Hand
Posts: 43
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Ulf. Well, yes, of course: for doc files I used the HWPF classes.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
So your problem is solved now, and you're able to extract text from both file types?
 
Monica Marcus
Ranch Hand
Posts: 43
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
No, I can extract text only from the doc files, but not from the docx files. I explained what happens with docx files in my first message of this thread. I would appreciate it very much if you (or someone else) can help, because now I am really stuck.
 
Monica Marcus
Ranch Hand
Posts: 43
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi guys, I was able to write the code to work for docx and doc files (different classes, of course) but I cannot get them both work as part of a larger application. The problem is that I need to use two jar files: poi-3-0-alpha3.jar (for the doc files) and poi-3.9-20121203.jar (for the docx files). Now both jar files contain two classes with identical names, but the contents of the classes is different. One of the classes misses a function required by doc files and the other class misses a function required by the docx files. So the order in which the jar files are added to my NetBeans project determines which of the two files (doc or docx) can be translated to text files. Is there a way to determine the program to look into both jar files?
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Monica Marcus wrote:The problem is that I need to use two jar files: poi-3-0-alpha3.jar (for the doc files) and poi-3.9-20121203.jar (for the docx files)


Why is that? I'm fairly certain that the current POI version does everything 3.0 did, so you should not need to use any of the old jar files, and anyway I strongly advise against maxing jars from different versions, problems like what you're experiencing are bound to happen.
 
Monica Marcus
Ranch Hand
Posts: 43
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
No, the current version does not contain the classes mentioned below:

import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.hwpf.HWPFDocument;

I need the older version for them. I tried some intermediate versions too, but they do not work either.
What can I do to have my Java tool work for both doc and docx files?

 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It sure does - both classes are in the scratchpad jar file. If they weren't part of POI, why would they be in the javadocs?

org.apache.poi.hwpf.extractor.WordExtractor and org.apache.poi.hwpf.HWPFDocument
 
Monica Marcus
Ranch Hand
Posts: 43
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Well, they are in the javadoc but my compiler (NetBeans) says otherwise. What can I do? Perhaps it is a mistake and the people at apache.org should know about it. I don't know how to contact them.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
How are you adding the scratchpad jar file to the classpath? That's separate from the main jar file - you need both.
 
Monica Marcus
Ranch Hand
Posts: 43
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I work with NetBeans, and I did not set my classpath myself. I just added the jar files (both poi jar files) to the project. If I add first the older version and then the new version then it works for doc files only. If I add first the new version and then the older version, it works for docx files only. NetBeans builts a jar file for my whole application to run.
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I guess I need to be explicit about it: the jar files you need to add are named poi-3.9-20121203.jar and poi-scratchpad-3.9-20121203.jar. Don't add any files that are not part of the POI 3.9 download (like from older POI versions) - it simply does not work, nor is it necessary.

(You may also have to add poi-ooxml-3.9-20121203.jar and poi-ooxml-schemas-3.9-20121203.jar, and some of the jars in the "ooxml-lib" directoy; I'm not sure in which circumstances those are needed.)
 
Monica Marcus
Ranch Hand
Posts: 43
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thank you, Ulf. I did not know what the scratchpad jar file is.
 
These are not the droids you are looking for. Perhaps I can interest you in a tiny ad?
We need your help - Coderanch server fundraiser
https://coderanch.com/wiki/782867/Coderanch-server-fundraiser
reply
    Bookmark Topic Watch Topic
  • New Topic