• Post Reply Bookmark Topic Watch Topic
  • New Topic

convert a txt file into Word document

 
megha punj
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I need to convert a text file into a word document. The text file is formatted properly with respect to spaces and tabs. I just need to copy this and put it into the word document. I am using the following code for the same. But the problem is that the lines do not appear in the same line but go into the next line.

e.g. in the text file the first stanza is like. I am using a line to show the space. Please consider this(-------------------------------------------------) as space. Please consider<Page break> and <line break> as a page break and line break characters code in txt.


< Page break>
<line break>
<line break>
<line break>
-------------------------------------------------COCRAC RANKING CORPORATION
-------------------------------------------------XXX XXX XXX XXX
------------------------------------------------- XXX XXX - Xxxx xxxxx
------------------------------------------------- IBM XX, 1 Queen Street
------------------------------------------------- TONCORD EAST, RMW XXXX

But when writing into the document it appears as in the screenshot attached. It understands the page breaks and the line breaks but not the space in the text file I think. If I compare the spaces before the text it is 62 bytes in text pad and in word document also it shows 62 spaces but the content goes to the next line.

I am using the following code to read the text file and to write into the word document. Please suggest me how can I maintain the formatting in the word document same as in the text file.

screenshot.png
[Thumbnail for screenshot.png]
Screenshot for the document produced.
 
Tim Moores
Saloon Keeper
Posts: 3261
54
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
.doc and .docx (you didn't say which one) are structured file formats that you will have a hard time creating from scratch with the Java API. Instead, check out the Apache POI library, or alternatively create an RTF file, which can be opened by Word and any other word processor.
 
Tim Holloway
Bartender
Posts: 18414
58
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The .doc file format is a rather horrible binary file format, so using Stringbuffer operations won't work. The docx file format is a rather horrible XML file format based on an Microsoft-designed "International" standard that's so horrible that even Microsoft can't always get it right. If I'm not mistaken, in some cases, they cheat by embedding raw binary stuff in it. Plus, the actual XML is binary compressed and mingled with other files to produce the actual docx file, I believe (or maybe that's just the Open/Libre Office format that does that).

Defintely. The quick and dirty easy way is to produce RTF. If you have a more complex document, use something like POI.
 
Campbell Ritchie
Marshal
Posts: 52581
119
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I unzipped an open office text document file and couldn't see any sign of non‑text, except possibly for a .cache file.
 
Tim Holloway
Bartender
Posts: 18414
58
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell Ritchie wrote:I unzipped an open office text document file and couldn't see any sign of non‑text, except possibly for a .cache file.


There should have been a "content.xml" file, if you're looking at an ODS document.
 
Campbell Ritchie
Marshal
Posts: 52581
119
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It was odt but yes, there was a content.xml file and everything in it seemed to open with a text editor (Pluma). It looked like a combination of XML and plain text. Pluma would complain if there is anything it cannot understand, e.g. incorrect encoding.
 
Tim Holloway
Bartender
Posts: 18414
58
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm not sure what you were expecting. XML is plain text. The document text is marked up (after all, that's where XML got its name). There's no need for special encoding of the text, since encryption and compression are done at higher levels and word-processing documents aren't pixel-precise to the character like absolute page layout formats like PDFs are. In short, the same basic structure as an XML word document, although the tags and files vary.

A text editor is not, after all, a word processor, so you'd see the raw XML in a text editor.
 
Campbell Ritchie
Marshal
Posts: 52581
119
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sorry for being loose with the terminology. That is what I expected to see: text and no control characters or anything.
 
Tim Holloway
Bartender
Posts: 18414
58
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Shouldn't be any control characters. Things like paragraph boundaries and other line breaks are supposed to be XML tags. Same thing (again) with MS-Word. Even when people use it in "typewriter mode" and insert "new line" characters (which are actually just empty paragraphs).

RTF has some quirks about control characters, I think, but they're significant to the RTF syntax, not the actual document text. I could reach out a handspan and check my RTF Pocket Guide, but too much trouble.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!