Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Reading/Writing Foreign Text

 
Mike Watts
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is there a way to Read in a file which contains English and German text, and then writing it out to another file?

Here is my problem:

I am reading in a file (which contains mostly English Text with a little German). I use a FileReader and then putting it into a BufferedReader. I read it line by line searching for particular strings (English text) and then storing it to an ArrayList. After I'm done, using a BufferedWriter,I write out to a file which contains all the Strings in my ArrayList. The problem is when there are German/Foreign text, for example W�HRUNG , it is coming out as W?HRUNG. The special characters are not being converted.

I would like to thank you in advance for any advice given.

~Mike
 
Jeff Albertson
Ranch Hand
Posts: 1780
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The problem has to do with the decoding that happens when a InputStream of bytes is converted in a Reader of characters, as well as the corresponding encoding of characters into an OutputStream of bytes. The key classes are InputStreamReader and OutputStreamWrite and you need to specify your charset -- say ISO-8859-1 or UTF-8.
 
Layne Lund
Ranch Hand
Posts: 3061
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How are you verifying the contents of the output file? I suspect that your Java program works perfectly, but there may be a problem with the font and/or character set that is used when you display the contents of the file. Are you using the command line (such as the "more" command) or a text editor (such as Notepad) to view the file? In either case, you need to be sure that it supports the character set that you are using. It is highly likely that the contents of the output file are correct but that the characters are not displayed correctly when you try to verify them.

Layne
 
Mike Watts
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am using the command line to view the file. I can use "more" or the VI editor. The file essentially comes from a C-program that prints out to a file containing English and German text. I can view the output (from the C program) fine using the command line. But through Java, its not correctly coming out.

I will also try and specify the charsets to see if that fixes the problem.

Another question:

Is there a way to insert text into a file without reading through the whole file, inserting the text, and then writing it out again.

I deal with thousands of text files with over 100,000 of lines. It takes quite some time to do such a thing.

The way I have been doing it is not very effecient. I read through the whole file, store each line into a Collection while checking for conditions, and then if the condition is true, I insert some text. After my readline=null, I write out my Collection line by line to a new file.

Thanks in advance!
 
Jeff Albertson
Ranch Hand
Posts: 1780
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Mike Wattana:

Is there a way to insert text into a file without reading through the whole file, inserting the text, and then writing it out again.

I deal with thousands of text files with over 100,000 of lines. It takes quite some time to do such a thing.

The way I have been doing it is not very effecient. I read through the whole file, store each line into a Collection while checking for conditions, and then if the condition is true, I insert some text. After my readline=null, I write out my Collection line by line to a new file.

1. The only alternative to rewriting a file to "edit" it is to use a RandomAcessFile, but they hardly ever work for text because editting in the middle of the file is a matter of replacing N old bytes with N new bytes. Unless your file format has fixed-length lines and your character encoding is a fixed number of bytes/char (say 1 or 2), this will fail.

2. It sounds like you are reading the entire file into memory before you start to rewrite it. Is is possible to hold fewer lines in memory and interleave reading and writing? This would keep your process from bloating and burdening your machine. Doing this means writing to a temp file (since you are not done reading!) and renaming the temp file at the end if you want to "overwrite" file contents. If your program is supposed to overwrite contents you should be doing this in any case, so that if it crashes, only a temp file is left in an incomplete state.

3. 100,000 lines? Perhaps it's time to rethink the design. With that much data, why not keep it in a database? If needed, you could write code that generated text file reports when needed. Another processing approach, even if input and output where *required* to be text files, would be to use a database as an intermediate data structure. It'll be slower than holding data in memory, but it will give you a lot of options.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic