• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • paul wheaton
  • Jeanne Boyarsky
  • Ron McLeod
Sheriffs:
  • Paul Clapham
  • Liutauras Vilda
  • Devaka Cooray
Saloon Keepers:
  • Tim Holloway
  • Roland Mueller
Bartenders:

Japanese character not read or written correctly

 
Ranch Hand
Posts: 255
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
My program reads lines from a text file with the method BufferedReader.readLine() and writes to another text file using BufferedWriter.write().
It works without any problems, usually, but when it encountered a certain Japanese character, I got some unexpected results. There was no problem with any other Japanese characters in the file; only this one character caused a problem.
(By "problem", I mean an unexpected result. What I wanted the program to do was read the text in one file and write it to another file, along with some other textual stuff.)
The Japanese character that caused the problem was the "no" in the word "nojo", which means "farm". Here it is in Japanese:

According to what I know, the .readLine() method reads text, 2 bytes for each character, and when it comes to a carriage return or a linefeed character, it considers that to be the end of the line, and stops reading characters. So what I think is that perhaps one of the two bytes of the Japanese character was considered to be a carriage return or a linefeed, or maybe even a null. I don't know.
I ran this program on my Mac, which is OS10.0 and running Java 3.1. Ancient, right? The characters at the end of lines are different in Windows, I know, so on Windows there might be different results.
Anyone have any ideas about what is going on here?
 
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
First, some corrections:

The number of bytes that make up a character is not fixed; it depends on the encoding that's being used to convert between bytes and characters. (Read this.) You're probably thinking of the way Java strings are stored; they use the UTF-16 encoding, which uses two bytes per character (usually--see the article), but that has nothing to do with how the text is stored on disk.

It used to be the case that Macs favored the carriage-return ('\r' or '\u000D') for line separators, but as of version 10 (OSX), Mac OS is based on Linux, which prefers the linefeed ('\n' or '\u000A').

However, it doesn't matter that much what the operating system thinks a line separator should be, because it's the application (in this case, your Java program) that has to read and write the files. Virtually every modern application will accept any of the three major styles of line separator ("\n", "\r", or "\r\n"). BufferedReader is no exception; you can use a different separator at the end of every line, and BufferedReader will handle them correctly.

There is, unfortunately, one very important exception: Windows Notepad. It refuses to recognize anything except the DOS/Windows-style carriage-return+linefeed ("\r\n") line separator. If it encounters a linefeed or carriage-return by itself, Notepad renders it as a rectangle instead of a line break. That's probably not the cause of your problem, since you're using a Mac, but it's useful to know about (not to mention infuriating).

Now that all that's out of the way, we'll need some more info before we can help you. Like, how exactly are you reading and writing the files? What's the exact code you use to construct the Reader and Writer? How do you write the line separators? Do you use BufferedWriter#newLine(), or do you explicitly write a "\r"? And how do you view the contents of the files?
 
Kevin Tysen
Ranch Hand
Posts: 255
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thank you for the reference to the explanation of text encoding. It was very helpful.
My program reads in text from a text file.
This is what part of the text looks like:

農場
farm
fan
fail
field

望む
hope
hold
hour
hop

Then, my program parses the text and makes a few more text files using the text.
This is how the program reads in the text:

BufferedReader br = new BufferedReader(new FileReader(block + ".txt"));
String line = "not null";
while (line != null){
line = br.readLine();
if (line != null){
// Puts the line in a String[] and does some other processing
}
}

This is how the program writes to one of the output files:

for (int i = 0; i < osTotal; i++){
bw.write("<TR #FFFFFF");

}else{
bw.write("#EEEEEE");

}
bw.write(34);
bw.write("><TD>");
bw.newLine();
bw.write(osAra[i].befUnd);
// osAra[i].befUnd is the Japanese words 農場 and 望む
bw.newLine();
bw.write("<FONT #00FF00");
bw.write(34);
bw.write(">");
bw.write(osAra[i].rightAns);
// osAra[i].rightAns is the English translation of the Japanese,
// specifically, farm and hope
bw.write("</FONT>");
bw.newLine();
bw.newLine();
bw.write("</TD></TR>");
bw.newLine();
}

This is what part of the output file looks like. The first four lines are what happened when

the program processed the words 農場 and farm. The last five lines are how the program

processed 望む and hope, which is the same way the rest of the text was processed, and which

is the way I expected the program to work.

<TR ><TD>
・<FONT COLOR="#00FF00">farm</FONT>

</TD></TR>
<TR ><TD>
望む
<FONT COLOR="#00FF00">hope</FONT>

</TD></TR>

As you can see, the 農 (no) of 農場 (nojo) is rendered unreadable, and the elements are

switched around.
Instead of
nojo [linebreak] <FONT COLOR="#00FF00">farm</FONT> [linebreak] [linebreak]
I have
[unreadable] <FONT COLOR="#00FF00">farm</FONT> [linebreak] jo [linebreak]
 
Kevin Tysen
Ranch Hand
Posts: 255
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Sorry, the HTML in the message that I sent seems to have been interpreted literally by the web browser, so I will send the last part of the message again.



This is how the program writes to one of the output files:







This is what part of the output file looks like. The first four lines are what happened when
the program processed the words 農場 and farm. The last five lines are how the program
processed 望む and hope, which is the same way the rest of the text was processed, and which
is the way I expected the program to work.




<TR ><TD>
・<FONT COLOR="#00FF00">farm</FONT>

</TD></TR>
<TR ><TD>
望む
<FONT COLOR="#00FF00">hope</FONT>

</TD></TR>




As you can see, the 農 (no) of 農場 (nojo) is rendered unreadable, and the elements are
switched around.
Instead of
nojo [linebreak] <FONT COLOR="#00FF00">farm</FONT> [linebreak] [linebreak]
I have
[unreadable] <FONT COLOR="#00FF00">farm</FONT> [linebreak] jo [linebreak]


 
Kevin Tysen
Ranch Hand
Posts: 255
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
OK, here is the last part of the message, one more time. I'll change the lessthan and greaterthan signs to brackets so that the computer won't interpret them as HTML.

[TR BGCOLOR="#EEEEEE"][TD]
・[FONT COLOR="#00FF00"]farm[/FONT]

[/TD][/TR]
[TR BGCOLOR="#FFFFFF"][TD]
望む
[FONT COLOR="#00FF00"]hope[/FONT]

[/TD][/TR]

As you can see, the 農 (no) of 農場 (nojo) is rendered unreadable, and the elements are

switched around.
Instead of
nojo [linebreak] [FONT COLOR="#00FF00"]farm[/FONT] [linebreak] [linebreak]
I have
[unreadable] [FONT COLOR="#00FF00"]farm[/FONT] [linebreak] jo [linebreak]
 
author and iconoclast
Posts: 24207
46
Mac OS X Eclipse IDE Chrome
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
There's a checkbox below the text box where you enter your posts called "Disable HTML in the message." Very handy if you want to show HTML code in your message!
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
FileReader decodes the contents of the file using the system default encoding. What encoding that is depends on the operating system and the locale of whatever computer the code is running on. That means it will be different on different machines, so you shouldn't use the default if you're dealing with anything other than pure ASCII.

Your file contains both ASCII characters and Japanese ideograms, so the encoding has to be one that supports both character sets: the most likely candidates are Shift_JIS and UTF-8. I would try UTF-8 first: And when you create the BufferedWriter, use an OutputStreamWriter and specify "UTF-8" again. If that doesn't work, try "Shift_JIS" for the Reader (but leave the Writer set to UTF-8).

This is just my best guess, based on experience; I can't get enough out of your posts to be more definite. If you still have problems, remember to check "Disable HTML" and "Disable smilies" when you post again (in fact, do what I did and set them to be disabled by default in your "My Profile" page).
 
Ranch Hand
Posts: 225
Eclipse IDE Debian Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you can produce very short examples of the correct and incorrect text when run through hexdump -C from the terminal, it should give us (and you) a better idea of what is wrong with the encoding of the generated data. Can you give this a try?
 
Kevin Tysen
Ranch Hand
Posts: 255
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for the advice. Actually, I went to the library yesterday, and I found a book that says that the default character set for UNIX based (MacOSX is UNIX, I believe) computers is EUC-JP, so I will try that, too.

About hexdump, is this how I should use it? For example, if the text I want to display is 農場 then on the command line I should type in

hexdump -C -e "農場"

or rather, not type it all in, but use copy and paste for the 農場 text?
 
Carey Evans
Ranch Hand
Posts: 225
Eclipse IDE Debian Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You can do:But that just shows the text in the encoding Terminal is using (UTF-8 in this case), which should be that same as what ‘locale’ prints.

If you create a text file containing the expected text, and one containing the output from Java, then you can compare the output of hexdump -C on each file and work out what encoding Java and your editor are using.
 
Kevin Tysen
Ranch Hand
Posts: 255
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I tried to copy and paste 農場 in the command line window, but I think the command line does not accept anything more than 8 bits. When I pasted 農場 and the other character, I got _ and ? respectively.
I think I'll try making a text file and do hexdump -C on it. Should I do it like this?

% echo myfile.txt hexdump -C
 
Carey Evans
Ranch Hand
Posts: 225
Eclipse IDE Debian Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It's just hexdump -C myfile.txt. You can read the manual page by typing man hexdump, or on the web.
 
Greenhorn
Posts: 24
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I believe Alan is right.. if you use only FileReader, you may encounter problems with characters other then standard ascii (e.g. we have the same problem for central european files), I use OutputStreamWriter or InputStreamReader to specify encoding.

There is another approach .. you may use NIO and java.nio.charset.CharsetDecoder / Charsetencoder where you can encode between ByteBuffer and CharBuffer in any charset supported.
 
Kevin Tysen
Ranch Hand
Posts: 255
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
By the way, is there a way to look at the bytes of a file in Windows, too? Do you do
hexdump -C myfile.txt
in Windows, too, or is there some other command?
 
Carey Evans
Ranch Hand
Posts: 225
Eclipse IDE Debian Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
There doesn’t seem to be anything that comes with Windows. The GnuWin32 project provides a package for Windows based on GNU CoreUtils, which includes od, and you can use GNU od like hexdump: od -t x1z filename
 
Carey Evans
Ranch Hand
Posts: 225
Eclipse IDE Debian Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It wasn't that hard, so:
 
PI day is 3.14 (march 14th) and is also einstein's birthday. And this is merely a tiny ad:
Smokeless wood heat with a rocket mass heater
https://woodheat.net
reply
    Bookmark Topic Watch Topic
  • New Topic