• Post Reply Bookmark Topic Watch Topic
  • New Topic

Basics: How to read chinese characters from an ANSI-encoded file  RSS feed

 
Lasse Koskela
author
Sheriff
Posts: 11962
5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So, my situation is this: I have a text file containing sequences of "\uXXXX" (chinese characters in a file of which encoding is ANSI). What I need to do is read these sequences of characters into String objects.
Now, being a newbie in everything related to character sets, I tried to "just read it" using a FileReader but
"\u78ba\u8a8d\u624b\u6a5f\u865f\u78bc" in the file became exactly the same as a String. In other words, my String object was 36 characters long instead of 6 characters what I expected.
I would appreciate any pointers regarding this problem.
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't have much encoding-knowledge either, but here is what I would be doing:
- parse the strings for the escape code, for example by using a regular expression
- convert the (hex?)-code to an int, using Integer.parseInt and cast it to char
- replace the escape code by that char
I would expect it to work, but I am not sure; and there might even be better approaches...
 
Lasse Koskela
author
Sheriff
Posts: 11962
5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks, Ilja. I know this would work but I'm hoping I don't have to start writing the decoding logic myself. If there isn't any standard API for this type of work, I'll have to write one myself.
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't know of a standard API, but java.util.Properties does this conversion as part of its load(InputStream) method. JDK 1.4.2 has a private loadConvert(String) which may well be exactly what you need, and doesn't seem to depend on outside code - you can probably cut & paste it for your own version, as a nice jump start.
 
Ernest Friedman-Hill
author and iconoclast
Sheriff
Posts: 24215
37
Chrome Eclipse IDE Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, if this were a standard named encoding (and I'm not sure it is; UTF-16 or a specific Chinese encoding would be much more appropriate than this escape-sequence stuff) then the thing to do would be to use FileInputStream and InputStreamReader explicitly, rather than using the convenience class FileReader. InputStreamReader has a constructor that lets you specify the encoding, and if the JVM knows about it, then it will translate the data for you automatically into Unicode.
 
Lasse Koskela
author
Sheriff
Posts: 11962
5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
InputStreamReader has a constructor that lets you specify the encoding, and if the JVM knows about it, then it will translate the data for you automatically into Unicode.
Could you pleeease share with me the encoding I should specify? I already tried "UTF-8", "ANSI", "ASCII", "ISO-8859-1" before posting here and all of them seemed to result in a string of length 36 instead of 6...
 
Ernest Friedman-Hill
author and iconoclast
Sheriff
Posts: 24215
37
Chrome Eclipse IDE Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, like I said,

... if this were a standard named encoding (and I'm not sure it is...

I've never seen a file like this. Where did you get it, anyway?
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
EFH's suggestion sounds good, but I'm pretty sure this is not a standard IANA named encoding. (Though it's certainly based on Unicode.) I iterated through the available Charsets in JDK 1.4.2, and none have an averageCharsPerByte() lower than .5 (whereas what we're looking for would be about .1667 chars per byte, or 6 bytes per char).
 
Lasse Koskela
author
Sheriff
Posts: 11962
5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I got it from some pipeline including Excel and VB macros... And this character decoding stuff is not even the worst part. The stuff I'm getting is sort-of comma-separated but not quite. It has rules like "the file format is csv up to the 4th column, after that everything up to the newline character is one column, including any commas etc."
Darn. I was already itching to get to reuse the CSV parser I wrote a couple of weeks ago
Anyway, I think I'm good to go with the loadConvert() method. Thanks for the tips, guys.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!