• Post Reply Bookmark Topic Watch Topic
  • New Topic

Producing output files in UTF-8  RSS feed

 
T Dahl
Ranch Hand
Posts: 35
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Not sure if this is a Java question or an eclipse question. Perhaps a little of both.

What I want to do is to create an XML output file in UTF-8 format. The serializer allows me to specify UTF-8 so that part is taken care of. Some of the element names are constant literals in my program. My Java source file is stored in ISO-8859-1. For that reason some characters come out with the wrong representation. I believe that if my source code was in UTF-8 too my output would also be correct. It is not as easy as changing to UTf-8 in the file properties settings in eclipse. If I do I get a lot of "invalid character" messages. So my questions are:
- Will the Java compiler be happy with UTF-8 input?
- Is there an easy way to convert my ISO-8895-1 source file to UTF-8?
 
Rob Spoor
Sheriff
Posts: 21135
87
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Unless you specify values inside the Java code, its encoding doesn't matter. What does matter is the encoding of the output stream you use.

If you use a FileReader / FileWriter, that uses the system encoding. For Windows that's CP-1252, which is very close to ISO-8859-1. The trick is to go ignore FileReader / FileWriter and use FileInputStream / FileOutputStream. Use an InputStreamReader / OutputStreamWriter around the FileInputStream / FileOutputStream, so you can create a Reader / Writer with an encoding you choose. So:

If you do specify values inside the Java code, you can use \uXXXX to inside the code to represent the Unicode characters. These can appear anywhere. For instance, the following is valid Java source code:
That's because \u0061 is the Unicode representation of the letter a, and the Unicode translation occurs before compiling. In other words, the compiler turns that cl\u0061ss and replaces it with class.
 
T Dahl
Ranch Hand
Posts: 35
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rob Spoor wrote:Unless you specify values inside the Java code...

But that's what I did. I used literals containing non-english characters to go in XML element names. E.g.


If you do specify values inside the Java code, you can use \uXXXX to inside the code to represent the Unicode characters. These can appear anywhere. For instance, the following is valid Java source code:
That's because \u0061 is the Unicode representation of the letter a, and the Unicode translation occurs before compiling. In other words, the compiler turns that cl\u0061ss and replaces it with class.

cool!

I have experimented more and I think I found a solution that works for me but may not be a general solution in similar cases. I changed the source file to UTF-8. As expected I got bunch of complaints. I then selected one of the characters and invoked the find/replace (ctrl-f in eclipse. The funny looking character appeared in the find frame. I put the right character (e.g. æ) in the replace frame and replaced all over the file. Repeat for each different character. Apparently this solved my problem. It is probably not the "right" way to do it...

Thank you for the reply! I actually learned something
 
Rob Spoor
Sheriff
Posts: 21135
87
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You're welcome
 
Mike Simmons
Ranch Hand
Posts: 3090
14
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The javac compiler accepts a -encoding option if you want to use a file encoding. If you don't specify this value, the platform default is used. It's likely that your platform default is neither ISO-8859-1 nor UTF-8, and that's why you find yourself needing to do the search-replace stuff.

Alternately, replacing all those non-ASCII characters with unicode escapes (like \u0061) before you compile will solve the problem too. It's a pain, as you've found. But once you've done it, you can pass the file off to other people without having to tell them to set the -encoding option, which makes life easier for them. And you.
 
Rob Spoor
Sheriff
Posts: 21135
87
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You can save yourself the work of manually encoding the file. The JDK's bin folder contains the native3ascii tool which can do the hard work for you.
 
T Dahl
Ranch Hand
Posts: 35
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rob Spoor wrote:You can save yourself the work of manually encoding the file. The JDK's bin folder contains the native3ascii tool which can do the hard work for you.

Thank you! That seems to be the proper tool for a situation like the one I had. Btw. it is spelled native2ascii
 
Rob Spoor
Sheriff
Posts: 21135
87
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Uhm yeah. I seem to have mistyped it. native3ascii makes no sense at all...
 
Consider Paul's rocket mass heater.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!