• Post Reply Bookmark Topic Watch Topic
  • New Topic

UTF-16 encoding  RSS feed

 
Dave Mulligan
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all,
I have a need to convert incoming String data into UTF-16, so that I can send it as an XML stream to a listener. I've found this algorithm to translate into UTF-8, (W3C open-source code) but can't find one for UTF-16. The reason for wanting to go to UTF-16 is that the receiving application wants UTF-16.
Here are my questions:
1. Does it matter if I send my text encoded as UTF-8 and say that it is "UTF-16"? It seems to work, but I'm very nervous about this.
2. Does anyone have a simple UTF-16 converter, similar to the one above?
Many thanks
Dave
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
1. Risky. Depends how it's being read/written. It's possible that whatever is doing the reading is really expecting UTF-8, or it's flexible enough to adapt. But don't send UTF-8 and label it UTF-16, that's just evil. Some maintenance programmer will track you down and shoot you, and justifiably so.
2.

[ March 20, 2003: Message edited by: Jim Yingst ]
 
Dave Mulligan
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Jim,
1. I agree. I'm learning to trust that nagging "I really shouldn't be doing this" feeling.
2. I see what you're suggesting, but I don't think this solves my problem, unless I've misunderstood the problem. When I create the XML I want to send, doesn't only the content of the tags need to be encoded, and not the tags themselves? If, for example, I want to send in my XML, don't I only have to run "+44 121-534-8707" through the encoder?
If this is right, then I run into a new problem with using the OutputStreamWriter - how do I append the encoded byte array that it produces to the StringBuffer that is building my XML?
Oh, and just to add to the fun, I'm restricted to JDK 1.1.8, so I can't use the java.nio.Encoder class & related methods.
I've come up with this as a way to do this:

but it has the disadvantage of encoding every character, even the low ASCII values that are valid, which destroys the readability of my XML. On the other hand, it will be (should be) reconstructed by the receiving machine, so this shouldn't be a problem.
Am I on the right track?
Thanks
Dave
 
Dave Mulligan
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Just re-read my last post & realized I could have been clearer about why I want to encode only the content of the XML tags, and not the entire output stream.
The very start of my XML is:

So, if I encode the entire stream, the instruction on how to decode the content will itself be encoded!
Dave
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Not much time right now, so briefly: your method for uncoding UTF-16 isn't correct. The Integer.toHexString() is not called for. Part of the confusion is that you're converting a String to bytes and then back to a String - forget that last part. Try writing the encoded bytes to a file using a FileOutputStream for example, then look at the file with various other tools like a text editor. It should be very readable, at least for all the regular ASCII values. Some tools that don't understand UTF-16 will show spaces for every other char - this is because every other byte is 0 in a two-byte encoding of simple ASCII values.
When you encode an document in, say, UTF-16, you generally encode the entire doc the same way, including all tags and the encoding declaration itself. Thi seems a bit weird I know, but it works - mostly since almost all encodings use the same values for the simple ASCII stuff you need for the encoding declaration. And all XML browsers are required to understand UTF-8 and UTF-16 anyway, so they can quickly figure out what's going on. I'm not sure how it works if you want to use more exotic encodings, but for UTF-16, just encode everything the same way. You may want to ask on the XML forum to learn more about this.
 
Dave Mulligan
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Jim,
Now I know I can safely encode the entire stream, including the tags & the encoding instructions, it works just fine.
Thanks again for the help
Dave
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!