• Post Reply Bookmark Topic Watch Topic
  • New Topic

B&S: My thoughts and final decision on character encoding  RSS feed

Michal Charemza
Ranch Hand
Posts: 86
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all,

Thanks to all who have ask questions and responded to (both mine and others') posts about character encoding. They have been very helpful. I have created a very rough draft of what will go in my choice.txt file about this. Also, I know this is far, far too verbose. I will be cutting it really down. I did it at this stage really to organise my thoughts about the whole issue, and to come to a good decision.

I post it here for two reasons: To help others (hopefully!) as others have helped me, and to hopefully get some feedback, especially if people think I've really done something wrong, or written something that simply isn't true.

Java Doc doesn't have "8 bit US ASCII" listed as being guaranteed to run on all platforms.

To use "7 bit US ASCII", "ISO-8859-1", or a UTF encoding.

Writing to the datafile using a UTF encoding of greater than 8 bits would contravene the specified database file format that each characters, and make it incompatible with the mentioned lagacy software. Trying to read the database file using a utf encoding of greater than 8 bits would also not successfully read the data: simply looking at the supplied database file one can see all the characters are 8 bits long.

Using UTF-8 may cause problems. It does not guarantee writing all characters as one byte. Although it would be able to read the database file well, it may end up writing characters as more than one byte, which would not not follow the specified database required format that all characters are 8 bits long.

Therefore it is a choice between "7 bit US ASCII" and "ISO-8859-1".

If ISO-8859-1 is used and the older software is expecting 7 bit US ASCII, the result is unknown, and in fact may be irrecoverable if the program was not coded to handle this. Indeed, in Java, the behavior of the String constuctor to decode byte arrays is unspecified if presented with a character not part of the specified charset, and CharsetDecoder.decode() will throw an exception if it encounters such a character.

If "7 bit US ASCII" is used and the older software is expecting "ISO-8859-1", I can forsee no fatal problems in the older software, as "7 bit US ASCII" is completely contained in the "ISO-8859-1" charset. Any "7 bit US ASCII" will be viewed accuratly using softare the expects characters in "ISO-8859-1". The only possible problems would be if my java program encounters any "ISO-8859-1" characters that are not in the "7 bit US ASCII" set. However, this can be coded for, using only the charset decode method, which will replace any unsupported characters with a supported replacement value without throwing an Exception. Testing on one Java platform, this is "?". This would result in a replacement of accented and other extended characters, but would never result in an unforseen exception in either my or the legacy software.

Also, using a charset instance decode method would allow for easy change to another 1 byte encoding, should the client decide to change.

Decision: To use "7 bit US ASCII", using decode methods from a Charset instance. This is the safest approach that guarantees that the older software will run properly, although extended characters may be unrecognizable. Essentially my reasoning is that I can control and know what my code does - I cannot control and am not sure what the older softare does. I have also judged that allowing their software to definitely run with the database file edited by my software, but with possiblity of names being slightly malformed, is better than a possiblity that it will not run at all.

[ September 04, 2004: Message edited by: Michal Charemza ]
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!