Win a copy of The Little Book of Impediments (e-book only) this week in the Agile and Other Processes forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

8 bit US-ASCII

 
Patrick McDonogh
Ranch Hand
Posts: 89
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hi all,
i was wondering if this method from the RandomAccessFile class :
writeBytes(new String("the string","US-ASCII"))
would write a string in 8 bit us-ascii.

thanks, and sorry about this, it is just really worrying me that in not writing the data correctly. Also if i use the writeShort() of RandomAccessFile does that get written as text?

Thanks a lot everyone.
 
Dustin Tosh
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Patrick,

I am a little confused as to what you are trying to do with the method new String("the string","US-ASCII"). Looking at the String API, I don't see a constructor that takes the String(String, String) form. In other words, this is not legal becuase there is no String constructor that takes two Strings as parameters. In Java, each character in a string is a 16 bit Unicode character, not ASCII.

So maybe I am just unclear as to what you are trying to do?

As for the writeShort() method of the RandomAccessFile, it writes a short to the file as two bytes, high byte first.

I would suggest looking at the String and RandomAccessFile API more closely.

Hope this helps!
 
Lara McCarver
Ranch Hand
Posts: 118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think he means this constructor:

String(byte[] bytes, String charsetName)

This requirement was a bit confusing to me, too, because technically speaking, US-ASCII is 7-bit, not 8-bit... the 8-bit is for all those icky European characters So the alternative is to use a charset that is actually 8-bit but encompasses US-ASCII, which would be ISO-8859-1, and which is Euro-friendly I can't remember which one I actually did... whichever you choose, of course you should document it in your issues.txt!
[ November 16, 2005: Message edited by: Lara McCarver ]
 
Patrick McDonogh
Ranch Hand
Posts: 89
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes im trying to write the text as 8 bit US ASCII using String(String string, Charset chartset) constructor and its really confusing me what i need to call the charset. im not necessarily looking for the code itself, just an explanation of how charsets work as i am only half understanding this topic even after reading the documentation in the API.
 
Patrick McDonogh
Ranch Hand
Posts: 89
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sorry i mean String(byte[] bytes, String charsetName) constructor.
 
Lara McCarver
Ranch Hand
Posts: 118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Here is what I know (depending on others to correct me!)...

I was told that in Java, strings are internally stored as Unicode, where each character takes up 2 bytes. But most text files are stored using ISO-Latin characters, which take up 1 byte each. If you look at the Java documentation on all the possible charsets, you can see all the different standard ways of encoding characters. Why there are so many... well just for example, ISO-Latin doesn't work with a language like Japanese or Chinese which has a *LARGE* number of characters, but it is great for English and most European languages, because the files get stored very compactly (1 byte per character vs. 2 bytes per character makes the file half the size). Each charset has a story behind it.

Actually I am a little confused about how Java stores its files, because the Property files that it creates look like normal files to me... unless they are Unicode files and Notepad and Textpad and just really great apps that can read Unicode files too...)

Lara
 
Dustin Tosh
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
When using the Properties class, the files actually get stored using the ISO 8859-1 encoding scheme.

Directly from the Properties API:
"When saving properties to a stream or loading them from a stream, the ISO 8859-1 character encoding is used. For characters that cannot be directly represented in this encoding, Unicode escapes are used; however, only a single 'u' character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings. "

As for TextPad, it will read both ASCII and Unicode.
 
Paul Clapham
Sheriff
Posts: 21576
33
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, you don't "write" a String by creating another String anyway.

Here's the deal: a String in Java is a sequence of Unicode characters. You can read about Unicode at its website (http://www.unicode/org) but suffice it to say that Unicode is intended to encompass all the world's writing systems. So a Unicode character is 16 bits and there are 65,536 possible Unicode characters. (Also, the first 128 of those Unicode characters are the familiar ASCII character set.)

Normally people don't store their text in 16-bit units, though. 8-bit units ("bytes") are preferred because they take less space for most texts, plus there are historical reasons, i.e. there are billions of documents where the text is stored in bytes.

Now, when people discovered that ASCII didn't satisfy all their requirements (this was about 35 minutes after ASCII was defined and somebody wanted to put the word resumé into a document), there were "extended ASCII" codes designed to represent characters. Lots of them. Windows defined a bunch, Apple defined a bunch to run on Macs, the ISO got into the act and defined ISO-8859-1, ISO-8859-2... the last I heard they were up to ISO-8859-15. All of these encodings or charsets were incompatible with each other in some way. Not to mention that the Japanese and Chinese designed their own mess in the same way.

So, back to Java. As I said, a String contains Unicode chars. So as such, it doesn't have an encoding or a charset. Those only come into play when you convert a String to bytes, or when you convert bytes to a String. And commonly the bytes in question are in a file. If you don't specify the charset you want to use when this conversion takes place, then Java will use the system's default charset, which is chosen based on your locale and your operating system. And if you speak English, it's going to be one of those extended ASCII charsets.

Generally you don't have to hack about in your program converting bytes from one charset to another. Just be aware that if you start seeing ??? where you expected text, then charset A has been used to decode bytes that were really encoded in charset B. The remedy for that, if it happens, is to find out what B really is and use B.

Is that enough?
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic