Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

8-bit US ASCII encoding???

 
Jimmy Ho
Ranch Hand
Posts: 61
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm working on the SCJD assignment and it states that the data in my database file is "8 bit US ASCII". They also imply that DataInputStream is my class of choice for parsing the data.

Anyway, I receive the data via a DataInputStream as a byte array and transform it into a String. Can I just get away with:

String s = new String(myByteArray); ???

Alternatively, I can also use

String s = new String(myByteArray, [encoding]);

but the closest available encodings according to the 1.5 JavaDocs are US-ASCII (which is 7-bit ASCII) or UTF-8. Is UTF-8 the same as "8 bit US ASCII"? Or can I just say that 8-bit ASCII will translate to unicode without serious issues and just use the first line above?

Am I being too fussy, or is this a legitimate issue?
 
Edwin Dalorzo
Ranch Hand
Posts: 961
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You can read about Java Supported Encoding here

The following encodings are 8-bit encodings:

ISO-8859-1
ISO-8859-2
ISO-8859-4
ISO-8859-9
ISO-8859-13
ISO-8859-15
windows-1252

Those are valid to be used.

US-ASCII is not capable to represent what UTF-8 is representing. Plus UTF-8 could be misinterpreted. A character in UTF-8 may occuppy more than one byte.

I hope this helps!
 
Ulf Dittmer
Rancher
Posts: 42968
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My interpretation would be that the characters are all encoded in US-ASCII (so you can expect the high-order bit of each byte to be 0), but that each character does indeed take up 8 bits (i.e., 1 byte).

Theoretically, with a 7-bit encoding, one could store 8 characters into 7 bytes to save space. Calling it "8-bit US-ASCII" indicates that this is not the case here. So one byte equals one character.
 
Pawel Solarski
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,
I'm having the same problem with my URLyBird, where the 8 bit US-ASCII is to be used. I think the best way is to assume that only 7 bit US-ASCII characters can be stored in database file. So the "8bit" requirement is confusing. I think it is better NOT to use any os ISO-8859-x characters at all, even though they contain the US_ASCII charset. Why?

Because IF I choose to use e.g. the ISO-8859-2 charset, then I may type my locale-specific strings like "żółź", then it will work for me, but then on the other machine with different locale, reading a record with that saved string may lead to strange behavior, resulting in "?,?,?,..." strings.

I hope that ranch sheriffs agree with me
 
Roel De Nijs
Sheriff
Posts: 10603
143
AngularJS Chrome Eclipse IDE Hibernate Java jQuery MySQL Database Spring Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Pawel,

Welcome to the JavaRanch!

Not (yet) a ranch sheriff, but I used the ISO-8859-1 char set. I think it doesn't matter which one you uses, just document your decision in choices.txt.

Kind regards,
Roel
 
Roberto Perillo
Bartender
Posts: 2271
3
Eclipse IDE Java Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Roel De Nijs wrote:Not (yet) a ranch sheriff...


Ah... that's my good buddy Roel, the proud of Belgium!
 
Roberto Perillo
Bartender
Posts: 2271
3
Eclipse IDE Java Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Howdy, Pawel. Welcome to JavaRanch!

Yeah, they mention the US-ASCII charset as being 8-bit, but it is in fact 7-bit. In my choices.txt file, I said that I used it, even though the instructions refer to it as being 8-bit. Here's a pretty small sample of the code I used to read the database:



where ENCODING is a String constant = "US-ASCII".
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic