• Post Reply Bookmark Topic Watch Topic
  • New Topic

Why the Discrepancy for chars?

 
Kevin Simonson
Ranch Hand
Posts: 191
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This has puzzled me for a long time. If I have any kind of a text file (like the source code for a Java program for example), and use a {Scanner} object to read it into {String} objects in a Java program, and then end up writing those {String} objects to a file by using a {PrintWriter} object, each of the {char} components of the {String} objects take up sixteen bits in the Java program, but each only takes up eight bits in the source file the {Scanner} object reads from, and each only takes up eight bits in the destination file the {PrintWriter} writes to. Why store each {char} with sixteen bits of memory in the Java program, when there was only eight bits of memory where the {char} originated from, and there will be only eight bits of memory where the {char} will ultimately get stored?

Furthermore, in my own particular application I've discovered that if a {String} has individual characters that correspond to numbers higher than 127, if I do a {println()} on the {PrintWriter} object for that particular {String}, those characters get writtern as {63}s, and a bunch of information appears to be getting lost. Is there some other way to write {String}s like this to files so that information doesn't get lost?
 
Kevin Simonson
Ranch Hand
Posts: 191
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Kevin Simonson wrote:This has puzzled me for a long time. If I have any kind of a text file (like the source code for a Java program for example), and use a {Scanner} object to read it into {String} objects in a Java program, and then end up writing those {String} objects to a file by using a {PrintWriter} object, each of the {char} components of the {String} objects take up sixteen bits in the Java program, but each only takes up eight bits in the source file the {Scanner} object reads from, and each only takes up eight bits in the destination file the {PrintWriter} writes to. Why store each {char} with sixteen bits of memory in the Java program, when there was only eight bits of memory where the {char} originated from, and there will be only eight bits of memory where the {char} will ultimately get stored?

Furthermore, in my own particular application I've discovered that if a {String} has individual characters that correspond to numbers higher than 127, if I do a {println()} on the {PrintWriter} object for that particular {String}, those characters get writtern as {63}s, and a bunch of information appears to be getting lost. Is there some other way to write {String}s like this to files so that information doesn't get lost?

To illustrate this, try these two programs:

and

and run them with "java Write 120 135 Hmm.Txt" and "java Read 120 135 Hmm.Txt". You'll see that everything higher than 127 got written as a {63}.
 
Rob Spoor
Sheriff
Posts: 20822
68
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Because of your example, you assume that every character fits in 1 byte. That would leave only 256 characters, including control characters like \r, \n or even \0. If you check the Unicode blocks you'll see that's not nearly enough. In fact, char gives you 65536 possible values and that only goes up to 0xFFFF; even char is actually not enough. That's why you should actually use code points instead of chars. Check out java.lang.Character for some methods that handle with code points.
 
Tim Holloway
Bartender
Posts: 18418
60
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It's part of the Java language spec. Character inside Java code means "Unicode character". Unicode is 16 bits. So inside a Java app, all characters and Strings are in Unicode.

When you read or write ASCII, which is an 8-bit code (actually originally 7 bits!), the I/O service routines perform code page translations to convert to/from Unicode. The reason this isn't more obvious is that the first 256 characters (give or take) of Unicode have the same values (less leading zeroes) as ASCII. You'd notice the difference more if you were IBM mainframe-oriented (EBCDIC). For example, ASCII "0" is 0x30, Unicode is \u0030, but EBCDIC is 0xF0. Likewise, space is 0x20, \u0020, 0x40.

If you really want to treat 8-bit characters as 8 bits, then you can read them as Byte/byte instead of char, but then you lose out on the Java String and Character functions (or at least have to convert to/from Unicode!)
 
Rob Spoor
Sheriff
Posts: 20822
68
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, String does have constructors that take a byte[], and can return a byte[] as well. However, you need to specify the encoding/charset to use (like StandardCharsets.US_ASCII).
 
Tim Holloway
Bartender
Posts: 18418
60
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rob Spoor wrote:Well, String does have constructors that take a byte[], and can return a byte[] as well. However, you need to specify the encoding/charset to use (like StandardCharsets.US_ASCII).


That's because the "constructor" is actually also a converter. And it requires information on how to convert the argument to Unicode. Although IIRC, there's still some deprecated methods that assume ASCII.

Incidentally, encryption services like these constructors. A String is immutable. So if you create a password string, it remains floating in JVM object space until its final destruction. But an array of characters (or bytes) containing a password can have its individual elements wiped clean immediately after use!

 
Kevin Simonson
Ranch Hand
Posts: 191
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rob Spoor and Tim Holloway have provided a lot of interesting information, and they've detailed to some extent the history of {String} objects and how each of their {char}s got to be sixteen bits long in memory, but it looks to me like they haven't answered my main question, which was, why each {char} in a {String} object takes up sixteen bits while stored on disk each takes up only eight bits? What good does it do to store a very diverse set of characters in a Java program, when they can't be stored anywhere on disk with their diversity intact?
 
Paul Clapham
Sheriff
Posts: 21892
36
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sure they can. You just haven't heard about encodings, or charsets, yet. Generally (but not always), system designers don't need to deal with everything in Unicode, but only a subset. So they choose an encoding which maps that subset of Unicode characters into bytes, to save space in the output. Here's the relevant tutorial from Oracle's series of Unicode tutorials: Character and Byte Streams; you should probably read some of the other tutorial pages related to that one.
 
Tim Holloway
Bartender
Posts: 18418
60
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Kevin Simonson wrote:Rob Spoor and Tim Holloway have provided a lot of interesting information, and they've detailed to some extent the history of {String} objects and how each of their {char}s got to be sixteen bits long in memory, but it looks to me like they haven't answered my main question, which was, why each {char} in a {String} object takes up sixteen bits while stored on disk each takes up only eight bits? What good does it do to store a very diverse set of characters in a Java program, when they can't be stored anywhere on disk with their diversity intact?


Who said that the text file had to be composed of 8-bit characters? Sure, it's what you're used to, but there were lots of platforms where characters came in other sizes. Which, incidentally, is why I get very pedantic about the difference between characters and bytes. A Byte is often the same size as a character, but not always, which is why network standards and such use the more precise word "octet" (8 bits) in their definitions. The classic definition of a machine's "byte" size was supposed to be the smallest amount of memory that could be directly physically accessed, although people do tend to think of it as a character size these days, since most modern machines have 8-bit granularity.

Characters over the years have come in many different sizes and configurations. On really early computers, the "character code" was often the raw data code for the associated I/O device. For example, the circa-1960 IBM mainframe terminals used a 6-bit code which is the origin of the "holes" in the later 8-bit EBCDIC character code set. "ASCIIZ" for IBM PC's was an 8-bit extension of the original 7-bit ASCII standard (the 8th bit was reserved as a parity bit). The old Teletype machines used a 5 or 6 bit Baudot code (no lower case, though). At the opposite extreme, I understand that there were was a popular supercomputer line whose "characters" were technically about 64 bits long. Punch-card devices used a 12-bit raw code. The IBM 1130 mini-computer contained a library for converting it to EBCDIC. And the IBM 1620 didn't hold with all this binary nonsense at all! It offered memory in one of 2 sizes: 5000 or 10,000 words of decimal numbers. Character codes for that machine required pairs of numbers I believe (someone pulled an manual out of a recycling stack, which is how I learned about that machine). The Prime Series minicomputers had a 16-bit word and fixed disk sectors that were also some even multiple of 16 bits. It was an ASCII-based machine and one with relatively tight resources, so I worked with characters both packed into a 16 bit word (effectively, that machine's "byte" size) and 1-per-word, depending on whether capacity or access speed was more important to me.

So the idea that text files on disk are 8 bits but Java text is 16 bits isn't really accurate. There are products that use Unicode in disk files (I think maybe even Windows Wordpad!). Some DBMS's have options for that in their table setup. And in addition to full-on 8/16 bit systems, a lot of stuff intended for I18N use has double-byte character options. UTF-8, for example.
 
Kevin Simonson
Ranch Hand
Posts: 191
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul Clapham wrote:Sure they can. You just haven't heard about encodings, or charsets, yet. Generally (but not always), system designers don't need to deal with everything in Unicode, but only a subset. So they choose an encoding which maps that subset of Unicode characters into bytes, to save space in the output. Here's the relevant tutorial from Oracle's series of Unicode tutorials: Character and Byte Streams; you should probably read some of the other tutorial pages related to that one.

The bottom line for me is that I have been thinking about writing a Java program that implements an editor (possibly based on Emacs), and since it works with {String} objects the possibility exists that the document it's going to store may have characters in it that won't get preserved, should I just use a {PrintWriter} object and its {println()} method to write those characters to disk. So should my editor be written so that it will use a {PrintWriter} object to write to disk in most cases, but perhaps have a flag that can be passed to my editor to tell it to use some other object to write to disk and read from disk? And if so, what object should I use? People have been expending a lot of energy telling me there are other ways to do disk I/O that stores sixteen bits to disk for each {char} in a {String}, but so far nobody's told me what one of those ways is.
 
Tim Holloway
Bartender
Posts: 18418
60
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Depending on how you design a text editor, it may use 2 (or 3!) different file formats: input text file, working data file (optional) and output text file. If you use a work file, it can have anything - text or otherwise - in it that you find useful to store. Some editors make the work and output file formats be the same so that by creative renaming, you're ensured against data loss in case the system crashes in the middle of a save operation (something that used to be a lot more common).

A PrintWriter writes characters, Strings, and the results of the String.valueOf() method (which outputs a String) applied to objects. Whether the PrintWriter writes 8-bit text or 16-bit text (or for that matter, whether ASCII, Unicode, or EBCDIC) is determined by what encoding you indicated for the output File, output Stream or Writer (depending on which constructor you used). If you did not explicitly select a character encoding, the one defined as default for your JVM's locale will be used. Don't, therefore that the same code will always output 8-bit ASCII unless you specifically requested an 8-bit ASCII encoding.
My desktop machine's default character set is UTF-8, so unless otherwise directed, it will expect text input readers to read 8-bit ASCII and print writers to write 8-bit ASCII. Which corresponds (more or less) to the standard text file encodings for my particular locale, OS, and hardware.

As the reader reads UTF-8, it converts to internal character/string encoding (Unicode). When Strings and characters are written, they're converted from Unicode to UTF-8. And, incidentally, I'm pretty sure that in cases where there's not a mapped Unicode-to-UTF8 conversion for a character, you'll get an exception from the character set converter. I'm also fairly sure that that's where I issues once upon a time. Since "ñ" isn't a strict ASCII character, I think we got a "n~" decoding out of it, which really upset the recipient, since it was expecting a fixed-length character field and the 2-character conversion gave it indigestion.
 
Paul Clapham
Sheriff
Posts: 21892
36
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Kevin Simonson wrote:I have been thinking about writing a Java program that implements an editor (possibly based on Emacs), and since it works with {String} objects the possibility exists that the document it's going to store may have characters in it that won't get preserved, should I just use a {PrintWriter} object and its {println()} method to write those characters to disk.


No, you shouldn't use a PrintWriter. Its documentation explains that its purpose is "Prints formatted representations of objects to a text-output stream". So you'd use that if you wanted to format objects automatically while writing them -- but that isn't what you want to do. You don't want to print anything, you just want to write text to a file. Which means you should use a FileWriter.

People have been expending a lot of energy telling me there are other ways to do disk I/O that stores sixteen bits to disk for each {char} in a {String}, but so far nobody's told me what one of those ways is.


First of all I'd recommend you use UTF-8 as your encoding, so you don't have to concern yourself with which characters you want/need to support. It supports the entire Unicode repertoire. As for what one of those ways is... I did post a link to a tutorial with just such an example, but I guess you weren't primed to take in that explanation yet. Here's that link again: Character and Byte Streams.

Note that the tutorial also explains the other question, namely how to read characters in from a file. You'll notice there's no PrintReader class...
 
Tim Holloway
Bartender
Posts: 18418
60
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
PrintWriter has the significant advantage that it has the "println()" methods. These methods allow you to automatically write code that writes out text lines with whatever newline character(s) the current OS demands. As of Java 5, it also added printf() methods, but those are probably not essential for the task at hand.

If, on the other hand, your editor desires the option to write in a variety of OS text file formats where line terminators vary (e.g., Linux/Unix '\n', DOS '\r\n', then the better option is to use an OuptutStreamWriter and handle line termination manually. Or you could subclass one of these classes and implement/override your own println() methods. FileWriter is a convenience subclass of OutputStreamWriter in case you are only interested in reading/writing Files.

 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!