Win a copy of The Little Book of Impediments (e-book only) this week in the Agile and Other Processes forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

NX: (Contractors) Character encoding

 
Jeff Wisard
Ranch Hand
Posts: 89
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello everyone,
In the contractors assignment, I return the trimmed version of the fields of my records from the readRecord method. That is, any trailing whitespace on each field is removed.
This means, of course, that when I create a new record or update a record, I have to put that whitespace back in the field before writing it. i.e. I have to pad each field with spaces so that it is the correct length.
A problem I encountered is that the data file is encoded in 8-bit US Ascii. When I would write a space character to a ByteBuffer (I am using NIO classes), that character is two bytes long. This caused a few problems, including buffer overflow exceptions.
So, I simply cast each space character to a single byte before writing it. This works fine.
However, I just came across the Charset, CharEncoder, and CharDecoder classes that are new with NIO and JDK 1.4.x. Has anyone used these classes for converting data between Unicode and US 8-bit Ascii for data files? I am thinking that I should use these classes instead of casting to a byte...but I'm not sure to what extent I need to use them. Should I run all the data that I read and write through the encoder or decoder classes first?
How important is this issue?
Thanks!
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How important is this issue?
Well, it's probably not that important - the way the assignment is set up to use US-ASCII, you can get away with converting chars to bytes simply by casting. That may well be fine as far as the exam's graders are concerned; dunno. However it's something that sets off alarm bells for me from personal experience; I find when people say their data is "ASCII" it means it might be Cp-1252, or ISO-8859-1, or UTF-8, or something else (possibly all of these, if they're collecting data from multiple clueless sources) and at some point in the future someone's going to wonder why some of the characters come out funny-looking. At which point I get to discuss with the client what encoding is really being used in the data. :roll: So I tend to assume that it's important to minimize assumptions about character encoding, and to make it as easy as possible to switch encodings later be changing just one line of code. So yes, I'm using the NIO charset stuff. I just save a private Charset instance, and whenever I need to convert a String to bytes or vice versa, I use the decode() or encode() method. Seems to work just fine. Note that the constant-length format of the data means that it would be a bit more difficult to change the encoding to something like UTF-8 - but any single-byte encoding would be easy to substitute in place of US-ASCII.
In the contractors assignment, I return the trimmed version of the fields of my records from the readRecord method. That is, any trailing whitespace on each field is removed.
Yeah, I'm taking a slightly different approach on this issue. The file format specs say fields are null terminated if shorter than the maximum length - even though the data file always terminates with spaces rather than nulls. AndI see no mention of trimming fields anywhere. So I'm writing my Data class to follow the API exactly, which (to me) means there's a difference between updating a field with value "foo" and updating it with "foo ". I just use a simple for loop to find the first null, as trim() would trim other whitepace as well.
Now common sense tells me that there's a good chance this isn't really what real-world users would want - but it is what they explicitly asked for. So I'm (proably) going to offer a few configuration options, allowing the user to choose between "save fields exactly", "trim fields on save", "append spaces on save". The first is what they get by default; the last is what they probably really want. Yeah, this is probably more effort than is necessary - but when I see Sun's absolutist policy on things we "must" do, I respond by taking their instructions very literally. In the real world I'd just call up the client and say hey, you probably really wanted those fields to be padded with spaces, not nulls, right? But lacking that option here, I refuse to get burned by crappy customer specs.
[ May 22, 2003: Message edited by: Jim Yingst ]
 
S. Ganapathy
Ranch Hand
Posts: 194
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Jim Yingst,
What do you mean by null terminated string in the data representation (db,db file).
How you are reading null strings from db.db. I found only blank spaces.
Thanks,
Ganapathy.
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi, Ganapathy. True, the current file has only spaces, not nulls. However the instructions indicate that each string field is null-terminated if it's shorter than the maximum length. Null-terminated means terminated with one or more NULL character, unicode value 0. As I interpret it, none of the fields in the existing file are shorter than max length, because they're padded with spaces. But it's entirely possible that shorter field values might be inserted in the future - either from our application (which we may write to avoid this possibility if we desire) or by other applications which access the db file (which we have no control over).
The NULL char was used all the time in C programs - it's how the end of a string is indicated. It's pretty rare to intentionally make use of them in Java - but they're part of the Unicode standard, and considered "characters" just like other chars. Check this out:

The more I think about it though, the more I think maybe I should always trim() strings when they're read and pad with spaces when storing, just as Jeff plans to do. I still think that's a violation of the API, but it's a case where the API is ill-conceived, and enforcing some consistency on how strings are stored is a good thing. Our app should definitely be able to deal with NULL chars if they do appear (e.g. if inserted by another program) - but it's evident from the file provided that the de facto standard way to store strings is by padding with spaces - so it's probably best to continue this policy. Depending on how ornery I'm feeling about poorly-conceived requirements. Either way, I'll document my reasoning of course.
[ May 22, 2003: Message edited by: Jim Yingst ]
 
S. Ganapathy
Ranch Hand
Posts: 194
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thankyou verymuch Jim Yingst.
 
Vitaly Zhuravlyov
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Jim,

The more I think about it though, the more I think maybe I should always trim() strings when they're read and pad with spaces when storing, just as Jeff plans to do.

Consider trimming the following

Do you also think that if we encounter a null terminated string possibly inserted by some other program we must preserve its format?
Thanks
Vitaly
[ May 23, 2003: Message edited by: Vitaly Zhuravlyov ]
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Oooo, good question, thanks. I'd say that if we see something like that, we should assume that everything after the first 0 is invalid, perhaps left over from a previous record that was overwritten. Unfortunately trim() does not have the desired effect in this case - it trims only the 1 final char rather than 3 as we require. So it seems necessary to use a for loop or indexOf() to find the occurance of the first 0, and use only the subsequence/substring prior to that point. I will probably do a trim() in addition to this, for the reasons given in my last post - but thanks for pointing out a case where the trim() by itself is insufficient.
[ May 23, 2003: Message edited by: Jim Yingst ]
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic