I've read most of the posts here relating to reading and writing bytes to/from the data file. This is what I've come up with and I want to make sure that I'm not doing anything blatantly idiotic. First, I'll post the data file format and then my assumptions.
**** Data File Format Start ****
Start of file
4 byte numeric, magic cookie value identifies this as a data file
4 byte numeric, offset to start of record zero
2 byte numeric, number of fields in each record
Schema description section.
Repeated for each field in a record:
2 byte numeric, length in bytes of field name
n bytes (defined by previous entry), field name
2 byte numeric, field length in bytes
end of repeating block
Data section. (offset into file equal to "offset to start of record zero" value)
Repeat to end of file:
2 byte flag. 00 implies valid record, 0x8000 implies deleted record
Record containing fields in order specified in schema section, no separators between fields, each field fixed length at maximum specified in schema information
End of file
All numeric values are stored in the header information use the formats of the DataInputStream and DataOutputStream classes. All text values, and all fields (which are text only), contain only 8 bit characters, null terminated if less than the maximum length for the field. The character encoding is 8 bit US ASCII.
**** Data File Format End ****
- for the numeric values, I should be using RandomAccessFile#readInt and #readShort
- the valid record flag should equal a string of "\u0000\u0000" and the delete field flag should equal a string of "\u8000"
- I should be using RandomAccessFile#readFully instead of #read when loading my byte objects
- When I convert the bytes I read into a String, I should do a new String(bytes,"US-ASCII") and a strObj.getBytes("US-ASCII") on writes
- "US-ASCII" is really 7 bit and I need 8 bit. Am I missing something here or do I need another encoding?
- I'm not sure of the best way to handle my delete flag writes, RandomAccessFile#writeChars("\u8000")???
- Even though it's been highly debated, I think I'll keep from trimming the spaces following many of the values in the data file, when I read them into memory.
- When reading in the field values, I'll have to loop through the chars and find the first null, everything before that will be my field value.
Thanks a lot gang.
Nice summation of so many discussions.
"US-ASCII" is really 7 bit and I need 8 bit. Am I missing something here or do I need another encoding?
Welcome to the wonderful world of clueless user specifications. :roll:
You have to make a design decision. Is it likely the user wants US-ASCII or an 8-bit format?
By the way: UTF-8 "uses all bits of an octet, but has the quality of preserving the full US-ASCII range: US-ASCII characters are encoded in one octet having the normal US-ASCII value, and any octet with such a value can only stand for an US-ASCII character, and nothing else." (from the UTF-8 RFC).
I'm not sure of the best way to handle my delete flag writes
Have you considered converting 0x8000 into the equivalant short value, and reading and writing it that way?
When reading in the field values, I'll have to loop through the chars and find the first null, everything before that will be my field value.
Presumably stopping if you reach the end of a field length without finding a null.
This all sounds pretty good. It sounds like you have made a few design decisions to get to what you have written. Have you documented them?
Thanks for your input. Yeah, my choices.txt is growing by leaps and bounds but I'm learning a lot in the process.
I'm still a little stumped on the encoding though... I saw one individual claim to have gotten a 91% using the default encoding, Philippe M. is a proponent of "US-ASCII", and Andrew seems to imply that "UTF-8" is the way to fly. Hmmmm....
As far as the delete flag is concerned... I could use RandomAccessFile.writeShort(Character.getNumericValue('\u8000')) but what's the real advantage over using RandomAccessFile.writeChars(new String("\u8000","UTF-8")) or even RandomAccessFile.writeChar(Character.getNumericValue('\u8000'))?