• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

NX: contractor - file encoding

 
Mike Southgate
Ranch Hand
Posts: 183
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I haven't done anything to specify the file encoding on my project, I just used a RandomAccessFile and it worked fine. Should I be doing something and if so, what?
ms
 
Philippe Maquet
Bartender
Posts: 1872
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Mike,

I haven't done anything to specify the file encoding on my project, I just used a RandomAccessFile and it worked fine. Should I be doing something and if so, what?

I'll reply first with a few questions :
  • What do your instructions say about encoding ?
  • Which methods do you use to read/write records in your RAF ?


  • I implemented support for multiple encoding, but as I use NIO, I got the benefit of the Charset and ByteBuffer classes to do it. Standard encoding in URLyBird 1.2.1 is "US-ASCII".
    I'll come back to this thread after your reply.
    Cheers,
    Phil.
     
    Mike Southgate
    Ranch Hand
    Posts: 183
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    "The character encoding is 8 bit US ASCII". I'm using RandomAccessFile.read(byte[] b) to read the data and writeBytes(String) to write.
    ms
     
    Philippe Maquet
    Bartender
    Posts: 1872
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hi Mike,

    I'm using RandomAccessFile.read(byte[] b) to read the data and writeBytes(String) to write.

    After you have read your byte array, how do you convert it into a String ? If you use the special String constructor String(byte[] bytes, String charsetName), it's OK (the constructor String(byte[] bytes) uses the default platform's charset which may be wrong).
    Now for write, I would first use the String method byte[] getBytes(String charsetName) and then void write(byte[] b) in RandomAccessFile, to fix the encoding.
    I recall that it's not the solution I chose to implement, but if you don't use NIO, I think it's the easiest one.
    Cheers,
    Phil.
     
    Mike Southgate
    Ranch Hand
    Posts: 183
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Originally posted by Philippe Maquet:
    Hi Mike,

    After you have read your byte array, how do you convert it into a String ? If you use the special String constructor String(byte[] bytes, String charsetName), it's OK (the constructor String(byte[] bytes) uses the default platform's charset which may be wrong).
    Now for write, I would first use the String method byte[] getBytes(String charsetName) and then void write(byte[] b) in RandomAccessFile, to fix the encoding.
    I recall that it's not the solution I chose to implement, but if you don't use NIO, I think it's the easiest one.
    Cheers,
    Phil.

    I was using String(byte[] bytes) to convert to a String, so I'll have to change this. I gather the charset name should be "US-ASCII".
    I also use readByte, readShort and readInt but Byte doesn't have an equivalent constructor - what should I do for these?
    ms
    [ July 25, 2003: Message edited by: Mike Southgate ]
     
    Mike Southgate
    Ranch Hand
    Posts: 183
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    I think I figured out part of the answer to my last question. I shouldn't have to worry about character conversions for the numeric type since they're not really codes the way characters are. It's only char or strings that will need the conversions. Correct?
     
    Philippe Maquet
    Bartender
    Posts: 1872
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hi Mike,

    I was using String(byte[] bytes) to convert to a String, so I'll have to change this. I gather the charset name should be "US-ASCII".

    Yes.

    I think I figured out part of the answer to my last question. I shouldn't have to worry about character conversions for the numeric type since they're not really codes the way characters are. It's only char or strings that will need the conversions. Correct?

    Exactly !
    Best,
    Phil.
     
    Bharat Ruparel
    Ranch Hand
    Posts: 493
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hello Phil,
    I am doing the URLyBird assignment. The instructions are as follows:
    "All text values, and all fields(which are text only), contain only 8 bit characters, null terminated if less than the maximum length for the field. The character encoding is 8 bit US ASCII."
    Here is the method that I am using to read from a RandomAccessFile:
    static String readFixedString( RandomAccessFile p_in, short p_size) throws IOException {
    StringBuffer b = new StringBuffer(p_size);
    int i = 0;
    boolean more = true;
    while (more && i < p_size) {
    char ch = (char) p_in.readByte();
    i++;
    if (ch == 0)
    more = false;
    else
    b.append(ch);
    }
    p_in.skipBytes(p_size - i);
    return b.toString();
    }
    Is this a problem?
    Best Regards.
    Bharat
     
    Philippe Maquet
    Bartender
    Posts: 1872
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hi Bharat,
    Well, what you seem to do is not using any encoding at all. I mean that by reading one byte at a time and typecasting it to a char you don't take any encoding into account. As US-ASCII is a subset of any character set encoded on 8 bits, it should be OK, at least for reading. Now if you have a same approach for writing, you'll take the risk to put characters (I mean bytes) in your file which are not part of US-ASCII (mainly the european characters).
    So I think it's better to use "US-ASCII" expressly as the instructions state.
    See the String constructor String(byte[] bytes, String charsetName), and the String method byte[] getBytes(String charsetName).
    Best,
    Phil.
     
    Bharat Ruparel
    Ranch Hand
    Posts: 493
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hello Phil,
    Thanks for the reply. I see your point. However, I am trying to stay within the guidelines given by Sun in the Assignment. The instructions say explicitly:
    "All text values, an all fields (which are text only), contain only 8 bit characters, null terminated if less that the maximum length for the field. The character encoding is 8 bit US ASCII."
    Since readByte() and writeByte() default to 8 bit US ASCII (at least that is my assumption) I feel I am OK.
    Let me know your thoughts.
    Thanks.
    Bharat
     
    S Bala
    Ranch Hand
    Posts: 49
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Don't forget, you need to use the encoding whle writing back to the file too.
    I am encoding it, and then padding with spaces, if required.
    SB
     
    Philippe Maquet
    Bartender
    Posts: 1872
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hi Bharat,
    Since readByte() and writeByte() default to 8 bit US ASCII (at least that is my assumption) I feel I am OK.

    Not correct ! readByte() and writeByte() do not default to some "8 bit US ASCII" encoding, because they just handle (as their name lets suppose ) handle bytes, and "encoding" is a notion which only comes to life at the character level.
    Why to take any risk regarding the instructions ?!
    Cheers,
    Phil.
     
    Bharat Ruparel
    Ranch Hand
    Posts: 493
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hello Phil,
    Thanks for the quick response. Bala, you too.
    Based on your input, I have modified the reaFixedString method as follow:
    static String readFixedStringTest(RandomAccessFile p_in, short p_size) throws IOException {
    byte [] bArray;
    bArray = new byte[p_size];
    int i = 0;
    boolean more = true;
    while (more && i < p_size) {
    bArray[i] = p_in.readByte();
    if ((char) bArray[i] == 0)
    more = false;
    i++;
    }
    String tempStr = new String(bArray,"US-ASCII");
    p_in.skipBytes(p_size - i);
    return tempStr;
    }
    What do you think?
    Let me know.
    Bharat
     
    Philippe Maquet
    Bartender
    Posts: 1872
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hi Bharat,
    It's OK IMO as far encoding is concerned. But as your bArray variable has the right size, why not use the RandomAccessFile method "int read(byte[] b)" ? It should be faster with the same result.
    Best,
    Phil.
     
    Bharat Ruparel
    Ranch Hand
    Posts: 493
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Did that.
    Thanks Phil.
    Regards.
    Bharat
     
    Jim Yingst
    Wanderer
    Sheriff
    Posts: 18671
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    [Mike S]: I'm using RandomAccessFile.read(byte[] b) to read the data
    I think you guys really want readFully(byte[] b). The read() method will usually work OK, but can't be relied on to always read all the bytes you expect. The readFully() method is an easy way to fix this.
     
    Philippe Maquet
    Bartender
    Posts: 1872
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hi Jim,
    Thank you to point that potential issue out. I only use RAF to reading the file header (I use NIO for the records), but (even Sun's doc is quite unclear about read() here as so often ), yes it seems safer to do as you suggest.
    Thanks again,
    Phil.
     
    Bharat Ruparel
    Ranch Hand
    Posts: 493
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hello Jim,
    I took your advice and changed read(byte []) to readFully(byte []). My code now looks as follows:
    static String readFixedStringTest(RandomAccessFile p_in, short p_size) throws IOException {
    byte [] bArray;
    bArray = new byte[p_size];
    p_in.readFully(bArray);
    int i = 0;
    boolean more = true;
    while (more && i < p_size) {
    if ((char) bArray[i] == 0)
    more = false;
    i++;
    }
    String tempStr = new String(bArray,0,p_size,"US-ASCII");
    return tempStr;
    }
    My question to you and Phil would be: I printed out the JavaDoc for RandomAccessFile for Java 2 Platform SE 1.4.1 from Sun's site. I couldn't tell the difference between read and readFully after going through it. How can I begin to use the online resources more effectively? Is there something that I am missing here?
    Thanks.
    Bharat
     
    Philippe Maquet
    Bartender
    Posts: 1872
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hi Bharat,
    My question to you and Phil would be: I printed out the JavaDoc for RandomAccessFile for Java 2 Platform SE 1.4.1 from Sun's site. I couldn't tell the difference between read and readFully after going through it. How can I begin to use the online resources more effectively? Is there something that I am missing here?

    That's why I wrote "even Sun's doc is quite unclear about read() here as so often". readFully() doc is very clear, RAF.read() one is not, and it's even worst IMO when you extend reading to InputStream.read(). "How can I begin to use the online resources more effectively?" : just try to figure out what those people tried to explain more or less successfuly I remember a memorable discussion between Jim and Max about the (garanteed or not) atomicity of fileChannels, each of them taking some part of the doc as opposite argument
    Best,
    Phil.
    PS: About your code : a solution based on indexOf((char)0) is - probably - more efficient.
     
    Jim Yingst
    Wanderer
    Sheriff
    Posts: 18671
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    [Philippe]: That's why I wrote "even Sun's doc is quite unclear about read() here as so often". readFully() doc is very clear, RAF.read() one is not, and it's even worst IMO when you extend reading to InputStream.read().
    RAF.read() is a bit clearer if you use readFully() as a point of comparison. If read() were not capable of incomplete reads, there would be no need for the readFully() method. However it's still not very clear unless you analyze it carefully. And yes InputStream is worse in this respect because there is no readFully() method to compare it to. This lack of clarity is why I'm so paranoid about FileChannel - I don't see any clear indication that the same problem will not occur there.
    [Philippe]: I remember a memorable discussion between Jim and Max about the (garanteed or not) atomicity of fileChannels, each of them taking some part of the doc as opposite argument
    Memorable? Hey, it's still going on - we just moved it to the I/O forum, here. I'm winning.
    [ August 06, 2003: Message edited by: Jim Yingst ]
     
    Philippe Maquet
    Bartender
    Posts: 1872
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hey, great ! But I knew already that you were winning (in that thread anyway)
    Now I'll read the next part of it (I stopped the serial with this forum ).
    Best,
    Philippe.
     
    Bharat Ruparel
    Ranch Hand
    Posts: 493
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hello Phil/Jim,
    Regardless of what the assignment says (URLyBird 1.31), the datafile db-1x1.db contains Fixed Length String data. There is no null termination character at the end. For example, for the name field which is 64 characters wide and contains the string value "Palace". Stepping through the program in the debugging mode, I discovered that every string value contains a Space (" ") character immediately after the last alphabet character (in the example above, it will be the character "e"). I am wondering whether the following would then be the most efficient implementation of reading data from this file?
    static String readFixedStringTest(RandomAccessFile p_in, short p_size) throws IOException {
    byte [] bArray;
    bArray = new byte[p_size];
    p_in.readFully(bArray);
    String tempStr = new String(bArray,0,p_size,"US-ASCII");
    tempStr = tempStr.trim();
    return tempStr;
    }
    I am a bit concerned about the instructions which explicitly states:
    "All text values, and all fields (which are text only), contain only 8 bit characters, null terminated if less than the maximum length for the field. The character encoding is 8 bit ASCII."
    I suppose I can throw in the indexOf check as follows in the above code:
    static String readFixedStringTest(RandomAccessFile p_in, short p_size) throws IOException {
    byte [] bArray;
    bArray = new byte[p_size];
    p_in.readFully(bArray);
    String tempStr = new String(bArray,0,p_size,"US-ASCII");
    int nullPos = tempStr.indexOf(0);
    tempStr = (nullPos < 0) ? tempStr : tempStr.substring(0,nullPos);
    tempStr = tempStr.trim();
    return tempStr;
    }
    What are your thoughts?
     
    Philippe Maquet
    Bartender
    Posts: 1872
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Hy Bharat,
    This "issue" has been discussed in detail recently, but as my opinion is quite different from that of majority of people here I'll answer once more just for you.
    In the fact that all field values are padded with spaces, many people see some "de-facto" standard which would go against our instructions (which tell us to pad field values with zeros ((byte) 0)). And from there, there are a few variations about what they think they must do.
    I don't see any contradiction. If any user of Data pads field values with spaces up to the maximum length, there will no null byte to be added, Data is not concerned. The space character is a "normal" one, so I wouldn't trim() field values after reading / before writing, just do what Data instructions tell us to do.
    Best,
    Phil.
     
    It is sorta covered in the JavaRanch Style Guide.
    • Post Reply
    • Bookmark Topic Watch Topic
    • New Topic