• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

characted encoding or not

 
Garry Kalra
Ranch Hand
Posts: 111
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am having trouble deciding whether to use character encoding in the deprecated methods or not. Pls guide me.
Garry
 
Trevor Dunn
Ranch Hand
Posts: 84
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I didn't
Trevor
 
Garry Kalra
Ranch Hand
Posts: 111
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What if i use the UTF-8 encoding, as i am sure it is available on all the java 2 platforms.
 
Mark Spritzler
ranger
Sheriff
Posts: 17278
6
IntelliJ IDE Mac Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That's the default if you don't put in an encoding.
Mark
 
Garry Kalra
Ranch Hand
Posts: 111
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the info
 
Kalichar Rangantittu
Ranch Hand
Posts: 240
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am not sure that that the default encoding if not specefied is UTF8 for the String construction and the getBytes Mark. From what I saw, the default encoding will be what is specified by the system. I am not really sure which one to use UTF8 or ISO-9959-1. Anyone with any suggestions justifying one or the other, please let me know. Thanks in advance.
 
Kalichar Rangantittu
Ranch Hand
Posts: 240
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm really confused. When the data class is used directly on windows without modification there are spaces that trail some strings. If I try the same on linux, no spaces. I have heard that this has to do with byte storage by individual platforms. Therefore, how can one get by without using trim after reading in the buffer in windows?
 
Kalichar Rangantittu
Ranch Hand
Posts: 240
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If someone has used the system encoding, could you please justify your reason for the same?
Thanks
 
Mark Spritzler
ranger
Sheriff
Posts: 17278
6
IntelliJ IDE Mac Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I used trim() as most people here has used trim() there are a lot of posts about it here.
Mark
 
Gennady Shapiro
Ranch Hand
Posts: 196
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Kalichar Rangantittu:
If someone has used the system encoding, could you please justify your reason for the same?
Thanks

This is really simple, guys.
If you aren't sure about what encoding to use for writing you should just look at what encoding the code is using for reading. It screams load and clear.
 
Kalichar Rangantittu
Ranch Hand
Posts: 240
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Gennady sir...I am not getting it..This is one area I am hitting a dead end. I see that when the buffer is read in, it does not do a read UTF or anything but parsing from the buffer? Please expand....Thanks in advance.
 
Kalichar Rangantittu
Ranch Hand
Posts: 240
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Additionally the constructor that creates a database uses writeUTF for the creating the header names. This is confusing me endless .
 
Gennady Shapiro
Ranch Hand
Posts: 196
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am assuming your question is what encoding scheme to use for reading/writing when replacing the deprecated methods that hadle encoding incorrectly.
1. The FBN spec tells you to modify only the deprecated methods.
2. That means you should not (although you could) modify other methods of Data.
3. That means when you modify deprecated methods, you should be consistent with whatever encoding they use.
4. Lets look at the constructors...They use readUTF/writeUTF...doesn't it mean that you should UTF-8 to encode your records? Yes. Must you use UTF-8. No.
Common sense suggests that you should not have a file with mixed encoding, and why should you?
There is a reason that you might have mixed encoding....say your headers are Unicode-encoded, should you be consistent and use Unicode to encode your Data? not really, if you can use UTF-8 that consumes only 50% of Unicode-encoded data. In this case mixed encoding would make sense -- you are saving megabytes of disk space(in real world, of course).
In this particular case, you should use UTF-8 because (1) its consistent with the rest of Data's methods and (2) UTF-8 is optimal encoding for this particular task.
Of course, it's all up to you, you can say 'screw you, I wanna be consistent with Sun philosophy and use Unicode for everything'. You can do that too, as long as you mention it in your design document. I don't think they'll mind much.
 
Kalichar Rangantittu
Ranch Hand
Posts: 240
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for your response. I do still have doubts. If we use utf-8 for the encoding as is done with the headers to maintain consistency. Methods like writeUTF used in the header convert Unicode to byte using UTF-8 encoding and that is fine because while reading in the header, they use readUTF again. Lets say we use UTF-8 for the readmethod of the program, now the read method of example creates a buffer of length the record_ length specified, as an exmample 200. This length assumes that each byte of the file will map to one byte of buffer. ie.,
byte [] buffer = new byte[200];
Suppose our data was written in using UTF-8 then there could be a problem as some wierd character like \u0a9e or something may not be just one byte but infact 3 bytes for the character \u0a9e although all ASCII values still map to 1 byte.
So if I write the field for AIRLINE which is assumed to be of character length 10 characters as s = "TESTAIRS\u0a92", this would infact write out
8 bytes for TESTAIRS + 3 bytes for \ua092 as its some wierd char there fore total :13 bytes. Note that the length of the string is still: 9 as obtained by s.length().
Consider a change in the read method:
rv[i] = new String(buffer,offset, description[i].getLength(),"UTF-8").trim(); This would not work as the bufer would read in description[i].getLength bytes only (ie in example shown above 10 bytes and not the extra ones for \ua092) from buffer causing loss of data ???
This is what is makeing me go crazeeeeeee??
I am quite a confused rat at this point (
Thanks for all the assistance.
 
Kalichar Rangantittu
Ranch Hand
Posts: 240
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I actually tried the above..For characters that take 3 bytes for encoding, the read does not read in the fields correctly. Any suggestions please
 
Kalichar Rangantittu
Ranch Hand
Posts: 240
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Gennady, please provide your valuable input sir....
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic