• Post Reply Bookmark Topic Watch Topic
  • New Topic

java.io.UTFDataFormatException: malformed input around byte 0  RSS feed

 
Praveen Babu
Ranch Hand
Posts: 138
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,

I am having trouble running the below program with strings more than 127 characters.
I am getting the following exception when the strings are more than 127 chars :

java.io.UTFDataFormatException: malformed input around byte 0
at java.io.DataInputStream.readUTF(DataInputStream.java:639)
at java.io.DataInputStream.readUTF(DataInputStream.java:547)
at com.TestMalformedInput.unpackString(TestMalformedInput.java:61)
at com.TestMalformedInput.main(TestMalformedInput.java:20)
An exception during unpacking

After analyzing the code found that the issue is because of conversion from bytes to string and vice versa.
I can run this program successfully when i convert the string arguments to byte[]. See the uncommented lines.
The biggest problem here is that i receive the variable in as string data type and i cannot change the method signature.
Could someone guide me on how to solve this issue.
Thank you in advance.

 
Paul Clapham
Sheriff
Posts: 22185
38
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't understand the point of that code. To me it seems like a waste of typing, but maybe there's something there which I don't understand.

You have a method called "packString" which writes String data in UTF-8 format to an array of bytes and then converts those bytes back to a String using some unknown encoding (your system's default charset, whatever that is). Converting binary bytes to a String is rarely a good idea, as String is not meant to be a container for arbitrary bytes. In the best case your system's default charset would be UTF-8, in which case packString would just give you back your original String. In other cases you get a mangled version of your original String.

And then you have "unpackString" which takes a String, converts it to bytes using that same unknown encoding, and then tries to read those bytes assuming they were encoded in UTF-8. But since you get that error message, that means they weren't encoded in UTF-8.

You can make the error message go away by specifying the UTF-8 encoding in your toString() and getBytes() method, but since that results in packString() returning the original String, there doesn't seem to be any point in that. But as I said, there's a good chance I don't understand what that code is supposed to be for.

I also notice that you aren't closing your output stream properly (the variable "dos"). So perhaps you're losing the last part of the data as well.
 
John Jai
Rancher
Posts: 1776
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Praveen,

Please re-read UseCodeTags and keep the number of characters in a line restricted. It helps your post look way better.
 
Praveen Babu
Ranch Hand
Posts: 138
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello Paul,

Thank you for the reply. First of all i forgot to mention that i am running this with default encoding as UTF-8 so system default will be UTF-8.
Let me tell you my usecase. I receive a huge number of strings as a string array which will be stored as bytes in a BLOB
field of database. Since i cannot change the method signature i need to send the result only as a String. This is what the packString does.
The string is converted to bytes and will be stored in DB in a different module which i cannot access.
The unpackString method does the viceversa i.e takes a String and returns String[]. For simplycity sake, i have ommitted converting to string[]
and just printed out the results.
One more thing to note is that this code works fine for less than 127 chars.
Regarding the closing of streams. I just ommitted it while reproducing this issue, although having it did not made any difference.
I have done the changes you suggested( by specifying UTF-8 in toString() and getBytes() ) but still getting the same exception.
I wonder why string return type works correctly for less than 127 chars.

Thank you.
 
John Jai
Rancher
Posts: 1776
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Quick pointers - Try writing the String in bytes using getBytes("UTF-8") while packing. I read in the API that writeUTF() method writes in modified-utf-8. Not sure if that's causing the problem.

As Paul suggested I used getBytes with UTF-8 while trying to write the String like below -


Praveen Babu wrote:First of all i forgot to mention that i am running this with default encoding as UTF-8 so system default will be UTF-8

Also you can double check your system's default encoding using below line of code. Mine is Cp1252.
 
Praveen Babu
Ranch Hand
Posts: 138
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi John,

Thank you for the reply. i can write bytes as you have mentioned but while reading how can i read name and value parts separately as the read() method
takes byte[] as an argument and we need to specify its size.

Regarding encoding. I have explicitly set to UTF-8 so i have no problems with that. For Cp1252 encoding type this code works properly.
 
John Jai
Rancher
Posts: 1776
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Praveen wrote:I have done the changes you suggested( by specifying UTF-8 in toString() and getBytes() ) but still getting the same exception.

Praveen,
It would be better if you post your modified code, so that if Paul comes back he can verify if you have done the changes in a right way.
 
Praveen Babu
Ranch Hand
Posts: 138
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
John Jai wrote:
Praveen wrote:I have done the changes you suggested( by specifying UTF-8 in toString() and getBytes() ) but still getting the same exception.

Praveen,
It would be better if you post your modified code, so that if Paul comes back he can verify if you have done the changes in a right way.


I have edited the code in my first post as per Paul's suggestions. I do not think that will make much difference as the default encoding is "UTF-8".
 
Paul Clapham
Sheriff
Posts: 22185
38
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So I had a look at the API documentation for DataOutputStream.writeUTF.

And what it does is this: it first writes a binary value containing the number of bytes which will be written out. Then it writes the String in bytes, encoded in UTF-8.

So your code does that. Your rather peculiar code at lines 36 to 39 does it four times, it writes the first String parameter then the second, then the first again, then the second again. So you have four groups of the binary value followed by the encoded String.

And then you convert those bytes back to a String. This is where you make your error. That two-byte binary value cannot necessarily be converted to a String, since it isn't meant to represent text. And you have found some situations where indeed converting it to a String damages the data.

In other words this scheme of yours to cram two Strings into one String in such a way that you can separate the two Strings later is not going to work. You'll have to think of something else. But frankly, to me the idea of using a String to represent a BLOB which is supposed to represent an array of Strings is a design failure.
 
Praveen Babu
Ranch Hand
Posts: 138
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Hi Paul,

Thank you for the inputs. I do realize that converting byte[] to String is a bad idea and it is the root cause. The only thing that is restricting me is that i cannot change the
method signature because it will lead to a huge code change but since there seems to be no other option, i will try to use byte[] and avoid String data types.

Regards.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!