• Post Reply Bookmark Topic Watch Topic
  • New Topic

Encoding UTF-8 and MalformedInput Exception  RSS feed

 
Nagendra Prasad
Ranch Hand
Posts: 219
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
All,
I am a rather curious problem thats bothering me...
The following code snippet

works ok for a few pages and bombs out for others. As far as i am aware, all the content that is being attempted to be served is UTF-8.. so i am bit bemused.
I tried to remove the UTF-8 setting.. and still i got a malformed IO exception. Beats me... I am using WebSphere 4.0.6 if thats of any help.
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
As far as i am aware, all the content that is being attempted to be served is UTF-8.
Well, maybe that's not the case. You probably need to gather more info. How about using the getContentType() to find out if whatever you're connecting to has provided you with any more info about the encoding? Typically this may be something like
"text/html; charset=UTF-8"
So you can try parsing out a charset from this field. If none is provided, the default is supposed to be ISO-8859-1. In practice it is sadly not that unusual for servers to fail to specify these fields correctly. The next line of defense is to initially assume the encoding is ISO-8859-1, and use that to interpret the subsequent HTML, and look for a meta tag which has the real encoding in it. E.g.
<meta http-equiv="Content-Type" content="text/html; charset=Shirt-JIS">
Natually, servers that feil to properly specify their encodings are EVIL. But we may have to deal with them nonetheless...
 
Nagendra Prasad
Ranch Hand
Posts: 219
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jim,
Many thanks for your response.. I am going to try this at work tomorrow.
For completion of information, the application server is websphere.
But the approach you have outlined could be useful. Let me try and
get back to you.
 
Nagendra Prasad
Ranch Hand
Posts: 219
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jim...
realised i have some sample data with which I could try this from where I am now...
THis is what the Content type comes out as:
"text/html;charset=Cp1252"
And, so there is a reason why it is failing...
but.. I think i need to delve into a bit more detail to have a meaningful clarification of this problem:
The data that I am trying to retrieve is a HTML document that is stored in the database. The HTML was stored using the following setter method:

The database (Which is db2) has a codeset which is UTF-8. I know that this
gets reflected as CP1252. So.. the question is ... is there a confusion/ conflict in the way data is stored and retrieved..
or
is this still an issue which the application server should be capable of handling?
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!