• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • paul wheaton
  • Jeanne Boyarsky
  • Ron McLeod
Sheriffs:
  • Paul Clapham
  • Liutauras Vilda
  • Devaka Cooray
Saloon Keepers:
  • Tim Holloway
  • Roland Mueller
Bartenders:

Encoding UTF-8 and MalformedInput Exception

 
Ranch Hand
Posts: 219
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
All,
I am a rather curious problem thats bothering me...
The following code snippet

works ok for a few pages and bombs out for others. As far as i am aware, all the content that is being attempted to be served is UTF-8.. so i am bit bemused.
I tried to remove the UTF-8 setting.. and still i got a malformed IO exception. Beats me... I am using WebSphere 4.0.6 if thats of any help.
 
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As far as i am aware, all the content that is being attempted to be served is UTF-8.
Well, maybe that's not the case. You probably need to gather more info. How about using the getContentType() to find out if whatever you're connecting to has provided you with any more info about the encoding? Typically this may be something like
"text/html; charset=UTF-8"
So you can try parsing out a charset from this field. If none is provided, the default is supposed to be ISO-8859-1. In practice it is sadly not that unusual for servers to fail to specify these fields correctly. The next line of defense is to initially assume the encoding is ISO-8859-1, and use that to interpret the subsequent HTML, and look for a meta tag which has the real encoding in it. E.g.
<meta http-equiv="Content-Type" content="text/html; charset=Shirt-JIS">
Natually, servers that feil to properly specify their encodings are EVIL. But we may have to deal with them nonetheless...
 
Nagendra Prasad
Ranch Hand
Posts: 219
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Jim,
Many thanks for your response.. I am going to try this at work tomorrow.
For completion of information, the application server is websphere.
But the approach you have outlined could be useful. Let me try and
get back to you.
 
Nagendra Prasad
Ranch Hand
Posts: 219
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Jim...
realised i have some sample data with which I could try this from where I am now...
THis is what the Content type comes out as:
"text/html;charset=Cp1252"
And, so there is a reason why it is failing...
but.. I think i need to delve into a bit more detail to have a meaningful clarification of this problem:
The data that I am trying to retrieve is a HTML document that is stored in the database. The HTML was stored using the following setter method:

The database (Which is db2) has a codeset which is UTF-8. I know that this
gets reflected as CP1252. So.. the question is ... is there a confusion/ conflict in the way data is stored and retrieved..
or
is this still an issue which the application server should be capable of handling?
 
Could you hold this kitten for a sec? I need to adjust this tiny ad:
Smokeless wood heat with a rocket mass heater
https://woodheat.net
reply
    Bookmark Topic Watch Topic
  • New Topic