Win a copy of Kotlin in Action this week in the Kotlin forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Weird Characters in UTF-8 web pages  RSS feed

 
Andy Lileston
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
To make a long story short, I've been generating html web pages from various sources on the net using some mashup libraries. The data comes mostly from XML feeds and the like and get returned to me as a regular . I've been running into issues that seem to be related to having characters that aren't being recognized by the browser and therefore get rendered as either question marks or black diamonds with a question mark inside them. Sound familiar?

My question is, how do I filter non-conforming characters to my webpage? The pages are declared as UTF-8.
 
Paul Clapham
Sheriff
Posts: 22490
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There aren't any characters that fail to conform to UTF-8. It can encode any character in the entire Unicode inventory. No, the problem you're having is most likely that you aren't reading your input documents with the correct encoding.

Or perhaps the problem is that when you generate the HTML pages, you aren't writing them with the UTF-8 encoding. That would be a lot simpler to solve if it were the problem, but failing to use the correct encoding for your inputs is far more likely.
 
Andy Lileston
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, it seems like there should be some sort of methodology for converting XXXX to UTF-8, no matter what the input is.
 
Carey Evans
Ranch Hand
Posts: 225
Debian Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you wrap your FileOutputStream with an OutputStreamWriter with the correct encoding, and your input string is correct, you cannot write invalid UTF-8 to the file. For example:You should check your input string:You also need to make sure the browser interprets the HTML as UTF-8, with an HTTP header or HTML meta tag; see http://www.w3.org/International/O-charset.
 
Paul Clapham
Sheriff
Posts: 22490
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Andy Lileston:
Well, it seems like there should be some sort of methodology for converting XXXX to UTF-8, no matter what the input is.
If you're reading from a URL, then you have to use an InputStreamReader with the correct encoding, analogous to the OutputStreamWriter that Carey described there. And there are ways of determining the correct encoding to use by looking at the HTTP headers. Assuming the site that's serving the files does things correctly, which is definitely not a given. I don't see any code to comment on, so I can't really say more than that.

Don't expect to find anything simple that automagically determines the encoding that was used to produce a document if you only have the document, though. Browsers try to do that with some success, but their algorithms are quite complex and still don't get the job done. Be prepared to do some manual overrides when you get bad data.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!