Win a copy of Functional Reactive Programming this week in the Other Languages forum!
    Bookmark Topic Watch Topic
  • New Topic

can not read remote XML correctly, due to encoding issue

 
gang lee
Greenhorn
Posts: 12
  • Mark post as helpful
  • send pies
  • Report post to moderator
hi guys

I am trying to read URL(which will return a weather info XML in Chinese): "http://www.google.com/ig/api?weather=dalian&hl=zh-CN"
I am using a Japanese windows XP.

the main source code envovled is:

in fact, whether I specify the encoding in new InputStreamReader() does not affect anything: I always get garbage content(except english part).

AND, the fowllowing conversion does not work


Could anybody give me some help?

thanks a lot!

lee

[edit]Add code tags. CR[/edit]
[ July 19, 2008: Message edited by: Campbell Ritchie ]
 
gang lee
Greenhorn
Posts: 12
  • Mark post as helpful
  • send pies
  • Report post to moderator
sorry, the code for conversion is the following way:

because it seems the source XML seems encoded in MS932.
when I do not specify encoding in new InputStreamReader() and I get the encoding is :MS932, by calling InputStreamReader.InputStreamReader().
[edit]Add code tags. CR[/edit]
[ July 19, 2008: Message edited by: Campbell Ritchie ]
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Report post to moderator
If this is a valid XML source, the encoding is mentioned in the header. If you'd simply use an XML parser (I like Dom4J), you wouldn't need to worry about the encoding at all, because it would take care of it automatically.
 
gang lee
Greenhorn
Posts: 12
  • Mark post as helpful
  • send pies
  • Report post to moderator
thanks Preuss,
the XML is not valid, only well-formed.

<?xml version="1.0" ?>
- <xml_api_reply version="1">
- <weather module_id="0" tab_id="0">
- <forecast_information>
<city data="Dalian, Liaoning" />

this XML is a service from google.
 
gang lee
Greenhorn
Posts: 12
  • Mark post as helpful
  • send pies
  • Report post to moderator
the only way I can get readable content is using following code:

with the URL:
"http://www.google.com/ig/api?weather=dalian&hl=ja"

but what I got is in Japanese, I want content in Chinese.
I tried many combinations of modification, but no one seems work.

please help.
[ July 19, 2008: Message edited by: gang lee ]
 
Rob Spoor
Sheriff
Pie
Posts: 20669
65
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Report post to moderator
Well your URL does specify to give the Japanese version: hl=ja
If Google supports Chinese then you should change it to Chinese.
 
gang lee
Greenhorn
Posts: 12
  • Mark post as helpful
  • send pies
  • Report post to moderator
thanks to Prime.

Google HAS a chinese version of that weather information,
BUT, when I specify "zh-CN" in URL,I get garbage content(ASCII part is ok.)
AND, the content seems still being encoded with MS932, which is a Japanese character set.

I suspect that the issue is due to my Japanese version of windows XP.
So I tried to set AcceptLanuage, AceeptEncoding headers etc. of my httprequest, but I failed to get correct content again.

Anybody else can help?

the source code is not complex, can anybody give it a try?


Of course, it's better you have a Japanese XP, or else you may not see the issue.
[ July 19, 2008: Message edited by: gang lee ]
 
Paul Clapham
Sheriff
Posts: 21416
33
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Report post to moderator
I've had this issue with XML documents from Google myself. Here's what I had to do:

1. Get the URLConnection and call its getContentEncoding() method.

2. Use the value you get from that in your InputStreamReader, instead of UTF-8 as in your original post.

As Ilja Preuss said earlier (I think), that's the rule for XML documents sent over HTTP; the encoding of the request overrides the encoding stated or implied by the document's prolog.
 
gang lee
Greenhorn
Posts: 12
  • Mark post as helpful
  • send pies
  • Report post to moderator
Thanks to Clapham, but unfortunately,
I get null when try getContentEncoding().
 
gang lee
Greenhorn
Posts: 12
  • Mark post as helpful
  • send pies
  • Report post to moderator
one more thing:
when I get and save the output from: www.google.com/ig/api?weather=dalian&hl=zh-CN

I found the BOM is FF FE, i.e. UTF-16LE.

but the encoding does not seem UTF-16...
because the browser(firefox2) say it's UTF-8 from view->character encoding.

one discouraging issue for beginner!

with the same URL with Internet Exploer: when I see menu: view->encoding, the pop up menu says:
GB2312
Unicode(UTF-8)
X Unicode
Other
and the menu is grey, forbidding user to re-choose.
[ July 19, 2008: Message edited by: gang lee ]
 
    Bookmark Topic Watch Topic
  • New Topic