• Post Reply Bookmark Topic Watch Topic
  • New Topic

is this unicode? Are there any tricks  RSS feed

 
Christopher Whu
Ranch Hand
Posts: 80
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am reading xml from the web... Basically looking at the document line by line using a buffered reader object.

One of my records contains:
name="Róók".

and this name entity is used in the url, but is encoded:
url="r=Misha&n=R%C3%B3%C3%B3k"

The url to the xml doc is:
http://www.wowarmory.com/character-sheet.xml?r=Misha&n=R��k

I had to use a user-agent in my URLConnection to get to the xml, it shows browsers something diff...

I want to write this data to an xml document, how should i store it in the xml? should i store it as R%C3%B3%C3%B3k?

thanks in advance...
 
Campbell Ritchie
Marshal
Posts: 56578
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Welcome to JavaRanch
Don't know much UML, but I do know where to find the Unicode Charts. I would have expected to find B3 and C3 in U0080 which appears if you click Latin-1; B3 is a cube sign not a square root and C3 is upper case A with a ~. So whatever you have got it isn't Unicode.
Sorry I can't help any more.
 
Christopher Whu
Ranch Hand
Posts: 80
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
this is the link i was talking about....

http://www.wowarmory.com/character-sheet.xml?r=Misha&n=R��k
 
Campbell Ritchie
Marshal
Posts: 56578
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Don't know. Sorry. I looked up c3b3 and b3c3 in Unicode and that doesn't help either.
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The URL appears to be URL-encoded with UTF-8 as the specific encoding scheme. "%C3%B3" would be the two-byte UTF-8 encoding of the character "ó". To get "R√≥√≥k" from that you would have to have decoded with the Macintosh encoding scheme instead of UTF-8. See the Comparison table at:

http://en.wikipedia.org/wiki/Western_Latin_character_sets_(computing)
 
Carey Evans
Ranch Hand
Posts: 225
Debian Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Alan's worked out what your characters are. In answer to your original question, I would store the URL with the percent-encoding in place, since that's what the browser has to send.

To read the Unicode characters in the XML correctly, you should specify UTF-8 encoding when you create your InputStreamReader; or better, you should use a real XML parser, with a StreamSource for instance, and the parser will detect the Unicode encoding of the input characters.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!