• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Why UTF-16 is needed for XML?

 
Ranch Hand
Posts: 3852
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have seen almost all the XML files start with this line:

<?xml version='1.0' encoding='UTF-16'?>

And as for as I know, UTF-16 encoding scheme is the most memory consuming because it supports highest number of the languages.

But we usually have only English characters in XML files (localized characters are present in properties file), then why we store XML files in UTF-16 format???

Thanks.
 
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You don't have to use UTF-16 to store XML. You can use any encoding at all, provided the parser can deal with it. It is extremely common to use UTF-8. So your question is asked in the wrong place. If you want to know why a certain XML file was encoded in UTF-16, you should ask the person who chose that encoding.
 
ankur rathi
Ranch Hand
Posts: 3852
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Paul Clapham:
You don't have to use UTF-16 to store XML. You can use any encoding at all, provided the parser can deal with it. It is extremely common to use UTF-8. So your question is asked in the wrong place. If you want to know why a certain XML file was encoded in UTF-16, you should ask the person who chose that encoding.



Okay but why even UTF-8???
Why not to use ASCII if the XML file is containing only English characters???

Thanks.
 
Ranch Hand
Posts: 1241
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by rathi ji:
Why not to use ASCII if the XML file is containing only English characters???

The trouble with limiting yourself to a small character set is that it may not be flexible if you need to add extra characters in the future. OK, so you may be able to just change the coding at the top of the XML file, but then this may have knock on effects on other parts of the system. It may be better to start off with something reasonable like UTF-8 or UTF-16, and make the system work with that.

I guess it depends on what you're storing in your XML file, and how likely the kind of data in it is likely to change in the future.
 
Paul Clapham
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by rathi ji:
Okay but why even UTF-8???
Why not to use ASCII if the XML file is containing only English characters???

ASCII is a subset of UTF-8. So if your file really contains only unaccented Latin letters, it's going to look identical whether it's encoded in ASCII or UTF-8. (Except for the prolog where you declare the encoding, of course.)

And as soon as somebody uses an accented letter in their data, the code that writes the ASCII version has to know to change it to a Unicode escape in the output. The standard Java classes do know this, of course, but many people don't use the built-in classes and prefer to write their own code that may not know it.

Basically UTF-8 can represent any character at all, including ASCII characters, and it doesn't cost anything extra to use it for ASCII characters. So it just makes sense to use UTF-8. (Or UTF-16 if your data contains a large percentage of CJK characters.)
 
He was expelled for perverse baking experiments. This tiny ad is a model student:
a bit of art, as a gift, the permaculture playing cards
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic