• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

XML Validation

 
Ranch Hand
Posts: 82
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hello,

I am trying to validate an XML file using Apache xerces-2_7_1. The encoding I am using in the XML file is UTF-8. When I have french chars in the file, I am getting "Invalid byte 2 of 2-byte UTF-8 sequence" error message. If I change the encoding to "ISO-8859-1", validation works fine, but the customer wants to use encoding UTF-8.

When I tested same file with XMLSpy, it is validating fine with UTF-8 encoding.

Can anyone tell me what I can do or what the cause is?

Here is the snipet of the code:
=====================================
<?xml version="1.0" encoding="UTF-8"?>
<Submission xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="layout.xsd">
=====================================

Thanks
Suresh
 
Marshal
Posts: 28177
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Putting <?xml version="1.0" encoding="UTF-8"?> at the start of your file says it's encoded in UTF-8, but that doesn't actually cause it to be encoded in UTF-8. The process that creates the file has to write the file in that encoding. If it produces some other encoding, it should specify that encoding in the prolog. That isn't happening in your case.
 
Suresh Kanagalingam
Ranch Hand
Posts: 82
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Paul,

Thanks for your quick reply. The character it is complaining is "�" (Checked ASCII value for this and it is 201). Also XMLSpy validates this char correctly.

Any toughts?

Thanks
Suresh
 
Paul Clapham
Marshal
Posts: 28177
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Any thoughts beyond the thoughts I already posted? No. What about you? Have you reviewed the process that produces that file?
 
Suresh Kanagalingam
Ranch Hand
Posts: 82
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Paul,

I checked the program to make sure it is writing standard character set to the file. I even used TextPad to type French characters using TextPad "ANSI Character" listing.

Can you please confirm that for letter "�" to be validated with UTF-8, it has to have hex value of 201?

Thanks
Suresh
 
Ranch Hand
Posts: 775
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yes and no. Unicode decimal 201 according to:

this table

but UTF-8 version is two chars.

This is something people continually get wrong
with XML. How the file is written and read
matters. The first line of the XML file
containg the declaration/version/charset is
strictly ASCII. I don't remember the exact
spec wording, but basically you are limited
to the 7-bit hunk of ASCII. All bytes after
that first line are strictly in the desired
character set. That means if you are dealing
with richer character sets you have to create
a file in the way required by the spec, or
things will break.

I suspect that right now what you may have is
a file with a single byte for either the
ASCII (131) or Unicode (201) encoding of
acute capital E.
 
Paul Clapham
Marshal
Posts: 28177
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Suresh Kanagalingam:
Hi Paul,

I checked the program to make sure it is writing standard character set to the file. I even used TextPad to type French characters using TextPad "ANSI Character" listing.

Can you please confirm that for letter "�" to be validated with UTF-8, it has to have hex value of 201?

Thanks
Suresh



To reiterate what Reid said, if you're seeing a hex value of 201 in your file then it isn't encoded in UTF-8. And if you used the "standard character set" to write to the file, that almost certainly wouldn't be UTF-8 anyway.

The easiest way to get your XML encoding right in Java is to use the standard XML software (whatever's built in to your JRE, or Xerces or Xalan or Saxon or some other open-source product) and to provide an output stream (not a Writer) for it to write to. The software will take care of the encoding.

Or iff you're writing XML to a file with your own ad-hoc code, then encode it in UTF-8 like this:
 
Reid M. Pinchback
Ranch Hand
Posts: 775
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
And although not an issue for UTF-8, for any character set that doesn't
include 7-bit ascii as a single-byte subset, you have to deal with
both encodings, not just a single coding as shown above. First you
output the first line in the required encoding, then everything else
in the other encoding. Not something I've had to do, but suspect it
comes up with Asian character sets, maybe UTF-16?

Like Paul said, doing something in a tool that understands this,
like serializing DOM, is generally just much safer.

[ January 12, 2006: Message edited by: Reid M. Pinchback ]
[ January 12, 2006: Message edited by: Reid M. Pinchback ]
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic