Win a copy of The Java Performance Companion this week in the Performance forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Invalid Character inside CDATA

 
Donny Wi
Greenhorn
Posts: 13
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm parsing an XML file that contains some Japanese language (UTF-8 chars). During the parsing, I received an error that says
"An invalid XML character (Unicode: 0xb4) was found in the CDATA section."
Can someone explain to me how does it possible to have an invalid XML character inside CDATA section? I believe the only restriction inside the CDATA section is including "]]" inside the message.
Thank you
 
Mapraputa Is
Leverager of our synergies
Sheriff
Posts: 10065
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Everybody believes so, yet it is a mistake. I think, the confusion stems from many, many sources of XML wisdom, which define CDATA section as "data that are ignored by the parser". If CDATA is ignored, we can put everything there, including binary data?
Nothing in XML specification suggest it. "CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup." And if you look at how CDATA is defined, you'll see
[18] CDSect ::= CDStart CData CDEnd
[19] CDStart ::= '<![CDATA['
[20] CData ::= (Char* - (Char* ']]>' Char*))
[21] CDEnd ::= ']]>
Where "Char" is in the same range as in any other part of XML document:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
This means that CDATA is different from parsed data only in that the markup is not recognized as such, i.e. not parsed.
My understanding is that XML document "physically" can consist of legal characters only; this layer has the highest priority, and high-level constructs like CDATA have to obey the rules. One way to circumvent this rule and to include illegal characters would be to code your data in base64, but this will increase document's size, violate all good design rules etc. etc.
[ May 09, 2002: Message edited by: Mapraputa Is ]
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic