• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

XPath Encoding Problem

 
Paulo Carvalho
Ranch Hand
Posts: 57
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,

I don't have lots of experience in that domain this is why I'm going to ask for your help.

I'm going to simplify my problem to explain it better:

I have a XML file with the following structure:



<?xml version="1.0" encoding="UTF-8"?>
...
<variable>
<value>France</value>
</variable>
<variable>
<value>Grèce</value>
</variable>
...
</xml>



With a Java class, I want to obtain the "values" tags values. Here is my java method to do that:


...
final XPath xpath = XPATHFACTORY.newXPath();
String result = "";
final XPathExpression nodesXpath = xpath.compile(xpathQuery);

// Gets the element
final Element nd =
(Element) nodesXpath.evaluate(doc, XPathConstants.NODE);

if (nd != null) {
result = nd.getTextContent();
}
...


The obtained values are the following ones:

Value1: France
Value2: Grèce


As you can see the 2nd one is not well formed. What can I do do get it correctly?
(My XML file is already UTF-8 encoded so I don't know what is the problem)

Thanks in advance.
Best regards
 
Paul Clapham
Sheriff
Posts: 21416
33
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
When you say you "get" that value, exactly what do you mean by that? Show us how you are looking at it; it's possible you are using an incorrect charset in the process of looking.
 
g tsuji
Ranch Hand
Posts: 669
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That looks very symptomatic of the application outputting character stream encoded in utf-8 (e-accent grave 0xc3 oxa8) and being read either on a cp1252 console screen or on a text editor like notepad with "ansi" encoding. If that's the case, it seems the parsing and output streaming are in good control. If the output stream had been using something other than utf8 like cp1252 and you still get that on the console/notepad, that would be a problem meaning the original xml document failed to be read properly or being badly encoded. As I suspect more of the former, I would say it is good news and you simply need to use a utf-8 console or text editor that support utf-8 encoding to read the characters as they should look like.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic