• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • paul wheaton
  • Ron McLeod
  • Devaka Cooray
Sheriffs:
  • Jeanne Boyarsky
  • Liutauras Vilda
  • Paul Clapham
Saloon Keepers:
  • Tim Holloway
  • Carey Brown
  • Piet Souris
Bartenders:

How To Retain Entity Reference Character While Reading XML

 
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have an xml file input.xml as below:


And I use dom4j library to parse the xml file to add email address postfix "@sa.com" to useranme element value:


After running the code, I get below result:


From the result, you can see the summary element value changes from
<summary>&quot;summary&quot;</summary>
to
<summary>"summary"</summary>
But the name attribute value which contains entity reference character doesn't change.

Why is it working like this? If I want to retain the original summary element value <summary>&quot;summary&quot;</summary>, how should I achieve that?
 
Sheriff
Posts: 22849
132
Eclipse IDE Spring Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Your &quot;s weren't being displayed properly, so I fixed that for you. I simply replaced all &s with &amp;, so that leads to &amp;quot;.

I'll move this to our XML forum, where I think you will get better replies.
 
Sheriff
Posts: 28395
100
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It's working like that because the quote character and the escaped version of the quote character are exactly equivalent in that context. There are no semantic differences between XML documents which use one or the other. So there is no reason for you to require one or the other.
 
Hui Zhou
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I do need this escaped version. We are using an issue tracking system, and there is a filter function which will read the xml. If we use the quote character version, the search function just can't work. But If we use the escaped version, it will work.

That's why I need to retain entity reference.
 
Ranch Hand
Posts: 734
7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
[0] I agree that as far as the xml parsing is concerned, it is quite indistinguishable escaping quotes or not in the text nodes - things are quite different of course in the attributes if the delimiter is colliding. Hence, I agree with Paul Clapham's posts. It is better to fix the filter functions than messing with semantically equivalent xml by electing one preferably over the others.

[0.1] Having said that, is it not dom4j's authors also have had made a choice in that regard? as I so far have not been able to discover which setting in the OutputFormat object capable to do one thing than the other in that regard, namely escaping or not escaping quotes or apos, say, --- other than ampersand and closing angle bracket which is a must for the latter two---, in the text nodes or attributes. So maybe it is also fair to quest about that from a user's point of view.

[1] With the above in mind, I have looked into the issue and it is indeed quite a bit of a difficulty bending it one way rather than the other. The main issue is that you can more easily output without escaping the quot or doubly blow the ampersand into (&quot;). A full control seems inevitably fall upon context handling or base more heavily on string writing kind of operation effectively making some post-processing the output as text file rather than a subset of text file, more specifically an xml document.

[2] This is one concrete way to achieve the objective, profiting double blown in the write out and then use unescape once. The trick is to properly double blown all the leaves first (including text nodes, attributes, comments and processing-instructions). After doing that in a more general setting, a broad set of documents can be output to the desire formats. This is the implementation into the exiting code.
 
Paul Clapham
Sheriff
Posts: 28395
100
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Generally parsers just parse the data into a standard internal form, mapping all of the possible escaped forms into one. Because of course the XML Recommendation says they are all the same.

So anybody who wants them treated differently should realize that they have a requirement which is outside of XML. And therefore that using software which follows the XML standards is perhaps a bad choice. If you just want to treat the document as undifferentiated text and modify it in various ways, then just do that. Don't give it to an XML parser.

To "g tsuji": It's true that XML serializers give you certain choices about how to output certain things. But the problem here is that none of them work together with the parser to keep track of how each individual character was escaped (or not) in the original document, so there is no way for the serializer to support the requirement "Preserve the escaping from the input document".
 
g tsuji
Ranch Hand
Posts: 734
7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
To Paul Clapham: Greetings! I actually agree with what you posted all along. I had gone a long way in my post just to explain how much it makes sense. I think it is not often stressed enough how much at the birdeye view xml is an abstract notion. Whereas in the concrete, it often requires a sense of compromise. Not every software has the luxury to operate at the low level of inforset at all levels of its operation without regard of the concrete realization of the xml. To cut directly through to the issue at hand. The compromise can be seen in the outputformat object for "pretty print". However, it is not exactly correct to say that outputformat has disregard to the aspect where xml treat as equivalent. Take a case to illustrate what I mean. The dom4j, at least, can set the attribute's delimiter. You can control whether you want a quote or an apostrophe as delimiter. But, all the same, parsing engine discards that piece of info as irrelevant to the message xml tries to carry. However, dom4j or some others goes the extra mile to facilate users. Because, at the beginning and all along, xml has to win over its community to make life livable without sacrificing the founding principles. They never treat text file processing complete strange to their mission of making a good parsing engine. If something becomes too much an annoyance, they make their compromise as well by sacrificing. Now, the issue at hand is the same kind. Maybe at certain moment during the dom4j authoring, that helper function was too much an annoyance and was relegated to lower priority or even was decided not to support, I have no objection. But, that is the point. They made a choice. The idempotent in the xml is built into the spec. That's why at certain moment, when the freedom is reckoned to be too much, one must make further choice in restraining further. The issue of c14n is the clearest example. I think all I expend in the above is in fact trying to give you reason rather than the contrary.
reply
    Bookmark Topic Watch Topic
  • New Topic