• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • paul wheaton
  • Ron McLeod
  • Devaka Cooray
Sheriffs:
  • Jeanne Boyarsky
  • Liutauras Vilda
  • Paul Clapham
Saloon Keepers:
  • Tim Holloway
  • Carey Brown
  • Piet Souris
Bartenders:

How to compare two huge xml files

 
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Friends


I need to compare two huge xml files, I can't read by line and compare using io, so I need to parse it and compare each object. Now to do like this it will be much coding effort.

I want to know whether any body faced this problem before, and have any solution using any particular algorithm. Another thing, After generating JAXB pojo classes, I can see there are no equals methods in those classes, so any way whether we can generate that equals method or not.

Joy
 
Ranch Hand
Posts: 2187
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Write UNIX Korn Shell program to compare the two files using, sed, awk, grep, etc.
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Are you testing for identity of the two files in the Infoset sense, or trying to spit out all content differences?

Note that for things like attributes in an Element, two documents may have a different text order but be considered identical XML. And of course an empty elment tag is considered identical to an open and close tag pair with no content.

Exactly what you have to detect makes a big difference.

Bill
 
Ranch Hand
Posts: 62
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
public Document getDocument ( Object object ) {

try {
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance ().newDocumentBuilder ();
if ( object != null && object instanceof String ) {
document = documentBuilder.parse ( new DataInputStream ( new ByteArrayInputStream ( ( (String) object ).getBytes () ) ) );
} else if ( object != null && object instanceof InputSource ) {
document = documentBuilder.parse ( (InputSource) object );
}

} catch ( SAXException e ) {
LOG.severeStackTrace ( e );
} catch ( ParserConfigurationException e ) {
LOG.severeStackTrace ( e );
} catch ( IOException e ) {
LOG.severeStackTrace ( e );
}
return document;
}
public boolean isXmlREqual(String xml1,String xml2) {

boolean flag = false ;
Document doc1= getDocument ( xml1 ));
Document doc2= getDocument ( xml2 );
NodeList NodeList1 = (NodeList) doc1.getElementsByTagName ( "*" );
Element element2 = (Element) doc2.getFirstChild ();
int size = NodeList1 .getLength ();
for ( int i = 0; i < size; i++ ) {
Node node1 = (Node) NodeList1 .item ( i );
String tagName 1= node1 != null ? node1 .getNodeName( ) : "";
String nodeValue1 = node1 != null && node1 .getFirstChild() != null ? node1 .getFirstChild() : "" ;
NodeList nodeList2 = (NodeList) element2 .getElementsByTagName ( tagName );
if ( nodeList != null && nodeList.getLength () > 0 ) {
Node node2 = (Node) nodeList.item ( 0 ).getFirstChild ();
if ( node != null && node.getNodeValue () != null ) {
String nodeValue12 = node2 != null && node2 .getFirstChild() != null ? node2 .getFirstChil() : "" ;
flag = nodeValue1 .equals(nodeValue12) ? true : false :
}

}
return flag ;
}

use the above code to check the passed xmls are equal are not .

if flag is true , xmls are equal , other wise xmls are not equal .

 
krishna bala
Ranch Hand
Posts: 62
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
sorry use the below , code

public Document getDocument ( Object object ) {
Document document = null ;
try {
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance ().newDocumentBuilder ();
if ( object != null && object instanceof String ) {
document = documentBuilder.parse ( new DataInputStream ( new ByteArrayInputStream ( ( (String) object ).getBytes () ) ) );
} else if ( object != null && object instanceof InputSource ) {
document = documentBuilder.parse ( (InputSource) object );
}

} catch ( SAXException e ) {
LOG.severeStackTrace ( e );
} catch ( ParserConfigurationException e ) {
LOG.severeStackTrace ( e );
} catch ( IOException e ) {
LOG.severeStackTrace ( e );
}
return document;
}
public boolean isXmlREqual(String xml1,String xml2) {

boolean flag = false ;
Document doc1= getDocument ( xml1 ));
Document doc2= getDocument ( xml2 );
NodeList NodeList1 = (NodeList) doc1.getElementsByTagName ( "*" );
Element element2 = (Element) doc2.getFirstChild ();
int size = NodeList1 .getLength ();
for ( int i = 0; i < size; i++ ) {
Node node1 = (Node) NodeList1 .item ( i );
String tagName 1= node1 != null ? node1 .getNodeName( ) : "";
String nodeValue1 = node1 != null && node1 .getFirstChild() != null ? node1 .getFirstChild() : "" ;
NodeList nodeList2 = (NodeList) element2 .getElementsByTagName ( tagName 1);
if ( nodeList2 != null && nodeList2.getLength () > 0 ) {
Node node2 = (Node) nodeList2.item ( 0 ).getFirstChild ();
if ( node != null && node.getNodeValue () != null ) {
String nodeValue12 = node2 != null && node2 .getFirstChild() != null ? node2 .getFirstChild() : "" ;
flag = nodeValue1 .equals(nodeValue12) ? true : false :
}

}
return flag ;
}

use the above code to check the passed xmls are equal are not .

if flag is true , xmls are equal , other wise xmls are not equal .
 
Sheriff
Posts: 28394
100
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Krishna, if you're going to post code it would be better if you posted formatted code. (Notice the "Code" button above the box where you post?) Also, it's possible to edit your posts, so there was no reason to post twice.

Now about that code. The original post said the XML documents were "huge" so it's possible they are too large to fit into memory. So using a DocumentBuilder might be a bad idea.

Also there are several problems with the code. For example you start by getting all the elements from the first document. You lose the structure (nesting) when you do that, so you can't possibly have an accurate comparison. I don't see where you are comparing text nodes, and I'm sure you aren't comparing attributes.

There's also some minor glitches like possibly converting the document to bytes using an encoding different than its actual encoding.

It's nice that you tried to help, but that code isn't really helpful.
 
Joybrata Chakraborty
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Friends for replying.

I have some point to make seeing your reply.

1. If we are try to solve the problem using unix shell script(using sed or awk ), we need to write the logic for parsing the xml, then comparison the data part, So I think its better to use Java for that. If you have any other view we can discuss.

2. As I mentioned that xml files will be huge that mean I have memory constrain, so we need to write the logic efficiently.
3. Another thing is after comparing the data part we need to log the difference.


Please keep on replying such that discussion remains alive and everybody gets benefit out of it.

Thanks
Joy





 
Joybrata Chakraborty
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

William Brogden wrote:Are you testing for identity of the two files in the Infoset sense, or trying to spit out all content differences?

Note that for things like attributes in an Element, two documents may have a different text order but be considered identical XML. And of course an empty elment tag is considered identical to an open and close tag pair with no content.

Exactly what you have to detect makes a big difference.

Bill



I need to find the logical differences of the content(data part) between the files.
 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

I need to find the logical differences of the content(data part) between the files.



OK, next questions:

Are the differences only in the content of a single Element - say the Text Node content of a "FirstName" Element or are the differences potentially more complex?

If the order of elements changes, but not the content, is that a change you want to detect?

How complex is the hierarchy of XML within the files? A relatively "flat" hierarchy should be easier to examine. The most extreme example would be a document with nothing but elements like:

<Book isbn=xxxx" >Sometext description</Book>

In a more complex hierarchy, <Book might be inside ><Section inside ><Library inside ><City inside ><State ... well you get the idea.

Bill
>
 
Paul Clapham
Sheriff
Posts: 28394
100
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Joybrata Chakraborty wrote:I need to find the logical differences of the content(data part) between the files.



It would be helpful if you used XML terminology. For example, what is this "data part" you are asking about? The entire XML document is data, so you seem to be using some non-XML idea where you divide your document into "data" and something else.
 
Joybrata Chakraborty
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

William Brogden wrote:

I need to find the logical differences of the content(data part) between the files.



OK, next questions:

Are the differences only in the content of a single Element - say the Text Node content of a "FirstName" Element or are the differences potentially more complex?

If the order of elements changes, but not the content, is that a change you want to detect?

How complex is the hierarchy of XML within the files? A relatively "flat" hierarchy should be easier to examine. The most extreme example would be a document with nothing but elements like:

<Book isbn=xxxx" >Sometext description</Book>

In a more complex hierarchy, <Book might be inside ><Section inside ><Library inside ><City inside ><State ... well you get the idea.

Bill
>



If order changes, it should show both the xml's are same, don't need to reflect any thing. Hierarchies are more complex..that's the worrying part.

I am trying to write the code using java reflection. Any suggestion is welcomed on that.

Thanks
Joy
 
Joybrata Chakraborty
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Paul Clapham wrote:

Joybrata Chakraborty wrote:I need to find the logical differences of the content(data part) between the files.



It would be helpful if you used XML terminology. For example, what is this "data part" you are asking about? The entire XML document is data, so you seem to be using some non-XML idea where you divide your document into "data" and something else.



Thanks for correcting me, values of these tags need to be compared.
 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

I am trying to write the code using java reflection. Any suggestion is welcomed on that.



Back off from thinking about reflection or any other cute tools. At this point you should be concentrating on the algorithm. Lots of flow charts considering all the possible complications before writing much code.

OK, so you have a complex hierarchy - At what level do you identify a difference?

Using my example do you identify the Book that has a difference, the Section that contains a Book that has a difference... etc.

Is there some unique identifier that will be enough to identify the location of a change or do you have to say something that depends on the order of elements - like:
there is a difference at the 32nd City, 3rd Library, 2nd section in file A.

Being able to say "element with id=124C41 is different" would be realllllly convenient.

How are you planning on comparing content once you have found a pair of elements to compare?

thanks for bringing us an interesting problem.

Bill
 
Paul Clapham
Sheriff
Posts: 28394
100
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Joybrata Chakraborty wrote:

Paul Clapham wrote:It would be helpful if you used XML terminology.



Thanks for correcting me, values of these tags need to be compared.



Ah, I see. In XML a "tag" is either a start tag of an element or an end tag of an element or an empty-element tag. So the "value" of a tag, if that means something, would be the name of the element.

But I suspect you might instead want to compare text nodes -- people often lazily use "value of a tag" to refer to a single text node which is the child of an element. This terminology falls down when you have mixed content, of course, but more importantly it gets in the way of defining your problem clearly.
 
reply
    Bookmark Topic Watch Topic
  • New Topic