Win a copy of Classic Computer Science Problems in Swift this week in the iOS forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

XML - CSV ... Performance related question.  RSS feed

 
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I have this requirement where I need to read a huge XML file (100+ Megs), parse it (StAX) record by record (boundary conditions as defined by the business logic) into an intermediate data structure (HashMap) and then write the contents of the structure to a file.

What would be the optimum solution to this?

Should I make use of an array of HashMap(s) as the intermediate structure and have one thread parse a record in the XML and put it into the structure and another thread read from the structure and write to the file? The problem is that the method that does this functionality can return only once the entire data in the XML has been written to the file. This method is invoked from a web application. I cannot background it for the time being.

Further, should I use memory mapped files (java.nio) when reading the XML file and writing to the output file?

Is there a way I can monitor the memory usage before invoking this method, midway through the method and at the end of the method?

Thanks
.J.
 
Johnny Augustus
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Anyone? I am waiting for some of your views on this before I can proceed.
 
Sheriff
Posts: 23496
46
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would say, if you have a situation where you can write data to some kind of structure and simultaneously read from that structure and write data to a file, you should dispense with the intermediate structure and write directly to the file. The intermediate structure just uses more memory and the processes accessing it just use more processor time. Not to mention more programming complexity.

And I don't see the point of monitoring memory usage. Yeah, you can do that, but what good would it do?

The whole question smells of premature optimization. Write the simplest possible code to start with. Then improve the things that need improving, whatever they are.
 
Johnny Augustus
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Paul,

The intermediate structure is necessary because there is no clear way that I can sequentially extract the necessary information from the XML and write it to the file. Our business requirement warrants the use of such a data holder.

The memory is an important constraint here because we are not working with high end servers. Besides, I need the benchmark to keep some higher ups satisfied

As of now, I am going ahead with the producer consumer approach. Will keep a track of this thread to see if anyone has any better suggestions.

Many thanks
.J.
 
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How much XML input do you have to parse before you can write a line of CSV?

So far, the Producer Thread / Consumer Thread pattern looks fine to me. Is there anything to be gained by making the intermediate structure a custom Java object?

Bill
 
Johnny Augustus
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Here's an example of the conversion that needs to happen

XML


needs to be converted into...

CSV
countries_country_id;countries_country_name;countries_country_states_state_id;countries_country_states_state_name
1;India;1;Maharashtra
1;India;2;Karnataka
2;USA;3;Nevada

Notice the repetition of country id and name in the second line of the CSV.
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So basically, because your hierarchy is shallow, all you have to do is hold on to the current country, country id, state and state id and write a CSV line every time you hit a </state> -end element event for "state" - no need for any complicated intermediate object here.

You might hand off each CSV line to a queue for a file writing thread so your XML parser can continue full speed.

Bill
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!