• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

XML - CSV ... Performance related question.

 
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

I have this requirement where I need to read a huge XML file (100+ Megs), parse it (StAX) record by record (boundary conditions as defined by the business logic) into an intermediate data structure (HashMap) and then write the contents of the structure to a file.

What would be the optimum solution to this?

Should I make use of an array of HashMap(s) as the intermediate structure and have one thread parse a record in the XML and put it into the structure and another thread read from the structure and write to the file? The problem is that the method that does this functionality can return only once the entire data in the XML has been written to the file. This method is invoked from a web application. I cannot background it for the time being.

Further, should I use memory mapped files (java.nio) when reading the XML file and writing to the output file?

Is there a way I can monitor the memory usage before invoking this method, midway through the method and at the end of the method?

Thanks
.J.
 
Johnny Augustus
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Anyone? I am waiting for some of your views on this before I can proceed.
 
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I would say, if you have a situation where you can write data to some kind of structure and simultaneously read from that structure and write data to a file, you should dispense with the intermediate structure and write directly to the file. The intermediate structure just uses more memory and the processes accessing it just use more processor time. Not to mention more programming complexity.

And I don't see the point of monitoring memory usage. Yeah, you can do that, but what good would it do?

The whole question smells of premature optimization. Write the simplest possible code to start with. Then improve the things that need improving, whatever they are.
 
Johnny Augustus
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Paul,

The intermediate structure is necessary because there is no clear way that I can sequentially extract the necessary information from the XML and write it to the file. Our business requirement warrants the use of such a data holder.

The memory is an important constraint here because we are not working with high end servers. Besides, I need the benchmark to keep some higher ups satisfied

As of now, I am going ahead with the producer consumer approach. Will keep a track of this thread to see if anyone has any better suggestions.

Many thanks
.J.
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
How much XML input do you have to parse before you can write a line of CSV?

So far, the Producer Thread / Consumer Thread pattern looks fine to me. Is there anything to be gained by making the intermediate structure a custom Java object?

Bill
 
Johnny Augustus
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Here's an example of the conversion that needs to happen

XML


needs to be converted into...

CSV
countries_country_id;countries_country_name;countries_country_states_state_id;countries_country_states_state_name
1;India;1;Maharashtra
1;India;2;Karnataka
2;USA;3;Nevada

Notice the repetition of country id and name in the second line of the CSV.
 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
So basically, because your hierarchy is shallow, all you have to do is hold on to the current country, country id, state and state id and write a CSV line every time you hit a </state> -end element event for "state" - no need for any complicated intermediate object here.

You might hand off each CSV line to a queue for a file writing thread so your XML parser can continue full speed.

Bill
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic