Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Converting xml files to test files

 
Bhasker Reddy
Ranch Hand
Posts: 176
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am processing xml files and converting them to a text files(with specific record types). It takes me around an hour to process one gig file.
I am using printwriter to write to text file. I am using println to write the to text file. Do you have any suggestions to improve this. I need to do
this in 10 minutes(instead of an hour) as we are going to process around 100 gigs of data every day.
Please let me know if you have any suggestions.??
 
Joe Ess
Bartender
Posts: 9319
10
Linux Mac OS X Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
First and foremost, make sure your hardware is up to the task. If your computer's CPU is at 100%, your memory use is 100% and your disk is thrashing, the computer is more occupied with swapping than working on your program. When running an enterprise application, there is no substute for enterprise-class hardware. Since you want to process gigs of information you probably won't be able to run many other processes on this computer. Especially not CPU-intensive things like web servers or databases.
Next, since you are working with XML you may be using DOM to parse your XML. I've found DOM to be a performance bottleneck and was able to get an order of magnitude better performance by manually parsing XML files. DOM is primarially for the editing of XML, so SAX may be a better alternative in your case (haven't tried it). Are you loading the file into memory, processing it, then writing it out? BAD IDEA for a one gig file unless you have MANY,MANY gigs of physical RAM. DOM loads the entire document into memory and allows one to manipulate it. SAX is event-driven. It processes a subset of the document at a time and generates events so your program can process and write those subsets out to disk, saving valuable resources.
The online book from Sun, Java Platform Performance, has some good general information for getting the most out of the java api.
 
Bhasker Reddy
Ranch Hand
Posts: 176
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
We are using enterprise class hardware with probably 18GB of RAM. We have our own parsing routine that's pretty fast. But I guess writing it to the text file takes lot of time. Parsing xml is fast, I am using printwriter println method. Is there something else that saves time. is fileWriter better than printwriter. Do you have any other ideas to make it faster.
thanks
Bhasker
 
Abhik Sarkar
Ranch Hand
Posts: 61
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Bhaskar,

I hope you have a BufferWriter between your PrintWriter and the FileWriter. If not, putting that in would help immediately.

Also, if you have used the default constructor of the PrintWriter, then autoFlush is enabled. This means that each time you call println(), the stream will be flushed. I have written a small program to demonstrate the difference it makes. Please ignore that bad Exception handling... I just wanted to demontrate the difference.



Here is the output from some test runs...


As you can see, the process speeds up around 3 times! In you case, that could mean that the time taken reduces to around 20 minutes.

Whether or not you want the output to be flushed immediately depends on the nature of your application. If it doing batch processing, you could do away with frequent flushing... if it needs to display data in real-time, you need to flush frequently.

Hope this helps,
Abhik.
 
Bhasker Reddy
Ranch Hand
Posts: 176
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am not displaying any data. I am outputting it to a test file. I am taking xml file and converting it to a text file. I am using println and printwriter to do it. Do you mean to say if I flush frequently, will it be
faster. Do I need to store all the output and then write it once to file.
Instead of writing it every single line.
 
Bhasker Reddy
Ranch Hand
Posts: 176
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I parse xml file, at the end of parsing , I have an object and arraylist that contains all the parsed information. I read object and arraylist and apply business logic and output data to a pipe delimited text file. I use printWriter out to do this. I output in this way
out.println(str + "\r"); I use this line 114 times to output all the information in object and arraylist. Do You guys think instead of doing it in this way. Can I store all the "str" (it is a string) into an arraylist(let's say strArrayList.
and at the end of applying business logic , output it to print writer using

ListIterator stringList;
stringList= strArrayList.listIterator();
int i=0;
while(stringList.hasNext()){
STRING str= (STRING)stringList.next();
out.println(str);
i++;
if (i=20){
out.flush();
i=0;
}
}

Do you think it will be faster if i do in this way.
 
Abhik Sarkar
Ranch Hand
Posts: 61
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Bhasker,

My point was that using a BufferedWriter would improve the performance and that flushing too often slows things down. So, if you aren't using a BufferedWriter, you should definitely consider using it. Also, if you don't have a lot of content to write to the file, you could consider the possibility of putting everything into a StringBuffer before writing the entire StringBuffer to the file in one go.

You could look around on the internet for articles on improving performance. Here is one I came across on
Performance Tuning on Sun's Java site.

Hope this helps,
Abhik.
 
Bhasker Reddy
Ranch Hand
Posts: 176
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Basically I am converting xml files to record type based text files. My xml files are based on multiple accounts. Every account has hundreds of records. When I parse I have a object of account, I read multiple tags and inner tags in account and output to text file. Whenever I read a record or
tag, I ouput to a text file using printwriter and println methods. I am talking about huge data(data inorder of gigs).
Is there a way instead of writing every line, can I store it in a object(serialization??) and write it at once. How to do this. Will this be much
faster than the outputing one line at a time.
 
Abhik Sarkar
Ranch Hand
Posts: 61
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Bhasker,

If it is only text, you can use the StringBuffer. Here is a modified version of my earlier example... you can see from the output that the execution time has reduced further.


 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic