Win a copy of Programmer's Guide to Java SE 8 Oracle Certified Associate (OCA) this week in the OCAJP forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Problem with processing data files of size larger than 350 MB

 
amit bose
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

Please find below the details of my query.

Problem: I need to process a huge(350 MB size) data file in Java.The data file is is basically a concatenation of multiple XMLs together.

What I need to do is..
(a) check if there are some unwanted characters in bewteen the XML tags
(b) If Yes, remove the tags
After the validation stage above, I need to write the file back to disc.

E.g. Input Data file sample (D1)
<?xml version="1.0" encoding="UTF-8"><books><!--- Books1. xml - some more tags go here --></books>some junk here
<?xml version="1.0" encoding="UTF-8"><books><!--- Books2. xml - some more tags go here --></books>
<?xml version="1.0" encoding="UTF-8"><books><!--- Books3. xml - some more tags go here --></books>more junk
<?xml version="1.0" encoding="UTF-8"><books><!--- Books4. xml - some more tags go here --></books>

(Please note that the content in input data file above will appear in a single line; For better readability I have shown indentation of the XMLs)


E.g. Output Data file sample (D2)
<?xml version="1.0" encoding="UTF-8"><books><!--- Books1. xml - some more tags go here --></books>
<?xml version="1.0" encoding="UTF-8"><books><!--- Books2. xml - some more tags go here --></books>
<?xml version="1.0" encoding="UTF-8"><books><!--- Books3. xml - some more tags go here --></books>
<?xml version="1.0" encoding="UTF-8"><books><!--- Books4. xml - some more tags go here --></books>

(The text 'some junk here' and 'more junk' have been removed in D2)


Earlier Solution:

I have shared my code below:


The above code thrown OutOfMemory Error for file size more than 100 MB and it happens when I am trying to read the file.

The next thing that I tried was using buffers to read the file than line-by-line:



The above code worked fine till the data file size was 200 MB or less. However, now I have a data file of 350 MB size and it keeps giving the out of memory error.
Increasing the buffer size does not sound like a good option.

Let me know if there are any pointers for this problem.


Thanks,
Amit
 
Somnath Mallick
Ranch Hand
Posts: 483
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think, since you are getting an out of memory error, it would help if you increase your JVM heap size.

 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13071
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It looks to me like there is only one pass through the file.

Why don't you write chunks of valid data as they are accumulated?

Bill
 
Carey Brown
Bartender
Pie
Posts: 1635
22
Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Your best bet is to process the XML in a serial fashion, in that way you have no memory problems. You could use either the STAX or SAX libraries for this.


Secondly, you are keeping 4 copies of the data in memory: sbfContent(twice), result, and sbfValidatedContent.


sbfContent should be emptied before trying to append to it again.


result and sbfValidatedContent should be released before trying to re-read the file.

 
amit bose
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Somnath Mallick wrote:I think, since you are getting an out of memory error, it would help if you increase your JVM heap size.



Thanks for the pointer Somnath.

However, I am already using a large heap size as below:

 
amit bose
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
William Brogden wrote:It looks to me like there is only one pass through the file.

Why don't you write chunks of valid data as they are accumulated?

Bill


Thanks for the pointer Bill.

Actually, I wanted to write chunks of valid data as they are accumulated but firstly, I need to read the input data file where the code fails. The input is also not a XML file that could be processes easily but rather multiple XMLs concatenated together.
 
amit bose
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Carey Brown wrote:Your best bet is to process the XML in a serial fashion, in that way you have no memory problems. You could use either the STAX or SAX libraries for this.


Secondly, you are keeping 4 copies of the data in memory: sbfContent(twice), result, and sbfValidatedContent.


sbfContent should be emptied before trying to append to it again.


result and sbfValidatedContent should be released before trying to re-read the file.




Thanks for the pointer Carey.

I was going through the webpage but it seems STAX API allows to stream XML data. As my input is not a XML file but rather multiple XMLs concatenated together, I am not sure if I can use this. Please correct me if I am wrong.

Also, regarding the duplicacy of data in memory: I will remove the duplicacy but the code fails prior to reaching the duplicated content (i.e. sbfValidatedContent etc.)
 
Somnath Mallick
Ranch Hand
Posts: 483
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Since you say that the code is failing at the reading part, I think the sbfContent is becoming too big for the JVM to handle! Could you debug the code and tell us where exactly (which line) the code is failing.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic