• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • paul wheaton
  • Jeanne Boyarsky
  • Ron McLeod
Sheriffs:
  • Paul Clapham
  • Liutauras Vilda
  • Devaka Cooray
Saloon Keepers:
  • Tim Holloway
  • Roland Mueller
Bartenders:

Problem with processing data files of size larger than 350 MB

 
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi All,

Please find below the details of my query.

Problem: I need to process a huge(350 MB size) data file in Java.The data file is is basically a concatenation of multiple XMLs together.

What I need to do is..
(a) check if there are some unwanted characters in bewteen the XML tags
(b) If Yes, remove the tags
After the validation stage above, I need to write the file back to disc.

E.g. Input Data file sample (D1)
<?xml version="1.0" encoding="UTF-8"><books><!--- Books1. xml - some more tags go here --></books>some junk here
<?xml version="1.0" encoding="UTF-8"><books><!--- Books2. xml - some more tags go here --></books>
<?xml version="1.0" encoding="UTF-8"><books><!--- Books3. xml - some more tags go here --></books>more junk
<?xml version="1.0" encoding="UTF-8"><books><!--- Books4. xml - some more tags go here --></books>

(Please note that the content in input data file above will appear in a single line; For better readability I have shown indentation of the XMLs)


E.g. Output Data file sample (D2)
<?xml version="1.0" encoding="UTF-8"><books><!--- Books1. xml - some more tags go here --></books>
<?xml version="1.0" encoding="UTF-8"><books><!--- Books2. xml - some more tags go here --></books>
<?xml version="1.0" encoding="UTF-8"><books><!--- Books3. xml - some more tags go here --></books>
<?xml version="1.0" encoding="UTF-8"><books><!--- Books4. xml - some more tags go here --></books>

(The text 'some junk here' and 'more junk' have been removed in D2)


Earlier Solution:

I have shared my code below:


The above code thrown OutOfMemory Error for file size more than 100 MB and it happens when I am trying to read the file.

The next thing that I tried was using buffers to read the file than line-by-line:



The above code worked fine till the data file size was 200 MB or less. However, now I have a data file of 350 MB size and it keeps giving the out of memory error.
Increasing the buffer size does not sound like a good option.

Let me know if there are any pointers for this problem.


Thanks,
Amit
 
Ranch Hand
Posts: 483
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I think, since you are getting an out of memory error, it would help if you increase your JVM heap size.

 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It looks to me like there is only one pass through the file.

Why don't you write chunks of valid data as they are accumulated?

Bill
 
Bartender
Posts: 10978
87
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows ChatGPT
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Your best bet is to process the XML in a serial fashion, in that way you have no memory problems. You could use either the STAX or SAX libraries for this.


Secondly, you are keeping 4 copies of the data in memory: sbfContent(twice), result, and sbfValidatedContent.


sbfContent should be emptied before trying to append to it again.


result and sbfValidatedContent should be released before trying to re-read the file.

 
amit bose
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Somnath Mallick wrote:I think, since you are getting an out of memory error, it would help if you increase your JVM heap size.



Thanks for the pointer Somnath.

However, I am already using a large heap size as below:

 
amit bose
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

William Brogden wrote:It looks to me like there is only one pass through the file.

Why don't you write chunks of valid data as they are accumulated?

Bill



Thanks for the pointer Bill.

Actually, I wanted to write chunks of valid data as they are accumulated but firstly, I need to read the input data file where the code fails. The input is also not a XML file that could be processes easily but rather multiple XMLs concatenated together.
 
amit bose
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Carey Brown wrote:Your best bet is to process the XML in a serial fashion, in that way you have no memory problems. You could use either the STAX or SAX libraries for this.


Secondly, you are keeping 4 copies of the data in memory: sbfContent(twice), result, and sbfValidatedContent.


sbfContent should be emptied before trying to append to it again.


result and sbfValidatedContent should be released before trying to re-read the file.




Thanks for the pointer Carey.

I was going through the webpage but it seems STAX API allows to stream XML data. As my input is not a XML file but rather multiple XMLs concatenated together, I am not sure if I can use this. Please correct me if I am wrong.

Also, regarding the duplicacy of data in memory: I will remove the duplicacy but the code fails prior to reaching the duplicated content (i.e. sbfValidatedContent etc.)
 
Somnath Mallick
Ranch Hand
Posts: 483
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Since you say that the code is failing at the reading part, I think the sbfContent is becoming too big for the JVM to handle! Could you debug the code and tell us where exactly (which line) the code is failing.
 
So there I was, trapped in the jungle. And at the last minute, I was saved by this tiny ad:
New web page for Paul's Rocket Mass Heaters movies
https://coderanch.com/t/785239/web-page-Paul-Rocket-Mass
reply
    Bookmark Topic Watch Topic
  • New Topic