• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Rob Spoor
  • Tim Cooke
  • Junilu Lacar
Sheriffs:
  • Henry Wong
  • Liutauras Vilda
  • Jeanne Boyarsky
Saloon Keepers:
  • Jesse Silverman
  • Tim Holloway
  • Stephan van Hulst
  • Tim Moores
  • Carey Brown
Bartenders:
  • Al Hobbs
  • Mikalai Zaikin
  • Piet Souris

OutOfMemoryError when trying to read large data

 
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi all,

When trying to read a large gzip response from a BufferedReader , I get this exception on this line: while ((line = bf.readLine()) != null) {

I tried increasing the heap size xmx , it still doesn't work. Any thoughts on how to fix this?

Thanks.

This is the exception:

java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
at java.lang.StringBuilder.append(StringBuilder.java:190)
at java.io.BufferedReader.readLine(BufferedReader.java:370)
at java.io.BufferedReader.readLine(BufferedReader.java:389)


This is the code:

 
Saloon Keeper
Posts: 13261
292
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
How big is the uncompressed JSON message that you're tring to reconstruct?

Do you really need the entire JSON message to be in memory?
 
Nancy Joe
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Stephan,

Thanks for your response. It is huge, can run in MB and GB. After reading this data, it needs to be loaded into a DB, so it needs to be in the memory.
 
Marshal
Posts: 74004
332
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Nancy Joe wrote:. . . it needs to be loaded into a DB, so it needs to be in the memory.

No, it doesn't need to be in memory. You will have to work out how to load the data into the DB as you read them.
 
Campbell Ritchie
Marshal
Posts: 74004
332
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you have enough mmory, you may be able to avoid the increase in the array size in the StringBuilder by setting its size before you start.
 
Stephan van Hulst
Saloon Keeper
Posts: 13261
292
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I suggest you use an event-based/streaming API for JSON processing. Such an API processes the JSON as it comes in, without keeping the entire thing in memory.

The Jackson Streaming API is a library you might be able to use for this.
 
Saloon Keeper
Posts: 24295
167
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:If you have enough mmory, you may be able to avoid the increase in the array size in the StringBuilder by setting its size before you start.



If the incoming data is in the gigabytes, then either you'll have to construct the StringBuilder with a gigabyte-sized buffer or get killed performance-wise as the smaller buffers get upsized and data gets copied back and forth. However, since last I looked the standard approach for a StringBuilder with buffer distension was to allocate the next iteration of the buffer by doubling the size of the current one, you could potentially OOM even if you actually only needed another 100MB.

Stephan's idea is better. Why load the whole thing in RAM only to decant it afterwards? Asides from the obscene memory footprint and potential fragility, you can't get any I/O overlap doing it that way. so on top of everything else it would take longer.

Of course, I could go one step further and use a pre-written/pre-debugged solution. Surely at least one ETL tool out there can eat JSON and load databases.
 
Rancher
Posts: 157
5
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Nancy Joe wrote:It is huge, can run in MB and GB. After reading this data, it needs to be loaded into a DB, so it needs to be in the memory.


I smell quite a very bad design flaw here:

Why is the data THAT big? Where does it come from and how is it generated? Why JSON? Is the GZIP transport even necessary?
If the goal is to read the data into a database the source format should be appropriate.

I guess the JSON is generated while the data are streamed - so the best way would be to read and parse it in the same way.
 
Sheriff
Posts: 26773
82
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:No, it doesn't need to be in memory. You will have to work out how to load the data into the DB as you read them.



It's not as complicated as that. The PreparedStatement interface has a setCharacterStream(int index, Reader reader) method which automatically tells the SQL UPDATE command to read the contents of the Reader and put them into the database. All of the billions of characters are supposed to go into a particular column of a single row in the database, correct?
 
Campbell Ritchie
Marshal
Posts: 74004
332
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Paul Clapham wrote:. . . It's not as complicated as that. . . .

I thought there would be an easy way tio do it.
 
Nancy Joe
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for all the responses. Basically, we are using an ETL tool to load the data into the DB. But the intermediate step is the rest client which uses a java library where I am having the problem. Once I send the json string to pentaho, it does the rest.
 
Tim Holloway
Saloon Keeper
Posts: 24295
167
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Well, as it happens, I know Pentaho DI from the inside. I think they still have some source code with my name on it. But I'm not clear on the workflow here.

Are you saying that you've got a web service that's receiving massive amounts of JSON data that then gets fed to Pentaho? Because if you are, it sounds like you need to consider modifying the API. There are all sorts of reasons why it's better to break a stream like that into batches instead of being a single service transaction.
 
Nancy Joe
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi  Tim,

Yeah, thats what our workflow looks like in short. Interesting that you worked on some of the pentaho source code. Basically in our case, pentaho makes one rest call to the service to pull tons of data (in GBs). The inbuilt rest client step in pentaho did not work for us since we got the gzipped json response back from the service and pentaho did not know how to decompress it. So we created a java library(which calls the rest service and decompresses the gzip to json string) which we use in the user defined java step and pass it back to a json input step in pentaho where it decodes the fields. But the user defined java step is not able to handle so much data in memory. We are using the com.fasterxml.jackson.core.JsonParser, but that also stores everything in memory when reading the json tree from the mapper.  Is there any other option than calling the rest service in batches?

Thanks.
 
Marshal
Posts: 22449
121
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
With JsonParser you can also do streaming parsing. It works like an iterator, and only the current token needs to be in memory (well, for the parser):

The possible tokens reflect on what's possible in JSON: START_OBJECT and END_OBJECT, START_ARRAY and END_ARRAY, FIELD_NAME, and the possible values: VALUE_STRING, VALUE_NUMBER_INT, VALUE_NUMBER_FLOAT, VALUE_TRUE, VALUE_FALSE and VALUE_NULL (I don't know why they didn't create VALUE_BOOLEAN instead of VALUE_TRUE and VALUE_FALSE).

However, if you can't do a similar streaming write, then this is obviously irrelevant.
 
Nancy Joe
Greenhorn
Posts: 18
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thank you. I am exploring the streaming option with the JsonToken.
 
Tim Holloway
Saloon Keeper
Posts: 24295
167
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Nancy Joe wrote:Hi  Tim,

Yeah, thats what our workflow looks like in short. Interesting that you worked on some of the pentaho source code. Basically in our case, pentaho makes one rest call to the service to pull tons of data (in GBs). The inbuilt rest client step in pentaho did not work for us since we got the gzipped json response back from the service and pentaho did not know how to decompress it. So we created a java library(which calls the rest service and decompresses the gzip to json string) which we use in the user defined java step and pass it back to a json input step in pentaho where it decodes the fields. But the user defined java step is not able to handle so much data in memory. We are using the com.fasterxml.jackson.core.JsonParser, but that also stores everything in memory when reading the json tree from the mapper.  Is there any other option than calling the rest service in batches?

Thanks.



I worked on Pentaho because the Excel reader annoyed me. And since it was open-source, I fixed it.

A word about gzip. Unlike ZIP, a gzip "file" is not a collection of "files", it's a single stream. If memory serves, downloading a ZIP required waiting until the end because the directory is located at the end of the file, but gzip is a simple stream and so you can start uncompressing from the very first byte received.

As it happens, many web clients handle gzip transparently and automatically. So you may have an issue with your web service not setting the proper MIME type and/or Content-Encoding header. As to why it's preferable to break the stream into chunks, you wouldn't wonder if you'd grown up in the days of dial-up modems which could suddenly lose connection 100MB into a data stream. And even the Internet breaks occastionally.

Assuming, however, that you can't address flaws in the server, the next best bet is to handle them in the ETL process.

When you design a Task in Pentaho DI, you string together processing blocks. Each block in the stream operates asynchronously. So if you had (worst-case) 3 blocks: one to make the JSON request, one to decompress the data and one to extract fields and records out of the JSON stream - and a fourth step or so to send the data to appropriate database tables - then those blocks would work as much as possible in parallel, not waiting for the previous block, but only the next record from the previous block. And that's not counting the fact that Pentaho DI can actually run multiple instances of a block on multiple processors.

The hard part would be the head, since there are no obvious fields or records in a compressed data stream. I think that their HTTP client can do the decompression internally, but I don't have the resources handy to confirm it. Worst case, spool the response to a disk file and decompress it from there. Better yet pass it though the gunzip command-line utility (for Unix/Linux clients).

You definitely shouldn't have to write your own code to do this, though. Pentaho DI can handle everything - either internally or by calling on external system resources. And definitely not have to store the whole ball of wax in RAM.

 
Tim Holloway
Saloon Keeper
Posts: 24295
167
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Just a note here: as good Unix utilities do, gunzip can operate as a pipe filter, so it also doesn't have to wait for all the input before it emits the decompressed output.
 
Sasparilla and fresh horses for all my men! You will see to it, won't you tiny ad?
Thread Boost feature
https://coderanch.com/t/674455/Thread-Boost-feature
reply
    Bookmark Topic Watch Topic
  • New Topic