• Post Reply Bookmark Topic Watch Topic
  • New Topic

De-serialize files too big for memory  RSS feed

 
Mark Jame
Greenhorn
Posts: 28
Java Netbeans IDE VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Assume an external party provides hundreds/thousands of very large files, each containiing millions of serialized objects. Assume I have no control over these files but I must de-serialize them and do not have (and never will have) enough memory.

Doing something like the following for example would run out of memory:



What can I do to safely process all the files?
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Are these objects written as members of a single collection for the whole file or serialized one at a time?

What has to happen to an object once de-serialized?

Bill
 
Mark Jame
Greenhorn
Posts: 28
Java Netbeans IDE VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Assume I don't know. Why, does it make a difference?
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It doesn't make sense to assume that you won't know - if you don't, you can't deserialize them anyway, at least not using Java's built-in classes. You could possibly reimplement Java's serialization format in a way that doesn't need the entire file in memory. My guess is that would be a lot of work without guarantee of success.

What is the use case for deserializing large files without using an adequate machine? Or rather, the circumstances that gave rise to it?
 
Mark Jame
Greenhorn
Posts: 28
Java Netbeans IDE VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm just trying to learn more by thinking about problems that are not easy to solve.

At the moment I am looking at how to deal with huge numbers of objects, too many to fit into memory (pretend my memory will never be big enough) and files that are too large to fit into memory.

Perhaps I need to turn the question around.

Lets say I have a loop to create a large number (say millions) of objects (that will never fit into memory at the same time) and I have to serialize them into a single new file in each iteration (say file1, file2, file3, ...).

How can this be done if there is not enough memory to fit all the objects into memory.

Then the next step (assume another program) would be my original question of how to read the files (and millions of objects) that are too large to fit into memory. Once deserialized assume I get some values from each object to use in a calculation.

Any ideas?
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mark Jame wrote:At the moment I am looking at how to deal with huge numbers of objects, too many to fit into memory (pretend my memory will never be big enough) and files that are too large to fit into memory.

These sorts of problems crop up all the time; and the answer is usually simple: break the problem down.

If there's no more "natural" solution, then simply ploughing them into an array is a start - because an array has a finite size, and it's blisteringly fast.
I call it "chunking", but I suspect there's a more technical term. Another alternative is buffering, but it may or may not solve the basic issue.

If there is a "transactional" element to the problem then it's more complex; but it's still achievable. And in many cases a database may help - since it's one of the things they were designed for.

But perhaps the major question you should be asking yourself is: WHY is this data so enormous?

It sounds to me like bad design, so perhaps the best place to start is there, so you don't have to write anything to start with.

On the other hand, if it's a simple "stream feed" (eg, stock market transactions) that you need to process, then in general, the "elements" are pretty small.

I guess what I'm trying to say is: Don't get too hung up on the size (especially if it's something you can't change). Deal with the components of that size.

HIH (Don't know whether it does )

Winston
 
Mark Jame
Greenhorn
Posts: 28
Java Netbeans IDE VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the feedback so far everyone.

I understand what you say about using a database, etc.; and of course I agree.

But my question is more about how to solve the problem assuming I cannot change certain things, so in this example the assumption is that the file I want to read and/or write will NEVER fit into memory.

Similar problems do exist; sorting very large files too big for memory, solved for example using some kind of external merge sorting algorithm.

What I am looking for is advice and/or examples of how to process (read and/or wrirte) very large files of serialized objects that are too large for memory; this will always be a possiblity that is out of my (programs) control.

Perhaps another example may help. I may have a program that polls a data directory looking for files (of serialized objects) that are created by an external program. My program has access to the API (classes only, that define the objects, a Widget class for example) so I know what the objects look like and how to get information about them (getters for example). The job of my program is to deserialize the data files, recording some information about each object then discard them. My program has no idea how big the files will be or how many objects they may contain.

So how can I design my program (and process the files and objects) to read the files and deserialize safetly without running into memory issues?

 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mark Jame wrote:Perhaps another example may help. I may have a program that polls a data directory looking for files (of serialized objects) that are created by an external program. My program has access to the API (classes only, that define the objects, a Widget class for example) so I know what the objects look like and how to get information about them (getters for example). The job of my program is to deserialize the data files, recording some information about each object then discard them. My program has no idea how big the files will be or how many objects they may contain.

Ah, well that's much more specific, and the fact is that whether it's possible or not will likely have everything to do with whether the information you need from your de-serialized object can be determined in isolation, or needs some sort of context.

There's a big difference between, for example, deserializing a NEW Customer that was created on another server for the purposes of adding them to a centralized database, and deserializing a set of grades for an existing Student for the purposes of updating their GPA.

If the things you're deserializating are fairly simple and don't need much in the way of context in order to "record" what you need, the chances are that you can just deal with them individually in batches; but if there's anything more needed...

And this is where I stop, because the question, as it stands, is simply too vague. It could be as simple as reading lines from a file:
read line→record information→read next line... (for "line", read "object")
but it could be MUCH more complex.

Winston
 
Mark Jame
Greenhorn
Posts: 28
Java Netbeans IDE VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Don't stop yet - please!
Winston Gutkowski wrote:If the things you're deserializating are fairly simple and don't need much in the way of context in order to "record" what you need, the chances are that you can just deal with them individually in batches; but if there's anything more needed...

Eureka! That's what I want, batches! The objects are simple, I read one or two fields then discard them.

So the question is how to read a large file (of serialized objects) in batches?
 
Jesper de Jong
Java Cowboy
Sheriff
Posts: 16059
88
Android IntelliJ IDE Java Scala Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Just use a loop: read a few objects, do whatever needs to be done with them, repeat, until there are no more object left to read.
 
Junilu Lacar
Sheriff
Posts: 11481
180
Android Debian Eclipse IDE IntelliJ IDE Java Linux Mac Spring Ubuntu
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you're wanting to think outside the box, then literally think more than just one box. Distributed systems can have virtually an infinite amount of memory to work with. If you break up the task into smaller chunks and distribute the work amongst many computers you can handle huge amounts of input. Companies like Google and Amazon do this all the time. But if you're going to limit your solution to just one box, then I think your options are limited as well.
 
Mark Jame
Greenhorn
Posts: 28
Java Netbeans IDE VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks again for the feedback.

Distributed processing, clustered servers etc obsiously improve the situation, but again there will always be a case where it is not possible to fit all of the data that needs to be processed into the availale memory, wether it is 1 machine or 1 milliion machines.

I now know I can read (or write) one object at a time, the object in question could be a single object or a larger object that is itself a collection of objects like a list.

So in terms of performance is it more efficient to read (or write) one object at a time or several?

I would assume it is better to read/write several objects at a time, in which case how do I know how many I can read/write without running out of memory?

In pseudocode for example, is there a way of doing something like the following:

 
Junilu Lacar
Sheriff
Posts: 11481
180
Android Debian Eclipse IDE IntelliJ IDE Java Linux Mac Spring Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
To be honest, I really don't see where you're going with this. While there's certainly some use in thinking about theoretical limits of computation, that's something I leave to computer scientists. Not to be snarky or anything but as a programmer, I deal with practical problems and finding pragmatic solutions to them. I just don't think about "infinitely huge objects that can't fit in memory" much nor do I think I'll ever encounter them in this lifetime so your suppositions are really just moot to me. Is there a practical basis for your line of questioning or are you just embarking on an experimental thought exercise?
 
Mark Jame
Greenhorn
Posts: 28
Java Netbeans IDE VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There are certainly cases where the preconditions to certain problems are maintained.

For example you may work on a system where one part creates files and another part reads and processes thoses files. In this case you know before you start to design the reading part that the size and/or number of files created are within certain limits.

Now suppose you have to design a program to read files that are created by an external system. You do not know the maximum size of the files so you want to design defensively to ensure your program will never break.

The actual size of the externally created files are irrelevant, they could be 1K or 1000TB, the point is you want a program that can process them safely (OK, it may take hours or even weeks) but it will process them withought breaking.

This is the main point of my question, how to design an effective solution to this type of problem.
 
Junilu Lacar
Sheriff
Posts: 11481
180
Android Debian Eclipse IDE IntelliJ IDE Java Linux Mac Spring Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well then, off the top of my head, I would say that in general, streams, caching (to some kind of storage other than main memory, that is), and again, distributed computing, will come into play in a practical solution. Your pseudocode looks like something that could lead to subtasks implemented using these strategies.
 
Paul Clapham
Sheriff
Posts: 22828
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mark Jame wrote:For example you may work on a system where one part creates files and another part reads and processes thoses files. In this case you know before you start to design the reading part that the size and/or number of files created are within certain limits.


This is certainly the case for XML, for example. When you're processing an XML file it's often convenient to parse the whole thing into memory using a DOM parser, and then work with an in-memory tree representation of the document. But obviously this fails when the file becomes too large, and if you check out the posts in our XML forum you'll see that questions like "How can I process a 1 GB XML file?" are not uncommon.

Naturally there are things that you can do, like using a parser which streams the data rather then reading it all into memory at once, or contacting the creator and saying "1 GB? Are you @#$ kidding me?", but those are less convenient.
 
Mark Jame
Greenhorn
Posts: 28
Java Netbeans IDE VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the hints about parsing Junilu and Paul.

I have worked on programs that read a continuous stream of data (from several servers) every second, 24 hours a day, but this question is more about serialization.

Perhaps I am getting myself confused with the details and need to read Javadoc more but I would like to get a head start if anyone can help.

Is there a difference between (for example) serializing 3 objects (say Widgets) one at a time and serializing a list containing 3 objects, in terms of how they can be deserialized?

Are the same methods used (and which ones?) to read a file with 3 single Widget objects and a file with a list of 3 Widget objects?

 
Paul Clapham
Sheriff
Posts: 22828
43
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
An ObjectInputStream contains a series of serialized Java objects. And no, a serialized List is not the same as a series of serializations of the objects in that List. That's because in the first case there is only one serialized object and in the second there is one serialized object per list entry.

But for me the bottom line is that a serialized object was once contained in a Java JVM somewhere, in memory. So your idea of there being a serialized object which can't ever be deserialized into memory doesn't hold any water. (Unless somebody writes code to simulate the Java serialization protocol and produces valid serialized objects which don't correspond to an actual object, which I don't think was part of your question.)
 
Mark Jame
Greenhorn
Posts: 28
Java Netbeans IDE VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Once again thank you.

It is not easy (for me at least) to express true intent in words alone (without tone of voice and facial expressions), I really am grateful to everyone and do not mean to be difficult, I am just trying to understand the details.

I know for example I cannot create a simple object, add it to a list and repeat until there are billiions of simple objects in the list because I would run out of memory before the list is complete and before I can serialize it.

But could I create a simple object, serialize it and then repeat (billiions of times) to end up with a huge file? If so, how do I then deserialize it, I assume I just read it one simple object at a time? Using readObject?

Similarly, another external machine with say tens times the amount of memory that I have may create a massive list with billions of objects (that do fit into memory) and then serialize it.

In this case, is it only possible to deserialize the whole object (a list of billions, which I do not have memory for) or can I read x number of objects from the list at a time? If I can what method(s) should I be looking at?

Thanks again all.
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I know for example I cannot create a simple object, add it to a list and repeat until there are billiions of simple objects in the list because I would run out of memory before the list is complete and before I can serialize it.


Your hypothetical list does not contain billions of simple objects, it contains references to billions of simple objects. There be many or zero other references to these objects. Use of the correct terminology will help prevent conceptual errors. Seriously!

Bill
 
Paul Clapham
Sheriff
Posts: 22828
43
Eclipse IDE Firefox Browser MySQL Database
  • Likes 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Come on, Mark, you're making this much harder than it actually is. If you serialize one object to a file, you have to deserialize that one object. If you serialize 10,000 objects to a file one at a time, you have to deserialize those 10,000 objects one at a time. In no case can you ever find yourself deserializing an object which can't fit into memory, because that object must have been in memory when it was serialized.

I think the problem is that you don't really understand what happens when an object is serialized. I had a look at the tutorials which came up in the first page of my Google search and none of them clearly said that when you serialize an object, you write out the state of that object and all the objects it contains references to, and all the objects that those objects reference, recursively forever. So what you have inside that serialized object is a tree structure of objects, or more exactly a network structure since the references don't have to form a tree. And when you deserialize that object what you get is the same network of objects that were in the original object. The tutorials all implicitly sort of say that because they show you serializing an object which contains a reference to a String, but they don't clearly point out the relevance of that fact.

So if you serialize a List containing 23 objects, let's say, it's a single object as far as serialization is concerned. But in fact it consists of a network of 24 objects because the List contains references to each of the 23 objects in the List -- and there may be more objects depending on what references those 23 objects contain. And when you deserialize that single object, you find it's a List which contains 23 objects.
 
Mark Jame
Greenhorn
Posts: 28
Java Netbeans IDE VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My apologies for incorrect terminology.

I have looked at this a little more but have to stop for a few days, when I have some free time again I will post some example programs (mine are mixed into larger apps with several threads).

My examples I mentioned above assume I need to deserialize a very large object (that was created by an external source). I do not have time to setup my different machines at the moment, but I can simulate machines with difference resources by tweaking the jvm options for maximum heap.

I can (and have) created a very large object (a list) that I successfully serialized and then deserialized on a (simulated) machine with lots of memory.

I have then tried to deserialize that large object on a (simulated) machine with much less memory, and as I thought it ran out of memory. This was one of the examples I was trying to look into. In this case I do not believe there is any way to deserialize the file?

The other example was writing and reading many objects, this I can now do successfully; writing Integer.MAX_VALUE objects one at a time (on my machine the file was about 32GB), then reading these objects one at a time, also with a simple check to confirm they were deserialized correctly.

Thanks again for all the feedback; I'll be asking more questions soon ;)
 
Jesper de Jong
Java Cowboy
Sheriff
Posts: 16059
88
Android IntelliJ IDE Java Scala Spring
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mark Jame wrote:I have then tried to deserialize that large object on a (simulated) machine with much less memory, and as I thought it ran out of memory. This was one of the examples I was trying to look into. In this case I do not believe there is any way to deserialize the file?

In that case, there is indeed no easy way to deserialize the file - the standard Java serialization mechanism has no way to partially deserialize an object.

If you really need this, then the only way to do it would be to read the file and write your own deserialization code, but that would be a big project and you would be replicating the functionality of Java's built-in serialization framework. There is also no official, public specification of Java's binary serialization format (as far as I know) so you would have to reverse-engineer it, and things might change in a future Java version and then your code would break if used with that new Java version.

It would be an enormous amount of effort just to support some corner case that probably won't happen in practice, so it is most likely not worth it.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!