• Post Reply Bookmark Topic Watch Topic
  • New Topic

Reading and Writing from Large File, OutOfMemory  RSS feed

 
Anant Jagania
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

I have a directory which contains large files ~1GB to ~5GB. Recursively i pick up the files and from each file I look for a particular string or sequence of char and create files. The string is the starting and ending point of the sub file which i want to create.

I read some amount of bytes at a time and work on those bytes. However, I am getting OutOfMemory exception after some time. I have set Xmx to 2048. I try to read less number of bytes but then program gets very slow. I have added System.gc() and System.runFinalization() in between the object creation.

I am using RandomFileAccess.

Please let me know how I can over come the problem of OutOfMemory and program to run little faster.

Thanks
Anant
 
Jesper de Jong
Java Cowboy
Sheriff
Posts: 16060
88
Android IntelliJ IDE Java Scala Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can you show us the source code? You probably have a memory leak somehwere, but it's very hard (or actually, impossible) to debug and point out the exact problem without seeing the source code.

You could try using a profiler, for example JVisualVM (which is included with JDK 6) to find out what is eating up all the memory.
 
Anant Jagania
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Jasper,

Thanks for the reply.

Here is the code.


Please let me know if anything required.
[edit[Break // comments and change to /* */ because the lines in code are too long for screen width. CR[/edit]
 
David Newton
Author
Rancher
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I find that code rather difficult to read, and the comments do more harm than good :(
 
Anant Jagania
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi David,

I have modified the comments.

I apologize for inconvenience.

If the code or comments still do not make sense, please let me know.. i will refactor it again.

Thanks
Anant
 
Wouter Oet
Bartender
Posts: 2700
IntelliJ IDE Opera
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I agree with David. The code is hard to read. I would refactor/rewrite it to use a more object oriented approach and introduce a lot of new methods. Because a single method of 190 lines of code is hard to comprehend. The calls to System.gc() are unnecessary because the garbage collector is guaranteed to run before a OutOfMemoryError is thrown by the JVM.
 
Anant Jagania
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey Guys,

Sorry for being late. Here is the complete code. It has got more comments as well. I have refactored little. Please let me know if this is still not readable.


 
Paul Clapham
Sheriff
Posts: 22832
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would use the profiler too. But I don't see why you need a 100 MB buffer to work in. In most situations a buffer size of somewhere around 8 KB is quite sufficient.

And I wouldn't do my own buffering like that. I'm not quite sure what rules you're using to filter the data but it seems to me that you're looking for the string "ABCD" as some kind of sentinel. I would rewrite the code to read the data sequentially using an ordinary FileInputStream. Set up a buffer which always contains the last four bytes read. Then you can compare that buffer to "ABCD" and act accordingly with respect to whether to write data.
 
David Newton
Author
Rancher
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
<offtopic>
Way, *way* too many comments. If single lines of code require a paragraph of comments, something has gone wrong.
</offtopic>
 
Campbell Ritchie
Marshal
Posts: 56546
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Also the comments (don't use // on more than one line) and some of the code is in lines too long to read on screen.
 
Jesper de Jong
Java Cowboy
Sheriff
Posts: 16060
88
Android IntelliJ IDE Java Scala Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Some remarks:

Line 57, 59: You never need to call new String("literal"). Just write "literal" - it makes no sense to create a separate new String object.

If I see this right, when the program doesn't find the sequence ABCD anywhere in the file, it will try to append the whole file to your ByteArrayOutputStream. That could easily cause an OutOfMemoryError if the file is very large.
 
Ernest Friedman-Hill
author and iconoclast
Sheriff
Posts: 24217
38
Chrome Eclipse IDE Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Also, 131072 * 100 is not 100MB, it's about 1.25 MB. Just an example of what was meant by "the comments hurt more than they help".
 
Anant Jagania
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is what I have to achieve.

There are big files and each file is made with joining many other files. The starting of nested file is ABCD. So I need to go through each byte and see if ABCD comes in the sequence. Once i find it, i need to start writing the bytes to files until i find another ABCD.


Paul, I can do the way you have mentioned but that makes my program very slow. And i have files with the size of 5GB as well. Reading 8/4 bytes at time will be a very slow progress. I will try to use profiler and see what they have to say.

David, Campbell,

I have written comments to get better understanding of the algorithm which i have implemented in the program. I can still shorten it. I will do it and post the code again.

Jasper,

Line 57, 59 are just for testing I have kept. Later on these strings will be passed as arguments to the program from the console.
The program doesn't append the whole file to ByteArrayOutputStream when we do not find ABCD. It will only append the current byte which is read and if any bytes A,AB,ABC where found before the current byte (given that current byte is not D).

File will be written/appended only when we find ABCD and RandomFileAccess object is not null. All the bytes written in BinaryArrayOutputStream will be dumped into RandomAccessFile.

Thanks everybody, I really appreciate your valuable time spending on this post.

I will redo the code/comments and post it again.

Regards,
Anant
 
Anant Jagania
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ernest , Also, 131072 * 100 is not 100MB, it's about 1.25 MB. Just an example of what was meant by "the comments hurt more than they help".


My apology.

Thanks for pointing this out.

I have corrected back the comments and number of bytes to read (MAXREAD) in the code.

Please have a look at it.

 
Campbell Ritchie
Marshal
Posts: 56546
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Please don't post the code again. My point was not that the comments were too long, but that they were
allwrittenasasinglelinewhi . . . wthescrollbarsappear.

See next post for what it ought to have said
 
Campbell Ritchie
Marshal
Posts: 56546
172
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I myself wrote: . . .
all written as a single line which nobody could read because it is too wide for the screen even here at work where I have a screen with 1440px from left to right so it is really difficult to read on an ordinary screen or with large writing in fact it is a good idea to keep all line lengths below 80 keystrokes to make it easier to read and I look forward to Maneesh Godbole posting UseRealWords but you will see how the scroll bars appear.
. . . or would have had I UsedRealWords.
 
David Newton
Author
Rancher
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, I think the comments are too long, and don't add to our understanding. They might add to *yours*, but then I'd recommend taking them out so they don't confuse *us*, the folks trying to understand your code.The method is named wrong. With the proper name, no comment would be necessary. It also has a side effect, which makes the naming somewhat more difficult, but if it was clear that all file pointer manipulation methods also updated the numberOfData (which is also not a great name), it wouldn't be necessary to include it.

Rather than rename the method, why not just pass in the actual pointer manipulation you want?Otherwise it'd have to be called moveFilePointerBack or something, and it just seems redundant.Here you've described much of what's already known about RandomFileAccess, and again, you are forced to make a comment that would be unnecessary if the method was named appropriately.Here the comment describes precisely what the code does, but takes 2.5x as much text as just the code--this code is self-explanatory. The comment adds unnecessary cognitive overhead.

These same meta-comments apply to much of the code: the code speaks louder than the comments. The code is the ultimate arbiter of what's happening. Comments are difficult to keep in sync with code, because they cannot be (trivially) mechanically checked for correctness. Naming, refactoring, and common sense do more to aid readability than a paragraph for each line of code.

(This is obviously just my opinion, and other folks may feel differently.)
 
Andrew Monkhouse
author and jackaroo
Marshal Commander
Posts: 12156
256
C++ Firefox Browser IntelliJ IDE Java Mac Oracle
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How many times do you create a new RandomAccessFile and assign it to fos? And how many times do you close the old streams?

You would probably benefit greatly by moving most of your instance variables to method variables. It will then become more obvious as to when they are not being used correctly.

You would probably also benefit from renaming methods and variables (normally if I see a variable named fos, I expect it to be a FileOutputStream - not a RandomAccessFile (often named raf)).
Neither fos nor raf are good descriptions of what the file is being used for though.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!