# Strategies for processing files that exceed memory

Emilio Gagliardi

Greenhorn

Posts: 16

posted 12 years ago

Greetings Java Gurus!

This is my first post and is intended to help me get an appreciation for what I must learn to solve a problem that I have.

I must create a program that can process 2.5 billion integers (6 gigs) stored in a file. Conceptually, the file is a 2D matrix of 50,000 rows and 50,000 columns. My program needs to perform simple linear algebra on the data, and write out the results to a file. I can do this already one "row" at a time, but it will take some time to churn through the data. I am currently using FileChannels and ByteBuffers to write and read the data. But my understanding of these classes is minimal at best. (E.g., is it possible to put more than one ByteBuffer in a FileChannel? If so, can I write from N number of ByteBuffers simultaneously??)

My two main questions are:

1) what are some common strategies for taking a large dataset and performing many small operations on it so you don't have to load the whole thing in memory? (E.g., normalize the matrix)

2) I've found some math libraries that seem to contain the methods I need, but I can't find any sample code or get in contact with the libary creator, what resources if any are out there to get advice/help with specific programming problems? Which forums at JavaRanch should I lurk in =)

Any suggestions or advice are greatly appreciated.

Sincerely,

Ps. The math libray I found is called Colt and contains many packages and classes for doing what I think I need to do, but it is way above me in terms of both the math and Java programming. I've included a link to the Colt Libraries in case you're interested.Colt Libraries

[ June 02, 2004: Message edited by: Emilio Gagliardi ]

This is my first post and is intended to help me get an appreciation for what I must learn to solve a problem that I have.

I must create a program that can process 2.5 billion integers (6 gigs) stored in a file. Conceptually, the file is a 2D matrix of 50,000 rows and 50,000 columns. My program needs to perform simple linear algebra on the data, and write out the results to a file. I can do this already one "row" at a time, but it will take some time to churn through the data. I am currently using FileChannels and ByteBuffers to write and read the data. But my understanding of these classes is minimal at best. (E.g., is it possible to put more than one ByteBuffer in a FileChannel? If so, can I write from N number of ByteBuffers simultaneously??)

My two main questions are:

1) what are some common strategies for taking a large dataset and performing many small operations on it so you don't have to load the whole thing in memory? (E.g., normalize the matrix)

2) I've found some math libraries that seem to contain the methods I need, but I can't find any sample code or get in contact with the libary creator, what resources if any are out there to get advice/help with specific programming problems? Which forums at JavaRanch should I lurk in =)

Any suggestions or advice are greatly appreciated.

Sincerely,

Ps. The math libray I found is called Colt and contains many packages and classes for doing what I think I need to do, but it is way above me in terms of both the math and Java programming. I've included a link to the Colt Libraries in case you're interested.Colt Libraries

[ June 02, 2004: Message edited by: Emilio Gagliardi ]

Emilio

posted 12 years ago

Well - divide et impera - as the romans said (or where it the greeks?).

The IO-Stuff is good positioned here, but the math-question could be split from it and asked in intermediate.

How do you want to write simultanously?

With different Threads?

You wouldn't know the sequence, in which the data is written.

If this is ok (mostly it's not, but perhaps it is for you) I would perform a small test with some millions of bytes.

Latestly on the hard-drive, the file isn't written simultanously.

If the order doesn't matter or might be restored later (by a key), you could write to different files, and merge them later.

Perhaps you can give a little more hints, what kind of lin.Alg. you perform.

Is everytime only one row involved?

The IO-Stuff is good positioned here, but the math-question could be split from it and asked in intermediate.

How do you want to write simultanously?

With different Threads?

You wouldn't know the sequence, in which the data is written.

If this is ok (mostly it's not, but perhaps it is for you) I would perform a small test with some millions of bytes.

Latestly on the hard-drive, the file isn't written simultanously.

If the order doesn't matter or might be restored later (by a key), you could write to different files, and merge them later.

Perhaps you can give a little more hints, what kind of lin.Alg. you perform.

Is everytime only one row involved?

Tim West

Ranch Hand

Posts: 539

posted 12 years ago

Hi Emilio,

My thoughts...because of the size of the data set, and the fact that many matrix operations are O(n^2) when implemented naively, I'd say you'll have to coordinate your I/O, data structure and algorithm.

The way I would approach this is to find an algorithm (well, all algorithms you intend to perform on the matrix) and then determine what data structure is best (1) for the algorithms themselves, and (2), for disk I/O and virtual memory paging.

On the second point, you'll want to find a structure whereby you load a chunk of the matrix from disk, and you can process that chunk completely, so you're not reading and writing different chunks to access only a fraction of the data contained. Some algorithms may progress best row-by-row, others across the diagonals. You will find several orders of magnitude difference in speed here between implementaitions with well-suited and poorly-suited data organisation. And performance here is one of your primary concerns - I'd warrant naive implementations will be vastly too slow.

OTOH, can you convince your boss to give you 6+ Gigs of RAM? This could be a very valid option, given the cost of your time vs. the up-front cost of the RAM

A final note, many matrix algorithms have variants that perform much better if the matrix is "sparse". That may or may not be relevant to your case.

HTH,

--Tim

PS, I make it 10Gigs - 5e4 x 5e4 x 4 bytes = 10 ^ 10 bytes.

[ June 02, 2004: Message edited by: Tim West ]

My thoughts...because of the size of the data set, and the fact that many matrix operations are O(n^2) when implemented naively, I'd say you'll have to coordinate your I/O, data structure and algorithm.

The way I would approach this is to find an algorithm (well, all algorithms you intend to perform on the matrix) and then determine what data structure is best (1) for the algorithms themselves, and (2), for disk I/O and virtual memory paging.

On the second point, you'll want to find a structure whereby you load a chunk of the matrix from disk, and you can process that chunk completely, so you're not reading and writing different chunks to access only a fraction of the data contained. Some algorithms may progress best row-by-row, others across the diagonals. You will find several orders of magnitude difference in speed here between implementaitions with well-suited and poorly-suited data organisation. And performance here is one of your primary concerns - I'd warrant naive implementations will be vastly too slow.

OTOH, can you convince your boss to give you 6+ Gigs of RAM? This could be a very valid option, given the cost of your time vs. the up-front cost of the RAM

A final note, many matrix algorithms have variants that perform much better if the matrix is "sparse". That may or may not be relevant to your case.

HTH,

--Tim

PS, I make it 10Gigs - 5e4 x 5e4 x 4 bytes = 10 ^ 10 bytes.

[ June 02, 2004: Message edited by: Tim West ]

Emilio Gagliardi

Greenhorn

Posts: 16

posted 12 years ago

Hi Stefan and Tim; Thank you for your insights!

In my limited computing knowledge, I really don't know how to do much. Simultaneous writing to disk? I naively thought that IFF I knew that each logical chunk consisted of 50000 integers, I would know ahead of time which byte range I could allocate to an object in charge of writing and so long as the data was put into the correct region it wouldn't matter *when*. But order does matter as we need to know which 50000 integers belong to what key.

Now, on the topic of keys, I hadn't even considered storing a key with the data. I just presumed that if the order was known ahead of time I didn't need to store a key. In terms of performance, can I achieve better performance if I don't restrict myself to a sequential OI scheme? I have to admit that the notion of creating 50000 files on my computer scares the bajeebaz out of me :/

Tim, I'm not sure how to coordinate my IO, data structure algorithm at this point because the math is opened. The first math I need to do is simple: I need to aggregate 50000 2D matrices into 50000 1D matrices. Currently, I'm just summing the contents of the cells up as I go. I also need to normalize each 1D matrix, but I don't know if that means against itself or the entire 50000 aggregated matrices. For example, one method I can use is to simply divide each cell in a 1D matrix by the largest value in that 1D matrix =OR= by the largest number from the 50000 aggregated matrices? Finally, I need to calculate the Euclidian distance between each 1D matrix. So, I was going to do some of my bookkeeping on the ByteBuffer as there are some utilitt classes that I found to operate on ByteBuffers directly, then I was going to use the underlying array as the container to pass the data to my file writing method. Thats it!

Can you explain how one can implement non-naively?

The matrices will be mostly sparse so I guess thats good? Finally, I ran a test last night and it took 181 seconds to write 10000 1D matrices with 50000 integers in each. The resulting file was almost 2 gigs. Truth be known, I have no idea if thats bad, or how bad that is...and I imagine that the time will change dramatically between machines and JVMs. The computer this will ultimately run on is a G5 with dual processors and 160 gigs of hardrive space. But I have no idea about Macs at all.

Thanks again for your advice!

Cheers,

Emilio

In my limited computing knowledge, I really don't know how to do much. Simultaneous writing to disk? I naively thought that IFF I knew that each logical chunk consisted of 50000 integers, I would know ahead of time which byte range I could allocate to an object in charge of writing and so long as the data was put into the correct region it wouldn't matter *when*. But order does matter as we need to know which 50000 integers belong to what key.

Now, on the topic of keys, I hadn't even considered storing a key with the data. I just presumed that if the order was known ahead of time I didn't need to store a key. In terms of performance, can I achieve better performance if I don't restrict myself to a sequential OI scheme? I have to admit that the notion of creating 50000 files on my computer scares the bajeebaz out of me :/

Tim, I'm not sure how to coordinate my IO, data structure algorithm at this point because the math is opened. The first math I need to do is simple: I need to aggregate 50000 2D matrices into 50000 1D matrices. Currently, I'm just summing the contents of the cells up as I go. I also need to normalize each 1D matrix, but I don't know if that means against itself or the entire 50000 aggregated matrices. For example, one method I can use is to simply divide each cell in a 1D matrix by the largest value in that 1D matrix =OR= by the largest number from the 50000 aggregated matrices? Finally, I need to calculate the Euclidian distance between each 1D matrix. So, I was going to do some of my bookkeeping on the ByteBuffer as there are some utilitt classes that I found to operate on ByteBuffers directly, then I was going to use the underlying array as the container to pass the data to my file writing method. Thats it!

Can you explain how one can implement non-naively?

The matrices will be mostly sparse so I guess thats good? Finally, I ran a test last night and it took 181 seconds to write 10000 1D matrices with 50000 integers in each. The resulting file was almost 2 gigs. Truth be known, I have no idea if thats bad, or how bad that is...and I imagine that the time will change dramatically between machines and JVMs. The computer this will ultimately run on is a G5 with dual processors and 160 gigs of hardrive space. But I have no idea about Macs at all.

Thanks again for your advice!

Cheers,

Emilio

Emilio

Tim West

Ranch Hand

Posts: 539

posted 12 years ago

Hi Emilio,

I'm not an expert on this stuff, but some more points...

First up, sparse is only an advantage if your algorithm is smart enough to utilise the sparseness. That involves some fairly heavy linear algebra though ("sparse matrix algebra" or something was a 3rd year hons-level course at uni), so I'll revoke my previous thought and recommend not considering this until you have to.

What I mean about coordinating that stuff is, you want to arrange that the way you access data coincides with the way the O/S will load it. So, if you're going to iterate across each row, you want to organise the data in memory and on the HDD so that you can easily load an entire row into memory at once. This way, you minimise the number of reads/writes to memory. If you were to store data in columns, then rows, you'd be loading an entire column but only reading one value out each time. If your algorithm navigates the matrix in a diffenr

This had advantages for virtual memory and paging too - when a page (containing your data) is paged in, the whole page is used at once. You don't have to swap continuously. It is probably good to avoid using virtual memory altogether - work out how much memory your program uses, and make sure you have more RAM than that

It sounds like the operations you perform can all be written to navigate the matrix the same way, and they're all O(n). So I'd say you'll get acceptable performance out of the whole thing...your numbers tend this way too.

I'm not sure whether you already knew everything I just wrote, but there it is anyway

Cheers

--Tim

(Edited - they're O(n) not O(1), fool!)

[ June 06, 2004: Message edited by: Tim West ]

I'm not an expert on this stuff, but some more points...

First up, sparse is only an advantage if your algorithm is smart enough to utilise the sparseness. That involves some fairly heavy linear algebra though ("sparse matrix algebra" or something was a 3rd year hons-level course at uni), so I'll revoke my previous thought and recommend not considering this until you have to.

What I mean about coordinating that stuff is, you want to arrange that the way you access data coincides with the way the O/S will load it. So, if you're going to iterate across each row, you want to organise the data in memory and on the HDD so that you can easily load an entire row into memory at once. This way, you minimise the number of reads/writes to memory. If you were to store data in columns, then rows, you'd be loading an entire column but only reading one value out each time. If your algorithm navigates the matrix in a diffenr

This had advantages for virtual memory and paging too - when a page (containing your data) is paged in, the whole page is used at once. You don't have to swap continuously. It is probably good to avoid using virtual memory altogether - work out how much memory your program uses, and make sure you have more RAM than that

It sounds like the operations you perform can all be written to navigate the matrix the same way, and they're all O(n). So I'd say you'll get acceptable performance out of the whole thing...your numbers tend this way too.

I'm not sure whether you already knew everything I just wrote, but there it is anyway

Cheers

--Tim

(Edited - they're O(n) not O(1), fool!)

[ June 06, 2004: Message edited by: Tim West ]

Emilio Gagliardi

Greenhorn

Posts: 16

Tim West

Ranch Hand

Posts: 539

Stan James

(instanceof Sidekick)

Ranch Hand

Ranch Hand

Posts: 8791

posted 12 years ago

One row at a time sounds like you can just read, transform, write without much trouble. If you had to relate a row to the one or two before and after I'd look into the "circular buffer" concept. Use a fixed size array of some size to hold a "sliding window" view of the larger file.

A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi