# architectural advice on iterating over large datasets

Emilio Gagliardi

Greenhorn

Posts: 16

posted 12 years ago

I am working on a project where one of the goals is to allow us to compare arrays of doubles. The data is currently stored in a file as 50000*50000 doubles; think of it as 50000 vectors with 50000 elements in each.

I need to process each vector to every other vector but I don't need to process a vector against itself or repeats. Futhermore, I can't store all the vectors in memory. What I need to do is this:

OUTER LOOP

read vector[i]//all from same file

INNER LOOP

read vector[j] //all from same file

compare and store value

i + j would look like this:[0,1][0,2][0,3][0,4][1,2][1,3][1,4][2,3][2,4]...etc, progressively iterating over the file until no more unique comparisons are possible.

Now the part I don't really understand is how I can keep track of which pairs I have compared so I'm not comparing [0,0][1,0][1,1][2,0][2,1][2,2] because either they already exist (0,1 is the same as 1,0) or I don't want to make the comparison (1,1). At first i thought I would simply load all the vectors into memory, but then realized I can't. But that leaves the problem of keeping track of which ones I have compared. Part of the problem is that I need to keep track of the values I produce when I compare the vectors so that I can sort them later.

That leads me to a whole new problem of how to store the vector comparisons, but that is further down the road.

SO I called this architectural advice because at this point I'm not sure how to think about processing the data and which data structures I should investigate utilizing. One glimor of an idea is that I could use a MappedByteBuffer but I'm not really sure how to implement that to solve my problem.

Any advice or suggestions are greatly appreciated.

[ June 10, 2004: Message edited by: Emilio Gagliardi ]

I need to process each vector to every other vector but I don't need to process a vector against itself or repeats. Futhermore, I can't store all the vectors in memory. What I need to do is this:

OUTER LOOP

read vector[i]//all from same file

INNER LOOP

read vector[j] //all from same file

compare and store value

i + j would look like this:[0,1][0,2][0,3][0,4][1,2][1,3][1,4][2,3][2,4]...etc, progressively iterating over the file until no more unique comparisons are possible.

Now the part I don't really understand is how I can keep track of which pairs I have compared so I'm not comparing [0,0][1,0][1,1][2,0][2,1][2,2] because either they already exist (0,1 is the same as 1,0) or I don't want to make the comparison (1,1). At first i thought I would simply load all the vectors into memory, but then realized I can't. But that leaves the problem of keeping track of which ones I have compared. Part of the problem is that I need to keep track of the values I produce when I compare the vectors so that I can sort them later.

That leads me to a whole new problem of how to store the vector comparisons, but that is further down the road.

SO I called this architectural advice because at this point I'm not sure how to think about processing the data and which data structures I should investigate utilizing. One glimor of an idea is that I could use a MappedByteBuffer but I'm not really sure how to implement that to solve my problem.

Any advice or suggestions are greatly appreciated.

[ June 10, 2004: Message edited by: Emilio Gagliardi ]

Emilio

Tim West

Ranch Hand

Posts: 539

posted 12 years ago

I'm not sure from your description exactly what you're computing, but one idea - you could determine whether or not two of the vectors are identical by computing their hashcodes and storing those in memory.

Could you be a bit more specific about what you do when you compare two arrays? I couldn't work out what was supposed to be unique...but it is early in the morning here, maybe I'll wake up soon =)

--Tim

Could you be a bit more specific about what you do when you compare two arrays? I couldn't work out what was supposed to be unique...but it is early in the morning here, maybe I'll wake up soon =)

--Tim

Emilio Gagliardi

Greenhorn

Posts: 16

posted 12 years ago

HI Tim;

I think after this project I'll owe you a beer or three

I need to calculate the Euclidian distance between each vector. The unique part comes from the fact that the distance between vector m and n is the same as the distance between vector n and m. Meaning, once I've computed the distance between any two vectors (a compared to b), I don't need to calculate the distance from the perspective of the other vector (b compared to a) nor do I need to calculate the distance between a and a!

How exactly do you compute the hashcode of an array?? That sounds promising! I'm curious about how to compute hashcodes but I don't really get what is hashed.

Cheers,

I think after this project I'll owe you a beer or three

I need to calculate the Euclidian distance between each vector. The unique part comes from the fact that the distance between vector m and n is the same as the distance between vector n and m. Meaning, once I've computed the distance between any two vectors (a compared to b), I don't need to calculate the distance from the perspective of the other vector (b compared to a) nor do I need to calculate the distance between a and a!

How exactly do you compute the hashcode of an array?? That sounds promising! I'm curious about how to compute hashcodes but I don't really get what is hashed.

Cheers,

Emilio