This week's book giveaway is in the Agile forum.
We're giving away four copies of Fixing your Scrum and have Ryan Ripley & Todd Miller on-line!
See this thread for details.
Win a copy of Fixing your Scrum this week in the Agile forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Rob Spoor
  • Liutauras Vilda
Sheriffs:
  • Jeanne Boyarsky
  • Junilu Lacar
  • Tim Cooke
Saloon Keepers:
  • Tim Holloway
  • Piet Souris
  • Stephan van Hulst
  • Tim Moores
  • Carey Brown
Bartenders:
  • Frits Walraven
  • Himai Minh

architectural advice on iterating over large datasets

 
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I am working on a project where one of the goals is to allow us to compare arrays of doubles. The data is currently stored in a file as 50000*50000 doubles; think of it as 50000 vectors with 50000 elements in each.
I need to process each vector to every other vector but I don't need to process a vector against itself or repeats. Futhermore, I can't store all the vectors in memory. What I need to do is this:
OUTER LOOP
read vector[i]//all from same file
INNER LOOP
read vector[j] //all from same file
compare and store value
i + j would look like this:[0,1][0,2][0,3][0,4][1,2][1,3][1,4][2,3][2,4]...etc, progressively iterating over the file until no more unique comparisons are possible.

Now the part I don't really understand is how I can keep track of which pairs I have compared so I'm not comparing [0,0][1,0][1,1][2,0][2,1][2,2] because either they already exist (0,1 is the same as 1,0) or I don't want to make the comparison (1,1). At first i thought I would simply load all the vectors into memory, but then realized I can't. But that leaves the problem of keeping track of which ones I have compared. Part of the problem is that I need to keep track of the values I produce when I compare the vectors so that I can sort them later.

That leads me to a whole new problem of how to store the vector comparisons, but that is further down the road.

SO I called this architectural advice because at this point I'm not sure how to think about processing the data and which data structures I should investigate utilizing. One glimor of an idea is that I could use a MappedByteBuffer but I'm not really sure how to implement that to solve my problem.

Any advice or suggestions are greatly appreciated.
[ June 10, 2004: Message edited by: Emilio Gagliardi ]
 
Ranch Hand
Posts: 539
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm not sure from your description exactly what you're computing, but one idea - you could determine whether or not two of the vectors are identical by computing their hashcodes and storing those in memory.

Could you be a bit more specific about what you do when you compare two arrays? I couldn't work out what was supposed to be unique...but it is early in the morning here, maybe I'll wake up soon =)


--Tim
 
Emilio Gagliardi
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
HI Tim;
I think after this project I'll owe you a beer or three

I need to calculate the Euclidian distance between each vector. The unique part comes from the fact that the distance between vector m and n is the same as the distance between vector n and m. Meaning, once I've computed the distance between any two vectors (a compared to b), I don't need to calculate the distance from the perspective of the other vector (b compared to a) nor do I need to calculate the distance between a and a!

How exactly do you compute the hashcode of an array?? That sounds promising! I'm curious about how to compute hashcodes but I don't really get what is hashed.

Cheers,
 
You showed up just in time for the waffles! And this tiny ad:
Building a Better World in your Backyard by Paul Wheaton and Shawn Klassen-Koop
https://coderanch.com/wiki/718759/books/Building-World-Backyard-Paul-Wheaton
reply
    Bookmark Topic Watch Topic
  • New Topic