# Using a file as a set.

Steven Bell

Ranch Hand

Posts: 1071

posted 10 years ago

Hi all.

I need a Set implementation that can hold roughly 2.4 trillion items. I'm guessing the best way to do that is write a set that is back by a File (which is why I'm posting in this forum).

I'll basically be inserting all the items into the Set, and then reading them out and populating either another of the same kind of Set (for long term storage, a master list) or a database.

I'm looking for advise as the to best way to back the Set with a File. I'm also looking for sorted order, the data has a natural sort order I'll use, so it will wind up being some kind of tree set.

Any help would be appreciated, or it you think I'm on the wrong track let me know.

I need a Set implementation that can hold roughly 2.4 trillion items. I'm guessing the best way to do that is write a set that is back by a File (which is why I'm posting in this forum).

I'll basically be inserting all the items into the Set, and then reading them out and populating either another of the same kind of Set (for long term storage, a master list) or a database.

I'm looking for advise as the to best way to back the Set with a File. I'm also looking for sorted order, the data has a natural sort order I'll use, so it will wind up being some kind of tree set.

Any help would be appreciated, or it you think I'm on the wrong track let me know.

Jeff Albertson

Ranch Hand

Posts: 1780

posted 10 years ago

You do realize that even if you can read and process a million items a second, that processing 2.4 trillion items will take

*667 hours?*And that even at one byte per item, that one file would be 2.4 terabytes in size, pushing the limits of both consumer disk storage and some modern operating systems? I'm not saying this is not possible, I'm just saying that this is a very serious problem that will need some cutting-edge development to get working.
Stan James

(instanceof Sidekick)

Ranch Hand

Ranch Hand

Posts: 8791

posted 10 years ago

I think a file based system would just end up reinventing the database, or at least a very complicated indexed file. I remember enough mainframe VSAM internals to talk through some of the issues and challenges, but I consider it just enough to know better than to try it. The book for the Turbo Pascal Database Toolbox is a great overview of B-Tree indexed files.

A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi

Edwin Dalorzo

Ranch Hand

Posts: 961

Steven Bell

Ranch Hand

Posts: 1071

posted 10 years ago

I do understand the size of the problem, although I didn't think to calculate the disk space needed, I stopped at the 'This just isn't going to fit in memory' stage. Maybe a little more information would help.

I'm working on a program that generates Sudoku solutions (it doesn't generate puzzles or solve puzzles, just solutions). Each valid solution can be modified to create a new valid solution. The possible permutations run just over 2.4 trillion. Once I get the algorithms all worked out (which I'm about half way there), I plan to break things up and have the work done in a grid system. Currently I can generate around 1 million permutations in 5 seconds.

I need the set restraints as a validation that the permutations have run correctly. I know exactly how many unique permutations there are, and if the numbers don't match I have a problem.

This all started when I had the idea that there are only a small number of unique patterns in the solution of sudoku puzzles. Not sure if that's true yet, but that's part of what I want to find out.

Considering the disk size issue, maybe I need to look at some sort of compressed storage.

P.S. For the interested the permutations run as follows:

Each set of row (1-3, 4-6, 7-9) can be swaped around without invalidating the puzzle, same with columns. Also each section can be swaped around, same with each column section. Then you can swap any two (or more) symbols and still have a valid puzzle (swap all the 2's and 4's for example). Also the puzzle can be rotated

This leads to 3!*3!*3!*3! = row permutations, same for column permutations

4 roational positions.

9! symbol swaps.

1296*1296*4*362880 = 2,437,996,216,320

I'm working on a program that generates Sudoku solutions (it doesn't generate puzzles or solve puzzles, just solutions). Each valid solution can be modified to create a new valid solution. The possible permutations run just over 2.4 trillion. Once I get the algorithms all worked out (which I'm about half way there), I plan to break things up and have the work done in a grid system. Currently I can generate around 1 million permutations in 5 seconds.

I need the set restraints as a validation that the permutations have run correctly. I know exactly how many unique permutations there are, and if the numbers don't match I have a problem.

This all started when I had the idea that there are only a small number of unique patterns in the solution of sudoku puzzles. Not sure if that's true yet, but that's part of what I want to find out.

Considering the disk size issue, maybe I need to look at some sort of compressed storage.

P.S. For the interested the permutations run as follows:

Each set of row (1-3, 4-6, 7-9) can be swaped around without invalidating the puzzle, same with columns. Also each section can be swaped around, same with each column section. Then you can swap any two (or more) symbols and still have a valid puzzle (swap all the 2's and 4's for example). Also the puzzle can be rotated

This leads to 3!*3!*3!*3! = row permutations, same for column permutations

4 roational positions.

9! symbol swaps.

1296*1296*4*362880 = 2,437,996,216,320

posted 10 years ago

[ May 10, 2006: Message edited by: Paul Clapham ]

Time to bring out the envelope again so we can do some calculations on the back of it. So 2.4 trillion permutations is 2.4 million times that, or 12 million seconds. That's somewhere over 3,000 hours -- close to 20 weeks in fact.Currently I can generate around 1 million permutations in 5 seconds.

[ May 10, 2006: Message edited by: Paul Clapham ]

Steven Bell

Ranch Hand

Posts: 1071

posted 10 years ago

That's why I said I'll be moving the processing to a grid computing system. This project is born part of the sudoku habit and part of my desire to play with grid computing.

And I'm not overly concerned with the time issue here. This is just a fun project for me.

P.S. The 2.4 trillion is just the permutations from one valid solution. There will many valid solutions. That number is actully one of my goals here.

[ May 10, 2006: Message edited by: Steven Bell ]

Originally posted by Paul Clapham:

Time to bring out the envelope again so we can do some calculations on the back of it. So 2.4 trillion permutations is 2.4 million times that, or 12 million seconds. That's somewhere over 3,000 hours -- close to 20 weeks in fact.

[ May 10, 2006: Message edited by: Paul Clapham ]

That's why I said I'll be moving the processing to a grid computing system. This project is born part of the sudoku habit and part of my desire to play with grid computing.

And I'm not overly concerned with the time issue here. This is just a fun project for me.

P.S. The 2.4 trillion is just the permutations from one valid solution. There will many valid solutions. That number is actully one of my goals here.

[ May 10, 2006: Message edited by: Steven Bell ]

Steven Bell

Ranch Hand

Posts: 1071

posted 10 years ago

I'm thinking I'm solving the wrong problem here.

Rather than finding a solution, computing all it's permutations and storing them, what I need to do is search for solutions and test them against previously found solutions to see if they are a permuataion or a new solution. It'll be much easier to manage memory resources, and there is no need to store permutations, only the unique solutions. I think this will also make it easier to break the work up into small units.

The only downside is there would be a fair amount of duplicate work done processing permutations to test new solutions.

Rather than finding a solution, computing all it's permutations and storing them, what I need to do is search for solutions and test them against previously found solutions to see if they are a permuataion or a new solution. It'll be much easier to manage memory resources, and there is no need to store permutations, only the unique solutions. I think this will also make it easier to break the work up into small units.

The only downside is there would be a fair amount of duplicate work done processing permutations to test new solutions.

posted 10 years ago

Well, do have fun. But keep the back of that envelope handy. From what I've seen, there does seem to be a small number of patterns in published Sudoku puzzles, but maybe that's more because it's easy to generate them that way. But to get there, you're going to have to go past enumerating all possible solutions and start enumerating all solvable problems.And I'm not overly concerned with the time issue here. This is just a fun project for me.