Antonio Rodriguez

Greenhorn

Posts: 3

posted 13 years ago

I'm learning some advanced Java by trying to recreate a fortune-like program. I plan to have a GUI interface (somewhat like the freeware SmartBee). My main dilemma is lanning how to access a fortune quickly from a set of files. One solution I've encountered is creating an index file that rides alongside the fortune file, but I was hoping that there would be a more elegant solution that doesn't involve that.

The fortunes are just text files, with "%" and a carriage return used as separators. The idea is to choose randomly from all the fortunes contained in that file, or potentially choose a random fortune from a list of files. The main dilemma is that the number of fortunes is very large. One file I'm using has more than 110,000 different fortunes, so reading them all into some sort of collection isn't feasible.

Any help is deeply appreciated. A pointer to where I can find a good solution would be excellent.

[ June 24, 2004: Message edited by: Antonio Rodriguez ]

The fortunes are just text files, with "%" and a carriage return used as separators. The idea is to choose randomly from all the fortunes contained in that file, or potentially choose a random fortune from a list of files. The main dilemma is that the number of fortunes is very large. One file I'm using has more than 110,000 different fortunes, so reading them all into some sort of collection isn't feasible.

Any help is deeply appreciated. A pointer to where I can find a good solution would be excellent.

[ June 24, 2004: Message edited by: Antonio Rodriguez ]

David Moose

Greenhorn

Posts: 4

posted 13 years ago

I suggest download some free java database, load all the data into that database and use jdbc to select a fortune from that.

Tim West

Ranch Hand

Posts: 539

Darin Niard

Ranch Hand

Posts: 118

posted 13 years ago

Well, the classic "fortune" program did it this way:

* Read in fortune #1, store in variable F.

* Read in fortune #2. With probabillity 1/2, assign it to F.

* Read in #3, and with probability 1/3, assign it to F.

* In general, for fortune #N, assign it to F with probability 1/N.

If you add up the fractions you'll see that each fortune has an equal chance of being selected.

This runs in time proportional to the number of fortunes, but constant space -- you never have more than two fortunes in memory. Actually, if you compute the probability first, you can null out F before reading in the new fortune, so actually you never need more than one. This was great for early UNIX back when memory was really scarce.

The more modern GNU fortune program uses the precompiled "index" files that Antonio originally postulated; this runs in both constant time and space, clearly preferable!

* Read in fortune #1, store in variable F.

* Read in fortune #2. With probabillity 1/2, assign it to F.

* Read in #3, and with probability 1/3, assign it to F.

* In general, for fortune #N, assign it to F with probability 1/N.

If you add up the fractions you'll see that each fortune has an equal chance of being selected.

This runs in time proportional to the number of fortunes, but constant space -- you never have more than two fortunes in memory. Actually, if you compute the probability first, you can null out F before reading in the new fortune, so actually you never need more than one. This was great for early UNIX back when memory was really scarce.

The more modern GNU fortune program uses the precompiled "index" files that Antonio originally postulated; this runs in both constant time and space, clearly preferable!

Ko Ko Naing

Ranch Hand

Posts: 3178

posted 13 years ago

Nice calculation!!! Mr.Ernest is correct... How come should we put such big chunck of characters in a string object like this? I have met such kind of issue in the past before. For more info about the maximum byte that a string can contain, this thread made an interesting point for us here....

Originally posted by Ernest Friedman-Hill:

At a 100 characters or so each, two bytes per character, plus say a dozen bytes of overhead per object, that's about 23 megabytes. Funny how the definition of "not that much" has changed!

Nice calculation!!! Mr.Ernest is correct... How come should we put such big chunck of characters in a string object like this? I have met such kind of issue in the past before. For more info about the maximum byte that a string can contain, this thread made an interesting point for us here....

Co-author of SCMAD Exam Guide, Author of JMADPlus

SCJP1.2, CCNA, SCWCD1.4, SCBCD1.3, SCMAD1.0, SCJA1.0, SCJP6.0

Darin Niard

Ranch Hand

Posts: 118

posted 13 years ago

Interesting, but... if the first fortune has a 50% probability then why would it show up any less than 1/2 the time? Or, am I not understanding you correctly? As for the index files, even if it picks the 5000th entry, does it not have to run through the first 4999 in order to find it? This is something about file i/o that's never been clear to me...

posted 13 years ago

You always read the whole file, and you perform this procedure for every fortune. The first fortune thus has up to N-1 chances to be replaced -- if it's not replaced by the second one, it may be replaced by the third one, or fourth one, and so on. And once the first one is replaced, the replacement can itself be replaced, as can that replacement. So the probability of the first fortune being selected is 1/2 - 1/3 - 1/4 - 1/5 - 1/6 ... - 1/N, which is equal to, as it turns out, 1/N.

No, you don't have to. The index file would have the file offsets of beginning of each fortune. You could use java.io.RandomAccessFile, seek() to the beginning of the chosen fortune (a more or less constant-time operation) and then read the correct number of bytes (computed by subtracting two successive start indices.)

Originally posted by Darin Niard:

Interesting, but... if the first fortune has a 50% probability then why would it show up any less than 1/2 the time? Or, am I not understanding you correctly?

You always read the whole file, and you perform this procedure for every fortune. The first fortune thus has up to N-1 chances to be replaced -- if it's not replaced by the second one, it may be replaced by the third one, or fourth one, and so on. And once the first one is replaced, the replacement can itself be replaced, as can that replacement. So the probability of the first fortune being selected is 1/2 - 1/3 - 1/4 - 1/5 - 1/6 ... - 1/N, which is equal to, as it turns out, 1/N.

As for the index files, even if it picks the 5000th entry, does it not have to run through the first 4999 in order to find it? This is something about file i/o that's never been clear to me...

No, you don't have to. The index file would have the file offsets of beginning of each fortune. You could use java.io.RandomAccessFile, seek() to the beginning of the chosen fortune (a more or less constant-time operation) and then read the correct number of bytes (computed by subtracting two successive start indices.)

Darin Niard

Ranch Hand

Posts: 118

posted 13 years ago

Clever, but I think the way I've always picked a random line from a file can be slightly more efficient (n/2 average instead of n). Basically, I'd just count up to a random number between 0 and the fileLength and stop on that entry. That's what I thought of originally when he asked this question, but since it was running in a GUI, I figured he would be reusing the fortune list a lot, so speed seemed most important to him. The indexing way sounds like the best method though

[ June 24, 2004: Message edited by: Darin Niard ]

Originally posted by Ernest Friedman-Hill:

You always read the whole file, and you perform this procedure for every fortune. The first fortune thus has up to N-1 chances to be replaced -- if it's not replaced by the second one, it may be replaced by the third one, or fourth one, and so on. And once the first one is replaced, the replacement can itself be replaced, as can that replacement. So the probability of the first fortune being selected is 1/2 - 1/3 - 1/4 - 1/5 - 1/6 ... - 1/N, which is equal to, as it turns out, 1/N.

Clever, but I think the way I've always picked a random line from a file can be slightly more efficient (n/2 average instead of n). Basically, I'd just count up to a random number between 0 and the fileLength and stop on that entry. That's what I thought of originally when he asked this question, but since it was running in a GUI, I figured he would be reusing the fortune list a lot, so speed seemed most important to him. The indexing way sounds like the best method though

[ June 24, 2004: Message edited by: Darin Niard ]

posted 13 years ago

This only works

Originally posted by Darin Niard:

Clever, but I think the way I've always picked a random line from a file can be slightly more efficient. Basically, I'd just count up to a random number between 0 and the fileLength and stop on that entry.

This only works

*if*you know how many fortunes are in the file. The neat thing about the original fortune algorithm is that it doesn't require this knowledge -- it works in a single pass over never-before-seen data.

Darin Niard

Ranch Hand

Posts: 118

Antonio Rodriguez

Greenhorn

Posts: 3

posted 13 years ago

I get it. The idea is that for the first fortune to survive, it would have to be outside of the probabilities that we calculate. Consider the case in which we have 5 fortunes.

1 always gets placed in F.

2 will be placed in F 1/2 of the time. Therefore, 1 survives 1/2 of the time.

3 will be placed in F 1/3 of the time. Therefore, if 1 survived the last replacement, it would survive this replacement 2/3 of the time.

4 will be placed in F 1/4 of the time. Therefore, if 1 survived the previous replacements, it would survive this replacement 3/4 of the time.

5 will be placed in F 1/5 of the time, Therefore, if 1 survived the previous replacements, it would survive 4/5 of the time.

So the probability of 1 surviving is 1/2 * 2/3 * 3/4 * 4/5 = 1/5

Very nice programming. Kudos to the original fortune programmers, and thanks to those who responded so quickly to my question. I apologize for the late reply, but I have really sporadic Internet access.

I probably will go towards the index methodology, to get the constant space/time factor, but this was a great insight. Thank you, Mr. F-H!

Originally posted by Ernest Friedman-Hill:

[QB]

You always read the whole file, and you perform this procedure for every fortune. The first fortune thus has up to N-1 chances to be replaced -- if it's not replaced by the second one, it may be replaced by the third one, or fourth one, and so on. And once the first one is replaced, the replacement can itself be replaced, as can that replacement. So the probability of the first fortune being selected is 1/2 - 1/3 - 1/4 - 1/5 - 1/6 ... - 1/N, which is equal to, as it turns out, 1/N.QB]

I get it. The idea is that for the first fortune to survive, it would have to be outside of the probabilities that we calculate. Consider the case in which we have 5 fortunes.

1 always gets placed in F.

2 will be placed in F 1/2 of the time. Therefore, 1 survives 1/2 of the time.

3 will be placed in F 1/3 of the time. Therefore, if 1 survived the last replacement, it would survive this replacement 2/3 of the time.

4 will be placed in F 1/4 of the time. Therefore, if 1 survived the previous replacements, it would survive this replacement 3/4 of the time.

5 will be placed in F 1/5 of the time, Therefore, if 1 survived the previous replacements, it would survive 4/5 of the time.

So the probability of 1 surviving is 1/2 * 2/3 * 3/4 * 4/5 = 1/5

Very nice programming. Kudos to the original fortune programmers, and thanks to those who responded so quickly to my question. I apologize for the late reply, but I have really sporadic Internet access.

I probably will go towards the index methodology, to get the constant space/time factor, but this was a great insight. Thank you, Mr. F-H!

It is sorta covered in the JavaRanch Style Guide. |