Win a copy of Spring in Action (5th edition) this week in the Spring forum!

- Post Reply Bookmark Topic Watch Topic
- New Topic

programming forums
Java
Mobile
Certification
Databases
Caching
Books
Engineering
Micro Controllers
OS
Languages
Paradigms
IDEs
Build Tools
Frameworks
Application Servers
Open Source
This Site
Careers
Other
all forums

this forum made possible by our volunteer staff, including ...

Marshals:

- Campbell Ritchie
- Bear Bibeault
- Devaka Cooray
- Liutauras Vilda
- Jeanne Boyarsky

Sheriffs:

- Knute Snortum
- Junilu Lacar
- paul wheaton

Saloon Keepers:

- Ganesh Patekar
- Frits Walraven
- Tim Moores
- Ron McLeod
- Carey Brown

Bartenders:

- Stephan van Hulst
- salvin francis
- Tim Holloway

posted 12 years ago

I want to generate random strings, of random length (up to some arbitrary maximum), where the probability of any possible string is the same.

Sounds simple, huh? Well, maybe it is, but I can't think of an easy way.

My current algorithm is: -

1. Randomly choose the length, N, between zero and the maximum, M.

2. Initialise storage for N characters.

3. Randomly choose each of the N characters, from the C possible characters.

This does *not* meet the requirements, because a particular short string is much more likely to come up than a long string.

I can think of a most-bogus solution, along the lines of Bogosort. That is: compute all the possible strings, then choose one at random. However, the memory and computational costs of this are ridiculous.

Any ideas?

Sounds simple, huh? Well, maybe it is, but I can't think of an easy way.

My current algorithm is: -

1. Randomly choose the length, N, between zero and the maximum, M.

2. Initialise storage for N characters.

3. Randomly choose each of the N characters, from the C possible characters.

This does *not* meet the requirements, because a particular short string is much more likely to come up than a long string.

I can think of a most-bogus solution, along the lines of Bogosort. That is: compute all the possible strings, then choose one at random. However, the memory and computational costs of this are ridiculous.

Any ideas?

Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.

posted 12 years ago

Sounds like you can't exceed the number of possible strings of the minimum length. So if we only use lower case alpha characters and the minimum length is 1 we have to choose from a set of 26 strings per possible length. If your upper length is 2 we have 42 strings to choose from.

get the number n of possible strings at minimum length

seed the RNG using n - must be repeatable to maintain the odds, no?

pick a random string index in range of 1, n

pick a random length

generate n strings

return the nth string

Does that makes sense?

get the number n of possible strings at minimum length

seed the RNG using n - must be repeatable to maintain the odds, no?

pick a random string index in range of 1, n

pick a random length

generate n strings

return the nth string

Does that makes sense?

A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi

Peter Chase

Ranch Hand

Posts: 1970

posted 12 years ago

Thanks for your interest.

Well, not if I understand correctly what you're saying.

The number of possible strings of length L, where each character is one of C characters is C^L. That's C to the power of L, not C times L as you seem to be saying. So, even with only 26 characters, the number of possible strings rapidly becomes comically huge, even with relatively modest values of C and L. For instance, 26^6 is 308915776; that's a lot of strings.

In my actual application (this is not just a puzzle for fun), C is 94 and Lmax is 300!

If I have misunderstood what you're saying, then perhaps you could post some more-detailed pseudo-code (or real code) to illustrate your solution.

[ January 23, 2006: Message edited by: Peter Chase ]

Originally posted by Stan James:

Does that makes sense?

Well, not if I understand correctly what you're saying.

The number of possible strings of length L, where each character is one of C characters is C^L. That's C to the power of L, not C times L as you seem to be saying. So, even with only 26 characters, the number of possible strings rapidly becomes comically huge, even with relatively modest values of C and L. For instance, 26^6 is 308915776; that's a lot of strings.

In my actual application (this is not just a puzzle for fun), C is 94 and Lmax is 300!

If I have misunderstood what you're saying, then perhaps you could post some more-detailed pseudo-code (or real code) to illustrate your solution.

[ January 23, 2006: Message edited by: Peter Chase ]

Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.

posted 12 years ago

You must choose length N with probability C^N * (C-1) / (C^(M+1) - 1).

Stan James

(instanceof Sidekick)

Posts: 8791

posted 12 years ago
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi

My C*(LMax-LMin+1) (26*2) was to allow 26 1-character strings, and 26 2-character strings. Seeding the RNG for repeatability and starting from the first random every time ensures that we will create only 26 unique 2-character strings, not C^L random possibilities. That keeps the probability of getting a particular string a consistent 1/42 whether length is 1 or 2. Did I correctly read that as the goal?

LMax goes out of the equation because LMin is the constraint, but for your numbers I'd have to be prepared to get 94^LMin random numbers which would probably run too long at 100% CPU.

Keep in mind: Music major here. This was fun but probably just passed my math skills.

LMax goes out of the equation because LMin is the constraint, but for your numbers I'd have to be prepared to get 94^LMin random numbers which would probably run too long at 100% CPU.

Keep in mind: Music major here. This was fun but probably just passed my math skills.

posted 12 years ago

If you would like a truely random string... I would suggest that you ask someone like to to spell a number of words and what is produced can guaranteed to be nothing but random letters of an arbituary length.

posted 12 years ago

Actually, you don't have to compute all the possible strings. You can choose a random number and compute the string based on that number.

For example, if you had to generate a random lower case string of max length 2, then there are 702 possible strings. Strings 0-25 would be single character string a,b,c,d,....z. String 26 - 701 would be 2 character strings aa,ab,ac,ad....zz. So, you select a random number from 0-701, and use number to figure out the string.

For example, if your random number generator gave you 523, you follow these steps

a) 523 mod 26 = 3

b) character 3 is d, so, next character is d, random string = d

c) 523/26 = 20

d) 20 mod 26 = 20

e) 20 character is t, so next character is t, random string = td

f) 20/26 = 0, so stop processing

If you want random strings of max length 3, you use similar logic except that you generate a random number between 0-18271, because the number of combinations are 18272. You don't have to limit yourself to 26 characters. You can use any number of characters. You will have to compute the number of combinations based on the number of characters

Originally posted by Peter Chase:

I can think of a most-bogus solution, along the lines of Bogosort. That is: compute all the possible strings, then choose one at random. However, the memory and computational costs of this are ridiculous.

Any ideas?

Actually, you don't have to compute all the possible strings. You can choose a random number and compute the string based on that number.

For example, if you had to generate a random lower case string of max length 2, then there are 702 possible strings. Strings 0-25 would be single character string a,b,c,d,....z. String 26 - 701 would be 2 character strings aa,ab,ac,ad....zz. So, you select a random number from 0-701, and use number to figure out the string.

For example, if your random number generator gave you 523, you follow these steps

a) 523 mod 26 = 3

b) character 3 is d, so, next character is d, random string = d

c) 523/26 = 20

d) 20 mod 26 = 20

e) 20 character is t, so next character is t, random string = td

f) 20/26 = 0, so stop processing

If you want random strings of max length 3, you use similar logic except that you generate a random number between 0-18271, because the number of combinations are 18272. You don't have to limit yourself to 26 characters. You can use any number of characters. You will have to compute the number of combinations based on the number of characters

posted 12 years ago

you're gonna have a problem here... a 20 character string is going to require a random number between 0 and 2,432,902,008,176,640,000. since the OP said "arbitrary length", i would assume 20, 50 or even 100 it not out of the question.

or, wait... i think i did 20!, i should have done 26^20, which gives us

19,928,148,895,209,409,152,340,197,376

[ January 25, 2006: Message edited by: fred rosenberger ]

or, wait... i think i did 20!, i should have done 26^20, which gives us

19,928,148,895,209,409,152,340,197,376

[ January 25, 2006: Message edited by: fred rosenberger ]

There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors

Jayesh Lalwani

Ranch Hand

Posts: 502

posted 12 years ago

Use multiple random numbers then. Alternatively, generate two-10 character segments then. Or four-5 character segments. Introduce the possibility of segments of zero length.

Fred, I agree with you at a theoritical level. At some point, a randomly generated dictionary would be more cost-effective than a true random string generator.

Originally posted by fred rosenberger:

you're gonna have a problem here... a 20 character string is going to require a random number between 0 and 2,432,902,008,176,640,000. since the OP said "arbitrary length", i would assume 20, 50 or even 100 it not out of the question.

or, wait... i think i did 20!, i should have done 26^20, which gives us

19,928,148,895,209,409,152,340,197,376

[ January 25, 2006: Message edited by: fred rosenberger ]

Use multiple random numbers then. Alternatively, generate two-10 character segments then. Or four-5 character segments. Introduce the possibility of segments of zero length.

Fred, I agree with you at a theoritical level. At some point, a randomly generated dictionary would be more cost-effective than a true random string generator.

Peter Chase

Ranch Hand

Posts: 1970

posted 12 years ago

For the record, this one sounds the most feasible so far. I'm not sure that I quite see what N, M and C are, but guess I could work it out.

Originally posted by Paul Clapham:

You must choose length N with probability C^N * (C-1) / (C^(M+1) - 1).

For the record, this one sounds the most feasible so far. I'm not sure that I quite see what N, M and C are, but guess I could work it out.

Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.

George Harris

Ranch Hand

Posts: 84

posted 12 years ago

How about if you were to loop though max size C. You then get a random number (0 to number of possible characters) where 0 would mean you skip that position, and then give the result. For example in for a max string length of 4...

2 . 16 . 1 . 0 would give the string 'bpa'

5 . 12 . 8 . 24 would give 'elhx'

It would also require that if the string had an embedded 0,

2 . 0 . 16 . 1 the string would be rejected and a restart was required.

2 . 0 . 0 . 0 would be valid however.

[ January 25, 2006: Message edited by: George Harris ]

2 . 16 . 1 . 0 would give the string 'bpa'

5 . 12 . 8 . 24 would give 'elhx'

It would also require that if the string had an embedded 0,

2 . 0 . 16 . 1 the string would be rejected and a restart was required.

2 . 0 . 0 . 0 would be valid however.

[ January 25, 2006: Message edited by: George Harris ]

Peter Chase

Ranch Hand

Posts: 1970

posted 12 years ago

George H: I think you've got a reasonable solution there. Takes a bit of thinking through, but I think it works. Not sure whether I prefer this solution, or the probability distribution one given earlier. Yours is more ingenious, certainly.

Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.

posted 12 years ago

However if C is 94, then the probability of choosing a string of anything less than the maximum length M (or Lmax as you called it later) is almost exactly 1/94 (regardless of what M is) and the probability of choosing a string of the maximum length is 93/94. So you might simplify the problem to only choose strings of length M -- a major simplification.

They are what you said they were in your original question.Originally posted by Peter Chase:

For the record, this one sounds the most feasible so far. I'm not sure that I quite see what N, M and C are, but guess I could work it out.

However if C is 94, then the probability of choosing a string of anything less than the maximum length M (or Lmax as you called it later) is almost exactly 1/94 (regardless of what M is) and the probability of choosing a string of the maximum length is 93/94. So you might simplify the problem to only choose strings of length M -- a major simplification.

posted 12 years ago
There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors

I have not done any math, but you mayhave a problem with George's solution...

if i undertand it correctly, he's saying whatever your max length is, generate a random string the full length. then, scan it to see if there is an imbedded 0.

for large max lengths, there are going to be a LOT of strings you'll have to reject, i think. it may take a while to find a valid one.

i'm not SURE of this, it's just my intuition (whatever that's worth...).

if i undertand it correctly, he's saying whatever your max length is, generate a random string the full length. then, scan it to see if there is an imbedded 0.

for large max lengths, there are going to be a LOT of strings you'll have to reject, i think. it may take a while to find a valid one.

i'm not SURE of this, it's just my intuition (whatever that's worth...).

Jayesh Lalwani

Ranch Hand

Posts: 502

posted 12 years ago

I beleive you are right

For a string of length 3

Total number of combinations = 27^3 = 19683

Total number of valid combintaions = 18278

Total number of invalid combos = 1405 = 7%

For string of length 5

Total = 14348907

Valid = 12356630

Invalid = 1992277 = 13%

It keeps on growing. For string of length 20, 51.11% of the combos will be invalid. For string of length 63, 90.35% of the combos will be invalid.

Originally posted by fred rosenberger:

I have not done any math, but you mayhave a problem with George's solution...

if i undertand it correctly, he's saying whatever your max length is, generate a random string the full length. then, scan it to see if there is an imbedded 0.

for large max lengths, there are going to be a LOT of strings you'll have to reject, i think. it may take a while to find a valid one.

i'm not SURE of this, it's just my intuition (whatever that's worth...).

I beleive you are right

For a string of length 3

Total number of combinations = 27^3 = 19683

Total number of valid combintaions = 18278

Total number of invalid combos = 1405 = 7%

For string of length 5

Total = 14348907

Valid = 12356630

Invalid = 1992277 = 13%

It keeps on growing. For string of length 20, 51.11% of the combos will be invalid. For string of length 63, 90.35% of the combos will be invalid.

posted 12 years ago

The time required for George's solution does indeed increase dramatically with string length. However even at 100 chars, I can generate 10000 of them in about 5 seconds on my ancient 500 MHz Pentium III machine. A length of 150 takes 30 sec, and length 200 takes 3 minutes. Obviously it's going to get worse and worse, but remember, those numbers are all for 10000 of the things. And how long do the strings really need to be, anyway? This is using uppercase chars only. Time goes down considerably if we include uppercase as well, since that decreases the chance of an invalid string at each character position. 10000 200-char strings take under 10 seconds. The number of strings shorter than max length stays around 1/27 for uppercase-only, and 1/53 for mixed case. As expected.

"I'm not back." - Bill Harding, *Twister*

Peter Chase

Ranch Hand

Posts: 1970

posted 12 years ago

In the real application, there are 94 characters to choose from and a maximum length of 300. Having 94, rather than 26, characters, makes things a lot better.

I reckon that George's method could be used for the application, but would eat up a lot more CPU than the probability distribution method, which just adds a little bit of maths to my original algorithm.

I reckon that George's method could be used for the application, but would eat up a lot more CPU than the probability distribution method, which just adds a little bit of maths to my original algorithm.

Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.

posted 12 years ago

Well - why simplify this?

Originally posted by Paul Clapham:

They are what you said they were in your original question.

However if C is 94, then the probability of choosing a string of anything less than the maximum length M (or Lmax as you called it later) is almost exactly 1/94 (regardless of what M is) and the probability of choosing a string of the maximum length is 93/94. So you might simplify the problem to only choose strings of length M -- a major simplification.

Well - why simplify this?

Peter Chase

Ranch Hand

Posts: 1970

posted 12 years ago

I reckon I can do better, with the same basic algorithm: -

I'm finessing away the test for len becoming less than one, because in practice it will never happen. The probability (1 in 94^300) is so insanely minute that one can say that with total confidence.

If there were fewer characters to choose from or a smaller maximum length, then such a test would be necessary.

[ January 28, 2006: Message edited by: Peter Chase ]

I'm finessing away the test for len becoming less than one, because in practice it will never happen. The probability (1 in 94^300) is so insanely minute that one can say that with total confidence.

If there were fewer characters to choose from or a smaller maximum length, then such a test would be necessary.

[ January 28, 2006: Message edited by: Peter Chase ]

Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.

posted 12 years ago

... at least until random.nextInt (94) isn't broken

[ January 28, 2006: Message edited by: Stefan Wagner ]

[ January 28, 2006: Message edited by: Stefan Wagner ]

[ January 28, 2006: Message edited by: Stefan Wagner ]

[ January 28, 2006: Message edited by: Stefan Wagner ]

Peter Chase

Ranch Hand

Posts: 1970

posted 12 years ago

Did I make a mistake? I admit I didn't compile or run my code. I am assuming that "random" is an instance of java.util.Random.

Originally posted by Stefan Wagner:

... at least until random.nextInt (94) isn't broken

Did I make a mistake? I admit I didn't compile or run my code. I am assuming that "random" is an instance of java.util.Random.

Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.

George Harris

Ranch Hand

Posts: 84

posted 12 years ago

Ok, a slight modification to my previous algorithm. Instead of rejecting embedded zeros, I would suggest shifting the new non-zero to the left and continuing from there. For example for a string length 10 we would start and get the following

23 . 16 . 0 . 0 . 0 . 19 ...

Instead of rejecting this string, we would move the 19 to the left and continue.

23 . 16 . 19 . 7 . etc.

I believe this will still result in the correct distribution of string lengths, with a significant improvement in processing.

23 . 16 . 0 . 0 . 0 . 19 ...

Instead of rejecting this string, we would move the 19 to the left and continue.

23 . 16 . 19 . 7 . etc.

I believe this will still result in the correct distribution of string lengths, with a significant improvement in processing.

Jim Yingst

Wanderer

Posts: 18671

posted 12 years ago

George, I don't think that works. Consider an extreme case: we want to generate strings of max length 2, using an alphabet with only two characters, A, and B. The possible strings (excluding zero-length) are:

Length 1: A, B

Length 2: AA, AB, BA, BB

So the chance of generating a 1-char string should be 2/6 = 1/3.

Now try your algorithm (assuming I've understood it correctly):

0.0 : [reject]

0.1 : A

0.2 : B

1.0 : A

1.1 : AA

1.2 : AB

2.0 : B

2.1 : BA

2.2 : BB

Looks to me like there's a 4/8 = 1/2 chance of getting length 1, which is not correct. On the other hand if we reject 1.0 and 2.0, the remaining options given the required 2/6 = 1/3 chance for length 1.

[ February 01, 2006: Message edited by: Jim Yingst ]

Length 1: A, B

Length 2: AA, AB, BA, BB

So the chance of generating a 1-char string should be 2/6 = 1/3.

Now try your algorithm (assuming I've understood it correctly):

0.0 : [reject]

0.1 : A

0.2 : B

1.0 : A

1.1 : AA

1.2 : AB

2.0 : B

2.1 : BA

2.2 : BB

Looks to me like there's a 4/8 = 1/2 chance of getting length 1, which is not correct. On the other hand if we reject 1.0 and 2.0, the remaining options given the required 2/6 = 1/3 chance for length 1.

[ February 01, 2006: Message edited by: Jim Yingst ]

"I'm not back." - Bill Harding, *Twister*

George Harris

Ranch Hand

Posts: 84

posted 12 years ago

Jim, I think you missed a step. In your example

0.1 : A

this would not necessarly be A, the first char would be A, but the second character would still have to be processed by a random number generator so,

0.1 could end up

1.0 - A

1.1 - A.A

1.2 - A.B

All that I was suggesting is that we don't have to start from scratch, only that we keep the 'good' number and still get the rest.

0.1 : A

this would not necessarly be A, the first char would be A, but the second character would still have to be processed by a random number generator so,

0.1 could end up

1.0 - A

1.1 - A.A

1.2 - A.B

All that I was suggesting is that we don't have to start from scratch, only that we keep the 'good' number and still get the rest.

posted 12 years ago

Unless I'm missing something here, we should be able to generate each random string in time O(n) where n is the length of the string, and stil have a uniform probability distribution across all string lengths and values, yes? Don't see doing it in less that O(n), but no reason for it to be more than O(n). For multiple strings k that suggests O(kn), although you might be able to play mod games to keep it at O(n).

[ February 01, 2006: Message edited by: Reid M. Pinchback ]

[ February 01, 2006: Message edited by: Reid M. Pinchback ]

Reid - SCJP2 (April 2002)

Jim Yingst

Wanderer

Posts: 18671

posted 12 years ago

Reid - yes, it's definitely possible to do this in O(n). I'd probably prefer to choose the string length first (probability for each length given in Paul Clapham's formula) and then randomly generate a string of that spsecific length. There are various ways to implement this, depending on how precise we want to be with the probabilities - but O(n) is definitely possible.

George's original algorithm is not O(n) certainly - I'm not sure what it is offhand, but it's notably worse than O(n). However I was intrigued because it is pretty simple to understand, and I was wondering just how fast it actually was. So I tested it a bit to see, and found it wasn't nearly as slow as I might have guessed. Given that it was pretty easy to understand, I wouldn't completely rule it out of consideration. However for a high-performance solution, I'd definitely go with something else.

As for George's revised algorithm - um, maybe. I'm still not sure I understand just how it works. The "and continuing from there" when he first described it was a bit too vague for me. I made a guess what it might mean, which was apparently wrong. However I think the whole thing is still a bit fuzzy. Which of the following contain "embedded zeros" which justify shifting and rerolling, and which are non-embedded zeros that actually reduce the length of the string?

0.1.1

1.0.1

1.1.0

0.0.0

Or more to the point, can you show code which implements the algorithm you're thinking of? There are several more choices to make in various special cases, and so far I can't see a version which will really give equal probabilities of all strings, but I could well be missing something.

George's original algorithm is not O(n) certainly - I'm not sure what it is offhand, but it's notably worse than O(n). However I was intrigued because it is pretty simple to understand, and I was wondering just how fast it actually was. So I tested it a bit to see, and found it wasn't nearly as slow as I might have guessed. Given that it was pretty easy to understand, I wouldn't completely rule it out of consideration. However for a high-performance solution, I'd definitely go with something else.

As for George's revised algorithm - um, maybe. I'm still not sure I understand just how it works. The "and continuing from there" when he first described it was a bit too vague for me. I made a guess what it might mean, which was apparently wrong. However I think the whole thing is still a bit fuzzy. Which of the following contain "embedded zeros" which justify shifting and rerolling, and which are non-embedded zeros that actually reduce the length of the string?

0.1.1

1.0.1

1.1.0

0.0.0

Or more to the point, can you show code which implements the algorithm you're thinking of? There are several more choices to make in various special cases, and so far I can't see a version which will really give equal probabilities of all strings, but I could well be missing something.

"I'm not back." - Bill Harding, *Twister*

George Harris

Ranch Hand

Posts: 84

posted 12 years ago

A quick version of the code that I was discussing in the 'shift' approach:

maxLength is the max string size and numChars is the number of available characters.

[ February 02, 2006: Message edited by: George Harris ]

[ February 03, 2006: Message edited by: George Harris ]

maxLength is the max string size and numChars is the number of available characters.

[ February 02, 2006: Message edited by: George Harris ]

[ February 03, 2006: Message edited by: George Harris ]

Stan James

(instanceof Sidekick)

Posts: 8791

posted 12 years ago
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi

Step way back and help me out. (Music major, remember?)

First, the original goal of making all strings equally likely. If I have a length range of 1 or 2 and 2 characters I can get

A

B

AA

AB

BA

BB

If I choose length at random I have an even chance of length=1->A|B or length=2->AA|AB|BA|BB. When I make 1000 strings I'll get A 25% of the time and AA 12.5% of the time. Fails the requirement, right?

I tried to solve this by saying there are only 2 strings possible at length=1, so I made only 2 strings possible at length=2. Now I have an even chance of A|B or AA|AB. AB and BB are right out, but who cares.

Was that even close? How does the latest algorithm do the trick?

First, the original goal of making all strings equally likely. If I have a length range of 1 or 2 and 2 characters I can get

A

B

AA

AB

BA

BB

If I choose length at random I have an even chance of length=1->A|B or length=2->AA|AB|BA|BB. When I make 1000 strings I'll get A 25% of the time and AA 12.5% of the time. Fails the requirement, right?

I tried to solve this by saying there are only 2 strings possible at length=1, so I made only 2 strings possible at length=2. Now I have an even chance of A|B or AA|AB. AB and BB are right out, but who cares.

Was that even close? How does the latest algorithm do the trick?

George Harris

Ranch Hand

Posts: 84

posted 12 years ago

If you run the code that I provided, 1,000,000 runs, with a two character string, you get the following results:

time = 2471 ms

a = 16%

b = 16%

aa = 16%

ab = 16%

ba = 16%

bb = 16%

I also tried to get 1,000,000 strings of maxLength 200 from and using any character from " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ;:'[]{},<.>/?-_=+`~1234567890!@#$%^&*()\\\"|";

It took:

time = 29627 ms

using only lowercase characters it took:

time = 30198 ms

[ February 02, 2006: Message edited by: George Harris ]

time = 2471 ms

a = 16%

b = 16%

aa = 16%

ab = 16%

ba = 16%

bb = 16%

I also tried to get 1,000,000 strings of maxLength 200 from and using any character from " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ;:'[]{},<.>/?-_=+`~1234567890!@#$%^&*()\\\"|";

It took:

time = 29627 ms

using only lowercase characters it took:

time = 30198 ms

[ February 02, 2006: Message edited by: George Harris ]

Stan James

(instanceof Sidekick)

Posts: 8791

posted 12 years ago
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi

Cool. I had to modify Jim Y's example from way up the thread to show the shifted possibilities:

Now there are two ways to get each possible value. I woonta believed it.

[ February 02, 2006: Message edited by: Stan James ]

Now there are two ways to get each possible value. I woonta believed it.

[ February 02, 2006: Message edited by: Stan James ]

Reid M. Pinchback

Ranch Hand

Posts: 775

posted 12 years ago

Ok, I've got an O(n) solution. The basic idea is that you can set up a recurrence relationship to solve the problem. To set up a bit of notation:

R(i) is a random string of length at most 'i'.

M(i) is a random string of exactly length 'i'

Pm(i) is the size of the population of M

Pr(i) is the size of the population of R

Then R(i+i) will be randomly chosen between R(i) and M(i+1),

and the relative probabilities of the choice are determined by

Pm(i+1) and Pr(i).

The code is below. I used longs for the population sizes; you could probably muck around with using doubles instead for the probabilities. In fact you'd have to do that if the max string length gets too long because the integer-valued population coefficients would get too large to represent as longs.

R(i) is a random string of length at most 'i'.

M(i) is a random string of exactly length 'i'

Pm(i) is the size of the population of M

Pr(i) is the size of the population of R

Then R(i+i) will be randomly chosen between R(i) and M(i+1),

and the relative probabilities of the choice are determined by

Pm(i+1) and Pr(i).

The code is below. I used longs for the population sizes; you could probably muck around with using doubles instead for the probabilities. In fact you'd have to do that if the max string length gets too long because the integer-valued population coefficients would get too large to represent as longs.

Reid - SCJP2 (April 2002)

Jim Yingst

Wanderer

Posts: 18671

posted 12 years ago

Hmmm... Reid, I tried adjusting the inputs a bit to get a smaller range of strings to look at:

This gives me output:

That doesn't look like a very even distribution to me. Am I misunderstanding how you intend this to be used?

[ February 02, 2006: Message edited by: Jim Yingst ]

This gives me output:

That doesn't look like a very even distribution to me. Am I misunderstanding how you intend this to be used?

[ February 02, 2006: Message edited by: Jim Yingst ]

"I'm not back." - Bill Harding, *Twister*

Reid M. Pinchback

Ranch Hand

Posts: 775

posted 12 years ago

That's bizarre. I've used smaller and larger alphabets, shorter and longer strings, fewer or greater number of iterations, and except for the bit of random noise you'd expect I get a uniform distribution. Definitely for a million iterations it should be pretty flat. I'll sleep on it, see if any light bulbs turn on.

Oh, and I thought of a different problem - it isn't O(n) because of the cost of generating each new string in an iteration - those are proportional in cost to the length. Easy solution though; you just use the recurrence for the probabilities, generate one max-length string, and use one random number to decide on the length (weighting the lengths by the probabilities).

Oh, and I thought of a different problem - it isn't O(n) because of the cost of generating each new string in an iteration - those are proportional in cost to the length. Easy solution though; you just use the recurrence for the probabilities, generate one max-length string, and use one random number to decide on the length (weighting the lengths by the probabilities).

Reid - SCJP2 (April 2002)

Reid M. Pinchback

Ranch Hand

Posts: 775

posted 12 years ago

Well, I found what was causing the wierdness in the distribution, but I can't see *why* it is causing the wierdness. The last change I made was to shift to using long instead of int for the coefficients to allow for bigger strings and alphabets, which meant I needed to generate a random long instead of a random int. Unfortunately there isn't a Random.nextLong(range) method, so I just modded by the range - and that seems to be torqueing the distribution. If I shove the int version back in:

Then it works perfectly, although now the maximum string length you can handle will be around 16 or 17 characters. Modding long version by the range is twisting the distribution. Odd; for small ranges I'd have expected the residues to still be pretty uniformly distributed, but clearly that isn't the case.

Then it works perfectly, although now the maximum string length you can handle will be around 16 or 17 characters. Modding long version by the range is twisting the distribution. Odd; for small ranges I'd have expected the residues to still be pretty uniformly distributed, but clearly that isn't the case.

Reid - SCJP2 (April 2002)

posted 12 years ago

I think the API documentation for Random's nextInt() method sheds some light on why that is, and possibly suggests how you could write your own nextLong() method.Originally posted by Reid M. Pinchback:

Unfortunately there isn't a Random.nextLong(range) method, so I just modded by the range - and that seems to be torqueing the distribution.

posted 12 years ago

hi

choose a random number between 0 and N (Math.random()*N)

also lets assume that character set frm which characters will be chosen is frm ascii values 0 to 199

let k be the no choosen

then choose 3k random nos (between 0 and 9)(Math.random()*10) and then make pairs of 3 nos each and select character corresponding to that ascii character

if no chosen is more than 199 then use the character with ascii value of no%200

choose a random number between 0 and N (Math.random()*N)

also lets assume that character set frm which characters will be chosen is frm ascii values 0 to 199

let k be the no choosen

then choose 3k random nos (between 0 and 9)(Math.random()*10) and then make pairs of 3 nos each and select character corresponding to that ascii character

if no chosen is more than 199 then use the character with ascii value of no%200

posted 12 years ago

You remember my suggestion to evaluate the length of the String with this code:

This is wrong.

The len > 1 often terminates the loop for small String-lengths (M) or small numChars, which leeds to too much Strings of len=1.

It was reduced by Peter Chase (having numChars=94 in mind) this way:

Well - this will work for numChars=2, if it is adjusted:

and repeated until (len != 0)

And using that, I generated a complete solution, which generates

20 000 - 30 000 Strings of maxlen=300 and numChars=94 per second on my 2Ghz-Laptop.

Here it is (with a little debug-code included):

Statistic for 2Ghz Pentium:

time java -Xmx384M 300 94 tuple

Test with small values for Stringlength and character-set-size:

My favorite solution would have been:

Pseudocode, of course.

[ February 05, 2006: Message edited by: Stefan Wagner ]

[ February 05, 2006: Message edited by: Stefan Wagner ]

This is wrong.

The len > 1 often terminates the loop for small String-lengths (M) or small numChars, which leeds to too much Strings of len=1.

It was reduced by Peter Chase (having numChars=94 in mind) this way:

Well - this will work for numChars=2, if it is adjusted:

and repeated until (len != 0)

And using that, I generated a complete solution, which generates

20 000 - 30 000 Strings of maxlen=300 and numChars=94 per second on my 2Ghz-Laptop.

Here it is (with a little debug-code included):

Statistic for 2Ghz Pentium:

time java -Xmx384M 300 94 tuple

Test with small values for Stringlength and character-set-size:

My favorite solution would have been:

Pseudocode, of course.

[ February 05, 2006: Message edited by: Stefan Wagner ]

[ February 05, 2006: Message edited by: Stefan Wagner ]

Don't get me started about those stupid light bulbs. |

- Post Reply Bookmark Topic Watch Topic
- New Topic

Boost this thread!