# Generate random strings

Sounds simple, huh? Well, maybe it is, but I can't think of an easy way.

My current algorithm is: -

1. Randomly choose the length, N, between zero and the maximum, M.

2. Initialise storage for N characters.

3. Randomly choose each of the N characters, from the C possible characters.

This does *not* meet the requirements, because a particular short string is much more likely to come up than a long string.

I can think of a most-bogus solution, along the lines of Bogosort. That is: compute all the possible strings, then choose one at random. However, the memory and computational costs of this are ridiculous.

Any ideas?

Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.

Ranch Hand

get the number n of possible strings at minimum length

seed the RNG using n - must be repeatable to maintain the odds, no?

pick a random string index in range of 1, n

pick a random length

generate n strings

return the nth string

Does that makes sense?

A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi

Originally posted by Stan James:

Does that makes sense?

Well, not if I understand correctly what you're saying.

The number of possible strings of length L, where each character is one of C characters is C^L. That's C to the power of L, not C times L as you seem to be saying. So, even with only 26 characters, the number of possible strings rapidly becomes comically huge, even with relatively modest values of C and L. For instance, 26^6 is 308915776; that's a lot of strings.

In my actual application (this is not just a puzzle for fun), C is 94 and Lmax is 300!

If I have misunderstood what you're saying, then perhaps you could post some more-detailed pseudo-code (or real code) to illustrate your solution.

[ January 23, 2006: Message edited by: Peter Chase ]

Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.

Ranch Hand

LMax goes out of the equation because LMin is the constraint, but for your numbers I'd have to be prepared to get 94^LMin random numbers which would probably run too long at 100% CPU.

Keep in mind: Music major here. This was fun but probably just passed my math skills.

A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi

Originally posted by Peter Chase:

I can think of a most-bogus solution, along the lines of Bogosort. That is: compute all the possible strings, then choose one at random. However, the memory and computational costs of this are ridiculous.

Any ideas?

Actually, you don't have to compute all the possible strings. You can choose a random number and compute the string based on that number.

For example, if you had to generate a random lower case string of max length 2, then there are 702 possible strings. Strings 0-25 would be single character string a,b,c,d,....z. String 26 - 701 would be 2 character strings aa,ab,ac,ad....zz. So, you select a random number from 0-701, and use number to figure out the string.

For example, if your random number generator gave you 523, you follow these steps

a) 523 mod 26 = 3

b) character 3 is d, so, next character is d, random string = d

c) 523/26 = 20

d) 20 mod 26 = 20

e) 20 character is t, so next character is t, random string = td

f) 20/26 = 0, so stop processing

If you want random strings of max length 3, you use similar logic except that you generate a random number between 0-18271, because the number of combinations are 18272. You don't have to limit yourself to 26 characters. You can use any number of characters. You will have to compute the number of combinations based on the number of characters

or, wait... i think i did 20!, i should have done 26^20, which gives us

19,928,148,895,209,409,152,340,197,376

[ January 25, 2006: Message edited by: fred rosenberger ]

There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors

Originally posted by fred rosenberger:

you're gonna have a problem here... a 20 character string is going to require a random number between 0 and 2,432,902,008,176,640,000. since the OP said "arbitrary length", i would assume 20, 50 or even 100 it not out of the question.

or, wait... i think i did 20!, i should have done 26^20, which gives us

19,928,148,895,209,409,152,340,197,376

[ January 25, 2006: Message edited by: fred rosenberger ]

Use multiple random numbers then. Alternatively, generate two-10 character segments then. Or four-5 character segments. Introduce the possibility of segments of zero length.

Fred, I agree with you at a theoritical level. At some point, a randomly generated dictionary would be more cost-effective than a true random string generator.

Originally posted by Paul Clapham:

You must choose length N with probability C^N * (C-1) / (C^(M+1) - 1).

For the record, this one sounds the most feasible so far. I'm not sure that I quite see what N, M and C are, but guess I could work it out.

Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.

2 . 16 . 1 . 0 would give the string 'bpa'

5 . 12 . 8 . 24 would give 'elhx'

It would also require that if the string had an embedded 0,

2 . 0 . 16 . 1 the string would be rejected and a restart was required.

2 . 0 . 0 . 0 would be valid however.

[ January 25, 2006: Message edited by: George Harris ]

Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.

They are what you said they were in your original question.Originally posted by Peter Chase:

For the record, this one sounds the most feasible so far. I'm not sure that I quite see what N, M and C are, but guess I could work it out.

However if C is 94, then the probability of choosing a string of anything less than the maximum length M (or Lmax as you called it later) is almost exactly 1/94 (regardless of what M is) and the probability of choosing a string of the maximum length is 93/94. So you might simplify the problem to only choose strings of length M -- a major simplification.

if i undertand it correctly, he's saying whatever your max length is, generate a random string the full length. then, scan it to see if there is an imbedded 0.

for large max lengths, there are going to be a LOT of strings you'll have to reject, i think. it may take a while to find a valid one.

i'm not SURE of this, it's just my intuition (whatever that's worth...).

There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors

Originally posted by fred rosenberger:

I have not done any math, but you mayhave a problem with George's solution...

if i undertand it correctly, he's saying whatever your max length is, generate a random string the full length. then, scan it to see if there is an imbedded 0.

for large max lengths, there are going to be a LOT of strings you'll have to reject, i think. it may take a while to find a valid one.

i'm not SURE of this, it's just my intuition (whatever that's worth...).

I beleive you are right

For a string of length 3

Total number of combinations = 27^3 = 19683

Total number of valid combintaions = 18278

Total number of invalid combos = 1405 = 7%

For string of length 5

Total = 14348907

Valid = 12356630

Invalid = 1992277 = 13%

It keeps on growing. For string of length 20, 51.11% of the combos will be invalid. For string of length 63, 90.35% of the combos will be invalid.

Sheriff

"I'm not back." - Bill Harding, *Twister*

I reckon that George's method could be used for the application, but would eat up a lot more CPU than the probability distribution method, which just adds a little bit of maths to my original algorithm.

Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.

Originally posted by Paul Clapham:

They are what you said they were in your original question.

However if C is 94, then the probability of choosing a string of anything less than the maximum length M (or Lmax as you called it later) is almost exactly 1/94 (regardless of what M is) and the probability of choosing a string of the maximum length is 93/94. So you might simplify the problem to only choose strings of length M -- a major simplification.

Well - why simplify this?

I'm finessing away the test for len becoming less than one, because in practice it will never happen. The probability (1 in 94^300) is so insanely minute that one can say that with total confidence.

If there were fewer characters to choose from or a smaller maximum length, then such a test would be necessary.

[ January 28, 2006: Message edited by: Peter Chase ]

Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.

Originally posted by Stefan Wagner:

... at least until random.nextInt (94) isn't broken

Did I make a mistake? I admit I didn't compile or run my code. I am assuming that "random" is an instance of java.util.Random.

Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.

23 . 16 . 0 . 0 . 0 . 19 ...

Instead of rejecting this string, we would move the 19 to the left and continue.

23 . 16 . 19 . 7 . etc.

I believe this will still result in the correct distribution of string lengths, with a significant improvement in processing.

Sheriff

Length 1: A, B

Length 2: AA, AB, BA, BB

So the chance of generating a 1-char string should be 2/6 = 1/3.

Now try your algorithm (assuming I've understood it correctly):

0.0 : [reject]

0.1 : A

0.2 : B

1.0 : A

1.1 : AA

1.2 : AB

2.0 : B

2.1 : BA

2.2 : BB

Looks to me like there's a 4/8 = 1/2 chance of getting length 1, which is not correct. On the other hand if we reject 1.0 and 2.0, the remaining options given the required 2/6 = 1/3 chance for length 1.

[ February 01, 2006: Message edited by: Jim Yingst ]

"I'm not back." - Bill Harding, *Twister*

0.1 : A

this would not necessarly be A, the first char would be A, but the second character would still have to be processed by a random number generator so,

0.1 could end up

1.0 - A

1.1 - A.A

1.2 - A.B

All that I was suggesting is that we don't have to start from scratch, only that we keep the 'good' number and still get the rest.

[ February 01, 2006: Message edited by: Reid M. Pinchback ]

Reid - SCJP2 (April 2002)

Sheriff

George's original algorithm is not O(n) certainly - I'm not sure what it is offhand, but it's notably worse than O(n). However I was intrigued because it is pretty simple to understand, and I was wondering just how fast it actually was. So I tested it a bit to see, and found it wasn't nearly as slow as I might have guessed. Given that it was pretty easy to understand, I wouldn't completely rule it out of consideration. However for a high-performance solution, I'd definitely go with something else.

As for George's revised algorithm - um, maybe. I'm still not sure I understand just how it works. The "and continuing from there" when he first described it was a bit too vague for me. I made a guess what it might mean, which was apparently wrong. However I think the whole thing is still a bit fuzzy. Which of the following contain "embedded zeros" which justify shifting and rerolling, and which are non-embedded zeros that actually reduce the length of the string?

0.1.1

1.0.1

1.1.0

0.0.0

Or more to the point, can you show code which implements the algorithm you're thinking of? There are several more choices to make in various special cases, and so far I can't see a version which will really give equal probabilities of all strings, but I could well be missing something.

"I'm not back." - Bill Harding, *Twister*

Ranch Hand

First, the original goal of making all strings equally likely. If I have a length range of 1 or 2 and 2 characters I can get

A

B

AA

AB

BA

BB

If I choose length at random I have an even chance of length=1->A|B or length=2->AA|AB|BA|BB. When I make 1000 strings I'll get A 25% of the time and AA 12.5% of the time. Fails the requirement, right?

I tried to solve this by saying there are only 2 strings possible at length=1, so I made only 2 strings possible at length=2. Now I have an even chance of A|B or AA|AB. AB and BB are right out, but who cares.

Was that even close? How does the latest algorithm do the trick?

A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi

time = 2471 ms

a = 16%

b = 16%

aa = 16%

ab = 16%

ba = 16%

bb = 16%

I also tried to get 1,000,000 strings of maxLength 200 from and using any character from " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ;:'[]{},<.>/?-_=+`~1234567890!@#$%^&*()\\\"|";

It took:

time = 29627 ms

using only lowercase characters it took:

time = 30198 ms

[ February 02, 2006: Message edited by: George Harris ]

Ranch Hand

Now there are two ways to get each possible value. I woonta believed it.

[ February 02, 2006: Message edited by: Stan James ]

R(i) is a random string of length at most 'i'.

M(i) is a random string of exactly length 'i'

Pm(i) is the size of the population of M

Pr(i) is the size of the population of R

Then R(i+i) will be randomly chosen between R(i) and M(i+1),

and the relative probabilities of the choice are determined by

Pm(i+1) and Pr(i).

The code is below. I used longs for the population sizes; you could probably muck around with using doubles instead for the probabilities. In fact you'd have to do that if the max string length gets too long because the integer-valued population coefficients would get too large to represent as longs.

Reid - SCJP2 (April 2002)

Sheriff

This gives me output:

That doesn't look like a very even distribution to me. Am I misunderstanding how you intend this to be used?

[ February 02, 2006: Message edited by: Jim Yingst ]

"I'm not back." - Bill Harding, *Twister*

Oh, and I thought of a different problem - it isn't O(n) because of the cost of generating each new string in an iteration - those are proportional in cost to the length. Easy solution though; you just use the recurrence for the probabilities, generate one max-length string, and use one random number to decide on the length (weighting the lengths by the probabilities).

Reid - SCJP2 (April 2002)

Then it works perfectly, although now the maximum string length you can handle will be around 16 or 17 characters. Modding long version by the range is twisting the distribution. Odd; for small ranges I'd have expected the residues to still be pretty uniformly distributed, but clearly that isn't the case.

Reid - SCJP2 (April 2002)

I think the API documentation for Random's nextInt() method sheds some light on why that is, and possibly suggests how you could write your own nextLong() method.Originally posted by Reid M. Pinchback:

Unfortunately there isn't a Random.nextLong(range) method, so I just modded by the range - and that seems to be torqueing the distribution.

choose a random number between 0 and N (Math.random()*N)

also lets assume that character set frm which characters will be chosen is frm ascii values 0 to 199

let k be the no choosen

then choose 3k random nos (between 0 and 9)(Math.random()*10) and then make pairs of 3 nos each and select character corresponding to that ascii character

if no chosen is more than 199 then use the character with ascii value of no%200

This is wrong.

The len > 1 often terminates the loop for small String-lengths (M) or small numChars, which leeds to too much Strings of len=1.

It was reduced by Peter Chase (having numChars=94 in mind) this way:

Well - this will work for numChars=2, if it is adjusted:

and repeated until (len != 0)

And using that, I generated a complete solution, which generates

20 000 - 30 000 Strings of maxlen=300 and numChars=94 per second on my 2Ghz-Laptop.

Here it is (with a little debug-code included):

Statistic for 2Ghz Pentium:

time java -Xmx384M 300 94 tuple

Test with small values for Stringlength and character-set-size:

My favorite solution would have been:

Pseudocode, of course.

[ February 05, 2006: Message edited by: Stefan Wagner ]

[ February 05, 2006: Message edited by: Stefan Wagner ]