# Probability of words in a random string

Geoffrey Falk

Ranch Hand

Posts: 171

1

posted 6 years ago

Suppose I give you a 4-letter word with letters chosen from an alphabet of 26 letters. What is the probability that a random 13-letter string contains the word as a substring?

Alternatively, write a Java program to calculate the probability.

Geoffrey

Alternatively, write a Java program to calculate the probability.

Geoffrey

Sun Certified Programmer for the Java 2 Platform

Ryan McGuire

Ranch Hand

Posts: 1078

4

posted 6 years ago

Notation:

Pn = prob that word occurs at position n of string.

For 0 <= n <= 9, Pn = (1/26) ^ 4 = 1/456976

For 9 < n, Pn = 0.

P(word does NOT appear in string) = [(1 - P1) * (1-P2) * ... (1 - P12)]

= (1- (1/456976)) ^ 13

= (456975/456976) ^ 13

P(word DOES exist in string) = 1 - P(word does NOT appear in string)

= 1 - (456975/456976)^13

approx= .0000284475

...which is about 1 in 35,000

According to this formula, you reach the 50-50 point around with a string around 300,000 letters long. That seems long.

Geoffrey Falk wrote:Suppose I give you a 4-letter word with letters chosen from an alphabet of 26 letters. What is the probability that a random 13-letter string contains the word as a substring?

Alternatively, write a Java program to calculate the probability.

Geoffrey

Notation:

Pn = prob that word occurs at position n of string.

For 0 <= n <= 9, Pn = (1/26) ^ 4 = 1/456976

For 9 < n, Pn = 0.

P(word does NOT appear in string) = [(1 - P1) * (1-P2) * ... (1 - P12)]

= (1- (1/456976)) ^ 13

= (456975/456976) ^ 13

P(word DOES exist in string) = 1 - P(word does NOT appear in string)

= 1 - (456975/456976)^13

approx= .0000284475

...which is about 1 in 35,000

According to this formula, you reach the 50-50 point around with a string around 300,000 letters long. That seems long.

Geoffrey Falk

Ranch Hand

Posts: 171

1

Ryan McGuire

Ranch Hand

Posts: 1078

4

posted 6 years ago

Loophole: In your problem statement, you used the word "chosen". In a nit-picky sense, that implies that are no repeated letters in the word. See the Wikipedia page on the Binomial Coefficient function, aka the Choose Function. Yeah, that's what I was thinking.

What if the word were 3 letters long and the string was 3? Is "AA" more likely than "AB" to show up in a string of 3 random letters? "AA" will match if the string is either "AA." (using regular expressions), or ".AA". That matches only 51 possible different strings. "AB" would match "AB." or ".AB", which covers 52 strings (since there's no overlap).

So it would seem that you're correct. ...assuming the "word" can have repeated letters.

Geoffrey Falk wrote:Are you sure? I think the probability depends on the word that I give you. For instance, "AAAA" has a different probability than "ABAB". (Hint: take account of double counting.)

Geoffrey

Loophole: In your problem statement, you used the word "chosen". In a nit-picky sense, that implies that are no repeated letters in the word. See the Wikipedia page on the Binomial Coefficient function, aka the Choose Function. Yeah, that's what I was thinking.

What if the word were 3 letters long and the string was 3? Is "AA" more likely than "AB" to show up in a string of 3 random letters? "AA" will match if the string is either "AA." (using regular expressions), or ".AA". That matches only 51 possible different strings. "AB" would match "AB." or ".AB", which covers 52 strings (since there's no overlap).

So it would seem that you're correct. ...assuming the "word" can have repeated letters.