• Post Reply Bookmark Topic Watch Topic
  • New Topic

cyclic similarity in words  RSS feed

 
Rish Gupta
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi, I am not sure what these are called but here's what I'm trying to do.. please bear with me, i'm not good at explaining.. Say we have a string S='abcabc' now we need to figure out how many similar strings we will get if we go cyclic. for example using the above case:
0th cycle : abcabc //this is the oroginal string
1st cycle : bcabca //here i th character = i+1 (so we are shifting by one) and putting the first character in the end
2nd cycle : cabcab
3rd cycle : abcabc //here we get the same string as the original string so counter is incremented by 1
4th cycle : bcabca
5th cycle : cabcab
6th cycle : abcabc //here we are back to where we started from so counter is again incremented by 1 and program exits returning 2

I hope you get what I'm trying to do and below is the code that I came up with:


The problem is that the above code is time consuming (O of N^2), I am trying to achieve it in O of N..
any ideas would be appreciated. TIA
 
Jayesh A Lalwani
Rancher
Posts: 2762
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What's N? Is N the number of strings or is N the number of characters. If it's number of strings, you are already O(n) . If it's number of characters, You cannot get O(n) using a brute force method. Equals method of string is O(n). So any method that cycles the string and then compares it with the original string will be O(n^2)


 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rish Gupta wrote:The problem is that the above code is time consuming (O of N^2), I am trying to achieve it in O of N..
any ideas would be appreciated. TIA

As Jayesh says, a brute force approach will always be O(n²). I suspect you can get that down quite a bit by eliminating strings that can't possibly rotate by understanding some of the properties that such a string must possess: one of which is that unless ALL characters are the same, its length cannot be prime (because it must be made up of subsequences of length > 1 that repeat). Or possibly easier: if the length is prime, then all letters MUST be the same.

So, if you factorize the length, you'll then be able to work out the lengths of subsequences that you need to check for, which still may take some time, but NOT O(n²), and if you start with the smallest first (* - see below), you'll hopefully be able to eliminate them the quickest.

That said, I've never tried the problem; but that's how I'd start going about it if I was looking for an efficient method (as opposed to a simple one ).

Winston

(*) I'm not by any means sure that you should start out with the shortest first but, just off the top of my head, it "feels" right.
 
Maxim Karvonen
Ranch Hand
Posts: 121
12
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston already offered one idea. You don't need to find all offsets, you may get a result using a smallest one.

Another idea is that dealing with "looping" strings is not always efficient. We may try to "unroll" your string into an infinite sequence. Or maybe finite, try to figure it yourself. Not a very new idea, some tasks on programming contests are also based on it. For your example you will get

I think, you should get the idea. Your original string is substring of some other string. And at this point the only thing you need is some efficient (O(N)) substring algorithm. I prefer Knuth–Morris–Pratt algorithm for such tasks (it's very simple in implementation), but you may use any other substring algorithm with same complexity.
 
Jayesh A Lalwani
Rancher
Posts: 2762
32
Eclipse IDE Spring Tomcat Server
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I was thinking ( and I think that's what WInston is going for) is that if you split the string using the factors of the length, you don;t need to roll it

FOr example, if you have abcabc The length is 6. So, you can split it into 2 parts of 3 characters or 3 parts of 2 chanracters, or 6 parts of 1 characters. So, let's spit into 2:- abc and abc. Now, since both these parts are equal, you know that if you would have rolled the string 6 times, it would come up to be the same as the original string on the 3rd rol. Splitting into 3 parts of 2 characters give you ab, ca and bc. They are not equal. So, you will not get it on the 2nd and 4th roll. if you split it into 6 parts of 1 characters, you again don't find that they are not equal, so you don't get the same string on 1st, and 5th roll. You always get the same string on the last 6th roll. So, you get the string on the 3rd and 6th roll. So, the answer is 2

If you had ababab
2 parts of 3 characters aba bab no match
3 parts of 2 characters ab ab ab match on 2th and 4th roll
6 parts of 1 character a b a b a b no match
always a match on 6th roll


So, total matches = 3

If you had aaaaaa
2 parts of 3 characters aaa aaa match on 3rd toll
3 parts of 2 characters ab ab ab match on 2th and 4th roll
6 parts of 1 character a a a a a a match on 1st and 5th roll
always a match on 6th roll


So, total matches = 6

You can figure out an algorithm. Trying to estimate the cost:- for each f that is a factor of N, you will have to run a string comparison N/f times. So, if number of factors is F, you will run string comparison F*N/F times = N. However, the strings will be small. So, the cost will be like O(N*N/F). SInce, F = N/Log N, you should get O(N log N).

I am not good at this math, so I might be wrong. Just from gut feel I think it would be N Log N.
 
Rish Gupta
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
K, So I did what you guys suggested (big thanks for all your inputs) and here is what I came up with.. could you confirm if this was at least Oof(N log N) if not OofN..


@Maxim - I din't really get what you suggested.. actually I have never heard of Knuth–Morris–Pratt algo.. I'll try to do some reading online on it and see if I can come up with something better..
thanks a lot for everyone's input
 
Rish Gupta
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is there some software that could compute the time it takes to run?? in terms of N..

in the code above I could cut down further on run time by going for(int i=1; i<=x*/2*; i++) while finding factors in line 7 and changing to i<fact.size() instead of i<fact.size()*-1* in line 14.
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rish Gupta wrote:K, So I did what you guys suggested (big thanks for all your inputs) and here is what I came up with.. could you confirm if this was at least Oof(N log N) if not OofN..

Whoa hoss. You're going too fast too soon. For starters, that code isn't right, because it doesn't take into account the fact that your "string" is cyclic.

Deal with ONE thing at a time - and personally, I'd say that the first thing to deal with is the comparison itself. And next you need to deal with the context of your problem.

Question 1: If I have a "cyclic" string (CS), and I want to compare it with a substring (SS) to see if SS is repeated in CS, what is the absolute maximum number of characters that I have to compare?

Question 2: If remove an arbitrary number of characters from the start of my CS string, and place them at the end, in the same sequence, should my program return a different result or not?

Answering those two questions correctly will give you a good head start on your problem (but it's only the start; if you want efficient, this is NOT a simple problem). And if you're into amateur magic, Question 2 covers exactly the same ground as cutting (but NOT shuffling) a deck of cards.

Winston
 
Rish Gupta
Greenhorn
Posts: 29
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
For starters, that code isn't right, because it doesn't take into account the fact that your "string" is cyclic.


Hi Winston,
I tested the code with about 20 different inputs (S string) and it has given me the right answer each time, could you give me an example that the code might not work for.. Maybe I'm thinking in just one direction. I'm not shifting the string to check if it is cyclic or not, I'm doing something similar to what you asked in question 1 to which I think the answer would be N-1 worst case scenario.

Basically here's what i designed the code to do:
It gets all the factors of string length (let's take 'abcabc' as an example, it'll return 1, 2, 3 and 6)
Now we start with 1, the first factor, so the code compares 'a' to 'b' and since they are different 1 is eliminated.
Next factor is 2, so the code compares 'ab' to 'ca' and since they are different 2 is eliminated.
Next factor is 3, so we compare 'abc' to 'abc', match is found and since we have reached the end of the string we have an answer which is string length divided by 3.. So final answer is 2, which is the right answer .
(If there were 3 more characters say 'abcabcxxx', 'abc' would have been compared again to xxx and answer printed only if it were a match otherwise the code would have moved on to the next factor)

I hope it makes sense, it would be a big help if you gave an example for which the code might not work except empty string, I haven't tested for it.
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rish Gupta wrote:I tested the code with about 20 different inputs (S string) and it has given me the right answer each time, could you give me an example that the code might not work for.

No, but I'd definitely try it with different lengths of string and different patterns; especially ones that "almost" cycle. And sorry for saying your code was wrong; I must have misread your compare() method. However, the advice is sound: do one thing at a time; and don't move on until you know that what you've got is working.

Just FYI, you can use the modulus operator for cyclic ops, which saves a bit of code (probably not time though), viz:although once you've done the factoring, you shouldn't need to "wrap" your comparisons at all (I left out the "ignorecase" stuff because you can deal with that by just converting the whole string to lower/upper case)

doing something similar to what you asked in question 1 to which I think the answer would be N-1 worst case scenario.

Actually, it will be length(CS) - length(SS) because you already know that SS is equal to itself.

Basically here's what i designed the code to do:
It gets all the factors of string length (let's take 'abcabc' as an example, it'll return 1, 2, 3 and 6)
Now we start with 1, the first factor, so the code compares 'a' to 'b' and since they are different 1 is eliminated.
Next factor is 2, so the code compares 'ab' to 'ca' and since they are different 2 is eliminated.
Next factor is 3, so we compare 'abc' to 'abc', match is found and since we have reached the end of the string we have an answer which is string length divided by 3.. So final answer is 2, which is the right answer .
(If there were 3 more characters say 'abcabcxxx', 'abc' would have been compared again to xxx and answer printed only if it were a match otherwise the code would have moved on to the next factor).

Yup that's about right. I presume that you don't bother with 6, since any string must "rotate" to itself. And notice that it doesn't matter whether you have 'abcabc' or 'bcabca' (which is a "rotation") because the first three characters still repeat. That's what I was referring to in Question 2.

You just have to be a bit careful with string lengths like 20, which is 2 x 2 x 5, to make sure that you get ALL factors (ie, 1, 2, 4, 5 and 10).

You could also shave a bit of time off by "smart" checking. For example, for a string of length 20, when checking your 'length 2' substring, you could check position 10 first, because if it doesn't match, then you know that both 2 AND 10 are eliminated; and if you then check positions 4 , 8, 12 and 16 and find a mismatch, you know that 4 is eliminated too. However, keeping track of all the stuff so that you don't repeat checks may be more trouble than it's worth.

I'm afraid my maths isn't good enough to give you the exact O() formula for time, but it will be something like f * O(n), where f is the number of factors in n - and I suspect there's a general formula for that too.

HIH

Winston
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:You could also shave a bit of time off by "smart" checking...

It also occurs to me that there might also be cases where you don't want to go in strict order of size. For a case like length 40 for example, it might be better to check 2, then 5 (rather than 4) since if 5 repeats even once, you've just eliminated 4 as a possibility.

However, without a mathematician to explain the actual process (to me as well ), I suspect you could tie yourself in knots with that sort of stuff.

Winston
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Likes 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:You could also shave a bit of time off by "smart" checking...

Got it. (I think )

Once you've eliminated '1' (all characters the same), check prime factors in descending order, keeping a count of repetitions even if you don't find a complete set of them, because if you find even one, you've just eliminated all factors below that number.

Furthermore, if you find more than one, the total number of characters taken up with repetitions, R = p * (r+1) (p=prime, r=# of repetitions), is important to know.

If R is more than twice the value of any other factor, you can eliminate those too; and if it's more than half the length of the entire string, you can eliminate any other possibility of a repetition.

So, taking your example of 'abcabc', let's double it up and introduce a mismatch: 'abcabcabcxbc'.

Length is now 12, so factors are 1, 2, 3, 4 and 6.

Going by the procedure above and having eliminated '1', we next start with 3 (the highest prime factor), and discover that there are 2 "repetitions" of 'abc' (but not 3, so it's not a full set).
Since r is > 0, we know that '2' can't possibly repeat, and since p * (r + 1) is 9, we also know that 4 can't repeat either. Finally, 9 is greater than 12 / 2, so we also know that no other length can possibly repeat.

But if the string is 'abcxbcabcxbc', there are no repetitions of 3, so we next check 2 (if there were any other prime factors > 2, we'd do those first, in descending order), then 4, and finally 6 where we find our cycle.

Phew. Fun exercise.

And if anyone can shoot down my logic, please let me know.

Winston
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!