Red Ronin

Greenhorn

Posts: 1

posted 5 years ago

Hi all,

this is my 1st post here, so sorry if i missed forum to post this question.

I was solving this code question (like for interview):

"Given a large document and a short pattern consisting of a few words (eg. W1 W2 W3), find the shortest string that has all the words in any order (for eg. W2 foo bar dog W1 cat W3 -- is a valid pattern)"

this is from here: http://www.glassdoor.com/Interview/Given-a-large-document-and-a-short-pattern-consisting-of-a-few-words-eg-W1-W2-W3-find-the-shortest-string-that-has-all-QTN_216408.htm

Here is my code. I tested this using Peter Norvig's big.txt (~1mil words) and it runs just under 3sec for my test case.

Is this fast enough? Other comments? Thankx in advance!

this is my 1st post here, so sorry if i missed forum to post this question.

I was solving this code question (like for interview):

"Given a large document and a short pattern consisting of a few words (eg. W1 W2 W3), find the shortest string that has all the words in any order (for eg. W2 foo bar dog W1 cat W3 -- is a valid pattern)"

this is from here: http://www.glassdoor.com/Interview/Given-a-large-document-and-a-short-pattern-consisting-of-a-few-words-eg-W1-W2-W3-find-the-shortest-string-that-has-all-QTN_216408.htm

Here is my code. I tested this using Peter Norvig's big.txt (~1mil words) and it runs just under 3sec for my test case.

Is this fast enough? Other comments? Thankx in advance!

Ireneusz Kordal

Ranch Hand

Posts: 423

posted 5 years ago

If performance satisfies you (or your customer) ..... then yes, it is fast enough.

In other case there are some improvements possible.

First, you don't need 6 different regex patterns to check this condition,

only one is sufficient:

Follow this link to study regex patterns, it's really cool: http://www.regular-expressions.info/lookaround.html

Second, don't use regex at all,

simple 3 calls to a function String.contains() would be much more faster:

http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#contains%28java.lang.CharSequence%29

Third, the given problem is: "find the shortest string".

In your case, a "string" is equivalent to one line from the file.

So, if you have found a line that matches the condition, and the length of this line is X,

then you need to check only lines that have length < X.

For each new line check if it size is < X, if no - then just skip this line and take the next one.

If you find a new line shorter than X, that meet the condition too, then set X = length on this new line.

In this way you avoid searching substrings within the string probably for 50% or even more lines.

Red Ronin wrote:

Is this fast enough? Other comments? Thankx in advance!

If performance satisfies you (or your customer) ..... then yes, it is fast enough.

In other case there are some improvements possible.

First, you don't need 6 different regex patterns to check this condition,

only one is sufficient:

Follow this link to study regex patterns, it's really cool: http://www.regular-expressions.info/lookaround.html

Second, don't use regex at all,

simple 3 calls to a function String.contains() would be much more faster:

http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#contains%28java.lang.CharSequence%29

Third, the given problem is: "find the shortest string".

In your case, a "string" is equivalent to one line from the file.

So, if you have found a line that matches the condition, and the length of this line is X,

then you need to check only lines that have length < X.

For each new line check if it size is < X, if no - then just skip this line and take the next one.

If you find a new line shorter than X, that meet the condition too, then set X = length on this new line.

In this way you avoid searching substrings within the string probably for 50% or even more lines.

posted 5 years ago

Hmmm. Not so sure about that (although I totally agree that

If you used

Winston

BTW: The above could be optimized along the lines that Ireneusz suggested to eliminate the third search if the first two produce a span that is greater than the one you already have; but personally I suspect it'll make the code a lot messier for not a lot of gain.

Ireneusz Kordal wrote:Third, the given problem is: "find the shortest string".

In your case, a "string" is equivalent to one line from the file.

So, if you have found a line that matches the condition, and the length of this line is X,

then you need to check only lines that have length < X.

Hmmm. Not so sure about that (although I totally agree that

`contains()`is better than a regex).

If you used

`indexOf()`and stored the results, you could then check the span between the first and last occurrence to work out your "shortest string", maybe something like:but the wording of the problem is very open to interpretation.

Winston

BTW: The above could be optimized along the lines that Ireneusz suggested to eliminate the third search if the first two produce a span that is greater than the one you already have; but personally I suspect it'll make the code a lot messier for not a lot of gain.

"Leadership is nature's way of removing morons from the productive flow" - Dogbert

Articles by Winston can be found here

Victor M. Pereira

Ranch Hand

Posts: 50

posted 5 years ago

What if the shortest string is between lines? What if the shortest string is a sub string in a line? ex: a w1 w2 w1 w3 x

I believe the shortest Time possible is O(n) for this problem, since divide and conquer would give O(n log n).

Your algorithm would be the almost correct one since checking for 6 patterns is still consider constant order, you would only have to change your line reader and evaluate more like the pseudo code from Ireneusz Kordal.

Performance means the fastest and right one.

To be completely honest this is a limited example of maximum or minimum sub sequence where S is only of

I believe the shortest Time possible is O(n) for this problem, since divide and conquer would give O(n log n).

Your algorithm would be the almost correct one since checking for 6 patterns is still consider constant order, you would only have to change your line reader and evaluate more like the pseudo code from Ireneusz Kordal.

Performance means the fastest and right one.

To be completely honest this is a limited example of maximum or minimum sub sequence where S is only of

**w1**,**w2**,**w3**and**a**that represents any other word.regards,

Victor M. Pereira