• Post Reply Bookmark Topic Watch Topic
  • New Topic

[regex] select word of 3 letter and more between other word  RSS feed

 
mark smith
Ranch Hand
Posts: 258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hi

with this regex, i know if a text contain word ice and snow but not tree and ski



i search to get word between ice and snow (text must not contain tree and ski) who have more then 3 letters

is there a way to do it with regex

thanks
 
Stephan van Hulst
Saloon Keeper
Posts: 7993
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Why do you want to use a regex? Is it not allowed for "tree" or "ski" to be anywhere in the input, or just not in the part between ice and snow? What if there are multiple instances of the words ice and snow? Do you also want text between "ice" and "ice"? Can the order of "ice" and "snow" be reversed?

Please give us more information on the requirements and circumstances you're working with, and why you're trying to achieve this in the first place.
 
mark smith
Ranch Hand
Posts: 258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:Why do you want to use a regex? Is it not allowed for "tree" or "ski" to be anywhere in the input, or just not in the part between ice and snow? What if there are multiple instances of the words ice and snow? Do you also want text between "ice" and "ice"? Can the order of "ice" and "snow" be reversed?

Please give us more information on the requirements and circumstances you're working with, and why you're trying to achieve this in the first place.


i'm not a regex expert, but i think that could take less time to write a regex than to write a function to do the same thing

like the regex specified

order is not important
ice, snow, tree, ski can be anywhere


don't need to manage multiple instance of the words ice an snow but you be a plus...

don't want text between ice and ice... only between ice and snow or snow and ice...
 
Stephan van Hulst
Saloon Keeper
Posts: 7993
143
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, you definitely don't want to do this with one regex. Even when you break it up it's going to look really ugly. Take a look:
 
mark smith
Ranch Hand
Posts: 258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan van Hulst wrote:Well, you definitely don't want to do this with one regex. Even when you break it up it's going to look really ugly. Take a look:


don,t seem to work correctely because if the input = ice hello house snow tree ski
that work.... but it should not because tree and ski is available....

also

on the result, i can loop to detect every word and display them only if the word have more then 3 letter..... but is there a way to do it directly in the regex?
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
mark smith wrote:on the result, i can loop to detect every word and display them only if the word have more then 3 letter..... but is there a way to do it directly in the regex?



Ignore me if I seem to be the only one ... but having read the topic posts, I am still not clear what is being asked for here. Could you show us a bunch of examples? Input, and expected output?

Henry
 
mark smith
Ranch Hand
Posts: 258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:
mark smith wrote:on the result, i can loop to detect every word and display them only if the word have more then 3 letter..... but is there a way to do it directly in the regex?



Ignore me if I seem to be the only one ... but having read the topic posts, I am still not clear what is being asked for here. Could you show us a bunch of examples? Input, and expected output?

Henry


ski and tree need to be there to be bad....

snow hello house ice ski tree -> bad
snow the hello house ice ski-> return word between snow and ice who have more then 3 letters so -> hello and house is returned
snow the hello house ice -> return word between snow and ice who have more then 3 letters so -> hello and house is returned

need to work fine except when only one of the two bad word are there.....


to get only word of 3 letter and more i tried without success:

 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
mark smith wrote:
ski and tree need to be there to be bad....

snow hello house ice ski tree -> bad
snow the hello house ice ski-> return word between snow and ice who have more then 3 letters so -> hello and house is returned
snow the hello house ice -> return word between snow and ice who have more then 3 letters so -> hello and house is returned

need to work fine except when only one of the two bad word are there.....


Oh, I see now.

mark smith wrote:
don,t seem to work correctely because if the input = ice hello house snow tree ski
that work.... but it should not because tree and ski is available....


You will need to modify the bad regex to match only when both bad words are present. Currently, it is one or the other.

mark smith wrote:
on the result, i can loop to detect every word and display them only if the word have more then 3 letter..... but is there a way to do it directly in the regex?


Not really. Regexes is not really good at returning an unknown number of matches with a single match. You have to rewrite it to loop yourself -- and probably to check the edge tokens yourself too.

mark smith wrote:
i'm not a regex expert, but i think that could take less time to write a regex than to write a function to do the same thing


I guess you are starting to realize that this may not be true.

Henry
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
mark smith wrote:
to get only word of 3 letter and more i tried without success:




"^{0,3}\\w" -- means zero to three of the beginning of input marker followed by a single word character. Of course, this makes no sense, since there is no way that the beginning of input marker can appear after an edge token, especially if you want more than one of it.

Henry
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:
mark smith wrote:
on the result, i can loop to detect every word and display them only if the word have more then 3 letter..... but is there a way to do it directly in the regex?


Not really. Regexes is not really good at returning an unknown number of matches with a single match. You have to rewrite it to loop yourself -- and probably to check the edge tokens yourself too.


I guess another way to do this is... use regex to capture the phrase between the two edges, then use regex on the phrase to get all words greater than three letters.

Henry
 
mark smith
Ranch Hand
Posts: 258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:

I guess another way to do this is... use regex to capture the phrase between the two edges, then use regex on the phrase to get all words greater than three letters.

Henry


i thought i could replace : (.*) in the code below



by (?=\\w{4,}\\b)

(.*) is the regex who capture the sentence between the two edge, no?
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
mark smith wrote:
(.*) is the regex who capture the sentence between the two edge, no?


Yes. it captures the phrase between the two edges. With it, you can use another regex to get the words that are greater than four letters. You will not be able to capture the words in the same pass, because you have an indeterminate number of words.

Henry
 
Stephan van Hulst
Saloon Keeper
Posts: 7993
143
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Why do you need to do all of this with as few (horrible, long, unreadable) regular expressions as possible?

Just write readable code. In the example I have given (which apparently doesn't exactly work correctly, but I'm sure you can change that), you can simply perform some operations on the capture returned by the 'good' pattern, as Henry already implied. This would be a *much* more preferable solution to doing it in a horrible, long, unreadable regex, even *if* you could do it with one regex in the first place.
 
mark smith
Ranch Hand
Posts: 258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:
mark smith wrote:
(.*) is the regex who capture the sentence between the two edge, no?


Yes. it captures the phrase between the two edges. With it, you can use another regex to get the words that are greater than four letters. You will not be able to capture the words in the same pass, because you have an indeterminate number of words.

Henry


i added another pattern: splittedWord

after i tried to do a matcher on the value returned by the good matcher...




splitedMatcher.matches() return alway false...
i don't understdand why

my input was: ice the hello house test snow
 
Rob Spoor
Sheriff
Posts: 21135
87
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
matcher.group(1) is " the hello house test " (including leading and trailing spaces). That certainly does not match your splittedWords pattern. It could find a few results ("hello", "house", "test"), but that's not what you're doing right now.

Edit: I misread the splittedWords pattern. It wouldn't cause "hello", "house" and "test" to be found, but instead empty strings just before those words. After all, you're using a positive lookahead.
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rob Spoor wrote:Edit: I misread the splittedWords pattern. It wouldn't cause "hello", "house" and "test" to be found, but instead empty strings just before those words. After all, you're using a positive lookahead.



Yeah. The original post had a regex that contains both positive and negative look-aheads. I am surprised that the OP doesn't know (or forgot) that look-aheads (and look-behinds) are non-capturing.

Henry
 
mark smith
Ranch Hand
Posts: 258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rob Spoor wrote:matcher.group(1) is " the hello house test " (including leading and trailing spaces). That certainly does not match your splittedWords pattern. It could find a few results ("hello", "house", "test"), but that's not what you're doing right now.

Edit: I misread the splittedWords pattern. It wouldn't cause "hello", "house" and "test" to be found, but instead empty strings just before those words. After all, you're using a positive lookahead.


this code should split the sentence and get all word, no?


i'm lost

i tried a couple of solution on http://www.regexplanet.com/ but that alway fails.
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
mark smith wrote:this code should split the sentence and get all word, no?



no

mark smith wrote:i'm lost

i tried a couple of solution on http://www.regexplanet.com/ but that alway fails.


It may be a good idea to start with a good tutorial on regular expressions. Regex is not something that can be learned by trail and error.

Henry
 
mark smith
Ranch Hand
Posts: 258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:
mark smith wrote:this code should split the sentence and get all word, no?



no

mark smith wrote:i'm lost

i tried a couple of solution on http://www.regexplanet.com/ but that alway fails.


It may be a good idea to start with a good tutorial on regular expressions. Regex is not something that can be learned by trail and error.

Henry


i used:


that work, surely there is a better way to do it

will buy a book
 
Rob Spoor
Sheriff
Posts: 21135
87
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That regex looks pretty good to me. It's exactly what you want: words that contain 3 or more letters.
 
mark smith
Ranch Hand
Posts: 258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rob Spoor wrote:That regex looks pretty good to me. It's exactly what you want: words that contain 3 or more letters.


why when i check with : splitedMatcher.matches()

that return false?
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
mark smith wrote:
Rob Spoor wrote:That regex looks pretty good to me. It's exactly what you want: words that contain 3 or more letters.


why when i check with : splitedMatcher.matches()

that return false?


Because matches() and find() methods are not the same thing. The matches() method is used to determine if the regex matches the whole input string. The find() method searches for the next substring in the input that matches, and returns it as group zero.

Henry
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
mark smith wrote:that work, surely there is a better way to do it


There is always room for improvement. For example, since the find() goes from left to right, and the regex is greedy, you really don't need the two word boundary specifiers.

Henry
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!