This week's book giveaway is in the Other Languages forum.
We're giving away four copies of Functional Reactive Programming and have Stephen Blackheath and Anthony Jones on-line!
See this thread for details.
Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

RegEx to negate a set of words

 
Bucsie Dusca
Ranch Hand
Posts: 31
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi
I'm trying to concoct a regular expression that does the following:
given a data set, I want it to retrieve the words that are not in another given set of words:
like, if i have the input:
all
great
minds
think
alike
and the constraint : (great|think)
the output would be:
all
minds
alike


so sort of gimme * but not if * = great or think

thanks
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

All the "\\b"s are there to make sure you're only comparing whole words to whole words; lookaheads are slippery that way.
 
Tad Dicks
Ranch Hand
Posts: 264
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
"\\b(?!(?:great|think)\\b)\\w+\\b"


what does the "?:" mean?

Does that mean remove?
if it the pipe was changed to a comma between great and think, what effect would that have on the regex.
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The "?:" means the enclosing parentheses form a non-capturing group. It's good practice to always use non-capturing parentheses for grouping if you don't actually need to capture that part of the match.

If you replaced the pipe with a comma, the lookahead would match the literal sequence "great,think" instead of "great" OR "think"; the only place a comma has special meaning is in the "{m,n}" quantifier.


http://www.regular-expressions.info/tutorial.html
 
Tad Dicks
Ranch Hand
Posts: 264
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you replaced the pipe with a comma, the lookahead would match the literal sequence "great,think" instead of "great" OR "think"; the only place a comma has special meaning is in the "{m,n}" quantifier.


I've been staring at too many dtds, that use a lot of regex-like syntax and was thinking the comma might be akin to an and (like a sequence in element declaration vs the pipes being or in a choice).

-Tad
 
Layne Lund
Ranch Hand
Posts: 3061
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This might be easier to do without regexes, depending on what the purpose is. In particular, every class that implements the Collection interface from the Collections framework has removeAll() and retainAll() that act like the mathimatical set difference and set union operations. You could add each "word" to a Collection of your choice (perhaps a Set?) and use these operations to get a Collection with the words you want.

Let me know what you think.

Layne
 
Akshay Kiran
Ranch Hand
Posts: 220
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
WOW, thats such a refreshingly simple solution.
SPectacular I must say, but doesn't it trade off on the efficiency?
Collections would be more memory hogging than just long string arrays?
 
Layne Lund
Ranch Hand
Posts: 3061
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think you will have to implement both approaches and measure how much memory overhead there is for using a Collection over an array of Strings. I think the overhead will be negligible. If implemented correctly, I think my idea will be much easier to understand and maintain, which outweighs the costs for the extra memory overhead.

In addition, you need to consider what the purpose for this is. At least, I assume that this is a small part of a larger project. Which approach will provide a data structure that other code can interface with more easily? Since you haven't provided much in the way of context, I can't even provide a suggestion along these lines. Even if I did, this boils down to a design decision on your part.

Layne
 
Akshay Kiran
Ranch Hand
Posts: 220
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The problem at hand wasn't mine, so I shall not be able to speak much either.
But imagine
if it were a 1000 words in a list, and a String of 1000 words in consideration, how would the compiler go about implementing the two approaches?
On grounds of readability, certainly yes, your approach would be far better than regex...
the only point of moot maybe "will there be a objectionable memory overhead? and if yes, is it worth the trade off?"
i think the questions will be best answered by those who dirty their hands in such stuff. I can't keep my hands clean here and provide answers!
[ October 24, 2005: Message edited by: Akshay Kiran ]
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic