• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

RegEx to negate a set of words

 
Ranch Hand
Posts: 31
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi
I'm trying to concoct a regular expression that does the following:
given a data set, I want it to retrieve the words that are not in another given set of words:
like, if i have the input:
all
great
minds
think
alike
and the constraint : (great|think)
the output would be:
all
minds
alike


so sort of gimme * but not if * = great or think

thanks
 
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

All the "\\b"s are there to make sure you're only comparing whole words to whole words; lookaheads are slippery that way.
 
Ranch Hand
Posts: 264
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

"\\b(?!(?:great|think)\\b)\\w+\\b"



what does the "?:" mean?

Does that mean remove?
if it the pipe was changed to a comma between great and think, what effect would that have on the regex.
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The "?:" means the enclosing parentheses form a non-capturing group. It's good practice to always use non-capturing parentheses for grouping if you don't actually need to capture that part of the match.

If you replaced the pipe with a comma, the lookahead would match the literal sequence "great,think" instead of "great" OR "think"; the only place a comma has special meaning is in the "{m,n}" quantifier.


http://www.regular-expressions.info/tutorial.html
 
Tad Dicks
Ranch Hand
Posts: 264
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

If you replaced the pipe with a comma, the lookahead would match the literal sequence "great,think" instead of "great" OR "think"; the only place a comma has special meaning is in the "{m,n}" quantifier.



I've been staring at too many dtds, that use a lot of regex-like syntax and was thinking the comma might be akin to an and (like a sequence in element declaration vs the pipes being or in a choice).

-Tad
 
Ranch Hand
Posts: 3061
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
This might be easier to do without regexes, depending on what the purpose is. In particular, every class that implements the Collection interface from the Collections framework has removeAll() and retainAll() that act like the mathimatical set difference and set union operations. You could add each "word" to a Collection of your choice (perhaps a Set?) and use these operations to get a Collection with the words you want.

Let me know what you think.

Layne
 
Ranch Hand
Posts: 220
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
WOW, thats such a refreshingly simple solution.
SPectacular I must say, but doesn't it trade off on the efficiency?
Collections would be more memory hogging than just long string arrays?
 
Layne Lund
Ranch Hand
Posts: 3061
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I think you will have to implement both approaches and measure how much memory overhead there is for using a Collection over an array of Strings. I think the overhead will be negligible. If implemented correctly, I think my idea will be much easier to understand and maintain, which outweighs the costs for the extra memory overhead.

In addition, you need to consider what the purpose for this is. At least, I assume that this is a small part of a larger project. Which approach will provide a data structure that other code can interface with more easily? Since you haven't provided much in the way of context, I can't even provide a suggestion along these lines. Even if I did, this boils down to a design decision on your part.

Layne
 
Akshay Kiran
Ranch Hand
Posts: 220
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The problem at hand wasn't mine, so I shall not be able to speak much either.
But imagine
if it were a 1000 words in a list, and a String of 1000 words in consideration, how would the compiler go about implementing the two approaches?
On grounds of readability, certainly yes, your approach would be far better than regex...
the only point of moot maybe "will there be a objectionable memory overhead? and if yes, is it worth the trade off?"
i think the questions will be best answered by those who dirty their hands in such stuff. I can't keep my hands clean here and provide answers!
[ October 24, 2005: Message edited by: Akshay Kiran ]
reply
    Bookmark Topic Watch Topic
  • New Topic