Win a copy of Functional Reactive Programming this week in the Other Languages forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

regex

 
Randall Twede
Ranch Hand
Posts: 4481
3
Java Python Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
as is my nature, before reading more about them, i post a question. this is just because i don't get a lot of internet time.
ok i need, mainly, how to match whole words only. this might seem easy. just check if there is a space before and a space after. first of all i don't know how to do that, and worst of all what about the cases where the match is at the very beginning or very end of the string?

there are many ways to find i have learned: indexOf(), contains(), match(); the classes Pattern and Matcher. it looks like regex is the way to go.
 
fred rosenberger
lowercase baba
Bartender
Posts: 12202
35
Chrome Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
As always, you have to first define what you consider a word.

But...regex does have special symbols for beginning and end of a string. IIRC, ^ is the start, and $ is the end. Combined with groupings you can do something like

(^| .?)

I think that gives you "beginning of line or one or more spaces". Have not tested at all, so it's probably not 100% correct.
 
Greg Charles
Sheriff
Posts: 2993
12
Firefox Browser IntelliJ IDE Java Mac Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Depending on the regex implementation, you may have the character class \w (Word) available. So "\w+" matches whole words. As Fred points out though, it depends how you define a word. I don't think that pattern matches contractions, like "don't".

It's a good idea to have a go-to site for testing regular expression, because they're tricky to get right. Mine is rubular.com, but there are many others.
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
"\b" represents a word boundary (see the Javadoc for Pattern) but there are a couple of 'gotchas'. First, it does not handle contractions and second you will need to escape it i.e. use "\\b" . As always with regular expressions the devil is in the detail but you have not yet provided any.

My standard references are "Mastering Regular Expressions" by Jeffrey Friedl (regarded as the regex bible), http://www.regular-expressions.info/tutorial.html and http://docs.oracle.com/javase/tutorial/essential/regex/ .
 
Randall Twede
Ranch Hand
Posts: 4481
3
Java Python Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
thanks for the answers. i didn't know about a go-to site for testing. i still need to read about regex more, i only glanced at it so far. if you have microsoft's wordpad, i am recreating the find and replace dialogs. checkboxes for Match Case and Whole Word Only. i also realized a word can end in punctuation(.,;

i see what you mean by escaping. the :+the) got interpreted as a smily face

i have part of it already. Replace All: Match Case was easy i just used the String method replaceAll() using the contents of the two text fields.
 
naved momin
Ranch Hand
Posts: 692
Eclipse IDE Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
to divide words using REGEX all you need to write is
([a-zA-Z]*\S)


this is just for demonstration purpose hope it helps..
 
fred rosenberger
lowercase baba
Bartender
Posts: 12202
35
Chrome Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
naved momin wrote:to divide words using REGEX all you need to write is
([a-zA-Z]*\S)


this is just for demonstration purpose hope it helps..

But as stated, that won't find words like:

Can't
O'Malley
It won't find the word at the end of this sentence.
It won't find a single word on a line by itself.
 
alex gorn
Greenhorn
Posts: 8
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think you must know what text you analyze and if there are some locale-specific symbols you cannot use smthing like [a-zA-Z] where latin symbols are used only, but java has smth like this:



which is for: unicode property "letter" - for all languages OR (certain symbol - you can define with your own. But it must be between word boundaries - \b) - OR (symbol) '
so we can catch all letters, which can be like O'Reilly or well-done
 
Randall Twede
Ranch Hand
Posts: 4481
3
Java Python Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
actually \b does seem to handle contractions. i just tried it.i am far from finished though.
this code is from the replace dialog class

the goal is for it to act more or less like Wordpad. i decided not to worry about unicode(non western stuff)
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Randall Twede wrote:actually \b does seem to handle contractions.


Interesting since my simple example shows that it doesn't ! The code below produces an array containing all the words but with all the apostrophes separated from their adjoining words. If the contractions were being handled as beiing part of the word then they would be included in the word in the split. Am I missing something?


Using find() as in below does just the same !
 
Randall Twede
Ranch Hand
Posts: 4481
3
Java Python Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
in the JTextArea i typed
isn't isn't isn'ta
in the dialog i told it to replace all isn't with is
i got
is is isn'ta
i'll try it again
same result
maybe it is because i have java 7
i don't know
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Randall Twede wrote:
maybe it is because i have java 7


Sorry but that is not the reason. It has been the same since the regex package was introduced into Java and, though I cannot claim to have tried all regex implementaions, I have never met a regex flavour in any language that is different.
 
Randall Twede
Ranch Hand
Posts: 4481
3
Java Python Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i see the problem now. if i type
can can't candy
then i say repace all can with dan, whole word only
i get
dan dan't candy
clearly a problem, but when i tried it in Wordpad and Open Office i got the same results
so i guess i won't worry about it
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic