File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes regex Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Soft Skills this week in the Jobs Discussion forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "regex" Watch "regex" New topic
Author

regex

Randall Twede
Ranch Hand

Joined: Oct 21, 2000
Posts: 4347
    
    2

as is my nature, before reading more about them, i post a question. this is just because i don't get a lot of internet time.
ok i need, mainly, how to match whole words only. this might seem easy. just check if there is a space before and a space after. first of all i don't know how to do that, and worst of all what about the cases where the match is at the very beginning or very end of the string?

there are many ways to find i have learned: indexOf(), contains(), match(); the classes Pattern and Matcher. it looks like regex is the way to go.


SCJP
Visit my download page
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 11497
    
  16

As always, you have to first define what you consider a word.

But...regex does have special symbols for beginning and end of a string. IIRC, ^ is the start, and $ is the end. Combined with groupings you can do something like

(^| .?)

I think that gives you "beginning of line or one or more spaces". Have not tested at all, so it's probably not 100% correct.


There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
Greg Charles
Sheriff

Joined: Oct 01, 2001
Posts: 2864
    
  11

Depending on the regex implementation, you may have the character class \w (Word) available. So "\w+" matches whole words. As Fred points out though, it depends how you define a word. I don't think that pattern matches contractions, like "don't".

It's a good idea to have a go-to site for testing regular expression, because they're tricky to get right. Mine is rubular.com, but there are many others.
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1110
    
  10

"\b" represents a word boundary (see the Javadoc for Pattern) but there are a couple of 'gotchas'. First, it does not handle contractions and second you will need to escape it i.e. use "\\b" . As always with regular expressions the devil is in the detail but you have not yet provided any.

My standard references are "Mastering Regular Expressions" by Jeffrey Friedl (regarded as the regex bible), http://www.regular-expressions.info/tutorial.html and http://docs.oracle.com/javase/tutorial/essential/regex/ .
Randall Twede
Ranch Hand

Joined: Oct 21, 2000
Posts: 4347
    
    2

thanks for the answers. i didn't know about a go-to site for testing. i still need to read about regex more, i only glanced at it so far. if you have microsoft's wordpad, i am recreating the find and replace dialogs. checkboxes for Match Case and Whole Word Only. i also realized a word can end in punctuation(.,;

i see what you mean by escaping. the :+the) got interpreted as a smily face

i have part of it already. Replace All: Match Case was easy i just used the String method replaceAll() using the contents of the two text fields.
naved momin
Ranch Hand

Joined: Jul 03, 2011
Posts: 692

to divide words using REGEX all you need to write is
([a-zA-Z]*\S)


this is just for demonstration purpose hope it helps..


The Only way to learn is ...........do!
Visit my blog http://inaved-momin.blogspot.com/
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 11497
    
  16

naved momin wrote:to divide words using REGEX all you need to write is
([a-zA-Z]*\S)


this is just for demonstration purpose hope it helps..

But as stated, that won't find words like:

Can't
O'Malley
It won't find the word at the end of this sentence.
It won't find a single word on a line by itself.
alex gorn
Greenhorn

Joined: Nov 15, 2012
Posts: 8
I think you must know what text you analyze and if there are some locale-specific symbols you cannot use smthing like [a-zA-Z] where latin symbols are used only, but java has smth like this:



which is for: unicode property "letter" - for all languages OR (certain symbol - you can define with your own. But it must be between word boundaries - \b) - OR (symbol) '
so we can catch all letters, which can be like O'Reilly or well-done
Randall Twede
Ranch Hand

Joined: Oct 21, 2000
Posts: 4347
    
    2

actually \b does seem to handle contractions. i just tried it.i am far from finished though.
this code is from the replace dialog class

the goal is for it to act more or less like Wordpad. i decided not to worry about unicode(non western stuff)
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1110
    
  10

Randall Twede wrote:actually \b does seem to handle contractions.


Interesting since my simple example shows that it doesn't ! The code below produces an array containing all the words but with all the apostrophes separated from their adjoining words. If the contractions were being handled as beiing part of the word then they would be included in the word in the split. Am I missing something?


Using find() as in below does just the same !
Randall Twede
Ranch Hand

Joined: Oct 21, 2000
Posts: 4347
    
    2

in the JTextArea i typed
isn't isn't isn'ta
in the dialog i told it to replace all isn't with is
i got
is is isn'ta
i'll try it again
same result
maybe it is because i have java 7
i don't know
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1110
    
  10

Randall Twede wrote:
maybe it is because i have java 7


Sorry but that is not the reason. It has been the same since the regex package was introduced into Java and, though I cannot claim to have tried all regex implementaions, I have never met a regex flavour in any language that is different.
Randall Twede
Ranch Hand

Joined: Oct 21, 2000
Posts: 4347
    
    2

i see the problem now. if i type
can can't candy
then i say repace all can with dan, whole word only
i get
dan dan't candy
clearly a problem, but when i tried it in Wordpad and Open Office i got the same results
so i guess i won't worry about it
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: regex