• Post Reply Bookmark Topic Watch Topic
  • New Topic

questions regarding regular expressions  RSS feed

 
adithi gudipudi
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,
If I write write replaceAll("[^ab]", "X") the it replaces all the characters except a or b. if I want to replace all the characters except ab sequence, then how should I do it in a simple way.
For example if I have "ab test hello all ab test1" the output should be "ab XXXX XXXXX XXX ab XXXXX".
Thanks
Adithi.
 
Max Habibi
town drunk
( and author)
Sheriff
Posts: 4118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

you know, there's a good book on regex in my signature
M
[ EJFH: Fix typo ]
[ April 24, 2004: Message edited by: Ernest Friedman-Hill ]
 
adithi gudipudi
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Max,
String tmp = "ab test hello all ab test1" ;
String s = tmp.replaceAll("[^a^b^ ]","X");
why do we need ending ^ in the regular expression. the ^a would be for not A ^b would be for not b then ending ^ for ???
thanks for your help!!
Adithi.
 
Max Habibi
town drunk
( and author)
Sheriff
Posts: 4118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by adithi gudipudi:
Max,
String tmp = "ab test hello all ab test1" ;
String s = tmp.replaceAll("[^a^b^ ]","X");
why do we need ending ^ in the regular expression. the ^a would be for not A ^b would be for not b then ending ^ for ???
thanks for your help!!
Adithi.

not space
 
adithi gudipudi
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Max. Actually I am using this part in a regular expression to remove all the "customer" entered tags from the text in a string. this text is coming from textarea in a html form.
Like this example :
String text =" <P>sample 1 <a href =hello.gif > hello image </a>.sample 2 <b> this text is bold </b> sample 3 ....</P>";
After using regular expression, the output should be
"sample 1 hello image .sample 2 this text is bold sample 3 ...."

So I am writing a regular expression to remove text starting from < to >
my regular expression is
text = text.replaceAll("/?>","#"); // 1 statement
text = text.replaceAll("</?;","@"); // 2 statement
text =text.replaceAll("@[^#]*/?\\s*#",""); // 3 statement
My logic should be covering all the types of tags , like <b>, </b> and also <b /> ( for example).
Now istead of doing the 1 and 2 statements I want to include that in the 3 statment it self.
Can you check the logic if this is right?
Hope I am clear.
Adithi.
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So you try to get around with the problem, that regExpr-searches always tries to find the biggest match.
I don't understand your tricks in detail.
But isn't it nearly simple to specify a html-tag?
It starts with a opening '<'.
Than it is followed by anything, but a '>'.
And it ends with an '>'.
"<[^>]*>"
 
adithi gudipudi
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Steven,
That is right. I dont know why I did that even though all the while I thinking of the same thing as you were. Actually if the customer enters the tag it would look like and . I did not know how to check ^( like how you did [^>]*) till Max Habibi gave an example. So thanks for bringing me on track.
Also I wrote a regular expression for removing all the links the customer enters Is this the right way to remove all the links like http://yahoo.com, ftp://msn.com or C:\test.jpg from a string of text. I did it like this.
text =text.replaceAll("\\S*://\\S*\\s","");
The regular expression I wrote does not work for C:\test.jpg But you think this is right for other cases?
Thanks again!!
Adithi.
 
adithi gudipudi
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am sorry for the code like that , actually if the customer enters the input tag it would be encoded to & gt; and & lt; and not like > and <.
 
Max Habibi
town drunk
( and author)
Sheriff
Posts: 4118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi adithi,
I'm still having a little bit of a hard time following exactly what you need. Can you provide, say, 5 before and after examples? If so, I'm pretty sure I can help you with this.
All best,
M
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
[adithi]:
If I write write replaceAll("[^ab]", "X") the it replaces all the characters
except a or b. if I want to replace all the characters except ab sequence,
then how should I do it in a simple way.
For example if I have "ab test hello all ab test1" the output should be
"ab XXXX XXXXX XXX ab XXXXX".

I can't imagine how this would be useful for anything, but a solution to this is:

Note that this technique is not really very elegant, and increases in complexity
for longer strings. E.g. if you want to replace everything but spaces and "abcd"
then you'd need something like:

Really, it would be much easier at this point to make a pattern that searches
for "abcd" rather than everything but "abcd".
[B][Max]:
[/B]
That doesn't seem to fit the requirements. Note that "all" becomes "aXX",
contrary to adithi's example.
Also, does [^a^b^ ] do anything different from [^ab ]? I haven't seen
this usage before, and it doesn't seem to be documented in the java.util.regex API.
---
From subsequent discussion, it seems Stefan's approach is appropriate to what
adithi really wanted. I can't figure out how the original problem
statement would have anything to do with removing tags from text - rather
it would be useful for removing everything but tags. Anyway,
I believe what you're looking for is

----
[adithi ]:
Also I wrote a regular expression for removing all the links the customer enters
Is this the right way to remove all the links like
http://yahoo.com, ftp://msn.com
or C:\test.jpg from a string of text. I did it like this.
text =text.replaceAll("\\S*://\\S*\\s","");
The regular expression I wrote does not work for C:\test.jpg But you think this
is right for other cases?

Not if your text contains any other colons outside of URLs - it will remove those too.
However this is a much more complex problem, as there are many types of structures
you might encounter representing links, and many others which are not links.
You might try something like this though:

I'm assuming that some links might omit http: entirely,
like www.yahoo.com.
[edited for formatting-MH]
[ April 24, 2004: Message edited by: Max Habibi ]
 
Max Habibi
town drunk
( and author)
Sheriff
Posts: 4118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Jim Yingst:
[B][adithi]:
[b][Max]:
[/B]
That doesn't seem to fit the requirements. Note that "all"
becomes "aXX",
contrary to adithi's example.

You're right, I misread her example.

Also, does [^a^b^ ] do anything different from [^ab ]?

Not really, though it's a hair more efficient.
I'll give the original pattern some thought: it seems like
there should be a simpler solution.
All best,
M
[ April 25, 2004: Message edited by: Max Habibi ]
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The ab challenge:
If we first replace every 'a' which isn't followed by a 'b':

then every 'b' which isn't preceeded by an 'a':

Now the only 'a', 'b' remaining should be 'ab'. Therefore we may replace everything beside 'a' OR 'b'.

But of course you may have aab which get's replaced to XXb in the first step, so we need to get the first character only:

Unfortunately, the solution doesn't work. (why?).
The first 'replaceAll' generates:
ab test hello X1l ab test1
Does backreference only work with 'pattern' and 'matcher'?
Another startingpoint could be

which needs concentration at the beginning and the end of the string.
Jim:
in your example:

I struggle about the ".*?". Isn't it the same as ".*" ?
And don't we inherit the problem of 'biggest match' in a much more complicated way, since we now search for a sequence, not an character?
conclusion: It's amusing to see how fast you may explain a pattern-matching-question to a person, and how long it needs to tell it to a computer, even if the input is in machine-friendly form (ascii-text).
[ April 24, 2004: Message edited by: Stefan Wagner ]
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
[Stefan]: Does backreference only work with 'pattern' and 'matcher'?
No, those work in other places too, such as replaceAll() or split().
Anything that takes a java.util.regex pattern can use these.
[B]Another startingpoint could be
[/B]
Yeah, that's promising. You'll need further processing, probably comparable to what I do below.
[B]Jim:
in your example:

I struggle about the ".*?". Isn't it the same as ".*" ?
And don't we inherit the problem of 'biggest match' in a much more
complicated way, since we now search for a sequence, not an character?[/B]
No, the ? make a big difference - it tells the matcher to always find the
shortest match (starting at wherever the first "&lt;" is). This
is exactly what we need here, ensuring that if we process something like

&lt; foo &gt; bar &lt; baz &gt;
we'll get "&lt; foo &gt;" as the first match, and "&lt; baz &gt;" as the second.
[Max]: I'll give the original pattern some thought: it seems like there should be a simpler solution
I agree that it seems like there should be. I'm not seeing it though. Not
for a longer string anyway. The method I gave for "ab" will work fine,
but there are some subtle bugs in "abcd". And even if there weren't, it's
too complex for my taste. It's sorta OK now, but imagine if we were
looking for "abcdefghijklm". Ugh.
I think it's simpler to look for the target "ab" or "abcd", rather than
look for everything but that string. Then use some good old-
fashioned programming to do the rest:

Of course at this point the Pattern and Matcher aren't really doing
anything we couldn't do just as easily with String's indexOf() - but this
is supposedly a regex problem (according to the thread title), so
what the heck...
[ April 25, 2004: Message edited by: Jim Yingst ]
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

I struggle about the ".*?". Isn't it the same as ".*" ?
No, the ? make a big difference - it tells the matcher to always find the
shortest match (starting at wherever the first "<" is). This
is exactly what we need here, ensuring that if we process something like

Thanks.
I asked myself in this direction but didn't find it in the api-docs.
(Yes - I know, there are books in signatures...).
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The Pattern API lists "reluctant quantifiers". They indicate that "X*?" means "X, zero or more times". But they don't bother explaining the difference between reluctant, greedy, and posessive qualifiers. However there is a (relatively new) regex section in the Java Tutorial, here. You want the section on Quantifiers.
[ April 25, 2004: Message edited by: Jim Yingst ]
 
Max Habibi
town drunk
( and author)
Sheriff
Posts: 4118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
But they don't bother explaining the difference between reluctant, greedy, and posessive qualifiers
ah, but I do
Consider the following

Greedy: You'll notice that group 1 in (.*)(\\d+) matches as much as possible. This is your greedy qualifier( or, more accurately, greedy-generous), and the default behavior. The group 1 pattern, namely (.*) is 'greedy' and will match as much as possible, as soon as possible. This is the greedy part: However, for the greater good of the entire pattern, the group 1 will 'release' as little as it has to, in order for group 2, namely (\\d+), to match. Thus, group 1 matches "aaaa1" and group 2 matches "1".
Reluctant: You'll notice that group 1 in (.*?)(\\d+) matches as little as possible. This is your reluctant qualifier. The group 1 pattern, namely (.*?) is reluctant, and thus matches as little as possible. However, for the greater good of the entire pattern, the group 1 will match as much as it minimally has to, in order for group 2, namely(\\d+), to match. Thus, group 1 matches "aaaa" and group 2 matches "11".
possessive: You'll notice that this pattern, "(.*+)(\\d+)", doesn't match at all. That's because possessive patterns are greedy, but never generous. Unlike the first two patterns, the group 1 pattern, namely (.*+), will not release the "11" which are captured as a part of (.*+), thus disallowing a match for the (\\d+). Thus, the pattern, as a whole, fails.
This is actually explained in much greater detail in my regex book: really
HTH,
M
[ April 25, 2004: Message edited by: Max Habibi ]
 
adithi gudipudi
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Max,
basically I had to remove all the tags which the customer enters and all the links so that we dont have any links in the database.
If the customer entered the tag it would look like & gt; and < so I should check for all the text in the string starting from & lt; and ending with & gt ; I did not know how to do it so I that was my first questions.
To remove the links I saw what Jim suggested. I did not try it yet.
Hope I made things clear.
Thanks for all the help. This discussion is making my concepts clear.
Adithi.
 
adithi gudipudi
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jim,
Regarding the regular expression to remove the links from the text you wrote

what does \\S++ mean. I know that \\S+ means that there should be atleast one non white space character. what does ++ mean???
Also, in the regular expression why include (? and )
Not sure why you included them?
Sorry to bother you all so many times but I am in the beginner stages of regular expressions.
Adithi.
 
Max Habibi
town drunk
( and author)
Sheriff
Posts: 4118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
adithi,
Look over my last post, and you'll see what ++, et el mean. If you still have questions, bring them up, so everyone can benefit.
All best,
M
 
adithi gudipudi
Greenhorn
Posts: 25
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks everyone. This discussion has been very useful
Max,
I understood the concept the \\S++. But does having complex regular expression have any performance issues on the server???
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!