• Post Reply Bookmark Topic Watch Topic
  • New Topic

A question about regular expression  RSS feed

 
Roy Yuan
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi, friends,

I want input a string like "st 21st st", and get string like "st 21 st". Pls. help me figure out the regular expression.

Thanks a lot,
Roy
 
Shaan Shar
Ranch Hand
Posts: 1249
Java Spring Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Roy Yuan:
Hi, friends,

I want input a string like "st 21st st", and get string like "st 21 st". Pls. help me figure out the regular expression.

Thanks a lot,
Roy


What you are facing problem exactly, Could you Please mention it out.. or your homework, what you have done till now..

It would be great to find what you have done so far.
 
Roy Yuan
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have some date strings like "5th November 2006", "22nd Octomber 1998", "21st May 1992", I want to remove "th","nd","rd","st" in those strings. I tried my regular expression "(\\d{1,2})(?:[st|rd|nd])(\\s*\\.*)", for example, on "5th November 2006". I got: 5. My expecation is: 5 November 2006.
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Roy Yuan:
I have some date strings like "5th November 2006", "22nd Octomber 1998", "21st May 1992", I want to remove "th","nd","rd","st" in those strings. I tried my regular expression "(\\d{1,2})(?:[st|rd|nd])(\\s*\\.*)", for example, on "5th November 2006". I got: 5. My expecation is: 5 November 2006.


Could you show us some code? The Regex looks fine -- it is probably how you are using it.


[EDIT: Spoke too soon. "st|rd|nd" in [] doesn't make sense. And "\\." is trying to match an actual period. In any case, would still like to see the java code, as there are many different options with Regex]

Henry
[ November 10, 2006: Message edited by: Henry Wong ]
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
While I agree we probably need more info on how you're using this, I do see one issue right now:

(?:[st|rd|nd])

This probably doesn't do what you think it does. The [ ] means you have created a character class - this represents a single character, equivalent to [strdn|]. (Duplicates are ignored.) You probably want this instead:

(?:st|rd|nd)
 
Roy Yuan
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi, friends,

The following is my code:
String str="Friday 21st November 1995";
Matcher tempM=Pattern.compile("(?:\\D*\\s*)(\\d{1,2})(?:st|rd|nd)(\\s*.*)")
.matcher(str);
if(tempM.matches()) {
System.out.println(tempM.group(1)+tempM.group(2));
}

I got: 21 November 1995

is there any better solution?

Thanks a lot,
Roy
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Roy Yuan:
Hi, friends,

I got: 21 November 1995

is there any better solution?

Thanks a lot,
Roy


Did you forget about the "th"? Like with 4th, 5th, 6th, etc.?

Henry
 
Roy Yuan
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
oh, I really forgot. Henry, Thanks for reminding me...
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There are a couple things which aren't wrong exactly, but could be simpler.

(?:\\D*\\s*)

This means as many non-digits as possible (possibly none), followed by as much whitespace as possible (possibly none). But all whitespace characters are also non-digits. So the \\s* part of the expression will never have any effect, since the \\D* part will grab any whitespace before \\s* has a chance.

(\\s*.*)

This means as much whitespace as possible, followed by anything else. Well, couldn't you just drop the \\s*, and let the .* take everything? There appears to be no need to treat the whitespace differently here.

So you could just as well use this:

(?:\\D*)(\\d{1,2})(?:st|rd|nd|th)(.*)

----

More importantly, you might want to consider what might happen if you ever get a line in a different format. Perhaps a line that already has the 'st' removed? Should you ignore the line? Print it? Log a warning message? Throw an exception? Or maybe you're certain that there will never be any line in a different format. But such assumptions can be dangerous; I would recommend at least logging any unusual situations you encounter.
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This means as much whitespace as possible, followed by anything else. Well, couldn't you just drop the \\s*, and let the .* take everything? There appears to be no need to treat the whitespace differently here.


Don't want to second guess Roy, but something tells me that the whitespace treatment is an incorrect attempt at detecting the word boundary. This is so that "234th" and "11stop" would not match.

Another option, since there is little checking prior and after, this could be done with a simple replace first call. (and with word boundary checks)



Henry
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!