• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Suggest one regex to match all the following CharSequence

 
Ranch Hand
Posts: 93
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,
I am looking for a regex to match all the following CharSequence and extract 2 groups out of 'em. The first group would be the name without the 4-digit year and parentheses. The second group, if present, would be the 4-digit year without the parentheses.
abc.avi
Ab-C.mkv
abc def.mkv
AbC DeF.divx
abc (2010).avi
ABC-DEF (2010).mkv

One I came up with does not work as expected:


Any help would be appreciated.
--
Abhi
 
Sheriff
Posts: 22783
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I see that you're using [ and ] to match the extension. That won't work. You should use ( and ), or if you don't want it as a capturing group use (?: and )

When creating regexes it's best to build them in little blocks. You first write down in normal words what you want to do, then translate each part into a sub-regex, and then paste these together.

So let's break it down:
- letters, dashes or spaces
- optionally: opening parentheses, 4-year digit (grouped), closing parentheses
- a dot
- mkv, avi, mp4, etc

I see you're using \u2212\u0020 for space and dash. You don't need those, you can add them as they are. Well, if you put the dash at the start of the character class, otherwise it will get special meaning: [-\w ]

See if you can put all of this together to form one regex. Make sure to check your year group against null. This will occur if it's not present.
 
a sarkar
Ranch Hand
Posts: 93
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Rob Prime wrote:
See if you can put all of this together to form one regex. Make sure to check your year group against null. This will occur if it's not present.


Thank you Rob for your input. I will chew on that and post back with the results.
--
Abhi
"Old user, new username"
 
Bartender
Posts: 5167
11
Netbeans IDE Opera Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

a sarkar wrote:"Old user, new username"


Why, were you taken up for cross posting under the old username?
http://forums.oracle.com/forums/thread.jspa?threadID=1255349
 
a sarkar
Ranch Hand
Posts: 93
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Darryl Burke wrote:

a sarkar wrote:"Old user, new username"


Why, were you taken up for cross posting under the old username?
http://forums.oracle.com/forums/thread.jspa?threadID=1255349


Nops, I didn't like the old username.

cross posting by Daryll -
http://forums.sun.com/thread.jspa?threadID=5441460
https://coderanch.com/t/498351/GUI/java/Bug-IconUIResource-doesn-paint-animated

--
Abhi
"Old user, new username"
 
Saloon Keeper
Posts: 15510
363
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Except he notified everyone he was cross posting?
 
a sarkar
Ranch Hand
Posts: 93
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:Except he notified everyone he was cross posting?


Cross posting, as I understand, is applicable to foums within a single website. Including all sites on the Web is a pretty big scope, I would say.
For argument's sake, even if we consider this as cross posting, notifying everyone is hardly a justification, don't you think? What if you commit a murder and notify all that you did it? Does that make it any less of a crime?
--
Abhi
"Old user, new username"
 
Rob Spoor
Sheriff
Posts: 22783
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Let's not discuss this here any further. If you guys want to continue, go to the Ranch Office forum.

Abhi, please read our BeForthrightWhenCrossPostingToOtherSites FAQ entry. We have this policy to prevent people from spending much time answering a question that may have been answered days ago on another forum. People on other forums will also like it if you notify them of posts here.
The issue you mentioned is our UseOneThreadPerQuestion policy.
 
a sarkar
Ranch Hand
Posts: 93
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
This is what I finally got working...with help from this and Oracle forum.

Unless someone wants to suggest a better regex, this is good for me.
 
Rob Spoor
Sheriff
Posts: 22783
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Let's break it down:
- ([-\\w\\s]++) : I used greedy quantification in my little test, so one +. However, since the ++ is applied to characters that are not part of the remainder (parentheses or dot) this won't matter. All is in a capturing group which is fine.
- (?:\\((\\d{4})\\))?+ : non capturing like my test. A (, followed by a capturing group of 4 digits, followed by another ). The entire thing is optional. Exactly like my test.
- \\. : a dot. Can't be simpler.
- (?:avi|mkv|mp4|divx){1}+ : this is where I have some questions about. First of all, {1} is never needed. It means "exactly one time", which is the same you get if you don't add any quantifiers. But that trailing + is odd. Do you want things like "avimp3divx" to be allowed? Surely not?

In the end, if you'd remove that {1}+ part you get almost what I had. I had no capturing group for the first part, and turned the extension in a capturing group, but apart from that and the ++ vs + it was equal.
 
Master Rancher
Posts: 4806
72
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Rob Prime wrote:- (?:avi|mkv|mp4|divx){1}+ : this is where I have some questions about. First of all, {1} is never needed. It means "exactly one time", which is the same you get if you don't add any quantifiers. But that trailing + is odd. Do you want things like "avimp3divx" to be allowed? Surely not?


That's not what the + does here. Instead it makes the preceding quantifier possessive. In this case that's {1}, which means exactly once - but now {1}+ means exactly once in possessive mode, disabling backtracking if the first attempt for this part of the expression fails.

Abhi is using possessive quantifiers throughout his regex here. I like using possessive quantifiers in many cases - but I'm not sure they're very helpful here. I suspect there are some not-yet-considered corner cases where using possessives will prevent a match from happening, even in cases where we might expect a match to happen. Not sure though.

Abhi, your examples don't show any punctuation characters besides '-' and '.'. Do you know for sure that they won't occur in the files you need to handle? You might want to include more test cases to make sure you're handling things well. For example, these look like video titles - movies? Picking a few movie titles semi-randomly out of imdb.com lists, I see

Wall Street: Money Never Sleeps
Legend of the Guardians: The Owls of Ga'Hoole
Crouching Tiger, Hidden Dragon
Kill Bill: Vol. 1
9 1/2 Weeks

If these were converted into file names by adding an optional (2010) (or other year) and a file extension, would your regex successfully parse them? Or, can you guarantee that filenames like that will not occur? Worth considering, I think.
 
Rob Spoor
Sheriff
Posts: 22783
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mike Simmons wrote:

Rob Prime wrote:- (?:avi|mkv|mp4|divx){1}+ : this is where I have some questions about. First of all, {1} is never needed. It means "exactly one time", which is the same you get if you don't add any quantifiers. But that trailing + is odd. Do you want things like "avimp3divx" to be allowed? Surely not?


That's not what the + does here. Instead it makes the preceding quantifier possessive. In this case that's {1}, which means exactly once - but now {1}+ means exactly once in possessive mode, disabling backtracking if the first attempt for this part of the expression fails.


Ah ok. I didn't know that possessive quantifiers also applied to {}.
 
a sarkar
Ranch Hand
Posts: 93
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Rob Prime wrote:I see that you're using [ and ] to match the extension. That won't work. You should use ( and ), or if you don't want it as a capturing group use (?: and )


I see that this is true but I don't understand the logic behind. Could you explain this statement or point to some documentation that does?
 
Rob Spoor
Sheriff
Posts: 22783
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
How about the Javadoc of java.util.regex.Pattern? 90% of what you need for regexes can be found there.
 
a sarkar
Ranch Hand
Posts: 93
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Rob Prime wrote:How about the Javadoc of java.util.regex.Pattern? 90% of what you need for regexes can be found there.


It seems to me that [avi|mkv|mp4|divx] is following straight from the Character class [abc] which means "a, b, or c (simple class)". Why would [avi|mkv|mp4|divx] not work and (avi|mkv|mp4|divx) would?
 
a sarkar
Ranch Hand
Posts: 93
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mike Simmons wrote:
Picking a few movie titles semi-randomly out of imdb.com lists, I see
Wall Street: Money Never Sleeps
Legend of the Guardians: The Owls of Ga'Hoole
Crouching Tiger, Hidden Dragon
Kill Bill: Vol. 1
9 1/2 Weeks

If these were converted into file names by adding an optional (2010) (or other year) and a file extension, would your regex successfully parse them?


Thanks for pointing this out Mike. From your example, let me see what I missed in my regex:
Colon (: ) as in "Wall Street: Money Never Sleeps" - Not allowed in a physical file name on Windows. As long as I am reading file from a Windows directory, I can guarantee this will not appear. Colon is permitted in a filename on Unix though but I am yet to see a practical example of such a file.
Apostrophe (') as in "The Owls of Ga'Hoole" - Should be added to the regex.
Comma (,) as in "Crouching Tiger, Hidden Dragon" - Should be added to the regex.
Full stop (.) as in "Kill Bill: Vol. 1" - Should be added to the regex.
Slash (/) as in "9 1/2 Weeks" - Like Colon, not allowed in a filename. Not in Windows or Unix.
 
Rob Spoor
Sheriff
Posts: 22783
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

a sarkar wrote:

Rob Prime wrote:How about the Javadoc of java.util.regex.Pattern? 90% of what you need for regexes can be found there.


It seems to me that [avi|mkv|mp4|divx] is following straight from the Character class [abc] which means "a, b, or c (simple class)". Why would [avi|mkv|mp4|divx] not work and (avi|mkv|mp4|divx) would?


Because a character class matches one single character. You want to match one of a few substrings, and that's what | is for. The () - which can be replaced by (?:) - is used to limit the | to only the things inside them.
 
a sarkar
Ranch Hand
Posts: 93
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Rob Prime wrote:
... a character class matches one single character. You want to match one of a few substrings, and that's what | is for. The () - which can be replaced by (?:) - is used to limit the | to only the things inside them.


Thank you Rob for the explanation - it definitely cleared my misconception. Incorporating Mike's suggestion and with little update, following is the latest regex: I will hence mark this thread as resolved.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic