• Post Reply Bookmark Topic Watch Topic
  • New Topic

Limiting the repitition in a RegEx  RSS feed

 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you have a RegEx like this:

[a-z][0-9]{2}

It will match everything up to two occurrences of strings like:

a3
b5
z3

but it will also match beyond this. So, if you had a database field with characters beyond the two "matching" ones, they would also be included.

a3404594f
b5ffsfg
z3jjjjjj

My question, then, is how to limit the RegEx RESULT to only be two characters max. That is, I want to limit the repetition to stop after 2 characters (in this case) even if there are additional characters beyond this.

I haven't found any good documentation on how to limit RegEx repetition results.

Thanks in advance for suggestions.

-mike
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mike London wrote:If you have a RegEx like this:

[a-z][0-9]{2}

It will match everything up to two occurrences of strings like:

a3
b5
z3


No. This regex matches a lowercase letter, followed by exactly two numeric digits. So, it will match stuff like "d45".


And as for the second half of the question -- perhaps, you can clarify a bit more?

Henry
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:
Mike London wrote:If you have a RegEx like this:

[a-z][0-9]{2}

It will match everything up to two occurrences of strings like:

a3
b5
z3


No. This regex matches a lowercase letter, followed by exactly two numeric digits. So, it will match stuff like "d45".


And as for the second half of the question -- perhaps, you can clarify a bit more?

Henry



----

Thanks for your reply.

Say you have a field in a table that has this: "a3c5d7f9c7".

But, you only want to extract (allow) the first two occurrences -- that is, limit the repetition.

Using the example I included above, it will allow the entire field since it matches the first two repetitions.

But, I only want the first two repetitions and not to allow anything beyond that.

Can this be done?

Thanks,

-mike
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mike London wrote:
Say you have a field in a table that has this: "a3c5d7f9c7".

But, you only want to extract (allow) the first two occurrences -- that is, limit the repetition.

Using the example I included above, it will allow the entire field since it matches the first two repetitions.

But, I only want the first two repetitions and not to allow anything beyond that.


I don't understand. Given your example what do you want as output? Do you want to determine if it just has 4 hex digits or do you want to truncate the values or what?
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Richard Tookey wrote:
Mike London wrote:
Say you have a field in a table that has this: "a3c5d7f9c7".

But, you only want to extract (allow) the first two occurrences -- that is, limit the repetition.

Using the example I included above, it will allow the entire field since it matches the first two repetitions.

But, I only want the first two repetitions and not to allow anything beyond that.


I don't understand. Given your example what do you want as output? Do you want to determine if it just has 4 hex digits or do you want to truncate the values or what?


Hello,

The RegEx has a repetition function {6} or whatever that says how many times it should match based on the previous RegEx pieces. What I want to have happen is to limit the RegEx output to just that repetition number. In other words, don't match (or return) anything beyond the repetition's values.

So, if you had a field like this: a4c4k5d94kacckd

And the you had a RegEx similar to this:

[a-z][0-9]{2}, I would only want a4c4 to be returned.

However, since a4c4 matches, the entire string gets returned in my testing. I want to limit that to just the a4c4.

Is this possible?
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't see you need a regular expression solution. Just take the first 4 character substring ! What am I missing?
 
Omkar Shetkar
Ranch Hand
Posts: 100
2
Eclipse IDE Java Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
As Henry already pointed out, given regular expression will search for a sequence of alphabet followed by any two digits.
For your requirement, this should be ([a-z][0-9]){2}.

I think you can use Matcher class to achieve your functionality.


Output:

Start index: 0 End index: 4 a4c4
Start index: 4 End index: 8 k5d9
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Omkar Shetkar wrote:
I think you can use Matcher class to achieve your functionality.


You obviously have a different insight into the OP's requirement because though your code works for your interpretation of the requirement just using substring() works for mine.
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Richard Tookey wrote:
You obviously have a different insight into the OP's requirement because though your code works for your interpretation of the requirement just using substring() works for mine.


The OP definitely needs to give more clarity, as I am still not sure of the issue at all. Perhaps the OP should provide an SSCCE with example inputs, outputs, and desired outputs?

Henry
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:
Richard Tookey wrote:
You obviously have a different insight into the OP's requirement because though your code works for your interpretation of the requirement just using substring() works for mine.


The OP definitely needs to give more clarity, as I am still not sure of the issue at all. Perhaps the OP should provide an SSCCE with example inputs, outputs, and desired outputs?

Henry


Sorry for the confusion. The RegEx is part of an Oracle SQL Query (REGEXP_LIKE).

What's happening is that since this field is part of a join key, we only want to consider the "repeatable" part of the field (the {5} part, or whatever). But, what's happening is that since there are characters beyond the {5}, after the match, they get included too in the output. Therefore, these fields with the extra characters could never successfully join to another table.

This isn't a Java issue, but a RegEx issue. Part Oracle, part RegEx. It is a Java application, however.

So, if you have a long DB field like this: "a0a2i7g1w8_morefieldcharacters", but you only want the first 10 characters that match: [a-z][0-9]){5}, BUT NO CHARACTERS BEYOND THAT, is this possible with a RegEx?

Perhaps a SUBSTRING on the field would be a better solution to limit it to 10 characters before the RegEx.

Just trying to see what's possible with RegEx.

Hopefully, this clears things up a bit.

Thanks,

-mike
 
Richard Tookey
Bartender
Posts: 1166
17
Java Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mike London wrote:

So, if you have a long DB field like this: "a0a2i7g1w8_morefieldcharacters", but you only want the first 10 characters that match: [a-z][0-9]){5}, BUT NO CHARACTERS BEYOND THAT, is this possible with a RegEx?


When you say 'want' what do you mean! Do you just want to match against "([a-z][0-9]){5}.*" i.e. the first 10 characters are specified and the rest can be anything? If not then I haven't a clue what you mean and I will bow out since I have never used regex in SQL
 
Steve Luke
Bartender
Posts: 4181
22
IntelliJ IDE Java Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mike London wrote:So, if you have a long DB field like this: "a0a2i7g1w8_morefieldcharacters", but you only want the first 10 characters that match: [a-z][0-9]){5}, BUT NO CHARACTERS BEYOND THAT

If you have a row whose field of interest is "a0a2i7g1w8_morefieldcharacters" do you want to make sure that row's data is eliminated from the results? If so, then you need to put boundary matches on the regex ^([a-z][0-9]){5}$. In this case only rows whose fields of interest are those 10 characters - no extra characters - will be returned from the result set.

If you instead want the row's data to be included in the result set, but just have the field be returned as "a0a2i7g1w8" instead of its full contents, then no I don't think you can do that with the RegEx there, you would need to parse the string after it is returned.
 
Henry Wong
author
Sheriff
Posts: 23295
125
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Steve Luke wrote:
If you instead want the row's data to be included in the result set, but just have the field be returned as "a0a2i7g1w8" instead of its full contents, then no I don't think you can do that with the RegEx there, you would need to parse the string after it is returned.


To clarify a bit, I think a better response would be that there isn't a way to do this with just the regular expression itself. Had this been done with the Java regular expression engine, it could easily be done with the API -- specifically with a combination of groups for extraction and control of whether to call the find() method or not.

Perhaps the OP can check to see if the database's regular expression engine provides such control?

Anyway...

Mike London wrote:
Sorry for the confusion. The RegEx is part of an Oracle SQL Query (REGEXP_LIKE).

This isn't a Java issue, but a RegEx issue. Part Oracle, part RegEx. It is a Java application, however.


Next time, it would be a very good idea to mention this with the first post. Regular expressions isn't just the expression itself. There is a whole library of stuff, for matching, finding, extracting, replacing, splitting, etc. It would be good to know that the question is regarding a different API/engine altogether.

Henry
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:
Steve Luke wrote:
If you instead want the row's data to be included in the result set, but just have the field be returned as "a0a2i7g1w8" instead of its full contents, then no I don't think you can do that with the RegEx there, you would need to parse the string after it is returned.


To clarify a bit, I think a better response would be that there isn't a way to do this with just the regular expression itself. Had this been done with the Java regular expression engine, it could easily be done with the API -- specifically with a combination of groups for extraction and control of whether to call the find() method or not.

Perhaps the OP can check to see if the database's regular expression engine provides such control?

Anyway...

Mike London wrote:
Sorry for the confusion. The RegEx is part of an Oracle SQL Query (REGEXP_LIKE).

This isn't a Java issue, but a RegEx issue. Part Oracle, part RegEx. It is a Java application, however.


Next time, it would be a very good idea to mention this with the first post. Regular expressions isn't just the expression itself. There is a whole library of stuff, for matching, finding, extracting, replacing, splitting, etc. It would be good to know that the question is regarding a different API/engine altogether.

Henry


Hello,

If this helps, i finally found the solution to my problem.

"([a-z][0-9]){5}$"

The "$" was all I needed to just match the 5 repetitions of the [a-z][0-9], but nothing to the right of that in the string (that is, nothing in the DB Field beyond the matched 10 characters specified in the RegEx).

My RegEx tool said the original RegEx (without the "$") correctly matched only 10 characters, but when that same RegEx was used in an Oracle WHERE clause, it brought in additional characters after the "matched" 10 characters.

Not sure what is happening exactly, but perhaps this gives enough information.

Appreciate any additional replies.

-- mike
 
Steve Luke
Bartender
Posts: 4181
22
IntelliJ IDE Java Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mike London wrote:My RegEx tool said the original RegEx (without the "$") correctly matched only 10 characters, but when that same RegEx was used in an Oracle WHERE clause, it brought in additional characters after the "matched" 10 characters.

Not sure what is happening exactly, but perhaps this gives enough information.

The key is the key word is REGEXP_LIKE which is a version of LIKE, LIKE doesn't require an exact match, they act as a 'String.contains()' expression. So as long as the RegEx would be matched someplace in the field it would be found. By adding the $ you are forcing those 10 characters to come at the end of the String (that is what $ means - the end of the line). You could still have a problem with your regex: "([a-z][0-9]){5}$". If the field is "a0a2i7g1w8r2" for example (six repetitions) it could still be found. That is why mine had the hat (^) at the start.
 
Steve Luke
Bartender
Posts: 4181
22
IntelliJ IDE Java Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:
Steve Luke wrote:
If you instead want the row's data to be included in the result set, but just have the field be returned as "a0a2i7g1w8" instead of its full contents, then no I don't think you can do that with the RegEx there, you would need to parse the string after it is returned.


To clarify a bit, I think a better response would be that there isn't a way to do this with just the regular expression itself.

... or in that particular spot (as a condition to the WHERE clause). There apparently is a function (REGEXP_SUBSTR) you can use in the SELECT clause to return a portion of the field that matches a RegEx. I guess that is superfluous now, though, since it appears the OP wanted the condition anyways.
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Steve Luke wrote:
Mike London wrote:My RegEx tool said the original RegEx (without the "$") correctly matched only 10 characters, but when that same RegEx was used in an Oracle WHERE clause, it brought in additional characters after the "matched" 10 characters.

Not sure what is happening exactly, but perhaps this gives enough information.

The key is the key word is REGEXP_LIKE which is a version of LIKE, LIKE doesn't require an exact match, they act as a 'String.contains()' expression. So as long as the RegEx would be matched someplace in the field it would be found. By adding the $ you are forcing those 10 characters to come at the end of the String (that is what $ means - the end of the line). You could still have a problem with your regex: "([a-z][0-9]){5}$". If the field is "a0a2i7g1w8r2" for example (six repetitions) it could still be found. That is why mine had the hat (^) at the start.


Steve,

Awesome! Thank you very much. Excellent information.

-- mike
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!