• Post Reply Bookmark Topic Watch Topic
  • New Topic

Regex and startsWith  RSS feed

 
nimo frey
Ranch Hand
Posts: 580
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have this:



and want to replace it with java regex:



Is regex1 the right Pattern:

regex1:

Can I shorten it more?

I have tried to use the shorter form

regex2:

but it does not work.

Is regex1 right?

Would I be better to use


I do not want to make a difference between a,ä,à or the like.

 
Campbell Ritchie
Marshal
Posts: 56600
172
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Completely different idea:
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
nimo frey wrote:Is regex1 the right Pattern:

I'm pretty certain you don't want the '+'. It will cause different behaviour from startsWith(), at any rate.

You could try:
s.matches("(?u)^[aäàbc].*");
The '^' tells the regex to search from the beginning of the string, and the '(?u)' tells it to do a Unicode-compliant case-insensitive search; it may also work with '(?i)'.

Alternatively:
Pattern.compile("^[aäàbc]", Pattern.UNICODE_CASE);
should produce a similar regex for a Matcher. It may also work with Pattern.CASE_INSENSITIVE, I'm not sure.

Winston
 
nimo frey
Ranch Hand
Posts: 580
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Campbell, nice idea, but I want to use regex.

Winston, this pattern seems to work as expected:



Is there a trick to avoid the explicit declaration of 'ä' or 'à'?

For example, I have thought that the following pattern already recognized 'ä' or 'à' implicitly, but this is not the case:



When using a Collator, I can set col.setStrength( Collator.PRIMARY ); so it automatically includes 'ä' or 'à' or the like.

At them moment, I cannot see any benefit of using this:



instead of this



p1 seems the same as p2. But I am unsure.

 
Rob Spoor
Sheriff
Posts: 21135
87
Chrome Eclipse IDE Java Windows
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
nimo frey wrote:

Don't use + for combining flags, use |. In this case the results will be the same, for others cases they will fail.

Let's say you have two sets of flags; one that is Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE and the other is Pattern.CASE_INSENSITIVE | Pattern.DOTALL. These match values 66 and 34 respectively. If you use | to combine the two you get 98, which is Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE | Pattern.DOTALL - what you want. If you use + however you don't get 98 but 100, which is Pattern.UNICODE_CASE | Pattern.COMMENTS | Pattern.DOTALL. In other words, Pattern.CASE_SENSITIVE is changed into Pattern.COMMENTS because you added it twice.
(Values taken from http://docs.oracle.com/javase/7/docs/api/constant-values.html#java.util.regex.Pattern.CASE_INSENSITIVE)
 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
nimo frey wrote:Winston, this pattern seems to work as expected:

OK, I wasn't sure whether CASE_INSENSITIVE would cover the diacritics or not (it may not include all of them).
For a Pattern, you don't need to include the '.*' - in fact, it may make the matching process slower - you only need it
if the pattern is used with String.matches().

Is there a trick to avoid the explicit declaration of 'ä' or 'à'?

Not that I know of. Diacritic combos are in a different section of the character set as far as I know.
But I'm quite happy to be proved wrong.

Winston
 
nimo frey
Ranch Hand
Posts: 580
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
For a Pattern, you don't need to include the '.*' - in fact, it may make the matching process slower - you only need it
if the pattern is used with String.matches().


I cannot find a way out not including '.*'


A normal use would be:



So I am forced to use ".*". When using


the result is different.


 
Winston Gutkowski
Bartender
Posts: 10575
66
Eclipse IDE Hibernate Ubuntu
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
nimo frey wrote:
For a Pattern, you don't need to include the '.*' - in fact, it may make the matching process slower - you only need it
if the pattern is used with String.matches().

...So I am forced to use ".*".When using...("^[a-häà]")...the result is different...

Only with matches(). Try
p.matcher("teststring").find()
if it returns true, you know the string starts with the relevant pattern.

Winston

PS: As far as I know, you don't need both Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE, and it may slow down your match.
They basically do the same thing, so choose one (I'd go with the Unicode one myself; but you should test it to make sure).
 
nimo frey
Ranch Hand
Posts: 580
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you. find() works as expected.

However, when I am forced to use both Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE:



When using this:

then "A-HÄÀ" is not considered.

When using this:

then "äà" is not considered.

However, when using this:

then both case-insensitive and unicode are considered correctly.

Which pattern should I use? Is there a difference between p0 and p3?

 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!