This regex should match names beginning with Capital, with dots & spaces, with atleast one multi-length string as in 'Java' & of length 45 chars.
It should also be able to match names like - A.B. Henry, A B Henry, A Henry, Alex H, Alex H., Alex Henry, Alex Williams Henry, etc.
I've created a package 'validator' in my application & named this class as 'FirstLastNameVerifier'. The code goes like this:
Thanks for your help.
I'm finding it hard to parse that regular expression, but I think you're over-complicating it.
Firstly, remember you don't need to validate everything in the same regular expression. For instance, I'd forget about trying to check the 45 character limit it in. I think you're better off checking .length() before trying the match.
I'm also not sure that it's worth trying to validate the rule that at least one of the groups has to be multi-character in the same regular expression. It may be possible, but only at the expense of complicating it. Whereas two separate simple regular expressions would probably do that job.
The following expression matches all the names you've given above, although I haven't tested it properly against names it shouldn't validate:
A second regular expression to check that there is at least one multi-letter word in the name should be pretty simple.
Matthew - Actually, I tried with a simpler one but as I kept adding those conditions it got complicated. I'll try this one & let you know.
Ulf - That's definitely something new to learn & I will use it in my code.
By the way, I can not yet understand the use of '?' and '^' in regex. I'm trying to read more about them, but still clueless as to their application in my code. If you can suggest some good tutorial/link/URL.. please do so.
The ultimate aim is simple - Match a multi-word string separated by either '.' or ' ', and starts with a capital letter.
Lets split the designing into parts.
Write a regex that starts with a capital letter and followed by any number of optional small letters(can also be 0, as in the case of 'A B Henry'). => [A-Z][a-z]*
Then, coming to the multi-word part, the words may be separated by either a '.' or a ' '(as in A.B Henry), but not both. => ([A-Z][a-z]*[\\s\\.])+
..but in the above constructed regex, the string is always supposed to end with a ' ' or a '.', otherwise it wont match. So we do a workaround, bringing [\\s\\.] to the front. => [A-Z][a-z]*(?:[\\s\\.][A-Z][a-z]*)*
The above constructed regex now shall match all of your inputs except for 'Alex H.' , but I guess you know what to do to make the pattern match this too
and the '?:' used is for defining a non-capturing group, as we are not interested in using back references here. This '?:' when included will improve your regex matching speed relatively.
(The regexp you posted doesn't make this mistake, but it's good to remember that there are last names that have only two characters, "Ng" being an example.)
For instance, I work in a university. Imagine how many of our systems struggled when we had a student join last year who only has a single name. Not just a preference for using a single name: legally speaking they have a single name.
Thanks Matthew. I went through that article. Yes, I would definitely need to keep in mind all those aspects & 'localization' etc. for later. Presently, I can continue with the min. of requirements that I've mentioned.
Thanks Ulf. Infact the complicated regex is a result of my trying to get all things done & dealing with all conditions @ once.
Thanks All for guiding me. I'm reading a few tutorials on 'regex' & trying to come up with a good one for this. Greenhorn that I am its taking me a little more time to figure out . But yes please do watch this post; I'm sure to come up with something soon. I'm at it.
I have tried testing your regex with both the code samples & there are some differences in output.
1. Using the 'match' method returns 'false' with most of the input strings mentioned above.
2. Using pattern class does match the input strings but the 'matcher groups' are not the same as input strings. For eg. A.B. Williams would give the following output:
1. "String is A.B. Williams Match result is false"
2. "String is A.B. Williams Match result is true"
In another detail display, it gives
Matcher grp is: A Matcher start is: 0 Matcher end is: 1
Matcher grp is: B Matcher start is: 2 Matcher end is: 3
Matcher grp is: Williams Matcher start is: 5 Matcher end is: 13
(I understand that because of the use of ?:, we do not get the characters "." & " " as a mtching group in the result.)
My question is why do the two cases treat the input strings differently. I'm not sure but could it be because 'match' does a complete match & would not settle for anything less whereas 'pattern' returns true even if a 'minimum' match is found?
Can you suggest how it is generally done & which would be a better way given the context?
Thanks for your help.
I'm not sure but could it be because 'match' does a complete match & would not settle for anything less whereas 'pattern' returns true even if a 'minimum' match is found?
I assume that you mean String's matches() method by 'match' and Pattern.compile(),Matcher.find() & Matcher.group() by 'pattern'.
So, you want to know the difference between them?
Go for the matches() method if you want to match your pattern against the whole input string, and go for Matcher.find() followed by Matcher.group(), to find any number of substring matches in the input string.
I suggest you read more of these API usages in the javadoc.
And 1 more thing - when you have a lot of strings to be matched against a same regex pattern, it is advisable that you create a Pattern instance and use Matcher.matches() or Matcher.find()/Matcher.group() on it.
Using String's matches(regex) is relatively a more costly method, for a long list of strings to be matched against. For just few matches, you can use this.
...I understand that because of the use of ?:, we do not get the characters "." & " " as a mtching group in the result...
No. When designing a pattern, anything within '(' and ')' is called a capturing group, which once matched the regex engine shall remember throughout its processing. To prevent the regex engine remembering this match, as we do not intend to use it again else where in our pattern(as a back reference), we use '?:'. You can even remove it. Nothing is going to change, except for the performance when the input string is very large.