• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Paul Clapham
  • Jeanne Boyarsky
  • Junilu Lacar
  • Henry Wong
Sheriffs:
  • Ron McLeod
  • Devaka Cooray
  • Tim Cooke
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Frits Walraven
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • salvin francis
  • fred rosenberger

pattern matching issue when validating String input

 
Ranch Hand
Posts: 199
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
As we all know, pattern matching is not straight forward in Java. There's , then there's , and then there's this method...
It's a little beyond me, to be honest. I created a small program which takes in Roman Numerals, and converts them into Arabic Numerals.

I thought it would be nice to have validation on my inputs, so I cared this method:



What I am trying to do in plain English: the validate method makes sure that the String I enter only accepts Roman Numerals which are composed of the letters I, V, X, L, C, D, and M.
Input which should pass:
XX
IX
XI
IV
V
I

Input which should NOT pass:

PP
P
XP
IG

However, when I enter XP, it still recognizes this as "10" and does not catch the "P". When I enter PI, it does not catch the "P" but recognizes the I and gives me a "1"
What am I doing wrong? I think my REGEX pattern is wrong.
 
Saloon Keeper
Posts: 7179
65
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Your pattern will only match a single character of one of those you've included and find() tells if that Pattern appears anywhere in the string.

What you want is:
A pattern that is made up of one or more of those characters and then use matches() which will only be true if the pattern matches the entire string exactly.

You Pattern should be "[IVXLCDM]+"
and replace find() with matches().

Alternatively you can  do:
 
Carey Brown
Saloon Keeper
Posts: 7179
65
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you want to be even more robust you can look  for patterns of 4 or more of any of those characters that appear consecutively, which should never happen. Example: "VIIII".

Pattern: "I{4,}|V{4,}|X{4,}|L{4,}|C{4,}|D{4,}|M{4,}"
Now  with that pattern you DO want to use find() because you don't care where in the string the pattern occurs.
 
Bartender
Posts: 3959
155
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Or do you want to accept only legal roman numerals? For instance: VV would be accepted by the matching, but it is not a legal roman numeral, and it should not translate to 55 or 10

PS: IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  is legal, but it is not in the preferred form.
 
Carey Brown
Saloon Keeper
Posts: 7179
65
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
 
Carey Brown
Saloon Keeper
Posts: 7179
65
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Piet Souris wrote:PS: IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  is legal, but it is not in the preferred form.


So what would this be: IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIX  ?
 
Piet Souris
Bartender
Posts: 3959
155
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It is a little more complex that that. An excellent description is in Project Euler:
exercise 89
 
Piet Souris
Bartender
Posts: 3959
155
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Carey Brown wrote:So what would this be: IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIX  ?


This is illegal. X cannot be decomposed of smaller values, just like C and D. So, for instance: LL for C is also illegal.
 
Piet Souris
Bartender
Posts: 3959
155
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Piet Souris wrote:PS: IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  is legal, but it is not in the preferred form.

Hmm, sorry about that, but that is also illegal. 9 I's is the max. Long time ago that I made this exercise.
 
Marshal
Posts: 15638
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Carey Brown wrote:
Alternatively you can  do:


You should note, however, that this alternative form compiles the regex on the fly. If you're going to try to match multiple strings to the same regular expression, a precompiled pattern is the better choice:
 
Bartender
Posts: 2590
124
Google Web Toolkit Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Mark Richardson wrote:...It's a little beyond me, to be honest.
...


I think there's a reason why we have two objects. It allows you to do this:

I think that this is quite a simple way to solve multiple scenarios and usecases. Have a look at this complete Tutorial page based only on Matcher

Having said that, not all problems should be solved using regex. I'll leave the choice to you  
 
lowercase baba
Posts: 12856
52
Chrome Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Carey Brown wrote:If you want to be even more robust you can look  for patterns of 4 or more of any of those characters that appear consecutively, which should never happen.


Have you ever looked at a grandfather clock face?

 
Junilu Lacar
Marshal
Posts: 15638
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
We could speculate and debate about what is and isn't valid but since we haven't heard from OP since the opening post, I don't think any of it matters until we hear what his requirements are. On the subject of clear requirements for this problem, this article might be of interest: https://dzone.com/articles/roman-numerals-kata-tdd-and
 
Mark Richardson
Ranch Hand
Posts: 199
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Junilu Lacar wrote:We could speculate and debate about what is and isn't valid but since we haven't heard from OP since the opening post, I don't think any of it matters until we hear what his requirements are. On the subject of clear requirements for this problem, this article might be of interest: https://dzone.com/articles/roman-numerals-kata-tdd-and



I would defer to the "specification" which Piet posted from ProjectEuler: https://projecteuler.net/about=roman_numerals
I wrote my program "for fun," and now I'm realizing the uncomfortable position I've put myself in.. haha - I suppose one could just copy and paste this solution on Stack overflow.
https://stackoverflow.com/questions/267399/how-do-you-match-only-valid-roman-numerals-with-a-regular-expression

Here is my ENTIRE code btw...  and currently the does not work as expected...



 
Junilu Lacar
Marshal
Posts: 15638
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would refer you to this quote:

Jamie Zawinski wrote:Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.


I would use the regex to just make sure the input had all valid roman numerals. I wouldn't go any further than that. I think that the domain rules are complex enough to defer further validation to the routine that does the translation from Roman numerals to Arabic. If the input doesn't follow the rules, you simply abort the translation and return an error message.
 
Sheriff
Posts: 7108
184
Eclipse IDE Postgres Database VI Editor Chrome Java Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You're not using the variable b so I'd just remove that line.  Since you only want to compile the regex once, I'd make p a static final field.  Also, rename p to something like ROMAN_NUMERAL_PATTERN.
 
Mark Richardson
Ranch Hand
Posts: 199
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Carey Brown wrote:If you want to be even more robust you can look  for patterns of 4 or more of any of those characters that appear consecutively, which should never happen. Example: "VIIII".

Pattern: "I{4,}|V{4,}|X{4,}|L{4,}|C{4,}|D{4,}|M{4,}"
Now  with that pattern you DO want to use find() because you don't care where in the string the pattern occurs.



Hi Carey, so I tried this:



However, it's taking the following (wrong) inputs:
P and returning 0
and IP and returning 1.

I thought that the regex would ensure that upon encountering a "P" it would print "Invalid Input" and error out.
 
Junilu Lacar
Marshal
Posts: 15638
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Mark Richardson wrote:
Hi Carey, so I tried this:
...
I thought that the regex would ensure that upon encountering a "P" it would print "Invalid Input" and error out.


It might be worth taking a few moments to read through some of the previous replies. In his first reply to you,

Carey Brown wrote:
You Pattern should be "[IVXLCDM]+"
and replace find() with matches().


 
Junilu Lacar
Marshal
Posts: 15638
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You don't even have to use regular expressions for this.

The convert() method would also report any errors it finds but those would be violations of whatever formatting rules you choose to implement rather than illegal numerals.
 
Mark Richardson
Ranch Hand
Posts: 199
2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Junilu Lacar wrote:You don't even have to use regular expressions for this.

The convert() method would also report any errors it finds but those would be violations of whatever formatting rules you choose to implement rather than illegal numerals.



Today I learned: Not to ignore Java Streams and actually learn them! :P
 
Piet Souris
Bartender
Posts: 3959
155
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Alernatively:

it is much easier to go from Arabic to Roman than vice versa. So, make a Map<Integer, String> and when done, you have many legal Romans. Usually, you end with 3999.
 
Carey Brown
Saloon Keeper
Posts: 7179
65
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There's a couple of things to note here, just because this says it's valid it may not be because it is not doing any syntax checking. And, as others have pointed out, this only works for the "compact" (typical?) roman numeral format and the input may not be limited to that, so you'd get a false negative. If you can find a robust parsing algorithm it would probably be quick enough compared to user input and wouldn't have the problems that regular expressions have. You could still use your [IVXLCDM]+ quick test before parsing it but my TOO_MANY test doesn't work in all cases, so, probably not a good idea.
 
Carey Brown
Saloon Keeper
Posts: 7179
65
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Version 12 of Java introduced using switche()'s as an expression, so the above code can be replaced with:
Note:
This is a preview feature, which is a feature whose design, specification, and implementation are complete, but is not permanent, which means that the feature may exist in a different form or not at all in future JDK releases. To compile and run code that contains preview features, you must specify additional command-line options. See Preview Features. For background information about the design of switch expressions, see JEP 354.
 
Junilu Lacar
Marshal
Posts: 15638
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Carey Brown wrote:Version 12 of Java introduced using switche()'s as an expression, so the above code can be replaced with:


Switch expressions were preview mode only in Java 12 and 13. If you want to use them in those versions, you need to add the --enable-preview option to the command line.

It's an official feature of the language from Java 14.

I'd think a value of 0 for an invalid numeral would be more useful. Or you could just throw an IllegalArgumentException.
 
Master Rancher
Posts: 3539
39
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
To be fair, this code would be easy to do without switch expressions too:
 
Junilu Lacar
Marshal
Posts: 15638
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You could also play around with using an enum type. I think there's some good potential to create intuitive code and semantics. For example, you could use the built-in valueOf(String) method to convert from a string to an enum value. Then you could something like this:

I'd probably add a numericValue property for each one and a getter for it.
 
Piet Souris
Bartender
Posts: 3959
155
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
As I said, it is easy to get the preferred Roman form of any number 0...999. When you have these, it is a little more work to expand these Romans according to what my linked article says. For instance, IV can be expanded to IIII (but not to IIIIII). The longest series I found was 499:

CDXCIX
CCCCXCIX
CDXCVIIII
CCCCXCVIIII
CDLXXXXVIIII
CDXCIIIIIIIII
CCCCLXXXXVIIII
CCCCXCIIIIIIIII
CDLXXXXIIIIIIIII
CCCCLXXXXIIIIIIIII
CDXXXXXXXXXIIIIIIIII
CCCCXXXXXXXXXIIIIIIIII


So a valid Roman is a String that starts with any number of M, followed by a legal String from above. Anyone who has a working Pattern?
 
Mike Simmons
Master Rancher
Posts: 3539
39
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Piet Souris wrote:So a valid Roman is a String that starts with any number of M, followed by a legal String from above.



Well, right away that seems to leave out the familiar MCM numbers seen in movie credits throughout the twentieth century.
 
Carey Brown
Saloon Keeper
Posts: 7179
65
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Mike Simmons wrote:

Piet Souris wrote:So a valid Roman is a String that starts with any number of M, followed by a legal String from above.



Well, right away that seems to leave out the familiar MCM numbers seen in movie credits throughout the twentieth century.

I don't see where MCM breaks the rules. Neither does XIX.

   Only one I, X, and C can be used as the leading numeral in part of a subtractive pair.
   I can only be placed before V and X.
   X can only be placed before L and C.
   C can only be placed before D and M.
   
   Note the following is valid because it is not subtractive:
   XI, XV, CV, CX, CL, MC


 
Mike Simmons
Master Rancher
Posts: 3539
39
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ok, I may have read too quickly.  Interpreting "from above" as the previous parts of his own post, it  did not seem to include any mention of or examples with M.  So I thought it did not allow M, other than the initial "any number of M".  But if he meant "from the rules already covered in other posts" or something like that, then maybe it works.  The post makes sense to me as a guide for how to handle things like XXXX etc, but not a comprehensive set of rules.
 
Carey Brown
Saloon Keeper
Posts: 7179
65
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
From ProjectEuler
Traditional Roman numerals are made up of the following denominations:

I = 1
V = 5
X = 10
L = 50
C = 100
D = 500
M = 1000

In order for a number written in Roman numerals to be considered valid there are
three basic rules which must be followed.

#1. Numerals must be arranged in descending order of size.
#2. M, C, and X cannot be equalled or exceeded by smaller denominations.
#3. D, L, and V can each only appear once.

For example, the number sixteen could be written as XVI or XIIIIII, with the first
being the preferred form as it uses the least number of numerals. We could not write
IIIIIIIIIIIIIIII because we are making X (ten) from smaller denominations, nor could
we write VVVI because the second and third rule are being broken.

The "descending size" rule was introduced to allow the use of subtractive combinations.
For example, four can be written IV because it is one before five. As the rule requires
that the numerals be arranged in order of size it should be clear to a reader that the
presence of a smaller numeral out of place, so to speak, was unambiguously to be
subtracted from the following numeral rather than added.

For example, nineteen could be written XIX = X (ten) + IX (nine). Note also how the
rule requires X (ten) be placed before IX (nine), and IXX would not be an acceptable
configuration (descending size rule). Similarly, XVIV would be invalid because V can
only appear once in a number.

Generally the Romans tried to use as few numerals as possible when displaying numbers.
For this reason, XIX would be the preferred form of nineteen over other valid
combinations, like XIIIIIIIII or XVIIII.

By mediaeval times it had become standard practice to avoid more than three consecutive
identical numerals by taking advantage of the more compact subtractive combinations.
That is, IV would be written instead of IIII, IX would be used instead of IIIIIIIII or
VIIII, and so on.

In addition to the three rules given above, if subtractive combinations are used then
the following four rules must be followed.

   Only one I, X, and C can be used as the leading numeral in part of a subtractive pair.
   I can only be placed before V and X.
   X can only be placed before L and C.
   C can only be placed before D and M.
   
   Note the following is valid because it is not subtractive:
   XI, XV, CV, CX, CL, MC

Which means that IL would be considered to be an invalid way of writing forty-nine, and
whereas XXXXIIIIIIIII, XXXXVIIII, XXXXIX, XLIIIIIIIII, XLVIIII, and XLIX are all quite
legitimate, the latter is the preferred (minimal) form. However, minimal form was not a
rule and there still remain in Rome many examples where economy of numerals has not been
employed. For example, in the famous Colosseum the numerals above the forty-ninth entrance
is written XXXXVIIII rather than XLIX.

It is also expected, but not required, that higher denominations should be used whenever
possible; for example, V should be used in place of IIIII, L should be used in place of
XXXXX, and D should be used in place of CCCCC. However, in the church of Sant'Agnese fuori
le Mura (St Agnes' outside the walls), found in Rome, the date, MCCCCCCVI (1606), is
written on the gilded and coffered wooden ceiling; I am sure that many would argue that it
should have been written MDCVI.
 
Mike Simmons
Master Rancher
Posts: 3539
39
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Here's a pattern I put together:
 
Carey Brown
Saloon Keeper
Posts: 7179
65
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Some my my test results Mike:
*** EDITED*** I had some logic reversed



 
Piet Souris
Bartender
Posts: 3959
155
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Mike Simmons wrote:

Piet Souris wrote:So a valid Roman is a String that starts with any number of M, followed by a legal String from above.



Well, right away that seems to leave out the familiar MCM numbers seen in movie credits throughout the twentieth century.


That series of 12 are the legal ways to write 499. I have a list of all legal romans for the values 0-999. so what I meant was that a legal Roman is any string starting with any number of M's, followed by one of those legal strings.
CM is legal, as it is 900, so MCM is legal, just like MMMCM.
 
Mike Simmons
Master Rancher
Posts: 3539
39
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
@Carey, thanks for results and clarification; I was having a hard time seeing what was wrong with some of the initial results.

@Piet, thanks for clarifying.
 
Mike Simmons
Master Rancher
Posts: 3539
39
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Actually @Carey, I'm still confused.  "MMMDCCCLXXVI" shows as valid according to my code, but you seem to think my code shows it as invalid?


 
Mike Simmons
Master Rancher
Posts: 3539
39
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Also, I still think the Romans were too restrictive in their rules.  MIM should have been a legal alternative to MCMXCIX.
 
Piet Souris
Bartender
Posts: 3959
155
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The rules have been releaved a little since corona: IC is now valid.
 
There's a hole in the bucket, dear Liza, dear Liza, a hole in the bucket, dear liza, a tiny ad:
Devious Experiments for a Truly Passive Greenhouse!
https://www.kickstarter.com/projects/paulwheaton/greenhouse-1
    Bookmark Topic Watch Topic
  • New Topic