Win a copy of Transfer Learning for Natural Language Processing (MEAP) this week in the Artificial Intelligence and Machine Learning forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Paul Clapham
  • Devaka Cooray
  • Bear Bibeault
Sheriffs:
  • Junilu Lacar
  • Knute Snortum
  • Liutauras Vilda
Saloon Keepers:
  • Ron McLeod
  • Stephan van Hulst
  • Tim Moores
  • Tim Holloway
  • Piet Souris
Bartenders:
  • salvin francis
  • Carey Brown
  • Frits Walraven

Different behavior of java.util.regex.Pattern and kotlin.text.Regex

 
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I was working on a problem on Codewars.com that required you to break a string into pairs of characters.

One way to solve this is by using a regular expression to split the string. I'm not clever enough to come up with the regex myself but found the incantation that works when used with java.util.regex.Pattern. It worked in Java so I thought I'd try it in Kotlin. Using Pattern, it worked the same, which is to be expected. However, when I tried it with kotlin.text.Regex, which I thought would behave similarly to Pattern, I noticed some differences in behavior. Here's the code that I was playing around with:

And here's my test:

These are the test failures:

byRegex gives extra pair for <ab>! ==> expected: <1> but was: <2>
Expected :1
Actual   :2

byRegex gives extra pair for <abcd>! ==> expected: <2> but was: <3>
Expected :2
Actual   :3

byRegex gives extra pair for <abcdef>! ==> expected: <3> but was: <4>
Expected :3
Actual   :4

Not sure why there's a difference. Maybe it has to do with the default RegexOptions but this is a little surprising (violation of POLA).

If you have any insights as to how to resolve the difference in behavior, I'd love to hear them. Thanks!
 
Junilu Lacar
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is driving me nuts    

I've reworked my tests to isolate failures:

The tests on lines 47-52 and 72-81 are the only ones failing. I have no idea why these would fail.
 
Saloon Keeper
Posts: 11881
253
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Just a thought, but could it be that your extension method conflicts with CharSequence.split()?
 
Junilu Lacar
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:Just a thought, but could it be that your extension method conflicts with CharSequence.split()?


I've eliminated that possibility with this:

I found out that I didn't need those extension functions anyway since Kotlin already had them but the above changes uses standard classes and functions/methods.
 
Junilu Lacar
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This establishes the expected behavior based on java.util.regex.Pattern:

This test passes as expected. But this fails:

The fact that pattern.split(s) is not symmetric with s.split(pattern) for only even-length strings makes me think there's something really screwy going on here.
 
Stephan van Hulst
Saloon Keeper
Posts: 11881
253
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Do both regular expression engines treat \G the same way for the edge case where you don't have a previous match yet?
 
Junilu Lacar
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:Do both regular expression engines treat \G the same way for the edge case where you don't have a previous match yet?


In the cases where s.length() is 3 and 4, both have a previous match, right? And in both cases, the number of groups should be 2. However, when s.length() == 4, the number of groups == 3. I'm not sure where that phantom group (the string is empty) is coming from. The java.util.Pattern certainly doesn't give it.

I'm going try the symmetry test in jshell for a sanity check.
 
Stephan van Hulst
Saloon Keeper
Posts: 11881
253
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Also, have you already used your debugger to step into the tests? I can usually find the cause of these kinds of problems fairly quickly by setting a breakpoint in my test and then step into the standard API code (assuming the source is available).
 
Junilu Lacar
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Here's the sanity check in jshell:

This shows that in Java, Pattern.split(String) is symmetrical with String.split(patternString) at least. This differs from the even-length symmetry test in Kotlin.
 
Stephan van Hulst
Saloon Keeper
Posts: 11881
253
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The difference is likely caused by how the regular expression engine treats the "left over" empty string if your zero-width match falls exactly at the end of the input string. It appears that Pattern ignores empty strings after the last delimiter match, whereas Regex includes them.

The same could be happening with CharSequence.split(), and your confusion is caused by weird extension method resolution. The way to find out is to step into the code and find out where your methid calls end up.
 
Stephan van Hulst
Saloon Keeper
Posts: 11881
253
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Junilu Lacar wrote:This shows that in Java, Pattern.split(String) is symmetrical with String.split(patternString) at least. This differs from the even-length symmetry test in Kotlin.


But Kotlin has 4 additional extension methods that may interfere.
 
Junilu Lacar
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So you don't find this behavior surprising, as in it's violating the Principle of Least Astonishment? If find it quite disconcerting -- I think my expectations of consistent behavior between Java and Kotlin are quite reasonable, especially since Pattern.split(s) works exactly the same way in Java and Kotlin.
 
Stephan van Hulst
Saloon Keeper
Posts: 11881
253
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The documentation for Regex.split() doesn't explicitly mention ignoring trailing empty matches when the limit is 0, which Pattern does. I'm pretty sure this explains the difference between Pattern and Regex.

The broken symmetry could be caused by the CharSequence.split(Pattern) extension method internally converting the Pattern to Regex and then calling Regex.split(this).

I don't know WHY they would do this, but it's a possible explanation at least.

Like I said, you must first find out what method Kotlin calls when you do s.split(byPattern), and then inspect its source code.
 
Stephan van Hulst
Saloon Keeper
Posts: 11881
253
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Junilu Lacar wrote:So you don't find this behavior surprising, as in it's violating the Principle of Least Astonishment?


I do find it surprising. I just think there's probably a logical explanation.

Why would Kotlin introduce a Regex class if Pattern did everything they wanted? Maybe the behavior of a 0 limit for Pattern tripped up so many people that they decided to do it differently.

I have an explanation for the behavior. That doesn't mean that I would do it the same way had I been at the helm of the Kotlin design team.
 
Junilu Lacar
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It's almost 3am now and I'm more inclined to hit my pillow than F7-Step Into in the debugger right now.

I did find that in Kotlin, s.split(Pattern) will return a List<String> whereas Pattern.split(s) has a type of Array<(out) String!>! (shown by IntelliJ IDEA's inferred type hint). I don't know if that really plays into all this but I suspect you're closer to the truth about Kotlin's implementation than whatever theory I could come up with.  If I filterNot { it.isNullOrEmpty() } then the test passes but that's not really an acceptable workaround to me. Gonna go to bed now -- we'll see if it still bugs me enough later to spend more time looking into this. At the very least, I might post an issue with the JetBrains folks.

Thanks for your time in  looking into this!
 
Stephan van Hulst
Saloon Keeper
Posts: 11881
253
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Here you go. It's close to expected:

It forces the limit to -1, which means it will explicitly include an empty string if the input ends on a delimiter.

If this doesn't work for you, then you could try to provide your own extension method that simply returns regex.split(this, limit).asList().
 
Stephan van Hulst
Saloon Keeper
Posts: 11881
253
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Output without extension method:

Extension method:
Output with extension method:
 
Stephan van Hulst
Saloon Keeper
Posts: 11881
253
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My final analysis:

It's very likely that the Kotlin designers thought that the Pattern.split(CharSequence, int) method was confusing because of the special meaning of the limit arguments -1 and 0. They were right. I have to look up the meaning every time I use the split() method.

To solve this in their new Regex class, they got rid of the special -1 argument, and made 0 to mean "no limit". This was their first mistake. They should have provided an overloaded version that only accepts positive arguments for the limit argument and an overloaded version that has no limit parameter at all.

For convenience they added a CharSequence.split() extension method that allows the user to invert the input and regex arguments. Then they added another convenience method that takes a Pattern. This was the second mistake. I agree with their choice to make these two extension methods yield the same result, this is in line with POLA. The mistake was that they provided the second extension method in the first place. They should have committed to their new view of the world by only providing a convenience method for Regex. If a client wants the Pattern behavior they should explicitly call the Pattern.split() method.
 
Junilu Lacar
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Wow, that's pretty hardcore, Stephan. I confess, I don't know enough about regex (a largely conscious choice) to know the subtleties of that particular incantation I used, especially the ramifications of using one value vs another for the limit parameter. I was relying on POLA to provide sensible defaults and consistency but got a rude awakening instead. Kind of a stark reminder of Jamie Zawinski's sentiments about regex:

Jamie Zawinski wrote:Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.



Have another cow on me for your effort.
 
Junilu Lacar
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
While it would be nice for Pattern and Regex to be fully consistent, I'm not sure this is an issue to which a generally applicable fix can be made. It would require more effort to prove but I suspect that a "fix" for this particular case might end up starting a whack-a-mole effect of breaking other cases that inspired the design fork in the first place.
 
Stephan van Hulst
Saloon Keeper
Posts: 11881
253
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Junilu.

In all fairness, this is not a problem with regular expressions. The same would have happened with a split() method that doesn't take a regular expression but rather a literal delimiter. You just happened upon an edge-case. What do you think should be the result of calling "_a_b_".split("_") ?
 
Junilu Lacar
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Here's a function that exercises different incantations of the same intent:

Results:

>>> testRegexPattern("_a_b_", "_")
[, a, b, ]
[, a, b, ]
[, a, b, ]
[, a, b]
[, a, b, ]

>>> val byPairs = "(?<=\\G.{2})"
>>> testRegexPattern("a", byPairs)
[a]
[a]
[a]
[a]
[a]

>>> testRegexPattern("ab", byPairs)
[ab]
[ab, ]
[ab, ]
[ab]
[ab, ]

>>> testRegexPattern("ab", byPairs)
[ab]
[ab, ]
[ab, ]
[ab]
[ab, ]

>>> testRegexPattern("abc", byPairs)
[abc]
[ab, c]
[ab, c]
[ab, c]
[ab, c]

>>> testRegexPattern("abcd", byPairs)
[abcd]
[ab, cd, ]
[ab, cd, ]
[ab, cd]
[ab, cd, ]

Seems like java.util.regex.Pattern will be consistently inconsistent with the other forms in many of these kinds of cases.
 
Junilu Lacar
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The rub in this is that the Kotlin documentation for split presents the versions that take Regex and Pattern together, implying that they behave the same way when in fact, they don't.

https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/split.html

Here's another twist to this story: Regex has a .toPattern() method that converts a Regex to "an instance of Pattern with the same pattern string and options as this instance of Regex has." This again would lead most people to reasonably expect the same behavior from both instances. However, the inconsistency I have shown is there. The only way to get consistency in behavior is if you explicitly specify a negative limit for Pattern.split() which kind of contradicts the whole "limit: Int = 0" default specification in the signature.
 
Junilu Lacar
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I posted a question with summary of all this to Reddit: https://www.reddit.com/r/Kotlin/comments/gls1ko/stringsplitpattern_is_not_symmetrical_with/
 
Junilu Lacar
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
On the bright side, I got to practice with quite a few JUnit 5 test features in Kotlin.
 
Stephan van Hulst
Saloon Keeper
Posts: 11881
253
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Junilu Lacar wrote:The rub in this is that the Kotlin documentation for split presents the versions that take Regex and Pattern together, implying that they behave the same way when in fact, they don't.


I think you're confused. s.split(regex) and s.split(pattern) have exactly the same output. s.split(pattern) is inconsistent with pattern.split(s).

The only way to get consistency in behavior is if you explicitly specify a negative limit for Pattern.split() which kind of contradicts the whole "limit: Int = 0" default specification in the signature.


I don't know what to tell you. The makers of Pattern thought it would be a good idea for pattern.split(s, 0) to trim trailing empty input. It's not better or worse than what regex.split(s, 0) does. Interestingly, JavaScript does this a lot better by returning an empty array when the limit is 0.

Why include a Regex class at all? Presumably because they needed a common type in case the programmer doesn't know whether they'll be targeting Java or JavaScript. The mistake was that they introduced a limit parameter that has a different meaning from BOTH Java and JavaScript's variants. We now have three different meanings of limit. This is truly a moronic decision.

The dumbest part is that the Kotlin designers decided to include a convenience method CharSequence.split(Pattern, Int). Just commit to using Regex. Don't promote classes if you don't agree with their design.

Honestly, Kotlin feels to me like they wanted to make a better version of Java, but didn't have the balls to really do it. It's like how in Java there is so much junk left over from C++.
 
Junilu Lacar
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:
I think you're confused. s.split(regex) and s.split(pattern) have exactly the same output. s.split(pattern) is inconsistent with pattern.split(s).


No, I'm not confused. I just left out a step in my reasoning, namely,

s.split(regex) is the same as s.split(pattern) // true
regex.split(s) is the same as s.split(regex) // true
pattern.split(s) is the same as s.split(pattern) // false - surprise!

In the Kotlin documentation, the links to Pattern take you to the JavaDocs, not Kotlin docs. So I think one could reasonably infer that the description of limit for Pattern in the JavaDocs also applies to the limit for String.split() in Kotlin. The gotcha is that it doesn't.

In Kotlin, limit for split() in Regex, String, and CharSequence does not allow negative values. Pattern is the odd man out that does allow negative values for limit, hence the inconsistency but only for certain cases. In other cases, it behaves the same way as Regex.split(), String.split(), and CharSequence.split() which is why this issue is kind of insidious.
 
Junilu Lacar
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:]I don't know what to tell you. The makers of Pattern thought it would be a good idea for pattern.split(s, 0) to trim trailing empty input. It's not better or worse than what regex.split(s, 0) does. Interestingly, JavaScript does this a lot better by returning an empty array when the limit is 0.


There's nothing to tell, really. I get where the discrepancy lies now: it's that in the Kotlin classes, limit can't be negative and 0 doesn't disregard trailing empties. Pattern, on the hand, coming from JavaLand, does allow for negative limits and trims trailing empties when limit is zero. If that's how it has to be then fine. I just think it should more be plainly called out in the documentation as a gotcha rather than have it as some esoteric nuance that you have to have seen before in order to avoid falling into a trap. For the most part, Java<->Kotlin interoperability seems pretty sensible and straightforward but this is one case where I think it isn't.

And for what it's worth, I agree that JavaScript does a more sensible and less surprising thing for a 0 limit.
 
Junilu Lacar
Sheriff
Posts: 15519
263
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Update: someone from the Kotlin team replied to my question on Reddit and said a ticket has been created to improve the documentation for CharSequence.split() to "emphasize the difference." At least there's that.
 
A magnificient life is loaded with tough challenges. En garde tiny ad:
Two software engineers solve most of the world's problems in one K&R sized book
https://coderanch.com/wiki/718759/books/Building-World-Backyard-Paul-Wheaton
    Bookmark Topic Watch Topic
  • New Topic