[Logo]
Forums Register Login
java regular expression giving stack overflow ?
I'm having a very difficult time trying to get this regular expression in java to work. I just don't understand why it is not working. It appears to work in a perl script.The idea is to check if this target is landing in html and is breaking out of a double quote in an html tag.

example:
injecting this in a http request: "xss="xss()"

would be true
would be false single ticks started tag

Regex: used to match
<\w('(\\'|[^'])*'|"(\\"|[^"])*"|[^'">])*xss="xss\(\)"

I know it must be using recursion and just keeps looking.but regex appears to be good?

please help!

Error:

Exception in
thread "Thread-35" java.lang.StackOverflowError
at java.lang.Character.codePointAt(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
Steve, next time you get a StackOverflowError, can you please shorten it? After a while it is looping, and there is no reason to show all of that. I cut off part of your stack trace; the relevant part is still there.
yeah, sorry about that.

For something like this, it may be a good idea to provide a sample program that demostrates the error... The way you describe this, we have no idea how you are using the regex.

Henry
Steve,

I agree with Henry: you should provide sample code. That said, just from looking at your regex, are you increasing the stack for the thread? Your regex seems to have a lot of captures and options, which would easily become very complex, and blow past the stack.

John.
Ok i have written a class to test this example i'm running into.It errors out just as i have mentioned above. Any help would be greatly appreciated!
I have tried attaching the text response to this post with no luck. So, instead this link's source is what im using as my content to match.

http://www.overstock.com/search?keywords='xss='xss()'&taxonomy=sto7&SearchType=Homepage&TID=D:Apparel

just as a side note the {0,1000} part of regex if reduced to {0,100} fixes the problem. Using * instead breaks it as well. So it seems to keep looking for match and then errors out. There should be no matches in this test. Is my regex bad? i want to blame java on this one.
This doesn't answer your questions.... but I am a bit confused with this part of the regex...



Doesn't this say match zero to 1000 characters of something that is -- either something that is not a single quote or is a single quote? Which, of course, is any character? And can be simplified as follows...



and since groups are not used, it can be further simplified as follows...



Henry
\\' part of the regex is looking for this \'. Because if your inside 'stuf' you need a single tick in order to break out of that string. So, i want to make sure i don't see this 'stuff\'xss because my single tick was escaped so i did not break out of the string.

Does that make sense?
 
Does that make sense?


Yes, but that is not what your regex does... The backslash has special meaning for both the regex and the java string literal. If the regex was loaded from a file (or something else that is not a string literal) then it would be fine. But since, it is a string literal, you need to escape it for both the regex and Java.

So, if you want a \' as a regex literal, in the way you are using it, it needs to be escaped twice... \\\\' .... four backslashes.

Henry

thanks i'll try that. I was aware of this issue with java but when i was testing the matches that preivious regex was working? I'll see if that fixes things.

thanks,

Steve
tried this:

String s = input.replaceAll("'([^']|\\\\'){0,1000}'xss='xss\\(\\)'", "pp");
still overflows.
This get really complex with double quotes -- as the backslash and the quote has special meaning to the string literal, but only the backslash has special meaning to the regex...

So, if you want a \" as a regex literal.... first, you must escape the backslash for the regex ... \\" ... then you must escape both the backslash and double quote for the string literal.... \\\\\" ... five backslashes.

Henry
 
steve labar wrote:
String s = input.replaceAll("'([^']|\\\\'){0,1000}'xss='xss\\(\\)'", "pp");


Side note. Since the left side of the OR is looking for something that is not single quote, it will match the backslash of a backslash single quote pair. This is probably not what you want. You may want to flip the two operands of the OR so that it will check the backslash single quote pair first.

Henry
 
steve labar wrote:
String s = input.replaceAll("'([^']|\\\\'){0,1000}'xss='xss\\(\\)'", "pp");
still overflows.


Yea, it overflows because it's complex. For every character, it has to take a branch, and recursively handle the next character, because the regex may fail later, and it has to backtrack it.

Now, in your case, there is never a need to backtrack, since it is a linear match to the closing quote. It may be a good idea to tell the regex to *not* prep for a possible backtrack, because it won't. This is done with the possessive operator, like so...



I also flipped the two operands to the OR (alternation).

Henry
You my friend are magical! This does in fact work. I'm going to have to do some more research as to why the + made all the difference. I was thinking when i was trying to fix it about the greedy vs. lazy. I thought it was just being greedy. I thought if i made it lazy it would just find the minimum and stop. But that was not helping.

Question: why did you alternate the or? is it more efficient this way

Thanks a lot you have saved me much frustration.

 
Question: why did you alternate the or? is it more efficient this way


No. It just won't work the other way -- see my post (about 2 back) for the explanation.

I was thinking when i was trying to fix it about the greedy vs. lazy. I thought it was just being greedy. I thought if i made it lazy it would just find the minimum and stop.


Based on the regex, it definitely wasn't a greedy vs reluctant issue. If you had increased the stack size, so that the regex had more memory, either greedy or reluctant would have worked.

I'm going to have to do some more research as to why the + made all the difference.


Yea. I highly recommend this. The possessive operator can be very efficient -- use less memory -- as you have seen. But in many cases, you actually want the regex to use backtracking. This example happened to work because you didn't need it -- but you need to learn how to recognize this.

Henry
Wink, wink, nudge, nudge, say no more ... https://richsoil.com/cards



All times above are in ranch (not your local) time.
The current ranch time is
Jul 27, 2017 20:52:43.