Sunday, September 3, 2006

Java Regex Broken for Alternation over Large Strings

This is a serious bummer.

I've been working through some of Rails's test cases this evening. Since I've got Depot running reasonably well now and I know I can easily wire together a simple Rails app that calls Hibernate or EJBs, I figured I'd try to get a better handle on the true state of Rails support.

As I'm running test cases--the majority of which appear to be passing, btw--I came across a couple goofy regex issues. The first is one we'd seen before, where dot refuses to match \r. In this case, adding DOT_ALL to the Pattern's flags seems to resolve it. I think we just hadn't noticed that flag before, so many other similar bugs might be resolved the same way. Not an easy issue to narrow down, but I found it and moved on.

So then I started seeing SystemStackErrors popping up within the CGI library. When I see stack overflows in JRuby, I usually figure it's one of two things:
  1. Something's wrong with JRuby that's breaking a termination condition in Ruby code, so a recursion is never ending
  2. JRuby itself is missing a termination and recursing endlessly
Pretty simple issues to track down...find where the recursion is happening, and figure out why it's not terminating. Except in this case, the recursion wasn't in either Ruby or JRuby code. It was in Java's regex library.

Blowing The Stack

O Java, how I love Thee. Every library I could want, all in one place. You care for me so well. It pains me to find your faults.

So it appears that the Java regex library's compiled patterns execute recursively. This alone wouldn't be particularly unusual, especially if the recursion was limited by the complexity of the regex itself. However in this case it seems the recursion is a factor of the string to match, rather than of the expression complexity. With a large enough input string, I can get Java's regex to blow the stack every time.
regex = /\A((?:.|\n)*?)?([\r\n]{1,2}|--)/n
long_string = ""
76.times { long_string += "xxxxxxxxxxxxxxxxxx" }

long_string.sub(r) {""}

This bit of Ruby code will cause java.util.regex.Pattern to blow the stack during matching. I'm sure this regex could be reduced to a simpler case, but I don't really need to go through the exercise of doing so; others have reported the same issue:
So the issue stands, unfortunately. I'm not sure I'd characterize it as an RFE...I'm no regex expert, but I would hope my regex engine wouldn't overflow based on large input, especially if that input was completely bogus as in my example above. I don't think that's unreasonable, especially considering that Ruby's regex engine appears to handle this case (and considerably larger cases) just fine. Correct me if I'm expecting too much.

So what's the damage? Well, since we currently use java.util.regex exclusively, and since Ruby libraries and apps very frequently use regex to match against very large strings, we'll have to migrate to a different regex library. Until that happens, however, Rails under JRuby will not support large multipart posts, at a minimum. There may be other places this has an effect, but multipart posting was the area I investigated.

So for now...I guess it's just busted. We'll begin evaluating alternative regex libraries posthaste. Any help from you folks would be appreciated...just take that regex or one from the bugs and try to match against extremely large input.

Rewrite the Regex?

Some may say that the regex should be rewritten to avoid the use of alternation. Perhaps so, perhaps there's a better way. However Ruby handles it fine, so requesting that the Rails folks or any other Ruby devs start tailoring their regex so they will work under JRuby is quite out of the question. If we can't run the same regex against the same input, we need a different regex library...that's all there is to it.

Besides, the bugs posted have additional regex that are even simpler. It just seems to be a limitation of the Java regex library that certain regex will blow the stack on large input.

For the record, here's the relevant bit of the stack. I get pages and pages of this, under both Java 5 and Java 6 beta 2...both 64-bit under Linux:
...
at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4052)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4241)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4111)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:3962)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3314)
at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4052)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4241)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4111)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:3962)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3314)
at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4052)
...

What a downer.

Update (2011-3-1): Someone stumbled across this post recently and I realized I'd never updated it with a solution for this problem. I checked out several alternative Java regex engines. Some had the same problem, being implemented the same way, but JRegex both resolved the issue and performed very well. I'd recommend it to anyone that needs to work around the problems in Java's built-in regex library.

4 comments:

  1. Johannes: Yeah, I didn't really expect that it's "compiling" the regex in a classic sense, especially since the blown stack trace shows all Pattern methods in the stack. It would have been nice if it compiled it down to something more efficient, as you say, but that's the way things are.

    My hope in trying other engines is that they'll be more efficient with stack and memory or based upon the same algorithms that Ruby's gnu regexp library uses. It will take a bit of research, but we don't really have another choice; we can't ask Ruby developers to custom-tailor their regex to suit JRuby or they'll just avoid using JRuby.

    ReplyDelete
  2. Hi Charles, I'm sure you would have discovered this anyway, but the Jakarta Project's ORO library is very good and does Perl 5 compatible regex'es (not sure exactly what Ruby requires in that regard).

    We've been using it for some years now and not run into any showstoppers.

    ReplyDelete
  3. I've tested Jakarta ORO, it doesnt seem to have the StackOverflow issue (while Jakarta Regexp does)

    ReplyDelete
  4. Gregory: Java's engine recurses, which limits the size of data it can handle. Ruby's isn't great, but it doesn't blow out the call stack for large input.

    ReplyDelete