Oniguruma is a regular expressions library.The benefit for us is avoiding the encode/decode we previously had to do for every regular expression match, since Ruby uses byte[]-based strings and all Java regular expression engines work with char[]. You can imagine the overhead all that array churn introduced.
The characteristics of this library is that different character encoding
for every regular expression object can be specified.
After running through a series of basic optimizations, most of the key expressions we worried about were performing as well as or much better than JRegex, so Ola went through with the conversion over the past couple days. Marcin is continuing to work on various optimizations, but both Ola and I have been playing with the new code. And it's looking great.
You may remember I reported recently about how the regexp bottleneck impacted XML parsing with REXML. Here's the numbers run against JRuby immediately before merging Joni:
read content from stream, no DOMAnd here's the performance numbers today, with Joni:
3.362000 0.000000 3.362000 ( 3.362000)
1.232000 0.000000 1.232000 ( 1.232000)
0.887000 0.000000 0.887000 ( 0.887000)
1.009000 0.000000 1.009000 ( 1.010000)
0.801000 0.000000 0.801000 ( 0.801000)
read content once, no DOM
9.869000 0.000000 9.869000 ( 9.869000)
9.779000 0.000000 9.779000 ( 9.779000)
9.786000 0.000000 9.786000 ( 9.786000)
9.655000 0.000000 9.655000 ( 9.655000)
9.601000 0.000000 9.601000 ( 9.601000)
read content from stream, build DOM
1.368000 0.000000 1.368000 ( 1.368000)
1.297000 0.000000 1.297000 ( 1.297000)
1.192000 0.000000 1.192000 ( 1.192000)
1.131000 0.000000 1.131000 ( 1.131000)
0.812000 0.000000 0.812000 ( 0.812000)
read content once, build DOM
10.595000 0.000000 10.595000 ( 10.595000)
9.489000 0.000000 9.489000 ( 9.488000)
9.947000 0.000000 9.947000 ( 9.947000)
9.821000 0.000000 9.821000 ( 9.821000)
9.414000 0.000000 9.414000 ( 9.415000)
read content from stream, no DOMMarcin's being modest about the work, but we're all absolutely amazed by it.
2.309000 0.000000 2.309000 ( 2.308000)
1.217000 0.000000 1.217000 ( 1.217000)
0.776000 0.000000 0.776000 ( 0.776000)
0.825000 0.000000 0.825000 ( 0.825000)
0.637000 0.000000 0.637000 ( 0.637000)
read content once, no DOM
0.370000 0.000000 0.370000 ( 0.369000)
0.415000 0.000000 0.415000 ( 0.415000)
0.288000 0.000000 0.288000 ( 0.288000)
0.260000 0.000000 0.260000 ( 0.260000)
0.254000 0.000000 0.254000 ( 0.254000)
read content from stream, build DOM
1.455000 0.000000 1.455000 ( 1.455000)
0.916000 0.000000 0.916000 ( 0.916000)
0.887000 0.000000 0.887000 ( 0.888000)
0.827000 0.000000 0.827000 ( 0.827000)
0.607000 0.000000 0.607000 ( 0.607000)
read content once, build DOM
0.630000 0.000000 0.630000 ( 0.630000)
0.664000 0.000000 0.664000 ( 0.664000)
0.680000 0.000000 0.680000 ( 0.680000)
0.553000 0.000000 0.553000 ( 0.553000)
0.650000 0.000000 0.650000 ( 0.650000)
So finally the last really gigantic performance bottleneck in JRuby is gone, and it appears that JRuby's slow regexp era has come to a close. Next targets: the remaining issues with IO and Java integration performance.
Uh oh - I have to apologize with Charles and others, looks like I've commented the wrong post...
ReplyDeleteFor what it's worth, here's another speed report using the latest trunk.
ReplyDeleteI had to write some code recently to read/parse >1GB of an application's log file (basic regexp stuff), but since Ruby was entirely too slow I had to to it in Perl. Not cool.
I'm happy to report that with the latest trunk of JRuby my minimal test version processes a 1.3GB file in 3m12s versus 1m36s for the Perl equivalent. Not bad at all. MRI, on the other hand, takes 11m45s!
This is with JDK6, by the way.
Great work!