Headius: Another Performance Discovery: REXML

Wednesday, October 17, 2007

Another Performance Discovery: REXML

I've discovered a really awful bottleneck in REXML processing.

Look at these results for parsing our build.xml:

read content from stream, no DOM
2.592000   0.000000   2.592000 (  2.592000)
1.326000   0.000000   1.326000 (  1.326000)
0.853000   0.000000   0.853000 (  0.853000)
0.620000   0.000000   0.620000 (  0.620000)
0.471000   0.000000   0.471000 (  0.471000)
read content once, no DOM
5.323000   0.000000   5.323000 (  5.323000)
5.328000   0.000000   5.328000 (  5.328000)
5.209000   0.000000   5.209000 (  5.209000)
5.173000   0.000000   5.173000 (  5.173000)
5.138000   0.000000   5.138000 (  5.138000)

When reading from a stream, the content is read in in chunks, with each chunk being matched in turn. Because our current regexp engine uses char[] instead of byte[], each chunk must be decoded into UTF-16 characters, matched, and encoded back into UTF-8 bytes.

For small chunks, like those read off a stream, this decode/encode cycle is fairly quick. Here, the streamed numbers are pretty close to MRI. However, when an XML string is parsed from memory, the process goes like this:

set buffer to entire string
match against the buffer
set buffer to post match (remainder of the string)

Now this is obviously a little inefficient, since it creates a lot of extra strings, but a copy-on-write String implementation helps a lot. However in our case it also means that we decode/encode the entire remaining XML string for every element match. For any nontrivial file, this is *terrible* overhead.

So what's the fix? Here's the same second benchmark using a StringIO object passed to the parser instead, with a simple change to rexml/source.rb:

Diff:

Index: lib/ruby/1.8/rexml/source.rb
===================================================================
--- lib/ruby/1.8/rexml/source.rb        (revision 4596)
+++ lib/ruby/1.8/rexml/source.rb        (working copy)
@@ -1,4 +1,5 @@
require 'rexml/encoding'
+require 'stringio'

module REXML
       # Generates Source-s.  USE THIS CLASS.
@@ -8,7 +9,7 @@
               # @return a Source, or nil if a bad argument was given
               def SourceFactory::create_from arg#, slurp=true
      if arg.kind_of? String
-                         Source.new(arg)
+                         IOSource.new(StringIO.new(arg))
      elsif arg.respond_to? :read and
            arg.respond_to? :readline and
            arg.respond_to? :nil? and

New numbers:

read content once, no DOM
0.640000   0.000000   0.640000 (  0.640000)
0.693000   0.000000   0.693000 (  0.693000)
0.542000   0.000000   0.542000 (  0.542000)
0.349000   0.000000   0.349000 (  0.349000)
0.336000   0.000000   0.336000 (  0.336000)

This is a perfect indication why JRuby's Rails performance is currently nowhere near what it will be. We continue to find these little gems...and there's no telling how many more are out there. With recent execution performance numbers looking extremely solid and recent Rails performance getting closer and closer, the upcoming 1.1 release ought to be amazing.

5 comments:

AnonymousOctober 17, 2007 at 8:19 PM
Nice work with JRuby. You and your fellow committers are doing a great job.

What's the best guess on the arrival timeframe for the Oniguruma port?
ReplyDelete
Replies
Charles Oliver NutterOctober 17, 2007 at 9:07 PM
anonymous: Marcin has said it will be done for the final 1.1 release, but perhaps not before RubyConf in early November.
ReplyDelete
Replies
pateOctober 18, 2007 at 5:39 AM
Awesome work by you and the rest of the JRuby crew Charles. I think you're really showing the benefit of a cleaner code base, and a willingness to dig into performance issues. Keep it up!
ReplyDelete
Replies
Jim DevilleOctober 19, 2007 at 5:34 PM
Awesome job. I'm looking forward to playing with JRuby more after my current project is released.
ReplyDelete
Replies
AnonymousMay 29, 2008 at 10:05 AM
Was this change applied to the Jruby 1.1.1 code base. It does not seem to be in that code base.
ReplyDelete
Replies

Add comment