Monday, October 15, 2012

So You Want To Optimize Ruby

I was recently asked for a list of "hard problems" a Ruby implementation really needs to solve before reporting benchmark numbers. You know...the sort of problems that might invalidate early perf numbers because they impact how you optimize Ruby. This post is a rework of my response...I hope you find it informative!

Fixnum to Bignum promotion

In Ruby, Fixnum math can promote to Bignum when the result is out of Fixnum's range. On implementations that use tagged pointers to represent Fixnum (MRI, Rubinius, MacRuby), the Fixnum range is somewhat less than the base CPU bits (32/64). On JRuby, Fixnum is always a straight 64-bit signed value.

This promotion is a performance concern for a couple reasons:
  • Every math operation that returns a new Fixnum must be range-checked. This slows all Fixnum operations.
  • It is difficult (if not impossible) to predict whether a Fixnum math operation will return a Fixnum or a Bignum. Since Bignum is always represented as a full object (not a primitive or a tagged pointer) this impacts optimizing Fixnum math call sites.

Floating-point performance

A similar concern is the performance of floating point values. Most of the native implementations have tagged values for Fixnum but only one I know of (Macruby) uses tagged values for Float. This can skew expectations because an implementation may perform very well on integer math and considerably worse on floating-point math due to the objects created (and collected). JRuby uses objects for both Fixnum and Float, so performance is roughly equivalent (and slower than I'd like).

Closures

Any language that supports closures ("blocks" in Ruby) has to deal with efficiently accessing frame-local data from calls down-stack. In Java, both anonymous inner classes and the upcoming lambda feature treat frame-local values (local variables, basically) as immutable...so their values can simply be copied into the closure object or carried along in some other way. In Ruby, local variables are always mutable, so an eventual activation of a closure body needs to be able to write into its containing frame. If a runtime does not support arbitrary frame access (as is the case on the JVM) it may have to allocate a separate data structure to represent those frame locals...and that impacts performance.

Bindings and eval

The eval methods in Ruby can usually accept an optional binding under which to run. This means any call to binding must return a fully-functional execution environment, and in JRuby this means both eval and binding force a full deoptimization of the surrounding method body.

There's an even more unpleasant aspect to this, however: every block can be used as a binding too.

All blocks can be turned into Proc and used as bindings, which means every block in the system has to have full access to values in the containing call frame. Most implementers hate this feature, since it means that optimizing call frames in the presence of blocks is much more difficult. Because they can be used as a binding, that of course means literally all frame data must be accessible: local variables; frame-local $ variables like $~; constants lookup environment; method visibility; and so on.

callcc and Continuation

JRuby doesn't implement callcc since the JVM doesn't support continuations, but any implementation hoping to optimize Ruby will have to take a stance here. Continuations obviously make optimization more difficult since you can branch into and out of execution contexts in rather unusual ways.

Fiber implementation

In JRuby, each Fiber runs on its own thread (though we pool the native thread to reduce Fiber spin-up costs). Other than that they operate pretty much like closures.

A Ruby implementer needs to decide whether it will use C-style native stack juggling (which makes optimizations like frame elimination trickier to implement) or give Fibers their own stacks in which to execute independently.

Thread/frame/etc local $globals

Thread globals are easy, obviously. All(?) host systems already have some repesentation of thread-local values. The tricky ones are explicit frame globals like $~ and $_ and implicit frame-local values like visibility, etc.

In the case of $~ and $_, the challenge is not in representing accesses of them directly but in handling implicit reads and writes of them that cross call boundaries. For example, calling [] on a String and passing a Regexp will cause the caller's frame-local $~ (and related values) to be updated to the MatchData for the pattern match that happens inside []. There are a number of core Ruby methods like this that can reach back into the caller's frame and read or write these values. This obviously makes reducing or eliminating call frames very tricky.

In JRuby, we track all core methods that read or write these values, and if we see those methods called in a body of code (the names, mind you...this is a static inspection), we will stand up a call frame for that body. This is not ideal. We would like to move these values into a separate stack that's lazily allocated only when actually needed, since methods that cross frames like String#[] force other methods like Array#[] to deoptimize too.

C extension support

If a given Ruby implementation is likely to fit into the "native" side of Ruby implementations (as opposed to implementations like JRuby or IronRuby that target an existing managed runtime), it will need to have a C extension story.

Ruby's C extension API is easier to support than some languages' native APIs (e.g. no reference-counting as in Python) but it still very much impacts how a runtime optimizes. Because the API needs to return forever-valid object references, implementations that don't give out pointers will have to maintain a handle table. The API includes a number of macros that provide access to object internals; they'll need to be simulated or explicitly unsupported. And the API makes no guarantees about concurrency and provides few primitives for controlling concurrent execution, so most implementations will need to lock around native downcalls.

An alternative for a new Ruby implementation is to expect extensions to be written in the host runtime's native language (Java or other JVM languages for JRuby; C# or other .NET languages for IronRuby, etc). However this imposes a burden on folks implementing language extensions, since they'll have to support yet another language to cover all Ruby implementations.

Ultimately, though, the unfortunate fact for most "native" impls is that regardless of how fast you can run Ruby code, the choke point is often going to be the C API emulation, since it will require a lot of handle-juggling and indirection compared to MRI. So without supporting the C API, there's a very large part of the story missing...a part of the story that accesses frame locals, closure bodies, bindings, and so on.

Of course if you can run Ruby code as fast as C, maybe it won't matter. :) Users can just implement their extensions in Ruby. JRuby is starting to approach that kind of performance for non-numeric, non-closure cases, but that sort of perf is not yet widespread enough to bank on.

Ruby 1.9 encoding support

Any benchmark that touches anything relating to binary text data must have encoding support, or you're really fudging the numbers. Encoding touches damn near everything, and can add a significant amount of overhead to String-manipulating benchmarks.

Garbage collection and object allocation

It's easy for a new impl to show good performance on benchmarks that do no allocation (or little allocation) and require no GC, like raw numerics (fib, tak, etc). Macruby and Rubinius, for example, really shine here. But many impls have drastically different performance when an algorithm starts allocating objects. Very few applications are doing pure integer numeric algorithms, so object
allocation and GC performance are an absolutely critical part of the performance story.

Concurrency / Parallelism

If you intend to be an impl that supports parallel thread execution, you're going to have to deal with various issues before publishing numbers. For example, threads can #kill or #raise each other, which in
a truly parallel runtime requires periodic safepoints/pings to know whether a cross-thread event has fired. If you're not handling those safepoints, you're not telling the whole story, since they impact execution.

There's also the thread-safety of runtime structures to be considered. As an example, Rubinius until recently had a hard lock around a data structure responsible for invalidating call sites, which meant that its simple inline cache could see a severe performance degradation at polymorphic call sites (they've since added polymorphic caching to ameliorate this case). The thread-safety of a Ruby implementation's core runtime structures can drastically impact even straight-line, non-concurrent performance.

Of course, for an impl that doesn't support parallel execution (which would put it in the somewhat more limited realm of MRI), you can get away with GIL scheduling tricks. You just won't have a very good in-process scaling story.

Tracing/debugging

All current impls support tracing or debugging APIs, though some (like
JRuby) require you to enable support for them via command-line or compile-time flags. A Ruby implementation needs to have an answer for this, since the runtime-level hooks required will have an impact...and may require users to opt-in.

ObjectSpace

ObjectSpace#each_object needs to be addressed before talking about performance. In JRuby, supporting each_object over arbitrary types was a major performance issue, since we had to track all objects in a separate data structure in case they were needed. We ultimately decided each_object would only work with Class and Module, since those were the major practical use cases (and tracking Class/Module hierarchies is far easier than tracking all objects in the system).

Depending on how a Ruby implementation tracks in-memory objects (and depending on the level of accuracy expected from ObjectSpace#each_object) this can impact how allocation logic and GC are optimized.

Method invalidation

Several implementations can see severe global effects due to methods like Object#extend blowing all global caches (or at least several caches), so you need to be able to support #extend in a reasonable way before talking about performance. Singleton objects also have a similar effect, since they alter the character of method caches by introducing new anonymous types at any time (and sometimes, in rapid succession).

In JRuby, singleton and #extend effects are limited to the call sites that see them. I also have an experimental branch that's smarter about type identity, so simple anonymous types (that have only had modules included or extended into them) will not damage caches at all. Hopefully we'll land that in a future release.

Constant lookup and invalidation

I believe all implementations have implemented constant cache invalidation as a global invalidation, though there are other more complicated ways to do it. The main challenge is the fact that constant lookup is tied to both lexical scope and class hiearchy, so invalidating individual constant lookup sites is usually infeasible. Constant lookup is also rather tricky and must be implemented correctly before talking about the performance of any benchmark that references constants.

Rails

Finally, regardless of how awesome a new Ruby implementation claims to be, most users will simply ask "but does it run Rails?" You can substitute your favorite framework or library, if you like...the bottom line is that an awesome Ruby implementation that doesn't run any Ruby applications is basically useless. Beware of crowing about your victory over Ruby performance before you can run code people actually care about.

17 comments:

  1. Given these optimization problems, if you could change 3 language features, and yet retain the ruby flavor, what would they be?

    Secondly, are there any languages, offering duck-typing, closures and metaprogramming features, that can be heavily optimized? (Please don't say LISP, even though I spent 2 years in college in love with it!)

    ReplyDelete
  2. It doesn't even have to be Shen or Haskell-style static typing.

    ReplyDelete
  3. > Secondly, are there any languages, offering duck-typing, closures and metaprogramming features, that can be heavily optimized?

    Metaprogramming should always be free at run-time, but duck-typing and closures are necessarily expensive (and pure evil imo!)

    ReplyDelete
  4. Jonathan: My top features to remove from Ruby would be the behavior of frame-local values ($~, $_, method visibility, etc), the ability to use a block as a binding (prevents a number of optimizations) and Fixnum to Bignum promotion (slows math, prevents aggressively optimizing to primitive integer). Most other features -- including dynamic/duck typing, metaprogramming, and closures -- can be optimized very well if not for some of these others.

    ReplyDelete
  5. @Charles: I'd mostly agree on frame-local values and block bindings, but I think removing automatic Fixnum to Bignum promotion would hurt a lot (I'm thinking integer overflow is not funny in a high-level language. Even if it's crazy to think any number can be represented, I find it useful in many cases).

    ReplyDelete
  6. Fantastic post. Thanks for being so comprehensive!

    ReplyDelete
  7. Charles could you get rid of Fixnum to Bignum promotion, could you introduce another type such as Expandonum that would do the promotion when someone wants it?

    ReplyDelete
  8. Agh... didn't mean to post that last one about Expandonum anonymously.

    ReplyDelete
  9. @Benoit: 99% of the time I know if my code might overflow, and in that case, I'd take the perfomance hit and use Bignums from the start. Yes, some benchmarks and tests might break, and maybe a small amount of existing ruby code, but nothing major.

    @Mark: Expandonum, great name. Of course, it will still take a performance hit over straight integers.

    @Charles: I suppose you could add a flag to JRuby to not do integer promotions.
    As to the other changes, you could ask Matz to deprecate them in Ruby 2.x, and then to remove them in 3.0 :-)

    Sheesh, these captchas are getting hard, and I pity the poor visually impaired who try the audio captcha. That's even worse!

    ReplyDelete
  10. Sorry, ninkibah == Jonathan O'Connor

    ReplyDelete
  11. About the «C extension support» section, I think that if you need the speed of a C snippet, the best option is to make an standard shared library, and load the functions in it with FFI. As far as I know, most implementations support FFI without problems

    ReplyDelete
  12. The small performance gain is it not worth to give up fixnum to bignum promotion. This is a basic features of scripting languages and I think ruby should stay a scripting language ;) Especially in JRuby it's easy to improve the performance of static code: write it in Java (or another JVM static language). Javascript has the ability for fixnum to bignum conversion, too. Nevertheless, this language is very fast.
    The frame local variables are ugly. I would not miss them. But I think it is not a good idea to remove such heavily used features in ruby.

    In the most cases the performance of ruby is very good. Instead of change the whole ruby language, one could define a subset of ruby and activate it for special methods which need good performance. For example:
    fast_ruby do
    # here comes fast code without eval, framelocal variables, fixnum to integer promotion etc.
    end

    ReplyDelete
  13. Out of interest, would rails go close to running if you dropped support for your top removal features? I could see some interest in a ruby subset that could be highly optimised - jruby -fast! It would be good to at least have a mode which discourages the use of hard to optimise features.

    ReplyDelete
  14. What was so bad about Java and .NET that they came with Ruby?....

    ReplyDelete
  15. Hello Charles,

    Unrelated to this post but I have thoroughly enjoyed javaone JIT for dummies.

    Keep up the good work,
    Khalil

    ReplyDelete
  16. Would it be possible & realistic/practical to introduce runtime options to disable some languages features thus enabling better optimization?? I personally could very well live without frame-local values, block as a binding and Fixnum to Bignum promotion if it provided substantial performance improvement.

    ReplyDelete