Hello friends!
I figured it's about time for me to start writing a bit on how JRuby is actually using invokedynamic.
As of today, JRuby utilizes invokedynamic far more than any other mainstream JVM language. We have worked very closely with the JSR leads and the OpenJDK developers to make sure invokedynamic runs well. And we have been advocating invokedynamic as a game-changer for the JVM and for JVM languages.
Let's explore one area where JRuby is using invokedynamic: Ruby's "constant" lookup.
Non-constant "Constants"
A constant in Ruby is defined on a class or module, and is subject to Ruby's typical namespacing logic. Constants start with a capital letter.
I often put "constants" in parentheses because constant values can be reassigned. This will usually produce a warning...but not an error. This means we can't simply look up constant values once and never look them up again (without special tricks I'll get into later).
Constant lookup is a also bit more complicated than method lookup. When retrieving a constant, Ruby first scans lexically-enclosing scopes' classes and modules for the constant. If the constant can't be found, the next search walks the current class's inheritance hierarchy. If we still can't find the constant, const_missing is called on the current class.
In order to make constant lookup fast, we want to do some sort of caching. In classic JRuby, Ruby 1.9 (YARV), Rubinius, and probably most other modern Ruby implementations, this is done with a global serial number. Whenever a constant is updated or a module is included (changing the inheritance hierarchy) all cached constants everywhere are forced to lookup again.
I have played with mechanisms for reducing the global impact of constant invalidation, but because constants can be looked up lexically it's simply too complicated to localize (since we need invalidate classes down-hierarchy from the change and we also need to update all lexical scopes that might see the change).
Constant Invalidation in JRuby 1.6
The logic in JRuby 1.6 goes something like this:
- If cache is empty or invalid, retrieve the constant value in the usual way (lexical, hierarchical search). Store the value with the current global constant serial number.
- On subsequent lookups, check cache for validity against the global constant serial number. If we have a value cached and the cache is still valid, return it.
- If any constant in the system is updated, or if a module is included into an existing class hierarchy, flip the serial number and force future constant lookups to re-cache.
This turns out to work fairly well. The same mechanism in Ruby 1.9 produced drastically faster constant lookups, and JRuby's performance is even better than 1.9.
But there's a problem here. Because there's this constant pinging of the global constant serial number, every constant access can potentially produce a new value. So we're paying the cost to check that serial number as well as interfering with optimizations that want to see constant values actually be constant.
Can we do better?
Quick Invokedynamic Primer
The main atom of invokedynamic is the MethodHandle. Method handles are essentially function pointers, which can point at Java methods or fields, constructors, constant values, or other method handles. Invokedynamic also provides the MethodHandles utility class, which lets us juggle method handles in various ways:
- adapting method signatures by casting, adding, moving, or dropping arguments
- combining three handles ("test", "target", and "fallback") to form new a "guard with test" if-statement-like handle
- wrap handles with exception handling or argument/return pre/post-processing
You can think of method handles and the chains of adapter handles that stitch them together as a special sort of functional language the JVM knows how to optimize. Given a chain of handles, you should usually get a piece of code that optimizes as well as (or better, in some cases) writing the same logic by hand in Java.
The invokedynamic bytecode simply provides a place to plug a method handle chain into code. When the JVM encounters an invokedynamic bytecode, it calls a "bootstrap method" associated with that bytecode for further instructions.
The bootstrap method returns a CallSite object, provided in java.lang.invoke. There are constant call sites for constant values, mutable call sites for when the target handle chain may have to change, and volatile call sites for when those changes must immediately be reflected across threads.
Once a CallSite has been installed for a given invokedynamic, subsequent hits skip the bootstrapping process, and we're off to the races.
SwitchPoint
I mentioned that the MethodHandles class provides a "guardWithTest" method for combining a test, a target (the "then" branch), and a fallback (the "else" branch). SwitchPoint, also in java.lang.invoke, acts like an on/off guardWithTest that once turned off can never be turned on again. You provide a target and fallback, and until the "switch" is thrown the target will be invoked. After the switch is thrown the fallback will be called.
What's the difference between this and a guardWithTest where the test just pings some global value? The difference is that SwitchPoint doesn't need to check anything.
Optimization and Deoptimization in the JVM
When the JVM decides to optimize a piece of code, it does so in an optimistic way. In very broad terms, this means it assumes its information up to this point is perfect: no new methods or classes will be introduced, profiling information is accurate, etc. Based on this "perfect" view of the world, it aggressively optimizes code.
Of course, the world isn't perfect. The JVM has to give up profiling and monitoring at some point, so it always has an imperfect view of the system. In order to avoid its aggressive optimizations triggering a fatal error later on, JVMs like OpenJDK (Hotspot) do something called deoptimization.
Deoptimization is the process by which running, optimized code can adapt on-the-fly to a changing system. In OpenJDK, there's several ways this is accomplished:
- Branches out of compiled code back into the interpreter, when compiled code is determined to be invalid.
- Guards around inlined virtual method accesses, to ensure we're still calling against the same class.
- On-stack replacement, for fixing up a running method already on the native call stack
- ...
Because of this ability to deoptimize, it's possible to support zero-cost guards at the JVM level. Returning to SwitchPoint, we can see how this new form of "guardWithTest" can be basically free: we're explicitly telling the JVM this switch is a rare occurrence it can optimize aggressively.
SwitchPoint for Constant Lookup
JRuby on invokedynamic uses SwitchPoint for constant lookup, as you'd expect. Instead of actively pinging that global constant serial number, we instead use a global SwitchPoint object to guard all cached constant accesses. When it comes time to invalidate the system's constants, we just flip the SwitchPoint off and create a new one. All SwitchPoint-guarded constant accesses in the system must then recache and use the new SwitchPoint.
In a well-behaved system, we should reach a steady state where no new constants are being defined and no new modules are being introduced. Because we're using SwitchPoint, the stable state means all constant accesses are treated as truly constant by the JVM, allowing optimizations that were impossible before. And of course this also means that we've achieved constant lookup performance very near a theoretical maximum.
Numbers
First, a caveat: SwitchPoint is implemented in a fairly naïve way in the released OpenJDK 7, using a volatile field as the switch value. As a result, SwitchPoint guardWithTest is very slow currently, and JRuby's SwitchPoint-based constant logic must be enabled. I show numbers below based on leading-edge Hotspot compiler patches that will go into the first update release (numbers provided by one of the Hotspot devs, Christian Thalinger...thanks Christian!)
The benchmark we're running is a modified version of bench_const_lookup in JRuby's benchmark suite. The modification here runs more iterations (10M instead of 1M) with more constant lookups (50 instead of 10) to get a better idea of optimized performance.
The benchmark we're running is a modified version of bench_const_lookup in JRuby's benchmark suite. The modification here runs more iterations (10M instead of 1M) with more constant lookups (50 instead of 10) to get a better idea of optimized performance.
Here's JRuby running our constant-lookup benchmark without SwitchPoint-based constants on Java 7:
As I said before, this is pretty good. JRuby's existing constant lookup performance is roughly 2x faster than Ruby 1.9.2.
Next, we'll try JRuby with SwitchPoint constants on Java 7 (released version, so we expect this to be slow):
The perf hit of purely volatile SwitchPoint is apparent.
And finally, JRuby with SwitchPoint constants on a dev build of Hotspot, which uses deoptimization rather than a volatile field:
This is basically the performance of the 10M iteration loop alone. In fact, if you look at the resulting optimized assembly, the constant accesses have been eliminated entirely since they're optimistically inlined and never used. Of course this would normally not happen in real code, but it shows how much better the JVM can optimized Ruby's behavior using invokedynamic.