Headius

Tuesday, July 3, 2018

Running JRuby on the Graal JIT

Hello, friends! Long time no blog!

I'm still here hacking away on JRuby for the benefit of Rubyists everywhere, usually slogging through compatibility fixes and new Ruby features. However with the release of JRuby 9.2, we've caught up to Ruby 2.5 (the current release) and I'm spending a little time on performance.

I thought today would be a good opportunity to show you how to start exploring some next-generation JRuby performance by running on top of the Graal JIT.

Graal

The Graal JIT is a project that grew out of a Sun Microsystem Labs project called MaxineVM. Maxine was an attempt to implement a JVM entirely in Java...via various specialized Java dialects, annotations, and compilation strategies. Graal is the latest iteration of the JIT work done for Maxine, and provides an optimizing JVM JIT implemented entirely in Java. Maxine only lives on as a research project, but Graal is rapidly shaping up to become the preferred JIT in OpenJDK.

The majority of optimizations that Graal does to code are no different from the "classic" OpenJDK JIT, Hotspot's "C2" compiler. You get all the usual dead code, loop unrolling, method inlining, branch profiling, and so on. However Graal goes beyond Hotspot in a few key ways. Most interesting to JRuby (and other dynamic languages) is the fact that Graal finally brings good Escape Analysis to the JVM.

Escape Analysis

The biggest gains JRuby sees running on Graal are from Escape Analysis (EA). The basic idea behind escape analysis goes like this: if you allocate an object, use it and its contents, and then abandon that object all within the same thread-local piece of code (i.e. not storing that object in a globally-visible location) then there's no reason to allocate the object. Think about this in terms of Java's autoboxing or varargs: if you box arguments or numbers, pass them to a method, and then unbox them again...the intermediate box was not really necessary.

The basic idea of EA is not complicated to understand, but implementing well can be devilishly hard. For example, the Hotspot JIT has had a form of escape analysis for years, optional in Java 7 and I believe turned on by default during Java 8 maintenance releases. However this EA was very limited...if any wrapper object was used across a branch (even a simple if/else) or if it might under any circumstance leave the compiled code being optimized, it would be forced to allocate every time.

The Graal EA improves on this with a technique called Partial Escape Analysis (PEA). In PEA, branches and loops do not interfere with the process of eliminating objects because all paths are considered. In addition, if there are boxed values eventually passed out of the compiled code, their allocation can be deferred until needed.

JRuby on Graal

By now you've likely heard about TruffleRuby, a Graal-based Ruby implementation that uses the Truffle framework for implementing all core language features and classes. Truffle provides many cool features for specializing data structures and making sure code inlines, but many of the optimizations TR users see is due to Graal doing such a great job of eliminating transient objects.

And because Graal is also open source and available in Java 10 and higher, JRuby can see some of those benefits!

There's two easy ways for you to test out JRuby on the Graal JIT

Using Java 10 with Graal

OpenJDK 9 included a new feature to pre-compile Java code to native (with the "jaotc" command), and this compiler made use of Graal. In OpenJDK 10, Graal is now included even on platforms where the ahead-of-time compiler is not supported.

Download and install any OpenJDK 10 (or higher) release, and pass these flags to the JVM (either with -J flags to JRuby or using the JAVA_OPTS environment variable):

-XX:+UnlockExperimentalVMOptions

-XX:+EnableJVMCI

-XX:+UseJVMCICompiler

Using GraalVM

GraalVM is a new build of OpenJDK that includes Graal and Truffle by default. Depending on which one you use (community or enterprise edition) you may also have access to additional proprietary optimizations.

GraalVM can be downloaded in community form for Linux and enterprise form for Linux and MacOS from graalvm.org. Install it and set it up as the JVM JRuby will run with (JAVA_HOME and/or PATH, as usual). It will use the Graal JIT by default.

Additional JRuby Flags

You will also want to include some JRuby flags that help us optimize Ruby code in ways that work better on Graal:

-Xcompile.invokedynamic enables our use of the InvokeDynamic feature, which lets dynamic calls inline and optimize like static Java calls.
-Xfixnum.cache=false disables our cache of small Fixnum objects. Using the cache helps on Hotspot, which has no reliable escape analysis, but having those objects floating around sometimes confuses Graal's partial escape analysis. Try your code with the cache on and off and let us know how it affects performance.

What to Expect

We have only been exploring how JRuby can make use of Graal over the past few months, but we're already seeing positive results on some benchmarks. However the optimizations we want to see are heavily dependent on inlining all code, an area where we need some work. I present two results here, one where Graal is working like we expect, and another where we're not yet seeing our best results.

Numeric Algorithms Look Good

One of the biggest areas JRuby performance suffers is in numeric algorithms. On a typical Hotspot-based JVM, all Fixnum and Float objects are actually object boxes that must be allocated and garbage collected like any other object. As you'd expect, this means that numeric algorithms pay a very high cost. This also means that Graal's partial escape analysis gives us huge gains, because all those boxes get swept away.

This first result is from a pure-Ruby Mandelbrot fractal-generating algorithm that makes the rounds periodically. The math here is almost all floating-point, with a few integers thrown in, but the entire algorithm fits in a single method. With JRuby using invokedynamic and running on Graal, all the code inlines and optimizes like a native numeric algorithm! Hooray for partial escape analysis!

We also have anecdotal reports of other numeric benchmarks performing significantly better with JRuby on Graal than JRuby on Hotspot...and in some cases, JRuby on Graal is the fastest result available!

Data Structures Need More Work

Of course most applications don't just work with numbers. They usually have a graph of objects in memory they need to traverse, search, create and destroy. In many cases, those objects include Array and Hash instances rather than user-defined objects, and frequently these structures are homogeneous: they contain only numbers, for example.

JRuby currently does not do everything it could to inline object creation and access. We also are not doing any numeric specialization of structures like Array, which means a list of Fixnums actually has to allocate all those Fixnum objects and hold them in memory. These are areas we intend to work on; I am currently looking at doing some minimal specialization of algorithmic code and numeric data structures, and we will release some specialization code for instance variables (right-sizing the object rather than using a separate array to hold instance variables) in JRuby 9.2.1.

The red/black benchmark tests the performance of a pure-Ruby red/black tree implementation. It creates a large graph, traverses it, searches it, and clears it. JRuby with InvokeDynamic on Hotspot still provides the best result here, perhaps because the extra magic of Graal is not utilized well.

This benchmark of ActiveRecord shows the opposite result: JRuby on Graal gets the best performance. Without having dug into the details, I'd guess there's some hot loop code in the middle of the "find all" logic, and that loop is optimizing well on Graal. But ultimately the objects all need to be created and the gain from Graal is fairly limited. I also have examples of other read, write, and update operations and only about half of them are faster with Graal.

Your Turn

The JRuby team (of which only two of us are actually employed to work on JRuby) has always managed resources with an eye for compatibility first. This has meant that JRuby performance -- while usually solid and usually faster than CRuby -- has received much less attention than some other Ruby implementations. But it turns out we can get really solid performance simply by inlining all appropriate code, specializing appropriate data structures, and running atop a JIT with good escape analysis.

We will be working over the next few months to better leverage Graal. In the mean time, we'd love to hear from JRuby users about their experiences with JRuby on Graal! If your code runs faster, let us know so we can tell others. If your code runs slower, let us know so we can try to improve it. And if you're interested in comparing with other Ruby implementations, just make sure your benchmark reflects a real-world case and doesn't simply optimize down to nothing.

JRuby on Graal has great promise for the future. Try it out and let us know how it goes!

Tuesday, October 24, 2017

Migrating to Java 9: Modules, Maven, OSGI, Travis CI

Hello friends! It has been too long!

Today, after many years, I've got something I wanted to blog rather than tweet out in code snippits and screenshots: I'm starting to get my projects working with Java 9.

I hope to cover all of the challenges and solutions I've come up with, but today I'll be focusing on something "simple": getting a straightforward Maven project to export a Java 9 module while still working on Java 8 and lower.

InvokeBinder

Some years ago I started work on a library called InvokeBinder (com.headius:invokebinder), a fluent wrapper around the Java 7 "MethodHandles" API in java.lang.invoke. InvokeBinder provides a more straightforward way to manipulate method handles, juggle arguments, and debug problems.

It's also a nice, simple library to try to get into Java 9.

module-info.java

The main stumbling block for exposing a module is this new file format in module-info.java:

Obviously this is not something we can compile on Java 8, so we will need to use a Java 9-compatible toolchain to build at least this one file.

In my case, I also want the rest of the library to work on Java 8, and since compiling module-info.java requires the target class files to be Java 9 format or higher, I'll need to compile everything except module-info.java separately.

Tool Support for Java 9 is Still Weak

I deferred exploring Java 9 until its release because of weak tool support. Now that I'm forced to deal with Java 9 I'm constantly stymied by weak tool support.

Most IDEs have some support for Java 9's syntactic changes, and of course accessing libraries from a Java 9 install is as easy as it was on previous versions. But the structural changes for Java 9: modules, multi-release jars, linking, and ahead-of-time compilation (which is admittedly experimental) generally do not exist.

How bad is it? Even NetBeans, Oracle's flagship IDE, usually the fastest free way to access the latest and greatest Java features...doesn't support these Java 9 features well at all (NetBeans 9 is still in development)

I generally use IntelliJ IDEA, which has always been ahead of the curve on supporting new Java features, and things are somewhat better here, You can define module definitions, but generally can't split JDK versions for a single source tree, the standard layout for a single module.

(Full disclosure: because I'm just trying to support Java 9 now, and I'm mostly migrating additional projects, I've only just started to figure out what features are supported in which IDs and how well; corrections and updates in comments are welcome.)

Because the IDE space is moving quickly, I won't go into how to get your IDE of choice working nicely with Java 9 structural features. As I cross that bridge, I'll try to post about it.

Maven?

Yes, you guessed it: like a majority of projects in the Java world, InvokeBinder still builds with Maven. For my purposes, it's the simplest way to get an artifact build and deployed, and Maven central is still the canonical place people look for libraries.

Maven is made up of thousands of little plugins and libraries, which makes updating any Maven project for a new JDK version an exercise in pain. When I first looked into Java 9 support some months ago, I basically had to give up; too many plugins I use hadn't updated, and in most cases there was no way to work around the incompatibilities.

Thankfully, most core Maven plugins now appear to work properly with Java 9, though most have not started to expose those structural features I discuss above.

To build invokebinder's module targeting Java 9 and build everything else targeting Java 8 required a newer version of the maven-compiler-plugin and some manual configuration, roughly described on this example page for module-info.java. In my case, I did not need to support Java versions earlier than Java 7, so the split toolkit configuration was unnecessary.

With this change, I was able to get my jar to export a module, and it still works properly when run on Java 7 and 8.

OSGI

Before Java 9 modules, the typical fine-grained runtime dependency system of choice has been OSGI, and InvokeBinder does indeed export an OSGI package.

Unfortunately, the plugin I'm using for doing this (the maven-bundle-plugin from the Felix project) has not been released in some time, and the version of the "bnd" library it uses does not work properly on Java 9.

A bit more manual configuration lets us force the plugin to use a newer version of the library, and everything now works: Java 8 or 9 with OSGI and Java 9 with modules!

Update: Travis CI

After finishing the changes needed above, I realized I needed to get this to build and work on Travis CI, which I use to test InvokeBinder.

Luckily, though Java 9 support in Travis is still a bit manual, it's not a difficult change:

What's Next?

I am both excited about Java 9 features and required to make JRuby and my other projects work with them, so this will be one of my primary projects over the next few months. I look forward to hearing how you are supporting Java 9, so please comment with corrections, tips, and updates on anything I've posted here.

In future posts I'll talk about getting multi-release jars, ahead-of-time compilation and linking, and JRuby itself running on Java 9. Stay tuned!

Wednesday, May 21, 2014

JRubyConf.eu 2014!

I'm thrilled to announce that we'll have another edition of JRubyConf.eu this year!

Last year's event was a great success. We had a two-day conference in Berlin immediately before Eurucamp 2013, with two speakers from the core team (myself and Tom Enebo) and a whopping fifteen non-core speakers. A great event was had by all.

This year, we've decided to pull the event back to its roots, as part of Eurucamp 2014. We'll return to the single-track, single-day event co-located with and immediately preceding Eurucamp on 1st August. We really wanted to bring JRuby back to Rubyists, and we're looking forward to hanging out at Eurucamp the whole weekend!

Why not visit Eurucamp early and spend a day learning about JRuby with the best JRubyists in Europe?

If you're interested in attending, tickets are available for only €99 at the Eurucamp ticket site now!

We're also looking for speakers from the JRuby community. You can submit to the CFP (which ends Sunday 28 May) using the Eurucamp CFP app.

Looking forward to seeing you at JRubyConf and Eurucamp this summer!

Friday, June 7, 2013

The Pain of Broken Subprocess Management on JDK

I prefer to write happy posts...I really do. But tonight I'm completely defeated by the JDK's implementation of subprocess launching, and I need to tell the world why.

JRuby has always strived to mimic MRI's behavior as much as possible, which in many cases has meant we need to route around the JDK to get at true POSIX APIs and behaviors.

For example, JRuby has provided the ability to manipulate symbolic links since well before Java 7 provided that capability, using a native POSIX subsystem built atop jnr-ffi, our Java-to-C FFI layer (courtesy of Wayne Meissner). Everyone in the Java world knew for years the lack of symlink support was a gross omission, but most folks just sucked it up and went about their business. We could not afford to do that.

We've repeated this process for many other Ruby features: UNIX sockets, libc-like IO, selectable stdin, filesystem attributes...on and on. And we've been able to provide the best POSIX runtime on the JVM bar none. Nobody has gone as far or done as much as JRuby has.

Another area where we've had to route around the JDK is in subprocess launching and management. The JDK provides java.lang.ProcessBuilder, an API for assembling the appropriate pieces of a subprocess launch, producing a java.lang.Process object. Process in turn provides methods to wait for the subprocess, get access to its streams, and destroy it forcibly. It works great, on the surface.

Unfortunately, the cake is a lie.

Under the covers, the JDK implements Process through a complicated series of tricks. We want to be able to interactively control the child process, monitor it for writes, govern its lifecycle exactly. The JDK attempts to provide a consistent experience across all platforms. Unfortunately, those two worlds are not currently compatible, and the resulting experience is consistently awful.

We'll start at the bottom to see where things go wrong.

POSIX, POSIX, Everywhere

At the core of ProcessBuilder, inside the native code behind UNIXProcess, we do find somewhat standard POSIX calls to fork and exec, wrapped up in a native downcall forkAndExec:

The C code behind this is a bit involved, so I'll summarize what it does.

Sets up pipes for in, out, err, and fail to communicate with the eventual child process.
Copies the parent's descriptors from the pipes into the "fds" array.
Launches the child through a fairly standard fork+exec sequence.
Waits for the child to write a byte to the fail pipe indicating success or failure.
Scrubs the unused sides of the pipes in parent and child.
Returns the child process ID.

This is all pretty standard for subprocess launching, and if it proceeded to put those file descriptors into direct, selectable channels we'd have no issues. Unfortunately, things immediately go awry once we return to the Java code.

Interactive?

The call to forkAndExec occurs inside the UNIXProcess constructor, as the very first thing it does. At that point, it has in hand the three standard file descriptors and the subprocess pid, and it knows that the subprocess has at least been successfully forked. The next step is to wrap the file descriptors in appropriate InputStream and OutputStream objects, and this is where we find the first flaw.

This is the code to set up an OutputStream for the input channel of the child process, so we can write to it. Now we know the operating system is going to funnel those written bytes directly to the subprocess's input stream, and ideally if we're launching a subprocess we intend to control it...perhaps by sending it interactive commands. Why, then, do we wrap the file descriptor with a BufferedOutputStream?
This is where JRuby's hacks begin. In our process subsystem, we have the following piece of code, which attempts to unwrap buffering from any stream it is given.

The FieldAccess.getProtectedFieldValue call there does what you think it does...attempt to read the "out" field from within FilteredOutputStream, which in this case will be the FileOutputStream from above. Unwrapping the stream in this way allows us to do two things:

We can do unbuffered writes to (or reads from, in the case of the child's out and err streams) the child process.
We can get access to the more direct FileChannel for the stream, to do direct ByteBuffer reads and writes or low-level stream copying.

So we're in good shape, right? It's a bit of hackery, but we've got our unbuffered Channel and can interact directly with the subprocess. Is this good enough?

I wish it were.

Selectable?

The second problem we run into is that users very often would like to select against the output streams of the child process, to perform nonblocking IO operations until the child has actually written some data. It gets reported as a JRuby bug over and over again because there's simply no way for us to implement it. Why? Because FileChannel is not selectable.

FileChannel implements methods for random-access reads and writes (positioning) and blocking IO interruption (which NIO implements by closing the stream...that's a rant for another day), but it does not implement any of the logic necessary for doing nonblocking IO using an NIO Selector. This comes up in at least one other place: the JVM's own standard IO streams are also not selectable, which means you can't select for user input at the console. Consistent experience indeed...it seems that all interaction with the user or with processes must be treated as file IO, with no selection capabilities.

(It is interesting to note that the JVM's standard IO streams are *also* wrapped in buffers, which we dutifully unwrap to provide a truly interactive console.)

Why are inter-proces file descriptors, which would support selector operations just wonderfully, wrapped in an unselectable channel? I have no idea, and it's impossible for us to hack around.

Let's not dwell on this item, since there's more to cover.

Fear the Reaper

You may recall I also wanted to have direct control over the lifecycle of the subprocess, to be able to wait for it or kill it at my own discretion. And on the surface, Process appears to provide these capabilities via the waitFor() and destroy() methods. Again it's all smoke and mirrors.

Further down in the UNIXProcess constructor, you'll find this curious piece of code:

For each subprocess started through this API, the JVM will spin up a "process reaper" thread. This thread is designed to monitor the subprocess for liveness and notify the parent UNIXProcess object when that process has died, so it can pass on that information to the user via the waitFor() and exitValue() API calls.

The interesting bit here is the waitForProcessExit(pid) call, which is another native downcall into C land:

There's nothing too peculiar here; this is how you'd wait for the child process to exit if you were writing plain old C code. But there's a sinister detail you can't see just by looking at this code: waitpid can be called exactly once by the parent process.

Part of the Ruby Process API is the ability to get a subprocess PID and wait for it. The concept of a process ID has been around for a long time, and Rubyists (even amateur Rubyists who've never written a line of C code) don't seem to have any problem calling Process.waitpid when they want to wait for a child to exit. JRuby is an implementation of Ruby, and we would ideally like to be able to run all Ruby code that exists, so we also must implement Process.waitpid in some reasonable way. Our choice was to literally call the C function waitpid(2) via our FFI layer.

Here's the subtle language from the wait(2) manpage (which includes waitpid):

RETURN VALUES
     If wait() returns due to a stopped or terminated child
     process, the process ID of the child is returned to the
     calling process.  Otherwise, a value of -1 is returned
     and errno is set to indicate the error.

     If wait3(), wait4(), or waitpid() returns due to a
     stopped or terminated child process, the process ID of
     the child is returned to the calling process.  If there
     are no children not previously awaited, -1 is returned
     with errno set to [ECHILD].  Otherwise, if WNOHANG is
     specified and there are no stopped or exited children,
     0 is returned. If an error is detected or a caught
     signal aborts the call, a value of -1 is returned and
     errno is set to indicate the error.

There's a lot of negatives and passives and conditions there, so I'll spell it out for you more directly: If you call waitpid for a given child PID and someone else in your process has already done so...bad things happen.

We effectively have to race the JDK to the waitpid call. If we get there first, the reaper thread bails out immediately and does no further work. If we don't get their first, it becomes impossible for a Ruby user to waitpid for that child process.

Now you may be saying "why don't you just wait on the Process object and let the JDK do its job, old man? The problem here is that Ruby's Process API behaves like a POSIX process API: you get a PID back, and you wait on that PID. We can't mimic that API without returning a PID and implementing Process.waitpid appropriately.

(Interesting note: we also use reflection tricks to get the real PID out of the java.lang.Process object, since it is not normally exposed.)

Could we have some internal lookup table mapping PIDs to Process objects, and make our wait logic just call Process.waitFor? In order to do so, we'd need to manage a weak-valued map from integers to Process objects...which is certainly doable, but it breaks if someone uses a native library or FFI call to launch a process themselves. Oh, but if it's not in our table we could do waitpid. And so the onion grows more layers, all because we can't simply launch a process, get a PID, and wait on it.

It doesn't end here, though.

Keep Boiling That Ocean

At this point we've managed to at least get interactive streams to the child process, and even if they're not selectable that's a big improvement over the standard API. We've managed to dig out a process ID and sometimes we can successfully wait for it with a normal waitpid function call. So out of our three goals (interactivity, selectability, lifecycle control) we're maybe close to halfway there.

Then the JDK engineers go and pull the rug out from under us.

The logic for UNIXProcess has changed over time. Here's the notable differences in the current JDK 7 codebase:

An Executor is now used to avoid spinning up a new thread for each child process. I'd +1 this, if the reaping logic weren't already causing me headaches.
The streams are now instances of UNIXProcess.ProcessPipeOutputStream and ProcessPipeInputStream. Don't get excited...they're still just buffered wrappers around File streams.
The logic run when the child process exist has changed...with catastrophic consequences.

Here's the new stream setup and reaper logic:

Now instead of simply notifying the UNIXProcess that the child has died, there's a call to processExited().

Ok, doesn't look bad so far. Let's look at ProcessPipeInputStream, which handles output from the child process.

So when the child process exits, the any data waiting to be read from its output stream is drained into a buffer. All of it. In memory.

Did you launch a process that writes a gigabyte of data to its output stream and then terminates? Well, friend, I sure hope you have a gigabyte of memory, because the JDK is going to read that sucker in and there's nothing you can do about it. And let's hope there's not more than 2GB of data, since this code basically just grows a byte[], which in Java can only grow to 2GB. If there's more than 2GB of data on that stream, this logic errors out and the data is lost forever. Oh, and by the way...if you happened to be devlishly clever and managed to dig down to the real FileChannel attached to the child process, all the data from that stream has suddenly disappeared, and the channel itself is closed, even if you never got a chance to read from it. Thanks for the help, JDK.

The JDK has managed to both break our clever workarounds (for its previously broken logic) an break itself even more badly. It's almost like they want to make subprocess launching so dreadfully bad you just don't use it anymore.

Never Surrender

Of course I could cry into my beer over this, but these sorts of problems and challenges are exactly why I'm involved in JRuby and OpenJDK. Obviously this API has gone off the deep end and can't be saved, so what's a hacker to do? In our case, we make our own API.

At this point, that's our only option. The ProcessBuilder and Process APIs are so terribly broken that we can't rely on them anymore. Thankfully, JRuby ships with a solid, fast FFI layer called the Java Native Runtime (JNR) that should make it possible for us to write our own process API entirely in Java. We will of course do that in the open, and we are hoping you will help us.

What's the moral of the story? I don't really know. Perhaps it's that lowest-common-denominator APIs usually trend toward uselessness. Perhaps it's that ignoring POSIX is an expressway to failure. Perhaps it's that I don't know when to quit. In any case, you can count on the JRuby team to continue bringing you the only true POSIX experience on the JVM, and you can count on me to keep pushing OpenJDK to follow our lead.

Saturday, May 11, 2013

On Languages, VMs, Optimization, and the Way of the World

I shouldn't be up this late, but I've been doing lots of thinking and exploring tonight.

In studying various VMs over the past few years, I've come up with a list of do's and don't that make things optimize right. These apply to languages, the structures that back them, and the VMs that optimize those languages, and from what I've seen there's a lot of immutable truths here given current optimization technology.

Let's dive in.

#1: Types don't have to be static

JVM and other dynamic-optimizing runtimes have proven this out. At runtime, it's possible to gather the same information static types would provide you at compile time, leading to optimizations at least as good as fully statically-typed, statically-optimized code. In some cases, it may be possible to do a better job, since runtime profiling is based on real execution, real branch percentages, real behavior, rather than a guess at what a program might do. You could probably make the claim that static optimization is a halting problem, and dynamic optimization eventually can beat it by definition since it can optimize what the program is actually doing.

However, this requires one key thing to really work well.

#2: Types need to be predictable

In order for runtime optimization to happen, objects need to have predictable types and those types need to have a predictable structure. This isn't to say that types must be statically declared...they just need to look the same on repeat visits. If objects can change type (smalltalk's become, perl's and C's weak typing) you're forced to include more guards against those changes, or you're forced to invalidate more code whenever something changes (or in the case of C, you just completely shit the bed when things aren't as expected). If change is possible and exposed at a language level, there may be nothing you can do to cope with all those different type shapes, and optimization can only go so far.

This applies both to the shape of a type's method table (methods remaining consistent once encountered) and the shape of the type's instances (predictable object layout). Many dynamically-typed languages impose dynamic type shape and object shape on VMs that run them, preventing those VMs from making useful predictions about how to optimize code. Optimistic predictions (generating synthetic types for known type shapes or preemptively allocating objects based on previously-seen shapes) still have to include fallback logic to maintain the mutable behavior, should it ever be needed. Again, optimization potential is limited, because the shape of the world can change on a whim and the VM has to be vigilent

The alternative summation of #1 and #2 is that types don't have to be statically declared, but they need to be statically defined. Most popular dynamic languages do neither, but all they really need to do is the latter.

#3: You can't cheat the CPU

Regardless of how clever you'd like to be in your code or language or VM or JIT, the limiting factor is how modern CPUs actually run your code. There's a long list of expectations you must meet to squeeze every last drop of speed out of a system, and diverging from those guidelines will always impose a penalty. This is the end...the bottom turtle...the unifying theory. It is, at the end of the day, the CPU you must appease to get the best performance. All other considerations fall out of that, and anywhere performance does not live up to expectations you are guaranteed to discover that someone tried to cheat the CPU.

Traditionally, static typing was the best way to guarantee we produced good CPU instructions. It gave us a clear picture of the world we could ponder and meditate over, eventually boiling out the secrets of the universe and producing the fastest possible code. But that always assumed a narrow vision of a world with unlimited resources. It assumed we could make all the right decisions for a program ahead of time and that no limitations outside our target instruction set would ever affect us. In the real world, however, CPUs have limited cache sizes, multiple threads, bottlenecked memory pipelines, and basic physics to contend with (you can only push so many electrons through a given piece of matter without blowing it up). Language and VM authors ignore the expectations of their target systems only at great peril.

Let's look at a few languages and where they fit.

Language Scorecard

Java is statically typed and types are of a fixed shape. This is the ideal situation mostly because of the type structure being predictable. Once encountered, a rose is just a rose. Given appropriate dynamic optimizations, there's no reason Java code can't compete with or surpass statically-typed and statically-compiled C/++, and in theory there's nothing preventing Java code from becoming optimal CPU instructions.

Dart is dynamically typed (or at least, types are optional and the VM doesn't care about them), but types are of a fixed shape. If programmers can tolerate fixed-shape types, Dart provides a very nice dynamic language that still can achieve the same optimizations as statically-typed Java or statically-compiled C/++.

Groovy is dynamically typed with some inference and optimization if you specify static types, but most (all?) types defined in Groovy are not guaranteed to be a fixed shape. As a result, even when specifying static types, guards must be inserted to check that those types' shapes have not changed. Groovy does, however, guarantee object shape is consistent over time, which avoids overhead from being able to reshape objects at runtime.

Ruby and JavaScript are dynamically typed and types and objects can change shape at runtime. This is a confluence of all the hardest-to-optimize language characteristics. In both cases, the best we can do is to attempt to predict common type and object shapes and insert guards for when we're wrong, but it's not possible to achieve the performance of a system with fully-predictable type and object shapes. Prove me wrong.

Now of course when I say it's not possible, I mean it's not possible for the general case. Specific cases of a known closed-world application can indeed be optimized as though the types and objects involved had static shapes. I do something along these lines in my RubyFlux compiler, which statically analyzes incoming Ruby code and assumes the methods it sees defined and the fields it sees accessed will be the only methods and fields it ever needs to worry about. But that requires omitting features that can mutate type and object structure, or else you have to have a way to know which types and objects those features will affect. Sufficiently smart compiler indeed.

Python has similar structural complexities to Ruby and adds in the additional complexity of an introspectable call stack. Under those circumstances, even on-stack execution state is not safe; a VM can't even make guarantees about the values it has in hand or the shape of a given call's activation. PyPy does an admirable job of attacking this problem by rewriting currently-running code and lifting on-stack state to the heap when it is accessed, but this approach prevents dropping unused local state (since you can't predict who might want to see it) and also fails to work under parallel execution (since you can't rewrite code another thread might be executing). Again, the dynamicity of a "cool" feature brings with it intrinsic penalties that are reducible but not removable.

Get to the Damn Point, Already

So what am I trying to say in all this? I started the evening by exploring a benchmark post comparing Dart's VM with JVM on the same benchmark. The numbers were not actually very exciting...with a line-by-line port from Dart to Java, Java came out slightly behind Dart. With a few modifications to the Java code, Java pulled slightly ahead. With additional modifications to the Dart code, it might leapfrog Java again. But this isn't interesting because Dart and Java can both rely on type and object shapes remaining consistent, and as a result the optimizations they perform can basically accomplish the same thing. Where it matters, they're similar enough that VMs don't care about the differences.

Where does this put languages I love, like Ruby? It's probably fair to concede that Ruby can't ever achieve the raw, straight-line performance of type-static (not statically-typed) languages like Dart or Java, regardless of the VM technologies involved. We'll be able to get close; JRuby can, with the help of invokedynamic, make method calls *nearly* as fast as Java calls, and by generating type shapes we can make object state *nearly* as predictable as Java types, but we can't go all the way. Regardless of how great the underlying VM is, if you can't hold to its immutable truths, you're walking against the wind. Ruby on Dart would probably not be any faster than Ruby on JVM, because you'd still have to implement mutable types and growable objects in pretty much the same way. Ruby on PyPy might be able to go farther, since the VM is designed for mutable types and growable objects, but you might have to sacrifice parallelism or accept that straight-line object-manipulating performance won't go all the way to a Java or Dart. Conversely, languages that make those type-static guarantees might be able to beat dynamic languages when running on dynamic language VMs (e.g. dart2js) for exactly the same reasons that they excel on their own VMs: they provide a more consistent view of the world, and offer no surprises to the VM that would hinder optimization. You trade dynamicity at the language level for predictability at the VM level.

The Actual Lesson

I guess the bottom line for me is realizing that there's always going to be a conflict between what programmers want out of programming languages and what's actually possible to give them. There's no magical fairy world where every language can be as fast as every other language, because there's no way to predict how every program is going to execute (or in truth, how a given program is going to execute given a general strategy). And that's ok; most of these languages can still get very close to each other in performance, and over time the dynamic type/object-shaped languages may offer ways to ratchet down some of that dynamism...or they might not care and just accept what limitations result. The important thing is for language users to recognize that nothing is free, and to understand the implications of language features and design decisions they make in their own programs.

Saturday, January 5, 2013

Constant and Global Optimization in JRuby 1.7.1 and 1.7.2

With every JRuby release, there's always at least a handful of optimizations. They range from tiny improvements in the compiler to perf-aware rewrites of core class methods, but they're almost always driven by real-world cases.

In JRuby 1.7.1 and 1.7.2, I made several improvements to the performance of Ruby constants and global variables that might be of some interest to you, dear reader.

Constants

In Ruby, a constant is a lexically and hierarchically accessed variable that starts with a capital letter. Class and module names like Object, Kernel, String, are all constants defined under the Object class. When I say constants are both lexical and hierarchically accessed, what I mean is that at access time we first search outward through lexically-enclosing scopes, and failing that we search through the class hierarchy of the innermost scope. For example:

Here, the first two constant accesses inside class B are successful; the first (IN_FOO) is located lexically in Foo, because it encloses the body of class B. The second (IN_A) is located hierarchically by searching B's ancestors. The third access fails, because the IN_BAR constant is only available within the Bar module's scope, so B can't see it.

Constants also...aren't. It is possible to redefine a constant, or define new constants deeper in a lexical or hierarchical strcture that mask earlier ones. However in most code (i.e. "good" code) constants eventually stabilize. This makes it possible to perform a variety of optimizations against them, even though they're not necessarily static.

Constants are used heavily throughout Ruby, both for constant values like Float::MAX and for classes like Array or Hash. It is therefore especially important that they be as fast as possible.

Global Variables

Globals in Ruby are about like you'd expect...name/value pairs in a global namespace. They start with $ character. Several global variables are "special" and exist in a more localized source, like $~ (last regular expression match in this call frame), $! (last exception raised in this thread), and so on. Use of these "local globals" mostly just amounts to special variable names that are always available; they're not really true global variables.

Everyone knows global variables should be discouraged, but that's largely referring to global variable use in normal program flow. Using global state across your application – potentially across threads – is a pretty nasty thing to do to yourself and your coworkers. But there are some valid uses of globals, like for logging state and levels, debugging flags, and truly global constructs like standard IO.

Here, we're using the global $DEBUG to specify whether logging should occur in MyApp#log. Those log messages are written to the stderr stream accessed via $stderr. Note also that $DEBUG can be set to true by passing -d at the JRuby command line.

Optimizing Constant Access (pre-1.7.1)

I've posted in the past about how JRuby optimizes constant access, so I'll just quickly review that here.

At a given access point, constant values are looked up from the current lexical scope and cached. Because constants can be modified, or new constants can be introduce that mask earlier ones, the JRuby runtime (org.jruby.Ruby) holds a global constant invalidator checked on each access to ensure the previous value is still valid.

On non-invokedynamic JVMs, verifying the cache involves an object identity comparison every time, which means a non-final value must be accessed via a couple levels of indirection. This adds a certain amount of overhead to constant access, and also makes it impossible for the JVM to fold multiple constant accesses away, or make static decisions based on a constant's value.

On an invokedynamic JVM, the cache verification is in the form of a SwitchPoint. SwitchPoint is a type of on/off guard used at invokedynamic call sites to represent a hard failure. Because it can only be switched off, the JVM is able to optimize the SwitchPoint logic down to what's called a "safe point", a very inexpensive ping back into the VM. As a result, constant accesses under invokedynamic can be folded away, and repeat access or unused accesses are not made at all.

However, there's a problem. In JRuby 1.7.0 and earlier, the only way we could access the current lexical scope (in a StaticScope object) was via the current call frame's DynamicScope, a heap-based object created on each activation of a given body of code. In order to reduce the performance hit to methods containing constants, we introduced a one-time DynamicScope called the "dummy scope", attached to the lexical scope and only created once. This avoided the huge hit of constructing a DynamicScope for every call, but caused constant-containing methods to be considerably slower than those without constants.

Lifting Lexical Scope Into Code

In JRuby 1.7.1, I decided to finally bite the bullet and make the lexical scope available to all method bodies, without requiring a DynamicScope intermediate. This was a nontrivial piece of work that took several days to get right, so although most of the work occurred before JRuby 1.7.0 was released, we opted to let it bake a bit before release.

The changes made it possible for all class, module, method, and block bodies to access their lexical scope essentially for free. It also helped us finally deliver on the promise of truly free constant access when running under invokedynamic.

So, does it work?

Assuming constant access is free, the three loops here should perform identically. The non-expression calls to foo and bar should disappear, since they both return a constant value that's never used. The calls for decrementing the 'a' variable should produce a constant value '1' and perform the same as the literal decrement in the control loop.

Here's Ruby (MRI) 2.0.0 performance on this benchmark.

The method call itself adds a significant amount of overhead here, and the constant access adds another 50% of that overhead. Ruby 2.0.0 has done a lot of work on performance, but the cost of invoking Ruby methods and accessing constants remains high, and constant accesses do not fold away as you would like.

Here's JRuby 1.7.2 performance on the same benchmark.

We obviously run all cases significantly faster than Ruby 2.0.0, but the important detail is that the method call adds only about 11% overhead to the control case, and constant access adds almost nothing.

For comparison, here's JRuby 1.7.0, which did not have free access to lexical scopes.

So by avoiding the intermediate DynamicScope, methods containing constant accesses are somewhere around 7x faster than before. Not bad.

Optimizing Global Variables

Because global variables have a much simpler structure than constants, they're pretty easy to optimize. I had not done so up to JRuby 1.7.1 mostly because I didn't see a compelling use case and didn't want to encourage their use. However, after Tony Arcieri pointed out that invokedynamic-optimized global variables could be used to add logging and profiling to an application with zero impact when disabled, I was convinced. Let's look at the example from above again.

In this example, we would ideally like there to be no overhead at all when $DEBUG is untrue, so we're free to add optional logging throughout the application with no penalty. In order to support this, two improvements were needed.

First, I modified our invokedynamic logic to cache global variables using a per-variable SwitchPoint. This makes access to mostly-static global variables as free as constant access, with the same performance improvements.

Second, I added some smarts into the compiler for conditional forms like "if $DEBUG" that would avoid re-checking the $DEBUG value at all if it were false the first time (and start checking it again if it were modified).

It's worth noting I also made this second optimization for constants; code like "if DEBUG_ENABLED" will also have the same performance characteristics.

Let's see how it performs.

In this case, we should again expect that all three forms have identical performance. Both the constant and the global resolve to an untrue value, so they should ideally not introduce any overhead compared to the bare method.

Here's Ruby (MRI) 2.0.0:

Both the global and the constant add overhead here in the neighborhood of 25% over an empty method. This means you can't freely add globally-conditional logic to your application without accepting a performance hit.

JRuby 1.7.2:

Again we see JRuby + invokedynamic optimizing method calls considerably better than MRI, but additionally we see that the untrue global conditions add no overhead compared to the empty method. You can freely use globals as conditions for logging, profiling, and other code you'd like to have disabled most of the time.

And finally, JRuby 1.7.1, which optimized constants, did not optimize globals, and did not have specialized conditional logic for either:

Where Do We Go From Here?

Hopefully I've helped show that we're really just seeing the tip of the iceberg as far as optimizing JRuby using invokedynamic. More than anything we want you to report real-world use cases that could benefit from additional optimization, so we can target our work effectively. And as always, please try out your apps on JRuby, enable JRuby testing in Travis CI, and let us know what we can do to make your JRuby experience better!

Monday, November 19, 2012

Refining Ruby

What does the following code do?

If you answered "it upcases two strings and adds them together, returning the result" you might be wrong because of a new Ruby feature called "refinements".

Let's start with the problem refinements are supposed to solve: monkey-patching.

Monkey-patching

In Ruby, all classes are mutable. Indeed, when you define a new class, you're really just creating an empty class and filling it with methods. The ability to mutate classes at runtime has been used (or abused) by many libraries and frameworks to decorate Ruby's core classes with additional (or replacement) behavior. For example, you might add a "camelize" method to String that knows how to convert under_score_names to camelCaseNames. This is lovingly called "monkey-patching" by the Ruby community.

Monkey-patching can be very useful, and many patterns in Ruby are built around the ability to modify classes. It can also cause problems if a library patches code in a way the user does not expect (or want), or if two libraries try to apply conflicting patches. Sometimes, you simply don't want patches to apply globally, and this is where refinements come in.

Localizing Monkeypatches

Refinements have been discussed as a feature for several years, sometimes under the name "selector namespaces". In essence, refinements are intended to allow monkey-patching only within certain limited scopes, like within a library that wants to use altered or enhanced versions of core Ruby types without affecting code outside the library. This is the case within the ActiveSupport library that forms part of the core of Rails.

ActiveSupport provides a number of extensions (patches) to the core Ruby classes like String#pluralize, Range#overlaps?, and Array#second. Some of these extensions are intended for use by Ruby developers, as conveniences that improve the readability or conciseness of code. Others exist mostly to support Rails itself. In both cases, it would be nice if we could prevent those extensions from leaking out of ActiveSupport into code that does not want or need them.

Refinements

In short, refinements provide a way to make class modifications that are only seen from within certain scopes. In the following example, I add a "camelize" method to the String class that's only seen from code within the Foo class.

With the Foo class refined, we can see that the "camelize" method is indeed available within the "camelize_string" method but not outside of the Foo class.

On the surface, this seems like exactly what we want. Unfortunately, there's a lot more complexity here than meets the eye.

Ruby Method Dispatch

In order to do a method call in Ruby, a runtime simply looks at the target object's class hierarchy, searches for the method from bottom to top, and upon finding it performs the call. A smart runtime will cache the method to avoid performing this search every time, but in general the mechanics of looking up a method body are rather simple.

In an implementation like JRuby, we might cache the method at what's called the "call site"—the point in Ruby code where a method call is actually performed. In order to know that the method is valid for future calls, we perform two checks at the call site: that the incoming object is of the same type as for previous calls; and that the type's hierarchy has not been mutated since the method was cached.

Up to now, method dispatch in Ruby has depended solely on the type of the target object. The calling context has not been important to the method lookup process, other than to confirm that visibility restrictions are enforced (primarily for protected methods, since private methods are rejected for non–self calls). That simplicity has allowed Ruby implementations to optimize method calls and Ruby programmers to understand code by simply determining the target object and methods available on it.

Refinements change everything.

Refinements Basics

Let's revisit the camelize example again.

The visible manifestation of refinements comes via the "refine" and "using" methods.

The "refine" method takes a class or module (the String class, in this case) and a block. Within the block, methods defined (camelize) are added to what might be called a patch set (a la monkey-patching) that can be applied to specific scopes in the future. The methods are not actually added to the refined class (String) except in a "virtual" sense when a body of code activates the refinement via the "using" method.

The "using" method takes a refinement-containing module and applies it to the current scope. Methods within that scope should see the refined version of the class, while methods outside that scope do not.

Where things get a little weird is in defining exactly what that scope should be and in implementing refined method lookup in such a way that does not negatively impact the performance of unrefined method lookup. In the current implementation of refinements, a "using" call affects all of the following scopes related to where it is called:

The direct scope, such as the top-level of a script, the body of a class, or the body of a method or block
Classes down-hierarchy from a refined class or module body
Bodies of code run via eval forms that change the "self" of the code, such as module_eval

It's worth emphasizing at this point that refinements can affect code far away from the original "using" call site. It goes without saying that refined method calls must now be aware of both the target type and the calling scope, but what of unrefined calls?

Dynamic Scoping of Method Lookup

Refinements (in their current form) basically cause method lookup to be dynamically scoped. In order to properly do a refined call, we need to know what refinements are active for the context in which the call is occurring and the type of the object we're calling against. The latter is simple, obviously, but determining the former turns out to be rather tricky.

Locally-applied refinements

In the simple case, where a "using" call appears alongside the methods we want to affect, the immediate calling scope contains everything we need. Calls in that scope (or in child scopes like method bodies) would perform method lookup based on the target class, a method name, and the hierarchy of scopes that surrounds them. The key for method lookup expands from a simple name to a name plus a call context.

Hierarchically-applied refinements

Refinements applied to a class must also affect subclasses, so even when we don't have a "using" call present we still may need to do refined dispatch. The following example illustrates this with a subclass of Foo (building off the previous example).

Here, the camelize method is used within a "map" call, showing that refinements used by the Foo class apply to Bar, its method definitions, and any subscopes like blocks within those methods. It should be apparent now why my first example might not do what you expect. Here's my first example again, this time with the Quux class visible.

The Quux class uses refinements from the BadRefinement module, effectively changing String#upcase to actually do String#reverse. By looking at the Baz class alone you can't tell what's supposed to happen, even if you are certain that str1 and str2 are always going to be String. Refinements have effectively localized the changes applied by the BadRefinement module, but they've also made the code more difficult to understand; the programmer (or the reader of the code) must know everything about the calling hierarchy to reason about method calls and expected results.

Dynamically-applied refinements

One of the key features of refinements is to allow block-based DSLs (domain-specific languages) to decorate various types of objects without affecting code outside the DSL. For example, an RSpec spec.

There's several calls here that we'd like to refine.

The "describe" method is called at the top of the script against the "toplevel" object (essentially a singleton Object instance). We'd like to apply a refinement at this level so "describe" does not have to be defined on Object itself.
The "it" method is called within the block passed to "describe". We'd like whatever self object is live inside that block to have an "it" method without modifying self's type directly.
The "should" method is called against an instance of MyClass, presumably a user-created class that does not define such a method. We would like to refine MyClass to have the "should" method only within the context of the block we pass to "it".
Finally, the "be_awesome" method—which RSpec translates into a call to MyClass#awesome?—should be available on the self object active in the "it" block without actually adding be_awesome to self's type.

In order to do this without having a "using" present in the spec file itself, we need to be able to dynamically apply refinements to code that might otherwise not be refined. The current implementation does this via Module#module_eval (or its argument-receiving brother, Module#module_exec).

A block of code passed to "module_eval" or "instance_eval" will see its self object changed from that of the original surrounding scope (the self at block creation time) to the target class or module. This is frequently used in Ruby to run a block of code as if it were within the body of the target class, so that method definitions affect the "module_eval" target rather than the code surrounding the block.

We can leverage this behavior to apply refinements to any block of code in the system. Because refined calls must look at the hierarchy of classes in the surrounding scope, every call in every block in every piece of code can potentially become refined in the future, if the block is passed via module_eval to a refined hierarchy. The following simple case might not do what you expect, even if the String class has not been modified directly.

Because the "+" method is called within a block, all bets are off. The str_ary passed in might not be a simple Array; it could be any user class that implements the "inject" method. If that implementation chooses, it can force the incoming block of code to be refined. Here's a longer version with such an implementation visible.

Suddenly, what looks like a simple addition of two strings produces a distinctly different result.

Now that you know how refinements work, let's discuss the problems they create.

Implementation Challenges

Because I know that most users don't care if a new, useful feature makes my life as a Ruby implementer harder, I'm not going to spend a great deal of time here. My concerns revolve around the complexities of knowing when to do a refined call and how to discover those refinements.

Current Ruby implementations are all built around method dispatch depending solely on the target object's type, and much of the caching and optimization we do depends on that. With refinements in play, we must also search and guard against types in the caller's context, which makes lookup much more complicated. Ideally we'd be able to limit this complexity to only refined calls, but because "using" can affect code far away from where it is called, we often have no way to know whether a given call might be refined in the future. This is especially pronounced in the "module_eval" case, where code that isn't even in the same class hierarchy as a refinement must still observe it.

There are numerous ways to address the implementation challenges.

Eliminate the "module_eval" Feature

At present, nobody knows of an easy way to implement the "module_eval" aspect of refinements. The current implementation in MRI does it in a brute-force way, flushing the global method cache on every execution and generating a new, refined, anonymous module for every call. Obviously this is not a feasible direction to go; block dispatch will happen very frequently at runtime, and we can't allow refined blocks to destroy performance for code elsewhere in the system.

The basic problem here is that in order for "module_eval" to work, every block in the system must be treated as a refined body of code all the time. That means that calls inside blocks throughout the system need to search and guard against the calling context even if no refinements are ever applied to them. The end result is that those calls suffer complexity and performance hits across the board.

At the moment, I do not see (nor does anyone else see) an efficient way to handle the "module_eval" case. It should be removed.

Localize the "using" Call

No new Ruby feature should cause across-the-board performance hits; one solution is for refinements to be recognized at parse time. This makes it easy to keep existing calls the way they are and only impose refinement complexity upon method calls that are actually refined.

The simplest way to do this is also the most limiting and the most cumbersome: force "using" to only apply to the immediate scope. This would require every body of code to "using" a refinement if method calls in that body should be refined. Here's a couple of our previous examples with this modification.

This is obviously pretty ugly, but it makes implementation much simpler. In every scope where we see a "using" call, we simply force all future calls to honor refinements. Calls appearing outside "using" scopes do not get refined and perform calls as normal.

We can improve this by making "using" apply to child scopes as well. This still provides the same parse-time "pseudo-keyword" benefit without the repetition.

Even better would be to officially make "using" a keyword and have it open a refined scope; that results in a clear delineation between refined and unrefined code. I show two forms of this below; the first opens a scope like "class" or "module", and the second uses a "do...end" block form.

It would be fair to say that requiring more explicit scoping of "using" would address my concern about knowing when to do a refined call. It does not, however, address the issues of locating active refinements at call time.

Locating Refinements

In each of the above examples, we still must pass some state from the calling context through to the method dispatch logic. Ideally we'd only need to pass in the calling object, which is already passed through for visibility checking. This works for refined class hierarchies, but it does not work for the RSpec case, since the calling object in some cases is just the top-level Object instance (and remember we don't want to decorate Object).

It turns out that there's already a feature in Ruby that follows lexical scoping: constant lookup. When Ruby code accesses a constant, the runtime must first search all enclosing scopes for a definition of that constant. Failing that, the runtime will walk the self object's class hierarchy. This is similar to what we want for the simplified version of refinements.

If we assume we've localized refinements to only calls within "using" scopes, then at parse time we can emit something like a RefinedCall for every method call in the code. A RefinedCall would be special in that it uses both the containing scope and the target class to look up a target method. The lookup process would proceed as follows:

Search the call's context for refinements, walking lexical scopes only
If refinements are found, search for the target method
If a refined method is found, use it for the call
Otherwise, proceed with normal lookup against the target object's class

Because the parser has already isolated refinement logic to specific calls, the only change needed is to pass the caller's context through to method dispatch.

Usability Concerns

There are indeed flavors of refinements that can be implemented reasonably efficiently, or at least implemented in such a way that unrefined code will not pay a price. I believe this is a requirement of any new feature: do no harm. But harm can come in a different form if a new feature makes Ruby code harder to reason about. I have some concerns here.

Let's go back to our "module_eval" case.

Because there's no "using" anywhere in the code, and we're not extending some other class, most folks will assume we're simply concatenating strings here. After all, why would I expect my "+" call to do something else? Why should my "+" call ever do something else here?

Ruby has many features that might be considered a little "magical". In most cases, they're only magic because the programmer doesn't have a good understanding of how they work. Constant lookup, for example, is actually rather simple...but if you don't know it searches both lexical and hierarchical contexts, you may be confused where values are coming from.

The "module_eval" behavior of refinements simply goes too far. It forces every Ruby programmer to second-guess every block of code they pass into someone else's library or someone else's method call. The guarantees of standard method dispatch no longer apply; you need to know if the method you're calling will change what calls your code makes. You need to understand the internal details of the target method. That's a terrible, terrible thing to do to Rubyists.

The same goes for refinements that are active down a class hierarchy. You can no longer extend a class and know that methods you call actually do what you expect. Instead, you have to know whether your parent classes or their ancestors refine some call you intend to make. I would argue this is considerably worse than directly monkey-patching some class, since at least in that case every piece of code has a uniform view.

The problems are compounded over time, too. As libraries you use change, you need to again review them to see if refinements are in play. You need to understand all those refinements just to be able to reason about your own code. And you need to hope and pray two libraries you're using don't define different refinements, causing one half of your application to behave one way and the other half of your application to behave another way.

I believe the current implementation of refinements introduces more complexity than it solves, mostly due to the lack of a strict lexical "using". Rubyists should be able to look at a piece of code and know what it does based solely on the types of objects it calls. Refinements make that impossible.

Update: Josh Ballanco points out another usability problem: "using" only affects method bodies defined temporally after it is called. For example, the following code only refines the "bar" method, not the "foo" method.

This may simply be an artifact of the current implementation, or it may be specified behavior; it's hard to tell since there's no specification of any kind other than the implementation and a handful of tests. In any case, it's yet another confusing aspect, since it means the order in which code is loaded can actually change which refinements are active.

tl;dr

My point here is not to beat down refinements. I agree there are cases where they'd be very useful, especially given the sort of monkey-patching I've seen in the wild. But the current implementation overreaches; it provides several features of questionable value, while simultaneously making both performance and understandability harder to achieve. Hopefully we'll be able to work with Matz and ruby-core to come up with a more reasonable, limited version of refinements...or else convince them not to include refinements in Ruby 2.0.