Tuesday, September 4, 2012

Avoiding Hash Lookups in a Ruby Implementation

I had an interesting realization tonight: I'm terrified of hash tables. Specifically, my work on JRuby (and even more directly, my work optimizing JRuby) has made me terrified to ever consider using a hash table in the hot path of any program or piece of code if there's any possibility of eliminating it. And what I've learned over the years is that the vast majority of execution-related (as opposed to data-related, purely dynamic-sourced lookup tables) hash tables are totally unnecessary.

Some background might be interesting here.

Hashes are a Language Designer's First Tool

Anyone who's ever designed a simple language knows that pretty much everything you do is trivial to implement as a hash table. Dynamically-expanding tables of functions or methods? Hash table! Variables? Hash table! Globals? Hash table!

In fact, some languages never graduate beyond this phase and remain essentially gobs and gobs of hash tables even in fairly recent implementations. I won't name your favorite language here, but I will name one of mine: Ruby.

Ruby: A Study in Hashes All Over the Freaking Place

As with many dynamic languages, early (for some definition of "early") implementations of Ruby used hash tables all over the place. Let's just take a brief tour through the many places hash tables are used in Ruby 1.8.7

(Author's note: 1.8.7 is now, by most measures, the "old" Ruby implementation, having been largely supplanted by the 1.9 series which boasts a "real" VM and optimizations to avoid most hot-path hash lookup.)

In Ruby (1.8.7), all of the following are (usually) implemented using hash lookups (and of these, many are hash lookups nearly every time, without any caching constructs):
  • Method Lookup: Ruby's class hierarchy is essentially a tree of hash tables that contain, among other things, methods. Searching for a method involves searching the target object's class. If that fails, you must search the parent class, and so on. In the absence of any sort of caching, this can mean you search all the way up to the root of the hierarchy (Object or Kernel, depending what you consider root) to find the method you need to invoke. This is also known as "slow".
  • Instance Variables: In Ruby, you do not declare ahead of time what variables a given class's object instances will contain. Instead, instance variables are allocated as they're assigned, like a hash table. And in fact, most Ruby implementations still use a hash table for variables under some circumstances, even though most of these variables can be statically determined ahead of time or dynamically determined (to static ends) at runtime.
  • Constants: Ruby's constants are actually "mostly" constant. They're a bit more like "const" in C, assignable once and never assignable again. Except that they are assignable again through various mechanisms. In any case, constants are also not declared ahead of time and are not purely a hierarchically-structured construct (they are both lexically and hierarchically scoped), and as a result the simplest implementation is a hash table (or chains of hash tables), once again.
  • Global Variables: Globals are frequently implemented as a top-level hash table even in modern, optimized language. They're also evil and you shouldn't use them, so most implementations don't even bother making them anything other than a hash table.
  • Local Variables: Oh yes, Ruby has not been immune to the greatest evil of all: purely hash table-based local variables. A "pure" version of Python would have to do the same, although in practice no implementations really support that (and yes, you can manipulate the execution frame to gain "hash like" behavior for Python locals, but you must surrender your Good Programmer's Card if you do). In Ruby's defense, however, hash tables were only ever used for closure scopes (blocks, etc), and no modern implementations of Ruby use hash tables for locals in any way.
There are other cases (like class variables) that are less interesting than these, but this list serves to show how easy it is for a language implementer to fall into the "everything's a hash, dude!" hole, only to find they have an incredibly flexible and totally useless language. Ruby is not such a language, and almost all of these cases can be optimized into largely static, predictable code paths with nary a hash calculation or lookup to be found.

How? I'm glad you asked.

JRuby: The Quest For Fewer Hashes

If I were to sum up the past 6 years I've spent optimizing JRuby (and learning how to optimize dynamic languages) it would be with the following phrase: Get Rid Of Hash Lookups.

When I tweeted about this realization yesterday, I got a few replies back about better hashing algorithms (e.g. "perfect" hashes) and a a few replies from puzzled folks ("what's wrong with hashes?"), which made me realize that it's not always apparent how unnecessary most (execution-related) hash lookups really are (and from now on, when I talk about unnecessary or optimizable hash lookups, I'm talking about execution-related hash lookups; you data folks can get off my back right now).

So perhaps we should talk a little about why hashes are bad in the first place.

What's Wrong With a Little Hash, Bro?

The most obvious problem with using hash tables is the mind-crunching frustration of finding THE PERFECT HASH ALGORITHM. Every year there's a new way to calculate String hashes, for example, that's [ better | faster | securer | awesomer ] than all precedents. JRuby, along with many other languages, actually released a security fix last year to patch the great hash collision DoS exploit so many folks made a big deal about (while us language implementers just sighed and said "maybe you don't actually want a hash table here, kids"). Now, the implementation we put in place has again been "exploited" and we're told we need to move to cryptographic hashing. Srsly? How about we just give you a crypto-awesome-mersenne-randomized hash impl you can use for all your outward-facing hash tables and you can leave us the hell alone?

But I digress.

Obviously the cost of calculating hash codes is the first sin of a hash table. The second sin is deciding how, based on that hash code, you will distribute buckets. Too many buckets and you're wasting space. Too few and you're more likely to have a collision. Ahh, the intricate dance of space and time plagues us forever.

Ok, so let's say we've got some absolutely smashing hash algorithm and foresight enough to balance our buckets so well we make Lady Justice shed a tear. We're still screwed, my friends, because we've almost certainly defeated the prediction and optimization capabilities of our VM or our M, and we've permanently signed over performance in exchange for ease of implementation.

It is conceivable that a really good machine can learn our hash algorithm really well, but in the case of string hashing we still have to walk some memory to give us reasonable assurance of unique hash codes. So there's performance sin #1 violated: never read from memory.

Even if we ignore the cost of calculating a hash code, which at worst requires reading some object data from memory and at best requires reading a cached hash code from elsewhere in memory, we have to contend with how the buckets are implemented. Most hash tables implement the buckets as either of the typical list forms: an array (contiguous memory locations in a big chunk, so each element must be dereferenced...O(1) complexity) or a linked list (one entry chaining to the next through some sort of memory dereference, leading to O(N) complexity for searching collided entries).

Assuming we're using simple arrays, we're still making life hard for the machine since it has to see through at least one and possibly several mostly-opaque memory references. By the time we've got the data we're after, we've done a bunch of memory-driven calculations to find a chain of memory dereferences. And you wanted this to be fast?

Get Rid Of The Hash

Early attempts (of mine and others) to optimize JRuby centered around making hashing as cheap as possible. We made sure our tables only accepted interned strings, so we could guarantee they'd already calculated and cached their hash values. We used the "programmer's hash", switch statements, to localize hash lookups closer to the code performing them, rather than trying to balance buckets. We explored complicated implementations of hierarchical hash tables that "saw through" to parents, so we could represent hierarchical method table relationships in (close to) O(1) complexity.

But we were missing the point. The problem was in our representing any of these language features as hash tables to begin with. And so we started working toward the implementation that has made JRuby actually become the fastest Ruby implementation: eliminate all hash lookups from hot execution paths.

How? Oh right, that's what we were talking about. I'll tell you.

Method Tables

I mentioned earlier that in Ruby, each class contains a method table (a hash table from method name to a piece of code that it binds) and method lookup proceeds up the class hierarchy. What I didn't tell you is that both the method tables and the hierarchy are mutable at runtime.

Hear that sound? It's the static-language fanatics' heads exploding. Or maybe the "everything must be mutable always forever or you are a very bad monkey" fanatics. Whatever.

Ruby is what it is, and the ability to mix in new method tables and patch existing method tables at runtime is part of what makes it attractive. Indeed, it's a huge part of what made frameworks like Rails possible, and also a huge reason why other more static (or more reasonable, depending on how you look at it) languages have had such difficulty replicating Rails' success.

Mine is not to reason why. Mine is but to do and die. I have to make it fast.

Proceeding from the naive implementation, there are certain truths we can hold at various times during execution:
  • Most method table and hierarchy manipulation will happen early in execution. This was true when I started working on JRuby and it's largely true now, in no small part due to the fact that optmizing method tables and hierarchies that are wildly different all the time is really, really hard (so no implementer does it, so no user should do it). Before you say it: even prototype-based languages like Javascript that appear to have no fixed structure do indeed settle into a finite set of predictable, optimizable "shapes" which VMs like V8 can take advantage of.
  • When changes do happen, they only affect a limited set of observers. Specifically, only call sites (the places where you actually make calls in code) need to know about the changes, and even they only need to know about them if they've already made some decision based on the old structure.
So we can assume method hierarchy structure is mostly static, and when it isn't there's only a limited set of cases where we care. How can we exploit that?

First, we implement what's called an "inline cache" at the call sites. In other words, every place where Ruby code makes a method call, we keep a slot in memory for the most recent method we looked up. In another quirk of fate, it turns out most calls are "monomorphic" ("one shape") so caching more than one is usually not beneficial.

When we revisit the cache, we need to know we've still got the right method. Obviously it would be stupid to do a full search of the target object's class hierarchy all over again, so what we want is to simply be able to examine the type of the object and know we're ok to use the same method. In JRuby, this is (usually) done by assigning a unique serial number to every class in the system, and caching that serial number along with the method at the call site.

Oh, but wait...how do we know if the class or its ancestors have been modified?

A simple implementation would be to keep a single global serial number that gets spun every time any method table or class hierarchy anywhere in the system is modified. If we assume that those changes eventually stop, this is good enough; the system stabilizes, the global serial number never changes, and all our cached methods are safely tucked away for the machine to branch-predict and optimize to death. This is how Ruby 1.9.3 optimizes inline caches (and I believe Ruby 2.0 works the same way).

Unfortunately, our perfect world isn't quite so perfect. Methods do get defined at runtime, especially in Ruby where people often create one-off "singleton methods" that only redefine a couple methods for very localized use. We don't want such changes to blow all inline caches everywhere, do we?

Let's split up the serial number by method name. That way, if you are only redefining the "foobar" method on your singletons, only inline caches for "foobar" calls will be impacted. Much better! This is how Rubinius implements cache invalidation.

Unfortunately again, it turns out that the methods people override on singletons are very often common methods like "hash" or "to_s" or "inspect", which means that a purely name-based invalidator still causes a large number of call sites to fail. Bummer.

In JRuby, we went through the above mechanisms and several others, finally settling on one that allows us to only ever invalidate the call sites that actually called a given method against a given type. And it's actually pretty simple: we spin the serial numbers on the individual classes, rather than in any global location.

Every Ruby class has one parent and zero or more children. The parent connection is obviously a hard link, since at various points during execution we need to be able to walk up the class hierarchy. In JRuby, we also add a weak link from parents to children, updated whenever the hierarchy changes. This allows changes anywhere in a class hiearchy to cascade down to all children, localizing changes to just that subhierarchy rather than inflicting its damage upon more global scopes.

Essentially, by actively invalidating down-hierarchy classes' serial numbers, we automatically know that matching serial numbers at call sites mean the cached method is 100% ok to use. We have reduced O(N) hierarchically-oriented hash table lookups to a single identity check. Victory!

Instance Variables

Optimizing method lookups actually turned out to be the easiest trick we had to pull. Instance variables defied optimization for a good while. Oddly enough, most Ruby implementations stumbled on a reasonably simple mechanism at the same time.

Ruby instance variables can be thought of as C++ or Java fields that only come into existence at runtime, when code actually starts using them. And where C++ and Java fields can be optimized right into the object's structure, Ruby instance variables have typically been implemented as a hash table that can grow and adapt to a running program as it runs.

Using a hash table for instance variables has some obvious issues:
  • The aforementioned performance costs of using hashes
  • Space concerns; a collection of buckets already consumes space for some sort of table, and too many buckets means you are using way more space per object than you want
At first you might think this problem can be tackled exactly the same way as method lookup, but you'd be wrong. What do we cache at the call site? It's not code we need to keep close to the point of use, it's the steps necessary to reach a point in a given object where a value is stored (ok, that could be considered code...just bear with me for a minute).

There are, however, truths we can exploit in this case as well.
  • A given class of objects will generally reference a small, finite number of variable names during the lifetime of a given program.
  • If a variable is accessed once, it is very likely to be accessed again.
  • The set of variables used by a particular class of objects is largely unique to that class of objects.
  • The majority of the variables ever to be accessed can be determined by inspecting the code contained in that class and its superclasses.
This gives us a lot to work with. Since we can localize the set of variables to a given class, that means we can store something at the class level. How about the actual layout of the values in object instances of that class?

This is how most current implementations of Ruby actually work.

In JRuby, as instance variables are first assigned, we bump a counter on the class that indicates an offset into an instance variable table associated with instances of that class. Eventually, all variables have been encountered and that table and that counter stop changing. Future instances of those objects, then, know exactly how larger the table needs to be and which variables are located where.

Invalidation of a given instance variable "call site" is then once again a simple class identity check. If we have the same class in hand, we know the offset into the object is guaranteed to be the same, and therefore we can go straight in without doing any hash lookup whatsoever.

Rubinius does things a little differently here. Instead of tracking the offsets at runtime, the Rubinius VM will examine all code associated with a class and use that to make a guess about how many variables will be needed. It sets up a table on the class ahead of time for those statically-determined names, and allocates exactly as much space for the object's header + those variables in memory (as opposed to JRuby, where the object and its table are two separate objects). This allows Rubinius to pack those known variables into a tighter space without hopping through the extra dereference JRuby has, and in many cases, this can translate to faster access.

However, both cases have their failures. In JRuby's version, we pay the cost of a second object (an array of values) and a pointer dereference to reach it, even if we can cache the offset 100% successfully at the call site. This translates to larger memory footprints and somewhat slower access times. In Rubinius, variables that are dynamically allocated fall back on a simple hash table, so dynamically-generated (or dynamically-mutated) classes may end up accessing some values in a much slower way than others.

The quest for perfect Ruby instance variable tables continues, but at least we have the tools to almost completely eliminate hashes right now.

Constants

The last case I'm going to cover in depth is that of "constant" values in Ruby.

Constants are, as I mentioned earlier, stored on classes in another hash table. If that were their only means of access, they would be uninteresting; we could use exactly the same mechanism for caching them as we do for methods, since they'd follow the same structure and behavior (other than being somewhat more static than method tables). Unfortunately, that's not the case; constants are located based on both lexical and hierarchical searches.

In Ruby, if you define a class or module, all constants lexically contained in that type's enclosing scopes are also visible within the type. This makes it possible to define new lexically-scoped aliased for values that might otherwise be difficult to retrieve without walking a class hierarchy or requiring a parent/child relationship to make those aliases visible. It also defeats nearly all reasonable mechanisms for eliminating hash lookups.

When you access a constant in Ruby, the implementation must first search all lexically-enclosing scopes. Each scope has a type (class or module) associated, and we check that type (and not its parents) for the constant name in question. Failing that, we fall back on the current type's class hierarchy, searching all the way up to the root type. Obviously, this could be far more searching than even method lookup, and we want to eliminate it.

If we had all the space in the world and no need to worry about dangling references, using our down-hierarchy method table invalidation would actually work very well here. We'd simply add another hierarchy for invalidation: lexical scopes. In practice, however, this is not feasible (or at least I have not found a way to make it feasible) since there are many times more lexical scopes in a given system than there are types, and a large number of those scopes are transient; we'd be tracking thousands or tens of thousands of parent/child relationships weakly all over the codebase. Even worse, invalidation due to constant updates or hierarchy changes would have to proceed both down the class hierarchy and throughout all lexically-enclosing scopes in the entire system. Ouch!

The current state of the art for Ruby implementations is basically our good old global serial number. Change a constant anywhere in Ruby 1.9.3, Rubinius, or JRuby, and you have just caused all constant access sites to invalidate (or they'll invalidate next time they're encountered). Now this sounds bad, perhaps because I told you it was bad above for method caching. But remember that the majority of Ruby programmers advise and practice the art of keeping constants...constant. Most of the big-name Ruby folks would call it a bug if your code is continually assigning or reassigning constants at runtime; there are other structures you could be using that are better suited to mutation, they might say. And in general, most modern Ruby libraries and frameworks do keep constants constant.

I'll admit we could do better here, especially if the world changed such that mutating constants was considered proper and advisable. But until that happens, we have again managed to eliminate hash lookups by caching values based on a (hopefully rarely modified) global serial number.

The Others

I did not go into the others because the solutions are either simple or not particularly interesting.

Local variables in any sane language (flame on!) are statically determinable at parse/compile time (rather than being dynamically scoped or determined at runtime). In JRuby, Ruby 1.9.3, and Rubinius, local variables are in all cases a simple tuple of offset into an execution frame and some depth at which to find the appropriate frame in the case of closures.

Global variables are largely discouraged, and usually only accessed at boot time to prepare more locally-defined values (e.g. configuration or environment variable access). In JRuby, we have experimented with mechanisms to cache global variable accessor logic in a way similar to instance variable accessors, but it turned out to be so rarely useful that we never shipped it.

Ruby also has another type of variable called a "class variable", which follows lookup rules almost identical to methods. We don't currently optimize these in JRuby, but it's on my to-do list.

Final Words

There are of course many other ways to avoid hash lookups, with probably the most robust and ambitious being code generation. Ruby developers, JIT compiler writers, and library authors have all used code generation to take what is a mostly-static lookup table and turn it into actually-static code. But you must be careful here to not fall into the trap of simply stuffing your hash logic into a switch table; you're still doing a calculation and some kind of indirection (memory dereference or code jump) to get to your target. Analyze the situation and figure out what immutable truths there are you can exploit, and you too can avoid the evils of hashes.

8 comments:

  1. Nice write-up, thanks. It's probably worth noticing `eval` regarding local vars. In old Ruby version it was possible to create a local var using `eval`.

    The same rules goes to JS implementations. And it's even standardized that `eval` cannot create a local var since is executed in the own execution context. This gives implementors an ability to use effectively allocated activation records on entering the context. Prior, even the standard said that activation objects are simple objects (e.g. hashes).

    ReplyDelete
  2. While it is true that eval can dynamically allocate variables in Ruby, it does not add variables to any non-eval scope. Under Ruby 1.8 eval'ed code executes under a hidden scope contained within the surrounding scope, shared across eval calls. Each call may increase the size of that scope, but once they start executing the set of variables are static. Under Ruby 1.9, each eval call gets its own scope, and the set of variables is determined once at parse time. So actually, there are no cases where variables mustw be represented as a hash in either Ruby 1.8 or 1.9, although eval calls can in 1.8 grow the existing hidden eval scope.

    ReplyDelete
  3. This is a very good 'how to write a dynamic programming language properly' post indeed!

    One thing I did not see in there (though I might have missed it) is the main reason we at GE have avoided hash tables/maps in the global name space. That reason is threading. Hash tables are not thread safe. Even Java's concurrent hash is not thread safe, it is just parallel. As we do not have control over user threads, they can go to sleep at any point. If they did this whilst mutating a ConcurrentHashMap or one we hand locked then the mutex for that sub-set of the map will remain locked and the program becomes effectively broken. I.E. a user sleeping one thread could accidentally also sleep the thread which would then go on to wake the first one back up again.

    The solution we have is to use an atomic trie structure for global names. Access for global names is done via a call site. When the site is boot strapped a lookup is done on the name space. If the name exists all is good. If it does not then an atomic update is done on the global names trie. This makes it impossible for that structure to be in a data race or be locked (no mutexes were use or hurt in the production of this code).

    The trie is less efficient than a hashmap - but it is only used to patch the sites and after that all sites for the same global share a common reference like objects.

    As threading becomes a bigger and bigger part of programming, the use of hashmaps becomes less and less attractive.

    ReplyDelete
    Replies
    1. if you can do a thread-safe atomic trie, why couldn't you do a thread-safe atomic hashmap? From some perspectives, a trie is nothing more than a a particular internal implementation of a hashmap, after all.

      Either way, you'll have the performance problems of hashmaps as in the OP, but I'm not seeing how concurrency matters, or if it does, why a trie is essentially better than a hashmap under concurrency.

      Delete
    2. A trie is not inherently better than an hashmap. As you point out, a trie is a similar conceptual system. However, making a hashmap immutable is rather tricky. Changing a node in the try only required replacing the nodes from the change point to the root node. In a classic hash map, the lookup from a hash into the bucked array is only efficient because the position of the array can be pre-computed. This means the entire array must be created at once. So, if make that immutable you need to replace the lookup array each time you change anything in the hashmap along with replacing all the elements in the appropriate bucket list from the start of that list to the replaced item.

      One solution to this problem is not to use an array but to use a tree for the initial lookup. This slows access from constant to log(n). But - once you have a tree, you might as well stick with a tree. A trie is easy to implement and is a nice compromise between memory and cpu performance.

      In another part of the same system as I was discussing we have very small hashmaps. In this case, it was most efficient to replace the entire hashmap each time and use an atomic reference.

      Delete
  4. Fascinating article. This makes me wonder whether and how Python deals with similar issues, and more specifically whether there are any optimization implications for exposing the __dict__ attribute.

    ReplyDelete
  5. J. Whitley: Python exposes many things that definitely do make optimization harder for Python implementers. You point out __dict__, and there's also frame access, the ability to replace get/set attribute methods, and so on. I'm sure there's ways around all this (or you may be able to ignore some of those lesser-used features, but it might require some different techniques than I talk about here.

    ReplyDelete
  6. Very interesting, clear and detailed article.

    I might say something totally stupid but I was wondering what is a typical ratio of callsites / method? Intuitively I guess there are a lot more callsites than methods, so instead of caching the methods on all callsites, why not cache all the callsites at method definitions ? so that when a method gets redefined or overload by another method in the look-up hierarchy, it can easily find all the current callsites and notify them which new method they should point at ?

    ReplyDelete