Wednesday, December 27, 2006

Making Dynamic Invocation Fast: An Idea?

Evan Phoenix (of Rubinius fame) were discussing dynamic dispatch today on #rubinius, sharing our caching strategies and our dispatch woes. We talked at length about various strategies for speeding dispatch, cache invalidation mechanisms, compilation techniques, and so on. All gloriously fun stuff.

So at one point I related to him the extended version of my plans for speeding up dynamic dispatch. I relate it now to you to hopefully spur some discussion. There are no original ideas anymore, right? Or are there?

The initial experiment with Fixnum basically was the static version of my fuzzy vision. During parse time, a very trivial static table mapped method names to numbers. I only handled three method names, +, -, and <. Then those numbers were passed to Fixnum during method dispatch, where a very trivial static switch statement used them to "fast dispatch" to the appropriate methods.

The ultimate vision, however, is much grander.

The general idea is that during dynamic dispatch, we want to use the method name and the class as keys to locate a specific method instance. The typical way this is done is by keeping a hashtable on every class, and looking up methods by name. This works reasonably well, but each class must have its own table and we must pay some hashtable cost for every search. This is compounded when we consider class hierarchies, where the method we're looking for might be in a super class or a super class's super class, ad infinatum.

Retrieving the class is the easy part; every method we want to dispatch will have a receiver, and receivers have a reference to their class.

So then, how do we perform this magic mapping from M method names and N classes to a given method? Well hell, that's a matrix!

So then the great idea: build a giant matrix mapping method names (method IDs) and classes (class IDs) to the appropriate switchable value. Then when we dispatch to the class, we use that switchable value in a neatly dense switch statement.

Let me repeat that in another way to make it clear...the values in the matrix are simple "int" values, ideally as closely-numbered (dense) as possible for a given class's set of methods. In order to dispatch to a named method on a given object, we have the following steps:
  1. Get the method ID from the method (likely calculated at parse or compile time)
  2. Get the class ID from the class (calculated sequentially at runtime)
  3. Index into the matrix using method ID and class ID to get the class-specific method switch value
  4. Use the switch value to dispatch (via a switch statement, of course) to the appropriate method in the target class
Ignoring the size of this matrix for a moment, there are a number of benefits to this data structure:
  • We have no need for per-class or inline method caches
  • Repopulating the switch table for a given class's methods is just a matter of refilling its column in the matrix with appropriate values
  • Since the matrix represents all classes, we can avoid searching hierarchies once parents' methods are known; we just represent those methods in the matrix under the child's column and fast-dispatch to them as normal
  • We can save off and reconstitute the mapping to avoid having to regenerate it
...and there's also the obvious benefit: using the switch values stored in the matrix, we can do a fast dispatch on the target object with only the minimal cost of looking up a value in the matrix. An additional benefit for C code would be simply storing the address of the appropriate code in the matrix, avoiding any switch values completely.

Now naturally we're talking about a big matrix here, I'll admit that. A fellow by the name of "Defiler" piped up and said an incomplete app of his showed 1147 unique method names across 1258 classes. That works out to a matrix with 1.4M elements. In the Java world, where everything is 32 bits, that's around 6MB of memory consumed just for method mapping. Acceptable? Perhaps. But what about using a sparse matrix?

A sparse matrix attempts to be as efficient as possible when there are large numbers of "zeros" in the matrix elements. That is, interestingly enough, exactly what we have here. The vast majority of entries in our method-map would contain a zero, since most methods would not exist on most classes. Zero, as it turns out, would neatly map to our good friend "method_missing", and so dispatching to a method on a target class that doesn't implement it would continue to work as it does today. Even better, method_missing dispatch would be just as fast as a regular method dispatch, excluding the actual method_missing implementation code, of course.

So then, there it is...a potential plan for fast dynamic method invocation in JRuby (and perhaps in Rubinius, if Evan agrees the idea has merit). Any of you out there heard of such a thing? Is this an old idea? Are there problems with it? Share!

11 comments:

  1. A lookup matrix of class VS method names sounds a lot like the way earlier Eiffel implementations worked. I believe they also went the sparse table route. I'll see if I can dig up some documentation on that.

    ReplyDelete
  2. First of all, I'd say that several per-class lookup tables will be faster than one huge lookup table: if you reference the row of your matrix directly from the associated class, you get one level of lookup for free.

    Now, to avoid the rows becoming huge, you need to compress them by removing the empty slots, as you said. Since this implies that you can't perform a direct lookup with the statically calculated method number, you really have just two good choices: either use binary search or perform a hash-lookup. Good hashing with custom collision handling is probably the fastest here.

    So hash-tables are not a problem if you are clever with them. If you initialize them correctly, you never have to fall back to the parent class. (Just using the same idea as you had with the matrix.) Moreover, since you can calculate the hash-code statically at compilation time, you don't have that overhead either.

    What's best, this approach generalizes easily to instances that want to override some methods: you can put the hash-table directly on instances. (But this has some subtle cache-invalidation problems when instance's class is changed after instance is created, so it might not be worth it.)

    The hash-table cost is not that big once you factor most of it to compilation time. The constant-time lookup for the matrix looks even better, but as you'll notice, it won't work for sparse matrices that are likely to be implemented using trees or hash-tables anyway.

    Finally, I'd better note that using hashed class descriptors with hash-codes calculated at compile time is not my original idea, but is a relatively well-known technique for implementing dynamic languages and/or multiple inheritance. I believe Appel mentions somewhere that this technique has an average overhead of 4 instructions compared to a normal call. Of course, I don't know JVM well enough to tell if the idiom translates easily.

    ReplyDelete
  3. What about singleton classes? Would they appear in this matrix? And if they don't, then what do you do?

    ReplyDelete
  4. This sounds good, it should be easier to maintain a single matrix than keeping a lot of polymorphic inline caches (PICs) in the AST up-to-date.

    I'm not sure, but I think PICs would have faster lookup (linear search of very short lists), but the sparse matrix would probably win when it comes to memory consumption.

    ReplyDelete
  5. I'm not a big PIC fan here either, but having some weights associated with receivers would even put the search on almost worst case scenario - guards would do the rest of the job. Memory consupmtion is another factor of course.

    ReplyDelete
  6. It always amusing to see how ignorant ruby people are that Smalltalk solved their problems decades ago.

    ReplyDelete
  7. oops, nevermind...the switch would only contain switch values for methods of a single class wouldn't it? Sorry...

    ReplyDelete
  8. Charles,
    are you going to switch to asm 3.0 in near future ? (there are some minor api changes). I'm playing with a ruby dsl for bytecode compiling/generation and I'd like to rely on as few dependencies as possible. Should I be prepared for it ?

    ReplyDelete
  9. I am missing one important subject in the discussion: parallelism. Programs become more threaded so freezing complete access to the mapping for an update is not very acceptable.

    ReplyDelete
  10. Strongtalk is the Sun's version of Smalltalk. So why dun ask the Strongtalk team, if they are still there.......

    ReplyDelete
  11. C Ruby implements an global method cache.

    When you call a method first time,the method could be stored in the cache.So when you call the method again,the method could be got from the cache.

    The cache is just a array,its index is calculated like this:
    #define EXPR1(c,m) ((((c)>>3)^(m))&CACHE_MASK)

    where c is the VALUE of class,m is the ID of method name.

    The cache is implements in eval.c.

    I've read JRuby 0.9.1 source.Two points could be improved.First,there is no cache.Second, String is the key of hashtable.

    In XRuby newruntime branch,RubyID is the key of hashtable, which is a wrapper of long and could reduce the calculation of hash key.Of course,global method cache is also implemented in XRuby newruntime.

    ReplyDelete