Monday, April 30, 2007

Dynamic Languages Event at CommunityOne RedMonk Unconference

Ok folks, this is the beginning. I'm going to try, in the midst of JavaOne Day Zero chaos, to run the first big *open* JVM dynamic languages get-together.

It will be part of RedMonk's Unconference running parallel to CommunityOne on Monday, May 7th. It is a *FREE* event. Yes, I said *FREE*.

We're going to have a number of Sun engineers there, including Tom Enebo and myself, John Rose from the JVM, Alex Buckley, and more. There will be representatives from JRuby (besides Tom and I), Jython, SISC, Rhino, Groovy, and others. It's going to be our big opportunity to make sure the JVM continues on a path of solid support for dynamic (and other) languages. And lest ye forget, these discussions will directly influence Java 7...so it's a unique opportunity we must not waste.

Why now? Because interest in dynamic languages for general-purpose application development--especially on top of mainstream virtual machines like the JVM--has simply exploded in the past year. At JavaOne, there's an entire track devoted to Tools and Languages, where the party line in the past has always been "Java Java Java". There are more language-related talks than anyone could attend, where before they were few and far between. And perhaps most importantly, Microsoft today showed they're doing the exact same thing we want to do, announcing their CLR-based Dynamic Language Runtime.

But we have a different palette with which to paint. The JVM is truly open source. All the language implementations are truly open source. We have vast communities in the Java world and strong, energetic communities around each dynamic language project. And all those communities have deep roots in the open source world. Our world is tailor-made for collaboration, and now is the time.

So, on Monday, May 7th (good lord, less than a week away!) I invite language implementers and alternative JVM language enthusiasts to join us. The time has not been decided yet, but I've got two other sessions at 4:00 and 5:00, so it will be before that. Perhaps during the early afternoon period? There's also the possibility of making arrangements for a second discussion that week, maybe co-opting a BOF or a BOF room.

For those of you unable to make it, please try to find a proxy that can represent your interests, or make those interests known to me over email. And know also that I'll be in San Francisco from the 3rd until the 15th, and in Portland, Oregon for RailsConf from then until the 21st. I'd love to get together and discuss language implementation and the future of the JVM. And I love a good beer.

Friends, dynamic languages have truly arrived on the JVM. Let's get together to ensure they feel welcome.

Watch this space for more details.

Tuesday, April 24, 2007

What Would I (Will I?) Change About Ruby

The latest Ruby Blogging Contest hits close to home: What changes would make Ruby into a better language without making it into something that isn't Ruby?

As you might guess, I've got some pretty strong thoughts here. I'm not a heavy Rails user, and I'm not as heavy a Ruby user as I'd like to be. But implementing a Ruby interpreter and now compiler has taught me a few things about what's right and what's wrong with Ruby. I'm not going to complain about performance, whine that the C code is too hard to follow, or even attack C-based extensions. Those may be important issues, but they're all fixable in the long term without breaking anything that works today (or by providing reasonable substitutes). I'm also not going to go into language design ideas...I have mine, you have yours, Matz has his. But my money's on Matz to do the "right thing" with regards to actual language design.

What I'm talking about are a few really important changes to the Ruby runtime, libraries, and ecosystem. Take these as my educated opinions...and don't think too hard about whether I'll be working to change these things in JRuby and in the wider Ruby world.

1. Threading

This more than any other area probably means the most visible changes to Ruby. Ruby currently is green-threaded, as most of you know. JRuby implements native threads mainly because Java uses native threads...we just piggyback off the excellent work of the JVM engineers. And the developing Ruby 1.9, the future successor to the current version 1.8 C implementation, provides something in the middle: native threads with a giant lock, so threads won't run concurrently.

So in general, Ruby is trending toward support for native threads. But there's a problem...some of Ruby's current APIs are impossible to do safely with native threads (and in general, impossible to really do safely with green threads...Ruby just does them anyway). Threading needs to be improved, with support for concurrent execution and removal of operations that prevent that.

Specifically, the following operations and features are inherently unsafe, and are not supported by any mature threaded system:
Thread#kill

Killing one thread from another may leave its locks and resources in an unpredictable state. JRuby currently implements this by setting a kill flag on the target thread and waiting for it to die--basically asking the thread to "please die yourself"--but it's not deterministic and the thread could fail to die.

Thread#raise

Forcing another thread to raise an exception can have the same effect as kill, since the thread may not expect to handle the given exception and may not be able to release locks or tidy up resources. JRuby handles this similar to kill, by setting a field to contain the exception a target thread should "please raise", but again it's not deterministic and there's no way to guarantee the target thread will raise.

Thread#critical=

There is no way to deterministically force true concurrent threads to stop and wait for the the current thread, not to mention the horrendous race conditions that can result when locks are involved. As a result of the many critical problems with critical=, it is already slated to be removed in Ruby 1.9/2.0.
In order for Ruby to survive in a parallel-processing era, unsafe threading operations need to go, and any libraries or apps that depend on them need to find new ways to solve these problems. Sorry folks, these aren't my rules. I understand why people like these features...I like them too. But you can't have your concurrency and eat it too.

2. ObjectSpace

ObjectSpace is Ruby's view into the garbage-collected heap. You can use it to iterate over all objects of a particular type, attach finalizers to any object, look an object up by its object ID, and so on. In Ruby, it's a pretty low-cost heap-walker, able to dig up objects matching particular criteria for you on a whim. It sounds like it might be pretty useful, but it's used by very few libraries...and most of those uses can be implemented in other (potentially more efficient) ways.

JRuby implements ObjectSpace by keeping a separate linked list in memory of weak references to created objects. This means that for every ObjectSpace-aware object that's created, a weakref is added to this list. When the object is collected, the weakref removes itself from the list. Walking all objects of a particular type just involves walking that list. Reconstituting an object ID into the object it references is supported by a separate weak list (again, more memory overhead).

There are no plans currently for ObjectSpace to be removed from Ruby in a future version. But there's a problem...in addition to being pure overhead in JRuby (which you can turn off completely by using the -O flag), ObjectSpace limits evolving development of the Ruby garbage collector, breaks heap and memory transparency, and poses yet more problems for threading.

There are many issues here. First off, the JRuby thing. By having to add ObjectSpace governors for all objects in the system, JRuby pays a very large penalty. We're forced to do this because the JVM (and most other advanced garbage-collecting VMs) does not allow you to traverse in-memory objects nor retrieve the object that is associated with a given ID. In general this is because the JVM does all sorts of wonderful and magical things with objects and memory behind the scenes, and the ability to ask for all objects of a given type or pull an object based on some ID number at any time cripples many of these tricks.

The threading issues are perhaps more important. Imagine if you will a true concurrent VM, with many threads creating objects, maybe one or more threads collecting garbage, and synchronizing all this to guarantee the integrity and efficiency of the heap and garbage collector. There is absolutely no room in this scenario for those multiple threads to request lists of specifically-typed objects at any time, nor to provide an ID and expect its object to be presented to you. These features break encapsulation across threads, they violate security restrictions from thread to thread, and they require whole new levels of locking to ensure that while reading from the heap no other thread produces new objects and no garbage collection occurs. As a result, ObjectSpace harms Ruby by limiting the flexibility of its garbage collecting and threading subsystems, and should be eliminated.

3. $SAFE and tainting

Safe levels are a fairly simple, straightforward way to set a "security level" that governs what operations are possible for a given thread. By setting the safe level to various values, you can limit modification of Object, prevent IO, disallow creation of new methods or classes, and so on. Added to this is the ability to "taint" or "untaint" objects. Tainted objects are considered "unsafe", and so certain security levels will cause errors to be thrown when those objects are passed to safe-only operations.

JRuby has safe level and tainting checks in place, but it's almost assured they're not working correctly. We have never tested them, largely because practically no tests (or perhaps literally no tests) use safe levels or tainting, and we've had *exactly one* bug report relating to safe levels, just a couple weeks ago. And to further kill the possibility of JRuby ever supporting safe levels and tainting correctly, my work tonight to fix some safe level issues revealed that doing so would add a tremendous amount of overhead to almost all critical operations like method creation, module/class mutation, and worst of all, object creation.

At this point, safe levels will probably remain in their current half-implemented state for 1.0, but I think it's almost decided for us that safe levels and tainting will simply not be supported in JRuby. In their place, we'll do two things (which I'd recommend the C implementation consider as well:
  • Recommend that people who really want "safe" environments use an approach like whytheluckystiff's Sandbox, which takes a more JVM-like approach to safety: it runs code in a true sandboxed sub-runtime with only "safe" operations even defined. In other words, not only is it disallowed to load in files or hit the network, it's physically *impossible* to do so. What makes this even better is that Sandbox is already supported in JRuby (gem "javasand") and JRuby out of the box allows a fine granularity of operations to be disabled in new runtimes.
  • Implement safe levels like Java handles security restrictions, which we get to leverage since they're already being checked and enforced at the JVM level. We will not be able to map everything...for obvious reasons, checking tainted strings all the time or limiting class and method creation are unlikely to ever happen, but we can limit those operations that the JVM allows us to limit, like loading remote code, opening sockets, accessing local files, and so on. So it's highly likely JRuby's implementation of safe levels will map to clearly-defined sets of Java security restrictions in the near future.
4. Direction

Ruby is a very free-form community. Matz is the most benevolent dictator I've had the pleasure to work with, and most of the community are true free-thinking artists. It's like the hippie commune of the language world. Peace out, man.

But there's a problem here. Ruby needs guidance beyond VM and language design or the loose meanderings of its more vocal community members. It boils down to a few simple points:
  • Ruby needs a spec. Anyone who believes this isn't true isn't paying attention. Now I'm not talking about a gold-standard legal document signed in blood by Matz and the chief stakeholders of the Ruby community. An officially sponsored, widely supported, and massively publicized community spec would work fine--and probably fit the community and the language better. But something needs to done quickly, since Ruby's "bus number" is dangerously low. A spec is not something to be feared...it's a guarantee that Ruby will live on into the future, that alternative implementations (like JRuby) can't intentionally introduce nasty incompatibilities (or at least, that they'd be easy to discover and easy to document), and perhaps most importantly...that the full glory and beauty of Ruby is published forever for all to see and explore, rather than dangerously trapped in very few minds.
  • Ruby needs a non-profit governing body. I'm not necessarily talking about a council of elders here, I'm just talking about some legal entity to which OSS copyrights can be assigned, donations can be made, and from which projects and initiatives can be funded. Maybe this would be RubyCentral, maybe this would be some other (new) organization...I don't know that. But it would be a great help to the community and Ruby's future if there were some official organization that could act as caretaker for Ruby's future. I'm all set to sign over any JRuby copyrights I have to such an organization, to protect the future of Ruby on the JVM just like the future of the C implementation. How about you?
  • Ruby needs you. Granted, this isn't really a change as such. You probably wouldn't be reading this if Ruby didn't already have you. But the Ruby community is at a big point in its lifetime...at risk of losing its identity, being eclipsed by newer projects, or even slipping deep, deep into the trough of disillusionment. What will prevent that happening is the community showing its strong ties, coming together to support official organizations and official documents, and above all, continuing to pour all our hearts into creating newer and better applications and libraries in Ruby, pushing the boundaries of what people think is possible.

Monday, April 16, 2007

Paving the Road to JRuby 1.0: Performance

Since it looks like Antonio Cangiano is going to delay the next running of the bulls until Ubuntu 7.04 is released, I figured the next JRuby 1.0 update could be about performance.

Performance is such a tricky area for Ruby. Folks outside the Ruby community happily malign its performance, fueled by both FUD and by some truths. Folks within the Ruby community either aren't affected by Ruby's performance (it's "fast enough") or they simply don't care (it's slow, but I still love it too much to leave). A small part of the Ruby community takes what is in my opinion a rather anti-Ruby stance: "write it in C" as a targeted solution for identified bottlenecks. I suppose reality lies somewhere inbetween all these views, with Ruby's performance certainly not being stellar in the general case, but reasonable and sometimes surprisingly good for specific cases.

Ruby 1.9 has raised the promise of a new bytecode-based interpreter engine--a Ruby "virtual machine" by some reckonings--with the goal of improving performance foremost on the minds of Ruby's developers. And again the performance question is rather complicated. Antonio's recent shootout, running only Ruby 1.9's chosen benchmarks against all other implementations, shows it doing extremely well. It comes out many times faster than Ruby 1.8 in almost every test, and no other implementation even comes close. The truth however doesn't change very much; for non-synthetic benchmarks (like running Rails) the situation only improves by about 15% for some tests, and for many other tests performance actually degrades. Of course Ruby 1.9 is still under heavy development, and many more improvements are ahead, but the wide range of results demonstrates again that benchmarking must be taken with a grain of salt.

Et tu, JRuby?

So then there's us and JRuby. JRuby's professed goal has never been to be a better Ruby; at best, we're trying to build the best Ruby possible on top of what we believe is the best VM in existence. And toward that end we've put most of our time into compatibility and correctness above all else, aiming for the goal of complete Ruby language compatibility and near-complete builtin-class compatibility, hoping for JRuby to someday be treated as "just another Ruby implementation" on which people can run their apps and design their clever frameworks and libraries. But in the past year, it's become apparent that we could actually exceed Ruby's performance for specific cases in the near term, and for general cases over time. So the target market for JRuby seems to be changing from "Ruby users that must use Java VMs, libraries, and servers" to "Ruby users that want a better-performing, more scalable implementation". And our noble quest for near-complete compatibility gets muddled with all these fiddly performance details.

But that's life, right? The goals you set out for yourself and your projects rarely align perfectly with the goals others set out for you. The trick is achieving a balance between what you want to do with your life and what others (like your community members or your employers) want you to do. Perhaps the successful developer is the one who can derive pleasure from both tasks.

On the road to JRuby 1.0, we've done our best to balance compatibility and performance. We are now nearing the end of the compatibility road, with Ruby language features nearly 100% and builtin classes almost as complete as we can make them on the JVM. The real reason for a JRuby 1.0 now is that we believe we're finally approaching "Ruby compatibility" for some high measure of compatibility, such that the vast majority of platform-agnostic Ruby code should run successfully on JRuby. And that's certainly no small feat, given that just a year ago we celebrated a mostly-broken cobbled-together Rails 1.1 app slowly handling CRUD operations. Today, people are deploying JRuby on Rails apps in production, and the game has only gotten more interesting.

So then, performance. You're all wondering what the answer is to this performance thing, aren't you? Is JRuby going to blow away all competition, including the nascent Ruby 1.9 and mid-term projects like XRuby, Rubinius, and Ruby.NET? It's certainly possible, but it's not our goal. Is JRuby going to be faster than Ruby 1.8 when 1.0 is released? For specific cases, I'd say yes...there's plenty of areas we already perform better than Ruby 1.8. For the general cases, it's hard to say. We perform well serving up Rails requests today, but only about 50-70% of Ruby 1.8's performance. And though we know where most of the bottlenecks lie, we're a little resource limited trying to fix them. Do we believe that we'll be faster than Ruby 1.8 in all general cases in the near future? Yes, we strongly believe that will happen.

Now of course I could ramble on and on about performance and put you all to sleep, but actual numbers will probably keep your interest better than my droning.

The Test

Like the shootout, I'm just running the Ruby 1.9 benchmarks here. We have not done any targeted optimization for these tests; there's no Fixnum magic or anything like that under the covers. What we have done is implement multiple general-purpose optimizations to speed method and block invocation, object creation, and interpretation. We've also spent a little more time getting these tests to compile successfully, but of course any work done on the compiler is generally applicable as well.

These results are all based on JRuby trunk code, revision 3480. I'm running Java 6 on a MacBook Pro 2.16GHz Core Duo, and all code was compiled to target Java 6.

For the first set of results the JRuby command executed basically amounts to the following:
JAVA_OPTS=-Xverify:none jruby SERVER -O [script.rb]
  • JAVA_OPTS=-Xverify:none specifies not to verify classes on startup; this is a large part of the speed hit Java applications have when starting. We turn it off here to remove a little of that overhead from the benchmarks, since most of them are very short runs to begin with.
  • SERVER specifies that JRuby should use the "server" VM, which takes a bit longer to optimize a bit more heavily when JITting Java code into native instructions. JRuby generally performs best under the server VM, though using it impacts startup time.
  • -O disables ObjectSpace in JRuby. This may seem like cheating, but the truth is that when ObjectSpace is enabled we pay double or triple the object creation cost in JRuby since we have to track all objects separately. Ruby's ObjectSpace is essentially zero-cost...it's just a window into the memory manager. Since we don't have a low or zero-cost way to implement ObjectSpace, it falls into what I categorize as "optional incompatibility". If you don't need it, turn it off and JRuby performance will improve. You can call it cheating if you like...the truth is that practically no code actually depends on ObjectSpace.
And then there's the standard disclaimer for any Java application: these times include startup, about 1.0 to 1.3 seconds. I know you all are picky about how benchmarks are run, and you love to include startup time even though the vast majority of the world's work is not done in the first few seconds of execution, but if you'll forgive the disabling of ObjectSpace I'm willing to meet you half way. Startup time is included.
TEST                    MRI     JRuby
--------------------------------------
app_answer 0.584 2.239
app_factorial ERROR 4.248
app_fib 7.126 10.549
app_mandelbrot 2.346 10.300
app_raise 2.587 4.441
app_strconcat 1.829 2.141
app_tak 9.711 13.345
app_tarai 7.529 11.050
loop_times 5.475 6.903
loop_whileloop 9.982 11.797
loop_whileloop2 2.009 3.292
so_ackermann 13.726 26.132
so_array 7.257 8.801
so_concatenate 2.063 3.546
so_count_words 0.507 4.491
so_exception 4.342 8.346
so_lists 1.238 2.744
so_matrix 2.258 4.241
so_nested_loop 5.609 7.898
so_object 7.050 6.496
so_random 2.139 4.643
so_sieve 0.740 2.240
vm1_block 23.604 27.405
vm1_const 16.774 20.650
vm1_ensure 16.546 15.800
vm1_length 21.210 21.899
vm1_rescue 13.170 16.197
vm1_simplereturn 21.091 31.376
vm1_swap 25.114 17.949
vm2_array 6.049 5.690
vm2_method 13.528 17.759
vm2_poly_method 16.886 24.956
vm2_poly_method_ov 4.576 6.972
vm2_proc 7.060 7.797
vm2_regexp 4.421 9.353
vm2_send 4.332 8.198
vm2_super 4.992 7.944
vm2_unif1 3.838 6.095
vm2_zsuper 5.409 8.452
vm3_thread_create_join 0.019 1.592
Some of these are rather surprising results. This is JRuby running in plain old interpreted mode, with no compilation involved. The majority of the tests still have JRuby slower than Ruby 1.8, but the gap has narrowed an incredible amount since last year. In only a few tests are we more than twice as slow as MRI, and on a couple we're almost twice as fast. If you'll imagine startup time removed from these numbers, and believe me when I say Java takes far more than 30-60 seconds to rev up to full speed, then the situation looks even better. What's more, we've got a good several weeks before the first 1.0ish release is scheduled (something betaish or RCish that proudly proclaims it's "done") and a bunch of great committers and community members eyeing performance metrics.

Ok, you may be asking "what about JRuby's compiler?" Yes, there is a compiler in the works. It's primarily been my job to build out the compiler, though Ola has jumped in a few times to offer his excellent help. And progress has been slow but steady. You have to remember that in all the world, there's no known 100% complete Ruby compiler for a general-purpose VM. There's Ruby 1.9, but its bytecodes have been custom designed around Ruby. There's XRuby and Ruby.NET, but it's still unclear how complete they really are. So this is an open area of research and development. But the results are looking great so far.

For this test, both the ahead-of-time (AOT) compiler and the just-in-time (JIT) compiler modes are activated. This forces the target script to be compiled before execution and also compiles any methods hit heavily once execution gets going. The command amounts to the following:
JAVA_OPTS="-Djruby.jit.enabled=true -Xverify:none" /
jruby SERVER -O -C [script.rb]
  • -Djruby.jit.enabled=true enables the JIT compiler. The default threshold at which a (compilable) method gets compiled is 50 invocations.
  • -C tells JRuby to compile the target script before executing it. If the script can't be compiled, JRuby bombs out with an error.
The compiler can't handle all the Ruby 1.9 tests yet. Specifically, it doesn't handle multiple assignment (e.g. vm1_swap), exception handling (anything involving rescue or ensure), or full class definitions. But it compiles the majority of the tests.
TEST                    MRI     JRuby
--------------------------------------
app_factorial 0.029 3.459
app_fib 7.094 5.093
app_mandelbrot 2.340 9.011
app_strconcat 1.827 2.391
app_tak 9.714 5.394
app_tarai 7.515 4.642
loop_times 5.428 2.942
loop_whileloop 10.016 6.027
loop_whileloop2 2.012 2.191
so_ackermann 13.610 11.254
so_concatenate 2.043 2.396
so_lists 1.250 2.141
so_matrix 2.256 2.394
so_nested_loop 5.614 4.167
so_random 2.158 3.291
so_sieve 0.741 1.798
vm1_block 23.392 12.397
vm1_const 16.980 8.908
vm1_length 21.094 10.901
vm1_simplereturn 21.252 9.345
vm2_array 6.025 3.041
vm2_method 13.049 7.794
vm2_regexp 4.468 7.647
vm2_unif1 3.855 3.293
vm3_thread_create_join 0.017 1.389
Ahh, now things look a bit different! In almost every case, JRuby performs better than Ruby 1.8. In the long running cases, the difference is even more obvious. Short runs still put Ruby 1.8 ahead, but I'm totally ok with that. Java, and by extension JRuby, has never had stellar short-run and startup performance. But we don't really have to if the heavy, long-running apps people actually use end up running faster.

Note also that this is the first real compiler we've had; it's not doing any optimization like Ruby 1.9's and Rhino's compilers, and it's almost certainly far from being efficient. I'm no compiler expert, and this is my first real attempt. These numbers already looking so good demonstrates that there's a grand adventure ahead of us: Ruby really can be made to perform well on the JVM. It just requires a little confidence and a little more effort.

That pretty much wraps up this installment. The bottom line: As we approach JRuby 1.0, our performance is better than it's ever been--faster than Ruby 1.8 for many specific cases, with stellar across-the-board performance right around the corner. And with other implementations like Ruby 1.9, XRuby, Rubinius, and Ruby.NET rapidly coming of age, Ruby's future is looking extremely solid.

script type="text/ruby"

Remember way back when the IRB applet was first thrust upon the world, and I promised that you truly could now use Ruby in the browser? At the time, it was still a little cumbersome, and generally you could only use Ruby by passing it directly into the applet.

Well Dion Almaer has taken it to the next awesome step:

<script type="text/ruby">

Check out his blog entry on how to enable true Ruby scripting on the client side, as well as the link to his demo page (still loads up the applet in the background, but that's a minor implementation detail).

Awesome.

Thursday, April 5, 2007

Paving the Road to JRuby 1.0: Unicode

Well friends, the countdown has begun. Within the next several weeks, we're looking at getting a 1.0 release out in either beta or RC form. Of course most of you OSS folks know how arbitrary release numbers are...the day after we branch 1.0 we're going to start committing to 1.1, and the cycle will continue. But it seems like the last year is coming to a very solid milestone, and these stabilization points are necessary once in a while. So it comes to pass.

As part of "closing the door" on 1.0, we're weighing features and bugs and wishes to decide what makes the cut and what must wait for a future release. To help get the word out, I'll be doing a series of entries on where we are, where we'll be for 1.0, and where we're going in the future. First off, one of the biggest issues for Ruby--Unicode support--and what it will look like in JRuby 1.0.

Unicode

Straddling two worlds is complicated. Ruby is dynamically typed, with open classes, redefinable methods, eval...a very different sort of world than Java. If those really obvious items weren't tricky enough, Ruby has another wrench to throw into the works: it cares not for Unicode.

Of all issues, this is probably the most problematic to solve. Java, as you know, is always unicode, representing strings internally as UTF-16 and supporting most external encodings you could ever possibly want. Ruby, on the other hand, pays little attention to encodings most of the time, preferring to just treat strings as an array of bytes. If it's encoded, so be it. If not, so be it. The majority of Ruby's string operations work with bytes (though there are a few smart exceptions), so multibyte-encoded strings can be damaged if you have high expectations.

We've spent the past year weighing options. We want JRuby to be compatible with the C implementation, since it's widely deployed and apps depend on its sometimes quirky string behavior. We also want JRuby to integrate well with Java, and the boundary to be as seamless as possible. How does one accomplish that? We've gone through a couple options:
  • At first, JRuby's string was represented with a String/StringBuffer internally. This allowed the original JRuby authors the most trivial path to a fully-working string implementation; almost all the heavy lifting was done for them, and for simple cases it worked fine. But the APIs did not conform to what Ruby applications expected, frequently returning 16 bit values for individual characters and reporting incorrect byte lengths for strings that couldn't encode into all 8-bit characters. It was broken, as far as Ruby code was concerned.
  • The second option, which is what we currently have, is to do what Ruby does...treat all strings as byte[], and implement all operations to the letter of the law. This allows us to make even the lowest-level APIs conform exactly, and to be honest that's been necessary to get applications like Rails up and running. Ruby blurs the distinction between a byte[] and a string so completely that Rails actually now ships with its own multibyte helper, a library that stands a good chance of becoming widely deployed. So to support Ruby and all its many wonderful apps, we fell in line.
At this point it's appropriate to answer a question many of you may have: If JRuby's strings are to be byte[]-based, how can you integrate with Java? A few answers have been proposed. The one I put forward some months ago was that we should treat Java strings like any other Java object, allowing developers to explicitly work with either Java strings or Ruby strings--never the twain meeting without full consent of both parties (or perhaps by using a few helper APIs). This allows Ruby's strings to work as you'd expect, and Java's strings to work as you'd expect, but it doesn't get the nicest flowthrough we really want in this case. So then there's a compromise option likely to be "it" for 1.0.

When crossing the boundary, Ruby will be assumed to use UTF-8 strings.

There are many benefits to this even beyond the fact that it took me 15 minutes to make it work. It allows Ruby strings to behave like Ruby apps expect them to. It allows Java strings to pass into and out of Ruby code without corruption. And it allows Ruby strings to pass into Java code with the only prerequisite (if you want to avoid garbled strings) being that they should be decodable UTF-8. That last item seems like a fairly reasonable requirement given that you're calling 100% unicode Java APIs, and in general the people that will care about this conversion are the ones already using unicode.

Of course, nobody calling these APIs would ever see this or have to do anything special...strings will pass back and forth and encode/decode as necessary, and things will generally just work. And there are performance tricks we can use to speed things up if that becomes a real issue. But it also does something no other proposed solution can: it solves the problem pretty darn well...in time for 1.0.

So for the record, this is the proposed solution we'll go with in the 1.0 timeframe:
  • Ruby strings are byte[] and conform to Ruby string semantics
  • Java strings passing into Ruby code will be encoded as UTF-8, with the implication that you should expect to be working with UTF-8 byte[] in the receiving code
  • Ruby strings passing out of Ruby into Java libraries will be assumed to be UTF-8, and the resulting string on the Java side of the call will reflect that assumption.
And what does the future hold? Well, there's a number of exciting areas happening post 1.0:
  • At some point, either before 1.0 or immediately after, there will be an installable gem that provides Java-native support for Rails' MultiByte library, which should provide a substantial performance boost there. As it is, the current pure Ruby version works almost 100%, so the extra performance should ultimately just be a bonus.
  • We will start implementing a Ruby 2.0-compatible string, but it's unclear when would be a good time to flip that switch. I would predict that JRuby users will have a choice of which to use fairly soon, and we'll swap out behavior depending on what you want.
  • It's also likely that we'll present clean ways to "just use Java strings" for developers that want to do that. This would not result in code that's compatible with stock Ruby, since we'd obviously have to "unicodify" many Ruby methods, but it fits into a larger goal of Ruby-enabling the Java platform as well as possible
  • And of course, some combination of these three and the current solution is possible, given some time and thought. Ultimately it will be driven by you, the users of JRuby, and what you want to see done with it. Because that's what we here at Sun are trying to be all about...what users and developers want to be doing.
So that about wraps up the unicode story for 1.0. I believe it's an excellent compromise between full Ruby compatibility and tight integration with the Java platform, and it gets us where we need to be right now. So what do you think?