Thursday, April 5, 2007

Paving the Road to JRuby 1.0: Unicode

Well friends, the countdown has begun. Within the next several weeks, we're looking at getting a 1.0 release out in either beta or RC form. Of course most of you OSS folks know how arbitrary release numbers are...the day after we branch 1.0 we're going to start committing to 1.1, and the cycle will continue. But it seems like the last year is coming to a very solid milestone, and these stabilization points are necessary once in a while. So it comes to pass.

As part of "closing the door" on 1.0, we're weighing features and bugs and wishes to decide what makes the cut and what must wait for a future release. To help get the word out, I'll be doing a series of entries on where we are, where we'll be for 1.0, and where we're going in the future. First off, one of the biggest issues for Ruby--Unicode support--and what it will look like in JRuby 1.0.

Unicode

Straddling two worlds is complicated. Ruby is dynamically typed, with open classes, redefinable methods, eval...a very different sort of world than Java. If those really obvious items weren't tricky enough, Ruby has another wrench to throw into the works: it cares not for Unicode.

Of all issues, this is probably the most problematic to solve. Java, as you know, is always unicode, representing strings internally as UTF-16 and supporting most external encodings you could ever possibly want. Ruby, on the other hand, pays little attention to encodings most of the time, preferring to just treat strings as an array of bytes. If it's encoded, so be it. If not, so be it. The majority of Ruby's string operations work with bytes (though there are a few smart exceptions), so multibyte-encoded strings can be damaged if you have high expectations.

We've spent the past year weighing options. We want JRuby to be compatible with the C implementation, since it's widely deployed and apps depend on its sometimes quirky string behavior. We also want JRuby to integrate well with Java, and the boundary to be as seamless as possible. How does one accomplish that? We've gone through a couple options:
  • At first, JRuby's string was represented with a String/StringBuffer internally. This allowed the original JRuby authors the most trivial path to a fully-working string implementation; almost all the heavy lifting was done for them, and for simple cases it worked fine. But the APIs did not conform to what Ruby applications expected, frequently returning 16 bit values for individual characters and reporting incorrect byte lengths for strings that couldn't encode into all 8-bit characters. It was broken, as far as Ruby code was concerned.
  • The second option, which is what we currently have, is to do what Ruby does...treat all strings as byte[], and implement all operations to the letter of the law. This allows us to make even the lowest-level APIs conform exactly, and to be honest that's been necessary to get applications like Rails up and running. Ruby blurs the distinction between a byte[] and a string so completely that Rails actually now ships with its own multibyte helper, a library that stands a good chance of becoming widely deployed. So to support Ruby and all its many wonderful apps, we fell in line.
At this point it's appropriate to answer a question many of you may have: If JRuby's strings are to be byte[]-based, how can you integrate with Java? A few answers have been proposed. The one I put forward some months ago was that we should treat Java strings like any other Java object, allowing developers to explicitly work with either Java strings or Ruby strings--never the twain meeting without full consent of both parties (or perhaps by using a few helper APIs). This allows Ruby's strings to work as you'd expect, and Java's strings to work as you'd expect, but it doesn't get the nicest flowthrough we really want in this case. So then there's a compromise option likely to be "it" for 1.0.

When crossing the boundary, Ruby will be assumed to use UTF-8 strings.

There are many benefits to this even beyond the fact that it took me 15 minutes to make it work. It allows Ruby strings to behave like Ruby apps expect them to. It allows Java strings to pass into and out of Ruby code without corruption. And it allows Ruby strings to pass into Java code with the only prerequisite (if you want to avoid garbled strings) being that they should be decodable UTF-8. That last item seems like a fairly reasonable requirement given that you're calling 100% unicode Java APIs, and in general the people that will care about this conversion are the ones already using unicode.

Of course, nobody calling these APIs would ever see this or have to do anything special...strings will pass back and forth and encode/decode as necessary, and things will generally just work. And there are performance tricks we can use to speed things up if that becomes a real issue. But it also does something no other proposed solution can: it solves the problem pretty darn well...in time for 1.0.

So for the record, this is the proposed solution we'll go with in the 1.0 timeframe:
  • Ruby strings are byte[] and conform to Ruby string semantics
  • Java strings passing into Ruby code will be encoded as UTF-8, with the implication that you should expect to be working with UTF-8 byte[] in the receiving code
  • Ruby strings passing out of Ruby into Java libraries will be assumed to be UTF-8, and the resulting string on the Java side of the call will reflect that assumption.
And what does the future hold? Well, there's a number of exciting areas happening post 1.0:
  • At some point, either before 1.0 or immediately after, there will be an installable gem that provides Java-native support for Rails' MultiByte library, which should provide a substantial performance boost there. As it is, the current pure Ruby version works almost 100%, so the extra performance should ultimately just be a bonus.
  • We will start implementing a Ruby 2.0-compatible string, but it's unclear when would be a good time to flip that switch. I would predict that JRuby users will have a choice of which to use fairly soon, and we'll swap out behavior depending on what you want.
  • It's also likely that we'll present clean ways to "just use Java strings" for developers that want to do that. This would not result in code that's compatible with stock Ruby, since we'd obviously have to "unicodify" many Ruby methods, but it fits into a larger goal of Ruby-enabling the Java platform as well as possible
  • And of course, some combination of these three and the current solution is possible, given some time and thought. Ultimately it will be driven by you, the users of JRuby, and what you want to see done with it. Because that's what we here at Sun are trying to be all about...what users and developers want to be doing.
So that about wraps up the unicode story for 1.0. I believe it's an excellent compromise between full Ruby compatibility and tight integration with the Java platform, and it gets us where we need to be right now. So what do you think?

3 comments:

  1. I am one of the persons that would like to be able to "just use Java strings" in come cases. Primarily because of increased performance if one can avoid converting between 8 and 16 bit UTF coded strings all the time when doing heavily mixed Java and JRuby development.

    ReplyDelete
  2. Hi Charles

    This is really good stuff.

    What about using $KCODE to choose the ruby default?

    There are only three choices, namely EUC, SJIS and UTF-8 (since ASCII
    is a subset of UTF-8).

    Mike

    ReplyDelete
  3. I guess the most important point here is that you are going to work with Ruby 2.0 developers soon. Thanks for your efforts!

    ReplyDelete