Wednesday, January 19, 2011

JRuby on Rails on Amazon Elastic Beanstalk

Amazon this week announced Elastic Beanstalk, a managed Apache Tomcat service for AWS. Naturally, I had to try JRuby on it.

First, the bad:
  • AWSEB is really slow to deploy stuff. Several times it got "stuck" and I waited for more than 30 minutes for it to recover. It did not appear to be an app issue, since the app came up just fine.
  • The default instance size is t1.micro. I was able to get a Rails app to boot there, but it's a very underpowered size.
  • It appears to start up JVMs with 256MB of memory max and 64MB of permgen. For a larger app, or one with many Rails instances, that might not be enough. For a "threadsafe" Rails app, though, it's plenty.
  • The default EC2 load balancer for the new Beanstalk instance is set to ping the "/" URL. If you don't rig up a / route in your Rails app (like I forgot to do) the app will come up for a few minutes and immediately get taken out.
And the good news: it works great once you get past the hassles! Here's the process that worked for my app (assuming app is already build and ready for deploy).

Preparing the app:
  • Ensure jruby-openssl is in Gemfile. Rails seems to want it in production mode.
  • Edit config/environments/production.rb to enable threadsafe mode.
  • `warble`
Preparing Elastic Beanstalk:
  • Create a new instance, specifying the .war file Warbler created above as the app to deploy
  • There is no step two
Once the instance has been prepared, you may want to resize it to something larger than t1.micro if it's meant to be a real app...but it should boot ok.



Have fun!

Wednesday, January 5, 2011

Representing Non-Unicode Text on the JVM

JRuby is an implementation of Ruby, and in order to achieve the high level of compatibility we boast we've had to put in some extra work. Probably the biggest area is in management of String data.

Strings in Ruby 1.8

Ruby 1.8 did not differentiate between strings of bytes or strings of characters. A String was just an array of bytes, and the representation of those bytes was dependent on how you used them. You could do regular expression matches against non-text binary sequences (used by some to parse binary formats like PNG). You could treat them as UTF-8 text by setting global encoding variables to assume UTF-8 in literal strings. And you could use the same methods for dealing with strings of bytes you would with strings of characters...split, gsub, and so on. The lack of a character abstraction was painful, but the ability to do character-string operations against byte-strings was frequently useful.

In order to support all this, we were forced in JRuby to also represent strings as byte[]. This was not an easy decision. Java's strings are all UTF-16 internally. By moving to byte[]-based strings, we lost many benefits of being on the JVM, like built-in regular expression support, seamless passing of strings to Java methods, and easy interaction with any Java libraries that accept, return, or manipulate Java strings. We eventually had to implement our own regexp engines (or the byte[]-to-char[]-to-byte[] overhead would kill us) and calls from Ruby to Java still pay a cost to pass Strings.

But we got a lot out of it too. We would not have been able to represent binary content easily with a char[]-based string, since it would either get garbled (when Java tried to decode it) or we'd have to force the data into only the lower byte of each char, doubling the size of all strings in memory. We have some of the fastest String IO capabilities of any JVM language, since we never have to decode text. And most importantly, we've achieved an incredibly high level of compatibility with C Ruby that would have been difficult or impossible forcing String data into char[].

There's also another major benefit: we can support Ruby 1.9's "multilingualization".

Strings in Ruby 1.9

Ruby 1.9 still represents all string data as byte[] internally, but it adds an additional field to all strings: Encoding (there's also "code range", but it's merely an internal optimization).

The biggest problem with encoded text in MRI was that you never knew what a string's encoding was supposed to be; the String object itself only aggregated byte[], and if you ever ended up with mixed encodings in a given app, you'd be in trouble. Rails even introduced its own "Chars" type specifically to deal with the lack of encoding support.

In Ruby 1.9, however, Strings know their own encoding. Strings can be forced to a specific encoding or transcoded to another. IO streams are aware of (and configurable for) external and internal encodings, and there's also default external and internal encodings. And you can still deal with raw binary data in the same structure and with the same String-manipulating features. For a full discussion of encoding support in Ruby 1.9, see Yehuda Katz's excellent post on Ruby 1.9 Encodings: A Primer.

As part of JRuby 1.6, we've been working on getting much closer to 100% compatible with Ruby 1.9. Of course this has meant working on encoding support. Luckily, we had a hacking machine some years ago in Marcin Mielzynski, who implemented not only our encoding-agnostic regexp engine (a port of Oniguruma from C), but also our byte[]-based String logic and almost all of our Encoding support. The remaining work has trickled in over the subsequent years, leading up the the last few months of heavy activity on 1.9 support.

How It Works

You might find it interesting to know how all this works, since JRuby is a JVM-based language where Strings are supposed to be UTF-16 always.

First off, String, implemented by the RubyString class. RubyString aggregates an Encoding and an array of bytes, using a structure we call ByteList. ByteList represents an arbitrary array of bytes, a view into them, and an encoding. All operations against a String operate within RubyString's code against ByteLists.

IO streams, implemented by RubyIO (and subclasses) and ChannelStream/ChannelDescriptor, accept and return ByteList instances. ByteList is the text/binary currency of JRuby...our java.lang.String.

Regexp is implemented in RubyRegexp using Joni, our Java port of the Oniguruma regular expression library. Oniguruma accepts byte arrays and uses encoding-specific information at match time to know what constitutes a character in that byte array. It is the only regular expression engine on the JVM capable of dealing with encodings other than UTF-16.

The JRuby parser also ties into encoding, using it on a per-file basis to know how to encode each literal string it encounters. Literal strings in the AST are held in StrNode, which aggregates a ByteList and constructs new String instances from it.

The compiler is an interesting case. Ideally we would like all literal strings to still go into the class file's constant pool, so that they can be loaded quickly and live as part of the class metadata. In order to do this, the byte[]-based string content is forced into a char[], which is forced into a java.lang.String that goes in the constant pool. At load time, we unpack the bytes and return them to a ByteList that knows their actual encoding. Dare I claim that JRuby is the first JVM language to allow representing literal strings in encodings other than UTF-16?

What It Means For You

At the end of the day, how are you affected by all this? How is your life improved?

If you are a Ruby user, you can count on JRuby having solid support for Ruby 1.9's "M17N" strings. That may not be complete for JRuby 1.6, but we intend to take it as far as possible. JRuby 1.6 *will* have the lion's share of M17N in-place and working.

If you are a JVM user...JRuby represents the *only* way you can deal with arbitrarily-encoded text without converting it to UTF-16 Unicode. At a minimum, this means JRuby has the potential to deal with raw wire data much more efficiently than libraries that have to up-convert to UTF-16 and downconvert back to UTF-8. It may also mean encodings without complete representation in Unicode (like Japanese "emoji" characters) can *only* be losslessly processed using JRuby, since forcing them into UTF-16 would either destroy them or mangle characters. And of course no other JVM language provides JRuby's capabilities for using String-like operations against arbitrary binary data. That's gotta be worth something!

I want to also take this opportunity to again thank Marcin for his work on JRuby in the past; Tom Enebo for his suffering through encoding-related parser work the past few weeks; Yukihiro "Matz" Matsumoto for adding encoding support to Ruby; and all JRuby committers and contributors who have helped us sort out M17N for JRuby 1.6.

Tuesday, January 4, 2011

Flat and Graph Profiles for JRuby 1.6

Sometimes it's the little things that make all the difference in the world.

For a long time, we've extolled the virtues of the amazing JVM tool ecosystem. There's literally dozens of profiling, debugging, and monitoring tools, making JRuby perhaps the best Ruby tool ecosystem you can get. But there's a surprising lack of tools for command-line use, and that's an area many Rubyists take for granted.

To help improve the situation, we recently got the ruby-debug maintainers to ship our JRuby version, so JRuby has easy-to-use command-line Ruby debugging support. You can simply "gem install ruby-debug" now, so we'll stop shipping it in JRuby 1.6.

We also shipped a basic "flat" instrumented profiler for JRuby 1.5.6. It's almost shocking how few command-line profiling tools there are available for JVM users; most require you to boot up a GUI and click a bunch of buttons to get any information at all. Even when there are tools for profiling, they're often bad at reporting results for non-Java languages like JRuby. So we decided to whip out a --profile flag that gives you a basic, flat, instrumented profile of your code.

To continue this trend, we enlisted the help of Dan Lucraft of the RedCar project to expand our profiler to include "graph" profiling results. Dan previously implemented JRuby support for the "ruby-prof" project, a native extension to C Ruby, in the form of "jruby-prof" (which you can install and use today on any recent JRuby release). He was a natural to work on the built-in profiling support.

For the uninitiated, "flat" profiles just show how much time each method body takes, possibly with downstream aggregate times and total aggregate times. This is what you usually get from built-in command-line profilers like the "hprof" profiler that ships with Hotspot/OpenJDK. Here's a "flat" profile for a simple piece of code.

~/projects/jruby ➔ jruby --profile.flat -e "def foo; 100000.times { (2 ** 200).to_s }; end; foo"
Total time: 0.99

total self children calls method
----------------------------------------------------------------
0.99 0.00 0.99 1 Object#foo
0.99 0.08 0.90 1 Fixnum#times
0.70 0.70 0.00 100000 Bignum#to_s
0.21 0.21 0.00 100000 Fixnum#**
0.00 0.00 0.00 145 Class#inherited
0.00 0.00 0.00 1 Module#method_added

A "graph" profile shows the top N call stacks from your application's run, breaking them down by how much time is spent in each method. It gives you a more complete picture of where time is being spent while running your application. Here's a "graph" profile (abbreviated) for the same code.

~/projects/jruby ➔ jruby --profile.graph -e "def foo; 100000.times { (2 ** 200).to_s }; end; foo"
%total %self total self children calls name
---------------------------------------------------------------------------------------------------------
100% 0% 1.00 0.00 1.00 0 (top)
1.00 0.00 1.00 1/1 Object#foo
0.00 0.00 0.00 145/145 Class#inherited
0.00 0.00 0.00 1/1 Module#method_added
---------------------------------------------------------------------------------------------------------
1.00 0.00 1.00 1/1 (top)
99% 0% 1.00 0.00 1.00 1 Object#foo
1.00 0.09 0.91 1/1 Fixnum#times
---------------------------------------------------------------------------------------------------------
1.00 0.09 0.91 1/1 Object#foo
99% 8% 1.00 0.09 0.91 1 Fixnum#times
0.70 0.70 0.00 100000/100000 Bignum#to_s
0.21 0.21 0.00 100000/100000 Fixnum#**
---------------------------------------------------------------------------------------------------------
0.70 0.70 0.00 100000/100000 Fixnum#times
69% 69% 0.70 0.70 0.00 100000 Bignum#to_s
---------------------------------------------------------------------------------------------------------
0.21 0.21 0.00 100000/100000 Fixnum#times
21% 21% 0.21 0.21 0.00 100000 Fixnum#**
---------------------------------------------------------------------------------------------------------
0.00 0.00 0.00 145/145 (top)
0% 0% 0.00 0.00 0.00 145 Class#inherited
---------------------------------------------------------------------------------------------------------
0.00 0.00 0.00 1/1 (top)
0% 0% 0.00 0.00 0.00 1 Module#method_added

As you can see, you get a much better picture of why certain methods are taking up time and what component calls are contributing most to that time.

We haven't settled on the final command-line flags, but look for the new graph profiling (and the cleaned-up flat profile) to ship with JRuby 1.6 (real soon now!)