The evening before the event, we went to Fuki Sushi in Palo Alto, a short two blocks from my hotel. I don't believe I've ever eaten such quantity or variety of Japanese cuisine, and I think our guests felt the same way. They marveled at the size of most dishes, especially the ice-cream-scoop-sized lump of wasabi and the two-foot-long sushi tray. They also photographed almost everything...I think I posed for a couple dozen snaps.
During the following day, Thursday, they sat through numerous presentations on Sun hardware and software. Tim and I also discussed Sun's position on Ruby and JRuby for an hour before and about forty minutes after lunch. Tim hit the high-level points about where Ruby will likely fit into the Java ecosystem in the future, and I supplied details and demos of JRuby. I also threw in a demo of JRuby's compiler beating Ruby 1.8 in the standard fib algorithm, which elicited a smile and laugh from Matz himself (whew! I was worried how he'd react!).
Most interesting to me, however, was my discussion with Matz that night.
I was invited to join the delegation for a crab dinner in San Francisco. We went to Crustacean, a moderately upscale joint near California and 101. And after the attendant tied my plastic bib on, we were ready to go.
Since Matz and I ended up sitting together, and since very few others at the table spoke English, we managed to get in some time discussing a couple Ruby 2.0 design issues. Here's a quick summary:
- Matz seems to have come around to my visibility proposal for "private" in Ruby 2.0, which is largely the same as how Java handles private visibility. I believe this model is a good simplification over the original proposal. See ruby-core:9996 and related for the original discussion. The basic facts of private then would be:
- You must dispatch to private methods using a functional call, as in foo() versus xyz.foo(). I didn't like this at first, but I've come around to using call syntax to force certain aspects of visibility.
- Dispatches to private methods will only look in the same class for the method definition.
- Methods that are public in superclasses can't be made private in subclasses.
- Methods that are private in superclasses are not visible to subclasses, and so new methods of the same name and any visibility can exist in subclasses.
- Protected methods in Ruby 2.0 could potentially act like private methods now, though Matz is worried it would be too much of a change. I think it's appropriate; current private method behavior is very similar to Java's model for protected methods, where the methods can't be seen from outside the hierarchy, but can be called and overridden within the hierarchy as normal. I voiced my opinion, so we'll see where Matz goes from here.
- Matz is still comfortable with removing set_trace_func if a better mechanism for profiling and debugging can replace it. I had a few suggestions for alternate mechanisms, but I also promised to look into Java's model, since it seems to work quite well. I also suggested there may be something to learn from DTrace.
- Matz has come around to the idea that encoded character sequences are a different type than unencoded byte arrays, though he still wants them to have the same outward interface.
The topic of encoded character strings came up a few times during Matz's visit, usually with him asking how we're doing things in JRuby. I explained that we mostly just follow Ruby 1.8, with our String now being backed by a byte[], but that we're also providing out-of-the-box native support for the new Rails ActiveSupport::MultiByte Chars class, a wrapper around string that enforces character boundaries and encodings.
At dinner, we continued the discussion. I made my case for a separate type with the following points:
- A separate type would not require String's interface to change, and it could remain a byte array
- By having separate types, we can use polymorphic behavior to avoid checking and re-checking encodings for every operation
So the result of the String discussion can be summarized in a few points:
- String's interface will change from 1.8 to work with characters rather than bytes, both in the encoded and unencoded forms of String. The plan for String methods' behaviors does not change from current Ruby 1.9.
- String will have subtypes that represent encoded character data, though in most cases you won't need to know about those types. If you do need to go after a UTF8String (my name), you can, but there will also be some sort of factory model for generating encoded strings and Ruby 2's encoding pragma will handle literals.
When Strings present characters rather than bytes, does that mean proper 4-byte characters (i.e., UTF-32 encoding)?
ReplyDeletetom palmer: Basically, in Ruby 1.9 all the methods that used to return bytes (like String[1], etc) will now return single-character strings.
ReplyDelete"hello"[1] => "e"
For getting at bytes directly, there are new methods like each_byte.
Makes sense. Thanks for the info.
ReplyDeleteActually, my question still applies. For instance, if a string consists of a single code point that's past the 16-bit range, is length of the string still 1? (As opposed to Java which would say 2 for length() but 1 for codePointCount().)
ReplyDeleteThe above should be:
ReplyDelete"hi".each_byte("UTF-16") ->
0x00
0x68
0x00
0x69
FYI, the specifics of the Ruby 2.0 String interface are more Matz's thing; I was just arguing for stronger typing of encoded versus unencoded strings, and it seems like that's going to happen.
ReplyDeletetom palmer: My understanding is that it returns single-*character* strings, including multi-byte or multi-word surrogated characters. It is a true character sequence. This is also how the Chars class works in Rails, returning single-character (but potentially multi-byte) strings for numeric string indexes.
anonymous: each_byte would go over each byte of the underlying store as-is; so each_byte on a UTF-16 encoded string would have the nulls you show. each_byte on a UTF-8 string would not, and would walk through each byte of a surrogated character individually.
tom palmer: Oh, and for the length issue...I believe there will be two methods, one for char length and one for byte length.
ReplyDeleteI don't know if it's correct to refer to UTF-32 as the only "proper" Unicode encoding.
ReplyDeleteOr to phrase it differently, at one point in the past people thought UTF-16 was the "proper" Unicode encoding.
It sounds like Ruby will be doing it right though, so you can make a UTF-32-backed string if you want. And in the future maybe someone will make a UTF-64-backed one.