Comments on Headius: California Schemin'

When Strings present characters rather than bytes,...

2007-02-13T10:59:00.000-08:00

When Strings present characters rather than bytes, does that mean proper 4-byte characters (i.e., UTF-32 encoding)?

tom palmer: Basically, in Ruby 1.9 all the methods...

2007-02-13T13:01:00.000-08:00

tom palmer: Basically, in Ruby 1.9 all the methods that used to return bytes (like String[1], etc) will now return single-character strings.

"hello"[1] => "e"

For getting at bytes directly, there are new methods like each_byte.

Makes sense. Thanks for the info.

2007-02-13T13:59:00.000-08:00

Makes sense. Thanks for the info.

Actually, my question still applies. For instance,...

2007-02-13T14:30:00.000-08:00

Actually, my question still applies. For instance, if a string consists of a single code point that's past the 16-bit range, is length of the string still 1? (As opposed to Java which would say 2 for length() but 1 for codePointCount().)

The above should be:"hi".each_byte("...

2007-02-13T15:06:00.000-08:00

The above should be:

"hi".each_byte("UTF-16") ->
0x00
0x68
0x00
0x69

FYI, the specifics of the Ruby 2.0 String interfac...

2007-02-13T15:27:00.000-08:00

FYI, the specifics of the Ruby 2.0 String interface are more Matz's thing; I was just arguing for stronger typing of encoded versus unencoded strings, and it seems like that's going to happen.

tom palmer: My understanding is that it returns single-*character* strings, including multi-byte or multi-word surrogated characters. It is a true character sequence. This is also how the Chars class works in Rails, returning single-character (but potentially multi-byte) strings for numeric string indexes.

anonymous: each_byte would go over each byte of the underlying store as-is; so each_byte on a UTF-16 encoded string would have the nulls you show. each_byte on a UTF-8 string would not, and would walk through each byte of a surrogated character individually.

tom palmer: Oh, and for the length issue...I belie...

2007-02-13T15:35:00.000-08:00

tom palmer: Oh, and for the length issue...I believe there will be two methods, one for char length and one for byte length.

I don't know if it's correct to refer to U...

2008-09-17T22:05:00.000-07:00

I don't know if it's correct to refer to UTF-32 as the only "proper" Unicode encoding.

Or to phrase it differently, at one point in the past people thought UTF-16 was the "proper" Unicode encoding.

It sounds like Ruby will be doing it right though, so you can make a UTF-32-backed string if you want. And in the future maybe someone will make a UTF-64-backed one.