tag:blogger.com,1999:blog-4704664917418794835.post5851695581913141289..comments2023-07-01T05:41:30.469-07:00Comments on Headius: California Schemin'headiushttp://www.blogger.com/profile/15717357218364947795noreply@blogger.comBlogger8125tag:blogger.com,1999:blog-4704664917418794835.post-15765893154028651352007-02-13T10:59:00.000-08:002007-02-13T10:59:00.000-08:00When Strings present characters rather than bytes,...When Strings present characters rather than bytes, does that mean proper 4-byte characters (i.e., UTF-32 encoding)?Tom Palmernoreply@blogger.comtag:blogger.com,1999:blog-4704664917418794835.post-87096989464622930862007-02-13T13:01:00.000-08:002007-02-13T13:01:00.000-08:00tom palmer: Basically, in Ruby 1.9 all the methods...tom palmer: Basically, in Ruby 1.9 all the methods that used to return bytes (like String[1], etc) will now return single-character strings.<br><br>"hello"[1] => "e"<br><br>For getting at bytes directly, there are new methods like each_byte.Charles Oliver Nutterhttp://www.blogger.com/profile/06400331959739924670noreply@blogger.comtag:blogger.com,1999:blog-4704664917418794835.post-87061381855951933222007-02-13T13:59:00.000-08:002007-02-13T13:59:00.000-08:00Makes sense. Thanks for the info.Makes sense. Thanks for the info.Tom Palmernoreply@blogger.comtag:blogger.com,1999:blog-4704664917418794835.post-83589832571599176082007-02-13T14:30:00.000-08:002007-02-13T14:30:00.000-08:00Actually, my question still applies. For instance,...Actually, my question still applies. For instance, if a string consists of a single code point that's past the 16-bit range, is length of the string still 1? (As opposed to Java which would say 2 for length() but 1 for codePointCount().)Tom Palmernoreply@blogger.comtag:blogger.com,1999:blog-4704664917418794835.post-17511998689328807962007-02-13T15:06:00.000-08:002007-02-13T15:06:00.000-08:00The above should be:"hi".each_byte("...The above should be:<br><br>"hi".each_byte("UTF-16") -><br>0x00<br>0x68<br>0x00<br>0x69Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-4704664917418794835.post-5510755038836643072007-02-13T15:27:00.000-08:002007-02-13T15:27:00.000-08:00FYI, the specifics of the Ruby 2.0 String interfac...FYI, the specifics of the Ruby 2.0 String interface are more Matz's thing; I was just arguing for stronger typing of encoded versus unencoded strings, and it seems like that's going to happen.<br><br>tom palmer: My understanding is that it returns single-*character* strings, including multi-byte or multi-word surrogated characters. It is a true character sequence. This is also how the Chars class works in Rails, returning single-character (but potentially multi-byte) strings for numeric string indexes.<br><br>anonymous: each_byte would go over each byte of the underlying store as-is; so each_byte on a UTF-16 encoded string would have the nulls you show. each_byte on a UTF-8 string would not, and would walk through each byte of a surrogated character individually.Charles Oliver Nutterhttp://www.blogger.com/profile/06400331959739924670noreply@blogger.comtag:blogger.com,1999:blog-4704664917418794835.post-90828234200111080352007-02-13T15:35:00.000-08:002007-02-13T15:35:00.000-08:00tom palmer: Oh, and for the length issue...I belie...tom palmer: Oh, and for the length issue...I believe there will be two methods, one for char length and one for byte length.Charles Oliver Nutterhttp://www.blogger.com/profile/06400331959739924670noreply@blogger.comtag:blogger.com,1999:blog-4704664917418794835.post-25982895342406214532008-09-17T22:05:00.000-07:002008-09-17T22:05:00.000-07:00I don't know if it's correct to refer to U...I don't know if it's correct to refer to UTF-32 as the only "proper" Unicode encoding.<br><br>Or to phrase it differently, at one point in the past people thought UTF-16 was the "proper" Unicode encoding.<br><br>It sounds like Ruby will be doing it right though, so you can make a UTF-32-backed string if you want. And in the future maybe someone will make a UTF-64-backed one.Anonymousnoreply@blogger.com