digitalmars.D - Re: Making all strings UTF ranges has some risk of WTF

Jason House <jason.james.house gmail.com> Feb 04 2010

Jason House <jason.james.house gmail.com> writes:

Andrei Alexandrescu Wrote:

 It's no secret that string et al. are not a magic recipe for writing 
 correct Unicode code. However, things are pretty good and could be 
 further improved by operating the following changes in std.array and 
 std.range:
 
 These changes effectively make UTF-8 and UTF-16 bidirectional ranges, 
 with the quirk that you still have a sort of a random-access operator.
 
 I'm very strongly in favor of this change. Bidirectional strings allow 
 beautiful correct algorithms to be written that handle encoded strings 
 without any additional effort; with these changes, everything applicable 
 of std.algorithm works out of the box (with the appropriate fixes here 
 and there), which is really remarkable.
 
 The remaining WTF is the length property. Traditionally, a range 
 offering length also implies the expectation that a range of length n 
 allows you to call popFront n times and then assert that the range is 
 empty. However, if you check e.g. hasLength!string it will yield false, 
 although the string does have an accessible member by that name and of 
 the appropriate type.
 
 Although Phobos always checks its assumptions, people might occasionally 
 write code that just uses .length without checking hasLength. Then, 
 they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.
 
 (The "real" length of the range is not stored, but can be computed by 
 using str.walkLength() in std.range.)
 
 What can be done about that? I see a number of solutions:


The underlying array of byte-sized data fragments is an implementation detail.
hasLength is a kludge. Follow good OO design and hide the implementation
details from the standard interface!

I would use a struct for UTF8 and UTF16 strings, and add a method to get the
raw array. That allows simple, compiler-enforced usage while still allowing
special casing to use raw data. As an added bonus, this method can generalize
for other variable widthrange elements.

Feb 04 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Re: Making all strings UTF ranges has some risk of WTF