www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - iteration over a string

reply Timothee Cour <thelastmammoth gmail.com> writes:
Questions regarding iteration over code points of a utf8 string:

In all that follows, I don't want to go through intermediate UTF32
representation by making a copy of my string, but I want to iterate over
its code points.

say my string is declared as:
string a=3D"=CE=A9abc"; //if email reader screws this up, it's a 'Omega' fo=
llowed
by abc

A)
this doesn't work obviously:
foreach(i,ai; a){
  write(i,",",ai," ");
}
//prints 0,=EF=BF=BD 1,=EF=BF=BD 2,a 3,b 4,c (ie decomposes at the 'char' l=
evel, so 5
elements)

B)
foreach(i,dchar ai;a){
  write(i,",",ai," ");
}
// prints 0,=CE=A9 2,a 3,b 4,c (ie decomposes at code points, so 4 elements=
)
But index i skips position 1, indicating the start index of code points; it
prints [0,2,3,4]
Is that a bug or a feature?

C)
writeln(a.walkLength); // prints 4
for(size_t i;!a.empty;a.popFront,i++)
  write(i,",",a.front," ");

// prints 0,=CE=A9 1,a 2,b 3,c
This seems the most correct for interpreting a string as a range over code
points, where index i has positions [0,1,2,3]

Is there a more idiomatic way?

D)
How to make the standard algorithms (std.map, etc) work well with the
iteration over code points as in method C above ?

For example this one is very confusing for me:
string a=3D"=CE=A9=CE=A9ab";
auto b1=3Da.map!(a=3D>"<"d~a~">"d).array;
writeln(b1.length);//6
writeln(b1);//["<=CE=A9>", "<=CE=A9>", "<a>", "<b>", "", ""]
Why are there 2 empty strings at the end? (one per Omega if you vary the
number of such symbols in the string).


E)
The fact that there are 2 ways to iterate over strings is confusing:
For example reading at docs, ForeachType is different from ElementType and
ElementType is special cased for narrow strings;
foreach(i;ai;a){foo(i,ai);} doesn't behave as for(size_t
i;!a.empty;a.popFront,i++) {foo(i,a.front);}
walkLength !=3D length for strings

F)
Why can't we have the following design instead:
* no special case with isNarrowString scattered throughout phobos
* iteration with foreach behaves as iteration with popFront/empty/front,
and walkLength =3D=3D length
* ForeachType =3D=3D ElementType (ie one is redundant)
* require *explicit user syntax* to construct a range over code points from
a string:

struct CodepointRange{
 this(string a){...}
 auto popFront(){}
 auto empty(){}
 auto length(){}//
}

now the user can do:
a.map!foo =3D> will iterate over char
a.CodepointRange.map!foo =3D> will iterate over code points.

Everything seems more orhogonal that way, and user has clear understanding
of complexity of each operation.
May 28 2013
parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 05/28/2013 12:26 AM, Timothee Cour wrote:

 In all that follows, I don't want to go through intermediate UTF32
 representation by making a copy of my string, but I want to iterate over
 its code points.
Yes, the whole situation is a little messy. :) There is also std.range.stride: foreach (ai; a.stride(1)) { // ... } If you need the index as well, and do not want to manage it explicitly, one way is to use zip and sequence: import std.stdio; import std.range; void main() { string a="Ωabc"; foreach (i, ai; zip(sequence!"n", a.stride(1))) { write(i,",",ai," "); } } The output: Ω a b c Ali
May 28 2013
parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 05/28/2013 12:42 AM, Ali Çehreli wrote:

 The output:

 Ω a b c
Rather: 0,Ω 1,a 2,b 3,c Ali
May 28 2013
parent "Diggory" <diggsey googlemail.com> writes:
Most algorithms for strings need the offset rather than the 
character index, so:

foreach (i; dchar c; str)

Gives the offset into the string for "i"

If you really need the character index just count it:

int charIndex = 0;
foreach (dchar c; str) {
    // ...

    ++charIndex;
}

If strings were treated specially so that they looked like arrays 
of dchars but used UTF-8 internally it would hide all sorts of 
performance costs. Random access into a UTF-8 string by the 
character index is O(n) whereas index by the offset is O(1).

If you are using random access by character index heavily you 
should therefore convert to a dstring first and then you can get 
the O(1) random access time.
May 28 2013