digitalmars.D.learn - retro() on a `string` creates a range of `dchar`, causing array()

Jakob Ovrum (12/12) Apr 17 2012 Consider this simple function:

bearophile (4/10) Apr 17 2012 Try "text" instead of "array".

Jakob Ovrum (4/17) Apr 17 2012 Thanks, that did it :)

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (4/13) Apr 17 2012 The reason is, a sequence of UTF-8 code units are not a valid UTF-8 when...

bearophile (4/6) Apr 17 2012 But reversed(char[]) now works :-)

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (8/14) Apr 17 2012 That's pretty cool. :) (You meant reverse()).

Timon Gehr (3/19) Apr 17 2012 It does not have to build a local string, see

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (31/35) Apr 17 2012 I never said otherwise. :p

bearophile (6/10) Apr 17 2012 The basic idea for that algorithm was mine, and Andrei was very gentle t...

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (10/17) Apr 17 2012 aware of a

bearophile (24/30) Apr 17 2012 I see. This is a matter of design. I see some possible solutions:

Jakob Ovrum (5/9) Apr 17 2012 It is absolutely possible to walk a UTF-8 string backwards.

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (6/14) Apr 18 2012 Indeed. I didn't mean otherwise. I was trying to explain why "The type

"Jakob Ovrum" <jakobovrum gmail.com> writes:

Consider this simple function:

	private string findParameterList(string typestr)
	{
		auto strippedHead = typestr.find("(")[1 .. $];
		auto strippedTail = retro(strippedHead).find(")");

		strippedTail.popFront(); // slice off closing parenthesis

		return array(strippedTail);
	}

The type of the return expression is dstring, not string.

What is the most elegant way or correct way to solve this 
friction?

(Note: the function is used in CTFE)

Apr 17 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Jakob Ovrum:

 		return array(strippedTail);
 	}

 The type of the return expression is dstring, not string.

 What is the most elegant way or correct way to solve this 
 friction?

 (Note: the function is used in CTFE)

Try "text" instead of "array".

Bye,
bearophile

Apr 17 2012

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Tuesday, 17 April 2012 at 15:18:49 UTC, bearophile wrote:
 Jakob Ovrum:

 		return array(strippedTail);
 	}

 The type of the return expression is dstring, not string.

 What is the most elegant way or correct way to solve this 
 friction?

 (Note: the function is used in CTFE)

 Try "text" instead of "array".

 Bye,
 bearophile

Thanks, that did it :)

(I also forgot to retro() a second time to make it build the 
array in the original direction, before anyone points it out)

Apr 17 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 04/17/2012 08:12 AM, Jakob Ovrum wrote:
 Consider this simple function:

 private string findParameterList(string typestr)
 {
 auto strippedHead = typestr.find("(")[1 .. $];
 auto strippedTail = retro(strippedHead).find(")");

 strippedTail.popFront(); // slice off closing parenthesis

 return array(strippedTail);
 }

 The type of the return expression is dstring, not string.

The reason is, a sequence of UTF-8 code units are not a valid UTF-8 when 
reversed (or retro'ed :p). But a dchar array can be reversed.

Ali

Apr 17 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Ali Çehreli:

 The reason is, a sequence of UTF-8 code units are not a valid 
 UTF-8 when reversed (or retro'ed :p).

But reversed(char[]) now works :-)

Bye,
bearophile

Apr 17 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 04/17/2012 08:58 AM, bearophile wrote:
 Ali Çehreli:

 The reason is, a sequence of UTF-8 code units are not a valid UTF-8
 when reversed (or retro'ed :p).

 But reversed(char[]) now works :-)

That's pretty cool. :) (You meant reverse()).

Interesting, because there could be no other way anyway because 
reverse() is in-place. Iterating by dchar without damaging the other end 
must have been challenging because the first half of the string may have 
been all multi-bype UTF-8 code units and all of the rest of single-bytes.

The algorithm must be building a local string.

 Bye,
 bearophile

Ali

Apr 17 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 04/17/2012 06:09 PM, Ali Çehreli wrote:
 On 04/17/2012 08:58 AM, bearophile wrote:
  > Ali Çehreli:
  >
  >> The reason is, a sequence of UTF-8 code units are not a valid UTF-8
  >> when reversed (or retro'ed :p).
  >
  > But reversed(char[]) now works :-)

 That's pretty cool. :) (You meant reverse()).

 Interesting, because there could be no other way anyway because
 reverse() is in-place. Iterating by dchar without damaging the other end
 must have been challenging because the first half of the string may have
 been all multi-bype UTF-8 code units and all of the rest of single-bytes.

 The algorithm must be building a local string.

  > Bye,
  > bearophile

 Ali

It does not have to build a local string, see
http://dlang.org/phobos/std_utf.html#strideBack

Apr 17 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 04/17/2012 09:12 AM, Timon Gehr wrote:
 On 04/17/2012 06:09 PM, Ali Çehreli wrote:

 The algorithm must be building a local string.


 It does not have to build a local string, see
 http://dlang.org/phobos/std_utf.html#strideBack

I never said otherwise. :p

I was too lazy to locate where 2.059's algorithm.d was placed under. 
Apparently it is here:

   /usr/include/x86_64-linux-gnu/dmd/phobos/std/algorithm.d

The algorithm is smart. It reverses individual Unicode characters 
in-place first and then reverses the whole string one last time:

void reverse(Char)(Char[] s)
if (isNarrowString!(Char[]) && !is(Char == const) && !is(Char == immutable))
{
     auto r = representation(s);
     for (size_t i = 0; i < s.length; )
     {
         immutable step = std.utf.stride(s, i);
         if (step > 1)
         {
             .reverse(r[i .. i + step]);
             i += step;
         }
         else
         {
             ++i;
         }
     }
     reverse(r);
}

Ali

P.S. Being a C++ programmer, exception-safety is always warm in my mind. 
Unfortunately the topic does not come up much in D forums. The algorithm 
above is not exception-safe because stride() may throw. But this way off 
topic on this thread. :)

Apr 17 2012

bearophile <bearophileHUGS lycos.com> writes:

Ali:

 The algorithm is smart.

The basic idea for that algorithm was mine, and Andrei was very gentle to
implement it, defining it a "Very fun exercise" :-)
http://d.puremagic.com/issues/show_bug.cgi?id=7086


 The algorithm
 above is not exception-safe because stride() may throw. But this way off 
 topic on this thread. :)

You can't expect Phobos to be perfect, it needs to be improved iteratively. If
you think that's not exception safe and and there are simple means to do it,
then please add this in Bugzilla. Being formally aware of a problem is the
second step toward improving the situation.

Bye,
bearophile

Apr 17 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 04/17/2012 12:57 PM, bearophile wrote:

 The algorithm
 above is not exception-safe because stride() may throw. But this way off
 topic on this thread. :)


 You can't expect Phobos to be perfect, it needs to be improved
 iteratively. If you think that's not exception safe and and there are 

simple
 means to do it, then please add this in Bugzilla. Being formally 

aware of a
 problem is the second step toward improving the situation.

Agreed.

But I am not that sure about this particular function anymore because 
for the function to be not 'strongly exception safe', the input string 
must be invalid UTF-8 to begin with.

I am not sure how bad it is to not preserve the actual invalidness of 
the string in that case. :)

Ali

Apr 17 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Ali Çehreli:

 Agreed.

 But I am not that sure about this particular function anymore 
 because for the function to be not 'strongly exception safe', 
 the input string must be invalid UTF-8 to begin with.

 I am not sure how bad it is to not preserve the actual 
 invalidness of the string in that case. :)

I see. This is a matter of design. I see some possible solutions:
1) Do nothing, assume input is well-formed UTF-8, otherwise 
output will be wrong (or it will throw an exception unsafely). 
This is what Phobos may be doing in this case.
2) Put a UTF validate inside the function pre-condition if the 
input is a narrow string. This will slow down code in non-release 
mode, maybe too much.
3) Use a stronger type system, that enforces pre-conditions and 
post-conditions in a smarter way. This means if the return value 
of a function that has 'validate' inside its post-condition is 
given as input to a function that has 'validate' inside its 
pre-condition, the validate is run only once even in non-release 
mode. Generally if you use many string functions this leads to 
the saving of lot of 'validate' functions. This solution is 
appreciated by Eiffel languages.
4) Use two different types, one for validated UTF-8 and one for 
unvalidated UTF-8. Unless you have bad bugs in your code this 
will avoid most calls to 'validate'. This solution is very simple 
because it doesn't require a smart compiler, and it's appreciated 
in languages like Haskell (example, see: http://www.yesodweb.com/ 
).

Bye,
bearophile

Apr 17 2012

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Tuesday, 17 April 2012 at 15:36:39 UTC, Ali Çehreli wrote:
 The reason is, a sequence of UTF-8 code units are not a valid 
 UTF-8 when reversed (or retro'ed :p). But a dchar array can be 
 reversed.

 Ali

It is absolutely possible to walk a UTF-8 string backwards.

The problem here is that arrays of char are ranges of dchar; 
hence you can't go the regular generic path and have to use 
text() instead.

Apr 17 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On Wednesday, 18 April 2012 at 05:45:06 UTC, Jakob Ovrum wrote:
 On Tuesday, 17 April 2012 at 15:36:39 UTC, Ali Çehreli wrote:
 The reason is, a sequence of UTF-8 code units are not a valid
 UTF-8 when reversed (or retro'ed :p). But a dchar array can be
 reversed.

 Ali

 It is absolutely possible to walk a UTF-8 string backwards.

Indeed. I didn't mean otherwise. I was trying to explain why "The type 
of the return expression is dstring, not string."

And I just checked, again, that my use of "UTF-8 code units" above was 
correct. :) I didn't say "Unicode code points".

Ali

Apr 18 2012

D Programming

C/C++ Programming

Other

digitalmars.D.learn - retro() on a `string` creates a range of `dchar`, causing array()