digitalmars.D - [challenge] can you break wstring's back?

Steven Schveighoffer (25/25) Nov 23 2010 I am working on a string implementation that enforces the correct

stephan (14/39) Nov 24 2010 Here you go

Steven Schveighoffer (5/6) Nov 24 2010 [snip]

spir (14/23) Nov 24 2010 If the same logic is supposed tell the nr of code units of char & dchar ...

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

I am working on a string implementation that enforces the correct  
restrictions on a string (bi-directional range, etc), and I came across  
what I feel is a bug.

However, I don't know enough about utf to construct a test case to prove  
it wrong.

In std.array, there are separate functions for array.popBack(), depending  
on whether the array is a char[], a wchar[], or any other array type.  The  
char[] and wchar[] popBacks are drastically different.

However, there is only one back() function for narrow strings which  
supposedly handles both char[] and wchar[].  It looks like it will parse  
1, 2, 3, or 4 elements depending on the bit pattern, and it's only looking  
at the least significant 8 bits of the elements to determine this.  Does  
this make sense for wstring?  I would think the wstring has a different  
way of decoding data than the string, otherwise why the two different  
popBacks?

I don't know how to construct a string which shows there is an issue, is  
there one?  If so, can you prove it with a unit test?

Hint, the bit pattern of the end of the string must 'trick' the function  
into using the wrong number of elements, because ones that happen to match  
the correct number of elements needed will not cause an error (after  
deciding how many elements to decode, the data is passed to the decode  
function, which should do the right thing).

As a bonus, can you write a correct wstring.back function so I can include  
it in my string struct? :)

-Steve

Nov 23 2010

stephan <none example.com> writes:

Am 24.11.2010 04:08, schrieb Steven Schveighoffer:
 I am working on a string implementation that enforces the correct
 restrictions on a string (bi-directional range, etc), and I came across
 what I feel is a bug.

 However, I don't know enough about utf to construct a test case to prove
 it wrong.

 In std.array, there are separate functions for array.popBack(),
 depending on whether the array is a char[], a wchar[], or any other
 array type. The char[] and wchar[] popBacks are drastically different.

 However, there is only one back() function for narrow strings which
 supposedly handles both char[] and wchar[]. It looks like it will parse
 1, 2, 3, or 4 elements depending on the bit pattern, and it's only
 looking at the least significant 8 bits of the elements to determine
 this. Does this make sense for wstring? I would think the wstring has a
 different way of decoding data than the string, otherwise why the two
 different popBacks?

 I don't know how to construct a string which shows there is an issue, is
 there one? If so, can you prove it with a unit test?

Here you go

import std.array;
import std.conv;

void main() {
     dchar c = cast(dchar) 0x10000;
     auto ws = to!wstring(c);
     assert(ws.length == 2);            // decoded as surrogate pair
     assert(ws.back == c);              // fails with decoding error
}


 Hint, the bit pattern of the end of the string must 'trick' the function
 into using the wrong number of elements, because ones that happen to
 match the correct number of elements needed will not cause an error
 (after deciding how many elements to decode, the data is passed to the
 decode function, which should do the right thing).

 As a bonus, can you write a correct wstring.back function so I can
 include it in my string struct? :)

Use the same logic as in popBack for wstring, i.e. check whether the 
last wchar is the high part of a surrogate pair (i.e between 0xDC00 and 
0xDFFF inclusive). If yes, two wchars are needed to decode to dchar. 
Otherwise, only one is needed.

 -Steve

Nov 24 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 24 Nov 2010 03:46:26 -0500, stephan <none example.com> wrote:


 Here you go

[snip]

Thank you very much :)

http://d.puremagic.com/issues/show_bug.cgi?id=5265

-Steve

Nov 24 2010

spir <denis.spir gmail.com> writes:

On Tue, 23 Nov 2010 22:08:04 -0500
"Steven Schveighoffer" <schveiguy yahoo.com> wrote:

 However, there is only one back() function for narrow strings which =20
 supposedly handles both char[] and wchar[].

If the same logic is supposed tell the nr of code units of char & dchar the=
n it is certainly a bug.=20

  It looks like it will parse =20
 1, 2, 3, or 4 elements depending on the bit pattern, and it's only lookin=

g =20
 at the least significant 8 bits of the elements to determine this.  Does =

=20
 this make sense for wstring?

This is what to do for utf8 <-> recomposing a whole code point / char from =
variable nr of code units.
Stephan answered about utf16 <-> wchar.

  I would think the wstring has a different =20
 way of decoding data than the string, otherwise why the two different =20
 popBacks?

You are right.


Denis
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com

Nov 24 2010

D Programming

C/C++ Programming

Other

digitalmars.D - [challenge] can you break wstring's back?