www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [challenge] can you break wstring's back?

reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
I am working on a string implementation that enforces the correct  
restrictions on a string (bi-directional range, etc), and I came across  
what I feel is a bug.

However, I don't know enough about utf to construct a test case to prove  
it wrong.

In std.array, there are separate functions for array.popBack(), depending  
on whether the array is a char[], a wchar[], or any other array type.  The  
char[] and wchar[] popBacks are drastically different.

However, there is only one back() function for narrow strings which  
supposedly handles both char[] and wchar[].  It looks like it will parse  
1, 2, 3, or 4 elements depending on the bit pattern, and it's only looking  
at the least significant 8 bits of the elements to determine this.  Does  
this make sense for wstring?  I would think the wstring has a different  
way of decoding data than the string, otherwise why the two different  
popBacks?

I don't know how to construct a string which shows there is an issue, is  
there one?  If so, can you prove it with a unit test?

Hint, the bit pattern of the end of the string must 'trick' the function  
into using the wrong number of elements, because ones that happen to match  
the correct number of elements needed will not cause an error (after  
deciding how many elements to decode, the data is passed to the decode  
function, which should do the right thing).

As a bonus, can you write a correct wstring.back function so I can include  
it in my string struct? :)

-Steve
Nov 23 2010
next sibling parent reply stephan <none example.com> writes:
Am 24.11.2010 04:08, schrieb Steven Schveighoffer:
 I am working on a string implementation that enforces the correct
 restrictions on a string (bi-directional range, etc), and I came across
 what I feel is a bug.

 However, I don't know enough about utf to construct a test case to prove
 it wrong.

 In std.array, there are separate functions for array.popBack(),
 depending on whether the array is a char[], a wchar[], or any other
 array type. The char[] and wchar[] popBacks are drastically different.

 However, there is only one back() function for narrow strings which
 supposedly handles both char[] and wchar[]. It looks like it will parse
 1, 2, 3, or 4 elements depending on the bit pattern, and it's only
 looking at the least significant 8 bits of the elements to determine
 this. Does this make sense for wstring? I would think the wstring has a
 different way of decoding data than the string, otherwise why the two
 different popBacks?

 I don't know how to construct a string which shows there is an issue, is
 there one? If so, can you prove it with a unit test?

Here you go import std.array; import std.conv; void main() { dchar c = cast(dchar) 0x10000; auto ws = to!wstring(c); assert(ws.length == 2); // decoded as surrogate pair assert(ws.back == c); // fails with decoding error }
 Hint, the bit pattern of the end of the string must 'trick' the function
 into using the wrong number of elements, because ones that happen to
 match the correct number of elements needed will not cause an error
 (after deciding how many elements to decode, the data is passed to the
 decode function, which should do the right thing).

 As a bonus, can you write a correct wstring.back function so I can
 include it in my string struct? :)

Use the same logic as in popBack for wstring, i.e. check whether the last wchar is the high part of a surrogate pair (i.e between 0xDC00 and 0xDFFF inclusive). If yes, two wchars are needed to decode to dchar. Otherwise, only one is needed.
 -Steve

Nov 24 2010
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 24 Nov 2010 03:46:26 -0500, stephan <none example.com> wrote:


 Here you go

[snip] Thank you very much :) http://d.puremagic.com/issues/show_bug.cgi?id=5265 -Steve
Nov 24 2010
prev sibling parent spir <denis.spir gmail.com> writes:
On Tue, 23 Nov 2010 22:08:04 -0500
"Steven Schveighoffer" <schveiguy yahoo.com> wrote:

 However, there is only one back() function for narrow strings which =20
 supposedly handles both char[] and wchar[].

If the same logic is supposed tell the nr of code units of char & dchar the= n it is certainly a bug.=20
  It looks like it will parse =20
 1, 2, 3, or 4 elements depending on the bit pattern, and it's only lookin=

 at the least significant 8 bits of the elements to determine this.  Does =

 this make sense for wstring?

This is what to do for utf8 <-> recomposing a whole code point / char from = variable nr of code units. Stephan answered about utf16 <-> wchar.
  I would think the wstring has a different =20
 way of decoding data than the string, otherwise why the two different =20
 popBacks?

You are right. Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 24 2010