www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - std.utf.decode behaves unexpectedly - Bug?

reply HeiHon <heiko.honrath gmx.de> writes:
Consider this:

[code]
import std.stdio, std.utf, std.exception;

void do_decode(string txt)
{
     try
     {
         size_t idx;
         writeln("decode ", txt);
         for (size_t i = 0; i < txt.length; i++)
         {
             dchar dc = std.utf.decode(txt[i..i+1], idx);
             writeln(" i=", i, " length=", txt[i..i+1].length, " 
char=", txt[i], " idx=", idx, " dchar=", dc);
         }
     }
     catch(Exception e)
     {
         writeln(e.msg, " file=", e.file, " line=", e.line);
     }
     writeln();
}

void main()
{
     do_decode("abc");
/+ result:
decode abc
  i=0 length=1 char=a idx=1 dchar=a
  i=1 length=1 char=b idx=2 dchar=c
  i=2 length=1 char=c idx=3 dchar=
+/

     do_decode("åbc");
/+ result:
decode åbc
Attempted to decode past the end of a string (at index 1) 
file=D:\dmd2\windows\bin\..\..\src\phobos\std\utf.d line=1268
+/

     do_decode("aåb");
/+ result:
decode aåb
  i=0 length=1 char=a idx=1 dchar=a
core.exception.RangeError std\utf.d(1265): Range violation
----------------
0x004054D4
0x0040214F
0x004045A7
0x004044BB
0x00403008
0x755D339A in BaseThreadInitThunk
0x76EE9EF2 in RtlInitializeExceptionChain
0x76EE9EC5 in RtlInitializeExceptionChain
+/
}
[/code]

I would expect:
decode abc -> dchar a, dchar b, dchar c
decode åbc -> dchar å, dchar b, dchar c
decode aåb -> dchar a, dchar å, dchar b

Am I using std.utf.decode wrongly or is it buggy?
Nov 06 2015
next sibling parent Spacen Jasset <spacen razemail.com> writes:
On Friday, 6 November 2015 at 19:26:50 UTC, HeiHon wrote:
 Consider this:

 [code]
 import std.stdio, std.utf, std.exception;

 void do_decode(string txt)
 {
     try
     {
         size_t idx;
         writeln("decode ", txt);
         for (size_t i = 0; i < txt.length; i++)
         {
             dchar dc = std.utf.decode(txt[i..i+1], idx);
             writeln(" i=", i, " length=", txt[i..i+1].length, " 
 char=", txt[i], " idx=", idx, " dchar=", dc);
         }
     }
     catch(Exception e)
     {
         writeln(e.msg, " file=", e.file, " line=", e.line);
     }
     writeln();
 }

 void main()
 {
     do_decode("abc");
 /+ result:
 decode abc
  i=0 length=1 char=a idx=1 dchar=a
  i=1 length=1 char=b idx=2 dchar=c
  i=2 length=1 char=c idx=3 dchar=
 +/

     do_decode("åbc");
 /+ result:
 decode åbc
 Attempted to decode past the end of a string (at index 1) 
 file=D:\dmd2\windows\bin\..\..\src\phobos\std\utf.d line=1268
 +/

     do_decode("aåb");
 /+ result:
 decode aåb
  i=0 length=1 char=a idx=1 dchar=a
 core.exception.RangeError std\utf.d(1265): Range violation
 ----------------
 0x004054D4
 0x0040214F
 0x004045A7
 0x004044BB
 0x00403008
 0x755D339A in BaseThreadInitThunk
 0x76EE9EF2 in RtlInitializeExceptionChain
 0x76EE9EC5 in RtlInitializeExceptionChain
 +/
 }
 [/code]

 I would expect:
 decode abc -> dchar a, dchar b, dchar c
 decode åbc -> dchar å, dchar b, dchar c
 decode aåb -> dchar a, dchar å, dchar b

 Am I using std.utf.decode wrongly or is it buggy?
I wouldn't have thought you would want to do this: dchar dc = std.utf.decode(txt[i..i+1], idx); since txt is utf8, and this is a multiple byte, and variable length encoding, so txt[i..i+1] won't work, you will end up with invalid chops of utf8. It would seem that you might want to just say decode(txt, i) instead if you look at the documentation it should decode one code point and advance i the right amount of characters forward. In other words, perhaps that paired with a while ( i < txt.length) might do the trick.
Nov 06 2015
prev sibling next sibling parent HeiHon <heiko.honrath gmx.de> writes:
Sorry, I mixed up the line numbers from dmd 2.068.2 and dmd 
2.069.0.
The correct line numbers for dmd 2.069.0 are:

Attempted to decode past the end of a string (at index 1) 
file=D:\dmd2\windows\bin\..\..\src\phobos\std\utf.d line=1281

and

core.exception.RangeError std\utf.d(1278): Range violation
Nov 06 2015
prev sibling next sibling parent BBaz <bb.temp gmx.com> writes:
On Friday, 6 November 2015 at 19:26:50 UTC, HeiHon wrote:
 Am I using std.utf.decode wrongly or is it buggy?
It's obviously used wrongly, try this instead: import std.utf, std.stdio; --- dstring do_decode(string txt) { dstring result; try { size_t idx; writeln("decode ", txt); while (true) { result ~= std.utf.decode(txt, idx); if (idx == txt.length) break; } } catch(Exception e) { writeln(e.msg, " file=", e.file, " line=", e.line); } return result; } void main() { writeln(do_decode("abc")); writeln(do_decode("åbc")); writeln(do_decode("aåb")); }
Nov 06 2015
prev sibling parent reply BBaz <bb.temp gmx.com> writes:
Sorry, the forum as stripped my answer. Here is the full version:

On Friday, 6 November 2015 at 19:26:50 UTC, HeiHon wrote:
 Am I using std.utf.decode wrongly or is it buggy?
It's obviously used wrongly, try this instead: import std.utf, std.stdio; --- dstring do_decode(string txt) { dstring result; try { size_t idx; writeln("decode ", txt); while (true) { result ~= std.utf.decode(txt, idx); if (idx == txt.length) break; } } catch(Exception e) { writeln(e.msg, " file=", e.file, " line=", e.line); } return result; } void main() { writeln(do_decode("abc")); writeln(do_decode("åbc")); writeln(do_decode("aåb")); } --- Additionally to what's been said in the other answers there was also another error: the `for()` loop was working on code points while there are possibly less code units in `txt`. So instead you can use an infinite loop and break when `txt` is decoded. Alternatively you could also use std.range primitives to decode, which can be considered as a more idiomatic way of doing things, e.g: --- import std.utf, std.stdio, std.range; dstring do_decode(string txt) { dstring result; try { size_t idx; writeln("decode ", txt); while (true) { if (txt.empty) break; result ~= txt.front; txt.popFront; } } catch(Exception e) { writeln(e.msg, " file=", e.file, " line=", e.line); } return result; } void main() { writeln(do_decode("abc")); writeln(do_decode("åbc")); writeln(do_decode("aåb")); } --- because `front` auto decodes it argument. To finish, a hint: you can use the unit tests found in phobos to learn how to use a particular function. Usually there are more than the one put as ddoc.
Nov 06 2015
parent HeiHon <heiko.honrath gmx.de> writes:
On Friday, 6 November 2015 at 20:00:43 UTC, BBaz wrote:
 Sorry, the forum as stripped my answer. Here is the full 
 version:
 ...
Thank you very much for taking the time to explain it!
Nov 07 2015