www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Reading dchar from UTF-8 stdin

reply =?ISO-8859-1?Q?Ali_=C7ehreli?= <acehreli yahoo.com> writes:
Given that the input stream is UTF-8, it is understandable that the 
following program pulls just one code unit from the standard input (I 
think the console encoding is UTF-8 on my Ubuntu 10.10):

import std.stdio;

void main()
{
     char code;
     readf(" %s", &code);
     writeln(code);       // <-- may write an incomplete character
}

ö is represented by two bytes in the UTF-8 encoding. When ö is fed to 
the input of the program, writeln expression does not produce a complete 
character on the output. That's understandable with char.

Would you expect all of the bytes to be consumed when a dchar was used 
instead?

import std.stdio;

void main()
{
     dchar code;          // <-- now a dchar
     readf(" %s", &code);
     writeln(code);       // <-- BUG: uses a code unit as a code point!
}

When the input is ö, now the output becomes Ã.

What would you expect to happen?

Ali

P.S. As what is written is not the same as what is read above, I am 
reminded of another issue: would you expect the strings "false" and 
"true" to be accepted as correct inputs when readf'ed to bool variables?
Mar 15 2011
parent reply spir <denis.spir gmail.com> writes:
On 03/15/2011 11:33 PM, Ali Çehreli wrote:
 Given that the input stream is UTF-8, it is understandable that the following
 program pulls just one code unit from the standard input (I think the console
 encoding is UTF-8 on my Ubuntu 10.10):

 import std.stdio;

 void main()
 {
 char code;
 readf(" %s", &code);
 writeln(code); // <-- may write an incomplete character
 }

 ö is represented by two bytes in the UTF-8 encoding. When ö is fed to the
input
 of the program, writeln expression does not produce a complete character on the
 output. That's understandable with char.

 Would you expect all of the bytes to be consumed when a dchar was used instead?

 import std.stdio;

 void main()
 {
 dchar code; // <-- now a dchar
 readf(" %s", &code);
 writeln(code); // <-- BUG: uses a code unit as a code point!
 }
Well, when I try to run that bit of code, I get an error in std.format. formattedRead (line near the end, marked with "***" below). void formattedRead(R, Char, S...)(ref R r, const(Char)[] fmt, S args) { auto spec = FormatSpec!Char(fmt); static if (!S.length) { spec.readUpToNextSpec(r); enforce(spec.trailing.empty); } else { // The function below accounts for '*' == fields meant to be // read and skipped void skipUnstoredFields() { for (;;) { spec.readUpToNextSpec(r); if (spec.width != spec.DYNAMIC) break; // must skip this field skipData(r, spec); } } skipUnstoredFields(); alias typeof(*args[0]) A; static if (isTuple!A) { foreach (i, T; A.Types) { //writeln("Parsing ", r, " with format ", fmt); (*args[0])[i] = unformatValue!(T)(r, spec); skipUnstoredFields(); } } else { *args[0] = unformatValue!(A)(r, spec); // *** } return formattedRead(r, spec.trailing, args[1 .. $]); } }
 When the input is ö, now the output becomes Ã.

 What would you expect to happen?
I would expect a whole code representing 'ö'.
 Ali

 P.S. As what is written is not the same as what is read above, I am reminded of
 another issue: would you expect the strings "false" and "true" to be accepted
 as correct inputs when readf'ed to bool variables?
Yep! Denis -- _________________ vita es estrany spir.wikidot.com
Mar 16 2011
parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 03/16/2011 02:52 AM, spir wrote:
 On 03/15/2011 11:33 PM, Ali Çehreli wrote:
 Given that the input stream is UTF-8
[...]
 Would you expect all of the bytes to be consumed when a dchar was used
 instead?

 import std.stdio;

 void main()
 {
 dchar code; // <-- now a dchar
 readf(" %s", &code);
 writeln(code); // <-- BUG: uses a code unit as a code point!
 }
Well, when I try to run that bit of code, I get an error in std.format. formattedRead (line near the end, marked with "***" below).
I use dmd 2.052 on an Ubuntu 10.10 console and compiles fine for me. I know that there has been changes in formatted input and output lately. Perhaps you use an earlier version?
 *args[0] = unformatValue!(A)(r, spec); // ***
 When the input is ö, now the output becomes Ã.

 What would you expect to happen?
I would expect a whole code representing 'ö'.
I agree; just opened a bug report: http://d.puremagic.com/issues/show_bug.cgi?id=5743 Ali
Mar 16 2011