www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Reading dchar from UTF-8 stdin

reply =?ISO-8859-1?Q?Ali_=C7ehreli?= <acehreli yahoo.com> writes:
Given that the input stream is UTF-8, it is understandable that the 
following program pulls just one code unit from the standard input (I 
think the console encoding is UTF-8 on my Ubuntu 10.10):

import std.stdio;

void main()
{
     char code;
     readf(" %s", &code);
     writeln(code);       // <-- may write an incomplete character
}

 is represented by two bytes in the UTF-8 encoding. When  is fed to 
the input of the program, writeln expression does not produce a complete 
character on the output. That's understandable with char.

Would you expect all of the bytes to be consumed when a dchar was used 
instead?

import std.stdio;

void main()
{
     dchar code;          // <-- now a dchar
     readf(" %s", &code);
     writeln(code);       // <-- BUG: uses a code unit as a code point!
}

When the input is , now the output becomes .

What would you expect to happen?

Ali

P.S. As what is written is not the same as what is read above, I am 
reminded of another issue: would you expect the strings "false" and 
"true" to be accepted as correct inputs when readf'ed to bool variables?
Mar 15 2011
parent reply spir <denis.spir gmail.com> writes:
On 03/15/2011 11:33 PM, Ali Çehreli wrote:
 Given that the input stream is UTF-8, it is understandable that the following
 program pulls just one code unit from the standard input (I think the console
 encoding is UTF-8 on my Ubuntu 10.10):

 import std.stdio;

 void main()
 {
 char code;
 readf(" %s", &code);
 writeln(code); // <-- may write an incomplete character
 }

 ö is represented by two bytes in the UTF-8 encoding. When ö is fed to the
input
 of the program, writeln expression does not produce a complete character on the
 output. That's understandable with char.

 Would you expect all of the bytes to be consumed when a dchar was used instead?

 import std.stdio;

 void main()
 {
 dchar code; // <-- now a dchar
 readf(" %s", &code);
 writeln(code); // <-- BUG: uses a code unit as a code point!
 }

Well, when I try to run that bit of code, I get an error in std.format. formattedRead (line near the end, marked with "***" below). void formattedRead(R, Char, S...)(ref R r, const(Char)[] fmt, S args) { auto spec = FormatSpec!Char(fmt); static if (!S.length) { spec.readUpToNextSpec(r); enforce(spec.trailing.empty); } else { // The function below accounts for '*' == fields meant to be // read and skipped void skipUnstoredFields() { for (;;) { spec.readUpToNextSpec(r); if (spec.width != spec.DYNAMIC) break; // must skip this field skipData(r, spec); } } skipUnstoredFields(); alias typeof(*args[0]) A; static if (isTuple!A) { foreach (i, T; A.Types) { //writeln("Parsing ", r, " with format ", fmt); (*args[0])[i] = unformatValue!(T)(r, spec); skipUnstoredFields(); } } else { *args[0] = unformatValue!(A)(r, spec); // *** } return formattedRead(r, spec.trailing, args[1 .. $]); } }
 When the input is ö, now the output becomes Ã.

 What would you expect to happen?

I would expect a whole code representing 'ö'.
 Ali

 P.S. As what is written is not the same as what is read above, I am reminded of
 another issue: would you expect the strings "false" and "true" to be accepted
 as correct inputs when readf'ed to bool variables?

Yep! Denis -- _________________ vita es estrany spir.wikidot.com
Mar 16 2011
parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 03/16/2011 02:52 AM, spir wrote:
 On 03/15/2011 11:33 PM, Ali Çehreli wrote:

 Given that the input stream is UTF-8


[...]
 Would you expect all of the bytes to be consumed when a dchar was used
 instead?

 import std.stdio;

 void main()
 {
 dchar code; // <-- now a dchar
 readf(" %s", &code);
 writeln(code); // <-- BUG: uses a code unit as a code point!
 }

Well, when I try to run that bit of code, I get an error in std.format. formattedRead (line near the end, marked with "***" below).

I use dmd 2.052 on an Ubuntu 10.10 console and compiles fine for me. I know that there has been changes in formatted input and output lately. Perhaps you use an earlier version?
 *args[0] = unformatValue!(A)(r, spec); // ***

 When the input is ö, now the output becomes Ã.

 What would you expect to happen?

I would expect a whole code representing 'ö'.

I agree; just opened a bug report: http://d.puremagic.com/issues/show_bug.cgi?id=5743 Ali
Mar 16 2011