digitalmars.D.learn - Read a unicode character from the terminal

Jacob Carlborg (6/6) Mar 31 2012 How would I read a unicode character from the terminal? I've tried using...

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (65/70) Mar 31 2012 I recommend using stdin. The destiny of std.cstream is uncertain and

Jordi Sayol (4/4) Mar 31 2012 Many thanks to be so educational.
Jordi Sayol (7/7) Mar 31 2012 BTW, for those who do not know, Ali =C3=87ehreli is writing a book to le...

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (8/12) Mar 31 2012 Thank you very much for the free plug! :)

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (67/70) Mar 31 2012 Here is a Unicode character range, which is unfortunately pretty

Jacob Carlborg (11/18) Apr 01 2012 Ok, what's the differences compared to the example in your first post:

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (14/32) Apr 01 2012 No difference in that example because it consumes the entire input as

Jacob Carlborg (4/17) Apr 01 2012 Ok, I see, thanks.

Jacob Carlborg (9/73) Apr 01 2012 Yeah, exactly. When I think about it, I don't know why I thought "getc"

Stewart Gordon (7/10) Mar 31 2012 What OS are you using?

Jacob Carlborg (5/16) Apr 01 2012 I'll have a look, thanks.
Stewart Gordon (8/11) Apr 04 2012 The D2 version is now up on the site.

Jacob Carlborg (10/17) Apr 04 2012 Sure I can help you with testing. I have a lot on my own table so I

Stewart Gordon (7/11) Apr 04 2012 Just to hold some miscellaneous utility classes/structs/functions.

Jacob Carlborg (6/21) Apr 04 2012 Ok, I see. The functions that need a Posix implementation are mostly in

Stewart Gordon (8/10) Apr 05 2012 Maybe it contains the code I need to finish datetime off. Though I can'...

Jacob Carlborg (5/18) Apr 05 2012 http://dlang.org/phobos/std_getopt.html

Stewart Gordon (5/7) Apr 07 2012 Where is the code in std.getopt that has any relevance whatsoever to

Jacob Carlborg (5/13) Apr 07 2012 Both std.getopt and mjg.libs.util.commandline handle command line

Stewart Gordon (10/12) Apr 07 2012 What's that to do with anything?

Jacob Carlborg (4/16) Apr 07 2012 I don't know what your module is supposed to do.

Stewart Gordon (8/9) Apr 07 2012 Then how about reading its documentation?

Jacob Carlborg (14/18) Apr 04 2012 I solved it like this:

Jacob Carlborg <doob me.com> writes:

How would I read a unicode character from the terminal? I've tried using 
"std.cstream.din.getc" but it seems to only work for ascii characters. 
If I try to read and print something that isn't ascii, it just prints a 
question mark.

-- 
/Jacob Carlborg

Mar 31 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 03/31/2012 08:56 AM, Jacob Carlborg wrote:
 How would I read a unicode character from the terminal? I've tried using
 "std.cstream.din.getc"

I recommend using stdin. The destiny of std.cstream is uncertain and 
stdin is sufficient. (I know that it lacks support for BOM but I don't 
need them.)

 but it seems to only work for ascii characters.
 If I try to read and print something that isn't ascii, it just prints a
 question mark.

The word 'character' used to mean characters of the Latin-based 
alphabets but with Unicode support that's not the case anymore. In D, 
'character' means UTF code unit, nothing else. Unfortunately, although 
'Unidode character' is just the correct term to use, it conflicts with 
D's characters which are not Unicode characters.

'Unicode code point' is the non-conflicting term that matches what we 
mean with 'Unicode character.' Only dchar can hold code points.

That's the part about characters.

The other side is what is being fed into the program through its 
standard input. On my Linux consoles, the text comes as a stream of 
chars, i.e. a UTF-8 encoded text. You must ensure that your terminal is 
capable of supporting Unicode through its settings. On Windows 
terminals, one must enter 'chcp 65001' to set the terminal to UTF-8.

Then, it is the program that must know what the data represents. If you 
are expecting a Unicode code point, then you may think that is should be 
as simple as reading into a dchar:

import std.stdio;

void main()
{
     dchar letter;
     readf("%s", &letter);    // <-- does not work!
     writeln(letter);
}

The output:

$ ./deneme
ç
Ã  <-- will be different on different consoles

The problem is, char can implicitly be converted to dchar. Since the 
letter ç consists of two chars (two UTF-8 code units), dchar gets the 
first one converted as a dchar.

To see this, read and write two chars in a loop without a newline in 
between:

import std.stdio;

void main()
{
     foreach (i; 0 .. 2) {
         char code;
         readf("%s", &code);
         write(code);
     }

     writeln();
}

This time two code units are read and then outputted to form a Unicode 
character on the console:

$ ./deneme
ç
ç   <-- result of two write(code) expressions

The solution is to use ranges when pulling Unicode characters out of 
strings. std.stdin does not provide this yet, but it will eventually 
happen (so I've heard :)).

For now, this is a way of getting Unicode characters from the input:

import std.stdio;

void main()
{
     string line = readln();

     foreach (dchar c; line) {
         writeln(c);
     }
}

Once you have the input as a string, std.utf.decode can also be used.

Ali

Mar 31 2012

Jordi Sayol <g.sayol yahoo.es> writes:

Many thanks to be so educational.

Best regards,
-- 
Jordi Sayol

Mar 31 2012

Jordi Sayol <g.sayol yahoo.es> writes:

BTW, for those who do not know, Ali =C3=87ehreli is writing a book to lea=
rn "D" from scratch. It's very educational.
There are two formats: HTML (on-line) and PDF.
http://ddili.org/ders/d.en/index.html

Best regards,
--=20
Jordi Sayol

Mar 31 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 03/31/2012 02:31 PM, Jordi Sayol wrote:
 BTW, for those who do not know, Ali Çehreli is writing a book to learn "D"
from scratch. It's very educational.
 There are two formats: HTML (on-line) and PDF.
 http://ddili.org/ders/d.en/index.html

 Best regards,

Thank you very much for the free plug! :)

I have translated eleven more chapters since the last announcement. I am 
on the assert chapter as we speak. It is taking longer than I had 
expected because I constantly make improvements to the original: 
corrections, consistency improvements, additions, adapting code samples 
to the current state of D, etc.

Ali

Mar 31 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 03/31/2012 11:53 AM, Ali Çehreli wrote:

 The solution is to use ranges when pulling Unicode characters out of
 strings. std.stdin does not provide this yet, but it will eventually
 happen (so I've heard :)).

Here is a Unicode character range, which is unfortunately pretty 
inefficient because it relies on an exception that is thrown from 
isValidDchar! :p

import std.stdio;
import std.utf;
import std.array;

struct UnicodeRange
{
     File file;
     char[4] codes;
     bool ready;

     this(File file)
     {
         this.file = file;
         this.ready = false;
     }

     bool empty() const  property
     {
         return file.eof();
     }

     dchar front() const  property
     {
         if (!ready) {
             // Sorry, no 'mutable' in D! :p
             UnicodeRange * mutable_this = cast(UnicodeRange*)&this;
             mutable_this.readNext();
         }
         return codes.front;
     }

     void popFront()
     {
         codes = codes.init;
         ready = false;
     }

     void readNext()
     {
         foreach (ref code; codes) {
             file.readf("%s", &code);

             if (file.eof()) {
                 codes[] = '\0';
                 ready = false;
                 break;
             }

             // Expensive way of determining "ready"!
             try {
                 if (isValidDchar(codes.front)) {
                     ready = true;
                     break;
                 }

             } catch (Exception) {
                 // not ready
             }
         }
     }
}

UnicodeRange byUnicode(File file = stdin)
{
     return UnicodeRange(file);
}

void main()
{
     foreach(c; byUnicode()) {
         writeln(c);
     }
}

Ali

Mar 31 2012

Jacob Carlborg <doob me.com> writes:

On 2012-04-01 01:17, Ali Çehreli wrote:
 On 03/31/2012 11:53 AM, Ali Çehreli wrote:

  > The solution is to use ranges when pulling Unicode characters out of
  > strings. std.stdin does not provide this yet, but it will eventually
  > happen (so I've heard :)).

 Here is a Unicode character range, which is unfortunately pretty
 inefficient because it relies on an exception that is thrown from
 isValidDchar! :p

Ok, what's the differences compared to the example in your first post:

void main()
{
     string line = readln();

     foreach (dchar c; line) {
         writeln(c);
     }
}

-- 
/Jacob Carlborg

Apr 01 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 04/01/2012 05:00 AM, Jacob Carlborg wrote:
 On 2012-04-01 01:17, Ali Çehreli wrote:
 On 03/31/2012 11:53 AM, Ali Çehreli wrote:

 The solution is to use ranges when pulling Unicode characters out of
 strings. std.stdin does not provide this yet, but it will eventually
 happen (so I've heard :)).

 Here is a Unicode character range, which is unfortunately pretty
 inefficient because it relies on an exception that is thrown from
 isValidDchar! :p

 Ok, what's the differences compared to the example in your first post:

 void main()
 {
 string line = readln();

 foreach (dchar c; line) {
 writeln(c);
 }
 }

No difference in that example because it consumes the entire input as 
dchars.

But in general, with that inefficient range, it is possible to pull just 
one dchar from the input and leave the rest of the stream untouched. For 
example, it would be possible to readf() an int right after that:

     auto u = byUnicode();

     dchar d = u.front;  // <-- reads just one dchar from the range

     int i;
     readf("%s", &i);    // <-- continues with std.stdio functions
     writeln(i);

With the getline() method, the int must be looked up in the line first, 
then from the input.

Ali

Apr 01 2012

Jacob Carlborg <doob me.com> writes:

On 2012-04-01 16:02, Ali Çehreli wrote:

 No difference in that example because it consumes the entire input as
 dchars.

 But in general, with that inefficient range, it is possible to pull just
 one dchar from the input and leave the rest of the stream untouched. For
 example, it would be possible to readf() an int right after that:

 auto u = byUnicode();

 dchar d = u.front; // <-- reads just one dchar from the range

 int i;
 readf("%s", &i); // <-- continues with std.stdio functions
 writeln(i);

 With the getline() method, the int must be looked up in the line first,
 then from the input.

 Ali

Ok, I see, thanks.

-- 
/Jacob Carlborg

Apr 01 2012

Jacob Carlborg <doob me.com> writes:

On 2012-03-31 20:53, Ali Çehreli wrote:
 I recommend using stdin. The destiny of std.cstream is uncertain and
 stdin is sufficient. (I know that it lacks support for BOM but I don't
 need them.)

I thought std.cstream was a stream wrapper around stdin.

 The word 'character' used to mean characters of the Latin-based
 alphabets but with Unicode support that's not the case anymore. In D,
 'character' means UTF code unit, nothing else. Unfortunately, although
 'Unidode character' is just the correct term to use, it conflicts with
 D's characters which are not Unicode characters.

 'Unicode code point' is the non-conflicting term that matches what we
 mean with 'Unicode character.' Only dchar can hold code points.

 That's the part about characters.

Yeah, exactly. When I think about it, I don't know why I thought "getc" 
would work since it only returns a "char" and not a "dchar".

 The other side is what is being fed into the program through its
 standard input. On my Linux consoles, the text comes as a stream of
 chars, i.e. a UTF-8 encoded text. You must ensure that your terminal is
 capable of supporting Unicode through its settings. On Windows
 terminals, one must enter 'chcp 65001' to set the terminal to UTF-8.

I'm on Mac OS X, the terminal is capable of handling Unicode.

 Then, it is the program that must know what the data represents. If you
 are expecting a Unicode code point, then you may think that is should be
 as simple as reading into a dchar:

 import std.stdio;

 void main()
 {
 dchar letter;
 readf("%s", &letter); // <-- does not work!
 writeln(letter);
 }

 The output:

 $ ./deneme
 ç
 Ã <-- will be different on different consoles

I tried that as well.

 The problem is, char can implicitly be converted to dchar. Since the
 letter ç consists of two chars (two UTF-8 code units), dchar gets the
 first one converted as a dchar.

 To see this, read and write two chars in a loop without a newline in
 between:

 import std.stdio;

 void main()
 {
 foreach (i; 0 .. 2) {
 char code;
 readf("%s", &code);
 write(code);
 }

 writeln();
 }

 This time two code units are read and then outputted to form a Unicode
 character on the console:

 $ ./deneme
 ç
 ç <-- result of two write(code) expressions

 The solution is to use ranges when pulling Unicode characters out of
 strings. std.stdin does not provide this yet, but it will eventually
 happen (so I've heard :)).

 For now, this is a way of getting Unicode characters from the input:

 import std.stdio;

 void main()
 {
 string line = readln();

 foreach (dchar c; line) {
 writeln(c);
 }
 }

 Once you have the input as a string, std.utf.decode can also be used.

 Ali

I'll give that a try, thanks.

-- 
/Jacob Carlborg

Apr 01 2012

Stewart Gordon <smjg_1998 yahoo.com> writes:

On 31/03/2012 16:56, Jacob Carlborg wrote:
 How would I read a unicode character from the terminal? I've tried using
 "std.cstream.din.getc" but it seems to only work for ascii characters. If I
try to read
 and print something that isn't ascii, it just prints a question mark.

What OS are you using?

And what codepage is the console set to?

You might want to try the console module in my utility library:

http://pr.stewartsplace.org.uk/d/sutil/

(For D1 at the moment, but a D2 version will be available any day now!)

Stewart.

Mar 31 2012

Jacob Carlborg <doob me.com> writes:

On 2012-04-01 00:14, Stewart Gordon wrote:
 On 31/03/2012 16:56, Jacob Carlborg wrote:
 How would I read a unicode character from the terminal? I've tried using
 "std.cstream.din.getc" but it seems to only work for ascii characters.
 If I try to read
 and print something that isn't ascii, it just prints a question mark.

 What OS are you using?

 And what codepage is the console set to?

I'm using Mac OS X and the terminal is set to handle UTF-8.

 You might want to try the console module in my utility library:

 http://pr.stewartsplace.org.uk/d/sutil/

 (For D1 at the moment, but a D2 version will be available any day now!)

 Stewart.

I'll have a look, thanks.

-- 
/Jacob Carlborg

Apr 01 2012

Stewart Gordon <smjg_1998 yahoo.com> writes:

On 31/03/2012 23:14, Stewart Gordon wrote:
<snip>
 You might want to try the console module in my utility library:

 http://pr.stewartsplace.org.uk/d/sutil/

 (For D1 at the moment, but a D2 version will be available any day now!)

The D2 version is now up on the site.

Jacob - would you be up for helping me with testing/implementation of my
library on Mac 
OS?  If you do a search for "todo" you'll see what needs to be done.  Some of
it will 
benefit Unix-type systems generally.  If perchance you have a big-endian CPU,
testing the 
bit arrays on it would also be of value.

Stewart.

Apr 04 2012

Jacob Carlborg <doob me.com> writes:

On 2012-04-04 18:06, Stewart Gordon wrote:

 The D2 version is now up on the site.

 Jacob - would you be up for helping me with testing/implementation of my
 library on Mac OS? If you do a search for "todo" you'll see what needs
 to be done. Some of it will benefit Unix-type systems generally. If
 perchance you have a big-endian CPU, testing the bit arrays on it would
 also be of value.

 Stewart.

Sure I can help you with testing. I have a lot on my own table so I 
don't have any time for implementing things (maybe some small things). 
If I may ask, what is the point of this library? Doesn't it duplicate 
functionally that's already available in Phobos and/or Tango?

For Mac OS X, if you just follow the Posix standard you'll get very far.

I have an x86 CPU, there were a couple of years ago since Apple last had 
a PPC based computer.

-- 
/Jacob Carlborg

Apr 04 2012

Stewart Gordon <smjg_1998 yahoo.com> writes:

On 04/04/2012 17:37, Jacob Carlborg wrote:
<snip>
 Sure I can help you with testing. I have a lot on my own table so I don't have
any time
 for implementing things (maybe some small things). If I may ask, what is the
point of this
 library?

Just to hold some miscellaneous utility classes/structs/functions.

 Doesn't it duplicate functionally that's already available in Phobos and/or
Tango?

<snip>

It certainly does in places.  But what matters is that it contains
functionality that 
isn't present in Phobos (or wasn't present in Phobos at the time I wrote it).

Stewart.

Apr 04 2012

Jacob Carlborg <doob me.com> writes:

On 2012-04-05 01:21, Stewart Gordon wrote:
 On 04/04/2012 17:37, Jacob Carlborg wrote:
 <snip>
 Sure I can help you with testing. I have a lot on my own table so I
 don't have any time
 for implementing things (maybe some small things). If I may ask, what
 is the point of this
 library?

 Just to hold some miscellaneous utility classes/structs/functions.

 Doesn't it duplicate functionally that's already available in Phobos
 and/or Tango?

 <snip>

 It certainly does in places. But what matters is that it contains
 functionality that isn't present in Phobos (or wasn't present in Phobos
 at the time I wrote it).

 Stewart.

Ok, I see. The functions that need a Posix implementation are mostly in 
datetime and commandline, if I recall correctly. These are already 
present in Phobos?

-- 
/Jacob Carlborg

Apr 04 2012

Stewart Gordon <smjg_1998 yahoo.com> writes:

On 05/04/2012 07:18, Jacob Carlborg wrote:
<snip>
 Ok, I see. The functions that need a Posix implementation are mostly in
datetime and
 commandline, if I recall correctly. These are already present in Phobos?

Maybe it contains the code I need to finish datetime off.  Though I can't
really just copy 
someone else's code, I suppose I can at least see what functions it uses.

I haven't noticed much along the lines of command line manipulation in Phobos -
only the 
code (now in druntime) to populate the args argument to main (which under Posix
it just 
uses argc/argv from the C main).  Or is there something I haven't found?

Stewart.

Apr 05 2012

Jacob Carlborg <doob me.com> writes:

On 2012-04-05 12:55, Stewart Gordon wrote:
 On 05/04/2012 07:18, Jacob Carlborg wrote:
 <snip>
 Ok, I see. The functions that need a Posix implementation are mostly
 in datetime and
 commandline, if I recall correctly. These are already present in Phobos?

 Maybe it contains the code I need to finish datetime off. Though I can't
 really just copy someone else's code, I suppose I can at least see what
 functions it uses.

 I haven't noticed much along the lines of command line manipulation in
 Phobos - only the code (now in druntime) to populate the args argument
 to main (which under Posix it just uses argc/argv from the C main). Or
 is there something I haven't found?

 Stewart.

http://dlang.org/phobos/std_getopt.html

But it might not do what you want.

-- 
/Jacob Carlborg

Apr 05 2012

Stewart Gordon <smjg_1998 yahoo.com> writes:

On 05/04/2012 14:51, Jacob Carlborg wrote:
<snip>
 http://dlang.org/phobos/std_getopt.html

 But it might not do what you want.

Where is the code in std.getopt that has any relevance whatsoever to 
what smjg.libs.util.datetime or smjg.libs.util.commandline is for?

Stewart.

Apr 07 2012

Jacob Carlborg <doob me.com> writes:

On 2012-04-07 14:36, Stewart Gordon wrote:
 On 05/04/2012 14:51, Jacob Carlborg wrote:
 <snip>
 http://dlang.org/phobos/std_getopt.html

 But it might not do what you want.

 Where is the code in std.getopt that has any relevance whatsoever to
 what smjg.libs.util.datetime or smjg.libs.util.commandline is for?

 Stewart.

Both std.getopt and mjg.libs.util.commandline handle command line 
arguments?

-- 
/Jacob Carlborg

Apr 07 2012

Stewart Gordon <smjg_1998 yahoo.com> writes:

On 07/04/2012 17:54, Jacob Carlborg wrote:
<snip>
 Both std.getopt and mjg.libs.util.commandline handle command line
 arguments?

What's that to do with anything?

If the code I need to finish smjg.libs.util.commandline is somewhere in 
std.getopt, please tell me where exactly it is.

If it isn't, then why did you refer me to it?  That's like telling 
someone who's writing a bigint library and struggling to implement 
multiplication to just look in std.math.  After all, they both handle 
numbers.

Stewart.

Apr 07 2012

Jacob Carlborg <doob me.com> writes:

On 2012-04-07 19:57, Stewart Gordon wrote:
 On 07/04/2012 17:54, Jacob Carlborg wrote:
 <snip>
 Both std.getopt and mjg.libs.util.commandline handle command line
 arguments?

 What's that to do with anything?

 If the code I need to finish smjg.libs.util.commandline is somewhere in
 std.getopt, please tell me where exactly it is.

 If it isn't, then why did you refer me to it? That's like telling
 someone who's writing a bigint library and struggling to implement
 multiplication to just look in std.math. After all, they both handle
 numbers.

 Stewart.

I don't know what your module is supposed to do.

-- 
/Jacob Carlborg

Apr 07 2012

Stewart Gordon <smjg_1998 yahoo.com> writes:

On 07/04/2012 20:16, Jacob Carlborg wrote:
<snip>
 I don't know what your module is supposed to do.

Then how about reading its documentation?
http://pr.stewartsplace.org.uk/d/sutil/doc/commandline.html

If there's something you don't understand about it, this is the issue 
that needs to be addressed, rather than wildly guessing that some Phobos 
module provides the answer.

Stewart.

Apr 07 2012

Jacob Carlborg <doob me.com> writes:

On 2012-03-31 17:56, Jacob Carlborg wrote:
 How would I read a unicode character from the terminal? I've tried using
 "std.cstream.din.getc" but it seems to only work for ascii characters.
 If I try to read and print something that isn't ascii, it just prints a
 question mark.

I solved it like this:

dchar readChar ()
{
     char[4] buffer;

     buffer[0] = din.getc();
     auto len = codeLength!(char)(buffer[0]);

     foreach (i ; 1 .. len)
         buffer[i] = din.getc();

     size_t i;
     return decode(buffer, i);
}

-- 
/Jacob Carlborg

Apr 04 2012

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Read a unicode character from the terminal