www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Read a unicode character from the terminal

reply Jacob Carlborg <doob me.com> writes:
How would I read a unicode character from the terminal? I've tried using 
"std.cstream.din.getc" but it seems to only work for ascii characters. 
If I try to read and print something that isn't ascii, it just prints a 
question mark.

-- 
/Jacob Carlborg
Mar 31 2012
next sibling parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 03/31/2012 08:56 AM, Jacob Carlborg wrote:
 How would I read a unicode character from the terminal? I've tried using
 "std.cstream.din.getc"

I recommend using stdin. The destiny of std.cstream is uncertain and stdin is sufficient. (I know that it lacks support for BOM but I don't need them.)
 but it seems to only work for ascii characters.
 If I try to read and print something that isn't ascii, it just prints a
 question mark.

The word 'character' used to mean characters of the Latin-based alphabets but with Unicode support that's not the case anymore. In D, 'character' means UTF code unit, nothing else. Unfortunately, although 'Unidode character' is just the correct term to use, it conflicts with D's characters which are not Unicode characters. 'Unicode code point' is the non-conflicting term that matches what we mean with 'Unicode character.' Only dchar can hold code points. That's the part about characters. The other side is what is being fed into the program through its standard input. On my Linux consoles, the text comes as a stream of chars, i.e. a UTF-8 encoded text. You must ensure that your terminal is capable of supporting Unicode through its settings. On Windows terminals, one must enter 'chcp 65001' to set the terminal to UTF-8. Then, it is the program that must know what the data represents. If you are expecting a Unicode code point, then you may think that is should be as simple as reading into a dchar: import std.stdio; void main() { dchar letter; readf("%s", &letter); // <-- does not work! writeln(letter); } The output: $ ./deneme ç Ã <-- will be different on different consoles The problem is, char can implicitly be converted to dchar. Since the letter ç consists of two chars (two UTF-8 code units), dchar gets the first one converted as a dchar. To see this, read and write two chars in a loop without a newline in between: import std.stdio; void main() { foreach (i; 0 .. 2) { char code; readf("%s", &code); write(code); } writeln(); } This time two code units are read and then outputted to form a Unicode character on the console: $ ./deneme ç ç <-- result of two write(code) expressions The solution is to use ranges when pulling Unicode characters out of strings. std.stdin does not provide this yet, but it will eventually happen (so I've heard :)). For now, this is a way of getting Unicode characters from the input: import std.stdio; void main() { string line = readln(); foreach (dchar c; line) { writeln(c); } } Once you have the input as a string, std.utf.decode can also be used. Ali
Mar 31 2012
next sibling parent Jordi Sayol <g.sayol yahoo.es> writes:
Many thanks to be so educational.

Best regards,
-- 
Jordi Sayol
Mar 31 2012
prev sibling next sibling parent reply Jordi Sayol <g.sayol yahoo.es> writes:
BTW, for those who do not know, Ali =C3=87ehreli is writing a book to lea=
rn "D" from scratch. It's very educational.
There are two formats: HTML (on-line) and PDF.
http://ddili.org/ders/d.en/index.html

Best regards,
--=20
Jordi Sayol
Mar 31 2012
parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 03/31/2012 02:31 PM, Jordi Sayol wrote:
 BTW, for those who do not know, Ali Çehreli is writing a book to learn "D"
from scratch. It's very educational.
 There are two formats: HTML (on-line) and PDF.
 http://ddili.org/ders/d.en/index.html

 Best regards,

Thank you very much for the free plug! :) I have translated eleven more chapters since the last announcement. I am on the assert chapter as we speak. It is taking longer than I had expected because I constantly make improvements to the original: corrections, consistency improvements, additions, adapting code samples to the current state of D, etc. Ali
Mar 31 2012
prev sibling next sibling parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 03/31/2012 11:53 AM, Ali Çehreli wrote:

 The solution is to use ranges when pulling Unicode characters out of
 strings. std.stdin does not provide this yet, but it will eventually
 happen (so I've heard :)).

Here is a Unicode character range, which is unfortunately pretty inefficient because it relies on an exception that is thrown from isValidDchar! :p import std.stdio; import std.utf; import std.array; struct UnicodeRange { File file; char[4] codes; bool ready; this(File file) { this.file = file; this.ready = false; } bool empty() const property { return file.eof(); } dchar front() const property { if (!ready) { // Sorry, no 'mutable' in D! :p UnicodeRange * mutable_this = cast(UnicodeRange*)&this; mutable_this.readNext(); } return codes.front; } void popFront() { codes = codes.init; ready = false; } void readNext() { foreach (ref code; codes) { file.readf("%s", &code); if (file.eof()) { codes[] = '\0'; ready = false; break; } // Expensive way of determining "ready"! try { if (isValidDchar(codes.front)) { ready = true; break; } } catch (Exception) { // not ready } } } } UnicodeRange byUnicode(File file = stdin) { return UnicodeRange(file); } void main() { foreach(c; byUnicode()) { writeln(c); } } Ali
Mar 31 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-04-01 01:17, Ali Çehreli wrote:
 On 03/31/2012 11:53 AM, Ali Çehreli wrote:

  > The solution is to use ranges when pulling Unicode characters out of
  > strings. std.stdin does not provide this yet, but it will eventually
  > happen (so I've heard :)).

 Here is a Unicode character range, which is unfortunately pretty
 inefficient because it relies on an exception that is thrown from
 isValidDchar! :p

Ok, what's the differences compared to the example in your first post: void main() { string line = readln(); foreach (dchar c; line) { writeln(c); } } -- /Jacob Carlborg
Apr 01 2012
parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 04/01/2012 05:00 AM, Jacob Carlborg wrote:
 On 2012-04-01 01:17, Ali Çehreli wrote:
 On 03/31/2012 11:53 AM, Ali Çehreli wrote:

 The solution is to use ranges when pulling Unicode characters out of
 strings. std.stdin does not provide this yet, but it will eventually
 happen (so I've heard :)).

Here is a Unicode character range, which is unfortunately pretty inefficient because it relies on an exception that is thrown from isValidDchar! :p

Ok, what's the differences compared to the example in your first post: void main() { string line = readln(); foreach (dchar c; line) { writeln(c); } }

No difference in that example because it consumes the entire input as dchars. But in general, with that inefficient range, it is possible to pull just one dchar from the input and leave the rest of the stream untouched. For example, it would be possible to readf() an int right after that: auto u = byUnicode(); dchar d = u.front; // <-- reads just one dchar from the range int i; readf("%s", &i); // <-- continues with std.stdio functions writeln(i); With the getline() method, the int must be looked up in the line first, then from the input. Ali
Apr 01 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-04-01 16:02, Ali Çehreli wrote:

 No difference in that example because it consumes the entire input as
 dchars.

 But in general, with that inefficient range, it is possible to pull just
 one dchar from the input and leave the rest of the stream untouched. For
 example, it would be possible to readf() an int right after that:

 auto u = byUnicode();

 dchar d = u.front; // <-- reads just one dchar from the range

 int i;
 readf("%s", &i); // <-- continues with std.stdio functions
 writeln(i);

 With the getline() method, the int must be looked up in the line first,
 then from the input.

 Ali

Ok, I see, thanks. -- /Jacob Carlborg
Apr 01 2012
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-03-31 20:53, Ali Çehreli wrote:
 I recommend using stdin. The destiny of std.cstream is uncertain and
 stdin is sufficient. (I know that it lacks support for BOM but I don't
 need them.)

I thought std.cstream was a stream wrapper around stdin.
 The word 'character' used to mean characters of the Latin-based
 alphabets but with Unicode support that's not the case anymore. In D,
 'character' means UTF code unit, nothing else. Unfortunately, although
 'Unidode character' is just the correct term to use, it conflicts with
 D's characters which are not Unicode characters.

 'Unicode code point' is the non-conflicting term that matches what we
 mean with 'Unicode character.' Only dchar can hold code points.

 That's the part about characters.

Yeah, exactly. When I think about it, I don't know why I thought "getc" would work since it only returns a "char" and not a "dchar".
 The other side is what is being fed into the program through its
 standard input. On my Linux consoles, the text comes as a stream of
 chars, i.e. a UTF-8 encoded text. You must ensure that your terminal is
 capable of supporting Unicode through its settings. On Windows
 terminals, one must enter 'chcp 65001' to set the terminal to UTF-8.

I'm on Mac OS X, the terminal is capable of handling Unicode.
 Then, it is the program that must know what the data represents. If you
 are expecting a Unicode code point, then you may think that is should be
 as simple as reading into a dchar:

 import std.stdio;

 void main()
 {
 dchar letter;
 readf("%s", &letter); // <-- does not work!
 writeln(letter);
 }

 The output:

 $ ./deneme
 ç
 Ã <-- will be different on different consoles

I tried that as well.
 The problem is, char can implicitly be converted to dchar. Since the
 letter ç consists of two chars (two UTF-8 code units), dchar gets the
 first one converted as a dchar.

 To see this, read and write two chars in a loop without a newline in
 between:

 import std.stdio;

 void main()
 {
 foreach (i; 0 .. 2) {
 char code;
 readf("%s", &code);
 write(code);
 }

 writeln();
 }

 This time two code units are read and then outputted to form a Unicode
 character on the console:

 $ ./deneme
 ç
 ç <-- result of two write(code) expressions

 The solution is to use ranges when pulling Unicode characters out of
 strings. std.stdin does not provide this yet, but it will eventually
 happen (so I've heard :)).

 For now, this is a way of getting Unicode characters from the input:

 import std.stdio;

 void main()
 {
 string line = readln();

 foreach (dchar c; line) {
 writeln(c);
 }
 }

 Once you have the input as a string, std.utf.decode can also be used.

 Ali

I'll give that a try, thanks. -- /Jacob Carlborg
Apr 01 2012
prev sibling next sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
On 31/03/2012 16:56, Jacob Carlborg wrote:
 How would I read a unicode character from the terminal? I've tried using
 "std.cstream.din.getc" but it seems to only work for ascii characters. If I
try to read
 and print something that isn't ascii, it just prints a question mark.

What OS are you using? And what codepage is the console set to? You might want to try the console module in my utility library: http://pr.stewartsplace.org.uk/d/sutil/ (For D1 at the moment, but a D2 version will be available any day now!) Stewart.
Mar 31 2012
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-04-01 00:14, Stewart Gordon wrote:
 On 31/03/2012 16:56, Jacob Carlborg wrote:
 How would I read a unicode character from the terminal? I've tried using
 "std.cstream.din.getc" but it seems to only work for ascii characters.
 If I try to read
 and print something that isn't ascii, it just prints a question mark.

What OS are you using? And what codepage is the console set to?

I'm using Mac OS X and the terminal is set to handle UTF-8.
 You might want to try the console module in my utility library:

 http://pr.stewartsplace.org.uk/d/sutil/

 (For D1 at the moment, but a D2 version will be available any day now!)

 Stewart.

I'll have a look, thanks. -- /Jacob Carlborg
Apr 01 2012
prev sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
On 31/03/2012 23:14, Stewart Gordon wrote:
<snip>
 You might want to try the console module in my utility library:

 http://pr.stewartsplace.org.uk/d/sutil/

 (For D1 at the moment, but a D2 version will be available any day now!)

The D2 version is now up on the site. Jacob - would you be up for helping me with testing/implementation of my library on Mac OS? If you do a search for "todo" you'll see what needs to be done. Some of it will benefit Unix-type systems generally. If perchance you have a big-endian CPU, testing the bit arrays on it would also be of value. Stewart.
Apr 04 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-04-04 18:06, Stewart Gordon wrote:

 The D2 version is now up on the site.

 Jacob - would you be up for helping me with testing/implementation of my
 library on Mac OS? If you do a search for "todo" you'll see what needs
 to be done. Some of it will benefit Unix-type systems generally. If
 perchance you have a big-endian CPU, testing the bit arrays on it would
 also be of value.

 Stewart.

Sure I can help you with testing. I have a lot on my own table so I don't have any time for implementing things (maybe some small things). If I may ask, what is the point of this library? Doesn't it duplicate functionally that's already available in Phobos and/or Tango? For Mac OS X, if you just follow the Posix standard you'll get very far. I have an x86 CPU, there were a couple of years ago since Apple last had a PPC based computer. -- /Jacob Carlborg
Apr 04 2012
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
On 04/04/2012 17:37, Jacob Carlborg wrote:
<snip>
 Sure I can help you with testing. I have a lot on my own table so I don't have
any time
 for implementing things (maybe some small things). If I may ask, what is the
point of this
 library?

Just to hold some miscellaneous utility classes/structs/functions.
 Doesn't it duplicate functionally that's already available in Phobos and/or
Tango?

It certainly does in places. But what matters is that it contains functionality that isn't present in Phobos (or wasn't present in Phobos at the time I wrote it). Stewart.
Apr 04 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-04-05 01:21, Stewart Gordon wrote:
 On 04/04/2012 17:37, Jacob Carlborg wrote:
 <snip>
 Sure I can help you with testing. I have a lot on my own table so I
 don't have any time
 for implementing things (maybe some small things). If I may ask, what
 is the point of this
 library?

Just to hold some miscellaneous utility classes/structs/functions.
 Doesn't it duplicate functionally that's already available in Phobos
 and/or Tango?

It certainly does in places. But what matters is that it contains functionality that isn't present in Phobos (or wasn't present in Phobos at the time I wrote it). Stewart.

Ok, I see. The functions that need a Posix implementation are mostly in datetime and commandline, if I recall correctly. These are already present in Phobos? -- /Jacob Carlborg
Apr 04 2012
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
On 05/04/2012 07:18, Jacob Carlborg wrote:
<snip>
 Ok, I see. The functions that need a Posix implementation are mostly in
datetime and
 commandline, if I recall correctly. These are already present in Phobos?

Maybe it contains the code I need to finish datetime off. Though I can't really just copy someone else's code, I suppose I can at least see what functions it uses. I haven't noticed much along the lines of command line manipulation in Phobos - only the code (now in druntime) to populate the args argument to main (which under Posix it just uses argc/argv from the C main). Or is there something I haven't found? Stewart.
Apr 05 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-04-05 12:55, Stewart Gordon wrote:
 On 05/04/2012 07:18, Jacob Carlborg wrote:
 <snip>
 Ok, I see. The functions that need a Posix implementation are mostly
 in datetime and
 commandline, if I recall correctly. These are already present in Phobos?

Maybe it contains the code I need to finish datetime off. Though I can't really just copy someone else's code, I suppose I can at least see what functions it uses. I haven't noticed much along the lines of command line manipulation in Phobos - only the code (now in druntime) to populate the args argument to main (which under Posix it just uses argc/argv from the C main). Or is there something I haven't found? Stewart.

http://dlang.org/phobos/std_getopt.html But it might not do what you want. -- /Jacob Carlborg
Apr 05 2012
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
On 05/04/2012 14:51, Jacob Carlborg wrote:
<snip>
 http://dlang.org/phobos/std_getopt.html

 But it might not do what you want.

Where is the code in std.getopt that has any relevance whatsoever to what smjg.libs.util.datetime or smjg.libs.util.commandline is for? Stewart.
Apr 07 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-04-07 14:36, Stewart Gordon wrote:
 On 05/04/2012 14:51, Jacob Carlborg wrote:
 <snip>
 http://dlang.org/phobos/std_getopt.html

 But it might not do what you want.

Where is the code in std.getopt that has any relevance whatsoever to what smjg.libs.util.datetime or smjg.libs.util.commandline is for? Stewart.

Both std.getopt and mjg.libs.util.commandline handle command line arguments? -- /Jacob Carlborg
Apr 07 2012
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
On 07/04/2012 17:54, Jacob Carlborg wrote:
<snip>
 Both std.getopt and mjg.libs.util.commandline handle command line
 arguments?

What's that to do with anything? If the code I need to finish smjg.libs.util.commandline is somewhere in std.getopt, please tell me where exactly it is. If it isn't, then why did you refer me to it? That's like telling someone who's writing a bigint library and struggling to implement multiplication to just look in std.math. After all, they both handle numbers. Stewart.
Apr 07 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-04-07 19:57, Stewart Gordon wrote:
 On 07/04/2012 17:54, Jacob Carlborg wrote:
 <snip>
 Both std.getopt and mjg.libs.util.commandline handle command line
 arguments?

What's that to do with anything? If the code I need to finish smjg.libs.util.commandline is somewhere in std.getopt, please tell me where exactly it is. If it isn't, then why did you refer me to it? That's like telling someone who's writing a bigint library and struggling to implement multiplication to just look in std.math. After all, they both handle numbers. Stewart.

I don't know what your module is supposed to do. -- /Jacob Carlborg
Apr 07 2012
parent Stewart Gordon <smjg_1998 yahoo.com> writes:
On 07/04/2012 20:16, Jacob Carlborg wrote:
<snip>
 I don't know what your module is supposed to do.

Then how about reading its documentation? http://pr.stewartsplace.org.uk/d/sutil/doc/commandline.html If there's something you don't understand about it, this is the issue that needs to be addressed, rather than wildly guessing that some Phobos module provides the answer. Stewart.
Apr 07 2012
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-03-31 17:56, Jacob Carlborg wrote:
 How would I read a unicode character from the terminal? I've tried using
 "std.cstream.din.getc" but it seems to only work for ascii characters.
 If I try to read and print something that isn't ascii, it just prints a
 question mark.

I solved it like this: dchar readChar () { char[4] buffer; buffer[0] = din.getc(); auto len = codeLength!(char)(buffer[0]); foreach (i ; 1 .. len) buffer[i] = din.getc(); size_t i; return decode(buffer, i); } -- /Jacob Carlborg
Apr 04 2012