digitalmars.D.learn - Reading unicode chars..

seany (6/6) Sep 02 2014 How do I read unicode chars that has code points \u1FFF and

Andrew Godfrey (8/14) Sep 02 2014 Maybe someone else here will recognize this, but for me you'd
=?UTF-8?B?QWxpIMOHZWhyZWxp?= (34/40) Sep 02 2014 One way is to use std.stdio.File just like you would use stdin and stdou...

seany (15/15) Sep 02 2014 Hi Ali, i know this example from your book.

seany (1/1) Sep 02 2014 Linux 64 bit, D2, phobos only.
=?UTF-8?B?QWxpIMOHZWhyZWxp?= (10/16) Sep 02 2014 You are doing it differently. Can you show us a minimal example?

seany (11/13) Sep 02 2014 That is precisely where the problem is.

seany (1/1) Sep 02 2014 Your example reads the file by lines, i need to get them by chars.

monarch_dodra (12/14) Sep 02 2014 If you are intent on reading the stream character (or wcharacter)

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (78/85) Sep 02 2014 I first started writing my own byDchar but then used std.utf.byDchar as

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (2/18) Sep 03 2014 https://github.com/D-Programming-Language/phobos/pull/2483

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (3/8) Sep 03 2014 Yay! Makes so much sense... :)

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (4/15) Sep 03 2014 I was going to suggest to the OP exactly what you suggested, but

seany (18/20) Sep 03 2014 import std.stdio;

ketmar via Digitalmars-d-learn (3/3) Sep 03 2014 On Wed, 03 Sep 2014 15:41:59 +0000

monarch_dodra (3/7) Sep 02 2014 I don't know if you are aware, but "byLineCopy" was recently

"seany" <seany uni-bonn.de> writes:

How do I read unicode chars that has code points \u1FFF and 
higher from a file?

file.getcw() reads only part of the char, and D identifies this 
character as an array of three or four characters.

Importing std.uni does not change the behavior.

Thank you.

Sep 02 2014

"Andrew Godfrey" <X y.com> writes:

On Tuesday, 2 September 2014 at 14:06:04 UTC, seany wrote:
 How do I read unicode chars that has code points \u1FFF and 
 higher from a file?

 file.getcw() reads only part of the char, and D identifies this 
 character as an array of three or four characters.

 Importing std.uni does not change the behavior.

 Thank you.

Maybe someone else here will recognize this, but for me you'd 
need to supply more information. Std.file doesn't have getcw, I 
see one in std.stream which has an "outdated" warning and that 
getcw is documented as implementation-specific. So what platform 
are you on?
Better yet, can you make a small code sample that shows what 
you're seeing?

Sep 02 2014

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 09/02/2014 07:06 AM, seany wrote:
 How do I read unicode chars that has code points \u1FFF and higher from
 a file?

 file.getcw() reads only part of the char, and D identifies this
 character as an array of three or four characters.

 Importing std.uni does not change the behavior.

 Thank you.

One way is to use std.stdio.File just like you would use stdin and stdout:

import std.stdio;

void main()
{
     string fileName = "unicode_test_file";
     doWrite(fileName);
     doRead(fileName);
}

void doWrite(string fileName)
{
     auto file = File(fileName, "w");
     file.writeln("abcçdef");
}

void doRead(string fileName)
{
     auto file = File(fileName, "r");

     foreach (line; file.byLine) {        // (1)
         foreach (dchar c; line) {        // (2)
             writeln(c);
         }

         import std.range;
         foreach (c; line.stride(1)) {    // (3)
             writeln(c);
         }
     }
}

Notes:

1) To avoid a common gotcha, note that 'line' is reused at every 
iteration here. You must make copies of portions of it if you need to.

2) dchar is important there

3) Any algorithms that turns a string to a range does expose decoded 
dchars. Here, I used stride.

Ali

Sep 02 2014

"seany" <seany uni-bonn.de> writes:

Hi Ali, i know this example from your book.

But try to capture „ the low quotation mark, appearing in the 
All-purpose punctuations plane of unicode, with \u201e - I worte 
I am having problems with \u1FFF and up.

This particular symbol, is seen as a dchar array "\x1e\x20" - so 
two dchars, using wchar returns the same result, when I directly 
profide the symbol to the code.

SO I was thinking of using two dchars, and printing the dstring, 
the problem then is that I do not know beforehand if a particular 
character read out of the file is a pair of dchars, or a single 
dchar.

And yes, it was stream.getcw, sorry, not file.getcw(). Indeed.

Reading this character from a file (using a while loop until EOF) 
produces an â and an unknown charcter given by a question mark in 
a white polygon.

Sep 02 2014

"seany" <seany uni-bonn.de> writes:

Linux 64 bit, D2, phobos only.

Sep 02 2014

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 09/02/2014 11:11 AM, seany wrote:

 But try to capture „ the low quotation mark, appearing in the
 All-purpose punctuations plane of unicode, with \u201e - I worte I am
 having problems with \u1FFF and up.

You are doing it differently. Can you show us a minimal example? 
Otherwise, there is nothing special about „. Continuing with my example, 
just change one line and it still works:

     file.writeln("abcçd„ef");

 This particular symbol, is seen as a dchar array "\x1e\x20" - so two
 dchars

That would happen when you you treat the chars on the input and 
individual dchars. Those two chars must be decoded as a single dchar. My 
example has shown two different ways of doing it. :)

 using wchar returns the same result

Same issue: You are treating individual chars as two individual wchars.

Ali

Sep 02 2014

"seany" <seany uni-bonn.de> writes:

On Tuesday, 2 September 2014 at 18:22:54 UTC, Ali Çehreli wrote:

 That would happen when you you treat the chars on the input and 
 individual dchars.

That is precisely where the problem is.

If you use the character in a file, and then open it as a stream, 
then use

File.getc()

or file.getcw()

until EOF is reached, then you get this prblem.

I want to read the file char by char,

and problem is i dont know where this char will appear, meaning I 
dont know where i have to treat multiple dchars, read by getc() 
or getcw() as a single char.

Sep 02 2014

"seany" <seany uni-bonn.de> writes:

Your example reads the file by lines, i need to get them by chars.

Sep 02 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 2 September 2014 at 18:30:55 UTC, seany wrote:
 Your example reads the file by lines, i need to get them by 
 chars.

If you are intent on reading the stream character (or wcharacter) 
1 by 1, then you will have to decode them manually, as there is 
no "getcd".

Unfortunately, the "newer" std.stdio module does not really 
provide facilities for such unitary reads.

I'd suggest you create a range out of your std.stream.File, which 
reads it byte by byte. Then, you pass it to the "byDchar()" 
range, which will auto decode those characters. If you really 
want to do it "character by character".

What's wrong with reading line by line, but processing the 
characters in said lines 1 by 1? That works "out of the box".

Sep 02 2014

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 09/02/2014 02:13 PM, monarch_dodra wrote:

 I'd suggest you create a range out of your std.stream.File, which reads
 it byte by byte.

I was in the process of doing just that.

 Then, you pass it to the "byDchar()" range, which will
 auto decode those characters. If you really want to do it "character by
 character".

I first started writing my own byDchar but then used std.utf.byDchar as 
you suggest. However, I had to resort to

1) Adding attributes to function calls which I know some are unsafe (see 
assumeHasAttribs() below). For example, I don't think getc() should be 
pure. (?) Also, how could all of its functions be nothrow? Is byDchar() 
is asking too much of its users?

2) I also had to make StreamRange a template just to get attribute 
inference from the compiler.

import std.stdio;
import std.stream;
import std.utf;
import std.traits;

auto assumeHasAttribs(T)(T t) pure
     if (isFunctionPointer!T || isDelegate!T)
{
     enum attrs = functionAttributes!T |
                  FunctionAttribute.pure_ |
                  FunctionAttribute.nogc |
                  FunctionAttribute.nothrow_;

     return cast(SetFunctionAttributes!(T, functionLinkage!T, attrs)) t;
}

/* This is a template just to take advantage of compiler's attribute
  * inference. */
struct StreamRange()
{
     std.stream.File f;
     char c;

     this(std.stream.File f)
     {
         this.f = f;
         prime();
     }

     private void prime()
     {
         if (!empty()) {
             c = assumeHasAttribs(&(f.getc))();
         }
     }

      property bool empty() const
     {
         return assumeHasAttribs(&(f.eof))();
     }

      property char front() const
     {
         return c;
     }

     void popFront()
     {
         prime();
     }
}

auto streamRange()(std.stream.File file)
{
     return StreamRange!()(file);
}

void main()
{
     string fileName = "unicode_test_file";
     doWrite(fileName);
     doRead(fileName);
}

void doWrite(string fileName)
{
     auto file = std.stdio.File(fileName, "w");
     file.writeln("abcçd𝔸e„f");
}

void doRead(string fileName)
{
     auto range = byDchar(streamRange(new std.stream.File(fileName,
                                                          FileMode.In)));

     foreach (c; range) {
         writeln(c);
     }
}

 What's wrong with reading line by line, but processing the characters in
 said lines 1 by 1? That works "out of the box".

Agreed.

Ali

Sep 02 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Tuesday, 2 September 2014 at 23:20:38 UTC, Ali Çehreli wrote:
 On 09/02/2014 02:13 PM, monarch_dodra wrote:

 I'd suggest you create a range out of your std.stream.File,

 which reads
 it byte by byte.

 I was in the process of doing just that.

 Then, you pass it to the "byDchar()" range, which will
 auto decode those characters. If you really want to do it

 "character by
 character".

 I first started writing my own byDchar but then used 
 std.utf.byDchar as you suggest. However, I had to resort to

 1) Adding attributes to function calls which I know some are 
 unsafe (see assumeHasAttribs() below). For example, I don't 
 think getc() should be pure. (?) Also, how could all of its 
 functions be nothrow? Is byDchar() is asking too much of its 
 users?

https://github.com/D-Programming-Language/phobos/pull/2483

Sep 03 2014

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 09/03/2014 01:21 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:

 1) Adding attributes to function calls which I know some are unsafe
 (see assumeHasAttribs() below). For example, I don't think getc()
 should be pure. (?) Also, how could all of its functions be nothrow?
 Is byDchar() is asking too much of its users?

 https://github.com/D-Programming-Language/phobos/pull/2483

Yay! Makes so much sense... :)

Ali

Sep 03 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Wednesday, 3 September 2014 at 13:54:30 UTC, Ali Çehreli wrote:
 On 09/03/2014 01:21 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:

 1) Adding attributes to function calls which I know some are 
 unsafe
 (see assumeHasAttribs() below). For example, I don't think 
 getc()
 should be pure. (?) Also, how could all of its functions be 
 nothrow?
 Is byDchar() is asking too much of its users?

 https://github.com/D-Programming-Language/phobos/pull/2483

 Yay! Makes so much sense... :)

I was going to suggest to the OP exactly what you suggested, but 
I thought I'd better check whether it actually works. Turns out 
it didn't :-P

Sep 03 2014

"seany" <seany uni-bonn.de> writes:

On Tuesday, 2 September 2014 at 21:13:04 UTC, monarch_dodra wrote:

 What's wrong with reading line by line, but processing the 
 characters in said lines 1 by 1? That works "out of the box".

import std.stdio;
import std.conv;
import core.vararg;

void main() {

string aa = "abc „";
         foreach (aaa; aa)
                 writeln(aaa);

}


output :

a
b
c

�
�
�

Linux 64 bit netrunner, console font set to Raleway,, then to 
Quivira - both supports the character in discussion.

Sep 03 2014

ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Wed, 03 Sep 2014 15:41:59 +0000
seany via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> wrote:

`foreach (dchar aaa; aa)`...

Sep 03 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 2 September 2014 at 17:10:57 UTC, Ali Çehreli wrote:
 1) To avoid a common gotcha, note that 'line' is reused at 
 every iteration here. You must make copies of portions of it if 
 you need to.

 Ali

I don't know if you are aware, but "byLineCopy" was recently 
introduced. It will be available in 2.067. Just spreading info.

Sep 02 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Reading unicode chars..