www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 10668] New: Unicode characters, when taken from strings, are not printed correctly

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10668

           Summary: Unicode characters, when taken from strings, are not
                    printed correctly
           Product: D
           Version: D2
          Platform: x86_64
        OS/Version: Mac OS X
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: MATTCA sky.com


--- Comment #0 from Matt Carter <MATTCA sky.com> 2013-07-19 01:55:53 PDT ---
Created an attachment (id=1234)
A small program which demonstrates the issue.

When obtaining a char from within a string of non-ASCII characters (in this
example, the pound sign ''), the resulting char will not be printed correctly
to the console (via std.stdio.writeln). Instead, the '?' symbol is printed.

However, when printing the entire string, the '' is printed correctly.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 19 2013
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10668



--- Comment #1 from Matt Carter <MATTCA sky.com> 2013-07-19 01:58:27 PDT ---
The content of the attachment, just in case:

module main;

import std.stdio;

void main(string[] args) {
    string s = "";
    writeln(s); // Output: 

    char c = s[0];
    writeln(c); // Output: ?

    writeln(s[0]); // Output: ?
}

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 19 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10668


monarchdodra gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |monarchdodra gmail.com
         Resolution|                            |INVALID


--- Comment #2 from monarchdodra gmail.com 2013-07-19 02:41:47 PDT ---
Well... what did you think it was going to print? you have a utf-8 sequence.
char c = s[0]; will extract the first code*point* of your unicode. You want the
first code*unit*.

http://www.fileformat.info/info/unicode/char/a3/index.htm
EG:  is the codepoint "AE"
In UTF8 it is represented by the sequence: [0xC2, 0xA3]

When you write "char c = s[0];", you are extracting the first codeunit, which
is 0xC2. When you pass this to to writeln, what will happen will mostly depend
on your locale/codepage. If it is set to UF8 (CP65001 on windows), then it will
print the "unknown character", since it you passed an incomplete sequence.

The correct code you want is:
dchar c = s.front;

(remember to include std.array to front).

Another alternative, is to simply work from the ground up with dstrings.

module main;

import std.stdio;

void main(string[] args) {
    dstring s = "";
    writeln(s); // Output: 

    dchar c = s[0];
    writeln(c); // Output: 

    writeln(s[0]); // Output: 
}

Do you have access to "The D Programming Language"? It has the best
introduction to unicode/UTF I've read.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 19 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10668


Nils <nilsbossung googlemail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |nilsbossung googlemail.com


--- Comment #3 from Nils <nilsbossung googlemail.com> 2013-07-19 07:59:08 PDT
---
(In reply to comment #2)
 Well... what did you think it was going to print? you have a utf-8 sequence.
 char c = s[0]; will extract the first code*point*
You mean code*unit*.
 of your unicode. You want the first code*unit*.
code*point* -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jul 19 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10668



--- Comment #4 from Matt Carter <MATTCA sky.com> 2013-07-19 08:24:57 PDT ---
(In reply to comment #2)
 Well... what did you think it was going to print? you have a utf-8 sequence.
 char c = s[0]; will extract the first code*point* of your unicode. You want the
 first code*unit*.
 
 http://www.fileformat.info/info/unicode/char/a3/index.htm
 EG:  is the codepoint "AE"
 In UTF8 it is represented by the sequence: [0xC2, 0xA3]
 
 When you write "char c = s[0];", you are extracting the first codeunit, which
 is 0xC2. When you pass this to to writeln, what will happen will mostly depend
 on your locale/codepage. If it is set to UF8 (CP65001 on windows), then it will
 print the "unknown character", since it you passed an incomplete sequence.
 
 The correct code you want is:
 dchar c = s.front;
 
 (remember to include std.array to front).
 
 Another alternative, is to simply work from the ground up with dstrings.
 
 module main;
 
 import std.stdio;
 
 void main(string[] args) {
     dstring s = "";
     writeln(s); // Output: 
 
     dchar c = s[0];
     writeln(c); // Output: 
 
     writeln(s[0]); // Output: 
 }
 
 Do you have access to "The D Programming Language"? It has the best
 introduction to unicode/UTF I've read.
Thanks for the response! Yeah, I converted my project to use dstrings on the off chance it worked after posting, lo-behold this is the fix it seems. I plan on eventually getting the book, although I've read some bad reviews regarding the e-book/kindle version, so I'm having to wait a little longer to get a hard copy. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jul 19 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10668



--- Comment #5 from monarchdodra gmail.com 2013-07-19 08:28:23 PDT ---
(In reply to comment #3)
 (In reply to comment #2)
 Well... what did you think it was going to print? you have a utf-8 sequence.
 char c = s[0]; will extract the first code*point*
You mean code*unit*.
 of your unicode. You want the first code*unit*.
code*point*
Oops. Massive face-palm. Thank you for correcting me. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jul 19 2013
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10668



--- Comment #6 from monarchdodra gmail.com 2013-07-19 08:41:49 PDT ---
(In reply to comment #4)
 Thanks for the response! Yeah, I converted my project to use dstrings on the
 off chance it worked after posting, lo-behold this is the fix it seems.
 
 I plan on eventually getting the book, although I've read some bad reviews
 regarding the e-book/kindle version, so I'm having to wait a little longer to
 get a hard copy.
I'd recommend trying to get your project to work with "normal UTF8" strings. They're the norm in D, and you'll have to get around to understanding how they work sooner or later. To make it *really* simple, a UTF-8 string should be handled like a bidirectional range of dchars. You can ask for front/back, popFront/popBack, and empty. Stick to only these primitives, and your code is *guaranteed* to work. All the other primitives (length, index, slice), while *present* require much more knowledge of what is going on, and should be used only when you *know* what you are doing. As a matter of fact, if you ask a string if it supports, say length: "hasLength!string": it will say "false". -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jul 19 2013