www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 5016] New: to!() can not convert from wide characters to char

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=5016

           Summary: to!() can  not convert from wide characters to char
           Product: D
           Version: D2
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: major
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: aarti interia.pl


--- Comment #0 from Marcin Kuszczak <aarti interia.pl> 2010-10-08 01:03:34 PDT
---
Test case:

void main() {
    //Instantiation error
    dchar from0 = 'A';
    char to0 = to!(char)(from0);

    //Instantiation error
    wchar from1 = 'A';
    char to1 = to!(char)(from1);

    //Ok
    char from2 = 'A';
    char to2 = to!(char)(from2);

    //Ok
    char from3 = 'A';
    wchar to3 = to!(wchar)(from3);

    //Ok
    char from4 = 'A';
    dchar to4 = to!(dchar)(from4);
}

It's interesting case as failing conversions should not always succeed (e.g.
when wchar/dchar can not be coded in one byte), while in many cases they are
perfectly valid.

I am starting thinking that assuming that strings/chars are just arrays is
quite a big mistake in D design: it introduces a lot of corner cases.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Oct 08 2010
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=5016


Andrei Alexandrescu <andrei metalanguage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
                 CC|                            |andrei metalanguage.com
         AssignedTo|nobody puremagic.com        |andrei metalanguage.com


-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 09 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=5016



--- Comment #1 from Marcin Kuszczak <aarti interia.pl> 2011-01-09 13:18:44 PST
---
After rethinking problem it seems that real problem is that char and wchar are
not "real" characters. These two types are just artificial things which cause
more troubles than necessary. The only "true" character is dchar and all other
character types should be depreciated.

In such a case:
string <=> ubyte[] => dchar[]
wstring <=> ushort[] => dchar[]

... and maybe also:
dstring <=> uint[] <=> dchar[]

where "=>" means "can be viewed as"

It would solve cleanly and properly problems with strange and unnecessary
conversions like "dchar -> char"

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 09 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=5016


Jonathan M Davis <jmdavisProg gmx.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jmdavisProg gmx.com


--- Comment #2 from Jonathan M Davis <jmdavisProg gmx.com> 2011-01-09 15:16:32
PST ---
char is explictly defined to be a UTF-8 code unit. wchar is explicitly defined
to be a UTF-16 code unit. dchar is explicitly defined to be a UTF-32 code unit.
In UTF-8 and UTF-16, it can take multiple code units to make up a code point,
whereas it always takes one code one UTF-32 code unit to make a code point. A
code point is what you would normally think of as a character. This is all
standard unicode stuff and getting rid of it would be foolish. It's used all
over the place in computing, not just in D.

Part of the trick to dealing with char and wchar correctly is that if you wish
to deal with code points / characters (_not_ code units), then _never_ deal
with char and wchar individually. That's why most of std.string deals with
entire strings at time. If you want to deal with an individual character, you
either use a dchar or one of the string types - e.g. 'a' as a dchar or "a" as a
string type. You shouldn't be converting from dchar to char and vice versa (or
between either of those and wchar). It really doesn't make sense. What makes
sense is converting between string types.

On the whole, what D does works fantastically, but you need to understand the
basics of unicode. The best place to look would probably be The D Programming
Language by Andrei Alexandrescu, since it applies directly to D, but there are
plenty of places online to find info on unicode, and you can look at the online
docs on arrays for more info about them: http://is.gd/krYRH .

What it comes down to really is that you use whatever string type you need
based on size - string, wstring, or dstring - or the need to be able to treat
an individual array index as a character. If you need to be able to use random
access on a string (including using them in algorithms in std.algorithm which
require random access ranges), or if you need to be able to alter individual
characters in place, then use dstring or dchar[]. Otherwise, save space and use
either string or wstring (string would generally be better unless you're using
primarily asian characters, since they tend to take 3 bytes in UTF-8 and 2 in
UTF-16).

There are functions which specifically take a dchar, so you can give them a
character then, but most deal entirely in strings, even if what you really care
about is an individual character. So, generally just treat individual
characters as strings with one character.

Take a look at the functions in std.utf: http://is.gd/krZLW . e.g.
std.utf.count() can be used to tell you how many code points / characters there
are in a string, and std.utf.stride() will tell you how many code units a
particular character is so that you can index into a string or wstring if you
have to.

When using foreach, make sure that you give the type as dchar. e.g.

string str = "hello world";

foreach(dchar c; str)
    writeln(c);

will print out each character individually, whereas as using char (which is the
default if you don't give a type) or wchar would print out the individual code
units (which isn't generally very useful). foreach is smart enough to convert
the string to the appropriate type on the fly while iterating over it, so if
you give it dchar, it'll take each code point at a time instead of each code
unit.

I'm sure that there are other things that would be useful to point out, but
that's all that comes to mind at the moment. On the whole, the way D handles
strings is fantastic. You just have to realize that you're dealing with UTF-8,
UTF-16, and UTF-32 code units instead of code points when you have a char,
wchar, or dchar respectively. dchar/UTF-32 is the only type where code units
and code points are the same size.

There has been some talk of various improvements to how all of this works (like
possibly making dchar the default type for foreach with string types), so some
incremental improvements may be made to iron out some of the wrinkles, but
strings in D are designed the way that they are on purpose, and it's not likely
to be drastically changed. For the most part, the problem is not the design but
rather understanding what the design is so that you can use it properly.

If you want to avoid the whole issue, then you can just use dstring everywhere,
but that _will_ result in using about 4 times the amount of memory as you would
need with string if you're dealing primarily with ASCII characters.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 09 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=5016



--- Comment #3 from Jonathan M Davis <jmdavisProg gmx.com> 2011-01-09 15:20:07
PST ---
std.conv.to!() does need to be fixed to better handle the situation though. It
should probably either outright refuse to convert between each of the character
types on the theory that there's pretty much no way that that's a good idea and
that the programmer can just use cast if they really, actually need to do such
a conversion. Or it should throw when the character can't fit in a single code
unit of the target type, though that's going to result in code that is rather
hit or miss as to whether it's going to succeed or not and wouldn't likely be a
good idea to use in code generally.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 09 2011
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=5016


Andrei Alexandrescu <andrei metalanguage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


--- Comment #4 from Andrei Alexandrescu <andrei metalanguage.com> 2011-01-22
15:11:51 PST ---
std.conv.to for narrowing conversions acts as a checked cast. This bug was
fixed in http://www.dsource.org/projects/phobos/changeset/2359 and
http://www.dsource.org/projects/phobos/changeset/2363

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 22 2011