www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 6458] New: Multibyte char literals shouldn't implicitly convert to char

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=6458

           Summary: Multibyte char literals shouldn't implicitly convert
                    to char
           Product: D
           Version: D2
          Platform: Other
        OS/Version: Windows
            Status: NEW
          Severity: normal
          Priority: P2
         Component: DMD
        AssignedTo: nobody puremagic.com
        ReportedBy: clugdbug yahoo.com.au



The code below should either be rejected, or work correctly.
The particularly problematic case is:   s[0..2] = 'ä', which looks perfectly
reasonable, but creates garbage.
I'm a bit confused about non-ASCII char literals, since although they are typed
as 'char', they can't be stored in a char... This just seems wrong.

----
int bug6458()
{
    char [] s = "abcdef".dup;
    s[0] = 'ä';
    assert(s == "äcdef");
    return 34;
}
void main()
{
    bug6458();
}

Surely this has been reported before, but I can't find it.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 08 2011
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=6458


Jonathan M Davis <jmdavisProg gmx.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jmdavisProg gmx.com



PDT ---
Personally, I think that all character literals should be typed as dchar, since
it's generally a _bad_ idea to operate on individual chars or wchars. Normally,
the only places that chars or wchars should be used is in ranges of chars or
wchars (which would normally be arrays). But making character literals dchar be
default might break too much code at this point. Though, since it should be
possible to use range propagation to verify whether a particular code point
will fit in a particular code unit, the breakage might be minimal.

Regardless, I actually never would have expected s[0 .. 2] = 'ä' to work, since
you're assigning a character to multiple characters as far as types go, though
I can see why you might think that it would work or why it arguably _should_
work. Obviously though, if the compiler is allowing you to assign a code point
to multiple code units like that, it should only compile if it can verify that
the code unit will fit exactly in those code units, and if it does compile, it
should work correctly rather than generate garbage. So, there are several
issues at work here it seems.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 08 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=6458


Don <clugdbug yahoo.com.au> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |accepts-invalid




 Personally, I think that all character literals should be typed as dchar, since
 it's generally a _bad_ idea to operate on individual chars or wchars. Normally,
 the only places that chars or wchars should be used is in ranges of chars or
 wchars (which would normally be arrays). But making character literals dchar be
 default might break too much code at this point. Though, since it should be
 possible to use range propagation to verify whether a particular code point
 will fit in a particular code unit, the breakage might be minimal.
Oddly, this passes: static assert('ä'.sizeof == 2); So there's something a bit nonsensical about the whole thing.
 Regardless, I actually never would have expected s[0 .. 2] = 'ä' to work, since
 you're assigning a character to multiple characters as far as types go, 
It's more subtle. This is block assignment. s[0..4] = 'a'; works, and creates "aaaa". s[0..4] = 'ä' is expected to fill the string with ä, creating "ää". Instead, it fills it with four copies of the first uft8 byte of ä, creating an invalid string. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Aug 08 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=6458




PDT ---
Ah, yes. I forgot that you could assign a single value to every element in an
array like that. That being the case, it should just fail to compile given that
the code point is not going to fit in each of the elements of the array. But
regardless, something odd is definitely going on here given that 'ä'.sizeof ==
2. It's probably an edge case which wasn't caught, since the only types which
take up multiple elements like that are char and wchar.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 08 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=6458


changlon <changlon gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |changlon gmail.com



s[0..3] = 'a';

this should raise an exception ?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 08 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=6458





 s[0..3] = 'a';
 
 this should raise an exception ?
sorry , I mean s[0..3] = 'ä'; -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Aug 08 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=6458




PDT ---
It shouldn't even compile, because the types don't match. Even with range
propagation, the best that you'll do with 'ä' is fit it in a wchar, so it won't
fit in a char, and so you _can't_ assign it to each element of s[0 .. 3] like
that. s[0 .. 3] = "ä"[] should work, but s[0 .. 3] = 'ä' definitely shouldn't.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 08 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=6458


Jacob Carlborg <doob me.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |doob me.com



As far as I can see, D uses the smallest type necessary to fit a character
literal. So all non-ascii character literals will either be wchar or dchar.
Both of the following passes, as expected.

static assert(is(typeof('ä') == wchar));
static assert(is(typeof('a') == char));

But I don't know why the compiler allows to assign a wchar to a char array
element. That doesn't seem right.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 08 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=6458





 As far as I can see, D uses the smallest type necessary to fit a character
 literal. So all non-ascii character literals will either be wchar or dchar.
 Both of the following passes, as expected.
 
 static assert(is(typeof('ä') == wchar));
 static assert(is(typeof('a') == char));
That's good news. Seems like it's only a few cases where it behaves stupidly.
 But I don't know why the compiler allows to assign a wchar to a char array
 element. That doesn't seem right.
It's more general than that: wchar w = 'ä'; char c = w; // Error: cannot implicitly convert expression (w) of type wchar to char char c = 'ä'; // passes!!! -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Aug 09 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=6458


yebblies <yebblies gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |yebblies gmail.com
           Platform|Other                       |All
         AssignedTo|nobody puremagic.com        |yebblies gmail.com
         OS/Version|Windows                     |All




 
 The compiler complains about the code above, just as it should, because a long
 won't fit in an int. Don't know why character literals are treated differently.
They aren't. The problem is that 'ä' evaluates to 0x00E4, and a bug in integer range propagation thinks this is ok to convert back to a char. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 30 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=6458


yebblies <yebblies gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |patch



Actually, this doesn't involve integer range propagation.

https://github.com/D-Programming-Language/dmd/pull/663

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 30 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=6458


yebblies <yebblies gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |andrei metalanguage.com



*** Issue 6988 has been marked as a duplicate of this issue. ***

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 31 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=6458


SomeDude <lovelydear mailmetrash.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lovelydear mailmetrash.com



PDT ---
This doesn't compile on 2.059 Win32.

PS E:\DigitalMars\dmd2\samples> rdmd -w bug.d
bug.d(4): invalid UTF-8 sequence
bug.d(5): invalid UTF-8 sequence
PS E:\DigitalMars\dmd2\samples>

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Apr 20 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=6458


Kenji Hara <k.hara.pg gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|D2                          |D1



---
D2 is fixed, but D1 also has same issue.

Pull request for D1:
https://github.com/D-Programming-Language/dmd/pull/1056

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 19 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=6458




---
Fixed for D1:
https://github.com/9rnsr/dmd/commit/6f5ae56f52c1f2a8af921905926a3ea4752ee388

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 20 2012
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=6458


Kenji Hara <k.hara.pg gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 20 2012