www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 12113] New: A nothrow std.utf.decode with substitution on bad encoding

reply d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=12113

           Summary: A nothrow std.utf.decode with substitution on bad
                    encoding
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: dmitry.olsh gmail.com


--- Comment #0 from Dmitry Olshansky <dmitry.olsh gmail.com> 2014-02-08
14:26:46 PST ---
Change the behaviour of decode according to Unicode standard recommendation.
As a bonus dealing with partly broken encoding in the text becomes palatable
which is hardly possible with the current behaviour.

The relevant section of the standard:

5.22 Best Practice for U+FFFD Substitution

When converting text from one character encoding to another, a conversion
algorithm may
encounter unconvertible code units. This is most commonly caused by some sort
of corruption
of the source data, so that it does not correctly follow the specification for
that
character encoding. Examples include dropping a byte in a multibyte encoding
such as
Shift-JIS, improper concatenation of strings, a mismatch between an encoding
declaration
and actual encoding of text, use of non-shortest form for UTF-8, and so on.

...

Whenever an unconvertible offset is reached during conversion of a code
unit sequence:
1. The maximal subpart at that offset should be replaced by a single
U+FFFD.
2. The conversion should proceed at the offset immediately after the maximal
subpart.
---

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 08 2014
next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=12113


bearophile_hugs eml.cc changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bearophile_hugs eml.cc


--- Comment #1 from bearophile_hugs eml.cc 2014-02-08 14:43:03 PST ---
Is it possible to generalize this to layz ranges? Currently this can't be
nothrow. Can it become nothrow using your suggestion?


void main() nothrow {
    string txt = "hello";
    foreach (dchar c; txt) {}
}


Currently it gives:

test.d(3): Error: '_aApplycd1' is not nothrow
test.d(1): Error: function 'D main' is nothrow yet may throw

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 08 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=12113


Jonathan M Davis <jmdavisProg gmx.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jmdavisProg gmx.com


--- Comment #2 from Jonathan M Davis <jmdavisProg gmx.com> 2014-02-08 16:12:39
PST ---
 Is it possible to generalize this to layz ranges?

I should point out that that code doesn't use any range-based functions at all. However, if decode and stride are both altered to be nothrow, then any function which was not inferred as nothrow because of them would then be nothrow (assuming that template inference was working correctly, which is not entirely the case right now - it does particularly poorly with Voldemort types). So, there's a good chance that a lot of range-based string stuff would become nothrow. As for your example, that uses pure druntime stuff, as it's a string with foreach. And that code will need to be updated as well as std.utf.decode and std.utf.stride. However, theoretically, it will be in the same boat and be able to become nothrow, which would theoretically make your example nothrow, but I'd have to go digging through the code in druntime to see what all it does in order to be sure. I expect that it could be nothrow, but there might be something unexpected that it's doing that prevents it. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=12113


Walter Bright <bugzilla digitalmars.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bugzilla digitalmars.com


--- Comment #3 from Walter Bright <bugzilla digitalmars.com> 2014-03-21
11:28:17 PDT ---
The relevant spec:

http://www.unicode.org/versions/Unicode6.2.0/UnicodeStandard-6.2.pdf

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Mar 21 2014
prev sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=12113



--- Comment #4 from Walter Bright <bugzilla digitalmars.com> 2014-03-24
11:12:35 PDT ---
https://github.com/D-Programming-Language/phobos/pull/2043

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Mar 24 2014