www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 10203] New: std.string.toUpperInPlace is... not in place

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10203

           Summary: std.string.toUpperInPlace is... not in place
           Product: D
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: monarchdodra gmail.com


--- Comment #0 from monarchdodra gmail.com 2013-05-29 09:53:34 PDT ---
As reported by manu in his talk. All in the title.

toUpperInPlace will actually allocated a new string. As a matter of fact, it
will allocate, a LOT: a new string per char.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 29 2013
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10203



--- Comment #1 from monarchdodra gmail.com 2013-05-29 12:18:46 PDT ---
Fixed here

https://github.com/D-Programming-Language/phobos/pull/1322

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 29 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10203



--- Comment #2 from monarchdodra gmail.com 2013-05-30 02:13:50 PDT ---
(In reply to comment #1)
 Fixed here
 
 https://github.com/D-Programming-Language/phobos/pull/1322
Turns out, upon investigation, that my fix is incorrect. What is more problematic though, is that the review revealed that there are cases where toUpperInPlace simply *cannot* be inplace: there characters out there, whose UTF representation is not the same length for upper case and lower case. What this means is that the lower case representation could be longer than the original length. How could this be done in place? It also reveals more problems: If the resulting string is actually smaller, than slices that aliased the same data will now reference garbage values. For example: 'İ', which is u0130, and a two byte encoding will become 'i'. So when I take the string "İa": auto a = "\xC4\xB0\x61"; auto b = a; toLowerInPlace(a); //Now: //a == "\x69\xB0" //b == "\x69\xB0\x61" Oops: Trailing code unit :/ Or: say, I have "aİa", and want to reduce in place "İ" auto a = "\x61\xC4\xB0\x61"; auto b = a[1 .. 3]; toLowerInPlace(b); //Now: //b == "\x69" //OK //a == "\x61\x69\xB0\x61" //Wait, what is that B0 doinh here? -------- It seems to me that, overall, toLowerInPlace is a function that is broken, that cannot respect the specs it promises, and violates the principal of least surprise in regards to behavior. I think it should either be tagged with a massive red unsafe, or deprecated. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
May 30 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10203


monarchdodra gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |http://d.puremagic.com/issu
                   |                            |es/show_bug.cgi?id=9629
         AssignedTo|monarchdodra gmail.com      |nobody puremagic.com


--- Comment #3 from monarchdodra gmail.com 2013-05-30 02:19:53 PDT ---
(In reply to comment #2)> So when I take the string "İa":
 auto a = "\xC4\xB0\x61";
 auto b = a;
 toLowerInPlace(a);
 
 //Now:
 //a == "\x69\xB0"
 //b == "\x69\xB0\x61" Oops: Trailing code unit :/
Wait, this example is wrong, corrected as: take the string "aİ" auto a = "\x61\xC4\xB0"; auto b = a; toLowerInPlace(a); //Now: //a == "\x61\x69" //b == "\x61\x69\xB0" Oops: Trailing code unit :/ Sorry. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
May 30 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10203


Dmitry Olshansky <dmitry.olsh gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh gmail.com


--- Comment #4 from Dmitry Olshansky <dmitry.olsh gmail.com> 2013-09-21
13:58:19 PDT ---
(In reply to comment #3)
 Wait, this example is wrong, corrected as:
 
 take the string "aİ"
 auto a = "\x61\xC4\xB0";
auto a = "\x61\xC4\xB0".dup;
 auto b = a;
 toLowerInPlace(a);
 
 //Now:
 //a == "\x61\x69"
 //b == "\x61\x69\xB0" Oops: Trailing code unit :/
 
 Sorry.
I belive we now should have solid treatment of toUpper/toLower (pending a bugfix in the works). For me this gives now: //a == 61 69 CC 87 //b == 61 C4 B0 i.e. toLowerInPlace fails to do this in place because resulting length increases. Should it try to extend if a.capacity allows it? (I bet it has about 15 bytes of storage) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Sep 21 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10203


bearophile_hugs eml.cc changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bearophile_hugs eml.cc


--- Comment #5 from bearophile_hugs eml.cc 2013-09-21 14:13:19 PDT ---
(In reply to comment #2)

 It seems to me that, overall, toLowerInPlace is a function that is broken, that
 cannot respect the specs it promises, and violates the principal of least
 surprise in regards to behavior.
 
 I think it should either be tagged with a massive red unsafe, or deprecated.
Even if it's impossible for Unicode strings, I'd like to keep a version of it for just arrays of ASCII chars, that is a common enough use case. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Sep 21 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10203



--- Comment #6 from Dmitry Olshansky <dmitry.olsh gmail.com> 2013-09-21
14:26:34 PDT ---
(In reply to comment #5)
 (In reply to comment #2)
 
 It seems to me that, overall, toLowerInPlace is a function that is broken, that
 cannot respect the specs it promises, and violates the principal of least
 surprise in regards to behavior.
 
 I think it should either be tagged with a massive red unsafe, or deprecated.
Even if it's impossible for Unicode strings, I'd like to keep a version of it for just arrays of ASCII chars, that is a common enough use case.
It also works quite well for UTF-16. And it does now. And I would have kept it if only because of backwards compatibility perspective. There is no other primitive yet, but a version(s) with output range for all string transformations is something to look forward to. The only question is whether or not should this function try to use extra capacity beyond the length if it's present (that would make InPlace suffix look saner). -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Sep 21 2013
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10203


w0rp <devw0rp gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |devw0rp gmail.com


--- Comment #7 from w0rp <devw0rp gmail.com> 2013-09-21 14:35:51 PDT ---
It may be worth considering presenting only what is possible, so offer
toUpperInPlace or toLowerInPlace only on ASCII data. As you say, Unicode
strings change in byte length dramatically, but the ASCII range stays the same.
So I think it would be worth offering functions with different names, or
perhaps with locale template arguments. (Locale.ascii, or whatever.) Just so
there can be a guarantee of something never allocating when you want to avoid
that, but still let you get at a "almost there" function when you want that.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Sep 21 2013