www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Formal Review of std.uni

reply "Jesse Phillips" <Jesse.K.Phillips+D gmail.com> writes:
First off, Dconf is this next weekend and effects the schedule of 
this review. Review will be held for 3 weeks, instead of holding 
off a week I'm extending the period and starting the review now. 
(Dmitry may be unable to respond due to his being a speaker)

This is a replacement module for the current std.uni by Dmitry 
Olshansky. The std.uni module provides an implementation of 
fundamental Unicode algorithms and data structures.

To use this module, install 2.63 beta, import uni; and not 
std.uni, compile two files from the source uni.d unicode_tables.d

Docs:
http://blackwhale.github.io/phobos/uni.html

Source:
https://github.com/blackwhale/gsoc-bench-2012

DMD Beta:
http://forum.dlang.org/post/517C8552.7040704 digitalmars.com

It should be noted that inclusion into Phobos may require 
addressing inter-dependencies, see "Reducing the 
inter-dependencies"
http://forum.dlang.org/post/kl8hn8$bm3$1 digitalmars.com
Apr 28 2013
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-04-28 18:56, Jesse Phillips wrote:
 First off, Dconf is this next weekend and effects the schedule of this
 review. Review will be held for 3 weeks, instead of holding off a week
 I'm extending the period and starting the review now. (Dmitry may be
 unable to respond due to his being a speaker)

 This is a replacement module for the current std.uni by Dmitry
 Olshansky. The std.uni module provides an implementation of fundamental
 Unicode algorithms and data structures.

Two minor things: * The docs for "decomposeHangul" contains a ddoc macro * "toLower" and "toUpper" mention overloads which take strings instead of a dchar. I cannot find those overloads in the std.uni module. If they refer to another module it should say so -- /Jacob Carlborg
Apr 28 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
29-Apr-2013 10:45, Jacob Carlborg пишет:
 On 2013-04-28 18:56, Jesse Phillips wrote:
 First off, Dconf is this next weekend and effects the schedule of this
 review. Review will be held for 3 weeks, instead of holding off a week
 I'm extending the period and starting the review now. (Dmitry may be
 unable to respond due to his being a speaker)

 This is a replacement module for the current std.uni by Dmitry
 Olshansky. The std.uni module provides an implementation of fundamental
 Unicode algorithms and data structures.

Two minor things: * The docs for "decomposeHangul" contains a ddoc macro

Thanks, fixed that and the same typo elsewhere.
 * "toLower" and "toUpper" mention overloads which take strings instead
 of a dchar. I cannot find those overloads in the std.uni module. If they
 refer to another module it should say so

Technically these should be in std.string and are there but incorrect. Plus the impossible to implement toLowerInPlace/toUpperInPlace since length may change during case-conversion. Darn... I got to implement them in std.uni then. Not much work given that all of pieces are already here. -- Dmitry Olshansky
Apr 29 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
29-Apr-2013 22:50, Jonathan M Davis пишет:
 On Monday, April 29, 2013 22:13:09 Dmitry Olshansky wrote:
 Technically these should be in std.string and are there but incorrect.

Then fix them there.

I think it will take a certain amount of leaked implementation details to get it done at least half-decently. In particular to re-use the same case-folding table as for case-insensitive matching. Would be a strategic mistake IMHO to spread the internals across 2 modules. Then std.string could provide a public alias(es) as it imports std.uni anyway? Going further if we are to preserve the status quo then std.uni will declare them as package to not advertise their new location.
 - Jonathan M Davis

-- Dmitry Olshansky
Apr 29 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
30-Apr-2013 04:12, Jonathan M Davis пишет:
 On Tuesday, April 30, 2013 03:02:17 Dmitry Olshansky wrote:
 29-Apr-2013 22:50, Jonathan M Davis пишет:
 On Monday, April 29, 2013 22:13:09 Dmitry Olshansky wrote:
 Technically these should be in std.string and are there but incorrect.

Then fix them there.

I think it will take a certain amount of leaked implementation details to get it done at least half-decently. In particular to re-use the same case-folding table as for case-insensitive matching. Would be a strategic mistake IMHO to spread the internals across 2 modules. Then std.string could provide a public alias(es) as it imports std.uni anyway? Going further if we are to preserve the status quo then std.uni will declare them as package to not advertise their new location.

An alias would be one option. The primary issue I see is that the general design of the modules has been that std.ascii and std.uni operate on individual characters and not strings, whereas std.string operates on strings (generally going with the unicode versions of things when there's a choice rather than the ASCII ones). And having overloads in std.uni which operate on strings doesn't follow that organizational model. Usually, std.string has called the std.uni functions to do what it's needed, and no implementation details have been leaked. I haven't looked at what you've done with std.uni yet, so I don't know how well that would work. But toLower and toUpper are far from them only functions which would be affected by something like this (icmp would be another obvious one).

Unicode --> can't be done on character by character basis
 But even if we decided to reorganize where some functionality or functions
 live, we shouldn't have std.uni replacing std.string functions while leaving
 the old functions in std.string. So, worst case, aliases should be used, but
 if at all reasonable, I'd prefer to keep the module organizational structure
 that we've had with regards to how characters and string functionality is
 organized.

Aliases it is then. -- Dmitry Olshansky
Apr 30 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
30-Apr-2013 23:17, Jonathan M Davis пишет:
 On Tuesday, April 30, 2013 15:13:14 Dmitry Olshansky wrote:
 Unicode --> can't be done on character by character basis

Sure it can. It operates on dchar.

Getting back to this. Sure it can't - I'd hate to break the illusion but the keyword is e.g. Unicode Case Folding. Another one is Combining Character sequence.
 So, with how it's been, std.uni would only be operating on dchars, and putting
 a function in there which operated on strings wouldn't make any sense. Maybe
 that doesn't work if you've done a bunch of grapheme stuff, and things will
 have to be adjusted, but it would be a definite shift to put anything in
 std.uni which operated on strings, and I think that it would need some definite
 justification (and there's a good chance that I'd be inclined to argue that it
 should still go in std.string, possibly using some internal modules if
 necessary).

Justification is that we'd rather have exactly one module dealing with a bunch of Unicode data arranged into intricate tables. Strictly speaking I'd abolish any Unicode related algorithm in std.string since it's almost definitely doing it wrong anyway (I've checked only 2 - both broken). There is not a single sign of unicode standards used, just the fallacious logic: byte --> dchar and use the same algorithm as with ASCII. It won't work.
 But clearly I need to take the time to take a look at what you've actually
 done (I keep meaning to but haven't gotten around to it yet). It had been my
 impression that what you were doing was primarily a matter of improving the
 implementation, but it sounds like you're doing something beyond that.

Take a peek at icmp and sicmp in new std.uni. Current fork of Phobos is here: https://github.com/blackwhale/phobos/tree/new-std-uni Eventually we'd have to do a bit more in the same direction e.g. title casing, split by word boundary etc. (all of these need fixing in std.string). Also all of the core tools are now in the open: CodepointSet, and generating Tries from sets and AA-s. -- Dmitry Olshansky
May 12 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Monday, April 29, 2013 22:13:09 Dmitry Olshansky wrote:
 Technically these should be in std.string and are there but incorrect.

Then fix them there. - Jonathan M Davis
Apr 29 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
I was working on a list of suggestions, but turned it into a pull 
request instead, as the list had gotten big enough to be 
inconvenient to go through manually. The pull focuses on 
improving some of the phrasing in the DDoc comments.

https://github.com/blackwhale/gsoc-bench-2012/pull/2
Apr 29 2013
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
Looks nice.

The glossary should be alphabetized.

"combining class" is defined under combiningClass - should it be in the
glossary 
instead?
Apr 29 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
30-Apr-2013 02:40, Walter Bright пишет:
 Looks nice.

Cool, this means we are getting there ;)
 The glossary should be alphabetized.

 "combining class" is defined under combiningClass - should it be in the
 glossary instead?

Good ideas. Linked combining class as everything else. -- Dmitry Olshansky
Apr 29 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, April 30, 2013 03:02:17 Dmitry Olshansky wrote:
 29-Apr-2013 22:50, Jonathan M Davis пишет:
 On Monday, April 29, 2013 22:13:09 Dmitry Olshansky wrote:
 Technically these should be in std.string and are there but incorrect.

Then fix them there.

I think it will take a certain amount of leaked implementation details to get it done at least half-decently. In particular to re-use the same case-folding table as for case-insensitive matching. Would be a strategic mistake IMHO to spread the internals across 2 modules. Then std.string could provide a public alias(es) as it imports std.uni anyway? Going further if we are to preserve the status quo then std.uni will declare them as package to not advertise their new location.

An alias would be one option. The primary issue I see is that the general design of the modules has been that std.ascii and std.uni operate on individual characters and not strings, whereas std.string operates on strings (generally going with the unicode versions of things when there's a choice rather than the ASCII ones). And having overloads in std.uni which operate on strings doesn't follow that organizational model. Usually, std.string has called the std.uni functions to do what it's needed, and no implementation details have been leaked. I haven't looked at what you've done with std.uni yet, so I don't know how well that would work. But toLower and toUpper are far from them only functions which would be affected by something like this (icmp would be another obvious one). But even if we decided to reorganize where some functionality or functions live, we shouldn't have std.uni replacing std.string functions while leaving the old functions in std.string. So, worst case, aliases should be used, but if at all reasonable, I'd prefer to keep the module organizational structure that we've had with regards to how characters and string functionality is organized. - Jonathan M Davis
Apr 29 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, April 30, 2013 15:13:14 Dmitry Olshansky wrote:
 Unicode --> can't be done on character by character basis

Sure it can. It operates on dchar. I don't know what you've done with std.uni, but the way things have been laid out is that std.ascii: Operates on dchars, doing ASCII operations, generally ignoring anything that's not ASCII. std.uni: Operates on dchars, doing Unicode operations. The operations are generally the same as what you'd do with ASCII only with Unicode instead. std.utf: Doing encoding related stuff with Unicode, so it operates on strings. std.string: Operates on strings (almost always using Unicode) So, with how it's been, std.uni would only be operating on dchars, and putting a function in there which operated on strings wouldn't make any sense. Maybe that doesn't work if you've done a bunch of grapheme stuff, and things will have to be adjusted, but it would be a definite shift to put anything in std.uni which operated on strings, and I think that it would need some definite justification (and there's a good chance that I'd be inclined to argue that it should still go in std.string, possibly using some internal modules if necessary). But clearly I need to take the time to take a look at what you've actually done (I keep meaning to but haven't gotten around to it yet). It had been my impression that what you were doing was primarily a matter of improving the implementation, but it sounds like you're doing something beyond that. - Jonathan M Davis
Apr 30 2013
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Apr-2013 20:56, Jesse Phillips пишет:
 This is a replacement module for the current std.uni by Dmitry
 Olshansky. The std.uni module provides an implementation of fundamental
 Unicode algorithms and data structures.

 To use this module, install 2.63 beta, import uni; and not std.uni,
 compile two files from the source uni.d unicode_tables.d

 Docs:
 http://blackwhale.github.io/phobos/uni.html

 Source:
 https://github.com/blackwhale/gsoc-bench-2012

 DMD Beta:
 http://forum.dlang.org/post/517C8552.7040704 digitalmars.com

 It should be noted that inclusion into Phobos may require addressing
 inter-dependencies, see "Reducing the inter-dependencies"
 http://forum.dlang.org/post/kl8hn8$bm3$1 digitalmars.com

We have only one week for review left so I'd like to sort out the last issues before we get to the voting. First to fill in on latest developments. With a bunch of ugly hacks I've managed to integrate new std.uni in my Phobos fork and it passes unittests for me now (on win32 at least). See it hanging there and waiting to be destroyed by the pull tester: https://github.com/D-Programming-Language/phobos/pull/1289 Remaining issues that I'm aware of: - proper toLower/toUpper (current one is simplified codepoint-for-codepoint) - clean up the debris after crush-landing back into Phobos, revert some unrelated changes etc. Please take time to make that list grow, esp w.r.t interface choices and the code itself. Plus separately I'd need to remove rudimentary versions of the same data-structures used in std.regex and rewire it to use the new std.uni. There are few bugs and issues uncovered during integration that I wish to get feedback on. std.string has a bogus test for toLower: Of the very few tests being done 2 are very special corner case around \u0130 which is I with dot and is expected to be lowercased to i. But it's *not* supposed to - this conversion is specific to Turk(?) locale (=tailoring). What should happen is unfolding it to 2-codepoint sequence 'i' and 'dot-above' (this is in works). I just hope nobody depends on these particular conversions and I am wondering who's put them there in the first place. std.json is another thing - 0x7F somehow is specifically tested as being accepted as part of string literal. Yet ECMA script docs clearly state that Unicode control characters are to be stripped even before lexing (ignored even in literals). P.S. Someday I need to track down and file about 2 (or 3?) distinct compiler bugs (fwd-ref hell, private alias hijacking) that I worked around while getting there. Another one has a fix already (thanks, Kenji): http://d.puremagic.com/issues/show_bug.cgi?id=10067 -- Dmitry Olshansky
May 12 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-05-12 22:06, Dmitry Olshansky wrote:

 std.json is another thing - 0x7F somehow is specifically tested as being
 accepted as part of string literal. Yet ECMA script docs clearly state
 that Unicode control characters are to be stripped even before lexing
 (ignored even in literals).

That is my fault. There's a unit test for std.net.isemail that contains a 0x7F. std.json is somehow used in the auto tester. std.net.isemail broke the auto tester because of this. I haven't seen any JSON parser that doesn't accept 0x7F in a string literal. -- /Jacob Carlborg
May 13 2013
prev sibling next sibling parent reply "Jesse Phillips" <Jesse.K.Phillips+D gmail.com> writes:
I've found a difference from the existing  std.uni.

Compile-time execution of toUpper/toLower does not work.

import uni;

void main() {
     enum a = 'm';

     static assert(a.toUpper == 'M');
     static assert(a.toLower == 'm');
}

This never really was part of the contract in std.uni, but is 
probably a good idea to continue supporting it.
May 18 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
18-May-2013 21:15, Jesse Phillips пишет:
 I've found a difference from the existing  std.uni.

 Compile-time execution of toUpper/toLower does not work.

 import uni;

 void main() {
      enum a = 'm';

      static assert(a.toUpper == 'M');
      static assert(a.toLower == 'm');
 }

 This never really was part of the contract in std.uni, but is probably a
 good idea to continue supporting it.

Will take a stub at it. -- Dmitry Olshansky
May 19 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
19-May-2013 13:08, Dmitry Olshansky пишет:
 18-May-2013 21:15, Jesse Phillips пишет:
 I've found a difference from the existing  std.uni.

 Compile-time execution of toUpper/toLower does not work.

 import uni;

 void main() {
      enum a = 'm';

      static assert(a.toUpper == 'M');
      static assert(a.toLower == 'm');
 }

 This never really was part of the contract in std.uni, but is probably a
 good idea to continue supporting it.

Will take a stub at it.

Was already fixed in my Phobos fork. I must have hit it during the integration process. Will re-sync the standalone version then. This also works: enum a = 'я'; static assert(a.toUpper == 'Я'); static assert(a.toLower == 'я'); -- Dmitry Olshansky
May 19 2013
prev sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Sunday, 28 April 2013 at 16:56:25 UTC, Jesse Phillips wrote:
 Docs:
 http://blackwhale.github.io/phobos/uni.html

Reminder -> remainder.
May 21 2013