digitalmars.D - Formal Review of std.uni

Jesse Phillips (19/19) Apr 28 2013 First off, Dconf is this next weekend and effects the schedule of

Jacob Carlborg (8/15) Apr 28 2013 Two minor things:

Dmitry Olshansky (9/23) Apr 29 2013 Technically these should be in std.string and are there but incorrect.

Jonathan M Davis (3/4) Apr 29 2013 Then fix them there.

Dmitry Olshansky (11/15) Apr 29 2013 I think it will take a certain amount of leaked implementation details

Jonathan M Davis (19/36) Apr 29 2013 An alias would be one option. The primary issue I see is that the genera...

Dmitry Olshansky (5/40) Apr 30 2013 Aliases it is then.

Jonathan M Davis (22/23) Apr 30 2013 Sure it can. It operates on dchar. I don't know what you've done with st...

Dmitry Olshansky (22/37) May 12 2013 Getting back to this.

Brian Schott (5/5) Apr 29 2013 I was working on a list of suggestions, but turned it into a pull
Walter Bright (4/4) Apr 29 2013 Looks nice.

Dmitry Olshansky (5/9) Apr 29 2013 Good ideas. Linked combining class as everything else.

Dmitry Olshansky (37/51) May 12 2013 We have only one week for review left so I'd like to sort out the last

Jacob Carlborg (7/11) May 13 2013 That is my fault. There's a unit test for std.net.isemail that contains

Jesse Phillips (10/10) May 18 2013 I've found a difference from the existing std.uni.

Dmitry Olshansky (4/14) May 19 2013 Will take a stub at it.

Dmitry Olshansky (9/26) May 19 2013 Was already fixed in my Phobos fork. I must have hit it during the

Jakob Ovrum (2/4) May 21 2013 Reminder -> remainder.

"Jesse Phillips" <Jesse.K.Phillips+D gmail.com> writes:

First off, Dconf is this next weekend and effects the schedule of 
this review. Review will be held for 3 weeks, instead of holding 
off a week I'm extending the period and starting the review now. 
(Dmitry may be unable to respond due to his being a speaker)

This is a replacement module for the current std.uni by Dmitry 
Olshansky. The std.uni module provides an implementation of 
fundamental Unicode algorithms and data structures.

To use this module, install 2.63 beta, import uni; and not 
std.uni, compile two files from the source uni.d unicode_tables.d

Docs:
http://blackwhale.github.io/phobos/uni.html

Source:
https://github.com/blackwhale/gsoc-bench-2012

DMD Beta:
http://forum.dlang.org/post/517C8552.7040704 digitalmars.com

It should be noted that inclusion into Phobos may require 
addressing inter-dependencies, see "Reducing the 
inter-dependencies"
http://forum.dlang.org/post/kl8hn8$bm3$1 digitalmars.com

Apr 28 2013

Jacob Carlborg <doob me.com> writes:

On 2013-04-28 18:56, Jesse Phillips wrote:
 First off, Dconf is this next weekend and effects the schedule of this
 review. Review will be held for 3 weeks, instead of holding off a week
 I'm extending the period and starting the review now. (Dmitry may be
 unable to respond due to his being a speaker)

 This is a replacement module for the current std.uni by Dmitry
 Olshansky. The std.uni module provides an implementation of fundamental
 Unicode algorithms and data structures.

Two minor things:

* The docs for "decomposeHangul" contains a ddoc macro

* "toLower" and "toUpper" mention overloads which take strings instead 
of a dchar. I cannot find those overloads in the std.uni module. If they 
refer to another module it should say so

-- 
/Jacob Carlborg

Apr 28 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

29-Apr-2013 10:45, Jacob Carlborg пишет:
 On 2013-04-28 18:56, Jesse Phillips wrote:
 First off, Dconf is this next weekend and effects the schedule of this
 review. Review will be held for 3 weeks, instead of holding off a week
 I'm extending the period and starting the review now. (Dmitry may be
 unable to respond due to his being a speaker)

 This is a replacement module for the current std.uni by Dmitry
 Olshansky. The std.uni module provides an implementation of fundamental
 Unicode algorithms and data structures.

 Two minor things:

 * The docs for "decomposeHangul" contains a ddoc macro

Thanks, fixed that and the same typo elsewhere.

 * "toLower" and "toUpper" mention overloads which take strings instead
 of a dchar. I cannot find those overloads in the std.uni module. If they
 refer to another module it should say so

Technically these should be in std.string and are there but incorrect. 
Plus the impossible to implement toLowerInPlace/toUpperInPlace since 
length may change during case-conversion.

Darn... I got to implement them in std.uni then. Not much work given 
that all of pieces are already here.

-- 
Dmitry Olshansky

Apr 29 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Monday, April 29, 2013 22:13:09 Dmitry Olshansky wrote:
 Technically these should be in std.string and are there but incorrect.

Then fix them there.

- Jonathan M Davis

Apr 29 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

29-Apr-2013 22:50, Jonathan M Davis пишет:
 On Monday, April 29, 2013 22:13:09 Dmitry Olshansky wrote:
 Technically these should be in std.string and are there but incorrect.

 Then fix them there.

I think it will take a certain amount of leaked implementation details 
to get it done at least half-decently. In particular to re-use the same 
case-folding table as for case-insensitive matching.

Would be a strategic mistake IMHO to spread the internals across 2 modules.

Then std.string could provide a public alias(es) as it imports std.uni 
anyway?

Going further if we are to preserve the status quo then std.uni will 
declare them as package to not advertise their new location.

 - Jonathan M Davis



-- 
Dmitry Olshansky

Apr 29 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Tuesday, April 30, 2013 03:02:17 Dmitry Olshansky wrote:
 29-Apr-2013 22:50, Jonathan M Davis пишет:
 On Monday, April 29, 2013 22:13:09 Dmitry Olshansky wrote:
 Technically these should be in std.string and are there but incorrect.

 
 Then fix them there.

 
 I think it will take a certain amount of leaked implementation details
 to get it done at least half-decently. In particular to re-use the same
 case-folding table as for case-insensitive matching.
 
 Would be a strategic mistake IMHO to spread the internals across 2 modules.
 
 Then std.string could provide a public alias(es) as it imports std.uni
 anyway?
 
 Going further if we are to preserve the status quo then std.uni will
 declare them as package to not advertise their new location.

An alias would be one option. The primary issue I see is that the general 
design of the modules has been that std.ascii and std.uni operate on 
individual characters and not strings, whereas std.string operates on strings 
(generally going with the unicode versions of things when there's a choice 
rather than the ASCII ones). And having overloads in std.uni which operate on 
strings doesn't follow that organizational model. Usually, std.string has 
called the std.uni functions to do what it's needed, and no implementation 
details have been leaked. I haven't looked at what you've done with std.uni 
yet, so I don't know how well that would work. But toLower and toUpper are far 
from them only functions which would be affected by something like this (icmp 
would be another obvious one).

But even if we decided to reorganize where some functionality or functions 
live, we shouldn't have std.uni replacing std.string functions while leaving 
the old functions in std.string. So, worst case, aliases should be used, but 
if at all reasonable, I'd prefer to keep the module organizational structure 
that we've had with regards to how characters and string functionality is 
organized.

- Jonathan M Davis

Apr 29 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

30-Apr-2013 04:12, Jonathan M Davis пишет:
 On Tuesday, April 30, 2013 03:02:17 Dmitry Olshansky wrote:
 29-Apr-2013 22:50, Jonathan M Davis пишет:
 On Monday, April 29, 2013 22:13:09 Dmitry Olshansky wrote:
 Technically these should be in std.string and are there but incorrect.

 Then fix them there.

 I think it will take a certain amount of leaked implementation details
 to get it done at least half-decently. In particular to re-use the same
 case-folding table as for case-insensitive matching.

 Would be a strategic mistake IMHO to spread the internals across 2 modules.

 Then std.string could provide a public alias(es) as it imports std.uni
 anyway?

 Going further if we are to preserve the status quo then std.uni will
 declare them as package to not advertise their new location.

 An alias would be one option. The primary issue I see is that the general
 design of the modules has been that std.ascii and std.uni operate on
 individual characters and not strings, whereas std.string operates on strings
 (generally going with the unicode versions of things when there's a choice
 rather than the ASCII ones). And having overloads in std.uni which operate on
 strings doesn't follow that organizational model. Usually, std.string has
 called the std.uni functions to do what it's needed, and no implementation
 details have been leaked. I haven't looked at what you've done with std.uni
 yet, so I don't know how well that would work. But toLower and toUpper are far
 from them only functions which would be affected by something like this (icmp
 would be another obvious one).

Unicode --> can't be done on character by character basis

 But even if we decided to reorganize where some functionality or functions
 live, we shouldn't have std.uni replacing std.string functions while leaving
 the old functions in std.string. So, worst case, aliases should be used, but
 if at all reasonable, I'd prefer to keep the module organizational structure
 that we've had with regards to how characters and string functionality is
 organized.

Aliases it is then.

-- 
Dmitry Olshansky

Apr 30 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Tuesday, April 30, 2013 15:13:14 Dmitry Olshansky wrote:
 Unicode --> can't be done on character by character basis

Sure it can. It operates on dchar. I don't know what you've done with std.uni, 
but the way things have been laid out is that

std.ascii: Operates on dchars, doing ASCII operations, generally ignoring 
anything that's not ASCII.

std.uni: Operates on dchars, doing Unicode operations. The operations are 
generally the same as what you'd do with ASCII only with Unicode instead.

std.utf: Doing encoding related stuff with Unicode, so it operates on strings.

std.string: Operates on strings (almost always using Unicode)

So, with how it's been, std.uni would only be operating on dchars, and putting 
a function in there which operated on strings wouldn't make any sense. Maybe 
that doesn't work if you've done a bunch of grapheme stuff, and things will 
have to be adjusted, but it would be a definite shift to put anything in 
std.uni which operated on strings, and I think that it would need some definite 
justification (and there's a good chance that I'd be inclined to argue that it 
should still go in std.string, possibly using some internal modules if 
necessary).

But clearly I need to take the time to take a look at what you've actually 
done (I keep meaning to but haven't gotten around to it yet). It had been my 
impression that what you were doing was primarily a matter of improving the 
implementation, but it sounds like you're doing something beyond that.

- Jonathan M Davis

Apr 30 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

30-Apr-2013 23:17, Jonathan M Davis пишет:
 On Tuesday, April 30, 2013 15:13:14 Dmitry Olshansky wrote:
 Unicode --> can't be done on character by character basis

 Sure it can. It operates on dchar.

Getting back to this.

Sure it can't - I'd hate to break the illusion but the keyword is e.g. 
Unicode Case Folding. Another one is Combining Character sequence.

 So, with how it's been, std.uni would only be operating on dchars, and putting
 a function in there which operated on strings wouldn't make any sense. Maybe
 that doesn't work if you've done a bunch of grapheme stuff, and things will
 have to be adjusted, but it would be a definite shift to put anything in
 std.uni which operated on strings, and I think that it would need some definite
 justification (and there's a good chance that I'd be inclined to argue that it
 should still go in std.string, possibly using some internal modules if
 necessary).

Justification is that we'd rather have exactly one module dealing with a 
bunch of Unicode data arranged into intricate tables.

Strictly speaking I'd abolish any Unicode related algorithm in 
std.string since it's almost definitely doing it wrong anyway (I've 
checked only 2 - both broken).

There is not a single sign of unicode standards used, just the 
fallacious logic: byte --> dchar and use the same algorithm as with 
ASCII. It won't work.

 But clearly I need to take the time to take a look at what you've actually
 done (I keep meaning to but haven't gotten around to it yet). It had been my
 impression that what you were doing was primarily a matter of improving the
 implementation, but it sounds like you're doing something beyond that.

Take a peek at icmp and sicmp in new std.uni.
Current fork of Phobos is here:
https://github.com/blackwhale/phobos/tree/new-std-uni

Eventually we'd have to do a bit more in the same direction e.g. title 
casing, split by word boundary etc. (all of these need fixing in 
std.string).

Also all of the core tools are now in the open: CodepointSet, and 
generating Tries from sets and AA-s.


-- 
Dmitry Olshansky

May 12 2013

"Brian Schott" <briancschott gmail.com> writes:

I was working on a list of suggestions, but turned it into a pull 
request instead, as the list had gotten big enough to be 
inconvenient to go through manually. The pull focuses on 
improving some of the phrasing in the DDoc comments.

https://github.com/blackwhale/gsoc-bench-2012/pull/2

Apr 29 2013

Walter Bright <newshound2 digitalmars.com> writes:

Looks nice.

The glossary should be alphabetized.

"combining class" is defined under combiningClass - should it be in the
glossary 
instead?

Apr 29 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

30-Apr-2013 02:40, Walter Bright пишет:
 Looks nice.

Cool, this means we are getting there ;)

 The glossary should be alphabetized.

 "combining class" is defined under combiningClass - should it be in the
 glossary instead?

Good ideas. Linked combining class as everything else.

-- 
Dmitry Olshansky

Apr 29 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

28-Apr-2013 20:56, Jesse Phillips пишет:
 This is a replacement module for the current std.uni by Dmitry
 Olshansky. The std.uni module provides an implementation of fundamental
 Unicode algorithms and data structures.

 To use this module, install 2.63 beta, import uni; and not std.uni,
 compile two files from the source uni.d unicode_tables.d

 Docs:
 http://blackwhale.github.io/phobos/uni.html

 Source:
 https://github.com/blackwhale/gsoc-bench-2012

 DMD Beta:
 http://forum.dlang.org/post/517C8552.7040704 digitalmars.com

 It should be noted that inclusion into Phobos may require addressing
 inter-dependencies, see "Reducing the inter-dependencies"
 http://forum.dlang.org/post/kl8hn8$bm3$1 digitalmars.com

We have only one week for review left so I'd like to sort out the last 
issues before we get to the voting.

First to fill in on latest developments.
With a bunch of ugly hacks I've managed to integrate new std.uni in my 
Phobos fork and it passes unittests for me now (on win32 at least).

See it hanging there and waiting to be destroyed by the pull tester:
https://github.com/D-Programming-Language/phobos/pull/1289

Remaining issues that I'm aware of:
- proper toLower/toUpper (current one is simplified codepoint-for-codepoint)
- clean up the debris after crush-landing back into Phobos, revert some 
unrelated changes etc.

Please take time to make that list grow, esp w.r.t interface choices and 
the code itself.

Plus separately I'd need to remove rudimentary versions of the same 
data-structures used in std.regex and rewire it to use the new std.uni.

There are few bugs and issues uncovered during integration that I wish 
to get feedback on.

std.string has a bogus test for toLower:
Of the very few tests being done 2 are very special corner case around 
\u0130 which is I with dot and is expected to be lowercased to i.
But it's *not* supposed to - this conversion is specific to Turk(?) 
locale (=tailoring). What should happen is unfolding it to 2-codepoint 
sequence 'i' and 'dot-above' (this is in works).

I just hope nobody depends on these particular conversions and I am 
wondering who's put them there in the first place.

std.json is another thing - 0x7F somehow is specifically tested as being 
accepted as part of string literal. Yet ECMA script docs clearly state 
that Unicode control characters are to be stripped even before lexing 
(ignored even in literals).

P.S. Someday I need to track down and file about 2 (or 3?) distinct 
compiler bugs (fwd-ref hell, private alias hijacking) that I worked 
around while getting there.
Another one has a fix already (thanks, Kenji):
http://d.puremagic.com/issues/show_bug.cgi?id=10067

-- 
Dmitry Olshansky

May 12 2013

Jacob Carlborg <doob me.com> writes:

On 2013-05-12 22:06, Dmitry Olshansky wrote:

 std.json is another thing - 0x7F somehow is specifically tested as being
 accepted as part of string literal. Yet ECMA script docs clearly state
 that Unicode control characters are to be stripped even before lexing
 (ignored even in literals).

That is my fault. There's a unit test for std.net.isemail that contains 
a 0x7F. std.json is somehow used in the auto tester. std.net.isemail 
broke the auto tester because of this. I haven't seen any JSON parser 
that doesn't accept 0x7F in a string literal.

-- 
/Jacob Carlborg

May 13 2013

"Jesse Phillips" <Jesse.K.Phillips+D gmail.com> writes:

I've found a difference from the existing  std.uni.

Compile-time execution of toUpper/toLower does not work.

import uni;

void main() {
     enum a = 'm';

     static assert(a.toUpper == 'M');
     static assert(a.toLower == 'm');
}

This never really was part of the contract in std.uni, but is 
probably a good idea to continue supporting it.

May 18 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

18-May-2013 21:15, Jesse Phillips пишет:
 I've found a difference from the existing  std.uni.

 Compile-time execution of toUpper/toLower does not work.

 import uni;

 void main() {
      enum a = 'm';

      static assert(a.toUpper == 'M');
      static assert(a.toLower == 'm');
 }

 This never really was part of the contract in std.uni, but is probably a
 good idea to continue supporting it.

Will take a stub at it.

-- 
Dmitry Olshansky

May 19 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

19-May-2013 13:08, Dmitry Olshansky пишет:
 18-May-2013 21:15, Jesse Phillips пишет:
 I've found a difference from the existing  std.uni.

 Compile-time execution of toUpper/toLower does not work.

 import uni;

 void main() {
      enum a = 'm';

      static assert(a.toUpper == 'M');
      static assert(a.toLower == 'm');
 }

 This never really was part of the contract in std.uni, but is probably a
 good idea to continue supporting it.

 Will take a stub at it.

Was already fixed in my Phobos fork. I must have hit it during the 
integration process. Will re-sync the standalone version then.

This also works:

     enum a = 'я';

     static assert(a.toUpper == 'Я');
     static assert(a.toLower == 'я');

-- 
Dmitry Olshansky

May 19 2013

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Sunday, 28 April 2013 at 16:56:25 UTC, Jesse Phillips wrote:
 Docs:
 http://blackwhale.github.io/phobos/uni.html

Reminder -> remainder.

May 21 2013

D Programming

C/C++ Programming

Other

digitalmars.D - Formal Review of std.uni