www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Ready for review: new std.uni

reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
It's been long over due to present the work I did during the last GSOC 
as it was a summer not winter of code after all. Unfortunately some 
compiler bugs, a new job :) and unrelated events of importance have 
postponed its release beyond measure.

Anyway it's polished and ready for the good old collective destruction 
called peer review. I'm looking for a review manager.

The code, including extra tests and a benchmark is here:
https://github.com/blackwhale/gsoc-bench-2012

And documentation:
http://blackwhale.github.com/phobos/uni.html

And with that I'm asking if there is anything ready to be reviewed in 
the queue?

Currently I prefer to keep it standalone so read 'uni' everywhere as 
'std.uni'. On the plus side it makes it easy to try new 'uni' and 
compare with the old one without re-compiling the Phobos.

To use in place of std.uni replace 'std.uni'->'uni' in your programs and 
compare the results. Just make sure both uni and unicode_tables modules 
are linked in, usually rdmd can take care of this dependency.

The list of new functionality is quite large so I'll point out the major 
sections and let the rest to the documentation.

In general there are 3 angles to the new std.uni:

1) The same stuff but better and faster. For one thing isXXX 
classification functions are brought up to date with Unicode 6.2 and are 
up to 5-6x times faster on non-ASCII text.

2) The commonly expected stuff in any modern Unicode-aware language:
normalization, grapheme decoding, composition/decomposition and 
case-insensitive comparison*.

3) I've taken it as a crucial point to provide all of the tools used to 
build Unicode algorithms from ground up to the end user.
Thus all generally useful data structures used to implement the library 
internals are accessible for 'mortals' too:
  - a type for manipulating sets of codepoints, with full set algebra
  - a construction for generating fast multi-stage lookup tables (Trie)
  - a ton of predefined sets to construct your own specific ones from

There is an extra candy for meta-programming text processing libraries: 
a set type can generate source code of unrolled hard-coded binary search 
for a given set of codepoints.

Among other things the entire collection of data required is generated 
automatically by downloading from unicode.org. The tool relies on the 
same foundation (3) and for the most part this version of std.uni should 
be trivially updated to the new versions of the standard (see gen_uni.d 
script).

* The only missing 'big' thing is the collation algorithm. At this point 
I'm proposing to just move the large chunk of new std.uni in place. This 
way potential contributors would have tools to implement missing bits 
later on.

P.S. CodepointSet could be easily adjusted to serve as generic integer 
set type and Trie already supports far more the codepoints->values 
mappings. These should probably be enhanced and later adopted to 
std.container(2).

-- 
Dmitry Olshansky
Jan 11 2013
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
[...]
 Anyway it's polished and ready for the good old collective
 destruction called peer review. I'm looking for a review manager.
 
 The code, including extra tests and a benchmark is here:
 https://github.com/blackwhale/gsoc-bench-2012
 
 And documentation:
 http://blackwhale.github.com/phobos/uni.html
Excellent!! It looks pretty impressive. We should definitely try out best to get this into Phobos. [...]
 In general there are 3 angles to the new std.uni:
 
 1) The same stuff but better and faster. For one thing isXXX
 classification functions are brought up to date with Unicode 6.2 and
 are up to 5-6x times faster on non-ASCII text.
And there is a tool for auto-updating the data, correct? So in theory, barring major structural changes in the standard, we should be able to recompile Phobos (maybe after running the update tool) and it should be automatically updated to the latest Unicode release?
 2) The commonly expected stuff in any modern Unicode-aware language:
 normalization, grapheme decoding, composition/decomposition and
 case-insensitive comparison*.

 3) I've taken it as a crucial point to provide all of the tools used
 to build Unicode algorithms from ground up to the end user.
Are there enough tools to implement line-breaking algorithms (TR14)?
 Thus all generally useful data structures used to implement the
 library internals are accessible for 'mortals' too:
  - a type for manipulating sets of codepoints, with full set algebra
  - a construction for generating fast multi-stage lookup tables (Trie)
  - a ton of predefined sets to construct your own specific ones from
I looked over the examples in the docs. Looks awesome! One question though: how do I check a character in a specific unicode category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the specific category, and I assume it's overkill to have one function per category, so I presume it should be possible to do this using the codepoint sets? Can you add an example of this to the docs, if this is possible? [...]
 Among other things the entire collection of data required is
 generated automatically by downloading from unicode.org. The tool
 relies on the same foundation (3) and for the most part this version
 of std.uni should be trivially updated to the new versions of the
 standard (see gen_uni.d script).
Excellent!
 * The only missing 'big' thing is the collation algorithm. At this
 point I'm proposing to just move the large chunk of new std.uni in
 place. This way potential contributors would have tools to implement
 missing bits later on.
But the aforementioned tools should be enough for someone to use as the basis for implementing the collation algorithm right?
 P.S. CodepointSet could be easily adjusted to serve as generic
 integer set type and Trie already supports far more the
 codepoints->values mappings. These should probably be enhanced and
 later adopted to std.container(2).
[...] Yes, having an efficient integer set type and generic Trie type would be a huge plus to add to std.container. Although, isn't Andrei supposed to be working on a new std.container, with a better range-based API and custom allocator scheme? If that's not going to happen for a while yet, I say we should merge std.uni first, then worry about factoring Trie and int set later (we can always just alias them in std.uni later when we move them, that should prevent user code breakage). Now, some nitpicks: - InversionList: - I don't like using ~ to mean symmetric set difference, because ~ already means concatenation in D, and it's confusing to overload it with an unrelated meaning. I propose using ^ instead, because symmetric set difference is analogous to XOR. - Why is the table of operators in opBinary's doc duplicated twice? Is the second table supposed to be something else? - We should at least briefly describe what the various Unicode normalization forms mean (yes I know they can be looked up on unicode.org and various other online resources, but having them right there in the docs makes the docs so much more helpful). - Why are the deprecated functions still in there? The deprecation messages say "It will be removed in August 2012", which is already past. So they should be removed by now. - Is there a reason there's a composeJamo / hangulDecompose, but no other writing system specific functions (such as functions to deal with Thai consonant/vowel compositions)? Is this a peculiarity of the Unicode standard? Or should there be more general functions that handle composition/decomposition of compound characters? - Why are many of the category checking functions system rather than safe? It would seem rather crippling to me if much of std.uni can't be used from safe code! - It would also be nice if these functions can be made pure (I'm not sure I understand why checking the category of a character should be impure). The lack of nothrow I can understand, since the input character may be illegal or otherwise malformed. But nothrow pure seems to me to be necessary for all category-checking functions. T -- The volume of a pizza of thickness a and radius z can be described by the following formula: pi zz a. -- Wouter Verhelst
Jan 11 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12-Jan-2013 00:35, H. S. Teoh пишет:
 On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
 [...]
 Anyway it's polished and ready for the good old collective
 destruction called peer review. I'm looking for a review manager.

 The code, including extra tests and a benchmark is here:
 https://github.com/blackwhale/gsoc-bench-2012

 And documentation:
 http://blackwhale.github.com/phobos/uni.html
Excellent!! It looks pretty impressive. We should definitely try out best to get this into Phobos. [...]
 In general there are 3 angles to the new std.uni:

 1) The same stuff but better and faster. For one thing isXXX
 classification functions are brought up to date with Unicode 6.2 and
 are up to 5-6x times faster on non-ASCII text.
And there is a tool for auto-updating the data, correct? So in theory, barring major structural changes in the standard, we should be able to recompile Phobos (maybe after running the update tool) and it should be automatically updated to the latest Unicode release?
Yup, except they tweak the algorithms all along. The normalization for instance is full of erratas and minor notes accumulated over time. I'll probably do a write up on how to implement it and where to find all the related info.
 2) The commonly expected stuff in any modern Unicode-aware language:
 normalization, grapheme decoding, composition/decomposition and
 case-insensitive comparison*.

 3) I've taken it as a crucial point to provide all of the tools used
 to build Unicode algorithms from ground up to the end user.
Are there enough tools to implement line-breaking algorithms (TR14)?
Should be. AFAIK _word_ breaking seemed trivial to do, got to check the line breaking.
 Thus all generally useful data structures used to implement the
 library internals are accessible for 'mortals' too:
   - a type for manipulating sets of codepoints, with full set algebra
   - a construction for generating fast multi-stage lookup tables (Trie)
   - a ton of predefined sets to construct your own specific ones from
I looked over the examples in the docs. Looks awesome! One question though: how do I check a character in a specific unicode category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the specific category, and I assume it's overkill to have one function per category, so I presume it should be possible to do this using the codepoint sets? Can you add an example of this to the docs, if this is possible?
auto psSet = unicode.Ps; assert(psSet['(']); //should pass
 [...]
 Among other things the entire collection of data required is
 generated automatically by downloading from unicode.org. The tool
 relies on the same foundation (3) and for the most part this version
 of std.uni should be trivially updated to the new versions of the
 standard (see gen_uni.d script).
Excellent!
 * The only missing 'big' thing is the collation algorithm. At this
 point I'm proposing to just move the large chunk of new std.uni in
 place. This way potential contributors would have tools to implement
 missing bits later on.
But the aforementioned tools should be enough for someone to use as the basis for implementing the collation algorithm right?
I sure hope so. I've been looking through all of TRs and found nothing surprising beyond more and more tricky codepoint sets and more and more wierd rules to use these.
 P.S. CodepointSet could be easily adjusted to serve as generic
 integer set type and Trie already supports far more the
 codepoints->values mappings. These should probably be enhanced and
 later adopted to std.container(2).
[...] Yes, having an efficient integer set type and generic Trie type would be a huge plus to add to std.container. Although, isn't Andrei supposed to be working on a new std.container, with a better range-based API and custom allocator scheme? If that's not going to happen for a while yet, I say we should merge std.uni first, then worry about factoring Trie and int set later (we can always just alias them in std.uni later when we move them, that should prevent user code breakage). Now, some nitpicks:
Thanks for early feedback.
 - InversionList:

     - I don't like using ~ to mean symmetric set difference, because ~
       already means concatenation in D, and it's confusing to overload it
       with an unrelated meaning. I propose using ^ instead, because
       symmetric set difference is analogous to XOR.
Point taken, arguably '~' was a bad idea. But I don't like ^ either. Seems like it's the best candidate though.
     - Why is the table of operators in opBinary's doc duplicated twice?
       Is the second table supposed to be something else?
Beats me. Must be some glitch with BOOKTABLE macro (or how I use it). I think I've hit it before.
 - We should at least briefly describe what the various Unicode
    normalization forms mean (yes I know they can be looked up on
    unicode.org and various other online resources, but having them right
    there in the docs makes the docs so much more helpful).
 - Why are the deprecated functions still in there? The deprecation
    messages say "It will be removed in August 2012", which is already
    past. So they should be removed by now.
Will do before Walter chimes in with his motto of keeping clutter just in case.
 - Is there a reason there's a composeJamo / hangulDecompose, but no
    other writing system specific functions (such as functions to deal
    with Thai consonant/vowel compositions)? Is this a peculiarity of the
    Unicode standard? Or should there be more general functions that
    handle composition/decomposition of compound characters?
Hangul are huge (11K+) and they have a specific algorithmic decomposition rule. Also they are hell of a special case in Unicode standard (that is mentioned in one chapter only!). Everything else is table-driven.
 - Why are many of the category checking functions  system rather than
     safe? It would seem rather crippling to me if much of std.uni can't
    be used from  safe code!
system? Well, these got to be either trusted or safe.
 - It would also be nice if these functions can be made pure (I'm not
    sure I understand why checking the category of a character should be
    impure).
Global sophisticated immutable table. I think this can be pure but the compiler might have disagreed.
  The lack of nothrow I can understand, since the input
    character may be illegal or otherwise malformed.
nothrow is probably doable as any codepoint will do. (and if it's > dchar.max then it's a problem somewhere else). TBH I've killed these qualifiers early on as it prevented the thing to compile. I can't recall is there is still a problem with any of theses.
 But  nothrow pure
    seems to me to be necessary for all category-checking functions.
-- Dmitry Olshansky
Jan 11 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Jan 12, 2013 at 01:06:30AM +0400, Dmitry Olshansky wrote:
 12-Jan-2013 00:35, H. S. Teoh пишет:
On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
[...]
Anyway it's polished and ready for the good old collective
destruction called peer review. I'm looking for a review manager.

The code, including extra tests and a benchmark is here:
https://github.com/blackwhale/gsoc-bench-2012

And documentation:
http://blackwhale.github.com/phobos/uni.html
[...]
2) The commonly expected stuff in any modern Unicode-aware language:
normalization, grapheme decoding, composition/decomposition and
case-insensitive comparison*.

3) I've taken it as a crucial point to provide all of the tools used
to build Unicode algorithms from ground up to the end user.
Are there enough tools to implement line-breaking algorithms (TR14)?
Should be. AFAIK _word_ breaking seemed trivial to do, got to check the line breaking.
Line-breaking is in TR14. I looked over it. It's definitely more complicated than I expected it to be. Some alphabets may even require stylistic rules for line-breaking! (Not to mention hyphenation.) So it's probably too much to expect std.uni to do it. But at least the tools necessary to implement it should be there. [...]
One question though: how do I check a character in a specific unicode
category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the
specific category, and I assume it's overkill to have one function per
category, so I presume it should be possible to do this using the
codepoint sets? Can you add an example of this to the docs, if this is
possible?
auto psSet = unicode.Ps; assert(psSet['(']); //should pass
It would be nice to have a list of possible categories in the docs for unicode.opDispatch/opCall, at least, the common categories, with a hyperlink to unicode.org (or wherever) that provides the full list. Currently, it's unclear, beyond the single example given, of what the possibilities are. [...]
- We should at least briefly describe what the various Unicode
   normalization forms mean (yes I know they can be looked up on
   unicode.org and various other online resources, but having them right
   there in the docs makes the docs so much more helpful).
- Why are the deprecated functions still in there? The deprecation
   messages say "It will be removed in August 2012", which is already
   past. So they should be removed by now.
Will do before Walter chimes in with his motto of keeping clutter just in case.
In that case, I would keep the functions (aliases?) but remove them from DDoc. So old code will still silently work, but the functions will no longer be officially supported, there will be no documentation for them and they can disappear anytime.
- Is there a reason there's a composeJamo / hangulDecompose, but no
   other writing system specific functions (such as functions to deal
   with Thai consonant/vowel compositions)? Is this a peculiarity of
   the Unicode standard? Or should there be more general functions
   that handle composition/decomposition of compound characters?
Hangul are huge (11K+) and they have a specific algorithmic decomposition rule. Also they are hell of a special case in Unicode standard (that is mentioned in one chapter only!). Everything else is table-driven.
OK. Makes sense, I guess. Though it gave me the impression that std.uni has some undue bias for Korean text. ;-) [...]
- It would also be nice if these functions can be made pure (I'm not
   sure I understand why checking the category of a character should
   be impure).
Global sophisticated immutable table. I think this can be pure but the compiler might have disagreed.
Hmm. This sounds like a bug in purity, maybe? It seems clear to me that immutable values might as well be constants, and therefore using said values should not make the code impure, even if they are implemented as a global table. I think a bug should be filed for this.
 The lack of nothrow I can understand, since the input
   character may be illegal or otherwise malformed.
nothrow is probably doable as any codepoint will do. (and if it's > dchar.max then it's a problem somewhere else).
True, an illegal character by definition does not belong to any category, so all category functions should return false for it. No exceptions need to be thrown.
 TBH I've killed these qualifiers early on as it prevented the thing
 to compile. I can't recall is there is still a problem with any of
 theses.
[...] Before merging into Phobos, I would think we should put (as many of) the qualifiers back (as possible). Preferably everwhere they make sense, barring compiler bugs. T -- Государство делает вид, что платит нам зарплату, а мы делаем вид, что работаем.
Jan 11 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12-Jan-2013 01:33, H. S. Teoh пишет:
 On Sat, Jan 12, 2013 at 01:06:30AM +0400, Dmitry Olshansky wrote:
 12-Jan-2013 00:35, H. S. Teoh пишет:
 On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
 [...]
 Anyway it's polished and ready for the good old collective
 destruction called peer review. I'm looking for a review manager.

 The code, including extra tests and a benchmark is here:
 https://github.com/blackwhale/gsoc-bench-2012

 And documentation:
 http://blackwhale.github.com/phobos/uni.html
[...]
 2) The commonly expected stuff in any modern Unicode-aware language:
 normalization, grapheme decoding, composition/decomposition and
 case-insensitive comparison*.

 3) I've taken it as a crucial point to provide all of the tools used
 to build Unicode algorithms from ground up to the end user.
Are there enough tools to implement line-breaking algorithms (TR14)?
Should be. AFAIK _word_ breaking seemed trivial to do, got to check the line breaking.
Line-breaking is in TR14. I looked over it. It's definitely more complicated than I expected it to be. Some alphabets may even require stylistic rules for line-breaking! (Not to mention hyphenation.) So it's probably too much to expect std.uni to do it. But at least the tools necessary to implement it should be there.
I had given it a practiced look of tired Unicode implementor. It has a lot of subtle moments to it but should be doable. I'm not about to implement it any time soon, but notes on how to do it are the following: 1. Avoid reading all of the reasons behind the algorithm listed there (there are good ones, a ton of them). So at first SKIP chapter 5 except maybe 5.3-5.4. And the fonts problems listed are not important to get it working. 2. Re-read the important chapters that is 2 and 6, 7. Others pretty much can be simply ignored until you start testing your implementation. 7 is actually pretty nice. 3. The first step to implement is to seek out the character classes (=sets) involved in algorithm. These are listed in Table.1 and the accompanying data file. 4. Process the file (see e.g. gen_uni.d) and create all of sets. 5. The core part of the algorithm itself is by the end of day not bad - section 6.1 lists all of the mandatory line breaking rules. Then there is a bunch of tweakable ones that are a bit harder. Plus see chapter 7 for a (logically) pair-wise table driven solution (though the row/cols in the table are "belongs to the set X"). -- Dmitry Olshansky
Jan 13 2013
prev sibling next sibling parent reply "David Nadlinger" <see klickverbot.at> writes:
On Friday, 11 January 2013 at 19:31:13 UTC, Dmitry Olshansky 
wrote:
 The code, including extra tests and a benchmark is here:
 https://github.com/blackwhale/gsoc-bench-2012
Phew, that repository is quite a hefty checkout with all the example data. Just for the fun of it, a completely irreproducible, unscientific benchmark (sorry, Andrei!) run on a Core i7 M 620 under Linux x86_64: DMD 2.061 --- $ rdmd -O -release -noboundscheck -inline char_class.d enwiki-latest-all-titles-in-ns0 Creating level 1 of Tries. Alpha:262144 Mark:262144 Number:262144 Symbol:262144 Creating level 2 of Tries. Alpha:9472 Mark:6656 Number:6656 Symbol:6400 Creating level 3 of Tries. Alpha:4000 Mark:2304 Number:2432 Symbol:2848 Creating level 4 of Tries. Alpha:3376 Mark:1976 Number:1624 Symbol:2096 Time to build dchar array: 2702 ms Baselines noop [enwiki], 217.335, 888.99 M/s new-std-alpha [enwiki], 1197.49, 161.34 M/s new-std-mark [enwiki], 6374.73, 30.31 M/s new-std-num [enwiki], 6290.89, 30.71 M/s new-std-sym [enwiki], 6500.93, 29.72 M/s new-std-white [enwiki], 822.245, 234.98 M/s combining class [enwiki], 5948.63, 32.48 M/s inv-alpha [enwiki], 10290, 18.78 M/s inv-mark [enwiki], 9229.2, 20.93 M/s inv-num [enwiki], 9046.48, 21.36 M/s inv-sym [enwiki], 9843.59, 19.63 M/s Tries of level 1 trie-alpha [enwiki], 2179.17, 88.66 M/s trie-mark [enwiki], 1912.54, 101.02 M/s trie-num [enwiki], 1924.92, 100.37 M/s trie-sym [enwiki], 1911.22, 101.09 M/s Tries of level 2 trie-alpha [enwiki], 4524.33, 42.70 M/s trie-mark [enwiki], 4276.92, 45.17 M/s trie-num [enwiki], 4308.42, 44.84 M/s trie-sym [enwiki], 4298.68, 44.95 M/s Tries of level 3 trie-alpha [enwiki], 6407.64, 30.15 M/s trie-mark [enwiki], 6150.8, 31.41 M/s trie-num [enwiki], 6095.96, 31.69 M/s trie-sym [enwiki], 6253.7, 30.89 M/s Tries of level 4 trie-alpha [enwiki], 8329.66, 23.20 M/s trie-mark [enwiki], 7993.11, 24.17 M/s trie-num [enwiki], 8040.94, 24.03 M/s trie-sym [enwiki], 8033.07, 24.05 M/s --- LDC (2.061 merge branch): --- $ rdmd --compiler=ldmd2 -O3 -release -noboundscheck -inline char_class.d enwiki-latest-all-titles-in-ns0 Alpha:262144 Mark:262144 Number:262144 Symbol:262144 Creating level 2 of Tries. Alpha:9472 Mark:6656 Number:6656 Symbol:6400 Creating level 3 of Tries. Alpha:4000 Mark:2304 Number:2432 Symbol:2848 Creating level 4 of Tries. Alpha:3376 Mark:1976 Number:1624 Symbol:2096 Time to build dchar array: 1962 ms Baselines noop [enwiki], 211.516, 913.44 M/s new-std-alpha [enwiki], 612.324, 315.53 M/s new-std-mark [enwiki], 577.134, 334.77 M/s new-std-num [enwiki], 600.996, 321.48 M/s new-std-sym [enwiki], 577.433, 334.60 M/s new-std-white [enwiki], 381.535, 506.40 M/s combining class [enwiki], 517.231, 373.54 M/s inv-alpha [enwiki], 8181.25, 23.62 M/s inv-mark [enwiki], 6938.09, 27.85 M/s inv-num [enwiki], 6347.71, 30.44 M/s inv-sym [enwiki], 6930.7, 27.88 M/s Tries of level 1 trie-alpha [enwiki], 275.741, 700.69 M/s trie-mark [enwiki], 274.972, 702.65 M/s trie-num [enwiki], 274.558, 703.70 M/s trie-sym [enwiki], 277.255, 696.86 M/s Tries of level 2 trie-alpha [enwiki], 453.453, 426.08 M/s trie-mark [enwiki], 452.912, 426.59 M/s trie-num [enwiki], 455.004, 424.63 M/s trie-sym [enwiki], 453.492, 426.04 M/s Tries of level 3 trie-alpha [enwiki], 577.442, 334.59 M/s trie-mark [enwiki], 579.054, 333.66 M/s trie-num [enwiki], 577.95, 334.30 M/s trie-sym [enwiki], 577.566, 334.52 M/s Tries of level 4 trie-alpha [enwiki], 778.261, 248.26 M/s trie-mark [enwiki], 778.943, 248.04 M/s trie-num [enwiki], 778.352, 248.23 M/s trie-sym [enwiki], 779.442, 247.88 M/s --- Haven't had a look at what is going on there yet – it could well be that some measurement loop is optimized out altogether. David
Jan 11 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12-Jan-2013 00:38, David Nadlinger пишет:
 On Friday, 11 January 2013 at 19:31:13 UTC, Dmitry Olshansky wrote:
 The code, including extra tests and a benchmark is here:
 https://github.com/blackwhale/gsoc-bench-2012
Phew, that repository is quite a hefty checkout with all the example data. Just for the fun of it, a completely irreproducible, unscientific benchmark (sorry, Andrei!) run on a Core i7 M 620 under Linux x86_64: DMD 2.061 --- $ rdmd -O -release -noboundscheck -inline char_class.d
[snip] I'd rather use dewiki or arwiki these give the more coverage otherwise you just check the presence of ASCII special casing (only isAlpha and isWhite have it).
 Time to build dchar array: 2702 ms

 Baselines
              noop [enwiki],   217.335, 888.99 M/s
     new-std-alpha [enwiki],   1197.49, 161.34 M/s
      new-std-mark [enwiki],   6374.73, 30.31 M/s
       new-std-num [enwiki],   6290.89, 30.71 M/s
       new-std-sym [enwiki],   6500.93, 29.72 M/s
     new-std-white [enwiki],   822.245, 234.98 M/s
   combining class [enwiki],   5948.63, 32.48 M/s
         inv-alpha [enwiki],     10290, 18.78 M/s
          inv-mark [enwiki],    9229.2, 20.93 M/s
           inv-num [enwiki],   9046.48, 21.36 M/s
           inv-sym [enwiki],   9843.59, 19.63 M/s

 Tries of level 1
        trie-alpha [enwiki],   2179.17, 88.66 M/s
         trie-mark [enwiki],   1912.54, 101.02 M/s
          trie-num [enwiki],   1924.92, 100.37 M/s
          trie-sym [enwiki],   1911.22, 101.09 M/s

 Tries of level 2
        trie-alpha [enwiki],   4524.33, 42.70 M/s
         trie-mark [enwiki],   4276.92, 45.17 M/s
          trie-num [enwiki],   4308.42, 44.84 M/s
          trie-sym [enwiki],   4298.68, 44.95 M/s

 Tries of level 3
        trie-alpha [enwiki],   6407.64, 30.15 M/s
         trie-mark [enwiki],    6150.8, 31.41 M/s
          trie-num [enwiki],   6095.96, 31.69 M/s
          trie-sym [enwiki],    6253.7, 30.89 M/s

 Tries of level 4
        trie-alpha [enwiki],   8329.66, 23.20 M/s
         trie-mark [enwiki],   7993.11, 24.17 M/s
          trie-num [enwiki],   8040.94, 24.03 M/s
          trie-sym [enwiki],   8033.07, 24.05 M/s
 ---
 LDC (2.061 merge branch):
 ---
 $ rdmd --compiler=ldmd2 -O3 -release -noboundscheck -inline char_class.d
 enwiki-latest-all-titles-in-ns0
[snip]
 Baselines
              noop [enwiki],   211.516, 913.44 M/s
     new-std-alpha [enwiki],   612.324, 315.53 M/s
      new-std-mark [enwiki],   577.134, 334.77 M/s
       new-std-num [enwiki],   600.996, 321.48 M/s
       new-std-sym [enwiki],   577.433, 334.60 M/s
     new-std-white [enwiki],   381.535, 506.40 M/s
   combining class [enwiki],   517.231, 373.54 M/s
         inv-alpha [enwiki],   8181.25, 23.62 M/s
          inv-mark [enwiki],   6938.09, 27.85 M/s
           inv-num [enwiki],   6347.71, 30.44 M/s
           inv-sym [enwiki],    6930.7, 27.88 M/s

 Tries of level 1
        trie-alpha [enwiki],   275.741, 700.69 M/s
         trie-mark [enwiki],   274.972, 702.65 M/s
          trie-num [enwiki],   274.558, 703.70 M/s
          trie-sym [enwiki],   277.255, 696.86 M/s

 Tries of level 2
        trie-alpha [enwiki],   453.453, 426.08 M/s
         trie-mark [enwiki],   452.912, 426.59 M/s
          trie-num [enwiki],   455.004, 424.63 M/s
          trie-sym [enwiki],   453.492, 426.04 M/s

 Tries of level 3
        trie-alpha [enwiki],   577.442, 334.59 M/s
         trie-mark [enwiki],   579.054, 333.66 M/s
          trie-num [enwiki],    577.95, 334.30 M/s
          trie-sym [enwiki],   577.566, 334.52 M/s

 Tries of level 4
        trie-alpha [enwiki],   778.261, 248.26 M/s
         trie-mark [enwiki],   778.943, 248.04 M/s
          trie-num [enwiki],   778.352, 248.23 M/s
          trie-sym [enwiki],   779.442, 247.88 M/s
 ---

 Haven't had a look at what is going on there yet – it could well be that
 some measurement loop is optimized out altogether.
You can print total counts after each bench, there is a TLS varaible written at the end of it. But anyway I like your numbers! :)
 David
-- Dmitry Olshansky
Jan 11 2013
parent reply "David Nadlinger" <see klickverbot.at> writes:
On Friday, 11 January 2013 at 20:57:57 UTC, Dmitry Olshansky 
wrote:
 You can print total counts after each bench, there is a TLS 
 varaible written at the end of it. But anyway I like your 
 numbers! :)
Okay, I couldn't resist having a short look at the results, specifically the benchmark of the new isSymbol implementation, where LDC beats DMD by roughly 10x. The reason for the nice performance results is mainly that LDC optimizes the classifyCall loop containing the trie lookup down to the following fairly optimal piece of code (eax is the overall counter that gets stored to lastCount): --- 40bc90: 8b 55 00 mov edx,DWORD PTR [rbp+0x0] 40bc93: 89 d6 mov esi,edx 40bc95: c1 ee 0d shr esi,0xd 40bc98: 40 0f b6 f6 movzx esi,sil 40bc9c: 0f b6 34 31 movzx esi,BYTE PTR [rcx+rsi*1] 40bca0: 48 83 c5 04 add rbp,0x4 40bca4: 0f b6 da movzx ebx,dl 40bca7: c1 e6 05 shl esi,0x5 40bcaa: c1 ea 08 shr edx,0x8 40bcad: 83 e2 1f and edx,0x1f 40bcb0: 09 f2 or edx,esi 40bcb2: 41 0f b7 14 50 movzx edx,WORD PTR [r8+rdx*2] 40bcb7: c1 e2 08 shl edx,0x8 40bcba: 09 da or edx,ebx 40bcbc: 48 c1 ea 06 shr rdx,0x6 40bcc0: 4c 01 ca add rdx,r9 40bcc3: 48 8b 14 d1 mov rdx,QWORD PTR [rcx+rdx*8] 40bcc7: 48 0f a3 da bt rdx,rbx 40bccb: 83 d0 00 adc eax,0x0 40bcce: 48 ff cf dec rdi 40bcd1: 75 bd jne 40bc90 --- The code DMD generates for the lookup, on the other hand, is pretty ugly, including several values being spilled to the stack, and also doesn't get inlined. This is, of course, just a microbenchmark, but it is cases like this which make me wish that we would just use LLVM (or GCC, for that matter) for the reference compiler – and I'm not talking about the slightly Frankensteinian endeavor that LDC is here. Walter, my intention is not at all to doubt your ability at a compiler writer; we all know the stories of how you used to annoy the team leads at the big companies by beating their performance numbers single-handedly, and I'm sure you could e.g. fix your backend to match the performance of the LDC-generated code for Dmitry's benchmark in no time. The question is just: Are we as a community big, resourceful enough to justify spending time on that? Sure, there would still be things we will have to fix ourselves when using another backend, such as SEH support in LLVM. But performance will always be a central selling point of a language like D, and do we really want to take the burden of keeping up with the competition ourselves, when we can just draw on the work of full-time backend developers at Intel, AMD, Apple and others for free? Given the current developments in microprocessors and given that applications such as graphics and scientific computing are naturally a good fit for D, what's next? You taking a year off from active language development to implement an auto-vectorizer for your backend? I know this question has been brought up before (if never really answered), and I don't want to start another futile discussion, but given the developments in the compiler/languages landscape over the last few years, it strikes me as an increasingly bad decision to stick with an obscure, poorly documented backend which nobody knows how to use – and nobody wants to learn how to use either, because, oops, they couldn't even redistribute their own work. Let's put aside all the other arguments (most of which I didn't even mention) for a moment, even the performance aspect; I think that the productivity aspect alone, both regarding duplicated work and accessibility of the project to new developers, makes it hard to justify forging leveraging the momentum of an established backend project like LLVM. [1] Maybe it is naïve to think that the situation could ever change for DMD. But I sincerely hope that the instant a promising self-hosted (as far as the frontend goes) compiler project shows up at the horizon, it will gain the necessary amount of official endorsement – and manpower, especially in the form of your (Walter's) expertise – to make that final, laborious stretch to release quality. If we just sit there and wait for somebody to come along with a new production-ready compiler which is better, faster and shinier than DMD, we will wait for a long, long time – this might happen for a Lisp dialect, but not for D. Sorry for the rant, [2] David [1] The reasons for which I'm focusing on LLVM here are not so much its technical qualities as its liberal BSD-like license – if it is good enough for Apple, Intel (also a compiler vendor) and their lawyer teams, it is probably also for us. The code could even be integrated into commercial products such as DMC without problems. [2] And for any typos which might undermine my credibility – it is way too early in the morning here.
Jan 11 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/11/2013 9:17 PM, David Nadlinger wrote:
 Sorry for the rant, [2]
 David
No problem. I think it's great that you're the champion for LDC. Having 3 robust compilers for D is only a win, and I don't think having 3 takes away from D at all. The 3 compilers have different strengths and weaknesses. For example, LDC doesn't work with VS, and so would have cost us the possible design win at Remedy.
Jan 11 2013
parent reply "David Nadlinger" <see klickverbot.at> writes:
On Saturday, 12 January 2013 at 05:44:00 UTC, Walter Bright wrote:
 No problem. I think it's great that you're the champion for LDC.
 Having 3 robust compilers for D is only a win, and I don't 
 think having 3 takes away from D at all.
I precisely did *not* want to drag LDC into this discussion, but I guess it was unavoidable given that I talked about it in the first part of my post, and of course because I'm the one maintaining it right now. So, just to clarify, I took up LDC development mainly because it produces significantly faster code for some of my projects, and it would have been a shame to let it die from a marketing point of view. If I had lots of spare time (which I don't have right now) I'd certainly try to help out with a D-based compiler project instead. So it's not like I'm overly attached to LDC, emotionally – resurrecting it just seemed like the most viable short-term solution. But yes, you are right, having 3 robust compilers for D would certainly be a good thing for D. But unfortunately, I think it would be quite a stretch to say that we have even a single such one right now. To elaborate – and again, please don't take this as a personal attack –, every time this discussion comes up you seem to resort to the »diversity is good« cliché without ever actually responding to the arguments other people bring up. You want to stick to the DMC backend? Fine, that's your good right, but wouldn't it only be fair to share your reasons with the other people who are investing considerable amounts of work and time in »your« language? As a note aside, I think this point is much deeper than it might appear at first. Sociology is not exactly my field, so I don't know what current research has to say about this, but at least to me it seems obvious that a healthy culture of communication is essential for the success of any non-trivial open source project. Even if it would be short-sighted to claim that all disputes can be resolved by a good technical discussion, it seems to go a long way towards reaching a consensus in most cases; and many engineers/scientists/hackers seem to regard it as the norm. On the other hand, it seems like failure to communicate the reasons behind a given decision is one of the most effective ways to cause (potential) contributors to turn away, looking for a more promising project to throw their spare time at. And to be honest, I think some of the most unfortunate events in the history of D can be directly traced back to such communication failures (I hope it is obvious what I'm alluding to here?). But let's return to the particular topic I brought up. To recap, my impression is that several large players in the industry have started to pool their efforts regarding compiler development, as spending the ever-increasing amount of work needed to come up with their own state-of-the-art optimizer is no longer justified by the potential advantages this would offer over their competitors – maybe quite similar to the almost universal success of Linux in the OS domain. So, in short, even if the DMC backend was technically superior to every other solution out there right now, I would be hard-pressed to make an argument for preferring it over a more widely used solution like LLVM because of this. I cannot see how putting work into a custom backend could realistically help to make D more attractive compared to other languages, so the current situation seems like a case of making things unnecessarily hard for ourselves. Of course, as with any open source project, it would be foolish to try and tell you what to spend your time on. But I would really be interested in your long-term vision for D(MD), because as noted before I can't see any signs that the expectations at a compiler backend will drop in the future (more and more vector units in processors, the rise of ARM, ...), and even when neglecting the platform support issue, I'm not sure how good an offer D would still be if its main implementation started to perform significantly worse than common C++ compilers. To summarize: Although there is quite a lot of promising development going on in the language space right now, I still think D has an excellent offer to make, otherwise I would left the camp long ago. But it doesn't seem wise to intentionally weaken our position in the arena by spending considerable amounts of manpower on solving a problem from the ground up for which there is already a free off-the-shelf solution that would get us at least 95% there – and the fact that our resources are limited is precisely what you seem to neglect when reciting the »diversity is good« mantra.
 For example, LDC doesn't work with VS, and so would have cost
 us the possible design win at Remedy.
In view of the above, I'm not sure how much weight this argument could carry, but I don't think it is valid in the first place: If our reference compiler was backed by LLVM, you could have just added CV8 support to it like you did for DMD. By the way, the other thing missing from LLVM regarding VS compatibility is SEH support, but DMD doesn't implement that either (LLVM actually supports EH on MinGW right now, so the implementation could likely have been re-used as a workaround, just like you did for DMD). David
Jan 13 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/13/2013 3:10 PM, David Nadlinger wrote:
 To elaborate – and again, please don't take this as a personal attack –,
every
 time this discussion comes up you seem to resort to the »diversity is good«
 cliché
Being a cliche doesn't make it invalid.
 without ever actually responding to the arguments other people bring up.
 You want to stick to the DMC backend? Fine, that's your good right, but
wouldn't
 it only be fair to share your reasons with the other people who are investing
 considerable amounts of work and time in »your« language?
It's long ceased to be my language. It's a group creation. Anyhow, I like using the dmc backend because: 1. I can completely control the user experience. 2. I can fix a problem where it should be fixed, rather than working around it. 3. I understand it completely, and so I can reasonably quickly add in things that some other back end may not support. 4. It's fast. 5. I don't run any risk of "you stole X's ideas when you worked on X's back end." 6. I enjoy working on it.
 So, in short, even if the DMC backend was technically superior to every other
 solution out there right now, I would be hard-pressed to make an argument for
 preferring it over a more widely used solution like LLVM because of this. I
 cannot see how putting work into a custom backend could realistically help to
 make D more attractive compared to other languages, so the current situation
 seems like a case of making things unnecessarily hard for ourselves.
This assumes that the existence of dmd takes away from ldc. I don't believe it does. Ldc doesn't need me.
 To summarize: Although there is quite a lot of promising development going on
in
 the language space right now, I still think D has an excellent offer to make,
 otherwise I would left the camp long ago. But it doesn't seem wise to
 intentionally weaken our position in the arena by spending considerable amounts
 of manpower on solving a problem from the ground up for which there is already
a
 free off-the-shelf solution that would get us at least 95% there – and the
fact
 that our resources are limited is precisely what you seem to neglect when
 reciting the »diversity is good« mantra.
I'm not convinced that the effort to keep up with llvm in order to work with it is actually less time.
 For example, LDC doesn't work with VS, and so would have cost
 us the possible design win at Remedy.
In view of the above, I'm not sure how much weight this argument could carry, but I don't think it is valid in the first place: If our reference compiler was backed by LLVM, you could have just added CV8 support to it like you did for DMD.
The thing is, I don't see how that would have saved time. It would have been more time - I already had CV4 support in dmd, so I had a head start on CV8. There were also a number of wacky things about Mscoff64 - how to integrate them into LLVM would have required a thorough knowledge of it on my part, and *then* I'd still have to implement it. There were also the MS64 ABI changes, which consumed a considerable amount of time. If that is not in LLVM, I would have been in for a world of effort and would have had to do some considerable surgery on it.
 By the way, the other thing missing from LLVM regarding VS compatibility is
 SEH support, but DMD doesn't implement that either (LLVM actually supports EH
on
 MinGW right now, so the implementation could likely have been re-used as a
 workaround, just like you did for DMD).
I think that underestimates what all is necessary to be ABI and format compatible with VC64. SEH is only a small part of it. For example, 1. Win64 ABI, which is fundamentally different from any other platform 2. Making it compatible with and use VC++'s runtime 3. Generating MsCoff 64. 4. Adding the wacky pdata and xdata sections 5. Learning the completely undocumented CV8 format, then implementing it Doing this all requires a fairly comprehensive understanding of the innards of the back end. If LLVM is lacking one or more of 1..5, then I wouldn't be surprised a bit if it took me considerably *longer* to do it for LLVM, and then I'd have to sell my changes to the LLVM group. (If they don't buy it, then I'm stuck with a fork I have to maintain myself. Ugh.)
Jan 13 2013
next sibling parent reply "David Nadlinger" <see klickverbot.at> writes:
On Sunday, 13 January 2013 at 23:57:48 UTC, Walter Bright wrote:
 On 1/13/2013 3:10 PM, David Nadlinger wrote:
 To elaborate – and again, please don't take this as a personal 
 attack –, every
 time this discussion comes up you seem to resort to the 
 »diversity is good«
 cliché
Being a cliche doesn't make it invalid.
You conveniently ended your citation before the »without ever actually responding to the arguments other people bring up« part.
 You want to stick to the DMC backend? Fine, that's your good 
 right, but wouldn't
 it only be fair to share your reasons with the other people 
 who are investing
 considerable amounts of work and time in »your« language?
It's long ceased to be my language. It's a group creation. Anyhow, I like using the dmc backend because:
Let me reply specifically focusing on LLVM:
 1. I can completely control the user experience.
You can do this with LLVM as well, as you can modify its source at will. Let me also note that this means that *only you* can fix certain parts of the user experience (see for example the Optlink issues). And besides, to be brutally honest, there has not been too much good feedback about the DMD toolchain so far, apart from compilation speeds, has there? The reason for this might of course be that users just expect a D compiler to work just like every other C/C++ compiler, but I can't see how having a potentially bigger degree of control would have helped much so far.
 2. I can fix a problem where it should be fixed, rather than 
 working around it.
You can do this with LLVM as well – after all, that's precisely the point of open source software. And additionally, it has the big advantage that other people can fix problems in it too, and in many cases, the LLVM people will even do the work for you.
 3. I understand it completely, and so I can reasonably quickly 
 add in things that some other back end may not support.
LLVM is reasonably well documented and has many useful debugging and visualization features, so I'm sure you would find your way around it quickly if you wanted to.
 4. It's fast.
No objection there. But let me note that its weak(er) optimizer might get it disqualified as far as release builds go before this argument even counts.
 5. I don't run any risk of "you stole X's ideas when you worked 
 on X's back end."
You can use any code from LLVM under a BSD-like license. Would this be too restrictive? I am not a lawyer, but I'm reasonably certain it would even allow you to just take the whole LLVM code base and sell it as a closed-source package.
 6. I enjoy working on it.
Fair enough, and all too understandable given that it's your creation! But to put it bluntly, it's a hassle for almost everybody else working on the compiler, so a project you want to attract other contributors to might not be the best environment to pursue this »hobby«.
 So, in short, even if the DMC backend was technically superior 
 to every other
 solution out there right now, I would be hard-pressed to make 
 an argument for
 preferring it over a more widely used solution like LLVM 
 because of this. I
 cannot see how putting work into a custom backend could 
 realistically help to
 make D more attractive compared to other languages, so the 
 current situation
 seems like a case of making things unnecessarily hard for 
 ourselves.
This assumes that the existence of dmd takes away from ldc. I don't believe it does. Ldc doesn't need me.
Why are you even bringing up LDC here? My last post must have been very unclear in this regard? But of course, LDC does need you – after all, it uses the DMD frontend! My entire point is that *D* needs you and your expertise: Having a veteran compiler writer on board is just about the best thing that can happen to a language effort; just because of your experience you are infinitely more productive in some areas than most of us other contributors are. What I'm trying to say is that if you want D to succeed, which I assume you do, spending (part of) your time on writing a custom backend might not be the best idea: First, you could have spent that time working on things that help to sell the language, and second, doing so actually makes life *harder* for other contributors.
 To summarize: Although there is quite a lot of promising 
 development going on in
 the language space right now, I still think D has an excellent 
 offer to make,
 otherwise I would left the camp long ago. But it doesn't seem 
 wise to
 intentionally weaken our position in the arena by spending 
 considerable amounts
 of manpower on solving a problem from the ground up for which 
 there is already a
 free off-the-shelf solution that would get us at least 95% 
 there – and the fact
 that our resources are limited is precisely what you seem to 
 neglect when
 reciting the »diversity is good« mantra.
I'm not convinced that the effort to keep up with llvm in order to work with it is actually less time.
Less time than what? Developing a code generator that matches theirs? You are certainly a very capable compiler engineer, but LLVM has quite a few bright minds working on it as well, full-time. I honestly don't see how »keeping up with LLVM« could be more time-consuming than replicating all of their work. Maybe you can elaborate on what exactly you meant? But for what it's worth, Kai and I have had no problems with keeping LDC up to date with the latest LLVM version for the last few upstream releases, and we both can't spend too much time on LDC, so this likely wouldn't be a big issue for you either.
 For example, LDC doesn't work with VS, and so would have cost
 us the possible design win at Remedy.
In view of the above, I'm not sure how much weight this argument could carry, but I don't think it is valid in the first place: If our reference compiler was backed by LLVM, you could have just added CV8 support to it like you did for DMD.
The thing is, I don't see how that would have saved time. It would have been more time - I already had CV4 support in dmd, so I had a head start on CV8. There were also a number of wacky things about Mscoff64 - how to integrate them into LLVM would have required a thorough knowledge of it on my part, and *then* I'd still have to implement it. There were also the MS64 ABI changes, which consumed a considerable amount of time. If that is not in LLVM, I would have been in for a world of effort and would have had to do some considerable surgery on it.
If I'm not mistaken, most of the related work for LLVM was done by the end of 2011.
 By the way, the other thing missing from LLVM regarding VS 
 compatibility is
 SEH support, but DMD doesn't implement that either (LLVM 
 actually supports EH on
 MinGW right now, so the implementation could likely have been 
 re-used as a
 workaround, just like you did for DMD).
I think that underestimates what all is necessary to be ABI and format compatible with VC64. SEH is only a small part of it. For example, 1. Win64 ABI, which is fundamentally different from any other platform 2. Making it compatible with and use VC++'s runtime 3. Generating MsCoff 64. 4. Adding the wacky pdata and xdata sections
I was not just blindly guessing in my last post. LLVM supports directly emitting 64 bit COFF, and as far as I know the code for writing out the RUNTIME_FUNCTION/UNWIND_INFO instances is functional.
 5. Learning the completely undocumented CV8 format, then 
 implementing it

 Doing this all requires a fairly comprehensive understanding of 
 the innards of the back end. If LLVM is lacking one or more of 
 1..5, then I wouldn't be surprised a bit if it took me 
 considerably *longer* to do it for LLVM, and then I'd have to 
 sell my changes to the LLVM group. (If they don't buy it, then 
 I'm stuck with a fork I have to maintain myself. Ugh.)
Given the modular design of LLVM, maintaining a fork probably wouldn't even be too bad; many companies using it e.g. have their own private backends. That being said, the LLVM folks are generally very open with accepting contributions into the main tree (except for 32 bit SEH support, where the Apple people are afraid of issues regarding that notorious Borland patent). But note that I didn't even want to convince you that implementing a given new feature (such as CV8) for LLVM would take you _less_ time than for your own backend. I'm just saying that I would expect it to take a comparable amount of time, and even if it might take you some time to get familiar with the LLVM internals at first, I'm sure that initial cost would quickly be offset by the fact that you don't have to write *everything* on your own. I wonder if there is another production-grade backend out there that is maintained by a single person? David
Jan 13 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/13/2013 6:07 PM, David Nadlinger wrote:
 1. I can completely control the user experience.
You can do this with LLVM as well, as you can modify its source at will. Let me also note that this means that *only you* can fix certain parts of the user experience (see for example the Optlink issues).
I don't really want to get into a long back-and-forth thing where we just keep repeating our statements. But as for optlink, I would have happily turned it over to anyone who expressed interest in working on it. I'd also put it on github if anyone wanted to look at it.
Jan 13 2013
next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/14/13, Walter Bright <newshound2 digitalmars.com> wrote:
 But as for optlink, I would have happily turned it over to anyone who
 expressed
 interest in working on it. I'd also put it on github if anyone wanted to
 look at it.
What's the state of Optlink, have you finished the transition to C? IIRC we've asked for this years ago.
Jan 13 2013
prev sibling next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/14/13, Walter Bright <newshound2 digitalmars.com> wrote:
 I'd also put it on github if anyone wanted to look at it.
Anyway don't tease, put it online! :p
Jan 13 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/13/2013 6:40 PM, Andrej Mitrovic wrote:
 On 1/14/13, Walter Bright <newshound2 digitalmars.com> wrote:
 I'd also put it on github if anyone wanted to look at it.
Anyway don't tease, put it online! :p
Have fun! https://github.com/DigitalMars/optlink
Jan 13 2013
next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/14/13, Walter Bright <newshound2 digitalmars.com> wrote:
 Have fun!
 https://github.com/DigitalMars/optlink
Cool! Can you tell us how this code is licensed?
Jan 13 2013
prev sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
 On 1/14/13, Walter Bright <newshound2 digitalmars.com> wrote:
 Have fun!
 https://github.com/DigitalMars/optlink
I'm getting a build failure when running the root build.bat: Fatal error: unable to open input file 'scio.h' Tried with both VC9 and VC10's nmake, I can't find this file in DMC, optlink, or VCs include dirs.
Jan 13 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/13/2013 8:00 PM, Andrej Mitrovic wrote:
 On 1/14/13, Walter Bright <newshound2 digitalmars.com> wrote:
 Have fun!
 https://github.com/DigitalMars/optlink
I'm getting a build failure when running the root build.bat: Fatal error: unable to open input file 'scio.h' Tried with both VC9 and VC10's nmake, I can't find this file in DMC, optlink, or VCs include dirs.
That's part of the dmc source code.
Jan 13 2013
prev sibling next sibling parent reply "David Nadlinger" <see klickverbot.at> writes:
On Monday, 14 January 2013 at 02:24:38 UTC, Walter Bright wrote:
 I don't really want to get into a long back-and-forth thing 
 where we just keep repeating our statements.
Neither do I. You listed your reasons for sticking with the DMC backend, and I tried to outline how I would judge things differently and why. I guess discussions generally work like this. :) Anyway, thanks for responding to my post in the first place. I very much appreciate that, as I don't think you ever commented on the topic in the last few years in any detail at all, at least as far as the public forums are concerned. David
Jan 13 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/13/13 10:22 PM, David Nadlinger wrote:
 On Monday, 14 January 2013 at 02:24:38 UTC, Walter Bright wrote:
 I don't really want to get into a long back-and-forth thing where we
 just keep repeating our statements.
Neither do I. You listed your reasons for sticking with the DMC backend, and I tried to outline how I would judge things differently and why. I guess discussions generally work like this. :) Anyway, thanks for responding to my post in the first place. I very much appreciate that, as I don't think you ever commented on the topic in the last few years in any detail at all, at least as far as the public forums are concerned.
I'll candidly mention that David's request was unclear to me. It's been winding enough that I missed the key sentence. Was it that we switch dmd to using llvm as backend? Andrei
Jan 13 2013
parent reply "David Nadlinger" <see klickverbot.at> writes:
On Monday, 14 January 2013 at 04:42:33 UTC, Andrei Alexandrescu 
wrote:
 I'll candidly mention that David's request was unclear to me. 
 It's been winding enough that I missed the key sentence. Was it 
 that we switch dmd to using llvm as backend?
If you want, yes - but not in the form of an actionable proposal yet. I was trying to argue that the benefits of using an existing solution like GCC or LLVM are large enough (resp. the costs of using a custom backend high enough) that we should seriously consider doing so. Especially because it looks as if the amount of work needed to keep the DMD backend and thus the reference D compiler competitive is going to increase further as other backends are gaining things like auto-vectorization to light up modern CPUs and ARM is gaining in importance. David
Jan 13 2013
next sibling parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 01/14/2013 12:33 AM, David Nadlinger wrote:
 On Monday, 14 January 2013 at 04:42:33 UTC, Andrei Alexandrescu wrote:
 I'll candidly mention that David's request was unclear to me. It's
 been winding enough that I missed the key sentence. Was it that we
 switch dmd to using llvm as backend?
If you want, yes - but not in the form of an actionable proposal yet. I was trying to argue that the benefits of using an existing solution like GCC or LLVM are large enough (resp. the costs of using a custom backend high enough) that we should seriously consider doing so. Especially because it looks as if the amount of work needed to keep the DMD backend and thus the reference D compiler competitive is going to increase further as other backends are gaining things like auto-vectorization to light up modern CPUs and ARM is gaining in importance. David
"gaining" he says ;) D just missed out on the leading edge of smartphone games. That is a HUGE market packed with small developers that can easily adopt new tooling. We got the invite and we stood them up. IMO, sticking to an x86-centric toolset cost D one of its perfect opportunities for being a killer tool. That makes me kinda sad. Sorry for the downer. I bring it up in the hope that we can learn from it. I like to think that we'll see more opportunities in the future.
Jan 13 2013
parent Russel Winder <russel winder.org.uk> writes:
On Mon, 2013-01-14 at 01:30 -0500, Chad J wrote:
[=E2=80=A6]
 tooling.  We got the invite and we stood them up.  IMO, sticking to an=
=20
 x86-centric toolset cost D one of its perfect opportunities for being a=
=20
 killer tool.  That makes me kinda sad.
If you can't do x86_64 and ARM you are legacy. Even the JVM now runs on ARM processors. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Jan 13 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 14, 2013 at 06:33:44AM +0100, David Nadlinger wrote:
 On Monday, 14 January 2013 at 04:42:33 UTC, Andrei Alexandrescu
 wrote:
I'll candidly mention that David's request was unclear to me. It's
been winding enough that I missed the key sentence. Was it that we
switch dmd to using llvm as backend?
If you want, yes - but not in the form of an actionable proposal yet. I was trying to argue that the benefits of using an existing solution like GCC or LLVM are large enough (resp. the costs of using a custom backend high enough) that we should seriously consider doing so. Especially because it looks as if the amount of work needed to keep the DMD backend and thus the reference D compiler competitive is going to increase further as other backends are gaining things like auto-vectorization to light up modern CPUs and ARM is gaining in importance.
[...] I have to say, as a bystander to this discussion, that I *have* felt the dilemma of whether to use dmd (I want to try out the latest D features) vs. gdc/ldc (I want maximal performance for compute-heavy apps). I don't mean to downplay dmd in any way, but IME code generated by gdc -O3 has been *consistently* 30-40% faster than code generated by dmd -O, sometimes even 50% or more. I have compared the assembly code generated by either compiler, and I have to say that, at least IME, the code produced by gdc is indisputably superior to the code produced by dmd. Due to this, I have wanted to use only gdc for my D projects, and I'd expect that any project where performance is a concern would feel the same way. But I was forced to use dmd because I wanted to try out the newest features and get the latest bugfixes as well. So now I'm torn between wanting competitive performance vs. getting the bugfixes I need or the latest and greatest features. This is far from ideal, IMHO. Users shouldn't have to choose between up-to-date language support vs. performance. Having the reference D compiler match gdc/ldc in performance would be a big factor in D adoption IMO (how many people would spend the time to look up gdc, or even know it exists, instead of just compiling a C/C++ program with gcc/g++, compare it to dmd output for a comparable D program, and walk away?). T -- WINDOWS = Will Install Needless Data On Whole System -- CompuMan
Jan 13 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, January 13, 2013 22:46:21 H. S. Teoh wrote:
 On Mon, Jan 14, 2013 at 06:33:44AM +0100, David Nadlinger wrote:
 On Monday, 14 January 2013 at 04:42:33 UTC, Andrei Alexandrescu
 
 wrote:
I'll candidly mention that David's request was unclear to me. It's
been winding enough that I missed the key sentence. Was it that we
switch dmd to using llvm as backend?
If you want, yes - but not in the form of an actionable proposal yet. I was trying to argue that the benefits of using an existing solution like GCC or LLVM are large enough (resp. the costs of using a custom backend high enough) that we should seriously consider doing so. Especially because it looks as if the amount of work needed to keep the DMD backend and thus the reference D compiler competitive is going to increase further as other backends are gaining things like auto-vectorization to light up modern CPUs and ARM is gaining in importance.
[...] I have to say, as a bystander to this discussion, that I *have* felt the dilemma of whether to use dmd (I want to try out the latest D features) vs. gdc/ldc (I want maximal performance for compute-heavy apps). I don't mean to downplay dmd in any way, but IME code generated by gdc -O3 has been *consistently* 30-40% faster than code generated by dmd -O, sometimes even 50% or more. I have compared the assembly code generated by either compiler, and I have to say that, at least IME, the code produced by gdc is indisputably superior to the code produced by dmd. Due to this, I have wanted to use only gdc for my D projects, and I'd expect that any project where performance is a concern would feel the same way. But I was forced to use dmd because I wanted to try out the newest features and get the latest bugfixes as well. So now I'm torn between wanting competitive performance vs. getting the bugfixes I need or the latest and greatest features. This is far from ideal, IMHO. Users shouldn't have to choose between up-to-date language support vs. performance. Having the reference D compiler match gdc/ldc in performance would be a big factor in D adoption IMO (how many people would spend the time to look up gdc, or even know it exists, instead of just compiling a C/C++ program with gcc/g++, compare it to dmd output for a comparable D program, and walk away?).
If you want to use the lastest master, then you're going to be stuck with dmd. If you're willing to use the latest release, then you have a choice. And assuming that you don't need the latest master, I would argue that if performance is critical, what you'll want to do is develop with dmd (because of its fast compilation items), and then you'll actually want to release using gdc or ldc so that you actually get fast programs. And given the fact that Walter is rightly focused on language features rather than optimizing dmd's backend and the fact that almost no one else seems likely to spend time optimizing dmd's backend, it's a foregone conclusion at this point that dmd is going to generate slower programs than gdc and ldc for the forseeable future. On the other hand, it wouldn't surprise me if having Walter switch to another backend for the reference compiler would cost us so much development time from him (as he learns the new backend) that it wouldn't be even vaguely worth it. - Jonathan M Davis
Jan 13 2013
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/13/2013 9:33 PM, David Nadlinger wrote:
 as other backends are gaining things
 like auto-vectorization to light up modern CPUs and ARM is gaining in
importance.
I've done some research on auto-vectorization, i.e. "The Software Vectorization Handbook" by Bik. My conclusion (Manu independently came to the same one, he's our resident SIMD expert) is that auto-vectorization is a disaster. What it is is: 1. Reverse engineer a loop into a higher level construct 2. Recompile that construct using vector instructions It's a disaster because (2) often fails in ways that are utterly mysterious to 99% of programmers. The failure mode is to not auto-vectorize the loop. Hence, the failure is silent and the user just sees poor performance, if he notices it at all. D's intended approach is completely different, and much more straightforward. The SIMD types are exposed to the front end, and you build vector operations like you would any other arithmetic operation. If a particular vector operation is not available for a particular SIMD type, then the compilation fails with an appropriate error message. I.e. D's approach is to fix the language to support vector semantics, rather than trying to back-asswards fit it into the optimizer.
Jan 13 2013
next sibling parent reply "David Nadlinger" <see klickverbot.at> writes:
On Monday, 14 January 2013 at 07:21:57 UTC, Walter Bright wrote:
 I've done some research on auto-vectorization, i.e. "The 
 Software Vectorization Handbook" by Bik.

 My conclusion (Manu independently came to the same one, he's 
 our resident SIMD expert) is that auto-vectorization is a 
 disaster.

 What it is is:

 1. Reverse engineer a loop into a higher level construct
 2. Recompile that construct using vector instructions

 It's a disaster because (2) often fails in ways that are 
 utterly mysterious to 99% of programmers. The failure mode is 
 to not auto-vectorize the loop. Hence, the failure is silent 
 and the user just sees poor performance, if he notices it at 
 all.
Virtually all common C/C++ compilers have had SIMD types/intrinsics for years now, which is more or less exactly what core.simd is for D - well, there is some syntax sugar for arithmetic operations on the D vector types, but it doesn't cover any of the "interesting" operations such as broadcasting, swizzling, etc. Yet *all* the big compiler vendors (Microsoft, Intel, GCC, LLVM) seem to find it necessary to spend considerable amounts of effort on improving their heuristics for the indeed quite problematic optimization process you describe. I don't think that rehashing the common reasons cited for the importance of (auto-)vectorizing code here would be of any use, you are certainly aware of them already. Just don't forget that Manu (who seems to have disappeared from the forums, crunch time?) represents a group of programmers who would resort to hand-written assembly for their "embarrassingly parallel" code if suitable compiler intrinsics didn't exist, whereas the vast majority of programmers probably would be troubled to reliably identify the cases where vectorization would have the biggest benefits in their code bases, even if they could justify spending time/maintainability on manually optimizing their code at this level. David
Jan 14 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/14/2013 12:28 AM, David Nadlinger wrote:
 Virtually all common C/C++ compilers have had SIMD types/intrinsics for years
 now, which is more or less exactly what core.simd is for D - well, there is
some
 syntax sugar for arithmetic operations on the D vector types, but it doesn't
 cover any of the "interesting" operations such as broadcasting, swizzling, etc.
Actually, they do a crappy job of it in my experiments.
 Yet *all* the big compiler vendors (Microsoft, Intel, GCC, LLVM) seem to find
it
 necessary to spend considerable amounts of effort on improving their heuristics
 for the indeed quite problematic optimization process you describe.
The fundamental problem they face is they cannot change the language - they have to work with bog standard code that has no notion of SIMD.
 I don't think that rehashing the common reasons cited for the importance of
 (auto-)vectorizing code here would be of any use, you are certainly aware of
 them already. Just don't forget that Manu (who seems to have disappeared from
 the forums, crunch time?) represents a group of programmers who would resort to
 hand-written assembly for their "embarrassingly parallel" code if suitable
 compiler intrinsics didn't exist, whereas the vast majority of programmers
 probably would be troubled to reliably identify the cases where vectorization
 would have the biggest benefits in their code bases, even if they could justify
 spending time/maintainability on manually optimizing their code at this level.
Here's the problem. Draw a table of SIMD vector types on one axis, and arithmetic operations on the other. Plot an X for what is supported. You'll find that the pattern of X's doesn't follow much of a pattern, and it varies all over the place for different CPUs and architectures. Now consider your autovectorizer. If an X isn't there, it (silently) generates workaround code. The only way you'll really be able to tell it went wrong is to disassemble the result - and who is going to do that? And what's much, much worse about this, is pulling values out of and into SIMD registers to do those workarounds causes massive stalls (the SIMD unit and the arithmetic unit are separate), so bad that you're better off doing no vectorization at all. The next problem with an autovectorizer is alignment - consider the C function: int func(float *p) { ... } Now you want to autovectorize inside func(). Is p aligned? Who knows? So you must emit code to do misaligned vectors, which is another performance disaster. The D approach is if you are able to get your vector code to compile, it will be: 1. aligned properly 2. compiled to real SIMD instructions
Jan 14 2013
parent reply "David Nadlinger" <see klickverbot.at> writes:
On Monday, 14 January 2013 at 09:02:25 UTC, Walter Bright wrote:
 Yet *all* the big compiler vendors (Microsoft, Intel, GCC, 
 LLVM) seem to find it
 necessary to spend considerable amounts of effort on improving 
 their heuristics
 for the indeed quite problematic optimization process you 
 describe.
The fundamental problem they face is they cannot change the language - they have to work with bog standard code that has no notion of SIMD.
They *are* changing the language using SIMD intrinsics and built-in types. It might be that D can do a somewhat nicer job at that by standardizing them and using less cryptic names, but the fundamental concept remains the same. And again, you don't have to convince me about the problems inherent to auto-vectorization, I'm well aware of them. I'm just pointing out that despite these obstacles, compiler vendors are still trying to implement it, probably because writing vector code is fundamentally difficult (as you mentioned, you can easily shoot yourself in the foot in many ways, performance-wise), and you can't justify just not using the vector units on today's chips. It's not that I think __vector, core.simd and friends would be a *bad* idea. But your general dismissal of auto-vectorization seems a bit too quick and convenient to me. David
Jan 14 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/14/2013 1:41 AM, David Nadlinger wrote:
 It's not that I think __vector, core.simd and friends would be a *bad* idea.
But
 your general dismissal of auto-vectorization seems a bit too quick and
 convenient to me.
I agree it's more convenient not to implement it. I also prefer using a lawnmower to a pair of scissors <g>.
Jan 14 2013
prev sibling next sibling parent reply "jerro" <a a.com> writes:
 I.e. D's approach is to fix the language to support vector 
 semantics, rather than trying to back-asswards fit it into the 
 optimizer.
I agree that adding SIMD types and operations to the languages is a better approach than autovectorization. I'd just like to point out that DMD's backend still has serious problems regarding SIMD support, which makes getting SIMD code to work with DMD a pain. For example, see those two bugs: http://d.puremagic.com/issues/show_bug.cgi?id=9304 http://d.puremagic.com/issues/show_bug.cgi?id=9200 I don't know how much work it is to fix those issues and any similar ones, but I think this is relevant to the discussion in this thread, since GCC and LLVM backends don't have those problems.
Jan 14 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/14/2013 12:54 AM, jerro wrote:
 I.e. D's approach is to fix the language to support vector semantics, rather
 than trying to back-asswards fit it into the optimizer.
I agree that adding SIMD types and operations to the languages is a better approach than autovectorization. I'd just like to point out that DMD's backend still has serious problems regarding SIMD support, which makes getting SIMD code to work with DMD a pain. For example, see those two bugs: http://d.puremagic.com/issues/show_bug.cgi?id=9304 http://d.puremagic.com/issues/show_bug.cgi?id=9200
Yeah, I know about those, I'll fix 'em. The SIMD support in the compiler is still young.
Jan 14 2013
prev sibling parent reply "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 14 January 2013 at 07:21:57 UTC, Walter Bright wrote:
 I've done some research on auto-vectorization, i.e. "The 
 Software Vectorization Handbook" by Bik.

 My conclusion (Manu independently came to the same one, he's 
 our resident SIMD expert) is that auto-vectorization is a 
 disaster.

 What it is is:

 1. Reverse engineer a loop into a higher level construct
 2. Recompile that construct using vector instructions

 It's a disaster because (2) often fails in ways that are 
 utterly mysterious to 99% of programmers. The failure mode is 
 to not auto-vectorize the loop. Hence, the failure is silent 
 and the user just sees poor performance, if he notices it at 
 all.
Have you seen the Visual C++ 2012 compiler? It fixes that problem. Its auto-vectorizer has two switches: 1. /Qvec-report:1, which reports the auto-vectorized loops 2. /Qvec-report:2, which reports all potential candidates and whether or not they were vectorized (and if not, a reason code)
Jan 17 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/17/2013 1:20 AM, Mehrdad wrote:
 Have you seen the Visual C++ 2012 compiler? It fixes that problem.

 Its auto-vectorizer has two switches:

 1. /Qvec-report:1, which reports the auto-vectorized loops
 2. /Qvec-report:2, which reports all potential candidates and whether or not
 they were vectorized (and if not, a reason code)
I would call it more of a workaround than a fix.
Jan 17 2013
prev sibling next sibling parent reply Russel Winder <russel winder.org.uk> writes:
On Mon, 2013-01-14 at 06:33 +0100, David Nadlinger wrote:
 On Monday, 14 January 2013 at 04:42:33 UTC, Andrei Alexandrescu=20
 wrote:
 I'll candidly mention that David's request was unclear to me.=20
 It's been winding enough that I missed the key sentence. Was it=20
 that we switch dmd to using llvm as backend?
=20 If you want, yes - but not in the form of an actionable proposal=20 yet. I was trying to argue that the benefits of using an existing=20 solution like GCC or LLVM are large enough (resp. the costs of=20 using a custom backend high enough) that we should seriously=20 consider doing so. Especially because it looks as if the amount=20 of work needed to keep the DMD backend and thus the reference D=20 compiler competitive is going to increase further as other=20 backends are gaining things like auto-vectorization to light up=20 modern CPUs and ARM is gaining in importance.
Having D in the GCC set is a good move for marketing the D language. GDC could be a vehicle for bringing more programmers to D but it needs more resource to bring it up to date. Having D in the LLVM set is a good move for marketing the D language. LDC is a vehicle for bringing people to D. It would be good to get the Debian package updated and actively kept up. This would then get it packaged for Ubuntu and Mint. It would be good to get it packaged for Fedora. Keeping going with DMD is fine, as long as it is in the knowledge that GDC and LDC are the D toolchains that will lead to wider D usage. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Jan 13 2013
parent reply "David Nadlinger" <see klickverbot.at> writes:
On Monday, 14 January 2013 at 07:52:49 UTC, Russel Winder wrote:
 LDC is a vehicle for bringing people to D. It would be good to 
 get the Debian package updated and actively kept up.
As we have a recent release now: Who do I have to ping about resurrecting the Debian package? I have never really used any Debian-based distros (except for some server boxes), so I have no good idea of how to find a packager and how to best assist with the packaging process. I know of the importance of getting an LDC package into Debian, but it would be much, much easier if somebody who is at home in both the D and Debian communities (or at least uses a Debian-based distro on a day-to-day basis) could handle this. David
Jan 14 2013
next sibling parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
On 14 January 2013 08:03, David Nadlinger <see klickverbot.at> wrote:

 On Monday, 14 January 2013 at 07:52:49 UTC, Russel Winder wrote:

 LDC is a vehicle for bringing people to D. It would be good to get the
 Debian package updated and actively kept up.
As we have a recent release now: Who do I have to ping about resurrecting the Debian package? I have never really used any Debian-based distros (except for some server boxes), so I have no good idea of how to find a packager and how to best assist with the packaging process. I know of the importance of getting an LDC package into Debian, but it would be much, much easier if somebody who is at home in both the D and Debian communities (or at least uses a Debian-based distro on a day-to-day basis) could handle this. David
I could set things in motion for you, but you'd probably have to nudge me once every 6 months to push in updates. -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Jan 14 2013
parent reply "David Nadlinger" <see klickverbot.at> writes:
On Monday, 14 January 2013 at 08:20:11 UTC, Iain Buclaw wrote:
 I could set things in motion for you, but you'd probably have 
 to nudge me
 once every 6 months to push in updates.
That would be great! How do we proceed? David
Jan 14 2013
next sibling parent Russel Winder <russel winder.org.uk> writes:
On Mon, 2013-01-14 at 09:26 +0100, David Nadlinger wrote:
 On Monday, 14 January 2013 at 08:20:11 UTC, Iain Buclaw wrote:
 I could set things in motion for you, but you'd probably have=20
 to nudge me
 once every 6 months to push in updates.
=20 That would be great! How do we proceed?
I just emailed the listed packager cc yourself. I guess I should have included Ian due to the GDC connection.=20 --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Jan 14 2013
prev sibling next sibling parent Russel Winder <russel winder.org.uk> writes:
On Mon, 2013-01-14 at 08:37 +0000, Russel Winder wrote:
[=E2=80=A6]
 included Ian due to the GDC connection.=20
/Ian/s//Iain/ --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Jan 14 2013
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 14 January 2013 08:44, Russel Winder <russel winder.org.uk> wrote:

 On Mon, 2013-01-14 at 08:37 +0000, Russel Winder wrote:
 [=85]
 included Ian due to the GDC connection.
/Ian/s//Iain/
Thanks for the name correction. :) --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Jan 14 2013
prev sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 14 January 2013 08:26, David Nadlinger <see klickverbot.at> wrote:

 On Monday, 14 January 2013 at 08:20:11 UTC, Iain Buclaw wrote:

 I could set things in motion for you, but you'd probably have to nudge me
 once every 6 months to push in updates.
That would be great! How do we proceed? David
Mattias is the current maintainer IIRC. I can request a transfer of ownership to either myself or someone who is willing to maintain the package for LDC. Also I need to post updates for the GDC package too, so might as well do both in one hit. -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Jan 14 2013
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 14, 2013 at 09:03:13AM +0100, David Nadlinger wrote:
 On Monday, 14 January 2013 at 07:52:49 UTC, Russel Winder wrote:
LDC is a vehicle for bringing people to D. It would be good to get
the Debian package updated and actively kept up.
As we have a recent release now: Who do I have to ping about resurrecting the Debian package? I have never really used any Debian-based distros (except for some server boxes), so I have no good idea of how to find a packager and how to best assist with the packaging process. I know of the importance of getting an LDC package into Debian, but it would be much, much easier if somebody who is at home in both the D and Debian communities (or at least uses a Debian-based distro on a day-to-day basis) could handle this.
[...] I'm a Debian developer (a very dormant one ever since I got married, I'll admit, but nevertheless one). I am rusty with the procedures due to not having used them for a while, but at least I know where to look and am familiar with the conventions used. So I'm willing to help. I can probably even sponsor uploads, if that's what it takes. T -- If lightning were to ever strike an orchestra, it'd always hit the conductor first.
Jan 14 2013
parent "eles" <eles eles.com> writes:
On Monday, 14 January 2013 at 16:11:00 UTC, H. S. Teoh wrote:
 On Mon, Jan 14, 2013 at 09:03:13AM +0100, David Nadlinger wrote:
 On Monday, 14 January 2013 at 07:52:49 UTC, Russel Winder
I'm a Debian developer (a very dormant one ever since I got married, I'll admit, but nevertheless one). I am rusty with the procedures due to not having used them for a while, but at least I know where to look and am familiar with the conventions used. So I'm willing to help. I can probably even sponsor uploads, if that's what it takes.
Hi, I would love to see gdc-4.7 (ie. based on gcc 4.7) in Debian Sid ASAP (maybe it still on schedule to be included into next Ubuntu - Raring Ringtail). Currently, the gdc package is still at the pre-historic era of gcc 4.6.3: http://packages.debian.org/sid/devel/gdc and Ubuntu follows closely with: http://packages.ubuntu.com/ko/raring/devel/gdc while its gcc package is 4.7.2: http://packages.ubuntu.com/ko/raring/gcc Many thanks for any implication on that side.
Jan 14 2013
prev sibling next sibling parent Jordi Sayol <g.sayol yahoo.es> writes:
Al 14/01/13 08:52, En/na Russel Winder ha escrit:
 
 Having D in the LLVM set is a good move for marketing the D language.
 LDC is a vehicle for bringing people to D. It would be good to get the
 Debian package updated and actively kept up. This would then get it
 packaged for Ubuntu and Mint. It would be good to get it packaged for
 Fedora.
LDC is already available on Fedora 17. -- Jordi Sayol
Jan 14 2013
prev sibling parent Russel Winder <russel winder.org.uk> writes:
On Mon, 2013-01-14 at 19:01 +0100, Jordi Sayol wrote:
[=E2=80=A6]
 LDC is already available on Fedora 17.
LDC is already available in Debian, it is just a very old version. Looks like the Fedora version is LDC as at 2012-06-24 from the Git repository. One up to Fedora :-) --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Jan 14 2013
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-01-14 03:24, Walter Bright wrote:

 But as for optlink, I would have happily turned it over to anyone who
 expressed interest in working on it. I'd also put it on github if anyone
 wanted to look at it.
Then I would say, put in on github regardless if anyone wants to take a look at it. It doesn't hurt. -- /Jacob Carlborg
Jan 13 2013
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-01-14 00:57, Walter Bright wrote:

 Doing this all requires a fairly comprehensive understanding of the
 innards of the back end. If LLVM is lacking one or more of 1..5, then I
 wouldn't be surprised a bit if it took me considerably *longer* to do it
 for LLVM
Sure, but if you had switch to LLVM, say three years ago, you would know innards of the back end. -- /Jacob Carlborg
Jan 13 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12-Jan-2013 09:17, David Nadlinger пишет:
 On Friday, 11 January 2013 at 20:57:57 UTC, Dmitry Olshansky wrote:
 You can print total counts after each bench, there is a TLS varaible
 written at the end of it. But anyway I like your numbers! :)
Okay, I couldn't resist having a short look at the results, specifically the benchmark of the new isSymbol implementation, where LDC beats DMD by roughly 10x. The reason for the nice performance results is mainly that LDC optimizes the classifyCall loop containing the trie lookup down to the following fairly optimal piece of code (eax is the overall counter that gets stored to lastCount):
So these are legit? Coooooool! BTW I'm having about 2-3 times better numbers on DMD 32bits with oldish AMD K10. Can you test 32bit versions also, could it be some glitch in 64bit codegen?
 ---
    40bc90:       8b 55 00                mov    edx,DWORD PTR [rbp+0x0]
    40bc93:       89 d6                   mov    esi,edx
    40bc95:       c1 ee 0d                shr    esi,0xd
    40bc98:       40 0f b6 f6             movzx  esi,sil
    40bc9c:       0f b6 34 31             movzx  esi,BYTE PTR [rcx+rsi*1]
    40bca0:       48 83 c5 04             add    rbp,0x4
    40bca4:       0f b6 da                movzx  ebx,dl
    40bca7:       c1 e6 05                shl    esi,0x5
    40bcaa:       c1 ea 08                shr    edx,0x8
    40bcad:       83 e2 1f                and    edx,0x1f
    40bcb0:       09 f2                   or     edx,esi
    40bcb2:       41 0f b7 14 50          movzx  edx,WORD PTR [r8+rdx*2]
    40bcb7:       c1 e2 08                shl    edx,0x8
    40bcba:       09 da                   or     edx,ebx
    40bcbc:       48 c1 ea 06             shr    rdx,0x6
    40bcc0:       4c 01 ca                add    rdx,r9
    40bcc3:       48 8b 14 d1             mov    rdx,QWORD PTR [rcx+rdx*8]
    40bcc7:       48 0f a3 da             bt     rdx,rbx
    40bccb:       83 d0 00                adc    eax,0x0
    40bcce:       48 ff cf                dec    rdi
    40bcd1:       75 bd                   jne    40bc90
 ---
This looks quite nice indeed.
 The code DMD generates for the lookup, on the other hand, is pretty
 ugly, including several values being spilled to the stack, and also
 doesn't get inlined.
To be honest one of the major problems I see with DMD is a lack of principled reliable inliner. Currently it may inline or not 2 equivalent pieces of code just because one of it has early return, or switch statement or whatever. And it's about to time to start inlining functions with loops as it's not 90-s anymore.
 [1] The reasons for which I'm focusing on LLVM here are not so much its
 technical qualities as its liberal BSD-like license – if it is good
 enough for Apple, Intel (also a compiler vendor) and their lawyer teams,
 it is probably also for us. The code could even be integrated into
 commercial products such as DMC without problems.
I like LLVM, and next to everybody in industry like it. Another example is AMD. They are building their compiler infrastructure for GPUs on top of LLVM.
 [2] And for any typos which might undermine my credibility – it is way
 too early in the morning here.
-- Dmitry Olshansky
Jan 12 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Jan 11, 2013 at 12:35:42PM -0800, H. S. Teoh wrote:
[...]
 - It would also be nice if these functions can be made pure (I'm not
   sure I understand why checking the category of a character should be
   impure). The lack of nothrow I can understand, since the input
   character may be illegal or otherwise malformed. But  nothrow pure
   seems to me to be necessary for all category-checking functions.
Gah, I meant safe pure. T -- Lawyer: (n.) An innocence-vending machine, the effectiveness of which depends on how much money is inserted.
Jan 11 2013
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
 Anyway it's polished and ready for the good old collective destruction called
 peer review. I'm looking for a review manager.
Thank you for doing this. Some notes: struct InversionList: 1. It implements length and empty, but otherwise does not seem to be a range, though its name implies it is a container. 2. The description "InversionList is a packed interval-based data structure that represents a set of codepoints." completely baffles me. 3. The table for opBinary appears twice. 4. toSourceCode says "if the codepoint". If what codepoint? And what funcName is ""? mapTrieIndex: I have no idea what this does from the description. Is it a map like in std.algorithms? I.e. does it work with ranges? Need an example. TrieBuilder: 1. No description. I'm completely baffled. 2. putRange? What does that do? General lack of examples. Trie: Should have a wikipedia link to what a trie is. buildTrie: Contains a number of functions that return auto, but no mention of what is returned. While I like auto, in these cases it is not helpful, because the user needs to know what type is returned. Grapheme: Should provide an InputRange so a user can iterate over its codepoints. It already appears to provide part of the range interface. sicmp and icmp: No reason one should have "in" and the other "inout". Both should be "const". combiningClass What is "the canonicial combining class" ? General note: Needs a section that briefly defines terms, with links to external documentation.
Jan 11 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12-Jan-2013 00:50, Walter Bright пишет:
 On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
 Anyway it's polished and ready for the good old collective destruction
 called
 peer review. I'm looking for a review manager.
Thank you for doing this. Some notes:
Great bashing, thanks :)
 struct InversionList:

     1. It implements length and empty, but otherwise does not seem to be
 a range, though its name implies it is a container.
Containers also have length and can be empty. Ranges on top of it are .byChar, .byInterval.
     2. The description "InversionList is a packed interval-based data
 structure that represents a set of codepoints." completely baffles me.
OK, probably it was a bad idea to combine implementation detail and general description in one statement. How about: InversionList is a set of codepoints, optimized for size. It represents a set as an array of intervals using 6 bytes per interval of codepoints. Better?
     3. The table for opBinary appears twice.

     4. toSourceCode says "if the codepoint". If what codepoint? And what
 funcName is ""?

 mapTrieIndex:

     I have no idea what this does from the description. Is it a map like
 in std.algorithms? I.e. does it work with ranges? Need an example.

 TrieBuilder:

     1. No description. I'm completely baffled.

     2. putRange? What does that do?

 General lack of examples.

 Trie:

      Should have a wikipedia link to what a trie is.

 buildTrie:

      Contains a number of functions that return auto, but no mention of
 what is returned. While I like auto, in these cases it is not helpful,
 because the user needs to know what type is returned.
Trust me you won't like it when you see it :) Part of the reason it's hidden. But you are correct that documentation on these artifacts is *ehm* ... sketchy. Will fix/update.
 Grapheme:

      Should provide an InputRange so a user can iterate over its
 codepoints. It already appears to provide part of the range interface.
Container != range. Just slice it and RA range is in your hands. I need to put this info somewhere prominent I guess. BTW foreach(v; Grapheme("abc")) { ... } should work because of automatic slicing.
 sicmp and icmp:

      No reason one should have "in" and the other "inout". Both should
 be "const".

 combiningClass

      What is "the canonicial combining class" ?

 General note:

      Needs a section that briefly defines terms, with links to external
 documentation.
All to the point, thanks. -- Dmitry Olshansky
Jan 11 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/11/2013 1:21 PM, Dmitry Olshansky wrote:
 12-Jan-2013 00:50, Walter Bright пишет:
 On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
 Anyway it's polished and ready for the good old collective destruction
 called
 peer review. I'm looking for a review manager.
Thank you for doing this. Some notes:
Great bashing, thanks :)
Anytime!
 struct InversionList:

     1. It implements length and empty, but otherwise does not seem to be
 a range, though its name implies it is a container.
Containers also have length and can be empty. Ranges on top of it are .byChar, .byInterval.
Some more text about accessing it as a range would be helpful. Being a set of "codepoints" yet accessing it "by char" makes no sense. In unicode, the terms "char", "codepoint", "octet", "grapheme", etc., are bandied about and are often conflated and confused with each other. I suggest a brief section defining those terms, and then scrupulously, consistently, anally, and nitpickingly adhering to them in the documentation and names of things.
     2. The description "InversionList is a packed interval-based data
 structure that represents a set of codepoints." completely baffles me.
OK, probably it was a bad idea to combine implementation detail and general description in one statement. How about: InversionList is a set of codepoints, optimized for size. It represents a set as an array of intervals using 6 bytes per interval of codepoints. Better?
The name says "list" and yet it is described as a "set". Being a "list" makes an implication that it is a linked list. BTW, what is an "interval"?
     3. The table for opBinary appears twice.

     4. toSourceCode says "if the codepoint". If what codepoint? And what
 funcName is ""?

 mapTrieIndex:

     I have no idea what this does from the description. Is it a map like
 in std.algorithms? I.e. does it work with ranges? Need an example.

 TrieBuilder:

     1. No description. I'm completely baffled.

     2. putRange? What does that do?

 General lack of examples.

 Trie:

      Should have a wikipedia link to what a trie is.

 buildTrie:

      Contains a number of functions that return auto, but no mention of
 what is returned. While I like auto, in these cases it is not helpful,
 because the user needs to know what type is returned.
Trust me you won't like it when you see it :) Part of the reason it's hidden. But you are correct that documentation on these artifacts is *ehm* ... sketchy. Will fix/update.
If the compiler type is hard to grok, ok, but at least the documentation should make it clear what is returned.
 Grapheme:

      Should provide an InputRange so a user can iterate over its
 codepoints. It already appears to provide part of the range interface.
Container != range. Just slice it and RA range is in your hands. I need to put this info somewhere prominent I guess. BTW foreach(v; Grapheme("abc")) { ... } should work because of automatic slicing.
Yes, but is the resulting v an octet, char, codepoint ????
 All to the point, thanks.
Welcs!
Jan 11 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Jan 11, 2013 at 01:36:21PM -0800, Walter Bright wrote:
 On 1/11/2013 1:21 PM, Dmitry Olshansky wrote:
12-Jan-2013 00:50, Walter Bright пишет:
[...]
buildTrie:

     Contains a number of functions that return auto, but no mention
of what is returned. While I like auto, in these cases it is not
helpful, because the user needs to know what type is returned.
Trust me you won't like it when you see it :) Part of the reason it's hidden. But you are correct that documentation on these artifacts is *ehm* ... sketchy. Will fix/update.
If the compiler type is hard to grok, ok, but at least the documentation should make it clear what is returned.
[...] Would an alias be appropriate in this case, so that we can put a concrete name to the function return type? Or is that an abuse of alias? T -- Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
Jan 11 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/11/2013 4:14 PM, H. S. Teoh wrote:
 Would an alias be appropriate in this case, so that we can put a
 concrete name to the function return type? Or is that an abuse of alias?
Actually, that sounds like a good idea.
Jan 11 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12-Jan-2013 05:31, Walter Bright пишет:
 On 1/11/2013 4:14 PM, H. S. Teoh wrote:
 Would an alias be appropriate in this case, so that we can put a
 concrete name to the function return type? Or is that an abuse of alias?
Actually, that sounds like a good idea.
Except that it's created out of template parameters by wrapping them in a couple of template. So not, can't do. -- Dmitry Olshansky
Jan 12 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Jan 12, 2013 at 12:20:46PM +0400, Dmitry Olshansky wrote:
 12-Jan-2013 05:31, Walter Bright пишет:
On 1/11/2013 4:14 PM, H. S. Teoh wrote:
Would an alias be appropriate in this case, so that we can put a
concrete name to the function return type? Or is that an abuse of
alias?
Actually, that sounds like a good idea.
Except that it's created out of template parameters by wrapping them in a couple of template. So not, can't do.
[...] I was going to suggest something like: alias typeof(return) ConcreteName; But then I realized it's impossible to use ConcreteName in the function signature, which makes it pointless to have the alias since it won't show up in the Ddoc. :-( Although, maybe something like this *might* be possible (but it's sorta getting to the point of being ridiculously complex just for the sake of making ddoc display a single identifier): template ConcreteName(A,B,C) { alias ConcreteName = SomeTemplate!(A, OtherTemplate!B, ...); } ConcreteName!(A,B,C) func(A,B,C,D)(D args) { ConcreteName!(A,B,C) retVal = ...; return retVal; } T -- Three out of two people have difficulties with fractions. -- Dirk Eddelbuettel
Jan 12 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12-Jan-2013 20:16, H. S. Teoh пишет:
 On Sat, Jan 12, 2013 at 12:20:46PM +0400, Dmitry Olshansky wrote:
 12-Jan-2013 05:31, Walter Bright пишет:
 On 1/11/2013 4:14 PM, H. S. Teoh wrote:
 Would an alias be appropriate in this case, so that we can put a
 concrete name to the function return type? Or is that an abuse of
 alias?
Actually, that sounds like a good idea.
Except that it's created out of template parameters by wrapping them in a couple of template. So not, can't do.
[...] I was going to suggest something like: alias typeof(return) ConcreteName; But then I realized it's impossible to use ConcreteName in the function signature, which makes it pointless to have the alias since it won't show up in the Ddoc. :-( Although, maybe something like this *might* be possible (but it's sorta getting to the point of being ridiculously complex just for the sake of making ddoc display a single identifier): template ConcreteName(A,B,C) { alias ConcreteName = SomeTemplate!(A, OtherTemplate!B, ...); } ConcreteName!(A,B,C) func(A,B,C,D)(D args) { ConcreteName!(A,B,C) retVal = ...; return retVal; }
For basic needs there are 2 alias for custom Tries that map dchar -> bool and dchar -> some integer. http://blackwhale.github.com/phobos/uni.html#CodepointSetTrie http://blackwhale.github.com/phobos/uni.html#CodepointTrie Both accept custom selection of bits per Trie level. The docs are not really helpful in this area yet. I'll make another pass and add more examples, glossary and so on. -- Dmitry Olshansky
Jan 12 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/12/2013 8:26 AM, Dmitry Olshansky wrote:
 I'll make another pass and add more examples, glossary
 and so on.
You can also try the ACRONYM macro: ACRONYM=<acronym title="$+">$1</acronym> ($+) If you mouse over an acronym, a little popup text will appear.
Jan 12 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/12/2013 11:57 AM, Walter Bright wrote:
 On 1/12/2013 8:26 AM, Dmitry Olshansky wrote:
 I'll make another pass and add more examples, glossary
 and so on.
You can also try the ACRONYM macro: ACRONYM=<acronym title="$+">$1</acronym> ($+) If you mouse over an acronym, a little popup text will appear.
More info: http://html5doctor.com/the-abbr-element/ Although "octet", for example, isn't exactly an abbreviation.
Jan 12 2013
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 11 January 2013 at 19:31:13 UTC, Dmitry Olshansky 
wrote:
 It's been long over due to present the work I did during the 
 last GSOC as it was a summer not winter of code after all. 
 Unfortunately some compiler bugs, a new job :) and unrelated 
 events of importance have postponed its release beyond measure.

 Anyway it's polished and ready for the good old collective 
 destruction called peer review. I'm looking for a review 
 manager.
 [SNIP]
I don't have anything to add, but I still want to voice my appreciation for your work. Well done. I hope this makes it through.
Jan 12 2013
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-01-11 20:31, Dmitry Olshansky wrote:
 It's been long over due to present the work I did during the last GSOC
 as it was a summer not winter of code after all. Unfortunately some
 compiler bugs, a new job :) and unrelated events of importance have
 postponed its release beyond measure.
I assume that "sicmp" is faster than "icmp"? In that case you might want to mention that in the documentation. -- /Jacob Carlborg
Jan 12 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12-Jan-2013 19:01, Jacob Carlborg пишет:
 On 2013-01-11 20:31, Dmitry Olshansky wrote:
 It's been long over due to present the work I did during the last GSOC
 as it was a summer not winter of code after all. Unfortunately some
 compiler bugs, a new job :) and unrelated events of importance have
 postponed its release beyond measure.
I assume that "sicmp" is faster than "icmp"? In that case you might want to mention that in the documentation.
Yeah, good point. -- Dmitry Olshansky
Jan 12 2013
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
 And documentation:
 http://blackwhale.github.com/phobos/uni.html
struct InversionList: length() returns "number of characters", while "empty()" returns true if there are no codepoints. Aren't the two redundant? And should it be characters or codepoints?
Jan 12 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
13-Jan-2013 03:39, Walter Bright пишет:
 On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
 And documentation:
 http://blackwhale.github.com/phobos/uni.html
struct InversionList: length() returns "number of characters", while "empty()" returns true if there are no codepoints. Aren't the two redundant?
Containers commonly have both. But maybe I should kill length on different grounds because it's computed by walking the array of intervals [a, b) and summing up (b-a) lengths. It's O(N).... ouch. The only problem is that then it's not at all obvious that length is better calculated via: reduce!"a + (b[1] - b[0])"(set.byInterval, 0); vs walkLength(set.byChar); where the latter may take enormous time to finish on common sets. (damn it should by DChar at least)
 And should it be characters or codepoints?
Codepoints, everything works on codepoints, that is dchars. A character is a Grapheme. So all mentions of character that slipped through got to be fixed. -- Dmitry Olshansky
Jan 12 2013
parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, January 13, 2013 11:00:50 Dmitry Olshansky wrote:
 13-Jan-2013 03:39, Walter Bright =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
 On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
 And documentation:
 http://blackwhale.github.com/phobos/uni.html
=20 struct InversionList: =20 =20 length() returns "number of characters", while "empty()" returns tr=
ue if
 there are no codepoints.
=20
 Aren't the two redundant?
=20 Containers commonly have both. But maybe I should kill length on different grounds because it's computed by walking the array of intervals [a, b) and summing up (b-a) lengths. It's O(N).... ouch. =20 The only problem is that then it's not at all obvious that length is better calculated via: reduce!"a + (b[1] - b[0])"(set.byInterval, 0); vs walkLength(set.byChar); =20 where the latter may take enormous time to finish on common sets. =20 (damn it should by DChar at least)
According to std.container, length should never be defined unless it's = at worst=20 O(log n) (though I would have expected it to require O(1); I don't know= of any=20 container that can calculate its lengeth in O(log n) but not O(1) - it'= s=20 always O(1) or O(n) in my experience). So, don't define length if it's = not that=20 efficient, even if it's not in std.container. It _may_ be valuable to add a function which does the same (e.g.=20 calculateLength or calcLength or calcLen or something liket that), espe= cially=20 if walkLength isn't going to be the most efficient. And yes, providing both empty and length is common and expected. And si= nce=20 empty is _required_ to be O(1), it's still guaranteed to be efficient, = whereas=20 length isn't. I really should ask Andrei why he made length require O(log n) instead = O(1)... - Jonathan M Davis
Jan 12 2013
parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 01/13/2013 02:28 AM, Jonathan M Davis wrote:
 I really should ask Andrei why he made length require O(log n) instead O(1)...

 - Jonathan M Davis
Are there cases where it can't be O(1)? My current intuition is that length can always be cached. If a container has a long length calculation, then it can be shortened by wrapping it in a proxy container that does nothing but expose the same operations as jackets around the original container. These jackets forward all data in and out with no changes, except for one: they count the number of elements entering and leaving the underlying container and store this count in a length variable.
Jan 13 2013
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, January 13, 2013 04:58:16 Chad J wrote:
 On 01/13/2013 02:28 AM, Jonathan M Davis wrote:
 I really should ask Andrei why he made length require O(log n) instead
 O(1)...
 
 - Jonathan M Davis
Are there cases where it can't be O(1)?
Most definitely. Take a doubly linked list for example. Either length can be O(1) and splicing is then O(n), or length is O(n) and splicing is then O(1). That's because if you want to keep track of the length, you have to count the number of elements being spliced in. For instance, it's a relatively common mistake in C++ to use std::List's size function and compare it with 0 to see whether the list is empty, because size is O(n). The correct thing to do is to call empty and check its result. You _could_ make std::list's size function be O(1), but that would mean that splicing becomes O(n), which C++98 definitely did not do (though I hear that C++11 made the interesting choice of changing how std::list works so that size is O(1) and splicing is O(n); I don't know if that's good or not). std.container.slist and std.container.dlist don't define length precisely because it can't be O(1) given their design. - Jonathan M Davis
Jan 13 2013
next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Sunday, 13 January 2013 at 10:24:17 UTC, Jonathan M Davis 
wrote:
 On Sunday, January 13, 2013 04:58:16 Chad J wrote:
 On 01/13/2013 02:28 AM, Jonathan M Davis wrote:
 I really should ask Andrei why he made length require O(log 
 n) instead
 O(1)...
 
 - Jonathan M Davis
Are there cases where it can't be O(1)?
Most definitely. Take a doubly linked list for example. Either length can be O(1) and splicing is then O(n), or length is O(n) and splicing is then O(1). That's because if you want to keep track of the length, you have to count the number of elements being spliced in. For instance, it's a relatively common mistake in C++ to use std::List's size function and compare it with 0 to see whether the list is empty, because size is O(n). The correct thing to do is to call empty and check its result. You _could_ make std::list's size function be O(1), but that would mean that splicing becomes O(n), which C++98 definitely did not do (though I hear that C++11 made the interesting choice of changing how std::list works so that size is O(1) and splicing is O(n); I don't know if that's good or not). std.container.slist and std.container.dlist don't define length precisely because it can't be O(1) given their design. - Jonathan M Davis
Yes, size is now "o(1)" in C++11. However, splice's complexity remains "o(1)" if the list spliced from/to is the same. It only degenerates to "o(N)" if they are not the same list. Also, note that with the old implementation, while length was "O(N)", most implementations could still cache the result it to give "0(1)" access, and only invalidate the cache when splicing with another list. But enough off topic here I guess ;)
Jan 13 2013
prev sibling parent Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 01/13/2013 05:23 AM, Jonathan M Davis wrote:
 On Sunday, January 13, 2013 04:58:16 Chad J wrote:
 On 01/13/2013 02:28 AM, Jonathan M Davis wrote:
 I really should ask Andrei why he made length require O(log n) instead
 O(1)...

 - Jonathan M Davis
Are there cases where it can't be O(1)?
Most definitely. Take a doubly linked list for example. Either length can be O(1) and splicing is then O(n), or length is O(n) and splicing is then O(1). That's because if you want to keep track of the length, you have to count the number of elements being spliced in. For instance, it's a relatively common mistake in C++ to use std::List's size function and compare it with 0 to see whether the list is empty, because size is O(n). The correct thing to do is to call empty and check its result. You _could_ make std::list's size function be O(1), but that would mean that splicing becomes O(n), which C++98 definitely did not do (though I hear that C++11 made the interesting choice of changing how std::list works so that size is O(1) and splicing is O(n); I don't know if that's good or not). std.container.slist and std.container.dlist don't define length precisely because it can't be O(1) given their design. - Jonathan M Davis
Thanks! That's a good example. It made sense once I realized that you could be splicing in arbitrary sections of another list, not just an entire other list. If I were to used cached lengths and splice in another list with a cached length, then I would just add the lengths of the two lists and do an O(1) splice. I can see how this wouldn't cover all cases though.
Jan 13 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
13-Jan-2013 13:58, Chad J пишет:
 On 01/13/2013 02:28 AM, Jonathan M Davis wrote:
 I really should ask Andrei why he made length require O(log n) instead
 O(1)...

 - Jonathan M Davis
Are there cases where it can't be O(1)?
 My current intuition is that length can always be cached.
Caching of length is suitable for adapters on top of base/core containers.
  If a
 container has a long length calculation, then it can be shortened by
 wrapping it in a proxy container that does nothing but expose the same
 operations as jackets around the original container.  These jackets
 forward all data in and out with no changes, except for one: they count
 the number of elements entering and leaving the underlying container and
 store this count in a length variable.
Yeah, exactly. -- Dmitry Olshansky
Jan 13 2013
prev sibling next sibling parent reply Martin Nowak <code dawg.eu> writes:
On 01/11/2013 08:31 PM, Dmitry Olshansky wrote:
 There is an extra candy for meta-programming text processing libraries:
 a set type can generate source code of unrolled hard-coded binary search
 for a given set of codepoints.
Great, I already had to do that myself for a DFA Lexer.
Jan 15 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
16-Jan-2013 06:37, Martin Nowak пишет:
 On 01/11/2013 08:31 PM, Dmitry Olshansky wrote:
 There is an extra candy for meta-programming text processing libraries:
 a set type can generate source code of unrolled hard-coded binary search
 for a given set of codepoints.
Great, I already had to do that myself for a DFA Lexer.
The only thing left that I'd love to improve is to integrate UTF awareness (for no-decoding processing) into it so that it becomes: auto set = ... ... set.toSourceCode("func", Encoding.UTF8); and prototype of generated function is something along the lines of: size_t func(char[] src); or: size_t func(char[] src, size_t offset); that returns number of matched code units and 0 on failure. -- Dmitry Olshansky
Jan 16 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
11-Jan-2013 23:31, Dmitry Olshansky пишет:
 The code, including extra tests and a benchmark is here:
 https://github.com/blackwhale/gsoc-bench-2012

 And documentation:
 http://blackwhale.github.com/phobos/uni.html
First of all, safe pure and nothrow is back. Let me know if something is still not. OK, I've made an extra pass through docs with these things in mind: - getting the introduction & terminology part right - more explanations and details where applicable (let me if that's too much / too little / wrong) - hiding away the truly generic (and not easy to use) Trie from documentation - old deprecated stuff is hidden from docs to discourage its use The whole Trie template was hidden because it needs to be in std.container anyway, and I want to extend & tweak it some more before committing it to public interface. It may very well worth a short review on its own. I've spent some hours to get an easy, useful and correct (as far as it gets) terminology throughout the module. In that wake 2 primary definitions are: A character means an encoded character, that is code point that has *assigned* mapping to abstract character (i.e. symbolically representable meaning). A code point is a point in a range of [0, 10FFFF] that is a value not necessarily having a symbolic meaning. Things I'm aware of that are not there yet: 1. "Moar" examples everywhere! 2. Expand on the description of certain primitives. 3. Table that lists all of the properties, scripts and other sets available through the 'unicode' helper. -- Dmitry Olshansky
Jan 16 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/16/2013 2:48 AM, Dmitry Olshansky wrote:
 I've spent some hours to get an easy, useful and correct (as far as it gets)
 terminology throughout the module.
Thank you. Looking at the Terminology section (the reference to it at the beginning should be a hyperlink): "Not all code points are assigned to encoded characters.": ?? I thought that was the whole point? "Note that UTF-32 code unit (dchar) holds the actual code point value." => "Note that in UTF-32, a code unit is a code point and is represented by the D dchar type." What happened to "octet", which I thought was the official term? "Also known as simply character." No, please no, at least not in this document. I suspect you need to ban the word "character" from this page. It is so overloaded in meaning that it is useless. "An abstract character does not necessarily correspond to what a user thinks of as a “character” and should not be confused with a Grapheme." This just makes me cry. Who knows what a user thinks of as a character? "not necessarily" means what? Is "Grapheme" a Unicode term? Why can't there be precise definitions of these terms? I wonder if even the Unicode standard people have no idea exactly what they are. Sorry for the rant, but unicode terms always make me mad.
Jan 16 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
16-Jan-2013 23:35, Walter Bright пишет:
 On 1/16/2013 2:48 AM, Dmitry Olshansky wrote:
 I've spent some hours to get an easy, useful and correct (as far as it
 gets)
 terminology throughout the module.
Thank you. Looking at the Terminology section (the reference to it at the beginning should be a hyperlink): "Not all code points are assigned to encoded characters.": ?? I thought that was the whole point?
Obviously it's not. The simple truth is only 110K+ codepoints are currently assigned everything else is either reserved for speicific use (internal, surrogates etc.) or for future additions. Thin of it this way - previously they thought a unit of symbolic info was 16 bits. Then a lot of problems cropped up (including adapting 8-bit only/Latin-1/ascii systems). Now they treat encoding as form of data storage and everything in the Unicode standard defined as operating on _values_ of codespace (that is code points). This doesn't mean that all these values are real characters.
 "Note that UTF-32 code unit (dchar) holds the actual code point value."
 => "Note that in UTF-32, a code unit is a code point and is represented
 by the D dchar type."
Yeah, that's simpler.
 What happened to "octet", which I thought was the official term?
Octet means 8-bits as a unit of data in just about any Internet protocol description. What do you what it here for?
 "Also known as simply character."

      No, please no, at least not in this document. I suspect you need to
 ban the word "character" from this page. It is so overloaded in meaning
 that it is useless.
It's not me. It's just the way things are and people have to be aware this particular meaning of character. Trust me, I've first converted the whole thing to code point(s) (note the space). Then I read it and it was like "meh" even to me who wrote the stuff in the first place. Then I looked at all documents by Unicode consortium they use 'character' throughout to mean either encoded character or abstract character depending on the time of day. Utility beats pedantry and thus 'character' is 'encoded character' everywhere it's used. Probably I'll remove this statement so that uninitiated don't fixate on it too much.
 "An abstract character does not necessarily correspond to what a user
 thinks of as a “character” and should not be confused with a Grapheme."

      This just makes me cry. Who knows what a user thinks of as a
 character? "not necessarily" means what? Is "Grapheme" a Unicode term?
Yes and yes. The user thinks a character is what he usually writes on a piece of paper obviously. It's a warning that what a typical user thinks a character doesn't always match the character in the Unicode. Grapheme is a unicode term that has direct substitute in this library hence the hyper-link. Grapheme are _visually_ a single entity. In fact the could be a sequence consisting of a base character (Unicode standard says 'character') + sequence of combining marks. This is called combining character sequence, but there are other kinds of grapheme see e.g. Hangul syllables and more weird stuff. BTW the text is taken as is from the Unicode standard definitions. I can litter a couple more pages with this crap, but instead I've taken the most important ones and merged a couple of definitions to make it simpler (e.g. see glyph).
 Why can't there be precise definitions of these terms?
'cause they are muddy ;) 'Abstract character' is one of the ugliest actually.
I wonder if even
 the Unicode standard people have no idea exactly what they are.

 Sorry for the rant, but unicode terms always make me mad.
Nothing I can cure here. And they know what they do but have a boatload of legacy crap that has to crawl on. Just recall the time where everybody thought that 16-bit codes should be enough for everybody. When they gradually took the whole world of writing systems into account things got dirty, awfully so. After learning a lot about Unicode I'd say they managed to do the job just fine given the constraints. I, for one, wouldn't even dare to try to reinvent *these* wheels. A lot of stuff could have been smoothed and simplified but overall - it's complex because the writing systems are utterly irregular plus the backwards compatibility has taken its toll. -- Dmitry Olshansky
Jan 16 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/16/2013 12:16 PM, Dmitry Olshansky wrote:
 Nothing I can cure here. And they know what they do but have a boatload of
 legacy crap that has to crawl on. Just recall the time where everybody thought
 that 16-bit codes should be enough for everybody. When they gradually took the
 whole world of writing systems into account things got dirty, awfully so.

 After learning a lot about Unicode I'd say they managed to do the job just fine
 given the constraints. I, for one, wouldn't even dare to try to reinvent
*these*
 wheels. A lot of stuff could have been smoothed and simplified but overall -
 it's complex because the writing systems are utterly irregular plus the
 backwards compatibility has taken its toll.
I guess I have to accept it :-) And thanks for taking on this pretty much impossible task.
Jan 16 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
16-Jan-2013 23:35, Walter Bright пишет:
 On 1/16/2013 2:48 AM, Dmitry Olshansky wrote:
 I've spent some hours to get an easy, useful and correct (as far as it
 gets)
 terminology throughout the module.
Thank you. Looking at the Terminology section (the reference to it at the beginning should be a hyperlink): "Not all code points are assigned to encoded characters.": ?? I thought that was the whole point?
And a lot of people really want to go this way see e.g. this: http://www.parrot.org/content/what-nfg-why-you-want-parrot-have-it There is a value in that but: a) it's not portable anywhere except for in-memory usage \ b) everything input/output got to be converted possibly creating new entries in a global dictionary of graphemes. -- Dmitry Olshansky
Jan 16 2013
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Jan 16, 2013 at 02:48:30PM +0400, Dmitry Olshansky wrote:
 11-Jan-2013 23:31, Dmitry Olshansky пишет:
The code, including extra tests and a benchmark is here:
https://github.com/blackwhale/gsoc-bench-2012

And documentation:
http://blackwhale.github.com/phobos/uni.html
First of all, safe pure and nothrow is back. Let me know if something is still not. OK, I've made an extra pass through docs with these things in mind: - getting the introduction & terminology part right - more explanations and details where applicable (let me if that's too much / too little / wrong) - hiding away the truly generic (and not easy to use) Trie from documentation - old deprecated stuff is hidden from docs to discourage its use
[...] Looks much better now! Some nitpicks: - Under Overview: [4th paragraph] "It's recognized that an application may need further enhancements and extensions. It could be the need for less commonly known algorithms or tailoring existing ones for regional-specific needs. To help users with building any extra functionality beyond the core primitives the module provides:" The grammar nazi in me thinks a better wording might be (changes delimited by {{}}): "It's recognized that an application may need further enhancements and extensions{{, such as}} less{{-}}commonly known algorithms{{,}} or tailoring existing ones for {{region}}-specific needs. To help users with building any extra functionality beyond the core primitives{{,}} the module provides:" The second item in the subsequent list: A way to construct optimal packed multi-stage tables also known as a special case of Trie. {{The functions}} codepointTrie, codepointSetTrie construct custom tries that map dchar to value. The end result is {{a}} fast and predictable Ο(1) lookup that powers functions like isAlpha {{and}} combiningClass{{,}} but for user-defined data sets. The last item in the list: Access to the commonly{{-}}used predefined sets of code points. The commonly{{-}}defined one{{s}} can be observed in the CLDR utility, on {{the}} page property index. {{S}}upported ones include Script, Block and General Category. See unicode for easy {{}} compile-time checked queries. - Under Terminology: [[3rd paragraph]] "The minimal bit combination that can represent a unit of encoded text for processing or interchange. Depending on the encoding this could be: 8-bit code units in the UTF-8 (($D char)), [...]" I think you transposed the "$(" here. :) The last sentence in this section appears to be truncated. Maybe a runaway DDoc macro somewhere earlier? - Under Construction of lookup tables, the grammar nazi says: [[1st sentence]] "{{The}} Unicode standard describes a set of algorithms that {{}} depend on having {{the}} ability to quickly look{{ }}up various properties of a code point. Given the the codespace of about 1 million code points, it is not a trivial task to providing a space{{-}}efficient solution for the {{}} multitude of properties." [[2nd paragraph]] "[...] Hash-tables {{have}} enormous memory footprint and binary search over intervals is not fast enough for some heavy-duty algorithms." [[3rd paragraph]] "{{(P }}The recommended solution (see Unicode Implementation Guidelines) {{}} is using multi-stage tables{{,}} that is{{,}} {{an instance}} of Trie with integer keys and {{a}} fixed number of stages. For the {{remainder}} of {{this}} section {{it will be}} called {{a}} fixed trie. The following describes a particular implementation that is aimed for the speed of access at the expense of ideal size savings." [[4th paragraph]] "[...] Split {{the}} number of bits in a key (code point, 21 bits) {{into}} 2 components (e.g. 15 and 8). The first is the number of bits in the index of {{the}} trie and the other is {{the}} number of bits {{in each}} page of {{the}} trie. The layout of trie is then an array of size 2^^bits-of-index followed an array of memory chunks of size 2^^bits-of-page/size-of-element." [[5th paragraph]] "[...] {{The}} slots of {{the}} index all have to contain {{the same[?] number of pages}}. The lookup is then just a couple of operations - slice {{the}} upper bits, {{then}} look{{ }}up {{the}} index for these{{.}} The pseudo-code is:" [[Following the code example]] "[...] Where if the elemsPerPage is a power of 2 the whole process is a handful of simple instructions and 2 array reads. {{Subsequent}} levels of {{the}} trie are introduced by recursing {{}} this notion - the index array is treated as values. The number of bits in {{the}} index is then again split into 2 parts, with pages over 'current-index' and {{the}} new 'upper-index'." [[Next paragraph]] "For completeness the level 1 trie is simply an array. {{The}} current implementation takes advantage of bit-packing values when the range is known to be limited in {{}} advance (such as bool){{.}} {{S}}ee also BitPacked for enforcing it manually. [...]" [[Last paragraph]] "The process of construction of a trie is more involved and is hidden from the user in a form of {{convenience}} functions: codepointTrie, codepointSetTrie and even more convenient toTrie. In general a set or built-in AA with dchar type can be turned into a trie. The trie object in this module is {{}} read-only (immutable){{;}} it's effectively frozen after construction." The grammar nazi has run out of steam, so no more grammar nitpicks for now. ;-) But there are still the following questions: - Why is isControl() not pure nothrow? - Why are the isX() functions system? I would have expected they should be at least trusted? (Or are there technical problems / compiler bugs preventing this?) That's all for now. I hope you don't mind me allowing the grammar nazi to take over for a bit. I want Phobos documentation to be professional quality. :) T -- The trouble with TCP jokes is that it's like hearing the same joke over and over.
Jan 17 2013
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
17-Jan-2013 22:48, H. S. Teoh пишет:
 On Wed, Jan 16, 2013 at 02:48:30PM +0400, Dmitry Olshansky wrote:
 11-Jan-2013 23:31, Dmitry Olshansky пишет:
 The code, including extra tests and a benchmark is here:
 https://github.com/blackwhale/gsoc-bench-2012

 And documentation:
 http://blackwhale.github.com/phobos/uni.html
First of all, safe pure and nothrow is back. Let me know if something is still not. OK, I've made an extra pass through docs with these things in mind: - getting the introduction & terminology part right - more explanations and details where applicable (let me if that's too much / too little / wrong) - hiding away the truly generic (and not easy to use) Trie from documentation - old deprecated stuff is hidden from docs to discourage its use
[...] Looks much better now! Some nitpicks:
Great ;)
 - Under Overview:

 	[4th paragraph] "It's recognized that an application may need
 	further enhancements and extensions. It could be the need for
 	less commonly known algorithms or tailoring existing ones for
 	regional-specific needs. To help users with building any extra
 	functionality beyond the core primitives the module provides:"

    The grammar nazi in me thinks a better wording might be (changes
    delimited by {{}}):

 	"It's recognized that an application may need further
 	enhancements and extensions{{, such as}} less{{-}}commonly known
 	algorithms{{,}} or tailoring existing ones for
 	{{region}}-specific needs. To help users with building any extra
 	functionality beyond the core primitives{{,}} the module
 	provides:"

    The second item in the subsequent list:

 	A way to construct optimal packed multi-stage tables also known
 	as a special case of Trie. {{The functions}} codepointTrie,
 	codepointSetTrie construct custom tries that map dchar to value.
 	The end result is {{a}} fast and predictable Ο(1) lookup that
 	powers functions like isAlpha {{and}} combiningClass{{,}} but
 	for user-defined data sets.

    The last item in the list:

 	Access to the commonly{{-}}used predefined sets of code points.
 	The commonly{{-}}defined one{{s}} can be observed in the CLDR
 	utility, on {{the}} page property index. {{S}}upported ones
 	include Script, Block and General Category. See unicode for easy
 	{{}} compile-time checked queries.

 - Under Terminology:

 	[[3rd paragraph]] "The minimal bit combination that can represent
 	a unit of encoded text for processing or interchange. Depending
 	on the encoding this could be: 8-bit code units in the UTF-8
 	(($D char)), [...]"

    I think you transposed the "$(" here. :)
Looks like one commit wasn't pushed :( I'll peruse you wording though.
    The last sentence in this section appears to be truncated. Maybe a
    runaway DDoc macro somewhere earlier?

 - Under Construction of lookup tables, the grammar nazi says:

 	[[1st sentence]] "{{The}} Unicode standard describes a set of
 	algorithms that {{}} depend on having {{the}} ability to quickly
 	look{{ }}up various properties of a code point. Given the the
 	codespace of about 1 million code points, it is not a trivial
 	task to providing a space{{-}}efficient solution for the {{}}
 	multitude of properties."

 	[[2nd paragraph]] "[...] Hash-tables {{have}} enormous memory
 	footprint and binary search over intervals is not fast enough
 	for some heavy-duty algorithms."

 	[[3rd paragraph]] "{{(P }}The recommended solution (see Unicode
 	Implementation Guidelines) {{}} is using multi-stage tables{{,}}
 	that is{{,}} {{an instance}} of Trie with integer keys and {{a}}
 	fixed number of stages. For the {{remainder}} of {{this}}
 	section {{it will be}} called {{a}} fixed trie. The following
 	describes a particular implementation that is aimed for the
 	speed of access at the expense of ideal size savings."

 	[[4th paragraph]] "[...] Split {{the}} number of bits in a key
 	(code point, 21 bits) {{into}} 2 components (e.g. 15 and 8). The
 	first is the number of bits in the index of {{the}} trie and the
 	other is {{the}} number of bits {{in each}} page of {{the}}
 	trie. The layout of trie is then an array of size
 	2^^bits-of-index followed an array of memory chunks of size
 	2^^bits-of-page/size-of-element."

 	[[5th paragraph]] "[...] {{The}} slots of {{the}} index all have
 	to contain {{the same[?] number of pages}}. The lookup is then
 	just a couple of operations - slice {{the}} upper bits, {{then}}
 	look{{ }}up {{the}} index for these{{.}} The pseudo-code is:"

 	[[Following the code example]] "[...] Where if the elemsPerPage
 	is a power of 2 the whole process is a handful of simple
 	instructions and 2 array reads.  {{Subsequent}} levels of
 	{{the}} trie are introduced by recursing {{}} this notion - the
 	index array is treated as values. The number of bits in {{the}}
 	index is then again split into 2 parts, with pages over
 	'current-index' and {{the}} new 'upper-index'."

 	[[Next paragraph]] "For completeness the level 1 trie is simply
 	an array. {{The}} current implementation takes advantage of
 	bit-packing values when the range is known to be limited in {{}}
 	advance (such as bool){{.}} {{S}}ee also BitPacked for enforcing
 	it manually.  [...]"

 	[[Last paragraph]] "The process of construction of a trie is
 	more involved and is hidden from the user in a form of
 	{{convenience}} functions: codepointTrie, codepointSetTrie and
 	even more convenient toTrie. In general a set or built-in AA
 	with dchar type can be turned into a trie. The trie object in
 	this module is {{}} read-only (immutable){{;}} it's effectively
 	frozen after construction."


 The grammar nazi has run out of steam, so no more grammar nitpicks for
 now. ;-) But there are still the following questions:

 - Why is isControl() not pure nothrow?
Missed this one.
 - Why are the isX() functions  system? I would have expected they should
    be at least  trusted? (Or are there technical problems / compiler bugs
    preventing this?)
M-hm I'm seeing this in my sources: bool isAlpha(dchar c) safe pure nothrow {...} The DDoc however shows system. A compiler bug?
 That's all for now. I hope you don't mind me allowing the grammar nazi
 to take over for a bit. I want Phobos documentation to be professional
 quality. :)
Sure, thanks.
 T
-- Dmitry Olshansky
Jan 17 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
17-Jan-2013 22:48, H. S. Teoh пишет:
 On Wed, Jan 16, 2013 at 02:48:30PM +0400, Dmitry Olshansky wrote:
 11-Jan-2013 23:31, Dmitry Olshansky пишет:
 The code, including extra tests and a benchmark is here:
 https://github.com/blackwhale/gsoc-bench-2012

 And documentation:
 http://blackwhale.github.com/phobos/uni.html
[snip]
 - Why are the isX() functions  system? I would have expected they should
    be at least  trusted? (Or are there technical problems / compiler bugs
    preventing this?)
M-hm I'm seeing this in my sources: bool isAlpha(dchar c) safe pure nothrow {...} The DDoc however shows system. A compiler bug?
It's indeed a bug in the compiler, looks funny: http://d.puremagic.com/issues/show_bug.cgi?id=9371 Makes me think that the compiler treats safe after the function prototype differently then the one placed before it. The trailing- safe camp must be jealous.
 That's all for now. I hope you don't mind me allowing the grammar nazi
 to take over for a bit. I want Phobos documentation to be professional
 quality. :)
All of these were incorporated, thanks. Among other new stuff I've added some cross-links throughout. Now on to the normalization forms description, etc. -- Dmitry Olshansky
Jan 22 2013
parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, January 23, 2013 00:20:19 Dmitry Olshansky wrote:
 It's indeed a bug in the compiler, looks funny:
 http://d.puremagic.com/issues/show_bug.cgi?id=9371
 
 Makes me think that the compiler treats  safe after
 the function prototype differently then the one placed before it.
 The trailing- safe camp must be jealous.
Prepending safe is evil. ;) That's definitely a weird bug though. I'd say that it pretty much has to be a bug in the parser given that the AST should be identical regardless of which side of the function signature an attribute is on. - Jonathan m Davis
Jan 22 2013