digitalmars.D - Ready for review: new std.uni

Dmitry Olshansky (52/52) Jan 11 2013 It's been long over due to present the work I did during the last GSOC

H. S. Teoh (61/97) Jan 11 2013 Excellent!! It looks pretty impressive. We should definitely try out

Dmitry Olshansky (32/128) Jan 11 2013 Yup, except they tweak the algorithms all along. The normalization for

H. S. Teoh (35/104) Jan 11 2013 Line-breaking is in TR14. I looked over it. It's definitely more

Dmitry Olshansky (23/53) Jan 13 2013 I had given it a practiced look of tired Unicode implementor. It has a

David Nadlinger (125/127) Jan 11 2013 Phew, that repository is quite a hefty checkout with all the

Dmitry Olshansky (10/93) Jan 11 2013 [snip]

David Nadlinger (96/99) Jan 11 2013 Okay, I couldn't resist having a short look at the results,

Walter Bright (5/7) Jan 11 2013 No problem. I think it's great that you're the champion for LDC. Having ...

David Nadlinger (86/91) Jan 13 2013 I precisely did *not* want to drag LDC into this discussion, but

Walter Bright (35/66) Jan 13 2013 It's long ceased to be my language. It's a group creation.

David Nadlinger (79/175) Jan 13 2013 You conveniently ended your citation before the »without ever

Walter Bright (5/9) Jan 13 2013 I don't really want to get into a long back-and-forth thing where we jus...

Andrej Mitrovic (3/7) Jan 13 2013 What's the state of Optlink, have you finished the transition to C?
Andrej Mitrovic (2/3) Jan 13 2013 Anyway don't tease, put it online! :p

Walter Bright (3/6) Jan 13 2013 Have fun!

Andrej Mitrovic (2/4) Jan 13 2013 Cool! Can you tell us how this code is licensed?
Andrej Mitrovic (4/7) Jan 13 2013 I'm getting a build failure when running the root build.bat:

Walter Bright (2/9) Jan 13 2013 That's part of the dmc source code.

David Nadlinger (10/12) Jan 13 2013 Neither do I. You listed your reasons for sticking with the DMC

Andrei Alexandrescu (5/15) Jan 13 2013 I'll candidly mention that David's request was unclear to me. It's been

David Nadlinger (12/15) Jan 13 2013 If you want, yes - but not in the form of an actionable proposal

Chad J (9/22) Jan 13 2013 "gaining" he says ;)

Russel Winder (16/19) Jan 13 2013 =20

H. S. Teoh (26/41) Jan 13 2013 [...]
Jonathan M Davis (15/58) Jan 13 2013 If you want to use the lastest master, then you're going to be stuck wit...
Walter Bright (19/21) Jan 13 2013 I've done some research on auto-vectorization, i.e. "The Software Vector...

David Nadlinger (24/37) Jan 14 2013 Virtually all common C/C++ compilers have had SIMD

Walter Bright (22/38) Jan 14 2013 The fundamental problem they face is they cannot change the language - t...

David Nadlinger (17/26) Jan 14 2013 They *are* changing the language using SIMD intrinsics and

Walter Bright (3/6) Jan 14 2013 I agree it's more convenient not to implement it.

jerro (11/14) Jan 14 2013 I agree that adding SIMD types and operations to the languages is

Walter Bright (3/11) Jan 14 2013 Yeah, I know about those, I'll fix 'em. The SIMD support in the compiler...

Mehrdad (6/19) Jan 17 2013 Have you seen the Visual C++ 2012 compiler? It fixes that problem.

Walter Bright (2/7) Jan 17 2013 I would call it more of a workaround than a fix.

Russel Winder (21/36) Jan 13 2013 Having D in the GCC set is a good move for marketing the D language. GDC

David Nadlinger (11/13) Jan 14 2013 As we have a recent release now: Who do I have to ping about

Iain Buclaw (6/19) Jan 14 2013 I could set things in motion for you, but you'd probably have to nudge m...

David Nadlinger (3/6) Jan 14 2013 That would be great! How do we proceed?

Russel Winder (13/19) Jan 14 2013 I just emailed the listed packager cc yourself. I guess I should have
Russel Winder (13/14) Jan 14 2013 /Ian/s//Iain/
Iain Buclaw (5/9) Jan 14 2013 Thanks for the name correction. :)
Iain Buclaw (8/14) Jan 14 2013 Mattias is the current maintainer IIRC. I can request a transfer of

H. S. Teoh (10/24) Jan 14 2013 [...]

eles (13/24) Jan 14 2013 Hi,

Jordi Sayol (4/10) Jan 14 2013 LDC is already available on Fedora 17.
Russel Winder (15/16) Jan 14 2013 LDC is already available in Debian, it is just a very old version. Looks

Jacob Carlborg (5/8) Jan 13 2013 Then I would say, put in on github regardless if anyone wants to take a

Jacob Carlborg (5/9) Jan 13 2013 Sure, but if you had switch to LLVM, say three years ago, you would know...

Dmitry Olshansky (16/58) Jan 12 2013 So these are legit? Coooooool!

H. S. Teoh (6/11) Jan 11 2013 Gah, I meant @safe pure.
Walter Bright (33/35) Jan 11 2013 Thank you for doing this. Some notes:

Dmitry Olshansky (20/57) Jan 11 2013 Containers also have length and can be empty. Ranges on top of it are

Walter Bright (14/79) Jan 11 2013 Some more text about accessing it as a range would be helpful. Being a s...

H. S. Teoh (10/25) Jan 11 2013 [...]

Walter Bright (2/4) Jan 11 2013 Actually, that sounds like a good idea.

Dmitry Olshansky (5/9) Jan 12 2013 Except that it's created out of template parameters by wrapping them in

H. S. Teoh (20/31) Jan 12 2013 [...]

Dmitry Olshansky (10/38) Jan 12 2013 For basic needs there are 2 alias for custom Tries that map dchar ->

Walter Bright (4/6) Jan 12 2013 You can also try the ACRONYM macro:

Walter Bright (4/10) Jan 12 2013 More info:

monarch_dodra (5/13) Jan 12 2013 I don't have anything to add, but I still want to voice my
Jacob Carlborg (5/9) Jan 12 2013 I assume that "sicmp" is faster than "icmp"? In that case you might want...

Dmitry Olshansky (4/12) Jan 12 2013 Yeah, good point.

Walter Bright (5/7) Jan 12 2013 struct InversionList:

Dmitry Olshansky (16/24) Jan 12 2013 Containers commonly have both. But maybe I should kill length on

Jonathan M Davis (23/49) Jan 12 2013 According to std.container, length should never be defined unless it's =

Chad J (9/11) Jan 13 2013 Are there cases where it can't be O(1)?

Jonathan M Davis (15/22) Jan 13 2013 Most definitely. Take a doubly linked list for example. Either length ca...

monarch_dodra (11/45) Jan 13 2013 Yes, size is now "o(1)" in C++11.
Chad J (8/30) Jan 13 2013 Thanks!

Dmitry Olshansky (5/20) Jan 13 2013 Yeah, exactly.

Martin Nowak (2/5) Jan 15 2013 Great, I already had to do that myself for a DFA Lexer.

Dmitry Olshansky (14/19) Jan 16 2013 The only thing left that I'd love to improve is to integrate UTF

Dmitry Olshansky (29/33) Jan 16 2013 First of all, @safe pure and nothrow is back. Let me know if something

Walter Bright (18/20) Jan 16 2013 Thank you. Looking at the Terminology section (the reference to it at th...

Dmitry Olshansky (50/74) Jan 16 2013 Obviously it's not. The simple truth is only 110K+ codepoints are

Walter Bright (3/12) Jan 16 2013 I guess I have to accept it :-)

Dmitry Olshansky (9/17) Jan 16 2013 And a lot of people really want to go this way see e.g. this:

H. S. Teoh (98/117) Jan 17 2013 [...]

Dmitry Olshansky (13/128) Jan 17 2013 Looks like one commit wasn't pushed :(
Dmitry Olshansky (12/34) Jan 22 2013 It's indeed a bug in the compiler, looks funny:

Jonathan M Davis (6/12) Jan 22 2013 Prepending @safe is evil. ;)

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

It's been long over due to present the work I did during the last GSOC 
as it was a summer not winter of code after all. Unfortunately some 
compiler bugs, a new job :) and unrelated events of importance have 
postponed its release beyond measure.

Anyway it's polished and ready for the good old collective destruction 
called peer review. I'm looking for a review manager.

The code, including extra tests and a benchmark is here:
https://github.com/blackwhale/gsoc-bench-2012

And documentation:
http://blackwhale.github.com/phobos/uni.html

And with that I'm asking if there is anything ready to be reviewed in 
the queue?

Currently I prefer to keep it standalone so read 'uni' everywhere as 
'std.uni'. On the plus side it makes it easy to try new 'uni' and 
compare with the old one without re-compiling the Phobos.

To use in place of std.uni replace 'std.uni'->'uni' in your programs and 
compare the results. Just make sure both uni and unicode_tables modules 
are linked in, usually rdmd can take care of this dependency.

The list of new functionality is quite large so I'll point out the major 
sections and let the rest to the documentation.

In general there are 3 angles to the new std.uni:

1) The same stuff but better and faster. For one thing isXXX 
classification functions are brought up to date with Unicode 6.2 and are 
up to 5-6x times faster on non-ASCII text.

2) The commonly expected stuff in any modern Unicode-aware language:
normalization, grapheme decoding, composition/decomposition and 
case-insensitive comparison*.

3) I've taken it as a crucial point to provide all of the tools used to 
build Unicode algorithms from ground up to the end user.
Thus all generally useful data structures used to implement the library 
internals are accessible for 'mortals' too:
  - a type for manipulating sets of codepoints, with full set algebra
  - a construction for generating fast multi-stage lookup tables (Trie)
  - a ton of predefined sets to construct your own specific ones from

There is an extra candy for meta-programming text processing libraries: 
a set type can generate source code of unrolled hard-coded binary search 
for a given set of codepoints.

Among other things the entire collection of data required is generated 
automatically by downloading from unicode.org. The tool relies on the 
same foundation (3) and for the most part this version of std.uni should 
be trivially updated to the new versions of the standard (see gen_uni.d 
script).

* The only missing 'big' thing is the collation algorithm. At this point 
I'm proposing to just move the large chunk of new std.uni in place. This 
way potential contributors would have tools to implement missing bits 
later on.

P.S. CodepointSet could be easily adjusted to serve as generic integer 
set type and Trie already supports far more the codepoints->values 
mappings. These should probably be enhanced and later adopted to 
std.container(2).

-- 
Dmitry Olshansky

Jan 11 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
[...]
 Anyway it's polished and ready for the good old collective
 destruction called peer review. I'm looking for a review manager.
 
 The code, including extra tests and a benchmark is here:
 https://github.com/blackwhale/gsoc-bench-2012
 
 And documentation:
 http://blackwhale.github.com/phobos/uni.html

Excellent!! It looks pretty impressive. We should definitely try out
best to get this into Phobos.


[...]
 In general there are 3 angles to the new std.uni:
 
 1) The same stuff but better and faster. For one thing isXXX
 classification functions are brought up to date with Unicode 6.2 and
 are up to 5-6x times faster on non-ASCII text.

And there is a tool for auto-updating the data, correct? So in theory,
barring major structural changes in the standard, we should be able to
recompile Phobos (maybe after running the update tool) and it should be
automatically updated to the latest Unicode release?


 2) The commonly expected stuff in any modern Unicode-aware language:
 normalization, grapheme decoding, composition/decomposition and
 case-insensitive comparison*.

 3) I've taken it as a crucial point to provide all of the tools used
 to build Unicode algorithms from ground up to the end user.

Are there enough tools to implement line-breaking algorithms (TR14)?


 Thus all generally useful data structures used to implement the
 library internals are accessible for 'mortals' too:
  - a type for manipulating sets of codepoints, with full set algebra
  - a construction for generating fast multi-stage lookup tables (Trie)
  - a ton of predefined sets to construct your own specific ones from

I looked over the examples in the docs. Looks awesome!

One question though: how do I check a character in a specific unicode
category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the
specific category, and I assume it's overkill to have one function per
category, so I presume it should be possible to do this using the
codepoint sets? Can you add an example of this to the docs, if this is
possible?


[...]
 Among other things the entire collection of data required is
 generated automatically by downloading from unicode.org. The tool
 relies on the same foundation (3) and for the most part this version
 of std.uni should be trivially updated to the new versions of the
 standard (see gen_uni.d script).

Excellent!


 * The only missing 'big' thing is the collation algorithm. At this
 point I'm proposing to just move the large chunk of new std.uni in
 place. This way potential contributors would have tools to implement
 missing bits later on.

But the aforementioned tools should be enough for someone to use as the
basis for implementing the collation algorithm right?


 P.S. CodepointSet could be easily adjusted to serve as generic
 integer set type and Trie already supports far more the
 codepoints->values mappings. These should probably be enhanced and
 later adopted to std.container(2).

[...]

Yes, having an efficient integer set type and generic Trie type would be
a huge plus to add to std.container. Although, isn't Andrei supposed to
be working on a new std.container, with a better range-based API and
custom allocator scheme? If that's not going to happen for a while yet,
I say we should merge std.uni first, then worry about factoring Trie and
int set later (we can always just alias them in std.uni later when we
move them, that should prevent user code breakage).

Now, some nitpicks:

- InversionList:

   - I don't like using ~ to mean symmetric set difference, because ~
     already means concatenation in D, and it's confusing to overload it
     with an unrelated meaning. I propose using ^ instead, because
     symmetric set difference is analogous to XOR.

   - Why is the table of operators in opBinary's doc duplicated twice?
     Is the second table supposed to be something else?

- We should at least briefly describe what the various Unicode
  normalization forms mean (yes I know they can be looked up on
  unicode.org and various other online resources, but having them right
  there in the docs makes the docs so much more helpful).

- Why are the deprecated functions still in there? The deprecation
  messages say "It will be removed in August 2012", which is already
  past. So they should be removed by now.

- Is there a reason there's a composeJamo / hangulDecompose, but no
  other writing system specific functions (such as functions to deal
  with Thai consonant/vowel compositions)? Is this a peculiarity of the
  Unicode standard? Or should there be more general functions that
  handle composition/decomposition of compound characters?

- Why are many of the category checking functions  system rather than
   safe? It would seem rather crippling to me if much of std.uni can't
  be used from  safe code!

- It would also be nice if these functions can be made pure (I'm not
  sure I understand why checking the category of a character should be
  impure). The lack of nothrow I can understand, since the input
  character may be illegal or otherwise malformed. But  nothrow pure
  seems to me to be necessary for all category-checking functions.


T

-- 
The volume of a pizza of thickness a and radius z can be described by
the following formula: pi zz a. -- Wouter Verhelst

Jan 11 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

12-Jan-2013 00:35, H. S. Teoh пишет:
 On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
 [...]
 Anyway it's polished and ready for the good old collective
 destruction called peer review. I'm looking for a review manager.

 The code, including extra tests and a benchmark is here:
 https://github.com/blackwhale/gsoc-bench-2012

 And documentation:
 http://blackwhale.github.com/phobos/uni.html

 Excellent!! It looks pretty impressive. We should definitely try out
 best to get this into Phobos.


 [...]
 In general there are 3 angles to the new std.uni:

 1) The same stuff but better and faster. For one thing isXXX
 classification functions are brought up to date with Unicode 6.2 and
 are up to 5-6x times faster on non-ASCII text.

 And there is a tool for auto-updating the data, correct? So in theory,
 barring major structural changes in the standard, we should be able to
 recompile Phobos (maybe after running the update tool) and it should be
 automatically updated to the latest Unicode release?

Yup, except they tweak the algorithms all along. The normalization for 
instance is full of erratas and minor notes accumulated over time. I'll 
probably do a write up on how to implement it and where to find all the 
related info.

 2) The commonly expected stuff in any modern Unicode-aware language:
 normalization, grapheme decoding, composition/decomposition and
 case-insensitive comparison*.

 3) I've taken it as a crucial point to provide all of the tools used
 to build Unicode algorithms from ground up to the end user.

 Are there enough tools to implement line-breaking algorithms (TR14)?

Should be. AFAIK _word_ breaking seemed trivial to do, got to check the 
line breaking.

 Thus all generally useful data structures used to implement the
 library internals are accessible for 'mortals' too:
   - a type for manipulating sets of codepoints, with full set algebra
   - a construction for generating fast multi-stage lookup tables (Trie)
   - a ton of predefined sets to construct your own specific ones from

 I looked over the examples in the docs. Looks awesome!

 One question though: how do I check a character in a specific unicode
 category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the
 specific category, and I assume it's overkill to have one function per
 category, so I presume it should be possible to do this using the
 codepoint sets? Can you add an example of this to the docs, if this is
 possible?

auto psSet = unicode.Ps;
assert(psSet['(']); //should pass

 [...]
 Among other things the entire collection of data required is
 generated automatically by downloading from unicode.org. The tool
 relies on the same foundation (3) and for the most part this version
 of std.uni should be trivially updated to the new versions of the
 standard (see gen_uni.d script).

 Excellent!


 * The only missing 'big' thing is the collation algorithm. At this
 point I'm proposing to just move the large chunk of new std.uni in
 place. This way potential contributors would have tools to implement
 missing bits later on.

 But the aforementioned tools should be enough for someone to use as the
 basis for implementing the collation algorithm right?

I sure hope so. I've been looking through all of TRs and found nothing 
surprising beyond more and more tricky codepoint sets and more and more 
wierd rules to use these.

 P.S. CodepointSet could be easily adjusted to serve as generic
 integer set type and Trie already supports far more the
 codepoints->values mappings. These should probably be enhanced and
 later adopted to std.container(2).

 [...]

 Yes, having an efficient integer set type and generic Trie type would be
 a huge plus to add to std.container. Although, isn't Andrei supposed to
 be working on a new std.container, with a better range-based API and
 custom allocator scheme? If that's not going to happen for a while yet,
 I say we should merge std.uni first, then worry about factoring Trie and
 int set later (we can always just alias them in std.uni later when we
 move them, that should prevent user code breakage).

 Now, some nitpicks:

Thanks for early feedback.

 - InversionList:

     - I don't like using ~ to mean symmetric set difference, because ~
       already means concatenation in D, and it's confusing to overload it
       with an unrelated meaning. I propose using ^ instead, because
       symmetric set difference is analogous to XOR.

Point taken, arguably '~' was a bad idea. But I don't like ^ either. 
Seems like it's the best candidate though.


     - Why is the table of operators in opBinary's doc duplicated twice?
       Is the second table supposed to be something else?

Beats me. Must be some glitch with BOOKTABLE macro (or how I use it). I 
think I've hit it before.

 - We should at least briefly describe what the various Unicode
    normalization forms mean (yes I know they can be looked up on
    unicode.org and various other online resources, but having them right
    there in the docs makes the docs so much more helpful).

 - Why are the deprecated functions still in there? The deprecation
    messages say "It will be removed in August 2012", which is already
    past. So they should be removed by now.

Will do before Walter chimes in with his motto of keeping clutter just 
in case.

 - Is there a reason there's a composeJamo / hangulDecompose, but no
    other writing system specific functions (such as functions to deal
    with Thai consonant/vowel compositions)? Is this a peculiarity of the
    Unicode standard? Or should there be more general functions that
    handle composition/decomposition of compound characters?

Hangul are huge (11K+) and they have a specific algorithmic 
decomposition rule. Also they are hell of a special case in Unicode 
standard (that is mentioned in one chapter only!). Everything else is 
table-driven.

 - Why are many of the category checking functions  system rather than
     safe? It would seem rather crippling to me if much of std.uni can't
    be used from  safe code!

 system? Well, these got to be either  trusted or  safe.

 - It would also be nice if these functions can be made pure (I'm not
    sure I understand why checking the category of a character should be
    impure).

Global sophisticated immutable table. I think this can be pure but the 
compiler might have disagreed.

  The lack of nothrow I can understand, since the input
    character may be illegal or otherwise malformed.

nothrow is probably doable as any codepoint will do. (and if it's > 
dchar.max then it's a problem somewhere else).

TBH I've killed these qualifiers early on as it prevented the thing to 
compile. I can't recall is there is still a problem with any of theses.

 But  nothrow pure
    seems to me to be necessary for all category-checking functions.


-- 
Dmitry Olshansky

Jan 11 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Jan 12, 2013 at 01:06:30AM +0400, Dmitry Olshansky wrote:
 12-Jan-2013 00:35, H. S. Teoh пишет:
On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
[...]
Anyway it's polished and ready for the good old collective
destruction called peer review. I'm looking for a review manager.

The code, including extra tests and a benchmark is here:
https://github.com/blackwhale/gsoc-bench-2012

And documentation:
http://blackwhale.github.com/phobos/uni.html



[...]
2) The commonly expected stuff in any modern Unicode-aware language:
normalization, grapheme decoding, composition/decomposition and
case-insensitive comparison*.

3) I've taken it as a crucial point to provide all of the tools used
to build Unicode algorithms from ground up to the end user.

Are there enough tools to implement line-breaking algorithms (TR14)?

 
 Should be. AFAIK _word_ breaking seemed trivial to do, got to check
 the line breaking.

Line-breaking is in TR14. I looked over it. It's definitely more
complicated than I expected it to be. Some alphabets may even require
stylistic rules for line-breaking! (Not to mention hyphenation.) So it's
probably too much to expect std.uni to do it. But at least the tools
necessary to implement it should be there.


[...]
One question though: how do I check a character in a specific unicode
category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the
specific category, and I assume it's overkill to have one function per
category, so I presume it should be possible to do this using the
codepoint sets? Can you add an example of this to the docs, if this is
possible?

 
 auto psSet = unicode.Ps;
 assert(psSet['(']); //should pass

It would be nice to have a list of possible categories in the docs for
unicode.opDispatch/opCall, at least, the common categories, with a
hyperlink to unicode.org (or wherever) that provides the full list.
Currently, it's unclear, beyond the single example given, of what the
possibilities are.


[...]
- We should at least briefly describe what the various Unicode
   normalization forms mean (yes I know they can be looked up on
   unicode.org and various other online resources, but having them right
   there in the docs makes the docs so much more helpful).

 
- Why are the deprecated functions still in there? The deprecation
   messages say "It will be removed in August 2012", which is already
   past. So they should be removed by now.

 
 Will do before Walter chimes in with his motto of keeping clutter
 just in case.

In that case, I would keep the functions (aliases?) but remove them from
DDoc. So old code will still silently work, but the functions will no
longer be officially supported, there will be no documentation for them
and they can disappear anytime.


- Is there a reason there's a composeJamo / hangulDecompose, but no
   other writing system specific functions (such as functions to deal
   with Thai consonant/vowel compositions)? Is this a peculiarity of
   the Unicode standard? Or should there be more general functions
   that handle composition/decomposition of compound characters?

 Hangul are huge (11K+) and they have a specific algorithmic
 decomposition rule. Also they are hell of a special case in Unicode
 standard (that is mentioned in one chapter only!). Everything else
 is table-driven.

OK. Makes sense, I guess. Though it gave me the impression that std.uni
has some undue bias for Korean text. ;-)

[...]
- It would also be nice if these functions can be made pure (I'm not
   sure I understand why checking the category of a character should
   be impure).

 
 Global sophisticated immutable table. I think this can be pure but
 the compiler might have disagreed.

Hmm. This sounds like a bug in purity, maybe? It seems clear to me that
immutable values might as well be constants, and therefore using said
values should not make the code impure, even if they are implemented as
a global table. I think a bug should be filed for this.


 The lack of nothrow I can understand, since the input
   character may be illegal or otherwise malformed.

 
 nothrow is probably doable as any codepoint will do. (and if it's >
 dchar.max then it's a problem somewhere else).

True, an illegal character by definition does not belong to any
category, so all category functions should return false for it. No
exceptions need to be thrown.


 TBH I've killed these qualifiers early on as it prevented the thing
 to compile. I can't recall is there is still a problem with any of
 theses.

[...]

Before merging into Phobos, I would think we should put (as many of) the
qualifiers back (as possible). Preferably everwhere they make sense,
barring compiler bugs.


T

-- 
Государство делает вид, что платит нам
зарплату, а мы делаем вид, что работаем.

Jan 11 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

12-Jan-2013 01:33, H. S. Teoh пишет:
 On Sat, Jan 12, 2013 at 01:06:30AM +0400, Dmitry Olshansky wrote:
 12-Jan-2013 00:35, H. S. Teoh пишет:
 On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
 [...]
 Anyway it's polished and ready for the good old collective
 destruction called peer review. I'm looking for a review manager.

 The code, including extra tests and a benchmark is here:
 https://github.com/blackwhale/gsoc-bench-2012

 And documentation:
 http://blackwhale.github.com/phobos/uni.html



 [...]
 2) The commonly expected stuff in any modern Unicode-aware language:
 normalization, grapheme decoding, composition/decomposition and
 case-insensitive comparison*.

 3) I've taken it as a crucial point to provide all of the tools used
 to build Unicode algorithms from ground up to the end user.

 Are there enough tools to implement line-breaking algorithms (TR14)?

 Should be. AFAIK _word_ breaking seemed trivial to do, got to check
 the line breaking.

 Line-breaking is in TR14. I looked over it. It's definitely more
 complicated than I expected it to be. Some alphabets may even require
 stylistic rules for line-breaking! (Not to mention hyphenation.) So it's
 probably too much to expect std.uni to do it. But at least the tools
 necessary to implement it should be there.

I had given it a practiced look of tired Unicode implementor. It has a 
lot of subtle moments to it but should be doable.

I'm not about to implement it any time soon, but notes on how to do it 
are the following:

1. Avoid reading all of the reasons behind the algorithm listed there 
(there are good ones, a ton of them). So at first SKIP chapter 5 except 
maybe 5.3-5.4. And the fonts problems listed are not important to get it 
working.

2. Re-read the important chapters that is 2 and 6, 7. Others pretty much 
can be simply ignored until you start testing your implementation. 7 is 
actually pretty nice.

3. The first step to implement is to seek out the character classes 
(=sets) involved in algorithm. These are listed in Table.1 and the 
accompanying data file.

4. Process the file (see e.g. gen_uni.d) and create all of sets.

5. The core part of the algorithm itself is by the end of day not bad - 
section 6.1 lists all of the mandatory line breaking rules. Then there 
is a bunch of tweakable ones that are a bit harder. Plus see chapter 7 
for a (logically) pair-wise table driven solution (though the row/cols 
in the table are "belongs to the set X").

-- 
Dmitry Olshansky

Jan 13 2013

"David Nadlinger" <see klickverbot.at> writes:

On Friday, 11 January 2013 at 19:31:13 UTC, Dmitry Olshansky 
wrote:
 The code, including extra tests and a benchmark is here:
 https://github.com/blackwhale/gsoc-bench-2012

Phew, that repository is quite a hefty checkout with all the 
example data.

Just for the fun of it, a completely irreproducible, unscientific 
benchmark (sorry, Andrei!) run on a Core i7 M 620 under Linux 
x86_64:

DMD 2.061
---
$ rdmd -O -release -noboundscheck -inline char_class.d 
enwiki-latest-all-titles-in-ns0
Creating level 1 of Tries.
Alpha:262144
Mark:262144
Number:262144
Symbol:262144
Creating level 2 of Tries.
Alpha:9472
Mark:6656
Number:6656
Symbol:6400
Creating level 3 of Tries.
Alpha:4000
Mark:2304
Number:2432
Symbol:2848
Creating level 4 of Tries.
Alpha:3376
Mark:1976
Number:1624
Symbol:2096
Time to build dchar array: 2702 ms

Baselines
             noop [enwiki],   217.335, 888.99 M/s
    new-std-alpha [enwiki],   1197.49, 161.34 M/s
     new-std-mark [enwiki],   6374.73, 30.31 M/s
      new-std-num [enwiki],   6290.89, 30.71 M/s
      new-std-sym [enwiki],   6500.93, 29.72 M/s
    new-std-white [enwiki],   822.245, 234.98 M/s
  combining class [enwiki],   5948.63, 32.48 M/s
        inv-alpha [enwiki],     10290, 18.78 M/s
         inv-mark [enwiki],    9229.2, 20.93 M/s
          inv-num [enwiki],   9046.48, 21.36 M/s
          inv-sym [enwiki],   9843.59, 19.63 M/s

Tries of level 1
       trie-alpha [enwiki],   2179.17, 88.66 M/s
        trie-mark [enwiki],   1912.54, 101.02 M/s
         trie-num [enwiki],   1924.92, 100.37 M/s
         trie-sym [enwiki],   1911.22, 101.09 M/s

Tries of level 2
       trie-alpha [enwiki],   4524.33, 42.70 M/s
        trie-mark [enwiki],   4276.92, 45.17 M/s
         trie-num [enwiki],   4308.42, 44.84 M/s
         trie-sym [enwiki],   4298.68, 44.95 M/s

Tries of level 3
       trie-alpha [enwiki],   6407.64, 30.15 M/s
        trie-mark [enwiki],    6150.8, 31.41 M/s
         trie-num [enwiki],   6095.96, 31.69 M/s
         trie-sym [enwiki],    6253.7, 30.89 M/s

Tries of level 4
       trie-alpha [enwiki],   8329.66, 23.20 M/s
        trie-mark [enwiki],   7993.11, 24.17 M/s
         trie-num [enwiki],   8040.94, 24.03 M/s
         trie-sym [enwiki],   8033.07, 24.05 M/s
---

LDC (2.061 merge branch):
---
$ rdmd --compiler=ldmd2 -O3 -release -noboundscheck -inline 
char_class.d enwiki-latest-all-titles-in-ns0
Alpha:262144
Mark:262144
Number:262144
Symbol:262144
Creating level 2 of Tries.
Alpha:9472
Mark:6656
Number:6656
Symbol:6400
Creating level 3 of Tries.
Alpha:4000
Mark:2304
Number:2432
Symbol:2848
Creating level 4 of Tries.
Alpha:3376
Mark:1976
Number:1624
Symbol:2096
Time to build dchar array: 1962 ms

Baselines
             noop [enwiki],   211.516, 913.44 M/s
    new-std-alpha [enwiki],   612.324, 315.53 M/s
     new-std-mark [enwiki],   577.134, 334.77 M/s
      new-std-num [enwiki],   600.996, 321.48 M/s
      new-std-sym [enwiki],   577.433, 334.60 M/s
    new-std-white [enwiki],   381.535, 506.40 M/s
  combining class [enwiki],   517.231, 373.54 M/s
        inv-alpha [enwiki],   8181.25, 23.62 M/s
         inv-mark [enwiki],   6938.09, 27.85 M/s
          inv-num [enwiki],   6347.71, 30.44 M/s
          inv-sym [enwiki],    6930.7, 27.88 M/s

Tries of level 1
       trie-alpha [enwiki],   275.741, 700.69 M/s
        trie-mark [enwiki],   274.972, 702.65 M/s
         trie-num [enwiki],   274.558, 703.70 M/s
         trie-sym [enwiki],   277.255, 696.86 M/s

Tries of level 2
       trie-alpha [enwiki],   453.453, 426.08 M/s
        trie-mark [enwiki],   452.912, 426.59 M/s
         trie-num [enwiki],   455.004, 424.63 M/s
         trie-sym [enwiki],   453.492, 426.04 M/s

Tries of level 3
       trie-alpha [enwiki],   577.442, 334.59 M/s
        trie-mark [enwiki],   579.054, 333.66 M/s
         trie-num [enwiki],    577.95, 334.30 M/s
         trie-sym [enwiki],   577.566, 334.52 M/s

Tries of level 4
       trie-alpha [enwiki],   778.261, 248.26 M/s
        trie-mark [enwiki],   778.943, 248.04 M/s
         trie-num [enwiki],   778.352, 248.23 M/s
         trie-sym [enwiki],   779.442, 247.88 M/s
---

Haven't had a look at what is going on there yet – it could well 
be that some measurement loop is optimized out altogether.

David

Jan 11 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

12-Jan-2013 00:38, David Nadlinger пишет:
 On Friday, 11 January 2013 at 19:31:13 UTC, Dmitry Olshansky wrote:
 The code, including extra tests and a benchmark is here:
 https://github.com/blackwhale/gsoc-bench-2012

 Phew, that repository is quite a hefty checkout with all the example data.

 Just for the fun of it, a completely irreproducible, unscientific
 benchmark (sorry, Andrei!) run on a Core i7 M 620 under Linux x86_64:

 DMD 2.061
 ---
 $ rdmd -O -release -noboundscheck -inline char_class.d

[snip]

I'd rather use dewiki or arwiki these give the more coverage otherwise 
you just check the presence of ASCII special casing (only isAlpha and 
isWhite have it).

 Time to build dchar array: 2702 ms

 Baselines
              noop [enwiki],   217.335, 888.99 M/s
     new-std-alpha [enwiki],   1197.49, 161.34 M/s
      new-std-mark [enwiki],   6374.73, 30.31 M/s
       new-std-num [enwiki],   6290.89, 30.71 M/s
       new-std-sym [enwiki],   6500.93, 29.72 M/s
     new-std-white [enwiki],   822.245, 234.98 M/s
   combining class [enwiki],   5948.63, 32.48 M/s
         inv-alpha [enwiki],     10290, 18.78 M/s
          inv-mark [enwiki],    9229.2, 20.93 M/s
           inv-num [enwiki],   9046.48, 21.36 M/s
           inv-sym [enwiki],   9843.59, 19.63 M/s

 Tries of level 1
        trie-alpha [enwiki],   2179.17, 88.66 M/s
         trie-mark [enwiki],   1912.54, 101.02 M/s
          trie-num [enwiki],   1924.92, 100.37 M/s
          trie-sym [enwiki],   1911.22, 101.09 M/s

 Tries of level 2
        trie-alpha [enwiki],   4524.33, 42.70 M/s
         trie-mark [enwiki],   4276.92, 45.17 M/s
          trie-num [enwiki],   4308.42, 44.84 M/s
          trie-sym [enwiki],   4298.68, 44.95 M/s

 Tries of level 3
        trie-alpha [enwiki],   6407.64, 30.15 M/s
         trie-mark [enwiki],    6150.8, 31.41 M/s
          trie-num [enwiki],   6095.96, 31.69 M/s
          trie-sym [enwiki],    6253.7, 30.89 M/s

 Tries of level 4
        trie-alpha [enwiki],   8329.66, 23.20 M/s
         trie-mark [enwiki],   7993.11, 24.17 M/s
          trie-num [enwiki],   8040.94, 24.03 M/s
          trie-sym [enwiki],   8033.07, 24.05 M/s
 ---



 LDC (2.061 merge branch):
 ---
 $ rdmd --compiler=ldmd2 -O3 -release -noboundscheck -inline char_class.d
 enwiki-latest-all-titles-in-ns0

[snip]

 Baselines
              noop [enwiki],   211.516, 913.44 M/s
     new-std-alpha [enwiki],   612.324, 315.53 M/s
      new-std-mark [enwiki],   577.134, 334.77 M/s
       new-std-num [enwiki],   600.996, 321.48 M/s
       new-std-sym [enwiki],   577.433, 334.60 M/s
     new-std-white [enwiki],   381.535, 506.40 M/s
   combining class [enwiki],   517.231, 373.54 M/s
         inv-alpha [enwiki],   8181.25, 23.62 M/s
          inv-mark [enwiki],   6938.09, 27.85 M/s
           inv-num [enwiki],   6347.71, 30.44 M/s
           inv-sym [enwiki],    6930.7, 27.88 M/s

 Tries of level 1
        trie-alpha [enwiki],   275.741, 700.69 M/s
         trie-mark [enwiki],   274.972, 702.65 M/s
          trie-num [enwiki],   274.558, 703.70 M/s
          trie-sym [enwiki],   277.255, 696.86 M/s

 Tries of level 2
        trie-alpha [enwiki],   453.453, 426.08 M/s
         trie-mark [enwiki],   452.912, 426.59 M/s
          trie-num [enwiki],   455.004, 424.63 M/s
          trie-sym [enwiki],   453.492, 426.04 M/s

 Tries of level 3
        trie-alpha [enwiki],   577.442, 334.59 M/s
         trie-mark [enwiki],   579.054, 333.66 M/s
          trie-num [enwiki],    577.95, 334.30 M/s
          trie-sym [enwiki],   577.566, 334.52 M/s

 Tries of level 4
        trie-alpha [enwiki],   778.261, 248.26 M/s
         trie-mark [enwiki],   778.943, 248.04 M/s
          trie-num [enwiki],   778.352, 248.23 M/s
          trie-sym [enwiki],   779.442, 247.88 M/s
 ---

 Haven't had a look at what is going on there yet – it could well be that
 some measurement loop is optimized out altogether.

You can print total counts after each bench, there is a TLS varaible 
written at the end of it. But anyway I like your numbers! :)

 David


-- 
Dmitry Olshansky

Jan 11 2013

"David Nadlinger" <see klickverbot.at> writes:

On Friday, 11 January 2013 at 20:57:57 UTC, Dmitry Olshansky 
wrote:
 You can print total counts after each bench, there is a TLS 
 varaible written at the end of it. But anyway I like your 
 numbers! :)

Okay, I couldn't resist having a short look at the results, 
specifically the benchmark of the new isSymbol implementation, 
where LDC beats DMD by roughly 10x. The reason for the nice 
performance results is mainly that LDC optimizes the classifyCall 
loop containing the trie lookup down to the following fairly 
optimal piece of code (eax is the overall counter that gets 
stored to lastCount):

---
   40bc90:       8b 55 00                mov    edx,DWORD PTR 
[rbp+0x0]
   40bc93:       89 d6                   mov    esi,edx
   40bc95:       c1 ee 0d                shr    esi,0xd
   40bc98:       40 0f b6 f6             movzx  esi,sil
   40bc9c:       0f b6 34 31             movzx  esi,BYTE PTR 
[rcx+rsi*1]
   40bca0:       48 83 c5 04             add    rbp,0x4
   40bca4:       0f b6 da                movzx  ebx,dl
   40bca7:       c1 e6 05                shl    esi,0x5
   40bcaa:       c1 ea 08                shr    edx,0x8
   40bcad:       83 e2 1f                and    edx,0x1f
   40bcb0:       09 f2                   or     edx,esi
   40bcb2:       41 0f b7 14 50          movzx  edx,WORD PTR 
[r8+rdx*2]
   40bcb7:       c1 e2 08                shl    edx,0x8
   40bcba:       09 da                   or     edx,ebx
   40bcbc:       48 c1 ea 06             shr    rdx,0x6
   40bcc0:       4c 01 ca                add    rdx,r9
   40bcc3:       48 8b 14 d1             mov    rdx,QWORD PTR 
[rcx+rdx*8]
   40bcc7:       48 0f a3 da             bt     rdx,rbx
   40bccb:       83 d0 00                adc    eax,0x0
   40bcce:       48 ff cf                dec    rdi
   40bcd1:       75 bd                   jne    40bc90
---

The code DMD generates for the lookup, on the other hand, is 
pretty ugly, including several values being spilled to the stack, 
and also doesn't get inlined.

This is, of course, just a microbenchmark, but it is cases like 
this which make me wish that we would just use LLVM (or GCC, for 
that matter) for the reference compiler – and I'm not talking 
about the slightly Frankensteinian endeavor that LDC is here. 
Walter, my intention is not at all to doubt your ability at a 
compiler writer; we all know the stories of how you used to annoy 
the team leads at the big companies by beating their performance 
numbers single-handedly, and I'm sure you could e.g. fix your 
backend to match the performance of the LDC-generated code for 
Dmitry's benchmark in no time. The question is just: Are we as a 
community big, resourceful enough to justify spending time on 
that?

Sure, there would still be things we will have to fix ourselves 
when using another backend, such as SEH support in LLVM. But 
performance will always be a central selling point of a language 
like D, and do we really want to take the burden of keeping up 
with the competition ourselves, when we can just draw on the work 
of full-time backend developers at Intel, AMD, Apple and others 
for free? Given the current developments in microprocessors and 
given that applications such as graphics and scientific computing 
are naturally a good fit for D, what's next? You taking a year 
off from active language development to implement an 
auto-vectorizer for your backend?

I know this question has been brought up before (if never really 
answered), and I don't want to start another futile discussion, 
but given the developments in the compiler/languages landscape 
over the last few years, it strikes me as an increasingly bad 
decision to stick with an obscure, poorly documented backend 
which nobody knows how to use – and nobody wants to learn how to 
use either, because, oops, they couldn't even redistribute their 
own work.

Let's put aside all the other arguments (most of which I didn't 
even mention) for a moment, even the performance aspect; I think 
that the productivity aspect alone, both regarding duplicated 
work and accessibility of the project to new developers, makes it 
hard to justify forging leveraging the momentum of an established 
backend project like LLVM. [1]

Maybe it is naïve to think that the situation could ever change 
for DMD. But I sincerely hope that the instant a promising 
self-hosted (as far as the frontend goes) compiler project shows 
up at the horizon, it will gain the necessary amount of official 
endorsement – and manpower, especially in the form of your 
(Walter's) expertise – to make that final, laborious stretch to 
release quality. If we just sit there and wait for somebody to 
come along with a new production-ready compiler which is better, 
faster and shinier than DMD, we will wait for a long, long time – 
this might happen for a Lisp dialect, but not for D.

Sorry for the rant, [2]
David



[1] The reasons for which I'm focusing on LLVM here are not so 
much its technical qualities as its liberal BSD-like license – if 
it is good enough for Apple, Intel (also a compiler vendor) and 
their lawyer teams, it is probably also for us. The code could 
even be integrated into commercial products such as DMC without 
problems.


[2] And for any typos which might undermine my credibility – it 
is way too early in the morning here.

Jan 11 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/11/2013 9:17 PM, David Nadlinger wrote:
 Sorry for the rant, [2]
 David

No problem. I think it's great that you're the champion for LDC. Having 3
robust 
compilers for D is only a win, and I don't think having 3 takes away from D at
all.

The 3 compilers have different strengths and weaknesses. For example, LDC 
doesn't work with VS, and so would have cost us the possible design win at
Remedy.

Jan 11 2013

"David Nadlinger" <see klickverbot.at> writes:

On Saturday, 12 January 2013 at 05:44:00 UTC, Walter Bright wrote:
 No problem. I think it's great that you're the champion for LDC.
 Having 3 robust compilers for D is only a win, and I don't 
 think having 3 takes away from D at all.

I precisely did *not* want to drag LDC into this discussion, but 
I guess it was unavoidable given that I talked about it in the 
first part of my post, and of course because I'm the one 
maintaining it right now.

So, just to clarify, I took up LDC development mainly because it 
produces significantly faster code for some of my projects, and 
it would have been a shame to let it die from a marketing point 
of view. If I had lots of spare time (which I don't have right 
now) I'd certainly try to help out with a D-based compiler 
project instead. So it's not like I'm overly attached to LDC, 
emotionally – resurrecting it just seemed like the most viable 
short-term solution.

But yes, you are right, having 3 robust compilers for D would 
certainly be a good thing for D. But unfortunately, I think it 
would be quite a stretch to say that we have even a single such 
one right now.

To elaborate – and again, please don't take this as a personal 
attack –, every time this discussion comes up you seem to resort 
to the »diversity is good« cliché without ever actually 
responding to the arguments other people bring up. You want to 
stick to the DMC backend? Fine, that's your good right, but 
wouldn't it only be fair to share your reasons with the other 
people who are investing considerable amounts of work and time in 
»your« language?

As a note aside, I think this point is much deeper than it might 
appear at first. Sociology is not exactly my field, so I don't 
know what current research has to say about this, but at least to 
me it seems obvious that a healthy culture of communication is 
essential for the success of any non-trivial open source project. 
Even if it would be short-sighted to claim that all disputes can 
be resolved by a good technical discussion, it seems to go a long 
way towards reaching a consensus in most cases; and many 
engineers/scientists/hackers seem to regard it as the norm. On 
the other hand, it seems like failure to communicate the reasons 
behind a given decision is one of the most effective ways to 
cause (potential) contributors to turn away, looking for a more 
promising project to throw their spare time at. And to be honest, 
I think some of the most unfortunate events in the history of D 
can be directly traced back to such communication failures (I 
hope it is obvious what I'm alluding to here?).

But let's return to the particular topic I brought up. To recap, 
my impression is that several large players in the industry have 
started to pool their efforts regarding compiler development, as 
spending the ever-increasing amount of work needed to come up 
with their own state-of-the-art optimizer is no longer justified 
by the potential advantages this would offer over their 
competitors – maybe quite similar to the almost universal success 
of Linux in the OS domain.

So, in short, even if the DMC backend was technically superior to 
every other solution out there right now, I would be hard-pressed 
to make an argument for preferring it over a more widely used 
solution like LLVM because of this. I cannot see how putting work 
into a custom backend could realistically help to make D more 
attractive compared to other languages, so the current situation 
seems like a case of making things unnecessarily hard for 
ourselves.

Of course, as with any open source project, it would be foolish 
to try and tell you what to spend your time on. But I would 
really be interested in your long-term vision for D(MD), because 
as noted before I can't see any signs that the expectations at a 
compiler backend will drop in the future (more and more vector 
units in processors, the rise of ARM, ...), and even when 
neglecting the platform support issue, I'm not sure how good an 
offer D would still be if its main implementation started to 
perform significantly worse than common C++ compilers.

To summarize: Although there is quite a lot of promising 
development going on in the language space right now, I still 
think D has an excellent offer to make, otherwise I would left 
the camp long ago. But it doesn't seem wise to intentionally 
weaken our position in the arena by spending considerable amounts 
of manpower on solving a problem from the ground up for which 
there is already a free off-the-shelf solution that would get us 
at least 95% there – and the fact that our resources are limited 
is precisely what you seem to neglect when reciting the 
»diversity is good« mantra.


 For example, LDC doesn't work with VS, and so would have cost
 us the possible design win at Remedy.

In view of the above, I'm not sure how much weight this argument 
could carry, but I don't think it is valid in the first place: If 
our reference compiler was backed by LLVM, you could have just 
added CV8 support to it like you did for DMD. By the way, the 
other thing missing from LLVM regarding VS compatibility is SEH 
support, but DMD doesn't implement that either (LLVM actually 
supports EH on MinGW right now, so the implementation could 
likely have been re-used as a workaround, just like you did for 
DMD).


David

Jan 13 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/13/2013 3:10 PM, David Nadlinger wrote:
 To elaborate – and again, please don't take this as a personal attack –,
every
 time this discussion comes up you seem to resort to the »diversity is good«
 cliché

Being a cliche doesn't make it invalid.

 without ever actually responding to the arguments other people bring up.
 You want to stick to the DMC backend? Fine, that's your good right, but
wouldn't
 it only be fair to share your reasons with the other people who are investing
 considerable amounts of work and time in »your« language?

It's long ceased to be my language. It's a group creation.

Anyhow, I like using the dmc backend because:

1. I can completely control the user experience.

2. I can fix a problem where it should be fixed, rather than working around it.

3. I understand it completely, and so I can reasonably quickly add in things 
that some other back end may not support.

4. It's fast.

5. I don't run any risk of "you stole X's ideas when you worked on X's back
end."

6. I enjoy working on it.


 So, in short, even if the DMC backend was technically superior to every other
 solution out there right now, I would be hard-pressed to make an argument for
 preferring it over a more widely used solution like LLVM because of this. I
 cannot see how putting work into a custom backend could realistically help to
 make D more attractive compared to other languages, so the current situation
 seems like a case of making things unnecessarily hard for ourselves.

This assumes that the existence of dmd takes away from ldc. I don't believe it 
does. Ldc doesn't need me.

 To summarize: Although there is quite a lot of promising development going on
in
 the language space right now, I still think D has an excellent offer to make,
 otherwise I would left the camp long ago. But it doesn't seem wise to
 intentionally weaken our position in the arena by spending considerable amounts
 of manpower on solving a problem from the ground up for which there is already
a
 free off-the-shelf solution that would get us at least 95% there – and the
fact
 that our resources are limited is precisely what you seem to neglect when
 reciting the »diversity is good« mantra.

I'm not convinced that the effort to keep up with llvm in order to work with it 
is actually less time.


 For example, LDC doesn't work with VS, and so would have cost
 us the possible design win at Remedy.

 In view of the above, I'm not sure how much weight this argument could carry,
 but I don't think it is valid in the first place: If our reference compiler was
 backed by LLVM, you could have just added CV8 support to it like you did for
 DMD.

The thing is, I don't see how that would have saved time. It would have been 
more time - I already had CV4 support in dmd, so I had a head start on CV8. 
There were also a number of wacky things about Mscoff64 - how to integrate them 
into LLVM would have required a thorough knowledge of it on my part, and *then* 
I'd still have to implement it. There were also the MS64 ABI changes, which 
consumed a considerable amount of time. If that is not in LLVM, I would have 
been in for a world of effort and would have had to do some considerable
surgery 
on it.

 By the way, the other thing missing from LLVM regarding VS compatibility is
 SEH support, but DMD doesn't implement that either (LLVM actually supports EH
on
 MinGW right now, so the implementation could likely have been re-used as a
 workaround, just like you did for DMD).

I think that underestimates what all is necessary to be ABI and format 
compatible with VC64. SEH is only a small part of it. For example,

1. Win64 ABI, which is fundamentally different from any other platform
2. Making it compatible with and use VC++'s runtime
3. Generating MsCoff 64.
4. Adding the wacky pdata and xdata sections
5. Learning the completely undocumented CV8 format, then implementing it

Doing this all requires a fairly comprehensive understanding of the innards of 
the back end. If LLVM is lacking one or more of 1..5, then I wouldn't be 
surprised a bit if it took me considerably *longer* to do it for LLVM, and then 
I'd have to sell my changes to the LLVM group. (If they don't buy it, then I'm 
stuck with a fork I have to maintain myself. Ugh.)

Jan 13 2013

"David Nadlinger" <see klickverbot.at> writes:

On Sunday, 13 January 2013 at 23:57:48 UTC, Walter Bright wrote:
 On 1/13/2013 3:10 PM, David Nadlinger wrote:
 To elaborate – and again, please don't take this as a personal 
 attack –, every
 time this discussion comes up you seem to resort to the 
 »diversity is good«
 cliché

 Being a cliche doesn't make it invalid.

You conveniently ended your citation before the »without ever 
actually responding to the arguments other people bring up« part.

 You want to stick to the DMC backend? Fine, that's your good 
 right, but wouldn't
 it only be fair to share your reasons with the other people 
 who are investing
 considerable amounts of work and time in »your« language?

 It's long ceased to be my language. It's a group creation.

 Anyhow, I like using the dmc backend because:

Let me reply specifically focusing on LLVM:

 1. I can completely control the user experience.

You can do this with LLVM as well, as you can modify its source 
at will. Let me also note that this means that *only you* can fix 
certain parts of the user experience (see for example the Optlink 
issues).

And besides, to be brutally honest, there has not been too much 
good feedback about the DMD toolchain so far, apart from 
compilation speeds, has there? The reason for this might of 
course be that users just expect a D compiler to work just like 
every other C/C++ compiler, but I can't see how having a 
potentially bigger degree of control would have helped much so 
far.

 2. I can fix a problem where it should be fixed, rather than 
 working around it.

You can do this with LLVM as well – after all, that's precisely 
the point of open source software. And additionally, it has the 
big advantage that other people can fix problems in it too, and 
in many cases, the LLVM people will even do the work for you.

 3. I understand it completely, and so I can reasonably quickly 
 add in things that some other back end may not support.

LLVM is reasonably well documented and has many useful debugging 
and visualization features, so I'm sure you would find your way 
around it quickly if you wanted to.

 4. It's fast.

No objection there. But let me note that its weak(er) optimizer 
might get it disqualified as far as release builds go before this 
argument even counts.

 5. I don't run any risk of "you stole X's ideas when you worked 
 on X's back end."

You can use any code from LLVM under a BSD-like license. Would 
this be too restrictive? I am not a lawyer, but I'm reasonably 
certain it would even allow you to just take the whole LLVM code 
base and sell it as a closed-source package.

 6. I enjoy working on it.

Fair enough, and all too understandable given that it's your 
creation! But to put it bluntly, it's a hassle for almost 
everybody else working on the compiler, so a project you want to 
attract other contributors to might not be the best environment 
to pursue this »hobby«.

 So, in short, even if the DMC backend was technically superior 
 to every other
 solution out there right now, I would be hard-pressed to make 
 an argument for
 preferring it over a more widely used solution like LLVM 
 because of this. I
 cannot see how putting work into a custom backend could 
 realistically help to
 make D more attractive compared to other languages, so the 
 current situation
 seems like a case of making things unnecessarily hard for 
 ourselves.

 This assumes that the existence of dmd takes away from ldc. I 
 don't believe it does. Ldc doesn't need me.

Why are you even bringing up LDC here? My last post must have 
been very unclear in this regard? But of course, LDC does need 
you – after all, it uses the DMD frontend!

My entire point is that *D* needs you and your expertise: Having 
a veteran compiler writer on board is just about the best thing 
that can happen to a language effort; just because of your 
experience you are infinitely more productive in some areas than 
most of us other contributors are. What I'm trying to say is that 
if you want D to succeed, which I assume you do, spending (part 
of) your time on writing a custom backend might not be the best 
idea: First, you could have spent that time working on things 
that help to sell the language, and second, doing so actually 
makes life *harder* for other contributors.

 To summarize: Although there is quite a lot of promising 
 development going on in
 the language space right now, I still think D has an excellent 
 offer to make,
 otherwise I would left the camp long ago. But it doesn't seem 
 wise to
 intentionally weaken our position in the arena by spending 
 considerable amounts
 of manpower on solving a problem from the ground up for which 
 there is already a
 free off-the-shelf solution that would get us at least 95% 
 there – and the fact
 that our resources are limited is precisely what you seem to 
 neglect when
 reciting the »diversity is good« mantra.

 I'm not convinced that the effort to keep up with llvm in order 
 to work with it is actually less time.

Less time than what? Developing a code generator that matches 
theirs? You are certainly a very capable compiler engineer, but 
LLVM has quite a few bright minds working on it as well, 
full-time. I honestly don't see how »keeping up with LLVM« could 
be more time-consuming than replicating all of their work. Maybe 
you can elaborate on what exactly you meant?

But for what it's worth, Kai and I have had no problems with 
keeping LDC up to date with the latest LLVM version for the last 
few upstream releases, and we both can't spend too much time on 
LDC, so this likely wouldn't be a big issue for you either.

 For example, LDC doesn't work with VS, and so would have cost
 us the possible design win at Remedy.

 In view of the above, I'm not sure how much weight this 
 argument could carry,
 but I don't think it is valid in the first place: If our 
 reference compiler was
 backed by LLVM, you could have just added CV8 support to it 
 like you did for
 DMD.

 The thing is, I don't see how that would have saved time. It 
 would have been more time - I already had CV4 support in dmd, 
 so I had a head start on CV8. There were also a number of wacky 
 things about Mscoff64 - how to integrate them into LLVM would 
 have required a thorough knowledge of it on my part, and *then* 
 I'd still have to implement it. There were also the MS64 ABI 
 changes, which consumed a considerable amount of time. If that 
 is not in LLVM, I would have been in for a world of effort and 
 would have had to do some considerable surgery on it.

If I'm not mistaken, most of the related work for LLVM was done 
by the end of 2011.

 By the way, the other thing missing from LLVM regarding VS 
 compatibility is
 SEH support, but DMD doesn't implement that either (LLVM 
 actually supports EH on
 MinGW right now, so the implementation could likely have been 
 re-used as a
 workaround, just like you did for DMD).

 I think that underestimates what all is necessary to be ABI and 
 format compatible with VC64. SEH is only a small part of it.

 For example,

 1. Win64 ABI, which is fundamentally different from any other 
 platform
 2. Making it compatible with and use VC++'s runtime
 3. Generating MsCoff 64.
 4. Adding the wacky pdata and xdata sections

I was not just blindly guessing in my last post. LLVM supports 
directly emitting 64 bit COFF, and as far as I know the code for 
writing out the RUNTIME_FUNCTION/UNWIND_INFO instances is 
functional.

 5. Learning the completely undocumented CV8 format, then 
 implementing it

 Doing this all requires a fairly comprehensive understanding of 
 the innards of the back end. If LLVM is lacking one or more of 
 1..5, then I wouldn't be surprised a bit if it took me 
 considerably *longer* to do it for LLVM, and then I'd have to 
 sell my changes to the LLVM group. (If they don't buy it, then 
 I'm stuck with a fork I have to maintain myself. Ugh.)

Given the modular design of LLVM, maintaining a fork probably 
wouldn't even be too bad; many companies using it e.g. have their 
own private backends. That being said, the LLVM folks are 
generally very open with accepting contributions into the main 
tree (except for 32 bit SEH support, where the Apple people are 
afraid of issues regarding that notorious Borland patent).

But note that I didn't even want to convince you that 
implementing a given new feature (such as CV8) for LLVM would 
take you _less_ time than for your own backend. I'm just saying 
that I would expect it to take a comparable amount of time, and 
even if it might take you some time to get familiar with the LLVM 
internals at first, I'm sure that initial cost would quickly be 
offset by the fact that you don't have to write *everything* on 
your own. I wonder if there is another production-grade backend 
out there that is maintained by a single person?

David

Jan 13 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/13/2013 6:07 PM, David Nadlinger wrote:
 1. I can completely control the user experience.

 You can do this with LLVM as well, as you can modify its source at will. Let me
 also note that this means that *only you* can fix certain parts of the user
 experience (see for example the Optlink issues).

I don't really want to get into a long back-and-forth thing where we just keep 
repeating our statements.

But as for optlink, I would have happily turned it over to anyone who expressed 
interest in working on it. I'd also put it on github if anyone wanted to look
at it.

Jan 13 2013

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 1/14/13, Walter Bright <newshound2 digitalmars.com> wrote:
 But as for optlink, I would have happily turned it over to anyone who
 expressed
 interest in working on it. I'd also put it on github if anyone wanted to
 look at it.

What's the state of Optlink, have you finished the transition to C?
IIRC we've asked for this years ago.

Jan 13 2013

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 1/14/13, Walter Bright <newshound2 digitalmars.com> wrote:
 I'd also put it on github if anyone wanted to look at it.

Anyway don't tease, put it online! :p

Jan 13 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/13/2013 6:40 PM, Andrej Mitrovic wrote:
 On 1/14/13, Walter Bright <newshound2 digitalmars.com> wrote:
 I'd also put it on github if anyone wanted to look at it.

 Anyway don't tease, put it online! :p

Have fun!

https://github.com/DigitalMars/optlink

Jan 13 2013

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 1/14/13, Walter Bright <newshound2 digitalmars.com> wrote:
 Have fun!
 https://github.com/DigitalMars/optlink

Cool! Can you tell us how this code is licensed?

Jan 13 2013

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

 On 1/14/13, Walter Bright <newshound2 digitalmars.com> wrote:
 Have fun!
 https://github.com/DigitalMars/optlink


I'm getting a build failure when running the root build.bat:

Fatal error: unable to open input file 'scio.h'

Tried with both VC9 and VC10's nmake, I can't find this file in DMC,
optlink, or VCs include dirs.

Jan 13 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/13/2013 8:00 PM, Andrej Mitrovic wrote:
 On 1/14/13, Walter Bright <newshound2 digitalmars.com> wrote:
 Have fun!
 https://github.com/DigitalMars/optlink


 I'm getting a build failure when running the root build.bat:

 Fatal error: unable to open input file 'scio.h'

 Tried with both VC9 and VC10's nmake, I can't find this file in DMC,
 optlink, or VCs include dirs.

That's part of the dmc source code.

Jan 13 2013

"David Nadlinger" <see klickverbot.at> writes:

On Monday, 14 January 2013 at 02:24:38 UTC, Walter Bright wrote:
 I don't really want to get into a long back-and-forth thing 
 where we just keep repeating our statements.

Neither do I. You listed your reasons for sticking with the DMC 
backend, and I tried to outline how I would judge things 
differently and why. I guess discussions generally work like 
this. :)

Anyway, thanks for responding to my post in the first place. I 
very much appreciate that, as I don't think you ever commented on 
the topic in the last few years in any detail at all, at least as 
far as the public forums are concerned.

David

Jan 13 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/13/13 10:22 PM, David Nadlinger wrote:
 On Monday, 14 January 2013 at 02:24:38 UTC, Walter Bright wrote:
 I don't really want to get into a long back-and-forth thing where we
 just keep repeating our statements.

 Neither do I. You listed your reasons for sticking with the DMC backend,
 and I tried to outline how I would judge things differently and why. I
 guess discussions generally work like this. :)

 Anyway, thanks for responding to my post in the first place. I very much
 appreciate that, as I don't think you ever commented on the topic in the
 last few years in any detail at all, at least as far as the public
 forums are concerned.

I'll candidly mention that David's request was unclear to me. It's been 
winding enough that I missed the key sentence. Was it that we switch dmd 
to using llvm as backend?

Andrei

Jan 13 2013

"David Nadlinger" <see klickverbot.at> writes:

On Monday, 14 January 2013 at 04:42:33 UTC, Andrei Alexandrescu 
wrote:
 I'll candidly mention that David's request was unclear to me. 
 It's been winding enough that I missed the key sentence. Was it 
 that we switch dmd to using llvm as backend?

If you want, yes - but not in the form of an actionable proposal 
yet. I was trying to argue that the benefits of using an existing 
solution like GCC or LLVM are large enough (resp. the costs of 
using a custom backend high enough) that we should seriously 
consider doing so. Especially because it looks as if the amount 
of work needed to keep the DMD backend and thus the reference D 
compiler competitive is going to increase further as other 
backends are gaining things like auto-vectorization to light up 
modern CPUs and ARM is gaining in importance.

David

Jan 13 2013

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 01/14/2013 12:33 AM, David Nadlinger wrote:
 On Monday, 14 January 2013 at 04:42:33 UTC, Andrei Alexandrescu wrote:
 I'll candidly mention that David's request was unclear to me. It's
 been winding enough that I missed the key sentence. Was it that we
 switch dmd to using llvm as backend?

 If you want, yes - but not in the form of an actionable proposal yet. I
 was trying to argue that the benefits of using an existing solution like
 GCC or LLVM are large enough (resp. the costs of using a custom backend
 high enough) that we should seriously consider doing so. Especially
 because it looks as if the amount of work needed to keep the DMD backend
 and thus the reference D compiler competitive is going to increase
 further as other backends are gaining things like auto-vectorization to
 light up modern CPUs and ARM is gaining in importance.

 David

"gaining" he says ;)

D just missed out on the leading edge of smartphone games.  That is a 
HUGE market packed with small developers that can easily adopt new 
tooling.  We got the invite and we stood them up.  IMO, sticking to an 
x86-centric toolset cost D one of its perfect opportunities for being a 
killer tool.  That makes me kinda sad.

Sorry for the downer.  I bring it up in the hope that we can learn from 
it.  I like to think that we'll see more opportunities in the future.

Jan 13 2013

Russel Winder <russel winder.org.uk> writes:

On Mon, 2013-01-14 at 01:30 -0500, Chad J wrote:
[=E2=80=A6]
 tooling.  We got the invite and we stood them up.  IMO, sticking to an=

=20
 x86-centric toolset cost D one of its perfect opportunities for being a=

=20
 killer tool.  That makes me kinda sad.

If you can't do x86_64 and ARM you are legacy. Even the JVM now runs on
ARM processors.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Jan 13 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Jan 14, 2013 at 06:33:44AM +0100, David Nadlinger wrote:
 On Monday, 14 January 2013 at 04:42:33 UTC, Andrei Alexandrescu
 wrote:
I'll candidly mention that David's request was unclear to me. It's
been winding enough that I missed the key sentence. Was it that we
switch dmd to using llvm as backend?

 
 If you want, yes - but not in the form of an actionable proposal
 yet. I was trying to argue that the benefits of using an existing
 solution like GCC or LLVM are large enough (resp. the costs of using
 a custom backend high enough) that we should seriously consider
 doing so. Especially because it looks as if the amount of work
 needed to keep the DMD backend and thus the reference D compiler
 competitive is going to increase further as other backends are
 gaining things like auto-vectorization to light up modern CPUs and
 ARM is gaining in importance.

[...]

I have to say, as a bystander to this discussion, that I *have* felt the
dilemma of whether to use dmd (I want to try out the latest D features)
vs. gdc/ldc (I want maximal performance for compute-heavy apps).

I don't mean to downplay dmd in any way, but IME code generated by gdc
-O3 has been *consistently* 30-40% faster than code generated by dmd -O,
sometimes even 50% or more. I have compared the assembly code generated
by either compiler, and I have to say that, at least IME, the code
produced by gdc is indisputably superior to the code produced by dmd.
Due to this, I have wanted to use only gdc for my D projects, and I'd
expect that any project where performance is a concern would feel the
same way. But I was forced to use dmd because I wanted to try out the
newest features and get the latest bugfixes as well.  So now I'm torn
between wanting competitive performance vs. getting the bugfixes I need
or the latest and greatest features.

This is far from ideal, IMHO. Users shouldn't have to choose between
up-to-date language support vs. performance. Having the reference D
compiler match gdc/ldc in performance would be a big factor in D
adoption IMO (how many people would spend the time to look up gdc, or
even know it exists, instead of just compiling a C/C++ program with
gcc/g++, compare it to dmd output for a comparable D program, and walk
away?).


T

-- 
WINDOWS = Will Install Needless Data On Whole System -- CompuMan

Jan 13 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, January 13, 2013 22:46:21 H. S. Teoh wrote:
 On Mon, Jan 14, 2013 at 06:33:44AM +0100, David Nadlinger wrote:
 On Monday, 14 January 2013 at 04:42:33 UTC, Andrei Alexandrescu
 
 wrote:
I'll candidly mention that David's request was unclear to me. It's
been winding enough that I missed the key sentence. Was it that we
switch dmd to using llvm as backend?

 
 If you want, yes - but not in the form of an actionable proposal
 yet. I was trying to argue that the benefits of using an existing
 solution like GCC or LLVM are large enough (resp. the costs of using
 a custom backend high enough) that we should seriously consider
 doing so. Especially because it looks as if the amount of work
 needed to keep the DMD backend and thus the reference D compiler
 competitive is going to increase further as other backends are
 gaining things like auto-vectorization to light up modern CPUs and
 ARM is gaining in importance.

 
 [...]
 
 I have to say, as a bystander to this discussion, that I *have* felt the
 dilemma of whether to use dmd (I want to try out the latest D features)
 vs. gdc/ldc (I want maximal performance for compute-heavy apps).
 
 I don't mean to downplay dmd in any way, but IME code generated by gdc
 -O3 has been *consistently* 30-40% faster than code generated by dmd -O,
 sometimes even 50% or more. I have compared the assembly code generated
 by either compiler, and I have to say that, at least IME, the code
 produced by gdc is indisputably superior to the code produced by dmd.
 Due to this, I have wanted to use only gdc for my D projects, and I'd
 expect that any project where performance is a concern would feel the
 same way. But I was forced to use dmd because I wanted to try out the
 newest features and get the latest bugfixes as well.  So now I'm torn
 between wanting competitive performance vs. getting the bugfixes I need
 or the latest and greatest features.
 
 This is far from ideal, IMHO. Users shouldn't have to choose between
 up-to-date language support vs. performance. Having the reference D
 compiler match gdc/ldc in performance would be a big factor in D
 adoption IMO (how many people would spend the time to look up gdc, or
 even know it exists, instead of just compiling a C/C++ program with
 gcc/g++, compare it to dmd output for a comparable D program, and walk
 away?).

If you want to use the lastest master, then you're going to be stuck with dmd. 
If you're willing to use the latest release, then you have a choice. And 
assuming that you don't need the latest master, I would argue that if 
performance is critical, what you'll want to do is develop with dmd (because 
of its fast compilation items), and then you'll actually want to release using 
gdc or ldc so that you actually get fast programs. And given the fact that 
Walter is rightly focused on language features rather than optimizing dmd's 
backend and the fact that almost no one else seems likely to spend time 
optimizing dmd's backend, it's a foregone conclusion at this point that dmd is 
going to generate slower programs than gdc and ldc for the forseeable future. 
On the other hand, it wouldn't surprise me if having Walter switch to another 
backend for the reference compiler would cost us so much development time from 
him (as he learns the new backend) that it wouldn't be even vaguely worth it.

- Jonathan M Davis

Jan 13 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/13/2013 9:33 PM, David Nadlinger wrote:
 as other backends are gaining things
 like auto-vectorization to light up modern CPUs and ARM is gaining in
importance.

I've done some research on auto-vectorization, i.e. "The Software Vectorization 
Handbook" by Bik.

My conclusion (Manu independently came to the same one, he's our resident SIMD 
expert) is that auto-vectorization is a disaster.

What it is is:

1. Reverse engineer a loop into a higher level construct
2. Recompile that construct using vector instructions

It's a disaster because (2) often fails in ways that are utterly mysterious to 
99% of programmers. The failure mode is to not auto-vectorize the loop. Hence, 
the failure is silent and the user just sees poor performance, if he notices it 
at all.

D's intended approach is completely different, and much more straightforward. 
The SIMD types are exposed to the front end, and you build vector operations 
like you would any other arithmetic operation. If a particular vector operation 
is not available for a particular SIMD type, then the compilation fails with an 
appropriate error message.

I.e. D's approach is to fix the language to support vector semantics, rather 
than trying to back-asswards fit it into the optimizer.

Jan 13 2013

"David Nadlinger" <see klickverbot.at> writes:

On Monday, 14 January 2013 at 07:21:57 UTC, Walter Bright wrote:
 I've done some research on auto-vectorization, i.e. "The 
 Software Vectorization Handbook" by Bik.

 My conclusion (Manu independently came to the same one, he's 
 our resident SIMD expert) is that auto-vectorization is a 
 disaster.

 What it is is:

 1. Reverse engineer a loop into a higher level construct
 2. Recompile that construct using vector instructions

 It's a disaster because (2) often fails in ways that are 
 utterly mysterious to 99% of programmers. The failure mode is 
 to not auto-vectorize the loop. Hence, the failure is silent 
 and the user just sees poor performance, if he notices it at 
 all.

Virtually all common C/C++ compilers have had SIMD 
types/intrinsics for years now, which is more or less exactly 
what core.simd is for D - well, there is some syntax sugar for 
arithmetic operations on the D vector types, but it doesn't cover 
any of the "interesting" operations such as broadcasting, 
swizzling, etc.

Yet *all* the big compiler vendors (Microsoft, Intel, GCC, LLVM) 
seem to find it necessary to spend considerable amounts of effort 
on improving their heuristics for the indeed quite problematic 
optimization process you describe.

I don't think that rehashing the common reasons cited for the 
importance of (auto-)vectorizing code here would be of any use, 
you are certainly aware of them already. Just don't forget that 
Manu (who seems to have disappeared from the forums, crunch 
time?) represents a group of programmers who would resort to 
hand-written assembly for their "embarrassingly parallel" code if 
suitable compiler intrinsics didn't exist, whereas the vast 
majority of programmers probably would be troubled to reliably 
identify the cases where vectorization would have the biggest 
benefits in their code bases, even if they could justify spending 
time/maintainability on manually optimizing their code at this 
level.

David

Jan 14 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/14/2013 12:28 AM, David Nadlinger wrote:
 Virtually all common C/C++ compilers have had SIMD types/intrinsics for years
 now, which is more or less exactly what core.simd is for D - well, there is
some
 syntax sugar for arithmetic operations on the D vector types, but it doesn't
 cover any of the "interesting" operations such as broadcasting, swizzling, etc.

Actually, they do a crappy job of it in my experiments.

 Yet *all* the big compiler vendors (Microsoft, Intel, GCC, LLVM) seem to find
it
 necessary to spend considerable amounts of effort on improving their heuristics
 for the indeed quite problematic optimization process you describe.

The fundamental problem they face is they cannot change the language - they
have 
to work with bog standard code that has no notion of SIMD.


 I don't think that rehashing the common reasons cited for the importance of
 (auto-)vectorizing code here would be of any use, you are certainly aware of
 them already. Just don't forget that Manu (who seems to have disappeared from
 the forums, crunch time?) represents a group of programmers who would resort to
 hand-written assembly for their "embarrassingly parallel" code if suitable
 compiler intrinsics didn't exist, whereas the vast majority of programmers
 probably would be troubled to reliably identify the cases where vectorization
 would have the biggest benefits in their code bases, even if they could justify
 spending time/maintainability on manually optimizing their code at this level.

Here's the problem. Draw a table of SIMD vector types on one axis, and 
arithmetic operations on the other. Plot an X for what is supported.

You'll find that the pattern of X's doesn't follow much of a pattern, and it 
varies all over the place for different CPUs and architectures.

Now consider your autovectorizer. If an X isn't there, it (silently) generates 
workaround code. The only way you'll really be able to tell it went wrong is to 
disassemble the result - and who is going to do that?

And what's much, much worse about this, is pulling values out of and into SIMD 
registers to do those workarounds causes massive stalls (the SIMD unit and the 
arithmetic unit are separate), so bad that you're better off doing no 
vectorization at all.

The next problem with an autovectorizer is alignment - consider the C function:

     int func(float *p) { ... }

Now you want to autovectorize inside func(). Is p aligned? Who knows? So you 
must emit code to do misaligned vectors, which is another performance disaster.

The D approach is if you are able to get your vector code to compile, it will
be:

1. aligned properly
2. compiled to real SIMD instructions

Jan 14 2013

"David Nadlinger" <see klickverbot.at> writes:

On Monday, 14 January 2013 at 09:02:25 UTC, Walter Bright wrote:
 Yet *all* the big compiler vendors (Microsoft, Intel, GCC, 
 LLVM) seem to find it
 necessary to spend considerable amounts of effort on improving 
 their heuristics
 for the indeed quite problematic optimization process you 
 describe.

 The fundamental problem they face is they cannot change the 
 language - they have to work with bog standard code that has no 
 notion of SIMD.

They *are* changing the language using SIMD intrinsics and 
built-in types. It might be that D can do a somewhat nicer job at 
that by standardizing them and using less cryptic names, but the 
fundamental concept remains the same.

And again, you don't have to convince me about the problems 
inherent to auto-vectorization, I'm well aware of them. I'm just 
pointing out that despite these obstacles, compiler vendors are 
still trying to implement it, probably because writing vector 
code is fundamentally difficult (as you mentioned, you can easily 
shoot yourself in the foot in many ways, performance-wise), and 
you can't justify just not using the vector units on today's 
chips.

It's not that I think __vector, core.simd and friends would be a 
*bad* idea. But your general dismissal of auto-vectorization 
seems a bit too quick and convenient to me.

David

Jan 14 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/14/2013 1:41 AM, David Nadlinger wrote:
 It's not that I think __vector, core.simd and friends would be a *bad* idea.
But
 your general dismissal of auto-vectorization seems a bit too quick and
 convenient to me.

I agree it's more convenient not to implement it.

I also prefer using a lawnmower to a pair of scissors <g>.

Jan 14 2013

"jerro" <a a.com> writes:

 I.e. D's approach is to fix the language to support vector 
 semantics, rather than trying to back-asswards fit it into the 
 optimizer.

I agree that adding SIMD types and operations to the languages is 
a better approach than autovectorization. I'd just like to point 
out that DMD's backend still has serious problems regarding SIMD 
support, which makes getting SIMD code to work with DMD a pain. 
For example, see those two bugs:

http://d.puremagic.com/issues/show_bug.cgi?id=9304
http://d.puremagic.com/issues/show_bug.cgi?id=9200

I don't know how much work it is to fix those issues and any 
similar ones, but I think this is relevant to the discussion in 
this thread, since GCC and LLVM backends don't have those 
problems.

Jan 14 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/14/2013 12:54 AM, jerro wrote:
 I.e. D's approach is to fix the language to support vector semantics, rather
 than trying to back-asswards fit it into the optimizer.

 I agree that adding SIMD types and operations to the languages is a better
 approach than autovectorization. I'd just like to point out that DMD's backend
 still has serious problems regarding SIMD support, which makes getting SIMD
code
 to work with DMD a pain. For example, see those two bugs:

 http://d.puremagic.com/issues/show_bug.cgi?id=9304
 http://d.puremagic.com/issues/show_bug.cgi?id=9200

Yeah, I know about those, I'll fix 'em. The SIMD support in the compiler is 
still young.

Jan 14 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 14 January 2013 at 07:21:57 UTC, Walter Bright wrote:
 I've done some research on auto-vectorization, i.e. "The 
 Software Vectorization Handbook" by Bik.

 My conclusion (Manu independently came to the same one, he's 
 our resident SIMD expert) is that auto-vectorization is a 
 disaster.

 What it is is:

 1. Reverse engineer a loop into a higher level construct
 2. Recompile that construct using vector instructions

 It's a disaster because (2) often fails in ways that are 
 utterly mysterious to 99% of programmers. The failure mode is 
 to not auto-vectorize the loop. Hence, the failure is silent 
 and the user just sees poor performance, if he notices it at 
 all.



Have you seen the Visual C++ 2012 compiler? It fixes that problem.

Its auto-vectorizer has two switches:

1. /Qvec-report:1, which reports the auto-vectorized loops
2. /Qvec-report:2, which reports all potential candidates and 
whether or not they were vectorized (and if not, a reason code)

Jan 17 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/17/2013 1:20 AM, Mehrdad wrote:
 Have you seen the Visual C++ 2012 compiler? It fixes that problem.

 Its auto-vectorizer has two switches:

 1. /Qvec-report:1, which reports the auto-vectorized loops
 2. /Qvec-report:2, which reports all potential candidates and whether or not
 they were vectorized (and if not, a reason code)

I would call it more of a workaround than a fix.

Jan 17 2013

Russel Winder <russel winder.org.uk> writes:

On Mon, 2013-01-14 at 06:33 +0100, David Nadlinger wrote:
 On Monday, 14 January 2013 at 04:42:33 UTC, Andrei Alexandrescu=20
 wrote:
 I'll candidly mention that David's request was unclear to me.=20
 It's been winding enough that I missed the key sentence. Was it=20
 that we switch dmd to using llvm as backend?

=20
 If you want, yes - but not in the form of an actionable proposal=20
 yet. I was trying to argue that the benefits of using an existing=20
 solution like GCC or LLVM are large enough (resp. the costs of=20
 using a custom backend high enough) that we should seriously=20
 consider doing so. Especially because it looks as if the amount=20
 of work needed to keep the DMD backend and thus the reference D=20
 compiler competitive is going to increase further as other=20
 backends are gaining things like auto-vectorization to light up=20
 modern CPUs and ARM is gaining in importance.

Having D in the GCC set is a good move for marketing the D language. GDC
could be a vehicle for bringing more programmers to D but it needs more
resource to bring it up to date.

Having D in the LLVM set is a good move for marketing the D language.
LDC is a vehicle for bringing people to D. It would be good to get the
Debian package updated and actively kept up. This would then get it
packaged for Ubuntu and Mint. It would be good to get it packaged for
Fedora.

Keeping going with DMD is fine, as long as it is in the knowledge that
GDC and LDC are the D toolchains that will lead to wider D usage.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Jan 13 2013

"David Nadlinger" <see klickverbot.at> writes:

On Monday, 14 January 2013 at 07:52:49 UTC, Russel Winder wrote:
 LDC is a vehicle for bringing people to D. It would be good to 
 get the Debian package updated and actively kept up.

As we have a recent release now: Who do I have to ping about 
resurrecting the Debian package? I have never really used any 
Debian-based distros (except for some server boxes), so I have no 
good idea of how to find a packager and how to best assist with 
the packaging process.

I know of the importance of getting an LDC package into Debian, 
but it would be much, much easier if somebody who is at home in 
both the D and Debian communities (or at least uses a 
Debian-based distro on a day-to-day basis) could handle this.

David

Jan 14 2013

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 14 January 2013 08:03, David Nadlinger <see klickverbot.at> wrote:

 On Monday, 14 January 2013 at 07:52:49 UTC, Russel Winder wrote:

 LDC is a vehicle for bringing people to D. It would be good to get the
 Debian package updated and actively kept up.

 As we have a recent release now: Who do I have to ping about resurrecting
 the Debian package? I have never really used any Debian-based distros
 (except for some server boxes), so I have no good idea of how to find a
 packager and how to best assist with the packaging process.

 I know of the importance of getting an LDC package into Debian, but it
 would be much, much easier if somebody who is at home in both the D and
 Debian communities (or at least uses a Debian-based distro on a day-to-day
 basis) could handle this.

 David


I could set things in motion for you, but you'd probably have to nudge me
once every 6 months to push in updates.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Jan 14 2013

"David Nadlinger" <see klickverbot.at> writes:

On Monday, 14 January 2013 at 08:20:11 UTC, Iain Buclaw wrote:
 I could set things in motion for you, but you'd probably have 
 to nudge me
 once every 6 months to push in updates.

That would be great! How do we proceed?

David

Jan 14 2013

Russel Winder <russel winder.org.uk> writes:

On Mon, 2013-01-14 at 09:26 +0100, David Nadlinger wrote:
 On Monday, 14 January 2013 at 08:20:11 UTC, Iain Buclaw wrote:
 I could set things in motion for you, but you'd probably have=20
 to nudge me
 once every 6 months to push in updates.

=20
 That would be great! How do we proceed?

I just emailed the listed packager cc yourself. I guess I should have
included Ian due to the GDC connection.=20

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Jan 14 2013

Russel Winder <russel winder.org.uk> writes:

On Mon, 2013-01-14 at 08:37 +0000, Russel Winder wrote:
[=E2=80=A6]
 included Ian due to the GDC connection.=20

/Ian/s//Iain/

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Jan 14 2013

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 14 January 2013 08:44, Russel Winder <russel winder.org.uk> wrote:

 On Mon, 2013-01-14 at 08:37 +0000, Russel Winder wrote:
 [=85]
 included Ian due to the GDC connection.

 /Ian/s//Iain/

Thanks for the name correction. :)

--=20
Iain Buclaw

*(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Jan 14 2013

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 14 January 2013 08:26, David Nadlinger <see klickverbot.at> wrote:

 On Monday, 14 January 2013 at 08:20:11 UTC, Iain Buclaw wrote:

 I could set things in motion for you, but you'd probably have to nudge me
 once every 6 months to push in updates.

 That would be great! How do we proceed?

 David

Mattias is the current maintainer IIRC.  I can request a transfer of
ownership to either myself or someone who is willing to maintain the
package for LDC.  Also I need to post updates for the GDC package too, so
might as well do both in one hit.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Jan 14 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Jan 14, 2013 at 09:03:13AM +0100, David Nadlinger wrote:
 On Monday, 14 January 2013 at 07:52:49 UTC, Russel Winder wrote:
LDC is a vehicle for bringing people to D. It would be good to get
the Debian package updated and actively kept up.

 
 As we have a recent release now: Who do I have to ping about
 resurrecting the Debian package? I have never really used any
 Debian-based distros (except for some server boxes), so I have no
 good idea of how to find a packager and how to best assist with the
 packaging process.
 
 I know of the importance of getting an LDC package into Debian, but
 it would be much, much easier if somebody who is at home in both the
 D and Debian communities (or at least uses a Debian-based distro on
 a day-to-day basis) could handle this.

[...]

I'm a Debian developer (a very dormant one ever since I got married,
I'll admit, but nevertheless one). I am rusty with the procedures due to
not having used them for a while, but at least I know where to look and
am familiar with the conventions used. So I'm willing to help. I can
probably even sponsor uploads, if that's what it takes.


T

-- 
If lightning were to ever strike an orchestra, it'd always hit the conductor
first.

Jan 14 2013

"eles" <eles eles.com> writes:

On Monday, 14 January 2013 at 16:11:00 UTC, H. S. Teoh wrote:
 On Mon, Jan 14, 2013 at 09:03:13AM +0100, David Nadlinger wrote:
 On Monday, 14 January 2013 at 07:52:49 UTC, Russel Winder

 I'm a Debian developer (a very dormant one ever since I got 
 married,
 I'll admit, but nevertheless one). I am rusty with the 
 procedures due to
 not having used them for a while, but at least I know where to 
 look and
 am familiar with the conventions used. So I'm willing to help. 
 I can
 probably even sponsor uploads, if that's what it takes.

Hi,

  I would love to see gdc-4.7 (ie. based on gcc 4.7) in Debian Sid 
ASAP (maybe it still on schedule to be included into next Ubuntu 
- Raring Ringtail).

  Currently, the gdc package is still at the pre-historic era of 
gcc 4.6.3:

http://packages.debian.org/sid/devel/gdc

and Ubuntu follows closely with:

http://packages.ubuntu.com/ko/raring/devel/gdc

while its gcc package is 4.7.2:

http://packages.ubuntu.com/ko/raring/gcc

Many thanks for any implication on that side.

Jan 14 2013

Jordi Sayol <g.sayol yahoo.es> writes:

Al 14/01/13 08:52, En/na Russel Winder ha escrit:
 
 Having D in the LLVM set is a good move for marketing the D language.
 LDC is a vehicle for bringing people to D. It would be good to get the
 Debian package updated and actively kept up. This would then get it
 packaged for Ubuntu and Mint. It would be good to get it packaged for
 Fedora.

LDC is already available on Fedora 17.

-- 
Jordi Sayol

Jan 14 2013

Russel Winder <russel winder.org.uk> writes:

On Mon, 2013-01-14 at 19:01 +0100, Jordi Sayol wrote:
[=E2=80=A6]
 LDC is already available on Fedora 17.

LDC is already available in Debian, it is just a very old version. Looks
like the Fedora version is LDC as at 2012-06-24 from the Git repository.
One up to Fedora :-)

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Jan 14 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-14 03:24, Walter Bright wrote:

 But as for optlink, I would have happily turned it over to anyone who
 expressed interest in working on it. I'd also put it on github if anyone
 wanted to look at it.

Then I would say, put in on github regardless if anyone wants to take a 
look at it. It doesn't hurt.

-- 
/Jacob Carlborg

Jan 13 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-14 00:57, Walter Bright wrote:

 Doing this all requires a fairly comprehensive understanding of the
 innards of the back end. If LLVM is lacking one or more of 1..5, then I
 wouldn't be surprised a bit if it took me considerably *longer* to do it
 for LLVM

Sure, but if you had switch to LLVM, say three years ago, you would know 
innards of the back end.

-- 
/Jacob Carlborg

Jan 13 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

12-Jan-2013 09:17, David Nadlinger пишет:
 On Friday, 11 January 2013 at 20:57:57 UTC, Dmitry Olshansky wrote:
 You can print total counts after each bench, there is a TLS varaible
 written at the end of it. But anyway I like your numbers! :)

 Okay, I couldn't resist having a short look at the results, specifically
 the benchmark of the new isSymbol implementation, where LDC beats DMD by
 roughly 10x. The reason for the nice performance results is mainly that
 LDC optimizes the classifyCall loop containing the trie lookup down to
 the following fairly optimal piece of code (eax is the overall counter
 that gets stored to lastCount):

So these are legit? Coooooool!

BTW I'm having about 2-3 times better numbers on DMD 32bits with oldish 
AMD K10. Can you test 32bit versions also, could it be some glitch in 
64bit codegen?

 ---
    40bc90:       8b 55 00                mov    edx,DWORD PTR [rbp+0x0]
    40bc93:       89 d6                   mov    esi,edx
    40bc95:       c1 ee 0d                shr    esi,0xd
    40bc98:       40 0f b6 f6             movzx  esi,sil
    40bc9c:       0f b6 34 31             movzx  esi,BYTE PTR [rcx+rsi*1]
    40bca0:       48 83 c5 04             add    rbp,0x4
    40bca4:       0f b6 da                movzx  ebx,dl
    40bca7:       c1 e6 05                shl    esi,0x5
    40bcaa:       c1 ea 08                shr    edx,0x8
    40bcad:       83 e2 1f                and    edx,0x1f
    40bcb0:       09 f2                   or     edx,esi
    40bcb2:       41 0f b7 14 50          movzx  edx,WORD PTR [r8+rdx*2]
    40bcb7:       c1 e2 08                shl    edx,0x8
    40bcba:       09 da                   or     edx,ebx
    40bcbc:       48 c1 ea 06             shr    rdx,0x6
    40bcc0:       4c 01 ca                add    rdx,r9
    40bcc3:       48 8b 14 d1             mov    rdx,QWORD PTR [rcx+rdx*8]
    40bcc7:       48 0f a3 da             bt     rdx,rbx
    40bccb:       83 d0 00                adc    eax,0x0
    40bcce:       48 ff cf                dec    rdi
    40bcd1:       75 bd                   jne    40bc90
 ---

This looks quite nice indeed.

 The code DMD generates for the lookup, on the other hand, is pretty
 ugly, including several values being spilled to the stack, and also
 doesn't get inlined.

To be honest one of the major problems I see with DMD is a lack of 
principled reliable inliner. Currently it may inline or not 2 equivalent 
  pieces of code just because one of it has early return,  or switch 
statement or whatever. And it's about to time to start inlining 
functions with loops as it's not 90-s anymore.


 [1] The reasons for which I'm focusing on LLVM here are not so much its
 technical qualities as its liberal BSD-like license – if it is good
 enough for Apple, Intel (also a compiler vendor) and their lawyer teams,
 it is probably also for us. The code could even be integrated into
 commercial products such as DMC without problems.

I like LLVM, and next to everybody in industry like it. Another example 
is AMD. They are building their compiler infrastructure for GPUs on top 
of LLVM.

 [2] And for any typos which might undermine my credibility – it is way
 too early in the morning here.


-- 
Dmitry Olshansky

Jan 12 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Jan 11, 2013 at 12:35:42PM -0800, H. S. Teoh wrote:
[...]
 - It would also be nice if these functions can be made pure (I'm not
   sure I understand why checking the category of a character should be
   impure). The lack of nothrow I can understand, since the input
   character may be illegal or otherwise malformed. But  nothrow pure
   seems to me to be necessary for all category-checking functions.

Gah, I meant  safe pure.


T

-- 
Lawyer: (n.) An innocence-vending machine, the effectiveness of which depends
on how much money is inserted.

Jan 11 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
 Anyway it's polished and ready for the good old collective destruction called
 peer review. I'm looking for a review manager.

Thank you for doing this. Some notes:


struct InversionList:

    1. It implements length and empty, but otherwise does not seem to be a 
range, though its name implies it is a container.

    2. The description "InversionList is a packed interval-based data structure 
that represents a set of codepoints." completely baffles me.

    3. The table for opBinary appears twice.

    4. toSourceCode says "if the codepoint". If what codepoint? And what 
funcName is ""?

mapTrieIndex:

    I have no idea what this does from the description. Is it a map like in 
std.algorithms? I.e. does it work with ranges? Need an example.

TrieBuilder:

    1. No description. I'm completely baffled.

    2. putRange? What does that do?

General lack of examples.

Trie:

     Should have a wikipedia link to what a trie is.

buildTrie:

     Contains a number of functions that return auto, but no mention of what is 
returned. While I like auto, in these cases it is not helpful, because the user 
needs to know what type is returned.

Grapheme:

     Should provide an InputRange so a user can iterate over its codepoints. It 
already appears to provide part of the range interface.


sicmp and icmp:

     No reason one should have "in" and the other "inout". Both should be
"const".

combiningClass

     What is "the canonicial combining class" ?

General note:

     Needs a section that briefly defines terms, with links to external 
documentation.

Jan 11 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

12-Jan-2013 00:50, Walter Bright пишет:
 On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
 Anyway it's polished and ready for the good old collective destruction
 called
 peer review. I'm looking for a review manager.

 Thank you for doing this. Some notes:

Great bashing, thanks :)

 struct InversionList:

     1. It implements length and empty, but otherwise does not seem to be
 a range, though its name implies it is a container.

Containers also have length and can be empty. Ranges on top of it are 
.byChar, .byInterval.

     2. The description "InversionList is a packed interval-based data
 structure that represents a set of codepoints." completely baffles me.

OK, probably it was a bad idea to combine implementation detail and 
general description in one statement.

How about: InversionList is a set of codepoints, optimized for size.
It represents a set as an array of intervals using 6 bytes per interval 
of codepoints.

Better?

     3. The table for opBinary appears twice.

     4. toSourceCode says "if the codepoint". If what codepoint? And what
 funcName is ""?

 mapTrieIndex:

     I have no idea what this does from the description. Is it a map like
 in std.algorithms? I.e. does it work with ranges? Need an example.

 TrieBuilder:

     1. No description. I'm completely baffled.

     2. putRange? What does that do?

 General lack of examples.

 Trie:

      Should have a wikipedia link to what a trie is.

 buildTrie:

      Contains a number of functions that return auto, but no mention of
 what is returned. While I like auto, in these cases it is not helpful,
 because the user needs to know what type is returned.

Trust me you won't like it when you see it :) Part of the reason it's 
hidden. But you are correct that documentation on these artifacts is 
*ehm* ... sketchy. Will fix/update.

 Grapheme:

      Should provide an InputRange so a user can iterate over its
 codepoints. It already appears to provide part of the range interface.

Container != range. Just slice it and RA range is in your hands. I need 
to put this info somewhere prominent I guess.

BTW foreach(v; Grapheme("abc")) { ... }
should work because of automatic slicing.

 sicmp and icmp:

      No reason one should have "in" and the other "inout". Both should
 be "const".

 combiningClass

      What is "the canonicial combining class" ?

 General note:

      Needs a section that briefly defines terms, with links to external
 documentation.

All to the point, thanks.

-- 
Dmitry Olshansky

Jan 11 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/11/2013 1:21 PM, Dmitry Olshansky wrote:
 12-Jan-2013 00:50, Walter Bright пишет:
 On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
 Anyway it's polished and ready for the good old collective destruction
 called
 peer review. I'm looking for a review manager.

 Thank you for doing this. Some notes:

 Great bashing, thanks :)

Anytime!


 struct InversionList:

     1. It implements length and empty, but otherwise does not seem to be
 a range, though its name implies it is a container.

 Containers also have length and can be empty. Ranges on top of it are .byChar,
 .byInterval.

Some more text about accessing it as a range would be helpful. Being a set of 
"codepoints" yet accessing it "by char" makes no sense.

In unicode, the terms "char", "codepoint", "octet", "grapheme", etc., are 
bandied about and are often conflated and confused with each other. I suggest a 
brief section defining those terms, and then scrupulously, consistently,
anally, 
and nitpickingly adhering to them in the documentation and names of things.


     2. The description "InversionList is a packed interval-based data
 structure that represents a set of codepoints." completely baffles me.

 OK, probably it was a bad idea to combine implementation detail and general
 description in one statement.

 How about: InversionList is a set of codepoints, optimized for size.
 It represents a set as an array of intervals using 6 bytes per interval of
 codepoints.

 Better?

The name says "list" and yet it is described as a "set". Being a "list" makes
an 
implication that it is a linked list. BTW, what is an "interval"?


     3. The table for opBinary appears twice.

     4. toSourceCode says "if the codepoint". If what codepoint? And what
 funcName is ""?

 mapTrieIndex:

     I have no idea what this does from the description. Is it a map like
 in std.algorithms? I.e. does it work with ranges? Need an example.

 TrieBuilder:

     1. No description. I'm completely baffled.

     2. putRange? What does that do?

 General lack of examples.

 Trie:

      Should have a wikipedia link to what a trie is.

 buildTrie:

      Contains a number of functions that return auto, but no mention of
 what is returned. While I like auto, in these cases it is not helpful,
 because the user needs to know what type is returned.

 Trust me you won't like it when you see it :) Part of the reason it's hidden.
 But you are correct that documentation on these artifacts is *ehm* ... sketchy.
 Will fix/update.

If the compiler type is hard to grok, ok, but at least the documentation should 
make it clear what is returned.


 Grapheme:

      Should provide an InputRange so a user can iterate over its
 codepoints. It already appears to provide part of the range interface.

 Container != range. Just slice it and RA range is in your hands. I need to put
 this info somewhere prominent I guess.

 BTW foreach(v; Grapheme("abc")) { ... }
 should work because of automatic slicing.

Yes, but is the resulting v an octet, char, codepoint ????


 All to the point, thanks.

Welcs!

Jan 11 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Jan 11, 2013 at 01:36:21PM -0800, Walter Bright wrote:
 On 1/11/2013 1:21 PM, Dmitry Olshansky wrote:
12-Jan-2013 00:50, Walter Bright пишет:


[...]
buildTrie:

     Contains a number of functions that return auto, but no mention
of what is returned. While I like auto, in these cases it is not
helpful, because the user needs to know what type is returned.

Trust me you won't like it when you see it :) Part of the reason it's
hidden.  But you are correct that documentation on these artifacts is
*ehm* ... sketchy.  Will fix/update.

 
 If the compiler type is hard to grok, ok, but at least the
 documentation should make it clear what is returned.

[...]

Would an alias be appropriate in this case, so that we can put a
concrete name to the function return type? Or is that an abuse of alias?


T

-- 
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it. -- Brian W. Kernighan

Jan 11 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/11/2013 4:14 PM, H. S. Teoh wrote:
 Would an alias be appropriate in this case, so that we can put a
 concrete name to the function return type? Or is that an abuse of alias?

Actually, that sounds like a good idea.

Jan 11 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

12-Jan-2013 05:31, Walter Bright пишет:
 On 1/11/2013 4:14 PM, H. S. Teoh wrote:
 Would an alias be appropriate in this case, so that we can put a
 concrete name to the function return type? Or is that an abuse of alias?

 Actually, that sounds like a good idea.

Except that it's created out of template parameters by wrapping them in 
a couple of template. So not, can't do.

-- 
Dmitry Olshansky

Jan 12 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Jan 12, 2013 at 12:20:46PM +0400, Dmitry Olshansky wrote:
 12-Jan-2013 05:31, Walter Bright пишет:
On 1/11/2013 4:14 PM, H. S. Teoh wrote:
Would an alias be appropriate in this case, so that we can put a
concrete name to the function return type? Or is that an abuse of
alias?

Actually, that sounds like a good idea.

 
 Except that it's created out of template parameters by wrapping them
 in a couple of template. So not, can't do.

[...]

I was going to suggest something like:

	alias typeof(return) ConcreteName;

But then I realized it's impossible to use ConcreteName in the function
signature, which makes it pointless to have the alias since it won't
show up in the Ddoc. :-(

Although, maybe something like this *might* be possible (but it's sorta
getting to the point of being ridiculously complex just for the sake of
making ddoc display a single identifier):

	template ConcreteName(A,B,C) {
		alias ConcreteName = SomeTemplate!(A, OtherTemplate!B, ...);
	}

	ConcreteName!(A,B,C) func(A,B,C,D)(D args) {
		ConcreteName!(A,B,C) retVal = ...;
		return retVal;
	}


T

-- 
Three out of two people have difficulties with fractions. -- Dirk Eddelbuettel

Jan 12 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

12-Jan-2013 20:16, H. S. Teoh пишет:
 On Sat, Jan 12, 2013 at 12:20:46PM +0400, Dmitry Olshansky wrote:
 12-Jan-2013 05:31, Walter Bright пишет:
 On 1/11/2013 4:14 PM, H. S. Teoh wrote:
 Would an alias be appropriate in this case, so that we can put a
 concrete name to the function return type? Or is that an abuse of
 alias?

 Actually, that sounds like a good idea.

 Except that it's created out of template parameters by wrapping them
 in a couple of template. So not, can't do.

 [...]

 I was going to suggest something like:

 	alias typeof(return) ConcreteName;

 But then I realized it's impossible to use ConcreteName in the function
 signature, which makes it pointless to have the alias since it won't
 show up in the Ddoc. :-(

 Although, maybe something like this *might* be possible (but it's sorta
 getting to the point of being ridiculously complex just for the sake of
 making ddoc display a single identifier):

 	template ConcreteName(A,B,C) {
 		alias ConcreteName = SomeTemplate!(A, OtherTemplate!B, ...);
 	}

 	ConcreteName!(A,B,C) func(A,B,C,D)(D args) {
 		ConcreteName!(A,B,C) retVal = ...;
 		return retVal;
 	}

For basic needs there are 2 alias for custom Tries that map dchar -> 
bool and dchar -> some integer.

http://blackwhale.github.com/phobos/uni.html#CodepointSetTrie
http://blackwhale.github.com/phobos/uni.html#CodepointTrie

Both accept custom selection of bits per Trie level. The docs are not 
really helpful in this area yet. I'll make another pass and add more 
examples, glossary and so on.

-- 
Dmitry Olshansky

Jan 12 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/12/2013 8:26 AM, Dmitry Olshansky wrote:
 I'll make another pass and add more examples, glossary
 and so on.

You can also try the ACRONYM macro:

ACRONYM=<acronym title="$+">$1</acronym> ($+)

If you mouse over an acronym, a little popup text will appear.

Jan 12 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/12/2013 11:57 AM, Walter Bright wrote:
 On 1/12/2013 8:26 AM, Dmitry Olshansky wrote:
 I'll make another pass and add more examples, glossary
 and so on.

 You can also try the ACRONYM macro:

 ACRONYM=<acronym title="$+">$1</acronym> ($+)

 If you mouse over an acronym, a little popup text will appear.

More info:

http://html5doctor.com/the-abbr-element/

Although "octet", for example, isn't exactly an abbreviation.

Jan 12 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 11 January 2013 at 19:31:13 UTC, Dmitry Olshansky 
wrote:
 It's been long over due to present the work I did during the 
 last GSOC as it was a summer not winter of code after all. 
 Unfortunately some compiler bugs, a new job :) and unrelated 
 events of importance have postponed its release beyond measure.

 Anyway it's polished and ready for the good old collective 
 destruction called peer review. I'm looking for a review 
 manager.
 [SNIP]

I don't have anything to add, but I still want to voice my 
appreciation for your work.

Well done. I hope this makes it through.

Jan 12 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-11 20:31, Dmitry Olshansky wrote:
 It's been long over due to present the work I did during the last GSOC
 as it was a summer not winter of code after all. Unfortunately some
 compiler bugs, a new job :) and unrelated events of importance have
 postponed its release beyond measure.

I assume that "sicmp" is faster than "icmp"? In that case you might want 
to mention that in the documentation.


-- 
/Jacob Carlborg

Jan 12 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

12-Jan-2013 19:01, Jacob Carlborg пишет:
 On 2013-01-11 20:31, Dmitry Olshansky wrote:
 It's been long over due to present the work I did during the last GSOC
 as it was a summer not winter of code after all. Unfortunately some
 compiler bugs, a new job :) and unrelated events of importance have
 postponed its release beyond measure.

 I assume that "sicmp" is faster than "icmp"? In that case you might want
 to mention that in the documentation.

Yeah, good point.

-- 
Dmitry Olshansky

Jan 12 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
 And documentation:
 http://blackwhale.github.com/phobos/uni.html

struct InversionList:


length() returns "number of characters", while "empty()" returns true if there 
are no codepoints.

Aren't the two redundant? And should it be characters or codepoints?

Jan 12 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

13-Jan-2013 03:39, Walter Bright пишет:
 On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
 And documentation:
 http://blackwhale.github.com/phobos/uni.html

 struct InversionList:


 length() returns "number of characters", while "empty()" returns true if
 there are no codepoints.

 Aren't the two redundant?

Containers commonly have both. But maybe I should kill length on 
different grounds because it's computed by walking the array of 
intervals [a, b) and summing up (b-a) lengths. It's O(N).... ouch.

The only problem is that then it's not at all obvious that length is 
better calculated via:
reduce!"a + (b[1] - b[0])"(set.byInterval, 0);
vs
walkLength(set.byChar);

where the latter may take enormous time to finish on common sets.

(damn it should by DChar at least)

 And should it be characters or codepoints?

Codepoints, everything works on codepoints, that is dchars. A character 
is a Grapheme. So all mentions of character that slipped through got to 
be fixed.

-- 
Dmitry Olshansky

Jan 12 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, January 13, 2013 11:00:50 Dmitry Olshansky wrote:
 13-Jan-2013 03:39, Walter Bright =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
 On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
 And documentation:
 http://blackwhale.github.com/phobos/uni.html

=20
 struct InversionList:
=20
=20
 length() returns "number of characters", while "empty()" returns tr=


ue if
 there are no codepoints.
=20
 Aren't the two redundant?

=20
 Containers commonly have both. But maybe I should kill length on
 different grounds because it's computed by walking the array of
 intervals [a, b) and summing up (b-a) lengths. It's O(N).... ouch.
=20
 The only problem is that then it's not at all obvious that length is
 better calculated via:
 reduce!"a + (b[1] - b[0])"(set.byInterval, 0);
 vs
 walkLength(set.byChar);
=20
 where the latter may take enormous time to finish on common sets.
=20
 (damn it should by DChar at least)

According to std.container, length should never be defined unless it's =
at worst=20
O(log n) (though I would have expected it to require O(1); I don't know=
 of any=20
container that can calculate its lengeth in O(log n) but not O(1) - it'=
s=20
always O(1) or O(n) in my experience). So, don't define length if it's =
not that=20
efficient, even if it's not in std.container.

It _may_ be valuable to add a function which does the same (e.g.=20
calculateLength or calcLength or calcLen or something liket that), espe=
cially=20
if walkLength isn't going to be the most efficient.

And yes, providing both empty and length is common and expected. And si=
nce=20
empty is _required_ to be O(1), it's still guaranteed to be efficient, =
whereas=20
length isn't.

I really should ask Andrei why he made length require O(log n) instead =
O(1)...

- Jonathan M Davis

Jan 12 2013

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 01/13/2013 02:28 AM, Jonathan M Davis wrote:
 I really should ask Andrei why he made length require O(log n) instead O(1)...

 - Jonathan M Davis

Are there cases where it can't be O(1)?

My current intuition is that length can always be cached.  If a 
container has a long length calculation, then it can be shortened by 
wrapping it in a proxy container that does nothing but expose the same 
operations as jackets around the original container.  These jackets 
forward all data in and out with no changes, except for one: they count 
the number of elements entering and leaving the underlying container and 
store this count in a length variable.

Jan 13 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, January 13, 2013 04:58:16 Chad J wrote:
 On 01/13/2013 02:28 AM, Jonathan M Davis wrote:
 I really should ask Andrei why he made length require O(log n) instead
 O(1)...
 
 - Jonathan M Davis

 
 Are there cases where it can't be O(1)?

Most definitely. Take a doubly linked list for example. Either length can be 
O(1) and splicing is then O(n), or length is O(n) and splicing is then O(1). 
That's because if you want to keep track of the length, you have to count the 
number of elements being spliced in. For instance, it's a relatively common 
mistake in C++ to use std::List's size function and compare it with 0 to see 
whether the list is empty, because size is O(n). The correct thing to do is to 
call empty and check its result.

You _could_ make std::list's size function be O(1), but that would mean that 
splicing becomes O(n), which C++98 definitely did not do (though I hear that 
C++11 made the interesting choice of changing how std::list works so that size 
is O(1) and splicing is O(n); I don't know if that's good or not).

std.container.slist and std.container.dlist don't define length precisely 
because it can't be O(1) given their design.

- Jonathan M Davis

Jan 13 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Sunday, 13 January 2013 at 10:24:17 UTC, Jonathan M Davis 
wrote:
 On Sunday, January 13, 2013 04:58:16 Chad J wrote:
 On 01/13/2013 02:28 AM, Jonathan M Davis wrote:
 I really should ask Andrei why he made length require O(log 
 n) instead
 O(1)...
 
 - Jonathan M Davis

 
 Are there cases where it can't be O(1)?

 Most definitely. Take a doubly linked list for example. Either 
 length can be
 O(1) and splicing is then O(n), or length is O(n) and splicing 
 is then O(1).
 That's because if you want to keep track of the length, you 
 have to count the
 number of elements being spliced in. For instance, it's a 
 relatively common
 mistake in C++ to use std::List's size function and compare it 
 with 0 to see
 whether the list is empty, because size is O(n). The correct 
 thing to do is to
 call empty and check its result.

 You _could_ make std::list's size function be O(1), but that 
 would mean that
 splicing becomes O(n), which C++98 definitely did not do 
 (though I hear that
 C++11 made the interesting choice of changing how std::list 
 works so that size
 is O(1) and splicing is O(n); I don't know if that's good or 
 not).

 std.container.slist and std.container.dlist don't define length 
 precisely
 because it can't be O(1) given their design.

 - Jonathan M Davis

Yes, size is now "o(1)" in C++11.

However, splice's complexity remains "o(1)" if the list spliced 
from/to is the same. It only degenerates to "o(N)" if they are 
not the same list.

Also, note that with the old implementation, while length was 
"O(N)", most implementations could still cache the result it to 
give "0(1)" access, and only invalidate the cache when splicing 
with another list.

But enough off topic here I guess ;)

Jan 13 2013

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 01/13/2013 05:23 AM, Jonathan M Davis wrote:
 On Sunday, January 13, 2013 04:58:16 Chad J wrote:
 On 01/13/2013 02:28 AM, Jonathan M Davis wrote:
 I really should ask Andrei why he made length require O(log n) instead
 O(1)...

 - Jonathan M Davis

 Are there cases where it can't be O(1)?

 Most definitely. Take a doubly linked list for example. Either length can be
 O(1) and splicing is then O(n), or length is O(n) and splicing is then O(1).
 That's because if you want to keep track of the length, you have to count the
 number of elements being spliced in. For instance, it's a relatively common
 mistake in C++ to use std::List's size function and compare it with 0 to see
 whether the list is empty, because size is O(n). The correct thing to do is to
 call empty and check its result.

 You _could_ make std::list's size function be O(1), but that would mean that
 splicing becomes O(n), which C++98 definitely did not do (though I hear that
 C++11 made the interesting choice of changing how std::list works so that size
 is O(1) and splicing is O(n); I don't know if that's good or not).

 std.container.slist and std.container.dlist don't define length precisely
 because it can't be O(1) given their design.

 - Jonathan M Davis

Thanks!

That's a good example.  It made sense once I realized that you could be 
splicing in arbitrary sections of another list, not just an entire other 
list.

If I were to used cached lengths and splice in another list with a 
cached length, then I would just add the lengths of the two lists and do 
an O(1) splice.  I can see how this wouldn't cover all cases though.

Jan 13 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

13-Jan-2013 13:58, Chad J пишет:
 On 01/13/2013 02:28 AM, Jonathan M Davis wrote:
 I really should ask Andrei why he made length require O(log n) instead
 O(1)...

 - Jonathan M Davis

 Are there cases where it can't be O(1)?

 My current intuition is that length can always be cached.

Caching of length is suitable for adapters on top of base/core containers.

  If a
 container has a long length calculation, then it can be shortened by
 wrapping it in a proxy container that does nothing but expose the same
 operations as jackets around the original container.  These jackets
 forward all data in and out with no changes, except for one: they count
 the number of elements entering and leaving the underlying container and
 store this count in a length variable.

Yeah, exactly.


-- 
Dmitry Olshansky

Jan 13 2013

Martin Nowak <code dawg.eu> writes:

On 01/11/2013 08:31 PM, Dmitry Olshansky wrote:
 There is an extra candy for meta-programming text processing libraries:
 a set type can generate source code of unrolled hard-coded binary search
 for a given set of codepoints.

Great, I already had to do that myself for a DFA Lexer.

Jan 15 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

16-Jan-2013 06:37, Martin Nowak пишет:
 On 01/11/2013 08:31 PM, Dmitry Olshansky wrote:
 There is an extra candy for meta-programming text processing libraries:
 a set type can generate source code of unrolled hard-coded binary search
 for a given set of codepoints.

 Great, I already had to do that myself for a DFA Lexer.

The only thing left that I'd love to improve is to integrate UTF 
awareness (for no-decoding processing) into it so that it becomes:
auto set = ...
...
set.toSourceCode("func", Encoding.UTF8);

and prototype of generated function
is something along the lines of:
size_t func(char[] src);

or:
size_t func(char[] src, size_t offset);

that returns number of matched code units and 0 on failure.

-- 
Dmitry Olshansky

Jan 16 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

11-Jan-2013 23:31, Dmitry Olshansky пишет:
 The code, including extra tests and a benchmark is here:
 https://github.com/blackwhale/gsoc-bench-2012

 And documentation:
 http://blackwhale.github.com/phobos/uni.html

First of all,  safe pure and nothrow is back. Let me know if something 
is still not.

OK, I've made an extra pass through docs with these things in mind:
- getting the introduction & terminology part right
- more explanations and details where applicable
  (let me if that's too much / too little / wrong)
- hiding away the truly generic (and not easy to use) Trie from 
documentation
- old deprecated stuff is hidden from docs to discourage its use

The whole Trie template was hidden because it needs to be in 
std.container anyway, and I want to extend & tweak it some more before 
committing it to public interface. It may very well worth a short review 
on its own.

I've spent some hours to get an easy, useful and correct (as far as it 
gets) terminology throughout the module.

In that wake 2 primary definitions are:
A character means an encoded character, that is code point that has 
*assigned* mapping to abstract character (i.e. symbolically 
representable meaning).
A code point is a point in a range of [0, 10FFFF] that is a value not 
necessarily having a symbolic meaning.

Things I'm aware of that are not there yet:

1. "Moar" examples everywhere!

2. Expand on the description of certain primitives.

3. Table that lists all of the properties, scripts and other sets 
available through the 'unicode' helper.


-- 
Dmitry Olshansky

Jan 16 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/16/2013 2:48 AM, Dmitry Olshansky wrote:
 I've spent some hours to get an easy, useful and correct (as far as it gets)
 terminology throughout the module.

Thank you. Looking at the Terminology section (the reference to it at the 
beginning should be a hyperlink):

"Not all code points are assigned to encoded characters.":

	?? I thought that was the whole point?

"Note that UTF-32 code unit (dchar) holds the actual code point value." =>
"Note 
that in UTF-32, a code unit is a code point and is represented by the D dchar
type."

What happened to "octet", which I thought was the official term?

"Also known as simply character."

	No, please no, at least not in this document. I suspect you need to ban the 
word "character" from this page. It is so overloaded in meaning that it is
useless.

"An abstract character does not necessarily correspond to what a user thinks of 
as a “character” and should not be confused with a Grapheme."

	This just makes me cry. Who knows what a user thinks of as a character? "not 
necessarily" means what? Is "Grapheme" a Unicode term?

Why can't there be precise definitions of these terms? I wonder if even the 
Unicode standard people have no idea exactly what they are.

Sorry for the rant, but unicode terms always make me mad.

Jan 16 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

16-Jan-2013 23:35, Walter Bright пишет:
 On 1/16/2013 2:48 AM, Dmitry Olshansky wrote:
 I've spent some hours to get an easy, useful and correct (as far as it
 gets)
 terminology throughout the module.

 Thank you. Looking at the Terminology section (the reference to it at
 the beginning should be a hyperlink):

 "Not all code points are assigned to encoded characters.":

      ?? I thought that was the whole point?

Obviously it's not. The simple truth is only 110K+ codepoints are 
currently assigned everything else is either reserved for speicific use 
(internal, surrogates etc.) or for future additions.

Thin of it this way - previously they thought a unit of symbolic info 
was 16 bits. Then a lot of problems cropped up (including adapting 8-bit 
only/Latin-1/ascii systems). Now they treat encoding as form of data 
storage and everything in the Unicode standard defined as operating on 
_values_ of codespace (that is code points).

This doesn't mean that all these values are real characters.

 "Note that UTF-32 code unit (dchar) holds the actual code point value."
 => "Note that in UTF-32, a code unit is a code point and is represented
 by the D dchar type."

Yeah, that's simpler.

 What happened to "octet", which I thought was the official term?

Octet means 8-bits as a unit of data in just about any Internet protocol 
description. What do you what it here for?

 "Also known as simply character."

      No, please no, at least not in this document. I suspect you need to
 ban the word "character" from this page. It is so overloaded in meaning
 that it is useless.

It's not me. It's just the way things are and people have to be aware 
this particular meaning of character.

Trust me, I've first converted the whole thing to code point(s) (note 
the space). Then I read it and it was like "meh" even to me who wrote 
the stuff in the first place. Then I looked at all documents by Unicode 
consortium they use 'character' throughout to mean either encoded 
character or abstract character depending on the time of day.

Utility beats pedantry and thus 'character' is 'encoded character' 
everywhere it's used. Probably I'll remove this statement so that 
uninitiated don't fixate on it too much.

 "An abstract character does not necessarily correspond to what a user
 thinks of as a “character” and should not be confused with a Grapheme."

      This just makes me cry. Who knows what a user thinks of as a
 character? "not necessarily" means what? Is "Grapheme" a Unicode term?

Yes and yes. The user thinks a character is what he usually writes on a 
piece of paper obviously. It's a warning that what a typical user thinks 
a character doesn't always match the character in the Unicode.

Grapheme is a unicode term that has direct substitute in this library 
hence the hyper-link. Grapheme are _visually_ a single entity. In fact 
the could be a sequence consisting of a base character (Unicode standard 
says 'character') + sequence of combining marks. This is called 
combining character sequence, but there are other kinds of grapheme see 
e.g. Hangul syllables and more weird stuff.

BTW the text is taken as is from the Unicode standard definitions.
I can litter a couple more pages with this crap, but instead I've taken
the most important ones and merged a couple of definitions to make it 
simpler (e.g. see glyph).

 Why can't there be precise definitions of these terms?

'cause they are muddy ;)
'Abstract character' is one of the ugliest actually.

I wonder if even
 the Unicode standard people have no idea exactly what they are.

 Sorry for the rant, but unicode terms always make me mad.

Nothing I can cure here. And they know what they do but have a boatload 
of legacy crap that has to crawl on. Just recall the time where 
everybody thought that 16-bit codes should be enough for everybody. When 
they gradually took the whole world of writing systems into account 
things got dirty, awfully so.

After learning a lot about Unicode I'd say they managed to do the job 
just fine given the constraints. I, for one, wouldn't even dare to try 
to reinvent *these* wheels. A lot of stuff could have been smoothed and 
simplified but overall - it's complex because the writing systems are 
utterly irregular plus the backwards compatibility has taken its toll.

-- 
Dmitry Olshansky

Jan 16 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/16/2013 12:16 PM, Dmitry Olshansky wrote:
 Nothing I can cure here. And they know what they do but have a boatload of
 legacy crap that has to crawl on. Just recall the time where everybody thought
 that 16-bit codes should be enough for everybody. When they gradually took the
 whole world of writing systems into account things got dirty, awfully so.

 After learning a lot about Unicode I'd say they managed to do the job just fine
 given the constraints. I, for one, wouldn't even dare to try to reinvent
*these*
 wheels. A lot of stuff could have been smoothed and simplified but overall -
 it's complex because the writing systems are utterly irregular plus the
 backwards compatibility has taken its toll.

I guess I have to accept it :-)

And thanks for taking on this pretty much impossible task.

Jan 16 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

16-Jan-2013 23:35, Walter Bright пишет:
 On 1/16/2013 2:48 AM, Dmitry Olshansky wrote:
 I've spent some hours to get an easy, useful and correct (as far as it
 gets)
 terminology throughout the module.

 Thank you. Looking at the Terminology section (the reference to it at
 the beginning should be a hyperlink):

 "Not all code points are assigned to encoded characters.":

      ?? I thought that was the whole point?

And a lot of people really want to go this way see e.g. this:
http://www.parrot.org/content/what-nfg-why-you-want-parrot-have-it

There is a value in that but:
a) it's not portable anywhere except for in-memory usage \
b) everything input/output got to be converted possibly creating new 
entries in a global dictionary of graphemes.

-- 
Dmitry Olshansky

Jan 16 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Jan 16, 2013 at 02:48:30PM +0400, Dmitry Olshansky wrote:
 11-Jan-2013 23:31, Dmitry Olshansky пишет:
The code, including extra tests and a benchmark is here:
https://github.com/blackwhale/gsoc-bench-2012

And documentation:
http://blackwhale.github.com/phobos/uni.html

 
 First of all,  safe pure and nothrow is back. Let me know if
 something is still not.
 
 OK, I've made an extra pass through docs with these things in mind:
 - getting the introduction & terminology part right
 - more explanations and details where applicable
  (let me if that's too much / too little / wrong)
 - hiding away the truly generic (and not easy to use) Trie from
 documentation
 - old deprecated stuff is hidden from docs to discourage its use

[...]

Looks much better now!

Some nitpicks:

- Under Overview:

	[4th paragraph] "It's recognized that an application may need
	further enhancements and extensions. It could be the need for
	less commonly known algorithms or tailoring existing ones for
	regional-specific needs. To help users with building any extra
	functionality beyond the core primitives the module provides:"

  The grammar nazi in me thinks a better wording might be (changes
  delimited by {{}}):

	"It's recognized that an application may need further
	enhancements and extensions{{, such as}} less{{-}}commonly known
	algorithms{{,}} or tailoring existing ones for
	{{region}}-specific needs. To help users with building any extra
	functionality beyond the core primitives{{,}} the module
	provides:"

  The second item in the subsequent list:

	A way to construct optimal packed multi-stage tables also known
	as a special case of Trie. {{The functions}} codepointTrie,
	codepointSetTrie construct custom tries that map dchar to value.
	The end result is {{a}} fast and predictable Ο(1) lookup that
	powers functions like isAlpha {{and}} combiningClass{{,}} but
	for user-defined data sets.

  The last item in the list:

	Access to the commonly{{-}}used predefined sets of code points.
	The commonly{{-}}defined one{{s}} can be observed in the CLDR
	utility, on {{the}} page property index. {{S}}upported ones
	include Script, Block and General Category. See unicode for easy
	{{}} compile-time checked queries.

- Under Terminology:

	[[3rd paragraph]] "The minimal bit combination that can represent
	a unit of encoded text for processing or interchange. Depending
	on the encoding this could be: 8-bit code units in the UTF-8
	(($D char)), [...]"

  I think you transposed the "$(" here. :)

  The last sentence in this section appears to be truncated. Maybe a
  runaway DDoc macro somewhere earlier?

- Under Construction of lookup tables, the grammar nazi says:

	[[1st sentence]] "{{The}} Unicode standard describes a set of
	algorithms that {{}} depend on having {{the}} ability to quickly
	look{{ }}up various properties of a code point. Given the the
	codespace of about 1 million code points, it is not a trivial
	task to providing a space{{-}}efficient solution for the {{}}
	multitude of properties."

	[[2nd paragraph]] "[...] Hash-tables {{have}} enormous memory
	footprint and binary search over intervals is not fast enough
	for some heavy-duty algorithms."

	[[3rd paragraph]] "{{(P }}The recommended solution (see Unicode
	Implementation Guidelines) {{}} is using multi-stage tables{{,}}
	that is{{,}} {{an instance}} of Trie with integer keys and {{a}}
	fixed number of stages. For the {{remainder}} of {{this}}
	section {{it will be}} called {{a}} fixed trie. The following
	describes a particular implementation that is aimed for the
	speed of access at the expense of ideal size savings."

	[[4th paragraph]] "[...] Split {{the}} number of bits in a key
	(code point, 21 bits) {{into}} 2 components (e.g. 15 and 8). The
	first is the number of bits in the index of {{the}} trie and the
	other is {{the}} number of bits {{in each}} page of {{the}}
	trie. The layout of trie is then an array of size
	2^^bits-of-index followed an array of memory chunks of size
	2^^bits-of-page/size-of-element."

	[[5th paragraph]] "[...] {{The}} slots of {{the}} index all have
	to contain {{the same[?] number of pages}}. The lookup is then
	just a couple of operations - slice {{the}} upper bits, {{then}}
	look{{ }}up {{the}} index for these{{.}} The pseudo-code is:"

	[[Following the code example]] "[...] Where if the elemsPerPage
	is a power of 2 the whole process is a handful of simple
	instructions and 2 array reads.  {{Subsequent}} levels of
	{{the}} trie are introduced by recursing {{}} this notion - the
	index array is treated as values. The number of bits in {{the}}
	index is then again split into 2 parts, with pages over
	'current-index' and {{the}} new 'upper-index'."

	[[Next paragraph]] "For completeness the level 1 trie is simply
	an array. {{The}} current implementation takes advantage of
	bit-packing values when the range is known to be limited in {{}}
	advance (such as bool){{.}} {{S}}ee also BitPacked for enforcing
	it manually.  [...]"

	[[Last paragraph]] "The process of construction of a trie is
	more involved and is hidden from the user in a form of
	{{convenience}} functions: codepointTrie, codepointSetTrie and
	even more convenient toTrie. In general a set or built-in AA
	with dchar type can be turned into a trie. The trie object in
	this module is {{}} read-only (immutable){{;}} it's effectively
	frozen after construction."


The grammar nazi has run out of steam, so no more grammar nitpicks for
now. ;-) But there are still the following questions:

- Why is isControl() not pure nothrow?

- Why are the isX() functions  system? I would have expected they should
  be at least  trusted? (Or are there technical problems / compiler bugs
  preventing this?)

That's all for now. I hope you don't mind me allowing the grammar nazi
to take over for a bit. I want Phobos documentation to be professional
quality. :)


T

-- 
The trouble with TCP jokes is that it's like hearing the same joke over and
over.

Jan 17 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

17-Jan-2013 22:48, H. S. Teoh пишет:
 On Wed, Jan 16, 2013 at 02:48:30PM +0400, Dmitry Olshansky wrote:
 11-Jan-2013 23:31, Dmitry Olshansky пишет:
 The code, including extra tests and a benchmark is here:
 https://github.com/blackwhale/gsoc-bench-2012

 And documentation:
 http://blackwhale.github.com/phobos/uni.html

 First of all,  safe pure and nothrow is back. Let me know if
 something is still not.

 OK, I've made an extra pass through docs with these things in mind:
 - getting the introduction & terminology part right
 - more explanations and details where applicable
   (let me if that's too much / too little / wrong)
 - hiding away the truly generic (and not easy to use) Trie from
 documentation
 - old deprecated stuff is hidden from docs to discourage its use

 [...]

 Looks much better now!

 Some nitpicks:

Great ;)

 - Under Overview:

 	[4th paragraph] "It's recognized that an application may need
 	further enhancements and extensions. It could be the need for
 	less commonly known algorithms or tailoring existing ones for
 	regional-specific needs. To help users with building any extra
 	functionality beyond the core primitives the module provides:"

    The grammar nazi in me thinks a better wording might be (changes
    delimited by {{}}):

 	"It's recognized that an application may need further
 	enhancements and extensions{{, such as}} less{{-}}commonly known
 	algorithms{{,}} or tailoring existing ones for
 	{{region}}-specific needs. To help users with building any extra
 	functionality beyond the core primitives{{,}} the module
 	provides:"

    The second item in the subsequent list:

 	A way to construct optimal packed multi-stage tables also known
 	as a special case of Trie. {{The functions}} codepointTrie,
 	codepointSetTrie construct custom tries that map dchar to value.
 	The end result is {{a}} fast and predictable Ο(1) lookup that
 	powers functions like isAlpha {{and}} combiningClass{{,}} but
 	for user-defined data sets.

    The last item in the list:

 	Access to the commonly{{-}}used predefined sets of code points.
 	The commonly{{-}}defined one{{s}} can be observed in the CLDR
 	utility, on {{the}} page property index. {{S}}upported ones
 	include Script, Block and General Category. See unicode for easy
 	{{}} compile-time checked queries.

 - Under Terminology:

 	[[3rd paragraph]] "The minimal bit combination that can represent
 	a unit of encoded text for processing or interchange. Depending
 	on the encoding this could be: 8-bit code units in the UTF-8
 	(($D char)), [...]"

    I think you transposed the "$(" here. :)

Looks like one commit wasn't pushed :(
I'll peruse you wording though.

    The last sentence in this section appears to be truncated. Maybe a
    runaway DDoc macro somewhere earlier?

 - Under Construction of lookup tables, the grammar nazi says:

 	[[1st sentence]] "{{The}} Unicode standard describes a set of
 	algorithms that {{}} depend on having {{the}} ability to quickly
 	look{{ }}up various properties of a code point. Given the the
 	codespace of about 1 million code points, it is not a trivial
 	task to providing a space{{-}}efficient solution for the {{}}
 	multitude of properties."

 	[[2nd paragraph]] "[...] Hash-tables {{have}} enormous memory
 	footprint and binary search over intervals is not fast enough
 	for some heavy-duty algorithms."

 	[[3rd paragraph]] "{{(P }}The recommended solution (see Unicode
 	Implementation Guidelines) {{}} is using multi-stage tables{{,}}
 	that is{{,}} {{an instance}} of Trie with integer keys and {{a}}
 	fixed number of stages. For the {{remainder}} of {{this}}
 	section {{it will be}} called {{a}} fixed trie. The following
 	describes a particular implementation that is aimed for the
 	speed of access at the expense of ideal size savings."

 	[[4th paragraph]] "[...] Split {{the}} number of bits in a key
 	(code point, 21 bits) {{into}} 2 components (e.g. 15 and 8). The
 	first is the number of bits in the index of {{the}} trie and the
 	other is {{the}} number of bits {{in each}} page of {{the}}
 	trie. The layout of trie is then an array of size
 	2^^bits-of-index followed an array of memory chunks of size
 	2^^bits-of-page/size-of-element."

 	[[5th paragraph]] "[...] {{The}} slots of {{the}} index all have
 	to contain {{the same[?] number of pages}}. The lookup is then
 	just a couple of operations - slice {{the}} upper bits, {{then}}
 	look{{ }}up {{the}} index for these{{.}} The pseudo-code is:"

 	[[Following the code example]] "[...] Where if the elemsPerPage
 	is a power of 2 the whole process is a handful of simple
 	instructions and 2 array reads.  {{Subsequent}} levels of
 	{{the}} trie are introduced by recursing {{}} this notion - the
 	index array is treated as values. The number of bits in {{the}}
 	index is then again split into 2 parts, with pages over
 	'current-index' and {{the}} new 'upper-index'."

 	[[Next paragraph]] "For completeness the level 1 trie is simply
 	an array. {{The}} current implementation takes advantage of
 	bit-packing values when the range is known to be limited in {{}}
 	advance (such as bool){{.}} {{S}}ee also BitPacked for enforcing
 	it manually.  [...]"

 	[[Last paragraph]] "The process of construction of a trie is
 	more involved and is hidden from the user in a form of
 	{{convenience}} functions: codepointTrie, codepointSetTrie and
 	even more convenient toTrie. In general a set or built-in AA
 	with dchar type can be turned into a trie. The trie object in
 	this module is {{}} read-only (immutable){{;}} it's effectively
 	frozen after construction."


 The grammar nazi has run out of steam, so no more grammar nitpicks for
 now. ;-) But there are still the following questions:

 - Why is isControl() not pure nothrow?

Missed this one.

 - Why are the isX() functions  system? I would have expected they should
    be at least  trusted? (Or are there technical problems / compiler bugs
    preventing this?)

M-hm I'm seeing this in my sources:
bool isAlpha(dchar c)  safe pure nothrow
{...}

The DDoc however shows  system.

A compiler bug?

 That's all for now. I hope you don't mind me allowing the grammar nazi
 to take over for a bit. I want Phobos documentation to be professional
 quality. :)

Sure, thanks.

 T


-- 
Dmitry Olshansky

Jan 17 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

17-Jan-2013 22:48, H. S. Teoh пишет:
 On Wed, Jan 16, 2013 at 02:48:30PM +0400, Dmitry Olshansky wrote:
 11-Jan-2013 23:31, Dmitry Olshansky пишет:
 The code, including extra tests and a benchmark is here:
 https://github.com/blackwhale/gsoc-bench-2012

 And documentation:
 http://blackwhale.github.com/phobos/uni.html



[snip]

 - Why are the isX() functions  system? I would have expected they should
    be at least  trusted? (Or are there technical problems / compiler bugs
    preventing this?)

 M-hm I'm seeing this in my sources:
 bool isAlpha(dchar c)  safe pure nothrow
 {...}

 The DDoc however shows  system.

 A compiler bug?

It's indeed a bug in the compiler, looks funny:
http://d.puremagic.com/issues/show_bug.cgi?id=9371

Makes me think that the compiler treats  safe after
the function prototype differently then the one placed before it.
The trailing- safe camp must be jealous.

 That's all for now. I hope you don't mind me allowing the grammar nazi
 to take over for a bit. I want Phobos documentation to be professional
 quality. :)


All of these were incorporated, thanks.
Among other new stuff I've added some cross-links throughout.

Now on to the normalization forms description, etc.

-- 
Dmitry Olshansky

Jan 22 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, January 23, 2013 00:20:19 Dmitry Olshansky wrote:
 It's indeed a bug in the compiler, looks funny:
 http://d.puremagic.com/issues/show_bug.cgi?id=9371
 
 Makes me think that the compiler treats  safe after
 the function prototype differently then the one placed before it.
 The trailing- safe camp must be jealous.

Prepending  safe is evil. ;)

That's definitely a weird bug though. I'd say that it pretty much has to be a 
bug in the parser given that the AST should be identical regardless of which 
side of the function signature an attribute is on.

- Jonathan m Davis

Jan 22 2013

D Programming

C/C++ Programming

Other

digitalmars.D - Ready for review: new std.uni