www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Proposal for SentinelInputRange

reply Walter Bright <newshound2 digitalmars.com> writes:
A SentinelInputRange is an InputRange with the following additions:

1. a compile time property 'sentinel' that is the terminating value of the range
2. empty is defined as: empty = (front == sentinel)
3. it is not necessary for empty to be called before front

A C style 0-terminated string is an example of a SentinelInputRange.

The additions to std.range would be:

1. isSentinelInputRange(T) which returns true if T is a SentinelInputRange
2. a unittest
3. documentation of this

An addition to std.string would be a function that takes a char* and returns a 
SentinelInputRange.

Motivation:
1. easy conversion of C strings to ranges
2. necessary for a fast implementation of a lexer

Any takers?
Feb 27 2013
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02/28/2013 02:11 AM, Walter Bright wrote:
 A SentinelInputRange is an InputRange with the following additions:

 1. a compile time property 'sentinel' that is the terminating value of
 the range
 2. empty is defined as: empty = (front == sentinel)
 3. it is not necessary for empty to be called before front

 A C style 0-terminated string is an example of a SentinelInputRange.

 The additions to std.range would be:

 1. isSentinelInputRange(T) which returns true if T is a SentinelInputRange
 2. a unittest
 3. documentation of this

 An addition to std.string would be a function that takes a char* and
 returns a SentinelInputRange.

 Motivation:
 1. easy conversion of C strings to ranges
 2. necessary for a fast implementation of a lexer

 Any takers?

This is not general enough. The concept is not exclusive to input ranges.
Feb 27 2013
parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Thursday, 28 February 2013 at 01:14:31 UTC, Timon Gehr wrote:
 On 02/28/2013 02:11 AM, Walter Bright wrote:
 A SentinelInputRange is an InputRange with the following 
 additions:

 1. a compile time property 'sentinel' that is the terminating 
 value of
 the range
 2. empty is defined as: empty = (front == sentinel)
 3. it is not necessary for empty to be called before front

 A C style 0-terminated string is an example of a 
 SentinelInputRange.

 The additions to std.range would be:

 1. isSentinelInputRange(T) which returns true if T is a 
 SentinelInputRange
 2. a unittest
 3. documentation of this

 An addition to std.string would be a function that takes a 
 char* and
 returns a SentinelInputRange.

 Motivation:
 1. easy conversion of C strings to ranges
 2. necessary for a fast implementation of a lexer

 Any takers?

This is not general enough. The concept is not exclusive to input ranges.

I presume you're remembering that if something is a forward, bidirectional or random access range it is also an input range. What further generality would you want?
Feb 27 2013
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 02/28/2013 02:18 AM, John Colvin wrote:
 On Thursday, 28 February 2013 at 01:14:31 UTC, Timon Gehr wrote:
 On 02/28/2013 02:11 AM, Walter Bright wrote:
 A SentinelInputRange is an InputRange with the following additions:

 1. a compile time property 'sentinel' that is the terminating value of
 the range
 2. empty is defined as: empty = (front == sentinel)
 3. it is not necessary for empty to be called before front

 A C style 0-terminated string is an example of a SentinelInputRange.

 The additions to std.range would be:

 1. isSentinelInputRange(T) which returns true if T is a
 SentinelInputRange
 2. a unittest
 3. documentation of this

 An addition to std.string would be a function that takes a char* and
 returns a SentinelInputRange.

 Motivation:
 1. easy conversion of C strings to ranges
 2. necessary for a fast implementation of a lexer

 Any takers?

This is not general enough. The concept is not exclusive to input ranges.

I presume you're remembering that if something is a forward, bidirectional or random access range it is also an input range. What further generality would you want?

Never mind. I was confused.
Feb 27 2013
prev sibling next sibling parent reply "Zach the Mystic" <reachBUTMINUSTHISzach gOOGLYmail.com> writes:
On Thursday, 28 February 2013 at 01:12:19 UTC, Walter Bright 
wrote:
 A SentinelInputRange is an InputRange with the following 
 additions:

 1. a compile time property 'sentinel' that is the terminating 
 value of the range
 2. empty is defined as: empty = (front == sentinel)
 3. it is not necessary for empty to be called before front

 A C style 0-terminated string is an example of a 
 SentinelInputRange.

 The additions to std.range would be:

 1. isSentinelInputRange(T) which returns true if T is a 
 SentinelInputRange
 2. a unittest
 3. documentation of this

 An addition to std.string would be a function that takes a 
 char* and returns a SentinelInputRange.

 Motivation:
 1. easy conversion of C strings to ranges
 2. necessary for a fast implementation of a lexer

 Any takers?

Seems smart. Patches up a hole in the armor of Ranges. What if more than one value can end the range, EOF, '\0'?
Feb 27 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/27/2013 5:53 PM, Zach the Mystic wrote:
 What if more than one
 value can end the range, EOF, '\0'?

I have never seen a need for that.
Feb 27 2013
next sibling parent reply Michel Fortin <michel.fortin michelf.ca> writes:
On 2013-02-28 02:05:20 +0000, "timotheecour" <thelastmammoth gmail.com> said:

 On Thursday, 28 February 2013 at 01:56:33 UTC, Walter Bright wrote:
 On 2/27/2013 5:53 PM, Zach the Mystic wrote:
 What if more than one
 value can end the range, EOF, '\0'?

I have never seen a need for that.

how about a predicate isSentinel instead of a fixed sentinel value? That'd allow more flexibility such as more than one sentinel value.

Not usable in a switch/case statement. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca/
Feb 27 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 2/27/2013 6:10 PM, timotheecour wrote:
 how about a predicate isSentinel instead of a fixed sentinel value?
 That'd allow more flexibility such as more than one sentinel value.

Not usable in a switch/case statement.

why is that needed?

Look at the lexer.c source code. tl,dr: PERFORMANCE!
Feb 27 2013
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/27/2013 6:01 PM, Zach the Mystic wrote:
 On Thursday, 28 February 2013 at 01:56:33 UTC, Walter Bright wrote:
 On 2/27/2013 5:53 PM, Zach the Mystic wrote:
 What if more than one
 value can end the range, EOF, '\0'?

I have never seen a need for that.

Do you mean that you've never seen software that uses that, or that you've never been convinced that such software couldn't be made simpler?

Pretty much both. For example, often a foreach loop will exit either when it runs out of data or some other condition is met. I don't see a need to work both of those conditions into the definition of a range.
Feb 27 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/27/2013 8:53 PM, Zach the Mystic wrote:
 My understanding of the logic of Sentinel Ranges so far is that switch
 statements and other control flow can proceed eagerly, because "go" values can
 be checked before the sentinel "stop" value, and "!empty" is known implicitly.
I
 don't know exactly where the speed benefits of having a single "stop" value
 known at compile time come from.

 Is this design focused more on your knowledge of how the compiler optimizes
 machine code, or on something which can be grasped at a higher level?

Take a look at lexer.c. With an InputRange, reading a character from a 0 terminated string requires two read operations. A SentinalInputRange requires only one.
Feb 27 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/27/2013 9:20 PM, deadalnix wrote:
 If the range define empty with something like front == sentinel, the inliner
 should kick in a reduce the whole stuff to only one read, no ?

auto c = front; if (c == sentinel || c == XX) is two reads. This may not seem important, but when you want high speed, it can halve it.
Feb 27 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/27/2013 10:02 PM, deadalnix wrote:
 On Thursday, 28 February 2013 at 05:28:47 UTC, Walter Bright wrote:
 On 2/27/2013 9:20 PM, deadalnix wrote:
 If the range define empty with something like front == sentinel, the inliner
 should kick in a reduce the whole stuff to only one read, no ?

auto c = front; if (c == sentinel || c == XX) is two reads. This may not seem important, but when you want high speed, it can halve it.

Don't you have to check for both all the time ? You have to check for the sentinel anyway.

I suggest again taking a look at the dmd source lexer.c for how to do it. There is no extra check.
Feb 27 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/27/2013 11:37 PM, deadalnix wrote:
 It does check for sentinel in many places (and it has to).

No, it doesn't. I suggest examining line 481 of lexer.c. Then examine line 558 and 559. https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c
Feb 28 2013
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 2:42 AM, deadalnix wrote:
 On Thursday, 28 February 2013 at 10:37:35 UTC, Walter Bright wrote:
 On 2/27/2013 11:37 PM, deadalnix wrote:
 It does check for sentinel in many places (and it has to).

No, it doesn't. I suggest examining line 481 of lexer.c. Then examine line 558 and 559. https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c

escapeSequence does check for it.

Note the lookahead is done without checking first. This is the salient point. A sentinel means you are guaranteed to always be able to look one character ahead without checking. This is used all over the lexer. The other InputRange forms do not allow peeking ahead without testing *first*.
Feb 28 2013
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-02-28 11:37, Walter Bright wrote:

 No, it doesn't. I suggest examining line 481 of lexer.c. Then examine
 line 558 and 559.

 https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c

You know that you can link to a specific line by clicking on it in the margin. -- /Jacob Carlborg
Feb 28 2013
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/28/13 2:37 AM, deadalnix wrote:
 I don't see how defining a specific sentinel range here helps.

On first blush I agree. It may as well be a range that by convention is sentinel-terminated, and there's calls to front and popFront but never to empty. Andrei
Feb 28 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 6:59 AM, Andrei Alexandrescu wrote:
 On 2/28/13 2:37 AM, deadalnix wrote:
 I don't see how defining a specific sentinel range here helps.

On first blush I agree. It may as well be a range that by convention is sentinel-terminated, and there's calls to front and popFront but never to empty.

Consider the following code from lexer.c: p++; switch (*p) Written using an InputRange: popFront(); switch (front) That code is INVALID. This is why a SentinelInputRange is necessary. You can't just use an InputRange in an invalid manner by convention.
Feb 28 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Feb-2013 21:09, Steven Schveighoffer пишет:
 On Thu, 28 Feb 2013 11:43:06 -0500, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 2/28/2013 6:59 AM, Andrei Alexandrescu wrote:
 On 2/28/13 2:37 AM, deadalnix wrote:
 I don't see how defining a specific sentinel range here helps.

On first blush I agree. It may as well be a range that by convention is sentinel-terminated, and there's calls to front and popFront but never to empty.

Consider the following code from lexer.c: p++; switch (*p) Written using an InputRange: popFront(); switch (front) That code is INVALID. This is why a SentinelInputRange is necessary. You can't just use an InputRange in an invalid manner by convention.

Does switch(*p) include a case for 0? If so, wouldn't it be equivalent to say if(empty) /* do stuff that case 0 does */ else switch(front) -Steve

No as a compiler will take it (or may depending on its brain) that 0 is what you want to test *first*. It may speed things up if branch is almost always taken but its not the case with sentinel. Thus its jsut dead code that needs to be decoded, evalutated and skipped (as predicated) *before* doing switch jump. In fact some people avoid the overhead of switch by placing one or two of highly-frequent branches with tests before the switch (thus avoiding indirect branch it entails in these frequent cases). -- Dmitry Olshansky
Feb 28 2013
next sibling parent Ary Borenszweig <ary esperanto.org.ar> writes:
On 2/28/13 2:32 PM, Steven Schveighoffer wrote:
 On Thu, 28 Feb 2013 12:13:27 -0500, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 No as a compiler will take it (or may depending on its brain) that 0
 is what you want to test *first*. It may speed things up if branch is
 almost always taken but its not the case with sentinel. Thus its jsut
 dead code that needs to be decoded, evalutated and skipped (as
 predicated) *before* doing switch jump.

Well, according to lexer.c, case 0 is the first case. Why does that not make the compiler decide to test 0 first? Answer: it actually doesn't care, the switch compiles to a jump table. Which still begs the question, why can't the compiler make the leap to the equivalent code with a range? are these not semantically equivalent? if(x == 0) { // case for 0 } else { switch(x) { case 1: // case for 1; break; case 2: // case for 2; break; ... // cases for everything but 0 } } vs. switch(x) { case 0: // case for 0; break; case 1: // case for 1; break; case 2: // case for 2; break; ... // cases for everything but 0 } They seem semantically identical. It would not be impossible for an optimizer to make this leap. Combined with inlining, the above could be: if(r.empty) { // case for 0; } else ... and still be optimized the same. I'm not saying it happens, I'm saying it's possible. I have read several posts here claiming that llvm does this (I haven't tried this myself, but it sounds likely true).

It is true. Compiling this Crystal program: --- def my_fun(num) num == 0 end num = ARGV[0].to_i if my_fun(num) puts 0 else case num when 1 puts 1 when 2 puts 2 end end --- I see this somewhere in the generated llvm code: switch i32 %25, label %crystal_main.exit [ i32 0, label %then.i i32 1, label %then16.i i32 2, label %then20.i ]
Feb 28 2013
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Feb-2013 21:38, deadalnix пишет:
 On Thursday, 28 February 2013 at 17:13:36 UTC, Dmitry Olshansky wrote:
 No as a compiler will take it (or may depending on its brain) that 0
 is what you want to test *first*. It may speed things up if branch is
 almost always taken but its not the case with sentinel. Thus its jsut
 dead code that needs to be decoded, evalutated and skipped (as
 predicated) *before* doing switch jump.

 In fact some people avoid the overhead of switch by placing one or two
 of highly-frequent branches with tests before the switch (thus
 avoiding indirect branch it entails in these frequent cases).

That won't work as expected with LLVM and full optimizations, as it will combine everything in a switch, unless you use branch weight, in which case it can do the reverse : extract common cases from the switch. See : http://llvm.org/docs/BranchWeightMetadata.html GCC does something similar. This is another example of how people think they are doing high level assembly when in fact, the compiler is changing everything behind their back, and not doing at all what they think.

Depends on the brains of compiler like I said. BTW it can't know the most frequent branches so jump-table may be actually slower (in some cases though they are niche). -- Dmitry Olshansky
Feb 28 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Feb-2013 21:38, deadalnix пишет:
 On Thursday, 28 February 2013 at 17:13:36 UTC, Dmitry Olshansky wrote:
 No as a compiler will take it (or may depending on its brain) that 0
 is what you want to test *first*. It may speed things up if branch is
 almost always taken but its not the case with sentinel. Thus its jsut
 dead code that needs to be decoded, evalutated and skipped (as
 predicated) *before* doing switch jump.

 In fact some people avoid the overhead of switch by placing one or two
 of highly-frequent branches with tests before the switch (thus
 avoiding indirect branch it entails in these frequent cases).

That won't work as expected with LLVM and full optimizations, as it will combine everything in a switch, unless you use branch weight, in which case it can do the reverse : extract common cases from the switch. See : http://llvm.org/docs/BranchWeightMetadata.html

behind the senescence. Then I stand corrected.
 GCC does something similar.

-- Dmitry Olshansky
Feb 28 2013
prev sibling parent FG <home fgda.pl> writes:
On 2013-02-28 17:43, Walter Bright wrote:
 Consider the following code from lexer.c:

     p++;
     switch (*p)

 Written using an InputRange:

     popFront();
     switch (front)

 That code is INVALID. This is why a SentinelInputRange is necessary. You can't
 just use an InputRange in an invalid manner by convention.

I do not understand... Why make a special type of InputRange when you can achieve exactly that with a normal string with an added extra '\0' at the end and then use it without any calls to empty(): while(1) { ... popFront(); switch(front) { case '\0': return; ... } } Cstrings are a very special case because we know the terminator beforehand and the check is trivial CPU-wise.
Feb 28 2013
prev sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, February 28, 2013 05:53:13 Zach the Mystic wrote:
 My understanding of the logic of Sentinel Ranges so far is that
 switch statements and other control flow can proceed eagerly,
 because "go" values can be checked before the sentinel "stop"
 value, and "!empty" is known implicitly. I don't know exactly
 where the speed benefits of having a single "stop" value known at
 compile time come from.

Because the case value needs to be known at compile time, but even if it didn't have to be, it would be less efficient to have to call a function for it than it would be to have a value known at compile time.
 Is this design focused more on your knowledge of how the compiler
 optimizes machine code, or on something which can be grasped at a
 higher level?

I think that it's simply a matter of looking at how often empty has to be called in the lexer if you don't have a sentinel value. Most every time that you pop front, you'll have to check empty, whereas with a sentinel value, you'll almost never have to check it. That saves at least one operation for every character that's lexed. Now, given that you're going to have to wrap any range in order for to be sentinel range (unless it was designed as one in the first place), in most cases, you're going to be stuck checking empty all the time anyway. For strings wrapped in a sentinel range, you can avoid it, because you can just make its last element 0 (though that may mean reallocating the entire string depending on whether you can append 0 to it without reallocating). But you're still stuck wrapping it. I suspect that it would just be better to special case strings rather than to create this sentinel range idea, because I really don't see any benefit for it outside of strings, and it's not like it's hard to just use a couple of static ifs to avoid the check for empty and add the sentinel case when you're operating on strings. - Jonathan M Davis
Feb 27 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/27/2013 9:06 PM, Jonathan M Davis wrote:
 Now, given that you're going to have to wrap any range in order for to be
 sentinel range

No. Look at how lexer.c is fed data.
Feb 27 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-02-28 06:13, Walter Bright wrote:

 No. Look at how lexer.c is fed data.

C strings have a sentinel from the beginning, '\0'. -- /Jacob Carlborg
Feb 28 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 12:20 AM, Jacob Carlborg wrote:
 On 2013-02-28 06:13, Walter Bright wrote:

 No. Look at how lexer.c is fed data.

C strings have a sentinel from the beginning, '\0'.

Look at how the source file is transformed into a 0 terminated input. Files are not C strings. There are no extra copies and no extra tests.
Feb 28 2013
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 2/27/2013 6:05 PM, timotheecour wrote:
 On Thursday, 28 February 2013 at 01:56:33 UTC, Walter Bright wrote:
 On 2/27/2013 5:53 PM, Zach the Mystic wrote:
 What if more than one
 value can end the range, EOF, '\0'?

I have never seen a need for that.

how about a predicate isSentinel instead of a fixed sentinel value? That'd allow more flexibility such as more than one sentinel value.

A fixed sentinel value makes it clear that it should be a compile time value for performance reasons. If it was a function, then there's really not much advantage to having a SentinelInputRange.
Feb 27 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Feb-2013 05:56, Walter Bright пишет:
 On 2/27/2013 5:53 PM, Zach the Mystic wrote:
 What if more than one
 value can end the range, EOF, '\0'?

I have never seen a need for that.

Inside of the D lexer: EOF is 0 or 0x1a :) -- Dmitry Olshansky
Feb 27 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/27/2013 11:02 PM, Dmitry Olshansky wrote:
 Inside of the D lexer:

 EOF is 0 or 0x1a :)

There is always a terminating 0, even if the file ends in 0x1a. (The 0x1A comes from some old editors that end a file with a control Z.)
Feb 27 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-02-28 08:23, Walter Bright wrote:

 There is always a terminating 0, even if the file ends in 0x1a.

 (The 0x1A comes from some old editors that end a file with a control Z.)

http://dlang.org/lex.html, "End of File" shows this: EndOfFile: physical end of the file \u0000 \u001A Doesn't that mean that the input is empty if the file is empty, \u0000 or \u001A is encountered? -- /Jacob Carlborg
Feb 28 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 12:25 AM, Jacob Carlborg wrote:
 On 2013-02-28 08:23, Walter Bright wrote:

 There is always a terminating 0, even if the file ends in 0x1a.

 (The 0x1A comes from some old editors that end a file with a control Z.)

http://dlang.org/lex.html, "End of File" shows this: EndOfFile: physical end of the file \u0000 \u001A Doesn't that mean that the input is empty if the file is empty, \u0000 or \u001A is encountered?

Please follow the source code from file to input to the lexer. https://github.com/D-Programming-Language/dmd/blob/master/src/root/root.c line 1012 and 1038
Feb 28 2013
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-02-28 11:44, Walter Bright wrote:

 Please follow the source code from file to input to the lexer.

 https://github.com/D-Programming-Language/dmd/blob/master/src/root/root.c

 line 1012 and 1038

I have no doubt in what you're saying. The spec just didn't look accurate according to what you said. -- /Jacob Carlborg
Feb 28 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 5:40 AM, Jacob Carlborg wrote:
 I have no doubt in what you're saying. The spec just didn't look accurate
 according to what you said.

I don't know what you mean.
Feb 28 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-02-28 17:31, Walter Bright wrote:

 I don't know what you mean.

I'm just saying that you and the spec disagree. At least that's how I read the spec. -- /Jacob Carlborg
Feb 28 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Feb-2013 14:44, Walter Bright пишет:
 On 2/28/2013 12:25 AM, Jacob Carlborg wrote:
 On 2013-02-28 08:23, Walter Bright wrote:

 There is always a terminating 0, even if the file ends in 0x1a.

 (The 0x1A comes from some old editors that end a file with a control Z.)

http://dlang.org/lex.html, "End of File" shows this: EndOfFile: physical end of the file \u0000 \u001A Doesn't that mean that the input is empty if the file is empty, \u0000 or \u001A is encountered?


 Please follow the source code from file to input to the lexer.

Source as spec is no good. Either change the spec or admit that having a tuple of sentinels as manifest constant is fine. In any case constant tuples are easily unrolled into case statements.
 https://github.com/D-Programming-Language/dmd/blob/master/src/root/root.c

 line 1012 and 1038

-- Dmitry Olshansky
Feb 28 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 7:38 AM, Dmitry Olshansky wrote:
 28-Feb-2013 14:44, Walter Bright пишет:
 On 2/28/2013 12:25 AM, Jacob Carlborg wrote:
 On 2013-02-28 08:23, Walter Bright wrote:

 There is always a terminating 0, even if the file ends in 0x1a.

 (The 0x1A comes from some old editors that end a file with a control Z.)

http://dlang.org/lex.html, "End of File" shows this: EndOfFile: physical end of the file \u0000 \u001A Doesn't that mean that the input is empty if the file is empty, \u0000 or \u001A is encountered?


 Please follow the source code from file to input to the lexer.

Source as spec is no good. Either change the spec or admit that having a tuple of sentinels as manifest constant is fine. In any case constant tuples are easily unrolled into case statements.

The spec is correct, so is the code, and tuple sentinels are entirely unnecessary.
Feb 28 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Feb-2013 20:33, Walter Bright пишет:
 On 2/28/2013 7:38 AM, Dmitry Olshansky wrote:
 28-Feb-2013 14:44, Walter Bright пишет:
 On 2/28/2013 12:25 AM, Jacob Carlborg wrote:
 On 2013-02-28 08:23, Walter Bright wrote:

 There is always a terminating 0, even if the file ends in 0x1a.

 (The 0x1A comes from some old editors that end a file with a
 control Z.)

http://dlang.org/lex.html, "End of File" shows this: EndOfFile: physical end of the file \u0000 \u001A Doesn't that mean that the input is empty if the file is empty, \u0000 or \u001A is encountered?


 Please follow the source code from file to input to the lexer.

Source as spec is no good. Either change the spec or admit that having a tuple of sentinels as manifest constant is fine. In any case constant tuples are easily unrolled into case statements.

The spec is correct, so is the code, and tuple sentinels are entirely unnecessary.

line 300: case 0x1A: // ^Z means end of file case 0: break; On the lines you noted it claimed that that 0x1a is outdate. Along with the fact that you allocate filesize+2 and fill the last 2 bytes with zeros. In any case I see 0 and 0x1a as 2 values that act like sentinels i.e. a tuple. And this is what spec says - any one of them is a sentinel. Correct me if I'm wrong. -- Dmitry Olshansky
Feb 28 2013
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02/28/2013 05:48 PM, Dmitry Olshansky wrote:
 ...

 line 300:
      case 0x1A:          // ^Z means end of file
      case 0:
                          break;

 On the lines you noted it claimed that that 0x1a is outdate. Along with
 the fact that you allocate filesize+2 and fill the last 2 bytes with zeros.

 In any case I see 0 and 0x1a as 2 values that act like sentinels i.e. a
 tuple. And this is what spec says - any one of them is a sentinel.
 Correct me if I'm wrong.

A sentinel is some data the original data is augmented with in order to simplify its processing. The lexer acts the same on 0x1A and 0, but only the additional 0 at the end which does not occur in the input is the sentinel. The lexer may even encounter a 0 that is not a sentinel.
Feb 28 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Feb-2013 22:08, Timon Gehr пишет:
 On 02/28/2013 05:48 PM, Dmitry Olshansky wrote:
 ...

 line 300:
      case 0x1A:          // ^Z means end of file
      case 0:
                          break;

 On the lines you noted it claimed that that 0x1a is outdate. Along with
 the fact that you allocate filesize+2 and fill the last 2 bytes with
 zeros.

 In any case I see 0 and 0x1a as 2 values that act like sentinels i.e. a
 tuple. And this is what spec says - any one of them is a sentinel.
 Correct me if I'm wrong.

A sentinel is some data the original data is augmented with in order to simplify its processing.

I thought 0 was proposed as such. Might have misunderstood the proposition.
 The lexer acts the same on 0x1A and 0, but only the additional 0 at the
 end which does not occur in the input is the sentinel.

That would mean that when you see 0 or 0x1A you do a check to see if that's the end of input then decide it's really the end of input. If that's the intended behavior I fail to decipher it from here: http://dlang.org/lex.html#EndOfFile
 The lexer may
 even encounter a 0 that is not a sentinel.

-- Dmitry Olshansky
Feb 28 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 10:39 AM, Dmitry Olshansky wrote:
 That would mean that when you see 0 or 0x1A you do a check to see if that's the
 end of input then decide it's really the end of input.

A 0 or a 0x1A is the end of valid D source. Period. But that doesn't mean the sentinel has to be a tuple.
Feb 28 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
01-Mar-2013 02:57, Walter Bright пишет:
 On 2/28/2013 10:39 AM, Dmitry Olshansky wrote:
 That would mean that when you see 0 or 0x1A you do a check to see if
 that's the
 end of input then decide it's really the end of input.

A 0 or a 0x1A is the end of valid D source. Period.

So I thought. Well, I can't see the difference between 2 sentinel values: both 0 and 0x1A has to both work. For me that means both values have to be accepted as sentinels I'd call that a tuple of sentinels.
 But that doesn't mean the sentinel has to be a tuple.

Should I say you are getting cryptic? -- Dmitry Olshansky
Feb 28 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-03-01 07:12, Dmitry Olshansky wrote:

 A 0 or a 0x1A is the end of valid D source. Period.

So I thought. Well, I can't see the difference between 2 sentinel values: both 0 and 0x1A has to both work. For me that means both values have to be accepted as sentinels I'd call that a tuple of sentinels.

I think I understand how Walter things here. The D code always ends with '0' or '0x1A'. Regardless of what the end character is, DMD will always add an '0' to the end. But you do need to check for both 0 and 0x1A. -- /Jacob Carlborg
Feb 28 2013
prev sibling next sibling parent "Zach the Mystic" <reachBUTMINUSTHISzach gOOGLYmail.com> writes:
On Thursday, 28 February 2013 at 01:56:33 UTC, Walter Bright 
wrote:
 On 2/27/2013 5:53 PM, Zach the Mystic wrote:
 What if more than one
 value can end the range, EOF, '\0'?

I have never seen a need for that.

Do you mean that you've never seen software that uses that, or that you've never been convinced that such software couldn't be made simpler?
Feb 27 2013
prev sibling next sibling parent "timotheecour" <thelastmammoth gmail.com> writes:
On Thursday, 28 February 2013 at 01:56:33 UTC, Walter Bright 
wrote:
 On 2/27/2013 5:53 PM, Zach the Mystic wrote:
 What if more than one
 value can end the range, EOF, '\0'?

I have never seen a need for that.

how about a predicate isSentinel instead of a fixed sentinel value? That'd allow more flexibility such as more than one sentinel value.
Feb 27 2013
prev sibling next sibling parent "timotheecour" <thelastmammoth gmail.com> writes:
 how about a predicate isSentinel instead of a fixed sentinel 
 value?
 That'd allow more flexibility such as more than one sentinel 
 value.

Not usable in a switch/case statement.

why is that needed?
Feb 27 2013
prev sibling next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Thursday, 28 February 2013 at 01:12:19 UTC, Walter Bright 
wrote:
 A SentinelInputRange is an InputRange with the following 
 additions:

 1. a compile time property 'sentinel' that is the terminating 
 value of the range
 2. empty is defined as: empty = (front == sentinel)
 3. it is not necessary for empty to be called before front

 A C style 0-terminated string is an example of a 
 SentinelInputRange.

 The additions to std.range would be:

 1. isSentinelInputRange(T) which returns true if T is a 
 SentinelInputRange
 2. a unittest
 3. documentation of this

 An addition to std.string would be a function that takes a 
 char* and returns a SentinelInputRange.

 Motivation:
 1. easy conversion of C strings to ranges
 2. necessary for a fast implementation of a lexer

 Any takers?

Why must sentinel be known at compile time? I don't see what's in the way of it being a runtime argument.
Feb 27 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/27/2013 8:01 PM, John Colvin wrote:
 Why must sentinel be known at compile time? I don't see what's in the way of it
 being a runtime argument.

Performance! It should be usable as a case in a switch statement.
Feb 27 2013
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 2/27/2013 9:03 PM, Andrej Mitrovic wrote:
 But I guess this defeats the purpose of the switch statement (i.e.
optimizations).

It certainly does defeat the performance of the switch statement, and hence the point of the sentinel.
 Btw can we get a clear opinion on whether 5862 is an actual bug that needs to
be
 fixed?

I'd have to investigate.
Feb 27 2013
prev sibling parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
On 2/28/13 1:57 AM, Walter Bright wrote:
 On 2/27/2013 8:01 PM, John Colvin wrote:
 Why must sentinel be known at compile time? I don't see what's in the
 way of it
 being a runtime argument.

Performance! It should be usable as a case in a switch statement.

Isn't it possible for the optimizer to inline the function call and then combine the next ifs? if (isSentinel(value)) { } else { switch(value) { case ... } }
Feb 28 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 6:48 AM, Ary Borenszweig wrote:
 On 2/28/13 1:57 AM, Walter Bright wrote:
 On 2/27/2013 8:01 PM, John Colvin wrote:
 Why must sentinel be known at compile time? I don't see what's in the
 way of it
 being a runtime argument.

Performance! It should be usable as a case in a switch statement.

Isn't it possible for the optimizer to inline the function call and then combine the next ifs? if (isSentinel(value)) { } else { switch(value) { case ... } }

1. I don't know of any C compiler that would do that. 2. There are other cases, as I pointed out to deadalnix. 3. You still can't do lookahead without extra checks Once again, guys, have a look at lexer.c at the lines I pointed out.
Feb 28 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 9:05 AM, deadalnix wrote:
 I see nothing that can't be done by an optimizing compiler. Maybe something is,
 but I can't find it. And the example you pointed me aren't.

Tell me how front() will be optimized out to this. case '>': p++; if (*p == '=') { p++; t->value = TOKge; // >= } else if (*p == '>') { p++; if (*p == '=') { p++; t->value = TOKshrass; // >>= } else if (*p == '>') { p++; if (*p == '=') { p++; t->value = TOKushrass; // >>>= } else t->value = TOKushr; // >>> } else t->value = TOKshr; // >> } else t->value = TOKgt; // > return;
Feb 28 2013
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02/28/2013 10:40 PM, Walter Bright wrote:
 On 2/28/2013 9:05 AM, deadalnix wrote:
 I see nothing that can't be done by an optimizing compiler. Maybe
 something is,
 but I can't find it. And the example you pointed me aren't.

Tell me how front() will be optimized out to this. case '>': p++; if (*p == '=') { p++; t->value = TOKge; // >= } else if (*p == '>') { p++; if (*p == '=') { p++; t->value = TOKshrass; // >>= } else if (*p == '>') { p++; if (*p == '=') { p++; t->value = TOKushrass; // >>>= } else t->value = TOKushr; // >>> } else t->value = TOKshr; // >> } else t->value = TOKgt; // > return;

I'd be surprised if DMD couldn't do it already. assume: popFront(){ p++; }, front()=>*p, empty()=>*p=='\0' case '>': popFront(); if(!empty && front == '='){ popFront(); t->value = TOKge; // >= }else if(!empty && front == '>'){ popFront(); if(!empty && front == '='){ popFront(); t->value = TOKshrass; // >>= }else if(!empty && front == '>'){ popFront(); if(!empty && front == '='){ popFront(); t->value = TOKushrass; // >>>= }else t->value = TOKushr; // >>> }else t->value = TOKshr; // >> }else t->value = TOKgt; // > return; // inlining => case '>': p++; if(!(*p=='\0') && *p == '='){ p++; t->value = TOKge; // >= }else if(!(*p=='\0') && *p == '>'){ p++; if(!(*p=='\0') && *p == '='){ p++; t->value = TOKshrass; // >>= }else if(!(*p=='\0') && *p == '>'){ p++; if(!(*p=='\0') && *p == '='){ p++; t->value = TOKushrass; // >>>= }else t->value = TOKushr; // >>> }else t->value = TOKshr; // >> }else t->value = TOKgt; // > return; // expression simplification // (eg. trivial peephole for x!=a && x==b case sufficient) // => case '>': p++; if(*p == '='){ p++; t->value = TOKge; // >= }else if(*p == '>'){ p++; if(*p == '='){ p++; t->value = TOKshrass; // >>= }else if(*p == '>'){ p++; if(*p == '='){ p++; t->value = TOKushrass; // >>>= }else t->value = TOKushr; // >>> }else t->value = TOKshr; // >> }else t->value = TOKgt; // > return;
Feb 28 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 2:16 PM, Timon Gehr wrote:
      // expression simplification
      // (eg. trivial peephole for x!=a && x==b case sufficient)

Hmm, I hadn't thought of that. Some checking shows that VC and DMC do not do it, gcc and clang do.
Feb 28 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-03-01 00:23, Walter Bright wrote:

 Hmm, I hadn't thought of that. Some checking shows that VC and DMC do
 not do it, gcc and clang do.

Time for some new optimizations for DMC :) -- /Jacob Carlborg
Feb 28 2013
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 28 Feb 2013 05:01:39 +0100
schrieb "John Colvin" <john.loughran.colvin gmail.com>:

 On Thursday, 28 February 2013 at 01:12:19 UTC, Walter Bright 
 wrote:
 A SentinelInputRange is an InputRange with the following 
 additions:

 1. a compile time property 'sentinel' that is the terminating 
 value of the range
 2. empty is defined as: empty = (front == sentinel)
 3. it is not necessary for empty to be called before front

 A C style 0-terminated string is an example of a 
 SentinelInputRange.

 The additions to std.range would be:

 1. isSentinelInputRange(T) which returns true if T is a 
 SentinelInputRange
 2. a unittest
 3. documentation of this

 An addition to std.string would be a function that takes a 
 char* and returns a SentinelInputRange.

 Motivation:
 1. easy conversion of C strings to ranges
 2. necessary for a fast implementation of a lexer

 Any takers?

Why must sentinel be known at compile time? I don't see what's in the way of it being a runtime argument.

Possibly because a 0 check can be reduced to a simple bit test on the CPU (and you need to do that often and possibly in several source code locations when parsing long strings). A runtime value means a memory read and comparison every time! (For simple loops the compiler will keep the value in a register, but that's not generally the case.) -- Marco
Feb 27 2013
prev sibling next sibling parent reply "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 28 February 2013 at 01:12:19 UTC, Walter Bright 
wrote:
 A SentinelInputRange is an InputRange with the following 
 additions:

 1. a compile time property 'sentinel' that is the terminating 
 value of the range
 2. empty is defined as: empty = (front == sentinel)
 3. it is not necessary for empty to be called before front

 A C style 0-terminated string is an example of a 
 SentinelInputRange.

 The additions to std.range would be:

 1. isSentinelInputRange(T) which returns true if T is a 
 SentinelInputRange
 2. a unittest
 3. documentation of this

 An addition to std.string would be a function that takes a 
 char* and returns a SentinelInputRange.

 Motivation:
 1. easy conversion of C strings to ranges
 2. necessary for a fast implementation of a lexer

 Any takers?

Can you explain what it does buy us ? Is it faster to check for the sentinel value that to check for emptiness ? Is it really handy in generic code ? For instance, if I use a switch, the actual sentinel value may be conflicting with other element in the switch. How does this kind of problem are solved ? Do they are problems in the first place ?
Feb 27 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/27/2013 8:47 PM, Jonathan M Davis wrote:
 Now, the only real benefit that I see for this allowing you to make a string
 zero-terminated (which in the case of a lexer would probably mean copying the
 entire file into a new string which has zero on the end).

Nawp, there is no extra copy. Take a look at the compiler source. You have to read the file into memory anyway - so make the buffer one byte longer, and put a 0 at the end.
 In general, you'll be
 forced to wrap a range in a sentinel range to get this behavior, which means
 that you're _still_ checking empty all the time, because it has to keep
 checking whether it's supposed to make front 0 now. And that probably means
 that it'll be slightly _more_ expensive to do this for anything other than a
 string. That being the case, it might be better to just special case strings
 rather than come up with this whole new range idea.

Nope, not necessary to wrap it.
 I'm also not at all covinced that this is generally useful. It may be that
 it's a great idea for lexers, but what other use cases are there?

Anything that walks a C string. Lots of cases for that. Sentinels are often used where high speed processing of data is desired. Google sentinel-terminated data for more examples.
 Also, I'd point out that even for strings, doing something like this means
 wrapping them, because their empty isn't defined in a manner which works with
 isSentinelRange.

For D strings, yes, for C strings, no need to wrap them.
 So, I'm inclined to believe that we'd be better off just special casing strings
 in any algorithms that can take advantage of this sort of thing than we would
 be creating this sentinel range idea.

0 terminated C strings are a classic case of this. Another case is a token stream ending with an EOF token.
Feb 27 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/27/2013 9:28 PM, Jonathan M Davis wrote:
 But you have to deal with D strings, not C strings if you're dealing with
 ranges. char* isn't a range. So, unless you're talking about wrapping a char*
 in a range, char* isn't going to work. And simply appending 0 to the end of a
 D string isn't enough, because isSentinelnputRange would fail, because
 std.array.empty doesn't match it. So, you need a wrapper even if it's only to
 pass the template constraint. That being the case, regardless of whether
 you're dealing with char* or string, you need a wrapper.

Again, please see how lexer.c works. I assure you, there is no double copying going on, nor is there a double test for the terminating 0.
 And what ranges other than strings can take advantage of anything like this?

I've mentioned a couple already in this thread, and I'll add another - an interpreter bytecode, you can see an example in the (former) std.regexp. Search for "REend".
 What are we gaining here

High speed processing.
So, why not just special case strings or arrays in the few situations
 where something like this is needed, especially when it would be so easy to
 do?

Sentinels structure the code differently.
Feb 27 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
 Again, please see how lexer.c works. I assure you, there is no double
 copying going on, nor is there a double test for the terminating 0.

I know what the lexer does, and remember that it _doesn't_ operate on ranges, and there are subtle differences between being able to just use char* and trying to handle generic ranges.

Hence the need to invent SentinelInputRange.
 Given how a lexer works (and I have been working on a lexer off and on
 recently), the only real difference is that you'd just use a couple of static
 ifs like

 static if(!isSomeString!R)
 {
      if(range.empty)
          break; //or whatever you do at the end
 }

 static if(isSomeString!R)
 {
      case 0:
          break; //or whatever you do at the end
 }

There are so many places where this would occur, it cries out for a new type.
 So, in the case of a lexer, I don't see sentinel ranges as buying us much. You
 end up having to wrap most any range that you pass to the lexer or whatever
 (including strings so that they'll pass isSentinelRange), you lose out on any
 optimizations of any functions that you call which special-case strings
 (though there probably wouldn't be many of those in a lexer), and all you
 avoid is a couple of static ifs.

And NO, THE SOURCE FILE INPUT IS NEITHER WRAPPED NOR DOUBLE COPIED. Here's how it's done: https://github.com/D-Programming-Language/dmd/blob/master/src/root/root.c line 1012 and 1038
 The idea of sentinels certainly isn't useless, but anything caring about that
 sort of speed is likely to just use strings or arrays, and those can trivially
 be special cased to avoid unnecessary empty checks and to add the check for
 the sentinel, making the whole sentinel range idea an unnecessary complication
 IMHO.

You can't do efficient lookahead without sentinels, either. Lexers are sensitive to every instruction executed per character read. No sentinels mean double the number of instructions per source character. InputRanges are an abject failure if "anyone caring about speed" is not going to use them. And yes, I care very much about the D lexing speed.
Feb 28 2013
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-02-28 12:29, Jonathan M Davis wrote:

 And you were just claiming that the lexer checked the sentinel type in only
 one place. If that's indeed the case (and I think that it's quite close to
 being true if it isn't true), then you _wouldn't_ need to use static ifs like
 this in many places. So, which is it? If you need to check the sentinel often
 enough that using static ifs is a problem, then it's probably not buying you
 much of anything over checking empty anyway.

You pick a sentinel that you need to check for anyway, i.e. null or eof. But if you don't manually add the sentinel there's nothing that says that the sentinel will be there, and therefore you weed to check for empty as well. -- /Jacob Carlborg
Feb 28 2013
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/28/13 8:44 AM, Jacob Carlborg wrote:
 On 2013-02-28 12:29, Jonathan M Davis wrote:

 And you were just claiming that the lexer checked the sentinel type in
 only
 one place. If that's indeed the case (and I think that it's quite
 close to
 being true if it isn't true), then you _wouldn't_ need to use static
 ifs like
 this in many places. So, which is it? If you need to check the
 sentinel often
 enough that using static ifs is a problem, then it's probably not
 buying you
 much of anything over checking empty anyway.

You pick a sentinel that you need to check for anyway, i.e. null or eof. But if you don't manually add the sentinel there's nothing that says that the sentinel will be there, and therefore you weed to check for empty as well.

auto range = assumeWithSentinel(input); :o) I'm only half-joking - the awesomeness of assumeSorted suggests we could reuse the pattern elsewhere. Andrei
Feb 28 2013
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/28/13 5:54 AM, Walter Bright wrote:
 On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
 Again, please see how lexer.c works. I assure you, there is no double
 copying going on, nor is there a double test for the terminating 0.

I know what the lexer does, and remember that it _doesn't_ operate on ranges, and there are subtle differences between being able to just use char* and trying to handle generic ranges.

Hence the need to invent SentinelInputRange.

I don't think the sentinel input range is a blocker for redoing the parser (with ranges) in D. This discussion has probably run its course. The right thing to do at this point is port the lexer and figure what primitives are necessary from its input. Andrei
Feb 28 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Feb-2013 19:31, Andrei Alexandrescu пишет:
 On 2/28/13 5:54 AM, Walter Bright wrote:
 On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
 Again, please see how lexer.c works. I assure you, there is no double
 copying going on, nor is there a double test for the terminating 0.

I know what the lexer does, and remember that it _doesn't_ operate on ranges, and there are subtle differences between being able to just use char* and trying to handle generic ranges.

Hence the need to invent SentinelInputRange.

I don't think the sentinel input range is a blocker for redoing the parser (with ranges) in D. This discussion has probably run its course. The right thing to do at this point is port the lexer and figure what primitives are necessary from its input.

And it wasn't the problem that std.d.lexer had anyway. Check the latest results. The merit of sentinel range IMHO is largely in tapping into C-strings and the like more naturally. -- Dmitry Olshansky
Feb 28 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 7:41 AM, Dmitry Olshansky wrote:
 The merit of sentinel range IMHO is largely in tapping into C-strings and the
 like more naturally.

The merit is having significantly faster speed.
Feb 28 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Feb-2013 20:56, Walter Bright пишет:
 On 2/28/2013 7:41 AM, Dmitry Olshansky wrote:
 The merit of sentinel range IMHO is largely in tapping into C-strings
 and the
 like more naturally.

The merit is having significantly faster speed.

5% is good but doesn't count as significant. But that was for std.regex quite sometime ago. In the lexer it might get be more. I can try right away with std.d.lexer by Brain Schott and get rid of the check for the length and place a null byte right after the file buffer. Then I should be able to estimate the gain. -- Dmitry Olshansky
Feb 28 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 9:06 AM, Dmitry Olshansky wrote:
 28-Feb-2013 20:56, Walter Bright пишет:
 The merit is having significantly faster speed.

sometime ago. In the lexer it might get be more.

Even 5% is worth getting, and especially so because it's low hanging fruit. I hear from people who have switched over to D and compile speed was a BIG factor. And I know you are well cognizant of the fact that for regex, speed is also a very big deal.
Feb 28 2013
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 7:31 AM, Andrei Alexandrescu wrote:
 I don't think the sentinel input range is a blocker for redoing the parser
(with
 ranges) in D. This discussion has probably run its course. The right thing to
do
 at this point is port the lexer and figure what primitives are necessary from
 its input.

I don't agree. It is a blocker for getting a fast lexer.
Feb 28 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/28/13 11:54 AM, Walter Bright wrote:
 On 2/28/2013 7:31 AM, Andrei Alexandrescu wrote:
 I don't think the sentinel input range is a blocker for redoing the
 parser (with
 ranges) in D. This discussion has probably run its course. The right
 thing to do
 at this point is port the lexer and figure what primitives are
 necessary from
 its input.

I don't agree. It is a blocker for getting a fast lexer.

I don't think what I said was understood. Andrei
Feb 28 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 9:10 AM, Andrei Alexandrescu wrote:
 I don't think what I said was understood.

I know the feeling :-)
Feb 28 2013
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 7:37 AM, monarch_dodra wrote:
 Just provide it with a disclaimer than it will only ever be useful for a select
 class of algorithms (parser), but that the optimization opportunity will be
 ignored by the rest of phobos.

1. parsing text is very commonplace 2. it is used in std.regexp in the interpreter engine, a quite different use case, but also common to bytecode engines
Feb 28 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Feb-2013 19:31, Andrei Alexandrescu пишет:
 On 2/28/13 5:54 AM, Walter Bright wrote:
 On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
 Again, please see how lexer.c works. I assure you, there is no double
 copying going on, nor is there a double test for the terminating 0.

I know what the lexer does, and remember that it _doesn't_ operate on ranges, and there are subtle differences between being able to just use char* and trying to handle generic ranges.

Hence the need to invent SentinelInputRange.

I don't think the sentinel input range is a blocker for redoing the parser (with ranges) in D. This discussion has probably run its course.

 The right thing to do at this point is port the lexer and figure what
 primitives are necessary from its input.

No need to port - look at std.d.lexer again. It was revamped and is ready for the new round of review I should say. Let's not use the old source for the new module and go the long path of : split off --> port --> patch up --> D-ify & re-write to ranges Instead we can just tweak the current std.d.lexer a little bit more and we have good clean-room lexer written in the idiomatic D. Well, it's getting there w.r.t. idiomaticness but it supports ranges including both random-access and forward ones (by transparently specializing for each one). -- Dmitry Olshansky
Feb 28 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/28/13 12:20 PM, Dmitry Olshansky wrote:
 28-Feb-2013 19:31, Andrei Alexandrescu пишет:
 On 2/28/13 5:54 AM, Walter Bright wrote:
 On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
 Again, please see how lexer.c works. I assure you, there is no double
 copying going on, nor is there a double test for the terminating 0.

I know what the lexer does, and remember that it _doesn't_ operate on ranges, and there are subtle differences between being able to just use char* and trying to handle generic ranges.

Hence the need to invent SentinelInputRange.

I don't think the sentinel input range is a blocker for redoing the parser (with ranges) in D. This discussion has probably run its course.

 The right thing to do at this point is port the lexer and figure what
 primitives are necessary from its input.

No need to port - look at std.d.lexer again. It was revamped and is ready for the new round of review I should say. Let's not use the old source for the new module and go the long path of : split off --> port --> patch up --> D-ify & re-write to ranges Instead we can just tweak the current std.d.lexer a little bit more and we have good clean-room lexer written in the idiomatic D. Well, it's getting there w.r.t. idiomaticness but it supports ranges including both random-access and forward ones (by transparently specializing for each one).

I think that's a good idea but I took a look at https://github.com/bhelyer/std.d.lexer/blob/master/std/d/lexer.d and I will destroy it. In a good sense :o). Andrei
Feb 28 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Feb-2013 22:38, Andrei Alexandrescu пишет:
 On 2/28/13 12:20 PM, Dmitry Olshansky wrote:
 28-Feb-2013 19:31, Andrei Alexandrescu пишет:
 On 2/28/13 5:54 AM, Walter Bright wrote:
 On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
 Again, please see how lexer.c works. I assure you, there is no double
 copying going on, nor is there a double test for the terminating 0.

I know what the lexer does, and remember that it _doesn't_ operate on ranges, and there are subtle differences between being able to just use char* and trying to handle generic ranges.

Hence the need to invent SentinelInputRange.

I don't think the sentinel input range is a blocker for redoing the parser (with ranges) in D. This discussion has probably run its course.

 The right thing to do at this point is port the lexer and figure what
 primitives are necessary from its input.

No need to port - look at std.d.lexer again. It was revamped and is ready for the new round of review I should say. Let's not use the old source for the new module and go the long path of : split off --> port --> patch up --> D-ify & re-write to ranges Instead we can just tweak the current std.d.lexer a little bit more and we have good clean-room lexer written in the idiomatic D. Well, it's getting there w.r.t. idiomaticness but it supports ranges including both random-access and forward ones (by transparently specializing for each one).

I think that's a good idea but I took a look at https://github.com/bhelyer/std.d.lexer/blob/master/std/d/lexer.d and I will destroy it. In a good sense :o).

That's the wrong one. This is the one: https://github.com/Hackerpilot/Dscanner/tree/range-based-lexer Though feel free to destroy the other one too ;) But I need your full powers with Dscanner first :o)
 Andrei

-- Dmitry Olshansky
Feb 28 2013
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/28/13 1:42 PM, Dmitry Olshansky wrote:
 That's the wrong one. This is the one:
 https://github.com/Hackerpilot/Dscanner/tree/range-based-lexer

 Though feel free to destroy the other one too ;)
 But I need your full powers with Dscanner first :o)

Whoa, that looks really good! Andrei
Feb 28 2013
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/28/13 6:29 AM, Jonathan M Davis wrote:
 There's tons of stuff that's fine with forward ranges (though I seriously
 question the viability of pure input ranges given how insanely limiting is to
 be unable to save their state), but there's also plenty of stuff that needs
 random access, and if you _really_ care about speed, there's a decent chance
 that you need random access (if nothing else, to be able to pop off multiple
 elements at a time). That doesn't necessarily mean that you're using an array,
 but odds are whatever you're using wraps an array if it isn't one. I don't
 think that there's much of anything which is random access without having an
 array or pointer underneath the hood. Reading from a file, I'd be very inclined
 to use something like MmFile or reading the whole file into an array rather
 than trying to wrap in something that's a pure input range. They're just too
 limiting.

I think this (and the long, long, long text above) is missing a point. The necessity here is defining new range primitives (such as lookahead), not force the existing range notions onto the needed functionality. Again: 1. Port the blessed lexer as is to D already 2. Figure out what abstraction is needed for lexer's input 3. Reify that abstraction as an existing range type, a variation on an existing range type, or a new range type Talking abstraction without concrete is a poor approach. Concrete first, abstraction extracted from it. Andrei
Feb 28 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 3:29 AM, Jonathan M Davis wrote:
 And you were just claiming that the lexer checked the sentinel type in only
 one place. If that's indeed the case (and I think that it's quite close to
 being true if it isn't true), then you _wouldn't_ need to use static ifs like
 this in many places. So, which is it? If you need to check the sentinel often
 enough that using static ifs is a problem, then it's probably not buying you
 much of anything over checking empty anyway.

Please, again, examine lexer.c. For example, look at what happens after line 719. Try 794 in particular. Now look at line 996 and what follows. Note that there is not a single null check there, yet the code is correct and does not run off the end of the data.
 But my point is that outside of strings or arrays, you're almost certainly
 stuck with that.

I've given you two examples (lexer and regexp) where you are certainly not stuck with that, and those two cases matter.
 Pure input ranges fail utterly as you can't save them, so you get _zero_
 lookahead [...]

Yet the lexer will work efficiently and correctly with a SentinelInputRange that is also a ForwardRange. It fits in nicely with the range concepts.
Feb 28 2013
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Feb-2013 20:52, Walter Bright пишет:
 On 2/28/2013 3:29 AM, Jonathan M Davis wrote:
 And you were just claiming that the lexer checked the sentinel type in
 only
 one place. If that's indeed the case (and I think that it's quite
 close to
 being true if it isn't true), then you _wouldn't_ need to use static
 ifs like
 this in many places. So, which is it? If you need to check the
 sentinel often
 enough that using static ifs is a problem, then it's probably not
 buying you
 much of anything over checking empty anyway.

Please, again, examine lexer.c. For example, look at what happens after line 719. Try 794 in particular. Now look at line 996 and what follows. Note that there is not a single null check there, yet the code is correct and does not run off the end of the data.
 But my point is that outside of strings or arrays, you're almost
 certainly
 stuck with that.

I've given you two examples (lexer and regexp) where you are certainly not stuck with that, and those two cases matter.

I'll add that when I wrote current std.regex internal VM I've switched to the sentinel as terminator of the program instead of checking the length. The end result was about 5% of speed up gained back then (measured on DMD alone though, other compilers didn't compile it at the time). In the lexer it may help a bit more given the lookahead. -- Dmitry Olshansky
Feb 28 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 10:52 AM, Jonathan M Davis wrote:
 Notice that it has to check for empty every time that the front is popped, and
 it can't avoid that,

Yes it can avoid it - that is the whole point. Notice there are no checks for the sentinel here - but the code is correct: case '>': p++; if (*p == '=') { p++; t->value = TOKge; // >= } else if (*p == '>') { p++; if (*p == '=') { p++; t->value = TOKshrass; // >>= } else if (*p == '>') { p++; if (*p == '=') { p++; t->value = TOKushrass; // >>>= } else t->value = TOKushr; // >>> } else t->value = TOKshr; // >> } else t->value = TOKgt; // > return;
Feb 28 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 4:07 PM, Jonathan M Davis wrote:
 But that's with a pointer. You see every ++ there? That would be a popFront
 call, and for most ranges, that would mean checking empty internally if the
 range needs to have a sentinel value on its end, because most ranges will be a
 wrapper around another range, and the only way to know whether you've reached
 their end (and that therefore, front now needs to be the sentinel) is to check
 for empty.

It's trivial to construct a SentinelInputRange that does not do this yet is correct.
 I just don't see how you're going to get a
 performance gain from much of anything other than strings.

I gave you other examples already. We're just going around in circles.
Feb 28 2013
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/28/13 7:34 PM, Walter Bright wrote:
 On 2/28/2013 4:07 PM, Jonathan M Davis wrote:
 I just don't see how you're going to get a
 performance gain from much of anything other than strings.

I gave you other examples already. We're just going around in circles.

We're talking about an abstraction that doesn't exist, exemplifying with code that doesn't use it. No wonder there's going to be miscommunication. The solution is to spin some code. Andrei
Feb 28 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 5:29 PM, Andrei Alexandrescu wrote:
 On 2/28/13 7:34 PM, Walter Bright wrote:
 On 2/28/2013 4:07 PM, Jonathan M Davis wrote:
 I just don't see how you're going to get a
 performance gain from much of anything other than strings.

I gave you other examples already. We're just going around in circles.

We're talking about an abstraction that doesn't exist, exemplifying with code that doesn't use it. No wonder there's going to be miscommunication. The solution is to spin some code.

std.regexp uses sentinels in the bytecode.
Feb 28 2013
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 5:00 PM, deadalnix wrote:
 On Friday, 1 March 2013 at 00:34:25 UTC, Walter Bright wrote:
 I just don't see how you're going to get a
 performance gain from much of anything other than strings.

I gave you other examples already. We're just going around in circles.

You didn't posted a single example that wasn't optimized by LLVM.

Search for [1] in lexer.c, i.e. lookahead cases, although these are less important. I admit I was surprised that LLVM did that, though the other compilers did not. There are some lingering issues with ordering. The sentinel is going to be the rare case, and you'd often want to sort the cases so that the most likely ones are tested first (if it doesn't all go into a jump table indexed by the value).
 I do understand that some compiler may not do the optimization, but it is show
 already that this is clearly possible and in fact done. That is not an
 theoretical improvement.

 I don't see the point in creating a new type of range simply because some
 compiler don't optimize properly.

It's a good point.
Feb 28 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, February 28, 2013 05:24:57 deadalnix wrote:
 Can you explain what it does buy us ? Is it faster to check for
 the sentinel value that to check for emptiness ?
 
 Is it really handy in generic code ? For instance, if I use a
 switch, the actual sentinel value may be conflicting with other
 element in the switch. How does this kind of problem are solved ?
 Do they are problems in the first place ?

In the case of a lexer, it allows you to skip checking for empty most of the time. You have something like switch(front) { case '+': ... case '>': ... ... case 0: //do whatever you do at the end } So, as far as the switch statement goes, you don't have to worry about whether the range is empty or not. The case for 0 takes care of it and will only get checked if it actually happens, whereas without the sentinel value, you have to check empty every time you pop front. Now, the only real benefit that I see for this allowing you to make a string zero-terminated (which in the case of a lexer would probably mean copying the entire file into a new string which has zero on the end). In general, you'll be forced to wrap a range in a sentinel range to get this behavior, which means that you're _still_ checking empty all the time, because it has to keep checking whether it's supposed to make front 0 now. And that probably means that it'll be slightly _more_ expensive to do this for anything other than a string. That being the case, it might be better to just special case strings rather than come up with this whole new range idea. I'm also not at all covinced that this is generally useful. It may be that it's a great idea for lexers, but what other use cases are there? I don't know that there isn't one, but I can't think of any. And the _only_ time that it's of any real benefit is definitely going to be when ranges are built with this idea in the first place rather than wrapping them, which almost guarantees that you're just dealing with arrays, which you can always just special case if you need this level of efficiency. Also, I'd point out that even for strings, doing something like this means wrapping them, because their empty isn't defined in a manner which works with isSentinelRange. Unlike most other cases, the wrapper wouldn't have to keep checking for empty (it could just guarantee that the string it was given ended with 0 and then say that it's length was one less than the underlying string), but you're still dealing with a wrapper and relying on the compiler to optimize that cost away. Also, that totally screws with passing the range to any functions which special case strings. Those functions would now also have to special case the wrapper type if you wanted the extra efficiency that you get from the function understanding that it's a string rather than a range of code units or code points (depending on which the sentinel range is claiming it is - probably code units in the case of the lexer). So, I'm inclined to believe that we'd be better off just special casing strings in any algorithms that can take advantage of this sort of thing than we would be creating this sentinel range idea. - Jonathan M Davis
Feb 27 2013
prev sibling next sibling parent "Zach the Mystic" <reachBUTMINUSTHISzach gOOGLYmail.com> writes:
On Thursday, 28 February 2013 at 03:37:25 UTC, Walter Bright 
wrote:
 On 2/27/2013 6:01 PM, Zach the Mystic wrote:
 On Thursday, 28 February 2013 at 01:56:33 UTC, Walter Bright 
 wrote:
 On 2/27/2013 5:53 PM, Zach the Mystic wrote:
 What if more than one
 value can end the range, EOF, '\0'?

I have never seen a need for that.

Do you mean that you've never seen software that uses that, or that you've never been convinced that such software couldn't be made simpler?

Pretty much both. For example, often a foreach loop will exit either when it runs out of data or some other condition is met. I don't see a need to work both of those conditions into the definition of a range.

My understanding of the logic of Sentinel Ranges so far is that switch statements and other control flow can proceed eagerly, because "go" values can be checked before the sentinel "stop" value, and "!empty" is known implicitly. I don't know exactly where the speed benefits of having a single "stop" value known at compile time come from. Is this design focused more on your knowledge of how the compiler optimizes machine code, or on something which can be grasped at a higher level?
Feb 27 2013
prev sibling next sibling parent "Andrej Mitrovic" <andrej.mitrovich gmail.com> writes:
On Thursday, 28 February 2013 at 04:58:23 UTC, Walter Bright 
wrote:
 On 2/27/2013 8:01 PM, John Colvin wrote:
 Why must sentinel be known at compile time? I don't see what's 
 in the way of it
 being a runtime argument.

Performance! It should be usable as a case in a switch statement.

Little did you know about this bug: http://d.puremagic.com/issues/show_bug.cgi?id=5862 :) But I guess this defeats the purpose of the switch statement (i.e. optimizations). Btw can we get a clear opinion on whether 5862 is an actual bug that needs to be fixed?
Feb 27 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 28 February 2013 at 04:57:08 UTC, Walter Bright 
wrote:
 On 2/27/2013 8:53 PM, Zach the Mystic wrote:
 My understanding of the logic of Sentinel Ranges so far is 
 that switch
 statements and other control flow can proceed eagerly, because 
 "go" values can
 be checked before the sentinel "stop" value, and "!empty" is 
 known implicitly. I
 don't know exactly where the speed benefits of having a single 
 "stop" value
 known at compile time come from.

 Is this design focused more on your knowledge of how the 
 compiler optimizes
 machine code, or on something which can be grasped at a higher 
 level?

Take a look at lexer.c. With an InputRange, reading a character from a 0 terminated string requires two read operations. A SentinalInputRange requires only one.

If the range define empty with something like front == sentinel, the inliner should kick in a reduce the whole stuff to only one read, no ?
Feb 27 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, February 27, 2013 21:10:15 Walter Bright wrote:
 Also, I'd point out that even for strings, doing something like this means
 wrapping them, because their empty isn't defined in a manner which works
 with isSentinelRange.

For D strings, yes, for C strings, no need to wrap them.

But you have to deal with D strings, not C strings if you're dealing with ranges. char* isn't a range. So, unless you're talking about wrapping a char* in a range, char* isn't going to work. And simply appending 0 to the end of a D string isn't enough, because isSentinelnputRange would fail, because std.array.empty doesn't match it. So, you need a wrapper even if it's only to pass the template constraint. That being the case, regardless of whether you're dealing with char* or string, you need a wrapper. And what ranges other than strings can take advantage of anything like this? There's no way to append a value to range other than chain, which means that you're forced to either construct ranges specifically with the idea that they'll be used as sentinel ranges (in which case, they're a wrapper around whatever they're using to hold the data internally - probably an array - and if you're doing that, why not just use an array in the first place?), or you're forced to wrap an existing range (which would then require it to check for empty on each popFront so that it can make its front be 0 when it's empty). What are we gaining here that can't be gained by simply using a few static ifs to special case strings? And I don't see how this idea is gaining us much of anything outside of strings and maybe arrays - both of which have to be wrapped in order for their empty to act appropriately (even if they can avoid the extra empty checks internally by appending the sentinel value to their end). So, why not just special case strings or arrays in the few situations where something like this is needed, especially when it would be so easy to do? - Jonathan M Davis
Feb 27 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 28 February 2013 at 05:28:47 UTC, Walter Bright 
wrote:
 On 2/27/2013 9:20 PM, deadalnix wrote:
 If the range define empty with something like front == 
 sentinel, the inliner
 should kick in a reduce the whole stuff to only one read, no ?

auto c = front; if (c == sentinel || c == XX) is two reads. This may not seem important, but when you want high speed, it can halve it.

Don't you have to check for both all the time ? You have to check for the sentinel anyway.
Feb 27 2013
prev sibling next sibling parent "Zach the Mystic" <reachBUTMINUSTHISzach gOOGLYmail.com> writes:
On Thursday, 28 February 2013 at 05:28:20 UTC, Jonathan M Davis 
wrote:
 What are we gaining here that can't be gained by simply using a 
 few static ifs to special case strings?

That's the question that is swaying me to your side at the moment. It's possible that the special performance which Sentinel Ranges offer is actually confined to a very narrow band of problems which can be dealt with transparently in the existing library. In which case the right thing to do is make an enhancement request asking for just such performance enhancing drugs to be administered. :-)
Feb 27 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 28 February 2013 at 07:22:53 UTC, Walter Bright 
wrote:
 On 2/27/2013 10:02 PM, deadalnix wrote:
 On Thursday, 28 February 2013 at 05:28:47 UTC, Walter Bright 
 wrote:
 On 2/27/2013 9:20 PM, deadalnix wrote:
 If the range define empty with something like front == 
 sentinel, the inliner
 should kick in a reduce the whole stuff to only one read, no 
 ?

auto c = front; if (c == sentinel || c == XX) is two reads. This may not seem important, but when you want high speed, it can halve it.

Don't you have to check for both all the time ? You have to check for the sentinel anyway.

I suggest again taking a look at the dmd source lexer.c for how to do it. There is no extra check.

I did, in fact I know it for quite a while. It does check for sentinel in many places (and it has to). How is it better than a empty property defined as front == sentinel ? For instance, if(!r.empty) { switch(r.front) { // cases } // somecode } else { // error code } Will be optimized by adding a case for the sentinel in the switch (if dmd don't, then it has to be fixed, LLVM is definitively able to do stuff like that, and I didn't checked, but I'd bet that GCC can do it as well) in the given way : switch(r.front) { case sentinel : // error code goto Label; // cases } // somecode Label: I don't see how defining a specific sentinel range here helps.
Feb 27 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, February 27, 2013 23:33:09 Walter Bright wrote:
 On 2/27/2013 9:28 PM, Jonathan M Davis wrote:
 But you have to deal with D strings, not C strings if you're dealing with
 ranges. char* isn't a range. So, unless you're talking about wrapping a
 char* in a range, char* isn't going to work. And simply appending 0 to
 the end of a D string isn't enough, because isSentinelnputRange would
 fail, because std.array.empty doesn't match it. So, you need a wrapper
 even if it's only to pass the template constraint. That being the case,
 regardless of whether you're dealing with char* or string, you need a
 wrapper.

Again, please see how lexer.c works. I assure you, there is no double copying going on, nor is there a double test for the terminating 0.

I know what the lexer does, and remember that it _doesn't_ operate on ranges, and there are subtle differences between being able to just use char* and trying to handle generic ranges. And no, the lexer doesn't have a double test. The place you're going to be stuck with a double test is most any range which isn't a string, because such ranges won't have sentinel values, and there will be no way to add them (as you really can't append to ranges), and so they'll end up being wrapped in a SentinelRange which will have to check on each popFront whether the wrapped range is now empty making it so that front needs to be the sentinel value. And most any range which _was_ designed to have a sentinel value would have to be managing its own contents (because otherwise, it would just be back to wrapping a range and having to check empty), which likely means that it'll just be a thin wrapper around a string or array anyway. Strings will still need to be wrapped, because they won't pass isSentinelRange otherwise, but they won't get any extra checks, because the wrapper can just check for 0 on the end and append it if it's not there.
So, why not just special case strings or arrays in the few situations

 where something like this is needed, especially when it would be so easy
 to
 do?

Sentinels structure the code differently.

Given how a lexer works (and I have been working on a lexer off and on recently), the only real difference is that you'd just use a couple of static ifs like static if(!isSomeString!R) { if(range.empty) break; //or whatever you do at the end } static if(isSomeString!R) { case 0: break; //or whatever you do at the end } So, in the case of a lexer, I don't see sentinel ranges as buying us much. You end up having to wrap most any range that you pass to the lexer or whatever (including strings so that they'll pass isSentinelRange), you lose out on any optimizations of any functions that you call which special-case strings (though there probably wouldn't be many of those in a lexer), and all you avoid is a couple of static ifs. The idea of sentinels certainly isn't useless, but anything caring about that sort of speed is likely to just use strings or arrays, and those can trivially be special cased to avoid unnecessary empty checks and to add the check for the sentinel, making the whole sentinel range idea an unnecessary complication IMHO. - Jonathan M Davis
Feb 27 2013
prev sibling next sibling parent reply "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Thursday, 28 February 2013 at 01:12:19 UTC, Walter Bright 
wrote:
 Motivation:
 1. easy conversion of C strings to ranges

This is already easy. struct CString { property char front() { return *p; } void popFront() { ++p; } property bool empty() const { return !*p; } char* p; } (Yep, rubbish implementation, missing save, constructors etc., but you get the idea).
 2. necessary for a fast implementation of a lexer

This is very domain specific. Is a whole new range concept in the standard library necessary? My worry here is that we are setting a precedent for adding new range concepts. If the only justification needed is that is saves a single operation in some niche area of computing then we are opening the door to a LOT of different range concepts. Are there any other REAL uses for this other than in one line of a lexer implementation? I think inclusions into the standard library should require at least several distinct and realistic use cases.
Feb 28 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 1:19 AM, Peter Alexander wrote:
 On Thursday, 28 February 2013 at 01:12:19 UTC, Walter Bright wrote:
 Motivation:
 1. easy conversion of C strings to ranges

This is already easy. struct CString { property char front() { return *p; } void popFront() { ++p; } property bool empty() const { return !*p; } char* p; } (Yep, rubbish implementation, missing save, constructors etc., but you get the idea).

Notice the double read of *p. This is what SentinelInputRange is trying to fix.
 2. necessary for a fast implementation of a lexer

This is very domain specific. Is a whole new range concept in the standard library necessary?

I've posted in this thread several use cases.
 My worry here is that we are setting a precedent for adding new range concepts.
 If the only justification needed is that is saves a single operation in some
 niche area of computing then we are opening the door to a LOT of different
range
 concepts.

There are many cases where speed is a very big deal.
 Are there any other REAL uses for this other than in one line of a lexer
 implementation? I think inclusions into the standard library should require at
 least several distinct and realistic use cases.

I've posted several - including one in Phobos.
Feb 28 2013
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-02-28 15:31, Steven Schveighoffer wrote:

 while(!r.empty)  // inlined to r.front != 0
 {
     switch(r.front) // why would another load occur here?
     {
        // no need to check for 0, already done
        ...
     }
 }

Don't you have to check for 0 anyway. You could still have more data in the buffer? I doesn't have to be the manually added sentential that is encountered. -- /Jacob Carlborg
Feb 28 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-02-28 15:53, Steven Schveighoffer wrote:

 You are missing the point.  Empty is DEFINED as r.front == 0.  Adding a
 check for 0, would essentially lead to dead code (and technically, the
 compiler could trim it out).

Ah, right. -- /Jacob Carlborg
Feb 28 2013
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 3:23 AM, deadalnix wrote:
 On Thursday, 28 February 2013 at 10:57:29 UTC, Walter Bright wrote:
 Notice the double read of *p. This is what SentinelInputRange is trying to fix.

But it will be optimized away.

(c == 1 || c == 5) does not optimize to (c == 5).
Feb 28 2013
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 6:31 AM, Steven Schveighoffer wrote:
 If this doesn't translate to the same code, I don't know why not.

Try it and see with your favorite C compiler. Then try the lookahead cases I also posted.
Feb 28 2013
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 9:25 AM, Steven Schveighoffer wrote:
 On Thu, 28 Feb 2013 12:00:48 -0500, Walter Bright <newshound2 digitalmars.com>
 wrote:

 On 2/28/2013 6:31 AM, Steven Schveighoffer wrote:
 If this doesn't translate to the same code, I don't know why not.

Try it and see with your favorite C compiler.

A sample case of 1 does not prove it's not possible, or explain why those optimizers don't take that step. A valid response would be to give a case why an optimizer COULDN'T make that leap.

No, it is not. DMD is compiled with real compilers, not abstract "sufficiently smart compilers".
 Then try the lookahead cases I also posted.

You have already stated it gets changed into a jump table.

Please, please listen to what I write. This is very frustrating. The code in lexer.c is there for all to see, and it amply illustrates everything I'm saying. For example, this code does not get translated into a jump table: case '+': p++; if (*p == '=') { p++; t->value = TOKaddass; } else if (*p == '+') { p++; t->value = TOKplusplus; } else t->value = TOKadd; return;
 Such an optimization
 seems possible to me, even with the != 0 check outside the switch, even if not
 all C compilers employ it.

It doesn't matter if it is theoretically possible if the compilers we need to use do not do it.
Feb 28 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/24/13 7:32 AM, deadalnix wrote:
 I finally decided to write a post about such imaginary "sufficiently
 smart compilers" that clearly don't exists today, never will and so we
 have to clutter the language :
 http://www.deadalnix.me/2013/03/23/a-story-about-optimization-llvm-and-the-sentinelinputrange/


 I really hope that someone wrote this imaginary LLVM thing everybody is
 talking about !

Looking great (also love the page design). The piece has a few systematic English language errors that should be easy to fix. Would you want to accept a copyedit step from a native speaker, and if so, is anyone here inclined to do that? Thanks, Andrei
Mar 24 2013
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/24/13 9:07 AM, John Colvin wrote:
 On Sunday, 24 March 2013 at 12:22:37 UTC, deadalnix wrote:
 On Sunday, 24 March 2013 at 11:49:17 UTC, Andrei Alexandrescu wrote:
 On 3/24/13 7:32 AM, deadalnix wrote:
 I finally decided to write a post about such imaginary "sufficiently
 smart compilers" that clearly don't exists today, never will and so we
 have to clutter the language :
 http://www.deadalnix.me/2013/03/23/a-story-about-optimization-llvm-and-the-sentinelinputrange/



 I really hope that someone wrote this imaginary LLVM thing everybody is
 talking about !

Looking great (also love the page design). The piece has a few systematic English language errors that should be easy to fix. Would you want to accept a copyedit step from a native speaker, and if so, is anyone here inclined to do that?

I'd be happy to, if someone want to do it.

Native English speaker here; I'd be happy to edit for grammar.

Great, please go ahead! Something that worked for my reviewers was to just email a bunch of "before" -> "after" sentences, e.g.: * "Developers want code that run fast." -> "Developers want code that runs fast." * "And this is how the SentinelInputRange get into the discussion." -> "And this is how the SentinelInputRange gets into the discussion." ... Andrei
Mar 24 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 10:12 AM, deadalnix wrote:
 On Thursday, 28 February 2013 at 17:00:51 UTC, Walter Bright wrote:
 On 2/28/2013 6:31 AM, Steven Schveighoffer wrote:
 If this doesn't translate to the same code, I don't know why not.

Try it and see with your favorite C compiler. Then try the lookahead cases I also posted.

You are being stubborn here. You claim that this is significantly faster, but give no numbers. You claim that the compiler can't do some optimization, when LLVM does it already.

I didn't say can't, I said didn't. llvm handles that case (which does surprise me), dmc, gcc, and vc do not. Furthermore, I'll say "cannot" on this: case '|': p++; if (*p == '=') { p++; t->value = TOKorass; } else if (*p == '|') { p++; t->value = TOKoror; } else t->value = TOKor; return;
Feb 28 2013
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 02/28/2013 11:17 PM, Walter Bright wrote:
 On 2/28/2013 10:12 AM, deadalnix wrote:
 On Thursday, 28 February 2013 at 17:00:51 UTC, Walter Bright wrote:
 On 2/28/2013 6:31 AM, Steven Schveighoffer wrote:
 If this doesn't translate to the same code, I don't know why not.

Try it and see with your favorite C compiler. Then try the lookahead cases I also posted.

You are being stubborn here. You claim that this is significantly faster, but give no numbers. You claim that the compiler can't do some optimization, when LLVM does it already.

I didn't say can't, I said didn't. llvm handles that case (which does surprise me), dmc, gcc, and vc do not. Furthermore, I'll say "cannot" on this: case '|': p++; if (*p == '=') { p++; t->value = TOKorass; } else if (*p == '|') { p++; t->value = TOKoror; } else t->value = TOKor; return;

Why not? It boils down to a little CSE and a trivial implies check in an and expression.
Feb 28 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 3:14 PM, jerro wrote:
 ldc2 -O3 -release compiles it to:

As I posted to Timon, clang (i.e. ldc2) and gcc do do this optimization, dmc and vc do not.
Feb 28 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 4:47 PM, H. S. Teoh wrote:
 On Thu, Feb 28, 2013 at 04:36:03PM -0800, Walter Bright wrote:
 On 2/28/2013 3:14 PM, jerro wrote:
 ldc2 -O3 -release compiles it to:

As I posted to Timon, clang (i.e. ldc2) and gcc do do this optimization, dmc and vc do not.

Couldn't dmc be made to do this optimization?

Yes, it could.
Feb 28 2013
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 7:40 AM, Steven Schveighoffer wrote:
 If you look at lexer.c, case 0 is the first test.

No, it is not. It is actually a table lookup - all done in parallel. jmp cases[character]
Feb 28 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 9:18 AM, Steven Schveighoffer wrote:
 The point is, if the lexer simply requires an input range, and not a sentinel
 input range, it is more flexible for its input.  But when it does get called
 with a sentinel input range, the optimizer can reap the benefits.

I suggest you compile lexer.c with various compilers, add in a !empty() call before every front(), and verify your assumption about this.
Feb 28 2013
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 5:02 AM, Peter Alexander wrote:
 On Thursday, 28 February 2013 at 10:57:29 UTC, Walter Bright wrote:
 My worry here is that we are setting a precedent for adding new range concepts.
 If the only justification needed is that is saves a single operation in some
 niche area of computing then we are opening the door to a LOT of different
range
 concepts.

There are many cases where speed is a very big deal.

Yes, are we going to add new range categories every time there's a performance benefit to doing so? has16ByteAlignedElements hasUniqueElements hasContiguousMemory isMonotonic isUnimodal ... ?

Depends on whether one can make a good case for them.
Feb 28 2013
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/28/13 4:19 AM, Peter Alexander wrote:
 Are there any other REAL uses for this other than in one line of a lexer
 implementation? I think inclusions into the standard library should
 require at least several distinct and realistic use cases.

Singly-linked lists. Andrei
Feb 28 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 28 February 2013 at 10:37:35 UTC, Walter Bright 
wrote:
 On 2/27/2013 11:37 PM, deadalnix wrote:
 It does check for sentinel in many places (and it has to).

No, it doesn't. I suggest examining line 481 of lexer.c. Then examine line 558 and 559. https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c

escapeSequence does check for it.
Feb 28 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 28 February 2013 at 10:57:29 UTC, Walter Bright 
wrote:
 Notice the double read of *p. This is what SentinelInputRange 
 is trying to fix.

But it will be optimized away.
Feb 28 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, February 28, 2013 02:54:24 Walter Bright wrote:
 On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
 Again, please see how lexer.c works. I assure you, there is no double
 copying going on, nor is there a double test for the terminating 0.

I know what the lexer does, and remember that it _doesn't_ operate on ranges, and there are subtle differences between being able to just use char* and trying to handle generic ranges.

Hence the need to invent SentinelInputRange.
 Given how a lexer works (and I have been working on a lexer off and on
 recently), the only real difference is that you'd just use a couple of
 static ifs like
 
 static if(!isSomeString!R)
 {
 
      if(range.empty)
      
          break; //or whatever you do at the end
 
 }
 
 static if(isSomeString!R)
 {
 
      case 0:
          break; //or whatever you do at the end
 
 }

There are so many places where this would occur, it cries out for a new type.

And you were just claiming that the lexer checked the sentinel type in only one place. If that's indeed the case (and I think that it's quite close to being true if it isn't true), then you _wouldn't_ need to use static ifs like this in many places. So, which is it? If you need to check the sentinel often enough that using static ifs is a problem, then it's probably not buying you much of anything over checking empty anyway.
 The idea of sentinels certainly isn't useless, but anything caring about
 that sort of speed is likely to just use strings or arrays, and those can
 trivially be special cased to avoid unnecessary empty checks and to add
 the check for the sentinel, making the whole sentinel range idea an
 unnecessary complication IMHO.

You can't do efficient lookahead without sentinels, either. Lexers are sensitive to every instruction executed per character read. No sentinels mean double the number of instructions per source character.

But my point is that outside of strings or arrays, you're almost certainly stuck with that. Most ranges would have to be wrapped in order to act as a sentinel range, and the sentinel range would have to check empty on the internal range on every popFront. The only time that a sentinel range would buy you anything outside of strings or arrays is when you have a range type written specifically to be a sentinel range and which doesn't have to call empty on a range internally. But that means that it's going to have to manage whatever it's holding internally itself and can't wrap another range - and that memory will almost certainly be an array of some sort. And if that's the case, what did we buy by wrapping it in a sentinel range? Why not just use the array directly? Pretty much the only type of range which isn't wrapping another range which wouldn't have an array internally would be one that was either reading from a file (or some other input) as it iterates or one which was generative. In either case, you would be unable to put a zero on the end without checking whether it was empty with each popFront, meaning that making it a sentinel range was pointless. So, pretty much the only ranges which could gain efficiency as sentinel ranges are those that hold arrays internally, meaning that the only gain from having the sentinel range over special casing the code on strings or arrays is the fact that you won't have to special case strings or arrays at all with static ifs or template constraints, but I'm quite convinced that in most cases, the number of static ifs involved would be quite few, making it quite trivial to special case on strings, meaning that there's pretty much no gain in having the sentinel ranges. It just incurs more overhead that the compiler must optimize away, because you have to wrap strings for them to work as sentinel ranges, or no template constraint will be able to detect that they're sentinel ranges. Of course, you could special case them so that they're treated as sentinel ranges in spite of that, but if you're special casing, why use the sentinel ranges in the first place instead of just special casing on strings?
 InputRanges are an abject failure if "anyone caring about speed" is not
 going to use them. And yes, I care very much about the D lexing speed.

Pure input ranges fail utterly as you can't save them, so you get _zero_ lookahead unless you start copying their contents somewhere else - which is quite expensive in comparison to slicing an array and which complicates the code. Forward ranges do better, but the lack of random access can be a killer in some cases. For instance, anything involving unicode (which a lexer mostly avoids, because it rarely has to decode unicode, but it still happens sometimes) requires random access for speed at least some of the time (especially when skipping code units), and it makes comparisons faster, because you can compare slices at once rather than compare a character and then pop it, compare, pop, compare, pop... For the fastest results, you need random-access ranges. There's tons of stuff that's fine with forward ranges (though I seriously question the viability of pure input ranges given how insanely limiting is to be unable to save their state), but there's also plenty of stuff that needs random access, and if you _really_ care about speed, there's a decent chance that you need random access (if nothing else, to be able to pop off multiple elements at a time). That doesn't necessarily mean that you're using an array, but odds are whatever you're using wraps an array if it isn't one. I don't think that there's much of anything which is random access without having an array or pointer underneath the hood. Reading from a file, I'd be very inclined to use something like MmFile or reading the whole file into an array rather than trying to wrap in something that's a pure input range. They're just too limiting. - Jonathan M Davis
Feb 28 2013
prev sibling next sibling parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Thursday, 28 February 2013 at 10:57:29 UTC, Walter Bright 
wrote:
 My worry here is that we are setting a precedent for adding 
 new range concepts.
 If the only justification needed is that is saves a single 
 operation in some
 niche area of computing then we are opening the door to a LOT 
 of different range
 concepts.

There are many cases where speed is a very big deal.

Yes, are we going to add new range categories every time there's a performance benefit to doing so? has16ByteAlignedElements hasUniqueElements hasContiguousMemory isMonotonic isUnimodal ... ?
Feb 28 2013
prev sibling next sibling parent reply Artur Skawina <art.08.09 gmail.com> writes:
On 02/28/13 02:11, Walter Bright wrote:
 A SentinelInputRange is an InputRange with the following additions:
 
 1. a compile time property 'sentinel' that is the terminating value of the
range
 2. empty is defined as: empty = (front == sentinel)
 3. it is not necessary for empty to be called before front
 
 A C style 0-terminated string is an example of a SentinelInputRange.
 
 The additions to std.range would be:
 
 1. isSentinelInputRange(T) which returns true if T is a SentinelInputRange
 2. a unittest
 3. documentation of this
 
 An addition to std.string would be a function that takes a char* and returns a
SentinelInputRange.
 
 Motivation:
 1. easy conversion of C strings to ranges
 2. necessary for a fast implementation of a lexer

LookAheadRange, that not only defines 'sentinel', but also 'lookahead' which defines how far one can safely look ahead. When 'lookahead==0' it becomes equivalent to your SentinelInputRange, where 'front' and '[0]' is always valid and can contain the sentinel. When 'lookahead==N', accessing '[N]' is always valid. Having said that, I've used this approach in a D lexer, and it does not really matter in practice - avoiding the length (or '\0' sentinel) check makes a <~1ms difference when lexing "datetime.d" sized objects (1.5Mbytes+, 460k+ tokens). Which is practically irrelevant both in an IDE context and a compiler context - other processing will be be orders of magnitude more expensive. An IDE doesn't need to re-lex the whole file after every key press and 1ms won't make any difference for a compiler run. "Necessary for a fast implementation of a lexer" is an exaggeration, but yes, it is an obvious optimization that has a measurable impact. Before implementing it I was expecting a larger improvement too; and was somewhat surprised when the resulting gain was in the noise (~1ms +/- ~3ms). /Relatively/ it is significant (10% perf gain is never insignificant), but the /absolute/ perf difference isn't that large. Which of course isn't an argument against (SentinelInputLookAhead)Range; a std API like that would be a good thing. The advantage of SentinelInputRange/LookAheadRange is that it, unlike an array, can be lazy. Which starts to matter when you're dealing with /token/ streams. artur
Feb 28 2013
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-02-28 15:08, Artur Skawina wrote:

 Having said that, I've used this approach in a D lexer, and it does not really
 matter in practice - avoiding the length (or '\0' sentinel) check makes a
 <~1ms difference when lexing "datetime.d" sized objects (1.5Mbytes+, 460k+
tokens).
 Which is practically irrelevant both in an IDE context and a compiler context
 - other processing will be be orders of magnitude more expensive. An IDE
doesn't
 need to re-lex the whole file after every key press and 1ms won't make any
 difference for a compiler run.

It's not about lexing a single file like std.datetime. We're takling be able to fast lex, I don't know, 100 or 1000 of files like std.datetime. -- /Jacob Carlborg
Feb 28 2013
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-02-28 15:47, Artur Skawina wrote:

 Define "fast". Lexing std.datetime takes at most ~10-20ms (possibly a
single-digit
 ms number, but i'd need to write some code to check the actual number). Smaller
 objects take proportionally less. Meaning you'll be I/O bound, even /one/
(disk)
 cache miss will have more impact then these kind of optimizations.
 Lexing a hundred small files or one 100x as big file is basically the same
operation;
 the difference will be in I/O + setup/teardown costs, which will be /outside/
the
 lexer, so aren't affected by how it accesses input.

You'll have to convince Walter. -- /Jacob Carlborg
Feb 28 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/28/2013 7:00 AM, Jacob Carlborg wrote:
 You'll have to convince Walter.

I've spent a lot of time profiling compilers. The result is pretty obvious - dmc is by far the fastest C/C++ compiler available, and dmd blows away about every other language in compile speed. Making this happen means you go after everything that eats time. Even a 5% improvement is a big deal. All those drips and draps add up to big speed increases. Two corporate users of D have also told me that their motivating case to adopt D was - compile speed! Throwing compile speed under the bus is what other compiler vendors do, and we're not going to do that.
Feb 28 2013
next sibling parent FG <home fgda.pl> writes:
On 2013-02-28 23:27, Walter Bright wrote:
 I've spent a lot of time profiling compilers. The result is pretty obvious -
dmc
 is by far the fastest C/C++ compiler available, and dmd blows away about every
 other language in compile speed.

 Making this happen means you go after everything that eats time. Even a 5%
 improvement is a big deal. All those drips and draps add up to big speed
increases.

I get that, but it doesn't require introducing a new type of range.
Feb 28 2013
prev sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, February 28, 2013 23:32:16 FG wrote:
 On 2013-02-28 23:27, Walter Bright wrote:
 I've spent a lot of time profiling compilers. The result is pretty obvious
 - dmc is by far the fastest C/C++ compiler available, and dmd blows away
 about every other language in compile speed.
 
 Making this happen means you go after everything that eats time. Even a 5%
 improvement is a big deal. All those drips and draps add up to big speed
 increases.


Exactly. - Jonathan M Davis
Feb 28 2013
prev sibling parent Artur Skawina <art.08.09 gmail.com> writes:
On 02/28/13 16:00, Jacob Carlborg wrote:
 On 2013-02-28 15:47, Artur Skawina wrote:
 
 Define "fast". Lexing std.datetime takes at most ~10-20ms (possibly a
single-digit
 ms number, but i'd need to write some code to check the actual number). Smaller
 objects take proportionally less. Meaning you'll be I/O bound, even /one/
(disk)
 cache miss will have more impact then these kind of optimizations.
 Lexing a hundred small files or one 100x as big file is basically the same
operation;
 the difference will be in I/O + setup/teardown costs, which will be /outside/
the
 lexer, so aren't affected by how it accesses input.

You'll have to convince Walter.

Actually, no -- like i said: /I'd like to have such std ranges/ - there are real gains to be had. I'm just saying that for the "fast lexer" case the /absolute/ improvement isn't necessarily very large. Modern compilers can do wonders. The advantage of defining /std/ range types is, well, that they are "std"; everybody doesn't have to reinvent them, often with inadequate docs and subtle differences in behavior. In this case the interesting properties are: a) terminating sentinel b) limited lookahead ("a" is a special case of "b") c) lazyness It's better to have a common interface than having everybody invent a private one every time some of the above features are needed. artur
Feb 28 2013
prev sibling parent Artur Skawina <art.08.09 gmail.com> writes:
On 02/28/13 15:20, Jacob Carlborg wrote:
 On 2013-02-28 15:08, Artur Skawina wrote:
 
 Having said that, I've used this approach in a D lexer, and it does not really
 matter in practice - avoiding the length (or '\0' sentinel) check makes a
 <~1ms difference when lexing "datetime.d" sized objects (1.5Mbytes+, 460k+
tokens).
 Which is practically irrelevant both in an IDE context and a compiler context
 - other processing will be be orders of magnitude more expensive. An IDE
doesn't
 need to re-lex the whole file after every key press and 1ms won't make any
 difference for a compiler run.

It's not about lexing a single file like std.datetime. We're takling be able to fast lex, I don't know, 100 or 1000 of files like std.datetime.

Define "fast". Lexing std.datetime takes at most ~10-20ms (possibly a single-digit ms number, but i'd need to write some code to check the actual number). Smaller objects take proportionally less. Meaning you'll be I/O bound, even /one/ (disk) cache miss will have more impact then these kind of optimizations. Lexing a hundred small files or one 100x as big file is basically the same operation; the difference will be in I/O + setup/teardown costs, which will be /outside/ the lexer, so aren't affected by how it accesses input. artur
Feb 28 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 28 Feb 2013 06:23:45 -0500, deadalnix <deadalnix gmail.com> wrote:

 On Thursday, 28 February 2013 at 10:57:29 UTC, Walter Bright wrote:
 Notice the double read of *p. This is what SentinelInputRange is trying  
 to fix.

But it will be optimized away.

I have to say I agree with deadalnix. you have essentially in lexer.c: outer: while(1) { switch(r.front) { case 0: break outer; ... } } whereas a range where empty is defined as r.front == 0: while(!r.empty) // inlined to r.front != 0 { switch(r.front) // why would another load occur here? { // no need to check for 0, already done ... } } If this doesn't translate to the same code, I don't know why not. The second implementation looks more elegant/straightforward. I think others have pointed out that other compilers already do this. I hope we aren't introducing library components to get around DMD optimizer shortcomings. Note, r cannot be a D string in order to achieve the desired efficiency. With a sentinel range, length is never checked, and should not have to be updated. I think it's not necessary to add extra isSentinelRange, etc. checks, just define your sentinel range like any other input range. -Steve
Feb 28 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 28 Feb 2013 09:48:02 -0500, Jacob Carlborg <doob me.com> wrote:

 On 2013-02-28 15:31, Steven Schveighoffer wrote:

 while(!r.empty)  // inlined to r.front != 0
 {
     switch(r.front) // why would another load occur here?
     {
        // no need to check for 0, already done
        ...
     }
 }

Don't you have to check for 0 anyway. You could still have more data in the buffer? I doesn't have to be the manually added sentential that is encountered.

You are missing the point. Empty is DEFINED as r.front == 0. Adding a check for 0, would essentially lead to dead code (and technically, the compiler could trim it out). An important thing about a sentinel input range is that it is not an array -- you cannot maintain or process length, because length is defined by the sentinel. This becomes an advantage when your goal is simply to process every element -- no time wasted updating length. -Steve
Feb 28 2013
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 28 February 2013 at 14:31:08 UTC, Steven 
Schveighoffer wrote:
 On Thu, 28 Feb 2013 06:23:45 -0500, deadalnix

 I have to say I agree with deadalnix.

 you have essentially in lexer.c:

 outer:
 while(1)
 {
    switch(r.front)
    {
       case 0:
         break outer;
       ...
    }
 }

 whereas a range where empty is defined as r.front == 0:

 while(!r.empty)  // inlined to r.front != 0
 {
    switch(r.front) // why would another load occur here?
    {
       // no need to check for 0, already done
       ...
    }
 }

 If this doesn't translate to the same code, I don't know why 
 not[SNIP].

 -Steve

The difference is that by doing "!r.empty" first, you are actually executing "r.front == 0" each and every time, before doing the rest of the checks proper. Doing it the other way around however, the only time you actually check the sentinel value is once you've reached the actual sentinel value, or you've run out of values "of interest": EG not actually on every element. You are basically re-ordering the tests to speed up the usual case. //---- This is kind of the same logic as in C++'s "operator=": It is recommended to always check for self assignment. However, most quality implementations will *not* execute this check, instead being extra careful about instruction ordering. While the rare self-assignement case becomes more expensive, the common assignment becomes cheaper. That's where the win is.
Feb 28 2013
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 28 February 2013 at 15:31:21 UTC, Andrei 
Alexandrescu wrote:
 On 2/28/13 5:54 AM, Walter Bright wrote:
 On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
 Again, please see how lexer.c works. I assure you, there is 
 no double
 copying going on, nor is there a double test for the 
 terminating 0.

I know what the lexer does, and remember that it _doesn't_ operate on ranges, and there are subtle differences between being able to just use char* and trying to handle generic ranges.

Hence the need to invent SentinelInputRange.

I don't think the sentinel input range is a blocker for redoing the parser (with ranges) in D. This discussion has probably run its course. The right thing to do at this point is port the lexer and figure what primitives are necessary from its input. Andrei

An actual sentinel range is trivial to implement, and the algorithms that can *actually* truly exploit it are rare. While I'm not against having such ranges in phobos, I'd just be weary to provide too many traits for them, or trying to have phobos exploit them either. Ranges have enough primitives as it is. Just provide it with a disclaimer than it will only ever be useful for a select class of algorithms (parser), but that the optimization opportunity will be ignored by the rest of phobos.
Feb 28 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 28 Feb 2013 10:31:42 -0500, monarch_dodra <monarchdodra gmail.com>  
wrote:

 On Thursday, 28 February 2013 at 14:31:08 UTC, Steven Schveighoffer  
 wrote:
 On Thu, 28 Feb 2013 06:23:45 -0500, deadalnix

 I have to say I agree with deadalnix.

 you have essentially in lexer.c:

 outer:
 while(1)
 {
    switch(r.front)
    {
       case 0:
         break outer;
       ...
    }
 }

 whereas a range where empty is defined as r.front == 0:

 while(!r.empty)  // inlined to r.front != 0
 {
    switch(r.front) // why would another load occur here?
    {
       // no need to check for 0, already done
       ...
    }
 }

 If this doesn't translate to the same code, I don't know why not[SNIP].

The difference is that by doing "!r.empty" first, you are actually executing "r.front == 0" each and every time, before doing the rest of the checks proper. Doing it the other way around however, the only time you actually check the sentinel value is once you've reached the actual sentinel value, or you've run out of values "of interest": EG not actually on every element. You are basically re-ordering the tests to speed up the usual case.

If you look at lexer.c, case 0 is the first test. How is it that the compiler knows that's the least likely case to check? Even it if it does, sure, the compiler could reorder it. But then it could include the while loop test into the switch statement, as others have suggested, and we get the same result. Besides, I don't think a jnz instruction is that expensive. My impression from Walter's other posts is to avoid the double load/double check, not to change the ordering. -Steve
Feb 28 2013
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 The merit is having significantly faster speed.

Maybe it's time to write two benchmarks to compare the two different speeds (and run the two benchmarks with DMD and LDC). Bye, bearophile
Feb 28 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 28 February 2013 at 16:36:44 UTC, Walter Bright 
wrote:
 On 2/28/2013 6:48 AM, Ary Borenszweig wrote:
 On 2/28/13 1:57 AM, Walter Bright wrote:
 On 2/27/2013 8:01 PM, John Colvin wrote:
 Why must sentinel be known at compile time? I don't see 
 what's in the
 way of it
 being a runtime argument.

Performance! It should be usable as a case in a switch statement.

Isn't it possible for the optimizer to inline the function call and then combine the next ifs? if (isSentinel(value)) { } else { switch(value) { case ... } }

1. I don't know of any C compiler that would do that.

As mentioned, LLVM is able to do this kind of things. I have to admit I was quite amazed when I discovered it.
 2. There are other cases, as I pointed out to deadalnix.

 3. You still can't do lookahead without extra checks

You need the actual check to do proper generic D code. But the compiler is able to optimize it away via inlining and mecanism described in 1.
 Once again, guys, have a look at lexer.c at the lines I pointed 
 out.

I see nothing that can't be done by an optimizing compiler. Maybe something is, but I can't find it. And the example you pointed me aren't.
Feb 28 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 28 Feb 2013 11:43:06 -0500, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 2/28/2013 6:59 AM, Andrei Alexandrescu wrote:
 On 2/28/13 2:37 AM, deadalnix wrote:
 I don't see how defining a specific sentinel range here helps.

On first blush I agree. It may as well be a range that by convention is sentinel-terminated, and there's calls to front and popFront but never to empty.

Consider the following code from lexer.c: p++; switch (*p) Written using an InputRange: popFront(); switch (front) That code is INVALID. This is why a SentinelInputRange is necessary. You can't just use an InputRange in an invalid manner by convention.

Does switch(*p) include a case for 0? If so, wouldn't it be equivalent to say if(empty) /* do stuff that case 0 does */ else switch(front) -Steve
Feb 28 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 28 Feb 2013 12:02:53 -0500, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 2/28/2013 7:40 AM, Steven Schveighoffer wrote:
 If you look at lexer.c, case 0 is the first test.

No, it is not. It is actually a table lookup - all done in parallel. jmp cases[character]

You are comparing the assembly output of your solution with uncompiled D code. Apples and oranges. Quoting directly from https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c#L479: switch (*p) { case 0: As several others have pointed out, an optimizer can (and some do) make this rewrite automatically. The point is, if the lexer simply requires an input range, and not a sentinel input range, it is more flexible for its input. But when it does get called with a sentinel input range, the optimizer can reap the benefits. -Steve
Feb 28 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 28 Feb 2013 12:00:48 -0500, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 2/28/2013 6:31 AM, Steven Schveighoffer wrote:
 If this doesn't translate to the same code, I don't know why not.

Try it and see with your favorite C compiler.

A sample case of 1 does not prove it's not possible, or explain why those optimizers don't take that step. A valid response would be to give a case why an optimizer COULDN'T make that leap.
 Then try the lookahead cases I also posted.

You have already stated it gets changed into a jump table. Such an optimization seems possible to me, even with the != 0 check outside the switch, even if not all C compilers employ it. -Steve
Feb 28 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 28 Feb 2013 12:13:27 -0500, Dmitry Olshansky  
<dmitry.olsh gmail.com> wrote:

 No as a compiler will take it (or may depending on its brain) that 0 is  
 what you want to test *first*. It may speed things up if branch is  
 almost always taken but its not the case with sentinel. Thus its jsut  
 dead code that needs to be decoded, evalutated and skipped (as  
 predicated) *before* doing switch jump.

Well, according to lexer.c, case 0 is the first case. Why does that not make the compiler decide to test 0 first? Answer: it actually doesn't care, the switch compiles to a jump table. Which still begs the question, why can't the compiler make the leap to the equivalent code with a range? are these not semantically equivalent? if(x == 0) { // case for 0 } else { switch(x) { case 1: // case for 1; break; case 2: // case for 2; break; ... // cases for everything but 0 } } vs. switch(x) { case 0: // case for 0; break; case 1: // case for 1; break; case 2: // case for 2; break; ... // cases for everything but 0 } They seem semantically identical. It would not be impossible for an optimizer to make this leap. Combined with inlining, the above could be: if(r.empty) { // case for 0; } else ... and still be optimized the same. I'm not saying it happens, I'm saying it's possible. I have read several posts here claiming that llvm does this (I haven't tried this myself, but it sounds likely true). -Steve
Feb 28 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 28 February 2013 at 17:13:36 UTC, Dmitry Olshansky 
wrote:
 No as a compiler will take it (or may depending on its brain) 
 that 0 is what you want to test *first*. It may speed things up 
 if branch is almost always taken but its not the case with 
 sentinel. Thus its jsut dead code that needs to be decoded, 
 evalutated and skipped (as predicated) *before* doing switch 
 jump.

 In fact some people avoid the overhead of switch by placing one 
 or two of highly-frequent branches with tests before the switch 
 (thus avoiding indirect branch it entails in these frequent 
 cases).

That won't work as expected with LLVM and full optimizations, as it will combine everything in a switch, unless you use branch weight, in which case it can do the reverse : extract common cases from the switch. See : http://llvm.org/docs/BranchWeightMetadata.html GCC does something similar. This is another example of how people think they are doing high level assembly when in fact, the compiler is changing everything behind their back, and not doing at all what they think.
Feb 28 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 28 February 2013 at 17:00:51 UTC, Walter Bright 
wrote:
 On 2/28/2013 6:31 AM, Steven Schveighoffer wrote:
 If this doesn't translate to the same code, I don't know why 
 not.

Try it and see with your favorite C compiler. Then try the lookahead cases I also posted.

You are being stubborn here. You claim that this is significantly faster, but give no numbers. You claim that the compiler can't do some optimization, when LLVM does it already. See sample code bellow, which is a typical structure find in lexer.c : int bar(int n) { switch(n) { case 0: return 0; // Bad case ! case 25: return 75; case 42: return 69; case 666: return 999; default: return -1; } } int foo(int n) { switch(n) { case 1: return 23; case 2: case 3: return n; default: return bar(n); } } As we see, foo don't check for 0, and bar does, but only when necessary, as bar isn't called. Or this is what you think. Let's see how SDC compile this : define i32 _D3barFiZi(i32 %arg.n) { %n = alloca i32 store i32 %arg.n, i32* %n br label %body body: ; preds = %0 %1 = load i32* %n switch i32 %1, label %default [ i32 0, label %.case i32 25, label %.case1 i32 42, label %.case2 i32 666, label %.case3 ] switchstart: ; No predecessors! br label %.case .case: ; preds = %body, %switchstart ret i32 0 .case1: ; preds = %body ret i32 75 .case2: ; preds = %body ret i32 69 .case3: ; preds = %body ret i32 999 default: ; preds = %body ret i32 -1 switchend: ; No predecessors! unreachable } define i32 _D3fooFiZi(i32 %arg.n) { %n = alloca i32 store i32 %arg.n, i32* %n br label %body body: ; preds = %0 %1 = load i32* %n switch i32 %1, label %default [ i32 1, label %.case i32 2, label %.case1 i32 3, label %.case2 ] switchstart: ; No predecessors! br label %.case .case: ; preds = %body, %switchstart ret i32 23 .case1: ; preds = %body br label %.case2 .case2: ; preds = %body, %.case1 %2 = load i32* %n ret i32 %2 default: ; preds = %body %3 = load i32* %n %4 = call i32 _D3barFiZi(i32 %3) ret i32 %4 switchend: ; No predecessors! unreachable } Here is the raw result of the frontend in LLVM IR. Now here is what the optimizer give us : define i32 _D3barFiZi(i32 %arg.n) nounwind readnone { body: switch i32 %arg.n, label %default [ i32 0, label %.case i32 25, label %.case1 i32 42, label %.case2 i32 666, label %.case3 ] .case: ; preds = %default, %.case3, %.case2, %.case1, %body %merge = phi i32 [ 0, %body ], [ 75, %.case1 ], [ 69, %.case2 ], [ 999, %.case3 ], [ -1, %default ] ret i32 %merge .case1: ; preds = %body br label %.case .case2: ; preds = %body br label %.case .case3: ; preds = %body br label %.case default: ; preds = %body br label %.case } define i32 _D3fooFiZi(i32 %arg.n) nounwind readnone { body: switch i32 %arg.n, label %default.i [ i32 1, label %.case i32 2, label %.case2 i32 3, label %.case2 i32 0, label %_D3barFiZi.exit i32 25, label %.case1.i i32 42, label %.case2.i i32 666, label %.case3.i ] .case: ; preds = %body, %_D3barFiZi.exit, %.case2 %merge = phi i32 [ 23, %body ], [ %arg.n, %.case2 ], [ %merge.i, %_D3barFiZi.exit ] ret i32 %merge .case2: ; preds = %body, %body br label %.case .case1.i: ; preds = %body br label %_D3barFiZi.exit .case2.i: ; preds = %body br label %_D3barFiZi.exit .case3.i: ; preds = %body br label %_D3barFiZi.exit default.i: ; preds = %body br label %_D3barFiZi.exit _D3barFiZi.exit: ; preds = %body, %.case1.i, %.case2.i, %.case3.i, %default.i %merge.i = phi i32 [ 75, %.case1.i ], [ 69, %.case2.i ], [ 999, %.case3.i ], [ -1, %default.i ], [ 0, %body ] br label %.case } See, foo don't even call bar anymore. No let's include the null check, as it would have been done if checking for empty on a regular range (that happen to internally use a sentinel) : int foo(int n) { if(n != 0) { switch(n) { case 1: return 23; case 2: case 3: return n; default: return bar(n); } } return bar(n); } I didn't repeated bar as it doesn't change. Which is optimized as : define i32 _D3fooFiZi(i32 %arg.n) nounwind readnone { body: switch i32 %arg.n, label %default.i [ i32 0, label %_D3barFiZi.exit8 i32 1, label %.case i32 2, label %.case2 i32 3, label %.case2 i32 666, label %.case3.i i32 25, label %_D3barFiZi.exit i32 42, label %.case2.i ] .case: ; preds = %body, %_D3barFiZi.exit8, %_D3barFiZi.exit, %.case2 %merge = phi i32 [ %arg.n, %.case2 ], [ %merge.i, %_D3barFiZi.exit ], [ 0, %_D3barFiZi.exit8 ], [ 23, %body ] ret i32 %merge .case2: ; preds = %body, %body br label %.case .case2.i: ; preds = %body br label %_D3barFiZi.exit .case3.i: ; preds = %body br label %_D3barFiZi.exit default.i: ; preds = %body br label %_D3barFiZi.exit _D3barFiZi.exit: ; preds = %body, %.case2.i, %.case3.i, %default.i %merge.i = phi i32 [ 69, %.case2.i ], [ 999, %.case3.i ], [ -1, %default.i ], [ 75, %body ] br label %.case _D3barFiZi.exit8: ; preds = %body br label %.case } The exact damn same thing ! Let's try with a different schemes, we check first an early return : int foo(int n) { if(n == 0) { return bar(n); } switch(n) { case 1: return 23; case 2: case 3: return n; default: return bar(n); } } Guess what ? It is optimized as : define i32 _D3fooFiZi(i32 %arg.n) nounwind readnone { body: switch i32 %arg.n, label %default.i7 [ i32 0, label %_D3barFiZi.exit i32 1, label %.case i32 2, label %.case2 i32 3, label %.case2 i32 666, label %.case3.i6 i32 25, label %_D3barFiZi.exit8 i32 42, label %.case2.i5 ] _D3barFiZi.exit: ; preds = %body, %_D3barFiZi.exit8, %.case2, %.case %merge9 = phi i32 [ 0, %body ], [ 23, %.case ], [ %arg.n, %.case2 ], [ %merge.i3, %_D3barFiZi.exit8 ] ret i32 %merge9 .case: ; preds = %body br label %_D3barFiZi.exit .case2: ; preds = %body, %body br label %_D3barFiZi.exit .case2.i5: ; preds = %body br label %_D3barFiZi.exit8 .case3.i6: ; preds = %body br label %_D3barFiZi.exit8 default.i7: ; preds = %body br label %_D3barFiZi.exit8 _D3barFiZi.exit8: ; preds = %body, %.case2.i5, %.case3.i6, %default.i7 %merge.i3 = phi i32 [ 69, %.case2.i5 ], [ 999, %.case3.i6 ], [ -1, %default.i7 ], [ 75, %body ] br label %_D3barFiZi.exit } Yet again the same thing. A range using a sentinel internally is clearly beneficial for a lexer (like C strings). Defining explicitly a range that does so is plain useless.
Feb 28 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Thursday, 28 February 2013 at 18:38:59 UTC, Andrei 
Alexandrescu wrote:
 I think that's a good idea but I took a look at 
 https://github.com/bhelyer/std.d.lexer/blob/master/std/d/lexer.d 
 and I will destroy it. In a good sense :o).

 Andrei

https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d He's talking about this one.
Feb 28 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, February 28, 2013 08:52:45 Walter Bright wrote:
 I've given you two examples (lexer and regexp) where you are certainly not
 stuck with that, and those two cases matter.
 
 Pure input ranges fail utterly as you can't save them, so you get _zero_
 lookahead [...]


My point is that in almost all cases, what'll end up happening with a sentinel range which is something like this: struct SRange(R)    if(is(Unqual!(ElementType!R) == char)) {    enum char sentinel = 0;     property char front() { return _front; }     property bool empty() { return _front == sentinel; }    void popFront()    {        _range.popFront();        if(_range.empty)            _front = sentinel;    }    this(R range)    {        if(_range.empty)            _front = sentinel;        else            _front = _rang.front;    } private:    char _front;    R _range; } Notice that it has to check for empty every time that the front is popped, and it can't avoid that, because it's wrapping another range which is not a sentinel range. The only time that it can avoid that is if it's managing its own memory internally via an array or pointer or whatnot. And if it's just going to be a string or an array, I'd argue for simply special casing strings or arrays to skip unnecessary empty checks. Also, with sentinel ranges, you'll be forced to wrap strings because of that sentinel marker that's required for isSentinelRange, meaning that you'll get something like struct SRange {    enum char sentinel = 0;     property char front() { return _front; }     property bool empty() { return _front == sentinel; }    void popFront() { _str = _str[1 .. $]; }    this(string str)    {        _str = str;        if(_str.empty)            _str = [sentinel];        else if(_str[$ - 1] != sentinel)            _str ~= sentinel;    } private:    string _str; } And while the compiler will hopefully optimize away all of the extra overhead of the wrapper type, any functions which would be able to optimize what they're doing for a string or array will not special case for this wrapper type, because they probably won't even know about it. A prime case where that might be an issue would be with std.utf.decode or std.utf.decodeFront which do some special casing for strings and which any unicode-aware lexer would have to use at least some of the time (even if the vast majority of the time, it can just check ASCII stuff). So, for the most part, the only time that you'll get any performance boost out of this is with strings or arrays, and you'll take a performance hit if you're using functions which special case strings or arrays but not the wrapper sentinel type. And yet special-casing strings and arrays will give you exactly the same performance boost without this sentinel range concept. - Jonathan M Davis
Feb 28 2013
prev sibling next sibling parent "jerro" <a a.com> writes:
 I didn't say can't, I said didn't. llvm handles that case 
 (which does surprise me), dmc, gcc, and vc do not. Furthermore, 
 I'll say "cannot" on this:

             case '|':
                 p++;
                 if (*p == '=')
                 {   p++;
                     t->value = TOKorass;
                 }
                 else if (*p == '|')
                 {   p++;
                     t->value = TOKoror;
                 }
                 else
                     t->value = TOKor;
                 return;

This code is correct without assuming that r is a sentinel range, and it is pretty close to your code above: void foo(R)(ref R r, ref int tvalue) { if(!r.empty) switch(r.front) { case '|': r.popFront(); if (!r.empty && r.front == '=') { r.popFront(); tvalue = 1; } else if (!r.empty && r.front == '|') { r.popFront(); tvalue = 2; } else tvalue = 3; return; default: } } If I add this: struct R { ubyte* p; property front(){ return *p; } property empty(){ return *p == 0; } void popFront() { p++; } } alias foo!R bar; ldc2 -O3 -release compiles it to: 0: mov (%rsi),%rax 3: cmpb $0x7c,(%rax) 6: jne 3e 8: lea 0x1(%rax),%rcx c: mov %rcx,(%rsi) f: mov 0x1(%rax),%cl 12: cmp $0x7c,%cl 15: jne 25 17: add $0x2,%rax 1b: mov %rax,(%rsi) 1e: movl $0x2,(%rdi) 24: retq 25: cmp $0x3d,%cl 28: jne 38 2a: add $0x2,%rax 2e: mov %rax,(%rsi) 31: movl $0x1,(%rdi) 37: retq 38: movl $0x3,(%rdi) 3e: retq
Feb 28 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, February 28, 2013 13:42:34 Walter Bright wrote:
 On 2/28/2013 10:52 AM, Jonathan M Davis wrote:
 Notice that it has to check for empty every time that the front is popped,
 and it can't avoid that,

Yes it can avoid it - that is the whole point. Notice there are no checks for the sentinel here - but the code is correct: case '>': p++; if (*p == '=') { p++; t->value = TOKge; // >= } else if (*p == '>') { p++; if (*p == '=') { p++; t->value = TOKshrass; // >>= } else if (*p == '>') { p++; if (*p == '=') { p++; t->value = TOKushrass; // >>>= } else t->value = TOKushr; // >>> } else t->value = TOKshr; // >> } else t->value = TOKgt; // > return;

But that's with a pointer. You see every ++ there? That would be a popFront call, and for most ranges, that would mean checking empty internally if the range needs to have a sentinel value on its end, because most ranges will be a wrapper around another range, and the only way to know whether you've reached their end (and that therefore, front now needs to be the sentinel) is to check for empty. If a range isn't a wrapper around another range, then it's probably an array, but arrays won't work with the sentinel range scheme, because they won't pass isSentinelRange. So, they'll still need a wrapper which allows them to be treated as a sentinel range. And so the pretty much the only case where you get any performance gain is with strings wrapped in a sentinel range, and if that's the case, I'd just special case them to use sentinels and avoid the sentinel range concept entirely. I just don't see how you're going to get a performance gain from much of anything other than strings. For the lexer I'm working on, I just use string mixins for any operation which might differ between range types (e.g. strings can't use front or they'll decode, whereas with a range of code units, you need to use front). So, instead of p++ like above, I might have something like mixin(popCodeUnit!R()); But if you're dealing with strings, and you want your function to operate on ranges of code units, then you're pretty much stuck either casting the strings to arrays of ubyte to operate on them, wrapping them in another range, or special casing them. And wrapping them (like your solution would require) is arguably the worst, because you lose out on any possibility of any helper functions you use optimizing for your strings, unless you write all of them yourself, because they won't be recognized as strings - and that includes functions like std.utf.decode, which any unicode-aware lexer will have to use at least some of the time. - Jonathan M Davis
Feb 28 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Feb 28, 2013 at 04:36:03PM -0800, Walter Bright wrote:
 On 2/28/2013 3:14 PM, jerro wrote:
ldc2 -O3 -release compiles it to:

As I posted to Timon, clang (i.e. ldc2) and gcc do do this optimization, dmc and vc do not.

Couldn't dmc be made to do this optimization? T -- It only takes one twig to burn down a forest.
Feb 28 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 28 February 2013 at 23:24:01 UTC, Walter Bright 
wrote:
 On 2/28/2013 2:16 PM, Timon Gehr wrote:
     // expression simplification
     // (eg. trivial peephole for x!=a && x==b case sufficient)

Hmm, I hadn't thought of that. Some checking shows that VC and DMC do not do it, gcc and clang do.

Finally we are getting somewhere.
Feb 28 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Friday, 1 March 2013 at 00:34:25 UTC, Walter Bright wrote:
 I just don't see how you're going to get a
 performance gain from much of anything other than strings.

I gave you other examples already. We're just going around in circles.

You didn't posted a single example that wasn't optimized by LLVM. I do understand that some compiler may not do the optimization, but it is show already that this is clearly possible and in fact done. That is not an theoretical improvement. I don't see the point in creating a new type of range simply because some compiler don't optimize properly.
Feb 28 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Friday, 1 March 2013 at 01:46:34 UTC, Walter Bright wrote:
 On 2/28/2013 5:29 PM, Andrei Alexandrescu wrote:
 On 2/28/13 7:34 PM, Walter Bright wrote:
 On 2/28/2013 4:07 PM, Jonathan M Davis wrote:
 I just don't see how you're going to get a
 performance gain from much of anything other than strings.

I gave you other examples already. We're just going around in circles.

We're talking about an abstraction that doesn't exist, exemplifying with code that doesn't use it. No wonder there's going to be miscommunication. The solution is to spin some code.

std.regexp uses sentinels in the bytecode.

I think nobody says that this technic is useless. Simply that a good compiler can take advantage of it without introducing an higher level concept of sentinel input range.
Feb 28 2013
prev sibling next sibling parent "Chris Nicholson-Sauls" <ibisbasenji gmail.com> writes:
A use case that comes immediately to mind: a sentinal range that, 
yes wraps, an infinite (but predictable!) range, effectively 
allowing you to take a head-slice of the infinite range.

auto foo = infiniteRangeOfEvenNumbers();
auto upto1000 = GenericSentinalInputRange!1000( foo );

Yes, in this contrived example, one could use take(), but what 
about an infinite range that can be predicted to always hit 
certain points, but is not restricted to those points?


As for other cases of sentinal ranges wrapping others, in 
particular arrays, there seems to be no mention of what I expect 
would be the most common use case: wherein the code which 
instantiates the sentinal range can guarantee that the sentinal 
already exists in the source data/range, because it *put* it 
there -- the lexer remains a fine example, as it tacks the 
terminating \0 on.  There is no need for extra checks, because 
there is no need to *replace* the source data with the sentinel. 
It is naturally reduced to it through normal processing.


Other apparent use cases: list processing (Andrei has mentioned 
this already, for singly-linked lists); protocol processing with 
a terminate-transaction message, or similar; state sequences 
(which can include lexers and parsers, as it happens).  This is 
just off the top of my head.

Can we do it now, without a special type?  Yes.  Is there any 
gain from introducing the special type, versus what can be done 
now?  Yes, sometimes.  Is there any harm in introducing the 
special type?  No.  Is there significant cost in adding it?  No.  
Should this just be handled by compiler optimization?  Yes and no 
-- yes in that a great compiler should be able to optimize *many* 
(but not neccessarily all!) cases; but no in that including 
implementation details in a language spec is usually bad mojo.  
Of course, any compiler that could optimize the common cases, 
could at least as readily optimize the use of a sentinal range -- 
in fact, I'd expect it to open optimization paths for some of the 
corner cases that might otherwise not have been optimized.

So... I could live without a standard sentinal range concept 
(have so far, using sentinal injection with input ranges, which 
as Walter pointed out is really an incorrect use (abuse?) of 
input ranges), but I also know I'd be using it if it existed (and 
thereby cleaning up some code versus how I do it now).
Feb 28 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Friday, 1 March 2013 at 06:19:19 UTC, Chris Nicholson-Sauls 
wrote:
 A use case that comes immediately to mind: a sentinal range 
 that, yes wraps, an infinite (but predictable!) range, 
 effectively allowing you to take a head-slice of the infinite 
 range.

 auto foo = infiniteRangeOfEvenNumbers();
 auto upto1000 = GenericSentinalInputRange!1000( foo );

struct GenericSentinelRange(R, Sentinel) { R r; property auto front() { return r.front; } void popFront() { r.popFront(); } property empty() { return r.front == Sentinel; } } We don't need a new type of range at all. You confuse legitimate uses case for using a sentinel to terminate a range and uses case where an actual sentinel range is needed.
 So... I could live without a standard sentinal range concept 
 (have so far, using sentinal injection with input ranges, which 
 as Walter pointed out is really an incorrect use (abuse?) of 
 input ranges), but I also know I'd be using it if it existed 
 (and thereby cleaning up some code versus how I do it now).

You can live without, and guess what : if you use LDC (and probably GDC) you'll get the performance boost Walter is talking about.
Feb 28 2013
prev sibling next sibling parent "Chris Nicholson-Sauls" <ibisbasenji gmail.com> writes:
On Friday, 1 March 2013 at 06:33:42 UTC, deadalnix wrote:
 struct GenericSentinelRange(R, Sentinel) {
     R r;

      property auto front() {
         return r.front;
     }

     void popFront() {
         r.popFront();
     }

      property empty() {
         return r.front == Sentinel;
     }
 }

That's exactly what I wrote as well, when I played with the idea before posting. Except I just called the sentinel constant S and added 'enum sentinel = S;' -- not significant extra work.
 We don't need a new type of range at all. You confuse 
 legitimate uses case for using a sentinel to terminate a range 
 and uses case where an actual sentinel range is needed.

I'm not confused. The use case for a sentinel range /type/ is always going to be the same, and singular: automating legitimate use cases for sentinels. So, in effect, an argument for one is as good as for the other, when I'm only responding to the calls for use case examples. My mistake was in not making it clear that I was responding just to those, sorry.
 You can live without, and guess what : if you use LDC (and 
 probably GDC) you'll get the performance boost Walter is 
 talking about.

Which presupposes that I *can* use LDC (or GDC). For many people, it may not be an option. And last time I tried (which has been months now, admittedly) I couldn't get LDC to work for me at all, which is a shame since I'm actually very much interested in it. Either way, reliance on implementation details is not a good thing.
Feb 28 2013
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 28 Feb 2013 14:27:28 -0800
schrieb Walter Bright <newshound2 digitalmars.com>:

 Two corporate users of D have also told me that their motivating case to adopt
D 
 was - compile speed!

That's what I thought and why I keep telling that Carthago has to..., I mean that Phobos and the GC is not the ultimate performance platform. You just don't get fast + safe + convenient in one package, it's a compromise.
 Throwing compile speed under the bus is what other compiler vendors do, and 
 we're not going to do that.

I certainly hope so :D. It's just amazing that you barely notice a delay when compiling a medium sized program. DMD practically competes with interpreted languages here in terms of turn-around time. -- Marco
Feb 28 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Friday, 1 March 2013 at 07:02:51 UTC, Chris Nicholson-Sauls 
wrote:
 Which presupposes that I *can* use LDC (or GDC). For many 
 people, it may not be an option.  And last time I tried (which 
 has been months now, admittedly) I couldn't get LDC to work for 
 me at all, which is a shame since I'm actually very much 
 interested in it.  Either way, reliance on implementation 
 details is not a good thing.

You don't rely on any implementation detail as the whole stuff will work with all implementation. It will simply not be able to take advantage of sentinel for speedup with some compilers. But the behavior will be exactly the same. Different compilers produces executable of different efficiency. That isn't something new.
Feb 28 2013
prev sibling next sibling parent =?iso-8859-15?Q?Simen_Kj=E6r=E5s?= <simen.kjaras gmail.com> writes:
On Thu, 28 Feb 2013 19:53:05 +0100, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 2/28/13 1:42 PM, Dmitry Olshansky wrote:
 That's the wrong one. This is the one:
 https://github.com/Hackerpilot/Dscanner/tree/range-based-lexer

 Though feel free to destroy the other one too ;)
 But I need your full powers with Dscanner first :o)

Whoa, that looks really good!

Nonono, destroy! :p -- Simen
Mar 01 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 28 Feb 2013 17:00:06 -0500, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 2/28/2013 9:25 AM, Steven Schveighoffer wrote:
 You have already stated it gets changed into a jump table.

Please, please listen to what I write. This is very frustrating. The code in lexer.c is there for all to see, and it amply illustrates everything I'm saying. For example, this code does not get translated into a jump table: case '+': p++; if (*p == '=') { p++; t->value = TOKaddass; } else if (*p == '+') { p++; t->value = TOKplusplus; } else t->value = TOKadd; return;

I don't need to add any more to this discussion, it seems more qualified people are making the points I am making, but in a more understandable way. Sorry to add to your frustration. -Steve
Mar 01 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 28 February 2013 at 22:00:11 UTC, Walter Bright 
wrote:
 On 2/28/2013 9:25 AM, Steven Schveighoffer wrote:
 On Thu, 28 Feb 2013 12:00:48 -0500, Walter Bright 
 <newshound2 digitalmars.com>
 wrote:

 On 2/28/2013 6:31 AM, Steven Schveighoffer wrote:
 If this doesn't translate to the same code, I don't know why 
 not.

Try it and see with your favorite C compiler.

A sample case of 1 does not prove it's not possible, or explain why those optimizers don't take that step. A valid response would be to give a case why an optimizer COULDN'T make that leap.

No, it is not. DMD is compiled with real compilers, not abstract "sufficiently smart compilers".

I finally decided to write a post about such imaginary "sufficiently smart compilers" that clearly don't exists today, never will and so we have to clutter the language : http://www.deadalnix.me/2013/03/23/a-story-about-optimization-llvm-and-the-sentinelinputrange/ I really hope that someone wrote this imaginary LLVM thing everybody is talking about !
Mar 24 2013
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
deadalnix:

 http://www.deadalnix.me/2013/03/23/a-story-about-optimization-llvm-and-the-sentinelinputrange/

A nice article. I'd like to see longer lines for the non-proportional text of the code. Another way to improve the article is to show what GDC does here. Bye, bearophile
Mar 24 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Sunday, 24 March 2013 at 11:49:17 UTC, Andrei Alexandrescu 
wrote:
 On 3/24/13 7:32 AM, deadalnix wrote:
 I finally decided to write a post about such imaginary 
 "sufficiently
 smart compilers" that clearly don't exists today, never will 
 and so we
 have to clutter the language :
 http://www.deadalnix.me/2013/03/23/a-story-about-optimization-llvm-and-the-sentinelinputrange/


 I really hope that someone wrote this imaginary LLVM thing 
 everybody is
 talking about !

Looking great (also love the page design). The piece has a few systematic English language errors that should be easy to fix. Would you want to accept a copyedit step from a native speaker, and if so, is anyone here inclined to do that?

I'd be happy to, if someone want to do it.
Mar 24 2013
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Sunday, 24 March 2013 at 12:22:37 UTC, deadalnix wrote:
 On Sunday, 24 March 2013 at 11:49:17 UTC, Andrei Alexandrescu 
 wrote:
 On 3/24/13 7:32 AM, deadalnix wrote:
 I finally decided to write a post about such imaginary 
 "sufficiently
 smart compilers" that clearly don't exists today, never will 
 and so we
 have to clutter the language :
 http://www.deadalnix.me/2013/03/23/a-story-about-optimization-llvm-and-the-sentinelinputrange/


 I really hope that someone wrote this imaginary LLVM thing 
 everybody is
 talking about !

Looking great (also love the page design). The piece has a few systematic English language errors that should be easy to fix. Would you want to accept a copyedit step from a native speaker, and if so, is anyone here inclined to do that?

I'd be happy to, if someone want to do it.

Native English speaker here; I'd be happy to edit for grammar.
Mar 24 2013
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Sunday, 24 March 2013 at 12:22:37 UTC, deadalnix wrote:
 On Sunday, 24 March 2013 at 11:49:17 UTC, Andrei Alexandrescu 
 wrote:
 On 3/24/13 7:32 AM, deadalnix wrote:
 I finally decided to write a post about such imaginary 
 "sufficiently
 smart compilers" that clearly don't exists today, never will 
 and so we
 have to clutter the language :
 http://www.deadalnix.me/2013/03/23/a-story-about-optimization-llvm-and-the-sentinelinputrange/


 I really hope that someone wrote this imaginary LLVM thing 
 everybody is
 talking about !

Looking great (also love the page design). The piece has a few systematic English language errors that should be easy to fix. Would you want to accept a copyedit step from a native speaker, and if so, is anyone here inclined to do that?

I'd be happy to, if someone want to do it.

I copied and pasted in to libreoffice for editing: http://dl.dropbox.com/u/910836/sentinal.odt or http://dl.dropbox.com/u/910836/sentinal.doc There are 2 comments from me, they are in [] and highlighted in yellow. You'll probably want to address them and then delete them. There were a lot of grammatical mistakes that have now been fixed and a couple of sentences have been rearranged. I had planned to work as Andrei suggested with "before -> after" but I quickly realised the edit file would be longer than the document! P.S. I may have accidentally introduced some British English by accident, perhaps a native American English speaker will spot them.
Mar 24 2013
prev sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Sunday, 24 March 2013 at 14:37:18 UTC, John Colvin wrote:
 On Sunday, 24 March 2013 at 12:22:37 UTC, deadalnix wrote:
 On Sunday, 24 March 2013 at 11:49:17 UTC, Andrei Alexandrescu 
 wrote:
 On 3/24/13 7:32 AM, deadalnix wrote:
 I finally decided to write a post about such imaginary 
 "sufficiently
 smart compilers" that clearly don't exists today, never will 
 and so we
 have to clutter the language :
 http://www.deadalnix.me/2013/03/23/a-story-about-optimization-llvm-and-the-sentinelinputrange/


 I really hope that someone wrote this imaginary LLVM thing 
 everybody is
 talking about !

Looking great (also love the page design). The piece has a few systematic English language errors that should be easy to fix. Would you want to accept a copyedit step from a native speaker, and if so, is anyone here inclined to do that?

I'd be happy to, if someone want to do it.

I copied and pasted in to libreoffice for editing: http://dl.dropbox.com/u/910836/sentinal.odt or http://dl.dropbox.com/u/910836/sentinal.doc There are 2 comments from me, they are in [] and highlighted in yellow. You'll probably want to address them and then delete them. There were a lot of grammatical mistakes that have now been fixed and a couple of sentences have been rearranged. I had planned to work as Andrei suggested with "before -> after" but I quickly realised the edit file would be longer than the document! P.S. I may have accidentally introduced some British English by accident, perhaps a native American English speaker will spot them.

That is much better now. Thank a lot, I updated the article.
Mar 24 2013