digitalmars.D - Request for comments: std.d.lexer

Brian Schott (18/18) Jan 27 2013 I'm writing a D lexer for possible inclusion in Phobos.

Philippe Sigaud (29/44) Jan 27 2013 Cool! I remember linking to it in the wiki a week ago:

deadalnix (3/14) Jan 27 2013 Oh yes, that is very important. Conditions are perfect to handle
Brian Schott (23/44) Jan 27 2013 It implements the InputRange interface from std.range so that

Timon Gehr (5/28) Jan 27 2013 You are measuring lexing speed against compilation speed. A reasonably

Johannes Pfau (6/45) Jan 27 2013 Profiling is always a good idea, but to be fair: His dmd was probably

dennis luehring (6/51) Jan 27 2013 that makes no sense - we need a tiny piece of benchmark code inside of

dennis luehring (3/8) Jan 27 2013 the dmd lexer is in

Philippe Sigaud (12/32) Jan 27 2013 Hmm. You're the first person I see pushing this. Personally, I'd go
Jacob Carlborg (5/9) Jan 27 2013 Always good to try and minimize the size of a token when the lexer

Walter Bright (5/7) Jan 27 2013 Speed is critical for a lexer.

Dmitry Olshansky (8/15) Jan 27 2013 I concur. One of the biggest reason* there is a separate lexer step is

Philippe Sigaud (6/11) Jan 27 2013 ... which is exactly what parsing expression grammars (and other

Dmitry Olshansky (14/26) Jan 27 2013 That is only a happenstance and is highly overrated. What does it buy

Philippe Sigaud (13/33) Jan 27 2013 I don't know. I discovered what scannerless parsing was after

Dmitry Olshansky (14/27) Jan 27 2013 Reminds me to tweak/refactor/optimize CT-part of it.

Philippe Sigaud (4/8) Jan 27 2013 The latter. The one I use for Pegged to generate (what is hopefully) a

Walter Bright (5/17) Jan 27 2013 If you do such a thing when designing a language, I guarantee that you w...

Philippe Sigaud (6/14) Jan 27 2013 Something I never tought about: given a 10k lines module, how many

Jacob Carlborg (13/18) Jan 27 2013 If we're talking about dynamic allocation you can make sure you're just

FG (12/22) Jan 28 2013 I was also thinking about using slices to limit string allocation. So fa...

Walter Bright (3/19) Jan 28 2013 Just study the dmd lexer.c source code.
Mehrdad (9/15) Jan 28 2013 Here's my (VERY) simple NFA-based "regex"-based lexer, which

Philippe Sigaud (2/3) Jan 28 2013 This link does not work for me :(

Mehrdad (8/12) Jan 28 2013 Yeah sorry, I found a bug in it, so I took it off and fixed it

Philippe Sigaud (5/13) Jan 28 2013 Grabby grab!

Mehrdad (6/10) Jan 28 2013 lol. Mainly because I got tired of running into compiler bugs, to

Paulo Pinto (2/9) Jan 27 2013 I second that.
Mehrdad (4/11) Jan 28 2013 Semi-unrelated question -- how would you benchmark a _parser_?

Jacob Carlborg (6/9) Jan 28 2013 It should always be possible to get a number, but it would be hard to

Mehrdad (4/13) Jan 28 2013 Sorry I think my question was vague -- what I meant was, how do

Jacob Carlborg (6/9) Jan 28 2013 The easy way out would be to just measure how long time it takes to

Mehrdad (4/14) Jan 28 2013 Yeah that's exactly what I meant by not comparing xD

Walter Bright (2/3) Jan 28 2013 Just run under a profiler, then fix the hot spots.

dennis luehring (2/5) Jan 28 2013 mehrdad speaks about benchmarking not optimization :)

Walter Bright (3/10) Jan 28 2013 I know. But profiling is how you make it fast. Lexers should be as fast ...

Mehrdad (3/15) Jan 28 2013 I was talking about parsers actually :| on a second thought maybe

Walter Bright (2/17) Jan 28 2013 Same comment applies.

Johannes Pfau (10/18) Jan 28 2013 But to be fair that doesn't fit ranges very well. If you don't want to

Dmitry Olshansky (9/27) Jan 28 2013 Not the whole source but to construct a table of all identifiers. The

Timon Gehr (18/33) Jan 28 2013 Source code is usually small. (Even std.datetime has 'only' 1.6 MB.) My

Walter Bright (2/3) Jan 28 2013 This makes symbol lookups very fast.
Dmitry Olshansky (13/49) Jan 30 2013 In allocation scheme I proposed that ID could be a 32bit offset into the...

Walter Bright (4/6) Feb 01 2013 That only works if you know in advance the max size the chunk can ever b...

Dmitry Olshansky (13/20) Feb 01 2013 Well I supposed it's exactly one reallocatable block. Then token have an...

Walter Bright (4/23) Feb 01 2013 Your technique can work, provided the number of identifiers isn't large ...

Dmitry Olshansky (238/270) Feb 02 2013 I think I failed to describe the virtual memory trick or you just

Walter Bright (3/4) Feb 02 2013 You can't just reserve 1Gb of address space in a 32 bit system. (I know ...

Dmitry Olshansky (20/25) Feb 03 2013 OK, sorry you must knew this stuff throughly. 1Gb is not the point I

Tove (4/12) Feb 01 2013 This can easily be archived by preallocating file.size bytes...

deadalnix (22/22) Jan 27 2013 Very happy to see that !

deadalnix (5/27) Jan 27 2013 And the famous Job's « one last thing » : I'm not a big fan of
Brian Schott (17/38) Jan 27 2013 I decided not to do this because the lexer actually calls itself

deadalnix (6/23) Jan 27 2013 Even without special casing for slice of the input, generating a

Dicebot (4/4) Jan 27 2013 In one of last discussions about standard lexer/parser I remember
Philippe Sigaud (4/4) Jan 27 2013 Let's add another question:

Jacob Carlborg (4/8) Jan 27 2013 Perhaps an option for this.

Jacob Carlborg (5/22) Jan 27 2013 How about changing the type of TokenType to ushort, if all members fit.
Walter Bright (3/6) Jan 27 2013 Just a quick comment: byToken() should not accept a filename. It's input...

Dmitry Olshansky (5/13) Jan 27 2013 InputRange of char/wchar/dchar I guess as lexer should have an ability

Walter Bright (3/15) Jan 27 2013 Just char would suffice. If a user needs wchar/dchar, they can put a map...

Brian Schott (5/13) Jan 27 2013 The file name is accepted for eventual error reporting purposes.

Timon Gehr (3/6) Jan 27 2013 Sure. The point you brought across, however, was that it is not
Walter Bright (2/3) Jan 27 2013 Use an OutputRange for that.

David Nadlinger (4/8) Jan 27 2013 What about that delegate-based design? I thought everyone agreed

Walter Bright (4/10) Jan 27 2013 An OutputRange is a way of doing that. The advantage of OutputRange's is...

David Nadlinger (8/22) Jan 27 2013 I was talking about the design you proposed yourself here:

Brian Schott (42/46) Jan 27 2013 I think you misunderstand. The file name is so that if you pass

Walter Bright (3/11) Jan 27 2013 Yes, I did misunderstand. I suggest updating the documentation to clear ...
deadalnix (5/15) Jan 27 2013 I don't think that is a good idea. For instance mixin need to be
Johannes Pfau (7/11) Jan 28 2013 This sounds like a valid use case for buffered ranges (which we don't

Jacob Carlborg (6/11) Jan 28 2013 If think slicing the buffer will force the whole buffer to remain in

Walter Bright (3/6) Jan 28 2013 Well, the source buffer can be large, and will span a lot of memory cach...

Jacob Carlborg (5/7) Jan 28 2013 Would it be better to .dup the slices? Won't that be just as slow as

Dmitry Olshansky (5/10) Jan 28 2013 It would be better to compactly layout these one by one in a separate

jerro (5/19) Jan 28 2013 Would it also make sense to do small string optimization? On 64

Dmitry Olshansky (4/20) Jan 28 2013 It very well could be iff slice is a field in the Token struct.

Timon Gehr (9/42) Jan 28 2013 I see, probably there should be an option to do this by slicing instead.

Brian Schott (28/33) Jan 30 2013 I implemented the various suggestions from a past thread and made

dennis luehring (3/19) Jan 30 2013 but you still need to compare that to the current dmd lexer - nothing
FG (3/8) Jan 30 2013 Yep. I'm eager to see what timings you get with the whole file kept in m...
Dmitry Olshansky (9/42) Jan 30 2013 idup --> allocation

FG (10/13) Jan 30 2013 Yeah, similar to what I suggested, except that probably it would be good...

Dmitry Olshansky (14/27) Jan 30 2013 Mm trie shouldn't have that many allocations. Typically each node is

Jacob Carlborg (4/24) Jan 31 2013 How many tokens would that be in total?

H. S. Teoh (99/105) Jan 28 2013 [...]

Dejan Lekic (3/22) Jan 28 2013 Having a "standard" lexer module, that many people work on, and
Jacob Carlborg (6/23) Jan 31 2013 Just thinking out loud here. Would it be possible to lex a file in

FG (2/4) Jan 31 2013 Do you know where you can safely cut it without having it lexed beforeha...

Jacob Carlborg (8/10) Jan 31 2013 I was thinking that myself. It would probably be possible to just cut it...

H. S. Teoh (9/18) Jan 31 2013 [...]

Jacob Carlborg (4/7) Jan 31 2013 That would be a problem.

deadalnix (4/13) Jan 31 2013 I don't think it worth the complexity. You can lex both file in

dennis luehring (4/30) Jan 31 2013 why not only the symbols at the border needs to be "connected" then

Jacob Carlborg (4/7) Jan 31 2013 That would require some profiling to figure out.

dennis luehring (7/13) Jan 31 2013 i would say it "can" help alot so the design should be able to use

Jacob Carlborg (7/13) Jan 31 2013 This is kind of obvious, I think. That's why I started to think at this

qznc (6/9) Jan 31 2013 I think a lexer should be IO-bound on todays machines, so

Mehrdad (4/11) Feb 01 2013 Pretty sure RAMs these days can handle more than 1 processor's
Jacob Carlborg (4/6) Feb 01 2013 I'm not sure I understand.

Brian Schott (5/5) Feb 03 2013 After several hours of optimizing I've managed to make it so that

FG (2/4) Feb 03 2013 What are you comparing here? How do you time DMD's lexing stage?

Brian Schott (3/8) Feb 03 2013 A simple hack of module.c that prints out the file's token count

FG (5/8) Feb 03 2013 Ah, fine. Then it's not that bad. :)

Jacob Carlborg (5/9) Feb 03 2013 That would be interesting to hear. Three times slower than DMD doesn't

FG (5/12) Feb 04 2013 Looking at the current source, there is now a StringCache to hold the st...

Mehrdad (6/11) Feb 04 2013 Where is the current bottleneck?
Brian Schott (4/4) Feb 04 2013 More optimizing:

Andrei Alexandrescu (4/8) Feb 04 2013 Suggestion: take lexer.c and convert it to D. Should take one day, and

Andrej Mitrovic (3/5) Feb 04 2013 This was already done for DDMD, and the more recent minimal version of i...

Andrei Alexandrescu (3/8) Feb 04 2013 Awesome! Has anyone measured the speed?

Brian Schott (6/8) Feb 04 2013 I gave up on getting it to compile.

Andrej Mitrovic (6/7) Feb 05 2013 It seems master is broken. An earlier commit is buildable with a small

deadalnix (5/14) Feb 04 2013 DMD's lexer is not suitable for phobos IMO. It doesn't take a

Andrei Alexandrescu (4/18) Feb 05 2013 I looked at the code. Once converted, it's trivial to convert the lexer

Jacob Carlborg (9/11) Feb 05 2013 There's reason for why nobody has just extract the lexer from DMD. It

Jonathan M Davis (18/20) Feb 05 2013 Not exactly. I decided that it would be of greater benefit to just write...

Jacob Carlborg (4/7) Feb 05 2013 Aha, I see.

Jonathan M Davis (12/19) Feb 05 2013 There are basic ideas about how it works which are obviously good and sh...

Jacob Carlborg (4/10) Feb 05 2013 Yeah, that could be useful.
Andrei Alexandrescu (5/23) Feb 05 2013 As far as I could tell the dependencies of the lexer are fairly

Jacob Carlborg (4/7) Feb 05 2013 That's not what I remember from last time I gave it a try.
Jonathan M Davis (14/17) Feb 05 2013 I don't remember all of the details at the moment, since it's been sever...

Andrei Alexandrescu (5/21) Feb 05 2013 I think it would be reasonable for a lexer to require a range of ubyte

Jonathan M Davis (9/12) Feb 05 2013 I'd have to think about how you'd handle the Unicode stuff in that case,...
Jonathan M Davis (27/30) Feb 08 2013 Another big issue is the fact that in some ways, using a pointer like dm...

Dmitry Olshansky (6/17) Feb 08 2013 Not true, certain ranges know length but can't be random access as

Jonathan M Davis (33/51) Feb 08 2013 t may

Dmitry Olshansky (14/45) Feb 08 2013 Well I honestly disagree about the promise of knowing length - being

Jonathan M Davis (10/15) Feb 08 2013 I don't know exactly how costly it ends up being (and certainly, poor de...

Dmitry Olshansky (20/37) Feb 08 2013 I suspect it *might* require extra register for base depending on

Jonathan M Davis (9/11) Feb 08 2013 The main problem isn't RA ranges. It's quite probable that the optimizer...

deadalnix (6/24) Feb 08 2013 For pure InputRanges, that is pretty bad as the decoding of UTF

deadalnix (7/33) Feb 08 2013 Wow that is complete bullshit xD I completely messed up what I

Jonathan M Davis (15/19) Feb 09 2013 Well, with the pull request that I have, it'll always pop for input rang...

Jacob Carlborg (6/12) Feb 09 2013 Following this discussion it seems it's complicated to support ranges in...

Andrei Alexandrescu (4/17) Feb 09 2013 Requiring a random-access range of ubyte with a terminating zero may be

Jacob Carlborg (6/8) Feb 09 2013 Requiring a random-access range probably makes it easier. People here

Andrei Alexandrescu (4/10) Feb 09 2013 Yah. The way I see it is, start with a random-access range and then see

Dmitry Olshansky (10/21) Feb 09 2013 I don't get this. There is no sensible requirement to forbid non-random

Jonathan M Davis (21/40) Feb 09 2013 may be

Walter Bright (4/6) Feb 08 2013 A more problematic case is dmd's lexer relies on a 0 byte at the end to ...

Jonathan M Davis (7/14) Feb 09 2013 I didn't know that. That's a cute trick. But yeah, without controlling t...

Andrei Alexandrescu (3/16) Feb 09 2013 That's not a problem. You may require that the input has a terminating z...

Jacob Carlborg (4/6) Feb 09 2013 You cannot just append a terminating zero in the lexer if it's a range?

Andrei Alexandrescu (3/7) Feb 09 2013 You require it. It's client's burden.

Jacob Carlborg (4/5) Feb 09 2013 Ok, I see.

Walter Bright (11/27) Feb 09 2013 Perhaps we can formalize this a bit. Define a SentinelInputRange, which ...

Walter Bright (4/11) Feb 09 2013 If it's not clear, SentinelInputRange.front can be called *before* check...

Andrei Alexandrescu (5/23) Feb 09 2013 Yah, that all would fit C-style strings and singly-linked lists like a

deadalnix (10/14) Feb 08 2013 Playing around that, I discovered a bug in std.utf : slice and

Jonathan M Davis (7/13) Feb 05 2013 It turns out that it has nothing to do with string mixins (though it doe...

Jacob Carlborg (4/9) Feb 05 2013 Ok, good that it's not a blocker.

Dmitry Olshansky (4/8) Feb 05 2013 Time to do some hacking on your lexer I guess. I'll try add a couple of

Brian Schott (7/10) Feb 05 2013 I've been using avgtime[1] for measuring run times, perf[2], and

Dmitry Olshansky (7/18) Feb 05 2013 Thanks.

Brian Schott (5/11) Feb 08 2013 http://hackerpilot.github.com/experimental/std_lexer/images/times3.png

Jacob Carlborg (4/7) Feb 08 2013 DMD is still consistently faster, :(

Brian Schott (3/4) Feb 08 2013 We might be getting to the part where we say that code generated

Brian Schott (5/11) Feb 08 2013 For the lulz I compiled with "dmd -release -inline -noboundscheck

Dmitry Olshansky (4/14) Feb 08 2013 Would be intereesting to try gdc as dmd on linux uses gcc.

Iain Buclaw (6/27) Feb 08 2013 What? That's an outright fib. :-)

Brian Schott (2/3) Feb 08 2013 I think he means that "on linux the dmd binary is compiled by gcc"

Iain Buclaw (5/9) Feb 08 2013 That's still lies. :o)

Dmitry Olshansky (4/13) Feb 08 2013 --

Iain Buclaw (9/25) Feb 08 2013 I see we could be doing this all day. :=C3=BE

Brian Schott (5/13) Feb 08 2013 What we're saying is that dmd, The Digital Mars D Compiler, is

Iain Buclaw (8/20) Feb 08 2013 ly

Dmitry Olshansky (7/23) Feb 08 2013 We've been discussing DMD's lexer written in C++ that is obviously

SomeDude (3/12) Feb 09 2013 I'd think it's built by DMC++, Digital Mars C++ compiler.

Dmitry Olshansky (4/18) Feb 09 2013 On linux, sure ;)

Brian Schott (8/9) Feb 08 2013 GDC decided to randomly not fail to build on my system, so it's

Brian Schott (3/4) Feb 08 2013 Copy/pate fail. That's actually "dmd -release -inline
Iain Buclaw (5/15) Feb 08 2013 Cool. How do you determine the speed times?
FG (3/5) Feb 08 2013 Interesting. Seems that LDC is faster than GDC but has a bigger start ov...

David Nadlinger (7/14) Feb 08 2013 It seems like that, but the differences are small enough that you

dennis luehring (8/17) Feb 08 2013 i still don't get what you comparing here
Jacob Carlborg (4/12) Feb 09 2013 Do you have the exact code you test available somewhere?

Dmitry Olshansky (40/46) Feb 08 2013 Keep in mind the D runtime start-up cost. In the end on small files DMD

Jacob Carlborg (5/52) Feb 08 2013 Ok, I see. But wouldn't it be more fair to compare either GCC to GDC or

Walter Bright (2/4) Feb 08 2013 I worship the Dark Arts.

Iain Buclaw (3/8) Feb 09 2013 More like cursed the god's when you wrote it. :-)
Brian Schott (5/10) Feb 28 2013 *recites incantations*

Andrej Mitrovic (2/3) Feb 28 2013 That's a nice plot, but what is the X axis?

FG (2/5) Feb 28 2013 Lexing time in milliseconds.

Dmitry Olshansky (22/33) Feb 28 2013 Looking far better now. Keep in mind that we still don't use the

Dmitry Olshansky (7/32) Feb 28 2013 To clarify these are problems I observed but haven't solved/avoided

"Brian Schott" <briancschott gmail.com> writes:

I'm writing a D lexer for possible inclusion in Phobos.

DDOC: 
http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
Code: 
https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d

It's currently able to correctly syntax highlight all of Phobos, 
but does a fairly bad job at rejecting or notifying users/callers 
about invalid input.

I'd like to hear arguments on the various ways to handle errors 
in the lexer. In a compiler it would be useful to throw an 
exception on finding something like a string literal that doesn't 
stop before EOF, but a text editor or IDE would probably want to 
be a bit more lenient. Maybe having it run-time (or compile-time 
configurable) like std.csv would be the best option here.

I'm interested in ideas on the API design and other high-level 
issues at the moment. I don't consider this ready for inclusion. 
(The current module being reviewed for inclusion in Phobos is the 
new std.uni.)

Jan 27 2013

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sun, Jan 27, 2013 at 10:51 AM, Brian Schott <briancschott gmail.com> wrote:
 I'm writing a D lexer for possible inclusion in Phobos.

 DDOC: http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
 Code:
 https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d

Cool! I remember linking to it in the wiki a week ago:
Here:

http://wiki.dlang.org/Lexers_Parsers

Feel free to correct the entry.

 It's currently able to correctly syntax highlight all of Phobos, but does a
 fairly bad job at rejecting or notifying users/callers about invalid input.

 I'd like to hear arguments on the various ways to handle errors in the
 lexer. In a compiler it would be useful to throw an exception on finding
 something like a string literal that doesn't stop before EOF, but a text
 editor or IDE would probably want to be a bit more lenient. Maybe having it
 run-time (or compile-time configurable) like std.csv would be the best
 option here.

Last time we discussed it, IIRC, some people wanted the lexer to stop
at once, other just wanted an Error token.
I personally prefer an Error token, but that means finding a way to
start lexing again after the error (and hence, finding where the error
ends).
I guess any separator/terminator could be used to re-engage the lexer:
space, semicolon, closing brace, closing parenthesis?

 I'm interested in ideas on the API design and other high-level issues at the
 moment. I don't consider this ready for inclusion. (The current module being
 reviewed for inclusion in Phobos is the new std.uni.)

OK, here are a few questions:

* Having a range interface is good. Any reason why you made byToken a
class and not a struct? Most (like, 99%) of range in Phobos are
structs. Do you need reference semantics?

* Also, is there a way to keep comments? Any code wanting the modify
the code might need them.
(edit: Ah, I see it: IterationStyle.IncludeComments)

* I'd distinguish between standard comments and documentation
comments. These are different beasts, to my eyes.

* I see Token has a startIndex member. Any reason not to have a
endIndex member? Or can and end index always be deduced from
startIndex and value.length?

* How does it fare with non ASCII code?

* A rough estimate of number of tokens/s would be good (I know it'll
vary). Walter seems to think if a lexer is not able to vomit thousands
of tokens a seconds, then it's not good. On a related note, does your
lexer have any problem with 10k+-lines files?

Jan 27 2013

"deadalnix" <deadalnix gmail.com> writes:

On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
 Last time we discussed it, IIRC, some people wanted the lexer 
 to stop
 at once, other just wanted an Error token.
 I personally prefer an Error token, but that means finding a 
 way to
 start lexing again after the error (and hence, finding where 
 the error
 ends).
 I guess any separator/terminator could be used to re-engage the 
 lexer:
 space, semicolon, closing brace, closing parenthesis?

Oh yes, that is very important. Conditions are perfect to handle 
stuff like that.

Jan 27 2013

"Brian Schott" <briancschott gmail.com> writes:

On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
 * Having a range interface is good. Any reason why you made 
 byToken a
 class and not a struct? Most (like, 99%) of range in Phobos are
 structs. Do you need reference semantics?

It implements the InputRange interface from std.range so that 
users have a choice of using template constraints or the OO model 
in their code.

 * Also, is there a way to keep comments? Any code wanting the 
 modify
 the code might need them.
 (edit: Ah, I see it: IterationStyle.IncludeComments)

 * I'd distinguish between standard comments and documentation
 comments. These are different beasts, to my eyes.

The standard at http://dlang.org/lex.html doesn't differentiate 
between them. It's trivial to write a function that checks if a 
token starts with "///", "/**", or "/++" while iterating over the 
tokens.

 * I see Token has a startIndex member. Any reason not to have a
 endIndex member? Or can and end index always be deduced from
 startIndex and value.length?

That's the idea.

 * How does it fare with non ASCII code?

Everything is templated on the character type, but I haven't done 
any testing on UTF-16 or UTF-32. Valgrind still shows functions 
from std.uni being called, so at the moment I assume it works.

 * A rough estimate of number of tokens/s would be good (I know 
 it'll
 vary). Walter seems to think if a lexer is not able to vomit 
 thousands
 of tokens a seconds, then it's not good. On a related note, 
 does your
 lexer have any problem with 10k+-lines files?

$ time dscanner --sloc ../phobos/std/datetime.d
14950

real	0m0.319s
user	0m0.313s
sys	0m0.006s

$ time dmd -c ../phobos/std/datetime.d

real	0m0.354s
user	0m0.318s
sys	0m0.036s

Yes, I know that "time" is a terrible benchmarking tool, but 
they're fairly close for whatever that's worth.

Jan 27 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/27/2013 11:42 AM, Brian Schott wrote:
 On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
 * Having a range interface is good. Any reason why you made byToken a
 class and not a struct? Most (like, 99%) of range in Phobos are
 structs. Do you need reference semantics?

 It implements the InputRange interface from std.range so that users have
 a choice of using template constraints or the OO model in their code.
 ...

The lexer range must be a struct.

 ...
 * A rough estimate of number of tokens/s would be good (I know it'll
 vary). Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good. On a related note, does your
 lexer have any problem with 10k+-lines files?

 $ time dscanner --sloc ../phobos/std/datetime.d
 14950

 real    0m0.319s
 user    0m0.313s
 sys    0m0.006s

 $ time dmd -c ../phobos/std/datetime.d

 real    0m0.354s
 user    0m0.318s
 sys    0m0.036s

 Yes, I know that "time" is a terrible benchmarking tool, but they're
 fairly close for whatever that's worth.

You are measuring lexing speed against compilation speed. A reasonably 
well performing lexer is around one order of magnitude faster on 
std.datetime. Maybe you should profile a little?

Jan 27 2013

Johannes Pfau <nospam example.com> writes:

Am Sun, 27 Jan 2013 12:38:33 +0100
schrieb Timon Gehr <timon.gehr gmx.ch>:

 On 01/27/2013 11:42 AM, Brian Schott wrote:
 On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
 * Having a range interface is good. Any reason why you made
 byToken a class and not a struct? Most (like, 99%) of range in
 Phobos are structs. Do you need reference semantics?

 It implements the InputRange interface from std.range so that users
 have a choice of using template constraints or the OO model in
 their code. ...

 
 The lexer range must be a struct.
 
 ...
 * A rough estimate of number of tokens/s would be good (I know
 it'll vary). Walter seems to think if a lexer is not able to vomit
 thousands of tokens a seconds, then it's not good. On a related
 note, does your lexer have any problem with 10k+-lines files?

 $ time dscanner --sloc ../phobos/std/datetime.d
 14950

 real    0m0.319s
 user    0m0.313s
 sys    0m0.006s

 $ time dmd -c ../phobos/std/datetime.d

 real    0m0.354s
 user    0m0.318s
 sys    0m0.036s

 Yes, I know that "time" is a terrible benchmarking tool, but they're
 fairly close for whatever that's worth.

 
 You are measuring lexing speed against compilation speed. A
 reasonably well performing lexer is around one order of magnitude
 faster on std.datetime. Maybe you should profile a little?
 

Profiling is always a good idea, but to be fair: His dmd was probably
compiled with gcc and -O2 if it's a normal release build.
So to compare that he should use gdc, -O2 -release -fno-bounds-check
and probably more flags to compile the d code.

Jan 27 2013

dennis luehring <dl.soluz gmx.net> writes:

Am 27.01.2013 13:31, schrieb Johannes Pfau:
 Am Sun, 27 Jan 2013 12:38:33 +0100
 schrieb Timon Gehr <timon.gehr gmx.ch>:

 On 01/27/2013 11:42 AM, Brian Schott wrote:
 On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
 * Having a range interface is good. Any reason why you made
 byToken a class and not a struct? Most (like, 99%) of range in
 Phobos are structs. Do you need reference semantics?

 It implements the InputRange interface from std.range so that users
 have a choice of using template constraints or the OO model in
 their code. ...

 The lexer range must be a struct.

 ...
 * A rough estimate of number of tokens/s would be good (I know
 it'll vary). Walter seems to think if a lexer is not able to vomit
 thousands of tokens a seconds, then it's not good. On a related
 note, does your lexer have any problem with 10k+-lines files?

 $ time dscanner --sloc ../phobos/std/datetime.d
 14950

 real    0m0.319s
 user    0m0.313s
 sys    0m0.006s

 $ time dmd -c ../phobos/std/datetime.d

 real    0m0.354s
 user    0m0.318s
 sys    0m0.036s

 Yes, I know that "time" is a terrible benchmarking tool, but they're
 fairly close for whatever that's worth.

 You are measuring lexing speed against compilation speed. A
 reasonably well performing lexer is around one order of magnitude
 faster on std.datetime. Maybe you should profile a little?

 Profiling is always a good idea, but to be fair: His dmd was probably
 compiled with gcc and -O2 if it's a normal release build.
 So to compare that he should use gdc, -O2 -release -fno-bounds-check
 and probably more flags to compile the d code.

that makes no sense - we need a tiny piece of benchmark code inside of 
dmd frontend (and gdc) - these results are the only reliable/compareable 
benchmark

someone knows the place where such benchmarking can take place in the 
dmd frontend code?

Jan 27 2013

dennis luehring <dl.soluz gmx.net> writes:

 that makes no sense - we need a tiny piece of benchmark code inside of
 dmd frontend (and gdc) - these results are the only reliable/compareable
 benchmark

 someone knows the place where such benchmarking can take place in the
 dmd frontend code?

the dmd lexer is in

https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c

with an small unit test

Jan 27 2013

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sun, Jan 27, 2013 at 11:42 AM, Brian Schott <briancschott gmail.com> wrote:
 On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
 * Having a range interface is good. Any reason why you made byToken a
 class and not a struct? Most (like, 99%) of range in Phobos are
 structs. Do you need reference semantics?


 It implements the InputRange interface from std.range so that users have a
 choice of using template constraints or the OO model in their code.

Hmm. You're the first person I see pushing this. Personally, I'd go
for a struct, if only because I don't need reference semantics.

 * Also, is there a way to keep comments? Any code wanting the modify
 the code might need them.
 (edit: Ah, I see it: IterationStyle.IncludeComments)

 * I'd distinguish between standard comments and documentation
 comments. These are different beasts, to my eyes.


 The standard at http://dlang.org/lex.html doesn't differentiate between
 them. It's trivial to write a function that checks if a token starts with
 "///", "/**", or "/++" while iterating over the tokens.

Yes but, the standard lexer was done with DMD in mind, and DMD has a
different code path for generating comments.

It's your project, sure, but I'd appreciate tokens differentiating
between the many comments.
Oh, and recognizing some inner Ddoc tokens, like ---- delimiters for
documentation code blocks. That way, code-in-doc could use the lexer
also.

Pretty please?


 * I see Token has a startIndex member. Any reason not to have a
 endIndex member? Or can and end index always be deduced from
 startIndex and value.length?


 That's the idea.

Does it work for UTF-16 and UTF-32 strings?

Jan 27 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-27 11:42, Brian Schott wrote:

 * I see Token has a startIndex member. Any reason not to have a
 endIndex member? Or can and end index always be deduced from
 startIndex and value.length?

 That's the idea.

Always good to try and minimize the size of a token when the lexer 
should output thousands of tokens per second.

-- 
/Jacob Carlborg

Jan 27 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
 Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good.

Speed is critical for a lexer.

This means, for example, you'll need to squeeze pretty much all storage 
allocation out of it. A lexer that does an allocation per token will is not 
going to do very well at all.

Jan 27 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

27-Jan-2013 23:48, Walter Bright пишет:
 On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
 Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good.

 Speed is critical for a lexer.

 This means, for example, you'll need to squeeze pretty much all storage
 allocation out of it. A lexer that does an allocation per token will is
 not going to do very well at all.

I concur. One of the biggest reason* there is a separate lexer step is 
because it could be made to do this stage very-very fast. Then the rest 
of the parser will greatly benefit from this underlying speed.

*Otherwise we could have just as well add the lexer stage as simple 
rules to the grammar that treats all of codepoints as terminals.

-- 
Dmitry Olshansky

Jan 27 2013

Philippe Sigaud <philippe.sigaud gmail.com> writes:

 I concur. One of the biggest reason* there is a separate lexer step is
 because it could be made to do this stage very-very fast. Then the rest of
 the parser will greatly benefit from this underlying speed.

 *Otherwise we could have just as well add the lexer stage as simple rules to
 the grammar that treats all of codepoints as terminals.

... which is exactly what parsing expression grammars (and other
scannerless parsers) do.

AFAICT, one interesting consequence is the ability to have composition
of grammars, which I sure have with Pegged. But maybe it's not related
that much, that's not something I stopped to think about.
In any case, grammar composition is something I've learnt to like quite a lot.

Jan 27 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

28-Jan-2013 00:18, Philippe Sigaud пишет:
 I concur. One of the biggest reason* there is a separate lexer step is
 because it could be made to do this stage very-very fast. Then the rest of
 the parser will greatly benefit from this underlying speed.

 *Otherwise we could have just as well add the lexer stage as simple rules to
 the grammar that treats all of codepoints as terminals.

 ... which is exactly what parsing expression grammars (and other
 scannerless parsers) do.

That is only a happenstance and is highly overrated. What does it buy 
you is the right question to ask. After all every LL-parser can be 
written avoiding notion of lexer and strangely I see no fuss about it.
So is it just hype? I dunno.

 AFAICT, one interesting consequence is the ability to have composition
 of grammars, which I sure have with Pegged. But maybe it's not related
 that much, that's not something I stopped to think about.
 In any case, grammar composition is something I've learnt to like quite a lot.

You can still have composability of grammars. In fact I'd define a 
couple of a less powerful but better optimized variations if I were you.

PEG-style memoization is still possible as long as underlying grammars 
are context-free all the useful usually are.

LR hardly composable ( I haven't seen a single one that does).
But all top-down one always are unless there is some explicit work spent 
to undermine this ability ;)

-- 
Dmitry Olshansky

Jan 27 2013

Philippe Sigaud <philippe.sigaud gmail.com> writes:

 ... which is exactly what parsing expression grammars (and other
 scannerless parsers) do.

 That is only a happenstance and is highly overrated. What does it buy you is
 the right question to ask. After all every LL-parser can be written avoiding
 notion of lexer and strangely I see no fuss about it.
 So is it just hype? I dunno.

I don't know. I discovered what scannerless parsing was after
discovering (and getting interested) in PEG.
That's a somewhat unusual path, as I'm now discovering standard
lexer+parser parsing :)

Anyway, let's not derail the OP thread too much.

 AFAICT, one interesting consequence is the ability to have composition
 of grammars, which I sure have with Pegged. But maybe it's not related
 that much, that's not something I stopped to think about.
 In any case, grammar composition is something I've learnt to like quite a
 lot.

 You can still have composability of grammars. In fact I'd define a couple of
 a less powerful but better optimized variations if I were you.

Yes, getting an automaton to deal with the regular part would be good.
But then, there already is std.regex ;)

it's good to know composition is possible elsewhere. I really like it:
it makes me use grammars (and the related, generated parsers) as I use
functions in PL: slowly building layers of abstraction and reuse.

 PEG-style memoization is still possible as long as underlying grammars are
 context-free all the useful usually are.

Yes, I agree.


 LR hardly composable ( I haven't seen a single one that does).
 But all top-down one always are unless there is some explicit work spent to
 undermine this ability ;)

What I still can't get an easy answer on is: is the D grammar LR(1),
LR(0), LALR?

Jan 27 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

28-Jan-2013 00:45, Philippe Sigaud пишет:
[snip]
 AFAICT, one interesting consequence is the ability to have composition
 of grammars, which I sure have with Pegged. But maybe it's not related
 that much, that's not something I stopped to think about.
 In any case, grammar composition is something I've learnt to like quite a
 lot.

 You can still have composability of grammars. In fact I'd define a couple of
 a less powerful but better optimized variations if I were you.

 Yes, getting an automaton to deal with the regular part would be good.
 But then, there already is std.regex ;)

Reminds me to tweak/refactor/optimize CT-part of it.
[snip]
 What I still can't get an easy answer on is: is the D grammar LR(1),
 LR(0), LALR?

I *think* you can shoehorn it into any one of these: LALR(1),  LL(k)+ 
some lookahead, LL(*) a generalization of it, or PEG. As usual you might 
need semantic predicates to iron out some quirks though. We are getting 
OT quick ;)

The hint is that your question is a bit faulty: by calling it "the D 
grammar" do you mean the exact one listed on the website or any 
equivalent that parses the same language (including the ones obtained by 
simple transformations)?

-- 
Dmitry Olshansky

Jan 27 2013

Philippe Sigaud <philippe.sigaud gmail.com> writes:

 The hint is that your question is a bit faulty: by calling it "the D
 grammar" do you mean the exact one listed on the website or any equivalent
 that parses the same language (including the ones obtained by simple
 transformations)?

The latter. The one I use for Pegged to generate (what is hopefully) a
D parser is already modified, discards constructs like NameList :=
Name NameList in favor of Name+

Anyway, let's stop here. Back to lexing proper :)

Jan 27 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/27/2013 12:18 PM, Philippe Sigaud wrote:
 I concur. One of the biggest reason* there is a separate lexer step is
 because it could be made to do this stage very-very fast. Then the rest of
 the parser will greatly benefit from this underlying speed.

 *Otherwise we could have just as well add the lexer stage as simple rules to
 the grammar that treats all of codepoints as terminals.

 ... which is exactly what parsing expression grammars (and other
 scannerless parsers) do.

 AFAICT, one interesting consequence is the ability to have composition
 of grammars, which I sure have with Pegged. But maybe it's not related
 that much, that's not something I stopped to think about.
 In any case, grammar composition is something I've learnt to like quite a lot.

If you do such a thing when designing a language, I guarantee that you will
find 
it impossible to resist having custom "tokens" in various parts of the grammar. 
This will make it impossible to write tools that only need to tokens, such as 
those syntax highlighting editors that really are token highlighters.

Jan 27 2013

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sun, Jan 27, 2013 at 8:48 PM, Walter Bright
<newshound2 digitalmars.com> wrote:
 On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
 Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good.


 Speed is critical for a lexer.

Something I never tought about: given a 10k lines module, how many
tokens does that represent, roughly. Ten times as much?

 This means, for example, you'll need to squeeze pretty much all storage
 allocation out of it. A lexer that does an allocation per token will is not
 going to do very well at all.

How does one do that? Honest question, I'm not really concerned by
extreme speed, most of the time, and have lots to learn in this area.

Jan 27 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-27 21:15, Philippe Sigaud wrote:

 This means, for example, you'll need to squeeze pretty much all storage
 allocation out of it. A lexer that does an allocation per token will is not
 going to do very well at all.

 How does one do that? Honest question, I'm not really concerned by
 extreme speed, most of the time, and have lots to learn in this area.

If we're talking about dynamic allocation you can make sure you're just 
using value types. A token could look like:

struct Token
{
     TokenKind kind;
     string value;
     uint index;
}

For the "value" you could just slice the buffer. But I think this will 
prevent the whole buffer from being collected.

-- 
/Jacob Carlborg

Jan 27 2013

FG <home fgda.pl> writes:

On 2013-01-28 08:47, Jacob Carlborg wrote:
 If we're talking about dynamic allocation you can make sure you're just using
 value types. A token could look like:

 struct Token
 {
      TokenKind kind;
      string value;
      uint index;
 }

 For the "value" you could just slice the buffer. But I think this will prevent
 the whole buffer from being collected.


I was also thinking about using slices to limit string allocation. So far the 
combined size of source files in D projects is so small, that it wouldn't hurt 
to mmap the files and slice them. It is possible though that someone would 
create a huge file, even if only just to see this program crash. :)

In that case something else may be useful. Allocate special arrays for holding 
value strings, for example 256 kB per array. Token.value will be a slice of
such 
array. Additionally have a lookup Trie to help reuse repeating values - if a 
string is already in an array, the Trie leaf will store its position (slice)
and 
the token will only have to copy that info. If the lookup doesn't turn up the 
string, the string will be added to the end of the array using Appender or, if 
it doesn't fit, a new array will be created.

Jan 28 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/27/2013 12:15 PM, Philippe Sigaud wrote:
 On Sun, Jan 27, 2013 at 8:48 PM, Walter Bright
 <newshound2 digitalmars.com> wrote:
 On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
 Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good.


 Speed is critical for a lexer.

 Something I never tought about: given a 10k lines module, how many
 tokens does that represent, roughly. Ten times as much?

I don't know.

 This means, for example, you'll need to squeeze pretty much all storage
 allocation out of it. A lexer that does an allocation per token will is not
 going to do very well at all.

 How does one do that? Honest question, I'm not really concerned by
 extreme speed, most of the time, and have lots to learn in this area.

Just study the dmd lexer.c source code.

Jan 28 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Sunday, 27 January 2013 at 20:15:33 UTC, Philippe Sigaud wrote:

 This means, for example, you'll need to squeeze pretty much 
 all storage allocation out of it. A lexer that does an 
 allocation per token will is not going to do very well at all.

 How does one do that? Honest question, I'm not really concerned 
 by extreme speed, most of the time, and have lots to learn in 
 this area.


Here's my (VERY) simple NFA-based "regex"-based lexer, which 
performs **no** heap allocations (unless the pattern is very 
complicated):

https://gist.github.com/b10ae22ab822c87467a3

Yes, the code is ugly and the capabilities are far too basic, but 
it was good enough for what I needed.

The point it illustrates is how you can lex without any heap 
allocations.

Jan 28 2013

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Mon, Jan 28, 2013 at 12:43 PM, Mehrdad <wfunction hotmail.com> wrote:
 https://gist.github.com/b10ae22ab822c87467a3

This link does not work for me :(

Jan 28 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 28 January 2013 at 20:35:51 UTC, Philippe Sigaud wrote:
 On Mon, Jan 28, 2013 at 12:43 PM, Mehrdad 
 <wfunction hotmail.com> wrote:
 https://gist.github.com/b10ae22ab822c87467a3

 This link does not work for me :(


Yeah sorry, I found a bug in it, so I took it off and fixed it 
(... or so I think).
I wouldn't be surprised if it's still buggy.

It's more of a proof of concept than anything... it definitely 
has room for improvement, but the speed is still decent.

Here's another link (grab it before it's gone!):

https://gist.github.com/addfe830acf9785d01ef

Jan 28 2013

Philippe Sigaud <philippe.sigaud gmail.com> writes:

 This link does not work for me :(



 Yeah sorry, I found a bug in it, so I took it off and fixed it (... or so I
 think).
 I wouldn't be surprised if it's still buggy.

 It's more of a proof of concept than anything... it definitely has room for
 improvement, but the speed is still decent.

 Here's another link (grab it before it's gone!):

 https://gist.github.com/addfe830acf9785d01ef

Grabby grab!

Owww, C++! My eyes, they're melting! My hairs, on fire! Why did you
forfeit your immortal soul?

Anyway, thanks. I'll look at it more closely. It seems to be the kind
of C++ I can still read *and* understand.

Jan 28 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 28 January 2013 at 22:00:30 UTC, Philippe Sigaud wrote:
 Owww, C++! My eyes, they're melting! My hairs, on fire! Why did 
 you forfeit your immortal soul?


lol. Mainly because I got tired of running into compiler bugs, to 
be honest. :|


 Anyway, thanks. I'll look at it more closely. It seems to be 
 the kind of C++ I can still read *and* understand.


Yeah, my goal was to make it mostly C-compatible, so I kept the 
C++-iness to a minimum.

It's more like "C"++  rather than "C++", I think. :P

Jan 28 2013

Paulo Pinto <pjmlp progtools.org> writes:

Am 27.01.2013 20:48, schrieb Walter Bright:
 On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
 Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good.

 Speed is critical for a lexer.

 This means, for example, you'll need to squeeze pretty much all storage
 allocation out of it. A lexer that does an allocation per token will is
 not going to do very well at all.

I second that.

Jan 27 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Sunday, 27 January 2013 at 19:48:23 UTC, Walter Bright wrote:
 On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
 Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good.

 Speed is critical for a lexer.

 This means, for example, you'll need to squeeze pretty much all 
 storage allocation out of it. A lexer that does an allocation 
 per token will is not going to do very well at all.


Semi-unrelated question -- how would you benchmark a _parser_?

Is there any way to get a _number_ as an answer, or is comparing 
against a rival parser the only way of benchmarking a parser?

Jan 28 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-28 10:03, Mehrdad wrote:

 Semi-unrelated question -- how would you benchmark a _parser_?

 Is there any way to get a _number_ as an answer, or is comparing against
 a rival parser the only way of benchmarking a parser?

It should always be possible to get a number, but it would be hard to 
determine if that's a good number or not without comparing with anything 
else.

-- 
/Jacob Carlborg

Jan 28 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 28 January 2013 at 09:07:15 UTC, Jacob Carlborg wrote:
 On 2013-01-28 10:03, Mehrdad wrote:

 Semi-unrelated question -- how would you benchmark a _parser_?

 Is there any way to get a _number_ as an answer, or is 
 comparing against
 a rival parser the only way of benchmarking a parser?

 It should always be possible to get a number, but it would be 
 hard to determine if that's a good number or not without 
 comparing with anything else.


Sorry I think my question was vague -- what I meant was, how do 
you measure it, i.e. in what units?

Lines per second? Tokens per second? Syntax nodes per second? etc.

Jan 28 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-28 10:09, Mehrdad wrote:

 Sorry I think my question was vague -- what I meant was, how do you
 measure it, i.e. in what units?

 Lines per second? Tokens per second? Syntax nodes per second? etc.

The easy way out would be to just measure how long time it takes to 
parse a given piece of code and compare with some other parse using the 
same code.

-- 
/Jacob Carlborg

Jan 28 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 28 January 2013 at 09:21:51 UTC, Jacob Carlborg wrote:
 On 2013-01-28 10:09, Mehrdad wrote:

 Sorry I think my question was vague -- what I meant was, how 
 do you
 measure it, i.e. in what units?

 Lines per second? Tokens per second? Syntax nodes per second? 
 etc.

 The easy way out would be to just measure how long time it 
 takes to parse a given piece of code and compare with some 
 other parse using the same code.


Yeah that's exactly what I meant by not comparing xD

I'm wondering if there's any way to make it independent of the 
code/grammar

Jan 28 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/28/2013 1:30 AM, Mehrdad wrote:
 I'm wondering if there's any way to make it independent of the code/grammar

Just run under a profiler, then fix the hot spots.

Jan 28 2013

dennis luehring <dl.soluz gmx.net> writes:

Am 28.01.2013 11:13, schrieb Walter Bright:
 On 1/28/2013 1:30 AM, Mehrdad wrote:
 I'm wondering if there's any way to make it independent of the code/grammar

 Just run under a profiler, then fix the hot spots.

mehrdad speaks about benchmarking not optimization :)

Jan 28 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/28/2013 2:16 AM, dennis luehring wrote:
 Am 28.01.2013 11:13, schrieb Walter Bright:
 On 1/28/2013 1:30 AM, Mehrdad wrote:
 I'm wondering if there's any way to make it independent of the code/grammar

 Just run under a profiler, then fix the hot spots.

 mehrdad speaks about benchmarking not optimization :)

I know. But profiling is how you make it fast. Lexers should be as fast as 
possible, not merely faster than the competition.

Jan 28 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 28 January 2013 at 11:01:49 UTC, Walter Bright wrote:
 On 1/28/2013 2:16 AM, dennis luehring wrote:
 Am 28.01.2013 11:13, schrieb Walter Bright:
 On 1/28/2013 1:30 AM, Mehrdad wrote:
 I'm wondering if there's any way to make it independent of 
 the code/grammar

 Just run under a profiler, then fix the hot spots.

 mehrdad speaks about benchmarking not optimization :)

 I know. But profiling is how you make it fast. Lexers should be 
 as fast as possible, not merely faster than the competition.



I was talking about parsers actually :| on a second thought maybe 
I shouldn't have asked that on a thread about lexing...

Jan 28 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/28/2013 3:34 AM, Mehrdad wrote:
 On Monday, 28 January 2013 at 11:01:49 UTC, Walter Bright wrote:
 On 1/28/2013 2:16 AM, dennis luehring wrote:
 Am 28.01.2013 11:13, schrieb Walter Bright:
 On 1/28/2013 1:30 AM, Mehrdad wrote:
 I'm wondering if there's any way to make it independent of the code/grammar

 Just run under a profiler, then fix the hot spots.

 mehrdad speaks about benchmarking not optimization :)

 I know. But profiling is how you make it fast. Lexers should be as fast as
 possible, not merely faster than the competition.



 I was talking about parsers actually :| on a second thought maybe I shouldn't
 have asked that on a thread about lexing...

Same comment applies.

Jan 28 2013

Johannes Pfau <nospam example.com> writes:

Am Sun, 27 Jan 2013 11:48:23 -0800
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
 Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good.

 
 Speed is critical for a lexer.
 
 This means, for example, you'll need to squeeze pretty much all
 storage allocation out of it.

But to be fair that doesn't fit ranges very well. If you don't want to
do any allocation but still keep identifiers etc in memory this
basically means you have to keep the whole source in memory and this is
conceptually an array and not a range.


But you can of course write a lexer which accepts buffered ranges and
does some allocation for those and is special cased for arrays to not
allocate at all. (Unbuffered ranges should be supported using a
generic BufferedRange)

Jan 28 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

28-Jan-2013 15:48, Johannes Pfau пишет:
 Am Sun, 27 Jan 2013 11:48:23 -0800
 schrieb Walter Bright <newshound2 digitalmars.com>:

 On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
 Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good.

 Speed is critical for a lexer.

 This means, for example, you'll need to squeeze pretty much all
 storage allocation out of it.

 But to be fair that doesn't fit ranges very well. If you don't want to
 do any allocation but still keep identifiers etc in memory this
 basically means you have to keep the whole source in memory and this is
 conceptually an array and not a range.

Not the whole source but to construct a table of all identifiers. The 
source is awfully redundant because of repeated identifiers, spaces, 
comments and what not. The set of unique identifiers is rather small.

I think the best course of action is to just provide a hook to trigger 
on every identifier encountered. That could be as discussed earlier a 
delegate.

 But you can of course write a lexer which accepts buffered ranges and
 does some allocation for those and is special cased for arrays to not
 allocate at all. (Unbuffered ranges should be supported using a
 generic BufferedRange)



-- 
Dmitry Olshansky

Jan 28 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/28/2013 12:59 PM, Dmitry Olshansky wrote:
 28-Jan-2013 15:48, Johannes Pfau пишет:
 ...

 But to be fair that doesn't fit ranges very well. If you don't want to
 do any allocation but still keep identifiers etc in memory this
 basically means you have to keep the whole source in memory and this is
 conceptually an array and not a range.

 Not the whole source but to construct a table of all identifiers. The
 source is awfully redundant because of repeated identifiers, spaces,
 comments and what not. The set of unique identifiers is rather small.

Source code is usually small. (Even std.datetime has 'only' 1.6 MB.) My 
own lexer-parser combination slices directly into the original source 
code, for every token, in order to remember the exact source location, 
and the last time I have measured, it ran faster than DMD's. I keep the 
source around for error reporting anyway:

tt.d:27:5: error: no member 'n' for type 'A'
     a.n=2;
     ^~~

Since the tokens point directly into the source code, it is not 
necessary to construct any other data structures in order to allow fast 
retrieval of the appropriate source code line.

But it's clear that a general-purpose library might not want to impose 
this restriction on storage upon it's clients. I think it is somewhat 
helpful for speed though. The other thing I do is buffering tokens in a 
contiguous ring buffer that grows if a lot of lookahead is requested.

 I think the best course of action is to just provide a hook to trigger
 on every identifier encountered. That could be as discussed earlier a
 delegate.

 ...

Maybe. I map identifiers to unique id's later, in the identifier AST 
node constructor, though.

Jan 28 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/28/2013 1:04 PM, Timon Gehr wrote:
 Maybe. I map identifiers to unique id's later,

This makes symbol lookups very fast.

Jan 28 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

29-Jan-2013 01:04, Timon Gehr пишет:
 On 01/28/2013 12:59 PM, Dmitry Olshansky wrote:
 28-Jan-2013 15:48, Johannes Pfau пишет:
 ...

 But to be fair that doesn't fit ranges very well. If you don't want to
 do any allocation but still keep identifiers etc in memory this
 basically means you have to keep the whole source in memory and this is
 conceptually an array and not a range.

 Not the whole source but to construct a table of all identifiers. The
 source is awfully redundant because of repeated identifiers, spaces,
 comments and what not. The set of unique identifiers is rather small.

 Source code is usually small. (Even std.datetime has 'only' 1.6 MB.) My
 own lexer-parser combination slices directly into the original source
 code, for every token, in order to remember the exact source location,
 and the last time I have measured, it ran faster than DMD's. I keep the
 source around for error reporting anyway:

 tt.d:27:5: error: no member 'n' for type 'A'
      a.n=2;
      ^~~

 Since the tokens point directly into the source code, it is not
 necessary to construct any other data structures in order to allow fast
 retrieval of the appropriate source code line.

 But it's clear that a general-purpose library might not want to impose
 this restriction on storage upon it's clients. I think it is somewhat
 helpful for speed though. The other thing I do is buffering tokens in a
 contiguous ring buffer that grows if a lot of lookahead is requested.

 I think the best course of action is to just provide a hook to trigger
 on every identifier encountered. That could be as discussed earlier a
 delegate.

 ...

 Maybe. I map identifiers to unique id's later, in the identifier AST
 node constructor, though.


In allocation scheme I proposed that ID could be a 32bit offset into the 
unique identifiers chunk. Identifiers themselves then would have to be 
stored with sentinels (e.g. zeros) at end like in C. Then the 'map' 
process comes for free.

The only overhead is actually copying the bytes to this buffer but it 
only happens once per unique identifier and in exchange you get to lex 
anything not only 'the whole module in one memory block' kind of thing.
On the plus side it's also more cache friendly.

Overall I think it's a good trade-off, but I never had the time to 
exercise it.

-- 
Dmitry Olshansky

Jan 30 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/30/2013 8:44 AM, Dmitry Olshansky wrote:
 In allocation scheme I proposed that ID could be a 32bit offset into the unique
 identifiers chunk.

That only works if you know in advance the max size the chunk can ever be and 
preallocate it. Otherwise, you have no guarantee that the next allocated chunk 
will be within 32 bits of address of the previous chunks.

Feb 01 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

01-Feb-2013 15:05, Walter Bright пишет:
 On 1/30/2013 8:44 AM, Dmitry Olshansky wrote:
 In allocation scheme I proposed that ID could be a 32bit offset into
 the unique
 identifiers chunk.

 That only works if you know in advance the max size the chunk can ever
 be and preallocate it. Otherwise, you have no guarantee that the next
 allocated chunk will be within 32 bits of address of the previous chunks.

Well I supposed it's exactly one reallocatable block. Then token have an 
offset that doesn't care if the block was reallocated.

Or rather the reallocating just RESERVE virtual RAM for it (say 1G), and 
COMMIT it page by page when you need to grow it. Once lexing is done, 
shrink virtual region to the actual used size to free up address space 
(e.g. if we are on 32bits).

AS for 32bit limit that gives 4Gb maximum of the cumulative length of 
all unique identifier names is more then enough by any standard. I 
haven't seen a 4G codebase not to speak of identifiers alone that even 
if we count all the repetitions separately.

-- 
Dmitry Olshansky

Feb 01 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 2/1/2013 3:22 AM, Dmitry Olshansky wrote:
 01-Feb-2013 15:05, Walter Bright пишет:
 On 1/30/2013 8:44 AM, Dmitry Olshansky wrote:
 In allocation scheme I proposed that ID could be a 32bit offset into
 the unique
 identifiers chunk.

 That only works if you know in advance the max size the chunk can ever
 be and preallocate it. Otherwise, you have no guarantee that the next
 allocated chunk will be within 32 bits of address of the previous chunks.

 Well I supposed it's exactly one reallocatable block. Then token have an offset
 that doesn't care if the block was reallocated.

 Or rather the reallocating just RESERVE virtual RAM for it (say 1G), and COMMIT
 it page by page when you need to grow it. Once lexing is done, shrink virtual
 region to the actual used size to free up address space (e.g. if we are on
32bits).

 AS for 32bit limit that gives 4Gb maximum of the cumulative length of all
unique
 identifier names is more then enough by any standard. I haven't seen a 4G
 codebase not to speak of identifiers alone that even if we count all the
 repetitions separately.

Your technique can work, provided the number of identifiers isn't large enough 
that memory fragmentation will prevent being able to reallocate the buffer to a 
larger size.

Feb 01 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

02-Feb-2013 01:10, Walter Bright пишет:
 On 2/1/2013 3:22 AM, Dmitry Olshansky wrote:
 01-Feb-2013 15:05, Walter Bright пишет:
 On 1/30/2013 8:44 AM, Dmitry Olshansky wrote:
 In allocation scheme I proposed that ID could be a 32bit offset into
 the unique
 identifiers chunk.

 That only works if you know in advance the max size the chunk can ever
 be and preallocate it. Otherwise, you have no guarantee that the next
 allocated chunk will be within 32 bits of address of the previous
 chunks.

 Well I supposed it's exactly one reallocatable block. Then token have
 an offset
 that doesn't care if the block was reallocated.

 Or rather the reallocating just RESERVE virtual RAM for it (say 1G),
 and COMMIT
 it page by page when you need to grow it. Once lexing is done, shrink
 virtual
 region to the actual used size to free up address space (e.g. if we
 are on 32bits).

 AS for 32bit limit that gives 4Gb maximum of the cumulative length of
 all unique
 identifier names is more then enough by any standard. I haven't seen a 4G
 codebase not to speak of identifiers alone that even if we count all the
 repetitions separately.

 Your technique can work, provided the number of identifiers isn't large
 enough that memory fragmentation will prevent being able to reallocate
 the buffer to a larger size.

I think I failed to describe the virtual memory trick or you just 
ignored it? But regardless I have to correct myself on both the amount 
to reserve and the fact that it's actually not virtual memory alone but 
a virtue of MMU.

See below a full program in D see it the trick in action, currently 
Win32 only, as I can't recall the right POSIX calls offhand. It's a 
benchmark against built-in array/appender - the heap stuff.

The speed is simillar, with my quick and dirty stuff being much faster 
iff arrays don't allocate all of ram beforehand.

The steps to snatch the power of not having any reallocations:

1. Reserve a contiguous *address space* range. Not even a bit of virtual 
memory is reserved/commited/touched at this moment. This just creates a 
TLB entry AFAICT.

(for lexer this address space range would have length equal to the sum 
of all files length as Tove points out, this is the first correction)

2. Any time there is no memory to put more stuff into (as is initially), 
commit the next few pages. At this moment virtual ram is committed but 
not allocated.

3. [this I think we all know] The moment memory location in the 
committed part of the range is touched the page for it is allocated (and 
on win32 initially contains zeros automatically). Only at this point 
Virtual memory is allocated

So you see where my 2nd correction goes. In essence you get: 0 
reallocations and no more then 1 page of virtual memory wasted in all 
cases. Sweet deal ;)

Then there is an optional step if you think that too much of address 
space is used (makes sense only on 32-bits if at all): copy the 
resulting block of actual data elsewhere in the heap and release the 
whole address range.

The code:
(the same as github gist: https://gist.github.com/4699388 )

///Written in the D programming language

// Bench built-in array append, std.array.Appender
// and custom virtual-memory aware VirtualArray in terms of
// memory usage and the time taken
// to append up to X megabytes using different chunk sizes:
// 16 bytes, 64bytes, 256 bytes and 16Kb at a time.


// pass this version to take insert readln's
// at interesting points and investigate RAM usage
// make sure to enable relevant columns in task manager process details:
// committed size and private working set
// version = diagnose;
import std.c.windows.windows;

import std.exception, std.datetime, std.algorithm, std.range, std.array;

auto roundUpToPowerOf2(size_t var, size_t pow2)
{
      return (var + pow2-1) & ~(pow2-1);
}

struct VirtualArray
{
private:
     ubyte* pointer;     // to the beginning of the reserved memory
     size_t _length;     // length of data
     size_t commited;   // committed memory so far
     size_t reserved;   // total possible maximum to grow to
     size_t pageSize;   // page size, this could be global
     //size_t pageSize;

     // commit some more memory from the reserved range
     void extend(size_t size)
     {
         size = roundUpToPowerOf2(size, pageSize);
         // this will throw once run out of reserved pages
         enforce(VirtualAlloc(pointer+commited, size,
             MEM_COMMIT, PAGE_READWRITE));
         commited += size;
     }
public:
     this(size_t maxCapacity)
     {
         SYSTEM_INFO sysinfo;
         GetSystemInfo(&sysinfo);
         pageSize = sysinfo.dwPageSize;
         // the page size is a power of 2
         // round the capacity up to a multiple of page size
         maxCapacity = roundUpToPowerOf2(maxCapacity, pageSize);
         pointer = cast(ubyte*)enforce(VirtualAlloc(null, maxCapacity,
             MEM_RESERVE, PAGE_READWRITE));
         commited = 0;
         reserved = maxCapacity;
         _length = 0;
     }

     // bare minimum primitives to run benchmark
     ref ubyte opIndex(size_t idx)
     {
         assert(idx < length);
         return pointer[idx];
     }

     // ditto
     ubyte[] opSlice(size_t from, size_t to)
     {
         assert(from < to && to <= length);
         return pointer[from..to];
     }

     // ditto
      property size_t length(){ return _length; }

     void opOpAssign(string op:"~")(ubyte[] data)
     {
         size_t end = length + data.length;
         while(end > commited)
             extend(end-commited);
         pointer[length..end] = data[];
         _length = end;
     }

     ~this()
     {
         // should not throw, struct is not copyable
         enforce(VirtualFree(pointer, 0, MEM_RELEASE));
     }
}

import std.stdio;

void timeIt(string prefix, void delegate() run)
{
     StopWatch sw;
     sw.start();
     run();
     sw.stop();
     writefln(prefix~ " %4s ms", sw.peek().msecs);
}

bool verify(Arr)(ubyte[] pattern, ref Arr arr)
{
     for(size_t i=0; i<arr.length; i+= pattern.length)
     {

         if(arr[i..i+pattern.length] != pattern[])
         {
             writeln(arr[i .. i+pattern.length], " vs ", pattern);
             return false;
         }
     }
     return true;
}

void main()
{
     size_t totalSize = 120*2^^20;
     auto chunks = [
         iota(0, 16).map!(x=>cast(ubyte)x).array,
         iota(0, 64).map!(x=>cast(ubyte)x).array,
         iota(0, 256).map!(x=>cast(ubyte)x).array,
         iota(0, 2^^10).map!(x=>cast(ubyte)x).array
     ];
     auto titles = [ " 16b", " 64b", "256b", " 16K" ];
     import core.memory;
     GC.disable(); // to prevent measuring a collection by mistake
     foreach(n, chunk; chunks)
     {
         // reserve memory anew on each iteration
         // I was able to easily go up to 1.5 G on 32bits
         // note: there are no reallocations here at all
         version(diagnose)
         if(n == 0)
         {
             writeln("Before reserving address space");
             readln();
         }
         auto virtArr = VirtualArray(totalSize);
         version(diagnose)
         if(n == 0)
         {
             writeln("Reserved address space");
             readln();
         }
         timeIt(titles[n]~" virtual", (){
             foreach(i; 0..totalSize/chunk.length)
             {
                 virtArr ~= chunk;
             }
         });
         version(diagnose)
         if(n == 0)
         {
             writeln("Commited all of virtual memory");
             readln(); //uncomment to take a pause to investigate RAM usage
         }
         // time to verify is the same for all with -O -inline
         enforce(verify(chunk, virtArr));
     }
     writeln("============");
     foreach(n, chunk; chunks)
     {
         ubyte[] arr;
         // the GC is out of the game as soon as 512 Mbytes on 32bits
         // as I hit OutOfMemory on a 2nd iteration
         // try to help poor runtime
         // without reserve it crashes and burns at about 120 M
         // with reserve it's on par with chunk >= 256
         // but it _commits_ all memory beforehand !
         version(diagnose)
         if(n == 0)
         {
             writeln("Before array.reserve()");
             readln();
         }
         arr.reserve(totalSize);
         version(diagnose)
         if(n == 0)
         {
             writeln("After array.reserve()");
             readln();
         }
         timeIt(titles[n]~" array ", (){
             foreach(i; 0..totalSize/chunk.length)
             {
                 arr ~= chunk;
             }
         });
         version(diagnose)
         if(n == 0)
         {
             writeln("After all data touched");
             readln();
         }
         // time to verify is the same for all with -O -inline
         enforce(verify(chunk, arr));
	// sadly need to not run out of memory on 32 bits
         delete arr;
     }
     writeln("============");
     foreach(n, chunk; chunks)
     {
         {
             // appender is supposed to be faster then array on appends
             // but it's actually slower starting at 256 byte chunks
             // and esp. so with 16Kb chunks
             auto app = appender!(ubyte[])();
             timeIt(titles[n]~" appender", (){
                 foreach(i; 0..totalSize/chunk.length)
                 {
                     app.put(chunk);
                 }
             });
             auto appArr = app.data();
             enforce(verify(chunk, appArr));
         }
         GC.collect();
     }
}

-- 
Dmitry Olshansky

Feb 02 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 2/2/2013 1:54 PM, Dmitry Olshansky wrote:
 I think I failed to describe the virtual memory trick or you just ignored it?

You can't just reserve 1Gb of address space in a 32 bit system. (I know about 
reserving address space vs committing memory.)

Feb 02 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

03-Feb-2013 05:30, Walter Bright пишет:
 On 2/2/2013 1:54 PM, Dmitry Olshansky wrote:
 I think I failed to describe the virtual memory trick or you just
 ignored it?

 You can't just reserve 1Gb of address space in a 32 bit system. (I know
 about reserving address space vs committing memory.)

OK, sorry you must knew this stuff throughly.  1Gb is not the point I 
feel like I shouldn't been listing 1Gb at all as we need much less.
It's about 21 Mb of D source code in full dmd test suite + druntime + 
phobos, and only 4M in vibe.d including all examples. Reserve this amount.

It's the same amount of memory as in read them all and slice them all 
approach that is said to be so fast and practical. And yet we don't 
commit this RAM at all most of the time.

Thinking more of 32bit systems that are so tight on address space - it's 
2Gb for user-mode. Precisely because of that a signed 32 bit offset is 
enough to address RAM if we store it in multiple chunks like you said. 
The fact that there is only 2Gb of user-space actually sort of helps us 
as we surely can reach the other chunks from a single base.

In case of 3Gb (some linux-es and/or large address aware win32) we just 
need to reserve the first range somewhere in the middle, as we do it at 
startup this should be easy with a few probes.

Also after the lexing you can always free the address space and 
reallocate-compact it as the last step like I said earlier.


-- 
Dmitry Olshansky

Feb 03 2013

"Tove" <tove fransson.se> writes:

On Friday, 1 February 2013 at 11:06:02 UTC, Walter Bright wrote:
 On 1/30/2013 8:44 AM, Dmitry Olshansky wrote:
 In allocation scheme I proposed that ID could be a 32bit 
 offset into the unique
 identifiers chunk.

 That only works if you know in advance the max size the chunk 
 can ever be and preallocate it. Otherwise, you have no 
 guarantee that the next allocated chunk will be within 32 bits 
 of address of the previous chunks.

This can easily be archived by preallocating file.size bytes... 
it will be x orders of magnitude too much, but it doesn't matter, 
as in the end only the cache locality matters.

Feb 01 2013

"deadalnix" <deadalnix gmail.com> writes:

Very happy to see that !

Some remarks :
  - Many parameters should be compile time parameters. Instead of 
runtime.
  - I'm not a big fan of byToken name, but let's see what others 
think of it.
  - I'm not sure this is the role of the lexer to process 
__IDENTIFIER__ special stuffs.
  - You need to provide a way to specify haw textual 
representation of the token (ie value) is set. The best way to do 
it IMO is an alias parameter that return a string when called 
with a string (then the user can choose to keep the value from 
original string, create a copy, always get the same copy with the 
same string, etc . . .).
  - Ideally, the location format should be configurable.
  - You must return at least a forward range, and not an input 
range, otherwize a lexer cannot lookahead.
  - I'm not sure about making TokenRange a class.

As a more positive note, I was just in need of something like 
this today and wondered if such project was going to finally 
happen. This is a great step forward ! Some stuff still need to 
be polished, but who get everything right the first time ?

Jan 27 2013

"deadalnix" <deadalnix gmail.com> writes:

On Sunday, 27 January 2013 at 10:32:39 UTC, deadalnix wrote:
 Very happy to see that !

 Some remarks :
  - Many parameters should be compile time parameters. Instead 
 of runtime.
  - I'm not a big fan of byToken name, but let's see what others 
 think of it.
  - I'm not sure this is the role of the lexer to process 
 __IDENTIFIER__ special stuffs.
  - You need to provide a way to specify haw textual 
 representation of the token (ie value) is set. The best way to 
 do it IMO is an alias parameter that return a string when 
 called with a string (then the user can choose to keep the 
 value from original string, create a copy, always get the same 
 copy with the same string, etc . . .).
  - Ideally, the location format should be configurable.
  - You must return at least a forward range, and not an input 
 range, otherwize a lexer cannot lookahead.
  - I'm not sure about making TokenRange a class.

 As a more positive note, I was just in need of something like 
 this today and wondered if such project was going to finally 
 happen. This is a great step forward ! Some stuff still need to 
 be polished, but who get everything right the first time ?

And the famous Job's « one last thing » : I'm not a big fan of 
having OPERATORS_BEGIN of the same type as regular token types. 
Now they make valid token. Why not provide a set of function like 
isOperator ?

Jan 27 2013

"Brian Schott" <briancschott gmail.com> writes:

On Sunday, 27 January 2013 at 10:32:39 UTC, deadalnix wrote:
 Very happy to see that !

 Some remarks :
  - Many parameters should be compile time parameters. Instead 
 of runtime.

I decided not to do this because the lexer actually calls itself 
while parsing token strings. If they were compile-time 
parameters, the compiler would likely generate a lot more code.

  - I'm not a big fan of byToken name, but let's see what others 
 think of it.

Chosen for consistency with the various functions in std.stdio, 
but now that you point his out, it's not very consistent with 
std.algorithm or std.range.

  - I'm not sure this is the role of the lexer to process 
 __IDENTIFIER__ special stuffs.

According to http://dlang.org/lex#specialtokens this is the 
correct behavior.

  - You need to provide a way to specify haw textual 
 representation of the token (ie value) is set. The best way to 
 do it IMO is an alias parameter that return a string when 
 called with a string (then the user can choose to keep the 
 value from original string, create a copy, always get the same 
 copy with the same string, etc . . .).

The lexer does not operate on slices of its input. It would be 
possible to special-case for this in the future.

  - Ideally, the location format should be configurable.
  - You must return at least a forward range, and not an input 
 range, otherwize a lexer cannot lookahead.

It's easy to wrap this range inside of another that does 
buffering for lookahead. 
https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/circularbuffer.d

 And the famous Job's « one last thing » : I'm not a big fan of 
 having OPERATORS_BEGIN of the same type as regular token types. 
 Now they make valid token. Why not provide a set of function 
 like isOperator ?

This eliminates possible uses of the case range statement. It may 
be the case (ha ha) that nobody cares and would rather have those 
functions. I'd be fine with changing that.

Jan 27 2013

"deadalnix" <deadalnix gmail.com> writes:

On Sunday, 27 January 2013 at 10:55:49 UTC, Brian Schott wrote:
 On Sunday, 27 January 2013 at 10:32:39 UTC, deadalnix wrote:
 Very happy to see that !

 Some remarks :
 - Many parameters should be compile time parameters. Instead 
 of runtime.

 I decided not to do this because the lexer actually calls 
 itself while parsing token strings. If they were compile-time 
 parameters, the compiler would likely generate a lot more code.

Have you measured ?

 - You need to provide a way to specify haw textual 
 representation of the token (ie value) is set. The best way to 
 do it IMO is an alias parameter that return a string when 
 called with a string (then the user can choose to keep the 
 value from original string, create a copy, always get the same 
 copy with the same string, etc . . .).

 The lexer does not operate on slices of its input. It would be 
 possible to special-case for this in the future.

Even without special casing for slice of the input, generating a 
new string or reusing an existing one is already an important 
thing. Depending of the processing that come after, it may be of 
great importance.

Jan 27 2013

"Dicebot" <m.strashun gmail.com> writes:

In one of last discussions about standard lexer/parser I remember 
quite a neat proposal - take a delegate for error handling and 
provide two out of the box ( one, that throws exception and one 
that returns Error token)

Jan 27 2013

Philippe Sigaud <philippe.sigaud gmail.com> writes:

Let's add another question:

what about treating q{ } token strings as... well, a list of tokens?
IDE would like this: no need to reparse the string, the token are
there directly.

Jan 27 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-27 12:51, Philippe Sigaud wrote:
 Let's add another question:

 what about treating q{ } token strings as... well, a list of tokens?
 IDE would like this: no need to reparse the string, the token are
 there directly.

Perhaps an option for this.

-- 
/Jacob Carlborg

Jan 27 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-27 10:51, Brian Schott wrote:
 I'm writing a D lexer for possible inclusion in Phobos.

 DDOC:
 http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
 Code:
 https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d


 It's currently able to correctly syntax highlight all of Phobos, but
 does a fairly bad job at rejecting or notifying users/callers about
 invalid input.

 I'd like to hear arguments on the various ways to handle errors in the
 lexer. In a compiler it would be useful to throw an exception on finding
 something like a string literal that doesn't stop before EOF, but a text
 editor or IDE would probably want to be a bit more lenient. Maybe having
 it run-time (or compile-time configurable) like std.csv would be the
 best option here.

 I'm interested in ideas on the API design and other high-level issues at
 the moment. I don't consider this ready for inclusion. (The current
 module being reviewed for inclusion in Phobos is the new std.uni.)

How about changing the type of TokenType to ushort, if all members fit. 
Just to minimize the size of a token.

-- 
/Jacob Carlborg

Jan 27 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/27/2013 1:51 AM, Brian Schott wrote:
 I'm interested in ideas on the API design and other high-level issues at the
 moment. I don't consider this ready for inclusion. (The current module being
 reviewed for inclusion in Phobos is the new std.uni.)

Just a quick comment: byToken() should not accept a filename. It's input should 
be via an InputRange, not a file.

Jan 27 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

27-Jan-2013 23:46, Walter Bright пишет:
 On 1/27/2013 1:51 AM, Brian Schott wrote:
 I'm interested in ideas on the API design and other high-level issues
 at the
 moment. I don't consider this ready for inclusion. (The current module
 being
 reviewed for inclusion in Phobos is the new std.uni.)

 Just a quick comment: byToken() should not accept a filename. It's input
 should be via an InputRange, not a file.

InputRange of char/wchar/dchar I guess as lexer should have an ability 
to avoid decoding.

-- 
Dmitry Olshansky

Jan 27 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/27/2013 12:11 PM, Dmitry Olshansky wrote:
 27-Jan-2013 23:46, Walter Bright пишет:
 On 1/27/2013 1:51 AM, Brian Schott wrote:
 I'm interested in ideas on the API design and other high-level issues
 at the
 moment. I don't consider this ready for inclusion. (The current module
 being
 reviewed for inclusion in Phobos is the new std.uni.)

 Just a quick comment: byToken() should not accept a filename. It's input
 should be via an InputRange, not a file.

 InputRange of char/wchar/dchar I guess as lexer should have an ability to avoid
 decoding.

Just char would suffice. If a user needs wchar/dchar, they can put a map in 
between the InputRange and the lexer.

Jan 27 2013

"Brian Schott" <briancschott gmail.com> writes:

On Sunday, 27 January 2013 at 19:46:12 UTC, Walter Bright wrote:
 On 1/27/2013 1:51 AM, Brian Schott wrote:
 I'm interested in ideas on the API design and other high-level 
 issues at the
 moment. I don't consider this ready for inclusion. (The 
 current module being
 reviewed for inclusion in Phobos is the new std.uni.)

 Just a quick comment: byToken() should not accept a filename. 
 It's input should be via an InputRange, not a file.

The file name is accepted for eventual error reporting purposes. 
The actual input for the lexer is the parameter called "range".

Regarding the times that I posted, my point was that it's not 
slower than "dmd -c", nothing more.

Jan 27 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/27/2013 10:39 PM, Brian Schott wrote:
 ...

 Regarding the times that I posted, my point was that it's not slower
 than "dmd -c", nothing more.

Sure. The point you brought across, however, was that it is not 
significantly faster yet. :o)

Jan 27 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/27/2013 1:39 PM, Brian Schott wrote:
 The file name is accepted for eventual error reporting purposes.

Use an OutputRange for that.

Jan 27 2013

"David Nadlinger" <see klickverbot.at> writes:

On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright wrote:
 On 1/27/2013 1:39 PM, Brian Schott wrote:
 The file name is accepted for eventual error reporting 
 purposes.

 Use an OutputRange for that.

What about that delegate-based design? I thought everyone agreed 
that it was nice?

David

Jan 27 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/27/2013 4:48 PM, David Nadlinger wrote:
 On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright wrote:
 On 1/27/2013 1:39 PM, Brian Schott wrote:
 The file name is accepted for eventual error reporting purposes.

 Use an OutputRange for that.

 What about that delegate-based design? I thought everyone agreed that it was
nice?

An OutputRange is a way of doing that. The advantage of OutputRange's is that
is 
TheWayToDoThings in Phobos so that components can all interoperate and plug
into 
each other.

Jan 27 2013

"David Nadlinger" <see klickverbot.at> writes:

On Monday, 28 January 2013 at 00:51:28 UTC, Walter Bright wrote:
 On 1/27/2013 4:48 PM, David Nadlinger wrote:
 On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright 
 wrote:
 On 1/27/2013 1:39 PM, Brian Schott wrote:
 The file name is accepted for eventual error reporting 
 purposes.

 Use an OutputRange for that.

 What about that delegate-based design? I thought everyone 
 agreed that it was nice?

 An OutputRange is a way of doing that. The advantage of 
 OutputRange's is that is TheWayToDoThings in Phobos so that 
 components can all interoperate and plug into each other.

I was talking about the design you proposed yourself here: 
http://forum.dlang.org/post/jvp9ke$2m45$1 digitalmars.com

Oh, and you really don't need to give me the basic Phobos/ranges 
sales pitch, I think I'm quite aware of their advantages. I'm 
just not sure that e.g. having an "exception thrower" output 
range would be a wise design decision.

David

Jan 27 2013

"Brian Schott" <briancschott gmail.com> writes:

On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright wrote:
 On 1/27/2013 1:39 PM, Brian Schott wrote:
 The file name is accepted for eventual error reporting 
 purposes.

 Use an OutputRange for that.

I think you misunderstand. The file name is so that if you pass 
in "foo.d" the lexer can say "Error: unterminated string literal 
beginning on line 123 of foo.d". It's not so that error messagaes 
will be written to a file of that name.

On the topic of performance, I realized that the numbers posted 
previously were actually for a debug build. Fail.

For whatever reason, the current version of the lexer code isn't 
triggering my heisenbug[1] and I was able to build with -release 
-inline -O.

Here's what avgtime has to say:

$ avgtime -q -h -r 200 dscanner --tokenCount 
../phobos/std/datetime.d

------------------------
Total time (ms): 51409.8
Repetitions    : 200
Sample mode    : 250 (169 ocurrences)
Median time    : 255.57
Avg time       : 257.049
Std dev.       : 4.39338
Minimum        : 252.931
Maximum        : 278.658
95% conf.int.  : [248.438, 265.66]  e = 8.61087
99% conf.int.  : [245.733, 268.366]  e = 11.3166
EstimatedAvg95%: [256.44, 257.658]  e = 0.608881
EstimatedAvg99%: [256.249, 257.849]  e = 0.800205
Histogram      :
     msecs: count  normalized bar




Which works out to 1,327,784 tokens per second on my Ivy Bridge 
i7.

I created a small program that demangles the output of valgrind 
so that tools like KCachegrind can display profiling information 
more clearly. It's now on the wiki[2]

The bottleneck in std.d.lexer as it stands is the appender 
instances that assemble Token.value during iteration and front() 
on the array of char[]. (As I'm sure everyone expected)

[1] 
http://forum.dlang.org/thread/bug-9353-3 http.d.puremagic.com%2Fissues%2F
[2] http://wiki.dlang.org/Other_Dev_Tools

Jan 27 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/27/2013 4:53 PM, Brian Schott wrote:
 On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright wrote:
 On 1/27/2013 1:39 PM, Brian Schott wrote:
 The file name is accepted for eventual error reporting purposes.

 Use an OutputRange for that.

 I think you misunderstand. The file name is so that if you pass in "foo.d" the
 lexer can say "Error: unterminated string literal beginning on line 123 of
 foo.d". It's not so that error messagaes will be written to a file of that
name.

Yes, I did misunderstand. I suggest updating the documentation to clear up the 
misunderstanding.

Jan 27 2013

"deadalnix" <deadalnix gmail.com> writes:

On Monday, 28 January 2013 at 00:53:03 UTC, Brian Schott wrote:
 On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright wrote:
 On 1/27/2013 1:39 PM, Brian Schott wrote:
 The file name is accepted for eventual error reporting 
 purposes.

 Use an OutputRange for that.

 I think you misunderstand. The file name is so that if you pass 
 in "foo.d" the lexer can say "Error: unterminated string 
 literal beginning on line 123 of foo.d". It's not so that error 
 messagaes will be written to a file of that name.

I don't think that is a good idea. For instance mixin need to be 
lexed but don't come from a file.

The lexer should report the error, what is done on error is up to 
the user of the lexer.

Jan 27 2013

Johannes Pfau <nospam example.com> writes:

Am Mon, 28 Jan 2013 01:53:02 +0100
schrieb "Brian Schott" <briancschott gmail.com>:

 
 The bottleneck in std.d.lexer as it stands is the appender 
 instances that assemble Token.value during iteration and front() 
 on the array of char[]. (As I'm sure everyone expected)

This sounds like a valid use case for buffered ranges (which we don't
have yet, afaik). When used correctly you could avoid the appender
completely. Instead slice the input buffer and copy it if necessary.
(If the complete file is kept in memory copying could also be
avoided)

Jan 28 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-28 09:51, Johannes Pfau wrote:

 This sounds like a valid use case for buffered ranges (which we don't
 have yet, afaik). When used correctly you could avoid the appender
 completely. Instead slice the input buffer and copy it if necessary.
 (If the complete file is kept in memory copying could also be
 avoided)

If think slicing the buffer will force the whole buffer to remain in 
memory and not be collected by the GC. If one keeps the whole buffer in 
memory anyway this won't be a problem.

-- 
/Jacob Carlborg

Jan 28 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 1/28/2013 1:05 AM, Jacob Carlborg wrote:
 If think slicing the buffer will force the whole buffer to remain in memory and
 not be collected by the GC. If one keeps the whole buffer in memory anyway this
 won't be a problem.

Well, the source buffer can be large, and will span a lot of memory cache
lines, 
making accessing those slices slow.

Jan 28 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-28 11:14, Walter Bright wrote:

 Well, the source buffer can be large, and will span a lot of memory
 cache lines, making accessing those slices slow.

Would it be better to .dup the slices? Won't that be just as slow as 
using the appender?

-- 
/Jacob Carlborg

Jan 28 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

28-Jan-2013 14:39, Jacob Carlborg пишет:
 On 2013-01-28 11:14, Walter Bright wrote:

 Well, the source buffer can be large, and will span a lot of memory
 cache lines, making accessing those slices slow.

 Would it be better to .dup the slices? Won't that be just as slow as
 using the appender?

It would be better to compactly layout these one by one in a separate 
buffer skipping all of the extra slack in the source file(s).

-- 
Dmitry Olshansky

Jan 28 2013

"jerro" <a a.com> writes:

On Monday, 28 January 2013 at 11:00:09 UTC, Dmitry Olshansky 
wrote:
 28-Jan-2013 14:39, Jacob Carlborg пишет:
 On 2013-01-28 11:14, Walter Bright wrote:

 Well, the source buffer can be large, and will span a lot of 
 memory
 cache lines, making accessing those slices slow.

 Would it be better to .dup the slices? Won't that be just as 
 slow as
 using the appender?

 It would be better to compactly layout these one by one in a 
 separate buffer skipping all of the extra slack in the source 
 file(s).

Would it also make sense to do small string optimization? On 64 
bit, you could use the memory used by the string field itself to 
store strings of length up to 15 characters.

Jan 28 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

28-Jan-2013 18:08, jerro пишет:
 On Monday, 28 January 2013 at 11:00:09 UTC, Dmitry Olshansky wrote:
 28-Jan-2013 14:39, Jacob Carlborg пишет:
 On 2013-01-28 11:14, Walter Bright wrote:

 Well, the source buffer can be large, and will span a lot of memory
 cache lines, making accessing those slices slow.

 Would it be better to .dup the slices? Won't that be just as slow as
 using the appender?

 It would be better to compactly layout these one by one in a separate
 buffer skipping all of the extra slack in the source file(s).

 Would it also make sense to do small string optimization? On 64 bit, you
 could use the memory used by the string field itself to store strings of
 length up to 15 characters.

It very well could be iff slice is a field in the Token struct.

-- 
Dmitry Olshansky

Jan 28 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/28/2013 01:53 AM, Brian Schott wrote:
 ...

 On the topic of performance, I realized that the numbers posted
 previously were actually for a debug build. Fail.

 For whatever reason, the current version of the lexer code isn't
 triggering my heisenbug[1] and I was able to build with -release -inline
 -O.

 Here's what avgtime has to say:

 $ avgtime -q -h -r 200 dscanner --tokenCount ../phobos/std/datetime.d

 ------------------------
 Total time (ms): 51409.8
 Repetitions    : 200
 Sample mode    : 250 (169 ocurrences)
 Median time    : 255.57
 Avg time       : 257.049
 Std dev.       : 4.39338
 Minimum        : 252.931
 Maximum        : 278.658
 95% conf.int.  : [248.438, 265.66]  e = 8.61087
 99% conf.int.  : [245.733, 268.366]  e = 11.3166
 EstimatedAvg95%: [256.44, 257.658]  e = 0.608881
 EstimatedAvg99%: [256.249, 257.849]  e = 0.800205
 Histogram      :
      msecs: count  normalized bar




 Which works out to 1,327,784 tokens per second on my Ivy Bridge i7.

Better, but still slow.

 I created a small program that demangles the output of valgrind so that
 tools like KCachegrind can display profiling information more clearly.
 It's now on the wiki[2]

 The bottleneck in std.d.lexer as it stands is the appender instances
 that assemble Token.value during iteration and front() on the array of
 char[]. (As I'm sure everyone expected)

I see, probably there should be an option to do this by slicing instead.
Also try to treat narrow strings in such a way that they do not incur 
undue decoding overhead.

I guess that at some point

pure nothrow TokenType lookupTokenType(const string input)

might become a bottleneck. (DMD does not generate near-optimal string 
switches, I think.)

Jan 28 2013

"Brian Schott" <briancschott gmail.com> writes:

On Monday, 28 January 2013 at 21:03:21 UTC, Timon Gehr wrote:
 Better, but still slow.

I implemented the various suggestions from a past thread and made 
the lexer only work ubyte[] (to aviod phobos converting 
everything to dchar all the time) and gave the tokenizer instance 
a character buffer that it re-uses.

Results:

$ avgtime -q -r 200 ./dscanner --tokenCount 
../phobos/std/datetime.d

------------------------
Total time (ms): 13861.8
Repetitions    : 200
Sample mode    : 69 (90 ocurrences)
Median time    : 69.0745
Avg time       : 69.3088
Std dev.       : 0.670203
Minimum        : 68.613
Maximum        : 72.635
95% conf.int.  : [67.9952, 70.6223]  e = 1.31357
99% conf.int.  : [67.5824, 71.0351]  e = 1.72633
EstimatedAvg95%: [69.2159, 69.4016]  e = 0.0928836
EstimatedAvg99%: [69.1867, 69.4308]  e = 0.12207

If my math is right, that means it's getting 4.9 million 
tokens/second now. According to Valgrind the only way to really 
improve things now is to require that the input to the lexer 
support slicing. (Remember the secret of Tango's XML parser...) 
The bottleneck is now on the calls to .idup to construct the 
token strings from slices of the buffer.

 I guess that at some point

 pure nothrow TokenType lookupTokenType(const string input)

 might become a bottleneck. (DMD does not generate near-optimal 
 string switches, I think.)

Right now that's a fairly small box on KCachegrind.

Jan 30 2013

dennis luehring <dl.soluz gmx.net> writes:

Am 30.01.2013 10:49, schrieb Brian Schott:
 On Monday, 28 January 2013 at 21:03:21 UTC, Timon Gehr wrote:
 Better, but still slow.

 I implemented the various suggestions from a past thread and made
 the lexer only work ubyte[] (to aviod phobos converting
 everything to dchar all the time) and gave the tokenizer instance
 a character buffer that it re-uses.

 Results:

 $ avgtime -q -r 200 ./dscanner --tokenCount
 ../phobos/std/datetime.d

 .....

 If my math is right, that means it's getting 4.9 million
 tokens/second now. According to Valgrind the only way to really
 improve things now is to require that the input to the lexer
 support slicing. (Remember the secret of Tango's XML parser...)
 The bottleneck is now on the calls to .idup to construct the
 token strings from slices of the buffer.

but you still need to compare that to the current dmd lexer - nothing 
else can be a benchmark reference

Jan 30 2013

FG <home fgda.pl> writes:

On 2013-01-30 10:49, Brian Schott wrote:
 If my math is right, that means it's getting 4.9 million tokens/second now.
 According to Valgrind the only way to really improve things now is to require
 that the input to the lexer support slicing. (Remember the secret of Tango's
XML
 parser...) The bottleneck is now on the calls to .idup to construct the token
 strings from slices of the buffer.

Yep. I'm eager to see what timings you get with the whole file kept in memory 
and sliced, without copying token strings.

Jan 30 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

30-Jan-2013 13:49, Brian Schott пишет:
 On Monday, 28 January 2013 at 21:03:21 UTC, Timon Gehr wrote:
 Better, but still slow.

 I implemented the various suggestions from a past thread and made the
 lexer only work ubyte[] (to aviod phobos converting everything to dchar
 all the time) and gave the tokenizer instance a character buffer that it
 re-uses.

 Results:

 $ avgtime -q -r 200 ./dscanner --tokenCount ../phobos/std/datetime.d

 ------------------------
 Total time (ms): 13861.8
 Repetitions    : 200
 Sample mode    : 69 (90 ocurrences)
 Median time    : 69.0745
 Avg time       : 69.3088
 Std dev.       : 0.670203
 Minimum        : 68.613
 Maximum        : 72.635
 95% conf.int.  : [67.9952, 70.6223]  e = 1.31357
 99% conf.int.  : [67.5824, 71.0351]  e = 1.72633
 EstimatedAvg95%: [69.2159, 69.4016]  e = 0.0928836
 EstimatedAvg99%: [69.1867, 69.4308]  e = 0.12207

 If my math is right, that means it's getting 4.9 million tokens/second
 now. According to Valgrind the only way to really improve things now is
 to require that the input to the lexer support slicing. (Remember the
 secret of Tango's XML parser...) The bottleneck is now on the calls to
 .idup to construct the token strings from slices of the buffer.

idup --> allocation

Instead I suggest to try and allocate a big block of fixed size (say 
about 16-64K) upfront and copy identifiers one by one there. When it 
fills just allocate another one and move on.

If identifier is exceptionally long then you can just idup it as before.

This should bring down the number of calls to GC significantly.

 I guess that at some point

 pure nothrow TokenType lookupTokenType(const string input)

 might become a bottleneck. (DMD does not generate near-optimal string
 switches, I think.)

 Right now that's a fairly small box on KCachegrind.


-- 
Dmitry Olshansky

Jan 30 2013

FG <home fgda.pl> writes:

On 2013-01-30 17:50, Dmitry Olshansky wrote:
 Instead I suggest to try and allocate a big block of fixed size (say about
 16-64K) upfront and copy identifiers one by one there. When it fills just
 allocate another one and move on.

Yeah, similar to what I suggested, except that probably it would be good to
also 
have a look-up structure for identifiers, so that only unique strings go into 
those blocks.

I wonder what would be a practical data structure for such look-ups:
A trie is great except for the implementation hurdle to keep it also in one or
a 
few memory blocks to prevent frequent allocations.
A simpler approach would be to make it an array of (hash, string_slice) pairs, 
sorted by hash (of the identifier) - lots of string scanning though.

What do you think?

Jan 30 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

30-Jan-2013 21:21, FG пишет:
 On 2013-01-30 17:50, Dmitry Olshansky wrote:
 Instead I suggest to try and allocate a big block of fixed size (say
 about
 16-64K) upfront and copy identifiers one by one there. When it fills just
 allocate another one and move on.

 Yeah, similar to what I suggested, except that probably it would be good
 to also have a look-up structure for identifiers, so that only unique
 strings go into those blocks.

 I wonder what would be a practical data structure for such look-ups:
 A trie is great except for the implementation hurdle to keep it also in
 one or a few memory blocks to prevent frequent allocations.

Mm trie shouldn't have that many allocations. Typically each node is 
fixed-sized. Also I don't see keeping it in many memory blocks as a show 
stopper. The trick is exactly the same as I mentioned above but blocks 
got to be quite a bit large (say 8 times :) ).

Truth be told the trie should be optimal for this, but a lot of effort 
is required to implement a fast trie compared to rolling out a simple 
hash-table. That's something I hope to change once my Trie template is 
polished enough for Phobos proposal.

 A simpler approach would be to make it an array of (hash, string_slice)
 pairs, sorted by hash (of the identifier) - lots of string scanning though.

Mm since you need to find what's unique as you lex the file I suspect 
that sorting is out of the window. I'd be modest and for starters just 
use a hash-table + tune the hash function a bit.


-- 
Dmitry Olshansky

Jan 30 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-30 10:49, Brian Schott wrote:

 Results:

 $ avgtime -q -r 200 ./dscanner --tokenCount ../phobos/std/datetime.d

 ------------------------
 Total time (ms): 13861.8
 Repetitions    : 200
 Sample mode    : 69 (90 ocurrences)
 Median time    : 69.0745
 Avg time       : 69.3088
 Std dev.       : 0.670203
 Minimum        : 68.613
 Maximum        : 72.635
 95% conf.int.  : [67.9952, 70.6223]  e = 1.31357
 99% conf.int.  : [67.5824, 71.0351]  e = 1.72633
 EstimatedAvg95%: [69.2159, 69.4016]  e = 0.0928836
 EstimatedAvg99%: [69.1867, 69.4308]  e = 0.12207

 If my math is right, that means it's getting 4.9 million tokens/second
 now. According to Valgrind the only way to really improve things now is
 to require that the input to the lexer support slicing. (Remember the
 secret of Tango's XML parser...) The bottleneck is now on the calls to
 .idup to construct the token strings from slices of the buffer.

How many tokens would that be in total?

-- 
/Jacob Carlborg

Jan 31 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sun, Jan 27, 2013 at 10:39:13PM +0100, Brian Schott wrote:
 On Sunday, 27 January 2013 at 19:46:12 UTC, Walter Bright wrote:

[...]
Just a quick comment: byToken() should not accept a filename. It's
input should be via an InputRange, not a file.

 
 The file name is accepted for eventual error reporting purposes. The
 actual input for the lexer is the parameter called "range".

[...]

FWIW, I've developed this little idiom in my code when it comes to
dealing with error reporting in lexing/parsing code (for my own DSLs,
not D):

The main problem I have is that my lexer/parser accepts an input range,
but input ranges don't (necessarily) have any filename/line number
associated with them. Moreover, the code that throws the exception may
be quite far down the call chain, and may have not access to the context
that knows what filename/line number the error occurred at. For example,
I may have a generic function called parseInt(), which can be called
from the lexer, the parser, or a whole bunch of other places. It
wouldn't make sense to force parseInt() to take a filename and line
number, just so it can have nicer error reporting, for example.

So I decided to move the inclusion of filename/line number information
to where they belong: in the code that knows about them. So here's a
sketch of my approach:

	class SyntaxError : Exception {
		string filename;
		int linenum;
		this(string msg) { super(msg); }
	}
	...
	int parseInt(R)(R inputRange) {
		...
		// N.B.: no filename/line number info here
		if (!isDigit(inputRange.front))
			throw new Exception("Invalid digit: %s",
				inputRange.front);
	}
	...
	Expr parseExpr(R)(R inputRange) {
		...
		// N.B.: any exception just unrolls past this call, 'cos
		// we have no filename/line number info to insert anyway
		if (tokenType == IntLiteral) {
			value = parseInt(inputRange):
		}
		...
	}
	...
	Expr parseFileInput(string filename) {
		auto f = File(filename);
		try {
			// Wrapper range that counts line numbers
			auto r = NumberedSrc(f);

			return parseExpr(r);
		} catch(SyntaxError e) {
			// Insert filename/line number info into message
			e.filename = filename;
			e.linenum = r.linenum;
			e.msg = format("%s:%d %s", filename, r.linenum, e.msg);
			throw e;
		}
	}
	...
	Expr parseConsoleInput() {
		// No filename/line number info here
		return parseExpr(stdin.byLine());
	}
	...
	Expr parseStringInput(string input) {
		try {
			auto r = NumberedSrc(input);
			return parseExpr(r);
		} catch(SyntaxError e) {
			// We don't have filename here, but we do have
			// line number, so use that.
			e.linenum = r.linenum;
			e.msg = format("Line %d: %s", r.linenum, e.msg);
			throw e;
		}
	}

Notice that I have different wrapper functions for dealing with
different kinds of input; the underlying parser doesn't even care about
filename/line numbers, but the wrapper functions catch any parsing
exceptions that are thrown from underneath and prepend this info as
appropriate. This simplifies the parsing code (don't have to keep
worrying about line numbers and propagating filenames) and also produces
output that makes sense:

- Console input don't need line numbers; the user doesn't care if this
  is the 500th command he typed, or the 701st.

- Internal strings don't get a nonsensical "filename", 'cos they don't
  *have* a filename in the first place. Just a single line number so you
  can locate the problem in, say, the string literal or something.

- File input has filename and line number.

- Other kinds of input contexts can be handled in the same way.

- The use of NumberedSrc (maybe better named LineNumberedRange or
  something) makes line numbers available to each of these contexts at
  the top-level. Though of course, the lexer itself can also handle this
  (but it adds complications if you have to continue detecting newlines
  in, say, string literals, when the lexer is in a different state).

The cleaner code does come at a price, though: this code probably is a
bit inefficient due to the number of layers in it. But, just thought I'd
share this idea.


T

-- 
You are only young once, but you can stay immature indefinitely. -- azephrahel

Jan 28 2013

"Dejan Lekic" <dejan.lekic gmail.com> writes:

On Sunday, 27 January 2013 at 09:51:10 UTC, Brian Schott wrote:
 I'm writing a D lexer for possible inclusion in Phobos.

 DDOC: 
 http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
 Code: 
 https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d

 It's currently able to correctly syntax highlight all of 
 Phobos, but does a fairly bad job at rejecting or notifying 
 users/callers about invalid input.

 I'd like to hear arguments on the various ways to handle errors 
 in the lexer. In a compiler it would be useful to throw an 
 exception on finding something like a string literal that 
 doesn't stop before EOF, but a text editor or IDE would 
 probably want to be a bit more lenient. Maybe having it 
 run-time (or compile-time configurable) like std.csv would be 
 the best option here.

 I'm interested in ideas on the API design and other high-level 
 issues at the moment. I don't consider this ready for 
 inclusion. (The current module being reviewed for inclusion in 
 Phobos is the new std.uni.)

Having a "standard" lexer module, that many people work on, and 
has up-to-date rules, is priceless! Thank you for doing this.

Jan 28 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-27 10:51, Brian Schott wrote:
 I'm writing a D lexer for possible inclusion in Phobos.

 DDOC:
 http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
 Code:
 https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d


 It's currently able to correctly syntax highlight all of Phobos, but
 does a fairly bad job at rejecting or notifying users/callers about
 invalid input.

 I'd like to hear arguments on the various ways to handle errors in the
 lexer. In a compiler it would be useful to throw an exception on finding
 something like a string literal that doesn't stop before EOF, but a text
 editor or IDE would probably want to be a bit more lenient. Maybe having
 it run-time (or compile-time configurable) like std.csv would be the
 best option here.

 I'm interested in ideas on the API design and other high-level issues at
 the moment. I don't consider this ready for inclusion. (The current
 module being reviewed for inclusion in Phobos is the new std.uni.)

Just thinking out loud here. Would it be possible to lex a file in 
parallel? Cutting it in half (or similar) and lex both pieces 
simultaneously in parallel.

-- 
/Jacob Carlborg

Jan 31 2013

FG <home fgda.pl> writes:

On 2013-01-31 13:14, Jacob Carlborg wrote:
 Just thinking out loud here. Would it be possible to lex a file in parallel?
 Cutting it in half (or similar) and lex both pieces simultaneously in parallel.

Do you know where you can safely cut it without having it lexed beforehand? :)

Jan 31 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-31 13:34, FG wrote:

 Do you know where you can safely cut it without having it lexed
 beforehand? :)

I was thinking that myself. It would probably be possible to just cut it 
in the middle and then lex a few characters forward and backwards until 
you get a valid token. Try and calculate the correct index where to cut.

Although I have no idea how much trouble it would given you and how much 
you would gain.

-- 
/Jacob Carlborg

Jan 31 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Jan 31, 2013 at 01:48:02PM +0100, Jacob Carlborg wrote:
 On 2013-01-31 13:34, FG wrote:
 
Do you know where you can safely cut it without having it lexed
beforehand? :)

 
 I was thinking that myself. It would probably be possible to just
 cut it in the middle and then lex a few characters forward and
 backwards until you get a valid token. Try and calculate the correct
 index where to cut.

[...]

Doesn't work if the middle happens to be inside a string literal
containing code. Esp. a q{} literal (you wouldn't be able to tell where
it starts/ends without scanning the entire file, because the {}'s nest).


T

-- 
Life is unfair. Ask too much from it, and it may decide you don't
deserve what you have now either.

Jan 31 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-31 16:54, H. S. Teoh wrote:

 Doesn't work if the middle happens to be inside a string literal
 containing code. Esp. a q{} literal (you wouldn't be able to tell where
 it starts/ends without scanning the entire file, because the {}'s nest).

That would be a problem.

-- 
/Jacob Carlborg

Jan 31 2013

"deadalnix" <deadalnix gmail.com> writes:

On Thursday, 31 January 2013 at 12:48:03 UTC, Jacob Carlborg 
wrote:
 On 2013-01-31 13:34, FG wrote:

 Do you know where you can safely cut it without having it lexed
 beforehand? :)

 I was thinking that myself. It would probably be possible to 
 just cut it in the middle and then lex a few characters forward 
 and backwards until you get a valid token. Try and calculate 
 the correct index where to cut.

 Although I have no idea how much trouble it would given you and 
 how much you would gain.

I don't think it worth the complexity. You can lex both file in 
parallel with 2 lexer instance if you want to make things faster.

Jan 31 2013

dennis luehring <dl.soluz gmx.net> writes:

Am 31.01.2013 13:14, schrieb Jacob Carlborg:
 On 2013-01-27 10:51, Brian Schott wrote:
 I'm writing a D lexer for possible inclusion in Phobos.

 DDOC:
 http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
 Code:
 https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d


 It's currently able to correctly syntax highlight all of Phobos, but
 does a fairly bad job at rejecting or notifying users/callers about
 invalid input.

 I'd like to hear arguments on the various ways to handle errors in the
 lexer. In a compiler it would be useful to throw an exception on finding
 something like a string literal that doesn't stop before EOF, but a text
 editor or IDE would probably want to be a bit more lenient. Maybe having
 it run-time (or compile-time configurable) like std.csv would be the
 best option here.

 I'm interested in ideas on the API design and other high-level issues at
 the moment. I don't consider this ready for inclusion. (The current
 module being reviewed for inclusion in Phobos is the new std.uni.)

 Just thinking out loud here. Would it be possible to lex a file in
 parallel? Cutting it in half (or similar) and lex both pieces
 simultaneously in parallel.

why not only the symbols at the border needs to be "connected" then

the question is: how many blocks(threads) are ok for 1,2,3,8 cores - can 
also depend on the speed of the filesystem

Jan 31 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-31 13:35, dennis luehring wrote:

 why not only the symbols at the border needs to be "connected" then

 the question is: how many blocks(threads) are ok for 1,2,3,8 cores - can
 also depend on the speed of the filesystem

That would require some profiling to figure out.

-- 
/Jacob Carlborg

Jan 31 2013

dennis luehring <dl.soluz gmx.net> writes:

Am 31.01.2013 13:48, schrieb Jacob Carlborg:
 On 2013-01-31 13:35, dennis luehring wrote:

 why not only the symbols at the border needs to be "connected" then

 the question is: how many blocks(threads) are ok for 1,2,3,8 cores - can
 also depend on the speed of the filesystem

 That would require some profiling to figure out.

i would say it "can" help alot so the design should be able to use 
split-parts and combine symboles at the border

and also file-based lexing should be threadable so that 16 files can be 
handled by my 16 cores system full in parallel :)

but its also size dependend - it makes no sense to split in too small 
parts, could be counter-productive

Jan 31 2013

Jacob Carlborg <doob me.com> writes:

On 2013-01-31 13:57, dennis luehring wrote:

 i would say it "can" help alot so the design should be able to use
 split-parts and combine symboles at the border

 and also file-based lexing should be threadable so that 16 files can be
 handled by my 16 cores system full in parallel :)

This is kind of obvious, I think. That's why I started to think at this 
less obvious case.

 but its also size dependend - it makes no sense to split in too small
 parts, could be counter-productive

Of course not. That would require some profiling as well to find a sweet 
spot.

-- 
/Jacob Carlborg

Jan 31 2013

"qznc" <qznc go.to> writes:

On Thursday, 31 January 2013 at 12:14:35 UTC, Jacob Carlborg 
wrote:
 Just thinking out loud here. Would it be possible to lex a file 
 in parallel? Cutting it in half (or similar) and lex both 
 pieces simultaneously in parallel.

I think a lexer should be IO-bound on todays machines, so 
parallelizing should not give much benefits.

Of course, you can write a microbenchmark, such that lexing is 
compute-bound and parallelizing gives a speedup. ;)

Jan 31 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Friday, 1 February 2013 at 07:41:11 UTC, qznc wrote:
 On Thursday, 31 January 2013 at 12:14:35 UTC, Jacob Carlborg 
 wrote:
 Just thinking out loud here. Would it be possible to lex a 
 file in parallel? Cutting it in half (or similar) and lex both 
 pieces simultaneously in parallel.

 I think a lexer should be IO-bound on todays machines, so 
 parallelizing should not give much benefits.

Pretty sure RAMs these days can handle more than 1 processor's 
memory request at a given point in time, so there should be an 
N-times speedup, where N maxes out at the max throughput...

Feb 01 2013

Jacob Carlborg <doob me.com> writes:

On 2013-02-01 08:41, qznc wrote:

 I think a lexer should be IO-bound on todays machines, so parallelizing
 should not give much benefits.

I'm not sure I understand.

-- 
/Jacob Carlborg

Feb 01 2013

"Brian Schott" <briancschott gmail.com> writes:

After several hours of optimizing I've managed to make it so that 
dmd's lexer is only three times faster.

http://hackerpilot.github.com/experimental/std_lexer/images/times.png

The funny thing is that compiling with LDC gave a bigger speed 
boost than any of my code refacatoring.

Feb 03 2013

FG <home fgda.pl> writes:

On 2013-02-04 01:22, Brian Schott wrote:
 After several hours of optimizing I've managed to make it so that dmd's lexer
is
 only three times faster.

What are you comparing here? How do you time DMD's lexing stage?

Feb 03 2013

"Brian Schott" <briancschott gmail.com> writes:

On Monday, 4 February 2013 at 00:38:57 UTC, FG wrote:
 On 2013-02-04 01:22, Brian Schott wrote:
 After several hours of optimizing I've managed to make it so 
 that dmd's lexer is
 only three times faster.

 What are you comparing here? How do you time DMD's lexing stage?

A simple hack of module.c that prints out the file's token count 
and then calls exit(0).

Feb 03 2013

FG <home fgda.pl> writes:

On 2013-02-04 01:41, Brian Schott wrote:
 What are you comparing here? How do you time DMD's lexing stage?

 A simple hack of module.c that prints out the file's token count and then calls
 exit(0).

Ah, fine. Then it's not that bad. :)
Have you already made it use slices of the whole source,
without copying the string values of tokens?
What kind of optimizations have you made?

Feb 03 2013

Jacob Carlborg <doob me.com> writes:

On 2013-02-04 01:50, FG wrote:

 Ah, fine. Then it's not that bad. :)
 Have you already made it use slices of the whole source,
 without copying the string values of tokens?
 What kind of optimizations have you made?

That would be interesting to hear. Three times slower than DMD doesn't 
sound good. I know that DMD is fast, but three times.

-- 
/Jacob Carlborg

Feb 03 2013

FG <home fgda.pl> writes:

On 2013-02-04 08:57, Jacob Carlborg wrote:
 On 2013-02-04 01:50, FG wrote:

 Ah, fine. Then it's not that bad. :)
 Have you already made it use slices of the whole source,
 without copying the string values of tokens?
 What kind of optimizations have you made?

 That would be interesting to hear. Three times slower than DMD doesn't sound
 good. I know that DMD is fast, but three times.

Looking at the current source, there is now a StringCache to hold the strings.
It is however also used to store all those long comments, so I believe quite
some time is unnecessarily wasted on generating hashes for them.
But probably there are other parts slowing things down.

Feb 04 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Monday, 4 February 2013 at 00:22:42 UTC, Brian Schott wrote:
 After several hours of optimizing I've managed to make it so 
 that dmd's lexer is only three times faster.

 http://hackerpilot.github.com/experimental/std_lexer/images/times.png

 The funny thing is that compiling with LDC gave a bigger speed 
 boost than any of my code refacatoring.




Where is the current bottleneck?

(Should be easy to find just by running the program ~5 times and 
suddenly breaking into it with a debugger.)

Also, I'm assuming you've already tried disabling range-checking 
on arrays?

Feb 04 2013

"Brian Schott" <briancschott gmail.com> writes:

More optimizing:
http://hackerpilot.github.com/experimental/std_lexer/images/times2.png

Still only half speed. I'm becoming more and more convinced that 
Walter is actually a wizard.

Feb 04 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/4/13 10:19 PM, Brian Schott wrote:
 More optimizing:
 http://hackerpilot.github.com/experimental/std_lexer/images/times2.png

 Still only half speed. I'm becoming more and more convinced that Walter
 is actually a wizard.

Suggestion: take lexer.c and convert it to D. Should take one day, and 
you'll have performance on par.

Andrei

Feb 04 2013

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 2/5/13, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 Suggestion: take lexer.c and convert it to D. Should take one day, and
 you'll have performance on par.

This was already done for DDMD, and the more recent minimal version of it:

https://github.com/zachthemystic/ddmd-clean/blob/master/dmd/lexer.d

Feb 04 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/4/13 11:05 PM, Andrej Mitrovic wrote:
 On 2/5/13, Andrei Alexandrescu<SeeWebsiteForEmail erdani.org>  wrote:
 Suggestion: take lexer.c and convert it to D. Should take one day, and
 you'll have performance on par.

 This was already done for DDMD, and the more recent minimal version of it:

 https://github.com/zachthemystic/ddmd-clean/blob/master/dmd/lexer.d

Awesome! Has anyone measured the speed?

Andrei

Feb 04 2013

"Brian Schott" <briancschott gmail.com> writes:

On Tuesday, 5 February 2013 at 04:19:54 UTC, Andrei Alexandrescu 
wrote:
 Awesome! Has anyone measured the speed?

 Andrei

I gave up on getting it to compile.

ddmd (the project this one is based on) was last updated to 2.040 
according to its page on dsource, and I also haven't gotten it to 
compile.

Feb 04 2013

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 2/5/13, Brian Schott <briancschott gmail.com> wrote:
 I gave up on getting it to compile.

It seems master is broken. An earlier commit is buildable with a small
change, but it doesn't seem to parse modules like std.datetime. I
don't know what exact version the front-end was ported from, but I've
tried to parse datetime from 2.055 to 2.061 and it didn't work.

So yeah it's not usable in this state.

Feb 05 2013

"deadalnix" <deadalnix gmail.com> writes:

On Tuesday, 5 February 2013 at 03:22:52 UTC, Andrei Alexandrescu 
wrote:
 On 2/4/13 10:19 PM, Brian Schott wrote:
 More optimizing:
 http://hackerpilot.github.com/experimental/std_lexer/images/times2.png

 Still only half speed. I'm becoming more and more convinced 
 that Walter
 is actually a wizard.

 Suggestion: take lexer.c and convert it to D. Should take one 
 day, and you'll have performance on par.

DMD's lexer is not suitable for phobos IMO. It doesn't take a 
range as input and don't produce a range. It also lack the 
features you may want from a multi usage D lexer.

Feb 04 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/5/13 12:45 AM, deadalnix wrote:
 On Tuesday, 5 February 2013 at 03:22:52 UTC, Andrei Alexandrescu wrote:
 On 2/4/13 10:19 PM, Brian Schott wrote:
 More optimizing:
 http://hackerpilot.github.com/experimental/std_lexer/images/times2.png

 Still only half speed. I'm becoming more and more convinced that Walter
 is actually a wizard.

 Suggestion: take lexer.c and convert it to D. Should take one day, and
 you'll have performance on par.

 DMD's lexer is not suitable for phobos IMO. It doesn't take a range as
 input and don't produce a range. It also lack the features you may want
 from a multi usage D lexer.

I looked at the code. Once converted, it's trivial to convert the lexer 
into an input range.

Andrei

Feb 05 2013

Jacob Carlborg <doob me.com> writes:

On 2013-02-05 04:22, Andrei Alexandrescu wrote:

 Suggestion: take lexer.c and convert it to D. Should take one day, and
 you'll have performance on par.

There's reason for why nobody has just extract the lexer from DMD. It 
will probably take more than a day just to extract the lexer to be able 
to use it without the rest of DMD.

It's probably easier to do the actual porting than extract the lexer.

Actually, when I think about it, Johnathan is working on porting the DMD 
lexer to D.

-- 
/Jacob Carlborg

Feb 05 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Tuesday, February 05, 2013 09:14:35 Jacob Carlborg wrote:
 Actually, when I think about it, Johnathan is working on porting the DMD
 lexer to D.

Not exactly. I decided that it would be of greater benefit to just write the 
thing according to the grammar and make sure that the compiler matched the 
spec. I've already found and fixed several bugs in the spec because of that. 
Also, given how I'm writing it, I expect that its speed will be similar to 
that of dmd given the fact that I'm writing it such that it will generally do 
the minimum number of operations required. But it wouldn't surprise me at all 
if no lexer can possibly match dmd at the moment as long as it's compiled with 
dmd (at least on Linux), because gcc's optimizer is so much better than dmd's, 
and dmd is getting compiled with gcc, whereas the competing lexer in D is 
being compiled with dmd.

I don't have a lot of time to work on my lexer at the moment, but I'd really 
like to get it done soon, and I have most of the features in place. 
Unfortunately, when I went to try and work on it again the other day, the code 
wasn't compiling anymore, and I need to figure out why. I suspect that it's a 
regression related to string mixins, but I have to investigate further to sort 
it out.

- Jonathan M Davis

Feb 05 2013

Jacob Carlborg <doob me.com> writes:

On 2013-02-05 10:07, Jonathan M Davis wrote:

 Not exactly. I decided that it would be of greater benefit to just write the
 thing according to the grammar and make sure that the compiler matched the
 spec.

Aha, I see.

-- 
/Jacob Carlborg

Feb 05 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Tuesday, February 05, 2013 09:14:35 Jacob Carlborg wrote:
 On 2013-02-05 04:22, Andrei Alexandrescu wrote:
 Suggestion: take lexer.c and convert it to D. Should take one day, and
 you'll have performance on par.

 
 There's reason for why nobody has just extract the lexer from DMD. It
 will probably take more than a day just to extract the lexer to be able
 to use it without the rest of DMD.

There are basic ideas about how it works which are obviously good and should 
be in the finished product in D, but it's not range-based, which forces you to 
do things differently. It's also not configurable, which forces you to do
things 
differently.

If it could be ported as-is and then compared for speed, then that would be a 
great test, since it would be able to show how much of the speed problem is 
purely a compiler issue as opposed to a design issue, but you wouldn't be able 
to actually use it for anything more than what Brian is doing with his 
performance testing, because as you point out, it's too integrated into dmd. 
It _would_ be valuable though as a performance test of the compiler.

- Jonathan M Davis

Feb 05 2013

Jacob Carlborg <doob me.com> writes:

On 2013-02-05 11:44, Jonathan M Davis wrote:

 If it could be ported as-is and then compared for speed, then that would be a
 great test, since it would be able to show how much of the speed problem is
 purely a compiler issue as opposed to a design issue, but you wouldn't be able
 to actually use it for anything more than what Brian is doing with his
 performance testing, because as you point out, it's too integrated into dmd.
 It _would_ be valuable though as a performance test of the compiler.

Yeah, that could be useful.

-- 
/Jacob Carlborg

Feb 05 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/5/13 5:44 AM, Jonathan M Davis wrote:
 On Tuesday, February 05, 2013 09:14:35 Jacob Carlborg wrote:
 On 2013-02-05 04:22, Andrei Alexandrescu wrote:
 Suggestion: take lexer.c and convert it to D. Should take one day, and
 you'll have performance on par.

 There's reason for why nobody has just extract the lexer from DMD. It
 will probably take more than a day just to extract the lexer to be able
 to use it without the rest of DMD.

 There are basic ideas about how it works which are obviously good and should
 be in the finished product in D, but it's not range-based, which forces you to
 do things differently. It's also not configurable, which forces you to do
things
 differently.

 If it could be ported as-is and then compared for speed, then that would be a
 great test, since it would be able to show how much of the speed problem is
 purely a compiler issue as opposed to a design issue, but you wouldn't be able
 to actually use it for anything more than what Brian is doing with his
 performance testing, because as you point out, it's too integrated into dmd.
 It _would_ be valuable though as a performance test of the compiler.

As far as I could tell the dependencies of the lexer are fairly 
contained (util, token, identifier) and conversion to input range is 
immediate.

Andrei

Feb 05 2013

Jacob Carlborg <doob me.com> writes:

On 2013-02-05 14:34, Andrei Alexandrescu wrote:

 As far as I could tell the dependencies of the lexer are fairly
 contained (util, token, identifier) and conversion to input range is
 immediate.

That's not what I remember from last time I gave it a try.

-- 
/Jacob Carlborg

Feb 05 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Tuesday, February 05, 2013 08:34:29 Andrei Alexandrescu wrote:
 As far as I could tell the dependencies of the lexer are fairly
 contained (util, token, identifier) and conversion to input range is
 immediate.

I don't remember all of the details at the moment, since it's been several 
months since I looked at dmd's lexer, but a lot of the problem stems from the 
fact that it's all written around the assumption that it's dealing with a 
char*. Converting it to operate on string might be fairly straightforward, but 
it gets more complicated when dealing with ranges. Also, both Walter and 
others have stated that the lexer in D should be configurable in a number of 
ways, and dmd's lexer isn't configurable at all. So, while a direct translation 
would likely be quick, refactoring it to do what it's been asked to be able to 
do would not be.

I'm quite a ways along with one that's written from scratch, but I need to find 
the time to finish it. Also, doing it from scratch has had the added benefit of 
helping me find bugs in the spec and in dmd.

- Jonathan M Davis

Feb 05 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/5/13 10:29 PM, Jonathan M Davis wrote:
 On Tuesday, February 05, 2013 08:34:29 Andrei Alexandrescu wrote:
 As far as I could tell the dependencies of the lexer are fairly
 contained (util, token, identifier) and conversion to input range is
 immediate.

 I don't remember all of the details at the moment, since it's been several
 months since I looked at dmd's lexer, but a lot of the problem stems from the
 fact that it's all written around the assumption that it's dealing with a
 char*. Converting it to operate on string might be fairly straightforward, but
 it gets more complicated when dealing with ranges. Also, both Walter and
 others have stated that the lexer in D should be configurable in a number of
 ways, and dmd's lexer isn't configurable at all. So, while a direct translation
 would likely be quick, refactoring it to do what it's been asked to be able to
 do would not be.

 I'm quite a ways along with one that's written from scratch, but I need to find
 the time to finish it. Also, doing it from scratch has had the added benefit of
 helping me find bugs in the spec and in dmd.

I think it would be reasonable for a lexer to require a range of ubyte 
as input, and carry its own decoding. In the first approximation it may 
even require a random-access range of ubyte.

Andrei

Feb 05 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
 I think it would be reasonable for a lexer to require a range of ubyte
 as input, and carry its own decoding. In the first approximation it may
 even require a random-access range of ubyte.

I'd have to think about how you'd handle the Unicode stuff in that case, since 
I'm not quite sure what you mean by having it handle its own decoding if it's 
a range of code units, but what I've been working on works with all of the 
character types and is very careful about how it deals with decoding in order 
to avoid unnecessary decoding. And that wasn't all that hard as far as the 
lexer's code goes. The hard part with that was making std.utf work with ranges 
of code units rather than just strings, and that was committed months ago.

- Jonathan M Davis

Feb 05 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
 I think it would be reasonable for a lexer to require a range of ubyte
 as input, and carry its own decoding. In the first approximation it may
 even require a random-access range of ubyte.

Another big issue is the fact that in some ways, using a pointer like dmd's 
lexer does is actually superior to using a range. In particular, it's trivial 
to determine where in the text a token is, because you can simply subtract the 
pointer in the token from the initial pointer. Strings would be okay too, 
because you can subtract their ptr properties. But the closest that you'll get 
with ranges is to subtract their lengths, and the only ranges that are likely 
to define length are random-access ranges. And to do that, you'd either have to 
keep calculating the index for each token as its generated or save the range 
with ever token (rather than just having a pointer) so that you could 
determine the index later if you needed to. And depending on the range, all of 
that saving could be expensive.

And for any other type of range, you'd literally have to count the code units 
as you iterated in order to figure out what the index is (and you'd have to 
keep saving the range as you went along if you wanted to slice it at all, 
since it wouldn't actually be sliceable, and so getting to a particular index 
in the range would be very expensive even if you kept track of it). And for 
syntax highlighting and some error reporting and a variety of other uses, you 
need to be able to determine where in the text a token was (not just its 
column and line number). And that's simply a lot easier with a pointer.

So, dealing with generic ranges is a bit problematic - far more so than any 
issues with character types. If the lexer is well-written, the extra overhead 
had be obviated by having the lexer function do stuff a bit differently 
depending on the type of the range, but regardless, restricting it to strings 
or pointers would be cleaner in that regard. It's not quite a use case where 
ranges shine - especially when efficiency is a top priority.

- Jonathan M Davis

Feb 08 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

08-Feb-2013 12:01, Jonathan M Davis пишет:
 On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
 I think it would be reasonable for a lexer to require a range of ubyte
 as input, and carry its own decoding. In the first approximation it may
 even require a random-access range of ubyte.

 Another big issue is the fact that in some ways, using a pointer like dmd's
 lexer does is actually superior to using a range. In particular, it's trivial
 to determine where in the text a token is, because you can simply subtract the
 pointer in the token from the initial pointer. Strings would be okay too,
 because you can subtract their ptr properties. But the closest that you'll get
 with ranges is to subtract their lengths, and the only ranges that are likely
 to define length are random-access ranges.

Not true, certain ranges know length but can't be random access as 
indexing is O(lgN) or worse. Including a stripe of chunks as taken from 
file.


-- 
Dmitry Olshansky

Feb 08 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, February 08, 2013 12:12:30 Dmitry Olshansky wrote:
 08-Feb-2013 12:01, Jonathan M Davis =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
 On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
 I think it would be reasonable for a lexer to require a range of u=



byte
 as input, and carry its own decoding. In the first approximation i=



t may
 even require a random-access range of ubyte.

=20
 Another big issue is the fact that in some ways, using a pointer li=


ke
 dmd's
 lexer does is actually superior to using a range. In particular, it=


's
 trivial to determine where in the text a token is, because you can =


simply
 subtract the pointer in the token from the initial pointer. Strings=


 would
 be okay too, because you can subtract their ptr properties. But the=


 closest that you'll get with ranges is to subtract their lengths, a=


nd the
 only ranges that are likely to define length are random-access rang=


es.
=20
 Not true, certain ranges know length but can't be random access as
 indexing is O(lgN) or worse. Including a stripe of chunks as taken fr=

om
 file.

I said that the only ones which are "likely" to define length are rando=
m-access=20
range. There _are_ other ranges which can, but in most cases, if you ca=
n know=20
the length, you can do random access as well. Regardless, the main issu=
e still=20
stands in that dealing with keeping track of the index of the code unit=
 of a=20
token is more complicated and generally more expensive with ranges than=
 it is=20
with a pointer. Some range types will do better than others, but short =
of=20
using a string's ptr property, there's always going to be some addition=
al=20
overhead in comparison to pointers to keep track of the indices or to k=
eep a=20
range or slice of one as part of a token. The pointer's just more light=
weight.=20
That doesn't make ranges unacceptable by any means. It just means that =
they're=20
going to take at least a slight performance hit in comparison to pointe=
rs.

- Jonathan M Davis

Feb 08 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

08-Feb-2013 13:40, Jonathan M Davis пишет:
 On Friday, February 08, 2013 12:12:30 Dmitry Olshansky wrote:
 08-Feb-2013 12:01, Jonathan M Davis пишет:
 On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
 I think it would be reasonable for a lexer to require a range of ubyte
 as input, and carry its own decoding. In the first approximation it may
 even require a random-access range of ubyte.

 Another big issue is the fact that in some ways, using a pointer like
 dmd's
 lexer does is actually superior to using a range. In particular, it's
 trivial to determine where in the text a token is, because you can simply
 subtract the pointer in the token from the initial pointer. Strings would
 be okay too, because you can subtract their ptr properties. But the
 closest that you'll get with ranges is to subtract their lengths, and the
 only ranges that are likely to define length are random-access ranges.

 Not true, certain ranges know length but can't be random access as
 indexing is O(lgN) or worse. Including a stripe of chunks as taken from
 file.

 I said that the only ones which are "likely" to define length are random-access
 range. There _are_ other ranges which can, but in most cases, if you can know
 the length, you can do random access as well.

Well I honestly disagree about the promise of knowing length - being 
able to index. "The most ranges" is arrays and wrappers on top of these.
Given current realities oF D and Phobos I'm afraid you are right though.

  Regardless, the main issue still
 stands in that dealing with keeping track of the index of the code unit of a
 token is more complicated and generally more expensive with ranges than it is
 with a pointer.

If target is random access range just use offset throughout. It's 
basically becomes base + offset vs base + pointer i.e. non-issue

If not then pointer argument no longer applies and you can just as well 
use separate counter on per popFront. It'd not that costly in this case 
and flexibility tramps other concerns with forward ranges in any case.

 Some range types will do better than others, but short of
 using a string's ptr property, there's always going to be some additional
 overhead in comparison to pointers to keep track of the indices or to keep a
 range or slice of one as part of a token. The pointer's just more lightweight.
 That doesn't make ranges unacceptable by any means. It just means that they're
 going to take at least a slight performance hit in comparison to pointers.

See above. Pointer to something inside of a buffer == index in buffer, 
typically even with pointer you can't drop the 'buffer' reference itself.

-- 
Dmitry Olshansky

Feb 08 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Friday, February 08, 2013 14:29:20 Dmitry Olshansky wrote:
 If target is random access range just use offset throughout. It's
 basically becomes base + offset vs base + pointer i.e. non-issue

 If not then pointer argument no longer applies and you can just as well
 use separate counter on per popFront. It'd not that costly in this case
 and flexibility tramps other concerns with forward ranges in any case.

I don't know exactly how costly it ends up being (and certainly, poor design 
choices in other areas could dwarf that cost), but it does incur extra 
overhead throughout the lexing process. In most code, it wouldn't be a big 
deal, but the lexer is trying to be lightning fast, so every extra bit like 
that is going to add up and slow it down. But you really don't have any choice 
if you don't even have random access. Regardless, my point is that accepting 
generic ranges rather than pointers complicates things somewhat and does incur 
at least a slight performance penalty.

- Jonathan M Davis

Feb 08 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

08-Feb-2013 23:25, Jonathan M Davis пишет:
 On Friday, February 08, 2013 14:29:20 Dmitry Olshansky wrote:
 If target is random access range just use offset throughout. It's
 basically becomes base + offset vs base + pointer i.e. non-issue

 If not then pointer argument no longer applies and you can just as well
 use separate counter on per popFront. It'd not that costly in this case
 and flexibility tramps other concerns with forward ranges in any case.

 I don't know exactly how costly it ends up being (and certainly, poor design
 choices in other areas could dwarf that cost), but it does incur extra
 overhead throughout the lexing process.
 In most code, it wouldn't be a big
 deal, but the lexer is trying to be lightning fast, so every extra 

bit like
 that is going to add up and slow it down.

I suspect it *might* require extra register for base depending on 
smartness of the compiler, extra register in turn could rise the cost.

The key to speed here however is not in using raw pointers, assembly 
and/or SIMD. FWIW if there is only a single base pointer + offsets 
compiler can assume no pointer aliasing and optimize things better.

The discussion becomes purely rhetorical unless some hard data is 
actually presented and not based on DMD's optimizer, please.

 But you really don't have any choice
 if you don't even have random access.

Agreed.

 Regardless, my point is that accepting
 generic ranges rather than pointers complicates things somewhat and 

does incur
 at least a slight performance penalty.

Complication - yes, slight performance cost is what I doubt it in RA 
case. Seems like a compiler/optimizer issue.

For one I used to write cycles with naked pointers in C to get better 
performance out of crappy compilers. I won't ever do it today as it 
would get worse performance in addition to being less readable (I've 
checked at least on VC++ about a year ago, compiler rewrites these 
index-based loops better, YMMV).


-- 
Dmitry Olshansky

Feb 08 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Friday, February 08, 2013 23:50:13 Dmitry Olshansky wrote:
 Complication - yes, slight performance cost is what I doubt it in RA
 case. Seems like a compiler/optimizer issue.

The main problem isn't RA ranges. It's quite probable that the optimizer takes 
care of any minor additional overhead that's incurred there. The issue is 
really pure forward ranges, because you're forced to count the code units as 
they're processed (or even save the range as you go along, given how expensive 
it would be to go find the index even if you have it). But we really have no 
way of knowing how bad it is without hard data. It would certainly be trivial 
to end up doing other stuff in the implementation which cost far more.

- Jonathan M Davis

Feb 08 2013

"deadalnix" <deadalnix gmail.com> writes:

On Friday, 8 February 2013 at 20:54:32 UTC, Jonathan M Davis 
wrote:
 On Friday, February 08, 2013 23:50:13 Dmitry Olshansky wrote:
 Complication - yes, slight performance cost is what I doubt it 
 in RA
 case. Seems like a compiler/optimizer issue.

 The main problem isn't RA ranges. It's quite probable that the 
 optimizer takes
 care of any minor additional overhead that's incurred there. 
 The issue is
 really pure forward ranges, because you're forced to count the 
 code units as
 they're processed (or even save the range as you go along, 
 given how expensive
 it would be to go find the index even if you have it). But we 
 really have no
 way of knowing how bad it is without hard data. It would 
 certainly be trivial
 to end up doing other stuff in the implementation which cost 
 far more.

For pure InputRanges, that is pretty bad as the decoding of UTF 
chars basically have to be done 2 times, and each codepoint 
popped individually, or the lexer have to embed its own homebrew 
version of std.utf .

Feb 08 2013

"deadalnix" <deadalnix gmail.com> writes:

On Saturday, 9 February 2013 at 07:15:57 UTC, deadalnix wrote:
 On Friday, 8 February 2013 at 20:54:32 UTC, Jonathan M Davis 
 wrote:
 On Friday, February 08, 2013 23:50:13 Dmitry Olshansky wrote:
 Complication - yes, slight performance cost is what I doubt 
 it in RA
 case. Seems like a compiler/optimizer issue.

 The main problem isn't RA ranges. It's quite probable that the 
 optimizer takes
 care of any minor additional overhead that's incurred there. 
 The issue is
 really pure forward ranges, because you're forced to count the 
 code units as
 they're processed (or even save the range as you go along, 
 given how expensive
 it would be to go find the index even if you have it). But we 
 really have no
 way of knowing how bad it is without hard data. It would 
 certainly be trivial
 to end up doing other stuff in the implementation which cost 
 far more.

 For pure InputRanges, that is pretty bad as the decoding of UTF 
 chars basically have to be done 2 times, and each codepoint 
 popped individually, or the lexer have to embed its own 
 homebrew version of std.utf .

Wow that is complete bullshit xD I completely messed up what I 
wanted to say.

So, std.utf.decodeFront pop or not the utf character. And in case 
it does not, you ends up having to do extra hanky panky (and 
duplicate logic in std.utf to know if it does pop or not, which 
is likely to be very bug prone).

Feb 08 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday, February 09, 2013 08:19:33 deadalnix wrote:
 So, std.utf.decodeFront pop or not the utf character. And in case
 it does not, you ends up having to do extra hanky panky (and
 duplicate logic in std.utf to know if it does pop or not, which
 is likely to be very bug prone).

Well, with the pull request that I have, it'll always pop for input ranges and 
never for anything else. I'm disinclined to have it pop off elements at all, 
because I'd like it to function as much like decode as possible, but there's 
no choice with pure input ranges, and it may be better to just make it always 
pop the elements off. I'll have to think about it. I hate pure input ranges in 
general. They're just so frustrating to deal with, and I believe that you can 
always have a forward range if you're willing to put a bit more effort into it.

The lexer that I'm writing doesn't even support pure input ranges, because it 
has to be able to save to do what it does, and the only reason that 
decodeFront exists is because I wrote it to support that use case. And needing 
to operate on ranges of code units is likely to be quite rare, and the 
function is quite new, so I don't expect that much code is affected by any bugs 
in decodeFront or that much would be broken were it to be changed.

- Jonathan M Davis

Feb 09 2013

Jacob Carlborg <doob me.com> writes:

On 2013-02-09 09:05, Jonathan M Davis wrote:

 The lexer that I'm writing doesn't even support pure input ranges, because it
 has to be able to save to do what it does, and the only reason that
 decodeFront exists is because I wrote it to support that use case. And needing
 to operate on ranges of code units is likely to be quite rare, and the
 function is quite new, so I don't expect that much code is affected by any bugs
 in decodeFront or that much would be broken were it to be changed.

Following this discussion it seems it's complicated to support ranges in 
a lexer and still maintain the speed. Is it really worth the trouble to 
support ranges?

-- 
/Jacob Carlborg

Feb 09 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/9/13 10:05 AM, Jacob Carlborg wrote:
 On 2013-02-09 09:05, Jonathan M Davis wrote:

 The lexer that I'm writing doesn't even support pure input ranges,
 because it
 has to be able to save to do what it does, and the only reason that
 decodeFront exists is because I wrote it to support that use case. And
 needing
 to operate on ranges of code units is likely to be quite rare, and the
 function is quite new, so I don't expect that much code is affected by
 any bugs
 in decodeFront or that much would be broken were it to be changed.

 Following this discussion it seems it's complicated to support ranges in
 a lexer and still maintain the speed. Is it really worth the trouble to
 support ranges?

Requiring a random-access range of ubyte with a terminating zero may be 
the most general approach to a fast lexer - and that's fine.

Andrei

Feb 09 2013

Jacob Carlborg <doob me.com> writes:

On 2013-02-09 16:10, Andrei Alexandrescu wrote:

 Requiring a random-access range of ubyte with a terminating zero may be
 the most general approach to a fast lexer - and that's fine.

Requiring a random-access range probably makes it easier. People here 
seems to try to support ranges with less functionality, like input or 
forward ranges.

-- 
/Jacob Carlborg

Feb 09 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/9/13 10:37 AM, Jacob Carlborg wrote:
 On 2013-02-09 16:10, Andrei Alexandrescu wrote:

 Requiring a random-access range of ubyte with a terminating zero may be
 the most general approach to a fast lexer - and that's fine.

 Requiring a random-access range probably makes it easier. People here
 seems to try to support ranges with less functionality, like input or
 forward ranges.

Yah. The way I see it is, start with a random-access range and then see 
what the use patterns are. Then possibly refine.

Andrei

Feb 09 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

09-Feb-2013 19:46, Andrei Alexandrescu пишет:
 On 2/9/13 10:37 AM, Jacob Carlborg wrote:
 On 2013-02-09 16:10, Andrei Alexandrescu wrote:

 Requiring a random-access range of ubyte with a terminating zero may be
 the most general approach to a fast lexer - and that's fine.

 Requiring a random-access range probably makes it easier. People here
 seems to try to support ranges with less functionality, like input or
 forward ranges.

 Yah. The way I see it is, start with a random-access range and then see
 what the use patterns are. Then possibly refine.

I don't get this. There is no sensible requirement to forbid non-random 
access. A couple of static ifs and you are done.

And I can confidently say that std.d.lexer has quite some room for 
optimization in both cases and it doesn't have to sacrifice the generic 
path.

I intent to continue hacking on Brain's implementation and to help him 
refine it. Any real help (as in work and analysis) is appreciated, thanks.

-- 
Dmitry Olshansky

Feb 09 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, February 10, 2013 00:26:41 Dmitry Olshansky wrote:
 09-Feb-2013 19:46, Andrei Alexandrescu =D0=BF=D0=B8=D1=88=D0=B5=D1=82=

:
 On 2/9/13 10:37 AM, Jacob Carlborg wrote:
 On 2013-02-09 16:10, Andrei Alexandrescu wrote:
 Requiring a random-access range of ubyte with a terminating zero =




may be
 the most general approach to a fast lexer - and that's fine.

=20
 Requiring a random-access range probably makes it easier. People h=



ere
 seems to try to support ranges with less functionality, like input=



 or
 forward ranges.

=20
 Yah. The way I see it is, start with a random-access range and then=


 see
 what the use patterns are. Then possibly refine.

=20
 I don't get this. There is no sensible requirement to forbid non-rand=

om
 access. A couple of static ifs and you are done.
=20
 And I can confidently say that std.d.lexer has quite some room for
 optimization in both cases and it doesn't have to sacrifice the gener=

ic
 path.

It may end up being less efficient with some types of ranges, but it ca=
n still=20
work, and someone's use case may not care about that extra loss of effi=
ciency.=20
Those who really do can use RA ranges or strings or whatever.

But if we throw too many extra restrictions on the ranges (like they ha=
ve to=20
be random access or end with 0), then pretty soon you might as well req=
uire=20
that they use string, because the extra benefit of the few extra types =
of=20
ranges which would work wolud be pretty minimal.

- Jonathan M Davis

Feb 09 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 2/8/2013 12:01 AM, Jonathan M Davis wrote:
 It's not quite a use case where
 ranges shine - especially when efficiency is a top priority.

A more problematic case is dmd's lexer relies on a 0 byte at the end to be a 
"sentinel" for the end of file. Without such a sentinel, every consumption of a 
source character requires two operations rather than one.

Feb 08 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, February 08, 2013 23:48:50 Walter Bright wrote:
 On 2/8/2013 12:01 AM, Jonathan M Davis wrote:
 It's not quite a use case where
 ranges shine - especially when efficiency is a top priority.

 
 A more problematic case is dmd's lexer relies on a 0 byte at the end to be a
 "sentinel" for the end of file. Without such a sentinel, every consumption
 of a source character requires two operations rather than one.

I didn't know that. That's a cute trick. But yeah, without controlling the 
input, it's not possible, and that won't work with a general implementation in 
a library. It would be trivial enough to put a wrapper around the input to add 
the 0 byte at the end, but the wrapper would still have to keep checking for 
empty, so you wouldn't gain anything.

- Jonathan M Davis

Feb 09 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/9/13 3:07 AM, Jonathan M Davis wrote:
 On Friday, February 08, 2013 23:48:50 Walter Bright wrote:
 On 2/8/2013 12:01 AM, Jonathan M Davis wrote:
 It's not quite a use case where
 ranges shine - especially when efficiency is a top priority.

 A more problematic case is dmd's lexer relies on a 0 byte at the end to be a
 "sentinel" for the end of file. Without such a sentinel, every consumption
 of a source character requires two operations rather than one.

 I didn't know that. That's a cute trick. But yeah, without controlling the
 input, it's not possible, and that won't work with a general implementation in
 a library. It would be trivial enough to put a wrapper around the input to add
 the 0 byte at the end, but the wrapper would still have to keep checking for
 empty, so you wouldn't gain anything.

That's not a problem. You may require that the input has a terminating zero.

Andrei

Feb 09 2013

Jacob Carlborg <doob me.com> writes:

On 2013-02-09 15:37, Andrei Alexandrescu wrote:

 That's not a problem. You may require that the input has a terminating
 zero.

You cannot just append a terminating zero in the lexer if it's a range?

-- 
/Jacob Carlborg

Feb 09 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/9/13 10:38 AM, Jacob Carlborg wrote:
 On 2013-02-09 15:37, Andrei Alexandrescu wrote:

 That's not a problem. You may require that the input has a terminating
 zero.

 You cannot just append a terminating zero in the lexer if it's a range?

You require it. It's client's burden.

Andrei

Feb 09 2013

Jacob Carlborg <doob me.com> writes:

On 2013-02-09 16:46, Andrei Alexandrescu wrote:

 You require it. It's client's burden.

Ok, I see.

-- 
/Jacob Carlborg

Feb 09 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 2/9/2013 6:37 AM, Andrei Alexandrescu wrote:
 On 2/9/13 3:07 AM, Jonathan M Davis wrote:
 On Friday, February 08, 2013 23:48:50 Walter Bright wrote:
 On 2/8/2013 12:01 AM, Jonathan M Davis wrote:
 It's not quite a use case where
 ranges shine - especially when efficiency is a top priority.

 A more problematic case is dmd's lexer relies on a 0 byte at the end to be a
 "sentinel" for the end of file. Without such a sentinel, every consumption
 of a source character requires two operations rather than one.

 I didn't know that. That's a cute trick. But yeah, without controlling the
 input, it's not possible, and that won't work with a general implementation in
 a library. It would be trivial enough to put a wrapper around the input to add
 the 0 byte at the end, but the wrapper would still have to keep checking for
 empty, so you wouldn't gain anything.

 That's not a problem. You may require that the input has a terminating zero.

Perhaps we can formalize this a bit. Define a SentinelInputRange, which has an 
additional manifest constant with the name 'sentinel'. empty becomes defined as:

     empty = front == sentinel;

popFront() does not advance if empty. Advancing over a SentinelInputRange can
be 
done the usual:

     for (r = range; !r.empty; r.popFront()) { auto e = r.front; }

or can be done as:

     for (r = range; r.front != sentinel; r.popFront()) { auto e = r.front; }

This should work well with 0 terminated strings. The output of the lexer should 
also be a SentinelInputRange, with TOKeof as the sentinel.

Feb 09 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 2/9/2013 12:43 PM, Walter Bright wrote:> Perhaps we can formalize this a
bit. 
Define a SentinelInputRange, which has an
 additional manifest constant with the name 'sentinel'. empty becomes defined
as:

      empty = front == sentinel;

 popFront() does not advance if empty. Advancing over a SentinelInputRange can
be
 done the usual:

      for (r = range; !r.empty; r.popFront()) { auto e = r.front; }

 or can be done as:

      for (r = range; r.front != sentinel; r.popFront()) { auto e = r.front; }

If it's not clear, SentinelInputRange.front can be called *before* checking 
empty. If it is empty, then the sentinel is returned.

Feb 09 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/9/13 3:49 PM, Walter Bright wrote:
 On 2/9/2013 12:43 PM, Walter Bright wrote:> Perhaps we can formalize
 this a bit. Define a SentinelInputRange, which has an
  > additional manifest constant with the name 'sentinel'. empty becomes
 defined as:
  >
  > empty = front == sentinel;
  >
  > popFront() does not advance if empty. Advancing over a
 SentinelInputRange can be
  > done the usual:
  >
  > for (r = range; !r.empty; r.popFront()) { auto e = r.front; }
  >
  > or can be done as:
  >
  > for (r = range; r.front != sentinel; r.popFront()) { auto e = r.front; }

 If it's not clear, SentinelInputRange.front can be called *before*
 checking empty. If it is empty, then the sentinel is returned.

Yah, that all would fit C-style strings and singly-linked lists like a 
glove. I've been long hoping the opportunity to define such a range 
would present itself :o).

Andrei

Feb 09 2013

"deadalnix" <deadalnix gmail.com> writes:

On Wednesday, 6 February 2013 at 03:51:33 UTC, Andrei 
Alexandrescu wrote:
 I think it would be reasonable for a lexer to require a range 
 of ubyte as input, and carry its own decoding. In the first 
 approximation it may even require a random-access range of 
 ubyte.

Playing around that, I discovered a bug in std.utf : slice and 
other range are not threated the same way by decodeFront, which 
is rather problematic. Jonathan also hit that bug : 
http://d.puremagic.com/issues/show_bug.cgi?id=9456

That make the lexer hard to write with a consistent behavior for 
InputRanges.

The bug probably exists in everything that rely on decodeFront at 
some point.

Feb 08 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Tuesday, February 05, 2013 01:07:48 Jonathan M Davis wrote:
 I don't have a lot of time to work on my lexer at the moment, but I'd really
 like to get it done soon, and I have most of the features in place.
 Unfortunately, when I went to try and work on it again the other day, the
 code wasn't compiling anymore, and I need to figure out why. I suspect that
 it's a regression related to string mixins, but I have to investigate
 further to sort it out.

It turns out that it has nothing to do with string mixins (though it does have 
to do with CTFE):

http://d.puremagic.com/issues/show_bug.cgi?id=9452

Fortunately, there's a simple workaround that'll let me continue until the bug 
is fixed.

- Jonathan M Davis

Feb 05 2013

Jacob Carlborg <doob me.com> writes:

On 2013-02-05 11:46, Jonathan M Davis wrote:

 It turns out that it has nothing to do with string mixins (though it does have
 to do with CTFE):

 http://d.puremagic.com/issues/show_bug.cgi?id=9452

 Fortunately, there's a simple workaround that'll let me continue until the bug
 is fixed.

Ok, good that it's not a blocker.

-- 
/Jacob Carlborg

Feb 05 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 02/05/2013 07:19 AM, Brian Schott wrote:
 More optimizing:
 http://hackerpilot.github.com/experimental/std_lexer/images/times2.png

 Still only half speed. I'm becoming more and more convinced that Walter
 is actually a wizard.

Time to do some hacking on your lexer I guess. I'll try add a couple of 
tricks and see if it helps.

What command do you use for benchmarking?

Feb 05 2013

"Brian Schott" <briancschott gmail.com> writes:

On Tuesday, 5 February 2013 at 08:52:53 UTC, Dmitry Olshansky
wrote:
 Time to do some hacking on your lexer I guess. I'll try add a 
 couple of tricks and see if it helps.

 What command do you use for benchmarking?

I've been using avgtime[1] for measuring run times, perf[2], and
callgrind/kcachegrind[3] for profiling.

[1] https://github.com/jmcabo/avgtime
[2] https://perf.wiki.kernel.org/index.php/Tutorial
[3] http://kcachegrind.sourceforge.net/html/Home.html

Feb 05 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

05-Feb-2013 22:25, Brian Schott пишет:
 On Tuesday, 5 February 2013 at 08:52:53 UTC, Dmitry Olshansky
 wrote:
 Time to do some hacking on your lexer I guess. I'll try add a couple
 of tricks and see if it helps.

 What command do you use for benchmarking?

 I've been using avgtime[1] for measuring run times, perf[2], and
 callgrind/kcachegrind[3] for profiling.

 [1] https://github.com/jmcabo/avgtime
 [2] https://perf.wiki.kernel.org/index.php/Tutorial
 [3] http://kcachegrind.sourceforge.net/html/Home.html

Thanks.
I've made a pass through the code and found some places to improve.
Sadly I've been rather busy at work today. Anyway get ready for pull 
requests should my ideas prove to be worthy performance-wise.

-- 
Dmitry Olshansky

Feb 05 2013

"Brian Schott" <briancschott gmail.com> writes:

On Tuesday, 5 February 2013 at 19:00:42 UTC, Dmitry Olshansky 
wrote:
 Thanks.
 I've made a pass through the code and found some places to 
 improve.
 Sadly I've been rather busy at work today. Anyway get ready for 
 pull requests should my ideas prove to be worthy 
 performance-wise.

http://hackerpilot.github.com/experimental/std_lexer/images/times3.png

Your suggestions seem to have worked. We're below 20ms on my 
machine for the datetime module.

Feb 08 2013

Jacob Carlborg <doob me.com> writes:

On 2013-02-08 16:08, Brian Schott wrote:

 http://hackerpilot.github.com/experimental/std_lexer/images/times3.png

 Your suggestions seem to have worked. We're below 20ms on my machine for
 the datetime module.

DMD is still consistently faster, :(

-- 
/Jacob Carlborg

Feb 08 2013

"Brian Schott" <briancschott gmail.com> writes:

On Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg wrote:
 DMD is still consistently faster, :(

We might be getting to the part where we say that code generated 
by gcc is still consistently faster than code generated by ldc.

Feb 08 2013

"Brian Schott" <briancschott gmail.com> writes:

On Friday, 8 February 2013 at 15:23:00 UTC, Brian Schott wrote:
 On Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg 
 wrote:
 DMD is still consistently faster, :(

 We might be getting to the part where we say that code 
 generated by gcc is still consistently faster than code 
 generated by ldc.

For the lulz I compiled with "dmd -release -inline -noboundscheck 
-O -m64 -property":

http://hackerpilot.github.com/experimental/std_lexer/images/times3-dmd.png

Yes, dmd vs ldc actually is a matter of 32 vs 18 ms for datetime.

Feb 08 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

08-Feb-2013 19:34, Brian Schott пишет:
 On Friday, 8 February 2013 at 15:23:00 UTC, Brian Schott wrote:
 On Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg wrote:
 DMD is still consistently faster, :(

 We might be getting to the part where we say that code generated by
 gcc is still consistently faster than code generated by ldc.

 For the lulz I compiled with "dmd -release -inline -noboundscheck -O
 -m64 -property":

 http://hackerpilot.github.com/experimental/std_lexer/images/times3-dmd.png

 Yes, dmd vs ldc actually is a matter of 32 vs 18 ms for datetime.

Would be intereesting to try gdc as dmd on linux uses gcc.

-- 
Dmitry Olshansky

Feb 08 2013

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 8 February 2013 15:35, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:

 08-Feb-2013 19:34, Brian Schott =D0=BF=D0=B8=D1=88=D0=B5=D1=82:

  On Friday, 8 February 2013 at 15:23:00 UTC, Brian Schott wrote:
 On Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg wrote:

 DMD is still consistently faster, :(

 We might be getting to the part where we say that code generated by
 gcc is still consistently faster than code generated by ldc.

 For the lulz I compiled with "dmd -release -inline -noboundscheck -O
 -m64 -property":

 http://hackerpilot.github.com/**experimental/std_lexer/images/**
 times3-dmd.png<http://hackerpilot.github.com/experimental/std_lexer/imag=


es/times3-dmd.png>
 Yes, dmd vs ldc actually is a matter of 32 vs 18 ms for datetime.

 Would be intereesting to try gdc as dmd on linux uses gcc.

What?  That's an outright fib. :-)


--=20
Iain Buclaw

*(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Feb 08 2013

"Brian Schott" <briancschott gmail.com> writes:

On Friday, 8 February 2013 at 15:42:47 UTC, Iain Buclaw wrote:
 What?  That's an outright fib. :-)

I think he means that "on linux the dmd binary is compiled by gcc"

Feb 08 2013

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 8 February 2013 15:46, Brian Schott <briancschott gmail.com> wrote:

 On Friday, 8 February 2013 at 15:42:47 UTC, Iain Buclaw wrote:

 What?  That's an outright fib. :-)

 I think he means that "on linux the dmd binary is compiled by gcc"

That's still lies. :o)


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Feb 08 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

08-Feb-2013 19:52, Iain Buclaw пишет:
 On 8 February 2013 15:46, Brian Schott <briancschott gmail.com
 <mailto:briancschott gmail.com>> wrote:

     On Friday, 8 February 2013 at 15:42:47 UTC, Iain Buclaw wrote:

         What?  That's an outright fib. :-)


     I think he means that "on linux the dmd binary is compiled by gcc"


 That's still lies. :o)

g++ ? :)


 --
 Iain Buclaw

 *(p < e ? p++ : p) = (c & 0x0f) + '0';


-- 
Dmitry Olshansky

Feb 08 2013

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 8 February 2013 16:00, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:

 08-Feb-2013 19:52, Iain Buclaw =D0=BF=D0=B8=D1=88=D0=B5=D1=82:

 On 8 February 2013 15:46, Brian Schott <briancschott gmail.com
 <mailto:briancschott gmail.com**>> wrote:

     On Friday, 8 February 2013 at 15:42:47 UTC, Iain Buclaw wrote:

         What?  That's an outright fib. :-)


     I think he means that "on linux the dmd binary is compiled by gcc"


 That's still lies. :o)

 g++ ? :)

I see we could be doing this all day. :=C3=BE

I'll lay down the hint, dmd compiles the source, not gcc. And although gcc
may be invoked during a certain special stage of compilation, its actually
just a driver to call a certain toolchain program that is outside of gcc.
:-)

--=20
Iain Buclaw

*(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Feb 08 2013

"Brian Schott" <briancschott gmail.com> writes:

On Friday, 8 February 2013 at 16:38:00 UTC, Iain Buclaw wrote:
 I see we could be doing this all day. :þ

We could.

 I'll lay down the hint, dmd compiles the source, not gcc. And 
 although gcc
 may be invoked during a certain special stage of compilation, 
 its actually
 just a driver to call a certain toolchain program that is 
 outside of gcc.
 :-)

What we're saying is that dmd, The Digital Mars D Compiler, is 
written in C++ and is thus built by GCC/G++. We can tell by 
looking at the Makefile that DMD doesn't build itself.

Feb 08 2013

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 8 February 2013 16:41, Brian Schott <briancschott gmail.com> wrote:

 On Friday, 8 February 2013 at 16:38:00 UTC, Iain Buclaw wrote:

 I see we could be doing this all day. :=FE

 We could.


  I'll lay down the hint, dmd compiles the source, not gcc. And although g=

cc
 may be invoked during a certain special stage of compilation, its actual=


ly
 just a driver to call a certain toolchain program that is outside of gcc=


.
 :-)

 What we're saying is that dmd, The Digital Mars D Compiler, is written in
 C++ and is thus built by GCC/G++. We can tell by looking at the Makefile
 that DMD doesn't build itself.

That still has nothing to do with how dmd links D programs. ;)


--=20
Iain Buclaw

*(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Feb 08 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

08-Feb-2013 20:51, Iain Buclaw пишет:
 On 8 February 2013 16:41, Brian Schott <briancschott gmail.com
 <mailto:briancschott gmail.com>> wrote:

     On Friday, 8 February 2013 at 16:38:00 UTC, Iain Buclaw wrote:

         I see we could be doing this all day. :þ


     We could.


         I'll lay down the hint, dmd compiles the source, not gcc. And
         although gcc
         may be invoked during a certain special stage of compilation,
         its actually
         just a driver to call a certain toolchain program that is
         outside of gcc.
         :-)


     What we're saying is that dmd, The Digital Mars D Compiler, is
     written in C++ and is thus built by GCC/G++. We can tell by looking
     at the Makefile that DMD doesn't build itself.


 That still has nothing to do with how dmd links D programs. ;)

We've been discussing DMD's lexer written in C++ that is obviously 
compiled using the default compiler on target platform. That's what is 
being measured the good ol' C++ lexer vs D lexer. I don't see how the 
way DMD does linking is related to what tool is used to actually build it.


-- 
Dmitry Olshansky

Feb 08 2013

"SomeDude" <lovelydear mailmetrash.com> writes:

On Friday, 8 February 2013 at 16:00:05 UTC, Dmitry Olshansky 
wrote:
 08-Feb-2013 19:52, Iain Buclaw пишет:

 That's still lies. :o)

 g++ ? :)


 --
 Iain Buclaw

 *(p < e ? p++ : p) = (c & 0x0f) + '0';


I'd think it's built by DMC++, Digital Mars C++ compiler.

Feb 09 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

09-Feb-2013 23:20, SomeDude пишет:
 On Friday, 8 February 2013 at 16:00:05 UTC, Dmitry Olshansky wrote:
 08-Feb-2013 19:52, Iain Buclaw пишет:

 That's still lies. :o)

 g++ ? :)


 --
 Iain Buclaw

 *(p < e ? p++ : p) = (c & 0x0f) + '0';


 I'd think it's built by DMC++, Digital Mars C++ compiler.

On linux, sure ;)

-- 
Dmitry Olshansky

Feb 09 2013

"Brian Schott" <briancschott gmail.com> writes:

On Friday, 8 February 2013 at 15:35:55 UTC, Dmitry Olshansky 
wrote:
 Would be intereesting to try gdc as dmd on linux uses gcc.

GDC decided to randomly not fail to build on my system, so it's 
time for MOAR CHARTS.

dmd -inline -noboundscheck -O -w -wi -m64 -property
ldc2 -O2 -release -vectorize -m64
gdc -O3 -fno-bounds-check -frelease -m64
http://hackerpilot.github.com/experimental/std_lexer/images/times3-all.png

Feb 08 2013

"Brian Schott" <briancschott gmail.com> writes:

On Friday, 8 February 2013 at 16:35:50 UTC, Brian Schott wrote:
 dmd -inline -noboundscheck -O -w -wi -m64 -property

Copy/pate fail. That's actually "dmd -release -inline 
-noboundscheck -O -w -wi -m64 -property"

Feb 08 2013

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 8 February 2013 16:35, Brian Schott <briancschott gmail.com> wrote:

 On Friday, 8 February 2013 at 15:35:55 UTC, Dmitry Olshansky wrote:

 Would be intereesting to try gdc as dmd on linux uses gcc.

 GDC decided to randomly not fail to build on my system, so it's time for
 MOAR CHARTS.

 dmd -inline -noboundscheck -O -w -wi -m64 -property
 ldc2 -O2 -release -vectorize -m64
 gdc -O3 -fno-bounds-check -frelease -m64
 http://hackerpilot.github.com/**experimental/std_lexer/images/**
 times3-all.png<http://hackerpilot.github.com/experimental/std_lexer/images/times3-all.png>

Cool.  How do you determine the speed times?


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Feb 08 2013

FG <home fgda.pl> writes:

On 2013-02-08 17:35, Brian Schott wrote:
 GDC decided to randomly not fail to build on my system, so it's time for MOAR
 CHARTS.

Interesting. Seems that LDC is faster than GDC but has a bigger start overhead.

OT: Why is the chart's legend in reverse order? :P

Feb 08 2013

"David Nadlinger" <see klickverbot.at> writes:

On Friday, 8 February 2013 at 20:41:52 UTC, FG wrote:
 On 2013-02-08 17:35, Brian Schott wrote:
 GDC decided to randomly not fail to build on my system, so 
 it's time for MOAR
 CHARTS.

 Interesting. Seems that LDC is faster than GDC but has a bigger 
 start overhead.

 OT: Why is the chart's legend in reverse order? :P

It seems like that, but the differences are small enough that you 
can't make reliable statements without having a closer look at 
the statistics.

David,
who dreams of a world in which every graph has (at least) error 
bars

Feb 08 2013

dennis luehring <dl.soluz gmx.net> writes:

Am 08.02.2013 17:35, schrieb Brian Schott:
 On Friday, 8 February 2013 at 15:35:55 UTC, Dmitry Olshansky
 wrote:
 Would be intereesting to try gdc as dmd on linux uses gcc.

 GDC decided to randomly not fail to build on my system, so it's
 time for MOAR CHARTS.

 dmd -inline -noboundscheck -O -w -wi -m64 -property
 ldc2 -O2 -release -vectorize -m64
 gdc -O3 -fno-bounds-check -frelease -m64
 http://hackerpilot.github.com/experimental/std_lexer/images/times3-all.png

i still don't get what you comparing here

you benchmarking something like an mixup of code-generation, 
startup-times and filesystem behavior + lexing time (+ lexing output style)

and this all makes only sense if you lexer produces the very same output 
(so it can be out of the box used as an replacement) as the dmd lexer - 
for beeing compareable

or aren't you in the phase of detail benchmarking/profiling?

Feb 08 2013

Jacob Carlborg <doob me.com> writes:

On 2013-02-08 17:35, Brian Schott wrote:
 On Friday, 8 February 2013 at 15:35:55 UTC, Dmitry Olshansky wrote:
 Would be intereesting to try gdc as dmd on linux uses gcc.

 GDC decided to randomly not fail to build on my system, so it's time for
 MOAR CHARTS.

 dmd -inline -noboundscheck -O -w -wi -m64 -property
 ldc2 -O2 -release -vectorize -m64
 gdc -O3 -fno-bounds-check -frelease -m64
 http://hackerpilot.github.com/experimental/std_lexer/images/times3-all.png

Do you have the exact code you test available somewhere?

-- 
/Jacob Carlborg

Feb 09 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

08-Feb-2013 19:12, Jacob Carlborg пишет:
 On 2013-02-08 16:08, Brian Schott wrote:

 http://hackerpilot.github.com/experimental/std_lexer/images/times3.png

 Your suggestions seem to have worked. We're below 20ms on my machine for
 the datetime module.

 DMD is still consistently faster, :(

Keep in mind the D runtime start-up cost. In the end on small files DMD 
always wins because of slim C-runtime.

To estimate runtime startup lag I've compared run times of

int main(){ return 0; } in C (gcc):

------------------------
Total time (ms): 31637.7
Repetitions    : 35000
Sample mode    : 0.8 (23649 ocurrences)
Median time    : 0.873
Avg time       : 0.903935
Std dev.       : 0.204107
Minimum        : 0.552
Maximum        : 9.057
95% conf.int.  : [0.503892, 1.30398]  e = 0.400043
99% conf.int.  : [0.378189, 1.42968]  e = 0.525746
EstimatedAvg95%: [0.901796, 0.906073]  e = 0.00213832
EstimatedAvg99%: [0.901125, 0.906745]  e = 0.00281023

void main(){} in D (ldc):
------------------------
Total time (ms): 15429.4
Repetitions    : 7000
Sample mode    : 2.1 (1785 ocurrences)
Median time    : 2.128
Avg time       : 2.2042
Std dev.       : 0.466834
Minimum        : 1.286
Maximum        : 19.933
95% conf.int.  : [1.28922, 3.11918]  e = 0.914978
99% conf.int.  : [1.00171, 3.40668]  e = 1.20248
EstimatedAvg95%: [2.19326, 2.21514]  e = 0.0109361
EstimatedAvg99%: [2.18983, 2.21857]  e = 0.0143724
dmitry dmitry-VirtualBox ~ $

I think that the mode could serve as an indicator of the most probable 
outcome.
For me it's around 0.8 vs 2.1 ms. D runtime rides on top of C one + plus 
the same OS costs to launch a process, so from now on let's keep in mind 
these extra ~ 1.3 ms of bias in favor of C-based lexer.


-- 
Dmitry Olshansky

Feb 08 2013

Jacob Carlborg <doob me.com> writes:

On 2013-02-08 16:28, Dmitry Olshansky wrote:
 08-Feb-2013 19:12, Jacob Carlborg пишет:
 On 2013-02-08 16:08, Brian Schott wrote:

 http://hackerpilot.github.com/experimental/std_lexer/images/times3.png

 Your suggestions seem to have worked. We're below 20ms on my machine for
 the datetime module.

 DMD is still consistently faster, :(

 Keep in mind the D runtime start-up cost. In the end on small files DMD
 always wins because of slim C-runtime.

 To estimate runtime startup lag I've compared run times of

 int main(){ return 0; } in C (gcc):

 ------------------------
 Total time (ms): 31637.7
 Repetitions    : 35000
 Sample mode    : 0.8 (23649 ocurrences)
 Median time    : 0.873
 Avg time       : 0.903935
 Std dev.       : 0.204107
 Minimum        : 0.552
 Maximum        : 9.057
 95% conf.int.  : [0.503892, 1.30398]  e = 0.400043
 99% conf.int.  : [0.378189, 1.42968]  e = 0.525746
 EstimatedAvg95%: [0.901796, 0.906073]  e = 0.00213832
 EstimatedAvg99%: [0.901125, 0.906745]  e = 0.00281023

 void main(){} in D (ldc):
 ------------------------
 Total time (ms): 15429.4
 Repetitions    : 7000
 Sample mode    : 2.1 (1785 ocurrences)
 Median time    : 2.128
 Avg time       : 2.2042
 Std dev.       : 0.466834
 Minimum        : 1.286
 Maximum        : 19.933
 95% conf.int.  : [1.28922, 3.11918]  e = 0.914978
 99% conf.int.  : [1.00171, 3.40668]  e = 1.20248
 EstimatedAvg95%: [2.19326, 2.21514]  e = 0.0109361
 EstimatedAvg99%: [2.18983, 2.21857]  e = 0.0143724
 dmitry dmitry-VirtualBox ~ $

 I think that the mode could serve as an indicator of the most probable
 outcome.
 For me it's around 0.8 vs 2.1 ms. D runtime rides on top of C one + plus
 the same OS costs to launch a process, so from now on let's keep in mind
 these extra ~ 1.3 ms of bias in favor of C-based lexer.

Ok, I see. But wouldn't it be more fair to compare either GCC to GDC or 
Clang to LDC.

-- 
/Jacob Carlborg

Feb 08 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 2/4/2013 7:19 PM, Brian Schott wrote:
 Still only half speed. I'm becoming more and more convinced that Walter is
 actually a wizard.

I worship the Dark Arts.

Feb 08 2013

Iain Buclaw <ibuclaw ubuntu.com> writes:

On Feb 9, 2013 8:01 AM, "Walter Bright" <newshound2 digitalmars.com> wrote:
 On 2/4/2013 7:19 PM, Brian Schott wrote:
 Still only half speed. I'm becoming more and more convinced that Walter


is
 actually a wizard.


 I worship the Dark Arts.

More like cursed the god's when you wrote it. :-)

Feb 09 2013

"Brian Schott" <briancschott gmail.com> writes:

On Saturday, 9 February 2013 at 07:55:04 UTC, Walter Bright wrote:
 On 2/4/2013 7:19 PM, Brian Schott wrote:
 Still only half speed. I'm becoming more and more convinced 
 that Walter is
 actually a wizard.

 I worship the Dark Arts.

*recites incantations*
*merges pull requests*
*adds unit tests*

http://hackerpilot.github.com/experimental/std_lexer/images/times4.png

Feb 28 2013

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 2/28/13, Brian Schott <briancschott gmail.com> wrote:
 http://hackerpilot.github.com/experimental/std_lexer/images/times4.png

That's a nice plot, but what is the X axis?

Feb 28 2013

FG <home fgda.pl> writes:

On 2013-02-28 16:34, Andrej Mitrovic wrote:
 On 2/28/13, Brian Schott <briancschott gmail.com> wrote:
 http://hackerpilot.github.com/experimental/std_lexer/images/times4.png

 That's a nice plot, but what is the X axis?

Lexing time in milliseconds.

Feb 28 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

28-Feb-2013 13:09, Brian Schott пишет:
 On Saturday, 9 February 2013 at 07:55:04 UTC, Walter Bright wrote:
 On 2/4/2013 7:19 PM, Brian Schott wrote:
 Still only half speed. I'm becoming more and more convinced that
 Walter is
 actually a wizard.

 I worship the Dark Arts.

 *recites incantations*
 *merges pull requests*
 *adds unit tests*

 http://hackerpilot.github.com/experimental/std_lexer/images/times4.png

Looking far better now. Keep in mind that we still don't use the 
original dmd's sentinel magic to avoid length check in std.d.lexer :)

Being that wizard who gave a couple of scrolls to Brain I'll outline 
some interesting data points collected while helping him out.

- D run-time takes some time to start up/shut down. It's on the order of 
1ms for me vs 1/10th of ms for plain C
- expanding on the previous point - GC.malloc takes sometime and 
triggers collections from time to time (2 collections to lex datetime, 
there is next to no garbage, but they run regardless).

Even simply disabling GC at first and re-enabling it at the end of 
program speeds up the whole time by roughly 5% on my machine.

- opApply is definitely slower then inlined range-based foreach loop 
even with scope delegate.

Other then this specifics the prime spells that give the performance are:

- don't use built-in AA or subtle allocations (avoid GC as far as you can)

- lean common path, keep there only the operations that are required in 
*every* code-path

- avoid ever doing the same work twice (redesign, hack and whatnot but 
don't do it)


-- 
Dmitry Olshansky

Feb 28 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

28-Feb-2013 21:27, Dmitry Olshansky пишет:
 28-Feb-2013 13:09, Brian Schott пишет:
 On Saturday, 9 February 2013 at 07:55:04 UTC, Walter Bright wrote:
 On 2/4/2013 7:19 PM, Brian Schott wrote:
 Still only half speed. I'm becoming more and more convinced that
 Walter is
 actually a wizard.

 I worship the Dark Arts.

 *recites incantations*
 *merges pull requests*
 *adds unit tests*

 http://hackerpilot.github.com/experimental/std_lexer/images/times4.png

 Looking far better now. Keep in mind that we still don't use the
 original dmd's sentinel magic to avoid length check in std.d.lexer :)

 Being that wizard who gave a couple of scrolls to Brain I'll outline
 some interesting data points collected while helping him out.

To clarify these are problems I observed but haven't solved/avoided 
completely:

 - D run-time takes some time to start up/shut down. It's on the order of
 1ms for me vs 1/10th of ms for plain C
 - expanding on the previous point - GC.malloc takes sometime and
 triggers collections from time to time (2 collections to lex datetime,
 there is next to no garbage, but they run regardless).

And this one below is an experiment and not is not related to the 
numbers posted still.

 Even simply disabling GC at first and re-enabling it at the end of
 program speeds up the whole time by roughly 5% on my machine.


-- 
Dmitry Olshansky

Feb 28 2013

D Programming

C/C++ Programming

Other

digitalmars.D - Request for comments: std.d.lexer