www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Request for comments: std.d.lexer

reply "Brian Schott" <briancschott gmail.com> writes:
I'm writing a D lexer for possible inclusion in Phobos.

DDOC: 
http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
Code: 
https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d

It's currently able to correctly syntax highlight all of Phobos, 
but does a fairly bad job at rejecting or notifying users/callers 
about invalid input.

I'd like to hear arguments on the various ways to handle errors 
in the lexer. In a compiler it would be useful to throw an 
exception on finding something like a string literal that doesn't 
stop before EOF, but a text editor or IDE would probably want to 
be a bit more lenient. Maybe having it run-time (or compile-time 
configurable) like std.csv would be the best option here.

I'm interested in ideas on the API design and other high-level 
issues at the moment. I don't consider this ready for inclusion. 
(The current module being reviewed for inclusion in Phobos is the 
new std.uni.)
Jan 27 2013
next sibling parent reply Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sun, Jan 27, 2013 at 10:51 AM, Brian Schott <briancschott gmail.com> wrote:
 I'm writing a D lexer for possible inclusion in Phobos.

 DDOC: http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
 Code:
 https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d

Cool! I remember linking to it in the wiki a week ago: Here: http://wiki.dlang.org/Lexers_Parsers Feel free to correct the entry.
 It's currently able to correctly syntax highlight all of Phobos, but does a
 fairly bad job at rejecting or notifying users/callers about invalid input.

 I'd like to hear arguments on the various ways to handle errors in the
 lexer. In a compiler it would be useful to throw an exception on finding
 something like a string literal that doesn't stop before EOF, but a text
 editor or IDE would probably want to be a bit more lenient. Maybe having it
 run-time (or compile-time configurable) like std.csv would be the best
 option here.

Last time we discussed it, IIRC, some people wanted the lexer to stop at once, other just wanted an Error token. I personally prefer an Error token, but that means finding a way to start lexing again after the error (and hence, finding where the error ends). I guess any separator/terminator could be used to re-engage the lexer: space, semicolon, closing brace, closing parenthesis?
 I'm interested in ideas on the API design and other high-level issues at the
 moment. I don't consider this ready for inclusion. (The current module being
 reviewed for inclusion in Phobos is the new std.uni.)

OK, here are a few questions: * Having a range interface is good. Any reason why you made byToken a class and not a struct? Most (like, 99%) of range in Phobos are structs. Do you need reference semantics? * Also, is there a way to keep comments? Any code wanting the modify the code might need them. (edit: Ah, I see it: IterationStyle.IncludeComments) * I'd distinguish between standard comments and documentation comments. These are different beasts, to my eyes. * I see Token has a startIndex member. Any reason not to have a endIndex member? Or can and end index always be deduced from startIndex and value.length? * How does it fare with non ASCII code? * A rough estimate of number of tokens/s would be good (I know it'll vary). Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good. On a related note, does your lexer have any problem with 10k+-lines files?
Jan 27 2013
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/27/2013 11:42 AM, Brian Schott wrote:
 On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
 * Having a range interface is good. Any reason why you made byToken a
 class and not a struct? Most (like, 99%) of range in Phobos are
 structs. Do you need reference semantics?

It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code. ...

The lexer range must be a struct.
 ...
 * A rough estimate of number of tokens/s would be good (I know it'll
 vary). Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good. On a related note, does your
 lexer have any problem with 10k+-lines files?

$ time dscanner --sloc ../phobos/std/datetime.d 14950 real 0m0.319s user 0m0.313s sys 0m0.006s $ time dmd -c ../phobos/std/datetime.d real 0m0.354s user 0m0.318s sys 0m0.036s Yes, I know that "time" is a terrible benchmarking tool, but they're fairly close for whatever that's worth.

You are measuring lexing speed against compilation speed. A reasonably well performing lexer is around one order of magnitude faster on std.datetime. Maybe you should profile a little?
Jan 27 2013
parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 27.01.2013 13:31, schrieb Johannes Pfau:
 Am Sun, 27 Jan 2013 12:38:33 +0100
 schrieb Timon Gehr <timon.gehr gmx.ch>:

 On 01/27/2013 11:42 AM, Brian Schott wrote:
 On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
 * Having a range interface is good. Any reason why you made
 byToken a class and not a struct? Most (like, 99%) of range in
 Phobos are structs. Do you need reference semantics?

It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code. ...

The lexer range must be a struct.
 ...
 * A rough estimate of number of tokens/s would be good (I know
 it'll vary). Walter seems to think if a lexer is not able to vomit
 thousands of tokens a seconds, then it's not good. On a related
 note, does your lexer have any problem with 10k+-lines files?

$ time dscanner --sloc ../phobos/std/datetime.d 14950 real 0m0.319s user 0m0.313s sys 0m0.006s $ time dmd -c ../phobos/std/datetime.d real 0m0.354s user 0m0.318s sys 0m0.036s Yes, I know that "time" is a terrible benchmarking tool, but they're fairly close for whatever that's worth.

You are measuring lexing speed against compilation speed. A reasonably well performing lexer is around one order of magnitude faster on std.datetime. Maybe you should profile a little?

Profiling is always a good idea, but to be fair: His dmd was probably compiled with gcc and -O2 if it's a normal release build. So to compare that he should use gdc, -O2 -release -fno-bounds-check and probably more flags to compile the d code.

that makes no sense - we need a tiny piece of benchmark code inside of dmd frontend (and gdc) - these results are the only reliable/compareable benchmark someone knows the place where such benchmarking can take place in the dmd frontend code?
Jan 27 2013
parent dennis luehring <dl.soluz gmx.net> writes:
 that makes no sense - we need a tiny piece of benchmark code inside of
 dmd frontend (and gdc) - these results are the only reliable/compareable
 benchmark

 someone knows the place where such benchmarking can take place in the
 dmd frontend code?

the dmd lexer is in https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c with an small unit test
Jan 27 2013
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-01-27 11:42, Brian Schott wrote:

 * I see Token has a startIndex member. Any reason not to have a
 endIndex member? Or can and end index always be deduced from
 startIndex and value.length?

That's the idea.

Always good to try and minimize the size of a token when the lexer should output thousands of tokens per second. -- /Jacob Carlborg
Jan 27 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
 Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good.

Speed is critical for a lexer. This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.
Jan 27 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
27-Jan-2013 23:48, Walter Bright пишет:
 On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
 Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good.

Speed is critical for a lexer. This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.

I concur. One of the biggest reason* there is a separate lexer step is because it could be made to do this stage very-very fast. Then the rest of the parser will greatly benefit from this underlying speed. *Otherwise we could have just as well add the lexer stage as simple rules to the grammar that treats all of codepoints as terminals. -- Dmitry Olshansky
Jan 27 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Jan-2013 00:18, Philippe Sigaud пишет:
 I concur. One of the biggest reason* there is a separate lexer step is
 because it could be made to do this stage very-very fast. Then the rest of
 the parser will greatly benefit from this underlying speed.

 *Otherwise we could have just as well add the lexer stage as simple rules to
 the grammar that treats all of codepoints as terminals.

... which is exactly what parsing expression grammars (and other scannerless parsers) do.

That is only a happenstance and is highly overrated. What does it buy you is the right question to ask. After all every LL-parser can be written avoiding notion of lexer and strangely I see no fuss about it. So is it just hype? I dunno.
 AFAICT, one interesting consequence is the ability to have composition
 of grammars, which I sure have with Pegged. But maybe it's not related
 that much, that's not something I stopped to think about.
 In any case, grammar composition is something I've learnt to like quite a lot.

You can still have composability of grammars. In fact I'd define a couple of a less powerful but better optimized variations if I were you. PEG-style memoization is still possible as long as underlying grammars are context-free all the useful usually are. LR hardly composable ( I haven't seen a single one that does). But all top-down one always are unless there is some explicit work spent to undermine this ability ;) -- Dmitry Olshansky
Jan 27 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Jan-2013 00:45, Philippe Sigaud пишет:
[snip]
 AFAICT, one interesting consequence is the ability to have composition
 of grammars, which I sure have with Pegged. But maybe it's not related
 that much, that's not something I stopped to think about.
 In any case, grammar composition is something I've learnt to like quite a
 lot.

You can still have composability of grammars. In fact I'd define a couple of a less powerful but better optimized variations if I were you.

Yes, getting an automaton to deal with the regular part would be good. But then, there already is std.regex ;)

Reminds me to tweak/refactor/optimize CT-part of it. [snip]
 What I still can't get an easy answer on is: is the D grammar LR(1),
 LR(0), LALR?

I *think* you can shoehorn it into any one of these: LALR(1), LL(k)+ some lookahead, LL(*) a generalization of it, or PEG. As usual you might need semantic predicates to iron out some quirks though. We are getting OT quick ;) The hint is that your question is a bit faulty: by calling it "the D grammar" do you mean the exact one listed on the website or any equivalent that parses the same language (including the ones obtained by simple transformations)? -- Dmitry Olshansky
Jan 27 2013
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/27/2013 12:18 PM, Philippe Sigaud wrote:
 I concur. One of the biggest reason* there is a separate lexer step is
 because it could be made to do this stage very-very fast. Then the rest of
 the parser will greatly benefit from this underlying speed.

 *Otherwise we could have just as well add the lexer stage as simple rules to
 the grammar that treats all of codepoints as terminals.

... which is exactly what parsing expression grammars (and other scannerless parsers) do. AFAICT, one interesting consequence is the ability to have composition of grammars, which I sure have with Pegged. But maybe it's not related that much, that's not something I stopped to think about. In any case, grammar composition is something I've learnt to like quite a lot.

If you do such a thing when designing a language, I guarantee that you will find it impossible to resist having custom "tokens" in various parts of the grammar. This will make it impossible to write tools that only need to tokens, such as those syntax highlighting editors that really are token highlighters.
Jan 27 2013
prev sibling next sibling parent Paulo Pinto <pjmlp progtools.org> writes:
Am 27.01.2013 20:48, schrieb Walter Bright:
 On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
 Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good.

Speed is critical for a lexer. This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.

I second that.
Jan 27 2013
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-01-27 21:15, Philippe Sigaud wrote:

 This means, for example, you'll need to squeeze pretty much all storage
 allocation out of it. A lexer that does an allocation per token will is not
 going to do very well at all.

How does one do that? Honest question, I'm not really concerned by extreme speed, most of the time, and have lots to learn in this area.

If we're talking about dynamic allocation you can make sure you're just using value types. A token could look like: struct Token { TokenKind kind; string value; uint index; } For the "value" you could just slice the buffer. But I think this will prevent the whole buffer from being collected. -- /Jacob Carlborg
Jan 27 2013
parent FG <home fgda.pl> writes:
On 2013-01-28 08:47, Jacob Carlborg wrote:
 If we're talking about dynamic allocation you can make sure you're just using
 value types. A token could look like:

 struct Token
 {
      TokenKind kind;
      string value;
      uint index;
 }

 For the "value" you could just slice the buffer. But I think this will prevent
 the whole buffer from being collected.

I was also thinking about using slices to limit string allocation. So far the combined size of source files in D projects is so small, that it wouldn't hurt to mmap the files and slice them. It is possible though that someone would create a huge file, even if only just to see this program crash. :) In that case something else may be useful. Allocate special arrays for holding value strings, for example 256 kB per array. Token.value will be a slice of such array. Additionally have a lookup Trie to help reuse repeating values - if a string is already in an array, the Trie leaf will store its position (slice) and the token will only have to copy that info. If the lookup doesn't turn up the string, the string will be added to the end of the array using Appender or, if it doesn't fit, a new array will be created.
Jan 28 2013
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-01-28 10:03, Mehrdad wrote:

 Semi-unrelated question -- how would you benchmark a _parser_?

 Is there any way to get a _number_ as an answer, or is comparing against
 a rival parser the only way of benchmarking a parser?

It should always be possible to get a number, but it would be hard to determine if that's a good number or not without comparing with anything else. -- /Jacob Carlborg
Jan 28 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-01-28 10:09, Mehrdad wrote:

 Sorry I think my question was vague -- what I meant was, how do you
 measure it, i.e. in what units?

 Lines per second? Tokens per second? Syntax nodes per second? etc.

The easy way out would be to just measure how long time it takes to parse a given piece of code and compare with some other parse using the same code. -- /Jacob Carlborg
Jan 28 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/28/2013 1:30 AM, Mehrdad wrote:
 I'm wondering if there's any way to make it independent of the code/grammar

Just run under a profiler, then fix the hot spots.
Jan 28 2013
parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 28.01.2013 11:13, schrieb Walter Bright:
 On 1/28/2013 1:30 AM, Mehrdad wrote:
 I'm wondering if there's any way to make it independent of the code/grammar

Just run under a profiler, then fix the hot spots.

mehrdad speaks about benchmarking not optimization :)
Jan 28 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/28/2013 2:16 AM, dennis luehring wrote:
 Am 28.01.2013 11:13, schrieb Walter Bright:
 On 1/28/2013 1:30 AM, Mehrdad wrote:
 I'm wondering if there's any way to make it independent of the code/grammar

Just run under a profiler, then fix the hot spots.

mehrdad speaks about benchmarking not optimization :)

I know. But profiling is how you make it fast. Lexers should be as fast as possible, not merely faster than the competition.
Jan 28 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/28/2013 3:34 AM, Mehrdad wrote:
 On Monday, 28 January 2013 at 11:01:49 UTC, Walter Bright wrote:
 On 1/28/2013 2:16 AM, dennis luehring wrote:
 Am 28.01.2013 11:13, schrieb Walter Bright:
 On 1/28/2013 1:30 AM, Mehrdad wrote:
 I'm wondering if there's any way to make it independent of the code/grammar

Just run under a profiler, then fix the hot spots.

mehrdad speaks about benchmarking not optimization :)

I know. But profiling is how you make it fast. Lexers should be as fast as possible, not merely faster than the competition.

I was talking about parsers actually :| on a second thought maybe I shouldn't have asked that on a thread about lexing...

Same comment applies.
Jan 28 2013
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/27/2013 12:15 PM, Philippe Sigaud wrote:
 On Sun, Jan 27, 2013 at 8:48 PM, Walter Bright
 <newshound2 digitalmars.com> wrote:
 On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
 Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good.

Speed is critical for a lexer.

Something I never tought about: given a 10k lines module, how many tokens does that represent, roughly. Ten times as much?

I don't know.
 This means, for example, you'll need to squeeze pretty much all storage
 allocation out of it. A lexer that does an allocation per token will is not
 going to do very well at all.

How does one do that? Honest question, I'm not really concerned by extreme speed, most of the time, and have lots to learn in this area.

Just study the dmd lexer.c source code.
Jan 28 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Jan-2013 15:48, Johannes Pfau пишет:
 Am Sun, 27 Jan 2013 11:48:23 -0800
 schrieb Walter Bright <newshound2 digitalmars.com>:

 On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
 Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good.

Speed is critical for a lexer. This means, for example, you'll need to squeeze pretty much all storage allocation out of it.

But to be fair that doesn't fit ranges very well. If you don't want to do any allocation but still keep identifiers etc in memory this basically means you have to keep the whole source in memory and this is conceptually an array and not a range.

Not the whole source but to construct a table of all identifiers. The source is awfully redundant because of repeated identifiers, spaces, comments and what not. The set of unique identifiers is rather small. I think the best course of action is to just provide a hook to trigger on every identifier encountered. That could be as discussed earlier a delegate.
 But you can of course write a lexer which accepts buffered ranges and
 does some allocation for those and is special cased for arrays to not
 allocate at all. (Unbuffered ranges should be supported using a
 generic BufferedRange)

-- Dmitry Olshansky
Jan 28 2013
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/28/2013 12:59 PM, Dmitry Olshansky wrote:
 28-Jan-2013 15:48, Johannes Pfau пишет:
 ...

 But to be fair that doesn't fit ranges very well. If you don't want to
 do any allocation but still keep identifiers etc in memory this
 basically means you have to keep the whole source in memory and this is
 conceptually an array and not a range.

Not the whole source but to construct a table of all identifiers. The source is awfully redundant because of repeated identifiers, spaces, comments and what not. The set of unique identifiers is rather small.

Source code is usually small. (Even std.datetime has 'only' 1.6 MB.) My own lexer-parser combination slices directly into the original source code, for every token, in order to remember the exact source location, and the last time I have measured, it ran faster than DMD's. I keep the source around for error reporting anyway: tt.d:27:5: error: no member 'n' for type 'A' a.n=2; ^~~ Since the tokens point directly into the source code, it is not necessary to construct any other data structures in order to allow fast retrieval of the appropriate source code line. But it's clear that a general-purpose library might not want to impose this restriction on storage upon it's clients. I think it is somewhat helpful for speed though. The other thing I do is buffering tokens in a contiguous ring buffer that grows if a lot of lookahead is requested.
 I think the best course of action is to just provide a hook to trigger
 on every identifier encountered. That could be as discussed earlier a
 delegate.

 ...

Maybe. I map identifiers to unique id's later, in the identifier AST node constructor, though.
Jan 28 2013
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/28/2013 1:04 PM, Timon Gehr wrote:
 Maybe. I map identifiers to unique id's later,

This makes symbol lookups very fast.
Jan 28 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
29-Jan-2013 01:04, Timon Gehr пишет:
 On 01/28/2013 12:59 PM, Dmitry Olshansky wrote:
 28-Jan-2013 15:48, Johannes Pfau пишет:
 ...

 But to be fair that doesn't fit ranges very well. If you don't want to
 do any allocation but still keep identifiers etc in memory this
 basically means you have to keep the whole source in memory and this is
 conceptually an array and not a range.

Not the whole source but to construct a table of all identifiers. The source is awfully redundant because of repeated identifiers, spaces, comments and what not. The set of unique identifiers is rather small.

Source code is usually small. (Even std.datetime has 'only' 1.6 MB.) My own lexer-parser combination slices directly into the original source code, for every token, in order to remember the exact source location, and the last time I have measured, it ran faster than DMD's. I keep the source around for error reporting anyway: tt.d:27:5: error: no member 'n' for type 'A' a.n=2; ^~~ Since the tokens point directly into the source code, it is not necessary to construct any other data structures in order to allow fast retrieval of the appropriate source code line. But it's clear that a general-purpose library might not want to impose this restriction on storage upon it's clients. I think it is somewhat helpful for speed though. The other thing I do is buffering tokens in a contiguous ring buffer that grows if a lot of lookahead is requested.
 I think the best course of action is to just provide a hook to trigger
 on every identifier encountered. That could be as discussed earlier a
 delegate.

 ...

Maybe. I map identifiers to unique id's later, in the identifier AST node constructor, though.

In allocation scheme I proposed that ID could be a 32bit offset into the unique identifiers chunk. Identifiers themselves then would have to be stored with sentinels (e.g. zeros) at end like in C. Then the 'map' process comes for free. The only overhead is actually copying the bytes to this buffer but it only happens once per unique identifier and in exchange you get to lex anything not only 'the whole module in one memory block' kind of thing. On the plus side it's also more cache friendly. Overall I think it's a good trade-off, but I never had the time to exercise it. -- Dmitry Olshansky
Jan 30 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/30/2013 8:44 AM, Dmitry Olshansky wrote:
 In allocation scheme I proposed that ID could be a 32bit offset into the unique
 identifiers chunk.

That only works if you know in advance the max size the chunk can ever be and preallocate it. Otherwise, you have no guarantee that the next allocated chunk will be within 32 bits of address of the previous chunks.
Feb 01 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
01-Feb-2013 15:05, Walter Bright пишет:
 On 1/30/2013 8:44 AM, Dmitry Olshansky wrote:
 In allocation scheme I proposed that ID could be a 32bit offset into
 the unique
 identifiers chunk.

That only works if you know in advance the max size the chunk can ever be and preallocate it. Otherwise, you have no guarantee that the next allocated chunk will be within 32 bits of address of the previous chunks.

Well I supposed it's exactly one reallocatable block. Then token have an offset that doesn't care if the block was reallocated. Or rather the reallocating just RESERVE virtual RAM for it (say 1G), and COMMIT it page by page when you need to grow it. Once lexing is done, shrink virtual region to the actual used size to free up address space (e.g. if we are on 32bits). AS for 32bit limit that gives 4Gb maximum of the cumulative length of all unique identifier names is more then enough by any standard. I haven't seen a 4G codebase not to speak of identifiers alone that even if we count all the repetitions separately. -- Dmitry Olshansky
Feb 01 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/1/2013 3:22 AM, Dmitry Olshansky wrote:
 01-Feb-2013 15:05, Walter Bright пишет:
 On 1/30/2013 8:44 AM, Dmitry Olshansky wrote:
 In allocation scheme I proposed that ID could be a 32bit offset into
 the unique
 identifiers chunk.

That only works if you know in advance the max size the chunk can ever be and preallocate it. Otherwise, you have no guarantee that the next allocated chunk will be within 32 bits of address of the previous chunks.

Well I supposed it's exactly one reallocatable block. Then token have an offset that doesn't care if the block was reallocated. Or rather the reallocating just RESERVE virtual RAM for it (say 1G), and COMMIT it page by page when you need to grow it. Once lexing is done, shrink virtual region to the actual used size to free up address space (e.g. if we are on 32bits). AS for 32bit limit that gives 4Gb maximum of the cumulative length of all unique identifier names is more then enough by any standard. I haven't seen a 4G codebase not to speak of identifiers alone that even if we count all the repetitions separately.

Your technique can work, provided the number of identifiers isn't large enough that memory fragmentation will prevent being able to reallocate the buffer to a larger size.
Feb 01 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
02-Feb-2013 01:10, Walter Bright пишет:
 On 2/1/2013 3:22 AM, Dmitry Olshansky wrote:
 01-Feb-2013 15:05, Walter Bright пишет:
 On 1/30/2013 8:44 AM, Dmitry Olshansky wrote:
 In allocation scheme I proposed that ID could be a 32bit offset into
 the unique
 identifiers chunk.

That only works if you know in advance the max size the chunk can ever be and preallocate it. Otherwise, you have no guarantee that the next allocated chunk will be within 32 bits of address of the previous chunks.

Well I supposed it's exactly one reallocatable block. Then token have an offset that doesn't care if the block was reallocated. Or rather the reallocating just RESERVE virtual RAM for it (say 1G), and COMMIT it page by page when you need to grow it. Once lexing is done, shrink virtual region to the actual used size to free up address space (e.g. if we are on 32bits). AS for 32bit limit that gives 4Gb maximum of the cumulative length of all unique identifier names is more then enough by any standard. I haven't seen a 4G codebase not to speak of identifiers alone that even if we count all the repetitions separately.

Your technique can work, provided the number of identifiers isn't large enough that memory fragmentation will prevent being able to reallocate the buffer to a larger size.

I think I failed to describe the virtual memory trick or you just ignored it? But regardless I have to correct myself on both the amount to reserve and the fact that it's actually not virtual memory alone but a virtue of MMU. See below a full program in D see it the trick in action, currently Win32 only, as I can't recall the right POSIX calls offhand. It's a benchmark against built-in array/appender - the heap stuff. The speed is simillar, with my quick and dirty stuff being much faster iff arrays don't allocate all of ram beforehand. The steps to snatch the power of not having any reallocations: 1. Reserve a contiguous *address space* range. Not even a bit of virtual memory is reserved/commited/touched at this moment. This just creates a TLB entry AFAICT. (for lexer this address space range would have length equal to the sum of all files length as Tove points out, this is the first correction) 2. Any time there is no memory to put more stuff into (as is initially), commit the next few pages. At this moment virtual ram is committed but not allocated. 3. [this I think we all know] The moment memory location in the committed part of the range is touched the page for it is allocated (and on win32 initially contains zeros automatically). Only at this point Virtual memory is allocated So you see where my 2nd correction goes. In essence you get: 0 reallocations and no more then 1 page of virtual memory wasted in all cases. Sweet deal ;) Then there is an optional step if you think that too much of address space is used (makes sense only on 32-bits if at all): copy the resulting block of actual data elsewhere in the heap and release the whole address range. The code: (the same as github gist: https://gist.github.com/4699388 ) ///Written in the D programming language // Bench built-in array append, std.array.Appender // and custom virtual-memory aware VirtualArray in terms of // memory usage and the time taken // to append up to X megabytes using different chunk sizes: // 16 bytes, 64bytes, 256 bytes and 16Kb at a time. // pass this version to take insert readln's // at interesting points and investigate RAM usage // make sure to enable relevant columns in task manager process details: // committed size and private working set // version = diagnose; import std.c.windows.windows; import std.exception, std.datetime, std.algorithm, std.range, std.array; auto roundUpToPowerOf2(size_t var, size_t pow2) { return (var + pow2-1) & ~(pow2-1); } struct VirtualArray { private: ubyte* pointer; // to the beginning of the reserved memory size_t _length; // length of data size_t commited; // committed memory so far size_t reserved; // total possible maximum to grow to size_t pageSize; // page size, this could be global //size_t pageSize; // commit some more memory from the reserved range void extend(size_t size) { size = roundUpToPowerOf2(size, pageSize); // this will throw once run out of reserved pages enforce(VirtualAlloc(pointer+commited, size, MEM_COMMIT, PAGE_READWRITE)); commited += size; } public: this(size_t maxCapacity) { SYSTEM_INFO sysinfo; GetSystemInfo(&sysinfo); pageSize = sysinfo.dwPageSize; // the page size is a power of 2 // round the capacity up to a multiple of page size maxCapacity = roundUpToPowerOf2(maxCapacity, pageSize); pointer = cast(ubyte*)enforce(VirtualAlloc(null, maxCapacity, MEM_RESERVE, PAGE_READWRITE)); commited = 0; reserved = maxCapacity; _length = 0; } // bare minimum primitives to run benchmark ref ubyte opIndex(size_t idx) { assert(idx < length); return pointer[idx]; } // ditto ubyte[] opSlice(size_t from, size_t to) { assert(from < to && to <= length); return pointer[from..to]; } // ditto property size_t length(){ return _length; } void opOpAssign(string op:"~")(ubyte[] data) { size_t end = length + data.length; while(end > commited) extend(end-commited); pointer[length..end] = data[]; _length = end; } ~this() { // should not throw, struct is not copyable enforce(VirtualFree(pointer, 0, MEM_RELEASE)); } } import std.stdio; void timeIt(string prefix, void delegate() run) { StopWatch sw; sw.start(); run(); sw.stop(); writefln(prefix~ " %4s ms", sw.peek().msecs); } bool verify(Arr)(ubyte[] pattern, ref Arr arr) { for(size_t i=0; i<arr.length; i+= pattern.length) { if(arr[i..i+pattern.length] != pattern[]) { writeln(arr[i .. i+pattern.length], " vs ", pattern); return false; } } return true; } void main() { size_t totalSize = 120*2^^20; auto chunks = [ iota(0, 16).map!(x=>cast(ubyte)x).array, iota(0, 64).map!(x=>cast(ubyte)x).array, iota(0, 256).map!(x=>cast(ubyte)x).array, iota(0, 2^^10).map!(x=>cast(ubyte)x).array ]; auto titles = [ " 16b", " 64b", "256b", " 16K" ]; import core.memory; GC.disable(); // to prevent measuring a collection by mistake foreach(n, chunk; chunks) { // reserve memory anew on each iteration // I was able to easily go up to 1.5 G on 32bits // note: there are no reallocations here at all version(diagnose) if(n == 0) { writeln("Before reserving address space"); readln(); } auto virtArr = VirtualArray(totalSize); version(diagnose) if(n == 0) { writeln("Reserved address space"); readln(); } timeIt(titles[n]~" virtual", (){ foreach(i; 0..totalSize/chunk.length) { virtArr ~= chunk; } }); version(diagnose) if(n == 0) { writeln("Commited all of virtual memory"); readln(); //uncomment to take a pause to investigate RAM usage } // time to verify is the same for all with -O -inline enforce(verify(chunk, virtArr)); } writeln("============"); foreach(n, chunk; chunks) { ubyte[] arr; // the GC is out of the game as soon as 512 Mbytes on 32bits // as I hit OutOfMemory on a 2nd iteration // try to help poor runtime // without reserve it crashes and burns at about 120 M // with reserve it's on par with chunk >= 256 // but it _commits_ all memory beforehand ! version(diagnose) if(n == 0) { writeln("Before array.reserve()"); readln(); } arr.reserve(totalSize); version(diagnose) if(n == 0) { writeln("After array.reserve()"); readln(); } timeIt(titles[n]~" array ", (){ foreach(i; 0..totalSize/chunk.length) { arr ~= chunk; } }); version(diagnose) if(n == 0) { writeln("After all data touched"); readln(); } // time to verify is the same for all with -O -inline enforce(verify(chunk, arr)); // sadly need to not run out of memory on 32 bits delete arr; } writeln("============"); foreach(n, chunk; chunks) { { // appender is supposed to be faster then array on appends // but it's actually slower starting at 256 byte chunks // and esp. so with 16Kb chunks auto app = appender!(ubyte[])(); timeIt(titles[n]~" appender", (){ foreach(i; 0..totalSize/chunk.length) { app.put(chunk); } }); auto appArr = app.data(); enforce(verify(chunk, appArr)); } GC.collect(); } } -- Dmitry Olshansky
Feb 02 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/2/2013 1:54 PM, Dmitry Olshansky wrote:
 I think I failed to describe the virtual memory trick or you just ignored it?

You can't just reserve 1Gb of address space in a 32 bit system. (I know about reserving address space vs committing memory.)
Feb 02 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Feb-2013 05:30, Walter Bright пишет:
 On 2/2/2013 1:54 PM, Dmitry Olshansky wrote:
 I think I failed to describe the virtual memory trick or you just
 ignored it?

You can't just reserve 1Gb of address space in a 32 bit system. (I know about reserving address space vs committing memory.)

OK, sorry you must knew this stuff throughly. 1Gb is not the point I feel like I shouldn't been listing 1Gb at all as we need much less. It's about 21 Mb of D source code in full dmd test suite + druntime + phobos, and only 4M in vibe.d including all examples. Reserve this amount. It's the same amount of memory as in read them all and slice them all approach that is said to be so fast and practical. And yet we don't commit this RAM at all most of the time. Thinking more of 32bit systems that are so tight on address space - it's 2Gb for user-mode. Precisely because of that a signed 32 bit offset is enough to address RAM if we store it in multiple chunks like you said. The fact that there is only 2Gb of user-space actually sort of helps us as we surely can reach the other chunks from a single base. In case of 3Gb (some linux-es and/or large address aware win32) we just need to reserve the first range somewhere in the middle, as we do it at startup this should be easy with a few probes. Also after the lexing you can always free the address space and reallocate-compact it as the last step like I said earlier. -- Dmitry Olshansky
Feb 03 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
Very happy to see that !

Some remarks :
  - Many parameters should be compile time parameters. Instead of 
runtime.
  - I'm not a big fan of byToken name, but let's see what others 
think of it.
  - I'm not sure this is the role of the lexer to process 
__IDENTIFIER__ special stuffs.
  - You need to provide a way to specify haw textual 
representation of the token (ie value) is set. The best way to do 
it IMO is an alias parameter that return a string when called 
with a string (then the user can choose to keep the value from 
original string, create a copy, always get the same copy with the 
same string, etc . . .).
  - Ideally, the location format should be configurable.
  - You must return at least a forward range, and not an input 
range, otherwize a lexer cannot lookahead.
  - I'm not sure about making TokenRange a class.

As a more positive note, I was just in need of something like 
this today and wondered if such project was going to finally 
happen. This is a great step forward ! Some stuff still need to 
be polished, but who get everything right the first time ?
Jan 27 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
 Last time we discussed it, IIRC, some people wanted the lexer 
 to stop
 at once, other just wanted an Error token.
 I personally prefer an Error token, but that means finding a 
 way to
 start lexing again after the error (and hence, finding where 
 the error
 ends).
 I guess any separator/terminator could be used to re-engage the 
 lexer:
 space, semicolon, closing brace, closing parenthesis?

Oh yes, that is very important. Conditions are perfect to handle stuff like that.
Jan 27 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Sunday, 27 January 2013 at 10:32:39 UTC, deadalnix wrote:
 Very happy to see that !

 Some remarks :
  - Many parameters should be compile time parameters. Instead 
 of runtime.
  - I'm not a big fan of byToken name, but let's see what others 
 think of it.
  - I'm not sure this is the role of the lexer to process 
 __IDENTIFIER__ special stuffs.
  - You need to provide a way to specify haw textual 
 representation of the token (ie value) is set. The best way to 
 do it IMO is an alias parameter that return a string when 
 called with a string (then the user can choose to keep the 
 value from original string, create a copy, always get the same 
 copy with the same string, etc . . .).
  - Ideally, the location format should be configurable.
  - You must return at least a forward range, and not an input 
 range, otherwize a lexer cannot lookahead.
  - I'm not sure about making TokenRange a class.

 As a more positive note, I was just in need of something like 
 this today and wondered if such project was going to finally 
 happen. This is a great step forward ! Some stuff still need to 
 be polished, but who get everything right the first time ?

And the famous Job's « one last thing » : I'm not a big fan of having OPERATORS_BEGIN of the same type as regular token types. Now they make valid token. Why not provide a set of function like isOperator ?
Jan 27 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
 * Having a range interface is good. Any reason why you made 
 byToken a
 class and not a struct? Most (like, 99%) of range in Phobos are
 structs. Do you need reference semantics?

It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code.
 * Also, is there a way to keep comments? Any code wanting the 
 modify
 the code might need them.
 (edit: Ah, I see it: IterationStyle.IncludeComments)

 * I'd distinguish between standard comments and documentation
 comments. These are different beasts, to my eyes.

The standard at http://dlang.org/lex.html doesn't differentiate between them. It's trivial to write a function that checks if a token starts with "///", "/**", or "/++" while iterating over the tokens.
 * I see Token has a startIndex member. Any reason not to have a
 endIndex member? Or can and end index always be deduced from
 startIndex and value.length?

That's the idea.
 * How does it fare with non ASCII code?

Everything is templated on the character type, but I haven't done any testing on UTF-16 or UTF-32. Valgrind still shows functions from std.uni being called, so at the moment I assume it works.
 * A rough estimate of number of tokens/s would be good (I know 
 it'll
 vary). Walter seems to think if a lexer is not able to vomit 
 thousands
 of tokens a seconds, then it's not good. On a related note, 
 does your
 lexer have any problem with 10k+-lines files?

$ time dscanner --sloc ../phobos/std/datetime.d 14950 real 0m0.319s user 0m0.313s sys 0m0.006s $ time dmd -c ../phobos/std/datetime.d real 0m0.354s user 0m0.318s sys 0m0.036s Yes, I know that "time" is a terrible benchmarking tool, but they're fairly close for whatever that's worth.
Jan 27 2013
prev sibling next sibling parent "Dicebot" <m.strashun gmail.com> writes:
In one of last discussions about standard lexer/parser I remember 
quite a neat proposal - take a delegate for error handling and 
provide two out of the box ( one, that throws exception and one 
that returns Error token)
Jan 27 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Sunday, 27 January 2013 at 10:32:39 UTC, deadalnix wrote:
 Very happy to see that !

 Some remarks :
  - Many parameters should be compile time parameters. Instead 
 of runtime.

I decided not to do this because the lexer actually calls itself while parsing token strings. If they were compile-time parameters, the compiler would likely generate a lot more code.
  - I'm not a big fan of byToken name, but let's see what others 
 think of it.

Chosen for consistency with the various functions in std.stdio, but now that you point his out, it's not very consistent with std.algorithm or std.range.
  - I'm not sure this is the role of the lexer to process 
 __IDENTIFIER__ special stuffs.

According to http://dlang.org/lex#specialtokens this is the correct behavior.
  - You need to provide a way to specify haw textual 
 representation of the token (ie value) is set. The best way to 
 do it IMO is an alias parameter that return a string when 
 called with a string (then the user can choose to keep the 
 value from original string, create a copy, always get the same 
 copy with the same string, etc . . .).

The lexer does not operate on slices of its input. It would be possible to special-case for this in the future.
  - Ideally, the location format should be configurable.
  - You must return at least a forward range, and not an input 
 range, otherwize a lexer cannot lookahead.

It's easy to wrap this range inside of another that does buffering for lookahead. https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/circularbuffer.d
 And the famous Job's « one last thing » : I'm not a big fan of 
 having OPERATORS_BEGIN of the same type as regular token types. 
 Now they make valid token. Why not provide a set of function 
 like isOperator ?

This eliminates possible uses of the case range statement. It may be the case (ha ha) that nobody cares and would rather have those functions. I'd be fine with changing that.
Jan 27 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Sunday, 27 January 2013 at 10:55:49 UTC, Brian Schott wrote:
 On Sunday, 27 January 2013 at 10:32:39 UTC, deadalnix wrote:
 Very happy to see that !

 Some remarks :
 - Many parameters should be compile time parameters. Instead 
 of runtime.

I decided not to do this because the lexer actually calls itself while parsing token strings. If they were compile-time parameters, the compiler would likely generate a lot more code.

Have you measured ?
 - You need to provide a way to specify haw textual 
 representation of the token (ie value) is set. The best way to 
 do it IMO is an alias parameter that return a string when 
 called with a string (then the user can choose to keep the 
 value from original string, create a copy, always get the same 
 copy with the same string, etc . . .).

The lexer does not operate on slices of its input. It would be possible to special-case for this in the future.

Even without special casing for slice of the input, generating a new string or reusing an existing one is already an important thing. Depending of the processing that come after, it may be of great importance.
Jan 27 2013
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sun, Jan 27, 2013 at 11:42 AM, Brian Schott <briancschott gmail.com> wrote:
 On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
 * Having a range interface is good. Any reason why you made byToken a
 class and not a struct? Most (like, 99%) of range in Phobos are
 structs. Do you need reference semantics?

It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code.

Hmm. You're the first person I see pushing this. Personally, I'd go for a struct, if only because I don't need reference semantics.
 * Also, is there a way to keep comments? Any code wanting the modify
 the code might need them.
 (edit: Ah, I see it: IterationStyle.IncludeComments)

 * I'd distinguish between standard comments and documentation
 comments. These are different beasts, to my eyes.

The standard at http://dlang.org/lex.html doesn't differentiate between them. It's trivial to write a function that checks if a token starts with "///", "/**", or "/++" while iterating over the tokens.

Yes but, the standard lexer was done with DMD in mind, and DMD has a different code path for generating comments. It's your project, sure, but I'd appreciate tokens differentiating between the many comments. Oh, and recognizing some inner Ddoc tokens, like ---- delimiters for documentation code blocks. That way, code-in-doc could use the lexer also. Pretty please?
 * I see Token has a startIndex member. Any reason not to have a
 endIndex member? Or can and end index always be deduced from
 startIndex and value.length?

That's the idea.

Does it work for UTF-16 and UTF-32 strings?
Jan 27 2013
prev sibling next sibling parent reply Philippe Sigaud <philippe.sigaud gmail.com> writes:
Let's add another question:

what about treating q{ } token strings as... well, a list of tokens?
IDE would like this: no need to reparse the string, the token are
there directly.
Jan 27 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-01-27 12:51, Philippe Sigaud wrote:
 Let's add another question:

 what about treating q{ } token strings as... well, a list of tokens?
 IDE would like this: no need to reparse the string, the token are
 there directly.

Perhaps an option for this. -- /Jacob Carlborg
Jan 27 2013
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-01-27 10:51, Brian Schott wrote:
 I'm writing a D lexer for possible inclusion in Phobos.

 DDOC:
 http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
 Code:
 https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d


 It's currently able to correctly syntax highlight all of Phobos, but
 does a fairly bad job at rejecting or notifying users/callers about
 invalid input.

 I'd like to hear arguments on the various ways to handle errors in the
 lexer. In a compiler it would be useful to throw an exception on finding
 something like a string literal that doesn't stop before EOF, but a text
 editor or IDE would probably want to be a bit more lenient. Maybe having
 it run-time (or compile-time configurable) like std.csv would be the
 best option here.

 I'm interested in ideas on the API design and other high-level issues at
 the moment. I don't consider this ready for inclusion. (The current
 module being reviewed for inclusion in Phobos is the new std.uni.)

How about changing the type of TokenType to ushort, if all members fit. Just to minimize the size of a token. -- /Jacob Carlborg
Jan 27 2013
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Sun, 27 Jan 2013 12:38:33 +0100
schrieb Timon Gehr <timon.gehr gmx.ch>:

 On 01/27/2013 11:42 AM, Brian Schott wrote:
 On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:
 * Having a range interface is good. Any reason why you made
 byToken a class and not a struct? Most (like, 99%) of range in
 Phobos are structs. Do you need reference semantics?

It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code. ...

The lexer range must be a struct.
 ...
 * A rough estimate of number of tokens/s would be good (I know
 it'll vary). Walter seems to think if a lexer is not able to vomit
 thousands of tokens a seconds, then it's not good. On a related
 note, does your lexer have any problem with 10k+-lines files?

$ time dscanner --sloc ../phobos/std/datetime.d 14950 real 0m0.319s user 0m0.313s sys 0m0.006s $ time dmd -c ../phobos/std/datetime.d real 0m0.354s user 0m0.318s sys 0m0.036s Yes, I know that "time" is a terrible benchmarking tool, but they're fairly close for whatever that's worth.

You are measuring lexing speed against compilation speed. A reasonably well performing lexer is around one order of magnitude faster on std.datetime. Maybe you should profile a little?

Profiling is always a good idea, but to be fair: His dmd was probably compiled with gcc and -O2 if it's a normal release build. So to compare that he should use gdc, -O2 -release -fno-bounds-check and probably more flags to compile the d code.
Jan 27 2013
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/27/2013 1:51 AM, Brian Schott wrote:
 I'm interested in ideas on the API design and other high-level issues at the
 moment. I don't consider this ready for inclusion. (The current module being
 reviewed for inclusion in Phobos is the new std.uni.)

Just a quick comment: byToken() should not accept a filename. It's input should be via an InputRange, not a file.
Jan 27 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
27-Jan-2013 23:46, Walter Bright пишет:
 On 1/27/2013 1:51 AM, Brian Schott wrote:
 I'm interested in ideas on the API design and other high-level issues
 at the
 moment. I don't consider this ready for inclusion. (The current module
 being
 reviewed for inclusion in Phobos is the new std.uni.)

Just a quick comment: byToken() should not accept a filename. It's input should be via an InputRange, not a file.

InputRange of char/wchar/dchar I guess as lexer should have an ability to avoid decoding. -- Dmitry Olshansky
Jan 27 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/27/2013 12:11 PM, Dmitry Olshansky wrote:
 27-Jan-2013 23:46, Walter Bright пишет:
 On 1/27/2013 1:51 AM, Brian Schott wrote:
 I'm interested in ideas on the API design and other high-level issues
 at the
 moment. I don't consider this ready for inclusion. (The current module
 being
 reviewed for inclusion in Phobos is the new std.uni.)

Just a quick comment: byToken() should not accept a filename. It's input should be via an InputRange, not a file.

InputRange of char/wchar/dchar I guess as lexer should have an ability to avoid decoding.

Just char would suffice. If a user needs wchar/dchar, they can put a map in between the InputRange and the lexer.
Jan 27 2013
prev sibling next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 01/27/2013 10:39 PM, Brian Schott wrote:
 ...

 Regarding the times that I posted, my point was that it's not slower
 than "dmd -c", nothing more.

Sure. The point you brought across, however, was that it is not significantly faster yet. :o)
Jan 27 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/27/2013 1:39 PM, Brian Schott wrote:
 The file name is accepted for eventual error reporting purposes.

Use an OutputRange for that.
Jan 27 2013
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/27/2013 4:48 PM, David Nadlinger wrote:
 On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright wrote:
 On 1/27/2013 1:39 PM, Brian Schott wrote:
 The file name is accepted for eventual error reporting purposes.

Use an OutputRange for that.

What about that delegate-based design? I thought everyone agreed that it was nice?

An OutputRange is a way of doing that. The advantage of OutputRange's is that is TheWayToDoThings in Phobos so that components can all interoperate and plug into each other.
Jan 27 2013
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/27/2013 4:53 PM, Brian Schott wrote:
 On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright wrote:
 On 1/27/2013 1:39 PM, Brian Schott wrote:
 The file name is accepted for eventual error reporting purposes.

Use an OutputRange for that.

I think you misunderstand. The file name is so that if you pass in "foo.d" the lexer can say "Error: unterminated string literal beginning on line 123 of foo.d". It's not so that error messagaes will be written to a file of that name.

Yes, I did misunderstand. I suggest updating the documentation to clear up the misunderstanding.
Jan 27 2013
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-01-28 09:51, Johannes Pfau wrote:

 This sounds like a valid use case for buffered ranges (which we don't
 have yet, afaik). When used correctly you could avoid the appender
 completely. Instead slice the input buffer and copy it if necessary.
 (If the complete file is kept in memory copying could also be
 avoided)

If think slicing the buffer will force the whole buffer to remain in memory and not be collected by the GC. If one keeps the whole buffer in memory anyway this won't be a problem. -- /Jacob Carlborg
Jan 28 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/28/2013 1:05 AM, Jacob Carlborg wrote:
 If think slicing the buffer will force the whole buffer to remain in memory and
 not be collected by the GC. If one keeps the whole buffer in memory anyway this
 won't be a problem.

Well, the source buffer can be large, and will span a lot of memory cache lines, making accessing those slices slow.
Jan 28 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-01-28 11:14, Walter Bright wrote:

 Well, the source buffer can be large, and will span a lot of memory
 cache lines, making accessing those slices slow.

Would it be better to .dup the slices? Won't that be just as slow as using the appender? -- /Jacob Carlborg
Jan 28 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Jan-2013 14:39, Jacob Carlborg пишет:
 On 2013-01-28 11:14, Walter Bright wrote:

 Well, the source buffer can be large, and will span a lot of memory
 cache lines, making accessing those slices slow.

Would it be better to .dup the slices? Won't that be just as slow as using the appender?

It would be better to compactly layout these one by one in a separate buffer skipping all of the extra slack in the source file(s). -- Dmitry Olshansky
Jan 28 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Jan-2013 18:08, jerro пишет:
 On Monday, 28 January 2013 at 11:00:09 UTC, Dmitry Olshansky wrote:
 28-Jan-2013 14:39, Jacob Carlborg пишет:
 On 2013-01-28 11:14, Walter Bright wrote:

 Well, the source buffer can be large, and will span a lot of memory
 cache lines, making accessing those slices slow.

Would it be better to .dup the slices? Won't that be just as slow as using the appender?

It would be better to compactly layout these one by one in a separate buffer skipping all of the extra slack in the source file(s).

Would it also make sense to do small string optimization? On 64 bit, you could use the memory used by the string field itself to store strings of length up to 15 characters.

It very well could be iff slice is a field in the Token struct. -- Dmitry Olshansky
Jan 28 2013
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/28/2013 01:53 AM, Brian Schott wrote:
 ...

 On the topic of performance, I realized that the numbers posted
 previously were actually for a debug build. Fail.

 For whatever reason, the current version of the lexer code isn't
 triggering my heisenbug[1] and I was able to build with -release -inline
 -O.

 Here's what avgtime has to say:

 $ avgtime -q -h -r 200 dscanner --tokenCount ../phobos/std/datetime.d

 ------------------------
 Total time (ms): 51409.8
 Repetitions    : 200
 Sample mode    : 250 (169 ocurrences)
 Median time    : 255.57
 Avg time       : 257.049
 Std dev.       : 4.39338
 Minimum        : 252.931
 Maximum        : 278.658
 95% conf.int.  : [248.438, 265.66]  e = 8.61087
 99% conf.int.  : [245.733, 268.366]  e = 11.3166
 EstimatedAvg95%: [256.44, 257.658]  e = 0.608881
 EstimatedAvg99%: [256.249, 257.849]  e = 0.800205
 Histogram      :
      msecs: count  normalized bar
        250:   169  ########################################
        260:    22  #####
        270:     9  ##

 Which works out to 1,327,784 tokens per second on my Ivy Bridge i7.

Better, but still slow.
 I created a small program that demangles the output of valgrind so that
 tools like KCachegrind can display profiling information more clearly.
 It's now on the wiki[2]

 The bottleneck in std.d.lexer as it stands is the appender instances
 that assemble Token.value during iteration and front() on the array of
 char[]. (As I'm sure everyone expected)

I see, probably there should be an option to do this by slicing instead. Also try to treat narrow strings in such a way that they do not incur undue decoding overhead. I guess that at some point pure nothrow TokenType lookupTokenType(const string input) might become a bottleneck. (DMD does not generate near-optimal string switches, I think.)
Jan 28 2013
next sibling parent dennis luehring <dl.soluz gmx.net> writes:
Am 30.01.2013 10:49, schrieb Brian Schott:
 On Monday, 28 January 2013 at 21:03:21 UTC, Timon Gehr wrote:
 Better, but still slow.

I implemented the various suggestions from a past thread and made the lexer only work ubyte[] (to aviod phobos converting everything to dchar all the time) and gave the tokenizer instance a character buffer that it re-uses. Results: $ avgtime -q -r 200 ./dscanner --tokenCount ../phobos/std/datetime.d ..... If my math is right, that means it's getting 4.9 million tokens/second now. According to Valgrind the only way to really improve things now is to require that the input to the lexer support slicing. (Remember the secret of Tango's XML parser...) The bottleneck is now on the calls to .idup to construct the token strings from slices of the buffer.

but you still need to compare that to the current dmd lexer - nothing else can be a benchmark reference
Jan 30 2013
prev sibling next sibling parent FG <home fgda.pl> writes:
On 2013-01-30 10:49, Brian Schott wrote:
 If my math is right, that means it's getting 4.9 million tokens/second now.
 According to Valgrind the only way to really improve things now is to require
 that the input to the lexer support slicing. (Remember the secret of Tango's
XML
 parser...) The bottleneck is now on the calls to .idup to construct the token
 strings from slices of the buffer.

Yep. I'm eager to see what timings you get with the whole file kept in memory and sliced, without copying token strings.
Jan 30 2013
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
30-Jan-2013 13:49, Brian Schott пишет:
 On Monday, 28 January 2013 at 21:03:21 UTC, Timon Gehr wrote:
 Better, but still slow.

I implemented the various suggestions from a past thread and made the lexer only work ubyte[] (to aviod phobos converting everything to dchar all the time) and gave the tokenizer instance a character buffer that it re-uses. Results: $ avgtime -q -r 200 ./dscanner --tokenCount ../phobos/std/datetime.d ------------------------ Total time (ms): 13861.8 Repetitions : 200 Sample mode : 69 (90 ocurrences) Median time : 69.0745 Avg time : 69.3088 Std dev. : 0.670203 Minimum : 68.613 Maximum : 72.635 95% conf.int. : [67.9952, 70.6223] e = 1.31357 99% conf.int. : [67.5824, 71.0351] e = 1.72633 EstimatedAvg95%: [69.2159, 69.4016] e = 0.0928836 EstimatedAvg99%: [69.1867, 69.4308] e = 0.12207 If my math is right, that means it's getting 4.9 million tokens/second now. According to Valgrind the only way to really improve things now is to require that the input to the lexer support slicing. (Remember the secret of Tango's XML parser...) The bottleneck is now on the calls to .idup to construct the token strings from slices of the buffer.

idup --> allocation Instead I suggest to try and allocate a big block of fixed size (say about 16-64K) upfront and copy identifiers one by one there. When it fills just allocate another one and move on. If identifier is exceptionally long then you can just idup it as before. This should bring down the number of calls to GC significantly.
 I guess that at some point

 pure nothrow TokenType lookupTokenType(const string input)

 might become a bottleneck. (DMD does not generate near-optimal string
 switches, I think.)

Right now that's a fairly small box on KCachegrind.

-- Dmitry Olshansky
Jan 30 2013
parent reply FG <home fgda.pl> writes:
On 2013-01-30 17:50, Dmitry Olshansky wrote:
 Instead I suggest to try and allocate a big block of fixed size (say about
 16-64K) upfront and copy identifiers one by one there. When it fills just
 allocate another one and move on.

Yeah, similar to what I suggested, except that probably it would be good to also have a look-up structure for identifiers, so that only unique strings go into those blocks. I wonder what would be a practical data structure for such look-ups: A trie is great except for the implementation hurdle to keep it also in one or a few memory blocks to prevent frequent allocations. A simpler approach would be to make it an array of (hash, string_slice) pairs, sorted by hash (of the identifier) - lots of string scanning though. What do you think?
Jan 30 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
30-Jan-2013 21:21, FG пишет:
 On 2013-01-30 17:50, Dmitry Olshansky wrote:
 Instead I suggest to try and allocate a big block of fixed size (say
 about
 16-64K) upfront and copy identifiers one by one there. When it fills just
 allocate another one and move on.

Yeah, similar to what I suggested, except that probably it would be good to also have a look-up structure for identifiers, so that only unique strings go into those blocks. I wonder what would be a practical data structure for such look-ups: A trie is great except for the implementation hurdle to keep it also in one or a few memory blocks to prevent frequent allocations.

Mm trie shouldn't have that many allocations. Typically each node is fixed-sized. Also I don't see keeping it in many memory blocks as a show stopper. The trick is exactly the same as I mentioned above but blocks got to be quite a bit large (say 8 times :) ). Truth be told the trie should be optimal for this, but a lot of effort is required to implement a fast trie compared to rolling out a simple hash-table. That's something I hope to change once my Trie template is polished enough for Phobos proposal.
 A simpler approach would be to make it an array of (hash, string_slice)
 pairs, sorted by hash (of the identifier) - lots of string scanning though.

Mm since you need to find what's unique as you lex the file I suspect that sorting is out of the window. I'd be modest and for starters just use a hash-table + tune the hash function a bit. -- Dmitry Olshansky
Jan 30 2013
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-01-30 10:49, Brian Schott wrote:

 Results:

 $ avgtime -q -r 200 ./dscanner --tokenCount ../phobos/std/datetime.d

 ------------------------
 Total time (ms): 13861.8
 Repetitions    : 200
 Sample mode    : 69 (90 ocurrences)
 Median time    : 69.0745
 Avg time       : 69.3088
 Std dev.       : 0.670203
 Minimum        : 68.613
 Maximum        : 72.635
 95% conf.int.  : [67.9952, 70.6223]  e = 1.31357
 99% conf.int.  : [67.5824, 71.0351]  e = 1.72633
 EstimatedAvg95%: [69.2159, 69.4016]  e = 0.0928836
 EstimatedAvg99%: [69.1867, 69.4308]  e = 0.12207

 If my math is right, that means it's getting 4.9 million tokens/second
 now. According to Valgrind the only way to really improve things now is
 to require that the input to the lexer support slicing. (Remember the
 secret of Tango's XML parser...) The bottleneck is now on the calls to
 .idup to construct the token strings from slices of the buffer.

How many tokens would that be in total? -- /Jacob Carlborg
Jan 31 2013
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Sun, Jan 27, 2013 at 8:48 PM, Walter Bright
<newshound2 digitalmars.com> wrote:
 On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
 Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good.

Speed is critical for a lexer.

Something I never tought about: given a 10k lines module, how many tokens does that represent, roughly. Ten times as much?
 This means, for example, you'll need to squeeze pretty much all storage
 allocation out of it. A lexer that does an allocation per token will is not
 going to do very well at all.

How does one do that? Honest question, I'm not really concerned by extreme speed, most of the time, and have lots to learn in this area.
Jan 27 2013
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
 I concur. One of the biggest reason* there is a separate lexer step is
 because it could be made to do this stage very-very fast. Then the rest of
 the parser will greatly benefit from this underlying speed.

 *Otherwise we could have just as well add the lexer stage as simple rules to
 the grammar that treats all of codepoints as terminals.

... which is exactly what parsing expression grammars (and other scannerless parsers) do. AFAICT, one interesting consequence is the ability to have composition of grammars, which I sure have with Pegged. But maybe it's not related that much, that's not something I stopped to think about. In any case, grammar composition is something I've learnt to like quite a lot.
Jan 27 2013
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
 ... which is exactly what parsing expression grammars (and other
 scannerless parsers) do.

That is only a happenstance and is highly overrated. What does it buy you is the right question to ask. After all every LL-parser can be written avoiding notion of lexer and strangely I see no fuss about it. So is it just hype? I dunno.

I don't know. I discovered what scannerless parsing was after discovering (and getting interested) in PEG. That's a somewhat unusual path, as I'm now discovering standard lexer+parser parsing :) Anyway, let's not derail the OP thread too much.
 AFAICT, one interesting consequence is the ability to have composition
 of grammars, which I sure have with Pegged. But maybe it's not related
 that much, that's not something I stopped to think about.
 In any case, grammar composition is something I've learnt to like quite a
 lot.

You can still have composability of grammars. In fact I'd define a couple of a less powerful but better optimized variations if I were you.

Yes, getting an automaton to deal with the regular part would be good. But then, there already is std.regex ;) it's good to know composition is possible elsewhere. I really like it: it makes me use grammars (and the related, generated parsers) as I use functions in PL: slowly building layers of abstraction and reuse.
 PEG-style memoization is still possible as long as underlying grammars are
 context-free all the useful usually are.

Yes, I agree.
 LR hardly composable ( I haven't seen a single one that does).
 But all top-down one always are unless there is some explicit work spent to
 undermine this ability ;)

What I still can't get an easy answer on is: is the D grammar LR(1), LR(0), LALR?
Jan 27 2013
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
 The hint is that your question is a bit faulty: by calling it "the D
 grammar" do you mean the exact one listed on the website or any equivalent
 that parses the same language (including the ones obtained by simple
 transformations)?

The latter. The one I use for Pegged to generate (what is hopefully) a D parser is already modified, discards constructs like NameList := Name NameList in favor of Name+ Anyway, let's stop here. Back to lexing proper :)
Jan 27 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Sunday, 27 January 2013 at 19:46:12 UTC, Walter Bright wrote:
 On 1/27/2013 1:51 AM, Brian Schott wrote:
 I'm interested in ideas on the API design and other high-level 
 issues at the
 moment. I don't consider this ready for inclusion. (The 
 current module being
 reviewed for inclusion in Phobos is the new std.uni.)

Just a quick comment: byToken() should not accept a filename. It's input should be via an InputRange, not a file.

The file name is accepted for eventual error reporting purposes. The actual input for the lexer is the parameter called "range". Regarding the times that I posted, my point was that it's not slower than "dmd -c", nothing more.
Jan 27 2013
prev sibling next sibling parent "David Nadlinger" <see klickverbot.at> writes:
On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright wrote:
 On 1/27/2013 1:39 PM, Brian Schott wrote:
 The file name is accepted for eventual error reporting 
 purposes.

Use an OutputRange for that.

What about that delegate-based design? I thought everyone agreed that it was nice? David
Jan 27 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright wrote:
 On 1/27/2013 1:39 PM, Brian Schott wrote:
 The file name is accepted for eventual error reporting 
 purposes.

Use an OutputRange for that.

I think you misunderstand. The file name is so that if you pass in "foo.d" the lexer can say "Error: unterminated string literal beginning on line 123 of foo.d". It's not so that error messagaes will be written to a file of that name. On the topic of performance, I realized that the numbers posted previously were actually for a debug build. Fail. For whatever reason, the current version of the lexer code isn't triggering my heisenbug[1] and I was able to build with -release -inline -O. Here's what avgtime has to say: $ avgtime -q -h -r 200 dscanner --tokenCount ../phobos/std/datetime.d ------------------------ Total time (ms): 51409.8 Repetitions : 200 Sample mode : 250 (169 ocurrences) Median time : 255.57 Avg time : 257.049 Std dev. : 4.39338 Minimum : 252.931 Maximum : 278.658 95% conf.int. : [248.438, 265.66] e = 8.61087 99% conf.int. : [245.733, 268.366] e = 11.3166 EstimatedAvg95%: [256.44, 257.658] e = 0.608881 EstimatedAvg99%: [256.249, 257.849] e = 0.800205 Histogram : msecs: count normalized bar 250: 169 ######################################## 260: 22 ##### 270: 9 ## Which works out to 1,327,784 tokens per second on my Ivy Bridge i7. I created a small program that demangles the output of valgrind so that tools like KCachegrind can display profiling information more clearly. It's now on the wiki[2] The bottleneck in std.d.lexer as it stands is the appender instances that assemble Token.value during iteration and front() on the array of char[]. (As I'm sure everyone expected) [1] http://forum.dlang.org/thread/bug-9353-3 http.d.puremagic.com%2Fissues%2F [2] http://wiki.dlang.org/Other_Dev_Tools
Jan 27 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Monday, 28 January 2013 at 00:53:03 UTC, Brian Schott wrote:
 On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright wrote:
 On 1/27/2013 1:39 PM, Brian Schott wrote:
 The file name is accepted for eventual error reporting 
 purposes.

Use an OutputRange for that.

I think you misunderstand. The file name is so that if you pass in "foo.d" the lexer can say "Error: unterminated string literal beginning on line 123 of foo.d". It's not so that error messagaes will be written to a file of that name.

I don't think that is a good idea. For instance mixin need to be lexed but don't come from a file. The lexer should report the error, what is done on error is up to the user of the lexer.
Jan 27 2013
prev sibling next sibling parent "David Nadlinger" <see klickverbot.at> writes:
On Monday, 28 January 2013 at 00:51:28 UTC, Walter Bright wrote:
 On 1/27/2013 4:48 PM, David Nadlinger wrote:
 On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright 
 wrote:
 On 1/27/2013 1:39 PM, Brian Schott wrote:
 The file name is accepted for eventual error reporting 
 purposes.

Use an OutputRange for that.

What about that delegate-based design? I thought everyone agreed that it was nice?

An OutputRange is a way of doing that. The advantage of OutputRange's is that is TheWayToDoThings in Phobos so that components can all interoperate and plug into each other.

I was talking about the design you proposed yourself here: http://forum.dlang.org/post/jvp9ke$2m45$1 digitalmars.com Oh, and you really don't need to give me the basic Phobos/ranges sales pitch, I think I'm quite aware of their advantages. I'm just not sure that e.g. having an "exception thrower" output range would be a wise design decision. David
Jan 27 2013
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Mon, 28 Jan 2013 01:53:02 +0100
schrieb "Brian Schott" <briancschott gmail.com>:

 
 The bottleneck in std.d.lexer as it stands is the appender 
 instances that assemble Token.value during iteration and front() 
 on the array of char[]. (As I'm sure everyone expected)

This sounds like a valid use case for buffered ranges (which we don't have yet, afaik). When used correctly you could avoid the appender completely. Instead slice the input buffer and copy it if necessary. (If the complete file is kept in memory copying could also be avoided)
Jan 28 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Sunday, 27 January 2013 at 19:48:23 UTC, Walter Bright wrote:
 On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
 Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good.

Speed is critical for a lexer. This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.

Semi-unrelated question -- how would you benchmark a _parser_? Is there any way to get a _number_ as an answer, or is comparing against a rival parser the only way of benchmarking a parser?
Jan 28 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 28 January 2013 at 09:07:15 UTC, Jacob Carlborg wrote:
 On 2013-01-28 10:03, Mehrdad wrote:

 Semi-unrelated question -- how would you benchmark a _parser_?

 Is there any way to get a _number_ as an answer, or is 
 comparing against
 a rival parser the only way of benchmarking a parser?

It should always be possible to get a number, but it would be hard to determine if that's a good number or not without comparing with anything else.

Sorry I think my question was vague -- what I meant was, how do you measure it, i.e. in what units? Lines per second? Tokens per second? Syntax nodes per second? etc.
Jan 28 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 28 January 2013 at 09:21:51 UTC, Jacob Carlborg wrote:
 On 2013-01-28 10:09, Mehrdad wrote:

 Sorry I think my question was vague -- what I meant was, how 
 do you
 measure it, i.e. in what units?

 Lines per second? Tokens per second? Syntax nodes per second? 
 etc.

The easy way out would be to just measure how long time it takes to parse a given piece of code and compare with some other parse using the same code.

Yeah that's exactly what I meant by not comparing xD I'm wondering if there's any way to make it independent of the code/grammar
Jan 28 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 28 January 2013 at 11:01:49 UTC, Walter Bright wrote:
 On 1/28/2013 2:16 AM, dennis luehring wrote:
 Am 28.01.2013 11:13, schrieb Walter Bright:
 On 1/28/2013 1:30 AM, Mehrdad wrote:
 I'm wondering if there's any way to make it independent of 
 the code/grammar

Just run under a profiler, then fix the hot spots.

mehrdad speaks about benchmarking not optimization :)

I know. But profiling is how you make it fast. Lexers should be as fast as possible, not merely faster than the competition.

I was talking about parsers actually :| on a second thought maybe I shouldn't have asked that on a thread about lexing...
Jan 28 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Sunday, 27 January 2013 at 20:15:33 UTC, Philippe Sigaud wrote:

 This means, for example, you'll need to squeeze pretty much 
 all storage allocation out of it. A lexer that does an 
 allocation per token will is not going to do very well at all.

How does one do that? Honest question, I'm not really concerned by extreme speed, most of the time, and have lots to learn in this area.

Here's my (VERY) simple NFA-based "regex"-based lexer, which performs **no** heap allocations (unless the pattern is very complicated): https://gist.github.com/b10ae22ab822c87467a3 Yes, the code is ugly and the capabilities are far too basic, but it was good enough for what I needed. The point it illustrates is how you can lex without any heap allocations.
Jan 28 2013
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Sun, 27 Jan 2013 11:48:23 -0800
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 1/27/2013 2:17 AM, Philippe Sigaud wrote:
 Walter seems to think if a lexer is not able to vomit thousands
 of tokens a seconds, then it's not good.

Speed is critical for a lexer. This means, for example, you'll need to squeeze pretty much all storage allocation out of it.

But to be fair that doesn't fit ranges very well. If you don't want to do any allocation but still keep identifiers etc in memory this basically means you have to keep the whole source in memory and this is conceptually an array and not a range. But you can of course write a lexer which accepts buffered ranges and does some allocation for those and is special cased for arrays to not allocate at all. (Unbuffered ranges should be supported using a generic BufferedRange)
Jan 28 2013
prev sibling next sibling parent "Dejan Lekic" <dejan.lekic gmail.com> writes:
On Sunday, 27 January 2013 at 09:51:10 UTC, Brian Schott wrote:
 I'm writing a D lexer for possible inclusion in Phobos.

 DDOC: 
 http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
 Code: 
 https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d

 It's currently able to correctly syntax highlight all of 
 Phobos, but does a fairly bad job at rejecting or notifying 
 users/callers about invalid input.

 I'd like to hear arguments on the various ways to handle errors 
 in the lexer. In a compiler it would be useful to throw an 
 exception on finding something like a string literal that 
 doesn't stop before EOF, but a text editor or IDE would 
 probably want to be a bit more lenient. Maybe having it 
 run-time (or compile-time configurable) like std.csv would be 
 the best option here.

 I'm interested in ideas on the API design and other high-level 
 issues at the moment. I don't consider this ready for 
 inclusion. (The current module being reviewed for inclusion in 
 Phobos is the new std.uni.)

Having a "standard" lexer module, that many people work on, and has up-to-date rules, is priceless! Thank you for doing this.
Jan 28 2013
prev sibling next sibling parent "jerro" <a a.com> writes:
On Monday, 28 January 2013 at 11:00:09 UTC, Dmitry Olshansky 
wrote:
 28-Jan-2013 14:39, Jacob Carlborg пишет:
 On 2013-01-28 11:14, Walter Bright wrote:

 Well, the source buffer can be large, and will span a lot of 
 memory
 cache lines, making accessing those slices slow.

Would it be better to .dup the slices? Won't that be just as slow as using the appender?

It would be better to compactly layout these one by one in a separate buffer skipping all of the extra slack in the source file(s).

Would it also make sense to do small string optimization? On 64 bit, you could use the memory used by the string field itself to store strings of length up to 15 characters.
Jan 28 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, Jan 27, 2013 at 10:39:13PM +0100, Brian Schott wrote:
 On Sunday, 27 January 2013 at 19:46:12 UTC, Walter Bright wrote:

Just a quick comment: byToken() should not accept a filename. It's
input should be via an InputRange, not a file.

The file name is accepted for eventual error reporting purposes. The actual input for the lexer is the parameter called "range".

FWIW, I've developed this little idiom in my code when it comes to dealing with error reporting in lexing/parsing code (for my own DSLs, not D): The main problem I have is that my lexer/parser accepts an input range, but input ranges don't (necessarily) have any filename/line number associated with them. Moreover, the code that throws the exception may be quite far down the call chain, and may have not access to the context that knows what filename/line number the error occurred at. For example, I may have a generic function called parseInt(), which can be called from the lexer, the parser, or a whole bunch of other places. It wouldn't make sense to force parseInt() to take a filename and line number, just so it can have nicer error reporting, for example. So I decided to move the inclusion of filename/line number information to where they belong: in the code that knows about them. So here's a sketch of my approach: class SyntaxError : Exception { string filename; int linenum; this(string msg) { super(msg); } } ... int parseInt(R)(R inputRange) { ... // N.B.: no filename/line number info here if (!isDigit(inputRange.front)) throw new Exception("Invalid digit: %s", inputRange.front); } ... Expr parseExpr(R)(R inputRange) { ... // N.B.: any exception just unrolls past this call, 'cos // we have no filename/line number info to insert anyway if (tokenType == IntLiteral) { value = parseInt(inputRange): } ... } ... Expr parseFileInput(string filename) { auto f = File(filename); try { // Wrapper range that counts line numbers auto r = NumberedSrc(f); return parseExpr(r); } catch(SyntaxError e) { // Insert filename/line number info into message e.filename = filename; e.linenum = r.linenum; e.msg = format("%s:%d %s", filename, r.linenum, e.msg); throw e; } } ... Expr parseConsoleInput() { // No filename/line number info here return parseExpr(stdin.byLine()); } ... Expr parseStringInput(string input) { try { auto r = NumberedSrc(input); return parseExpr(r); } catch(SyntaxError e) { // We don't have filename here, but we do have // line number, so use that. e.linenum = r.linenum; e.msg = format("Line %d: %s", r.linenum, e.msg); throw e; } } Notice that I have different wrapper functions for dealing with different kinds of input; the underlying parser doesn't even care about filename/line numbers, but the wrapper functions catch any parsing exceptions that are thrown from underneath and prepend this info as appropriate. This simplifies the parsing code (don't have to keep worrying about line numbers and propagating filenames) and also produces output that makes sense: - Console input don't need line numbers; the user doesn't care if this is the 500th command he typed, or the 701st. - Internal strings don't get a nonsensical "filename", 'cos they don't *have* a filename in the first place. Just a single line number so you can locate the problem in, say, the string literal or something. - File input has filename and line number. - Other kinds of input contexts can be handled in the same way. - The use of NumberedSrc (maybe better named LineNumberedRange or something) makes line numbers available to each of these contexts at the top-level. Though of course, the lexer itself can also handle this (but it adds complications if you have to continue detecting newlines in, say, string literals, when the lexer is in a different state). The cleaner code does come at a price, though: this code probably is a bit inefficient due to the number of layers in it. But, just thought I'd share this idea. T -- You are only young once, but you can stay immature indefinitely. -- azephrahel
Jan 28 2013
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Mon, Jan 28, 2013 at 12:43 PM, Mehrdad <wfunction hotmail.com> wrote:
 https://gist.github.com/b10ae22ab822c87467a3

This link does not work for me :(
Jan 28 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 28 January 2013 at 20:35:51 UTC, Philippe Sigaud wrote:
 On Mon, Jan 28, 2013 at 12:43 PM, Mehrdad 
 <wfunction hotmail.com> wrote:
 https://gist.github.com/b10ae22ab822c87467a3

This link does not work for me :(

Yeah sorry, I found a bug in it, so I took it off and fixed it (... or so I think). I wouldn't be surprised if it's still buggy. It's more of a proof of concept than anything... it definitely has room for improvement, but the speed is still decent. Here's another link (grab it before it's gone!): https://gist.github.com/addfe830acf9785d01ef
Jan 28 2013
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
 This link does not work for me :(

Yeah sorry, I found a bug in it, so I took it off and fixed it (... or so I think). I wouldn't be surprised if it's still buggy. It's more of a proof of concept than anything... it definitely has room for improvement, but the speed is still decent. Here's another link (grab it before it's gone!): https://gist.github.com/addfe830acf9785d01ef

Grabby grab! Owww, C++! My eyes, they're melting! My hairs, on fire! Why did you forfeit your immortal soul? Anyway, thanks. I'll look at it more closely. It seems to be the kind of C++ I can still read *and* understand.
Jan 28 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 28 January 2013 at 22:00:30 UTC, Philippe Sigaud wrote:
 Owww, C++! My eyes, they're melting! My hairs, on fire! Why did 
 you forfeit your immortal soul?

lol. Mainly because I got tired of running into compiler bugs, to be honest. :|
 Anyway, thanks. I'll look at it more closely. It seems to be 
 the kind of C++ I can still read *and* understand.

Yeah, my goal was to make it mostly C-compatible, so I kept the C++-iness to a minimum. It's more like "C"++ rather than "C++", I think. :P
Jan 28 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Monday, 28 January 2013 at 21:03:21 UTC, Timon Gehr wrote:
 Better, but still slow.

I implemented the various suggestions from a past thread and made the lexer only work ubyte[] (to aviod phobos converting everything to dchar all the time) and gave the tokenizer instance a character buffer that it re-uses. Results: $ avgtime -q -r 200 ./dscanner --tokenCount ../phobos/std/datetime.d ------------------------ Total time (ms): 13861.8 Repetitions : 200 Sample mode : 69 (90 ocurrences) Median time : 69.0745 Avg time : 69.3088 Std dev. : 0.670203 Minimum : 68.613 Maximum : 72.635 95% conf.int. : [67.9952, 70.6223] e = 1.31357 99% conf.int. : [67.5824, 71.0351] e = 1.72633 EstimatedAvg95%: [69.2159, 69.4016] e = 0.0928836 EstimatedAvg99%: [69.1867, 69.4308] e = 0.12207 If my math is right, that means it's getting 4.9 million tokens/second now. According to Valgrind the only way to really improve things now is to require that the input to the lexer support slicing. (Remember the secret of Tango's XML parser...) The bottleneck is now on the calls to .idup to construct the token strings from slices of the buffer.
 I guess that at some point

 pure nothrow TokenType lookupTokenType(const string input)

 might become a bottleneck. (DMD does not generate near-optimal 
 string switches, I think.)

Right now that's a fairly small box on KCachegrind.
Jan 30 2013
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-01-27 10:51, Brian Schott wrote:
 I'm writing a D lexer for possible inclusion in Phobos.

 DDOC:
 http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
 Code:
 https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d


 It's currently able to correctly syntax highlight all of Phobos, but
 does a fairly bad job at rejecting or notifying users/callers about
 invalid input.

 I'd like to hear arguments on the various ways to handle errors in the
 lexer. In a compiler it would be useful to throw an exception on finding
 something like a string literal that doesn't stop before EOF, but a text
 editor or IDE would probably want to be a bit more lenient. Maybe having
 it run-time (or compile-time configurable) like std.csv would be the
 best option here.

 I'm interested in ideas on the API design and other high-level issues at
 the moment. I don't consider this ready for inclusion. (The current
 module being reviewed for inclusion in Phobos is the new std.uni.)

Just thinking out loud here. Would it be possible to lex a file in parallel? Cutting it in half (or similar) and lex both pieces simultaneously in parallel. -- /Jacob Carlborg
Jan 31 2013
next sibling parent reply FG <home fgda.pl> writes:
On 2013-01-31 13:14, Jacob Carlborg wrote:
 Just thinking out loud here. Would it be possible to lex a file in parallel?
 Cutting it in half (or similar) and lex both pieces simultaneously in parallel.

Do you know where you can safely cut it without having it lexed beforehand? :)
Jan 31 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-01-31 13:34, FG wrote:

 Do you know where you can safely cut it without having it lexed
 beforehand? :)

I was thinking that myself. It would probably be possible to just cut it in the middle and then lex a few characters forward and backwards until you get a valid token. Try and calculate the correct index where to cut. Although I have no idea how much trouble it would given you and how much you would gain. -- /Jacob Carlborg
Jan 31 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-01-31 16:54, H. S. Teoh wrote:

 Doesn't work if the middle happens to be inside a string literal
 containing code. Esp. a q{} literal (you wouldn't be able to tell where
 it starts/ends without scanning the entire file, because the {}'s nest).

That would be a problem. -- /Jacob Carlborg
Jan 31 2013
prev sibling next sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 31.01.2013 13:14, schrieb Jacob Carlborg:
 On 2013-01-27 10:51, Brian Schott wrote:
 I'm writing a D lexer for possible inclusion in Phobos.

 DDOC:
 http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
 Code:
 https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d


 It's currently able to correctly syntax highlight all of Phobos, but
 does a fairly bad job at rejecting or notifying users/callers about
 invalid input.

 I'd like to hear arguments on the various ways to handle errors in the
 lexer. In a compiler it would be useful to throw an exception on finding
 something like a string literal that doesn't stop before EOF, but a text
 editor or IDE would probably want to be a bit more lenient. Maybe having
 it run-time (or compile-time configurable) like std.csv would be the
 best option here.

 I'm interested in ideas on the API design and other high-level issues at
 the moment. I don't consider this ready for inclusion. (The current
 module being reviewed for inclusion in Phobos is the new std.uni.)

Just thinking out loud here. Would it be possible to lex a file in parallel? Cutting it in half (or similar) and lex both pieces simultaneously in parallel.

why not only the symbols at the border needs to be "connected" then the question is: how many blocks(threads) are ok for 1,2,3,8 cores - can also depend on the speed of the filesystem
Jan 31 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-01-31 13:35, dennis luehring wrote:

 why not only the symbols at the border needs to be "connected" then

 the question is: how many blocks(threads) are ok for 1,2,3,8 cores - can
 also depend on the speed of the filesystem

That would require some profiling to figure out. -- /Jacob Carlborg
Jan 31 2013
parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 31.01.2013 13:48, schrieb Jacob Carlborg:
 On 2013-01-31 13:35, dennis luehring wrote:

 why not only the symbols at the border needs to be "connected" then

 the question is: how many blocks(threads) are ok for 1,2,3,8 cores - can
 also depend on the speed of the filesystem

That would require some profiling to figure out.

i would say it "can" help alot so the design should be able to use split-parts and combine symboles at the border and also file-based lexing should be threadable so that 16 files can be handled by my 16 cores system full in parallel :) but its also size dependend - it makes no sense to split in too small parts, could be counter-productive
Jan 31 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-01-31 13:57, dennis luehring wrote:

 i would say it "can" help alot so the design should be able to use
 split-parts and combine symboles at the border

 and also file-based lexing should be threadable so that 16 files can be
 handled by my 16 cores system full in parallel :)

This is kind of obvious, I think. That's why I started to think at this less obvious case.
 but its also size dependend - it makes no sense to split in too small
 parts, could be counter-productive

Of course not. That would require some profiling as well to find a sweet spot. -- /Jacob Carlborg
Jan 31 2013
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-02-01 08:41, qznc wrote:

 I think a lexer should be IO-bound on todays machines, so parallelizing
 should not give much benefits.

I'm not sure I understand. -- /Jacob Carlborg
Feb 01 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Jan 31, 2013 at 01:48:02PM +0100, Jacob Carlborg wrote:
 On 2013-01-31 13:34, FG wrote:
 
Do you know where you can safely cut it without having it lexed
beforehand? :)

I was thinking that myself. It would probably be possible to just cut it in the middle and then lex a few characters forward and backwards until you get a valid token. Try and calculate the correct index where to cut.

Doesn't work if the middle happens to be inside a string literal containing code. Esp. a q{} literal (you wouldn't be able to tell where it starts/ends without scanning the entire file, because the {}'s nest). T -- Life is unfair. Ask too much from it, and it may decide you don't deserve what you have now either.
Jan 31 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Thursday, 31 January 2013 at 12:48:03 UTC, Jacob Carlborg 
wrote:
 On 2013-01-31 13:34, FG wrote:

 Do you know where you can safely cut it without having it lexed
 beforehand? :)

I was thinking that myself. It would probably be possible to just cut it in the middle and then lex a few characters forward and backwards until you get a valid token. Try and calculate the correct index where to cut. Although I have no idea how much trouble it would given you and how much you would gain.

I don't think it worth the complexity. You can lex both file in parallel with 2 lexer instance if you want to make things faster.
Jan 31 2013
prev sibling next sibling parent "qznc" <qznc go.to> writes:
On Thursday, 31 January 2013 at 12:14:35 UTC, Jacob Carlborg 
wrote:
 Just thinking out loud here. Would it be possible to lex a file 
 in parallel? Cutting it in half (or similar) and lex both 
 pieces simultaneously in parallel.

I think a lexer should be IO-bound on todays machines, so parallelizing should not give much benefits. Of course, you can write a microbenchmark, such that lexing is compute-bound and parallelizing gives a speedup. ;)
Jan 31 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Friday, 1 February 2013 at 07:41:11 UTC, qznc wrote:
 On Thursday, 31 January 2013 at 12:14:35 UTC, Jacob Carlborg 
 wrote:
 Just thinking out loud here. Would it be possible to lex a 
 file in parallel? Cutting it in half (or similar) and lex both 
 pieces simultaneously in parallel.

I think a lexer should be IO-bound on todays machines, so parallelizing should not give much benefits.

Pretty sure RAMs these days can handle more than 1 processor's memory request at a given point in time, so there should be an N-times speedup, where N maxes out at the max throughput...
Feb 01 2013
prev sibling next sibling parent "Tove" <tove fransson.se> writes:
On Friday, 1 February 2013 at 11:06:02 UTC, Walter Bright wrote:
 On 1/30/2013 8:44 AM, Dmitry Olshansky wrote:
 In allocation scheme I proposed that ID could be a 32bit 
 offset into the unique
 identifiers chunk.

That only works if you know in advance the max size the chunk can ever be and preallocate it. Otherwise, you have no guarantee that the next allocated chunk will be within 32 bits of address of the previous chunks.

This can easily be archived by preallocating file.size bytes... it will be x orders of magnitude too much, but it doesn't matter, as in the end only the cache locality matters.
Feb 01 2013
prev sibling next sibling parent reply "Brian Schott" <briancschott gmail.com> writes:
After several hours of optimizing I've managed to make it so that 
dmd's lexer is only three times faster.

http://hackerpilot.github.com/experimental/std_lexer/images/times.png

The funny thing is that compiling with LDC gave a bigger speed 
boost than any of my code refacatoring.
Feb 03 2013
next sibling parent reply FG <home fgda.pl> writes:
On 2013-02-04 01:22, Brian Schott wrote:
 After several hours of optimizing I've managed to make it so that dmd's lexer
is
 only three times faster.

What are you comparing here? How do you time DMD's lexing stage?
Feb 03 2013
parent reply FG <home fgda.pl> writes:
On 2013-02-04 01:41, Brian Schott wrote:
 What are you comparing here? How do you time DMD's lexing stage?

A simple hack of module.c that prints out the file's token count and then calls exit(0).

Ah, fine. Then it's not that bad. :) Have you already made it use slices of the whole source, without copying the string values of tokens? What kind of optimizations have you made?
Feb 03 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-02-04 01:50, FG wrote:

 Ah, fine. Then it's not that bad. :)
 Have you already made it use slices of the whole source,
 without copying the string values of tokens?
 What kind of optimizations have you made?

That would be interesting to hear. Three times slower than DMD doesn't sound good. I know that DMD is fast, but three times. -- /Jacob Carlborg
Feb 03 2013
parent FG <home fgda.pl> writes:
On 2013-02-04 08:57, Jacob Carlborg wrote:
 On 2013-02-04 01:50, FG wrote:

 Ah, fine. Then it's not that bad. :)
 Have you already made it use slices of the whole source,
 without copying the string values of tokens?
 What kind of optimizations have you made?

That would be interesting to hear. Three times slower than DMD doesn't sound good. I know that DMD is fast, but three times.

Looking at the current source, there is now a StringCache to hold the strings. It is however also used to store all those long comments, so I believe quite some time is unnecessarily wasted on generating hashes for them. But probably there are other parts slowing things down.
Feb 04 2013
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/4/13 10:19 PM, Brian Schott wrote:
 More optimizing:
 http://hackerpilot.github.com/experimental/std_lexer/images/times2.png

 Still only half speed. I'm becoming more and more convinced that Walter
 is actually a wizard.

Suggestion: take lexer.c and convert it to D. Should take one day, and you'll have performance on par. Andrei
Feb 04 2013
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/4/13 11:05 PM, Andrej Mitrovic wrote:
 On 2/5/13, Andrei Alexandrescu<SeeWebsiteForEmail erdani.org>  wrote:
 Suggestion: take lexer.c and convert it to D. Should take one day, and
 you'll have performance on par.

This was already done for DDMD, and the more recent minimal version of it: https://github.com/zachthemystic/ddmd-clean/blob/master/dmd/lexer.d

Awesome! Has anyone measured the speed? Andrei
Feb 04 2013
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-02-05 04:22, Andrei Alexandrescu wrote:

 Suggestion: take lexer.c and convert it to D. Should take one day, and
 you'll have performance on par.

There's reason for why nobody has just extract the lexer from DMD. It will probably take more than a day just to extract the lexer to be able to use it without the rest of DMD. It's probably easier to do the actual porting than extract the lexer. Actually, when I think about it, Johnathan is working on porting the DMD lexer to D. -- /Jacob Carlborg
Feb 05 2013
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-02-05 10:07, Jonathan M Davis wrote:

 Not exactly. I decided that it would be of greater benefit to just write the
 thing according to the grammar and make sure that the compiler matched the
 spec.

Aha, I see. -- /Jacob Carlborg
Feb 05 2013
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-02-05 11:44, Jonathan M Davis wrote:

 If it could be ported as-is and then compared for speed, then that would be a
 great test, since it would be able to show how much of the speed problem is
 purely a compiler issue as opposed to a design issue, but you wouldn't be able
 to actually use it for anything more than what Brian is doing with his
 performance testing, because as you point out, it's too integrated into dmd.
 It _would_ be valuable though as a performance test of the compiler.

Yeah, that could be useful. -- /Jacob Carlborg
Feb 05 2013
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-02-05 11:46, Jonathan M Davis wrote:

 It turns out that it has nothing to do with string mixins (though it does have
 to do with CTFE):

 http://d.puremagic.com/issues/show_bug.cgi?id=9452

 Fortunately, there's a simple workaround that'll let me continue until the bug
 is fixed.

Ok, good that it's not a blocker. -- /Jacob Carlborg
Feb 05 2013
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/5/13 5:44 AM, Jonathan M Davis wrote:
 On Tuesday, February 05, 2013 09:14:35 Jacob Carlborg wrote:
 On 2013-02-05 04:22, Andrei Alexandrescu wrote:
 Suggestion: take lexer.c and convert it to D. Should take one day, and
 you'll have performance on par.

There's reason for why nobody has just extract the lexer from DMD. It will probably take more than a day just to extract the lexer to be able to use it without the rest of DMD.

There are basic ideas about how it works which are obviously good and should be in the finished product in D, but it's not range-based, which forces you to do things differently. It's also not configurable, which forces you to do things differently. If it could be ported as-is and then compared for speed, then that would be a great test, since it would be able to show how much of the speed problem is purely a compiler issue as opposed to a design issue, but you wouldn't be able to actually use it for anything more than what Brian is doing with his performance testing, because as you point out, it's too integrated into dmd. It _would_ be valuable though as a performance test of the compiler.

As far as I could tell the dependencies of the lexer are fairly contained (util, token, identifier) and conversion to input range is immediate. Andrei
Feb 05 2013
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-02-05 14:34, Andrei Alexandrescu wrote:

 As far as I could tell the dependencies of the lexer are fairly
 contained (util, token, identifier) and conversion to input range is
 immediate.

That's not what I remember from last time I gave it a try. -- /Jacob Carlborg
Feb 05 2013
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/5/13 10:29 PM, Jonathan M Davis wrote:
 On Tuesday, February 05, 2013 08:34:29 Andrei Alexandrescu wrote:
 As far as I could tell the dependencies of the lexer are fairly
 contained (util, token, identifier) and conversion to input range is
 immediate.

I don't remember all of the details at the moment, since it's been several months since I looked at dmd's lexer, but a lot of the problem stems from the fact that it's all written around the assumption that it's dealing with a char*. Converting it to operate on string might be fairly straightforward, but it gets more complicated when dealing with ranges. Also, both Walter and others have stated that the lexer in D should be configurable in a number of ways, and dmd's lexer isn't configurable at all. So, while a direct translation would likely be quick, refactoring it to do what it's been asked to be able to do would not be. I'm quite a ways along with one that's written from scratch, but I need to find the time to finish it. Also, doing it from scratch has had the added benefit of helping me find bugs in the spec and in dmd.

I think it would be reasonable for a lexer to require a range of ubyte as input, and carry its own decoding. In the first approximation it may even require a random-access range of ubyte. Andrei
Feb 05 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
08-Feb-2013 12:01, Jonathan M Davis пишет:
 On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
 I think it would be reasonable for a lexer to require a range of ubyte
 as input, and carry its own decoding. In the first approximation it may
 even require a random-access range of ubyte.

Another big issue is the fact that in some ways, using a pointer like dmd's lexer does is actually superior to using a range. In particular, it's trivial to determine where in the text a token is, because you can simply subtract the pointer in the token from the initial pointer. Strings would be okay too, because you can subtract their ptr properties. But the closest that you'll get with ranges is to subtract their lengths, and the only ranges that are likely to define length are random-access ranges.

Not true, certain ranges know length but can't be random access as indexing is O(lgN) or worse. Including a stripe of chunks as taken from file. -- Dmitry Olshansky
Feb 08 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
08-Feb-2013 13:40, Jonathan M Davis пишет:
 On Friday, February 08, 2013 12:12:30 Dmitry Olshansky wrote:
 08-Feb-2013 12:01, Jonathan M Davis пишет:
 On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
 I think it would be reasonable for a lexer to require a range of ubyte
 as input, and carry its own decoding. In the first approximation it may
 even require a random-access range of ubyte.

Another big issue is the fact that in some ways, using a pointer like dmd's lexer does is actually superior to using a range. In particular, it's trivial to determine where in the text a token is, because you can simply subtract the pointer in the token from the initial pointer. Strings would be okay too, because you can subtract their ptr properties. But the closest that you'll get with ranges is to subtract their lengths, and the only ranges that are likely to define length are random-access ranges.

Not true, certain ranges know length but can't be random access as indexing is O(lgN) or worse. Including a stripe of chunks as taken from file.

I said that the only ones which are "likely" to define length are random-access range. There _are_ other ranges which can, but in most cases, if you can know the length, you can do random access as well.

Well I honestly disagree about the promise of knowing length - being able to index. "The most ranges" is arrays and wrappers on top of these. Given current realities oF D and Phobos I'm afraid you are right though. Regardless, the main issue still
 stands in that dealing with keeping track of the index of the code unit of a
 token is more complicated and generally more expensive with ranges than it is
 with a pointer.

If target is random access range just use offset throughout. It's basically becomes base + offset vs base + pointer i.e. non-issue If not then pointer argument no longer applies and you can just as well use separate counter on per popFront. It'd not that costly in this case and flexibility tramps other concerns with forward ranges in any case.
 Some range types will do better than others, but short of
 using a string's ptr property, there's always going to be some additional
 overhead in comparison to pointers to keep track of the indices or to keep a
 range or slice of one as part of a token. The pointer's just more lightweight.
 That doesn't make ranges unacceptable by any means. It just means that they're
 going to take at least a slight performance hit in comparison to pointers.

See above. Pointer to something inside of a buffer == index in buffer, typically even with pointer you can't drop the 'buffer' reference itself. -- Dmitry Olshansky
Feb 08 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
08-Feb-2013 23:25, Jonathan M Davis пишет:
 On Friday, February 08, 2013 14:29:20 Dmitry Olshansky wrote:
 If target is random access range just use offset throughout. It's
 basically becomes base + offset vs base + pointer i.e. non-issue

 If not then pointer argument no longer applies and you can just as well
 use separate counter on per popFront. It'd not that costly in this case
 and flexibility tramps other concerns with forward ranges in any case.

I don't know exactly how costly it ends up being (and certainly, poor design choices in other areas could dwarf that cost), but it does incur extra overhead throughout the lexing process. In most code, it wouldn't be a big deal, but the lexer is trying to be lightning fast, so every extra

 that is going to add up and slow it down.

I suspect it *might* require extra register for base depending on smartness of the compiler, extra register in turn could rise the cost. The key to speed here however is not in using raw pointers, assembly and/or SIMD. FWIW if there is only a single base pointer + offsets compiler can assume no pointer aliasing and optimize things better. The discussion becomes purely rhetorical unless some hard data is actually presented and not based on DMD's optimizer, please.
 But you really don't have any choice
 if you don't even have random access.

Agreed.
 Regardless, my point is that accepting
 generic ranges rather than pointers complicates things somewhat and 

 at least a slight performance penalty.

Complication - yes, slight performance cost is what I doubt it in RA case. Seems like a compiler/optimizer issue. For one I used to write cycles with naked pointers in C to get better performance out of crappy compilers. I won't ever do it today as it would get worse performance in addition to being less readable (I've checked at least on VC++ about a year ago, compiler rewrites these index-based loops better, YMMV). -- Dmitry Olshansky
Feb 08 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/8/2013 12:01 AM, Jonathan M Davis wrote:
 It's not quite a use case where
 ranges shine - especially when efficiency is a top priority.

A more problematic case is dmd's lexer relies on a 0 byte at the end to be a "sentinel" for the end of file. Without such a sentinel, every consumption of a source character requires two operations rather than one.
Feb 08 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/9/13 3:07 AM, Jonathan M Davis wrote:
 On Friday, February 08, 2013 23:48:50 Walter Bright wrote:
 On 2/8/2013 12:01 AM, Jonathan M Davis wrote:
 It's not quite a use case where
 ranges shine - especially when efficiency is a top priority.

A more problematic case is dmd's lexer relies on a 0 byte at the end to be a "sentinel" for the end of file. Without such a sentinel, every consumption of a source character requires two operations rather than one.

I didn't know that. That's a cute trick. But yeah, without controlling the input, it's not possible, and that won't work with a general implementation in a library. It would be trivial enough to put a wrapper around the input to add the 0 byte at the end, but the wrapper would still have to keep checking for empty, so you wouldn't gain anything.

That's not a problem. You may require that the input has a terminating zero. Andrei
Feb 09 2013
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-02-09 15:37, Andrei Alexandrescu wrote:

 That's not a problem. You may require that the input has a terminating
 zero.

You cannot just append a terminating zero in the lexer if it's a range? -- /Jacob Carlborg
Feb 09 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/9/13 10:38 AM, Jacob Carlborg wrote:
 On 2013-02-09 15:37, Andrei Alexandrescu wrote:

 That's not a problem. You may require that the input has a terminating
 zero.

You cannot just append a terminating zero in the lexer if it's a range?

You require it. It's client's burden. Andrei
Feb 09 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-02-09 16:46, Andrei Alexandrescu wrote:

 You require it. It's client's burden.

Ok, I see. -- /Jacob Carlborg
Feb 09 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/9/2013 6:37 AM, Andrei Alexandrescu wrote:
 On 2/9/13 3:07 AM, Jonathan M Davis wrote:
 On Friday, February 08, 2013 23:48:50 Walter Bright wrote:
 On 2/8/2013 12:01 AM, Jonathan M Davis wrote:
 It's not quite a use case where
 ranges shine - especially when efficiency is a top priority.

A more problematic case is dmd's lexer relies on a 0 byte at the end to be a "sentinel" for the end of file. Without such a sentinel, every consumption of a source character requires two operations rather than one.

I didn't know that. That's a cute trick. But yeah, without controlling the input, it's not possible, and that won't work with a general implementation in a library. It would be trivial enough to put a wrapper around the input to add the 0 byte at the end, but the wrapper would still have to keep checking for empty, so you wouldn't gain anything.

That's not a problem. You may require that the input has a terminating zero.

Perhaps we can formalize this a bit. Define a SentinelInputRange, which has an additional manifest constant with the name 'sentinel'. empty becomes defined as: empty = front == sentinel; popFront() does not advance if empty. Advancing over a SentinelInputRange can be done the usual: for (r = range; !r.empty; r.popFront()) { auto e = r.front; } or can be done as: for (r = range; r.front != sentinel; r.popFront()) { auto e = r.front; } This should work well with 0 terminated strings. The output of the lexer should also be a SentinelInputRange, with TOKeof as the sentinel.
Feb 09 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/9/2013 12:43 PM, Walter Bright wrote:> Perhaps we can formalize this a
bit. 
Define a SentinelInputRange, which has an
 additional manifest constant with the name 'sentinel'. empty becomes defined
as:

      empty = front == sentinel;

 popFront() does not advance if empty. Advancing over a SentinelInputRange can
be
 done the usual:

      for (r = range; !r.empty; r.popFront()) { auto e = r.front; }

 or can be done as:

      for (r = range; r.front != sentinel; r.popFront()) { auto e = r.front; }

If it's not clear, SentinelInputRange.front can be called *before* checking empty. If it is empty, then the sentinel is returned.
Feb 09 2013
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/9/13 3:49 PM, Walter Bright wrote:
 On 2/9/2013 12:43 PM, Walter Bright wrote:> Perhaps we can formalize
 this a bit. Define a SentinelInputRange, which has an
  > additional manifest constant with the name 'sentinel'. empty becomes
 defined as:
  >
  > empty = front == sentinel;
  >
  > popFront() does not advance if empty. Advancing over a
 SentinelInputRange can be
  > done the usual:
  >
  > for (r = range; !r.empty; r.popFront()) { auto e = r.front; }
  >
  > or can be done as:
  >
  > for (r = range; r.front != sentinel; r.popFront()) { auto e = r.front; }

 If it's not clear, SentinelInputRange.front can be called *before*
 checking empty. If it is empty, then the sentinel is returned.

Yah, that all would fit C-style strings and singly-linked lists like a glove. I've been long hoping the opportunity to define such a range would present itself :o). Andrei
Feb 09 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, February 05, 2013 09:14:35 Jacob Carlborg wrote:
 Actually, when I think about it, Johnathan is working on porting the DMD
 lexer to D.

Not exactly. I decided that it would be of greater benefit to just write the thing according to the grammar and make sure that the compiler matched the spec. I've already found and fixed several bugs in the spec because of that. Also, given how I'm writing it, I expect that its speed will be similar to that of dmd given the fact that I'm writing it such that it will generally do the minimum number of operations required. But it wouldn't surprise me at all if no lexer can possibly match dmd at the moment as long as it's compiled with dmd (at least on Linux), because gcc's optimizer is so much better than dmd's, and dmd is getting compiled with gcc, whereas the competing lexer in D is being compiled with dmd. I don't have a lot of time to work on my lexer at the moment, but I'd really like to get it done soon, and I have most of the features in place. Unfortunately, when I went to try and work on it again the other day, the code wasn't compiling anymore, and I need to figure out why. I suspect that it's a regression related to string mixins, but I have to investigate further to sort it out. - Jonathan M Davis
Feb 05 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, February 05, 2013 09:14:35 Jacob Carlborg wrote:
 On 2013-02-05 04:22, Andrei Alexandrescu wrote:
 Suggestion: take lexer.c and convert it to D. Should take one day, and
 you'll have performance on par.

There's reason for why nobody has just extract the lexer from DMD. It will probably take more than a day just to extract the lexer to be able to use it without the rest of DMD.

There are basic ideas about how it works which are obviously good and should be in the finished product in D, but it's not range-based, which forces you to do things differently. It's also not configurable, which forces you to do things differently. If it could be ported as-is and then compared for speed, then that would be a great test, since it would be able to show how much of the speed problem is purely a compiler issue as opposed to a design issue, but you wouldn't be able to actually use it for anything more than what Brian is doing with his performance testing, because as you point out, it's too integrated into dmd. It _would_ be valuable though as a performance test of the compiler. - Jonathan M Davis
Feb 05 2013
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/5/13 12:45 AM, deadalnix wrote:
 On Tuesday, 5 February 2013 at 03:22:52 UTC, Andrei Alexandrescu wrote:
 On 2/4/13 10:19 PM, Brian Schott wrote:
 More optimizing:
 http://hackerpilot.github.com/experimental/std_lexer/images/times2.png

 Still only half speed. I'm becoming more and more convinced that Walter
 is actually a wizard.

Suggestion: take lexer.c and convert it to D. Should take one day, and you'll have performance on par.

DMD's lexer is not suitable for phobos IMO. It doesn't take a range as input and don't produce a range. It also lack the features you may want from a multi usage D lexer.

I looked at the code. Once converted, it's trivial to convert the lexer into an input range. Andrei
Feb 05 2013
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 02/05/2013 07:19 AM, Brian Schott wrote:
 More optimizing:
 http://hackerpilot.github.com/experimental/std_lexer/images/times2.png

 Still only half speed. I'm becoming more and more convinced that Walter
 is actually a wizard.

Time to do some hacking on your lexer I guess. I'll try add a couple of tricks and see if it helps. What command do you use for benchmarking?
Feb 05 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
05-Feb-2013 22:25, Brian Schott пишет:
 On Tuesday, 5 February 2013 at 08:52:53 UTC, Dmitry Olshansky
 wrote:
 Time to do some hacking on your lexer I guess. I'll try add a couple
 of tricks and see if it helps.

 What command do you use for benchmarking?

I've been using avgtime[1] for measuring run times, perf[2], and callgrind/kcachegrind[3] for profiling. [1] https://github.com/jmcabo/avgtime [2] https://perf.wiki.kernel.org/index.php/Tutorial [3] http://kcachegrind.sourceforge.net/html/Home.html

Thanks. I've made a pass through the code and found some places to improve. Sadly I've been rather busy at work today. Anyway get ready for pull requests should my ideas prove to be worthy performance-wise. -- Dmitry Olshansky
Feb 05 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-02-08 16:08, Brian Schott wrote:

 http://hackerpilot.github.com/experimental/std_lexer/images/times3.png

 Your suggestions seem to have worked. We're below 20ms on my machine for
 the datetime module.

DMD is still consistently faster, :( -- /Jacob Carlborg
Feb 08 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
08-Feb-2013 19:12, Jacob Carlborg пишет:
 On 2013-02-08 16:08, Brian Schott wrote:

 http://hackerpilot.github.com/experimental/std_lexer/images/times3.png

 Your suggestions seem to have worked. We're below 20ms on my machine for
 the datetime module.

DMD is still consistently faster, :(

Keep in mind the D runtime start-up cost. In the end on small files DMD always wins because of slim C-runtime. To estimate runtime startup lag I've compared run times of int main(){ return 0; } in C (gcc): ------------------------ Total time (ms): 31637.7 Repetitions : 35000 Sample mode : 0.8 (23649 ocurrences) Median time : 0.873 Avg time : 0.903935 Std dev. : 0.204107 Minimum : 0.552 Maximum : 9.057 95% conf.int. : [0.503892, 1.30398] e = 0.400043 99% conf.int. : [0.378189, 1.42968] e = 0.525746 EstimatedAvg95%: [0.901796, 0.906073] e = 0.00213832 EstimatedAvg99%: [0.901125, 0.906745] e = 0.00281023 void main(){} in D (ldc): ------------------------ Total time (ms): 15429.4 Repetitions : 7000 Sample mode : 2.1 (1785 ocurrences) Median time : 2.128 Avg time : 2.2042 Std dev. : 0.466834 Minimum : 1.286 Maximum : 19.933 95% conf.int. : [1.28922, 3.11918] e = 0.914978 99% conf.int. : [1.00171, 3.40668] e = 1.20248 EstimatedAvg95%: [2.19326, 2.21514] e = 0.0109361 EstimatedAvg99%: [2.18983, 2.21857] e = 0.0143724 dmitry dmitry-VirtualBox ~ $ I think that the mode could serve as an indicator of the most probable outcome. For me it's around 0.8 vs 2.1 ms. D runtime rides on top of C one + plus the same OS costs to launch a process, so from now on let's keep in mind these extra ~ 1.3 ms of bias in favor of C-based lexer. -- Dmitry Olshansky
Feb 08 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-02-08 16:28, Dmitry Olshansky wrote:
 08-Feb-2013 19:12, Jacob Carlborg пишет:
 On 2013-02-08 16:08, Brian Schott wrote:

 http://hackerpilot.github.com/experimental/std_lexer/images/times3.png

 Your suggestions seem to have worked. We're below 20ms on my machine for
 the datetime module.

DMD is still consistently faster, :(

Keep in mind the D runtime start-up cost. In the end on small files DMD always wins because of slim C-runtime. To estimate runtime startup lag I've compared run times of int main(){ return 0; } in C (gcc): ------------------------ Total time (ms): 31637.7 Repetitions : 35000 Sample mode : 0.8 (23649 ocurrences) Median time : 0.873 Avg time : 0.903935 Std dev. : 0.204107 Minimum : 0.552 Maximum : 9.057 95% conf.int. : [0.503892, 1.30398] e = 0.400043 99% conf.int. : [0.378189, 1.42968] e = 0.525746 EstimatedAvg95%: [0.901796, 0.906073] e = 0.00213832 EstimatedAvg99%: [0.901125, 0.906745] e = 0.00281023 void main(){} in D (ldc): ------------------------ Total time (ms): 15429.4 Repetitions : 7000 Sample mode : 2.1 (1785 ocurrences) Median time : 2.128 Avg time : 2.2042 Std dev. : 0.466834 Minimum : 1.286 Maximum : 19.933 95% conf.int. : [1.28922, 3.11918] e = 0.914978 99% conf.int. : [1.00171, 3.40668] e = 1.20248 EstimatedAvg95%: [2.19326, 2.21514] e = 0.0109361 EstimatedAvg99%: [2.18983, 2.21857] e = 0.0143724 dmitry dmitry-VirtualBox ~ $ I think that the mode could serve as an indicator of the most probable outcome. For me it's around 0.8 vs 2.1 ms. D runtime rides on top of C one + plus the same OS costs to launch a process, so from now on let's keep in mind these extra ~ 1.3 ms of bias in favor of C-based lexer.

Ok, I see. But wouldn't it be more fair to compare either GCC to GDC or Clang to LDC. -- /Jacob Carlborg
Feb 08 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
08-Feb-2013 19:34, Brian Schott пишет:
 On Friday, 8 February 2013 at 15:23:00 UTC, Brian Schott wrote:
 On Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg wrote:
 DMD is still consistently faster, :(

We might be getting to the part where we say that code generated by gcc is still consistently faster than code generated by ldc.

For the lulz I compiled with "dmd -release -inline -noboundscheck -O -m64 -property": http://hackerpilot.github.com/experimental/std_lexer/images/times3-dmd.png Yes, dmd vs ldc actually is a matter of 32 vs 18 ms for datetime.

Would be intereesting to try gdc as dmd on linux uses gcc. -- Dmitry Olshansky
Feb 08 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
08-Feb-2013 19:52, Iain Buclaw пишет:
 On 8 February 2013 15:46, Brian Schott <briancschott gmail.com
 <mailto:briancschott gmail.com>> wrote:

     On Friday, 8 February 2013 at 15:42:47 UTC, Iain Buclaw wrote:

         What?  That's an outright fib. :-)


     I think he means that "on linux the dmd binary is compiled by gcc"


 That's still lies. :o)

g++ ? :)
 --
 Iain Buclaw

 *(p < e ? p++ : p) = (c & 0x0f) + '0';

-- Dmitry Olshansky
Feb 08 2013
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
08-Feb-2013 20:51, Iain Buclaw пишет:
 On 8 February 2013 16:41, Brian Schott <briancschott gmail.com
 <mailto:briancschott gmail.com>> wrote:

     On Friday, 8 February 2013 at 16:38:00 UTC, Iain Buclaw wrote:

         I see we could be doing this all day. :þ


     We could.


         I'll lay down the hint, dmd compiles the source, not gcc. And
         although gcc
         may be invoked during a certain special stage of compilation,
         its actually
         just a driver to call a certain toolchain program that is
         outside of gcc.
         :-)


     What we're saying is that dmd, The Digital Mars D Compiler, is
     written in C++ and is thus built by GCC/G++. We can tell by looking
     at the Makefile that DMD doesn't build itself.


 That still has nothing to do with how dmd links D programs. ;)

We've been discussing DMD's lexer written in C++ that is obviously compiled using the default compiler on target platform. That's what is being measured the good ol' C++ lexer vs D lexer. I don't see how the way DMD does linking is related to what tool is used to actually build it. -- Dmitry Olshansky
Feb 08 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
09-Feb-2013 23:20, SomeDude пишет:
 On Friday, 8 February 2013 at 16:00:05 UTC, Dmitry Olshansky wrote:
 08-Feb-2013 19:52, Iain Buclaw пишет:

 That's still lies. :o)

g++ ? :)
 --
 Iain Buclaw

 *(p < e ? p++ : p) = (c & 0x0f) + '0';


I'd think it's built by DMC++, Digital Mars C++ compiler.

On linux, sure ;) -- Dmitry Olshansky
Feb 09 2013
prev sibling next sibling parent FG <home fgda.pl> writes:
On 2013-02-08 17:35, Brian Schott wrote:
 GDC decided to randomly not fail to build on my system, so it's time for MOAR
 CHARTS.

Interesting. Seems that LDC is faster than GDC but has a bigger start overhead. OT: Why is the chart's legend in reverse order? :P
Feb 08 2013
prev sibling next sibling parent dennis luehring <dl.soluz gmx.net> writes:
Am 08.02.2013 17:35, schrieb Brian Schott:
 On Friday, 8 February 2013 at 15:35:55 UTC, Dmitry Olshansky
 wrote:
 Would be intereesting to try gdc as dmd on linux uses gcc.

GDC decided to randomly not fail to build on my system, so it's time for MOAR CHARTS. dmd -inline -noboundscheck -O -w -wi -m64 -property ldc2 -O2 -release -vectorize -m64 gdc -O3 -fno-bounds-check -frelease -m64 http://hackerpilot.github.com/experimental/std_lexer/images/times3-all.png

i still don't get what you comparing here you benchmarking something like an mixup of code-generation, startup-times and filesystem behavior + lexing time (+ lexing output style) and this all makes only sense if you lexer produces the very same output (so it can be out of the box used as an replacement) as the dmd lexer - for beeing compareable or aren't you in the phase of detail benchmarking/profiling?
Feb 08 2013
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-02-08 17:35, Brian Schott wrote:
 On Friday, 8 February 2013 at 15:35:55 UTC, Dmitry Olshansky wrote:
 Would be intereesting to try gdc as dmd on linux uses gcc.

GDC decided to randomly not fail to build on my system, so it's time for MOAR CHARTS. dmd -inline -noboundscheck -O -w -wi -m64 -property ldc2 -O2 -release -vectorize -m64 gdc -O3 -fno-bounds-check -frelease -m64 http://hackerpilot.github.com/experimental/std_lexer/images/times3-all.png

Do you have the exact code you test available somewhere? -- /Jacob Carlborg
Feb 09 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/4/2013 7:19 PM, Brian Schott wrote:
 Still only half speed. I'm becoming more and more convinced that Walter is
 actually a wizard.

I worship the Dark Arts.
Feb 08 2013
next sibling parent FG <home fgda.pl> writes:
On 2013-02-28 16:34, Andrej Mitrovic wrote:
 On 2/28/13, Brian Schott <briancschott gmail.com> wrote:
 http://hackerpilot.github.com/experimental/std_lexer/images/times4.png

That's a nice plot, but what is the X axis?

Lexing time in milliseconds.
Feb 28 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Feb-2013 13:09, Brian Schott пишет:
 On Saturday, 9 February 2013 at 07:55:04 UTC, Walter Bright wrote:
 On 2/4/2013 7:19 PM, Brian Schott wrote:
 Still only half speed. I'm becoming more and more convinced that
 Walter is
 actually a wizard.

I worship the Dark Arts.

*recites incantations* *merges pull requests* *adds unit tests* http://hackerpilot.github.com/experimental/std_lexer/images/times4.png

Looking far better now. Keep in mind that we still don't use the original dmd's sentinel magic to avoid length check in std.d.lexer :) Being that wizard who gave a couple of scrolls to Brain I'll outline some interesting data points collected while helping him out. - D run-time takes some time to start up/shut down. It's on the order of 1ms for me vs 1/10th of ms for plain C - expanding on the previous point - GC.malloc takes sometime and triggers collections from time to time (2 collections to lex datetime, there is next to no garbage, but they run regardless). Even simply disabling GC at first and re-enabling it at the end of program speeds up the whole time by roughly 5% on my machine. - opApply is definitely slower then inlined range-based foreach loop even with scope delegate. Other then this specifics the prime spells that give the performance are: - don't use built-in AA or subtle allocations (avoid GC as far as you can) - lean common path, keep there only the operations that are required in *every* code-path - avoid ever doing the same work twice (redesign, hack and whatnot but don't do it) -- Dmitry Olshansky
Feb 28 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-Feb-2013 21:27, Dmitry Olshansky пишет:
 28-Feb-2013 13:09, Brian Schott пишет:
 On Saturday, 9 February 2013 at 07:55:04 UTC, Walter Bright wrote:
 On 2/4/2013 7:19 PM, Brian Schott wrote:
 Still only half speed. I'm becoming more and more convinced that
 Walter is
 actually a wizard.

I worship the Dark Arts.

*recites incantations* *merges pull requests* *adds unit tests* http://hackerpilot.github.com/experimental/std_lexer/images/times4.png

Looking far better now. Keep in mind that we still don't use the original dmd's sentinel magic to avoid length check in std.d.lexer :) Being that wizard who gave a couple of scrolls to Brain I'll outline some interesting data points collected while helping him out.

To clarify these are problems I observed but haven't solved/avoided completely:
 - D run-time takes some time to start up/shut down. It's on the order of
 1ms for me vs 1/10th of ms for plain C
 - expanding on the previous point - GC.malloc takes sometime and
 triggers collections from time to time (2 collections to lex datetime,
 there is next to no garbage, but they run regardless).

And this one below is an experiment and not is not related to the numbers posted still.
 Even simply disabling GC at first and re-enabling it at the end of
 program speeds up the whole time by roughly 5% on my machine.

-- Dmitry Olshansky
Feb 28 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Monday, 4 February 2013 at 00:38:57 UTC, FG wrote:
 On 2013-02-04 01:22, Brian Schott wrote:
 After several hours of optimizing I've managed to make it so 
 that dmd's lexer is
 only three times faster.

What are you comparing here? How do you time DMD's lexing stage?

A simple hack of module.c that prints out the file's token count and then calls exit(0).
Feb 03 2013
prev sibling next sibling parent "Mehrdad" <wfunction hotmail.com> writes:
On Monday, 4 February 2013 at 00:22:42 UTC, Brian Schott wrote:
 After several hours of optimizing I've managed to make it so 
 that dmd's lexer is only three times faster.

 http://hackerpilot.github.com/experimental/std_lexer/images/times.png

 The funny thing is that compiling with LDC gave a bigger speed 
 boost than any of my code refacatoring.

Where is the current bottleneck? (Should be easy to find just by running the program ~5 times and suddenly breaking into it with a debugger.) Also, I'm assuming you've already tried disabling range-checking on arrays?
Feb 04 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
More optimizing:
http://hackerpilot.github.com/experimental/std_lexer/images/times2.png

Still only half speed. I'm becoming more and more convinced that 
Walter is actually a wizard.
Feb 04 2013
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 2/5/13, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 Suggestion: take lexer.c and convert it to D. Should take one day, and
 you'll have performance on par.

This was already done for DDMD, and the more recent minimal version of it: https://github.com/zachthemystic/ddmd-clean/blob/master/dmd/lexer.d
Feb 04 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Tuesday, 5 February 2013 at 04:19:54 UTC, Andrei Alexandrescu 
wrote:
 Awesome! Has anyone measured the speed?

 Andrei

I gave up on getting it to compile. ddmd (the project this one is based on) was last updated to 2.040 according to its page on dsource, and I also haven't gotten it to compile.
Feb 04 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Tuesday, 5 February 2013 at 03:22:52 UTC, Andrei Alexandrescu 
wrote:
 On 2/4/13 10:19 PM, Brian Schott wrote:
 More optimizing:
 http://hackerpilot.github.com/experimental/std_lexer/images/times2.png

 Still only half speed. I'm becoming more and more convinced 
 that Walter
 is actually a wizard.

Suggestion: take lexer.c and convert it to D. Should take one day, and you'll have performance on par.

DMD's lexer is not suitable for phobos IMO. It doesn't take a range as input and don't produce a range. It also lack the features you may want from a multi usage D lexer.
Feb 04 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, February 05, 2013 01:07:48 Jonathan M Davis wrote:
 I don't have a lot of time to work on my lexer at the moment, but I'd really
 like to get it done soon, and I have most of the features in place.
 Unfortunately, when I went to try and work on it again the other day, the
 code wasn't compiling anymore, and I need to figure out why. I suspect that
 it's a regression related to string mixins, but I have to investigate
 further to sort it out.

It turns out that it has nothing to do with string mixins (though it does have to do with CTFE): http://d.puremagic.com/issues/show_bug.cgi?id=9452 Fortunately, there's a simple workaround that'll let me continue until the bug is fixed. - Jonathan M Davis
Feb 05 2013
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 2/5/13, Brian Schott <briancschott gmail.com> wrote:
 I gave up on getting it to compile.

It seems master is broken. An earlier commit is buildable with a small change, but it doesn't seem to parse modules like std.datetime. I don't know what exact version the front-end was ported from, but I've tried to parse datetime from 2.055 to 2.061 and it didn't work. So yeah it's not usable in this state.
Feb 05 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Tuesday, 5 February 2013 at 08:52:53 UTC, Dmitry Olshansky
wrote:
 Time to do some hacking on your lexer I guess. I'll try add a 
 couple of tricks and see if it helps.

 What command do you use for benchmarking?

I've been using avgtime[1] for measuring run times, perf[2], and callgrind/kcachegrind[3] for profiling. [1] https://github.com/jmcabo/avgtime [2] https://perf.wiki.kernel.org/index.php/Tutorial [3] http://kcachegrind.sourceforge.net/html/Home.html
Feb 05 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, February 05, 2013 08:34:29 Andrei Alexandrescu wrote:
 As far as I could tell the dependencies of the lexer are fairly
 contained (util, token, identifier) and conversion to input range is
 immediate.

I don't remember all of the details at the moment, since it's been several months since I looked at dmd's lexer, but a lot of the problem stems from the fact that it's all written around the assumption that it's dealing with a char*. Converting it to operate on string might be fairly straightforward, but it gets more complicated when dealing with ranges. Also, both Walter and others have stated that the lexer in D should be configurable in a number of ways, and dmd's lexer isn't configurable at all. So, while a direct translation would likely be quick, refactoring it to do what it's been asked to be able to do would not be. I'm quite a ways along with one that's written from scratch, but I need to find the time to finish it. Also, doing it from scratch has had the added benefit of helping me find bugs in the spec and in dmd. - Jonathan M Davis
Feb 05 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
 I think it would be reasonable for a lexer to require a range of ubyte
 as input, and carry its own decoding. In the first approximation it may
 even require a random-access range of ubyte.

I'd have to think about how you'd handle the Unicode stuff in that case, since I'm not quite sure what you mean by having it handle its own decoding if it's a range of code units, but what I've been working on works with all of the character types and is very careful about how it deals with decoding in order to avoid unnecessary decoding. And that wasn't all that hard as far as the lexer's code goes. The hard part with that was making std.utf work with ranges of code units rather than just strings, and that was committed months ago. - Jonathan M Davis
Feb 05 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
 I think it would be reasonable for a lexer to require a range of ubyte
 as input, and carry its own decoding. In the first approximation it may
 even require a random-access range of ubyte.

Another big issue is the fact that in some ways, using a pointer like dmd's lexer does is actually superior to using a range. In particular, it's trivial to determine where in the text a token is, because you can simply subtract the pointer in the token from the initial pointer. Strings would be okay too, because you can subtract their ptr properties. But the closest that you'll get with ranges is to subtract their lengths, and the only ranges that are likely to define length are random-access ranges. And to do that, you'd either have to keep calculating the index for each token as its generated or save the range with ever token (rather than just having a pointer) so that you could determine the index later if you needed to. And depending on the range, all of that saving could be expensive. And for any other type of range, you'd literally have to count the code units as you iterated in order to figure out what the index is (and you'd have to keep saving the range as you went along if you wanted to slice it at all, since it wouldn't actually be sliceable, and so getting to a particular index in the range would be very expensive even if you kept track of it). And for syntax highlighting and some error reporting and a variety of other uses, you need to be able to determine where in the text a token was (not just its column and line number). And that's simply a lot easier with a pointer. So, dealing with generic ranges is a bit problematic - far more so than any issues with character types. If the lexer is well-written, the extra overhead had be obviated by having the lexer function do stuff a bit differently depending on the type of the range, but regardless, restricting it to strings or pointers would be cleaner in that regard. It's not quite a use case where ranges shine - especially when efficiency is a top priority. - Jonathan M Davis
Feb 08 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, February 08, 2013 12:12:30 Dmitry Olshansky wrote:
 08-Feb-2013 12:01, Jonathan M Davis =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
 On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
 I think it would be reasonable for a lexer to require a range of u=



 as input, and carry its own decoding. In the first approximation i=



 even require a random-access range of ubyte.

Another big issue is the fact that in some ways, using a pointer li=


 dmd's
 lexer does is actually superior to using a range. In particular, it=


 trivial to determine where in the text a token is, because you can =


 subtract the pointer in the token from the initial pointer. Strings=


 be okay too, because you can subtract their ptr properties. But the=


 closest that you'll get with ranges is to subtract their lengths, a=


 only ranges that are likely to define length are random-access rang=


=20
 Not true, certain ranges know length but can't be random access as
 indexing is O(lgN) or worse. Including a stripe of chunks as taken fr=

 file.

I said that the only ones which are "likely" to define length are rando= m-access=20 range. There _are_ other ranges which can, but in most cases, if you ca= n know=20 the length, you can do random access as well. Regardless, the main issu= e still=20 stands in that dealing with keeping track of the index of the code unit= of a=20 token is more complicated and generally more expensive with ranges than= it is=20 with a pointer. Some range types will do better than others, but short = of=20 using a string's ptr property, there's always going to be some addition= al=20 overhead in comparison to pointers to keep track of the indices or to k= eep a=20 range or slice of one as part of a token. The pointer's just more light= weight.=20 That doesn't make ranges unacceptable by any means. It just means that = they're=20 going to take at least a slight performance hit in comparison to pointe= rs. - Jonathan M Davis
Feb 08 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Tuesday, 5 February 2013 at 19:00:42 UTC, Dmitry Olshansky 
wrote:
 Thanks.
 I've made a pass through the code and found some places to 
 improve.
 Sadly I've been rather busy at work today. Anyway get ready for 
 pull requests should my ideas prove to be worthy 
 performance-wise.

http://hackerpilot.github.com/experimental/std_lexer/images/times3.png Your suggestions seem to have worked. We're below 20ms on my machine for the datetime module.
Feb 08 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg wrote:
 DMD is still consistently faster, :(

We might be getting to the part where we say that code generated by gcc is still consistently faster than code generated by ldc.
Feb 08 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Friday, 8 February 2013 at 15:23:00 UTC, Brian Schott wrote:
 On Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg 
 wrote:
 DMD is still consistently faster, :(

We might be getting to the part where we say that code generated by gcc is still consistently faster than code generated by ldc.

For the lulz I compiled with "dmd -release -inline -noboundscheck -O -m64 -property": http://hackerpilot.github.com/experimental/std_lexer/images/times3-dmd.png Yes, dmd vs ldc actually is a matter of 32 vs 18 ms for datetime.
Feb 08 2013
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
--047d7bb70dfe95615e04d5386845
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 8 February 2013 15:35, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:

 08-Feb-2013 19:34, Brian Schott =D0=BF=D0=B8=D1=88=D0=B5=D1=82:

  On Friday, 8 February 2013 at 15:23:00 UTC, Brian Schott wrote:
 On Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg wrote:

 DMD is still consistently faster, :(

We might be getting to the part where we say that code generated by gcc is still consistently faster than code generated by ldc.

For the lulz I compiled with "dmd -release -inline -noboundscheck -O -m64 -property": http://hackerpilot.github.com/**experimental/std_lexer/images/** times3-dmd.png<http://hackerpilot.github.com/experimental/std_lexer/imag=


 Yes, dmd vs ldc actually is a matter of 32 vs 18 ms for datetime.

Would be intereesting to try gdc as dmd on linux uses gcc.

What? That's an outright fib. :-) --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0'; --047d7bb70dfe95615e04d5386845 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><br><div class=3D"gmail_extra"><div class=3D"gmail_quote">= On 8 February 2013 15:35, Dmitry Olshansky <span dir=3D"ltr">&lt;<a href=3D= "mailto:dmitry.olsh gmail.com" target=3D"_blank">dmitry.olsh gmail.com</a>&= gt;</span> wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex">08-Feb-2013 19:34, Brian Schott =D0=BF=D0=B8= =D1=88=D0=B5=D1=82:<div><div class=3D"h5"><br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> On Friday, 8 February 2013 at 15:23:00 UTC, Brian Schott wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> On Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> DMD is still consistently faster, :(<br> </blockquote> <br> We might be getting to the part where we say that code generated by<br> gcc is still consistently faster than code generated by ldc.<br> </blockquote> <br> For the lulz I compiled with &quot;dmd -release -inline -noboundscheck -O<b= r> -m64 -property&quot;:<br> <br> <a href=3D"http://hackerpilot.github.com/experimental/std_lexer/images/time= s3-dmd.png" target=3D"_blank">http://hackerpilot.github.com/<u></u>experime= ntal/std_lexer/images/<u></u>times3-dmd.png</a><br> <br> Yes, dmd vs ldc actually is a matter of 32 vs 18 ms for datetime.<br> </blockquote> <br></div></div> Would be intereesting to try gdc as dmd on linux uses gcc.<span class=3D"HO= EnZb"><font color=3D"#888888"><br> <br></font></span></blockquote></div><br><br></div><div class=3D"gmail_extr= a">What?=C2=A0 That&#39;s an outright fib. :-)<br></div><div class=3D"gmail= _extra"><br><br>-- <br>Iain Buclaw<br><br>*(p &lt; e ? p++ : p) =3D (c &amp= ; 0x0f) + &#39;0&#39;; </div></div> --047d7bb70dfe95615e04d5386845--
Feb 08 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Friday, 8 February 2013 at 15:42:47 UTC, Iain Buclaw wrote:
 What?  That's an outright fib. :-)

I think he means that "on linux the dmd binary is compiled by gcc"
Feb 08 2013
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
--001636b14cd00b563104d5388c27
Content-Type: text/plain; charset=ISO-8859-1

On 8 February 2013 15:46, Brian Schott <briancschott gmail.com> wrote:

 On Friday, 8 February 2013 at 15:42:47 UTC, Iain Buclaw wrote:

 What?  That's an outright fib. :-)

I think he means that "on linux the dmd binary is compiled by gcc"

-- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0'; --001636b14cd00b563104d5388c27 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On 8= February 2013 15:46, Brian Schott <span dir=3D"ltr">&lt;<a href=3D"mailto:= briancschott gmail.com" target=3D"_blank">briancschott gmail.com</a>&gt;</s= pan> wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"><div class=3D"im">On Friday, 8 February 2013= at 15:42:47 UTC, Iain Buclaw wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> What? =A0That&#39;s an outright fib. :-)<br> </blockquote> <br></div> I think he means that &quot;on linux the dmd binary is compiled by gcc&quot= ;<br> <br> </blockquote></div><br></div><div class=3D"gmail_extra">That&#39;s still li= es. :o)<br></div><div class=3D"gmail_extra"><br clear=3D"all"><br>-- <br>Ia= in Buclaw<br><br>*(p &lt; e ? p++ : p) =3D (c &amp; 0x0f) + &#39;0&#39;; </div></div> --001636b14cd00b563104d5388c27--
Feb 08 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Friday, 8 February 2013 at 15:35:55 UTC, Dmitry Olshansky 
wrote:
 Would be intereesting to try gdc as dmd on linux uses gcc.

GDC decided to randomly not fail to build on my system, so it's time for MOAR CHARTS. dmd -inline -noboundscheck -O -w -wi -m64 -property ldc2 -O2 -release -vectorize -m64 gdc -O3 -fno-bounds-check -frelease -m64 http://hackerpilot.github.com/experimental/std_lexer/images/times3-all.png
Feb 08 2013
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
--047d7bb70dfe379ecf04d5392e27
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 8 February 2013 16:00, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:

 08-Feb-2013 19:52, Iain Buclaw =D0=BF=D0=B8=D1=88=D0=B5=D1=82:

 On 8 February 2013 15:46, Brian Schott <briancschott gmail.com
 <mailto:briancschott gmail.com**>> wrote:

     On Friday, 8 February 2013 at 15:42:47 UTC, Iain Buclaw wrote:

         What?  That's an outright fib. :-)


     I think he means that "on linux the dmd binary is compiled by gcc"


 That's still lies. :o)


I'll lay down the hint, dmd compiles the source, not gcc. And although gcc may be invoked during a certain special stage of compilation, its actually just a driver to call a certain toolchain program that is outside of gcc. :-) --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0'; --047d7bb70dfe379ecf04d5392e27 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On 8= February 2013 16:00, Dmitry Olshansky <span dir=3D"ltr">&lt;<a href=3D"mai= lto:dmitry.olsh gmail.com" target=3D"_blank">dmitry.olsh gmail.com</a>&gt;<= /span> wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex">08-Feb-2013 19:52, Iain Buclaw =D0=BF=D0=B8= =D1=88=D0=B5=D1=82:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"><div class=3D"im"> On 8 February 2013 15:46, Brian Schott &lt;<a href=3D"mailto:briancschott g= mail.com" target=3D"_blank">briancschott gmail.com</a><br></div><div><div c= lass=3D"h5"> &lt;mailto:<a href=3D"mailto:briancschott gmail.com" target=3D"_blank">bria= ncschott gmail.com</a><u></u>&gt;&gt; wrote:<br> <br> =C2=A0 =C2=A0 On Friday, 8 February 2013 at 15:42:47 UTC, Iain Buclaw wrote= :<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 What? =C2=A0That&#39;s an outright fib. :-)<br> <br> <br> =C2=A0 =C2=A0 I think he means that &quot;on linux the dmd binary is compil= ed by gcc&quot;<br> <br> <br></div></div><div class=3D"im"> That&#39;s still lies. :o)<br> <br> </div></blockquote> <br> g++ ? :)<div class=3D"HOEnZb"><div class=3D"h5"><br></div></div></blockquot= e><div><br></div><div>I see we could be doing this all day. :=C3=BE<br><br>= </div><div>I&#39;ll lay down the hint, dmd compiles the source, not gcc. An= d although gcc may be invoked during a certain special stage of compilation= , its actually just a driver to call a certain toolchain program that is ou= tside of gcc. :-)<br> </div><div>=C2=A0</div></div>-- <br>Iain Buclaw<br><br>*(p &lt; e ? p++ : p= ) =3D (c &amp; 0x0f) + &#39;0&#39;; </div></div> --047d7bb70dfe379ecf04d5392e27--
Feb 08 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Friday, 8 February 2013 at 16:35:50 UTC, Brian Schott wrote:
 dmd -inline -noboundscheck -O -w -wi -m64 -property

Copy/pate fail. That's actually "dmd -release -inline -noboundscheck -O -w -wi -m64 -property"
Feb 08 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Friday, 8 February 2013 at 16:38:00 UTC, Iain Buclaw wrote:
 I see we could be doing this all day. :þ

We could.
 I'll lay down the hint, dmd compiles the source, not gcc. And 
 although gcc
 may be invoked during a certain special stage of compilation, 
 its actually
 just a driver to call a certain toolchain program that is 
 outside of gcc.
 :-)

What we're saying is that dmd, The Digital Mars D Compiler, is written in C++ and is thus built by GCC/G++. We can tell by looking at the Makefile that DMD doesn't build itself.
Feb 08 2013
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
--20cf3074b24625dfa404d53963a5
Content-Type: text/plain; charset=ISO-8859-1

On 8 February 2013 16:35, Brian Schott <briancschott gmail.com> wrote:

 On Friday, 8 February 2013 at 15:35:55 UTC, Dmitry Olshansky wrote:

 Would be intereesting to try gdc as dmd on linux uses gcc.

GDC decided to randomly not fail to build on my system, so it's time for MOAR CHARTS. dmd -inline -noboundscheck -O -w -wi -m64 -property ldc2 -O2 -release -vectorize -m64 gdc -O3 -fno-bounds-check -frelease -m64 http://hackerpilot.github.com/**experimental/std_lexer/images/** times3-all.png<http://hackerpilot.github.com/experimental/std_lexer/images/times3-all.png>

Cool. How do you determine the speed times? -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0'; --20cf3074b24625dfa404d53963a5 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On 8= February 2013 16:35, Brian Schott <span dir=3D"ltr">&lt;<a href=3D"mailto:= briancschott gmail.com" target=3D"_blank">briancschott gmail.com</a>&gt;</s= pan> wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"><div class=3D"im">On Friday, 8 February 2013= at 15:35:55 UTC, Dmitry Olshansky wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> Would be intereesting to try gdc as dmd on linux uses gcc.<br> </blockquote> <br></div> GDC decided to randomly not fail to build on my system, so it&#39;s time fo= r MOAR CHARTS.<br> <br> dmd -inline -noboundscheck -O -w -wi -m64 -property<br> ldc2 -O2 -release -vectorize -m64<br> gdc -O3 -fno-bounds-check -frelease -m64<br> <a href=3D"http://hackerpilot.github.com/experimental/std_lexer/images/time= s3-all.png" target=3D"_blank">http://hackerpilot.github.com/<u></u>experime= ntal/std_lexer/images/<u></u>times3-all.png</a><br> </blockquote></div><br></div><div class=3D"gmail_extra">Cool.=A0 How do you= determine the speed times?<br><br clear=3D"all"></div><div class=3D"gmail_= extra"><br>-- <br>Iain Buclaw<br><br>*(p &lt; e ? p++ : p) =3D (c &amp; 0x0= f) + &#39;0&#39;; </div></div> --20cf3074b24625dfa404d53963a5--
Feb 08 2013
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
--001636b14cd04a015204d5395ed0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On 8 February 2013 16:41, Brian Schott <briancschott gmail.com> wrote:

 On Friday, 8 February 2013 at 16:38:00 UTC, Iain Buclaw wrote:

 I see we could be doing this all day. :=FE

We could. I'll lay down the hint, dmd compiles the source, not gcc. And although g=

 may be invoked during a certain special stage of compilation, its actual=


 just a driver to call a certain toolchain program that is outside of gcc=


 :-)

What we're saying is that dmd, The Digital Mars D Compiler, is written in C++ and is thus built by GCC/G++. We can tell by looking at the Makefile that DMD doesn't build itself.

That still has nothing to do with how dmd links D programs. ;) --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0'; --001636b14cd04a015204d5395ed0 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On 8= February 2013 16:41, Brian Schott <span dir=3D"ltr">&lt;<a href=3D"mailto:= briancschott gmail.com" target=3D"_blank">briancschott gmail.com</a>&gt;</s= pan> wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"><div class=3D"im">On Friday, 8 February 2013= at 16:38:00 UTC, Iain Buclaw wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> I see we could be doing this all day. :=FE<br> </blockquote> <br></div> We could.<div class=3D"im"><br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> I&#39;ll lay down the hint, dmd compiles the source, not gcc. And although = gcc<br> may be invoked during a certain special stage of compilation, its actually<= br> just a driver to call a certain toolchain program that is outside of gcc.<b= r> :-)<br> </blockquote> <br></div> What we&#39;re saying is that dmd, The Digital Mars D Compiler, is written = in C++ and is thus built by GCC/G++. We can tell by looking at the Makefile= that DMD doesn&#39;t build itself.<br> </blockquote></div><br></div><div class=3D"gmail_extra">That still has noth= ing to do with how dmd links D programs. ;)<br></div><div class=3D"gmail_ex= tra"><br clear=3D"all"><br>-- <br>Iain Buclaw<br><br>*(p &lt; e ? p++ : p) = =3D (c &amp; 0x0f) + &#39;0&#39;; </div></div> --001636b14cd04a015204d5395ed0--
Feb 08 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Friday, February 08, 2013 14:29:20 Dmitry Olshansky wrote:
 If target is random access range just use offset throughout. It's
 basically becomes base + offset vs base + pointer i.e. non-issue

 If not then pointer argument no longer applies and you can just as well
 use separate counter on per popFront. It'd not that costly in this case
 and flexibility tramps other concerns with forward ranges in any case.

I don't know exactly how costly it ends up being (and certainly, poor design choices in other areas could dwarf that cost), but it does incur extra overhead throughout the lexing process. In most code, it wouldn't be a big deal, but the lexer is trying to be lightning fast, so every extra bit like that is going to add up and slow it down. But you really don't have any choice if you don't even have random access. Regardless, my point is that accepting generic ranges rather than pointers complicates things somewhat and does incur at least a slight performance penalty. - Jonathan M Davis
Feb 08 2013
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Friday, February 08, 2013 23:50:13 Dmitry Olshansky wrote:
 Complication - yes, slight performance cost is what I doubt it in RA
 case. Seems like a compiler/optimizer issue.

The main problem isn't RA ranges. It's quite probable that the optimizer takes care of any minor additional overhead that's incurred there. The issue is really pure forward ranges, because you're forced to count the code units as they're processed (or even save the range as you go along, given how expensive it would be to go find the index even if you have it). But we really have no way of knowing how bad it is without hard data. It would certainly be trivial to end up doing other stuff in the implementation which cost far more. - Jonathan M Davis
Feb 08 2013
prev sibling next sibling parent "David Nadlinger" <see klickverbot.at> writes:
On Friday, 8 February 2013 at 20:41:52 UTC, FG wrote:
 On 2013-02-08 17:35, Brian Schott wrote:
 GDC decided to randomly not fail to build on my system, so 
 it's time for MOAR
 CHARTS.

Interesting. Seems that LDC is faster than GDC but has a bigger start overhead. OT: Why is the chart's legend in reverse order? :P

It seems like that, but the differences are small enough that you can't make reliable statements without having a closer look at the statistics. David, who dreams of a world in which every graph has (at least) error bars
Feb 08 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Wednesday, 6 February 2013 at 03:51:33 UTC, Andrei 
Alexandrescu wrote:
 I think it would be reasonable for a lexer to require a range 
 of ubyte as input, and carry its own decoding. In the first 
 approximation it may even require a random-access range of 
 ubyte.

Playing around that, I discovered a bug in std.utf : slice and other range are not threated the same way by decodeFront, which is rather problematic. Jonathan also hit that bug : http://d.puremagic.com/issues/show_bug.cgi?id=9456 That make the lexer hard to write with a consistent behavior for InputRanges. The bug probably exists in everything that rely on decodeFront at some point.
Feb 08 2013
prev sibling next sibling parent reply "deadalnix" <deadalnix gmail.com> writes:
On Friday, 8 February 2013 at 20:54:32 UTC, Jonathan M Davis 
wrote:
 On Friday, February 08, 2013 23:50:13 Dmitry Olshansky wrote:
 Complication - yes, slight performance cost is what I doubt it 
 in RA
 case. Seems like a compiler/optimizer issue.

The main problem isn't RA ranges. It's quite probable that the optimizer takes care of any minor additional overhead that's incurred there. The issue is really pure forward ranges, because you're forced to count the code units as they're processed (or even save the range as you go along, given how expensive it would be to go find the index even if you have it). But we really have no way of knowing how bad it is without hard data. It would certainly be trivial to end up doing other stuff in the implementation which cost far more.

For pure InputRanges, that is pretty bad as the decoding of UTF chars basically have to be done 2 times, and each codepoint popped individually, or the lexer have to embed its own homebrew version of std.utf .
Feb 08 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-02-09 09:05, Jonathan M Davis wrote:

 The lexer that I'm writing doesn't even support pure input ranges, because it
 has to be able to save to do what it does, and the only reason that
 decodeFront exists is because I wrote it to support that use case. And needing
 to operate on ranges of code units is likely to be quite rare, and the
 function is quite new, so I don't expect that much code is affected by any bugs
 in decodeFront or that much would be broken were it to be changed.

Following this discussion it seems it's complicated to support ranges in a lexer and still maintain the speed. Is it really worth the trouble to support ranges? -- /Jacob Carlborg
Feb 09 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/9/13 10:05 AM, Jacob Carlborg wrote:
 On 2013-02-09 09:05, Jonathan M Davis wrote:

 The lexer that I'm writing doesn't even support pure input ranges,
 because it
 has to be able to save to do what it does, and the only reason that
 decodeFront exists is because I wrote it to support that use case. And
 needing
 to operate on ranges of code units is likely to be quite rare, and the
 function is quite new, so I don't expect that much code is affected by
 any bugs
 in decodeFront or that much would be broken were it to be changed.

Following this discussion it seems it's complicated to support ranges in a lexer and still maintain the speed. Is it really worth the trouble to support ranges?

Requiring a random-access range of ubyte with a terminating zero may be the most general approach to a fast lexer - and that's fine. Andrei
Feb 09 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-02-09 16:10, Andrei Alexandrescu wrote:

 Requiring a random-access range of ubyte with a terminating zero may be
 the most general approach to a fast lexer - and that's fine.

Requiring a random-access range probably makes it easier. People here seems to try to support ranges with less functionality, like input or forward ranges. -- /Jacob Carlborg
Feb 09 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/9/13 10:37 AM, Jacob Carlborg wrote:
 On 2013-02-09 16:10, Andrei Alexandrescu wrote:

 Requiring a random-access range of ubyte with a terminating zero may be
 the most general approach to a fast lexer - and that's fine.

Requiring a random-access range probably makes it easier. People here seems to try to support ranges with less functionality, like input or forward ranges.

Yah. The way I see it is, start with a random-access range and then see what the use patterns are. Then possibly refine. Andrei
Feb 09 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
09-Feb-2013 19:46, Andrei Alexandrescu пишет:
 On 2/9/13 10:37 AM, Jacob Carlborg wrote:
 On 2013-02-09 16:10, Andrei Alexandrescu wrote:

 Requiring a random-access range of ubyte with a terminating zero may be
 the most general approach to a fast lexer - and that's fine.

Requiring a random-access range probably makes it easier. People here seems to try to support ranges with less functionality, like input or forward ranges.

Yah. The way I see it is, start with a random-access range and then see what the use patterns are. Then possibly refine.

I don't get this. There is no sensible requirement to forbid non-random access. A couple of static ifs and you are done. And I can confidently say that std.d.lexer has quite some room for optimization in both cases and it doesn't have to sacrifice the generic path. I intent to continue hacking on Brain's implementation and to help him refine it. Any real help (as in work and analysis) is appreciated, thanks. -- Dmitry Olshansky
Feb 09 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Saturday, 9 February 2013 at 07:15:57 UTC, deadalnix wrote:
 On Friday, 8 February 2013 at 20:54:32 UTC, Jonathan M Davis 
 wrote:
 On Friday, February 08, 2013 23:50:13 Dmitry Olshansky wrote:
 Complication - yes, slight performance cost is what I doubt 
 it in RA
 case. Seems like a compiler/optimizer issue.

The main problem isn't RA ranges. It's quite probable that the optimizer takes care of any minor additional overhead that's incurred there. The issue is really pure forward ranges, because you're forced to count the code units as they're processed (or even save the range as you go along, given how expensive it would be to go find the index even if you have it). But we really have no way of knowing how bad it is without hard data. It would certainly be trivial to end up doing other stuff in the implementation which cost far more.

For pure InputRanges, that is pretty bad as the decoding of UTF chars basically have to be done 2 times, and each codepoint popped individually, or the lexer have to embed its own homebrew version of std.utf .

Wow that is complete bullshit xD I completely messed up what I wanted to say. So, std.utf.decodeFront pop or not the utf character. And in case it does not, you ends up having to do extra hanky panky (and duplicate logic in std.utf to know if it does pop or not, which is likely to be very bug prone).
Feb 08 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, February 09, 2013 08:19:33 deadalnix wrote:
 So, std.utf.decodeFront pop or not the utf character. And in case
 it does not, you ends up having to do extra hanky panky (and
 duplicate logic in std.utf to know if it does pop or not, which
 is likely to be very bug prone).

Well, with the pull request that I have, it'll always pop for input ranges and never for anything else. I'm disinclined to have it pop off elements at all, because I'd like it to function as much like decode as possible, but there's no choice with pure input ranges, and it may be better to just make it always pop the elements off. I'll have to think about it. I hate pure input ranges in general. They're just so frustrating to deal with, and I believe that you can always have a forward range if you're willing to put a bit more effort into it. The lexer that I'm writing doesn't even support pure input ranges, because it has to be able to save to do what it does, and the only reason that decodeFront exists is because I wrote it to support that use case. And needing to operate on ranges of code units is likely to be quite rare, and the function is quite new, so I don't expect that much code is affected by any bugs in decodeFront or that much would be broken were it to be changed. - Jonathan M Davis
Feb 09 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, February 08, 2013 23:48:50 Walter Bright wrote:
 On 2/8/2013 12:01 AM, Jonathan M Davis wrote:
 It's not quite a use case where
 ranges shine - especially when efficiency is a top priority.

A more problematic case is dmd's lexer relies on a 0 byte at the end to be a "sentinel" for the end of file. Without such a sentinel, every consumption of a source character requires two operations rather than one.

I didn't know that. That's a cute trick. But yeah, without controlling the input, it's not possible, and that won't work with a general implementation in a library. It would be trivial enough to put a wrapper around the input to add the 0 byte at the end, but the wrapper would still have to keep checking for empty, so you wouldn't gain anything. - Jonathan M Davis
Feb 09 2013
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
--f46d0447f382f817b504d547e6a3
Content-Type: text/plain; charset=ISO-8859-1

On Feb 9, 2013 8:01 AM, "Walter Bright" <newshound2 digitalmars.com> wrote:
 On 2/4/2013 7:19 PM, Brian Schott wrote:
 Still only half speed. I'm becoming more and more convinced that Walter


 actually a wizard.

I worship the Dark Arts.

More like cursed the god's when you wrote it. :-) --f46d0447f382f817b504d547e6a3 Content-Type: text/html; charset=ISO-8859-1 <p><br> On Feb 9, 2013 8:01 AM, &quot;Walter Bright&quot; &lt;<a href="mailto:newshound2 digitalmars.com">newshound2 digitalmars.com</a>&gt; wrote:<br> &gt;<br> &gt; On 2/4/2013 7:19 PM, Brian Schott wrote:<br> &gt;&gt;<br> &gt;&gt; Still only half speed. I&#39;m becoming more and more convinced that Walter is<br> &gt;&gt; actually a wizard.<br> &gt;<br> &gt;<br> &gt; I worship the Dark Arts.</p> <p>More like cursed the god&#39;s when you wrote it. :-)</p> --f46d0447f382f817b504d547e6a3--
Feb 09 2013
prev sibling next sibling parent "SomeDude" <lovelydear mailmetrash.com> writes:
On Friday, 8 February 2013 at 16:00:05 UTC, Dmitry Olshansky 
wrote:
 08-Feb-2013 19:52, Iain Buclaw пишет:

 That's still lies. :o)

g++ ? :)
 --
 Iain Buclaw

 *(p < e ? p++ : p) = (c & 0x0f) + '0';


I'd think it's built by DMC++, Digital Mars C++ compiler.
Feb 09 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, February 10, 2013 00:26:41 Dmitry Olshansky wrote:
 09-Feb-2013 19:46, Andrei Alexandrescu =D0=BF=D0=B8=D1=88=D0=B5=D1=82=

 On 2/9/13 10:37 AM, Jacob Carlborg wrote:
 On 2013-02-09 16:10, Andrei Alexandrescu wrote:
 Requiring a random-access range of ubyte with a terminating zero =




 the most general approach to a fast lexer - and that's fine.

Requiring a random-access range probably makes it easier. People h=



 seems to try to support ranges with less functionality, like input=



 forward ranges.

Yah. The way I see it is, start with a random-access range and then=


 what the use patterns are. Then possibly refine.

I don't get this. There is no sensible requirement to forbid non-rand=

 access. A couple of static ifs and you are done.
=20
 And I can confidently say that std.d.lexer has quite some room for
 optimization in both cases and it doesn't have to sacrifice the gener=

 path.

It may end up being less efficient with some types of ranges, but it ca= n still=20 work, and someone's use case may not care about that extra loss of effi= ciency.=20 Those who really do can use RA ranges or strings or whatever. But if we throw too many extra restrictions on the ranges (like they ha= ve to=20 be random access or end with 0), then pretty soon you might as well req= uire=20 that they use string, because the extra benefit of the few extra types = of=20 ranges which would work wolud be pretty minimal. - Jonathan M Davis
Feb 09 2013
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Saturday, 9 February 2013 at 07:55:04 UTC, Walter Bright wrote:
 On 2/4/2013 7:19 PM, Brian Schott wrote:
 Still only half speed. I'm becoming more and more convinced 
 that Walter is
 actually a wizard.

I worship the Dark Arts.

*recites incantations* *merges pull requests* *adds unit tests* http://hackerpilot.github.com/experimental/std_lexer/images/times4.png
Feb 28 2013
prev sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 2/28/13, Brian Schott <briancschott gmail.com> wrote:
 http://hackerpilot.github.com/experimental/std_lexer/images/times4.png

That's a nice plot, but what is the X axis?
Feb 28 2013