www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - std.d.lexer requirements

reply Walter Bright <newshound2 digitalmars.com> writes:
Given the various proposals for a lexer module for Phobos, I thought I'd share 
some characteristics it ought to have.

First of all, it should be suitable for, at a minimum:

1. compilers

2. syntax highlighting editors

3. source code formatters

4. html creation

To that end:

1. It should accept as input an input range of UTF8. I feel it is a mistake to 
templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use 
an 'adapter' range to convert the input to UTF8. (This is what component 
programming is all about.)

2. It should output an input range of tokens

3. tokens should be values, not classes

4. It should avoid memory allocation as much as possible

5. It should read or write any mutable global state outside of its "Lexer"
instance

6. A single "Lexer" instance should be able to serially accept input ranges, 
sharing and updating one identifier table

7. It should accept a callback delegate for errors. That delegate should decide 
whether to:
    1. ignore the error (and "Lexer" will try to recover and continue)
    2. print an error message (and "Lexer" will try to recover and continue)
    3. throw an exception, "Lexer" is done with that input range

8. Lexer should be configurable as to whether it should collect information 
about comments and ddoc comments or not

9. Comments and ddoc comments should be attached to the next following token, 
they should not themselves be tokens

10. High speed matters a lot

11. Tokens should have begin/end line/column markers, though most of the time 
this can be implicitly determined

12. It should come with unittests that, using -cov, show 100% coverage


Basically, I don't want anyone to be motivated to do a separate one after
seeing 
this one.
Aug 01 2012
next sibling parent reply "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Thursday, 2 August 2012 at 00:11:15 UTC, Walter Bright wrote:
 3. tokens should be values, not classes

I agree with everything but this point. Tokens become quite large when they have all kinds of string and position information on them, much larger than the typical recommended sizes for pass-by-value. It's still possible to provide an interface for them that doesn't require as much copying (ref returns and getters and whatnot), but it's fiddly and way too easy to copy these huge Token structs around, especially if the output is a range of Token structs.
Aug 01 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 5:21 PM, Jakob Ovrum wrote:
 On Thursday, 2 August 2012 at 00:11:15 UTC, Walter Bright wrote:
 3. tokens should be values, not classes

I agree with everything but this point. Tokens become quite large when they have all kinds of string and position information on them, much larger than the typical recommended sizes for pass-by-value. It's still possible to provide an interface for them that doesn't require as much copying (ref returns and getters and whatnot), but it's fiddly and way too easy to copy these huge Token structs around, especially if the output is a range of Token structs.

Doing a class requires a heap allocation per token, which will have disastrous performance consequences.
Aug 01 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 9:02 PM, Jakob Ovrum wrote:
 No it doesn't. That's an implication of using the NewExpression, not classes.

See my other reply to you on this topic.
Aug 01 2012
prev sibling next sibling parent reply "Bernard Helyer" <b.helyer gmail.com> writes:
 5. It should read or write any mutable global state outside of 
 its "Lexer"
 instance

I assume you mean 'shouldn't'. :P
Aug 01 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 6:30 PM, Bernard Helyer wrote:
 5. It should read or write any mutable global state outside of its "Lexer"
 instance

I assume you mean 'shouldn't'. :P

Good catch.
Aug 01 2012
prev sibling next sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 02/08/2012 02:10, Walter Bright a écrit :
 6. A single "Lexer" instance should be able to serially accept input
 ranges, sharing and updating one identifier table

I see the lexer as a function that take an range of char as input and give back a range of token. Does it make sense to make an instance of a lexer ?
 7. It should accept a callback delegate for errors. That delegate should
 decide whether to:
 1. ignore the error (and "Lexer" will try to recover and continue)
 2. print an error message (and "Lexer" will try to recover and continue)
 3. throw an exception, "Lexer" is done with that input range

Off topic, but it look like the condition proposal from H.S. Teoh and myself.
 Basically, I don't want anyone to be motivated to do a separate one
 after seeing this one.

That would be awesome !
Aug 01 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 7:04 PM, deadalnix wrote:
 Le 02/08/2012 02:10, Walter Bright a écrit :
 6. A single "Lexer" instance should be able to serially accept input
 ranges, sharing and updating one identifier table

I see the lexer as a function that take an range of char as input and give back a range of token. Does it make sense to make an instance of a lexer ?

Yes, as the lexer will have some state and some configuration parameters.
Aug 01 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 9:06 PM, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 21:00:53 Walter Bright wrote:
 On 8/1/2012 7:04 PM, deadalnix wrote:
 Le 02/08/2012 02:10, Walter Bright a écrit :
 6. A single "Lexer" instance should be able to serially accept input
 ranges, sharing and updating one identifier table

I see the lexer as a function that take an range of char as input and give back a range of token. Does it make sense to make an instance of a lexer ?


Couldn't that just be part of the range type? A separate lexer type shouldn't be necessary.

No, because the user may want to serially present ranges to the same lexer type. Why do that, you might say? That's how dmd works - it means the identifier table will be the same for all source files.
Aug 01 2012
next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 21:54:01 Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 21:47:22 Walter Bright wrote:
 Couldn't that just be part of the range type? A separate lexer type
 shouldn't be necessary.

No, because the user may want to serially present ranges to the same lexer type. Why do that, you might say? That's how dmd works - it means the identifier table will be the same for all source files.

Then just pass the same identifier table to the function which creates the token range. That doesn't require another type.

If you end up with enough pieces of state that you need to carry around between files/strings being lexed, then creating another object makes sense, but I would hope that the amount of global state like that would be minimal. - Jonathan M Davis
Aug 01 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 9:54 PM, Jonathan M Davis wrote:
 Then just pass the same identifier table to the function which creates the
 token range. That doesn't require another type.

You're still going to require another type, otherwise you'll have to duplicate the state in every token allocation, with resultant heavy memory and initialization costs. Please keep in mind that a lexer is not something you just pass a few short strings to. It's very very very performance critical as all those extra instructions add up to long delays when you're shoving millions of lines of code into its maw. For the same reason you're also not going to want the lexer putting pressure on the GC. It could bring your whole system down. To get a high performance lexer, you're going to be counting the average number of instructions executed per input character. Each one counts. Shaving one off is a victory. You should also be thinking about memory cache access patterns.
Aug 01 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 10:34 PM, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 22:09:05 Walter Bright wrote:
 On 8/1/2012 9:54 PM, Jonathan M Davis wrote:
 Then just pass the same identifier table to the function which creates the
 token range. That doesn't require another type.

You're still going to require another type, otherwise you'll have to duplicate the state in every token allocation, with resultant heavy memory and initialization costs.

Why would you be duplicating any state in the token? The range would have the state, not the token. The token would only have the data specific to the token, not the range as a whole.

I meant two types: the Lexer and the Token. The Lexer can present a range interface.
Aug 01 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 10:53 PM, Jonathan M Davis wrote:
 What I was expecting there to  be was a type which was a range of tokens.  You
 passed the source string to a function which returned that range, and you
 iterated over it to process each token. What you appear to have been arguing
 for is another type which you get the range from which holds additional state
 (e.g. the indentifier table). So, you're saying now that you only meant that
 the token type needs to be separate from the range type (which I would have
 thought would be the only way to do it), or are you indeed saying that there
 should be another type that the range comes from and which holds additional
 state?

You could do it either way.
Aug 01 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 9:41 PM, H. S. Teoh wrote:
 Whether it's part of the range type or a separate lexer type,
 *definitely* make it possible to have multiple instances. One of the
 biggest flaws of otherwise-good lexer generators like lex and flex
 (C/C++) is that the core code assumes a single instance, and
 multi-instances were glued on after the fact, making it a royal pain to
 work with anything that needs lexing multiple things at the same time.

Yup. I keep trying to think of a way to lex multiple files at the same time in separate threads, but the problem is serializing access to the identifier table will likely kill off any perf gain.
Aug 01 2012
next sibling parent deadalnix <deadalnix gmail.com> writes:
Le 02/08/2012 06:48, Walter Bright a écrit :
 On 8/1/2012 9:41 PM, H. S. Teoh wrote:
 Whether it's part of the range type or a separate lexer type,
 *definitely* make it possible to have multiple instances. One of the
 biggest flaws of otherwise-good lexer generators like lex and flex
 (C/C++) is that the core code assumes a single instance, and
 multi-instances were glued on after the fact, making it a royal pain to
 work with anything that needs lexing multiple things at the same time.

Yup. I keep trying to think of a way to lex multiple files at the same time in separate threads, but the problem is serializing access to the identifier table will likely kill off any perf gain.

That was exactly my reaction to the « let's reuse the identifier table » comment of yours. The future is multicore.
Aug 02 2012
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/6/2012 5:14 PM, Jason House wrote:
 The following is an incredibly fast multithreaded hash table. It is both
 lock-free and fence-free. Would something like that solve your problem?

 http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf

It might if I understood it! There do seem to be some cases where fences are required. This would take considerable study.
Aug 07 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 21:00:53 Walter Bright wrote:
 On 8/1/2012 7:04 PM, deadalnix wrote:
 Le 02/08/2012 02:10, Walter Bright a =C3=A9crit :
 6. A single "Lexer" instance should be able to serially accept inp=



 ranges, sharing and updating one identifier table

I see the lexer as a function that take an range of char as input a=


 back a range of token. Does it make sense to make an instance of a =


 ?


Couldn't that just be part of the range type? A separate lexer type sho= uldn't=20 be necessary. - Jonathan M Davis
Aug 01 2012
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Aug 01, 2012 at 09:06:42PM -0700, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 21:00:53 Walter Bright wrote:
 On 8/1/2012 7:04 PM, deadalnix wrote:
 Le 02/08/2012 02:10, Walter Bright a écrit :
 6. A single "Lexer" instance should be able to serially accept
 input ranges, sharing and updating one identifier table

I see the lexer as a function that take an range of char as input and give back a range of token. Does it make sense to make an instance of a lexer ?

parameters.

Couldn't that just be part of the range type? A separate lexer type shouldn't be necessary.

Whether it's part of the range type or a separate lexer type, *definitely* make it possible to have multiple instances. One of the biggest flaws of otherwise-good lexer generators like lex and flex (C/C++) is that the core code assumes a single instance, and multi-instances were glued on after the fact, making it a royal pain to work with anything that needs lexing multiple things at the same time. T -- The day Microsoft makes something that doesn't suck is probably the day they start making vacuum cleaners... -- Slashdotter
Aug 01 2012
prev sibling parent "Jason House" <jason.james.house gmail.com> writes:
On Thursday, 2 August 2012 at 04:48:56 UTC, Walter Bright wrote:
 On 8/1/2012 9:41 PM, H. S. Teoh wrote:
 Whether it's part of the range type or a separate lexer type,
 *definitely* make it possible to have multiple instances. One 
 of the
 biggest flaws of otherwise-good lexer generators like lex and 
 flex
 (C/C++) is that the core code assumes a single instance, and
 multi-instances were glued on after the fact, making it a 
 royal pain to
 work with anything that needs lexing multiple things at the 
 same time.

Yup. I keep trying to think of a way to lex multiple files at the same time in separate threads, but the problem is serializing access to the identifier table will likely kill off any perf gain.

The following is an incredibly fast multithreaded hash table. It is both lock-free and fence-free. Would something like that solve your problem? http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf
Aug 06 2012
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
 1. It should accept as input an input range of UTF8. I feel it is a mistake
 to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
 should use an 'adapter' range to convert the input to UTF8. (This is what
 component programming is all about.)

But that's not how ranges of characters work. They're ranges of dchar. Ranges don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have to create special wrappers around string or wstring to have ranges of UTF-8. The way that it's normally done is to have ranges of dchar and then special-case range-based functions for strings. Then the function can operate on any range of dchar but still operates on strings efficiently. So, it's quite possible to make a lexer operate on ranges of dchar but be operating primarily on ASCII by special-casing strings and avoiding decoding as much as possible. My lexer does a good job of this, I believe, so it's thoroughly optimized for strings, but it's still operating on ranges of dchar, not ranges of UTF-8.
 2. It should output an input range of tokens
 
 3. tokens should be values, not classes
 
 4. It should avoid memory allocation as much as possible
 
 5. It should read or write any mutable global state outside of its "Lexer"
 instance
 
 6. A single "Lexer" instance should be able to serially accept input ranges,
 sharing and updating one identifier table

When you say identifier table, do you mean symbol table, or something else? Isn't the symbol table something for the parser to worry about? Other than parsers, I can't think of anything which would even _care_ about a symbol table. And depending on how a symbol table were done, you'd want it to take scoping rules into account (_I'd_ certainly expect it to), meaning that it only includes identifiers which are relevant to the current scope. And if that's the case, it _really_ doesn't belong in the lexer. The code using the lexer can easily have its own table that it populates according to however it wants it populated by processing the identifier tokens as they get lexed. Not to mention, don't symbol tables usually include type information that a lexer _wouldn't_ have? Sure, a symbol table _could_ be put in the lexer (at least if it doesn't care about scoping), but I definitely question that that's the right place for it. If you mean a table that simply lists identifiers rather than a symbol table, I'm not quite sure why you'd want it in addition to the symbol table, and again, I'd expect only parsers to care about that sort of thing, so I'd expect the parser to be implementing it. So, what exactly are you looking for the lexer to implement as far as an identifier table goes, and why is it in the lexer rather than the parser?
 7. It should accept a callback delegate for errors. That delegate should
 decide whether to:
     1. ignore the error (and "Lexer" will try to recover and continue)
     2. print an error message (and "Lexer" will try to recover and continue)
 3. throw an exception, "Lexer" is done with that input range

I'm currently treating errors as tokens. It then becomes easy for the code using the lexer to just ignore the errors, to process them immediately, or to put off dealing with them until the lexing is complete. So, the code using the lexer can handle errors however and whenever it likes without having to worry about delegates or exceptions. Since tokens are lexed lazily, the fact that an error is reported as a token doesn't require that the lexing continue, but it also makes it _easy_ to continue lexing, ignoring or saving the error. I'm inclined to think that that's a superior approach to using delegates and exceptions.
 8. Lexer should be configurable as to whether it should collect information
 about comments and ddoc comments or not

 9. Comments and ddoc comments should be attached to the next following
 token, they should not themselves be tokens

Why? I'd argue for just processing them as tokens just like I'm doing with errors. The code using the lexer can then process them or ignore them as it likes (it can even specifically look for them and ignore all other tokens if it's something which only operates on comments). And if you want to see which token follows the comment, it's the next one in the range. So, I don't see much need to attach comments to other tokens. What's the advantage of not treating them as tokens?
 12. It should come with unittests that, using -cov, show 100% coverage

100% is actually unreasonable, because there are lines which should _never_ be hit (e.g. an assert(0) line in a switch statement). All lines _other_ than those sort of lines should have code coverage, but depending on the lexer implementation, 100% is impossible. std.datetime has a ton of unit tests and IIRC it still only manages 98% because of assert(0) lines. So, I agree with the spirit of your statement, but it's not necessarily reasonable to do exactly what you're saying. - Jonathan M Davis
Aug 01 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 8:04 PM, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
 1. It should accept as input an input range of UTF8. I feel it is a mistake
 to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
 should use an 'adapter' range to convert the input to UTF8. (This is what
 component programming is all about.)

But that's not how ranges of characters work. They're ranges of dchar. Ranges don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have to create special wrappers around string or wstring to have ranges of UTF-8. The way that it's normally done is to have ranges of dchar and then special-case range-based functions for strings. Then the function can operate on any range of dchar but still operates on strings efficiently.

I have argued against making ranges of characters dchars, because of performance reasons. This will especially adversely affect the performance of the lexer. The fact is, files are normally in UTF8 and just about everything else is in UTF8. Prematurely converting to UTF-32 is a performance disaster. Note that the single largest thing holding back regex performance is that premature conversion to dchar and back to char. If lexer is required to accept dchar ranges, its performance will drop at least in half, and people are going to go reinvent their own lexers.
 So, it's quite possible to make a lexer operate on ranges of dchar but be
 operating primarily on ASCII by special-casing strings and avoiding decoding
 as much as possible. My lexer does a good job of this, I believe, so it's
 thoroughly optimized for strings, but it's still operating on ranges of dchar,
 not ranges of UTF-8.

 2. It should output an input range of tokens

 3. tokens should be values, not classes

 4. It should avoid memory allocation as much as possible

 5. It should read or write any mutable global state outside of its "Lexer"
 instance

 6. A single "Lexer" instance should be able to serially accept input ranges,
 sharing and updating one identifier table

When you say identifier table, do you mean symbol table, or something else?

All identifiers are entered into a hashtable, and are referred to by pointers into that hashtable for the rest of dmd. This makes symbol lookups incredibly fast, as no string comparisons are done.
 Isn't the symbol table something for the parser to worry about? Other than
 parsers, I can't think of anything which would even _care_ about a symbol
 table. And depending on how a symbol table were done, you'd want it to take
 scoping rules into account (_I'd_ certainly expect it to), meaning that it
 only includes identifiers which are relevant to the current scope. And if
 that's the case, it _really_ doesn't belong in the lexer. The code using the
 lexer can easily have its own table that it populates according to however it
 wants it populated by processing the identifier tokens as they get lexed. Not
 to mention, don't symbol tables usually include type information that a lexer
 _wouldn't_ have?

The symbol table becomes a symbol table of pointers into the lexer's identifier table.
 Sure, a symbol table _could_ be put in the lexer (at least if it doesn't care
 about scoping), but I definitely question that that's the right place for it.

The symbol table isn't in the lexer, the identifier string table is.
 If you mean a table that simply lists identifiers rather than a symbol table,
 I'm not quite sure why you'd want it in addition to the symbol table, and
 again, I'd expect only parsers to care about that sort of thing, so I'd expect
 the parser to be implementing it.

Because it is gawdammed fast to do it that way. An identifier is treated as a string exactly once, in the lexer, and thereafter by its unique handle.
 So, what exactly are you looking for the lexer to implement as far as an
 identifier table goes, and why is it in the lexer rather than the parser?

Very simple. A mapping of [identifier string] => [unique value], and a pointer servers nicely as that unique value.
 7. It should accept a callback delegate for errors. That delegate should
 decide whether to:
      1. ignore the error (and "Lexer" will try to recover and continue)
      2. print an error message (and "Lexer" will try to recover and continue)
 3. throw an exception, "Lexer" is done with that input range

I'm currently treating errors as tokens. It then becomes easy for the code using the lexer to just ignore the errors, to process them immediately, or to put off dealing with them until the lexing is complete. So, the code using the lexer can handle errors however and whenever it likes without having to worry about delegates or exceptions. Since tokens are lexed lazily, the fact that an error is reported as a token doesn't require that the lexing continue, but it also makes it _easy_ to continue lexing, ignoring or saving the error. I'm inclined to think that that's a superior approach to using delegates and exceptions.

I don't agree. It means that the parser has to check everywhere for error tokens. Besides the ugliness, it means a slowdown for those checks.
 8. Lexer should be configurable as to whether it should collect information
 about comments and ddoc comments or not

 9. Comments and ddoc comments should be attached to the next following
 token, they should not themselves be tokens

Why?

Because then the parser would have to add checks for 0 or more comment tokens between every other token. It uglifies the parser pretty badly, and slows it down.
 I'd argue for just processing them as tokens just like I'm doing with
 errors. The code using the lexer can then process them or ignore them as it
 likes (it can even specifically look for them and ignore all other tokens if
 it's something which only operates on comments). And if you want to see which
 token follows the comment, it's the next one in the range. So, I don't see
 much need to attach comments to other tokens. What's the advantage of not
 treating them as tokens?

Performance and simplicity. Your grammar will be awfully ugly unless you insert another range to filter out those tokens - but that'll be a slowdown.
 12. It should come with unittests that, using -cov, show 100% coverage

100% is actually unreasonable, because there are lines which should _never_ be hit (e.g. an assert(0) line in a switch statement). All lines _other_ than those sort of lines should have code coverage, but depending on the lexer implementation, 100% is impossible. std.datetime has a ton of unit tests and IIRC it still only manages 98% because of assert(0) lines. So, I agree with the spirit of your statement, but it's not necessarily reasonable to do exactly what you're saying.

I know, but I expect all unreachable code to be justified. The lexer is a well defined problem, and this should not be difficult.
Aug 01 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 9:52 PM, Jonathan M Davis wrote:
 1. The current design of Phobos is to have ranges of dchar, because it fosters
 code correctness (though it can harm efficiency). It's arguably too late to do
 otherwise. Certainly, doing otherwise now would break a lot of code. If the
 lexer tried to operate on UTF-8 as part of its API rather than operating on
 ranges of dchar and special-casing strings, then it wouldn't fit in with Phobos
 at all.

The lexer must use char or it will not be acceptable as anything but a toy for performance reasons.
 2. The lexer does _not_ have to have its performance tank by accepting ranges
 of dchar. It's true that the performance will be harmed for ranges which
 _aren't_ strings, but for strings (as would be by far the most common use
 case) it can be very efficient by special-casing them.

Somebody has to convert the input files into dchars, and then back into chars. That blows for performance. Think billions and billions of characters going through, not just a few random strings. Always always think of the lexer as having a firehose of data shoved into its maw, and it better be thirsty!
 And as much as there are potential performance issues with Phobos' choice of
 treating strings as ranges of dchar, if it were to continue to treat them as
 ranges of code units, it's pretty much a guarantee that there would be a _ton_
 of bugs caused by it. Most programmers have absolutely no clue about how
 unicode works and don't really want to know. They want strings to just work.
 Phobos' approach of defaulting to correct but making it possible to make the
 code faster through specializations is very much in line with D's typical
 approach of making things safe by default but allowing the programmer to do
 unsafe things for optimizations when they know what they're doing.

I expect std.d.lexer to handle UTF8 correctly, so I don't think this should be an issue in this particular case. dmd's lexer does handle UTF8 correctly. Note also that the places where non-ASCII characters can appear in correct D code is severely limited, and there are even fewer places where multibyte characters need to be decoded at all, and the lexer takes full advantage of this to boost its speed. For example, non-ASCII characters can appear in comments, but they DO NOT need to be decoded, and even just having the test for a non-ASCII character in the comment scanner will visibly slow down the lexer.
 All identifiers are entered into a hashtable, and are referred to by
 pointers into that hashtable for the rest of dmd. This makes symbol lookups
 incredibly fast, as no string comparisons are done.

Hmmm. Well, I'd still argue that that's a parser thing. Pretty much nothing else will care about it. At most, it should be an optional feature of the lexer. But it certainly could be added that way.

I hate to say "trust me on this", but if you don't, have a look at dmd's lexer and how it handles identifiers, then look at dmd's symbol table.
Aug 01 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 10:44 PM, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 22:33:12 Walter Bright wrote:
 The lexer must use char or it will not be acceptable as anything but a toy
 for performance reasons.

Avoiding decoding can be done with strings and operating on ranges of dchar, so you'd be operating almost entirely on ASCII. Are you saying that there's a performance issue aside from decoding?

1. Encoding it into a dchar is a performance problem. Source code sits in files that are nearly always in UTF8. So your input range MUST check every single char and convert it to UTF32 as necessary. Plus, there's that additional step removed from sticking the file input buffer directly into the lexer's input. 2. Storing strings as dchars is a performance and memory problem (4x as much data and hence time). Remember, nearly ALL of D source will be ASCII. All performance considerations must be tilted towards the usual case.
 Somebody has to convert the input files into dchars, and then back into
 chars. That blows for performance. Think billions and billions of
 characters going through, not just a few random strings.

Why is there any converting to dchar going on here?

Because your input range is a range of dchar?
 I don't see why any would
 be necessary. If you reading in a file as a string or char[] (as would be
 typical), then you're operating on a string, and then the only time that any
 decoding will be necessary is when you actually need to operate on a unicode
 character, which is very rare in D's grammar. It's only when operating on
 something _other_ than a string that you'd have to actually deal with dchars.

That's what I've been saying. So why have an input range of dchars, which must be decoded in advance, otherwise it wouldn't be a range of dchars?
 Hmmm. Well, I'd still argue that that's a parser thing. Pretty much
 nothing
 else will care about it. At most, it should be an optional feature of the
 lexer. But it certainly could be added that way.

I hate to say "trust me on this", but if you don't, have a look at dmd's lexer and how it handles identifiers, then look at dmd's symbol table.

My point is that it's the sort of thing that _only_ a parser would care about. So, unless it _needs_ to be in the lexer for some reason, it shouldn't be.

I think you are misunderstanding. The lexer doesn't have a *symbol* table in it. It has a mapping from identifiers to unique handles. It needs to be there, otherwise the semantic analysis has to scan identifier strings a second time.
Aug 01 2012
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-02 08:39, Walter Bright wrote:

 That's what I've been saying. So why have an input range of dchars,
 which must be decoded in advance, otherwise it wouldn't be a range of
 dchars?

Your first requirement is a bit strange and confusing: "1. It should accept as input an input range of UTF8." To my understand ranges in Phobos are designed to operate on dchars, not chars, regardless of the original input type. So if you can create a range that operates on UTF8 I don't know if that will be usable with the rest of the Phobos functions expecting ranges of dchars. Basically making this type of range useless since it cannot use any other functionality in Phobos. The current state of Phobos: you either have a range of dchars or you don't have a range, you have an array or something else. -- /Jacob Carlborg
Aug 01 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 11:59 PM, Jacob Carlborg wrote:
 Your first requirement is a bit strange and confusing:

 "1. It should accept as input an input range of UTF8."

 To my understand ranges in Phobos are designed to operate on dchars, not chars,
 regardless of the original input type. So if you can create a range that
 operates on UTF8 I don't know if that will be usable with the rest of the
Phobos
 functions expecting ranges of dchars. Basically making this type of range
 useless since it cannot use any other functionality in Phobos.

 The current state of Phobos: you either have a range of dchars or you don't
have
 a range, you have an array or something else.

I answered this point a few posts up in the thread.
Aug 02 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-02 09:21, Walter Bright wrote:

 I answered this point a few posts up in the thread.

I've read a few posts up and the only answer I found is that the lexer needs to operates on chars. But it does not answer the question how that range type would be used by all other range based functions in Phobos. Have I missed this? Can you please link to your post. -- /Jacob Carlborg
Aug 02 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 12:29 AM, Jacob Carlborg wrote:
 On 2012-08-02 09:21, Walter Bright wrote:

 I answered this point a few posts up in the thread.

I've read a few posts up and the only answer I found is that the lexer needs to operates on chars. But it does not answer the question how that range type would be used by all other range based functions in Phobos. Have I missed this? Can you please link to your post.

You can use an adapter range.
Aug 02 2012
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 12:21 AM, Jonathan M Davis wrote:
 Because your input range is a range of dchar?

range-based function which operates on ranges of dchar will use static if or overloads to special-case strings. This means that it will function with any range of dchar, but it _also_ will be as efficient with strings as if it just operated on strings.

It *still* must convert UTF8 to dchars before presenting them to the consumer of the dchar elements.
 It won't decode anything in the string unless it has to.
 So, having a lexer which operates on ranges of dchar does _not_ make string
 processing less efficient. It just makes it so that it can _also_ operate on
 ranges of dchar which aren't strings.

 For instance, my lexer uses this whenever it needs to get at the first
 character in the range:

 static if(isNarrowString!R)
      Unqual!(ElementEncodingType!R) first = range[0];
 else
      dchar first = range.front;

You're requiring a random access input range that has random access to something other than the range element type?? and you're requiring an isNarrowString to work on an arbitrary range?
 if I need to know the number of code units that make up the code point, I
 explicitly call decode in the case of a narrow string. In either case, code
 units are _not_ being converted to dchar unless they absolutely have to be.

Or you could do away with requiring a special range type and just have it be a UTF8 range. What I wasn't realizing earlier was that you were positing a range type that has two different kinds of elements. I don't think this is a proper component type.
 Yes. I understand. It has a mapping of pointers to identifiers. My point is
 that nothing but parsers will need that.
 From the standpoint of functionality,
 it's a parser feature, not a lexer feature. So, if it can be done just fine in
 the parser, then that's where it should be. If on the other hand, it _needs_
 to be in the lexer for some reason (e.g. performance), then that's a reason to
 put it there.

If you take it out of the lexer, then: 1. the lexer must allocate storage for every identifier, rather than only for unique identifiers 2. and then the parser must scan the identifier string *again* 3. there must be two hash lookups of each identifier rather than one It's a suboptimal design.
Aug 02 2012
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-08-02 07:44, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 22:33:12 Walter Bright wrote:
 The lexer must use char or it will not be acceptable as anything but a toy
 for performance reasons.

Avoiding decoding can be done with strings and operating on ranges of dchar, so you'd be operating almost entirely on ASCII. Are you saying that there's a performance issue aside from decoding?
 Somebody has to convert the input files into dchars, and then back into
 chars. That blows for performance. Think billions and billions of
 characters going through, not just a few random strings.

Why is there any converting to dchar going on here? I don't see why any would be necessary. If you reading in a file as a string or char[] (as would be typical), then you're operating on a string, and then the only time that any decoding will be necessary is when you actually need to operate on a unicode character, which is very rare in D's grammar. It's only when operating on something _other_ than a string that you'd have to actually deal with dchars.

I think you two are saying the same things. -- /Jacob Carlborg
Aug 01 2012
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 21:52:15 Jonathan M Davis wrote:
 And as much as there are potential performance issues with Phobos' choice of
 treating strings as ranges of dchar, if it were to continue to treat them
 as ranges of code units, it's pretty much a guarantee that there would be a
 _ton_ of bugs caused by it. Most programmers have absolutely no clue about
 how unicode works and don't really want to know. They want strings to just
 work. Phobos' approach of defaulting to correct but making it possible to
 make the code faster through specializations is very much in line with D's
 typical approach of making things safe by default but allowing the
 programmer to do unsafe things for optimizations when they know what
 they're doing.

Another thing that I should point out is that a range of UTF-8 or UTF-16 wouldn't work with many range-based functions at all. Most of std.algorithm and its ilk would be completely useless. Range-based functions operate on a ranges elements, so operating on a range of code units would mean operating on code units, which is going to be _wrong_ almost all the time. Think about what would happen if you used a function like map or filter on a range of code units. The resultant range would be completely invalid as far as unicode goes. Range-based functions need to be operating on _characters_. Technically, not even code points gets us there, so it's _still_ buggy. It's just a _lot_ closer to being correct and works 99.99+% of the time. If we want to be able to operate on ranges of UTF-8 or UTF-16, we need to add a concept of variably-length encoded ranges so that it's possible to treat them as both their encoding and whatever they represent (e.g. code point or grapheme in the case of ranges of code units). - Jonathan M Davis
Aug 01 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 11:56 PM, Jonathan M Davis wrote:
 Another thing that I should point out is that a range of UTF-8 or UTF-16
 wouldn't work with many range-based functions at all. Most of std.algorithm
 and its ilk would be completely useless. Range-based functions operate on a
 ranges elements, so operating on a range of code units would mean operating on
 code units, which is going to be _wrong_ almost all the time. Think about what
 would happen if you used a function like map or filter on a range of code
 units. The resultant range would be completely invalid as far as unicode goes.

My experience in writing fast string based code that worked on UTF8 and correctly handled multibyte characters was that they are very possible and practical, and they are faster. The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it isn't fast, serious users will eschew it and will cook up their own. You'll have a nice, pretty, useless toy of std.d.lexer. I think there's some serious underestimation of how critical this is.
 Range-based functions need to be operating on _characters_. Technically, not
 even code points gets us there, so it's _still_ buggy. It's just a _lot_
 closer to being correct and works 99.99+% of the time.

Multi-code point characters are quite irrelevant to the correctness of a D lexer.
 If we want to be able to operate on ranges of UTF-8 or UTF-16, we need to add
 a concept of variably-length encoded ranges so that it's possible to treat
 them as both their encoding and whatever they represent (e.g. code point or
 grapheme in the case of ranges of code units).

No, this is not necessary.
Aug 02 2012
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-02 09:29, Walter Bright wrote:

 My experience in writing fast string based code that worked on UTF8 and
 correctly handled multibyte characters was that they are very possible
 and practical, and they are faster.

 The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If
 it isn't fast, serious users will eschew it and will cook up their own.
 You'll have a nice, pretty, useless toy of std.d.lexer.

 I think there's some serious underestimation of how critical this is.

I do understand that the lexer needs to be insanely fast and it needs to operate on UTF-8 and not UTF-32 or anything else. But what I still don't understand is how a UTF-8 range is going to be usable by other range based functions in Phobos. -- /Jacob Carlborg
Aug 02 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 12:49 AM, Jacob Carlborg wrote:
 But what I still don't understand is how a UTF-8 range is going to be usable by
 other range based functions in Phobos.

Worst case use an adapter range.
Aug 02 2012
next sibling parent travert phare.normalesup.org (Christophe Travert) writes:
Walter Bright , dans le message (digitalmars.D:174015), a écrit :
 On 8/2/2012 12:49 AM, Jacob Carlborg wrote:
 But what I still don't understand is how a UTF-8 range is going to be usable by
 other range based functions in Phobos.

Worst case use an adapter range.

Yes auto r = myString.byChar(); after implementing a byChar adapter range or just auto r = cast(const(ubyte)[]) myString; And it's a range of code unit, not code point. And it's usable in phobos.
Aug 02 2012
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-02 10:15, Walter Bright wrote:

 Worst case use an adapter range.

And that is better than a plain string? -- /Jacob Carlborg
Aug 02 2012
parent reply travert phare.normalesup.org (Christophe Travert) writes:
Jacob Carlborg , dans le message (digitalmars.D:174069), a écrit :
 On 2012-08-02 10:15, Walter Bright wrote:
 
 Worst case use an adapter range.

And that is better than a plain string?

Aug 02 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-02 22:51, Christophe Travert wrote:
 Jacob Carlborg , dans le message (digitalmars.D:174069), a écrit :
 On 2012-08-02 10:15, Walter Bright wrote:

 Worst case use an adapter range.

And that is better than a plain string?


If it was a plain string you wouldn't use "front". You would handle any, possible, decoding by yourself, internally in the lexer. This is what Johnathan already does, it seems: static if(isNarrowString!R) Unqual!(ElementEncodingType!R) first = range[0]; else dchar first = range.front; If you're only supporting plain UTF-8 strings you would just do: char first = range[0]; -- /Jacob Carlborg
Aug 02 2012
parent reply travert phare.normalesup.org (Christophe Travert) writes:
Jacob Carlborg , dans le message (digitalmars.D:174131), a écrit :
 static if(isNarrowString!R)
      Unqual!(ElementEncodingType!R) first = range[0];
 else
      dchar first = range.front;

I find it more comfortable to just use first = range.front, with a range of char or ubyte. This range does not have to be a string, it can be a something over a file, stream, socket. It can also be the result of an algorithm, because you *can* use algorithm on ranges of char, and it makes sense if you know what you are doing. If Walter discovers the lexer does not work with a socket, a "file.byChunk.join", and has to do expensive utf-8 decoding for the lexer to work because it can only use range of dchar, and not range of char (except that it special-cased strings), he may not be happy. It the range happens to be a string, I would use an adapter to make it appear like a range of char, not dchar, like the library likes to do. I think Andrei suggested that already. -- Christophe
Aug 03 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/3/2012 1:21 AM, Christophe Travert wrote:
 This range does not have to be a string, it can be a something over a
 file, stream, socket. It can also be the result of an algorithm, because
 you *can* use algorithm on ranges of char, and it makes sense if you
 know what you are doing.

Correct, that's the whole point of using a range - it can come from anything.
 If Walter discovers the lexer does not work with a socket, a
 "file.byChunk.join", and has to do expensive utf-8 decoding for the
 lexer to work because it can only use range of dchar, and not range of
 char (except that it special-cased strings), he may not be happy.

No may about it, I wouldn't be happy :-)
Aug 03 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/3/2012 3:07 AM, Walter Bright wrote:
 On 8/3/2012 1:21 AM, Christophe Travert wrote:
 This range does not have to be a string, it can be a something over a
 file, stream, socket. It can also be the result of an algorithm, because
 you *can* use algorithm on ranges of char, and it makes sense if you
 know what you are doing.

Correct, that's the whole point of using a range - it can come from anything.

For example, let's suppose we want to do D syntax highlighting in our IDE. It is highly unlikely that the text editor's data structure is a simple string. It's likely to be an array of lines, or something like that. It's simple to provide a range interface to that, where it presents the data structure as a sequence of chars. But that wouldn't work with Jonathan's string specialization. Nor would the string specialization work if the input was a 1024 byte file buffer.
Aug 03 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/3/2012 4:40 AM, Tobias Pankrath wrote:
 Would this be an argument for putting the computation of source locations (i.e.
 line + offset or similar) into the range / into an template argument / policy,
 so that it's done in the most effective way for the client?

 Kate for example has a "range"-type that marks a span in the text buffer. This
 way the lexer can return token with the correct "textrange" attached and you
 don't need to recompute the text ranges from line/col numbers.

Worth thinking about.
Aug 03 2012
prev sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
 Correct, that's the whole point of using a range - it can come 
 from anything.

For example, let's suppose we want to do D syntax highlighting in our IDE. It is highly unlikely that the text editor's data structure is a simple string. It's likely to be an array of lines, or something like that. It's simple to provide a range interface to that, where it presents the data structure as a sequence of chars.

Would this be an argument for putting the computation of source locations (i.e. line + offset or similar) into the range / into an template argument / policy, so that it's done in the most effective way for the client? Kate for example has a "range"-type that marks a span in the text buffer. This way the lexer can return token with the correct "textrange" attached and you don't need to recompute the text ranges from line/col numbers.
Aug 03 2012
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 12:33 AM, Bernard Helyer wrote:
 On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
 The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it
 isn't fast, serious users will eschew it and will cook up their own. You'll
 have a nice, pretty, useless toy of std.d.lexer.

If you want to throw out some target times, that would be useful.

As fast as the dmd one would be best.
Aug 02 2012
next sibling parent dennis luehring <dl.soluz gmx.net> writes:
Am 02.08.2012 10:13, schrieb Walter Bright:
 On 8/2/2012 12:33 AM, Bernard Helyer wrote:
 On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
 The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it
 isn't fast, serious users will eschew it and will cook up their own. You'll
 have a nice, pretty, useless toy of std.d.lexer.

If you want to throw out some target times, that would be useful.

As fast as the dmd one would be best.

would it be (easily) possible to "extract" the dmd lexer code (and needed interface) for using it as an spererated benchmark reference?
Aug 02 2012
prev sibling next sibling parent dennis luehring <dl.soluz gmx.net> writes:
Am 02.08.2012 10:13, schrieb Walter Bright:
 On 8/2/2012 12:33 AM, Bernard Helyer wrote:
 On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
 The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it
 isn't fast, serious users will eschew it and will cook up their own. You'll
 have a nice, pretty, useless toy of std.d.lexer.

If you want to throw out some target times, that would be useful.

As fast as the dmd one would be best.

can the dmd lexer seperated as an lib and become useable from outside - as the benchmark reference
Aug 02 2012
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 1:21 AM, Jonathan M Davis wrote:
 How would we measure that? dmd's lexer is tied to dmd, so how would we test
 the speed of only its lexer?

Easy. Just make a special version of dmd that lexes only, and time it.
Aug 02 2012
parent reply Ed McCardell <edmccard hotmail.com> writes:
On 08/02/2012 04:41 AM, Walter Bright wrote:
 On 8/2/2012 1:21 AM, Jonathan M Davis wrote:
 How would we measure that? dmd's lexer is tied to dmd, so how would we
 test
 the speed of only its lexer?

Easy. Just make a special version of dmd that lexes only, and time it.

I made a lexing-only version of dmd at https://github.com/edmccard/dmd/tree/lexonly by stripping non-lexer-related code from mars.c, and adding a lexModule function that is called instead of Module::parse.. There's no benchmarking code yet; it basically just does while (token.value != TOKeof) nextToken(); for each D source file passed on the command line. --Ed
Aug 03 2012
parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 03.08.2012 09:56, schrieb Ed McCardell:
 On 08/02/2012 04:41 AM, Walter Bright wrote:
 On 8/2/2012 1:21 AM, Jonathan M Davis wrote:
 How would we measure that? dmd's lexer is tied to dmd, so how would we
 test
 the speed of only its lexer?

Easy. Just make a special version of dmd that lexes only, and time it.

I made a lexing-only version of dmd at https://github.com/edmccard/dmd/tree/lexonly by stripping non-lexer-related code from mars.c, and adding a lexModule function that is called instead of Module::parse.. There's no benchmarking code yet; it basically just does while (token.value != TOKeof) nextToken(); for each D source file passed on the command line. --Ed

Walter mentioned an easier way - without "forking" dmd - should be better an integral part of the ongoing development maybe under a tools or benchmark section? news://news.digitalmars.com:119/jvfv14$1ri0$1 digitalmars.com ->Look in doc.c at highlightCode2() for how to call the lexer by itself.
Aug 03 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/3/2012 1:27 AM, dennis luehring wrote:
 Walter mentioned an easier way - without "forking" dmd - should be better an
 integral part of the ongoing development maybe under a tools or benchmark
section?

Forking it is fine. This is just for a one-off benchmarking thing.
Aug 03 2012
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 02/08/2012 10:13, Walter Bright a écrit :
 On 8/2/2012 12:33 AM, Bernard Helyer wrote:
 On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
 The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful.
 If it
 isn't fast, serious users will eschew it and will cook up their own.
 You'll
 have a nice, pretty, useless toy of std.d.lexer.

If you want to throw out some target times, that would be useful.

As fast as the dmd one would be best.

That'd be great but . . . lexer really isn't the performance bottleneck of dmd (or any compiler of a non trivial language). Additionally, anybody that have touched dmd source code can agree that usability/maintainability isn't as its best. Sacrificing some perfs in a non bottleneck area to increase ease of use make perfect sense.
Aug 02 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 4:47 AM, deadalnix wrote:
 Le 02/08/2012 10:13, Walter Bright a écrit :
 As fast as the dmd one would be best.

That'd be great but . . . lexer really isn't the performance bottleneck of dmd (or any compiler of a non trivial language). Additionally, anybody that have touched dmd source code can agree that usability/maintainability isn't as its best. Sacrificing some perfs in a non bottleneck area to increase ease of use make perfect sense.

A lexer is like a device driver, or a memory allocator. Having it be as fast as possible pays off dividends endlessly for whoever uses it. And in my profiling, lexing speed is a big factor in the debug build times. The Digital Mars C++ compiler (and its antecedents) made a lot of hay by being the fastest compiler available by a wide margin. A big advantage D has is also compile speed. Please do not underestimate its importance.
Aug 02 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 02-Aug-12 22:06, Walter Bright wrote:
 On 8/2/2012 4:47 AM, deadalnix wrote:
 Le 02/08/2012 10:13, Walter Bright a écrit :
 As fast as the dmd one would be best.

That'd be great but . . . lexer really isn't the performance bottleneck of dmd (or any compiler of a non trivial language). Additionally, anybody that have touched dmd source code can agree that usability/maintainability isn't as its best. Sacrificing some perfs in a non bottleneck area to increase ease of use make perfect sense.

A lexer is like a device driver, or a memory allocator. Having it be as fast as possible pays off dividends endlessly for whoever uses it.

+1 Another point is that it's crystal clear *how* to optimize lexer, and gain some precious time. It is well trodden path with well known destination. The time saved in lexer gives you some headroom in say optimizer, where you always can find more ways to improve result by sacrificing speed. In the long run I totally expect millions of lines of D code in the wild.
 And in my profiling, lexing speed is a big factor in the debug build times.

 The Digital Mars C++ compiler (and its antecedents) made a lot of hay by
 being the fastest compiler available by a wide margin. A big advantage D
 has is also compile speed.

 Please do not underestimate its importance.

-- Dmitry Olshansky
Aug 02 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 1:20 PM, Dmitry Olshansky wrote:
 +1 Another point is that it's crystal clear *how* to optimize lexer, and gain
 some precious time. It is well trodden path with well known destination.

I'm less sure how well known it is. DMC++ was known for being *several times* faster than other C++ compilers, not just a few percent faster. But we do have the DMD lexer which is useful as a benchmark and a guide. I won't say it couldn't be made faster, but it does set a minimum bar for performance.
Aug 02 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-03 00:01, Walter Bright wrote:

 But we do have the DMD lexer which is useful as a benchmark and a guide.
 I won't say it couldn't be made faster, but it does set a minimum bar
 for performance.

I'm not sure how easy it would be to just measure the lexing phase of DMD. If it's easy someone would probably already have extracted the lexer from DMD. -- /Jacob Carlborg
Aug 02 2012
next sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 03.08.2012 08:37, schrieb Jacob Carlborg:
 On 2012-08-03 00:01, Walter Bright wrote:

 But we do have the DMD lexer which is useful as a benchmark and a guide.
 I won't say it couldn't be made faster, but it does set a minimum bar
 for performance.

I'm not sure how easy it would be to just measure the lexing phase of DMD. If it's easy someone would probably already have extracted the lexer from DMD.

wouldn't it be better to extract the lexer part of dmd into its own (hopefully small) library - that way the lexer is still useable by dmd AND benchmarkable from outside - it is then even possible to replace the dmd lexer by an D version due to the c linkage feature of D
Aug 02 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-08-03 08:49, dennis luehring wrote:

 wouldn't it be better to extract the lexer part of dmd into its own
 (hopefully small) library - that way the lexer is still useable by dmd
 AND benchmarkable from outside - it is then even possible to replace the
 dmd lexer by an D version due to the c linkage feature of D

That was what I meant. If it's easy someone would have already done it. -- /Jacob Carlborg
Aug 02 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 11:37 PM, Jacob Carlborg wrote:
 I'm not sure how easy it would be to just measure the lexing phase of DMD. If
 it's easy someone would probably already have extracted the lexer from DMD.

You don't need to extract it to measure it. Just have it lex the source files in a loop, and time that loop.
Aug 02 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-03 08:59, Walter Bright wrote:

 You don't need to extract it to measure it. Just have it lex the source
 files in a loop, and time that loop.

Well, that's the problem. It's not like DMD has a single "lex" function that does all the job. Would it perhaps be possible to time Parser::parseModule and remove all things that doesn't seem related to lexing? Remove stuff like: a = new Identifiers(); md = new ModuleDeclaration(a, id, safe); And similar. -- /Jacob Carlborg
Aug 03 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/3/2012 12:11 AM, Jacob Carlborg wrote:
 On 2012-08-03 08:59, Walter Bright wrote:

 You don't need to extract it to measure it. Just have it lex the source
 files in a loop, and time that loop.

Well, that's the problem. It's not like DMD has a single "lex" function that does all the job. Would it perhaps be possible to time Parser::parseModule and remove all things that doesn't seem related to lexing? Remove stuff like: a = new Identifiers(); md = new ModuleDeclaration(a, id, safe); And similar.

Look in doc.c at highlightCode2() for how to call the lexer by itself.
Aug 03 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-03 09:35, Walter Bright wrote:

 Look in doc.c at highlightCode2() for how to call the lexer by itself.

So basically: Token tok; //start timer while (tok.value != TOKeof) lex.scan(&tok); //end timer Something like that? -- /Jacob Carlborg
Aug 03 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/3/2012 2:01 AM, Jacob Carlborg wrote:
 On 2012-08-03 09:35, Walter Bright wrote:

 Look in doc.c at highlightCode2() for how to call the lexer by itself.

So basically: Token tok; //start timer while (tok.value != TOKeof) lex.scan(&tok); //end timer Something like that?

Pretty much. Or just call lex.nextToken() until you get a TOKeof.
Aug 03 2012
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 11:14 AM, Marco Leise wrote:
 If I type `` at the top of std.datetime in Mono-D right after "module
 std.datetime;", I have a 12 (!) seconds wait, till the IDE becomes responsive
 again. But if I use ' or ", then the editor inserts the second terminator and
 is done with reparsing in 1-2 seconds.

A good IDE should do its parsing in a separate thread, so the main user input thread remains crisp and responsive. If the user edits the text while the parsing is in progress, the background parsing thread simply abandons the current parse and starts over.
Aug 02 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-02 21:35, Walter Bright wrote:

 A good IDE should do its parsing in a separate thread, so the main user
 input thread remains crisp and responsive.

 If the user edits the text while the parsing is in progress, the
 background parsing thread simply abandons the current parse and starts
 over.

It still needs to update the editor view with the correct syntax highlighting which needs to be done in the same thread as the rest of the GUI. -- /Jacob Carlborg
Aug 02 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 1:41 PM, Jacob Carlborg wrote:
 On 2012-08-02 21:35, Walter Bright wrote:

 A good IDE should do its parsing in a separate thread, so the main user
 input thread remains crisp and responsive.

 If the user edits the text while the parsing is in progress, the
 background parsing thread simply abandons the current parse and starts
 over.

It still needs to update the editor view with the correct syntax highlighting which needs to be done in the same thread as the rest of the GUI.

The rendering code should be in yet a third thread. An editor I wrote years ago had the rendering code in a separate thread from user input. You never had to wait to type in commands, the rendering would catch up when it could. What was also effective was the rendering would abandon a render midstream and restart it if it detected that the underlying data had changed in the meantime. This meant that the display was never more than one render out of date. Although the code itself wasn't any faster, it certainly *felt* faster with this approach. It made for crisp editing even on a pig slow machine.
Aug 02 2012
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 03-Aug-12 02:10, Walter Bright wrote:
 On 8/2/2012 1:41 PM, Jacob Carlborg wrote:
 On 2012-08-02 21:35, Walter Bright wrote:

 A good IDE should do its parsing in a separate thread, so the main user
 input thread remains crisp and responsive.

 If the user edits the text while the parsing is in progress, the
 background parsing thread simply abandons the current parse and starts
 over.

It still needs to update the editor view with the correct syntax highlighting which needs to be done in the same thread as the rest of the GUI.

The rendering code should be in yet a third thread.

It never ceases to amaze me how people miss this very simple point: GUI runs on its own thread and shouldn't ever block on something (save for message pump itself, of course). Everything else (including possibly slow rendering) done on the side and then result (once ready) swiftly indicated on GUI. Recent Windows 8 talks were in fact nothing new in this regard, but now even API is made so that it's harder to do something blocking in GUI thread.
 An editor I wrote years ago had the rendering code in a separate thread
 from user input. You never had to wait to type in commands, the
 rendering would catch up when it could. What was also effective was the
 rendering would abandon a render midstream and restart it if it detected
 that the underlying data had changed in the meantime. This meant that
 the display was never more than one render out of date.

 Although the code itself wasn't any faster, it certainly *felt* faster
 with this approach. It made for crisp editing even on a pig slow machine.

-- Dmitry Olshansky
Aug 02 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-03 00:25, Dmitry Olshansky wrote:

 OT:
 It never ceases to amaze me how people miss this very simple point:
 GUI runs on its own thread and shouldn't ever block on something (save
 for message pump itself, of course). Everything else (including possibly
 slow rendering) done on the side and then result (once ready) swiftly
 indicated on GUI.

I'm not entirely sure what you mean be "rendering" but most GUI systems are not thread safe and all GUI updates need to happen in the same thread. -- /Jacob Carlborg
Aug 02 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 03-Aug-12 10:35, Jacob Carlborg wrote:
 On 2012-08-03 00:25, Dmitry Olshansky wrote:

 OT:
 It never ceases to amaze me how people miss this very simple point:
 GUI runs on its own thread and shouldn't ever block on something (save
 for message pump itself, of course). Everything else (including possibly
 slow rendering) done on the side and then result (once ready) swiftly
 indicated on GUI.

I'm not entirely sure what you mean be "rendering" but most GUI systems are not thread safe and all GUI updates need to happen in the same thread.

back to UI thread a reference to the buffer with pixels). This technique been in use for decades. Imagine drawing some large intricate fractal it could easily take few seconds. -- Dmitry Olshansky
Aug 03 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-08-03 18:49, Dmitry Olshansky wrote:

 Draw thing to an off screen  bitmap  then blit it to window (aye, pass
 back to UI thread a reference to the buffer with pixels). This technique
 been in use for decades. Imagine drawing some large intricate fractal it
 could easily take few seconds.

Ok, now I see. -- /Jacob Carlborg
Aug 03 2012
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-08-03 00:10, Walter Bright wrote:

 The rendering code should be in yet a third thread.

Most GUI systems are not thread safe. I know for sure that Cocoa on Mac OS X is not. All the changes to the GUI needs to happen in the same thread. But you can usually post a message from another thread to the GUI thread if you want to update the GUI from another thread.
 An editor I wrote years ago had the rendering code in a separate thread
 from user input. You never had to wait to type in commands, the
 rendering would catch up when it could. What was also effective was the
 rendering would abandon a render midstream and restart it if it detected
 that the underlying data had changed in the meantime. This meant that
 the display was never more than one render out of date.

 Although the code itself wasn't any faster, it certainly *felt* faster
 with this approach. It made for crisp editing even on a pig slow machine.

But if you type something and you don't see any new characters that will feel slow. Regardless of the application received the user input or not. The user doesn't care what's happening in the background of the application, it only cares about what it can see. I've used many editors/IDEs where the GUI locks, for whatever reason, you continue typing and suddenly you get 5 characters at once. To me, that would indicate the the application do receive the user input but it cannot render it fast enough, for whatever reason. -- /Jacob Carlborg
Aug 02 2012
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-08-02 22:54, Andrej Mitrovic wrote:

 It can do that immediately for the text that's visible in the window
 because ~100 lines of text can be lexed pretty damn instantly. As soon
 as that's done the GUI should be responsive and the rest of the text
 buffer should be lexed in the background.

That should work if the lexer can handle it. -- /Jacob Carlborg
Aug 02 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 11:06:39 Walter Bright wrote:
 On 8/2/2012 4:47 AM, deadalnix wrote:
 Le 02/08/2012 10:13, Walter Bright a écrit :
 As fast as the dmd one would be best.

That'd be great but . . . lexer really isn't the performance bottleneck of dmd (or any compiler of a non trivial language). Additionally, anybody that have touched dmd source code can agree that usability/maintainability isn't as its best. Sacrificing some perfs in a non bottleneck area to increase ease of use make perfect sense.

A lexer is like a device driver, or a memory allocator. Having it be as fast as possible pays off dividends endlessly for whoever uses it. And in my profiling, lexing speed is a big factor in the debug build times. The Digital Mars C++ compiler (and its antecedents) made a lot of hay by being the fastest compiler available by a wide margin. A big advantage D has is also compile speed. Please do not underestimate its importance.

Recent benchmarks by Iain Buclaw show that lexing is not a bottleneck in dmd, so it's natural for some folks to think that the lexing is not all that critical in comparison to other compiler components. However, the reason that lexing is not a bottleneck is almost ceratinly because the lexer is so thoroughly optimized. So, if it _didn't_ try and be as efficient as possible, it _would_ be a bottleneck. However, given that we're providing an API for a lexer rather than having the lexer be an internal component of a compiler, that _does_ put more pressure on making the API user-friendly. Hopefully that can be done without sacrificing any performance, but there may be a point where the only way to make it faster would be to make the API very unfriendly. If that's the case, then we're going to have some tough decisions to make. But we'll cross that bridge if/when we get to it. - Jonathan M Davis
Aug 02 2012
prev sibling parent deadalnix <deadalnix gmail.com> writes:
Le 02/08/2012 20:14, Marco Leise a écrit :
 Am Thu, 02 Aug 2012 14:26:58 +0200
 schrieb "Adam D. Ruppe"<destructionator gmail.com>:

 On Thursday, 2 August 2012 at 11:47:20 UTC, deadalnix wrote:
 lexer really isn't the performance bottleneck of dmd (or any
 compiler of a non trivial language).

What if we're just using this lexer in something like a syntax highlighting text editor? I'd be annoyed if it stopped typing for a little bit cuz of processing.

My pet peeve for editors is how to make the lexer work on only what I just edited. It is really not trivial, but eased for example by editors that insert closing } automatically, so the scopes don't mess up. A missing curly bracket results in the parser reinterpreting the whole file. The same goes for string terminators: If I type `` at the top of std.datetime in Mono-D right after "module std.datetime;", I have a 12 (!) seconds wait, till the IDE becomes responsive again. But if I use ' or ", then the editor inserts the second terminator and is done with reparsing in 1-2 seconds. Can we really have a good lexer that is equally well suited for compilers as for editors ?

A common solution to that problem is to cancel the computation when user type something again.
Aug 03 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
 It is for ranges in general. In the general case, a range of UTF-8 or UTF-16
 makes no sense whatsoever. Having range-based functions which understand the
 encodings and optimize accordingly can be very beneficial (which happens with
 strings but can't happen with general ranges without the concept of a
 variably-length encoded range like we have with forward range or random access
 range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work.
 Range-based functions operate on elements, and doing stuff like filter or map
or
 reduce on code units doesn't make any sense at all.

Yes, it can work.
Aug 02 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
 On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
 It is for ranges in general. In the general case, a range of UTF-8 or
 UTF-16 makes no sense whatsoever. Having range-based functions which
 understand the encodings and optimize accordingly can be very beneficial
 (which happens with strings but can't happen with general ranges without
 the concept of a variably-length encoded range like we have with forward
 range or random access range), but to actually have a range of UTF-8 or
 UTF-16 just wouldn't work. Range-based functions operate on elements, and
 doing stuff like filter or map or reduce on code units doesn't make any
 sense at all.

Yes, it can work.

How?

Keep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it.
 Do you really think that it makes sense for a function like map or filter to
 operate on individual code units? Because that's what would end up happening
 with a range of code units. Your average, range-based function only makes
 sense with _characters_, not code units. Functions which can operate on ranges
 of code units without screwing up the encoding are a rarity.

Rare or not, they are certainly possible, and the early versions of std.string did just that (although they weren't using ranges, the same techniques apply).
Aug 02 2012
next sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 02/08/2012 10:44, Walter Bright a écrit :
 On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
 On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
 It is for ranges in general. In the general case, a range of UTF-8 or
 UTF-16 makes no sense whatsoever. Having range-based functions which
 understand the encodings and optimize accordingly can be very
 beneficial
 (which happens with strings but can't happen with general ranges
 without
 the concept of a variably-length encoded range like we have with
 forward
 range or random access range), but to actually have a range of UTF-8 or
 UTF-16 just wouldn't work. Range-based functions operate on
 elements, and
 doing stuff like filter or map or reduce on code units doesn't make any
 sense at all.

Yes, it can work.

How?

Keep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it.

How is that different than a manually done range of dchar ?
Aug 02 2012
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 4:49 AM, deadalnix wrote:
 How is that different than a manually done range of dchar ?

The decoding is rarely necessary, even if non-ascii data is there. However, the range cannot decide if decoding is necessary - the receiver has to, hence the receiver does the decoding.
Aug 02 2012
prev sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 10:58:21 Walter Bright wrote:
 On 8/2/2012 4:49 AM, deadalnix wrote:
 How is that different than a manually done range of dchar ?

The decoding is rarely necessary, even if non-ascii data is there. However, the range cannot decide if decoding is necessary - the receiver has to, hence the receiver does the decoding.

Hence why special-casing must be used to deal with variably-length encodings like UTF-8. If it's not something that the range can handle through front and popFront, then it's not going to work as a range. Having it be _possible_ to special-case but not require it allows for the type to works a range but still be efficient of the function operating on the range codes for it. But if you require the function to code for it, then it's not really a range. - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 02-Aug-12 12:44, Walter Bright wrote:
 On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
 On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
 It is for ranges in general. In the general case, a range of UTF-8 or
 UTF-16 makes no sense whatsoever. Having range-based functions which
 understand the encodings and optimize accordingly can be very
 beneficial
 (which happens with strings but can't happen with general ranges
 without
 the concept of a variably-length encoded range like we have with
 forward
 range or random access range), but to actually have a range of UTF-8 or
 UTF-16 just wouldn't work. Range-based functions operate on
 elements, and
 doing stuff like filter or map or reduce on code units doesn't make any
 sense at all.

Yes, it can work.

How?

Keep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it.

Since Unicode 5(?) the range of codepoints was defined to be 0...0x10FFFF specifically so that it could be encoded in 4 bytes of UTF-8. P.S. Looks like I'm too late for this party ;) -- Dmitry Olshansky
Aug 02 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 8:46 AM, Dmitry Olshansky wrote:
 Keep a 6 character buffer in your consumer. If you read a char with the
 high bit set, start filling that buffer and then decode it.

Since Unicode 5(?) the range of codepoints was defined to be 0...0x10FFFF specifically so that it could be encoded in 4 bytes of UTF-8.

Yeah, but I thought 6 bytes would future proof it! (Inevitably, the Unicode committee will add more.)
 P.S. Looks like I'm too late for this party ;)

It affects you strongly, too, so I'm glad to see you join in.
Aug 02 2012
prev sibling next sibling parent travert phare.normalesup.org (Christophe Travert) writes:
"Jonathan M Davis" , dans le message (digitalmars.D:174059), a écrit :
 In either case, because the consumer must do something other than simply 
 operate on front, popFront, empty, etc., you're _not_ dealing with the range 
 API but rather working around it.

In some case a range of dchar is useful. In some case a range of char is sufficient, and much more efficient. And for the UTF-aware programer, it makes much sense. The fact that you sometimes have to buffer some information because the meaning of one element is affected by the previous element is a normal issue in many algoritms, it's not working arround anything. Your lexer uses range API, would you say it is just working arroung range because you have to take into account several caracters (let they be dchars) at the same time to know what they are meaning?
Aug 02 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 1:26 PM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:
 On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
 On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
 It is for ranges in general. In the general case, a range of UTF-8 or
 UTF-16 makes no sense whatsoever. Having range-based functions which
 understand the encodings and optimize accordingly can be very beneficial
 (which happens with strings but can't happen with general ranges without
 the concept of a variably-length encoded range like we have with forward
 range or random access range), but to actually have a range of UTF-8 or
 UTF-16 just wouldn't work. Range-based functions operate on elements,
 and
 doing stuff like filter or map or reduce on code units doesn't make any
 sense at all.

Yes, it can work.

How?

Keep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it.

And how on earth is that going to work as a range?

1. read a character from the range 2. if the character is the start of a multibyte character, put the character in the buffer 3. keep reading from the range until you've got the whole of the multibyte character 4. convert that 6 (or 4) character buffer into a dchar Remember, its the consumer doing the decoding, not the input range.
 I agree that we should be making string operations more efficient by taking
code
 units into account, but I completely disagree that we can do that generically.

The requirement I listed was that the input range present UTF8 characters. Not any random character type.
Aug 02 2012
next sibling parent Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Walter Bright wrote:
 On 8/2/2012 1:26 PM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:
 Keep a 6 character buffer in your consumer. If you read a char with
 the high
 bit set, start filling that buffer and then decode it.

And how on earth is that going to work as a range?

1. read a character from the range 2. if the character is the start of a multibyte character, put the character in the buffer 3. keep reading from the range until you've got the whole of the multibyte character 4. convert that 6 (or 4) character buffer into a dchar

Working example: https://github.com/pszturmaj/json-streaming-parser/blob/master/json.d#L18 :-)
Aug 02 2012
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/2/12 6:38 PM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:
 Remember, its the consumer doing the decoding, not the input range.

But that's the problem. The consumer has to treat character ranges specially to make this work. It's not generic.

It is generic! It's just in another dimension: it operates on any range of _bytes_. Andrei
Aug 02 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/2/12 6:54 PM, Jonathan M Davis wrote:
 So, a function which does the buffering of code units like Walter suggests is
 generic?

Of course, because it operates on bytes read from memory, files, or sockets etc.
 It's doing something that makes no sense outside of strings.

Right. The bytes represent UTF-8 encoded strings, except their type is ubyte so there's no processing in the library.
 And if
 it's doing that with strings and something else with everything else (which it
 _has_ to do if the same function is going to work with both unicode as well as
 range types that have nothing to do with unicode), then it's special casing
 strings and therefore is _not_ generic.

This is automatically destroyed because its assumption was destroyed.
 Sure, you could have a function which specifically operates on ranges of code
 units and understands how unicode works and is written accordingly, but then
 that function is specific to ranges of code units and is only generic with
 regards to various ranges of code units. It can't operate on generic ranges
 like functions such as map and filter can.

Yes, and I think that's exactly what the doctor prescribed here. Andrei
Aug 02 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/2/12 7:18 PM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 19:06:32 Andrei Alexandrescu wrote:
 Sure, you could have a function which specifically operates on ranges of
 code units and understands how unicode works and is written accordingly,
 but then that function is specific to ranges of code units and is only
 generic with regards to various ranges of code units. It can't operate on
 generic ranges like functions such as map and filter can.

Yes, and I think that's exactly what the doctor prescribed here.

It may be the best approach for the lexer (though I'm not convinced; I'll have to think about it more),

Your insights are always appreciated; even their Cliff notes :o).
 but Walter seems to be arguing that that strings
 should be treated as ranges of code units in general, which I think is
 completely wrong.

I think Walter has very often emphasized the need for the lexer to be faster than the usual client software. My perception is that he's discussing lexer design in understanding there's a need for a less comfortable approach, namely do decoding in client. Andrei
Aug 02 2012
next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 19:52:35 Jonathan M Davis wrote:
 I suppose that we could make it operate on code units and just let ranges of
 dchar have UTF-32 as their code unit (since dchar is both a code unit and a
 code point), then ranges of dchar will still work but ranges of char and
 wchar will _also_ work. Hmmm. As I said, I'll have to think this through a
 bit.

LOL. It looks like taking this approach results in almost identical code to what I've been doing. The main difference is that if you're dealing with a range other than a string, you need to use decode instead of front, which means that decode is going to need to work with more than just strings (probably stride too). I'll have to create a pull request for that. But unless you restrict it to strings and ranges of code units which are random access, you still have to worry about stuff like using range[0] vs range.front depending on the type, so my mixin approach is still applicable, and it makes it very easy to switch what I'm doing, since there are very few lines that need to be tweaked. So, I guess that I'll be taking the approach of taking ranges of char, wchar, and dchar and treat them all as ranges of code units. So, it'll work with everything that it worked with before but will now also work with ranges of char and wchar. There's still a performance hit if you do something like passing it filter!"true(source), but there's no way to fix that without disallowing dchar ranges entirely, which would be unnecessarily restrictive. Once you allow arbitrary ranges of char rather than requiring strings, the extra code required to allow ranges of wchar and dchar is trivial. It's stuff like worrying about range[0] vs range.front which complicates things (even if front is a code unit rather than a code point), and using string mixins makes it so that the code with the logic is just as simple as it would be with strings. So, I think that I can continue almost exactly as I have been and still achieve what Walter wants. The main issue that I have (beyond finishing what I haven't gotten to yet) is changing how I handle errors and comments, since I currently have them as tokens, but that shouldn't be terribly hard to fix. - Jonathan M Davis
Aug 02 2012
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 4:30 PM, Andrei Alexandrescu wrote:
 I think Walter has very often emphasized the need for the lexer to be faster
 than the usual client software. My perception is that he's discussing lexer
 design in understanding there's a need for a less comfortable approach, namely
 do decoding in client.

+1
Aug 02 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 3:38 PM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:
 Remember, its the consumer doing the decoding, not the input range.

But that's the problem. The consumer has to treat character ranges specially to make this work. It's not generic. If it were generic, then it would simply be using front, popFront, etc. It's going to have to special case strings to do the buffering that you're suggesting. And if you have to special case strings, then how is that any different from what we have now?

No, the consumer can do its own buffering. It only needs a 4 character buffer, worst case.
 If you're arguing that strings should be treated as ranges of code units, then
 pretty much _every_ range-based function will have to special case strings to
 even work correctly - otherwise it'll be operating on individual code points
 rather than code points (e.g. filtering code units rather than code points,
 which would generate an invalid string). This makes the default behavior
 incorrect, forcing _everyone_ to special case strings _everywhere_ if they
 want correct behavior with ranges which are strings. And efficiency means
 nothing if the result is wrong.

No, I'm arguing that the LEXER should accept a UTF8 input range for its input. I am not making a general argument about ranges, characters, or Phobos.
 As it is now, the default behavior of strings with range-based functions is
 correct but inefficient, so at least we get correct code. And if someone wants
 their string processing to be efficient, then they special case strings and do
 things like the buffering that you're suggesting. So, we have correct by
 default with efficiency as an option. The alternative that you seem to be
 suggesting (making strings be treated as ranges of code units) means that it
 would be fast by default but correct as an option, which is completely
 backwards IMHO. Efficiency is important, but it's pointless how efficient
 something is if it's wrong, and expecting that your average programmer is
 going to write unicode-aware code which functions correctly is completely
 unrealistic.

Efficiency for the *lexer* is of *paramount* importance. I don't anticipate std.d.lexer will be implemented by some random newbie, I expect it to be carefully implemented and to do Unicode correctly, regardless of how difficult or easy that may be. I seem to utterly fail at making this point. The same point applies to std.regex - efficiency is terribly, terribly important for it. Everyone judges regexes by their speed, and nobody cares how hard they are to implement to get that speed. To reiterate another point, since we are in the compiler business, people will expect std.d.lexer to be of top quality, not some bag on the side. It needs to be usable as a base for writing a professional quality compiler. It's the reason why I'm pushing much harder on this than I do for other modules.
Aug 02 2012
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/2/12 10:40 PM, Walter Bright wrote:
 To reiterate another point, since we are in the compiler business,
 people will expect std.d.lexer to be of top quality, not some bag on the
 side. It needs to be usable as a base for writing a professional quality
 compiler. It's the reason why I'm pushing much harder on this than I do
 for other modules.

The lexer must be configurable enough to tokenize other languages than D. I confess I'm very unhappy that there seem to be no less than three people determined to write lexers for D. We're wasting precious talent and resources doubly. Once, several people are working in parallel on the same product. Second, none of them is actually solving the problem that should be solved. Andrei
Aug 02 2012
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 08/03/2012 05:08 AM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 23:00:41 Andrei Alexandrescu wrote:
 On 8/2/12 10:40 PM, Walter Bright wrote:
 To reiterate another point, since we are in the compiler business,
 people will expect std.d.lexer to be of top quality, not some bag on the
 side. It needs to be usable as a base for writing a professional quality
 compiler. It's the reason why I'm pushing much harder on this than I do
 for other modules.

The lexer must be configurable enough to tokenize other languages than D. I confess I'm very unhappy that there seem to be no less than three people determined to write lexers for D. We're wasting precious talent and resources doubly. Once, several people are working in parallel on the same product. Second, none of them is actually solving the problem that should be solved.

You're not going to get as fast a lexer if it's not written specifically for D.

This should certainly be possible.
 Writing a generic lexer is a different problem.

It is a more general problem.
Aug 02 2012
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 8:00 PM, Andrei Alexandrescu wrote:
 On 8/2/12 10:40 PM, Walter Bright wrote:
 To reiterate another point, since we are in the compiler business,
 people will expect std.d.lexer to be of top quality, not some bag on the
 side. It needs to be usable as a base for writing a professional quality
 compiler. It's the reason why I'm pushing much harder on this than I do
 for other modules.

The lexer must be configurable enough to tokenize other languages than D. I confess I'm very unhappy that there seem to be no less than three people determined to write lexers for D. We're wasting precious talent and resources doubly. Once, several people are working in parallel on the same product. Second, none of them is actually solving the problem that should be solved.

I agree and I hope the three can combine their efforts with the best ideas of each. I don't think the lexer needs to be configurable to other languages. I think it's fine that it be custom for D. However, and this is a big however, its user-facing interface should be usable for other languages, I see no obvious reason why this cannot be.
Aug 02 2012
next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, August 03, 2012 05:36:05 Bernard Helyer wrote:
 If the other guys think they've got it, then I can withdraw my
 efforts. I was just thinking I had a lexer just sitting around,
 may as well use it, but if the other guys have it, then I'm fine
 with withdrawing.

I'm a fair ways along already. Making changes according to what Walter wants will slow me down some but not a lot. It's something that I've wanted to do for quite some time but just finally got around to starting a couple of weeks ago. So, I definitely want to complete it regardless of what anyone else is doing. It also gives me an opportunity to make sure that the spec and dmd match (and I've already found a few bugs with the spec and fixed them). Feel free to do whatever you want, but I'm definitely going to complete what I'm working on. - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent "Bernard Helyer" <b.helyer gmail.com> writes:
On Friday, 3 August 2012 at 03:59:29 UTC, Jonathan M Davis wrote:
 On Friday, August 03, 2012 05:36:05 Bernard Helyer wrote:
 If the other guys think they've got it, then I can withdraw my
 efforts. I was just thinking I had a lexer just sitting around,
 may as well use it, but if the other guys have it, then I'm 
 fine
 with withdrawing.

I'm a fair ways along already. Making changes according to what Walter wants will slow me down some but not a lot. It's something that I've wanted to do for quite some time but just finally got around to starting a couple of weeks ago. So, I definitely want to complete it regardless of what anyone else is doing. It also gives me an opportunity to make sure that the spec and dmd match (and I've already found a few bugs with the spec and fixed them). Feel free to do whatever you want, but I'm definitely going to complete what I'm working on. - Jonathan M Davis

I'll let you get on with it then. I'll amuse myself with the thought of someone asking why SDC doesn't use std.d.lexer or a parser generator. I'll then hit them with my cane, and tell them to get off of my lawn. :P
Aug 02 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Fri, Aug 3, 2012 at 5:59 AM, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Friday, August 03, 2012 05:36:05 Bernard Helyer wrote:
 If the other guys think they've got it, then I can withdraw my
 efforts. I was just thinking I had a lexer just sitting around,
 may as well use it, but if the other guys have it, then I'm fine
 with withdrawing.

I'm a fair ways along already. Making changes according to what Walter wants will slow me down some but not a lot. It's something that I've wanted to do for quite some time but just finally got around to starting a couple of weeks ago. So, I definitely want to complete it regardless of what anyone else is doing. It also gives me an opportunity to make sure that the spec and dmd match (and I've already found a few bugs with the spec and fixed them). Feel free to do whatever you want, but I'm definitely going to complete what I'm working on.

Look, many of us here were interested in your idea of having comments and errors lexed as tokens. Could it be possible to just add two static policies errorBehavior { report, skip, ...} and comments { asToken, grouped }? That way, a user just say; // Walter auto toks = Lexer!(comments.grouped)(code); // May be the default value // Other: auto toks = Lexer!(comments.asToken)(code); It's just a small static if away...
Aug 02 2012
prev sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Friday, 3 August 2012 at 04:02:34 UTC, Bernard Helyer wrote:
 I'll let you get on with it then. I'll amuse myself with the
 thought of someone asking why SDC doesn't use std.d.lexer or
 a parser generator. I'll then hit them with my cane, and tell
 them to get off of my lawn. :P

I don't think you should give up on this adaption. SDC's lexer is already complete. I don't know if anyone else can claim that. Apart from that, any changes made will almost certainly benefit SDC as well.
Aug 02 2012
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/2/12 11:11 PM, Bernard Helyer wrote:
 On Friday, 3 August 2012 at 03:00:42 UTC, Andrei Alexandrescu wrote:
 The lexer must be configurable enough to tokenize other languages than D.

You're going to have to defend that one.

I wouldn't know how to. To me it's all too obvious it's better to have a tool that allows writing lexers for many languages (and incidentally use it for D), than a tool that can only tokenize D code. Andrei
Aug 02 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/2/12 11:08 PM, Jonathan M Davis wrote:
 You're not going to get as fast a lexer if it's not written specifically for D.
 Writing a generic lexer is a different problem. It's also one that needs to be
 solved, but I think that it's a mistake to think that a generic lexer is going
 to be able to be as fast as one specifically optimized for D.

Do you have any evidence to back that up? I mean you're just saying it. Andrei
Aug 02 2012
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 08/03/2012 05:53 AM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 23:41:39 Andrei Alexandrescu wrote:
 On 8/2/12 11:08 PM, Jonathan M Davis wrote:
 You're not going to get as fast a lexer if it's not written specifically
 for D. Writing a generic lexer is a different problem. It's also one that
 needs to be solved, but I think that it's a mistake to think that a
 generic lexer is going to be able to be as fast as one specifically
 optimized for D.

Do you have any evidence to back that up? I mean you're just saying it.

Because all of the rules are built directly into the code. You don't have to use regexes or anything like that.

The parts that can be specified with simple regexen are certainly not a problem. A generic string mixin based lexer should be able to generate very close to optimal code by eg. merging common token prefixes and the like.
 Pieces of the lexer could certainly be
 generic or copied over to other lexers just fine, but when you write the lexer
 by hand specifically for D, you can guarantee that it checks exactly what it
 needs to for D without any extra cruft or lost efficiency due to decoding where
 it doesn't need to or checking an additional character at any point or
 anything along those lines.

This is achievable if it is fully generic as well. Just add generic peephole optimizations until the generated lexer is identical to what the hand-written one would have looked like.
 And tuning it is much easier,

Yes.
 because you have control over the whole thing.

The library writer has control over the whole thing in each case.
 Also, given features such as token strings, I
 would think that using a generic lexer on D would be rather difficult anyway.

It would of course need to incorporate custom parsing routine support.
 If someone wants to try and write a generic lexer for D and see if they can
 beat out any hand-written ones,

I'll possibly give it a shot if I can find the time.
 then more power to them, but I don't see how
 you could possibly expect to shave the operations down to the bare minimum
 necessary to get the job done with a generic lexer, whereas a hand-written
 parser can do that given enough effort.

If it is optimal for D lexing and close-optimal or optimal for other languages then it is profoundly more useful than just a D lexer.
Aug 02 2012
prev sibling parent deadalnix <deadalnix gmail.com> writes:
Le 03/08/2012 05:41, Andrei Alexandrescu a écrit :
 On 8/2/12 11:08 PM, Jonathan M Davis wrote:
 You're not going to get as fast a lexer if it's not written
 specifically for D.
 Writing a generic lexer is a different problem. It's also one that
 needs to be
 solved, but I think that it's a mistake to think that a generic lexer
 is going
 to be able to be as fast as one specifically optimized for D.

Do you have any evidence to back that up? I mean you're just saying it. Andrei

I think it can be as fast if the lexer is compile time created. I'd also argue that it is better to have a D lexer ASAP, even if it isn't as configurable as it could be. Better something now that something better later. We can make it better later anyway.
Aug 03 2012
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 23:00:41 Andrei Alexandrescu wrote:
 On 8/2/12 10:40 PM, Walter Bright wrote:
 To reiterate another point, since we are in the compiler business,
 people will expect std.d.lexer to be of top quality, not some bag on the
 side. It needs to be usable as a base for writing a professional quality
 compiler. It's the reason why I'm pushing much harder on this than I do
 for other modules.

The lexer must be configurable enough to tokenize other languages than D. I confess I'm very unhappy that there seem to be no less than three people determined to write lexers for D. We're wasting precious talent and resources doubly. Once, several people are working in parallel on the same product. Second, none of them is actually solving the problem that should be solved.

You're not going to get as fast a lexer if it's not written specifically for D. Writing a generic lexer is a different problem. It's also one that needs to be solved, but I think that it's a mistake to think that a generic lexer is going to be able to be as fast as one specifically optimized for D. And there are several people already working on the generic stuff (like Phillipe). - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent "Bernard Helyer" <b.helyer gmail.com> writes:
On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
 The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be 
 useful. If it isn't fast, serious users will eschew it and will 
 cook up their own. You'll have a nice, pretty, useless toy of 
 std.d.lexer.

If you want to throw out some target times, that would be useful.
Aug 02 2012
prev sibling next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Thursday, 2 August 2012 at 11:47:20 UTC, deadalnix wrote:
 lexer really isn't the performance bottleneck of dmd (or any 
 compiler of a non trivial language).

What if we're just using this lexer in something like a syntax highlighting text editor? I'd be annoyed if it stopped typing for a little bit cuz of processing.
Aug 02 2012
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 8/2/12, Walter Bright <newshound2 digitalmars.com> wrote:
 A good IDE should do its parsing in a separate thread, so the main user
 input
 thread remains crisp and responsive.

 If the user edits the text while the parsing is in progress, the background

 parsing thread simply abandons the current parse and starts over.

Agreed, this is an editor/IDE domain problem. And there are some obvious optimizations an editor could use, such as only lexing the visible portion of text on the screen (and then lexing the rest in a background thread which doesn't block the interface).
Aug 02 2012
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 8/2/12, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:
 In the long run I totally expect millions of lines of D code in the wild.

In a single project or globally? I hope not the former, D should inspire people to write less code that does more. :)
Aug 02 2012
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 02-Aug-12 08:30, Walter Bright wrote:
 On 8/1/2012 8:04 PM, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
 1. It should accept as input an input range of UTF8. I feel it is a
 mistake
 to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
 should use an 'adapter' range to convert the input to UTF8. (This is
 what
 component programming is all about.)

But that's not how ranges of characters work. They're ranges of dchar. Ranges don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have to create special wrappers around string or wstring to have ranges of UTF-8. The way that it's normally done is to have ranges of dchar and then special-case range-based functions for strings. Then the function can operate on any range of dchar but still operates on strings efficiently.

I have argued against making ranges of characters dchars, because of performance reasons. This will especially adversely affect the performance of the lexer. The fact is, files are normally in UTF8 and just about everything else is in UTF8. Prematurely converting to UTF-32 is a performance disaster. Note that the single largest thing holding back regex performance is that premature conversion to dchar and back to char.

Well, it doesn't convert back to UTF-8 as it just slices of the input :) Otherwise very true especially with ctRegex that used to recieve quite some hype even in its present state. 33% of time spent is doing and redoing UTF-8 decoding. (Note that quite some extra work on top of what lexer does is done, e.g. lexer is largely deterministic but regex has some of try-rollback).
 If lexer is required to accept dchar ranges, its performance will drop
 at least in half, and people are going to go reinvent their own lexers.

Yes, it slows things down. Decoding (if any) should kick in only where it's absolutely necessary and be an integral part of lexer automation. -- Dmitry Olshansky
Aug 02 2012
prev sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 8/2/12, Jacob Carlborg <doob me.com> wrote:
 It still needs to update the editor view with the correct syntax
 highlighting which needs to be done in the same thread as the rest of
 the GUI.

It can do that immediately for the text that's visible in the window because ~100 lines of text can be lexed pretty damn instantly. As soon as that's done the GUI should be responsive and the rest of the text buffer should be lexed in the background.
Aug 02 2012
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
 7. It should accept a callback delegate for errors. That delegate should
 decide whether to:
      1. ignore the error (and "Lexer" will try to recover and continue)
      2. print an error message (and "Lexer" will try to recover and continue)
 3. throw an exception, "Lexer" is done with that input range

I'm currently treating errors as tokens. It then becomes easy for the code using the lexer to just ignore the errors, to process them immediately, or to put off dealing with them until the lexing is complete. So, the code using the lexer can handle errors however and whenever it likes without having to worry about delegates or exceptions. Since tokens are lexed lazily, the fact that an error is reported as a token doesn't require that the lexing continue, but it also makes it _easy_ to continue lexing, ignoring or saving the error. I'm inclined to think that that's a superior approach to using delegates and exceptions.

Really nice idea. It is still easy to wrap the Range in another Range that process errors in a custom way.
Aug 02 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 4:27 AM, deadalnix wrote:
 Really nice idea. It is still easy to wrap the Range in another Range that
 process errors in a custom way.

It's suboptimal for a high speed lexer/parser, as already explained. Speed is of paramount importance for a lexer, and it overrides things like elegance and simplicity.
Aug 02 2012
prev sibling next sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Thursday, 2 August 2012 at 04:00:54 UTC, Walter Bright wrote:
 Doing a class requires a heap allocation per token, which will 
 have disastrous performance consequences.

No it doesn't. That's an implication of using the NewExpression, not classes.
Aug 01 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 21:30:44 Walter Bright wrote:
 On 8/1/2012 8:04 PM, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
 1. It should accept as input an input range of UTF8. I feel it is a
 mistake
 to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
 should use an 'adapter' range to convert the input to UTF8. (This is what
 component programming is all about.)

But that's not how ranges of characters work. They're ranges of dchar. Ranges don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have to create special wrappers around string or wstring to have ranges of UTF-8. The way that it's normally done is to have ranges of dchar and then special-case range-based functions for strings. Then the function can operate on any range of dchar but still operates on strings efficiently.

I have argued against making ranges of characters dchars, because of performance reasons. This will especially adversely affect the performance of the lexer. The fact is, files are normally in UTF8 and just about everything else is in UTF8. Prematurely converting to UTF-32 is a performance disaster. Note that the single largest thing holding back regex performance is that premature conversion to dchar and back to char. If lexer is required to accept dchar ranges, its performance will drop at least in half, and people are going to go reinvent their own lexers.

1. The current design of Phobos is to have ranges of dchar, because it fosters code correctness (though it can harm efficiency). It's arguably too late to do otherwise. Certainly, doing otherwise now would break a lot of code. If the lexer tried to operate on UTF-8 as part of its API rather than operating on ranges of dchar and special-casing strings, then it wouldn't fit in with Phobos at all. 2. The lexer does _not_ have to have its performance tank by accepting ranges of dchar. It's true that the performance will be harmed for ranges which _aren't_ strings, but for strings (as would be by far the most common use case) it can be very efficient by special-casing them. And as much as there are potential performance issues with Phobos' choice of treating strings as ranges of dchar, if it were to continue to treat them as ranges of code units, it's pretty much a guarantee that there would be a _ton_ of bugs caused by it. Most programmers have absolutely no clue about how unicode works and don't really want to know. They want strings to just work. Phobos' approach of defaulting to correct but making it possible to make the code faster through specializations is very much in line with D's typical approach of making things safe by default but allowing the programmer to do unsafe things for optimizations when they know what they're doing.
 All identifiers are entered into a hashtable, and are referred to by
 pointers into that hashtable for the rest of dmd. This makes symbol lookups
 incredibly fast, as no string comparisons are done.

Hmmm. Well, I'd still argue that that's a parser thing. Pretty much nothing else will care about it. At most, it should be an optional feature of the lexer. But it certainly could be added that way. - Jonathan M Davis
Aug 01 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 21:47:22 Walter Bright wrote:
 Couldn't that just be part of the range type? A separate lexer type
 shouldn't be necessary.

No, because the user may want to serially present ranges to the same lexer type. Why do that, you might say? That's how dmd works - it means the identifier table will be the same for all source files.

Then just pass the same identifier table to the function which creates the token range. That doesn't require another type. - Jonathan M Davis
Aug 01 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 22:09:05 Walter Bright wrote:
 On 8/1/2012 9:54 PM, Jonathan M Davis wrote:
 Then just pass the same identifier table to the function which creates the
 token range. That doesn't require another type.

You're still going to require another type, otherwise you'll have to duplicate the state in every token allocation, with resultant heavy memory and initialization costs.

Why would you be duplicating any state in the token? The range would have the state, not the token. The token would only have the data specific to the token, not the range as a whole. - Jonathan M Davis
Aug 01 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 22:33:12 Walter Bright wrote:
 The lexer must use char or it will not be acceptable as anything but a toy
 for performance reasons.

Avoiding decoding can be done with strings and operating on ranges of dchar, so you'd be operating almost entirely on ASCII. Are you saying that there's a performance issue aside from decoding?
 Somebody has to convert the input files into dchars, and then back into
 chars. That blows for performance. Think billions and billions of
 characters going through, not just a few random strings.

Why is there any converting to dchar going on here? I don't see why any would be necessary. If you reading in a file as a string or char[] (as would be typical), then you're operating on a string, and then the only time that any decoding will be necessary is when you actually need to operate on a unicode character, which is very rare in D's grammar. It's only when operating on something _other_ than a string that you'd have to actually deal with dchars.
 Hmmm. Well, I'd still argue that that's a parser thing. Pretty much
 nothing
 else will care about it. At most, it should be an optional feature of the
 lexer. But it certainly could be added that way.

I hate to say "trust me on this", but if you don't, have a look at dmd's lexer and how it handles identifiers, then look at dmd's symbol table.

My point is that it's the sort of thing that _only_ a parser would care about. So, unless it _needs_ to be in the lexer for some reason, it shouldn't be. - Jonathan M Davis
Aug 01 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 22:41:01 Walter Bright wrote:
 On 8/1/2012 10:34 PM, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 22:09:05 Walter Bright wrote:
 On 8/1/2012 9:54 PM, Jonathan M Davis wrote:
 Then just pass the same identifier table to the function which creates
 the
 token range. That doesn't require another type.

You're still going to require another type, otherwise you'll have to duplicate the state in every token allocation, with resultant heavy memory and initialization costs.

Why would you be duplicating any state in the token? The range would have the state, not the token. The token would only have the data specific to the token, not the range as a whole.

I meant two types: the Lexer and the Token. The Lexer can present a range interface.

What I was expecting there to be was a type which was a range of tokens. You passed the source string to a function which returned that range, and you iterated over it to process each token. What you appear to have been arguing for is another type which you get the range from which holds additional state (e.g. the indentifier table). So, you're saying now that you only meant that the token type needs to be separate from the range type (which I would have thought would be the only way to do it), or are you indeed saying that there should be another type that the range comes from and which holds additional state? - Jonathan M Davis
Aug 01 2012
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-02 02:10, Walter Bright wrote:

 1. It should accept as input an input range of UTF8. I feel it is a
 mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it
 UTF16 should use an 'adapter' range to convert the input to UTF8. (This
 is what component programming is all about.)

I'm no expert on ranges but won't that prevent slicing? Slicing is one of the main reasons for why the Tango XML parser is so amazingly fast. -- /Jacob Carlborg
Aug 01 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 11:49 PM, Jacob Carlborg wrote:
 On 2012-08-02 02:10, Walter Bright wrote:

 1. It should accept as input an input range of UTF8. I feel it is a
 mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it
 UTF16 should use an 'adapter' range to convert the input to UTF8. (This
 is what component programming is all about.)

I'm no expert on ranges but won't that prevent slicing? Slicing is one of the main reasons for why the Tango XML parser is so amazingly fast.

You don't want to use slicing on the lexer. The reason is that your slices will be spread all over memory, as source files can be huge, and all that memory will be retained and never released. What you want is a compact representation after lexing. Compactness also helps a lot with memory caching.
Aug 02 2012
parent reply deadalnix <deadalnix gmail.com> writes:
Le 02/08/2012 09:30, Walter Bright a écrit :
 On 8/1/2012 11:49 PM, Jacob Carlborg wrote:
 On 2012-08-02 02:10, Walter Bright wrote:

 1. It should accept as input an input range of UTF8. I feel it is a
 mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it
 UTF16 should use an 'adapter' range to convert the input to UTF8. (This
 is what component programming is all about.)

I'm no expert on ranges but won't that prevent slicing? Slicing is one of the main reasons for why the Tango XML parser is so amazingly fast.

You don't want to use slicing on the lexer. The reason is that your slices will be spread all over memory, as source files can be huge, and all that memory will be retained and never released. What you want is a compact representation after lexing. Compactness also helps a lot with memory caching.

Token are not kept in memory. You usually consume them for other processing and throw them away. It isn't an issue.
Aug 02 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 4:52 AM, deadalnix wrote:
 Le 02/08/2012 09:30, Walter Bright a écrit :
 On 8/1/2012 11:49 PM, Jacob Carlborg wrote:
 On 2012-08-02 02:10, Walter Bright wrote:

 1. It should accept as input an input range of UTF8. I feel it is a
 mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it
 UTF16 should use an 'adapter' range to convert the input to UTF8. (This
 is what component programming is all about.)

I'm no expert on ranges but won't that prevent slicing? Slicing is one of the main reasons for why the Tango XML parser is so amazingly fast.

You don't want to use slicing on the lexer. The reason is that your slices will be spread all over memory, as source files can be huge, and all that memory will be retained and never released. What you want is a compact representation after lexing. Compactness also helps a lot with memory caching.

Token are not kept in memory. You usually consume them for other processing and throw them away. It isn't an issue.

The tokens are not kept, correct. But the identifier strings, and the string literals, are kept, and if they are slices into the input buffer, then everything I said applies.
Aug 02 2012
next sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 02/08/2012 20:08, Walter Bright a écrit :
 On 8/2/2012 4:52 AM, deadalnix wrote:
 Le 02/08/2012 09:30, Walter Bright a écrit :
 On 8/1/2012 11:49 PM, Jacob Carlborg wrote:
 On 2012-08-02 02:10, Walter Bright wrote:

 1. It should accept as input an input range of UTF8. I feel it is a
 mistake to templatize it for UTF16 and UTF32. Anyone desiring to
 feed it
 UTF16 should use an 'adapter' range to convert the input to UTF8.
 (This
 is what component programming is all about.)

I'm no expert on ranges but won't that prevent slicing? Slicing is one of the main reasons for why the Tango XML parser is so amazingly fast.

You don't want to use slicing on the lexer. The reason is that your slices will be spread all over memory, as source files can be huge, and all that memory will be retained and never released. What you want is a compact representation after lexing. Compactness also helps a lot with memory caching.

Token are not kept in memory. You usually consume them for other processing and throw them away. It isn't an issue.

The tokens are not kept, correct. But the identifier strings, and the string literals, are kept, and if they are slices into the input buffer, then everything I said applies.

Ok, what do you think of that : lexer can have a parameter that tell if it should build a table of token or slice the input. The second is important, for instance for an IDE : lexing will occur often, and you prefer slicing here because you already have the source file in memory anyway. The token always contains as a member a slice. The slice come either from the source or from a memory chunk allocated by the lexer. If the lexer allocate chunks, it will reuse the same memory location for the same string. Considering the following mecanism to compare slice, this will require 2 comparaisons for identifier lexed with that method : if(a.length != b.length) return false; if(a.ptr == b.ptr) return true; // Regular char by char comparison. Is that a suitable option ?
Aug 03 2012
next sibling parent travert phare.normalesup.org (Christophe Travert) writes:
deadalnix , dans le message (digitalmars.D:174155), a écrit :
 The tokens are not kept, correct. But the identifier strings, and the
 string literals, are kept, and if they are slices into the input buffer,
 then everything I said applies.

Ok, what do you think of that : lexer can have a parameter that tell if it should build a table of token or slice the input. The second is important, for instance for an IDE : lexing will occur often, and you prefer slicing here because you already have the source file in memory anyway. The token always contains as a member a slice. The slice come either from the source or from a memory chunk allocated by the lexer.

If I may add, there are several possitilities here: 1- a real slice of the input range 2- a slice of the input range created with .save and takeExactly 3- a slice allocated in GC memory by the lexer 4- a slice of memory owned by the lexer, which is reused for the next token (thus, the next call to popFront invalidates the token). 5- a slice of memory from a lookup table. All are useful in certain situations. #1 is usable for sliceable ranges, and is definitely efficient when you don't have a huge amont of code to parse. #2 is usable for forward ranges. #3 is usable for any range, but I would not recommand it... #4 is usable for any range, #5 is best if you perform complicated operations with the tokens. #1/#2 should not be very hard to code: when you start to lex a new token, you save the range, and when you found the end of the token, you just use takeExactly on the saved ranged. #4 requires to use an internal buffer. That's more code, but you have to do them in a second step if you want to be able to use input range (which you have too). Actually, the buffer may be external, if you use a buffered-range adapter to make a forward range out of an input range. Having an internal buffer may be more efficient. That something that has to be profiled. #3 can be obtained from #4 by map!(x => x.dup). #5 requires one of the previous to be implemented. You need to have a slice saved somewhere before having a look at the look-up table. Therefore, I think #5 should be obtained without a high loss of efficiency by an algorithm external to the lexer. This would probably bring many ways to use the lexer. For example, you can filter-out many tokens that you don't want before building the table, which avoids to have an overfull look-up table if you are only interested in a subset of tokens. #1/#2 with adapter ranges might be the only thing that is required to code, although the API should allow to define #4 and #5, for the user to use the adapters blindly, or if an internal implementation proves to be significantly more efficient. -- Christophe
Aug 03 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/3/2012 6:18 AM, deadalnix wrote:
 lexer can have a parameter that tell if it should build a table of token or
 slice the input. The second is important, for instance for an IDE : lexing will
 occur often, and you prefer slicing here because you already have the source
 file in memory anyway.

A string may span multiple lines - IDEs do not store the text as one string.
 If the lexer allocate chunks, it will reuse the same memory location for the
 same string. Considering the following mecanism to compare slice, this will
 require 2 comparaisons for identifier lexed with that method :

 if(a.length != b.length) return false;
 if(a.ptr == b.ptr) return true;
 // Regular char by char comparison.

 Is that a suitable option ?

You're talking about doing for strings what is done for identifiers - returning a unique handle for each. I don't think this works very well for string literals, as there seem to be few duplicates.
Aug 03 2012
parent deadalnix <deadalnix gmail.com> writes:
Le 03/08/2012 21:59, Walter Bright a écrit :
 On 8/3/2012 6:18 AM, deadalnix wrote:
 lexer can have a parameter that tell if it should build a table of
 token or
 slice the input. The second is important, for instance for an IDE :
 lexing will
 occur often, and you prefer slicing here because you already have the
 source
 file in memory anyway.

A string may span multiple lines - IDEs do not store the text as one string.
 If the lexer allocate chunks, it will reuse the same memory location
 for the
 same string. Considering the following mecanism to compare slice, this
 will
 require 2 comparaisons for identifier lexed with that method :

 if(a.length != b.length) return false;
 if(a.ptr == b.ptr) return true;
 // Regular char by char comparison.

 Is that a suitable option ?

You're talking about doing for strings what is done for identifiers - returning a unique handle for each. I don't think this works very well for string literals, as there seem to be few duplicates.

That option have the benefice to allow very fast identifier comparison (like DMD does) but don't impose it. For instance, you could use that trick in a single thread, but another identifier table for another. It allow to avoid completely the problem with multithreading you mention, while keeping most identifiers comparison really fast. It allow also for several allocation scheme for the slice, that fit different needs, as shown by Christophe Travert.
Aug 04 2012
prev sibling parent reply travert phare.normalesup.org (Christophe Travert) writes:
Jonathan M Davis , dans le message (digitalmars.D:174191), a écrit :
 On Thursday, August 02, 2012 11:08:23 Walter Bright wrote:
 The tokens are not kept, correct. But the identifier strings, and the string
 literals, are kept, and if they are slices into the input buffer, then
 everything I said applies.

String literals often _can't_ be slices unless you leave them in their original state rather than giving the version that they translate to (e.g. leaving \&copy; in the string rather than replacing it with its actual, unicode value). And since you're not going to be able to create the literal using whatever type the range is unless it's a string of some variety, that means that the literals often can't be slices, which - depending on the implementation - would make it so that that they can't _ever_ be slices. Identifiers are a different story, since they don't have to be translated at all, but regardless of whether keeping a slice would be better than creating a new string, the identifier table will be far superior, since then you only need one copy of each identifier. So, it ultimately doesn't make sense to use slices in either case even without considering issues like them being spread across memory. The only place that I'd expect a slice in a token is in the string which represents the text which was lexed, and that won't normally be kept around. - Jonathan M Davis

I thought it was not the lexer's job to process litterals. Just split the input in tokens, and provide minimal info: TokenType, line and col along with the representation from the input. That's enough for a syntax highlighting tools for example. Otherwise you'll end up doing complex interpretation and the lexer will not be that efficient. Litteral interpretation can be done in a second step. Do you think doing litteral interpretation separately when you need it would be less efficient? -- Christophe
Aug 04 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 04-Aug-12 14:02, Christophe Travert wrote:
 Jonathan M Davis , dans le message (digitalmars.D:174191), a écrit :
 On Thursday, August 02, 2012 11:08:23 Walter Bright wrote:
 The tokens are not kept, correct. But the identifier strings, and the string
 literals, are kept, and if they are slices into the input buffer, then
 everything I said applies.

String literals often _can't_ be slices unless you leave them in their original state rather than giving the version that they translate to (e.g. leaving \&copy; in the string rather than replacing it with its actual, unicode value). And since you're not going to be able to create the literal using whatever type the range is unless it's a string of some variety, that means that the literals often can't be slices, which - depending on the implementation - would make it so that that they can't _ever_ be slices. Identifiers are a different story, since they don't have to be translated at all, but regardless of whether keeping a slice would be better than creating a new string, the identifier table will be far superior, since then you only need one copy of each identifier. So, it ultimately doesn't make sense to use slices in either case even without considering issues like them being spread across memory. The only place that I'd expect a slice in a token is in the string which represents the text which was lexed, and that won't normally be kept around. - Jonathan M Davis

I thought it was not the lexer's job to process litterals. Just split the input in tokens, and provide minimal info: TokenType, line and col along with the representation from the input. That's enough for a syntax highlighting tools for example. Otherwise you'll end up doing complex interpretation and the lexer will not be that efficient. Litteral interpretation can be done in a second step. Do you think doing litteral interpretation separately when you need it would be less efficient?

-- Dmitry Olshansky
Aug 04 2012
parent reply travert phare.normalesup.org (Christophe Travert) writes:
Dmitry Olshansky , dans le message (digitalmars.D:174214), a écrit :
 Most likely - since you re-read the same memory twice to do it.

You're probably right, but if you do this right after the token is generated, the memory should still be near the processor. And the operation on the first read should be very basic: just check nothing illegal appears, and check for the end of the token. The cost is not negligible, but what you do with litteral tokens can vary much, and what the lexer will propose may not be what the user want, so the user may suffer the cost of the litteral decoding (including allocation of the decoded string, the copy of the caracters, etc), that he doesn't want, or will have to re-do his own way... -- Christophe
Aug 04 2012
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 04-Aug-12 14:55, Christophe Travert wrote:
 Dmitry Olshansky , dans le message (digitalmars.D:174214), a écrit :
 Most likely - since you re-read the same memory twice to do it.

You're probably right, but if you do this right after the token is generated, the memory should still be near the processor. And the operation on the first read should be very basic: just check nothing illegal appears, and check for the end of the token.

q{ .. } "\x13\x27 ...\u1212" In most cases it takes around the same time to check correctness and output it as simply pass it by. (see also re-decoding unicode in identifiers, though that's rare to see unicode chars in identifier)
 The cost is not
 negligible, but what you do with litteral tokens can vary much, and what
 the lexer will propose may not be what the user want, so the user may
 suffer the cost of the litteral decoding (including allocation of the
 decoded string, the copy of the caracters, etc), that he doesn't want,
 or will have to re-do his own way...

issues. Just provide a templates with a few hooks, and add a Noop policy that does nothing. -- Dmitry Olshansky
Aug 04 2012
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 04-Aug-12 15:48, Jonathan M Davis wrote:
 On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
 I see it as a compile-time policy, that will fit nicely and solve both
 issues. Just provide a templates with a few hooks, and add a Noop policy
 that does nothing.

It's starting to look like figuring out what should and shouldn't be configurable and how to handle it is going to be the largest problem in the lexer...

Let's add some meat to my post. I've seen it mostly as follows: //user defines mixin template that is mixed in inside lexer template MyConfig() { enum identifierTable = true; // means there would be calls to table.insert on each identifier enum countLines = true; //adds line, column properties to the lexer/Tokens //statically bound callbacks, inside one can use say: // skip() - to skip a char (popFront) // get() - to read next char (via popFront, front) // line, col - as readonly properties // (skip & get do the counting if enabled) bool onError() { skip(); //the most dumb recovery, just skip a char return true; //go on with tokenizing, false - stop prematurely } ... } usage: { auto my_supa_table = ...; //some kind of container (should a set on strings and support .insert("blah"); ) auto dlex = Lexer!(MyConfig)(table); auto all_tokens = array(dlex(joiner(stdin.byChunk(4096)))); //or if we had no interest in table but only tokens: auto noop = Lexer!(NoopLex)(); ... } -- Dmitry Olshansky
Aug 04 2012
parent reply deadalnix <deadalnix gmail.com> writes:
Le 04/08/2012 15:45, Dmitry Olshansky a écrit :
 On 04-Aug-12 15:48, Jonathan M Davis wrote:
 On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
 I see it as a compile-time policy, that will fit nicely and solve both
 issues. Just provide a templates with a few hooks, and add a Noop policy
 that does nothing.

It's starting to look like figuring out what should and shouldn't be configurable and how to handle it is going to be the largest problem in the lexer...

Let's add some meat to my post. I've seen it mostly as follows: //user defines mixin template that is mixed in inside lexer template MyConfig() { enum identifierTable = true; // means there would be calls to table.insert on each identifier enum countLines = true; //adds line, column properties to the lexer/Tokens //statically bound callbacks, inside one can use say: // skip() - to skip a char (popFront) // get() - to read next char (via popFront, front) // line, col - as readonly properties // (skip & get do the counting if enabled) bool onError() { skip(); //the most dumb recovery, just skip a char return true; //go on with tokenizing, false - stop prematurely } ... } usage: { auto my_supa_table = ...; //some kind of container (should a set on strings and support .insert("blah"); ) auto dlex = Lexer!(MyConfig)(table); auto all_tokens = array(dlex(joiner(stdin.byChunk(4096)))); //or if we had no interest in table but only tokens: auto noop = Lexer!(NoopLex)(); ... }

It seems way too much. The most complex thing that is needed is the policy to allocate identifiers in tokens. It can be made by passing a function that have a string as parameter and a string as return value. The default one would be an identity function. The second parameter is a bool to tokenize comments or not. Is that enough ? The onError look like a typical use case for conditions as explained in the huge thread on Exception.
Aug 06 2012
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-08-06 21:00, Philippe Sigaud wrote:

 Yes, well we don't have a condition system. And using exceptions
 during lexing would most probably kill its efficiency.
 Errors in lexing are not uncommon. The usual D idiom of having an enum
 StopOnError { no, yes } should be enough.

Especially when if the lexer is used in an IDE. The code is constantly changing and will be invalid quite often. -- /Jacob Carlborg
Aug 06 2012
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 06-Aug-12 22:03, deadalnix wrote:
 Le 04/08/2012 15:45, Dmitry Olshansky a écrit :
 On 04-Aug-12 15:48, Jonathan M Davis wrote:
 On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
 I see it as a compile-time policy, that will fit nicely and solve both
 issues. Just provide a templates with a few hooks, and add a Noop
 policy
 that does nothing.

It's starting to look like figuring out what should and shouldn't be configurable and how to handle it is going to be the largest problem in the lexer...

Let's add some meat to my post. I've seen it mostly as follows: //user defines mixin template that is mixed in inside lexer template MyConfig() { enum identifierTable = true; // means there would be calls to table.insert on each identifier enum countLines = true; //adds line, column properties to the lexer/Tokens //statically bound callbacks, inside one can use say: // skip() - to skip a char (popFront) // get() - to read next char (via popFront, front) // line, col - as readonly properties // (skip & get do the counting if enabled) bool onError() { skip(); //the most dumb recovery, just skip a char return true; //go on with tokenizing, false - stop prematurely } ... } usage: { auto my_supa_table = ...; //some kind of container (should a set on strings and support .insert("blah"); ) auto dlex = Lexer!(MyConfig)(table); auto all_tokens = array(dlex(joiner(stdin.byChunk(4096)))); //or if we had no interest in table but only tokens: auto noop = Lexer!(NoopLex)(); ... }

It seems way too much. The most complex thing that is needed is the policy to allocate identifiers in tokens.

Editor that highlights text may choose not to build identifier table at all. One may see it as a safe mode (low resource mode) for more advance IDE.
 The second parameter is a bool to tokenize comments or not. Is that
 enough ?

And doing Tokens as special comment token is frankly bad idea. See Walter's comments in this thread. Also e.g. For compiler only DDoc ones are ever useful, not so for IDE. Filtering them out later is inefficient, as it would be far better not to create them in the first place.
 The onError look like a typical use case for conditions as explained in
 the huge thread on Exception.

mm I lost track of that discussion. Either way I see statically bound function as good enough hook into the process as it can do anything useful: skip wrong chars, throw exception, stop parsing prematurely, whatever - pick your poison. -- Dmitry Olshansky
Aug 06 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-06 22:26, Dmitry Olshansky wrote:

 No.

 And doing Tokens as special comment token is frankly bad idea. See
 Walter's comments in this thread.

 Also e.g. For compiler only DDoc ones are ever useful, not so for IDE.
 Filtering them out later is inefficient, as it would be far better not
 to create them in the first place.

The Eclipse plugin Descent can show formatted DDoc comments. -- /Jacob Carlborg
Aug 06 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 07-Aug-12 01:48, Jacob Carlborg wrote:
 On 2012-08-06 22:26, Dmitry Olshansky wrote:

 No.

 And doing Tokens as special comment token is frankly bad idea. See
 Walter's comments in this thread.

 Also e.g. For compiler only DDoc ones are ever useful, not so for IDE.
 Filtering them out later is inefficient, as it would be far better not
 to create them in the first place.

The Eclipse plugin Descent can show formatted DDoc comments.

to highlight things properly. -- Dmitry Olshansky
Aug 06 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/6/2012 12:00 PM, Philippe Sigaud wrote:
 Yes, well we don't have a condition system. And using exceptions
 during lexing would most probably kill its efficiency.
 Errors in lexing are not uncommon. The usual D idiom of having an enum
 StopOnError { no, yes } should be enough.

That's why I suggested supplying a callback delegate to decide what to do with errors (ignore, throw exception, or quit) and have the delegate itself do that. That way, there is no customization of the Lexer required.
Aug 06 2012
next sibling parent reply travert phare.normalesup.org (Christophe Travert) writes:
Walter Bright , dans le message (digitalmars.D:174360), a écrit :
 On 8/6/2012 12:00 PM, Philippe Sigaud wrote:
 Yes, well we don't have a condition system. And using exceptions
 during lexing would most probably kill its efficiency.
 Errors in lexing are not uncommon. The usual D idiom of having an enum
 StopOnError { no, yes } should be enough.

That's why I suggested supplying a callback delegate to decide what to do with errors (ignore, throw exception, or quit) and have the delegate itself do that. That way, there is no customization of the Lexer required.

It may be easier to take into accound few cases (return error token and throwing is enough, so that is a basic static if), than to define a way to integrate a delegate (what should be the delegate's signature, what value to return to query for stopping, how to provide ways to recovers, etc).
Aug 07 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/7/2012 1:00 AM, Christophe Travert wrote:
 That's why I suggested supplying a callback delegate to decide what to do with
 errors (ignore, throw exception, or quit) and have the delegate itself do that.
 That way, there is no customization of the Lexer required.

It may be easier to take into accound few cases (return error token and throwing is enough, so that is a basic static if), than to define a way to integrate a delegate (what should be the delegate's signature, what value to return to query for stopping, how to provide ways to recovers, etc).

If the delegate returns, then the lexer recovers. The delegate is passed the error message and the location. I don't see it is more complex than that.
Aug 07 2012
parent travert phare.normalesup.org (Christophe Travert) writes:
Walter Bright , dans le message (digitalmars.D:174393), a écrit :
 If the delegate returns, then the lexer recovers.

That's an option, if there is only one way to recover (which is a reasonable assumption). You wanted the delegate to "decide what to do with the errors (ignore, throw exception, or quit)". Throwing is handled, but not ignore/quit. Jonathan's solution (delegate returning a bool) is good. It could also be a delegate returning an int, 0 meaning continue, and any other value being an error code that can be retrieved later. It could also be a number of characters to skip (0 meaning break).
Aug 07 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/7/2012 1:14 AM, Jonathan M Davis wrote:
 But you can also configure the lexer to return an error token instead of using
 the delegate if that's what you prefer. But Walter is right in that if you
 have to check every token for whether it's an error, that will incur overhead.
 So, depending on your use case, that could be unacceptable.

It's not just overhead - it's just plain ugly to constantly check for error tokens. It's also tedious and error prone to insert those checks. I don't see any advantage to it.
Aug 07 2012
next sibling parent travert phare.normalesup.org (Christophe Travert) writes:
Walter Bright , dans le message (digitalmars.D:174394), a écrit :
 On 8/7/2012 1:14 AM, Jonathan M Davis wrote:
 But you can also configure the lexer to return an error token instead of using
 the delegate if that's what you prefer. But Walter is right in that if you
 have to check every token for whether it's an error, that will incur overhead.
 So, depending on your use case, that could be unacceptable.

It's not just overhead - it's just plain ugly to constantly check for error tokens. It's also tedious and error prone to insert those checks.

It's not necessarily ugly, because of the powerful range design. You can branch the lexer to a range adapter that just ignore error tokens, or throw when it meats an error token. For example, just use: auto tokens = data.lexer.throwOnErrorToken; I don't think this is more ugly than: auto tokens = data.lexer!(complex signature) { throw LexException; }; But yes, there is overhead, so I understand returning error tokens is not satisfactory for everyone.
 I don't see any advantage to it.

Storing the error somewhere can be of use. For example, you may want to lex a whole file into an array of tokens, and then deal with you errors as you analyse the array of tokens. Of course, you can alway make a delegate to store the error somewhere, but it is easier if this somewhere is in your token pile. What I don't see any advantage is using a delegate that can only return or throw. A policy makes the job: auto tokens = data.lexer!ExceptionPolicy.throwException; That's clean too. If you want the delegate to be of any use, then it must have data to process. That's why I said we have to worry about the signature of the delegate. -- Christophe
Aug 07 2012
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-07 12:06, Jonathan M Davis wrote:

 It's easier to see where in the range of tokens the errors occur. A delegate
 is disconnected from the point where the range is being consumed, whereas if
 tokens are used for errors, then the function consuming the range can see
 exactly where in the range of tokens the error is (and potentially handle it
 differently based on that information).

Just pass the same token to the delegate that you would have returned otherwise? -- /Jacob Carlborg
Aug 07 2012
parent travert phare.normalesup.org (Christophe Travert) writes:
Jacob Carlborg , dans le message (digitalmars.D:174421), a écrit :
 On 2012-08-07 12:06, Jonathan M Davis wrote:
 
 It's easier to see where in the range of tokens the errors occur. A delegate
 is disconnected from the point where the range is being consumed, whereas if
 tokens are used for errors, then the function consuming the range can see
 exactly where in the range of tokens the error is (and potentially handle it
 differently based on that information).

Just pass the same token to the delegate that you would have returned otherwise?

That's what I would do. If you have to define a way to return error information as a token, better use it again when using delegates. Personnaly, I would have the delegate be: int delegate(Token); A return value of 0 means: continue parsing. Any other value is an error number and stops the parser (makes it empty). The error number can be retrieved from the empty parser with a specific function. If you want to throw, just throw in the delegate. No need to return a specific value for that. But a bool return value may be enough too...
Aug 07 2012
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/7/2012 3:06 AM, Jonathan M Davis wrote:
 It's easier to see where in the range of tokens the errors occur. A delegate
 is disconnected from the point where the range is being consumed, whereas if
 tokens are used for errors, then the function consuming the range can see
 exactly where in the range of tokens the error is (and potentially handle it
 differently based on that information).

The delegate has a context pointer giving it a reference to whatever context the code calling the Lexer needs.
Aug 07 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/7/2012 7:15 AM, Philippe Sigaud wrote:
 Also, what I proposed was a *static* decision: with SkipErrors { no,
 yes }. With a static if inside its guts, the lexer could change its
 behavior accordingly.

Yes, I understand about static if decisions :-) hell I invented them!
 Walter, with all due respect, you sometimes give the impression to
 forget we are talking about D and go back to deeply entrenched C-isms.

Delegates are not C-isms.
 Compile-time decisions can be used to avoid any overhead as long as
 you have a clear idea of what the two code paths should look like.

Yes, I understand that. There's also a point about adding too much complexity to the interface. The delegate callback reduces complexity in the interface.
 And, as Christophe said, ranges are a powerful API. In another thread
 Simen and me did some comparison between C-like code and code using
 only ranges upon ranges upon ranges. A (limited!) difference in speed
 appeared only for very long calculations.

That's good, and you really don't need to sell me on ranges - I'm already sold.
Aug 07 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/7/2012 2:14 PM, Jonathan M Davis wrote:
 I expect that the configuration stuff is going to have to be adjusted after I'm
 done, since I'm not sure that it's entirely clear what's worth configuring or
 not.

"When in doubt, leave it out." If experience later shows it is really needed, it is easier to add it in than to have a blizzard of useless configuration options that cannot be removed.
Aug 07 2012
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, August 07, 2012 08:00:24 Christophe Travert wrote:
 Walter Bright , dans le message (digitalmars.D:174360), a =C3=A9crit =

 That's why I suggested supplying a callback delegate to decide what=


 with errors (ignore, throw exception, or quit) and have the delegat=


 itself do that. That way, there is no customization of the Lexer
 required.

It may be easier to take into accound few cases (return error token a=

 throwing is enough, so that is a basic static if), than to define a w=

 to integrate a delegate (what should be the delegate's signature, wha=

 value to return to query for stopping, how to provide ways to recover=

 etc).

For the moment at least, I'm doing this bool delegate(string errorMsg, SourcePos pos) errorHandler; where SourcePos is a struct which holds the line and col of the place i= n the=20 source code where the bad token starts. It it returns true, the token i= s=20 skipped. If it returns false, an exception is thrown - and of course th= e=20 delegate can throw its own exception if it wants to. But you can also configure the lexer to return an error token instead o= f using=20 the delegate if that's what you prefer. But Walter is right in that if = you=20 have to check every token for whether it's an error, that will incur ov= erhead.=20 So, depending on your use case, that could be unacceptable. - Jonathan M Davis
Aug 07 2012
prev sibling parent travert phare.normalesup.org (Christophe Travert) writes:
Jonathan M Davis , dans le message (digitalmars.D:174223), a écrit :
 On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
 I see it as a compile-time policy, that will fit nicely and solve both
 issues. Just provide a templates with a few hooks, and add a Noop policy
 that does nothing.

It's starting to look like figuring out what should and shouldn't be configurable and how to handle it is going to be the largest problem in the lexer...

Yes, I figured-out a policy could be used to, but since the begining of the thread, that makes a lot of things to configure! Jonathan would have trouble trying to make them all. Choices have to be made. That's why I proposed to use adapter range to enable to do the buffering instead of slicing, and to build the look up table. Done correctly, it can make the core of the lexer imlementation clean without loosing efficiency (I hope). If this policy for parsing literals if the only thing that remains to be configured directly in the core of the lexer with static if, then it's reasonable.
Aug 04 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
 I see it as a compile-time policy, that will fit nicely and solve both
 issues. Just provide a templates with a few hooks, and add a Noop policy
 that does nothing.

It's starting to look like figuring out what should and shouldn't be configurable and how to handle it is going to be the largest problem in the lexer... - Jonathan M Davis
Aug 04 2012
prev sibling next sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
On Saturday, 4 August 2012 at 11:58:09 UTC, Jonathan M Davis 
wrote:
 On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
 I see it as a compile-time policy, that will fit nicely and 
 solve both
 issues. Just provide a templates with a few hooks, and add a 
 Noop policy
 that does nothing.

It's starting to look like figuring out what should and shouldn't be configurable and how to handle it is going to be the largest problem in the lexer... - Jonathan M Davis

If we have a really fast lexer that is highly compile-time configureable and has a readable codebase, then this would be a really good showcase for D.
Aug 04 2012
prev sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Mon, Aug 6, 2012 at 8:03 PM, deadalnix <deadalnix gmail.com> wrote:

 The most complex thing that is needed is the policy to allocate identifiers
 in tokens. It can be made by passing a function that have a string as
 parameter and a string as return value. The default one would be an identity
 function.

I think one should pass it an empty symbol table and the lexer should fill it and associate each identifier with a unique ID, ID which would appear in the Identifier token.
 The second parameter is a bool to tokenize comments or not. Is that enough ?

 The onError look like a typical use case for conditions as explained in the
 huge thread on Exception.

Yes, well we don't have a condition system. And using exceptions during lexing would most probably kill its efficiency. Errors in lexing are not uncommon. The usual D idiom of having an enum StopOnError { no, yes } should be enough.
Aug 06 2012
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 11:08:23 Walter Bright wrote:
 The tokens are not kept, correct. But the identifier strings, and the string
 literals, are kept, and if they are slices into the input buffer, then
 everything I said applies.

String literals often _can't_ be slices unless you leave them in their original state rather than giving the version that they translate to (e.g. leaving \&copy; in the string rather than replacing it with its actual, unicode value). And since you're not going to be able to create the literal using whatever type the range is unless it's a string of some variety, that means that the literals often can't be slices, which - depending on the implementation - would make it so that that they can't _ever_ be slices. Identifiers are a different story, since they don't have to be translated at all, but regardless of whether keeping a slice would be better than creating a new string, the identifier table will be far superior, since then you only need one copy of each identifier. So, it ultimately doesn't make sense to use slices in either case even without considering issues like them being spread across memory. The only place that I'd expect a slice in a token is in the string which represents the text which was lexed, and that won't normally be kept around. - Jonathan M Davis
Aug 03 2012
prev sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
On Friday, 3 August 2012 at 14:49:55 UTC, 
travert phare.normalesup.org
 If I may add, there are several possitilities here:
  1- a real slice of the input range
  2- a slice of the input range created with .save and 
 takeExactly
  3- a slice allocated in GC memory by the lexer
  4- a slice of memory owned by the lexer, which is reused for 
 the next
 token (thus, the next call to popFront invalidates the token).
  5- a slice of memory from a lookup table.

#6 None. Just provide information of token type and location (i.e. for syntax highlighting).
Aug 03 2012
prev sibling next sibling parent reply "Bernard Helyer" <b.helyer gmail.com> writes:
http://i.imgur.com/oSXTc.png

Posted without comment.
Aug 02 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 12:09 AM, Bernard Helyer wrote:
 http://i.imgur.com/oSXTc.png

 Posted without comment.

I'm afraid I'm mystified as to your point.
Aug 02 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 12:34 AM, Bernard Helyer wrote:
 Just that I'm slaving away over a hot IDE here. :P

Ah! Well, keep on keepin' on, then!
Aug 02 2012
prev sibling parent Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 08/02/2012 03:09 AM, Bernard Helyer wrote:
 http://i.imgur.com/oSXTc.png

 Posted without comment.

Hell yeah Alexander Brandon.
Aug 04 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 23:39:42 Walter Bright wrote:
 Somebody has to convert the input files into dchars, and then back into
 chars. That blows for performance. Think billions and billions of
 characters going through, not just a few random strings.

Why is there any converting to dchar going on here?

Because your input range is a range of dchar?

I think that we're misunderstanding each other here. A typical, well-written, range-based function which operates on ranges of dchar will use static if or overloads to special-case strings. This means that it will function with any range of dchar, but it _also_ will be as efficient with strings as if it just operated on strings. It won't decode anything in the string unless it has to. So, having a lexer which operates on ranges of dchar does _not_ make string processing less efficient. It just makes it so that it can _also_ operate on ranges of dchar which aren't strings. For instance, my lexer uses this whenever it needs to get at the first character in the range: static if(isNarrowString!R) Unqual!(ElementEncodingType!R) first = range[0]; else dchar first = range.front; I use string mixins to reduce the code duplication that goes with this, but notice that strings and wstrings do _not_ get decoded. Their first code unit is used, and then most everything operates on that code unit. e.g. switch(first) { case '\n': //... case ' ': //... case '+' //... //... default: //... } It's only when the code actually needs to decode the first code point that it does so (e.g. when all of the possible ASCII characters have been exhausted, it needs to check whether the first code point is a unicode alpha character, since that could be the beginning of an identifier). At that point, I end up doing something like static if(isNarrowString!R) dchar front = range.front; else alias first front; or if I need to know the number of code units that make up the code point, I explicitly call decode in the case of a narrow string. In either case, code units are _not_ being converted to dchar unless they absolutely have to be. The only thing that suffers here performance-wise is ranges that aren't actually strings. For instance, filter!"true"(source) would have _much_ worse performance, because it has to decode every character. But without the concept of a variably-lengthed encoded range, we can't really get around that.
 My point is that it's the sort of thing that _only_ a parser would care
 about. So, unless it _needs_ to be in the lexer for some reason, it
 shouldn't be.

it. It has a mapping from identifiers to unique handles. It needs to be there, otherwise the semantic analysis has to scan identifier strings a second time.

Yes. I understand. It has a mapping of pointers to identifiers. My point is that nothing but parsers will need that. From the standpoint of functionality, it's a parser feature, not a lexer feature. So, if it can be done just fine in the parser, then that's where it should be. If on the other hand, it _needs_ to be in the lexer for some reason (e.g. performance), then that's a reason to put it there. It sounds like you're arguing that performance requirements dictate that it be put in the lexer even if nothing but the parser cares about it, in which case, the lexer is indeed where it should be, but if performance isn't affected, then I'd still argue that it should be in the parser. - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent "Bernard Helyer" <b.helyer gmail.com> writes:
On Thursday, 2 August 2012 at 07:32:59 UTC, Walter Bright wrote:
 On 8/2/2012 12:09 AM, Bernard Helyer wrote:
 http://i.imgur.com/oSXTc.png

 Posted without comment.

I'm afraid I'm mystified as to your point.

Just that I'm slaving away over a hot IDE here. :P
Aug 02 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 00:29:09 Walter Bright wrote:
 If we want to be able to operate on ranges of UTF-8 or UTF-16, we need to
 add a concept of variably-length encoded ranges so that it's possible to
 treat them as both their encoding and whatever they represent (e.g. code
 point or grapheme in the case of ranges of code units).

No, this is not necessary.

It is for ranges in general. In the general case, a range of UTF-8 or UTF-16 makes no sense whatsoever. Having range-based functions which understand the encodings and optimize accordingly can be very beneficial (which happens with strings but can't happen with general ranges without the concept of a variably-length encoded range like we have with forward range or random access range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work. Range-based functions operate on elements, and doing stuff like filter or map or reduce on code units doesn't make any sense at all. - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent dennis luehring <dl.soluz gmx.net> writes:
 10. High speed matters a lot

then add a benchmark "suite" to the list - the lexer should be benchmarked from the very first beginning and it should be designed for multithreading - there is no need for on-the-fly hash-table updating - maybe just one update on each lex threads end
Aug 02 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 01:13:04 Walter Bright wrote:
 On 8/2/2012 12:33 AM, Bernard Helyer wrote:
 On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
 The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If
 it
 isn't fast, serious users will eschew it and will cook up their own.
 You'll
 have a nice, pretty, useless toy of std.d.lexer.

If you want to throw out some target times, that would be useful.

As fast as the dmd one would be best.

How would we measure that? dmd's lexer is tied to dmd, so how would we test the speed of only its lexer? - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
 On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
 It is for ranges in general. In the general case, a range of UTF-8 or
 UTF-16 makes no sense whatsoever. Having range-based functions which
 understand the encodings and optimize accordingly can be very beneficial
 (which happens with strings but can't happen with general ranges without
 the concept of a variably-length encoded range like we have with forward
 range or random access range), but to actually have a range of UTF-8 or
 UTF-16 just wouldn't work. Range-based functions operate on elements, and
 doing stuff like filter or map or reduce on code units doesn't make any
 sense at all.

Yes, it can work.

How? If you operate on a range of code units, then you're operating on individual code units, which almost never makes sense. There are plenty cases where a function which understands the encoding can avoid some of costs associated with decoding and whatnot, but since range-based functions operate on their elements, if the elementse are code units, a range-based function will operate on individual code units with _no_ understanding of the encoding at all. Ranges have no concept of encoding. Do you really think that it makes sense for a function like map or filter to operate on individual code units? Because that's what would end up happening with a range of code units. Your average, range-based function only makes sense with _characters_, not code units. Functions which can operate on ranges of code units without screwing up the encoding are a rarity. Unless a range-based function special cases a range-type which is variably- lengthed encoded (e.g. string), it just isn't going to deal with the encoding properly. Either it operates on the encoding or the actual value, depending on what its element type is. I concur that operating on strings as code units is better from the standpoint of efficiency, but it just doesn't work with a generic function without it having a special case which therefore _isn't_ generic. - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent reply Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Walter Bright wrote:
 1. It should accept as input an input range of UTF8. I feel it is a
 mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it
 UTF16 should use an 'adapter' range to convert the input to UTF8. (This
 is what component programming is all about.)

Why it is a mistake? I think Lexer should parse any UTF range and return compatible token's strings. That is it should provide strings for UTF8 input, wstrings for UTF16 input and so on.
Aug 02 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 2:27 AM, Piotr Szturmaj wrote:
 Walter Bright wrote:
 1. It should accept as input an input range of UTF8. I feel it is a
 mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it
 UTF16 should use an 'adapter' range to convert the input to UTF8. (This
 is what component programming is all about.)

Why it is a mistake?

Because the lexer is large and it would have to have a lot of special case code inserted here and there to make that work.
 I think Lexer should parse any UTF range and return
 compatible token's strings. That is it should provide strings for UTF8 input,
 wstrings for UTF16 input and so on.

Why? I've never seen any UTF16 or UTF32 D source in the wild. Besides, if it is not templated then it doesn't need to be recompiled by every user of it - it can exist as object code in the library.
Aug 02 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/2/12 6:07 AM, Walter Bright wrote:
 Why? I've never seen any UTF16 or UTF32 D source in the wild.

Here's a crazy idea that I'll hang to this one remark. No, two crazy ideas. First, after having read the large back-and-forth Jonathan/Walter in one sitting, it's becoming obvious to me you'll never understand each other on this nontrivial matter through this medium. I suggest you set up a skype/phone call. Once you get past the first 30 seconds of social awkwardness of hearing each other's voice, you'll make fantastic progress in communicating. Regarding the problem at hand, it's becoming painfully obvious to me that the lexer MUST do its own decoding internally. Hence, a very simple thing to do is have the entire lexer only deal with ranges of ubyte. If someone passes a char[], the lexer's front end can simply call s.representation and obtain the underlying ubyte[]. If someone passes some range of char, the lexer uses an adapter (e.g. map()) that casts every char to ubyte, which is a zero-cost operation. Then it uses the same core operating on ranges of ubyte. In the first implementation, the lexer may actually refuse any range of 16-bit or 32-bit elements (wchar[], range of wchar, dchar[], range of dchar). Later on the core may be evolved to handle range of ushort and range of dchar. The front-end would use again representation() against wchar[], cast with range of wchar, and would just pass the dchar[] and range of dchar around. This makes the core simple and efficient (I think Jonathan's use of static if and mixins everywhere, while well-intended, complicates matters without necessity). And as such we have a lexer! Which operates with ranges, just has a simple front-end clarifying that the lexer must do its own decoding. Works? Andrei
Aug 02 2012
parent reply Michel Fortin <michel.fortin michelf.ca> writes:
On 2012-08-02 12:28:03 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 Regarding the problem at hand, it's becoming painfully obvious to me 
 that the lexer MUST do its own decoding internally.

That's not a great surprise to me. I hit the same issues when writing my XML parser, which is why I invented functions called frontUnit and popFrontUnit. I'm glad you're realizing this.
 Hence, a very simple thing to do is have the entire lexer only deal 
 with ranges of ubyte. If someone passes a char[], the lexer's front end 
 can simply call s.representation and obtain the underlying ubyte[].

That's ugly, but it could work (assuming s.representation returns the casted range by ref). I still prefer my frontUnit and popFrontUnit approach though. In fact, any parser for which speed is important will have to bypass std.range's clever handling of UTF characters. Dealing simply with ubytes isn't enough, since in some cases you'll want to fire up the UTF decoder. The next issue, which I haven's seen discussed here is that for a parser to be efficient it should operate on buffers. You can make it work with arbitrary ranges, but if you don't have a buffer you can slice when you need to preserve a string, you're going to have to build the string character by character, which is not efficient at all. But then you can only really return slices if the underlying representation is the same as the output representation, and unless your API has a templated output type, you're going to special case a lot of things. After having attempted an XML parser with ranges, I'm not sure parsing using generic ranges can be made very efficient. Automatic conversion to UTF-32 is a nuisance for performance, and if the output needs to return parts of the input, you'll need to create an inefficient special case just to allocate many new strings in the correct format. I wonder how your call with Walter will turn out. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca/
Aug 02 2012
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/2/12 2:17 PM, Michel Fortin wrote:
 On 2012-08-02 12:28:03 +0000, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 Hence, a very simple thing to do is have the entire lexer only deal
 with ranges of ubyte. If someone passes a char[], the lexer's front
 end can simply call s.representation and obtain the underlying ubyte[].

That's ugly, but it could work (assuming s.representation returns the casted range by ref). I still prefer my frontUnit and popFrontUnit approach though.

I agree frontUnit and popFrontUnit are more generic because they allow other ranges to define them.
 In fact, any parser for which speed is important will have to bypass
 std.range's clever handling of UTF characters. Dealing simply with
 ubytes isn't enough, since in some cases you'll want to fire up the UTF
 decoder.

 The next issue, which I haven's seen discussed here is that for a parser
 to be efficient it should operate on buffers. You can make it work with
 arbitrary ranges, but if you don't have a buffer you can slice when you
 need to preserve a string, you're going to have to build the string
 character by character, which is not efficient at all. But then you can
 only really return slices if the underlying representation is the same
 as the output representation, and unless your API has a templated output
 type, you're going to special case a lot of things.

I think a BufferedRange could go a long way here.
 After having attempted an XML parser with ranges, I'm not sure parsing
 using generic ranges can be made very efficient. Automatic conversion to
 UTF-32 is a nuisance for performance, and if the output needs to return
 parts of the input, you'll need to create an inefficient special case
 just to allocate many new strings in the correct format.

I'm not so sure, but I'll measure.
 I wonder how your call with Walter will turn out.

What call? Thanks, Andrei
Aug 02 2012
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-02 22:26, Andrei Alexandrescu wrote:
 On 8/2/12 2:17 PM, Michel Fortin wrote:

 I wonder how your call with Walter will turn out.

What call?

The skype call you suggested: "First, after having read the large back-and-forth Jonathan/Walter in one sitting, it's becoming obvious to me you'll never understand each other on this nontrivial matter through this medium. I suggest you set up a skype/phone call. Once you get past the first 30 seconds of social awkwardness of hearing each other's voice, you'll make fantastic progress in communicatin" -- /Jacob Carlborg
Aug 02 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/2/12 4:44 PM, Jacob Carlborg wrote:
 On 2012-08-02 22:26, Andrei Alexandrescu wrote:
 On 8/2/12 2:17 PM, Michel Fortin wrote:

 I wonder how your call with Walter will turn out.

What call?

The skype call you suggested: "First, after having read the large back-and-forth Jonathan/Walter in one sitting, it's becoming obvious to me you'll never understand each other on this nontrivial matter through this medium. I suggest you set up a skype/phone call. Once you get past the first 30 seconds of social awkwardness of hearing each other's voice, you'll make fantastic progress in communicatin"

Oh, ok, I thought I'd be on the call. That would be between Jonathan and Walter. Andrei
Aug 02 2012
prev sibling parent travert phare.normalesup.org (Christophe Travert) writes:
Andrei Alexandrescu , dans le message (digitalmars.D:174060), a écrit :
 I agree frontUnit and popFrontUnit are more generic because they allow 
 other ranges to define them.

Any range of dchar could have a representation (or you may want to call it something else) that returns a range of char (or ubyte). And I think they are more generic because they use a generic API (ie range), that is very powerful: the representation can provide length, slicing, etc... that is different that the dchar length or whatever. You don't want to duplicate all range methods by postfixing Unit...
 I wonder how your call with Walter will turn out.

What call?

You proposed Jonathan to call Walter in an earlier post. I believe there is a misunderstandment.
Aug 02 2012
prev sibling parent Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Michel Fortin wrote:
 The next issue, which I haven's seen discussed here is that for a parser
 to be efficient it should operate on buffers. You can make it work with
 arbitrary ranges, but if you don't have a buffer you can slice when you
 need to preserve a string, you're going to have to build the string
 character by character, which is not efficient at all. But then you can
 only really return slices if the underlying representation is the same
 as the output representation, and unless your API has a templated output
 type, you're going to special case a lot of things.

Instead of returning whole strings, you may provide another sub-range for the string itself. I wrote JSON parser using this approach (https://github.com/pszturmaj/json-streaming-parser) and thanks to that it is possible to parse json without a single heap allocation. This could be also used in XML parser.
Aug 02 2012
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 02 Aug 2012 14:26:58 +0200
schrieb "Adam D. Ruppe" <destructionator gmail.com>:

 On Thursday, 2 August 2012 at 11:47:20 UTC, deadalnix wrote:
 lexer really isn't the performance bottleneck of dmd (or any 
 compiler of a non trivial language).

What if we're just using this lexer in something like a syntax highlighting text editor? I'd be annoyed if it stopped typing for a little bit cuz of processing.

My pet peeve for editors is how to make the lexer work on only what I just edited. It is really not trivial, but eased for example by editors that insert closing } automatically, so the scopes don't mess up. A missing curly bracket results in the parser reinterpreting the whole file. The same goes for string terminators: If I type `` at the top of std.datetime in Mono-D right after "module std.datetime;", I have a 12 (!) seconds wait, till the IDE becomes responsive again. But if I use ' or ", then the editor inserts the second terminator and is done with reparsing in 1-2 seconds. Can we really have a good lexer that is equally well suited for compilers as for editors ? -- Marco
Aug 02 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:
 On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
 On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
 It is for ranges in general. In the general case, a range of UTF-8 or
 UTF-16 makes no sense whatsoever. Having range-based functions which
 understand the encodings and optimize accordingly can be very beneficial
 (which happens with strings but can't happen with general ranges without
 the concept of a variably-length encoded range like we have with forward
 range or random access range), but to actually have a range of UTF-8 or
 UTF-16 just wouldn't work. Range-based functions operate on elements,
 and
 doing stuff like filter or map or reduce on code units doesn't make any
 sense at all.

Yes, it can work.

How?

Keep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it.

And how on earth is that going to work as a range? Range-based functions operate on elements. They use empty, front, popFront, etc. If front doesn't return an element that a range-based function can operate on without caring what it is, then that type isn't going to work as a range. If you need the consumer to be doing something special, then that means you need to special case it for that range type. And that's what you're doing when you special- case range-base functions for strings. So, because of how front and popFront work, you either have to have a range of code units or a range of code points. With a range of code units, the element type is a code unit, so any operations that you do will be operating on individual code units, not code points. With a range of code points, any operations being done will operate on code points, which _will_ require decoding as long as you're actually using the range API. You only make strings more efficent by special-casing the function for them such that it understands unicode and will operate on the string in the most efficient way according to how that string's encoding works. You seem to be arguing that we can somehow have a generic range API which operates on strings, and yet what you're saying a function using those ranges must do (e.g. having a buffer of multipl code units) requires that a range- based function operate in a non-generic manner for strings. If the consumer has to do _anything_ special for strings, then what it's doing is non-generic with regards to strings. I agree that we should be making string operations more efficient by taking code units into account, but I completely disagree that we can do that generically. At best, we could add the concept of a variably-length encoded range such that a range-based function could special case them and use the encoding where appropriate, but all that buys us is the ability to special case variably- length encoded ranges _other_ than strings (since we can already special case strings), and I don't think that it's even possible to deal with a variably- length encoded range properly without understanding what the encoding is, in which case, we'd be dealing with special range types which were specifically UTF-8 encoded or UTF-16 encoded, and range-based functions would be special- casing _those_ rather than a generic variably-length encoded range. In either case, because the consumer must do something other than simply operate on front, popFront, empty, etc., you're _not_ dealing with the range API but rather working around it. - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent =?ISO-8859-1?Q?Jos=E9_Armando_Garc=EDa_Sancio?= <jsancio gmail.com> writes:
On Thu, Aug 2, 2012 at 1:26 PM, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:
 On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
 On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
 It is for ranges in general. In the general case, a range of UTF-8 or
 UTF-16 makes no sense whatsoever. Having range-based functions which
 understand the encodings and optimize accordingly can be very beneficial
 (which happens with strings but can't happen with general ranges without
 the concept of a variably-length encoded range like we have with forward
 range or random access range), but to actually have a range of UTF-8 or
 UTF-16 just wouldn't work. Range-based functions operate on elements,
 and
 doing stuff like filter or map or reduce on code units doesn't make any
 sense at all.

Yes, it can work.

How?

Keep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it.

And how on earth is that going to work as a range? Range-based functions operate on elements. They use empty, front, popFront, etc. If front doesn't return an element that a range-based function can operate on without caring what it is, then that type isn't going to work as a range. If you need the consumer to be doing something special, then that means you need to special case it for that range type. And that's what you're doing when you special- case range-base functions for strings.

A little bit off topic but... People have been composing/decorating Streams/Ranges for probably 30 years now. Examples: input stream, output stream, byte stream, char stream, buffered stream, cipher stream, base64 stream, etc. If you need more example. Consider an HTTPS request. At the lowest level you have a byte stream/range. No sane developer wants to deal with HTTPS request at this level so you decorate it with an SSL stream/range. That is still too low level so you decorate this with a char stream/range. Still too low level? Decorate it with a modal line buffered stream/range. We are getting closer but it still not the correct range abstraction so then you need a modal http stream/range. You need the modal part if you want to support http streaming.
Aug 02 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:
 Remember, its the consumer doing the decoding, not the input range.

But that's the problem. The consumer has to treat character ranges specially to make this work. It's not generic. If it were generic, then it would simply be using front, popFront, etc. It's going to have to special case strings to do the buffering that you're suggesting. And if you have to special case strings, then how is that any different from what we have now? If you're arguing that strings should be treated as ranges of code units, then pretty much _every_ range-based function will have to special case strings to even work correctly - otherwise it'll be operating on individual code points rather than code points (e.g. filtering code units rather than code points, which would generate an invalid string). This makes the default behavior incorrect, forcing _everyone_ to special case strings _everywhere_ if they want correct behavior with ranges which are strings. And efficiency means nothing if the result is wrong. As it is now, the default behavior of strings with range-based functions is correct but inefficient, so at least we get correct code. And if someone wants their string processing to be efficient, then they special case strings and do things like the buffering that you're suggesting. So, we have correct by default with efficiency as an option. The alternative that you seem to be suggesting (making strings be treated as ranges of code units) means that it would be fast by default but correct as an option, which is completely backwards IMHO. Efficiency is important, but it's pointless how efficient something is if it's wrong, and expecting that your average programmer is going to write unicode-aware code which functions correctly is completely unrealistic. - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 18:41:23 Andrei Alexandrescu wrote:
 On 8/2/12 6:38 PM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:
 Remember, its the consumer doing the decoding, not the input range.

But that's the problem. The consumer has to treat character ranges specially to make this work. It's not generic.

It is generic! It's just in another dimension: it operates on any range of _bytes_.

So, a function which does the buffering of code units like Walter suggests is generic? It's doing something that makes no sense outside of strings. And if it's doing that with strings and something else with everything else (which it _has_ to do if the same function is going to work with both unicode as well as range types that have nothing to do with unicode), then it's special casing strings and therefore is _not_ generic. Sure, you could have a function which specifically operates on ranges of code units and understands how unicode works and is written accordingly, but then that function is specific to ranges of code units and is only generic with regards to various ranges of code units. It can't operate on generic ranges like functions such as map and filter can. - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 19:06:32 Andrei Alexandrescu wrote:
 Sure, you could have a function which specifically operates on ranges of
 code units and understands how unicode works and is written accordingly,
 but then that function is specific to ranges of code units and is only
 generic with regards to various ranges of code units. It can't operate on
 generic ranges like functions such as map and filter can.

Yes, and I think that's exactly what the doctor prescribed here.

It may be the best approach for the lexer (though I'm not convinced; I'll have to think about it more), but Walter seems to be arguing that that strings should be treated as ranges of code units in general, which I think is completely wrong. - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 19:30:47 Andrei Alexandrescu wrote:
 On 8/2/12 7:18 PM, Jonathan M Davis wrote:
 Your insights are always appreciated; even their Cliff notes :o).

LOL. Well, I'm not about to decide on the best approach to this without thinking through it more. What I've been doing manages to deal quite nicely with avoiding unnecessary decoding and still allows for the lexing of ranges of dchar which aren't strings (though there's obviously an efficiency hit there), and it really isn't complicated or messy thanks to some basic mixins that I've been using. Switching to operating specifically on code units and not accepting ranges of dchar at all has some serious ramifications, and I have to think through them all before I take a position on that.
 but Walter seems to be arguing that that strings
 should be treated as ranges of code units in general, which I think is
 completely wrong.

I think Walter has very often emphasized the need for the lexer to be faster than the usual client software. My perception is that he's discussing lexer design in understanding there's a need for a less comfortable approach, namely do decoding in client.

That may be, but if he's arguing that strings should _always_ be treated as range of code units - as in all D programs, most of which don't have anything to do with lexers (other than when they're compiled) - then I'm definitely going to object to that, and it's my understanding that that's what he's arguing. But maybe I've misunderstood. I've been arguing that strings should still be treated as ranges of code points and that that does not preclude making the lexer efficiently operate on code units when operating on strings even if it operates on ranges of dchar. I think that whether making the lexer operate on ranges of dchar but specialize on strings is a better approach or making it operate specifically on ranges of code units is a better approach depends on what we want it to be usable with. It should be just as fast with strings in either case, so it becomes a question of how we want to handle ranges which _aren't_ strings. I suppose that we could make it operate on code units and just let ranges of dchar have UTF-32 as their code unit (since dchar is both a code unit and a code point), then ranges of dchar will still work but ranges of char and wchar will _also_ work. Hmmm. As I said, I'll have to think this through a bit. - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 19:40:13 Walter Bright wrote:
 No, I'm arguing that the LEXER should accept a UTF8 input range for its
 input. I am not making a general argument about ranges, characters, or
 Phobos.

I think that this is the main point of misunderstanding then. From your comments, you seemed to me to be arguing for strings being treated as ranges of code units all the time, which really doesn't make sense. - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent "Bernard Helyer" <b.helyer gmail.com> writes:
On Friday, 3 August 2012 at 03:00:42 UTC, Andrei Alexandrescu 
wrote:
 The lexer must be configurable enough to tokenize other 
 languages than D.

You're going to have to defend that one.
Aug 02 2012
prev sibling next sibling parent "Bernard Helyer" <b.helyer gmail.com> writes:
On Friday, 3 August 2012 at 03:14:14 UTC, Walter Bright wrote:
 On 8/2/2012 8:00 PM, Andrei Alexandrescu wrote:
 On 8/2/12 10:40 PM, Walter Bright wrote:
 To reiterate another point, since we are in the compiler 
 business,
 people will expect std.d.lexer to be of top quality, not some 
 bag on the
 side. It needs to be usable as a base for writing a 
 professional quality
 compiler. It's the reason why I'm pushing much harder on this 
 than I do
 for other modules.

The lexer must be configurable enough to tokenize other languages than D. I confess I'm very unhappy that there seem to be no less than three people determined to write lexers for D. We're wasting precious talent and resources doubly. Once, several people are working in parallel on the same product. Second, none of them is actually solving the problem that should be solved.

I agree and I hope the three can combine their efforts with the best ideas of each.

If the other guys think they've got it, then I can withdraw my efforts. I was just thinking I had a lexer just sitting around, may as well use it, but if the other guys have it, then I'm fine with withdrawing.
Aug 02 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 23:41:39 Andrei Alexandrescu wrote:
 On 8/2/12 11:08 PM, Jonathan M Davis wrote:
 You're not going to get as fast a lexer if it's not written specifically
 for D. Writing a generic lexer is a different problem. It's also one that
 needs to be solved, but I think that it's a mistake to think that a
 generic lexer is going to be able to be as fast as one specifically
 optimized for D.

Do you have any evidence to back that up? I mean you're just saying it.

Because all of the rules are built directly into the code. You don't have to use regexes or anything like that. Pieces of the lexer could certainly be generic or copied over to other lexers just fine, but when you write the lexer by hand specifically for D, you can guarantee that it checks exactly what it needs to for D without any extra cruft or lost efficiency due to decoding where it doesn't need to or checking an additional character at any point or anything along those lines. And tuning it is much easier, because you have control over the whole thing. Also, given features such as token strings, I would think that using a generic lexer on D would be rather difficult anyway. If someone wants to try and write a generic lexer for D and see if they can beat out any hand-written ones, then more power to them, but I don't see how you could possibly expect to shave the operations down to the bare minimum necessary to get the job done with a generic lexer, whereas a hand-written parser can do that given enough effort. - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, August 03, 2012 06:02:29 Bernard Helyer wrote:
 I'll let you get on with it then. I'll amuse myself with the
 thought of someone asking why SDC doesn't use std.d.lexer or
 a parser generator. I'll then hit them with my cane, and tell
 them to get off of my lawn. :P

Well, if std.d.lexer is done correctly, then it should at least be very usable by SDC, though whether it would be worth changing SDC to use it, I have no idea. That'll be up to you and the other SDC devs. But when code gets posted, you should definitely chime in on it. - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, August 03, 2012 06:14:08 Timon Gehr wrote:
 If it is optimal for D lexing and close-optimal or optimal for other
 languages then it is profoundly more useful than just a D lexer.

If fully concur that we should have a generic lexer, but unless the generic lexer can be just as fast as a hand-written one (which I seriously question), then I definitely think that we're going to want both, not just a generic one. - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Fri, Aug 3, 2012 at 6:14 AM, Timon Gehr <timon.gehr gmx.ch> wrote:

 If someone wants to try and write a generic lexer for D and see if they
 can
 beat out any hand-written ones,

I'll possibly give it a shot if I can find the time.

I propose we let him finish std.lexer, test it with Jonathan, benchmark it, etc. Once it's out, we can create a small group interested in getting generic and we can try and decorticate it at our leasure.
Aug 02 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, August 03, 2012 07:30:32 Philippe Sigaud wrote:
 Look, many of us here were interested in your idea of having comments
 and errors lexed as tokens.
 Could it be possible to just add two static policies errorBehavior {
 report, skip, ...} and comments { asToken, grouped }?
 That way, a user just say;
 
 // Walter
 auto toks = Lexer!(comments.grouped)(code); // May be the default value
 // Other:
 auto toks = Lexer!(comments.asToken)(code);
 
 It's just a small static if away...

I'm sure it's possible. I just don't know how messy it will be. It depends on how the code is affected by needing to not treat comments and errors as tokens for what Walter wants. I'll see what I can do though. - Jonathan M Davis
Aug 02 2012
prev sibling next sibling parent reply Brad Roberts <braddr puremagic.com> writes:
To help with performance comparisons I ripped dmd's lexer out and got it
building as a few .d files.  It's very crude.
It's got tons of casts (more than the original c++ version).  I attempted no
cleanup or any other change than the
minimum I could to get it to build and run.  Obviously there's tons of room for
cleanup, but that's not the point...
it's just useful as a baseline.

The branch:
    https://github.com/braddr/phobos/tree/dmd_lexer

The commit with the changes:
    https://github.com/braddr/phobos/commit/040540ef3baa38997b15a56be3e9cd9c4bfa51ab

On my desktop (far from idle, it's running 2 of the auto testers), it
consistently takes 0.187s to lex all of the .d
files in phobos.

Later,
Brad

On 8/1/2012 5:10 PM, Walter Bright wrote:
 Given the various proposals for a lexer module for Phobos, I thought I'd share
some characteristics it ought to have.
 
 First of all, it should be suitable for, at a minimum:
 
 1. compilers
 
 2. syntax highlighting editors
 
 3. source code formatters
 
 4. html creation
 
 To that end:
 
 1. It should accept as input an input range of UTF8. I feel it is a mistake to
templatize it for UTF16 and UTF32. Anyone
 desiring to feed it UTF16 should use an 'adapter' range to convert the input
to UTF8. (This is what component
 programming is all about.)
 
 2. It should output an input range of tokens
 
 3. tokens should be values, not classes
 
 4. It should avoid memory allocation as much as possible
 
 5. It should read or write any mutable global state outside of its "Lexer"
 instance
 
 6. A single "Lexer" instance should be able to serially accept input ranges,
sharing and updating one identifier table
 
 7. It should accept a callback delegate for errors. That delegate should
decide whether to:
    1. ignore the error (and "Lexer" will try to recover and continue)
    2. print an error message (and "Lexer" will try to recover and continue)
    3. throw an exception, "Lexer" is done with that input range
 
 8. Lexer should be configurable as to whether it should collect information
about comments and ddoc comments or not
 
 9. Comments and ddoc comments should be attached to the next following token,
they should not themselves be tokens
 
 10. High speed matters a lot
 
 11. Tokens should have begin/end line/column markers, though most of the time
this can be implicitly determined
 
 12. It should come with unittests that, using -cov, show 100% coverage
 
 
 Basically, I don't want anyone to be motivated to do a separate one after
seeing this one.

Aug 05 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/5/2012 12:59 AM, Brad Roberts wrote:
 To help with performance comparisons I ripped dmd's lexer out and got it
building as a few .d files.  It's very crude.
 It's got tons of casts (more than the original c++ version).  I attempted no
cleanup or any other change than the
 minimum I could to get it to build and run.  Obviously there's tons of room
for cleanup, but that's not the point...
 it's just useful as a baseline.

 The branch:
      https://github.com/braddr/phobos/tree/dmd_lexer

 The commit with the changes:
      https://github.com/braddr/phobos/commit/040540ef3baa38997b15a56be3e9cd9c4bfa51ab

 On my desktop (far from idle, it's running 2 of the auto testers), it
consistently takes 0.187s to lex all of the .d
 files in phobos.

 Later,
 Brad

Thanks, Brad!
Aug 05 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, August 04, 2012 17:45:58 Dmitry Olshansky wrote:
 On 04-Aug-12 15:48, Jonathan M Davis wrote:
 On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
 I see it as a compile-time policy, that will fit nicely and solve both
 issues. Just provide a templates with a few hooks, and add a Noop policy
 that does nothing.

It's starting to look like figuring out what should and shouldn't be configurable and how to handle it is going to be the largest problem in the lexer...

Let's add some meat to my post. I've seen it mostly as follows:

It would probably be a bit more user friendly to pass a struct as a template argument (which you can't do in the normal sense, but you can pass it as an alias). Regardless, the problem isn't so much how to provide a configuration as how the configuration options affect the lexer and what it does. I'll figure it out, but it does definitely complicate things. I wasn't originally planning an having anything be configurable. But if we want it to be such that no one will want to write another one (as Walter is looking for), then it's going to need to be configurable enough to make it efficient for all of the common lexing scenarios rather than efficient for one particular scenario. - Jonathan M Davis
Aug 05 2012
prev sibling next sibling parent Ary Manzana <ary esperanto.org.ar> writes:
On 8/1/12 21:10 , Walter Bright wrote:
 8. Lexer should be configurable as to whether it should collect
 information about comments and ddoc comments or not

 9. Comments and ddoc comments should be attached to the next following
 token, they should not themselves be tokens

I believe there should be an option to get comments as tokens. Or otherwise the attached comments should have source location...
Aug 06 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, August 07, 2012 02:54:42 Walter Bright wrote:
 On 8/7/2012 1:14 AM, Jonathan M Davis wrote:
 But you can also configure the lexer to return an error token instead of
 using the delegate if that's what you prefer. But Walter is right in that
 if you have to check every token for whether it's an error, that will
 incur overhead. So, depending on your use case, that could be
 unacceptable.

It's not just overhead - it's just plain ugly to constantly check for error tokens. It's also tedious and error prone to insert those checks. I don't see any advantage to it.

It's easier to see where in the range of tokens the errors occur. A delegate is disconnected from the point where the range is being consumed, whereas if tokens are used for errors, then the function consuming the range can see exactly where in the range of tokens the error is (and potentially handle it differently based on that information). Regardless, I was asked to keep that option in there by at least one person (Philippe Sigaud IIRC), which is why I didn't just switch over to the delegate entirely. - Jonathan M Davis
Aug 07 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Tue, Aug 7, 2012 at 12:06 PM, Jonathan M Davis <jmdavisProg gmx.com> wrote:

 Regardless, I was asked to keep that option in there by at least one person
 (Philippe Sigaud IIRC), which is why I didn't just switch over to the delegate
 entirely.

IIRC, I was not the only one, as people here interested in coding an IDE asked for it too. A lexer is useful for more than 'just' parsing D afterwards: an IDE could easily color tokens according to their type and an error token is just was is needed to highlight errors. Also, what I proposed was a *static* decision: with SkipErrors { no, yes }. With a static if inside its guts, the lexer could change its behavior accordingly. Make skipError.yes the default and Walter get its speed. It's just that an IDE or another parser could use auto lex = std.lexer.Lexer!(SkipError.no)(input); Walter, with all due respect, you sometimes give the impression to forget we are talking about D and go back to deeply entrenched C-isms. Compile-time decisions can be used to avoid any overhead as long as you have a clear idea of what the two code paths should look like. And, as Christophe said, ranges are a powerful API. In another thread Simen and me did some comparison between C-like code and code using only ranges upon ranges upon ranges. A (limited!) difference in speed appeared only for very long calculations.
Aug 07 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Tue, Aug 7, 2012 at 9:38 PM, Walter Bright
<newshound2 digitalmars.com> wrote:

 Yes, I understand about static if decisions :-) hell I invented them!

And what a wonderful decision that was!
 Yes, I understand that. There's also a point about adding too much
 complexity to the interface. The delegate callback reduces complexity in the
 interface.

OK, then let's let Jonathan work, and we will see how it goes.
 And, as Christophe said, ranges are a powerful API. In another thread
 Simen and me did some comparison between C-like code and code using
 only ranges upon ranges upon ranges. A (limited!) difference in speed
 appeared only for very long calculations.

That's good, and you really don't need to sell me on ranges - I'm already sold.

Well, you gave the impression a bit upstream in this thread that having to filter a token range to eliminate errors was an atrocity (millions of tokens!). As far as I'm concerned, the recent good news was to (re?)discover than complex calls of ranges upon ranges could still be calculated by CTFE. That's really neat.
Aug 07 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, August 07, 2012 12:38:26 Walter Bright wrote:
 Yes, I understand that. There's also a point about adding too much
 complexity to the interface. The delegate callback reduces complexity in
 the interface.

It doesn't really affect much to allow choosing between returning a token and using a delegate, especially if ignoring errors is treated as a separate option rather than simply using a delegate that skips them (which may or may not be beneficial - it's faster without the delegate, but it's actually kind of hard to get lexing errors). What worries me more is stuff like providing a way to have the range calculate the current position itself (as Christophe suggested IIRC) or having it provide an efficient way to determine the number of code units between two ranges so that you can slice the range lexed to put in the Token. Determining the number of code units is easily done with ptr for strings, but for everything else, you generally have to count as code units are consumed, which isn't really an issue for small tokens (especially those like symbols where the length is known without counting) but does add up for arbitrarily long ones such as comments or string literals. So, providing a way to calculate it more efficiently where possible might be desirable, but it's yet another layer of complication, and I don't know that it's actually possible to provide such a function in enough situations for it to be worth providing that functionality. I expect that the configuration stuff is going to have to be adjusted after I'm done, since I'm not sure that it's entirely clear what's worth configuring or not. - Jonathan M Davis
Aug 07 2012
prev sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, August 07, 2012 14:30:51 Walter Bright wrote:
 On 8/7/2012 2:14 PM, Jonathan M Davis wrote:
 I expect that the configuration stuff is going to have to be adjusted
 after I'm done, since I'm not sure that it's entirely clear what's worth
 configuring or not.

"When in doubt, leave it out." If experience later shows it is really needed, it is easier to add it in than to have a blizzard of useless configuration options that cannot be removed.

On the whole, I agree, and I'm definitely leaving some of that stuff out at this point. A simplified version of what I have at the moment is struct LexConfig { /+ configuration options +/} struct Lexer(alias config) if(is(typeof(config) == LexConfig)) { auto lex(R)(R range, SourcePos pos = SourcePos.init) {} } The main problem that I have is that if I end up with any delegates which need to take the range being lexed, then either I can't put them in the Lexer like I am with the error handler delegate (forcing you to pass it to the lex function every time that you lex a new range), or I have to templatize Lexer on the range type, which is also undesirable. If it turns out later that we want to add such delegates and we didn't templatize Lexer on the range type, then we'll be forced to make it so that they're arguments to the lex function. But templatizing the Lexer on the range type is ugly enough that it's not really something that I want to do if I can reasonbly avoid it. Neither solution appeals to me. If we don't ever need to add such delegates, then it doesn't really matter, but I'd prefer to have a solution which is appropriately flexible if we _do_ need to add that sort of thing later. I guess that I'll have to keep thinking about it to see if I can come up with a reasonable way to do it without hampering the API by either making it ugly through unneeded flexibility or by making it too simple to be able to reasonably add more functionality later. - Jonathan M Davis
Aug 07 2012