digitalmars.D - std.d.lexer requirements

Walter Bright (34/34) Aug 01 2012 Given the various proposals for a lexer module for Phobos, I thought I'd...

Jakob Ovrum (9/10) Aug 01 2012 I agree with everything but this point. Tokens become quite large

Walter Bright (3/11) Aug 01 2012 Doing a class requires a heap allocation per token, which will have disa...

Jakob Ovrum (3/5) Aug 01 2012 No it doesn't. That's an implication of using the NewExpression,

Walter Bright (2/3) Aug 01 2012 See my other reply to you on this topic.

Bernard Helyer (1/4) Aug 01 2012 I assume you mean 'shouldn't'. :P

Walter Bright (2/5) Aug 01 2012 Good catch.

deadalnix (7/16) Aug 01 2012 I see the lexer as a function that take an range of char as input and

Walter Bright (2/8) Aug 01 2012 Yes, as the lexer will have some state and some configuration parameters...

Jonathan M Davis (9/18) Aug 01 2012 nd give

Walter Bright (4/16) Aug 01 2012 No, because the user may want to serially present ranges to the same lex...

Jonathan M Davis (4/10) Aug 01 2012 Then just pass the same identifier table to the function which creates t...

Walter Bright (13/15) Aug 01 2012 You're still going to require another type, otherwise you'll have to dup...

Jonathan M Davis (5/12) Aug 01 2012 Why would you be duplicating any state in the token? The range would hav...

Walter Bright (2/13) Aug 01 2012 I meant two types: the Lexer and the Token. The Lexer can present a rang...

Jonathan M Davis (11/29) Aug 01 2012 What I was expecting there to be was a type which was a range of tokens...

Walter Bright (2/11) Aug 01 2012 You could do it either way.

Jonathan M Davis (5/15) Aug 01 2012 If you end up with enough pieces of state that you need to carry around

H. S. Teoh (12/26) Aug 01 2012 [...]

Walter Bright (4/10) Aug 01 2012 Yup. I keep trying to think of a way to lex multiple files at the same t...

deadalnix (4/14) Aug 02 2012 That was exactly my reaction to the � let's reuse the identifier table �...
Jason House (5/20) Aug 06 2012 The following is an incredibly fast multithreaded hash table. It

Walter Bright (3/6) Aug 07 2012 It might if I understood it! There do seem to be some cases where fences...

Jonathan M Davis (55/80) Aug 01 2012 But that's not how ranges of characters work. They're ranges of dchar. R...

Walter Bright (27/108) Aug 01 2012 I have argued against making ranges of characters dchars, because of per...

Jonathan M Davis (24/54) Aug 01 2012 1. The current design of Phobos is to have ranges of dchar, because it f...

Walter Bright (19/44) Aug 01 2012 The lexer must use char or it will not be acceptable as anything but a t...

Jonathan M Davis (13/25) Aug 01 2012 Avoiding decoding can be done with strings and operating on ranges of dc...

Walter Bright (15/40) Aug 01 2012 1. Encoding it into a dchar is a performance problem. Source code sits i...

Jacob Carlborg (13/16) Aug 01 2012 Your first requirement is a bit strange and confusing:

Walter Bright (2/11) Aug 02 2012 I answered this point a few posts up in the thread.

Jacob Carlborg (7/8) Aug 02 2012 I've read a few posts up and the only answer I found is that the lexer

Walter Bright (2/8) Aug 02 2012 You can use an adapter range.

Jonathan M Davis (54/68) Aug 02 2012 I think that we're misunderstanding each other here. A typical, well-wri...

Walter Bright (16/42) Aug 02 2012 It *still* must convert UTF8 to dchars before presenting them to the con...

Jacob Carlborg (4/19) Aug 01 2012 I think you two are saying the same things.

Jonathan M Davis (16/26) Aug 01 2012 Another thing that I should point out is that a range of UTF-8 or UTF-16...

Walter Bright (10/24) Aug 02 2012 My experience in writing fast string based code that worked on UTF8 and

Bernard Helyer (2/6) Aug 02 2012 If you want to throw out some target times, that would be useful.

Walter Bright (2/7) Aug 02 2012 As fast as the dmd one would be best.

Jonathan M Davis (4/15) Aug 02 2012 How would we measure that? dmd's lexer is tied to dmd, so how would we t...

Walter Bright (2/4) Aug 02 2012 Easy. Just make a special version of dmd that lexes only, and time it.

Ed McCardell (9/14) Aug 03 2012 I made a lexing-only version of dmd at

dennis luehring (6/21) Aug 03 2012 Walter mentioned an easier way - without "forking" dmd - should be

Walter Bright (2/4) Aug 03 2012 Forking it is fine. This is just for a one-off benchmarking thing.

dennis luehring (3/11) Aug 02 2012 would it be (easily) possible to "extract" the dmd lexer code (and
dennis luehring (3/11) Aug 02 2012 can the dmd lexer seperated as an lib and become useable from outside -
deadalnix (7/17) Aug 02 2012 That'd be great but . . .

Adam D. Ruppe (4/6) Aug 02 2012 What if we're just using this lexer in something like a

Marco Leise (8/15) Aug 02 2012 My pet peeve for editors is how to make the lexer work on only what I ju...

Walter Bright (5/9) Aug 02 2012 A good IDE should do its parsing in a separate thread, so the main user ...

Andrej Mitrovic (5/10) Aug 02 2012 Agreed, this is an editor/IDE domain problem. And there are some
Jacob Carlborg (6/11) Aug 02 2012 It still needs to update the editor view with the correct syntax

Andrej Mitrovic (5/8) Aug 02 2012 It can do that immediately for the text that's visible in the window

Jacob Carlborg (4/8) Aug 02 2012 That should work if the lexer can handle it.

Walter Bright (10/19) Aug 02 2012 The rendering code should be in yet a third thread.

Dmitry Olshansky (11/34) Aug 02 2012 OT:

Jacob Carlborg (5/11) Aug 02 2012 I'm not entirely sure what you mean be "rendering" but most GUI systems

Dmitry Olshansky (7/16) Aug 03 2012 Draw thing to an off screen bitmap then blit it to window (aye, pass

Jacob Carlborg (4/8) Aug 03 2012 Ok, now I see.

Jacob Carlborg (15/24) Aug 02 2012 Most GUI systems are not thread safe. I know for sure that Cocoa on Mac

deadalnix (3/16) Aug 03 2012 A common solution to that problem is to cancel the computation when user...

Walter Bright (8/17) Aug 02 2012 A lexer is like a device driver, or a memory allocator. Having it be as ...

Dmitry Olshansky (9/32) Aug 02 2012 +1 Another point is that it's crystal clear *how* to optimize lexer, and...

Andrej Mitrovic (3/4) Aug 02 2012 In a single project or globally? I hope not the former, D should
Walter Bright (5/7) Aug 02 2012 I'm less sure how well known it is. DMC++ was known for being *several t...

Jacob Carlborg (6/9) Aug 02 2012 I'm not sure how easy it would be to just measure the lexing phase of

dennis luehring (5/12) Aug 02 2012 wouldn't it be better to extract the lexer part of dmd into its own

Jacob Carlborg (4/8) Aug 02 2012 That was what I meant. If it's easy someone would have already done it.

Walter Bright (3/5) Aug 02 2012 You don't need to extract it to measure it. Just have it lex the source ...

Jacob Carlborg (11/13) Aug 03 2012 Well, that's the problem. It's not like DMD has a single "lex" function

Walter Bright (2/13) Aug 03 2012 Look in doc.c at highlightCode2() for how to call the lexer by itself.

Jacob Carlborg (10/11) Aug 03 2012 So basically:

Walter Bright (2/11) Aug 03 2012 Pretty much. Or just call lex.nextToken() until you get a TOKeof.

Jonathan M Davis (15/38) Aug 02 2012 Recent benchmarks by Iain Buclaw show that lexing is not a bottleneck in...

Jonathan M Davis (10/16) Aug 02 2012 It is for ranges in general. In the general case, a range of UTF-8 or UT...

Walter Bright (2/10) Aug 02 2012 Yes, it can work.

Jonathan M Davis (21/33) Aug 02 2012 How? If you operate on a range of code units, then you're operating on

Walter Bright (5/24) Aug 02 2012 Keep a 6 character buffer in your consumer. If you read a char with the ...

deadalnix (2/24) Aug 02 2012 How is that different than a manually done range of dchar ?

Walter Bright (4/5) Aug 02 2012 The decoding is rarely necessary, even if non-ascii data is there. Howev...

Jonathan M Davis (8/14) Aug 02 2012 Hence why special-casing must be used to deal with variably-length encod...

Dmitry Olshansky (7/29) Aug 02 2012 4 bytes is enough.

Walter Bright (4/11) Aug 02 2012 Yeah, but I thought 6 bytes would future proof it! (Inevitably, the Unic...

Jonathan M Davis (38/58) Aug 02 2012 And how on earth is that going to work as a range? Range-based functions...

travert phare.normalesup.org (Christophe Travert) (10/13) Aug 02 2012 In some case a range of dchar is useful. In some case a range of char is...
Walter Bright (9/33) Aug 02 2012 1. read a character from the range

Jonathan M Davis (26/27) Aug 02 2012 But that's the problem. The consumer has to treat character ranges speci...

Andrei Alexandrescu (4/8) Aug 02 2012 It is generic! It's just in another dimension: it operates on any range

Jonathan M Davis (13/22) Aug 02 2012 So, a function which does the buffering of code units like Walter sugges...

Andrei Alexandrescu (8/21) Aug 02 2012 Of course, because it operates on bytes read from memory, files, or

Jonathan M Davis (6/13) Aug 02 2012 It may be the best approach for the lexer (though I'm not convinced; I'l...

Andrei Alexandrescu (7/20) Aug 02 2012 I think Walter has very often emphasized the need for the lexer to be

Jonathan M Davis (27/37) Aug 02 2012 LOL. Well, I'm not about to decide on the best approach to this without
Jonathan M Davis (28/33) Aug 02 2012 LOL. It looks like taking this approach results in almost identical code...
Walter Bright (2/6) Aug 02 2012 +1

Walter Bright (17/43) Aug 02 2012 No, the consumer can do its own buffering. It only needs a 4 character b...

Jonathan M Davis (5/8) Aug 02 2012 I think that this is the main point of misunderstanding then. From your
Andrei Alexandrescu (8/13) Aug 02 2012 The lexer must be configurable enough to tokenize other languages than

Jonathan M Davis (7/20) Aug 02 2012 You're not going to get as fast a lexer if it's not written specifically...

Timon Gehr (3/19) Aug 02 2012 It is a more general problem.
Andrei Alexandrescu (3/7) Aug 02 2012 Do you have any evidence to back that up? I mean you're just saying it.

Jonathan M Davis (16/24) Aug 02 2012 Because all of the rules are built directly into the code. You don't hav...

Timon Gehr (14/41) Aug 02 2012 The parts that can be specified with simple regexen are certainly not a

Jonathan M Davis (5/7) Aug 02 2012 If fully concur that we should have a generic lexer, but unless the gene...
Philippe Sigaud (4/8) Aug 02 2012 I propose we let him finish std.lexer, test it with Jonathan, benchmark ...

deadalnix (6/16) Aug 03 2012 I think it can be as fast if the lexer is compile time created. I'd also...

Bernard Helyer (3/5) Aug 02 2012 You're going to have to defend that one.

Andrei Alexandrescu (5/8) Aug 02 2012 I wouldn't know how to. To me it's all too obvious it's better to have a...

Walter Bright (6/17) Aug 02 2012 I agree and I hope the three can combine their efforts with the best ide...

Bernard Helyer (5/29) Aug 02 2012 If the other guys think they've got it, then I can withdraw my

Jonathan M Davis (10/14) Aug 02 2012 I'm a fair ways along already. Making changes according to what Walter w...

Bernard Helyer (5/27) Aug 02 2012 I'll let you get on with it then. I'll amuse myself with the

Jonathan M Davis (6/10) Aug 02 2012 Well, if std.d.lexer is done correctly, then it should at least be very ...
Jakob Ovrum (5/9) Aug 02 2012 I don't think you should give up on this adaption. SDC's lexer is

Philippe Sigaud (11/24) Aug 02 2012 Look, many of us here were interested in your idea of having comments

Piotr Szturmaj (3/16) Aug 02 2012 Working example:

=?ISO-8859-1?Q?Jos=E9_Armando_Garc=EDa_Sancio?= (13/41) Aug 02 2012 A little bit off topic but...

Jacob Carlborg (7/14) Aug 02 2012 I do understand that the lexer needs to be insanely fast and it needs to...

Walter Bright (2/4) Aug 02 2012 Worst case use an adapter range.

travert phare.normalesup.org (Christophe Travert) (7/14) Aug 02 2012 Yes
Jacob Carlborg (4/5) Aug 02 2012 And that is better than a plain string?

travert phare.normalesup.org (Christophe Travert) (2/8) Aug 02 2012 because its front method does not do any decoding.

Jacob Carlborg (12/20) Aug 02 2012 If it was a plain string you wouldn't use "front". You would handle any,...

travert phare.normalesup.org (Christophe Travert) (16/20) Aug 03 2012 I find it more comfortable to just use

Walter Bright (3/11) Aug 03 2012 No may about it, I wouldn't be happy :-)

Walter Bright (8/14) Aug 03 2012 For example, let's suppose we want to do D syntax highlighting in our ID...

Tobias Pankrath (8/16) Aug 03 2012 Would this be an argument for putting the computation of source

Walter Bright (2/8) Aug 03 2012 Worth thinking about.

Dmitry Olshansky (11/39) Aug 02 2012 Well, it doesn't convert back to UTF-8 as it just slices of the input :)

deadalnix (2/16) Aug 02 2012 Really nice idea. It is still easy to wrap the Range in another Range

Walter Bright (4/6) Aug 02 2012 It's suboptimal for a high speed lexer/parser, as already explained.

Jacob Carlborg (5/9) Aug 01 2012 I'm no expert on ranges but won't that prevent slicing? Slicing is one

Walter Bright (5/12) Aug 02 2012 You don't want to use slicing on the lexer. The reason is that your slic...

deadalnix (4/21) Aug 02 2012 Token are not kept in memory. You usually consume them for other

Walter Bright (4/27) Aug 02 2012 The tokens are not kept, correct. But the identifier strings, and the st...

deadalnix (15/47) Aug 03 2012 Ok, what do you think of that :

travert phare.normalesup.org (Christophe Travert) (39/53) Aug 03 2012 If I may add, there are several possitilities here:

Tobias Pankrath (4/13) Aug 03 2012 #6 None. Just provide information of token type and location

Walter Bright (5/16) Aug 03 2012 You're talking about doing for strings what is done for identifiers - re...

deadalnix (8/32) Aug 04 2012 That option have the benefice to allow very fast identifier comparison

Jonathan M Davis (17/20) Aug 03 2012 String literals often _can't_ be slices unless you leave them in their

travert phare.normalesup.org (Christophe Travert) (10/34) Aug 04 2012 I thought it was not the lexer's job to process litterals. Just split

Dmitry Olshansky (4/36) Aug 04 2012 Most likely - since you re-read the same memory twice to do it.

travert phare.normalesup.org (Christophe Travert) (12/13) Aug 04 2012 You're probably right, but if you do this right after the token is

Dmitry Olshansky (11/23) Aug 04 2012 q{ .. }

Jonathan M Davis (5/8) Aug 04 2012 It's starting to look like figuring out what should and shouldn't be

Tobias Pankrath (5/17) Aug 04 2012 If we have a really fast lexer that is highly compile-time
Dmitry Olshansky (34/41) Aug 04 2012 Let's add some meat to my post.

Jonathan M Davis (12/25) Aug 05 2012 [snip]
deadalnix (9/50) Aug 06 2012 It seems way too much.

Philippe Sigaud (8/15) Aug 06 2012 I think one should pass it an empty symbol table and the lexer should

Jacob Carlborg (5/9) Aug 06 2012 Especially when if the lexer is used in an IDE. The code is constantly
Walter Bright (4/8) Aug 06 2012 That's why I suggested supplying a callback delegate to decide what to d...

travert phare.normalesup.org (Christophe Travert) (6/16) Aug 07 2012 It may be easier to take into accound few cases (return error token and

Jonathan M Davis (25/36) Aug 07 2012 to do

Walter Bright (4/8) Aug 07 2012 It's not just overhead - it's just plain ugly to constantly check for er...

Jonathan M Davis (10/21) Aug 07 2012 It's easier to see where in the range of tokens the errors occur. A dele...

Jacob Carlborg (5/10) Aug 07 2012 Just pass the same token to the delegate that you would have returned

travert phare.normalesup.org (Christophe Travert) (11/21) Aug 07 2012 That's what I would do. If you have to define a way to return error

Walter Bright (3/8) Aug 07 2012 The delegate has a context pointer giving it a reference to whatever con...

travert phare.normalesup.org (Christophe Travert) (24/33) Aug 07 2012 It's not necessarily ugly, because of the powerful range design. You can...
Philippe Sigaud (18/21) Aug 07 2012 IIRC, I was not the only one, as people here interested in coding an

Walter Bright (6/17) Aug 07 2012 Delegates are not C-isms.

Philippe Sigaud (10/20) Aug 07 2012 And what a wonderful decision that was!
Jonathan M Davis (23/26) Aug 07 2012 It doesn't really affect much to allow choosing between returning a toke...

Walter Bright (4/7) Aug 07 2012 "When in doubt, leave it out."

Jonathan M Davis (27/37) Aug 07 2012 On the whole, I agree, and I'm definitely leaving some of that stuff out...

Walter Bright (4/12) Aug 07 2012 If the delegate returns, then the lexer recovers.

travert phare.normalesup.org (Christophe Travert) (10/11) Aug 07 2012 That's an option, if there is only one way to recover (which is a

Dmitry Olshansky (15/77) Aug 06 2012 Editor that highlights text may choose not to build identifier table at

Jacob Carlborg (4/10) Aug 06 2012 The Eclipse plugin Descent can show formatted DDoc comments.

Dmitry Olshansky (5/15) Aug 06 2012 I've meant that IDE may be interested in more then just DDoc, at least

travert phare.normalesup.org (Christophe Travert) (10/18) Aug 04 2012 Yes, I figured-out a policy could be used to, but since the begining of

Bernard Helyer (2/2) Aug 02 2012 http://i.imgur.com/oSXTc.png

Walter Bright (2/4) Aug 02 2012 I'm afraid I'm mystified as to your point.

Bernard Helyer (2/7) Aug 02 2012 Just that I'm slaving away over a hot IDE here. :P

Walter Bright (2/3) Aug 02 2012 Ah! Well, keep on keepin' on, then!

Chad J (2/4) Aug 04 2012 Hell yeah Alexander Brandon.

dennis luehring (5/6) Aug 02 2012 then add a benchmark "suite" to the list - the lexer should be
Piotr Szturmaj (4/8) Aug 02 2012 Why it is a mistake? I think Lexer should parse any UTF range and return...

Walter Bright (6/15) Aug 02 2012 Because the lexer is large and it would have to have a lot of special ca...

Andrei Alexandrescu (29/30) Aug 02 2012 Here's a crazy idea that I'll hang to this one remark. No, two crazy ide...

Michel Fortin (30/35) Aug 02 2012 That's not a great surprise to me. I hit the same issues when writing

Andrei Alexandrescu (8/34) Aug 02 2012 I agree frontUnit and popFrontUnit are more generic because they allow

Jacob Carlborg (10/13) Aug 02 2012 The skype call you suggested:

Andrei Alexandrescu (4/16) Aug 02 2012 Oh, ok, I thought I'd be on the call. That would be between Jonathan and...

travert phare.normalesup.org (Christophe Travert) (9/14) Aug 02 2012 Any range of dchar could have a representation (or you may want to call

Piotr Szturmaj (6/14) Aug 02 2012 Instead of returning whole strings, you may provide another sub-range

Jonathan M Davis (5/17) Aug 02 2012 I'm sure it's possible. I just don't know how messy it will be. It depen...
Brad Roberts (13/59) Aug 05 2012 To help with performance comparisons I ripped dmd's lexer out and got it...

Walter Bright (2/14) Aug 05 2012 Thanks, Brad!

Ary Manzana (3/7) Aug 06 2012 I believe there should be an option to get comments as tokens. Or

Walter Bright <newshound2 digitalmars.com> writes:

Given the various proposals for a lexer module for Phobos, I thought I'd share 
some characteristics it ought to have.

First of all, it should be suitable for, at a minimum:

1. compilers

2. syntax highlighting editors

3. source code formatters

4. html creation

To that end:

1. It should accept as input an input range of UTF8. I feel it is a mistake to 
templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use 
an 'adapter' range to convert the input to UTF8. (This is what component 
programming is all about.)

2. It should output an input range of tokens

3. tokens should be values, not classes

4. It should avoid memory allocation as much as possible

5. It should read or write any mutable global state outside of its "Lexer"
instance

6. A single "Lexer" instance should be able to serially accept input ranges, 
sharing and updating one identifier table

7. It should accept a callback delegate for errors. That delegate should decide 
whether to:
    1. ignore the error (and "Lexer" will try to recover and continue)
    2. print an error message (and "Lexer" will try to recover and continue)
    3. throw an exception, "Lexer" is done with that input range

8. Lexer should be configurable as to whether it should collect information 
about comments and ddoc comments or not

9. Comments and ddoc comments should be attached to the next following token, 
they should not themselves be tokens

10. High speed matters a lot

11. Tokens should have begin/end line/column markers, though most of the time 
this can be implicitly determined

12. It should come with unittests that, using -cov, show 100% coverage


Basically, I don't want anyone to be motivated to do a separate one after
seeing 
this one.

Aug 01 2012

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Thursday, 2 August 2012 at 00:11:15 UTC, Walter Bright wrote:
 3. tokens should be values, not classes

I agree with everything but this point. Tokens become quite large 
when they have all kinds of string and position information on 
them, much larger than the typical recommended sizes for 
pass-by-value. It's still possible to provide an interface for 
them that doesn't require as much copying (ref returns and 
getters and whatnot), but it's fiddly and way too easy to copy 
these huge Token structs around, especially if the output is a 
range of Token structs.

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 5:21 PM, Jakob Ovrum wrote:
 On Thursday, 2 August 2012 at 00:11:15 UTC, Walter Bright wrote:
 3. tokens should be values, not classes

 I agree with everything but this point. Tokens become quite large when they
have
 all kinds of string and position information on them, much larger than the
 typical recommended sizes for pass-by-value. It's still possible to provide an
 interface for them that doesn't require as much copying (ref returns and
getters
 and whatnot), but it's fiddly and way too easy to copy these huge Token structs
 around, especially if the output is a range of Token structs.

Doing a class requires a heap allocation per token, which will have disastrous 
performance consequences.

Aug 01 2012

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Thursday, 2 August 2012 at 04:00:54 UTC, Walter Bright wrote:
 Doing a class requires a heap allocation per token, which will 
 have disastrous performance consequences.

No it doesn't. That's an implication of using the NewExpression, 
not classes.

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 9:02 PM, Jakob Ovrum wrote:
 No it doesn't. That's an implication of using the NewExpression, not classes.

See my other reply to you on this topic.

Aug 01 2012

"Bernard Helyer" <b.helyer gmail.com> writes:

 5. It should read or write any mutable global state outside of 
 its "Lexer"
 instance

I assume you mean 'shouldn't'. :P

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 6:30 PM, Bernard Helyer wrote:
 5. It should read or write any mutable global state outside of its "Lexer"
 instance

 I assume you mean 'shouldn't'. :P

Good catch.

Aug 01 2012

deadalnix <deadalnix gmail.com> writes:

Le 02/08/2012 02:10, Walter Bright a �crit :
 6. A single "Lexer" instance should be able to serially accept input
 ranges, sharing and updating one identifier table

I see the lexer as a function that take an range of char as input and 
give back a range of token. Does it make sense to make an instance of a 
lexer ?

 7. It should accept a callback delegate for errors. That delegate should
 decide whether to:
 1. ignore the error (and "Lexer" will try to recover and continue)
 2. print an error message (and "Lexer" will try to recover and continue)
 3. throw an exception, "Lexer" is done with that input range

Off topic, but it look like the condition proposal from H.S. Teoh and 
myself.

 Basically, I don't want anyone to be motivated to do a separate one
 after seeing this one.

That would be awesome !

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 7:04 PM, deadalnix wrote:
 Le 02/08/2012 02:10, Walter Bright a �crit :
 6. A single "Lexer" instance should be able to serially accept input
 ranges, sharing and updating one identifier table

 I see the lexer as a function that take an range of char as input and give back
 a range of token. Does it make sense to make an instance of a lexer ?

Yes, as the lexer will have some state and some configuration parameters.

Aug 01 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 21:00:53 Walter Bright wrote:
 On 8/1/2012 7:04 PM, deadalnix wrote:
 Le 02/08/2012 02:10, Walter Bright a =C3=A9crit :
 6. A single "Lexer" instance should be able to serially accept inp=



ut
 ranges, sharing and updating one identifier table

=20
 I see the lexer as a function that take an range of char as input a=


nd give
 back a range of token. Does it make sense to make an instance of a =


lexer
 ?

 Yes, as the lexer will have some state and some configuration paramet=

ers.

Couldn't that just be part of the range type? A separate lexer type sho=
uldn't=20
be necessary.

- Jonathan M Davis

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 9:06 PM, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 21:00:53 Walter Bright wrote:
 On 8/1/2012 7:04 PM, deadalnix wrote:
 Le 02/08/2012 02:10, Walter Bright a écrit :
 6. A single "Lexer" instance should be able to serially accept input
 ranges, sharing and updating one identifier table

 I see the lexer as a function that take an range of char as input and give
 back a range of token. Does it make sense to make an instance of a lexer
 ?

 Yes, as the lexer will have some state and some configuration parameters.

 Couldn't that just be part of the range type? A separate lexer type shouldn't
 be necessary.

No, because the user may want to serially present ranges to the same lexer
type. 
Why do that, you might say? That's how dmd works - it means the identifier
table 
will be the same for all source files.

Aug 01 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 21:47:22 Walter Bright wrote:
 Couldn't that just be part of the range type? A separate lexer type
 shouldn't be necessary.

 
 No, because the user may want to serially present ranges to the same lexer
 type. Why do that, you might say? That's how dmd works - it means the
 identifier table will be the same for all source files.

Then just pass the same identifier table to the function which creates the 
token range. That doesn't require another type.

- Jonathan M Davis

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 9:54 PM, Jonathan M Davis wrote:
 Then just pass the same identifier table to the function which creates the
 token range. That doesn't require another type.

You're still going to require another type, otherwise you'll have to duplicate 
the state in every token allocation, with resultant heavy memory and 
initialization costs.

Please keep in mind that a lexer is not something you just pass a few short 
strings to. It's very very very performance critical as all those extra 
instructions add up to long delays when you're shoving millions of lines of
code 
into its maw.

For the same reason you're also not going to want the lexer putting pressure on 
the GC. It could bring your whole system down.

To get a high performance lexer, you're going to be counting the average number 
of instructions executed per input character. Each one counts. Shaving one off 
is a victory. You should also be thinking about memory cache access patterns.

Aug 01 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 22:09:05 Walter Bright wrote:
 On 8/1/2012 9:54 PM, Jonathan M Davis wrote:
 Then just pass the same identifier table to the function which creates the
 token range. That doesn't require another type.

 
 You're still going to require another type, otherwise you'll have to
 duplicate the state in every token allocation, with resultant heavy memory
 and initialization costs.

Why would you be duplicating any state in the token? The range would have the 
state, not the token. The token would only have the data specific to the token, 
not the range as a whole.

- Jonathan M Davis

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 10:34 PM, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 22:09:05 Walter Bright wrote:
 On 8/1/2012 9:54 PM, Jonathan M Davis wrote:
 Then just pass the same identifier table to the function which creates the
 token range. That doesn't require another type.

 You're still going to require another type, otherwise you'll have to
 duplicate the state in every token allocation, with resultant heavy memory
 and initialization costs.

 Why would you be duplicating any state in the token? The range would have the
 state, not the token. The token would only have the data specific to the token,
 not the range as a whole.

I meant two types: the Lexer and the Token. The Lexer can present a range
interface.

Aug 01 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 22:41:01 Walter Bright wrote:
 On 8/1/2012 10:34 PM, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 22:09:05 Walter Bright wrote:
 On 8/1/2012 9:54 PM, Jonathan M Davis wrote:
 Then just pass the same identifier table to the function which creates
 the
 token range. That doesn't require another type.

 
 You're still going to require another type, otherwise you'll have to
 duplicate the state in every token allocation, with resultant heavy
 memory
 and initialization costs.

 
 Why would you be duplicating any state in the token? The range would have
 the state, not the token. The token would only have the data specific to
 the token, not the range as a whole.

 
 I meant two types: the Lexer and the Token. The Lexer can present a range
 interface.

What I was expecting there to  be was a type which was a range of tokens.  You 
passed the source string to a function which returned that range, and you 
iterated over it to process each token. What you appear to have been arguing 
for is another type which you get the range from which holds additional state 
(e.g. the indentifier table). So, you're saying now that you only meant that 
the token type needs to be separate from the range type (which I would have 
thought would be the only way to do it), or are you indeed saying that there 
should be another type that the range comes from and which holds additional 
state?

- Jonathan M Davis

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 10:53 PM, Jonathan M Davis wrote:
 What I was expecting there to  be was a type which was a range of tokens.  You
 passed the source string to a function which returned that range, and you
 iterated over it to process each token. What you appear to have been arguing
 for is another type which you get the range from which holds additional state
 (e.g. the indentifier table). So, you're saying now that you only meant that
 the token type needs to be separate from the range type (which I would have
 thought would be the only way to do it), or are you indeed saying that there
 should be another type that the range comes from and which holds additional
 state?

You could do it either way.

Aug 01 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 21:54:01 Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 21:47:22 Walter Bright wrote:
 Couldn't that just be part of the range type? A separate lexer type
 shouldn't be necessary.

 
 No, because the user may want to serially present ranges to the same lexer
 type. Why do that, you might say? That's how dmd works - it means the
 identifier table will be the same for all source files.

 
 Then just pass the same identifier table to the function which creates the
 token range. That doesn't require another type.

If you end up with enough pieces of state that you need to carry around 
between files/strings being lexed, then creating another object makes sense, 
but I would hope that the amount of global state like that would be minimal.

- Jonathan M Davis

Aug 01 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Aug 01, 2012 at 09:06:42PM -0700, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 21:00:53 Walter Bright wrote:
 On 8/1/2012 7:04 PM, deadalnix wrote:
 Le 02/08/2012 02:10, Walter Bright a �crit :
 6. A single "Lexer" instance should be able to serially accept
 input ranges, sharing and updating one identifier table

 
 I see the lexer as a function that take an range of char as input
 and give back a range of token. Does it make sense to make an
 instance of a lexer ?

 Yes, as the lexer will have some state and some configuration
 parameters.

 
 Couldn't that just be part of the range type? A separate lexer type
 shouldn't be necessary.

[...]

Whether it's part of the range type or a separate lexer type,
*definitely* make it possible to have multiple instances. One of the
biggest flaws of otherwise-good lexer generators like lex and flex
(C/C++) is that the core code assumes a single instance, and
multi-instances were glued on after the fact, making it a royal pain to
work with anything that needs lexing multiple things at the same time.


T

-- 
The day Microsoft makes something that doesn't suck is probably the day
they start making vacuum cleaners... -- Slashdotter

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 9:41 PM, H. S. Teoh wrote:
 Whether it's part of the range type or a separate lexer type,
 *definitely* make it possible to have multiple instances. One of the
 biggest flaws of otherwise-good lexer generators like lex and flex
 (C/C++) is that the core code assumes a single instance, and
 multi-instances were glued on after the fact, making it a royal pain to
 work with anything that needs lexing multiple things at the same time.

Yup. I keep trying to think of a way to lex multiple files at the same time in 
separate threads, but the problem is serializing access to the identifier table 
will likely kill off any perf gain.

Aug 01 2012

deadalnix <deadalnix gmail.com> writes:

Le 02/08/2012 06:48, Walter Bright a �crit :
 On 8/1/2012 9:41 PM, H. S. Teoh wrote:
 Whether it's part of the range type or a separate lexer type,
 *definitely* make it possible to have multiple instances. One of the
 biggest flaws of otherwise-good lexer generators like lex and flex
 (C/C++) is that the core code assumes a single instance, and
 multi-instances were glued on after the fact, making it a royal pain to
 work with anything that needs lexing multiple things at the same time.

 Yup. I keep trying to think of a way to lex multiple files at the same
 time in separate threads, but the problem is serializing access to the
 identifier table will likely kill off any perf gain.

That was exactly my reaction to the � let's reuse the identifier table � 
comment of yours.

The future is multicore.

Aug 02 2012

"Jason House" <jason.james.house gmail.com> writes:

On Thursday, 2 August 2012 at 04:48:56 UTC, Walter Bright wrote:
 On 8/1/2012 9:41 PM, H. S. Teoh wrote:
 Whether it's part of the range type or a separate lexer type,
 *definitely* make it possible to have multiple instances. One 
 of the
 biggest flaws of otherwise-good lexer generators like lex and 
 flex
 (C/C++) is that the core code assumes a single instance, and
 multi-instances were glued on after the fact, making it a 
 royal pain to
 work with anything that needs lexing multiple things at the 
 same time.

 Yup. I keep trying to think of a way to lex multiple files at 
 the same time in separate threads, but the problem is 
 serializing access to the identifier table will likely kill off 
 any perf gain.

The following is an incredibly fast multithreaded hash table. It 
is both lock-free and fence-free. Would something like that solve 
your problem?

http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf

Aug 06 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/6/2012 5:14 PM, Jason House wrote:
 The following is an incredibly fast multithreaded hash table. It is both
 lock-free and fence-free. Would something like that solve your problem?

 http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf

It might if I understood it! There do seem to be some cases where fences are 
required. This would take considerable study.

Aug 07 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
 1. It should accept as input an input range of UTF8. I feel it is a mistake
 to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
 should use an 'adapter' range to convert the input to UTF8. (This is what
 component programming is all about.)

But that's not how ranges of characters work. They're ranges of dchar. Ranges 
don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have to create 
special wrappers around string or wstring to have ranges of UTF-8. The way 
that it's normally done is to have ranges of dchar and then special-case 
range-based functions for strings. Then the function can operate on any range 
of dchar but still operates on strings efficiently.

So, it's quite possible to make a lexer operate on ranges of dchar but be 
operating primarily on ASCII by special-casing strings and avoiding decoding 
as much as possible. My lexer does a good job of this, I believe, so it's 
thoroughly optimized for strings, but it's still operating on ranges of dchar, 
not ranges of UTF-8.

 2. It should output an input range of tokens
 
 3. tokens should be values, not classes
 
 4. It should avoid memory allocation as much as possible
 
 5. It should read or write any mutable global state outside of its "Lexer"
 instance
 
 6. A single "Lexer" instance should be able to serially accept input ranges,
 sharing and updating one identifier table

When you say identifier table, do you mean symbol table, or something else?

Isn't the symbol table something for the parser to worry about? Other than 
parsers, I can't think of anything which would even _care_ about a symbol 
table. And depending on how a symbol table were done, you'd want it to take 
scoping rules into account (_I'd_ certainly expect it to), meaning that it 
only includes identifiers which are relevant to the current scope. And if 
that's the case, it _really_ doesn't belong in the lexer. The code using the 
lexer can easily have its own table that it populates according to however it 
wants it populated by processing the identifier tokens as they get lexed. Not 
to mention, don't symbol tables usually include type information that a lexer 
_wouldn't_ have?

Sure, a symbol table _could_ be put in the lexer (at least if it doesn't care 
about scoping), but I definitely question that that's the right place for it.

If you mean a table that simply lists identifiers rather than a symbol table, 
I'm not quite sure why you'd want it in addition to the symbol table, and 
again, I'd expect only parsers to care about that sort of thing, so I'd expect 
the parser to be implementing it.

So, what exactly are you looking for the lexer to implement as far as an 
identifier table goes, and why is it in the lexer rather than the parser?

 7. It should accept a callback delegate for errors. That delegate should
 decide whether to:
     1. ignore the error (and "Lexer" will try to recover and continue)
     2. print an error message (and "Lexer" will try to recover and continue)
 3. throw an exception, "Lexer" is done with that input range

I'm currently treating errors as tokens. It then becomes easy for the code 
using the lexer to just ignore the errors, to process them immediately, or to 
put off dealing with them until the lexing is complete. So, the code using the 
lexer can handle errors however and whenever it likes without having to worry 
about delegates or exceptions. Since tokens are lexed lazily, the fact that an 
error is reported as a token doesn't require that the lexing continue, but it 
also makes it _easy_ to continue lexing, ignoring or saving the error. I'm 
inclined to think that that's a superior approach to using delegates and 
exceptions.

 8. Lexer should be configurable as to whether it should collect information
 about comments and ddoc comments or not

 9. Comments and ddoc comments should be attached to the next following
 token, they should not themselves be tokens

Why? I'd argue for just processing them as tokens just like I'm doing with 
errors. The code using the lexer can then process them or ignore them as it 
likes (it can even specifically look for them and ignore all other tokens if 
it's something which only operates on comments). And if you want to see which 
token follows the comment, it's the next one in the range. So, I don't see 
much need to attach comments to other tokens. What's the advantage of not 
treating them as tokens?

 12. It should come with unittests that, using -cov, show 100% coverage

100% is actually unreasonable, because there are lines which should _never_ be 
hit (e.g. an assert(0) line in a switch statement). All lines _other_ than 
those sort of lines should have code coverage, but depending on the lexer 
implementation, 100% is impossible. std.datetime has a ton of unit tests and 
IIRC it still only manages 98% because of assert(0) lines. So, I agree with 
the spirit of your statement, but it's not necessarily reasonable to do 
exactly what you're saying.

- Jonathan M Davis

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 8:04 PM, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
 1. It should accept as input an input range of UTF8. I feel it is a mistake
 to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
 should use an 'adapter' range to convert the input to UTF8. (This is what
 component programming is all about.)

 But that's not how ranges of characters work. They're ranges of dchar. Ranges
 don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have to create
 special wrappers around string or wstring to have ranges of UTF-8. The way
 that it's normally done is to have ranges of dchar and then special-case
 range-based functions for strings. Then the function can operate on any range
 of dchar but still operates on strings efficiently.

I have argued against making ranges of characters dchars, because of
performance 
reasons. This will especially adversely affect the performance of the lexer.

The fact is, files are normally in UTF8 and just about everything else is in 
UTF8. Prematurely converting to UTF-32 is a performance disaster. Note that the 
single largest thing holding back regex performance is that premature
conversion 
to dchar and back to char.

If lexer is required to accept dchar ranges, its performance will drop at least 
in half, and people are going to go reinvent their own lexers.


 So, it's quite possible to make a lexer operate on ranges of dchar but be
 operating primarily on ASCII by special-casing strings and avoiding decoding
 as much as possible. My lexer does a good job of this, I believe, so it's
 thoroughly optimized for strings, but it's still operating on ranges of dchar,
 not ranges of UTF-8.

 2. It should output an input range of tokens

 3. tokens should be values, not classes

 4. It should avoid memory allocation as much as possible

 5. It should read or write any mutable global state outside of its "Lexer"
 instance

 6. A single "Lexer" instance should be able to serially accept input ranges,
 sharing and updating one identifier table

 When you say identifier table, do you mean symbol table, or something else?

All identifiers are entered into a hashtable, and are referred to by pointers 
into that hashtable for the rest of dmd. This makes symbol lookups incredibly 
fast, as no string comparisons are done.


 Isn't the symbol table something for the parser to worry about? Other than
 parsers, I can't think of anything which would even _care_ about a symbol
 table. And depending on how a symbol table were done, you'd want it to take
 scoping rules into account (_I'd_ certainly expect it to), meaning that it
 only includes identifiers which are relevant to the current scope. And if
 that's the case, it _really_ doesn't belong in the lexer. The code using the
 lexer can easily have its own table that it populates according to however it
 wants it populated by processing the identifier tokens as they get lexed. Not
 to mention, don't symbol tables usually include type information that a lexer
 _wouldn't_ have?

The symbol table becomes a symbol table of pointers into the lexer's identifier 
table.


 Sure, a symbol table _could_ be put in the lexer (at least if it doesn't care
 about scoping), but I definitely question that that's the right place for it.

The symbol table isn't in the lexer, the identifier string table is.


 If you mean a table that simply lists identifiers rather than a symbol table,
 I'm not quite sure why you'd want it in addition to the symbol table, and
 again, I'd expect only parsers to care about that sort of thing, so I'd expect
 the parser to be implementing it.

Because it is gawdammed fast to do it that way. An identifier is treated as a 
string exactly once, in the lexer, and thereafter by its unique handle.


 So, what exactly are you looking for the lexer to implement as far as an
 identifier table goes, and why is it in the lexer rather than the parser?

Very simple. A mapping of [identifier string] => [unique value], and a pointer 
servers nicely as that unique value.


 7. It should accept a callback delegate for errors. That delegate should
 decide whether to:
      1. ignore the error (and "Lexer" will try to recover and continue)
      2. print an error message (and "Lexer" will try to recover and continue)
 3. throw an exception, "Lexer" is done with that input range

 I'm currently treating errors as tokens. It then becomes easy for the code
 using the lexer to just ignore the errors, to process them immediately, or to
 put off dealing with them until the lexing is complete. So, the code using the
 lexer can handle errors however and whenever it likes without having to worry
 about delegates or exceptions. Since tokens are lexed lazily, the fact that an
 error is reported as a token doesn't require that the lexing continue, but it
 also makes it _easy_ to continue lexing, ignoring or saving the error. I'm
 inclined to think that that's a superior approach to using delegates and
 exceptions.

I don't agree. It means that the parser has to check everywhere for error 
tokens. Besides the ugliness, it means a slowdown for those checks.


 8. Lexer should be configurable as to whether it should collect information
 about comments and ddoc comments or not

 9. Comments and ddoc comments should be attached to the next following
 token, they should not themselves be tokens

 Why?

Because then the parser would have to add checks for 0 or more comment tokens 
between every other token. It uglifies the parser pretty badly, and slows it
down.

 I'd argue for just processing them as tokens just like I'm doing with
 errors. The code using the lexer can then process them or ignore them as it
 likes (it can even specifically look for them and ignore all other tokens if
 it's something which only operates on comments). And if you want to see which
 token follows the comment, it's the next one in the range. So, I don't see
 much need to attach comments to other tokens. What's the advantage of not
 treating them as tokens?

Performance and simplicity. Your grammar will be awfully ugly unless you insert 
another range to filter out those tokens - but that'll be a slowdown.


 12. It should come with unittests that, using -cov, show 100% coverage

 100% is actually unreasonable, because there are lines which should _never_ be
 hit (e.g. an assert(0) line in a switch statement). All lines _other_ than
 those sort of lines should have code coverage, but depending on the lexer
 implementation, 100% is impossible. std.datetime has a ton of unit tests and
 IIRC it still only manages 98% because of assert(0) lines. So, I agree with
 the spirit of your statement, but it's not necessarily reasonable to do
 exactly what you're saying.

I know, but I expect all unreachable code to be justified. The lexer is a well 
defined problem, and this should not be difficult.

Aug 01 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 21:30:44 Walter Bright wrote:
 On 8/1/2012 8:04 PM, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
 1. It should accept as input an input range of UTF8. I feel it is a
 mistake
 to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
 should use an 'adapter' range to convert the input to UTF8. (This is what
 component programming is all about.)

 
 But that's not how ranges of characters work. They're ranges of dchar.
 Ranges don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd
 have to create special wrappers around string or wstring to have ranges
 of UTF-8. The way that it's normally done is to have ranges of dchar and
 then special-case range-based functions for strings. Then the function
 can operate on any range of dchar but still operates on strings
 efficiently.

 
 I have argued against making ranges of characters dchars, because of
 performance reasons. This will especially adversely affect the performance
 of the lexer.
 
 The fact is, files are normally in UTF8 and just about everything else is in
 UTF8. Prematurely converting to UTF-32 is a performance disaster. Note that
 the single largest thing holding back regex performance is that premature
 conversion to dchar and back to char.
 
 If lexer is required to accept dchar ranges, its performance will drop at
 least in half, and people are going to go reinvent their own lexers.

1. The current design of Phobos is to have ranges of dchar, because it fosters 
code correctness (though it can harm efficiency). It's arguably too late to do 
otherwise. Certainly, doing otherwise now would break a lot of code. If the 
lexer tried to operate on UTF-8 as part of its API rather than operating on 
ranges of dchar and special-casing strings, then it wouldn't fit in with Phobos 
at all.

2. The lexer does _not_ have to have its performance tank by accepting ranges 
of dchar. It's true that the performance will be harmed for ranges which 
_aren't_ strings, but for strings (as would be by far the most common use 
case) it can be very efficient by special-casing them.

And as much as there are potential performance issues with Phobos' choice of 
treating strings as ranges of dchar, if it were to continue to treat them as 
ranges of code units, it's pretty much a guarantee that there would be a _ton_ 
of bugs caused by it. Most programmers have absolutely no clue about how 
unicode works and don't really want to know. They want strings to just work. 
Phobos' approach of defaulting to correct but making it possible to make the 
code faster through specializations is very much in line with D's typical 
approach of making things safe by default but allowing the programmer to do 
unsafe things for optimizations when they know what they're doing.

 All identifiers are entered into a hashtable, and are referred to by
 pointers into that hashtable for the rest of dmd. This makes symbol lookups
 incredibly fast, as no string comparisons are done.

Hmmm. Well, I'd still argue that that's a parser thing. Pretty much nothing 
else will care about it. At most, it should be an optional feature of the 
lexer. But it certainly could be added that way.

- Jonathan M Davis

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 9:52 PM, Jonathan M Davis wrote:
 1. The current design of Phobos is to have ranges of dchar, because it fosters
 code correctness (though it can harm efficiency). It's arguably too late to do
 otherwise. Certainly, doing otherwise now would break a lot of code. If the
 lexer tried to operate on UTF-8 as part of its API rather than operating on
 ranges of dchar and special-casing strings, then it wouldn't fit in with Phobos
 at all.

The lexer must use char or it will not be acceptable as anything but a toy for 
performance reasons.


 2. The lexer does _not_ have to have its performance tank by accepting ranges
 of dchar. It's true that the performance will be harmed for ranges which
 _aren't_ strings, but for strings (as would be by far the most common use
 case) it can be very efficient by special-casing them.

Somebody has to convert the input files into dchars, and then back into chars. 
That blows for performance. Think billions and billions of characters going 
through, not just a few random strings.

Always always think of the lexer as having a firehose of data shoved into its 
maw, and it better be thirsty!


 And as much as there are potential performance issues with Phobos' choice of
 treating strings as ranges of dchar, if it were to continue to treat them as
 ranges of code units, it's pretty much a guarantee that there would be a _ton_
 of bugs caused by it. Most programmers have absolutely no clue about how
 unicode works and don't really want to know. They want strings to just work.
 Phobos' approach of defaulting to correct but making it possible to make the
 code faster through specializations is very much in line with D's typical
 approach of making things safe by default but allowing the programmer to do
 unsafe things for optimizations when they know what they're doing.

I expect std.d.lexer to handle UTF8 correctly, so I don't think this should be 
an issue in this particular case. dmd's lexer does handle UTF8 correctly.

Note also that the places where non-ASCII characters can appear in correct D 
code is severely limited, and there are even fewer places where multibyte 
characters need to be decoded at all, and the lexer takes full advantage of
this 
to boost its speed.

For example, non-ASCII characters can appear in comments, but they DO NOT need 
to be decoded, and even just having the test for a non-ASCII character in the 
comment scanner will visibly slow down the lexer.

 All identifiers are entered into a hashtable, and are referred to by
 pointers into that hashtable for the rest of dmd. This makes symbol lookups
 incredibly fast, as no string comparisons are done.

 Hmmm. Well, I'd still argue that that's a parser thing. Pretty much nothing
 else will care about it. At most, it should be an optional feature of the
 lexer. But it certainly could be added that way.

I hate to say "trust me on this", but if you don't, have a look at dmd's lexer 
and how it handles identifiers, then look at dmd's symbol table.

Aug 01 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 22:33:12 Walter Bright wrote:
 The lexer must use char or it will not be acceptable as anything but a toy
 for performance reasons.

Avoiding decoding can be done with strings and operating on ranges of dchar, 
so you'd be operating almost entirely on ASCII. Are you saying that there's a 
performance issue aside from decoding?

 Somebody has to convert the input files into dchars, and then back into
 chars. That blows for performance. Think billions and billions of
 characters going through, not just a few random strings.

Why is there any converting to dchar going on here? I don't see why any would 
be necessary. If you reading in a file as a string or char[] (as would be 
typical), then you're operating on a string, and then the only time that any 
decoding will be necessary is when you actually need to operate on a unicode 
character, which is very rare in D's grammar. It's only when operating on 
something _other_ than a string that you'd have to actually deal with dchars.

 Hmmm. Well, I'd still argue that that's a parser thing. Pretty much
 nothing
 else will care about it. At most, it should be an optional feature of the
 lexer. But it certainly could be added that way.

 
 I hate to say "trust me on this", but if you don't, have a look at dmd's
 lexer and how it handles identifiers, then look at dmd's symbol table.

My point is that it's the sort of thing that _only_ a parser would care about. 
So, unless it _needs_ to be in the lexer for some reason, it shouldn't be.

- Jonathan M Davis

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 10:44 PM, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 22:33:12 Walter Bright wrote:
 The lexer must use char or it will not be acceptable as anything but a toy
 for performance reasons.

 Avoiding decoding can be done with strings and operating on ranges of dchar,
 so you'd be operating almost entirely on ASCII. Are you saying that there's a
 performance issue aside from decoding?

1. Encoding it into a dchar is a performance problem. Source code sits in files 
that are nearly always in UTF8. So your input range MUST check every single
char 
and convert it to UTF32 as necessary. Plus, there's that additional step
removed 
from sticking the file input buffer directly into the lexer's input.

2. Storing strings as dchars is a performance and memory problem (4x as much 
data and hence time).

Remember, nearly ALL of D source will be ASCII. All performance considerations 
must be tilted towards the usual case.


 Somebody has to convert the input files into dchars, and then back into
 chars. That blows for performance. Think billions and billions of
 characters going through, not just a few random strings.

 Why is there any converting to dchar going on here?

Because your input range is a range of dchar?


 I don't see why any would
 be necessary. If you reading in a file as a string or char[] (as would be
 typical), then you're operating on a string, and then the only time that any
 decoding will be necessary is when you actually need to operate on a unicode
 character, which is very rare in D's grammar. It's only when operating on
 something _other_ than a string that you'd have to actually deal with dchars.

That's what I've been saying. So why have an input range of dchars, which must 
be decoded in advance, otherwise it wouldn't be a range of dchars?


 Hmmm. Well, I'd still argue that that's a parser thing. Pretty much
 nothing
 else will care about it. At most, it should be an optional feature of the
 lexer. But it certainly could be added that way.

 I hate to say "trust me on this", but if you don't, have a look at dmd's
 lexer and how it handles identifiers, then look at dmd's symbol table.

 My point is that it's the sort of thing that _only_ a parser would care about.
 So, unless it _needs_ to be in the lexer for some reason, it shouldn't be.

I think you are misunderstanding. The lexer doesn't have a *symbol* table in
it. 
It has a mapping from identifiers to unique handles. It needs to be there, 
otherwise the semantic analysis has to scan identifier strings a second time.

Aug 01 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-02 08:39, Walter Bright wrote:

 That's what I've been saying. So why have an input range of dchars,
 which must be decoded in advance, otherwise it wouldn't be a range of
 dchars?

Your first requirement is a bit strange and confusing:

"1. It should accept as input an input range of UTF8."

To my understand ranges in Phobos are designed to operate on dchars, not 
chars, regardless of the original input type. So if you can create a 
range that operates on UTF8 I don't know if that will be usable with the 
rest of the Phobos functions expecting ranges of dchars. Basically 
making this type of range useless since it cannot use any other 
functionality in Phobos.

The current state of Phobos: you either have a range of dchars or you 
don't have a range, you have an array or something else.

-- 
/Jacob Carlborg

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 11:59 PM, Jacob Carlborg wrote:
 Your first requirement is a bit strange and confusing:

 "1. It should accept as input an input range of UTF8."

 To my understand ranges in Phobos are designed to operate on dchars, not chars,
 regardless of the original input type. So if you can create a range that
 operates on UTF8 I don't know if that will be usable with the rest of the
Phobos
 functions expecting ranges of dchars. Basically making this type of range
 useless since it cannot use any other functionality in Phobos.

 The current state of Phobos: you either have a range of dchars or you don't
have
 a range, you have an array or something else.

I answered this point a few posts up in the thread.

Aug 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-02 09:21, Walter Bright wrote:

 I answered this point a few posts up in the thread.

I've read a few posts up and the only answer I found is that the lexer 
needs to operates on chars. But it does not answer the question how that 
range type would be used by all other range based functions in Phobos.

Have I missed this? Can you please link to your post.

-- 
/Jacob Carlborg

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 12:29 AM, Jacob Carlborg wrote:
 On 2012-08-02 09:21, Walter Bright wrote:

 I answered this point a few posts up in the thread.

 I've read a few posts up and the only answer I found is that the lexer needs to
 operates on chars. But it does not answer the question how that range type
would
 be used by all other range based functions in Phobos.

 Have I missed this? Can you please link to your post.

You can use an adapter range.

Aug 02 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 23:39:42 Walter Bright wrote:
 Somebody has to convert the input files into dchars, and then back into
 chars. That blows for performance. Think billions and billions of
 characters going through, not just a few random strings.

 
 Why is there any converting to dchar going on here?

 
 Because your input range is a range of dchar?

I think that we're misunderstanding each other here. A typical, well-written, 
range-based function which operates on ranges of dchar will use static if or 
overloads to special-case strings. This means that it will function with any 
range of dchar, but it _also_ will be as efficient with strings as if it just 
operated on strings. It won't decode anything in the string unless it has to. 
So, having a lexer which operates on ranges of dchar does _not_ make string 
processing less efficient. It just makes it so that it can _also_ operate on 
ranges of dchar which aren't strings.

For instance, my lexer uses this whenever it needs to get at the first 
character in the range:

static if(isNarrowString!R)
    Unqual!(ElementEncodingType!R) first = range[0];
else
    dchar first = range.front;

I use string mixins to reduce the code duplication that goes with this, but 
notice that strings and wstrings do _not_ get decoded. Their first code unit is 
used, and then most everything operates on that code unit. e.g.

switch(first)
{
    case '\n': //...
    case ' ': //...
    case '+' //...
    //...
    default: //...
}

It's only when the code actually needs to decode the first code point that it 
does so (e.g. when all of the possible ASCII characters have been exhausted, 
it needs to check whether the first code point is a unicode alpha character, 
since that could be the beginning of an identifier). At that point, I end up 
doing something like

static if(isNarrowString!R)
    dchar front = range.front;
else
    alias first front;

or

if I need to know the number of code units that make up the code point, I 
explicitly call decode in the case of a narrow string. In either case, code 
units are _not_ being converted to dchar unless they absolutely have to be.

The only thing that suffers here performance-wise is ranges that aren't 
actually strings. For instance, filter!"true"(source) would have _much_ worse 
performance, because it has to decode every character. But without the concept 
of a variably-lengthed encoded range, we can't really get around that.

 My point is that it's the sort of thing that _only_ a parser would care
 about. So, unless it _needs_ to be in the lexer for some reason, it
 shouldn't be.

 I think you are misunderstanding. The lexer doesn't have a *symbol* table in
 it. It has a mapping from identifiers to unique handles. It needs to be
 there, otherwise the semantic analysis has to scan identifier strings a
 second time.

Yes. I understand. It has a mapping of pointers to identifiers. My point is 
that nothing but parsers will need that. From the standpoint of functionality, 
it's a parser feature, not a lexer feature. So, if it can be done just fine in 
the parser, then that's where it should be. If on the other hand, it _needs_ 
to be in the lexer for some reason (e.g. performance), then that's a reason to 
put it there.

It sounds like you're arguing that performance requirements dictate that it be 
put in the lexer even if nothing but the parser cares about it, in which case, 
the lexer is indeed where it should be, but if performance isn't affected, then 
I'd still argue that it should be in the parser.

- Jonathan M Davis

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 12:21 AM, Jonathan M Davis wrote:
 Because your input range is a range of dchar?

 I think that we're misunderstanding each other here. A typical, well-written,
 range-based function which operates on ranges of dchar will use static if or
 overloads to special-case strings. This means that it will function with any
 range of dchar, but it _also_ will be as efficient with strings as if it just
 operated on strings.

It *still* must convert UTF8 to dchars before presenting them to the consumer
of 
the dchar elements.


 It won't decode anything in the string unless it has to.
 So, having a lexer which operates on ranges of dchar does _not_ make string
 processing less efficient. It just makes it so that it can _also_ operate on
 ranges of dchar which aren't strings.

 For instance, my lexer uses this whenever it needs to get at the first
 character in the range:

 static if(isNarrowString!R)
      Unqual!(ElementEncodingType!R) first = range[0];
 else
      dchar first = range.front;

You're requiring a random access input range that has random access to
something 
other than the range element type?? and you're requiring an isNarrowString to 
work on an arbitrary range?


 if I need to know the number of code units that make up the code point, I
 explicitly call decode in the case of a narrow string. In either case, code
 units are _not_ being converted to dchar unless they absolutely have to be.

Or you could do away with requiring a special range type and just have it be a 
UTF8 range.

What I wasn't realizing earlier was that you were positing a range type that
has 
two different kinds of elements. I don't think this is a proper component type.


 Yes. I understand. It has a mapping of pointers to identifiers. My point is
 that nothing but parsers will need that.
 From the standpoint of functionality,
 it's a parser feature, not a lexer feature. So, if it can be done just fine in
 the parser, then that's where it should be. If on the other hand, it _needs_
 to be in the lexer for some reason (e.g. performance), then that's a reason to
 put it there.

If you take it out of the lexer, then:

1. the lexer must allocate storage for every identifier, rather than only for 
unique identifiers

2. and then the parser must scan the identifier string *again*

3. there must be two hash lookups of each identifier rather than one

It's a suboptimal design.

Aug 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-02 07:44, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 22:33:12 Walter Bright wrote:
 The lexer must use char or it will not be acceptable as anything but a toy
 for performance reasons.

 Avoiding decoding can be done with strings and operating on ranges of dchar,
 so you'd be operating almost entirely on ASCII. Are you saying that there's a
 performance issue aside from decoding?

 Somebody has to convert the input files into dchars, and then back into
 chars. That blows for performance. Think billions and billions of
 characters going through, not just a few random strings.

 Why is there any converting to dchar going on here? I don't see why any would
 be necessary. If you reading in a file as a string or char[] (as would be
 typical), then you're operating on a string, and then the only time that any
 decoding will be necessary is when you actually need to operate on a unicode
 character, which is very rare in D's grammar. It's only when operating on
 something _other_ than a string that you'd have to actually deal with dchars.

I think you two are saying the same things.

-- 
/Jacob Carlborg

Aug 01 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, August 01, 2012 21:52:15 Jonathan M Davis wrote:
 And as much as there are potential performance issues with Phobos' choice of
 treating strings as ranges of dchar, if it were to continue to treat them
 as ranges of code units, it's pretty much a guarantee that there would be a
 _ton_ of bugs caused by it. Most programmers have absolutely no clue about
 how unicode works and don't really want to know. They want strings to just
 work. Phobos' approach of defaulting to correct but making it possible to
 make the code faster through specializations is very much in line with D's
 typical approach of making things safe by default but allowing the
 programmer to do unsafe things for optimizations when they know what
 they're doing.

Another thing that I should point out is that a range of UTF-8 or UTF-16 
wouldn't work with many range-based functions at all. Most of std.algorithm 
and its ilk would be completely useless. Range-based functions operate on a 
ranges elements, so operating on a range of code units would mean operating on 
code units, which is going to be _wrong_ almost all the time. Think about what 
would happen if you used a function like map or filter on a range of code 
units. The resultant range would be completely invalid as far as unicode goes. 
Range-based functions need to be operating on _characters_. Technically, not 
even code points gets us there, so it's _still_ buggy. It's just a _lot_ 
closer to being correct and works 99.99+% of the time.

If we want to be able to operate on ranges of UTF-8 or UTF-16, we need to add 
a concept of variably-length encoded ranges so that it's possible to treat 
them as both their encoding and whatever they represent (e.g. code point or 
grapheme in the case of ranges of code units).

- Jonathan M Davis

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 11:56 PM, Jonathan M Davis wrote:
 Another thing that I should point out is that a range of UTF-8 or UTF-16
 wouldn't work with many range-based functions at all. Most of std.algorithm
 and its ilk would be completely useless. Range-based functions operate on a
 ranges elements, so operating on a range of code units would mean operating on
 code units, which is going to be _wrong_ almost all the time. Think about what
 would happen if you used a function like map or filter on a range of code
 units. The resultant range would be completely invalid as far as unicode goes.

My experience in writing fast string based code that worked on UTF8 and 
correctly handled multibyte characters was that they are very possible and 
practical, and they are faster.

The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it 
isn't fast, serious users will eschew it and will cook up their own. You'll
have 
a nice, pretty, useless toy of std.d.lexer.

I think there's some serious underestimation of how critical this is.



 Range-based functions need to be operating on _characters_. Technically, not
 even code points gets us there, so it's _still_ buggy. It's just a _lot_
 closer to being correct and works 99.99+% of the time.

Multi-code point characters are quite irrelevant to the correctness of a D
lexer.


 If we want to be able to operate on ranges of UTF-8 or UTF-16, we need to add
 a concept of variably-length encoded ranges so that it's possible to treat
 them as both their encoding and whatever they represent (e.g. code point or
 grapheme in the case of ranges of code units).

No, this is not necessary.

Aug 02 2012

"Bernard Helyer" <b.helyer gmail.com> writes:

On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
 The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be 
 useful. If it isn't fast, serious users will eschew it and will 
 cook up their own. You'll have a nice, pretty, useless toy of 
 std.d.lexer.

If you want to throw out some target times, that would be useful.

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 12:33 AM, Bernard Helyer wrote:
 On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
 The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it
 isn't fast, serious users will eschew it and will cook up their own. You'll
 have a nice, pretty, useless toy of std.d.lexer.

 If you want to throw out some target times, that would be useful.

As fast as the dmd one would be best.

Aug 02 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, August 02, 2012 01:13:04 Walter Bright wrote:
 On 8/2/2012 12:33 AM, Bernard Helyer wrote:
 On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
 The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If
 it
 isn't fast, serious users will eschew it and will cook up their own.
 You'll
 have a nice, pretty, useless toy of std.d.lexer.

 
 If you want to throw out some target times, that would be useful.

 
 As fast as the dmd one would be best.

How would we measure that? dmd's lexer is tied to dmd, so how would we test 
the speed of only its lexer?

- Jonathan M Davis

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 1:21 AM, Jonathan M Davis wrote:
 How would we measure that? dmd's lexer is tied to dmd, so how would we test
 the speed of only its lexer?

Easy. Just make a special version of dmd that lexes only, and time it.

Aug 02 2012

Ed McCardell <edmccard hotmail.com> writes:

On 08/02/2012 04:41 AM, Walter Bright wrote:
 On 8/2/2012 1:21 AM, Jonathan M Davis wrote:
 How would we measure that? dmd's lexer is tied to dmd, so how would we
 test
 the speed of only its lexer?

 Easy. Just make a special version of dmd that lexes only, and time it.

I made a lexing-only version of dmd at

https://github.com/edmccard/dmd/tree/lexonly

by stripping non-lexer-related code from mars.c, and adding a lexModule 
function that is called instead of Module::parse.. There's no 
benchmarking code yet; it basically just does

   while (token.value != TOKeof) nextToken();

for each D source file passed on the command line.

--Ed

Aug 03 2012

dennis luehring <dl.soluz gmx.net> writes:

Am 03.08.2012 09:56, schrieb Ed McCardell:
 On 08/02/2012 04:41 AM, Walter Bright wrote:
 On 8/2/2012 1:21 AM, Jonathan M Davis wrote:
 How would we measure that? dmd's lexer is tied to dmd, so how would we
 test
 the speed of only its lexer?

 Easy. Just make a special version of dmd that lexes only, and time it.

 I made a lexing-only version of dmd at

 https://github.com/edmccard/dmd/tree/lexonly

 by stripping non-lexer-related code from mars.c, and adding a lexModule
 function that is called instead of Module::parse.. There's no
 benchmarking code yet; it basically just does

     while (token.value != TOKeof) nextToken();

 for each D source file passed on the command line.

 --Ed

Walter mentioned an easier way - without "forking" dmd - should be 
better an integral part of the ongoing development maybe under a tools 
or benchmark section?

news://news.digitalmars.com:119/jvfv14$1ri0$1 digitalmars.com
->Look in doc.c at highlightCode2() for how to call the lexer by itself.

Aug 03 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/3/2012 1:27 AM, dennis luehring wrote:
 Walter mentioned an easier way - without "forking" dmd - should be better an
 integral part of the ongoing development maybe under a tools or benchmark
section?


Forking it is fine. This is just for a one-off benchmarking thing.

Aug 03 2012

dennis luehring <dl.soluz gmx.net> writes:

Am 02.08.2012 10:13, schrieb Walter Bright:
 On 8/2/2012 12:33 AM, Bernard Helyer wrote:
 On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
 The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it
 isn't fast, serious users will eschew it and will cook up their own. You'll
 have a nice, pretty, useless toy of std.d.lexer.

 If you want to throw out some target times, that would be useful.

 As fast as the dmd one would be best.

would it be (easily) possible to "extract" the dmd lexer code (and 
needed interface) for using it as an spererated benchmark reference?

Aug 02 2012

dennis luehring <dl.soluz gmx.net> writes:

Am 02.08.2012 10:13, schrieb Walter Bright:
 On 8/2/2012 12:33 AM, Bernard Helyer wrote:
 On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
 The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it
 isn't fast, serious users will eschew it and will cook up their own. You'll
 have a nice, pretty, useless toy of std.d.lexer.

 If you want to throw out some target times, that would be useful.

 As fast as the dmd one would be best.

can the dmd lexer seperated as an lib and become useable from outside - 
as the benchmark reference

Aug 02 2012

deadalnix <deadalnix gmail.com> writes:

Le 02/08/2012 10:13, Walter Bright a écrit :
 On 8/2/2012 12:33 AM, Bernard Helyer wrote:
 On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:
 The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful.
 If it
 isn't fast, serious users will eschew it and will cook up their own.
 You'll
 have a nice, pretty, useless toy of std.d.lexer.

 If you want to throw out some target times, that would be useful.

 As fast as the dmd one would be best.

That'd be great but . . .

lexer really isn't the performance bottleneck of dmd (or any compiler of 
a non trivial language). Additionally, anybody that have touched dmd 
source code can agree that usability/maintainability isn't as its best.

Sacrificing some perfs in a non bottleneck area to increase ease of use 
make perfect sense.

Aug 02 2012

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Thursday, 2 August 2012 at 11:47:20 UTC, deadalnix wrote:
 lexer really isn't the performance bottleneck of dmd (or any 
 compiler of a non trivial language).

What if we're just using this lexer in something like a
syntax highlighting text editor? I'd be annoyed if it
stopped typing for a little bit cuz of processing.

Aug 02 2012

Marco Leise <Marco.Leise gmx.de> writes:

Am Thu, 02 Aug 2012 14:26:58 +0200
schrieb "Adam D. Ruppe" <destructionator gmail.com>:

 On Thursday, 2 August 2012 at 11:47:20 UTC, deadalnix wrote:
 lexer really isn't the performance bottleneck of dmd (or any 
 compiler of a non trivial language).

 
 What if we're just using this lexer in something like a
 syntax highlighting text editor? I'd be annoyed if it
 stopped typing for a little bit cuz of processing.

My pet peeve for editors is how to make the lexer work on only what I just
edited. It is really not trivial, but eased for example by editors that insert
closing } automatically, so the scopes don't mess up. A missing curly bracket
results in the parser reinterpreting the whole file. The same goes for string
terminators:

If I type `` at the top of std.datetime in Mono-D right after "module
std.datetime;", I have a 12 (!) seconds wait, till the IDE becomes responsive
again.
But if I use ' or ", then the editor inserts the second terminator and is done
with reparsing in 1-2 seconds.

Can we really have a good lexer that is equally well suited for compilers as
for editors ?

-- 
Marco

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 11:14 AM, Marco Leise wrote:
 If I type `` at the top of std.datetime in Mono-D right after "module
 std.datetime;", I have a 12 (!) seconds wait, till the IDE becomes responsive
 again. But if I use ' or ", then the editor inserts the second terminator and
 is done with reparsing in 1-2 seconds.

A good IDE should do its parsing in a separate thread, so the main user input 
thread remains crisp and responsive.

If the user edits the text while the parsing is in progress, the background 
parsing thread simply abandons the current parse and starts over.

Aug 02 2012

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 8/2/12, Walter Bright <newshound2 digitalmars.com> wrote:
 A good IDE should do its parsing in a separate thread, so the main user
 input
 thread remains crisp and responsive.

 If the user edits the text while the parsing is in progress, the background

 parsing thread simply abandons the current parse and starts over.

Agreed, this is an editor/IDE domain problem. And there are some
obvious optimizations an editor could use, such as only lexing the
visible portion of text on the screen (and then lexing the rest in a
background thread which doesn't block the interface).

Aug 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-02 21:35, Walter Bright wrote:

 A good IDE should do its parsing in a separate thread, so the main user
 input thread remains crisp and responsive.

 If the user edits the text while the parsing is in progress, the
 background parsing thread simply abandons the current parse and starts
 over.

It still needs to update the editor view with the correct syntax 
highlighting which needs to be done in the same thread as the rest of 
the GUI.

-- 
/Jacob Carlborg

Aug 02 2012

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 8/2/12, Jacob Carlborg <doob me.com> wrote:
 It still needs to update the editor view with the correct syntax
 highlighting which needs to be done in the same thread as the rest of
 the GUI.

It can do that immediately for the text that's visible in the window
because ~100 lines of text can be lexed pretty damn instantly. As soon
as that's done the GUI should be responsive and the rest of the text
buffer should be lexed in the background.

Aug 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-02 22:54, Andrej Mitrovic wrote:

 It can do that immediately for the text that's visible in the window
 because ~100 lines of text can be lexed pretty damn instantly. As soon
 as that's done the GUI should be responsive and the rest of the text
 buffer should be lexed in the background.

That should work if the lexer can handle it.

-- 
/Jacob Carlborg

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 1:41 PM, Jacob Carlborg wrote:
 On 2012-08-02 21:35, Walter Bright wrote:

 A good IDE should do its parsing in a separate thread, so the main user
 input thread remains crisp and responsive.

 If the user edits the text while the parsing is in progress, the
 background parsing thread simply abandons the current parse and starts
 over.

 It still needs to update the editor view with the correct syntax highlighting
 which needs to be done in the same thread as the rest of the GUI.

The rendering code should be in yet a third thread.

An editor I wrote years ago had the rendering code in a separate thread from 
user input. You never had to wait to type in commands, the rendering would
catch 
up when it could. What was also effective was the rendering would abandon a 
render midstream and restart it if it detected that the underlying data had 
changed in the meantime. This meant that the display was never more than one 
render out of date.

Although the code itself wasn't any faster, it certainly *felt* faster with
this 
approach. It made for crisp editing even on a pig slow machine.

Aug 02 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 03-Aug-12 02:10, Walter Bright wrote:
 On 8/2/2012 1:41 PM, Jacob Carlborg wrote:
 On 2012-08-02 21:35, Walter Bright wrote:

 A good IDE should do its parsing in a separate thread, so the main user
 input thread remains crisp and responsive.

 If the user edits the text while the parsing is in progress, the
 background parsing thread simply abandons the current parse and starts
 over.

 It still needs to update the editor view with the correct syntax
 highlighting
 which needs to be done in the same thread as the rest of the GUI.

 The rendering code should be in yet a third thread.

OT:
It never ceases to amaze me how people miss this very simple point:
GUI runs on its own thread and shouldn't ever block on something (save 
for message pump itself, of course). Everything else (including possibly 
slow rendering) done on the side and then result (once ready) swiftly 
indicated on GUI.

Recent Windows 8 talks were in fact nothing new in this regard, but now 
even API is made so that it's harder to do something blocking in GUI thread.


 An editor I wrote years ago had the rendering code in a separate thread
 from user input. You never had to wait to type in commands, the
 rendering would catch up when it could. What was also effective was the
 rendering would abandon a render midstream and restart it if it detected
 that the underlying data had changed in the meantime. This meant that
 the display was never more than one render out of date.

 Although the code itself wasn't any faster, it certainly *felt* faster
 with this approach. It made for crisp editing even on a pig slow machine.


-- 
Dmitry Olshansky

Aug 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-03 00:25, Dmitry Olshansky wrote:

 OT:
 It never ceases to amaze me how people miss this very simple point:
 GUI runs on its own thread and shouldn't ever block on something (save
 for message pump itself, of course). Everything else (including possibly
 slow rendering) done on the side and then result (once ready) swiftly
 indicated on GUI.

I'm not entirely sure what you mean be "rendering" but most GUI systems 
are not thread safe and all GUI updates need to happen in the same thread.

-- 
/Jacob Carlborg

Aug 02 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 03-Aug-12 10:35, Jacob Carlborg wrote:
 On 2012-08-03 00:25, Dmitry Olshansky wrote:

 OT:
 It never ceases to amaze me how people miss this very simple point:
 GUI runs on its own thread and shouldn't ever block on something (save
 for message pump itself, of course). Everything else (including possibly
 slow rendering) done on the side and then result (once ready) swiftly
 indicated on GUI.

 I'm not entirely sure what you mean be "rendering" but most GUI systems
 are not thread safe and all GUI updates need to happen in the same thread.

Draw thing to an off screen  bitmap  then blit it to window (aye, pass 
back to UI thread a reference to the buffer with pixels). This technique 
been in use for decades. Imagine drawing some large intricate fractal it 
could easily take few seconds.

-- 
Dmitry Olshansky

Aug 03 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-03 18:49, Dmitry Olshansky wrote:

 Draw thing to an off screen  bitmap  then blit it to window (aye, pass
 back to UI thread a reference to the buffer with pixels). This technique
 been in use for decades. Imagine drawing some large intricate fractal it
 could easily take few seconds.

Ok, now I see.

-- 
/Jacob Carlborg

Aug 03 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-03 00:10, Walter Bright wrote:

 The rendering code should be in yet a third thread.

Most GUI systems are not thread safe. I know for sure that Cocoa on Mac 
OS X is not. All the changes to the GUI needs to happen in the same 
thread. But you can usually post a message from another thread to the 
GUI thread if you want to update the GUI from another thread.

 An editor I wrote years ago had the rendering code in a separate thread
 from user input. You never had to wait to type in commands, the
 rendering would catch up when it could. What was also effective was the
 rendering would abandon a render midstream and restart it if it detected
 that the underlying data had changed in the meantime. This meant that
 the display was never more than one render out of date.

 Although the code itself wasn't any faster, it certainly *felt* faster
 with this approach. It made for crisp editing even on a pig slow machine.

But if you type something and you don't see any new characters that will 
feel slow. Regardless of the application received the user input or not. 
The user doesn't care what's happening in the background of the 
application, it only cares about what it can see.

I've used many editors/IDEs where the GUI locks, for whatever reason, 
you continue typing and suddenly you get 5 characters at once. To me, 
that would indicate the the application do receive the user input but it 
cannot render it fast enough, for whatever reason.

-- 
/Jacob Carlborg

Aug 02 2012

deadalnix <deadalnix gmail.com> writes:

Le 02/08/2012 20:14, Marco Leise a �crit :
 Am Thu, 02 Aug 2012 14:26:58 +0200
 schrieb "Adam D. Ruppe"<destructionator gmail.com>:

 On Thursday, 2 August 2012 at 11:47:20 UTC, deadalnix wrote:
 lexer really isn't the performance bottleneck of dmd (or any
 compiler of a non trivial language).

 What if we're just using this lexer in something like a
 syntax highlighting text editor? I'd be annoyed if it
 stopped typing for a little bit cuz of processing.

 My pet peeve for editors is how to make the lexer work on only what I just
edited. It is really not trivial, but eased for example by editors that insert
closing } automatically, so the scopes don't mess up. A missing curly bracket
results in the parser reinterpreting the whole file. The same goes for string
terminators:

 If I type `` at the top of std.datetime in Mono-D right after "module
std.datetime;", I have a 12 (!) seconds wait, till the IDE becomes responsive
again.
 But if I use ' or ", then the editor inserts the second terminator and is done
with reparsing in 1-2 seconds.

 Can we really have a good lexer that is equally well suited for compilers as
for editors ?

A common solution to that problem is to cancel the computation when user 
type something again.

Aug 03 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 4:47 AM, deadalnix wrote:
 Le 02/08/2012 10:13, Walter Bright a écrit :
 As fast as the dmd one would be best.

 That'd be great but . . .

 lexer really isn't the performance bottleneck of dmd (or any compiler of a non
 trivial language). Additionally, anybody that have touched dmd source code can
 agree that usability/maintainability isn't as its best.

 Sacrificing some perfs in a non bottleneck area to increase ease of use make
 perfect sense.

A lexer is like a device driver, or a memory allocator. Having it be as fast as 
possible pays off dividends endlessly for whoever uses it.

And in my profiling, lexing speed is a big factor in the debug build times.

The Digital Mars C++ compiler (and its antecedents) made a lot of hay by being 
the fastest compiler available by a wide margin. A big advantage D has is also 
compile speed.

Please do not underestimate its importance.

Aug 02 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 02-Aug-12 22:06, Walter Bright wrote:
 On 8/2/2012 4:47 AM, deadalnix wrote:
 Le 02/08/2012 10:13, Walter Bright a écrit :
 As fast as the dmd one would be best.

 That'd be great but . . .

 lexer really isn't the performance bottleneck of dmd (or any compiler
 of a non
 trivial language). Additionally, anybody that have touched dmd source
 code can
 agree that usability/maintainability isn't as its best.

 Sacrificing some perfs in a non bottleneck area to increase ease of
 use make
 perfect sense.

 A lexer is like a device driver, or a memory allocator. Having it be as
 fast as possible pays off dividends endlessly for whoever uses it.

+1 Another point is that it's crystal clear *how* to optimize lexer, and 
gain some precious time. It is well trodden path with well known 
destination.
The time saved in lexer gives you some headroom in say optimizer, where 
you always can find more ways to improve result by sacrificing speed.

In the long run I totally expect millions of lines of D code in the wild.

 And in my profiling, lexing speed is a big factor in the debug build times.

 The Digital Mars C++ compiler (and its antecedents) made a lot of hay by
 being the fastest compiler available by a wide margin. A big advantage D
 has is also compile speed.

 Please do not underestimate its importance.


-- 
Dmitry Olshansky

Aug 02 2012

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 8/2/12, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:
 In the long run I totally expect millions of lines of D code in the wild.

In a single project or globally? I hope not the former, D should
inspire people to write less code that does more. :)

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 1:20 PM, Dmitry Olshansky wrote:
 +1 Another point is that it's crystal clear *how* to optimize lexer, and gain
 some precious time. It is well trodden path with well known destination.

I'm less sure how well known it is. DMC++ was known for being *several times* 
faster than other C++ compilers, not just a few percent faster.

But we do have the DMD lexer which is useful as a benchmark and a guide. I
won't 
say it couldn't be made faster, but it does set a minimum bar for performance.

Aug 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-03 00:01, Walter Bright wrote:

 But we do have the DMD lexer which is useful as a benchmark and a guide.
 I won't say it couldn't be made faster, but it does set a minimum bar
 for performance.

I'm not sure how easy it would be to just measure the lexing phase of 
DMD. If it's easy someone would probably already have extracted the 
lexer from DMD.

-- 
/Jacob Carlborg

Aug 02 2012

dennis luehring <dl.soluz gmx.net> writes:

Am 03.08.2012 08:37, schrieb Jacob Carlborg:
 On 2012-08-03 00:01, Walter Bright wrote:

 But we do have the DMD lexer which is useful as a benchmark and a guide.
 I won't say it couldn't be made faster, but it does set a minimum bar
 for performance.

 I'm not sure how easy it would be to just measure the lexing phase of
 DMD. If it's easy someone would probably already have extracted the
 lexer from DMD.

wouldn't it be better to extract the lexer part of dmd into its own 
(hopefully small) library - that way the lexer is still useable by dmd 
AND benchmarkable from outside - it is then even possible to replace the 
dmd lexer by an D version due to the c linkage feature of D

Aug 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-03 08:49, dennis luehring wrote:

 wouldn't it be better to extract the lexer part of dmd into its own
 (hopefully small) library - that way the lexer is still useable by dmd
 AND benchmarkable from outside - it is then even possible to replace the
 dmd lexer by an D version due to the c linkage feature of D

That was what I meant. If it's easy someone would have already done it.

-- 
/Jacob Carlborg

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 11:37 PM, Jacob Carlborg wrote:
 I'm not sure how easy it would be to just measure the lexing phase of DMD. If
 it's easy someone would probably already have extracted the lexer from DMD.


You don't need to extract it to measure it. Just have it lex the source files
in 
a loop, and time that loop.

Aug 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-03 08:59, Walter Bright wrote:

 You don't need to extract it to measure it. Just have it lex the source
 files in a loop, and time that loop.

Well, that's the problem. It's not like DMD has a single "lex" function 
that does all the job.

Would it perhaps be possible to time Parser::parseModule and remove all 
things that doesn't seem related to lexing?

Remove stuff like:

a = new Identifiers();
md = new ModuleDeclaration(a, id, safe);

And similar.

-- 
/Jacob Carlborg

Aug 03 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/3/2012 12:11 AM, Jacob Carlborg wrote:
 On 2012-08-03 08:59, Walter Bright wrote:

 You don't need to extract it to measure it. Just have it lex the source
 files in a loop, and time that loop.

 Well, that's the problem. It's not like DMD has a single "lex" function that
 does all the job.

 Would it perhaps be possible to time Parser::parseModule and remove all things
 that doesn't seem related to lexing?

 Remove stuff like:

 a = new Identifiers();
 md = new ModuleDeclaration(a, id, safe);

 And similar.

Look in doc.c at highlightCode2() for how to call the lexer by itself.

Aug 03 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-03 09:35, Walter Bright wrote:

 Look in doc.c at highlightCode2() for how to call the lexer by itself.

So basically:

Token tok;

//start timer

while (tok.value != TOKeof)
     lex.scan(&tok);

//end timer

Something like that?

-- 
/Jacob Carlborg

Aug 03 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/3/2012 2:01 AM, Jacob Carlborg wrote:
 On 2012-08-03 09:35, Walter Bright wrote:

 Look in doc.c at highlightCode2() for how to call the lexer by itself.

 So basically:

 Token tok;

 //start timer

 while (tok.value != TOKeof)
      lex.scan(&tok);

 //end timer

 Something like that?


Pretty much. Or just call lex.nextToken() until you get a TOKeof.

Aug 03 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, August 02, 2012 11:06:39 Walter Bright wrote:
 On 8/2/2012 4:47 AM, deadalnix wrote:
 Le 02/08/2012 10:13, Walter Bright a écrit :
 As fast as the dmd one would be best.

 
 That'd be great but . . .
 
 lexer really isn't the performance bottleneck of dmd (or any compiler of a
 non trivial language). Additionally, anybody that have touched dmd source
 code can agree that usability/maintainability isn't as its best.
 
 Sacrificing some perfs in a non bottleneck area to increase ease of use
 make perfect sense.

 
 A lexer is like a device driver, or a memory allocator. Having it be as fast
 as possible pays off dividends endlessly for whoever uses it.
 
 And in my profiling, lexing speed is a big factor in the debug build times.
 
 The Digital Mars C++ compiler (and its antecedents) made a lot of hay by
 being the fastest compiler available by a wide margin. A big advantage D
 has is also compile speed.
 
 Please do not underestimate its importance.

Recent benchmarks by Iain Buclaw show that lexing is not a bottleneck in dmd, 
so it's natural for some folks to think that the lexing is not all that 
critical in comparison to other compiler components. However, the reason that 
lexing is not a bottleneck is almost ceratinly because the lexer is so 
thoroughly optimized. So, if it _didn't_ try and be as efficient as possible,
it 
_would_ be a bottleneck.

However, given that we're providing an API for a lexer rather than having the 
lexer be an internal component of a compiler, that _does_ put more pressure on 
making the API user-friendly. Hopefully that can be done without sacrificing 
any performance, but there may be a point where the only way to make it faster 
would be to make the API very unfriendly. If that's the case, then we're going 
to have some tough decisions to make. But we'll cross that bridge if/when we 
get to it.

- Jonathan M Davis

Aug 02 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, August 02, 2012 00:29:09 Walter Bright wrote:
 If we want to be able to operate on ranges of UTF-8 or UTF-16, we need to
 add a concept of variably-length encoded ranges so that it's possible to
 treat them as both their encoding and whatever they represent (e.g. code
 point or grapheme in the case of ranges of code units).

 
 No, this is not necessary.

It is for ranges in general. In the general case, a range of UTF-8 or UTF-16 
makes no sense whatsoever. Having range-based functions which understand the 
encodings and optimize accordingly can be very beneficial (which happens with 
strings but can't happen with general ranges without the concept of a 
variably-length encoded range like we have with forward range or random access 
range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work. 
Range-based functions operate on elements, and doing stuff like filter or map
or 
reduce on code units doesn't make any sense at all.

- Jonathan M Davis

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
 It is for ranges in general. In the general case, a range of UTF-8 or UTF-16
 makes no sense whatsoever. Having range-based functions which understand the
 encodings and optimize accordingly can be very beneficial (which happens with
 strings but can't happen with general ranges without the concept of a
 variably-length encoded range like we have with forward range or random access
 range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work.
 Range-based functions operate on elements, and doing stuff like filter or map
or
 reduce on code units doesn't make any sense at all.

Yes, it can work.

Aug 02 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
 On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
 It is for ranges in general. In the general case, a range of UTF-8 or
 UTF-16 makes no sense whatsoever. Having range-based functions which
 understand the encodings and optimize accordingly can be very beneficial
 (which happens with strings but can't happen with general ranges without
 the concept of a variably-length encoded range like we have with forward
 range or random access range), but to actually have a range of UTF-8 or
 UTF-16 just wouldn't work. Range-based functions operate on elements, and
 doing stuff like filter or map or reduce on code units doesn't make any
 sense at all.

 
 Yes, it can work.

How? If you operate on a range of code units, then you're operating on 
individual code units, which almost never makes sense. There are plenty cases 
where a function which understands the encoding can avoid some of costs 
associated with decoding and whatnot, but since range-based functions operate 
on their elements, if the elementse are code units, a range-based function 
will operate on individual code units with _no_ understanding of the encoding 
at all. Ranges have no concept of encoding.

Do you really think that it makes sense for a function like map or filter to 
operate on individual code units? Because that's what would end up happening 
with a range of code units. Your average, range-based function only makes 
sense with _characters_, not code units. Functions which can operate on ranges 
of code units without screwing up the encoding are a rarity.

Unless a range-based function special cases a range-type which is variably-
lengthed encoded (e.g. string), it just isn't going to deal with the encoding 
properly. Either it operates on the encoding or the actual value, depending on 
what its element type is.

I concur that operating on strings as code units is better from the standpoint 
of efficiency, but it just doesn't work with a generic function without it 
having a special case which therefore _isn't_ generic.

- Jonathan M Davis

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
 On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
 It is for ranges in general. In the general case, a range of UTF-8 or
 UTF-16 makes no sense whatsoever. Having range-based functions which
 understand the encodings and optimize accordingly can be very beneficial
 (which happens with strings but can't happen with general ranges without
 the concept of a variably-length encoded range like we have with forward
 range or random access range), but to actually have a range of UTF-8 or
 UTF-16 just wouldn't work. Range-based functions operate on elements, and
 doing stuff like filter or map or reduce on code units doesn't make any
 sense at all.

 Yes, it can work.

 How?

Keep a 6 character buffer in your consumer. If you read a char with the high
bit 
set, start filling that buffer and then decode it.


 Do you really think that it makes sense for a function like map or filter to
 operate on individual code units? Because that's what would end up happening
 with a range of code units. Your average, range-based function only makes
 sense with _characters_, not code units. Functions which can operate on ranges
 of code units without screwing up the encoding are a rarity.

Rare or not, they are certainly possible, and the early versions of std.string 
did just that (although they weren't using ranges, the same techniques apply).

Aug 02 2012

deadalnix <deadalnix gmail.com> writes:

Le 02/08/2012 10:44, Walter Bright a écrit :
 On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
 On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
 It is for ranges in general. In the general case, a range of UTF-8 or
 UTF-16 makes no sense whatsoever. Having range-based functions which
 understand the encodings and optimize accordingly can be very
 beneficial
 (which happens with strings but can't happen with general ranges
 without
 the concept of a variably-length encoded range like we have with
 forward
 range or random access range), but to actually have a range of UTF-8 or
 UTF-16 just wouldn't work. Range-based functions operate on
 elements, and
 doing stuff like filter or map or reduce on code units doesn't make any
 sense at all.

 Yes, it can work.

 How?

 Keep a 6 character buffer in your consumer. If you read a char with the
 high bit set, start filling that buffer and then decode it.

How is that different than a manually done range of dchar ?

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 4:49 AM, deadalnix wrote:
 How is that different than a manually done range of dchar ?

The decoding is rarely necessary, even if non-ascii data is there. However, the 
range cannot decide if decoding is necessary - the receiver has to, hence the 
receiver does the decoding.

Aug 02 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, August 02, 2012 10:58:21 Walter Bright wrote:
 On 8/2/2012 4:49 AM, deadalnix wrote:
 How is that different than a manually done range of dchar ?

 
 The decoding is rarely necessary, even if non-ascii data is there. However,
 the range cannot decide if decoding is necessary - the receiver has to,
 hence the receiver does the decoding.

Hence why special-casing must be used to deal with variably-length encodings 
like UTF-8. If it's not something that the range can handle through front and 
popFront, then it's not going to work as a range. Having it be _possible_ to 
special-case but not require it allows for the type to works a range but still 
be efficient of the function operating on the range codes for it. But if you 
require the function to code for it, then it's not really a range.

- Jonathan M Davis

Aug 02 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 02-Aug-12 12:44, Walter Bright wrote:
 On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
 On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
 It is for ranges in general. In the general case, a range of UTF-8 or
 UTF-16 makes no sense whatsoever. Having range-based functions which
 understand the encodings and optimize accordingly can be very
 beneficial
 (which happens with strings but can't happen with general ranges
 without
 the concept of a variably-length encoded range like we have with
 forward
 range or random access range), but to actually have a range of UTF-8 or
 UTF-16 just wouldn't work. Range-based functions operate on
 elements, and
 doing stuff like filter or map or reduce on code units doesn't make any
 sense at all.

 Yes, it can work.

 How?

 Keep a 6 character buffer in your consumer. If you read a char with the
 high bit set, start filling that buffer and then decode it.

4 bytes is enough.

Since Unicode 5(?) the range of codepoints was defined to be 
0...0x10FFFF specifically so that it could be encoded in 4 bytes of UTF-8.


P.S. Looks like I'm too late for this party ;)


-- 
Dmitry Olshansky

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 8:46 AM, Dmitry Olshansky wrote:
 Keep a 6 character buffer in your consumer. If you read a char with the
 high bit set, start filling that buffer and then decode it.

 4 bytes is enough.

 Since Unicode 5(?) the range of codepoints was defined to be 0...0x10FFFF
 specifically so that it could be encoded in 4 bytes of UTF-8.

Yeah, but I thought 6 bytes would future proof it! (Inevitably, the Unicode 
committee will add more.)

 P.S. Looks like I'm too late for this party ;)

It affects you strongly, too, so I'm glad to see you join in.

Aug 02 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:
 On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
 On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
 It is for ranges in general. In the general case, a range of UTF-8 or
 UTF-16 makes no sense whatsoever. Having range-based functions which
 understand the encodings and optimize accordingly can be very beneficial
 (which happens with strings but can't happen with general ranges without
 the concept of a variably-length encoded range like we have with forward
 range or random access range), but to actually have a range of UTF-8 or
 UTF-16 just wouldn't work. Range-based functions operate on elements,
 and
 doing stuff like filter or map or reduce on code units doesn't make any
 sense at all.

 
 Yes, it can work.

 
 How?

 
 Keep a 6 character buffer in your consumer. If you read a char with the high
 bit set, start filling that buffer and then decode it.

And how on earth is that going to work as a range? Range-based functions 
operate on elements. They use empty, front, popFront, etc. If front doesn't 
return an element that a range-based function can operate on without caring 
what it is, then that type isn't going to work as a range. If you need the 
consumer to be doing something special, then that means you need to special 
case it for that range type. And that's what you're doing when you special-
case range-base functions for strings.

So, because of how front and popFront work, you either have to have a range of 
code units or a range of code points. With a range of code units, the element 
type is a code unit, so any operations that you do will be operating on 
individual code units, not code points. With a range of code points, any 
operations being done will operate on code points, which _will_ require 
decoding as long as you're actually using the range API. You only make strings 
more efficent by special-casing the function for them such that it understands 
unicode and will operate on the string in the most efficient way according to 
how that string's encoding works.

You seem to be arguing that we can somehow have a generic range API which 
operates on strings, and yet what you're saying a function using those ranges 
must do (e.g. having a buffer of multipl code units) requires that a range-
based function operate in a non-generic manner for strings. If the consumer 
has to do _anything_ special for strings, then what it's doing is non-generic 
with regards to strings.

I agree that we should be making string operations more efficient by taking
code 
units into account, but I completely disagree that we can do that generically. 
At best, we could add the concept of a variably-length encoded range such that 
a range-based function could special case them and use the encoding where 
appropriate, but all that buys us is the ability to special case variably-
length encoded ranges _other_ than strings (since we can already special case 
strings), and I don't think that it's even possible to deal with a variably-
length encoded range properly without understanding what the encoding is, in 
which case, we'd be dealing with special range types which were specifically 
UTF-8 encoded or UTF-16 encoded, and range-based functions would be special-
casing _those_ rather than a generic variably-length encoded range.

In either case, because the consumer must do something other than simply 
operate on front, popFront, empty, etc., you're _not_ dealing with the range 
API but rather working around it.

- Jonathan M Davis

Aug 02 2012

travert phare.normalesup.org (Christophe Travert) writes:

"Jonathan M Davis" , dans le message (digitalmars.D:174059), a �crit�:
 In either case, because the consumer must do something other than simply 
 operate on front, popFront, empty, etc., you're _not_ dealing with the range 
 API but rather working around it.

In some case a range of dchar is useful. In some case a range of char is 
sufficient, and much more efficient. And for the UTF-aware programer, it 
makes much sense.

The fact that you sometimes have to buffer some information because the 
meaning of one element is affected by the previous element is a normal 
issue in many algoritms, it's not working arround anything. Your lexer 
uses range API, would you say it is just working arroung range because 
you have to take into account several caracters (let they be dchars) at 
the same time to know what they are meaning?

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 1:26 PM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:
 On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
 On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
 It is for ranges in general. In the general case, a range of UTF-8 or
 UTF-16 makes no sense whatsoever. Having range-based functions which
 understand the encodings and optimize accordingly can be very beneficial
 (which happens with strings but can't happen with general ranges without
 the concept of a variably-length encoded range like we have with forward
 range or random access range), but to actually have a range of UTF-8 or
 UTF-16 just wouldn't work. Range-based functions operate on elements,
 and
 doing stuff like filter or map or reduce on code units doesn't make any
 sense at all.

 Yes, it can work.

 How?

 Keep a 6 character buffer in your consumer. If you read a char with the high
 bit set, start filling that buffer and then decode it.

 And how on earth is that going to work as a range?

1. read a character from the range
2. if the character is the start of a multibyte character, put the character in 
the buffer
3. keep reading from the range until you've got the whole of the multibyte
character
4. convert that 6 (or 4) character buffer into a dchar

Remember, its the consumer doing the decoding, not the input range.

 I agree that we should be making string operations more efficient by taking
code
 units into account, but I completely disagree that we can do that generically.

The requirement I listed was that the input range present UTF8 characters. Not 
any random character type.

Aug 02 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:
 Remember, its the consumer doing the decoding, not the input range.

But that's the problem. The consumer has to treat character ranges specially 
to make this work. It's not generic. If it were generic, then it would simply 
be using front, popFront, etc. It's going to have to special case strings to 
do the buffering that you're suggesting. And if you have to special case 
strings, then how is that any different from what we have now?

If you're arguing that strings should be treated as ranges of code units, then 
pretty much _every_ range-based function will have to special case strings to 
even work correctly - otherwise it'll be operating on individual code points 
rather than code points (e.g. filtering code units rather than code points, 
which would generate an invalid string). This makes the default behavior 
incorrect, forcing _everyone_ to special case strings _everywhere_ if they 
want correct behavior with ranges which are strings. And efficiency means 
nothing if the result is wrong.

As it is now, the default behavior of strings with range-based functions is 
correct but inefficient, so at least we get correct code. And if someone wants 
their string processing to be efficient, then they special case strings and do 
things like the buffering that you're suggesting. So, we have correct by 
default with efficiency as an option. The alternative that you seem to be 
suggesting (making strings be treated as ranges of code units) means that it 
would be fast by default but correct as an option, which is completely 
backwards IMHO. Efficiency is important, but it's pointless how efficient 
something is if it's wrong, and expecting that your average programmer is 
going to write unicode-aware code which functions correctly is completely 
unrealistic.

- Jonathan M Davis

Aug 02 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/2/12 6:38 PM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:
 Remember, its the consumer doing the decoding, not the input range.

 But that's the problem. The consumer has to treat character ranges specially
 to make this work. It's not generic.

It is generic! It's just in another dimension: it operates on any range 
of _bytes_.

Andrei

Aug 02 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, August 02, 2012 18:41:23 Andrei Alexandrescu wrote:
 On 8/2/12 6:38 PM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:
 Remember, its the consumer doing the decoding, not the input range.

 
 But that's the problem. The consumer has to treat character ranges
 specially to make this work. It's not generic.

 
 It is generic! It's just in another dimension: it operates on any range
 of _bytes_.

So, a function which does the buffering of code units like Walter suggests is 
generic? It's doing something that makes no sense outside of strings. And if 
it's doing that with strings and something else with everything else (which it 
_has_ to do if the same function is going to work with both unicode as well as 
range types that have nothing to do with unicode), then it's special casing 
strings and therefore is _not_ generic.

Sure, you could have a function which specifically operates on ranges of code 
units and understands how unicode works and is written accordingly, but then 
that function is specific to ranges of code units and is only generic with 
regards to various ranges of code units. It can't operate on generic ranges 
like functions such as map and filter can.

- Jonathan M Davis

Aug 02 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/2/12 6:54 PM, Jonathan M Davis wrote:
 So, a function which does the buffering of code units like Walter suggests is
 generic?

Of course, because it operates on bytes read from memory, files, or 
sockets etc.

 It's doing something that makes no sense outside of strings.

Right. The bytes represent UTF-8 encoded strings, except their type is 
ubyte so there's no processing in the library.

 And if
 it's doing that with strings and something else with everything else (which it
 _has_ to do if the same function is going to work with both unicode as well as
 range types that have nothing to do with unicode), then it's special casing
 strings and therefore is _not_ generic.

This is automatically destroyed because its assumption was destroyed.

 Sure, you could have a function which specifically operates on ranges of code
 units and understands how unicode works and is written accordingly, but then
 that function is specific to ranges of code units and is only generic with
 regards to various ranges of code units. It can't operate on generic ranges
 like functions such as map and filter can.

Yes, and I think that's exactly what the doctor prescribed here.


Andrei

Aug 02 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, August 02, 2012 19:06:32 Andrei Alexandrescu wrote:
 Sure, you could have a function which specifically operates on ranges of
 code units and understands how unicode works and is written accordingly,
 but then that function is specific to ranges of code units and is only
 generic with regards to various ranges of code units. It can't operate on
 generic ranges like functions such as map and filter can.

 
 Yes, and I think that's exactly what the doctor prescribed here.

It may be the best approach for the lexer (though I'm not convinced; I'll have 
to think about it more), but Walter seems to be arguing that that strings 
should be treated as ranges of code units in general, which I think is 
completely wrong.

- Jonathan M Davis

Aug 02 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/2/12 7:18 PM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 19:06:32 Andrei Alexandrescu wrote:
 Sure, you could have a function which specifically operates on ranges of
 code units and understands how unicode works and is written accordingly,
 but then that function is specific to ranges of code units and is only
 generic with regards to various ranges of code units. It can't operate on
 generic ranges like functions such as map and filter can.

 Yes, and I think that's exactly what the doctor prescribed here.

 It may be the best approach for the lexer (though I'm not convinced; I'll have
 to think about it more),

Your insights are always appreciated; even their Cliff notes :o).

 but Walter seems to be arguing that that strings
 should be treated as ranges of code units in general, which I think is
 completely wrong.

I think Walter has very often emphasized the need for the lexer to be 
faster than the usual client software. My perception is that he's 
discussing lexer design in understanding there's a need for a less 
comfortable approach, namely do decoding in client.

Andrei

Aug 02 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, August 02, 2012 19:30:47 Andrei Alexandrescu wrote:
 On 8/2/12 7:18 PM, Jonathan M Davis wrote:
 Your insights are always appreciated; even their Cliff notes :o).

LOL. Well, I'm not about to decide on the best approach to this without 
thinking through it more. What I've been doing manages to deal quite nicely 
with avoiding unnecessary decoding and still allows for the lexing of ranges 
of dchar which aren't strings (though there's obviously an efficiency hit 
there), and it really isn't complicated or messy thanks to some basic mixins 
that I've been using. Switching to operating specifically on code units and not 
accepting ranges of dchar at all has some serious ramifications, and I have to 
think through them all before I take a position on that.

 but Walter seems to be arguing that that strings
 should be treated as ranges of code units in general, which I think is
 completely wrong.

 
 I think Walter has very often emphasized the need for the lexer to be
 faster than the usual client software. My perception is that he's
 discussing lexer design in understanding there's a need for a less
 comfortable approach, namely do decoding in client.

That may be, but if he's arguing that strings should _always_ be treated as 
range of code units - as in all D programs, most of which don't have anything 
to do with lexers (other than when they're compiled) - then I'm definitely 
going to object to that, and it's my understanding that that's what he's 
arguing. But maybe I've misunderstood.

I've been arguing that strings should still be treated as ranges of code 
points and that that does not preclude making the lexer efficiently operate on 
code units when operating on strings even if it operates on ranges of dchar. I 
think that whether making the lexer operate on ranges of dchar but specialize 
on strings is a better approach or making it operate specifically on ranges of 
code units is a better approach depends on what we want it to be usable with. 
It should be just as fast with strings in either case, so it becomes a 
question of how we want to handle ranges which _aren't_ strings.

I suppose that we could make it operate on code units and just let ranges of 
dchar have UTF-32 as their code unit (since dchar is both a code unit and a 
code point), then ranges of dchar will still work but ranges of char and wchar 
will _also_ work. Hmmm. As I said, I'll have to think this through a bit.

- Jonathan M Davis

Aug 02 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, August 02, 2012 19:52:35 Jonathan M Davis wrote:
 I suppose that we could make it operate on code units and just let ranges of
 dchar have UTF-32 as their code unit (since dchar is both a code unit and a
 code point), then ranges of dchar will still work but ranges of char and
 wchar will _also_ work. Hmmm. As I said, I'll have to think this through a
 bit.

LOL. It looks like taking this approach results in almost identical code to 
what I've been doing. The main difference is that if you're dealing with a 
range other than a string, you need to use decode instead of front, which 
means that decode is going to need to work with more than just strings 
(probably stride too). I'll have to create a pull request for that.

But unless you restrict it to strings and ranges of code units which are 
random access, you still have to worry about stuff like using range[0] vs 
range.front depending on the type, so my mixin approach is still applicable, 
and it makes it very easy to switch what I'm doing, since there are very few 
lines that need to be tweaked.

So, I guess that I'll be taking the approach of taking ranges of char, wchar, 
and dchar and treat them all as ranges of code units. So, it'll work with 
everything that it worked with before but will now also work with ranges of 
char and wchar. There's still a performance hit if you do something like 
passing it filter!"true(source), but there's no way to fix that without 
disallowing dchar ranges entirely, which would be unnecessarily restrictive. 
Once you allow arbitrary ranges of char rather than requiring strings, the 
extra code required to allow ranges of wchar and dchar is trivial. It's stuff 
like worrying about range[0] vs range.front which complicates things (even if 
front is a code unit rather than a code point), and using string mixins makes 
it so that the code with the logic is just as simple as it would be with 
strings. So, I think that I can continue almost exactly as I have been and 
still achieve what Walter wants. The main issue that I have (beyond finishing 
what I haven't gotten to yet) is changing how I handle errors and comments, 
since I currently have them as tokens, but that shouldn't be terribly hard to 
fix.

- Jonathan M Davis

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 4:30 PM, Andrei Alexandrescu wrote:
 I think Walter has very often emphasized the need for the lexer to be faster
 than the usual client software. My perception is that he's discussing lexer
 design in understanding there's a need for a less comfortable approach, namely
 do decoding in client.

+1

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 3:38 PM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:
 Remember, its the consumer doing the decoding, not the input range.

 But that's the problem. The consumer has to treat character ranges specially
 to make this work. It's not generic. If it were generic, then it would simply
 be using front, popFront, etc. It's going to have to special case strings to
 do the buffering that you're suggesting. And if you have to special case
 strings, then how is that any different from what we have now?

No, the consumer can do its own buffering. It only needs a 4 character buffer, 
worst case.


 If you're arguing that strings should be treated as ranges of code units, then
 pretty much _every_ range-based function will have to special case strings to
 even work correctly - otherwise it'll be operating on individual code points
 rather than code points (e.g. filtering code units rather than code points,
 which would generate an invalid string). This makes the default behavior
 incorrect, forcing _everyone_ to special case strings _everywhere_ if they
 want correct behavior with ranges which are strings. And efficiency means
 nothing if the result is wrong.

No, I'm arguing that the LEXER should accept a UTF8 input range for its input.
I 
am not making a general argument about ranges, characters, or Phobos.


 As it is now, the default behavior of strings with range-based functions is
 correct but inefficient, so at least we get correct code. And if someone wants
 their string processing to be efficient, then they special case strings and do
 things like the buffering that you're suggesting. So, we have correct by
 default with efficiency as an option. The alternative that you seem to be
 suggesting (making strings be treated as ranges of code units) means that it
 would be fast by default but correct as an option, which is completely
 backwards IMHO. Efficiency is important, but it's pointless how efficient
 something is if it's wrong, and expecting that your average programmer is
 going to write unicode-aware code which functions correctly is completely
 unrealistic.

Efficiency for the *lexer* is of *paramount* importance. I don't anticipate 
std.d.lexer will be implemented by some random newbie, I expect it to be 
carefully implemented and to do Unicode correctly, regardless of how difficult 
or easy that may be.

I seem to utterly fail at making this point.

The same point applies to std.regex - efficiency is terribly, terribly
important 
for it. Everyone judges regexes by their speed, and nobody cares how hard they 
are to implement to get that speed.

To reiterate another point, since we are in the compiler business, people will 
expect std.d.lexer to be of top quality, not some bag on the side. It needs to 
be usable as a base for writing a professional quality compiler. It's the
reason 
why I'm pushing much harder on this than I do for other modules.

Aug 02 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, August 02, 2012 19:40:13 Walter Bright wrote:
 No, I'm arguing that the LEXER should accept a UTF8 input range for its
 input. I am not making a general argument about ranges, characters, or
 Phobos.

I think that this is the main point of misunderstanding then. From your 
comments, you seemed to me to be arguing for strings being treated as ranges 
of code units all the time, which really doesn't make sense.

- Jonathan M Davis

Aug 02 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/2/12 10:40 PM, Walter Bright wrote:
 To reiterate another point, since we are in the compiler business,
 people will expect std.d.lexer to be of top quality, not some bag on the
 side. It needs to be usable as a base for writing a professional quality
 compiler. It's the reason why I'm pushing much harder on this than I do
 for other modules.

The lexer must be configurable enough to tokenize other languages than 
D. I confess I'm very unhappy that there seem to be no less than three 
people determined to write lexers for D. We're wasting precious talent 
and resources doubly. Once, several people are working in parallel on 
the same product. Second, none of them is actually solving the problem 
that should be solved.

Andrei

Aug 02 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, August 02, 2012 23:00:41 Andrei Alexandrescu wrote:
 On 8/2/12 10:40 PM, Walter Bright wrote:
 To reiterate another point, since we are in the compiler business,
 people will expect std.d.lexer to be of top quality, not some bag on the
 side. It needs to be usable as a base for writing a professional quality
 compiler. It's the reason why I'm pushing much harder on this than I do
 for other modules.

 
 The lexer must be configurable enough to tokenize other languages than
 D. I confess I'm very unhappy that there seem to be no less than three
 people determined to write lexers for D. We're wasting precious talent
 and resources doubly. Once, several people are working in parallel on
 the same product. Second, none of them is actually solving the problem
 that should be solved.

You're not going to get as fast a lexer if it's not written specifically for D. 
Writing a generic lexer is a different problem. It's also one that needs to be 
solved, but I think that it's a mistake to think that a generic lexer is going 
to be able to be as fast as one specifically optimized for D. And there are 
several people already working on the generic stuff (like Phillipe).

- Jonathan M Davis

Aug 02 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 08/03/2012 05:08 AM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 23:00:41 Andrei Alexandrescu wrote:
 On 8/2/12 10:40 PM, Walter Bright wrote:
 To reiterate another point, since we are in the compiler business,
 people will expect std.d.lexer to be of top quality, not some bag on the
 side. It needs to be usable as a base for writing a professional quality
 compiler. It's the reason why I'm pushing much harder on this than I do
 for other modules.

 The lexer must be configurable enough to tokenize other languages than
 D. I confess I'm very unhappy that there seem to be no less than three
 people determined to write lexers for D. We're wasting precious talent
 and resources doubly. Once, several people are working in parallel on
 the same product. Second, none of them is actually solving the problem
 that should be solved.

 You're not going to get as fast a lexer if it's not written specifically for D.

This should certainly be possible.

 Writing a generic lexer is a different problem.

It is a more general problem.

Aug 02 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/2/12 11:08 PM, Jonathan M Davis wrote:
 You're not going to get as fast a lexer if it's not written specifically for D.
 Writing a generic lexer is a different problem. It's also one that needs to be
 solved, but I think that it's a mistake to think that a generic lexer is going
 to be able to be as fast as one specifically optimized for D.

Do you have any evidence to back that up? I mean you're just saying it.

Andrei

Aug 02 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, August 02, 2012 23:41:39 Andrei Alexandrescu wrote:
 On 8/2/12 11:08 PM, Jonathan M Davis wrote:
 You're not going to get as fast a lexer if it's not written specifically
 for D. Writing a generic lexer is a different problem. It's also one that
 needs to be solved, but I think that it's a mistake to think that a
 generic lexer is going to be able to be as fast as one specifically
 optimized for D.

 
 Do you have any evidence to back that up? I mean you're just saying it.

Because all of the rules are built directly into the code. You don't have to 
use regexes or anything like that. Pieces of the lexer could certainly be 
generic or copied over to other lexers just fine, but when you write the lexer 
by hand specifically for D, you can guarantee that it checks exactly what it 
needs to for D without any extra cruft or lost efficiency due to decoding where 
it doesn't need to or checking an additional character at any point or 
anything along those lines. And tuning it is much easier, because you have 
control over the whole thing. Also, given features such as token strings, I 
would think that using a generic lexer on D would be rather difficult anyway.

If someone wants to try and write a generic lexer for D and see if they can 
beat out any hand-written ones, then more power to them, but I don't see how 
you could possibly expect to shave the operations down to the bare minimum 
necessary to get the job done with a generic lexer, whereas a hand-written 
parser can do that given enough effort.

- Jonathan M Davis

Aug 02 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 08/03/2012 05:53 AM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 23:41:39 Andrei Alexandrescu wrote:
 On 8/2/12 11:08 PM, Jonathan M Davis wrote:
 You're not going to get as fast a lexer if it's not written specifically
 for D. Writing a generic lexer is a different problem. It's also one that
 needs to be solved, but I think that it's a mistake to think that a
 generic lexer is going to be able to be as fast as one specifically
 optimized for D.

 Do you have any evidence to back that up? I mean you're just saying it.

 Because all of the rules are built directly into the code. You don't have to
 use regexes or anything like that.

The parts that can be specified with simple regexen are certainly not a
problem. A generic string mixin based lexer should be able to generate
very close to optimal code by eg. merging common token prefixes and the
like.

 Pieces of the lexer could certainly be
 generic or copied over to other lexers just fine, but when you write the lexer
 by hand specifically for D, you can guarantee that it checks exactly what it
 needs to for D without any extra cruft or lost efficiency due to decoding where
 it doesn't need to or checking an additional character at any point or
 anything along those lines.

This is achievable if it is fully generic as well. Just add generic
peephole optimizations until the generated lexer is identical to what
the hand-written one would have looked like.

 And tuning it is much easier,

Yes.

 because you have control over the whole thing.

The library writer has control over the whole thing in each case.

 Also, given features such as token strings, I
 would think that using a generic lexer on D would be rather difficult anyway.

It would of course need to incorporate custom parsing routine support.

 If someone wants to try and write a generic lexer for D and see if they can
 beat out any hand-written ones,

I'll possibly give it a shot if I can find the time.

 then more power to them, but I don't see how
 you could possibly expect to shave the operations down to the bare minimum
 necessary to get the job done with a generic lexer, whereas a hand-written
 parser can do that given enough effort.

If it is optimal for D lexing and close-optimal or optimal for other
languages then it is profoundly more useful than just a D lexer.

Aug 02 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, August 03, 2012 06:14:08 Timon Gehr wrote:
 If it is optimal for D lexing and close-optimal or optimal for other
 languages then it is profoundly more useful than just a D lexer.

If fully concur that we should have a generic lexer, but unless the generic 
lexer can be just as fast as a hand-written one (which I seriously question), 
then I definitely think that we're going to want both, not just a generic one.

- Jonathan M Davis

Aug 02 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Fri, Aug 3, 2012 at 6:14 AM, Timon Gehr <timon.gehr gmx.ch> wrote:

 If someone wants to try and write a generic lexer for D and see if they
 can
 beat out any hand-written ones,


 I'll possibly give it a shot if I can find the time.

I propose we let him finish std.lexer, test it with Jonathan, benchmark it, etc.
Once it's out, we can create a small group interested in getting
generic and we can try and decorticate it at our leasure.

Aug 02 2012

deadalnix <deadalnix gmail.com> writes:

Le 03/08/2012 05:41, Andrei Alexandrescu a écrit :
 On 8/2/12 11:08 PM, Jonathan M Davis wrote:
 You're not going to get as fast a lexer if it's not written
 specifically for D.
 Writing a generic lexer is a different problem. It's also one that
 needs to be
 solved, but I think that it's a mistake to think that a generic lexer
 is going
 to be able to be as fast as one specifically optimized for D.

 Do you have any evidence to back that up? I mean you're just saying it.

 Andrei

I think it can be as fast if the lexer is compile time created. I'd also 
argue that it is better to have a D lexer ASAP, even if it isn't as 
configurable as it could be.

Better something now that something better later. We can make it better 
later anyway.

Aug 03 2012

"Bernard Helyer" <b.helyer gmail.com> writes:

On Friday, 3 August 2012 at 03:00:42 UTC, Andrei Alexandrescu 
wrote:
 The lexer must be configurable enough to tokenize other 
 languages than D.

You're going to have to defend that one.

Aug 02 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/2/12 11:11 PM, Bernard Helyer wrote:
 On Friday, 3 August 2012 at 03:00:42 UTC, Andrei Alexandrescu wrote:
 The lexer must be configurable enough to tokenize other languages than D.

 You're going to have to defend that one.

I wouldn't know how to. To me it's all too obvious it's better to have a 
tool that allows writing lexers for many languages (and incidentally use 
it for D), than a tool that can only tokenize D code.

Andrei

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 8:00 PM, Andrei Alexandrescu wrote:
 On 8/2/12 10:40 PM, Walter Bright wrote:
 To reiterate another point, since we are in the compiler business,
 people will expect std.d.lexer to be of top quality, not some bag on the
 side. It needs to be usable as a base for writing a professional quality
 compiler. It's the reason why I'm pushing much harder on this than I do
 for other modules.

 The lexer must be configurable enough to tokenize other languages than D. I
 confess I'm very unhappy that there seem to be no less than three people
 determined to write lexers for D. We're wasting precious talent and resources
 doubly. Once, several people are working in parallel on the same product.
 Second, none of them is actually solving the problem that should be solved.

I agree and I hope the three can combine their efforts with the best ideas of
each.

I don't think the lexer needs to be configurable to other languages. I think 
it's fine that it be custom for D. However, and this is a big however, its 
user-facing interface should be usable for other languages, I see no obvious 
reason why this cannot be.

Aug 02 2012

"Bernard Helyer" <b.helyer gmail.com> writes:

On Friday, 3 August 2012 at 03:14:14 UTC, Walter Bright wrote:
 On 8/2/2012 8:00 PM, Andrei Alexandrescu wrote:
 On 8/2/12 10:40 PM, Walter Bright wrote:
 To reiterate another point, since we are in the compiler 
 business,
 people will expect std.d.lexer to be of top quality, not some 
 bag on the
 side. It needs to be usable as a base for writing a 
 professional quality
 compiler. It's the reason why I'm pushing much harder on this 
 than I do
 for other modules.

 The lexer must be configurable enough to tokenize other 
 languages than D. I
 confess I'm very unhappy that there seem to be no less than 
 three people
 determined to write lexers for D. We're wasting precious 
 talent and resources
 doubly. Once, several people are working in parallel on the 
 same product.
 Second, none of them is actually solving the problem that 
 should be solved.

 I agree and I hope the three can combine their efforts with the 
 best ideas of each.

If the other guys think they've got it, then I can withdraw my
efforts. I was just thinking I had a lexer just sitting around,
may as well use it, but if the other guys have it, then I'm fine
with withdrawing.

Aug 02 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, August 03, 2012 05:36:05 Bernard Helyer wrote:
 If the other guys think they've got it, then I can withdraw my
 efforts. I was just thinking I had a lexer just sitting around,
 may as well use it, but if the other guys have it, then I'm fine
 with withdrawing.

I'm a fair ways along already. Making changes according to what Walter wants 
will slow me down some but not a lot. It's something that I've wanted to do 
for quite some time but just finally got around to starting a couple of weeks 
ago. So, I definitely want to complete it regardless of what anyone else is 
doing. It also gives me an opportunity to make sure that the spec and dmd 
match (and I've already found a few bugs with the spec and fixed them).

Feel free to do whatever you want, but I'm definitely going to complete what 
I'm working on.

- Jonathan M Davis

Aug 02 2012

"Bernard Helyer" <b.helyer gmail.com> writes:

On Friday, 3 August 2012 at 03:59:29 UTC, Jonathan M Davis wrote:
 On Friday, August 03, 2012 05:36:05 Bernard Helyer wrote:
 If the other guys think they've got it, then I can withdraw my
 efforts. I was just thinking I had a lexer just sitting around,
 may as well use it, but if the other guys have it, then I'm 
 fine
 with withdrawing.

 I'm a fair ways along already. Making changes according to what 
 Walter wants
 will slow me down some but not a lot. It's something that I've 
 wanted to do
 for quite some time but just finally got around to starting a 
 couple of weeks
 ago. So, I definitely want to complete it regardless of what 
 anyone else is
 doing. It also gives me an opportunity to make sure that the 
 spec and dmd
 match (and I've already found a few bugs with the spec and 
 fixed them).

 Feel free to do whatever you want, but I'm definitely going to 
 complete what
 I'm working on.

 - Jonathan M Davis

I'll let you get on with it then. I'll amuse myself with the
thought of someone asking why SDC doesn't use std.d.lexer or
a parser generator. I'll then hit them with my cane, and tell
them to get off of my lawn. :P

Aug 02 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, August 03, 2012 06:02:29 Bernard Helyer wrote:
 I'll let you get on with it then. I'll amuse myself with the
 thought of someone asking why SDC doesn't use std.d.lexer or
 a parser generator. I'll then hit them with my cane, and tell
 them to get off of my lawn. :P

Well, if std.d.lexer is done correctly, then it should at least be very usable 
by SDC, though whether it would be worth changing SDC to use it, I have no 
idea. That'll be up to you and the other SDC devs. But when code gets posted, 
you should definitely chime in on it.

- Jonathan M Davis

Aug 02 2012

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Friday, 3 August 2012 at 04:02:34 UTC, Bernard Helyer wrote:
 I'll let you get on with it then. I'll amuse myself with the
 thought of someone asking why SDC doesn't use std.d.lexer or
 a parser generator. I'll then hit them with my cane, and tell
 them to get off of my lawn. :P

I don't think you should give up on this adaption. SDC's lexer is 
already complete. I don't know if anyone else can claim that. 
Apart from that, any changes made will almost certainly benefit 
SDC as well.

Aug 02 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Fri, Aug 3, 2012 at 5:59 AM, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Friday, August 03, 2012 05:36:05 Bernard Helyer wrote:
 If the other guys think they've got it, then I can withdraw my
 efforts. I was just thinking I had a lexer just sitting around,
 may as well use it, but if the other guys have it, then I'm fine
 with withdrawing.

 I'm a fair ways along already. Making changes according to what Walter wants
 will slow me down some but not a lot. It's something that I've wanted to do
 for quite some time but just finally got around to starting a couple of weeks
 ago. So, I definitely want to complete it regardless of what anyone else is
 doing. It also gives me an opportunity to make sure that the spec and dmd
 match (and I've already found a few bugs with the spec and fixed them).

 Feel free to do whatever you want, but I'm definitely going to complete what
 I'm working on.

Look, many of us here were interested in your idea of having comments
and errors lexed as tokens.
Could it be possible to just add two static policies errorBehavior {
report, skip, ...} and comments { asToken, grouped }?
That way, a user just say;

// Walter
auto toks = Lexer!(comments.grouped)(code); // May be the default value
// Other:
auto toks = Lexer!(comments.asToken)(code);

It's just a small static if away...

Aug 02 2012

Piotr Szturmaj <bncrbme jadamspam.pl> writes:

Walter Bright wrote:
 On 8/2/2012 1:26 PM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:
 Keep a 6 character buffer in your consumer. If you read a char with
 the high
 bit set, start filling that buffer and then decode it.

 And how on earth is that going to work as a range?

 1. read a character from the range
 2. if the character is the start of a multibyte character, put the
 character in the buffer
 3. keep reading from the range until you've got the whole of the
 multibyte character
 4. convert that 6 (or 4) character buffer into a dchar

Working example: 
https://github.com/pszturmaj/json-streaming-parser/blob/master/json.d#L18 :-)

Aug 02 2012

=?ISO-8859-1?Q?Jos=E9_Armando_Garc=EDa_Sancio?= <jsancio gmail.com> writes:

On Thu, Aug 2, 2012 at 1:26 PM, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:
 On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
 On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
 On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
 It is for ranges in general. In the general case, a range of UTF-8 or
 UTF-16 makes no sense whatsoever. Having range-based functions which
 understand the encodings and optimize accordingly can be very beneficial
 (which happens with strings but can't happen with general ranges without
 the concept of a variably-length encoded range like we have with forward
 range or random access range), but to actually have a range of UTF-8 or
 UTF-16 just wouldn't work. Range-based functions operate on elements,
 and
 doing stuff like filter or map or reduce on code units doesn't make any
 sense at all.

 Yes, it can work.

 How?

 Keep a 6 character buffer in your consumer. If you read a char with the high
 bit set, start filling that buffer and then decode it.

 And how on earth is that going to work as a range? Range-based functions
 operate on elements. They use empty, front, popFront, etc. If front doesn't
 return an element that a range-based function can operate on without caring
 what it is, then that type isn't going to work as a range. If you need the
 consumer to be doing something special, then that means you need to special
 case it for that range type. And that's what you're doing when you special-
 case range-base functions for strings.

A little bit off topic but...

People have been composing/decorating Streams/Ranges for probably 30
years now. Examples: input stream, output stream, byte stream, char
stream, buffered stream, cipher stream, base64 stream, etc.

If you need more example. Consider an HTTPS request. At the lowest
level you have a byte stream/range. No sane developer wants to deal
with HTTPS request at this level so you decorate it with an SSL
stream/range. That is still too low level so you decorate this with a
char stream/range. Still too low level? Decorate it with a modal line
buffered stream/range. We are getting closer but it still not the
correct range abstraction so then you need a modal http stream/range.
You need the modal part if you want to support http streaming.

Aug 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-02 09:29, Walter Bright wrote:

 My experience in writing fast string based code that worked on UTF8 and
 correctly handled multibyte characters was that they are very possible
 and practical, and they are faster.

 The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If
 it isn't fast, serious users will eschew it and will cook up their own.
 You'll have a nice, pretty, useless toy of std.d.lexer.

 I think there's some serious underestimation of how critical this is.

I do understand that the lexer needs to be insanely fast and it needs to 
operate on UTF-8 and not UTF-32 or anything else.

But what I still don't understand is how a UTF-8 range is going to be 
usable by other range based functions in Phobos.

-- 
/Jacob Carlborg

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 12:49 AM, Jacob Carlborg wrote:
 But what I still don't understand is how a UTF-8 range is going to be usable by
 other range based functions in Phobos.

Worst case use an adapter range.

Aug 02 2012

travert phare.normalesup.org (Christophe Travert) writes:

Walter Bright , dans le message (digitalmars.D:174015), a �crit�:
 On 8/2/2012 12:49 AM, Jacob Carlborg wrote:
 But what I still don't understand is how a UTF-8 range is going to be usable by
 other range based functions in Phobos.

 
 Worst case use an adapter range.
 
 

Yes

auto r = myString.byChar();

after implementing a byChar adapter range or just

auto r = cast(const(ubyte)[]) myString;

And it's a range of code unit, not code point.
And it's usable in phobos.

Aug 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-02 10:15, Walter Bright wrote:

 Worst case use an adapter range.

And that is better than a plain string?

-- 
/Jacob Carlborg

Aug 02 2012

travert phare.normalesup.org (Christophe Travert) writes:

Jacob Carlborg , dans le message (digitalmars.D:174069), a �crit�:
 On 2012-08-02 10:15, Walter Bright wrote:
 
 Worst case use an adapter range.

 
 And that is better than a plain string?
 

because its front method does not do any decoding.

Aug 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-02 22:51, Christophe Travert wrote:
 Jacob Carlborg , dans le message (digitalmars.D:174069), a �crit :
 On 2012-08-02 10:15, Walter Bright wrote:

 Worst case use an adapter range.

 And that is better than a plain string?

 because its front method does not do any decoding.

If it was a plain string you wouldn't use "front". You would handle any, 
possible, decoding by yourself, internally in the lexer. This is what 
Johnathan already does, it seems:

static if(isNarrowString!R)
     Unqual!(ElementEncodingType!R) first = range[0];
else
     dchar first = range.front;

If you're only supporting plain UTF-8 strings you would just do:

char first = range[0];

-- 
/Jacob Carlborg

Aug 02 2012

travert phare.normalesup.org (Christophe Travert) writes:

Jacob Carlborg , dans le message (digitalmars.D:174131), a �crit�:
 static if(isNarrowString!R)
      Unqual!(ElementEncodingType!R) first = range[0];
 else
      dchar first = range.front;

I find it more comfortable to just use
first = range.front, with a range of char or ubyte.

This range does not have to be a string, it can be a something over a 
file, stream, socket. It can also be the result of an algorithm, because 
you *can* use algorithm on ranges of char, and it makes sense if you 
know what you are doing.

If Walter discovers the lexer does not work with a socket, a 
"file.byChunk.join", and has to do expensive utf-8 decoding for the 
lexer to work because it can only use range of dchar, and not range of 
char (except that it special-cased strings), he may not be happy.

It the range happens to be a string, I would use an adapter to make it 
appear like a range of char, not dchar, like the library likes to do. I 
think Andrei suggested that already.

-- 
Christophe

Aug 03 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/3/2012 1:21 AM, Christophe Travert wrote:
 This range does not have to be a string, it can be a something over a
 file, stream, socket. It can also be the result of an algorithm, because
 you *can* use algorithm on ranges of char, and it makes sense if you
 know what you are doing.

Correct, that's the whole point of using a range - it can come from anything.

 If Walter discovers the lexer does not work with a socket, a
 "file.byChunk.join", and has to do expensive utf-8 decoding for the
 lexer to work because it can only use range of dchar, and not range of
 char (except that it special-cased strings), he may not be happy.

No may about it, I wouldn't be happy :-)

Aug 03 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/3/2012 3:07 AM, Walter Bright wrote:
 On 8/3/2012 1:21 AM, Christophe Travert wrote:
 This range does not have to be a string, it can be a something over a
 file, stream, socket. It can also be the result of an algorithm, because
 you *can* use algorithm on ranges of char, and it makes sense if you
 know what you are doing.

 Correct, that's the whole point of using a range - it can come from anything.

For example, let's suppose we want to do D syntax highlighting in our IDE. It
is 
highly unlikely that the text editor's data structure is a simple string. It's 
likely to be an array of lines, or something like that. It's simple to provide
a 
range interface to that, where it presents the data structure as a sequence of 
chars.

But that wouldn't work with Jonathan's string specialization.

Nor would the string specialization work if the input was a 1024 byte file
buffer.

Aug 03 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

 Correct, that's the whole point of using a range - it can come 
 from anything.

 For example, let's suppose we want to do D syntax highlighting 
 in our IDE. It is highly unlikely that the text editor's data 
 structure is a simple string. It's likely to be an array of 
 lines, or something like that. It's simple to provide a range 
 interface to that, where it presents the data structure as a 
 sequence of chars.

Would this be an argument for putting the computation of source 
locations (i.e. line + offset or similar) into the range / into 
an template argument / policy, so that it's done in the most 
effective way for the client?

Kate for example has a "range"-type that marks a span in the text 
buffer. This way the lexer can return token with the correct 
"textrange" attached and you don't need to recompute the text 
ranges from line/col numbers.

Aug 03 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/3/2012 4:40 AM, Tobias Pankrath wrote:
 Would this be an argument for putting the computation of source locations (i.e.
 line + offset or similar) into the range / into an template argument / policy,
 so that it's done in the most effective way for the client?

 Kate for example has a "range"-type that marks a span in the text buffer. This
 way the lexer can return token with the correct "textrange" attached and you
 don't need to recompute the text ranges from line/col numbers.

Worth thinking about.

Aug 03 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 02-Aug-12 08:30, Walter Bright wrote:
 On 8/1/2012 8:04 PM, Jonathan M Davis wrote:
 On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
 1. It should accept as input an input range of UTF8. I feel it is a
 mistake
 to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
 should use an 'adapter' range to convert the input to UTF8. (This is
 what
 component programming is all about.)

 But that's not how ranges of characters work. They're ranges of dchar.
 Ranges
 don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have
 to create
 special wrappers around string or wstring to have ranges of UTF-8. The
 way
 that it's normally done is to have ranges of dchar and then special-case
 range-based functions for strings. Then the function can operate on
 any range
 of dchar but still operates on strings efficiently.

 I have argued against making ranges of characters dchars, because of
 performance reasons. This will especially adversely affect the
 performance of the lexer.

 The fact is, files are normally in UTF8 and just about everything else
 is in UTF8. Prematurely converting to UTF-32 is a performance disaster.
 Note that the single largest thing holding back regex performance is
 that premature conversion to dchar and back to char.

Well, it doesn't convert back to UTF-8 as it just slices of the input :)

Otherwise very true especially with ctRegex that used to recieve quite 
some hype even in its present state. 33% of time spent is doing and 
redoing UTF-8 decoding.
(Note that quite some extra work on top of what lexer does is done, e.g. 
lexer is largely deterministic but regex has some of try-rollback).

 If lexer is required to accept dchar ranges, its performance will drop
 at least in half, and people are going to go reinvent their own lexers.

Yes, it slows things down. Decoding (if any) should kick in only where 
it's absolutely necessary and be an integral part of lexer automation.


-- 
Dmitry Olshansky

Aug 02 2012

deadalnix <deadalnix gmail.com> writes:

 7. It should accept a callback delegate for errors. That delegate should
 decide whether to:
      1. ignore the error (and "Lexer" will try to recover and continue)
      2. print an error message (and "Lexer" will try to recover and continue)
 3. throw an exception, "Lexer" is done with that input range

 I'm currently treating errors as tokens. It then becomes easy for the code
 using the lexer to just ignore the errors, to process them immediately, or to
 put off dealing with them until the lexing is complete. So, the code using the
 lexer can handle errors however and whenever it likes without having to worry
 about delegates or exceptions. Since tokens are lexed lazily, the fact that an
 error is reported as a token doesn't require that the lexing continue, but it
 also makes it _easy_ to continue lexing, ignoring or saving the error. I'm
 inclined to think that that's a superior approach to using delegates and
 exceptions.

Really nice idea. It is still easy to wrap the Range in another Range 
that process errors in a custom way.

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 4:27 AM, deadalnix wrote:
 Really nice idea. It is still easy to wrap the Range in another Range that
 process errors in a custom way.

It's suboptimal for a high speed lexer/parser, as already explained.

Speed is of paramount importance for a lexer, and it overrides things like 
elegance and simplicity.

Aug 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-02 02:10, Walter Bright wrote:

 1. It should accept as input an input range of UTF8. I feel it is a
 mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it
 UTF16 should use an 'adapter' range to convert the input to UTF8. (This
 is what component programming is all about.)

I'm no expert on ranges but won't that prevent slicing? Slicing is one 
of the main reasons for why the Tango XML parser is so amazingly fast.

-- 
/Jacob Carlborg

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 11:49 PM, Jacob Carlborg wrote:
 On 2012-08-02 02:10, Walter Bright wrote:

 1. It should accept as input an input range of UTF8. I feel it is a
 mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it
 UTF16 should use an 'adapter' range to convert the input to UTF8. (This
 is what component programming is all about.)

 I'm no expert on ranges but won't that prevent slicing? Slicing is one of the
 main reasons for why the Tango XML parser is so amazingly fast.

You don't want to use slicing on the lexer. The reason is that your slices will 
be spread all over memory, as source files can be huge, and all that memory
will 
be retained and never released. What you want is a compact representation after 
lexing. Compactness also helps a lot with memory caching.

Aug 02 2012

deadalnix <deadalnix gmail.com> writes:

Le 02/08/2012 09:30, Walter Bright a �crit :
 On 8/1/2012 11:49 PM, Jacob Carlborg wrote:
 On 2012-08-02 02:10, Walter Bright wrote:

 1. It should accept as input an input range of UTF8. I feel it is a
 mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it
 UTF16 should use an 'adapter' range to convert the input to UTF8. (This
 is what component programming is all about.)

 I'm no expert on ranges but won't that prevent slicing? Slicing is one
 of the
 main reasons for why the Tango XML parser is so amazingly fast.

 You don't want to use slicing on the lexer. The reason is that your
 slices will be spread all over memory, as source files can be huge, and
 all that memory will be retained and never released. What you want is a
 compact representation after lexing. Compactness also helps a lot with
 memory caching.

Token are not kept in memory. You usually consume them for other 
processing and throw them away.

It isn't an issue.

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 4:52 AM, deadalnix wrote:
 Le 02/08/2012 09:30, Walter Bright a �crit :
 On 8/1/2012 11:49 PM, Jacob Carlborg wrote:
 On 2012-08-02 02:10, Walter Bright wrote:

 1. It should accept as input an input range of UTF8. I feel it is a
 mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it
 UTF16 should use an 'adapter' range to convert the input to UTF8. (This
 is what component programming is all about.)

 I'm no expert on ranges but won't that prevent slicing? Slicing is one
 of the
 main reasons for why the Tango XML parser is so amazingly fast.

 You don't want to use slicing on the lexer. The reason is that your
 slices will be spread all over memory, as source files can be huge, and
 all that memory will be retained and never released. What you want is a
 compact representation after lexing. Compactness also helps a lot with
 memory caching.

 Token are not kept in memory. You usually consume them for other processing and
 throw them away.

 It isn't an issue.

The tokens are not kept, correct. But the identifier strings, and the string 
literals, are kept, and if they are slices into the input buffer, then 
everything I said applies.

Aug 02 2012

deadalnix <deadalnix gmail.com> writes:

Le 02/08/2012 20:08, Walter Bright a �crit :
 On 8/2/2012 4:52 AM, deadalnix wrote:
 Le 02/08/2012 09:30, Walter Bright a �crit :
 On 8/1/2012 11:49 PM, Jacob Carlborg wrote:
 On 2012-08-02 02:10, Walter Bright wrote:

 1. It should accept as input an input range of UTF8. I feel it is a
 mistake to templatize it for UTF16 and UTF32. Anyone desiring to
 feed it
 UTF16 should use an 'adapter' range to convert the input to UTF8.
 (This
 is what component programming is all about.)

 I'm no expert on ranges but won't that prevent slicing? Slicing is one
 of the
 main reasons for why the Tango XML parser is so amazingly fast.

 You don't want to use slicing on the lexer. The reason is that your
 slices will be spread all over memory, as source files can be huge, and
 all that memory will be retained and never released. What you want is a
 compact representation after lexing. Compactness also helps a lot with
 memory caching.

 Token are not kept in memory. You usually consume them for other
 processing and
 throw them away.

 It isn't an issue.

 The tokens are not kept, correct. But the identifier strings, and the
 string literals, are kept, and if they are slices into the input buffer,
 then everything I said applies.

Ok, what do you think of that :

lexer can have a parameter that tell if it should build a table of token 
or slice the input. The second is important, for instance for an IDE : 
lexing will occur often, and you prefer slicing here because you already 
have the source file in memory anyway.

The token always contains as a member a slice. The slice come either 
from the source or from a memory chunk allocated by the lexer.

If the lexer allocate chunks, it will reuse the same memory location for 
the same string. Considering the following mecanism to compare slice, 
this will require 2 comparaisons for identifier lexed with that method :

if(a.length != b.length) return false;
if(a.ptr == b.ptr) return true;
// Regular char by char comparison.

Is that a suitable option ?

Aug 03 2012

travert phare.normalesup.org (Christophe Travert) writes:

deadalnix , dans le message (digitalmars.D:174155), a �crit�:
 The tokens are not kept, correct. But the identifier strings, and the
 string literals, are kept, and if they are slices into the input buffer,
 then everything I said applies.

 
 Ok, what do you think of that :
 
 lexer can have a parameter that tell if it should build a table of token 
 or slice the input. The second is important, for instance for an IDE : 
 lexing will occur often, and you prefer slicing here because you already 
 have the source file in memory anyway.
 
 The token always contains as a member a slice. The slice come either 
 from the source or from a memory chunk allocated by the lexer.

If I may add, there are several possitilities here:
 1- a real slice of the input range
 2- a slice of the input range created with .save and takeExactly
 3- a slice allocated in GC memory by the lexer
 4- a slice of memory owned by the lexer, which is reused for the next 
token (thus, the next call to popFront invalidates the token).
 5- a slice of memory from a lookup table.

All are useful in certain situations.

don't have a huge amont of code to parse.






token, you save the range, and when you found the end of the token, you 
just use takeExactly on the saved ranged.


to do them in a second step if you want to be able to use input range 
(which you have too). Actually, the buffer may be external, if you use a 
buffered-range adapter to make a forward range out of an input range. 
Having an internal buffer may be more efficient. That something that has 
to be profiled.




slice saved somewhere before having a look at the look-up table. 

efficiency by an algorithm external to the lexer. This would probably 
bring many ways to use the lexer. For example, you can filter-out many 
tokens that you don't want before building the table, which avoids to 
have an overfull look-up table if you are only interested in a subset of 
tokens.



user to use the adapters blindly, or if an internal implementation
proves to be significantly more efficient.

-- 
Christophe

Aug 03 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

On Friday, 3 August 2012 at 14:49:55 UTC, 
travert phare.normalesup.org
 If I may add, there are several possitilities here:
  1- a real slice of the input range
  2- a slice of the input range created with .save and 
 takeExactly
  3- a slice allocated in GC memory by the lexer
  4- a slice of memory owned by the lexer, which is reused for 
 the next
 token (thus, the next call to popFront invalidates the token).
  5- a slice of memory from a lookup table.


(i.e. for syntax highlighting).

Aug 03 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/3/2012 6:18 AM, deadalnix wrote:
 lexer can have a parameter that tell if it should build a table of token or
 slice the input. The second is important, for instance for an IDE : lexing will
 occur often, and you prefer slicing here because you already have the source
 file in memory anyway.

A string may span multiple lines - IDEs do not store the text as one string.

 If the lexer allocate chunks, it will reuse the same memory location for the
 same string. Considering the following mecanism to compare slice, this will
 require 2 comparaisons for identifier lexed with that method :

 if(a.length != b.length) return false;
 if(a.ptr == b.ptr) return true;
 // Regular char by char comparison.

 Is that a suitable option ?

You're talking about doing for strings what is done for identifiers - returning 
a unique handle for each. I don't think this works very well for string 
literals, as there seem to be few duplicates.

Aug 03 2012

deadalnix <deadalnix gmail.com> writes:

Le 03/08/2012 21:59, Walter Bright a �crit :
 On 8/3/2012 6:18 AM, deadalnix wrote:
 lexer can have a parameter that tell if it should build a table of
 token or
 slice the input. The second is important, for instance for an IDE :
 lexing will
 occur often, and you prefer slicing here because you already have the
 source
 file in memory anyway.

 A string may span multiple lines - IDEs do not store the text as one
 string.

 If the lexer allocate chunks, it will reuse the same memory location
 for the
 same string. Considering the following mecanism to compare slice, this
 will
 require 2 comparaisons for identifier lexed with that method :

 if(a.length != b.length) return false;
 if(a.ptr == b.ptr) return true;
 // Regular char by char comparison.

 Is that a suitable option ?

 You're talking about doing for strings what is done for identifiers -
 returning a unique handle for each. I don't think this works very well
 for string literals, as there seem to be few duplicates.

That option have the benefice to allow very fast identifier comparison 
(like DMD does) but don't impose it. For instance, you could use that 
trick in a single thread, but another identifier table for another.

It allow to avoid completely the problem with multithreading you 
mention, while keeping most identifiers comparison really fast.

It allow also for several allocation scheme for the slice, that fit 
different needs, as shown by Christophe Travert.

Aug 04 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, August 02, 2012 11:08:23 Walter Bright wrote:
 The tokens are not kept, correct. But the identifier strings, and the string
 literals, are kept, and if they are slices into the input buffer, then
 everything I said applies.

String literals often _can't_ be slices unless you leave them in their 
original state rather than giving the version that they translate to (e.g. 
leaving \&copy; in the string rather than replacing it with its actual, 
unicode value). And since you're not going to be able to create the literal 
using whatever type the range is unless it's a string of some variety, that 
means that the literals often can't be slices, which - depending on the 
implementation - would make it so that that they can't _ever_ be slices.

Identifiers are a different story, since they don't have to be translated at 
all, but regardless of whether keeping a slice would be better than creating a 
new string, the identifier table will be far superior, since then you only need 
one copy of each identifier. So, it ultimately doesn't make sense to use slices 
in either case even without considering issues like them being spread across 
memory.

The only place that I'd expect a slice in a token is in the string which 
represents the text which was lexed, and that won't normally be kept around.

- Jonathan M Davis

Aug 03 2012

travert phare.normalesup.org (Christophe Travert) writes:

Jonathan M Davis , dans le message (digitalmars.D:174191), a �crit�:
 On Thursday, August 02, 2012 11:08:23 Walter Bright wrote:
 The tokens are not kept, correct. But the identifier strings, and the string
 literals, are kept, and if they are slices into the input buffer, then
 everything I said applies.

 
 String literals often _can't_ be slices unless you leave them in their 
 original state rather than giving the version that they translate to (e.g. 
 leaving \&copy; in the string rather than replacing it with its actual, 
 unicode value). And since you're not going to be able to create the literal 
 using whatever type the range is unless it's a string of some variety, that 
 means that the literals often can't be slices, which - depending on the 
 implementation - would make it so that that they can't _ever_ be slices.
 
 Identifiers are a different story, since they don't have to be translated at 
 all, but regardless of whether keeping a slice would be better than creating a 
 new string, the identifier table will be far superior, since then you only
need 
 one copy of each identifier. So, it ultimately doesn't make sense to use
slices 
 in either case even without considering issues like them being spread across 
 memory.
 
 The only place that I'd expect a slice in a token is in the string which 
 represents the text which was lexed, and that won't normally be kept around.
 
 - Jonathan M Davis

I thought it was not the lexer's job to process litterals. Just split 
the input in tokens, and provide minimal info: TokenType, line and col 
along with the representation from the input. That's enough for a syntax 
highlighting tools for example. Otherwise you'll end up doing complex 
interpretation and the lexer will not be that efficient. Litteral 
interpretation can be done in a second step. Do you think doing litteral 
interpretation separately when you need it would be less efficient?

-- 
Christophe

Aug 04 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 04-Aug-12 14:02, Christophe Travert wrote:
 Jonathan M Davis , dans le message (digitalmars.D:174191), a �crit :
 On Thursday, August 02, 2012 11:08:23 Walter Bright wrote:
 The tokens are not kept, correct. But the identifier strings, and the string
 literals, are kept, and if they are slices into the input buffer, then
 everything I said applies.

 String literals often _can't_ be slices unless you leave them in their
 original state rather than giving the version that they translate to (e.g.
 leaving \&copy; in the string rather than replacing it with its actual,
 unicode value). And since you're not going to be able to create the literal
 using whatever type the range is unless it's a string of some variety, that
 means that the literals often can't be slices, which - depending on the
 implementation - would make it so that that they can't _ever_ be slices.

 Identifiers are a different story, since they don't have to be translated at
 all, but regardless of whether keeping a slice would be better than creating a
 new string, the identifier table will be far superior, since then you only need
 one copy of each identifier. So, it ultimately doesn't make sense to use slices
 in either case even without considering issues like them being spread across
 memory.

 The only place that I'd expect a slice in a token is in the string which
 represents the text which was lexed, and that won't normally be kept around.

 - Jonathan M Davis

 I thought it was not the lexer's job to process litterals. Just split
 the input in tokens, and provide minimal info: TokenType, line and col
 along with the representation from the input. That's enough for a syntax
 highlighting tools for example. Otherwise you'll end up doing complex
 interpretation and the lexer will not be that efficient. Litteral
 interpretation can be done in a second step. Do you think doing litteral
 interpretation separately when you need it would be less efficient?

Most likely - since you re-read the same memory twice to do it.

-- 
Dmitry Olshansky

Aug 04 2012

travert phare.normalesup.org (Christophe Travert) writes:

Dmitry Olshansky , dans le message (digitalmars.D:174214), a �crit�:
 Most likely - since you re-read the same memory twice to do it.

You're probably right, but if you do this right after the token is 
generated, the memory should still be near the processor. And the 
operation on the first read should be very basic: just check nothing 
illegal appears, and check for the end of the token. The cost is not 
negligible, but what you do with litteral tokens can vary much, and what 
the lexer will propose may not be what the user want, so the user may 
suffer the cost of the litteral decoding (including allocation of the 
decoded string, the copy of the caracters, etc), that he doesn't want, 
or will have to re-do his own way...

-- 
Christophe

Aug 04 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 04-Aug-12 14:55, Christophe Travert wrote:
 Dmitry Olshansky , dans le message (digitalmars.D:174214), a �crit :
 Most likely - since you re-read the same memory twice to do it.

 You're probably right, but if you do this right after the token is
 generated, the memory should still be near the processor. And the
 operation on the first read should be very basic: just check nothing
 illegal appears, and check for the end of the token.

q{ .. }
"\x13\x27 ...\u1212"

In most cases it takes around the same time to check correctness and 
output it as simply pass it by. (see also re-decoding unicode in 
identifiers, though that's rare to see unicode chars in identifier)


 The cost is not
 negligible, but what you do with litteral tokens can vary much, and what
 the lexer will propose may not be what the user want, so the user may
 suffer the cost of the litteral decoding (including allocation of the
 decoded string, the copy of the caracters, etc), that he doesn't want,
 or will have to re-do his own way...

I see it as a compile-time policy, that will fit nicely and solve both 
issues. Just provide a templates with a few hooks, and add a Noop policy 
that does nothing.

-- 
Dmitry Olshansky

Aug 04 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
 I see it as a compile-time policy, that will fit nicely and solve both
 issues. Just provide a templates with a few hooks, and add a Noop policy
 that does nothing.

It's starting to look like figuring out what should and shouldn't be 
configurable and how to handle it is going to be the largest problem in the 
lexer...

- Jonathan M Davis

Aug 04 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

On Saturday, 4 August 2012 at 11:58:09 UTC, Jonathan M Davis 
wrote:
 On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
 I see it as a compile-time policy, that will fit nicely and 
 solve both
 issues. Just provide a templates with a few hooks, and add a 
 Noop policy
 that does nothing.

 It's starting to look like figuring out what should and 
 shouldn't be
 configurable and how to handle it is going to be the largest 
 problem in the
 lexer...

 - Jonathan M Davis

If we have a really fast lexer that is highly compile-time 
configureable and has a readable codebase, then this would be a 
really good showcase for D.

Aug 04 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 04-Aug-12 15:48, Jonathan M Davis wrote:
 On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
 I see it as a compile-time policy, that will fit nicely and solve both
 issues. Just provide a templates with a few hooks, and add a Noop policy
 that does nothing.

 It's starting to look like figuring out what should and shouldn't be
 configurable and how to handle it is going to be the largest problem in the
 lexer...

Let's add some meat to my post.
I've seen it mostly as follows:

//user defines mixin template that is mixed in inside lexer
template MyConfig()
{
	enum identifierTable = true; // means there would be calls to 
table.insert on each identifier
	enum countLines = true; //adds line, column properties to the lexer/Tokens

	//statically bound callbacks, inside one can use say:
	// skip() - to skip a char (popFront)
	// get() - to read next char (via popFront, front)
	// line, col - as readonly properties
	// (skip & get do the counting if enabled)
	
	bool onError()
	{
		skip(); //the most dumb recovery, just skip a char
		return true; //go on with tokenizing, false - stop prematurely
	}

	...
}

usage:


{
	auto my_supa_table = ...; //some kind of container (should a  set on 
strings and support .insert("blah"); )

	auto dlex = Lexer!(MyConfig)(table);
	auto all_tokens = array(dlex(joiner(stdin.byChunk(4096))));

	//or if we had no interest in table but only tokens:
	auto noop = Lexer!(NoopLex)();
	...		
}

-- 
Dmitry Olshansky

Aug 04 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday, August 04, 2012 17:45:58 Dmitry Olshansky wrote:
 On 04-Aug-12 15:48, Jonathan M Davis wrote:
 On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
 I see it as a compile-time policy, that will fit nicely and solve both
 issues. Just provide a templates with a few hooks, and add a Noop policy
 that does nothing.

 
 It's starting to look like figuring out what should and shouldn't be
 configurable and how to handle it is going to be the largest problem in
 the
 lexer...

 
 Let's add some meat to my post.
 I've seen it mostly as follows:

[snip]

It would probably be a bit more user friendly to pass a struct as a template 
argument (which you can't do in the normal sense, but you can pass it as an 
alias). Regardless, the problem isn't so much how to provide a configuration as 
how the configuration options affect the lexer and what it does. I'll figure it 
out, but it does definitely complicate things. I wasn't originally planning an 
having anything be configurable. But if we want it to be such that no one will 
want to write another one (as Walter is looking for), then it's going to need 
to be configurable enough to make it efficient for all of the common lexing 
scenarios rather than efficient for one particular scenario.

- Jonathan M Davis

Aug 05 2012

deadalnix <deadalnix gmail.com> writes:

Le 04/08/2012 15:45, Dmitry Olshansky a écrit :
 On 04-Aug-12 15:48, Jonathan M Davis wrote:
 On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
 I see it as a compile-time policy, that will fit nicely and solve both
 issues. Just provide a templates with a few hooks, and add a Noop policy
 that does nothing.

 It's starting to look like figuring out what should and shouldn't be
 configurable and how to handle it is going to be the largest problem
 in the
 lexer...

 Let's add some meat to my post.
 I've seen it mostly as follows:

 //user defines mixin template that is mixed in inside lexer
 template MyConfig()
 {
 enum identifierTable = true; // means there would be calls to
 table.insert on each identifier
 enum countLines = true; //adds line, column properties to the lexer/Tokens

 //statically bound callbacks, inside one can use say:
 // skip() - to skip a char (popFront)
 // get() - to read next char (via popFront, front)
 // line, col - as readonly properties
 // (skip & get do the counting if enabled)

 bool onError()
 {
 skip(); //the most dumb recovery, just skip a char
 return true; //go on with tokenizing, false - stop prematurely
 }

 ...
 }

 usage:


 {
 auto my_supa_table = ...; //some kind of container (should a set on
 strings and support .insert("blah"); )

 auto dlex = Lexer!(MyConfig)(table);
 auto all_tokens = array(dlex(joiner(stdin.byChunk(4096))));

 //or if we had no interest in table but only tokens:
 auto noop = Lexer!(NoopLex)();
 ...
 }

It seems way too much.

The most complex thing that is needed is the policy to allocate 
identifiers in tokens. It can be made by passing a function that have a 
string as parameter and a string as return value. The default one would 
be an identity function.

The second parameter is a bool to tokenize comments or not. Is that enough ?

The onError look like a typical use case for conditions as explained in 
the huge thread on Exception.

Aug 06 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Mon, Aug 6, 2012 at 8:03 PM, deadalnix <deadalnix gmail.com> wrote:

 The most complex thing that is needed is the policy to allocate identifiers
 in tokens. It can be made by passing a function that have a string as
 parameter and a string as return value. The default one would be an identity
 function.

I think one should pass it an empty symbol table and the lexer should
fill it and associate each identifier with a unique ID, ID which would
appear in the Identifier token.


 The second parameter is a bool to tokenize comments or not. Is that enough ?

 The onError look like a typical use case for conditions as explained in the
 huge thread on Exception.

Yes, well we don't have a condition system. And using exceptions
during lexing would most probably kill its efficiency.
Errors in lexing are not uncommon. The usual D idiom of having an enum
StopOnError { no, yes } should be enough.

Aug 06 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-06 21:00, Philippe Sigaud wrote:

 Yes, well we don't have a condition system. And using exceptions
 during lexing would most probably kill its efficiency.
 Errors in lexing are not uncommon. The usual D idiom of having an enum
 StopOnError { no, yes } should be enough.

Especially when if the lexer is used in an IDE. The code is constantly 
changing and will be invalid quite often.

-- 
/Jacob Carlborg

Aug 06 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/6/2012 12:00 PM, Philippe Sigaud wrote:
 Yes, well we don't have a condition system. And using exceptions
 during lexing would most probably kill its efficiency.
 Errors in lexing are not uncommon. The usual D idiom of having an enum
 StopOnError { no, yes } should be enough.


That's why I suggested supplying a callback delegate to decide what to do with 
errors (ignore, throw exception, or quit) and have the delegate itself do that. 
That way, there is no customization of the Lexer required.

Aug 06 2012

travert phare.normalesup.org (Christophe Travert) writes:

Walter Bright , dans le message (digitalmars.D:174360), a écrit :
 On 8/6/2012 12:00 PM, Philippe Sigaud wrote:
 Yes, well we don't have a condition system. And using exceptions
 during lexing would most probably kill its efficiency.
 Errors in lexing are not uncommon. The usual D idiom of having an enum
 StopOnError { no, yes } should be enough.

 
 
 That's why I suggested supplying a callback delegate to decide what to do with 
 errors (ignore, throw exception, or quit) and have the delegate itself do
that. 
 That way, there is no customization of the Lexer required.

It may be easier to take into accound few cases (return error token and 
throwing is enough, so that is a basic static if), than to define a way 
to integrate a delegate (what should be the delegate's signature, what 
value to return to query for stopping, how to provide ways to recovers, 
etc).

Aug 07 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Tuesday, August 07, 2012 08:00:24 Christophe Travert wrote:
 Walter Bright , dans le message (digitalmars.D:174360), a =C3=A9crit =

:

 That's why I suggested supplying a callback delegate to decide what=


 to do
 with errors (ignore, throw exception, or quit) and have the delegat=


e
 itself do that. That way, there is no customization of the Lexer
 required.

=20
 It may be easier to take into accound few cases (return error token a=

nd
 throwing is enough, so that is a basic static if), than to define a w=

ay
 to integrate a delegate (what should be the delegate's signature, wha=

t
 value to return to query for stopping, how to provide ways to recover=

s,
 etc).

For the moment at least, I'm doing this

bool delegate(string errorMsg, SourcePos pos) errorHandler;

where SourcePos is a struct which holds the line and col of the place i=
n the=20
source code where the bad token starts. It it returns true, the token i=
s=20
skipped. If it returns false, an exception is thrown - and of course th=
e=20
delegate can throw its own exception if it wants to.

But you can also configure the lexer to return an error token instead o=
f using=20
the delegate if that's what you prefer. But Walter is right in that if =
you=20
have to check every token for whether it's an error, that will incur ov=
erhead.=20
So, depending on your use case, that could be unacceptable.

- Jonathan M Davis

Aug 07 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/7/2012 1:14 AM, Jonathan M Davis wrote:
 But you can also configure the lexer to return an error token instead of using
 the delegate if that's what you prefer. But Walter is right in that if you
 have to check every token for whether it's an error, that will incur overhead.
 So, depending on your use case, that could be unacceptable.

It's not just overhead - it's just plain ugly to constantly check for error 
tokens. It's also tedious and error prone to insert those checks.

I don't see any advantage to it.

Aug 07 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Tuesday, August 07, 2012 02:54:42 Walter Bright wrote:
 On 8/7/2012 1:14 AM, Jonathan M Davis wrote:
 But you can also configure the lexer to return an error token instead of
 using the delegate if that's what you prefer. But Walter is right in that
 if you have to check every token for whether it's an error, that will
 incur overhead. So, depending on your use case, that could be
 unacceptable.

 
 It's not just overhead - it's just plain ugly to constantly check for error
 tokens. It's also tedious and error prone to insert those checks.
 
 I don't see any advantage to it.

It's easier to see where in the range of tokens the errors occur. A delegate 
is disconnected from the point where the range is being consumed, whereas if 
tokens are used for errors, then the function consuming the range can see 
exactly where in the range of tokens the error is (and potentially handle it 
differently based on that information).

Regardless, I was asked to keep that option in there by at least one person 
(Philippe Sigaud IIRC), which is why I didn't just switch over to the delegate 
entirely.

- Jonathan M Davis

Aug 07 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-07 12:06, Jonathan M Davis wrote:

 It's easier to see where in the range of tokens the errors occur. A delegate
 is disconnected from the point where the range is being consumed, whereas if
 tokens are used for errors, then the function consuming the range can see
 exactly where in the range of tokens the error is (and potentially handle it
 differently based on that information).

Just pass the same token to the delegate that you would have returned 
otherwise?

-- 
/Jacob Carlborg

Aug 07 2012

travert phare.normalesup.org (Christophe Travert) writes:

Jacob Carlborg , dans le message (digitalmars.D:174421), a écrit :
 On 2012-08-07 12:06, Jonathan M Davis wrote:
 
 It's easier to see where in the range of tokens the errors occur. A delegate
 is disconnected from the point where the range is being consumed, whereas if
 tokens are used for errors, then the function consuming the range can see
 exactly where in the range of tokens the error is (and potentially handle it
 differently based on that information).

 
 Just pass the same token to the delegate that you would have returned 
 otherwise?

That's what I would do. If you have to define a way to return error 
information as a token, better use it again when using delegates.
Personnaly, I would have the delegate be:
int delegate(Token);
A return value of 0 means: continue parsing. Any other value is an 
error number and stops the parser (makes it empty). The error number 
can be retrieved from the empty parser with a specific function.
If you want to throw, just throw in the delegate. No need to return a 
specific value for that.

But a bool return value may be enough too...

Aug 07 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/7/2012 3:06 AM, Jonathan M Davis wrote:
 It's easier to see where in the range of tokens the errors occur. A delegate
 is disconnected from the point where the range is being consumed, whereas if
 tokens are used for errors, then the function consuming the range can see
 exactly where in the range of tokens the error is (and potentially handle it
 differently based on that information).

The delegate has a context pointer giving it a reference to whatever context
the 
code calling the Lexer needs.

Aug 07 2012

travert phare.normalesup.org (Christophe Travert) writes:

Walter Bright , dans le message (digitalmars.D:174394), a écrit :
 On 8/7/2012 1:14 AM, Jonathan M Davis wrote:
 But you can also configure the lexer to return an error token instead of using
 the delegate if that's what you prefer. But Walter is right in that if you
 have to check every token for whether it's an error, that will incur overhead.
 So, depending on your use case, that could be unacceptable.

 
 It's not just overhead - it's just plain ugly to constantly check for error 
 tokens. It's also tedious and error prone to insert those checks.

It's not necessarily ugly, because of the powerful range design. You can 
branch the lexer to a range adapter that just ignore error tokens, or 
throw when it meats an error token.

For example, just use:
auto tokens = data.lexer.throwOnErrorToken;

I don't think this is more ugly than:
auto tokens = data.lexer!(complex signature) { throw LexException; };

But yes, there is overhead, so I understand returning error tokens is 
not satisfactory for everyone.

 I don't see any advantage to it.

Storing the error somewhere can be of use.
For example, you may want to lex a whole file into an array of tokens, 
and then deal with you errors as you analyse the array of tokens. 
Of course, you can alway make a delegate to store the error somewhere, 
but it is easier if this somewhere is in your token pile.

What I don't see any advantage is using a delegate that can only return 
or throw. A policy makes the job:
auto tokens = data.lexer!ExceptionPolicy.throwException;
That's clean too.

If you want the delegate to be of any use, then it must have 
data to process. That's why I said we have to worry about the 
signature of the delegate.

-- 
Christophe

Aug 07 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Tue, Aug 7, 2012 at 12:06 PM, Jonathan M Davis <jmdavisProg gmx.com> wrote:

 Regardless, I was asked to keep that option in there by at least one person
 (Philippe Sigaud IIRC), which is why I didn't just switch over to the delegate
 entirely.

IIRC, I was not the only one, as people here interested in coding an
IDE asked for it too. A lexer is useful for more than 'just' parsing D
afterwards: an IDE could easily color tokens according to their type
and an error token is just was is needed to highlight errors.

Also, what I proposed was a *static* decision: with SkipErrors { no,
yes }. With a static if inside its guts, the lexer could change its
behavior accordingly. Make skipError.yes the default and Walter get
its speed. It's just that an IDE or another parser could use

auto lex = std.lexer.Lexer!(SkipError.no)(input);


Walter, with all due respect, you sometimes give the impression to
forget we are talking about D and go back to deeply entrenched C-isms.
Compile-time decisions can be used to avoid any overhead as long as
you have a clear idea of what the two code paths should look like.

And, as Christophe said, ranges are a powerful API. In another thread
Simen and me did some comparison between C-like code and code using
only ranges upon ranges upon ranges. A (limited!) difference in speed
appeared only for very long calculations.

Aug 07 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/7/2012 7:15 AM, Philippe Sigaud wrote:
 Also, what I proposed was a *static* decision: with SkipErrors { no,
 yes }. With a static if inside its guts, the lexer could change its
 behavior accordingly.

Yes, I understand about static if decisions :-) hell I invented them!


 Walter, with all due respect, you sometimes give the impression to
 forget we are talking about D and go back to deeply entrenched C-isms.

Delegates are not C-isms.


 Compile-time decisions can be used to avoid any overhead as long as
 you have a clear idea of what the two code paths should look like.

Yes, I understand that. There's also a point about adding too much complexity
to 
the interface. The delegate callback reduces complexity in the interface.

 And, as Christophe said, ranges are a powerful API. In another thread
 Simen and me did some comparison between C-like code and code using
 only ranges upon ranges upon ranges. A (limited!) difference in speed
 appeared only for very long calculations.

That's good, and you really don't need to sell me on ranges - I'm already sold.

Aug 07 2012

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Tue, Aug 7, 2012 at 9:38 PM, Walter Bright
<newshound2 digitalmars.com> wrote:

 Yes, I understand about static if decisions :-) hell I invented them!

And what a wonderful decision that was!

 Yes, I understand that. There's also a point about adding too much
 complexity to the interface. The delegate callback reduces complexity in the
 interface.

OK, then let's let Jonathan work, and we will see how it goes.


 And, as Christophe said, ranges are a powerful API. In another thread
 Simen and me did some comparison between C-like code and code using
 only ranges upon ranges upon ranges. A (limited!) difference in speed
 appeared only for very long calculations.


 That's good, and you really don't need to sell me on ranges - I'm already
 sold.

Well, you gave the impression a bit upstream in this thread that
having to filter a token range to eliminate errors was an atrocity
(millions of tokens!).

As far as I'm concerned, the recent good news was to (re?)discover
than complex calls of ranges upon ranges could still be calculated by
CTFE. That's really neat.

Aug 07 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Tuesday, August 07, 2012 12:38:26 Walter Bright wrote:
 Yes, I understand that. There's also a point about adding too much
 complexity to the interface. The delegate callback reduces complexity in
 the interface.

It doesn't really affect much to allow choosing between returning a token and 
using a delegate, especially if ignoring errors is treated as a separate 
option rather than simply using a delegate that skips them (which may or may 
not be beneficial - it's faster without the delegate, but it's actually kind of 
hard to get lexing errors).

What worries me more is stuff like providing a way to have the range calculate 
the current position itself (as Christophe suggested IIRC) or having it 
provide an efficient way to determine the number of code units between two 
ranges so that you can slice the range lexed to put in the Token. Determining 
the number of code units is easily done with ptr for strings, but for 
everything else, you generally have to count as code units are consumed, which 
isn't really an issue for small tokens (especially those like symbols where 
the length is known without counting) but does add up for arbitrarily long 
ones such as comments or string literals. So, providing a way to calculate it 
more efficiently where possible might be desirable, but it's yet another layer 
of complication, and I don't know that it's actually possible to provide such 
a function in enough situations for it to be worth providing that 
functionality.

I expect that the configuration stuff is going to have to be adjusted after I'm 
done, since I'm not sure that it's entirely clear what's worth configuring or 
not.

- Jonathan M Davis

Aug 07 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/7/2012 2:14 PM, Jonathan M Davis wrote:
 I expect that the configuration stuff is going to have to be adjusted after I'm
 done, since I'm not sure that it's entirely clear what's worth configuring or
 not.

"When in doubt, leave it out."

If experience later shows it is really needed, it is easier to add it in than
to 
have a blizzard of useless configuration options that cannot be removed.

Aug 07 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Tuesday, August 07, 2012 14:30:51 Walter Bright wrote:
 On 8/7/2012 2:14 PM, Jonathan M Davis wrote:
 I expect that the configuration stuff is going to have to be adjusted
 after I'm done, since I'm not sure that it's entirely clear what's worth
 configuring or not.

 
 "When in doubt, leave it out."
 
 If experience later shows it is really needed, it is easier to add it in
 than to have a blizzard of useless configuration options that cannot be
 removed.

On the whole, I agree, and I'm definitely leaving some of that stuff out at
this 
point. A simplified version of what I have at the moment is

struct LexConfig { /+ configuration options +/}

struct Lexer(alias config)
 if(is(typeof(config) == LexConfig))
{
 auto lex(R)(R range, SourcePos pos = SourcePos.init) {}
}

The main problem that I have is that if I end up with any delegates which need 
to take the range being lexed, then either I can't put them in the Lexer like 
I am with the error handler delegate (forcing you to pass it to the lex 
function every time that you lex a new range), or I have to templatize Lexer 
on the range type, which is also undesirable.

If it turns out later that we want to add such delegates and we didn't 
templatize Lexer on the range type, then we'll be forced to make it so that 
they're arguments to the lex function. But templatizing the Lexer on the range 
type is ugly enough that it's not really something that I want to do if I can 
reasonbly avoid it. Neither solution appeals to me. If we don't ever need to 
add such delegates, then it doesn't really matter, but I'd prefer to have a 
solution which is appropriately flexible if we _do_ need to add that sort of 
thing later.

I guess that I'll have to keep thinking about it to see if I can come up with 
a reasonable way to do it without hampering the API by either making it ugly 
through unneeded flexibility or by making it too simple to be able to 
reasonably add more functionality later.

- Jonathan M Davis

Aug 07 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/7/2012 1:00 AM, Christophe Travert wrote:
 That's why I suggested supplying a callback delegate to decide what to do with
 errors (ignore, throw exception, or quit) and have the delegate itself do that.
 That way, there is no customization of the Lexer required.

 It may be easier to take into accound few cases (return error token and
 throwing is enough, so that is a basic static if), than to define a way
 to integrate a delegate (what should be the delegate's signature, what
 value to return to query for stopping, how to provide ways to recovers,
 etc).


If the delegate returns, then the lexer recovers.

The delegate is passed the error message and the location.

I don't see it is more complex than that.

Aug 07 2012

travert phare.normalesup.org (Christophe Travert) writes:

Walter Bright , dans le message (digitalmars.D:174393), a écrit :
 If the delegate returns, then the lexer recovers.

That's an option, if there is only one way to recover (which is a 
reasonable assumption).

You wanted the delegate to "decide what to do with the errors (ignore, 
throw exception, or quit)".

Throwing is handled, but not ignore/quit. Jonathan's solution (delegate 
returning a bool) is good. It could also be a delegate returning an int, 
0 meaning continue, and any other value being an error code that can be 
retrieved later. It could also be a number of characters to skip (0 
meaning break).

Aug 07 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 06-Aug-12 22:03, deadalnix wrote:
 Le 04/08/2012 15:45, Dmitry Olshansky a écrit :
 On 04-Aug-12 15:48, Jonathan M Davis wrote:
 On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
 I see it as a compile-time policy, that will fit nicely and solve both
 issues. Just provide a templates with a few hooks, and add a Noop
 policy
 that does nothing.

 It's starting to look like figuring out what should and shouldn't be
 configurable and how to handle it is going to be the largest problem
 in the
 lexer...

 Let's add some meat to my post.
 I've seen it mostly as follows:

 //user defines mixin template that is mixed in inside lexer
 template MyConfig()
 {
 enum identifierTable = true; // means there would be calls to
 table.insert on each identifier
 enum countLines = true; //adds line, column properties to the
 lexer/Tokens

 //statically bound callbacks, inside one can use say:
 // skip() - to skip a char (popFront)
 // get() - to read next char (via popFront, front)
 // line, col - as readonly properties
 // (skip & get do the counting if enabled)

 bool onError()
 {
 skip(); //the most dumb recovery, just skip a char
 return true; //go on with tokenizing, false - stop prematurely
 }

 ...
 }

 usage:


 {
 auto my_supa_table = ...; //some kind of container (should a set on
 strings and support .insert("blah"); )

 auto dlex = Lexer!(MyConfig)(table);
 auto all_tokens = array(dlex(joiner(stdin.byChunk(4096))));

 //or if we had no interest in table but only tokens:
 auto noop = Lexer!(NoopLex)();
 ...
 }

 It seems way too much.

 The most complex thing that is needed is the policy to allocate
 identifiers in tokens.

Editor that highlights text may choose not to build identifier table at 
all. One may see it as a safe mode (low resource mode) for more advance IDE.

 The second parameter is a bool to tokenize comments or not. Is that
 enough ?

No.

And doing Tokens as special comment token is frankly bad idea. See 
Walter's comments in this thread.

Also e.g. For compiler only DDoc ones are ever useful, not so for IDE. 
Filtering them out later is inefficient, as it would be far better not 
to create them in the first place.

 The onError look like a typical use case for conditions as explained in
 the huge thread on Exception.

mm I lost track of that discussion. Either way I see statically bound 
function as good enough hook into the process as it can do anything 
useful: skip wrong chars, throw exception, stop parsing prematurely, 
whatever - pick your poison.

-- 
Dmitry Olshansky

Aug 06 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-06 22:26, Dmitry Olshansky wrote:

 No.

 And doing Tokens as special comment token is frankly bad idea. See
 Walter's comments in this thread.

 Also e.g. For compiler only DDoc ones are ever useful, not so for IDE.
 Filtering them out later is inefficient, as it would be far better not
 to create them in the first place.

The Eclipse plugin Descent can show formatted DDoc comments.

-- 
/Jacob Carlborg

Aug 06 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 07-Aug-12 01:48, Jacob Carlborg wrote:
 On 2012-08-06 22:26, Dmitry Olshansky wrote:

 No.

 And doing Tokens as special comment token is frankly bad idea. See
 Walter's comments in this thread.

 Also e.g. For compiler only DDoc ones are ever useful, not so for IDE.
 Filtering them out later is inefficient, as it would be far better not
 to create them in the first place.

 The Eclipse plugin Descent can show formatted DDoc comments.

I've meant that IDE may be interested in more then just DDoc, at least 
to highlight things properly.

-- 
Dmitry Olshansky

Aug 06 2012

travert phare.normalesup.org (Christophe Travert) writes:

Jonathan M Davis , dans le message (digitalmars.D:174223), a �crit�:
 On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:
 I see it as a compile-time policy, that will fit nicely and solve both
 issues. Just provide a templates with a few hooks, and add a Noop policy
 that does nothing.

 
 It's starting to look like figuring out what should and shouldn't be 
 configurable and how to handle it is going to be the largest problem in the 
 lexer...

Yes, I figured-out a policy could be used to, but since the begining of 
the thread, that makes a lot of things to configure! Jonathan would 
have trouble trying to make them all. Choices have to be made. That's 
why I proposed to use adapter range to enable to do the buffering 
instead of slicing, and to build the look up table. Done correctly, it 
can make the core of the lexer imlementation clean without loosing 
efficiency (I hope). If this policy for parsing literals if the only 
thing that remains to be configured directly in the core of the lexer 
with static if, then it's reasonable.

Aug 04 2012

"Bernard Helyer" <b.helyer gmail.com> writes:

http://i.imgur.com/oSXTc.png

Posted without comment.

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 12:09 AM, Bernard Helyer wrote:
 http://i.imgur.com/oSXTc.png

 Posted without comment.

I'm afraid I'm mystified as to your point.

Aug 02 2012

"Bernard Helyer" <b.helyer gmail.com> writes:

On Thursday, 2 August 2012 at 07:32:59 UTC, Walter Bright wrote:
 On 8/2/2012 12:09 AM, Bernard Helyer wrote:
 http://i.imgur.com/oSXTc.png

 Posted without comment.

 I'm afraid I'm mystified as to your point.

Just that I'm slaving away over a hot IDE here. :P

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 12:34 AM, Bernard Helyer wrote:
 Just that I'm slaving away over a hot IDE here. :P

Ah! Well, keep on keepin' on, then!

Aug 02 2012

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 08/02/2012 03:09 AM, Bernard Helyer wrote:
 http://i.imgur.com/oSXTc.png

 Posted without comment.

Hell yeah Alexander Brandon.

Aug 04 2012

dennis luehring <dl.soluz gmx.net> writes:

 10. High speed matters a lot

then add a benchmark "suite" to the list - the lexer should be 
benchmarked from the very first beginning

and it should be designed for multithreading - there is no need for
on-the-fly hash-table updating - maybe just one update on each lex 
threads end

Aug 02 2012

Piotr Szturmaj <bncrbme jadamspam.pl> writes:

Walter Bright wrote:
 1. It should accept as input an input range of UTF8. I feel it is a
 mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it
 UTF16 should use an 'adapter' range to convert the input to UTF8. (This
 is what component programming is all about.)

Why it is a mistake? I think Lexer should parse any UTF range and return 
compatible token's strings. That is it should provide strings for UTF8 
input, wstrings for UTF16 input and so on.

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 2:27 AM, Piotr Szturmaj wrote:
 Walter Bright wrote:
 1. It should accept as input an input range of UTF8. I feel it is a
 mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it
 UTF16 should use an 'adapter' range to convert the input to UTF8. (This
 is what component programming is all about.)

 Why it is a mistake?

Because the lexer is large and it would have to have a lot of special case code 
inserted here and there to make that work.

 I think Lexer should parse any UTF range and return
 compatible token's strings. That is it should provide strings for UTF8 input,
 wstrings for UTF16 input and so on.

Why? I've never seen any UTF16 or UTF32 D source in the wild.

Besides, if it is not templated then it doesn't need to be recompiled by every 
user of it - it can exist as object code in the library.

Aug 02 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/2/12 6:07 AM, Walter Bright wrote:
 Why? I've never seen any UTF16 or UTF32 D source in the wild.

Here's a crazy idea that I'll hang to this one remark. No, two crazy ideas.

First, after having read the large back-and-forth Jonathan/Walter in one 
sitting, it's becoming obvious to me you'll never understand each other 
on this nontrivial matter through this medium. I suggest you set up a 
skype/phone call. Once you get past the first 30 seconds of social 
awkwardness of hearing each other's voice, you'll make fantastic 
progress in communicating.

Regarding the problem at hand, it's becoming painfully obvious to me 
that the lexer MUST do its own decoding internally. Hence, a very simple 
thing to do is have the entire lexer only deal with ranges of ubyte. If 
someone passes a char[], the lexer's front end can simply call 
s.representation and obtain the underlying ubyte[].

If someone passes some range of char, the lexer uses an adapter (e.g. 
map()) that casts every char to ubyte, which is a zero-cost operation. 
Then it uses the same core operating on ranges of ubyte.

In the first implementation, the lexer may actually refuse any range of 
16-bit or 32-bit elements (wchar[], range of wchar, dchar[], range of 
dchar). Later on the core may be evolved to handle range of ushort and 
range of dchar. The front-end would use again representation() against 
wchar[], cast with range of wchar, and would just pass the dchar[] and 
range of dchar around.

This makes the core simple and efficient (I think Jonathan's use of 
static if and mixins everywhere, while well-intended, complicates 
matters without necessity).

And as such we have a lexer! Which operates with ranges, just has a 
simple front-end clarifying that the lexer must do its own decoding.

Works?


Andrei

Aug 02 2012

Michel Fortin <michel.fortin michelf.ca> writes:

On 2012-08-02 12:28:03 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 Regarding the problem at hand, it's becoming painfully obvious to me 
 that the lexer MUST do its own decoding internally.

That's not a great surprise to me. I hit the same issues when writing 
my XML parser, which is why I invented functions called frontUnit and 
popFrontUnit. I'm glad you're realizing this.

 Hence, a very simple thing to do is have the entire lexer only deal 
 with ranges of ubyte. If someone passes a char[], the lexer's front end 
 can simply call s.representation and obtain the underlying ubyte[].

That's ugly, but it could work (assuming s.representation returns the 
casted range by ref). I still prefer my frontUnit and popFrontUnit 
approach though.

In fact, any parser for which speed is important will have to bypass 
std.range's clever handling of UTF characters. Dealing simply with 
ubytes isn't enough, since in some cases you'll want to fire up the UTF 
decoder.

The next issue, which I haven's seen discussed here is that for a 
parser to be efficient it should operate on buffers. You can make it 
work with arbitrary ranges, but if you don't have a buffer you can 
slice when you need to preserve a string, you're going to have to build 
the string character by character, which is not efficient at all. But 
then you can only really return slices if the underlying representation 
is the same as the output representation, and unless your API has a 
templated output type, you're going to special case a lot of things.

After having attempted an XML parser with ranges, I'm not sure parsing 
using generic ranges can be made very efficient. Automatic conversion 
to UTF-32 is a nuisance for performance, and if the output needs to 
return parts of the input, you'll need to create an inefficient special 
case just to allocate many new strings in the correct format.

I wonder how your call with Walter will turn out.

-- 
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca/

Aug 02 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/2/12 2:17 PM, Michel Fortin wrote:
 On 2012-08-02 12:28:03 +0000, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 Hence, a very simple thing to do is have the entire lexer only deal
 with ranges of ubyte. If someone passes a char[], the lexer's front
 end can simply call s.representation and obtain the underlying ubyte[].

 That's ugly, but it could work (assuming s.representation returns the
 casted range by ref). I still prefer my frontUnit and popFrontUnit
 approach though.

I agree frontUnit and popFrontUnit are more generic because they allow 
other ranges to define them.

 In fact, any parser for which speed is important will have to bypass
 std.range's clever handling of UTF characters. Dealing simply with
 ubytes isn't enough, since in some cases you'll want to fire up the UTF
 decoder.

 The next issue, which I haven's seen discussed here is that for a parser
 to be efficient it should operate on buffers. You can make it work with
 arbitrary ranges, but if you don't have a buffer you can slice when you
 need to preserve a string, you're going to have to build the string
 character by character, which is not efficient at all. But then you can
 only really return slices if the underlying representation is the same
 as the output representation, and unless your API has a templated output
 type, you're going to special case a lot of things.

I think a BufferedRange could go a long way here.

 After having attempted an XML parser with ranges, I'm not sure parsing
 using generic ranges can be made very efficient. Automatic conversion to
 UTF-32 is a nuisance for performance, and if the output needs to return
 parts of the input, you'll need to create an inefficient special case
 just to allocate many new strings in the correct format.

I'm not so sure, but I'll measure.

 I wonder how your call with Walter will turn out.

What call?


Thanks,

Andrei

Aug 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-02 22:26, Andrei Alexandrescu wrote:
 On 8/2/12 2:17 PM, Michel Fortin wrote:

 I wonder how your call with Walter will turn out.

 What call?

The skype call you suggested:

"First, after having read the large back-and-forth Jonathan/Walter in 
one sitting, it's becoming obvious to me you'll never understand each 
other on this nontrivial matter through this medium. I suggest you set 
up a skype/phone call. Once you get past the first 30 seconds of social 
awkwardness of hearing each other's voice, you'll make fantastic 
progress in communicatin"

-- 
/Jacob Carlborg

Aug 02 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/2/12 4:44 PM, Jacob Carlborg wrote:
 On 2012-08-02 22:26, Andrei Alexandrescu wrote:
 On 8/2/12 2:17 PM, Michel Fortin wrote:

 I wonder how your call with Walter will turn out.

 What call?

 The skype call you suggested:

 "First, after having read the large back-and-forth Jonathan/Walter in
 one sitting, it's becoming obvious to me you'll never understand each
 other on this nontrivial matter through this medium. I suggest you set
 up a skype/phone call. Once you get past the first 30 seconds of social
 awkwardness of hearing each other's voice, you'll make fantastic
 progress in communicatin"

Oh, ok, I thought I'd be on the call. That would be between Jonathan and 
Walter.

Andrei

Aug 02 2012

travert phare.normalesup.org (Christophe Travert) writes:

Andrei Alexandrescu , dans le message (digitalmars.D:174060), a �crit�:
 I agree frontUnit and popFrontUnit are more generic because they allow 
 other ranges to define them.

Any range of dchar could have a representation (or you may want to call 
it something else) that returns a range of char (or ubyte). And I think 
they are more generic because they use a generic API (ie range), that is 
very powerful: the representation can provide length, slicing, etc... 
that is different that the dchar length or whatever. You don't want to 
duplicate all range methods by postfixing Unit...

 I wonder how your call with Walter will turn out.

 
 What call?

You proposed Jonathan to call Walter in an earlier post. I believe there 
is a misunderstandment.

Aug 02 2012

Piotr Szturmaj <bncrbme jadamspam.pl> writes:

Michel Fortin wrote:
 The next issue, which I haven's seen discussed here is that for a parser
 to be efficient it should operate on buffers. You can make it work with
 arbitrary ranges, but if you don't have a buffer you can slice when you
 need to preserve a string, you're going to have to build the string
 character by character, which is not efficient at all. But then you can
 only really return slices if the underlying representation is the same
 as the output representation, and unless your API has a templated output
 type, you're going to special case a lot of things.

Instead of returning whole strings, you may provide another sub-range 
for the string itself. I wrote JSON parser using this approach 
(https://github.com/pszturmaj/json-streaming-parser) and thanks to that 
it is possible to parse json without a single heap allocation. This 
could be also used in XML parser.

Aug 02 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, August 03, 2012 07:30:32 Philippe Sigaud wrote:
 Look, many of us here were interested in your idea of having comments
 and errors lexed as tokens.
 Could it be possible to just add two static policies errorBehavior {
 report, skip, ...} and comments { asToken, grouped }?
 That way, a user just say;
 
 // Walter
 auto toks = Lexer!(comments.grouped)(code); // May be the default value
 // Other:
 auto toks = Lexer!(comments.asToken)(code);
 
 It's just a small static if away...

I'm sure it's possible. I just don't know how messy it will be. It depends on 
how the code is affected by needing to not treat comments and errors as tokens 
for what Walter wants. I'll see what I can do though.

- Jonathan M Davis

Aug 02 2012

Brad Roberts <braddr puremagic.com> writes:

To help with performance comparisons I ripped dmd's lexer out and got it
building as a few .d files.  It's very crude.
It's got tons of casts (more than the original c++ version).  I attempted no
cleanup or any other change than the
minimum I could to get it to build and run.  Obviously there's tons of room for
cleanup, but that's not the point...
it's just useful as a baseline.

The branch:
    https://github.com/braddr/phobos/tree/dmd_lexer

The commit with the changes:
    https://github.com/braddr/phobos/commit/040540ef3baa38997b15a56be3e9cd9c4bfa51ab

On my desktop (far from idle, it's running 2 of the auto testers), it
consistently takes 0.187s to lex all of the .d
files in phobos.

Later,
Brad

On 8/1/2012 5:10 PM, Walter Bright wrote:
 Given the various proposals for a lexer module for Phobos, I thought I'd share
some characteristics it ought to have.
 
 First of all, it should be suitable for, at a minimum:
 
 1. compilers
 
 2. syntax highlighting editors
 
 3. source code formatters
 
 4. html creation
 
 To that end:
 
 1. It should accept as input an input range of UTF8. I feel it is a mistake to
templatize it for UTF16 and UTF32. Anyone
 desiring to feed it UTF16 should use an 'adapter' range to convert the input
to UTF8. (This is what component
 programming is all about.)
 
 2. It should output an input range of tokens
 
 3. tokens should be values, not classes
 
 4. It should avoid memory allocation as much as possible
 
 5. It should read or write any mutable global state outside of its "Lexer"
 instance
 
 6. A single "Lexer" instance should be able to serially accept input ranges,
sharing and updating one identifier table
 
 7. It should accept a callback delegate for errors. That delegate should
decide whether to:
    1. ignore the error (and "Lexer" will try to recover and continue)
    2. print an error message (and "Lexer" will try to recover and continue)
    3. throw an exception, "Lexer" is done with that input range
 
 8. Lexer should be configurable as to whether it should collect information
about comments and ddoc comments or not
 
 9. Comments and ddoc comments should be attached to the next following token,
they should not themselves be tokens
 
 10. High speed matters a lot
 
 11. Tokens should have begin/end line/column markers, though most of the time
this can be implicitly determined
 
 12. It should come with unittests that, using -cov, show 100% coverage
 
 
 Basically, I don't want anyone to be motivated to do a separate one after
seeing this one.

Aug 05 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/5/2012 12:59 AM, Brad Roberts wrote:
 To help with performance comparisons I ripped dmd's lexer out and got it
building as a few .d files.  It's very crude.
 It's got tons of casts (more than the original c++ version).  I attempted no
cleanup or any other change than the
 minimum I could to get it to build and run.  Obviously there's tons of room
for cleanup, but that's not the point...
 it's just useful as a baseline.

 The branch:
      https://github.com/braddr/phobos/tree/dmd_lexer

 The commit with the changes:
      https://github.com/braddr/phobos/commit/040540ef3baa38997b15a56be3e9cd9c4bfa51ab

 On my desktop (far from idle, it's running 2 of the auto testers), it
consistently takes 0.187s to lex all of the .d
 files in phobos.

 Later,
 Brad

Thanks, Brad!

Aug 05 2012

Ary Manzana <ary esperanto.org.ar> writes:

On 8/1/12 21:10 , Walter Bright wrote:
 8. Lexer should be configurable as to whether it should collect
 information about comments and ddoc comments or not

 9. Comments and ddoc comments should be attached to the next following
 token, they should not themselves be tokens

I believe there should be an option to get comments as tokens. Or 
otherwise the attached comments should have source location...

Aug 06 2012

D Programming

C/C++ Programming

Other

digitalmars.D - std.d.lexer requirements