www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Lexer in D

reply "Namespace" <rswhite4 googlemail.com> writes:
For one of our courses this year we have a very cool topic: Lexer 
and Compiler.
In the past I have already tried some experiments in this 
direction with Romulus and Remus, but I've never measured my 
Lexing time.
That's why I wanted to start again to write a lexer (hitherto 
purely for the learning effect).
I looked at the code of some of the other lexers, that are also 
written in D, to get some inspiration from, but currently my 
Lexer is a little bit slow, it requires for example ~200 msecs 
for std.datetime.
So I wanted to ask whether there are nice people who help me to 
improve our lexer.

Code: https://github.com/Dgame/Puzzle/blob/master/lexer.d

Thanks in advance. :)
Mar 02 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
02-Mar-2013 13:33, Namespace пишет:
 For one of our courses this year we have a very cool topic: Lexer and
 Compiler.
 In the past I have already tried some experiments in this direction with
 Romulus and Remus, but I've never measured my Lexing time.
 That's why I wanted to start again to write a lexer (hitherto purely for
 the learning effect).
 I looked at the code of some of the other lexers, that are also written
 in D, to get some inspiration from, but currently my Lexer is a little
 bit slow, it requires for example ~200 msecs for std.datetime.
 So I wanted to ask whether there are nice people who help me to improve
 our lexer.

 Code: https://github.com/Dgame/Puzzle/blob/master/lexer.d

 Thanks in advance. :)
See this one I'm currently investing my time into: https://github.com/Hackerpilot/Dscanner/tree/range-based-lexer Other then this use cachegrind and compile with ldc/gdc :) Also for dmd check the flags are "-release -O -inline -noboundscheck" -- Dmitry Olshansky
Mar 02 2013
parent "Namespace" <rswhite4 googlemail.com> writes:
I'm already using these flags. But thanks for your Link I will 
take a look at it. :)
Mar 02 2013
prev sibling parent reply David <d dav1d.de> writes:
Am 02.03.2013 10:33, schrieb Namespace:
 For one of our courses this year we have a very cool topic: Lexer and
 Compiler.
 In the past I have already tried some experiments in this direction with
 Romulus and Remus, but I've never measured my Lexing time.
 That's why I wanted to start again to write a lexer (hitherto purely for
 the learning effect).
 I looked at the code of some of the other lexers, that are also written
 in D, to get some inspiration from, but currently my Lexer is a little
 bit slow, it requires for example ~200 msecs for std.datetime.
 So I wanted to ask whether there are nice people who help me to improve
 our lexer.
 
 Code: https://github.com/Dgame/Puzzle/blob/master/lexer.d
 
 Thanks in advance. :)
Making `types` and `keywords` an array might improve the speed a little bit: if (id in types) -> if (types.canFind(id)) (same with keywords)
Mar 02 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
02-Mar-2013 15:04, David пишет:
 Am 02.03.2013 10:33, schrieb Namespace:
 For one of our courses this year we have a very cool topic: Lexer and
 Compiler.
 In the past I have already tried some experiments in this direction with
 Romulus and Remus, but I've never measured my Lexing time.
 That's why I wanted to start again to write a lexer (hitherto purely for
 the learning effect).
 I looked at the code of some of the other lexers, that are also written
 in D, to get some inspiration from, but currently my Lexer is a little
 bit slow, it requires for example ~200 msecs for std.datetime.
 So I wanted to ask whether there are nice people who help me to improve
 our lexer.

 Code: https://github.com/Dgame/Puzzle/blob/master/lexer.d

 Thanks in advance. :)
Making `types` and `keywords` an array might improve the speed a little bit: if (id in types) -> if (types.canFind(id)) (same with keywords)
That would be slower as canFind is linear search. Simpler way is use first letter and length to select a bucket of keywords. But otherwise, yes, the first advice is never use built-in AA as these take quite some time to allocate. Plus they not that fast to lookup as they used mod-prime not mod-power-of-2 to pick a bucket. -- Dmitry Olshansky
Mar 02 2013
next sibling parent reply "Namespace" <rswhite4 googlemail.com> writes:
With canFind it takes ~300 msecs. :/
Any idea how I could improve it?

 Simpler way is use first letter and length to select a bucket 
 of keywords.
I don't understand how you could do this without AA's and how this could be more performant than canFind.
Mar 02 2013
parent "Namespace" <rswhite4 googlemail.com> writes:
I sorted both, types and keywords, alphabetical and wrote my own 
'canFind':

bool canFind(const size_t size)(ref const string[size] values, 
string value) pure nothrow {
	if (value[0] < 'm' || value[0] == '_') {
		foreach (string _val; values) {
			if (_val == value) return true;
		}
	} else {
		foreach_reverse (string _val; values) {
			if (_val == value) return true;
		}
	}
	
	return false;
}

That brings ~20 msecs, so I have ~280 mesecs instead of ~ 300. 
But it isn't perfect.
Mar 02 2013
prev sibling next sibling parent reply "Namespace" <rswhite4 googlemail.com> writes:
 That would be slower as canFind is linear search.
 Simpler way is use first letter and length to select a bucket 
 of keywords.
I think I get it. Did you mean something like this? [code] immutable ubyte[char] typeMap; immutable ubyte[char] keywordMap; static this() { typeMap = [ 'd' : 7, 'l' : 16, 'i' : 12, 'u' : 20, 'b' : 0, 'f' : 10, 'r' : 17, 'v' : 25, 'c' : 2, 's' : 18, 'w' : 26 ]; keywordMap = [ '_' : 0, 'a' : 6, 'b' : 12, 'c' : 14, 'd' : 20, 'e' : 26, 'f' : 30, 'g' : 36, 'i' : 37, 'l' : 45, 'm' : 46, 'n' : 49, 'o' : 52, 'p' : 54, 'r' : 60, 's' : 62, 't' : 69, 'u' : 77, 'v' : 79, 'w' : 81 ]; } bool isKeyword(string value) pure nothrow { if (value[0] !in keywordMap) return false; foreach (string kword; keywords[keywordMap[value[0]] .. $]) { if (kword == value) return true; if (kword[0] != value[0]) return false; } return false; } bool isType(string value) pure nothrow { if (value[0] !in typeMap) return false; foreach (string type; types[typeMap[value[0]] .. $]) { if (type == value) return true; if (type[0] != value[0]) return false; } return false; } [/code] I'm now by ~230 msecs.
Mar 02 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
02-Mar-2013 18:53, Namespace пишет:
 That would be slower as canFind is linear search.
 Simpler way is use first letter and length to select a bucket of
 keywords.
I think I get it. Did you mean something like this?
No-no and no. Just translate list of keywords/types to a bunch of switch-cases that will give you near optimal performance. Use CTFE to construct that string of D code. In fact I proposed to do just 2 levels of switches: first switch by length then by first char. Even simple if you don't want to go for CTFE code-generation is to make plain array of array. The length of array is exactly 256 i.e. Even this should have decent speed: string[][256] keywords = [ null, null, ... //at index 'i' ['int', 'if', 'invariant', 'idouble', 'ifloat' ...] //array of strings starting with char for each ... ]; bool isKeyword(string value) { string[] toSearch = keywords[value[0]]; return toSearch.canFind(value); } Better yet only store [1..$] parts of keywords and search for value[1..$]. Even faster is to make the same for each possible length of keyword.
 [code]
 immutable ubyte[char] typeMap;
 immutable ubyte[char] keywordMap;

 static this() {
      typeMap = [
          'd' : 7, 'l' : 16, 'i' : 12, 'u' : 20, 'b' : 0, 'f' : 10, 'r' :
 17, 'v' : 25, 'c' : 2, 's' : 18, 'w' : 26
      ];

      keywordMap = [
          '_' : 0, 'a' : 6, 'b' : 12, 'c' : 14, 'd' : 20, 'e' : 26, 'f' :
 30, 'g' : 36, 'i' : 37, 'l' : 45,
          'm' : 46, 'n' : 49, 'o' : 52, 'p' : 54, 'r' : 60, 's' : 62, 't'
 : 69, 'u' : 77, 'v' : 79, 'w' : 81
      ];
 }

 bool isKeyword(string value) pure nothrow {
      if (value[0] !in keywordMap) return false;

      foreach (string kword; keywords[keywordMap[value[0]] .. $]) {
          if (kword == value) return true;
          if (kword[0] != value[0]) return false;
      }

      return false;
 }

 bool isType(string value) pure nothrow {
      if (value[0] !in typeMap) return false;

      foreach (string type; types[typeMap[value[0]] .. $]) {
          if (type == value) return true;
          if (type[0] != value[0]) return false;
      }

      return false;
 }
 [/code]

 I'm now by ~230 msecs.
-- Dmitry Olshansky
Mar 02 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
 No-no and no.

 Just translate list of keywords/types to a bunch of 
 switch-cases that will give you near optimal performance. Use 
 CTFE to construct that string of D code.
 In fact I proposed to do just 2 levels of switches: first 
 switch by length then by first char.
I don't understand. You don't mean something like: switch (value) { case "if": isKeyword = true; break; ... right?
 Even simple if you don't want to go for CTFE code-generation is 
 to make plain array of array. The length of array is exactly 
 256 i.e.

 Even this should have decent speed:

 string[][256] keywords = [
 null,
 null,
 ...
 //at index 'i'
 ['int', 'if', 'invariant', 'idouble', 'ifloat' ...]
 //array of strings starting with char for each
 ...
 ];

 bool isKeyword(string value)
 {
 	string[] toSearch = keywords[value[0]];
 	return toSearch.canFind(value);
 }
Ok I will try it. Thank you.
Mar 02 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
02-Mar-2013 22:09, Namespace пишет:
 No-no and no.

 Just translate list of keywords/types to a bunch of switch-cases that
 will give you near optimal performance. Use CTFE to construct that
 string of D code.
 In fact I proposed to do just 2 levels of switches: first switch by
 length then by first char.
I don't understand. You don't mean something like: switch (value) { case "if": isKeyword = true; break; ... right?
No. switch(value.length) case 2: switch(value[0]) { case 'i': switch(value[1]){ case 'f': return value.length == 2; case ... ... } ... case 3: ... }
 Even simple if you don't want to go for CTFE code-generation is to
 make plain array of array. The length of array is exactly 256 i.e.

 Even this should have decent speed:

 string[][256] keywords = [
 null,
 null,
 ...
 //at index 'i'
 ['int', 'if', 'invariant', 'idouble', 'ifloat' ...]
 //array of strings starting with char for each
 ...
 ];

 bool isKeyword(string value)
 {
     string[] toSearch = keywords[value[0]];
     return toSearch.canFind(value);
 }
Ok I will try it. Thank you.
-- Dmitry Olshansky
Mar 02 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
02-Mar-2013 22:15, Dmitry Olshansky пишет:
 02-Mar-2013 22:09, Namespace пишет:
 No-no and no.

 Just translate list of keywords/types to a bunch of switch-cases that
 will give you near optimal performance. Use CTFE to construct that
 string of D code.
 In fact I proposed to do just 2 levels of switches: first switch by
 length then by first char.
I don't understand. You don't mean something like: switch (value) { case "if": isKeyword = true; break; ... right?
No. switch(value.length) case 2: switch(value[0]) { case 'i': switch(value[1]){ case 'f': return value.length == 2;
return true; I edited the post to include length switch before per-char switch.
          case ...
      ...
      }
 ...
 case 3:
 ...
 }
-- Dmitry Olshansky
Mar 02 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
I hope I understand you right this time:
http://dpaste.1azy.net/4c2e4428

Was this your idea?
With this I reached between 215 and 230 msecs.
Mar 02 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
02-Mar-2013 23:01, Namespace пишет:
 I hope I understand you right this time:
 http://dpaste.1azy.net/4c2e4428

 Was this your idea?
 With this I reached between 215 and 230 msecs.
This is a mess of conditionals is what a direct array of 256 entries would avoid: int idx; if (value[0] != '_') { idx = value[0] - 'a'; if (idx == 26) return false; } else { idx = 26; } if (idx < 0 || idx > 26) { // debug writeln("kword: ", idx, ':', value[0]); return false; } if (keywords[idx] is null) return false; return keywords[idx].canFind(value); Gaining some speed in the process. Plus another layer of array to discern keywords by length. You see why I suggested to generate the code in the first place ? ;) BTW what's the reason to separate keywords and type keywords? They are processed the same in lexer and only parser somewhere up above knows what to do with these regardless. Just return different token values for each. -- Dmitry Olshansky
Mar 02 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
 This is a mess of conditionals is what a direct array of 256 
 entries would avoid:

 	int idx;
 	if (value[0] != '_') {
 		idx = value[0] - 'a';
 		if (idx == 26) return false;
 	} else {
 		idx = 26;
 	}
 	
 	if (idx < 0 || idx > 26) {
 		// debug writeln("kword: ", idx, ':', value[0]);
 		return false;
 	}
 	
 	if (keywords[idx] is null) return false;
 	
 	return keywords[idx].canFind(value);

 Gaining some speed in the process. Plus another layer of array 
 to discern keywords by length. You see why I suggested to 
 generate the code in the first place ? ;)

 BTW what's the reason to separate keywords and type keywords? 
 They are processed the same in lexer and only parser somewhere 
 up above knows what to do with these regardless. Just return 
 different token values for each.
I changed it and merged them together. Also I use now a array of 256 entities, but I must keep the check if idx is < 0, because 'D' - 'a' is negative. And yes I see what you meant.^^ Code: http://dpaste.1azy.net/317241c0 I reach still 215 - 230 msecs.
Mar 02 2013
next sibling parent "Namespace" <rswhite4 googlemail.com> writes:
It is sufficient if my array has only 26 units, because I check 
only lower cases.
Mar 02 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Mar-2013 00:03, Namespace пишет:
 This is a mess of conditionals is what a direct array of 256 entries
 would avoid:

     int idx;
     if (value[0] != '_') {
         idx = value[0] - 'a';
         if (idx == 26) return false;
     } else {
         idx = 26;
     }

     if (idx < 0 || idx > 26) {
         // debug writeln("kword: ", idx, ':', value[0]);
         return false;
     }

     if (keywords[idx] is null) return false;

     return keywords[idx].canFind(value);

 Gaining some speed in the process. Plus another layer of array to
 discern keywords by length. You see why I suggested to generate the
 code in the first place ? ;)

 BTW what's the reason to separate keywords and type keywords? They are
 processed the same in lexer and only parser somewhere up above knows
 what to do with these regardless. Just return different token values
 for each.
I changed it and merged them together. Also I use now a array of 256 entities, but I must keep the check if idx is < 0, because 'D' - 'a' is negative. And yes I see what you meant.^^
Obviously, you still largely don't :) Use the full byte to do the lookup dammit. Saving few array slots doesn't bring your speed up. Using full byte *directly* as index does speed things up because you don't have to do *any* computations and *conditionals*. Conditionals that can easily go both ways are the worst thing performance-wise. I don't know how to phrase that better.
 Code: http://dpaste.1azy.net/317241c0
 I reach still 215 - 230 msecs.
Use profiling to determine what takes the most of the time. If you don't like callgrind/cachegrind (I seriously don't know why not use it) try dmd's -profile. Looking at your main code I see some spaghetti code and passing index by pointer. Don't do that - encapsulate lexing state in some kind of struct, then the index would be implicitly passed to all functions by this pointer. It may very well be the case you are bound by some other code. In any case after some heroic efforts Dscanner lexes std.datetime in 24 msecs, and that on linux VM powered by the oldish AMD Phenom II. What kind of CPU do you use? -- Dmitry Olshansky
Mar 02 2013
prev sibling parent reply "Zach the Mystic" <reachBUTMINUSTHISzach gOOGLYmail.com> writes:
On Saturday, 2 March 2013 at 11:59:39 UTC, Dmitry Olshansky wrote:
 02-Mar-2013 15:04, David пишет:
 Am 02.03.2013 10:33, schrieb Namespace:
 For one of our courses this year we have a very cool topic: 
 Lexer and
 Compiler.
 In the past I have already tried some experiments in this 
 direction with
 Romulus and Remus, but I've never measured my Lexing time.
 That's why I wanted to start again to write a lexer (hitherto 
 purely for
 the learning effect).
 I looked at the code of some of the other lexers, that are 
 also written
 in D, to get some inspiration from, but currently my Lexer is 
 a little
 bit slow, it requires for example ~200 msecs for std.datetime.
 So I wanted to ask whether there are nice people who help me 
 to improve
 our lexer.

 Code: https://github.com/Dgame/Puzzle/blob/master/lexer.d

 Thanks in advance. :)
Making `types` and `keywords` an array might improve the speed a little bit: if (id in types) -> if (types.canFind(id)) (same with keywords)
That would be slower as canFind is linear search. Simpler way is use first letter and length to select a bucket of keywords. But otherwise, yes, the first advice is never use built-in AA as these take quite some time to allocate. Plus they not that fast to lookup as they used mod-prime not mod-power-of-2 to pick a bucket.
And it doesn't even need to: http://d.puremagic.com/issues/show_bug.cgi?id=9522
Mar 02 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
02-Mar-2013 23:50, Zach the Mystic пишет:
 On Saturday, 2 March 2013 at 11:59:39 UTC, Dmitry Olshansky wrote:
 02-Mar-2013 15:04, David пишет:
 Am 02.03.2013 10:33, schrieb Namespace:
 For one of our courses this year we have a very cool topic: Lexer and
 Compiler.
 In the past I have already tried some experiments in this direction
 with
 Romulus and Remus, but I've never measured my Lexing time.
 That's why I wanted to start again to write a lexer (hitherto purely
 for
 the learning effect).
 I looked at the code of some of the other lexers, that are also written
 in D, to get some inspiration from, but currently my Lexer is a little
 bit slow, it requires for example ~200 msecs for std.datetime.
 So I wanted to ask whether there are nice people who help me to improve
 our lexer.

 Code: https://github.com/Dgame/Puzzle/blob/master/lexer.d

 Thanks in advance. :)
Making `types` and `keywords` an array might improve the speed a little bit: if (id in types) -> if (types.canFind(id)) (same with keywords)
That would be slower as canFind is linear search. Simpler way is use first letter and length to select a bucket of keywords. But otherwise, yes, the first advice is never use built-in AA as these take quite some time to allocate. Plus they not that fast to lookup as they used mod-prime not mod-power-of-2 to pick a bucket.
And it doesn't even need to: http://d.puremagic.com/issues/show_bug.cgi?id=9522
Yes, obviously AA is becoming a central pain point that needs fixing. If you are inclined to help I suggest you join forces with H.S. Teoh and Dicebot and help fixing the AA. -- Dmitry Olshansky
Mar 02 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
I never used the -profile flag before (only -conv) so I want to 
ask if I understand it right:
http://dpaste.1azy.net/4382097b
It seems that the call which is listed on line 133

1589328      214949      214948           0     pure  trusted 
dchar std.utf.decode!(const(immutable(char)[])).decode(ref 
const(immutable(char)[]), ref uint)

take the most time/performance. Am I right?
And why is something decoded? Should I use dchar instead of char?

And I do not take anything that is already finished, because I 
want to be more familiar with the matter.
I want to see, what you have to note in order to get performance.
The 24 msecs would obviously be a dream, but I would be pleased 
with 50-100 msecs. :D
Mar 02 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
I noted it to late: decode comes from readText: readText 
validates your text. I will now  use std.file.read.

The new trace output is here and it seems you're right, isKeyword 
and isType (yes I'd decided to separate it again and keep both in 
the lexer) take a lot of time:
http://dpaste.1azy.net/0d6aff6b
  154874      106186       95403           0     pure nothrow bool 
Puzzle.Lexer.isKeyword(immutable(char)[])
  159688       78020       69289           0     pure nothrow bool 
Puzzle.Lexer.isType(immutable(char)[])

I never thought that. :)
Mar 02 2013
next sibling parent reply "Namespace" <rswhite4 googlemail.com> writes:
I think that thanks to your suggestions isKeyword and isType are 
entirely inlined, because they aren't listet in the trace.log 
anymore (http://dpaste.1azy.net/b94b19ff). The only thing which 
makes trouble is the isNext function.
Maybe I should also use a StringStream (implemented as Range). 
What do you think?
Mar 02 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Mar-2013 03:52, Namespace пишет:
 I think that thanks to your suggestions isKeyword and isType are
 entirely inlined, because they aren't listet in the trace.log anymore
 (http://dpaste.1azy.net/b94b19ff). The only thing which makes trouble is
 the isNext function.
 Maybe I should also use a StringStream (implemented as Range). What do
 you think?
I'd repeat that I think it makes no sense to separately treat isType. In any case there is way more to types then built-in ones and is the job of parser (to assume types) and semantic step (to type-check and infer). Another thing is to run -profile without inline to understand the structure of time spent per each subroutine better. -- Dmitry Olshansky
Mar 02 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
 I'd repeat that I think it makes no sense to separately treat 
 isType. In any case there is way more to types then built-in 
 ones and is the job of parser (to assume types) and semantic 
 step (to type-check and infer).
Yes I've understood, but currently I want it so.
 Another thing is to run -profile without inline to understand 
 the structure of time spent per each subroutine better.
I did this and I refactored a lot of my code. Now I get 170 - 180 msecs and this is my current trace.log: http://dpaste.1azy.net/b94b19ff But currently I have no more ideas how I could gain more performance. Maybe I should disable the compiler while looping through the text? Or maybe I should allocate more space for the Token array (e.g. toks.length = 100;)? I hope that you or anyone other have further ideas. But anyway, thanks for the help! :)
Mar 03 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Mar-2013 18:28, Namespace пишет:
 I'd repeat that I think it makes no sense to separately treat isType.
 In any case there is way more to types then built-in ones and is the
 job of parser (to assume types) and semantic step (to type-check and
 infer).
Yes I've understood, but currently I want it so.
 Another thing is to run -profile without inline to understand the
 structure of time spent per each subroutine better.
I did this and I refactored a lot of my code. Now I get 170 - 180 msecs and this is my current trace.log: http://dpaste.1azy.net/b94b19ff But currently I have no more ideas how I could gain more performance. Maybe I should disable the compiler while looping through the text? Or maybe I should allocate more space for the Token array (e.g. toks.length = 100;)? I hope that you or anyone other have further ideas. But anyway, thanks for the help! :)
Simple - don't use array append and don't produce and array. Just produce a lazy forward range that is iterated. At the very least use Appender!(Token[]) it ought to be much faster. -- Dmitry Olshansky
Mar 03 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
 Simple - don't use array append and don't produce and array.
 Just produce a lazy forward range that is iterated.

 At the very least use Appender!(Token[]) it ought to be much 
 faster.
But hasn't each range an array internally? Or how does that work? I try to use Appender, but this lazy forward range sounds interesting.
Mar 03 2013
next sibling parent reply "Namespace" <rswhite4 googlemail.com> writes:
On Sunday, 3 March 2013 at 16:19:27 UTC, Namespace wrote:
 Simple - don't use array append and don't produce and array.
 Just produce a lazy forward range that is iterated.

 At the very least use Appender!(Token[]) it ought to be much 
 faster.
But hasn't each range an array internally? Or how does that work? I try to use Appender, but this lazy forward range sounds interesting.
Appender does not work for my Token struct because I have const/immutable members. So I cannot add something. Example: http://dpaste.1azy.net/21f100d3 Without 'const' it works fine.
Mar 03 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Mar-2013 20:36, Namespace пишет:
 On Sunday, 3 March 2013 at 16:19:27 UTC, Namespace wrote:
 Simple - don't use array append and don't produce and array.
 Just produce a lazy forward range that is iterated.

 At the very least use Appender!(Token[]) it ought to be much faster.
But hasn't each range an array internally? Or how does that work? I try to use Appender, but this lazy forward range sounds interesting.
Appender does not work for my Token struct because I have const/immutable members. So I cannot add something. Example: http://dpaste.1azy.net/21f100d3 Without 'const' it works fine.
Just rip off the const why would you need it anyway? -- Dmitry Olshansky
Mar 03 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
 Just rip off the const why would you need it anyway?
The data in 'tokens' are final, so why they should not be const? What exactly is the problem if there are const member? Is this a known problem? Otherwise, I will go and investigate it.
Mar 03 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Mar-2013 20:45, Namespace пишет:
 Just rip off the const why would you need it anyway?
The data in 'tokens' are final, so why they should not be const?
Then just return const(Token) like simple full const. In general I'd avoid marking data as bitwise const unless that is actually true. For all you know some algorithm up above might tweak existing token values directly.
 What exactly is the problem if there are const member? Is this a known
 problem?
It's quite old and ugly as a lot of things would need to bit-blit stuff over it and fail to do so.
 Otherwise, I will go and investigate it.
-- Dmitry Olshansky
Mar 03 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
It's quite amazing, with Appender I reach between 115 and 125 
msecs.
I adapted the Appender for myself and solve the problem with 
memcpy/memmove.
That works fine.
But is there any other I can do (besides the LazyForwardRange) to 
gain more performance?
Mar 03 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
I've installed cygwin to use the time benchmark function and that 
are the results:
$ time ./Lexer.exe

real    0m0.225s
user    0m0.015s
sys     0m0.046s

I also read in the std.d.lexer thread that the use of 'char' 
isn't that good for a lexer, because dmd wants to decode 'char' 
to 'dchar'. The usage of ubyte should be better. Is that right?
Mar 04 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
With reserving of 60% of the file size in memory I get a big 
speed boost.

------------------------
Total time (ms): 17544.6
Repetitions    : 200
Sample mode    : 87 (121 ocurrences)
Median time    : 87.3835
Avg time       : 87.7232
Std dev.       : 1.77905
Minimum        : 84.238
Maximum        : 101.919
95% conf.int.  : [84.2363, 91.21]  e = 3.48687
99% conf.int.  : [83.1407, 92.3057]  e = 4.58252
EstimatedAvg95%: [87.4766, 87.9697]  e = 0.246559
EstimatedAvg99%: [87.3991, 88.0472]  e = 0.324033

But I'm still away from:
Avg time       : 69.3088 
http://forum.dlang.org/thread/dpdgcycrgfspcxenzrjf forum.dlang.org?page=8#post-abncwoqtpnaoohihkliw:40forum.dlang.org

I've tried also 10%, 40% and 80% but 60% seems optimal.

And I think I reached my performance limit. I could implement the 
LazyForwardRange and, if anyone can confirm the statement of char 
<-> dchar decoding and that the usage of ubyte would be (crucial) 
better, I will try this too.
Mar 04 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
04-Mar-2013 16:17, Namespace пишет:
 With reserving of 60% of the file size in memory I get a big speed boost.

 ------------------------
 Total time (ms): 17544.6
 Repetitions    : 200
 Sample mode    : 87 (121 ocurrences)
 Median time    : 87.3835
 Avg time       : 87.7232
 Std dev.       : 1.77905
 Minimum        : 84.238
 Maximum        : 101.919
 95% conf.int.  : [84.2363, 91.21]  e = 3.48687
 99% conf.int.  : [83.1407, 92.3057]  e = 4.58252
 EstimatedAvg95%: [87.4766, 87.9697]  e = 0.246559
 EstimatedAvg99%: [87.3991, 88.0472]  e = 0.324033

 But I'm still away from:
 Avg time       : 69.3088
 http://forum.dlang.org/thread/dpdgcycrgfspcxenzrjf forum.dlang.org?page=8#post-abncwoqtpnaoohihkliw:40forum.dlang.org
Keep in mind the CPU you are testing on. What's you chip? In any case Dscanner has been improved to take far far less time then 60ms. Brain listed graphs showing under 20 ms on his CPU. for all and <10 for every module in std except std.datetime.
 I've tried also 10%, 40% and 80% but 60% seems optimal.

 And I think I reached my performance limit. I could implement the
 LazyForwardRange and, if anyone can confirm the statement of char <->
 dchar decoding and that the usage of ubyte would be (crucial) better, I
 will try this too.
-- Dmitry Olshansky
Mar 04 2013
next sibling parent reply "Namespace" <rswhite4 googlemail.com> writes:
Yes, I know. But I do not think I have a chance to get even 
closer to it. At least I don't know, how.
I'm on a AMD Athlon 64 X2 Dual Core Processor 5400+.
Mar 04 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
04-Mar-2013 22:58, Namespace пишет:
 Yes, I know. But I do not think I have a chance to get even closer to
 it. At least I don't know, how.
One thing that all of these lexers do (dmd's orignal and Dscanner) is get tokens as you need them - lazily. You would always be slower if taking time to append them in some array (no matter how). The idiomatic way as I said is to use ranges.
 I'm on a AMD Athlon 64 X2 Dual Core Processor 5400+.
Mm that's not that far away from mine. But Brain has got something even more powerful :) You can easily download and test Dscanner on the your CPU so that you get numbers that are actually comparable. -- Dmitry Olshansky
Mar 04 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
Yes, but I'm not as familiar with ranges. But I want to learn 
more about them and try it then.
What do you mean, how much would entail the use of ranges 
compared to my current method? 30%?
Mar 04 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
04-Mar-2013 23:19, Namespace пишет:
 Yes, but I'm not as familiar with ranges. But I want to learn more about
 them and try it then.
 What do you mean, how much would entail the use of ranges compared to my
 current method? 30%?
Basically you copy Token structs and sometimes reallocate the whole array (I guess at 60% you don't reallocate but waste time on GC.malloc that zeroes out memory block). My estimate is no less then 10-15%. -- Dmitry Olshansky
Mar 04 2013
prev sibling parent reply "Namespace" <rswhite4 googlemail.com> writes:
------------------------
Total time (ms): 11374.4
Repetitions    : 200
Sample mode    : 56 (89 ocurrences)
Median time    : 56.312
Avg time       : 56.872
Std dev.       : 2.60698
Minimum        : 53.978
Maximum        : 81.602
95% conf.int.  : [51.7624, 61.9816]  e = 5.10959
99% conf.int.  : [50.1569, 63.5871]  e = 6.71514
EstimatedAvg95%: [56.5107, 57.2333]  e = 0.361303
EstimatedAvg99%: [56.3972, 57.3468]  e = 0.474832

I'm getting closer.
Mar 11 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
------------------------
Total time (ms): 10941.9
Repetitions    : 200
Sample mode    : 54 (118 ocurrences)
Median time    : 54.5855
Avg time       : 54.7094
Std dev.       : 1.33935
Minimum        : 52.535
Maximum        : 67.206
95% conf.int.  : [52.0843, 57.3345]  e = 2.62509
99% conf.int.  : [51.2594, 58.1593]  e = 3.44995
EstimatedAvg95%: [54.5238, 54.895]  e = 0.185622
EstimatedAvg99%: [54.4654, 54.9533]  e = 0.243948

And even a bit closer.

I've also done some benchmarks. And I wonder why my appender is 
much faster if I enable memory reservation.
My Appender: http://dpaste.1azy.net/1bb34d9a
I'm also very interested to hear some improvements tips.

In each benchmark 1_000_000 times these struct is appended:

struct B {
public:
	const int b;
}

Without memory reserve:

My Appender:
------------------------
Total time (ms): 7638.67
Repetitions    : 200
Sample mode    : 38 (68 ocurrences)
Median time    : 38.0545
Avg time       : 38.1934
Std dev.       : 1.56827
Minimum        : 35.859
Maximum        : 47.586
95% conf.int.  : [35.1196, 41.2671]  e = 3.07375
99% conf.int.  : [34.1538, 42.233]  e = 4.0396
EstimatedAvg95%: [37.976, 38.4107]  e = 0.217347
EstimatedAvg99%: [37.9077, 38.479]  e = 0.285643

D's Appender:
------------------------
Total time (ms): 6972.66
Repetitions    : 200
Sample mode    : 34 (97 ocurrences)
Median time    : 34.7795
Avg time       : 34.8633
Std dev.       : 1.41623
Minimum        : 32.96
Maximum        : 49.579
95% conf.int.  : [32.0876, 37.6391]  e = 2.77576
99% conf.int.  : [31.2153, 38.5113]  e = 3.64797
EstimatedAvg95%: [34.667, 35.0596]  e = 0.196276
EstimatedAvg99%: [34.6054, 35.1213]  e = 0.257951

D array:
------------------------
Total time (ms): 22868.2
Repetitions    : 200
Sample mode    : 110 (198 ocurrences)
Median time    : 113.658
Avg time       : 114.341
Std dev.       : 1.97184
Minimum        : 113.011
Maximum        : 133.978
95% conf.int.  : [110.477, 118.206]  e = 3.86473
99% conf.int.  : [109.262, 119.42]  e = 5.07911
EstimatedAvg95%: [114.068, 114.615]  e = 0.273277
EstimatedAvg99%: [113.982, 114.7]  e = 0.359147

With memory reserve:

My Appender:
------------------------
Total time (ms): 6197.48
Repetitions    : 200
Sample mode    : 30 (88 ocurrences)
Median time    : 30.617
Avg time       : 30.9874
Std dev.       : 2.86913
Minimum        : 28.583
Maximum        : 64.06
95% conf.int.  : [25.364, 36.6108]  e = 5.62339
99% conf.int.  : [23.597, 38.3778]  e = 7.39039
EstimatedAvg95%: [30.5897, 31.385]  e = 0.397634
EstimatedAvg99%: [30.4648, 31.51]  e = 0.52258

D's Appender:
------------------------
Total time (ms): 6628.31
Repetitions    : 200
Sample mode    : 32 (104 ocurrences)
Median time    : 32.6775
Avg time       : 33.1415
Std dev.       : 3.00344
Minimum        : 31.506
Maximum        : 57.493
95% conf.int.  : [27.2549, 39.0282]  e = 5.88663
99% conf.int.  : [25.4052, 40.8779]  e = 7.73635
EstimatedAvg95%: [32.7253, 33.5578]  e = 0.416248
EstimatedAvg99%: [32.5945, 33.6886]  e = 0.547042

D array:
------------------------
Total time (ms): 22849.9
Repetitions    : 200
Sample mode    : 110 (198 ocurrences)
Median time    : 113.025
Avg time       : 114.249
Std dev.       : 8.40906
Minimum        : 112.355
Maximum        : 222.982
95% conf.int.  : [97.768, 130.731]  e = 16.4815
99% conf.int.  : [92.5891, 135.91]  e = 21.6603
EstimatedAvg95%: [113.084, 115.415]  e = 1.16542
EstimatedAvg99%: [112.718, 115.781]  e = 1.53162
Mar 11 2013
parent reply FG <home fgda.pl> writes:
On 2013-03-11 16:11, Namespace wrote:
 I've also done some benchmarks. And I wonder why my appender is much faster if
I
 enable memory reservation.
Probably just because D's Appender has a better growth scheme than yours and you reallocate too often. Try to compare how often that happens.
 I'm also very interested to hear some improvements tips.
Well, you may add a check whether alloc/realloc actually succeeded. :)
Mar 11 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
On Monday, 11 March 2013 at 15:21:31 UTC, FG wrote:
 On 2013-03-11 16:11, Namespace wrote:
 I've also done some benchmarks. And I wonder why my appender 
 is much faster if I
 enable memory reservation.
Probably just because D's Appender has a better growth scheme than yours and you reallocate too often. Try to compare how often that happens.
D's: 24 extends, 9 moves (qalloc + memcpy). My Appender: 25 moves. But I do not think you can extend a memory block with C functions.
 I'm also very interested to hear some improvements tips.
Well, you may add a check whether alloc/realloc actually succeeded. :)
Maybe. :)
Mar 11 2013
parent reply David <d dav1d.de> writes:
 But I do not think you can extend a memory block with C functions.
What about realloc?
Mar 11 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
On Monday, 11 March 2013 at 15:51:58 UTC, David wrote:
 But I do not think you can extend a memory block with C 
 functions.
What about realloc?
That's what I use. But it moves your old memory into the new allocated. Or proves it first if your current memory block could be extended?
Mar 11 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 11 March 2013 at 15:56:37 UTC, Namespace wrote:
 On Monday, 11 March 2013 at 15:51:58 UTC, David wrote:
 But I do not think you can extend a memory block with C 
 functions.
What about realloc?
That's what I use. But it moves your old memory into the new allocated. Or proves it first if your current memory block could be extended?
extends if possible. http://dpaste.dzfl.pl/f86e5db6 import core.stdc.stdlib; void main() { auto p1 = malloc(14); auto p2 = realloc(p1, 15); assert(p1 is p2); }
Mar 11 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
 extends if possible.

 http://dpaste.dzfl.pl/f86e5db6

 import core.stdc.stdlib;
 void main()
 {
   auto p1 = malloc(14);
   auto p2 = realloc(p1, 15);
   assert(p1 is p2);
 }
Cool, nice to know. Currently I have this benchmark: Without memory reserve: ------------------------ Total time (ms): 6779.16 Repetitions : 200 Sample mode : 33 (73 ocurrences) Median time : 33.689 Avg time : 33.8958 Std dev. : 1.86754 Minimum : 31.414 Maximum : 52.237 95% conf.int. : [30.2355, 37.5561] e = 3.66031 99% conf.int. : [29.0853, 38.7063] e = 4.81047 EstimatedAvg95%: [33.637, 34.1546] e = 0.258823 EstimatedAvg99%: [33.5556, 34.2359] e = 0.340151 With memory reserve: ------------------------ Total time (ms): 5945.1 Repetitions : 200 Sample mode : 29 (68 ocurrences) Median time : 29.3215 Avg time : 29.7255 Std dev. : 2.083 Minimum : 27.458 Maximum : 45.32 95% conf.int. : [25.6429, 33.8081] e = 4.08261 99% conf.int. : [24.36, 35.0909] e = 5.36546 EstimatedAvg95%: [29.4368, 30.0142] e = 0.288684 EstimatedAvg99%: [29.3461, 30.1049] e = 0.379396 But I does not gain that much: ------------------------ Total time (ms): 10833 Repetitions : 200 Sample mode : 53 (105 ocurrences) Median time : 53.848 Avg time : 54.1652 Std dev. : 1.72681 Minimum : 52.118 Maximum : 69.703 95% conf.int. : [50.7807, 57.5497] e = 3.38448 99% conf.int. : [49.7172, 58.6131] e = 4.44796 EstimatedAvg95%: [53.9259, 54.4045] e = 0.239319 EstimatedAvg99%: [53.8507, 54.4797] e = 0.314518 Seems to be the wrong point of optimizations.
Mar 11 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 11 March 2013 at 16:37:00 UTC, Namespace wrote:
 Seems to be the wrong point of optimizations.
Profile? Also, note that appender is a tool for appending stuff *into an array*. If you don't actually need an array at the end, then you could give Array a roll? I don't think it will necessarily do better, but it doesn't hurt to give it a roll.
Mar 11 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
 Profile?

 Also, note that appender is a tool for appending stuff *into an 
 array*. If you don't actually need an array at the end, then 
 you could give Array a roll?

 I don't think it will necessarily do better, but it doesn't 
 hurt to give it a roll.
My profile output looks like this: http://dpaste.1azy.net/0409f0bc Since I have already discussed most of the stuff in this thread, I do not know what I could still improve on the Lexer. But I have not yet thought of Array. Because my Appender use already malloc and free, I could learn something by the implementation of 'Array'. Thanks for the suggestion.
Mar 11 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
Array has an horrible code...

I decided to start from scratch.
So I took a close look at the lexer of dmd and took over the 
basic functionality.
Result for std.datetime (without profiling so far):

------------------------
Total time (ms): 5971.66
Repetitions    : 200
Sample mode    : 28 (101 ocurrences)
Median time    : 28.89
Avg time       : 29.8583
Std dev.       : 3.37222
Minimum        : 27.958
Maximum        : 70.583
95% conf.int.  : [23.2489, 36.4677]  e = 6.60943
99% conf.int.  : [21.172, 38.5446]  e = 8.68627
EstimatedAvg95%: [29.3909, 30.3257]  e = 0.467357
EstimatedAvg99%: [29.2441, 30.4725]  e = 0.614212
Mar 13 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
14-Mar-2013 01:19, Namespace пишет:
 Array has an horrible code...

 I decided to start from scratch.
 So I took a close look at the lexer of dmd and took over the basic
 functionality.
 Result for std.datetime (without profiling so far):
It's getting better!
 ------------------------
 Total time (ms): 5971.66
 Repetitions    : 200
 Sample mode    : 28 (101 ocurrences)
 Median time    : 28.89
 Avg time       : 29.8583
 Std dev.       : 3.37222
 Minimum        : 27.958
 Maximum        : 70.583
 95% conf.int.  : [23.2489, 36.4677]  e = 6.60943
 99% conf.int.  : [21.172, 38.5446]  e = 8.68627
 EstimatedAvg95%: [29.3909, 30.3257]  e = 0.467357
 EstimatedAvg99%: [29.2441, 30.4725]  e = 0.614212
-- Dmitry Olshansky
Mar 14 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
Yes, and it's really fun. :)
By the way, the code can be found here: 
https://github.com/Dgame/Puzzle/blob/master/DLexer.d
Maybe some of you have an idea how I could improve anything.
Last trace.log is also there: 
https://github.com/Dgame/Puzzle/blob/master/trace.log
My goal is about 20 msecs. :D
Mar 14 2013
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
14-Mar-2013 21:20, Namespace пишет:
 Yes, and it's really fun. :)
 By the way, the code can be found here:
 https://github.com/Dgame/Puzzle/blob/master/DLexer.d
 Maybe some of you have an idea how I could improve anything.
 Last trace.log is also there:
 https://github.com/Dgame/Puzzle/blob/master/trace.log
 My goal is about 20 msecs. :D
And my is 15 with Dscanner. But it's stuck at around 21-22. But both of us can just cheat and buy a newer PC :) So what we need to count is the number of syscalls + cycles spent in user mode to have a meaningful metric. -- Dmitry Olshansky
Mar 14 2013
parent reply Artur Skawina <art.08.09 gmail.com> writes:
On 03/14/13 18:24, Dmitry Olshansky wrote:
 14-Mar-2013 21:20, Namespace пишет:
 Yes, and it's really fun. :)
 By the way, the code can be found here:
 https://github.com/Dgame/Puzzle/blob/master/DLexer.d
 Maybe some of you have an idea how I could improve anything.
 Last trace.log is also there:
 https://github.com/Dgame/Puzzle/blob/master/trace.log
 My goal is about 20 msecs. :D
And my is 15 with Dscanner. But it's stuck at around 21-22. But both of us can just cheat and buy a newer PC :)
What are you both actually measuring? IOW what happens with the resulting tokens? artur
Mar 14 2013
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
15-Mar-2013 01:17, Artur Skawina пишет:
 On 03/14/13 18:24, Dmitry Olshansky wrote:
 14-Mar-2013 21:20, Namespace пишет:
 Yes, and it's really fun. :)
 By the way, the code can be found here:
 https://github.com/Dgame/Puzzle/blob/master/DLexer.d
 Maybe some of you have an idea how I could improve anything.
 Last trace.log is also there:
 https://github.com/Dgame/Puzzle/blob/master/trace.log
 My goal is about 20 msecs. :D
And my is 15 with Dscanner. But it's stuck at around 21-22. But both of us can just cheat and buy a newer PC :)
What are you both actually measuring? IOW what happens with the resulting tokens? artur
I'm counting and/or filtering out specific tokens. The latter is a tiny bit slower. The other options of Dscanner are to generate e.g. HTML highlighting for code or measure token frequencies. Haven't timed those as their speed doesn't entirely depend on lexer as is. -- Dmitry Olshansky
Mar 14 2013
prev sibling parent reply "Namespace" <rswhite4 googlemail.com> writes:
So far, my lexer is pure exercise.
But my goal is actually to filter variables and functions, to see 
if they are ever used in the code.
Mar 15 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
On Friday, 15 March 2013 at 08:20:18 UTC, Namespace wrote:
 So far, my lexer is pure exercise.
 But my goal is actually to filter variables and functions, to 
 see if they are ever used in the code.
I'm almost finished. In my few tests, my little parser detects function, struct and variable declarations and expressions. It can resolve rvalue references (by creating temporary variables) and detects unused/underused variables. But it is still a beta. But it makes a lot of fun. :D
Mar 21 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
On Thursday, 21 March 2013 at 11:36:45 UTC, Namespace wrote:
 On Friday, 15 March 2013 at 08:20:18 UTC, Namespace wrote:
 So far, my lexer is pure exercise.
 But my goal is actually to filter variables and functions, to 
 see if they are ever used in the code.
I'm almost finished. In my few tests, my little parser detects function, struct and variable declarations and expressions. It can resolve rvalue references (by creating temporary variables) and detects unused/underused variables. But it is still a beta. But it makes a lot of fun. :D
Moved to https://github.com/Dgame/DAT/
Mar 23 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
On Saturday, 23 March 2013 at 15:39:50 UTC, Namespace wrote:
 On Thursday, 21 March 2013 at 11:36:45 UTC, Namespace wrote:
 On Friday, 15 March 2013 at 08:20:18 UTC, Namespace wrote:
 So far, my lexer is pure exercise.
 But my goal is actually to filter variables and functions, to 
 see if they are ever used in the code.
I'm almost finished. In my few tests, my little parser detects function, struct and variable declarations and expressions. It can resolve rvalue references (by creating temporary variables) and detects unused/underused variables. But it is still a beta. But it makes a lot of fun. :D
Moved to https://github.com/Dgame/DAT/
It's time for a beta. Would be nice if someone takes a look at DAT or try it. I'm sure that there are still some bugs, but maybe this little recreational fun is indeed useful for one or the other. I had a lot of fun. It's tested with 'simple.d', std.stdio and std.datetime.
Mar 24 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
24-Mar-2013 20:08, Namespace пишет:
 On Saturday, 23 March 2013 at 15:39:50 UTC, Namespace wrote:
 On Thursday, 21 March 2013 at 11:36:45 UTC, Namespace wrote:
 On Friday, 15 March 2013 at 08:20:18 UTC, Namespace wrote:
 So far, my lexer is pure exercise.
 But my goal is actually to filter variables and functions, to see if
 they are ever used in the code.
I'm almost finished. In my few tests, my little parser detects function, struct and variable declarations and expressions. It can resolve rvalue references (by creating temporary variables) and detects unused/underused variables. But it is still a beta. But it makes a lot of fun. :D
Moved to https://github.com/Dgame/DAT/
It's time for a beta. Would be nice if someone takes a look at DAT or try it. I'm sure that there are still some bugs, but maybe this little recreational fun is indeed useful for one or the other. I had a lot of fun. It's tested with 'simple.d', std.stdio and std.datetime.
Have no time - looks interesting but I highly doubt it can do what is promised. Still take time to do a small formal announcement for beta. Reusing the old and fizzled out thread of a D lexer to discuss a tool created on top of it is bound to have very little feedback. -- Dmitry Olshansky
Mar 24 2013
parent reply "Namespace" <rswhite4 googlemail.com> writes:
 Have no time - looks interesting but I highly doubt it can do 
 what is promised. Still take time to do a small formal 
 announcement for beta.
Yes you're right, it's going to take some time before all would work in general. In my small test runs, it has, however, been proven, but D is also incredibly complex. I will also stop very soon to work on it, but maybe the code / the idea inspire someone else to write something more mature for the D community. This is also the reason why I'm asking in this fizzle thread for a feedback. :) But once you've taken a look on it, I would be happy if you could tell me what should be improved next time.
Mar 24 2013
parent reply "Brian Schott" <briancschott gmail.com> writes:
On Sunday, 24 March 2013 at 16:54:22 UTC, Namespace wrote:
 Have no time - looks interesting but I highly doubt it can do 
 what is promised. Still take time to do a small formal 
 announcement for beta.
Yes you're right, it's going to take some time before all would work in general. In my small test runs, it has, however, been proven, but D is also incredibly complex. I will also stop very soon to work on it, but maybe the code / the idea inspire someone else to write something more mature for the D community. This is also the reason why I'm asking in this fizzle thread for a feedback. :) But once you've taken a look on it, I would be happy if you could tell me what should be improved next time.
I have similar goals for the Dscananer tool, so I may steal some of your ideas.
Mar 24 2013
parent "Namespace" <rswhite4 googlemail.com> writes:
On Sunday, 24 March 2013 at 22:16:45 UTC, Brian Schott wrote:
 On Sunday, 24 March 2013 at 16:54:22 UTC, Namespace wrote:
 Have no time - looks interesting but I highly doubt it can do 
 what is promised. Still take time to do a small formal 
 announcement for beta.
Yes you're right, it's going to take some time before all would work in general. In my small test runs, it has, however, been proven, but D is also incredibly complex. I will also stop very soon to work on it, but maybe the code / the idea inspire someone else to write something more mature for the D community. This is also the reason why I'm asking in this fizzle thread for a feedback. :) But once you've taken a look on it, I would be happy if you could tell me what should be improved next time.
I have similar goals for the Dscananer tool, so I may steal some of your ideas.
I'm glad to hear that. If I can help let me know, would be happy to gain more knowledge in this area. And I would be glad if some of my ideas become reality.
Mar 24 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
14-Mar-2013 21:20, Namespace пишет:
 Yes, and it's really fun. :)
 By the way, the code can be found here:
 https://github.com/Dgame/Puzzle/blob/master/DLexer.d
 Maybe some of you have an idea how I could improve anything.
 Last trace.log is also there:
 https://github.com/Dgame/Puzzle/blob/master/trace.log
 My goal is about 20 msecs. :D
So you got rid of array creation. About time ;) -- Dmitry Olshansky
Mar 14 2013
parent "Namespace" <rswhite4 googlemail.com> writes:
 So you got rid of array creation. About time ;)
Yes, that was the only way to get closer to your measured time. :D
Mar 15 2013
prev sibling parent reply "Brian Schott" <briancschott gmail.com> writes:
On Monday, 4 March 2013 at 18:10:48 UTC, Dmitry Olshansky wrote:
 Brain listed graphs showing under 20 ms on his CPU. for all and 
 <10 for every module in std except std.datetime.
https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/perftest.sh I checked in the script that I use to generate the data for the graph last night. You can copy/paste its output into a spreadsheet (using tabs as field separators).
Mar 04 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
05-Mar-2013 01:35, Brian Schott пишет:
 On Monday, 4 March 2013 at 18:10:48 UTC, Dmitry Olshansky wrote:
 Brain listed graphs showing under 20 ms on his CPU. for all and <10
 for every module in std except std.datetime.
https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/perftest.sh I checked in the script that I use to generate the data for the graph last night. You can copy/paste its output into a spreadsheet (using tabs as field separators).
Great, thanks! I'll see if I can make a plot from it with std-dev shown as well. -- Dmitry Olshansky
Mar 04 2013
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Mar-2013 20:19, Namespace пишет:
 Simple - don't use array append and don't produce and array.
 Just produce a lazy forward range that is iterated.

 At the very least use Appender!(Token[]) it ought to be much faster.
But hasn't each range an array internally?
Of course, not or the whole range concept is meaningless.
 Or how does that work?
By having a struct or generally any object that in fact works as generator of tokens. popFront lexes new token, front gives the current token and empty indicate if there are no more tokens. See std.range for some simple ranges. If you throw in a .save method you can replay tokenization from some point in the past (= soem older state, think lookahead when parsing).
 I try to use Appender, but this lazy forward range sounds interesting.
-- Dmitry Olshansky
Mar 03 2013
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, March 03, 2013 17:19:25 Namespace wrote:
 But hasn't each range an array internally?
Many do, but some don't. Take std.algorithm.ioto, or the ranges in std.random for instance. They're generative. And if all ranges had arrays underneat the hood, infinite ranges wouldn't be possible except for something like std.algorithm.cycle. Something like struct Range { property int front() { return _i; } void popFront() {++i} property bool empty { return _i != int.max; } private int _i; } would be a valid range. In the case of a lexer, you generate the next token on each popFront call, so you consume the range of characters that you're lexing as you go but don't need to allocate memory for more than the current token at a time. - Jonathan M Davis
Mar 03 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Mar-2013 03:12, Namespace пишет:
 I noted it to late: decode comes from readText: readText validates your
 text. I will now  use std.file.read.
That's why Walter promotes profilers. They help verify hypothesis on performance and constantly surprise people in what they actually find.
 The new trace output is here and it seems you're right, isKeyword and
 isType (yes I'd decided to separate it again and keep both in the lexer)
 take a lot of time:
 http://dpaste.1azy.net/0d6aff6b
   154874      106186       95403           0     pure nothrow bool
 Puzzle.Lexer.isKeyword(immutable(char)[])
   159688       78020       69289           0     pure nothrow bool
 Puzzle.Lexer.isType(immutable(char)[])

 I never thought that. :)
-- Dmitry Olshansky
Mar 02 2013