www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - Dscanner - It exists

reply "Brian Schott" <briancschott gmail.com> writes:
First: This is not a release announcement.

I want to let people know that Dscanner *exists*.

https://github.com/Hackerpilot/Dscanner/

It's a utility that I designed to be used by text editors such as 
VIM, Textadept, etc., for getting information about D source code.

I've held off on anoncing this in the past because I don't think 
that it's really ready for a release, but after seeing several of 
the threads about lexers in the D newsgroup I decided I should 
make some sort of announcement.

What it does:
* Has a D lexer
* Can syntax-highlight D source files as HTML
* Can generate CTAGS files from D code
* VERY BASIC autocomplete <- The reason I don't consider it "done"
* Can generate a JSON summary of D code.
* Line of code counter. Basically just a filter on the range of 
tokens that looks for things like semicolons.

It's Boost licensed, so feel free to use (or submit improvements 
for) the tokenizer.
Aug 01 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 10:30 AM, Brian Schott wrote:
 First: This is not a release announcement.

 I want to let people know that Dscanner *exists*.

 https://github.com/Hackerpilot/Dscanner/

 It's a utility that I designed to be used by text editors such as VIM,
 Textadept, etc., for getting information about D source code.

 I've held off on anoncing this in the past because I don't think that it's
 really ready for a release, but after seeing several of the threads about
lexers
 in the D newsgroup I decided I should make some sort of announcement.

 What it does:
 * Has a D lexer
 * Can syntax-highlight D source files as HTML
 * Can generate CTAGS files from D code
 * VERY BASIC autocomplete <- The reason I don't consider it "done"
 * Can generate a JSON summary of D code.
 * Line of code counter. Basically just a filter on the range of tokens that
 looks for things like semicolons.

 It's Boost licensed, so feel free to use (or submit improvements for) the
 tokenizer.

I suggest proposing the D lexer as an addition to Phobos. But if that is done, its interface would need to accept a range as input, and its output should be a range of tokens.
Aug 01 2012
next sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 01/08/2012 19:58, Brian Schott a écrit :
 On Wednesday, 1 August 2012 at 17:36:16 UTC, Walter Bright wrote:
 I suggest proposing the D lexer as an addition to Phobos. But if that
 is done, its interface would need to accept a range as input, and its
 output should be a range of tokens.

It used to be range-based, but the performance was terrible. The inability to use slicing on a forward-range of characters and the gigantic block on KCachegrind labeled "std.utf.decode" were the reasons that I chose this approach. I wish I had saved the measurements on this....

Maybe a RandomAccessRange could do the trick ?
Aug 01 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/1/12 5:09 PM, deadalnix wrote:
 Le 01/08/2012 19:58, Brian Schott a écrit :
 On Wednesday, 1 August 2012 at 17:36:16 UTC, Walter Bright wrote:
 I suggest proposing the D lexer as an addition to Phobos. But if that
 is done, its interface would need to accept a range as input, and its
 output should be a range of tokens.

It used to be range-based, but the performance was terrible. The inability to use slicing on a forward-range of characters and the gigantic block on KCachegrind labeled "std.utf.decode" were the reasons that I chose this approach. I wish I had saved the measurements on this....

Maybe a RandomAccessRange could do the trick ?

I think the best way here is to define a BufferedRange that takes any other range and supplies a buffer for it (with the appropriate primitives) in a native array. Andrei
Aug 01 2012
next sibling parent reply David <d dav1d.de> writes:
 I think the best way here is to define a BufferedRange that takes any
 other range and supplies a buffer for it (with the appropriate
 primitives) in a native array.

 Andrei

Don't you think, this range stuff is overdone? Define some fancy Range stuff, if an array just works perfectly? Ranges > Iterators, yes, but I think they are overdone.
Aug 01 2012
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/1/12 6:23 PM, David wrote:
 I think the best way here is to define a BufferedRange that takes any
 other range and supplies a buffer for it (with the appropriate
 primitives) in a native array.

 Andrei

Don't you think, this range stuff is overdone? Define some fancy Range stuff, if an array just works perfectly? Ranges > Iterators, yes, but I think they are overdone.

I don't. Andrei
Aug 01 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 3:44 PM, Bernard Helyer wrote:
 I would be concerned with potential performance ramifications,
 though.

As well you should be. A poorly constructed range can have terrible performance. But one thing to take careful note of: you *can* define a range that is nothing more than a pointer. Of course, you must be careful using such, because it won't be safe, but if performance overrides everything else, that option is available. And best of all, one could still supply a safe range to the same algorithm code, without changing any of it.
Aug 01 2012
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-08-02 00:23, David wrote:
 I think the best way here is to define a BufferedRange that takes any
 other range and supplies a buffer for it (with the appropriate
 primitives) in a native array.

 Andrei

Don't you think, this range stuff is overdone? Define some fancy Range stuff, if an array just works perfectly? Ranges > Iterators, yes, but I think they are overdone.

I think so. Some parts of the community seem to be obsessed about ranges. -- /Jacob Carlborg
Aug 01 2012
prev sibling parent deadalnix <deadalnix gmail.com> writes:
Le 01/08/2012 23:19, Andrei Alexandrescu a écrit :
 On 8/1/12 5:09 PM, deadalnix wrote:
 Le 01/08/2012 19:58, Brian Schott a écrit :
 On Wednesday, 1 August 2012 at 17:36:16 UTC, Walter Bright wrote:
 I suggest proposing the D lexer as an addition to Phobos. But if that
 is done, its interface would need to accept a range as input, and its
 output should be a range of tokens.

It used to be range-based, but the performance was terrible. The inability to use slicing on a forward-range of characters and the gigantic block on KCachegrind labeled "std.utf.decode" were the reasons that I chose this approach. I wish I had saved the measurements on this....

Maybe a RandomAccessRange could do the trick ?

I think the best way here is to define a BufferedRange that takes any other range and supplies a buffer for it (with the appropriate primitives) in a native array. Andrei

This make sense. The stuff can be a noop on arrays, and that solve everything.
Aug 01 2012
prev sibling next sibling parent Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Brian Schott wrote:
 On Wednesday, 1 August 2012 at 17:36:16 UTC, Walter Bright wrote:
 I suggest proposing the D lexer as an addition to Phobos. But if that
 is done, its interface would need to accept a range as input, and its
 output should be a range of tokens.

It used to be range-based, but the performance was terrible. The inability to use slicing on a forward-range of characters and the gigantic block on KCachegrind labeled "std.utf.decode" were the reasons that I chose this approach. I wish I had saved the measurements on this....

Ranges are usually taken as template parameters, so you can use static if to provide different code for arrays and regular ranges.
Aug 01 2012
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 10:35 AM, Walter Bright wrote:
 I suggest proposing the D lexer as an addition to Phobos. But if that is done,
 its interface would need to accept a range as input, and its output should be a
 range of tokens.

See the thread over in digitalmars.D about a proposed std.d.lexer.
Aug 01 2012
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-01 22:20, Jonathan M Davis wrote:

 If you want really good performance out of a range-based solution operating on
 ranges of dchar, then you need to special case for the built-in string types
 all over the place, and if you have to wrap them in other range types
 (generally because of calling another range-based function), then there's a
 good chance that you will indeed get a performance hit. D's range-based
 approach is really nice from the perspective of usability, but you have to
 work at it a bit if you want it to be efficient when operating on strings. It
 _can_ be done though.

Is it really worth it though? Most use cases will just be with regular strings? -- /Jacob Carlborg
Aug 01 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-02 08:26, Jonathan M Davis wrote:

 It's really not all that hard to special case for strings, especially when
 you're operating primarily on code units. And I think that the lexer should be
 flexible enough to be usable with ranges other than strings. We're trying to
 make most stuff in Phobos range-based, not string-based or array-based.

Ok. I just don't think it's worth giving up some performance or make the design overly complicated just to make a range interface. But if ranges doesn't cause these problems I'm happy. -- /Jacob Carlborg
Aug 01 2012
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-08-02 09:43, Jonathan M Davis wrote:

 For instance, I have this function which I use to generate a mixin any time
 that I want to get the first code unit:

 string declareFirst(R)()
      if(isForwardRange!R && is(Unqual!(ElementType!R) == dchar))
 {
      static if(isNarrowString!R)
          return "Unqual!(ElementEncodingType!R) first = range[0];";
      else
          return "dchar first = range.front;";
 }

 So, every line using it becomes

 mixin(declareFirst!R());

 which really isn't any worse than

 char c = str[0];

 except that it works with more than just strings. Yes, it's more effort to get
 the lexer working with all ranges of dchar, but I don't think that it's all
 that much worse, it the result is much more flexible.

That doesn't look too bad. I'm quite happy :) -- /Jacob Carlborg
Aug 02 2012
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/2/12 3:43 AM, Jonathan M Davis wrote:
 A range-based function operating on strings without special-casing them often
 _will_ harm performance. But if you special-case them for strings, then you
 can avoid that performance penalty - especially if you can avoid having to
 decode any characters.

 The result is that using range-based functions on strings is generally correct
 without the function writer (or the caller) having to worry about encodings
 and the like, but if they want to eke out all of the performance that they
 can, they need to go to the extra effort of special-casing the function for
 strings. Like much of D, it favors correctness/saftey but allows you to get
 full performance if you work at it a bit harder.

 In the case of the lexer, it's really not all that bad - especially since
 string mixins allow me to give the operation that I need (e.g. get the first
 code unit) in the correct way for that particular range type without worrying
 about the details.

 For instance, I have this function which I use to generate a mixin any time
 that I want to get the first code unit:

 string declareFirst(R)()
      if(isForwardRange!R&&  is(Unqual!(ElementType!R) == dchar))
 {
      static if(isNarrowString!R)
          return "Unqual!(ElementEncodingType!R) first = range[0];";
      else
          return "dchar first = range.front;";
 }

 So, every line using it becomes

 mixin(declareFirst!R());

 which really isn't any worse than

 char c = str[0];

 except that it works with more than just strings. Yes, it's more effort to get
 the lexer working with all ranges of dchar, but I don't think that it's all
 that much worse, it the result is much more flexible.

I just posted in the .D forum a simple solution that is fast, uses general ranges, and has you avoid all of the hecatomb of code above. Andrei
Aug 02 2012
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Wednesday, 1 August 2012 at 17:36:16 UTC, Walter Bright wrote:
 I suggest proposing the D lexer as an addition to Phobos. But 
 if that is done, its interface would need to accept a range as 
 input, and its output should be a range of tokens.

It used to be range-based, but the performance was terrible. The inability to use slicing on a forward-range of characters and the gigantic block on KCachegrind labeled "std.utf.decode" were the reasons that I chose this approach. I wish I had saved the measurements on this....
Aug 01 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 19:58:46 Brian Schott wrote:
 On Wednesday, 1 August 2012 at 17:36:16 UTC, Walter Bright wrote:
 I suggest proposing the D lexer as an addition to Phobos. But
 if that is done, its interface would need to accept a range as
 input, and its output should be a range of tokens.

It used to be range-based, but the performance was terrible. The inability to use slicing on a forward-range of characters and the gigantic block on KCachegrind labeled "std.utf.decode" were the reasons that I chose this approach. I wish I had saved the measurements on this....

If you want really good performance out of a range-based solution operating on ranges of dchar, then you need to special case for the built-in string types all over the place, and if you have to wrap them in other range types (generally because of calling another range-based function), then there's a good chance that you will indeed get a performance hit. D's range-based approach is really nice from the perspective of usability, but you have to work at it a bit if you want it to be efficient when operating on strings. It _can_ be done though. The D lexer that I'm currently writing special-cases strings pretty much _everywhere_ (string mixins really help reduce the cost of that in terms of code duplication). The result is that if I do it right, its performance for strings should be very close to what dmd can do (it probably won't quite reach dmd's performance simply because of some extra stuff it does to make it more usable for stuff other than compilers - e.g. syntax highlighters). But you'll still likely get a performance hit of you did something like string source = getSource(); auto result = tokenRange(filter!"true"(source)); instead of string source = getSource(); auto result = tokenRange(source); It won't be quite as bad a performance hit with 2.060 thanks to some recent optimizations to string's popFront, but you're going to lose out on some performance regardless, because nothing can special-case for every possible range type, and one of the keys to fast string processing is to minimizing how much you decode characters, which generally requires special-casing. - Jonathan M Davis
Aug 01 2012
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 01 Aug 2012 19:58:46 +0200
schrieb "Brian Schott" <briancschott gmail.com>:

 On Wednesday, 1 August 2012 at 17:36:16 UTC, Walter Bright wrote:
 I suggest proposing the D lexer as an addition to Phobos. But 
 if that is done, its interface would need to accept a range as 
 input, and its output should be a range of tokens.

It used to be range-based, but the performance was terrible. The inability to use slicing on a forward-range of characters and the gigantic block on KCachegrind labeled "std.utf.decode" were the reasons that I chose this approach. I wish I had saved the measurements on this....

I can understand you. I was reading a dictionary file with readText().splitLines(); and wondering why a unicode decoding was performed. Unfortunately ranges work on Unicode units and all structured text files are structured by ASCII characters. While these file formats probably just old or done with some false sense of compatibility in mind, it is also clear to their inventors, that parsing them is easier and faster with single-byte characters to delimit tokens. But we have talked about UTF-8 vs. ASCII and foreach vs. ranges before. I still hope for some super-smart solution, that doesn't need a book of documentation and allows some kind of ASCII-equivalent range. I've heard that foreach over UTF-8 with a dchar loop variable, does an implicit decoding of the UTF-8 string. While this is useful it is also not self-explanatory and needs some reading into the topic. -- Marco
Aug 01 2012
prev sibling next sibling parent reply Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Wed, Aug 1, 2012 at 7:30 PM, Brian Schott <briancschott gmail.com> wrote:
 First: This is not a release announcement.

 I want to let people know that Dscanner *exists*.

 https://github.com/Hackerpilot/Dscanner/

 What it does:
 * Has a D lexer

 * Can generate a JSON summary of D code.

I just tested the JSON output and it works nicely. Finally, a way to get imports! I have have two remarks (not critics!) - there seem to be two "structs" objects in the JSON, unless I'm mistaken. - alias declaration are not parsed, seemingly. (as in "alias int MyInt;") Also, do you think comments could be included in the JSON? Nice work, keep going!
Aug 01 2012
parent David <d dav1d.de> writes:
 I use them quite frequently in unittest {} blocks, if only to import
 std.stdio to get why my unittests don't work :)

version(unittest) { private import std.stdio; } ^ Place this where you have your other imports and you don't have to import it in your unittest{} blocks.
Aug 01 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, August 01, 2012 22:34:14 Marco Leise wrote:
 I still hope for some
 super-smart solution, that doesn't need a book of documentation and allows
 some kind of ASCII-equivalent range.

If you want pure ASCII, then just cast to ubyte[] (or const(ubyte)[] or immutable(ubyte)[], depending on the constness involved). string functions won't work, because they require UTF-8 (or UTF-16 or UTF-32 if they're templatized on string type), but other range-based and array functions will work just fine. - Jonathan M Davis
Aug 01 2012
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Wednesday, 1 August 2012 at 20:39:49 UTC, Philippe Sigaud 
wrote:
 I have have two remarks (not critics!)

 - there seem to be two "structs" objects in the JSON, unless 
 I'm mistaken.
 - alias declaration are not parsed, seemingly.  (as in "alias 
 int MyInt;")

 Also, do you think comments could be included in the JSON?

 Nice work, keep going!

It's more likely that I'll remember things if they're enhancement requests/bugs on Github. Structs: I'll look into it Alias: Not implemented yet. Comments: It's planned. I want to be able to give doc comments in the autocomplete information.
Aug 01 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Wed, Aug 1, 2012 at 11:03 PM, Brian Schott <briancschott gmail.com> wrote:

 It's more likely that I'll remember things if they're enhancement
 requests/bugs on Github.

Right, I say the same to people asking for things in my projects :) OK, done.
Aug 01 2012
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 1 Aug 2012 22:39:41 +0200
schrieb Philippe Sigaud <philippe.sigaud gmail.com>:

 I just tested the JSON output and it works nicely. Finally, a way to
 get imports!

What does it do if you import from _inside_ a function ? Not that this would happen often, but it can. :-] -- Marco
Aug 01 2012
prev sibling next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Wednesday, 1 August 2012 at 21:35:08 UTC, Marco Leise wrote:
 Am Wed, 1 Aug 2012 22:39:41 +0200
 schrieb Philippe Sigaud <philippe.sigaud gmail.com>:

 I just tested the JSON output and it works nicely. Finally, a 
 way to
 get imports!

What does it do if you import from _inside_ a function ? Not that this would happen often, but it can. :-]

It ignores the insides of functions, mostly because writing a full D parser was not a design goal. I'm mostly concerned with autocomplete, ctags, and summarizing. Unfortunately it also ignores the insides of static if and version statements as well. I've thought about having versions be a command line or configuration option, but the only way to handle static if is to actually be a compiler.
Aug 01 2012
prev sibling next sibling parent Philippe Sigaud <philippe.sigaud gmail.com> writes:
On Wed, Aug 1, 2012 at 11:35 PM, Marco Leise <Marco.Leise gmx.de> wrote:
 Am Wed, 1 Aug 2012 22:39:41 +0200
 schrieb Philippe Sigaud <philippe.sigaud gmail.com>:

 I just tested the JSON output and it works nicely. Finally, a way to
 get imports!

What does it do if you import from _inside_ a function ? Not that this would happen often, but it can. :-]

I use them quite frequently in unittest {} blocks, if only to import std.stdio to get why my unittests don't work :)
Aug 01 2012
prev sibling next sibling parent "Bernard Helyer" <b.helyer gmail.com> writes:
On Wednesday, 1 August 2012 at 22:31:39 UTC, Andrei Alexandrescu 
wrote:
 On 8/1/12 6:23 PM, David wrote:
 Ranges > Iterators, yes, but I think they are overdone.

I don't.

I think the main problem is that you need that abstraction for Phobos. Whereas if you're writing stuff for yourself, you don't bother. Even if it's a library for consumption. I wonder if there's an abstraction that would make defining a range around some data trivial. Maybe even just a good article on "why use ranges over X" where X == array of data, or iterators for the C++ crowd. I know in SDC's lexer we actually do have things that could be turned into input and output ranges fairly trivially. I would be concerned with potential performance ramifications, though. -Bernard.
Aug 01 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 08:18:39 Jacob Carlborg wrote:
 On 2012-08-01 22:20, Jonathan M Davis wrote:
 If you want really good performance out of a range-based solution
 operating on ranges of dchar, then you need to special case for the
 built-in string types all over the place, and if you have to wrap them in
 other range types (generally because of calling another range-based
 function), then there's a good chance that you will indeed get a
 performance hit. D's range-based approach is really nice from the
 perspective of usability, but you have to work at it a bit if you want it
 to be efficient when operating on strings. It _can_ be done though.

Is it really worth it though? Most use cases will just be with regular strings?

It's really not all that hard to special case for strings, especially when you're operating primarily on code units. And I think that the lexer should be flexible enough to be usable with ranges other than strings. We're trying to make most stuff in Phobos range-based, not string-based or array-based. - Jonathan M Davis
Aug 01 2012
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, August 02, 2012 08:51:26 Jacob Carlborg wrote:
 On 2012-08-02 08:26, Jonathan M Davis wrote:
 It's really not all that hard to special case for strings, especially when
 you're operating primarily on code units. And I think that the lexer
 should be flexible enough to be usable with ranges other than strings.
 We're trying to make most stuff in Phobos range-based, not string-based
 or array-based.

design overly complicated just to make a range interface. But if ranges doesn't cause these problems I'm happy.

A range-based function operating on strings without special-casing them often _will_ harm performance. But if you special-case them for strings, then you can avoid that performance penalty - especially if you can avoid having to decode any characters. The result is that using range-based functions on strings is generally correct without the function writer (or the caller) having to worry about encodings and the like, but if they want to eke out all of the performance that they can, they need to go to the extra effort of special-casing the function for strings. Like much of D, it favors correctness/saftey but allows you to get full performance if you work at it a bit harder. In the case of the lexer, it's really not all that bad - especially since string mixins allow me to give the operation that I need (e.g. get the first code unit) in the correct way for that particular range type without worrying about the details. For instance, I have this function which I use to generate a mixin any time that I want to get the first code unit: string declareFirst(R)() if(isForwardRange!R && is(Unqual!(ElementType!R) == dchar)) { static if(isNarrowString!R) return "Unqual!(ElementEncodingType!R) first = range[0];"; else return "dchar first = range.front;"; } So, every line using it becomes mixin(declareFirst!R()); which really isn't any worse than char c = str[0]; except that it works with more than just strings. Yes, it's more effort to get the lexer working with all ranges of dchar, but I don't think that it's all that much worse, it the result is much more flexible. - Jonathan M Davis
Aug 02 2012