digitalmars.D.learn - How would I optimize this parser?

T.D.Spenser (99/99) Oct 30 2010 I wrote a simple parser in D. The language for it looks like this:

bearophile (10/11) Oct 30 2010 Very good :-)

spir (60/61) Oct 31 2010 some cases an Appender helps, and in some cases some workaround may be n...

Nick Sabalausky (3/5) Oct 31 2010 std.algorithm. But the docs for it do need to be improved.

spir (13/20) Oct 31 2010 I find it very strange that map/filter/reduce take *strings* as func exp...

Simen kjaeraas (9/25) Oct 31 2010 They take both, in fact:

bearophile (25/28) Oct 31 2010 But that needs to be compile-time constant, so if you have several funct...

Steven Schveighoffer (6/13) Nov 02 2010 I don't think it needs to be a compile-time constant.

Rainer Deyke (5/9) Oct 31 2010 In other words: function literals in D are too verbose to be convenient.

spir (28/36) Nov 01 2010 True. But it is very hard to find a notation that is at the same time:

T.D.Spenser (12/23) Oct 31 2010 something to optimize for.

bearophile (12/14) Oct 31 2010 Good. Your code looks good, well formatted and readable.
bearophile (15/15) Oct 31 2010 If you import:

bearophile (6/7) Oct 31 2010 The program allocates 169_000 TextNode and 245_001 TagNode. Just alloca...

bearophile (3/3) Oct 31 2010 Converting the nodes to tagged structs, and allocating the nodes with a ...
spir (9/10) Nov 01 2010 ey cause reallocations.

bearophile (4/5) Nov 01 2010 In some situations that's not handy to do, because you don't know what's...
bearophile (5/6) Nov 01 2010 I have answered the wrong question sorry :-) My answer was about the tre...

T.D.Spenser (31/47) Nov 02 2010 I'll reply to several posts at once.

bearophile (50/71) Nov 02 2010 Java code, prints "12":

T.D.Spenser <thoughdipenser.net gmail.com> writes:

I wrote a simple parser in D. The language for it looks like this:

[tagName content content [childTagName] [childTagName more content]]

The parser is here:

http://thoughtdispenser.net/files/parser.d

Unfortunately, it's quite slow. Can anyone point out what might be the issue(s)?
begin 644 parser.d
M<'5B;&EC(&-L87-S(%-T871I8U!A<G-E<B!["B` ("!P=6)L:6, <W1A=&EC
M(%1A9TYO9&4 <&%R<V4H<W1R:6YG(&-O9&4I>PH ("` ("` (%1A9TYO9&4 
M<F]O="`](&YE=R!486=.;V1E*"(B*3L*("` ("` ("!I9B`H8V]D92YL96YG
M=&  /B`P*2!["B` ("` ("` ("` (&EN="!C;VYT96YT16YD(#T <&%R<V5"

M9"`\(&-O9&4N;&5N9W1H*2!["B` ("` ("` ("` ("` ("!T:')O=R!N97< 
M17AC97!T:6]N*")4:&5R92!A<F4 97AT<F$ 8VQO<VEN9R!B<F%C:V5T<R(I
M.PH ("` ("` ("` ("!]"B` ("` ("` ?0H ("` ("` (')E='5R;B!R;V]T
M.PH ("` ?0H*("` ('-T871I8R!P=6)L:6, :6YT('!A<G-E0F]D>2AS=')I
M;F< =&5X="P 5&%G3F]D92!C=7)R96YT3F]D92P :6YT('!O<VET:6]N*7L*
M("` ("` ("!W:&EL92`H<&]S:71I;VX /"!T97AT+FQE;F=T:"D >PH ("` 
M("` ("` ("!I9B`H=&5X=%MP;W-I=&EO;ET /3T )ULG*2!["B` ("` ("` 
M("` ("` ("!P;W-I=&EO;B`]('!A<G-E3F]D92AT97AT+"!C=7)R96YT3F]D
M92P <&]S:71I;VX *R`Q*3L*("` ("` ("` ("` ("` ('!O<VET:6]N*RL[
M"B` ("` ("` ("` ('T 96QS92!I9B`H=&5X=%MP;W-I=&EO;ET /3T )UTG
M*2!["B` ("` ("` ("` ("` ("!R971U<FX <&]S:71I;VX["B` ("` ("` 
M("` ('T 96QS92!["B` ("` ("` ("` ("` ("!P;W-I=&EO;B`]('!A<G-E
M5&5X="AT97AT+"!C=7)R96YT3F]D92P <&]S:71I;VXI.PH ("` ("` ("` 
M("!]"B` ("` ("` ?0H ("` ("` (')E='5R;B!T97AT+FQE;F=T:#L*("` 
M('T*"B` ("!S=&%T:6, <'5B;&EC(&EN="!P87)S951E>'0H<W1R:6YG('1E
M>'0L(%1A9TYO9&4 8W5R<F5N=$YO9&4L(&EN="!S=&%R="E["B` ("` ("` 
M+R]A<W-E<G0H=&5X=%MS=&%R=%T (3T )ULG("8F('1E>'1;<W1A<G1=("$]
M("==)RD["B` ("` ("` :6YT('!O<VET:6]N(#T <W1A<G0["B` ("` ("` 
M=VAI;&4 *'!O<VET:6]N(#P =&5X="YL96YG=& I('L*("` ("` ("` ("` 
M:68 *'1E>'1;<&]S:71I;VY=(#T]("=;)R!\?"!T97AT6W!O<VET:6]N72`]
M/2`G72<I('L*("` ("` ("` ("` ("` (&)R96%K.PH ("` ("` ("` ("!]
M"B` ("` ("` ("` ('!O<VET:6]N*RL["B` ("` ("` ?0H ("` ("` ('-T
M<FEN9R!T97AT0VAA<G, /2!T97AT6W-T87)T("XN('!O<VET:6]N73L*("` 
M("` ("!497AT3F]D92!T97AT3F]D92`](&YE=R!497AT3F]D92AT97AT0VAA
M<G,I.PH ("` ("` (&-U<G)E;G1.;V1E+F%D9$-H:6QD*'1E>'1.;V1E*3L*
M("` ("` ("!R971U<FX <&]S:71I;VX["B` ("!]" H ("` <W1A=&EC('!U
M8FQI8R!I;G0 9V5T3F%M945N9"AS=')I;F< =&5X="P :6YT('!O<VET:6]N
M*7L*("` ("` ("!W:&EL92`H<&]S:71I;VX /"!T97AT+FQE;F=T:"D >PH 
M("` ("` ("` ("!S=VET8V  *'1E>'1;<&]S:71I;VY=*2!["B` ("` ("` 
M("` ("` ("!C87-E("< )RP )UQT)RP )UQN)RP )UQR)RP )UTG+"`G6R<Z
M"B` ("` ("` ("` ("` ("` ("` <F5T=7)N('!O<VET:6]N.PH ("` ("` 
M("` ("` ("` 8G)E86L["B` ("` ("` ("` ("` ("!D969A=6QT.B`*("` 
M("` ("` ("` ("` ("` ("`O+V1O(&YO=&AI;F<L(&YE960 =&AI<R!T;R!A
M=F]I9"!S:6QL>2!E<G)O<G,*("` ("` ("` ("` ?0H ("` ("` ("` ("!P
M;W-I=&EO;BLK.PH ("` ("` ('T*("` ("` ("`O+V%S<V5R="AF86QS92D[
M"B` ("` ("` =&AR;W< ;F5W($5X8V5P=&EO;B B4V]M92!T86=S(&%R92!N
M;W0 8VQO<V5D(BD["B` ("!]" H ("` <W1A=&EC('!U8FQI8R!I;G0 <&%R
M<V5.;V1E*'-T<FEN9R!T97AT+"!486=.;V1E('!A<F5N=$YO9&4L(&EN="!S
M=&%R="E["B` ("` ("` :6YT('!O<VET:6]N(#T 9V5T3F%M945N9"AT97AT
M+"!S=&%R="D["B` ("` ("` <W1R:6YG(&YA;64 /2!T97AT6W-T87)T("XN
M('!O<VET:6]N73L*("` ("` ("!486=.;V1E('1H:7-.;V1E(#T ;F5W(%1A
M9TYO9&4H;F%M92D[" H ("` ("` (&EF("AT97AT6W!O<VET:6]N72`A/2`G
M6R< )B8 =&5X=%MP;W-I=&EO;ET (3T )UTG*2!["B` ("` ("` ("` ('!O
M<VET:6]N*RL[("\O<VMI<"!T:&4 =VAI=&5S<&%C92!C:&%R('1H870 <V5P
M87)A=&5S(&YA;64 86YD(&)O9'D*("` ("` ("!]" H ("` ("` (&EN="!T
M86=%;F0 /2!P87)S94)O9'DH=&5X="P =&AI<TYO9&4L('!O<VET:6]N*3L*
M("` ("` ("!I9B`H=&%G16YD(#X]('1E>'0N;&5N9W1H*2!["B` ("` ("` 
M("` ('1H<F]W(&YE=R!%>&-E<'1I;VXH(E-O;64 =&%G(&ES;B=T(&-L;W-E
M9"(I.PH ("` ("` ('T*("` ("` ("!P87)E;G1.;V1E+F%D9$-H:6QD*'1H
M:7-.;V1E*3L*("` ("` ("!R971U<FX =&%G16YD.PH ("` ?0I]" IP=6)L
M:6, 86)S=')A8W0 8VQA<W, 3F]D92!["B` ("!.;V1E('!A<F5N=#L*("` 
M(`H ("` <'5B;&EC(&%B<W1R86-T('-T<FEN9R!T;U-T<FEN9R I.PH ("` 
M"B` ("!P=6)L:6, 3F]D92!G971087)E;G0H*7L*("` ("` ("!R971U<FX 
M<&%R96YT.PH ("` ?0I]" IP=6)L:6, 8VQA<W, 5&%G3F]D92`Z($YO9&4 
M>PH ("` <W1R:6YG(&YA;64["B` ("!.;V1E6UT 8VAI;&1R96X[" H ("` 
M<'5B;&EC('1H:7,H<W1R:6YG(&YA;64L($YO9&5;72!C:&EL9')E;BXN+BE[
M("\O8V%N)W0 :&%V92!T=V\ 8V]N<W1R=6-T;W)S(&QI:V4 :6X 2F%V80H 
M("` ("` ('1H:7,N;F%M92`](&YA;64["B` ("` ("` :68 *&-H:6QD<F5N

M:6QD<F5N.PH ("` ("` ("` ("!F;W)E86-H("A.;V1E(&-H:6QD(#L 8VAI
M;&1R96XI('L*("` ("` ("` ("` ("` (&-H:6QD+G!A<F5N="`]('1H:7,[
M"B` ("` ("` ("` ('T*("` ("` ("!]"B` ("!]"B` ("`*("` ('!U8FQI
M8R!S=')I;F< 9V5T3F%M92 I>PH ("` ("` (')E='5R;B!N86UE.PH ("` 
M?0H*("` ('!U8FQI8R!.;V1E(&=E=$-H:6QD*&EN="!P;W-I=&EO;BE["B` 
M("` ("` <F5T=7)N(&-H:6QD<F5N6W!O<VET:6]N73L*("` ('T*"B` ("!P
M=6)L:6, 5&%G3F]D95M=(&=E=$-H:6QD<F5N*'-T<FEN9R!N86UE*7L*("` 
M("` ("!486=.;V1E6UT ;6%T8VAE<SL*("` ("` ("!F;W)E86-H("A.;V1E
M(&-H:6QD(#L 8VAI;&1R96XI('L*("` ("` ("` ("` 875T;R!C:&EL9%1A
M9R`](&-A<W0H5&%G3F]D92D 8VAI;&0["B` ("` ("` ("` (&EF("AC:&EL
M9%1A9R`A:7, ;G5L;"`F)B!C:&EL9%1A9RYG971.86UE*"D /3T ;F%M92D 
M>R`O+W1R:6-K>2P <VEN8V4 8V%S="!I<R!U<V5D('1O(&-H96-K('1Y<&4*
M("` ("` ("` ("` ("` (&UA=&-H97, ?CT 8VAI;&1486<["B` ("` ("` 
M("` ('T*("` ("` ("!]"B` ("` ("` <F5T=7)N(&UA=&-H97,["B` ("!]

M("` ("` 8VAI;&1R96X ?CT ;F5W0VAI;&0["B` ("` ("` ;F5W0VAI;&0N
M<&%R96YT(#T =&AI<SL*("` ('T*"B` ("!P=6)L:6, <W1R:6YG('1O4W1R
M:6YG*"E["B` ("` ("` <W1R:6YG(&-O;G1E;G0 /2!N86UE('X (B`B.PH 
M("` ("` (&9O<F5A8V  *$YO9&4 8VAI;&0 .R!C:&EL9')E;BD >PH ("` 
M("` ("` ("!C;VYT96YT('X](&-H:6QD+G1O4W1R:6YG*"D["B` ("` ("` 
M?0H ("` ("` (&EF("AP87)E;G0 :7, ;G5L;"D >PH ("` ("` ("` ("!R
M971U<FX 8V]N=&5N=#L*("` ("` ("!]"B` ("` ("` <F5T=7)N(");(B!^
M(&-O;G1E;G0 ?B`B72(["B` ("!]"GT*"G!U8FQI8R!C;&%S<R!497AT3F]D
M92`Z($YO9&4 >PH ("` <W1R:6YG(&-O;G1E;G0[" H ("` =&AI<RAS=')I
M;F< 8V]N=&5N="E["B` ("` ("` =&AI<RYC;VYT96YT(#T 8V]N=&5N=#L*
M("` ('T*"B` ("!P=6)L:6, <W1R:6YG('1O4W1R:6YG*"E["B` ("` ("` 
8<F5T=7)N(&-O;G1E;G0["B` ("!]"GT*
`
end

Oct 30 2010

bearophile <bearophileHUGS lycos.com> writes:

T.D.Spenser:

 Unfortunately, it's quite slow. Can anyone point out what might be the
issue(s)?

Very good :-)
First suggestions:
- Generally compile your D code using the -w switch (you have missing some
overrides and more things)
- Add a main() with some benchmarking examples to your program, so I will have
something to optimize for.
- Use the -profile compiler switch on the benchmarking examples to see what's
slow.
- Use final classes or final methods where possible and keep an eye on memory
allocations. Consider using structs in some places instead of classes.
- Keep in mind that ~ and ~= in arrays isn't a very fast operation. In some
cases an Appender helps, and in some cases some workaround may be needed.

See you later,
bearophile

Oct 30 2010

spir <denis.spir gmail.com> writes:

On Sat, 30 Oct 2010 23:00:59 -0400
bearophile <bearophileHUGS lycos.com> wrote:

 - Keep in mind that ~ and ~=3D in arrays isn't a very fast operation. In =

some cases an Appender helps, and in some cases some workaround may be need=
ed.

A use case of join (in std.string) is indeed to build a text from snippets.
But I personly use this func most often to build a text representation of l=
ist of elements of any kind. For this, join falls short:
* no auto-conversion to string
* no delimiters
Thus, I think the following may be useful in the stdlib (aside join):

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
string listText(T)(T[] elements, string sep) {
    uint l =3D elements.length ;
    if (l =3D=3D 0)
        return "" ;
    string[] texts ; texts.length =3D l ;
    foreach (uint i, T element ; elements)
        texts[i] =3D to!string(elements[i]) ;
    return join(texts, sep) ;
}
string listText(T)(T[] elements, string sep, string leftDelim, string right=
Delim) {
    return format("%s%s%s", leftDelim, listText(elements, sep), rightDelim)=
 ;
}

// testing
struct Symbol {
    string k;
    int v;
    string toString () {
        return format("%s:%s", this.k, to!string(this.v)) ;
    }
}

void main () {
    int[] ints =3D [1,2,3] ;
    writeln(listText(ints , "---")) ;
    writeln(listText(ints , " " , "(",")")) ;
    Symbol[3] symbols =3D [Symbol("a",1) , Symbol("b",2) , Symbol("c",3)] ;
    writeln(listText(symbols , " | " , "{ "," }")) ;
}

// writes:
1---2---3
(1 2 3)
{ a:1 | b:2 | c:3 }
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

This works sensibly with any custom type where toString is defined. (else y=
ou get the type name ;-)

I would also love a mapping-func param, to allow expressing things like lis=
tText(ints, square, " "); but without optional parameters (AFAIK), combinat=
ions are problematic.

Also, I could not find functional methods like map, filter, reduce in std.f=
unctional. Where else? Also not in std.array. Map would be handy above to s=
tring-ify. And remove the need for a map func param in listText (I would be=
 happy to contribute them if someone guides me on how to participate.)


Denis
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com

Oct 31 2010

"Nick Sabalausky" <a a.a> writes:

"spir" <denis.spir gmail.com> wrote in message 
news:mailman.45.1288523296.21107.digitalmars-d-learn puremagic.com...
Also, I could not find functional methods like map, filter,
 reduce in std.functional. Where else? Also not in std.array.

std.algorithm. But the docs for it do need to be improved.

Oct 31 2010

spir <denis.spir gmail.com> writes:

On Sun, 31 Oct 2010 17:31:54 -0400
"Nick Sabalausky" <a a.a> wrote:

 "spir" <denis.spir gmail.com> wrote in message=20
 news:mailman.45.1288523296.21107.digitalmars-d-learn puremagic.com...
Also, I could not find functional methods like map, filter,
 reduce in std.functional. Where else? Also not in std.array.

=20
 std.algorithm. But the docs for it do need to be improved.

I find it very strange that map/filter/reduce take *strings* as func expres=
sions, instead of function literals:
	int[] arr =3D [ 1, 2, 3, 4 ];
	auto squares =3D map!("a * a")(arr);
Why then have func literal? Func parameter is precisely the place where pro=
grammers like to use them, isn't it? Isn't this a sign that D func literals=
 are not considered as practical enough?

Denis
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com

Oct 31 2010

"Simen kjaeraas" <simen.kjaras gmail.com> writes:

spir <denis.spir gmail.com> wrote:

 On Sun, 31 Oct 2010 17:31:54 -0400
 "Nick Sabalausky" <a a.a> wrote:

 "spir" <denis.spir gmail.com> wrote in message
 news:mailman.45.1288523296.21107.digitalmars-d-learn puremagic.com...
Also, I could not find functional methods like map, filter,
 reduce in std.functional. Where else? Also not in std.array.

 std.algorithm. But the docs for it do need to be improved.

 I find it very strange that map/filter/reduce take *strings* as func  
 expressions, instead of function literals:
 	int[] arr = [ 1, 2, 3, 4 ];
 	auto squares = map!("a * a")(arr);

They take both, in fact:

auto cubes = map!((a){ return a*a*a; })(arr);


 Why then have func literal? Func parameter is precisely the place where  
 programmers like to use them, isn't it? Isn't this a sign that D func  
 literals are not considered as practical enough?

For very short functions, strings are better, because of the length of
the 'return' keyword. Had D instead always returned the result of the
last line of a function (unless specified to be of void return type),
map/filter/reduce would likely not take strings.

-- 
Simen

Oct 31 2010

bearophile <bearophileHUGS lycos.com> writes:

Simen kjaeraas:

 They take both, in fact:
 
 auto cubes = map!((a){ return a*a*a; })(arr);

But that needs to be compile-time constant, so if you have several functions,
you need to put them inside a typetuple, or duplicate the code. And some idioms
are just not possible.

You may see it well here regarding the "compose":
http://rosettacode.org/wiki/First-class_functions#D


import std.stdio, std.typetuple, std.functional;
private import std.math;
 
void main() {
    // wrappers needed as not all built-in functions
    // have same signature, eg pure/nothrow
    auto sin  = (real x) { return std.math.sin(x); };
    auto asin = (real x) { return std.math.asin(x); };
    auto cos  = (real x) { return std.math.cos(x); };
    auto acos = (real x) { return std.math.acos(x); };
    auto cube = (real x) { return x ^^ 3; };
    auto cbrt = (real x) { return std.math.cbrt(x); };
 
    alias TypeTuple!(sin,  cos,  cube) dir;
    alias TypeTuple!(asin, acos, cbrt) inv;
 
    foreach (i, f; dir)
        writefln("%6.3f", compose!(f, inv[i])(0.5));
}

Bye,
bearophile

Oct 31 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sun, 31 Oct 2010 19:20:04 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:

 Simen kjaeraas:

 They take both, in fact:

 auto cubes = map!((a){ return a*a*a; })(arr);

 But that needs to be compile-time constant, so if you have several  
 functions, you need to put them inside a typetuple, or duplicate the  
 code. And some idioms are just not possible.

I don't think it needs to be a compile-time constant.

I think one of the most horrible problems with documenting D templates is  
it's impossible to tell what's possible without trying it.

-Steve

Nov 02 2010

Rainer Deyke <rainerd eldwood.com> writes:

On 10/31/2010 16:57, Simen kjaeraas wrote:
 For very short functions, strings are better, because of the length of
 the 'return' keyword. Had D instead always returned the result of the
 last line of a function (unless specified to be of void return type),
 map/filter/reduce would likely not take strings.

In other words: function literals in D are too verbose to be convenient.
 This language-level shortcoming is plastered over at the library level.


-- 
Rainer Deyke - rainerd eldwood.com

Oct 31 2010

spir <denis.spir gmail.com> writes:

On Sun, 31 Oct 2010 20:24:59 -0600
Rainer Deyke <rainerd eldwood.com> wrote:

 On 10/31/2010 16:57, Simen kjaeraas wrote:
 For very short functions, strings are better, because of the length of
 the 'return' keyword. Had D instead always returned the result of the
 last line of a function (unless specified to be of void return type),
 map/filter/reduce would likely not take strings.

=20
 In other words: function literals in D are too verbose to be convenient.
  This language-level shortcoming is plastered over at the library level.

True. But it is very hard to find a notation that is at the same time:
* compact
* general
* clear
In many language communities, there are endless discussions of this topic. =
One common issue is the contradiction between compact and general. Oftentim=
es people come up with a cooool format without realising they only consider=
 their little egotic use case (typically one single arg, one single expr, e=
tc...).
Also, the need for clarity is difficult to achieve for historical reasons i=
n languages of the C & Pascal lines: namely that func defs (and type defs) =
were basically special cased (because then they were not first-class elemen=
ts):
	string Text (...) {...}
instead of
	Text =3D func string (...) {...}
This makes theoretically superior formats difficult to integrate nicely wit=
h the general syntax of the language. The solution is indeed to get rid of =
the legacy syntax and use only literals (1), but...


Denis

(1) Once the bug that prevents using literals in func defs, using the form
	auto f =3D function type (params) {block}
is solved, we do not need the legacy format at all anymore.
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com

Nov 01 2010

T.D.Spenser <thoughdispenser.org gmail.com> writes:

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 T.D.Spenser:
 Unfortunately, it's quite slow. Can anyone point out what might be the
issue(s)?

 Very good :-)
 First suggestions:
 - Generally compile your D code using the -w switch (you have missing some

overrides and more things)
 - Add a main() with some benchmarking examples to your program, so I will have

something to optimize for.
 - Use the -profile compiler switch on the benchmarking examples to see what's
slow.
 - Use final classes or final methods where possible and keep an eye on memory

allocations. Consider using structs in some places instead of classes.
 - Keep in mind that ~ and ~= in arrays isn't a very fast operation. In some

cases an Appender helps, and in some cases some workaround may be needed.
 See you later,
 bearophile

Thanks for replying.

Here is a version with a main function and sample file to parse:

http://thoughtdispenser.net/files/parser-test.7z

I fixed the warnings, but didn't change anything else yet.

Having built-in profiler is nice, I didn't know about it. Although, the files it
spits out are quite strange. If I read them right, it seems the parser spends
most
time constructing nodes for the document's object model.

Oct 31 2010

bearophile <bearophileHUGS lycos.com> writes:

T.D.Spenser:

 Here is a version with a main function and sample file to parse:

Good. Your code looks good, well formatted and readable.
But before you try to optimize code you have write some tests, otherwise you
can't be sure you are not breaking the code, it's like driving with closed
eyes. D has both unittests and design by contract, add them (it's also a good
training in using such important things). I will try to look at your code even
without that necessary safety net, but I can't be sure I am not breaking your
code, so I hope you will give me some tests&contracts too.


 Although, the files it spits out are quite strange.

I think they are normal enough for a profiler. GCC produces a not too much
different file.

Notes on your code:

Is this a critique of the D way to tell if an object is of a class? :-)

auto childTag = cast(TagNode) child;
if (childTag !is null && childTag.getName() == name) { //tricky, since cast is
used to check type


In D you may have many constructors, so what were you trying to do here?

public this(string name, Node[] children...) { //can't have two constructors
like in Java

Bye,
bearophile

Oct 31 2010

bearophile <bearophileHUGS lycos.com> writes:

If you import:
import core.memory: GC;

And then you disable the GC just before parsing:
GC.disable();

The parsing runtime on my PC becomes about one third. Disabling the GC when you
have to build a large data structure and re-enabling it after the creation is a
normal optimization both in CPython and D.

You may also import:
import std.c.stdlib: exit;

And add this last line to the main:

exit(0);

to kill garbage collection at the end to remove the final collection time you
weren't timing (but was there).

Now the total running time is about 0.3 seconds instead of 1.1 seconds.

All this means that your code is GC-bound :-) D GC is far worse than the Java
one.

I will try to optimize other things.

Later,
bearophile

Oct 31 2010

bearophile <bearophileHUGS lycos.com> writes:

bearophile:
 Now the total running time is about 0.3 seconds instead of 1.1 seconds.

The program allocates 169_000 TextNode  and 245_001 TagNode. Just allocating
two dynamic arrays of them, even with disabled GC, takes about 0.16 seconds of
the about 0.30 of running time.

The children arrays inside TagNode receive a total of 414_000 appends, they
cause reallocations.

I'll try to study the code some more.

Bye,
bearophile

Oct 31 2010

bearophile <bearophileHUGS lycos.com> writes:

Converting the nodes to tagged structs, and allocating the nodes with a memory
pool data structure (that allocates large chunks from the C heap), seems to
reduce the total running time to about 0.15 seconds. I don't know a good
language to write this kind of code.

Bye,
bearophile

Oct 31 2010

spir <denis.spir gmail.com> writes:

On Sun, 31 Oct 2010 22:02:10 -0400
bearophile <bearophileHUGS lycos.com> wrote:

 The children arrays inside TagNode receive a total of 414_000 appends, th=

ey cause reallocations.

Then an optimization would be to somehow grossly predict array sizes and pr=
eallocate?


Denis
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com

Nov 01 2010

bearophile <bearophileHUGS lycos.com> writes:

spir:

 Then an optimization would be to somehow grossly predict array sizes and
preallocate?

In some situations that's not handy to do, because you don't know what's a good
size to preallocate. So I suggest a more general solution, to create a struct
that inside keeps a dynamic array of pointers to fixed-sized chunks of structs.
The size of the chunk size is chosen at compile time to be "good" (close to 64
KB, for example). A member function of this struct may return a pointer to a
new struct on request, and allocate a new chunk when the last allocated chunk
is full. Another member function may clear the data structure with no
deallocation (just resetting the pointer to the first free struct), and another
member function may really free all the chunks. This data structure may
allocate its chunks from the GC heap or the C heap. This is the data structure
I have used for this parser. 

Bye,
bearophile

Nov 01 2010

bearophile <bearophileHUGS lycos.com> writes:

spir:

 Then an optimization would be to somehow grossly predict array sizes and
preallocate?

I have answered the wrong question sorry :-) My answer was about the tree node
allocations.

I have not studied enough the allocation patterns of those arrays, so I can't
answer. You may collect better statistics yourself.

Bye,
bearophile

Nov 01 2010

T.D.Spenser <thoughdispenser.org gmail.com> writes:

I'll reply to several posts at once.

The code comments you quoted weren't meant as a criticism. They are mostly notes
to myself in case I want to document my learning experience with D on some
personal website no one ever visits. I'll post criticisms separately if/when I
have any. :)

When I was speaking about several constructors, I referred to stuff like this:

    public TagNode(String name){
        this.name = name;
        this.children = new ArrayList<Node>(1);
    }

    public TagNode(String name, Node... children){
        this.children = Arrays.asList(children);
        for (Node child : children) {
            child.parent = this;
        }
    }

It's allowed in Java, but I can see why this can be prohibited.

You're right about unit tests. One thing that surprised me is that unit tests
are
run when the program is run. For some reason I thought they were run immediately
after the compilation with the appropriate flag.

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 If you import:
 import core.memory: GC;
 And then you disable the GC just before parsing:
 GC.disable();
 The parsing runtime on my PC becomes about one third.

Interesting. I didn't think about GC, because there aren't any objects that get
unreferenced. But, of course, garbage collector can't know about that until it
runs. (Well, unless there is a mechanism for marking objects as non-collectible
manually.)

Tangent question. Is there a way to disable GC per application thread?

 Disabling the GC when you have to build a large data structure and re-enabling

it after the creation is a normal optimization both in CPython and D.
 You may also import:
 import std.c.stdlib: exit;
 And add this last line to the main:
 exit(0);
 to kill garbage collection at the end to remove the final collection time you

weren't timing (but was there).
 Now the total running time is about 0.3 seconds instead of 1.1 seconds.
 All this means that your code is GC-bound :-) D GC is far worse than the Java
one.
 I will try to optimize other things.
 Later,
 bearophile

Thanks for such informative replies. I can't quite keep up on the weekdays, but
I
will experiment with the suggestions this weekend and probably post an update of
some sort.

Nov 02 2010

bearophile <bearophileHUGS lycos.com> writes:

T.D.Spense:

     public TagNode(String name){
         this.name = name;
         this.children = new ArrayList<Node>(1);
     }
 
     public TagNode(String name, Node... children){
         this.children = Arrays.asList(children);
         for (Node child : children) {
             child.parent = this;
         }
     }
 
 It's allowed in Java, but I can see why this can be prohibited.


Java code, prints "12":


class Node {}
 
class Main {
  public void foo(String name) {
    System.out.print("1"); 
  }
 
  public void foo(String name, Node... nodes) {
    System.out.print("2"); 
  }
 
   public static void main(String[] args) {
    Main m = new Main();
    m.foo("hello");
    m.foo("hello", new Node());
  }
}


In D the T[]... syntax means zero or more T, so if you supply zero T, the
compiler can't know what of the two methods you are calling, so it's statically
forbidden.

This prints 22 in Java:


class Node {}
 
class Main {
  public void foo(String name, Node... nodes) {
    System.out.print("2"); 
  }
 
   public static void main(String[] args) {
    Main m = new Main();
    m.foo("hello");
    m.foo("hello", new Node());
  }
}

Here I prefer how D is designed, it looks tidier (often it's the opposite, the



 You're right about unit tests. One thing that surprised me is that unit tests
are
 run when the program is run. For some reason I thought they were run
immediately
 after the compilation with the appropriate flag.

If you compile your code with:

dmd foo.d
And then you run the program:
foo.exe
or:
./foo

Your unit tests aren't run.

If you instead compile with this:
dmd -unittest foo.d

And then you run the program, the unittests are run and then the program is
run. In D2 with a version(unittest) you may disable the running as you like.


 Interesting. I didn't think about GC, because there aren't any objects that get
 unreferenced. But, of course, garbage collector can't know about that until it
 runs.

Right, the program builds the data structure and then now and then the GC
transversed it all and marks all objects as reachable. This takes a lot of time.


 (Well, unless there is a mechanism for marking objects as non-collectible
manually.)

A simple way to make objects not collectable is to allocate them from the C
heap (there is a kind of placement new in D).


 Tangent question. Is there a way to disable GC per application thread?

I am not expert on this, I think there is no way yet. Maybe it will be added.

Bye,
bearophile

Nov 02 2010

D Programming

C/C++ Programming

Other

digitalmars.D.learn - How would I optimize this parser?