digitalmars.D.announce - #line decoder

BCS (7/7) Sep 24 2008 I’m doing a pile of code generation an got tired of manually translati...

bearophile (208/213) Sep 25 2008 I think that's a job fitter for a script, for example in Python (or Ruby...

Sergey Gromov (7/21) Sep 25 2008 Specifically in this line:

bearophile (71/73) Sep 25 2008 Yes, I have not seem them missing commas in the test run output (this te...

Sergey Gromov (70/85) Sep 26 2008 You've intrigued me so I did some experimenting, too. I won't post my

Sergey Gromov (27/60) Sep 26 2008 The simpler, the better: use Array instead of Rope and get 2.36s:

bearophile (110/110) Sep 27 2008 Sergey Gromov, there's a bug in your code, you don't perform the dup, so...

Sergey Gromov (3/7) Sep 27 2008 This will teach me not to do research late at night. D-:

bearophile (16/17) Sep 28 2008 So far I have done similar bugs few times, so I think it's not your faul...

Sergey Gromov (5/16) Sep 28 2008 Unfortunately this doesn't address the issue I've faced: readLine

bearophile (17/20) Sep 28 2008 You are right.
Denis Koroskin (8/24) Sep 29 2008 I think it is easy. Adopt a guide-line and make it standard (across the ...

Sergey Gromov (56/71) Sep 27 2008 And, for completeness sake, a straight-forward implementation in C++:

Christopher Wright (6/14) Sep 25 2008 I actually do most of my scripting in D (except for the odd line of

BCS <ao pathlink.com> writes:

I’m doing a pile of code generation an got tired of manually translating 
the effects of #line directives so I just wiped out a little tool to do it 
for me.

You give it the input file, the “filename” and line number you are looking 
for and it either spits out that line or the actual line number it’s on
depending 
on what you ask for.

source: http://www.dsource.org/projects/scrapple/browser/trunk/lines/lines.d

Sep 24 2008

bearophile <bearophileHUGS lycos.com> writes:

BCS Wrote:
 so I just wiped out a little tool to do it for me.

I think that's a job fitter for a script, for example in Python (or Ruby or
Perl).

-----------------

But trying to write such little text oriented programs in D is positive anyway,
because it helps spot that there are some rough edges in D/Phobos still that
deserve some improvements.

For example, this is "data.txt":

jan        sun     12
feb        mon     14
mar        fri     23
aug        sat     3
jun        tue     15

The purpose is to read those 3 columns as separated arrays, so the output is
transposed:

[["jan", "feb", "mar", "aug", "jun"],
 ["sun", "mon", "fri", "sat", "tue"],
 ["12", "14", "23", "3", "15"]]

With Python you can do it this way (this produces immutable sub arrays),
loader1:

r1 = zip(*(l.split() for l in file("data2.txt")))

If you duplicate many times the data.txt to have a bigger file (for example a
44 MB text file) like this:

 t = file("data.txt").read()
 t2 = t * 400000
 len(t2)



43600000
 file("data2.txt", "w").write(t2)



you find that transposing is too much slow. This is my faster Python+Psyco
version:


from collections import deque
def main():
    fin = file("data2.txt")
    cols = [deque([el]) for el in fin.readline().split()]
    for line in fin:
        col = 0
        for el in line.split():
            cols[col].append(el)
            col += 1
import psyco; psyco.full()
main()

My first D version:

// loader3
import std.stream;
import std.string: split;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new string[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    foreach (string line; fin)
        foreach (col, el; line.split())
            cols[col] ~= el.dup;
}

Disabling the GC doesn't help:

// loader4
import std.stream, std.gc;
import std.string: split;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new string[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    disable();
    foreach (string line; fin)
        foreach (col, el; line.split())
            cols[col] ~= el.dup;
    enable();
}

Here using an ArrayBuilder of mine increases two times because the GC wastes a
large amount of time scanning pointers (I have not converted the ArrayBuilders
into arrays because that operation is very fast, just a slicing, so hopefully
it doesn't change the overall running time):

// loader5
import std.stream;
import std.string: split;
import d.builders: ArrayBuilder;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new ArrayBuilder!(string)[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    foreach (string line; fin)
        foreach (col, el; line.split())
            cols[col] ~= el.dup;
}

Using the ArrayBuilder and disabling the GC helps a little:

// loader6
import std.stream;
import std.string: split;
import d.builders: ArrayBuilder;
import std.gc;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new ArrayBuilder!(string)[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    disable();
    foreach (string line; fin)
        foreach (col, el; line.split())
            cols[col] ~= el.dup;
    enable();
}

Using a lazy splitter generally helps a lot, this time helps a little because
the lines are made of few parts:

// loader7
import std.stream;
import std.string: split;
import d.string: xsplit;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new string[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    foreach (string line; fin)
        foreach (col, el; line.xsplit())
            cols[col] ~= el.dup;
}

But I don't understand why disabling the GC with the xsplit worsens the
situation:

// loader8
import std.stream;
import std.string: split;
import d.string: xsplit;
import std.gc;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new string[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    disable();
    foreach (string line; fin)
        foreach (col, el; line.xsplit())
            cols[col] ~= el.dup;
    enable();
}

Using both an ArrayBuilder and lazy splitting improves the situation
significantly:

// loader9
import std.stream;
import std.string: split;
import d.string: xsplit;
import d.builders: ArrayBuilder;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new ArrayBuilder!(string)[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    foreach (string line; fin)
        foreach (col, el; line.xsplit())
            cols[col] ~= el.dup;
}

This time disabling the GC helps a lot (see loader8 for comparison), we are now
closer to the timings of Python, so lot of people at this point can probably
stop trying to optimize the code:

// loader10
import std.stream;
import std.string: split;
import d.string: xsplit;
import std.gc;
import d.builders: ArrayBuilder;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new ArrayBuilder!(string)[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    disable();
    foreach (string line; fin)
        foreach (col, el; line.xsplit())
            cols[col] ~= el.dup;
    enable();
}

But I have done few other versions, trying to go faster than then Python
version. In the following versions I have assumed the programmer knows the
number of colums (3), in theory this helps the code. In theory this is very
fast, because the xsplit() is very fast and there is no need to dup the
elements, because you can just keep the txt string around (that read() loads
quickly), and a matrix of string slices. In practice the GC kills the
performance, I don't know why. Note the running time of the foreach loop is
very fast (0.3-0.4 s), most of the time is used in the final deallocation:

// loader11
import std.file: read;
import d.string: xsplit;
import d.builders: ArrayBuilder;
void main() {
    const int NCOLS = 3;
    auto txt = cast(string)read("data2.txt");
    auto cols = new ArrayBuilder!(string)[NCOLS];
    foreach (i, el; txt.xsplit())
        cols[i % NCOLS] ~= el;
}

Disabling the GC helps:

// loader12
import std.file: read;
import d.string: xsplit;
import d.builders: ArrayBuilder;
import std.gc;
void main() {
    const int NCOLS = 3;
    auto txt = cast(string)read("data2.txt");
    auto cols = new ArrayBuilder!(string)[NCOLS];
    disable();
    foreach (i, el; txt.xsplit())
        cols[i % NCOLS] ~= el;
    enable();
}

I have tried few other mad things without good results.

Timings, data2.txt, warm timings, best of 3:
  loader1:  23.05 s
  loader2:   3.00 s
  loader3:   8.82 s
  loader4:  11.44 s
  loader5:  21.31 s
  loader6:   7.20 s
  loader7:   7.51 s
  loader8:   8.45 s
  loader9:   5.46 s
  loader10:  3.73 s
  loader11: 82.54 s
  loader12: 38.87 s

I can now probably write a pure C version, but I am too much lazy for that :-)

If you know ways to speed up the loading of such file in D I'll gladly take a
look at your code.
In bioinformatics I process txt files all the time. 

Bye,
bearophile

Sep 25 2008

Sergey Gromov <snake.scaly gmail.com> writes:

In article <gbg0uf$uka$1 digitalmars.com>, bearophileHUGS lycos.com 
says...
 // loader3
 import std.stream;
 import std.string: split;
 void main() {
     auto fin = new BufferedFile("data2.txt");
     string[] first = fin.readLine().split();
     auto cols = new string[first.length];
     foreach (col, el; first)
         cols[col] ~= el.dup;
     foreach (string line; fin)
         foreach (col, el; line.split())
             cols[col] ~= el.dup;
 }

Specifically in this line:

     auto cols = new string[first.length];

you probably actually want "new string[][first.length]", otherwise you 
simply get three huge strings.

My timing of your original sample is 9.8s (best out of 5 runs).  The 
fixed sample timing is 48.9s.

Sep 25 2008

bearophile <bearophileHUGS lycos.com> writes:

Sergey Gromov:
 you probably actually want "new string[][first.length]", otherwise you 
 simply get three huge strings.

Yes, I have not seem them missing commas in the test run output (this tells me
to always avoid writeln and always use my putr, that puts "" around strings
when they printed inside containers).
The timings of the loader3/loader4 versions now show that the ArrayBuilder is
quite useful.

The version 4 too has the same bug, the new versions:

// loader3
import std.stream;
import std.string: split;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new string[][first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    foreach (string line; fin)
        foreach (col, el; line.split())
            cols[col] ~= el.dup;
}


// loader4
import std.stream, std.gc;
import std.string: split;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new string[][first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    disable();
    foreach (string line; fin)
        foreach (col, el; line.split())
            cols[col] ~= el.dup;
    enable();
}

And just to be sure and have a more real test, I have added a benchmark that
shows the final array creation too (the final append is very quick, less than
0.01 s), derived from loader10:

// loader10b
import std.stream;
import std.string: split;
import d.string: xsplit;
import std.gc;
import d.builders: ArrayBuilder;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new ArrayBuilder!(string)[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    disable();
    foreach (string line; fin)
        foreach (col, el; line.xsplit())
            cols[col] ~= el.dup;
    enable();
    string[][] truecols;
    foreach (col; cols)
        truecols ~= col.toarray;
}


Updated timings:
Timings, data2.txt, warm timings, best of 3:
  loader1:  23.05 s
  loader2:   3.00 s
  loader3:  44.79 s
  loader4:  39.28 s
  loader5:  21.31 s
  loader6:   7.20 s
  loader7:   7.51 s
  loader8:   8.45 s
  loader9:   5.46 s
  loader10:  3.73 s
  loader10b: 3.88 s
  loader11: 82.54 s
  loader12: 38.87 s

Bye and thank you,
bearophile

Sep 25 2008

Sergey Gromov <snake.scaly gmail.com> writes:

In article <gbgem1$1qvo$1 digitalmars.com>, bearophileHUGS lycos.com 
says...
 Updated timings:
 Timings, data2.txt, warm timings, best of 3:
   loader1:  23.05 s
   loader2:   3.00 s
   loader3:  44.79 s
   loader4:  39.28 s
   loader5:  21.31 s
   loader6:   7.20 s
   loader7:   7.51 s
   loader8:   8.45 s
   loader9:   5.46 s
   loader10:  3.73 s
   loader10b: 3.88 s
   loader11: 82.54 s
   loader12: 38.87 s

You've intrigued me so I did some experimenting, too.  I won't post my 
bad tries but the better one.

import std.stream: BufferedFile;
import std.string: split;
import std.gc: enable, disable;
import rope: Rope;
void main()
{
    // disable();
    auto fin = new BufferedFile("data2.txt");
    alias Rope!(string) SRope;
    SRope[] result;
    foreach (el; split(cast(string) fin.readLine()))
    {
        result ~= new SRope;
        result[$-1] ~= el;
    }
    foreach (char[] line; fin)
        foreach (id, el; split(cast(string) line))
            result[id] ~= el;
    auto month = result[0].get();
    auto day = result[1].get();
    auto num = result[2].get();
    // enable();
}

The rope thing is my version of an ArrayBuilder ;) :

module rope;
class Rope(T)
{
    typeof(this) opCatAssign(T el)
    {
        if (pool >= current.length)
        {
            chunks ~= current;
            current = new T[current.length * 2];
            pool = 0;
        }
        current[pool++] = el;
        return this;
    }
    T[] get()
    {
        T[] result;
        foreach (c; chunks)
            result ~= c;
        return result ~ current[0 .. pool];
    }
    this()
    {
        current = new T[16];
        pool = 0;
    }
    private
    {
        T[][] chunks;
        T[] current;
        size_t pool;
    }
}

The timings are:
2.77s  the loader as it is posted here
2.99s  with garbage collector disabled (uncomment commented)
3.2s   Python version

On a side note, I've got a lot of weird timings today.  It turned out 
that on many occasions, when I had very bad timings, most of the 
processing time were spent on a closing brace of a function, or on an 
std.file.read obviously not reading the file but doing some memory 
handling.  Something definitely goes wrong sometimes.

Sep 26 2008

Sergey Gromov <snake.scaly gmail.com> writes:

In article <MPG.2347c0e378c6cc6a98970c news.digitalmars.com>, 
snake.scaly gmail.com says...
 module rope;
 class Rope(T)
 {
     typeof(this) opCatAssign(T el)
     {
         if (pool >= current.length)
         {
             chunks ~= current;
             current = new T[current.length * 2];
             pool = 0;
         }
         current[pool++] = el;
         return this;
     }
     T[] get()
     {
         T[] result;
         foreach (c; chunks)
             result ~= c;
         return result ~ current[0 .. pool];
     }
     this()
     {
         current = new T[16];
         pool = 0;
     }
     private
     {
         T[][] chunks;
         T[] current;
         size_t pool;
     }
 }

The simpler, the better: use Array instead of Rope and get 2.36s:

module array;
class Array(T)
{
    this()
    {
        data = new T[16];
    }
    typeof(this) opCatAssign(T el)
    {
        if (pos == data.length)
            data.length = data.length * 2;
        data[pos++] = el;
        return this;
    }
    T[] get()
    {
        return data[0 .. pos];
    }
    private
    {
        T[] data;
        size_t pos;
    }
}

Sep 26 2008

bearophile <bearophileHUGS lycos.com> writes:

Sergey Gromov, there's a bug in your code, you don't perform the dup, so the
contents of all the string slices are lost. This is a bug-prone feature of the
current D, I have done several times similar bugs in the past.

The timings:

Updated timings:
Timings, data2.txt, warm timings, best of 3:
  loader1:    23.05 s
  loader2:     3.00 s
  loader3:    44.79 s
  loader4:    39.28 s
  loader5:    21.31 s
  loader6:     7.20 s
  loader7:     7.51 s
  loader8:     8.45 s
  loader9:     5.46 s
  loader10:    3.73 s
  loader10b:  3.88 s
  loader11:   82.54 s
  loader12:   38.87 s
  loader13:   28.28 s
  loader14:   15.37 s
  loader13:    6.74 s (no GC)
  loader14:    6.94 s (no GC)


Your code I have used:

// loader13
import std.stream: BufferedFile;
import std.string: split;
//import std.gc: enable, disable;

class Rope(T) {
    private {
        T[][] chunks;
        T[] current;
        size_t pool;
    }
    this() {
        current = new T[16];
        pool = 0;
    }
    typeof(this) opCatAssign(T el) {
        if (pool >= current.length) {
            chunks ~= current;
            current = new T[current.length * 2];
            pool = 0;
        }
        current[pool++] = el;
        return this;
    }
    T[] get() {
        T[] result;
        foreach (c; chunks)
            result ~= c;
        return result ~ current[0 .. pool];
    }
}

void main() {
    //disable();
    auto fin = new BufferedFile("data2.txt");
    alias Rope!(string) SRope;
    SRope[] result;
    foreach (el; split(cast(string) fin.readLine())) {
        result ~= new SRope;
        result[$-1] ~= el;
    }

    foreach (char[] line; fin)
        foreach (id, el; split(cast(string) line))
            result[id] ~= el.dup;

    auto month = result[0].get();
    auto day = result[1].get();
    auto num = result[2].get();
    //enable();
}



// loader14
import std.stream: BufferedFile;
import std.string: split;
//import std.gc: enable, disable;

class Array(T) {
    private {
        T[] data;
        size_t pos;
    }
    this() {
        data = new T[16];
    }
    typeof(this) opCatAssign(T el) {
        if (pos == data.length)
            data.length = data.length * 2;
        data[pos++] = el;
        return this;
    }
    T[] get() {
        return data[0 .. pos];
    }
}

void main() {
    //disable();
    auto fin = new BufferedFile("data2.txt");
    alias Array!(string) SArray;
    SArray[] result;
    foreach (el; split(cast(string) fin.readLine())) {
        result ~= new SArray;
        result[$-1] ~= el;
    }

    foreach (char[] line; fin)
        foreach (id, el; line.split())
            result[id] ~= el.dup;

    auto month = result[0].get();
    auto day = result[1].get();
    auto num = result[2].get();
    //enable();
}

Bye,
bearophile

Sep 27 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Sat, 27 Sep 2008 08:00:02 -0400,
bearophile wrote:
   loader13:   28.28 s
   loader14:   15.37 s
   loader13:    6.74 s (no GC)
   loader14:    6.94 s (no GC)

This will teach me not to do research late at night. D-:

Sep 27 2008

bearophile <bearophileHUGS lycos.com> writes:

Sergey Gromov:
 This will teach me not to do research late at night. D-:

So far I have done similar bugs few times, so I think it's not your fault, such
"light" slices are a bug-prone feature of D. But light slices are quite
efficient and handy too. (A significant difference in performance from Numpy
package compared to Mathlab comes from the fact that in Numpy slices are light
(in D I miss rectangular dynamic arrays of Numpy. I think their syntax can be
added without problems, a[x,y] is rectangular, and a[x][y] is not. The same
works for their creation: new int[10,20] is easy to tell apart from
int[10][20], but other problems may be present)).

D is supposed to do the safe thing by default (if it's not too much slow and it
doesn't require too much memory) and the efficient but unsafe one on request,
but in this situation I think it is doing the opposite.

So the situation can be inverted: using the normal slice syntax to denote a
real copy of the slice contents, and perform a "light" slice when the
programmers uses a special syntax.

Real copy of the slice:
int[] s = a[2 .. 5];

Light copy, by reference only, some possible alternative syntax:
int[] s = a[2 .. 5].light;
int[] s = a[[2 .. 5]];
int[] s = a[2 .. 5].undup;
int[] s = a[|2 .. 5|];
int[] s = a[{2 .. 5}];
int[] s = a{2 .. 5};
int[] s = a[(2 .. 5)];

Bye,
bearophile

Sep 28 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Sun, 28 Sep 2008 16:38:33 -0400,
bearophile wrote:
 Real copy of the slice:
 int[] s = a[2 .. 5];
 
 Light copy, by reference only, some possible alternative syntax:
 int[] s = a[2 .. 5].light;
 int[] s = a[[2 .. 5]];
 int[] s = a[2 .. 5].undup;
 int[] s = a[|2 .. 5|];
 int[] s = a[{2 .. 5}];
 int[] s = a{2 .. 5};
 int[] s = a[(2 .. 5)];

Unfortunately this doesn't address the issue I've faced: readLine 
returns a slice of an internal buffer and there's no way to notice that 
except by RTFM or debugging.

Sep 28 2008

bearophile <bearophileHUGS lycos.com> writes:

Sergey Gromov:
 Unfortunately this doesn't address the issue I've faced: readLine 
 returns a slice of an internal buffer and there's no way to notice that 
 except by RTFM or debugging.

You are right.
This is how I have mostly solved your problem in my libs. This is a lazy
generator of all the combinations of the given items, taken at groups of length
k:

class Xcombinations(TyItem) {
    this(TyItem[] items, int k, bool copy=true) {
    ...
    }
}

The copy argument is true by default, it means that all the arrays it yields
are actual copies. So by default it acts safely but it's slower. In the not so
common situations you want to go faster and you know what you are doing you can
set copy to false, so the dupping isn't performed, and the code goes ten or
more times faster. You have to put that third argument to false, so while you
program you know what does it means.

A similar strategy may help avoid some problems with the readLine and other
functions/generators. If no performance loss can be accepted then 'copy' may
even become a compile-time constant, so you can use it with "static if":

class Xcombinations(TyItem, bool copy=true) {
    this(TyItem[] items, int k) {
    ...
    }
}

Bye,
bearophile

Sep 28 2008

"Denis Koroskin" <2korden gmail.com> writes:

On Mon, 29 Sep 2008 02:31:59 +0400, Sergey Gromov <snake.scaly gmail.com>  
wrote:

 Sun, 28 Sep 2008 16:38:33 -0400,
 bearophile wrote:
 Real copy of the slice:
 int[] s = a[2 .. 5];

 Light copy, by reference only, some possible alternative syntax:
 int[] s = a[2 .. 5].light;
 int[] s = a[[2 .. 5]];
 int[] s = a[2 .. 5].undup;
 int[] s = a[|2 .. 5|];
 int[] s = a[{2 .. 5}];
 int[] s = a{2 .. 5};
 int[] s = a[(2 .. 5)];

 Unfortunately this doesn't address the issue I've faced: readLine
 returns a slice of an internal buffer and there's no way to notice that
 except by RTFM or debugging.

I think it is easy. Adopt a guide-line and make it standard (across the  
library or the whole language):

a) if a const slice is returned then it means that you don't own the data  
but someone else does. You can only read from it (const == readonly).
b) if a mutable slice is returned then you are free to modify it, you are  
the sole owner of it.

Sep 29 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Thu, 25 Sep 2008 12:36:17 -0400,
bearophile wrote:
 Updated timings:
 Timings, data2.txt, warm timings, best of 3:
   loader1:  23.05 s
   loader2:   3.00 s
   loader3:  44.79 s
   loader4:  39.28 s
   loader5:  21.31 s
   loader6:   7.20 s
   loader7:   7.51 s
   loader8:   8.45 s
   loader9:   5.46 s
   loader10:  3.73 s
   loader10b: 3.88 s
   loader11: 82.54 s
   loader12: 38.87 s

And, for completeness sake, a straight-forward implementation in C++:

#include <fstream>
#include <string>
#include <memory>
#include <algorithm>
#include <ctype.h>
#include <vector>
#include <functional>

using namespace std;

int main()
{
    char buf[1024];
    char *pos, *end;
    ifstream ifs("data2.txt");
    // result
    vector<vector<string> > result;
    // number of columns
    ifs.getline(buf, sizeof buf);
    pos = buf;
    end = buf + ifs.gcount();
    // loop over the tokens
    while (pos != end)
    {
        char *wordEnd = find_if(pos, end, isspace);
        result.push_back(vector<string>());
        result.back().push_back(string(pos, wordEnd));
        pos = find_if(wordEnd, end, not1(ptr_fun(isspace)));
    }
    // rest of the lines
    while (ifs.good())
    {
        ifs.getline(buf, sizeof buf);
        pos = buf;
        end = buf + ifs.gcount();
        // loop over the tokens
        size_t col = 0;
        while (pos != end)
        {
            char *wordEnd = find_if(pos, end, isspace);
            result[col].push_back(string(pos, wordEnd));
            pos = find_if(wordEnd, end, not1(ptr_fun(isspace)));
            ++col;
        }
    }
}

On my system:
C++:    3.93 s
best D: 2.36 s
Python: 3.20 s

Replacing the inner vector<> with a list<> makes matters worse: 5.68 s 
instead of 3.93.  I also guessed that the problem was in vectors copying 
all the strings on every re-allocation and tried to replace strings with 
boost::shared_ptr<string>.  Which made for 13.35 s.  I'm not sure I can 
optimize this code any further without re-writing it in Cee.

Sep 27 2008

Christopher Wright <dhasenan gmail.com> writes:

bearophile wrote:
 BCS Wrote:
 so I just wiped out a little tool to do it for me.

 
 I think that's a job fitter for a script, for example in Python (or Ruby or
Perl).
 
 -----------------
 
 But trying to write such little text oriented programs in D is positive
anyway, because it helps spot that there are some rough edges in D/Phobos still
that deserve some improvements.

I actually do most of my scripting in D (except for the odd line of 
sed/awk now and then). I just don't know any scripting languages well 
enough to write my scripts in, say, Python, unless I want to take more 
than one order of magnitude longer to write the code.

Even with sufficient familiarity, I'd only see small gains, so why bother?

Sep 25 2008

D Programming

C/C++ Programming

Other

digitalmars.D.announce - #line decoder