www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - #line decoder

reply BCS <ao pathlink.com> writes:
I’m doing a pile of code generation an got tired of manually translating 
the effects of #line directives so I just wiped out a little tool to do it 
for me.

You give it the input file, the “filename” and line number you are looking 
for and it either spits out that line or the actual line number it’s on
depending 
on what you ask for.

source: http://www.dsource.org/projects/scrapple/browser/trunk/lines/lines.d
Sep 24 2008
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
BCS Wrote:
 so I just wiped out a little tool to do it for me.

I think that's a job fitter for a script, for example in Python (or Ruby or Perl). ----------------- But trying to write such little text oriented programs in D is positive anyway, because it helps spot that there are some rough edges in D/Phobos still that deserve some improvements. For example, this is "data.txt": jan sun 12 feb mon 14 mar fri 23 aug sat 3 jun tue 15 The purpose is to read those 3 columns as separated arrays, so the output is transposed: [["jan", "feb", "mar", "aug", "jun"], ["sun", "mon", "fri", "sat", "tue"], ["12", "14", "23", "3", "15"]] With Python you can do it this way (this produces immutable sub arrays), loader1: r1 = zip(*(l.split() for l in file("data2.txt"))) If you duplicate many times the data.txt to have a bigger file (for example a 44 MB text file) like this:
 t = file("data.txt").read()
 t2 = t * 400000
 len(t2)



 file("data2.txt", "w").write(t2)



you find that transposing is too much slow. This is my faster Python+Psyco version: # loader2 from collections import deque def main(): fin = file("data2.txt") cols = [deque([el]) for el in fin.readline().split()] for line in fin: col = 0 for el in line.split(): cols[col].append(el) col += 1 import psyco; psyco.full() main() My first D version: // loader3 import std.stream; import std.string: split; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new string[first.length]; foreach (col, el; first) cols[col] ~= el.dup; foreach (string line; fin) foreach (col, el; line.split()) cols[col] ~= el.dup; } Disabling the GC doesn't help: // loader4 import std.stream, std.gc; import std.string: split; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new string[first.length]; foreach (col, el; first) cols[col] ~= el.dup; disable(); foreach (string line; fin) foreach (col, el; line.split()) cols[col] ~= el.dup; enable(); } Here using an ArrayBuilder of mine increases two times because the GC wastes a large amount of time scanning pointers (I have not converted the ArrayBuilders into arrays because that operation is very fast, just a slicing, so hopefully it doesn't change the overall running time): // loader5 import std.stream; import std.string: split; import d.builders: ArrayBuilder; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new ArrayBuilder!(string)[first.length]; foreach (col, el; first) cols[col] ~= el.dup; foreach (string line; fin) foreach (col, el; line.split()) cols[col] ~= el.dup; } Using the ArrayBuilder and disabling the GC helps a little: // loader6 import std.stream; import std.string: split; import d.builders: ArrayBuilder; import std.gc; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new ArrayBuilder!(string)[first.length]; foreach (col, el; first) cols[col] ~= el.dup; disable(); foreach (string line; fin) foreach (col, el; line.split()) cols[col] ~= el.dup; enable(); } Using a lazy splitter generally helps a lot, this time helps a little because the lines are made of few parts: // loader7 import std.stream; import std.string: split; import d.string: xsplit; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new string[first.length]; foreach (col, el; first) cols[col] ~= el.dup; foreach (string line; fin) foreach (col, el; line.xsplit()) cols[col] ~= el.dup; } But I don't understand why disabling the GC with the xsplit worsens the situation: // loader8 import std.stream; import std.string: split; import d.string: xsplit; import std.gc; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new string[first.length]; foreach (col, el; first) cols[col] ~= el.dup; disable(); foreach (string line; fin) foreach (col, el; line.xsplit()) cols[col] ~= el.dup; enable(); } Using both an ArrayBuilder and lazy splitting improves the situation significantly: // loader9 import std.stream; import std.string: split; import d.string: xsplit; import d.builders: ArrayBuilder; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new ArrayBuilder!(string)[first.length]; foreach (col, el; first) cols[col] ~= el.dup; foreach (string line; fin) foreach (col, el; line.xsplit()) cols[col] ~= el.dup; } This time disabling the GC helps a lot (see loader8 for comparison), we are now closer to the timings of Python, so lot of people at this point can probably stop trying to optimize the code: // loader10 import std.stream; import std.string: split; import d.string: xsplit; import std.gc; import d.builders: ArrayBuilder; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new ArrayBuilder!(string)[first.length]; foreach (col, el; first) cols[col] ~= el.dup; disable(); foreach (string line; fin) foreach (col, el; line.xsplit()) cols[col] ~= el.dup; enable(); } But I have done few other versions, trying to go faster than then Python version. In the following versions I have assumed the programmer knows the number of colums (3), in theory this helps the code. In theory this is very fast, because the xsplit() is very fast and there is no need to dup the elements, because you can just keep the txt string around (that read() loads quickly), and a matrix of string slices. In practice the GC kills the performance, I don't know why. Note the running time of the foreach loop is very fast (0.3-0.4 s), most of the time is used in the final deallocation: // loader11 import std.file: read; import d.string: xsplit; import d.builders: ArrayBuilder; void main() { const int NCOLS = 3; auto txt = cast(string)read("data2.txt"); auto cols = new ArrayBuilder!(string)[NCOLS]; foreach (i, el; txt.xsplit()) cols[i % NCOLS] ~= el; } Disabling the GC helps: // loader12 import std.file: read; import d.string: xsplit; import d.builders: ArrayBuilder; import std.gc; void main() { const int NCOLS = 3; auto txt = cast(string)read("data2.txt"); auto cols = new ArrayBuilder!(string)[NCOLS]; disable(); foreach (i, el; txt.xsplit()) cols[i % NCOLS] ~= el; enable(); } I have tried few other mad things without good results. Timings, data2.txt, warm timings, best of 3: loader1: 23.05 s loader2: 3.00 s loader3: 8.82 s loader4: 11.44 s loader5: 21.31 s loader6: 7.20 s loader7: 7.51 s loader8: 8.45 s loader9: 5.46 s loader10: 3.73 s loader11: 82.54 s loader12: 38.87 s I can now probably write a pure C version, but I am too much lazy for that :-) If you know ways to speed up the loading of such file in D I'll gladly take a look at your code. In bioinformatics I process txt files all the time. Bye, bearophile
Sep 25 2008
next sibling parent reply Sergey Gromov <snake.scaly gmail.com> writes:
In article <gbg0uf$uka$1 digitalmars.com>, bearophileHUGS lycos.com 
says...
 // loader3
 import std.stream;
 import std.string: split;
 void main() {
     auto fin = new BufferedFile("data2.txt");
     string[] first = fin.readLine().split();
     auto cols = new string[first.length];
     foreach (col, el; first)
         cols[col] ~= el.dup;
     foreach (string line; fin)
         foreach (col, el; line.split())
             cols[col] ~= el.dup;
 }

Specifically in this line:
     auto cols = new string[first.length];

you probably actually want "new string[][first.length]", otherwise you simply get three huge strings. My timing of your original sample is 9.8s (best out of 5 runs). The fixed sample timing is 48.9s.
Sep 25 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Sergey Gromov:
 you probably actually want "new string[][first.length]", otherwise you 
 simply get three huge strings.

Yes, I have not seem them missing commas in the test run output (this tells me to always avoid writeln and always use my putr, that puts "" around strings when they printed inside containers). The timings of the loader3/loader4 versions now show that the ArrayBuilder is quite useful. The version 4 too has the same bug, the new versions: // loader3 import std.stream; import std.string: split; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new string[][first.length]; foreach (col, el; first) cols[col] ~= el.dup; foreach (string line; fin) foreach (col, el; line.split()) cols[col] ~= el.dup; } // loader4 import std.stream, std.gc; import std.string: split; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new string[][first.length]; foreach (col, el; first) cols[col] ~= el.dup; disable(); foreach (string line; fin) foreach (col, el; line.split()) cols[col] ~= el.dup; enable(); } And just to be sure and have a more real test, I have added a benchmark that shows the final array creation too (the final append is very quick, less than 0.01 s), derived from loader10: // loader10b import std.stream; import std.string: split; import d.string: xsplit; import std.gc; import d.builders: ArrayBuilder; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new ArrayBuilder!(string)[first.length]; foreach (col, el; first) cols[col] ~= el.dup; disable(); foreach (string line; fin) foreach (col, el; line.xsplit()) cols[col] ~= el.dup; enable(); string[][] truecols; foreach (col; cols) truecols ~= col.toarray; } Updated timings: Timings, data2.txt, warm timings, best of 3: loader1: 23.05 s loader2: 3.00 s loader3: 44.79 s loader4: 39.28 s loader5: 21.31 s loader6: 7.20 s loader7: 7.51 s loader8: 8.45 s loader9: 5.46 s loader10: 3.73 s loader10b: 3.88 s loader11: 82.54 s loader12: 38.87 s Bye and thank you, bearophile
Sep 25 2008
next sibling parent reply Sergey Gromov <snake.scaly gmail.com> writes:
In article <gbgem1$1qvo$1 digitalmars.com>, bearophileHUGS lycos.com 
says...
 Updated timings:
 Timings, data2.txt, warm timings, best of 3:
   loader1:  23.05 s
   loader2:   3.00 s
   loader3:  44.79 s
   loader4:  39.28 s
   loader5:  21.31 s
   loader6:   7.20 s
   loader7:   7.51 s
   loader8:   8.45 s
   loader9:   5.46 s
   loader10:  3.73 s
   loader10b: 3.88 s
   loader11: 82.54 s
   loader12: 38.87 s

You've intrigued me so I did some experimenting, too. I won't post my bad tries but the better one. import std.stream: BufferedFile; import std.string: split; import std.gc: enable, disable; import rope: Rope; void main() { // disable(); auto fin = new BufferedFile("data2.txt"); alias Rope!(string) SRope; SRope[] result; foreach (el; split(cast(string) fin.readLine())) { result ~= new SRope; result[$-1] ~= el; } foreach (char[] line; fin) foreach (id, el; split(cast(string) line)) result[id] ~= el; auto month = result[0].get(); auto day = result[1].get(); auto num = result[2].get(); // enable(); } The rope thing is my version of an ArrayBuilder ;) : module rope; class Rope(T) { typeof(this) opCatAssign(T el) { if (pool >= current.length) { chunks ~= current; current = new T[current.length * 2]; pool = 0; } current[pool++] = el; return this; } T[] get() { T[] result; foreach (c; chunks) result ~= c; return result ~ current[0 .. pool]; } this() { current = new T[16]; pool = 0; } private { T[][] chunks; T[] current; size_t pool; } } The timings are: 2.77s the loader as it is posted here 2.99s with garbage collector disabled (uncomment commented) 3.2s Python version On a side note, I've got a lot of weird timings today. It turned out that on many occasions, when I had very bad timings, most of the processing time were spent on a closing brace of a function, or on an std.file.read obviously not reading the file but doing some memory handling. Something definitely goes wrong sometimes.
Sep 26 2008
parent reply Sergey Gromov <snake.scaly gmail.com> writes:
In article <MPG.2347c0e378c6cc6a98970c news.digitalmars.com>, 
snake.scaly gmail.com says...
 module rope;
 class Rope(T)
 {
     typeof(this) opCatAssign(T el)
     {
         if (pool >= current.length)
         {
             chunks ~= current;
             current = new T[current.length * 2];
             pool = 0;
         }
         current[pool++] = el;
         return this;
     }
     T[] get()
     {
         T[] result;
         foreach (c; chunks)
             result ~= c;
         return result ~ current[0 .. pool];
     }
     this()
     {
         current = new T[16];
         pool = 0;
     }
     private
     {
         T[][] chunks;
         T[] current;
         size_t pool;
     }
 }

The simpler, the better: use Array instead of Rope and get 2.36s: module array; class Array(T) { this() { data = new T[16]; } typeof(this) opCatAssign(T el) { if (pos == data.length) data.length = data.length * 2; data[pos++] = el; return this; } T[] get() { return data[0 .. pos]; } private { T[] data; size_t pos; } }
Sep 26 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Sergey Gromov, there's a bug in your code, you don't perform the dup, so the
contents of all the string slices are lost. This is a bug-prone feature of the
current D, I have done several times similar bugs in the past.

The timings:

Updated timings:
Timings, data2.txt, warm timings, best of 3:
  loader1:    23.05 s
  loader2:     3.00 s
  loader3:    44.79 s
  loader4:    39.28 s
  loader5:    21.31 s
  loader6:     7.20 s
  loader7:     7.51 s
  loader8:     8.45 s
  loader9:     5.46 s
  loader10:    3.73 s
  loader10b:  3.88 s
  loader11:   82.54 s
  loader12:   38.87 s
  loader13:   28.28 s
  loader14:   15.37 s
  loader13:    6.74 s (no GC)
  loader14:    6.94 s (no GC)


Your code I have used:

// loader13
import std.stream: BufferedFile;
import std.string: split;
//import std.gc: enable, disable;

class Rope(T) {
    private {
        T[][] chunks;
        T[] current;
        size_t pool;
    }
    this() {
        current = new T[16];
        pool = 0;
    }
    typeof(this) opCatAssign(T el) {
        if (pool >= current.length) {
            chunks ~= current;
            current = new T[current.length * 2];
            pool = 0;
        }
        current[pool++] = el;
        return this;
    }
    T[] get() {
        T[] result;
        foreach (c; chunks)
            result ~= c;
        return result ~ current[0 .. pool];
    }
}

void main() {
    //disable();
    auto fin = new BufferedFile("data2.txt");
    alias Rope!(string) SRope;
    SRope[] result;
    foreach (el; split(cast(string) fin.readLine())) {
        result ~= new SRope;
        result[$-1] ~= el;
    }

    foreach (char[] line; fin)
        foreach (id, el; split(cast(string) line))
            result[id] ~= el.dup;

    auto month = result[0].get();
    auto day = result[1].get();
    auto num = result[2].get();
    //enable();
}



// loader14
import std.stream: BufferedFile;
import std.string: split;
//import std.gc: enable, disable;

class Array(T) {
    private {
        T[] data;
        size_t pos;
    }
    this() {
        data = new T[16];
    }
    typeof(this) opCatAssign(T el) {
        if (pos == data.length)
            data.length = data.length * 2;
        data[pos++] = el;
        return this;
    }
    T[] get() {
        return data[0 .. pos];
    }
}

void main() {
    //disable();
    auto fin = new BufferedFile("data2.txt");
    alias Array!(string) SArray;
    SArray[] result;
    foreach (el; split(cast(string) fin.readLine())) {
        result ~= new SArray;
        result[$-1] ~= el;
    }

    foreach (char[] line; fin)
        foreach (id, el; line.split())
            result[id] ~= el.dup;

    auto month = result[0].get();
    auto day = result[1].get();
    auto num = result[2].get();
    //enable();
}

Bye,
bearophile
Sep 27 2008
parent reply Sergey Gromov <snake.scaly gmail.com> writes:
Sat, 27 Sep 2008 08:00:02 -0400,
bearophile wrote:
   loader13:   28.28 s
   loader14:   15.37 s
   loader13:    6.74 s (no GC)
   loader14:    6.94 s (no GC)

This will teach me not to do research late at night. D-:
Sep 27 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Sergey Gromov:
 This will teach me not to do research late at night. D-:

So far I have done similar bugs few times, so I think it's not your fault, such "light" slices are a bug-prone feature of D. But light slices are quite efficient and handy too. (A significant difference in performance from Numpy package compared to Mathlab comes from the fact that in Numpy slices are light (in D I miss rectangular dynamic arrays of Numpy. I think their syntax can be added without problems, a[x,y] is rectangular, and a[x][y] is not. The same works for their creation: new int[10,20] is easy to tell apart from int[10][20], but other problems may be present)). D is supposed to do the safe thing by default (if it's not too much slow and it doesn't require too much memory) and the efficient but unsafe one on request, but in this situation I think it is doing the opposite. So the situation can be inverted: using the normal slice syntax to denote a real copy of the slice contents, and perform a "light" slice when the programmers uses a special syntax. Real copy of the slice: int[] s = a[2 .. 5]; Light copy, by reference only, some possible alternative syntax: int[] s = a[2 .. 5].light; int[] s = a[[2 .. 5]]; int[] s = a[2 .. 5].undup; int[] s = a[|2 .. 5|]; int[] s = a[{2 .. 5}]; int[] s = a{2 .. 5}; int[] s = a[(2 .. 5)]; Bye, bearophile
Sep 28 2008
parent reply Sergey Gromov <snake.scaly gmail.com> writes:
Sun, 28 Sep 2008 16:38:33 -0400,
bearophile wrote:
 Real copy of the slice:
 int[] s = a[2 .. 5];
 
 Light copy, by reference only, some possible alternative syntax:
 int[] s = a[2 .. 5].light;
 int[] s = a[[2 .. 5]];
 int[] s = a[2 .. 5].undup;
 int[] s = a[|2 .. 5|];
 int[] s = a[{2 .. 5}];
 int[] s = a{2 .. 5};
 int[] s = a[(2 .. 5)];

Unfortunately this doesn't address the issue I've faced: readLine returns a slice of an internal buffer and there's no way to notice that except by RTFM or debugging.
Sep 28 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Sergey Gromov:
 Unfortunately this doesn't address the issue I've faced: readLine 
 returns a slice of an internal buffer and there's no way to notice that 
 except by RTFM or debugging.

You are right. This is how I have mostly solved your problem in my libs. This is a lazy generator of all the combinations of the given items, taken at groups of length k: class Xcombinations(TyItem) { this(TyItem[] items, int k, bool copy=true) { ... } } The copy argument is true by default, it means that all the arrays it yields are actual copies. So by default it acts safely but it's slower. In the not so common situations you want to go faster and you know what you are doing you can set copy to false, so the dupping isn't performed, and the code goes ten or more times faster. You have to put that third argument to false, so while you program you know what does it means. A similar strategy may help avoid some problems with the readLine and other functions/generators. If no performance loss can be accepted then 'copy' may even become a compile-time constant, so you can use it with "static if": class Xcombinations(TyItem, bool copy=true) { this(TyItem[] items, int k) { ... } } Bye, bearophile
Sep 28 2008
prev sibling parent Sergey Gromov <snake.scaly gmail.com> writes:
Thu, 25 Sep 2008 12:36:17 -0400,
bearophile wrote:
 Updated timings:
 Timings, data2.txt, warm timings, best of 3:
   loader1:  23.05 s
   loader2:   3.00 s
   loader3:  44.79 s
   loader4:  39.28 s
   loader5:  21.31 s
   loader6:   7.20 s
   loader7:   7.51 s
   loader8:   8.45 s
   loader9:   5.46 s
   loader10:  3.73 s
   loader10b: 3.88 s
   loader11: 82.54 s
   loader12: 38.87 s

And, for completeness sake, a straight-forward implementation in C++: #include <fstream> #include <string> #include <memory> #include <algorithm> #include <ctype.h> #include <vector> #include <functional> using namespace std; int main() { char buf[1024]; char *pos, *end; ifstream ifs("data2.txt"); // result vector<vector<string> > result; // number of columns ifs.getline(buf, sizeof buf); pos = buf; end = buf + ifs.gcount(); // loop over the tokens while (pos != end) { char *wordEnd = find_if(pos, end, isspace); result.push_back(vector<string>()); result.back().push_back(string(pos, wordEnd)); pos = find_if(wordEnd, end, not1(ptr_fun(isspace))); } // rest of the lines while (ifs.good()) { ifs.getline(buf, sizeof buf); pos = buf; end = buf + ifs.gcount(); // loop over the tokens size_t col = 0; while (pos != end) { char *wordEnd = find_if(pos, end, isspace); result[col].push_back(string(pos, wordEnd)); pos = find_if(wordEnd, end, not1(ptr_fun(isspace))); ++col; } } } On my system: C++: 3.93 s best D: 2.36 s Python: 3.20 s Replacing the inner vector<> with a list<> makes matters worse: 5.68 s instead of 3.93. I also guessed that the problem was in vectors copying all the strings on every re-allocation and tried to replace strings with boost::shared_ptr<string>. Which made for 13.35 s. I'm not sure I can optimize this code any further without re-writing it in Cee.
Sep 27 2008
prev sibling parent Christopher Wright <dhasenan gmail.com> writes:
bearophile wrote:
 BCS Wrote:
 so I just wiped out a little tool to do it for me.

I think that's a job fitter for a script, for example in Python (or Ruby or Perl). ----------------- But trying to write such little text oriented programs in D is positive anyway, because it helps spot that there are some rough edges in D/Phobos still that deserve some improvements.

I actually do most of my scripting in D (except for the odd line of sed/awk now and then). I just don't know any scripting languages well enough to write my scripts in, say, Python, unless I want to take more than one order of magnitude longer to write the code. Even with sufficient familiarity, I'd only see small gains, so why bother?
Sep 25 2008
prev sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Mon, 29 Sep 2008 02:31:59 +0400, Sergey Gromov <snake.scaly gmail.com>  
wrote:

 Sun, 28 Sep 2008 16:38:33 -0400,
 bearophile wrote:
 Real copy of the slice:
 int[] s = a[2 .. 5];

 Light copy, by reference only, some possible alternative syntax:
 int[] s = a[2 .. 5].light;
 int[] s = a[[2 .. 5]];
 int[] s = a[2 .. 5].undup;
 int[] s = a[|2 .. 5|];
 int[] s = a[{2 .. 5}];
 int[] s = a{2 .. 5};
 int[] s = a[(2 .. 5)];

Unfortunately this doesn't address the issue I've faced: readLine returns a slice of an internal buffer and there's no way to notice that except by RTFM or debugging.

I think it is easy. Adopt a guide-line and make it standard (across the library or the whole language): a) if a const slice is returned then it means that you don't own the data but someone else does. You can only read from it (const == readonly). b) if a mutable slice is returned then you are free to modify it, you are the sole owner of it.
Sep 29 2008