www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Can I speed up this log parsing script further?

reply uncorroded <uncorroded temp.org> writes:
Hi guys,

I am a beginner in D. As a project, I converted a log-parsing 
script in Python which we use at work, to D. This link was 
helpful - ( 
https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ 
) I compiled it with dmd and ldc. The log file is 52 MB. With dmd 
(not release build), it takes 1.1 sec and with ldc, it takes 0.3 
sec.

The Python script (run with system python, not Pypy) takes 0.75 
sec. The D and Python functions are here and on pastebin ( D - 
https://pastebin.com/SeUR3wFP , Python - 
https://pastebin.com/F5JbfBmE ).

Basically, i am reading a line, checking for 2 constants. If 
either one is found, some processing is done on line and stored 
to an array for later analysis. I tried reading the file entirely 
in one go using std.file : readText and using std.algorithm : 
splitter for lazily splitting newline but there is no difference 
in speed, so I used the byLine approach mentioned in the linked 
blog. Is there a better way of doing this in D?

Note:
I ran GC profiling as mentioned in linked blog. The results were:

Number of collections:  3
	Total GC prep time:  0 milliseconds
	Total mark time:  0 milliseconds
	Total sweep time:  0 milliseconds
	Total page recovery time:  0 milliseconds
	Max Pause Time:  0 milliseconds
	Grand total GC time:  2 milliseconds
GC summary:   12 MB,    3 GC    2 ms, Pauses    0 ms <    0 ms

So GC does not seem to be an issue.

Here's the D script:

import std.stdio;
import std.string;
import std.array;
import std.algorithm : splitter;
import std.typecons : tuple, Tuple;
import std.conv : to;

void read_log(string filename) {
     File file = File(filename, "r");
     Tuple!(char[], int, char[])[] npushed;
     Tuple!(int, char[], int, bool, bool)[] pushed;
     foreach (line; file.byLine) {
         if (line.indexOf("SOC_NOT_PUSHED") != -1) {
             auto tarr = line.split();
             npushed ~= tuple(tarr[2] ~ tarr[3], to!int(tarr[$ - 
1]), tarr[$ - 2]);
             continue;
         }
         if (line.indexOf("SYNC_PUSH:") != -1) {
             auto rel = line.split("SYNC_PUSH:")[1].strip();
             auto att = rel.split(" at ");
             auto ina = att[1].split(" in ");
             auto msa = ina[1].split(" ms ");
             pushed ~= tuple(to!int(att[0]), ina[0], 
to!int(msa[0]),
                     msa[1].indexOf("PA-SOC_POP") != -1, 
msa[1].indexOf("CU-SOC_POP") != -1);
         }
     }
     // Using the arrays later on in production script
     writeln(npushed.length);
     writeln(pushed.length);
}


Here is Python function:

def read_log(fname):
     try:
         with open(fname, 'r') as f:
             raw = f.read().splitlines()
             ns = [s.split() for s in raw if 'SOC_NOT_PUSHED' in s]
             ss = [w.split("SYNC_PUSH:")[1].strip() for w in raw 
if 'SYNC_PUSH:' in w]
             not_pushed = [[s[2]+s[3], int(s[-1]), s[-2]] for s in 
ns]
             ww = [(int(e.split(' at ')[0]), e.split(' at 
')[1].split(' in ')[0], int(e.split(' at ')[1].split(' in 
')[1].split(' ms ')[0]), set(e.split(' at ')[1].split(' in 
')[1].split(' ms ')[1].split())) for e in ss]
             pushed = [[w[0], w[1], w[2], 1 if 'PA-SOC_POP' in 
w[3] else 0, 1 if 'CU-SOC_POP' in w[3] else 0] for w in ww]
             return not_pushed, pushed
     except:
         return []
Jun 09 2017
next sibling parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 09/06/2017 8:34 AM, uncorroded wrote:
 Hi guys,
 
 I am a beginner in D. As a project, I converted a log-parsing script in 
 Python which we use at work, to D. This link was helpful - ( 
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ ) I 
 compiled it with dmd and ldc. The log file is 52 MB. With dmd (not 
 release build), it takes 1.1 sec and with ldc, it takes 0.3 sec.
 
 The Python script (run with system python, not Pypy) takes 0.75 sec. The 
 D and Python functions are here and on pastebin ( D - 
 https://pastebin.com/SeUR3wFP , Python - https://pastebin.com/F5JbfBmE ).
 
 Basically, i am reading a line, checking for 2 constants. If either one 
 is found, some processing is done on line and stored to an array for 
 later analysis. I tried reading the file entirely in one go using 
 std.file : readText and using std.algorithm : splitter for lazily 
 splitting newline but there is no difference in speed, so I used the 
 byLine approach mentioned in the linked blog. Is there a better way of 
 doing this in D?
 
 Note:
 I ran GC profiling as mentioned in linked blog. The results were:
 
 Number of collections:  3
      Total GC prep time:  0 milliseconds
      Total mark time:  0 milliseconds
      Total sweep time:  0 milliseconds
      Total page recovery time:  0 milliseconds
      Max Pause Time:  0 milliseconds
      Grand total GC time:  2 milliseconds
 GC summary:   12 MB,    3 GC    2 ms, Pauses    0 ms <    0 ms
 
 So GC does not seem to be an issue.
 
 Here's the D script:
 
 import std.stdio;
 import std.string;
 import std.array;
 import std.algorithm : splitter;
 import std.typecons : tuple, Tuple;
 import std.conv : to;
 
 void read_log(string filename) {
      File file = File(filename, "r");
      Tuple!(char[], int, char[])[] npushed;
      Tuple!(int, char[], int, bool, bool)[] pushed;
      foreach (line; file.byLine) {
          if (line.indexOf("SOC_NOT_PUSHED") != -1) {
              auto tarr = line.split();
              npushed ~= tuple(tarr[2] ~ tarr[3], to!int(tarr[$ - 1]), 
 tarr[$ - 2]);
              continue;
          }
          if (line.indexOf("SYNC_PUSH:") != -1) {
              auto rel = line.split("SYNC_PUSH:")[1].strip();
              auto att = rel.split(" at ");
              auto ina = att[1].split(" in ");
              auto msa = ina[1].split(" ms ");
              pushed ~= tuple(to!int(att[0]), ina[0], to!int(msa[0]),
                      msa[1].indexOf("PA-SOC_POP") != -1, 
 msa[1].indexOf("CU-SOC_POP") != -1);
          }
      }
      // Using the arrays later on in production script
      writeln(npushed.length);
      writeln(pushed.length);
 }
 
 
 Here is Python function:
 
 def read_log(fname):
      try:
          with open(fname, 'r') as f:
              raw = f.read().splitlines()
              ns = [s.split() for s in raw if 'SOC_NOT_PUSHED' in s]
              ss = [w.split("SYNC_PUSH:")[1].strip() for w in raw if 
 'SYNC_PUSH:' in w]
              not_pushed = [[s[2]+s[3], int(s[-1]), s[-2]] for s in ns]
              ww = [(int(e.split(' at ')[0]), e.split(' at ')[1].split(' 
 in ')[0], int(e.split(' at ')[1].split(' in ')[1].split(' ms ')[0]), 
 set(e.split(' at ')[1].split(' in ')[1].split(' ms ')[1].split())) for e 
 in ss]
              pushed = [[w[0], w[1], w[2], 1 if 'PA-SOC_POP' in w[3] else 
 0, 1 if 'CU-SOC_POP' in w[3] else 0] for w in ww]
              return not_pushed, pushed
      except:
          return []
 
The code isn't entirely 1:1. Any usage of IO (includes stdout via writeln) is expensive. Your python code doesn't write anything to stdout (or perform any calls). It would also be good to get the results of dmd -release as well.
Jun 09 2017
next sibling parent Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
I would considered using appender for pushed and npushed. Can you post file
on which you are running benchmarking?

On Fri, Jun 9, 2017 at 9:50 AM, rikki cattermole via Digitalmars-d-learn <
digitalmars-d-learn puremagic.com> wrote:

 On 09/06/2017 8:34 AM, uncorroded wrote:

 Hi guys,

 I am a beginner in D. As a project, I converted a log-parsing script in
 Python which we use at work, to D. This link was helpful - (
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ ) I
 compiled it with dmd and ldc. The log file is 52 MB. With dmd (not release
 build), it takes 1.1 sec and with ldc, it takes 0.3 sec.

 The Python script (run with system python, not Pypy) takes 0.75 sec. The
 D and Python functions are here and on pastebin ( D -
 https://pastebin.com/SeUR3wFP , Python - https://pastebin.com/F5JbfBmE ).

 Basically, i am reading a line, checking for 2 constants. If either one
 is found, some processing is done on line and stored to an array for later
 analysis. I tried reading the file entirely in one go using std.file :
 readText and using std.algorithm : splitter for lazily splitting newline
 but there is no difference in speed, so I used the byLine approach
 mentioned in the linked blog. Is there a better way of doing this in D?

 Note:
 I ran GC profiling as mentioned in linked blog. The results were:

 Number of collections:  3
      Total GC prep time:  0 milliseconds
      Total mark time:  0 milliseconds
      Total sweep time:  0 milliseconds
      Total page recovery time:  0 milliseconds
      Max Pause Time:  0 milliseconds
      Grand total GC time:  2 milliseconds
 GC summary:   12 MB,    3 GC    2 ms, Pauses    0 ms <    0 ms

 So GC does not seem to be an issue.

 Here's the D script:

 import std.stdio;
 import std.string;
 import std.array;
 import std.algorithm : splitter;
 import std.typecons : tuple, Tuple;
 import std.conv : to;

 void read_log(string filename) {
      File file = File(filename, "r");
      Tuple!(char[], int, char[])[] npushed;
      Tuple!(int, char[], int, bool, bool)[] pushed;
      foreach (line; file.byLine) {
          if (line.indexOf("SOC_NOT_PUSHED") != -1) {
              auto tarr = line.split();
              npushed ~= tuple(tarr[2] ~ tarr[3], to!int(tarr[$ - 1]),
 tarr[$ - 2]);
              continue;
          }
          if (line.indexOf("SYNC_PUSH:") != -1) {
              auto rel = line.split("SYNC_PUSH:")[1].strip();
              auto att = rel.split(" at ");
              auto ina = att[1].split(" in ");
              auto msa = ina[1].split(" ms ");
              pushed ~= tuple(to!int(att[0]), ina[0], to!int(msa[0]),
                      msa[1].indexOf("PA-SOC_POP") != -1,
 msa[1].indexOf("CU-SOC_POP") != -1);
          }
      }
      // Using the arrays later on in production script
      writeln(npushed.length);
      writeln(pushed.length);
 }


 Here is Python function:

 def read_log(fname):
      try:
          with open(fname, 'r') as f:
              raw = f.read().splitlines()
              ns = [s.split() for s in raw if 'SOC_NOT_PUSHED' in s]
              ss = [w.split("SYNC_PUSH:")[1].strip() for w in raw if
 'SYNC_PUSH:' in w]
              not_pushed = [[s[2]+s[3], int(s[-1]), s[-2]] for s in ns]
              ww = [(int(e.split(' at ')[0]), e.split(' at ')[1].split('
 in ')[0], int(e.split(' at ')[1].split(' in ')[1].split(' ms ')[0]),
 set(e.split(' at ')[1].split(' in ')[1].split(' ms ')[1].split())) for e in
 ss]
              pushed = [[w[0], w[1], w[2], 1 if 'PA-SOC_POP' in w[3] else
 0, 1 if 'CU-SOC_POP' in w[3] else 0] for w in ww]
              return not_pushed, pushed
      except:
          return []
The code isn't entirely 1:1. Any usage of IO (includes stdout via writeln) is expensive. Your python code doesn't write anything to stdout (or perform any calls). It would also be good to get the results of dmd -release as well.
Jun 09 2017
prev sibling parent Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
Dne 9.6.2017 v 09:50 rikki cattermole via Digitalmars-d-learn napsal(a):

 On 09/06/2017 8:34 AM, uncorroded wrote:
 Hi guys,

 ...
The code isn't entirely 1:1. Any usage of IO (includes stdout via writeln) is expensive. Your python code doesn't write anything to stdout (or perform any calls). It would also be good to get the results of dmd -release as well.
Not in this case, he only write two numbers at the end, so I guess there will be no difference
Jun 09 2017
prev sibling parent reply Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
On Fri, Jun 9, 2017 at 9:34 AM, uncorroded via Digitalmars-d-learn <
digitalmars-d-learn puremagic.com> wrote:

 Hi guys,

 I am a beginner in D. As a project, I converted a log-parsing script in
 Python which we use at work, to D. This link was helpful - (
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ ) I
 compiled it with dmd and ldc. The log file is 52 MB. With dmd (not release
 build), it takes 1.1 sec and with ldc, it takes 0.3 sec.

 The Python script (run with system python, not Pypy) takes 0.75 sec. The D
 and Python functions are here and on pastebin ( D -
 https://pastebin.com/SeUR3wFP , Python - https://pastebin.com/F5JbfBmE ).

 Basically, i am reading a line, checking for 2 constants. If either one is
 found, some processing is done on line and stored to an array for later
 analysis. I tried reading the file entirely in one go using std.file :
 readText and using std.algorithm : splitter for lazily splitting newline
 but there is no difference in speed, so I used the byLine approach
 mentioned in the linked blog. Is there a better way of doing this in D?

 There is no difference in speed because you do not process your data
lazily, so you make many allocations, so this is main reason why it is so slow. I could improve that, but I will need to see some example data, which you are trying to parse. But some rules, 1.) instead of ~= you shoud use std.array.appender 2.) instead of std.string.split you could use std.algorithm.splitter or std.algorithm.findSplit 3.) instead of indexOf I would use std.algorithm.startsWith (in case it is on the begining of the line)
Jun 09 2017
parent reply uncorroded <uncorroded temp.org> writes:
On Friday, 9 June 2017 at 08:58:38 UTC, Daniel Kozak wrote:

 There is no difference in speed because you do not process 
 your data
lazily, so you make many allocations, so this is main reason why it is so slow. I could improve that, but I will need to see some example data, which you are trying to parse. But some rules, 1.) instead of ~= you shoud use std.array.appender 2.) instead of std.string.split you could use std.algorithm.splitter or std.algorithm.findSplit 3.) instead of indexOf I would use std.algorithm.startsWith (in case it is on the begining of the line)
Thanks everyone for the tips. The log file itself is 52 MB but I have added a sample in pastebin ( https://pastebin.com/vj778PK4 ). Will try the suggestions today evening.
Jun 09 2017
parent reply Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
import std.stdio;
import std.array: appender, array;
import std.algorithm : findSplit, splitter, joiner, canFind, map;
import std.typecons : tuple, Tuple;
import std.conv : to;
import std.range : dropOne, dropExactly, takeExactly, chain;

alias push_type = Tuple!(int, char[], int, bool, bool);
alias npush_type = Tuple!(char[], int, char[]);

void read_log(string filename) {
    File file = File(filename, "r");
    auto npushed = appender!(npush_type[])();
    auto pushed = appender!(push_type[])();
    foreach (line; file.byLine) {
        if (auto findResult = line.findSplit(" SYNC_PUSH: ")) {
            auto rel = findResult[2];
            auto att = rel.splitter(" ");

            auto firstVal = att.front.to!int;
            auto secondVal = att.dropExactly(2).takeExactly(2).joiner("
").to!(char[]).dup;
            auto thirdVal = att.dropExactly(5).front.to!int;
            auto fourthVal = findResult[2].canFind("PA-SOC_POP");
            auto fifthVal = findResult[2].canFind("CU-SOC_POP");
            pushed.put(tuple(firstVal, secondVal, thirdVal, fourthVal,
fifthVal));
            continue;
        }
        if (auto findResult = line.findSplit(" SOC_NOT_PUSHED: ")) {
            auto leftPart = findResult[0].splitter(" ").dropExactly(2)
                                                       .takeExactly(2);
            auto rightPart = findResult[2].splitter(" ").takeExactly(2);
            auto firstVal = chain(leftPart.front,
leftPart.dropOne.front).to!(char[]);
            auto thirdVal = rightPart.front.to!(char[]).dup;
            auto secondVal = rightPart.dropOne.front.to!int;
            npushed.put(tuple(firstVal, secondVal, thirdVal));
            continue;
        }
    }
    // Doing more stuff with these arrays later. For now, just printing
lengths
    writeln(npushed.data.length);
    writeln(pushed.data.length);
}

On Fri, Jun 9, 2017 at 12:01 PM, uncorroded via Digitalmars-d-learn <
digitalmars-d-learn puremagic.com> wrote:

 On Friday, 9 June 2017 at 08:58:38 UTC, Daniel Kozak wrote:

 There is no difference in speed because you do not process your data

 lazily, so you make many allocations, so this is main reason why it is so
 slow. I could improve that, but I will need to see some example data, which
 you are trying to parse.

 But some rules,
 1.) instead of ~= you shoud use std.array.appender
 2.) instead of std.string.split you could use std.algorithm.splitter or
 std.algorithm.findSplit
 3.) instead of indexOf I would use std.algorithm.startsWith (in case it is
 on the begining of the line)
Thanks everyone for the tips. The log file itself is 52 MB but I have added a sample in pastebin ( https://pastebin.com/vj778PK4 ). Will try the suggestions today evening.
Jun 09 2017
parent reply uncorroded <uncorroded temp.org> writes:
On Friday, 9 June 2017 at 14:19:48 UTC, Daniel Kozak wrote:
 import std.stdio;
 import std.array: appender, array;
 import std.algorithm : findSplit, splitter, joiner, canFind, 
 map;
 import std.typecons : tuple, Tuple;
 import std.conv : to;
 import std.range : dropOne, dropExactly, takeExactly, chain;

 alias push_type = Tuple!(int, char[], int, bool, bool);
 alias npush_type = Tuple!(char[], int, char[]);

 void read_log(string filename) {
     File file = File(filename, "r");
     auto npushed = appender!(npush_type[])();
     auto pushed = appender!(push_type[])();
     foreach (line; file.byLine) {
         if (auto findResult = line.findSplit(" SYNC_PUSH: ")) {
             auto rel = findResult[2];
             auto att = rel.splitter(" ");

             auto firstVal = att.front.to!int;
             auto secondVal = 
 att.dropExactly(2).takeExactly(2).joiner("
 ").to!(char[]).dup;
             auto thirdVal = att.dropExactly(5).front.to!int;
             auto fourthVal = 
 findResult[2].canFind("PA-SOC_POP");
             auto fifthVal = findResult[2].canFind("CU-SOC_POP");
             pushed.put(tuple(firstVal, secondVal, thirdVal, 
 fourthVal,
 fifthVal));
             continue;
         }
         if (auto findResult = line.findSplit(" SOC_NOT_PUSHED: 
 ")) {
             auto leftPart = findResult[0].splitter(" 
 ").dropExactly(2)
                                                        
 .takeExactly(2);
             auto rightPart = findResult[2].splitter(" 
 ").takeExactly(2);
             auto firstVal = chain(leftPart.front,
 leftPart.dropOne.front).to!(char[]);
             auto thirdVal = rightPart.front.to!(char[]).dup;
             auto secondVal = rightPart.dropOne.front.to!int;
             npushed.put(tuple(firstVal, secondVal, thirdVal));
             continue;
         }
     }
     // Doing more stuff with these arrays later. For now, just 
 printing
 lengths
     writeln(npushed.data.length);
     writeln(pushed.data.length);
 }
Hi Daniel, Thanks a lot for the code. I tested it on our production log file and it takes 0.2 sec (50% improvement over old time) I tried using just the appender and it did not make a significant difference. On the other hand, just changing the inner loop to use std.algorithm and range seems to make a big difference.Is there a good resource to read about the good stuff in std.algorithm and range? I tried going through the library docs but they are too exhaustive! Thanks :)
Jun 09 2017
next sibling parent "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Fri, Jun 09, 2017 at 04:28:08PM +0000, uncorroded via Digitalmars-d-learn
wrote:
[...]
 Is there a good resource to read about the good stuff in std.algorithm
 and range? I tried going through the library docs but they are too
 exhaustive!
[...] Try these: http://ddili.org/ders/d.en/ranges.html http://www.informit.com/articles/article.aspx?p=1407357 T -- Gone Chopin. Bach in a minuet.
Jun 09 2017
prev sibling parent Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
btw it is important to use right compiler with right flags I am using ldc
with this flags
ldmd2 -O -release -boundscheck=off

On Fri, Jun 9, 2017 at 6:28 PM, uncorroded via Digitalmars-d-learn <
digitalmars-d-learn puremagic.com> wrote:

 On Friday, 9 June 2017 at 14:19:48 UTC, Daniel Kozak wrote:

 import std.stdio;
 import std.array: appender, array;
 import std.algorithm : findSplit, splitter, joiner, canFind, map;
 import std.typecons : tuple, Tuple;
 import std.conv : to;
 import std.range : dropOne, dropExactly, takeExactly, chain;

 alias push_type = Tuple!(int, char[], int, bool, bool);
 alias npush_type = Tuple!(char[], int, char[]);

 void read_log(string filename) {
     File file = File(filename, "r");
     auto npushed = appender!(npush_type[])();
     auto pushed = appender!(push_type[])();
     foreach (line; file.byLine) {
         if (auto findResult = line.findSplit(" SYNC_PUSH: ")) {
             auto rel = findResult[2];
             auto att = rel.splitter(" ");

             auto firstVal = att.front.to!int;
             auto secondVal = att.dropExactly(2).takeExactly(2).joiner("
 ").to!(char[]).dup;
             auto thirdVal = att.dropExactly(5).front.to!int;
             auto fourthVal = findResult[2].canFind("PA-SOC_POP");
             auto fifthVal = findResult[2].canFind("CU-SOC_POP");
             pushed.put(tuple(firstVal, secondVal, thirdVal, fourthVal,
 fifthVal));
             continue;
         }
         if (auto findResult = line.findSplit(" SOC_NOT_PUSHED: ")) {
             auto leftPart = findResult[0].splitter(" ").dropExactly(2)
                                                        .takeExactly(2);
             auto rightPart = findResult[2].splitter(" ").takeExactly(2);
             auto firstVal = chain(leftPart.front,
 leftPart.dropOne.front).to!(char[]);
             auto thirdVal = rightPart.front.to!(char[]).dup;
             auto secondVal = rightPart.dropOne.front.to!int;
             npushed.put(tuple(firstVal, secondVal, thirdVal));
             continue;
         }
     }
     // Doing more stuff with these arrays later. For now, just printing
 lengths
     writeln(npushed.data.length);
     writeln(pushed.data.length);
 }
Hi Daniel, Thanks a lot for the code. I tested it on our production log file and it takes 0.2 sec (50% improvement over old time) I tried using just the appender and it did not make a significant difference. On the other hand, just changing the inner loop to use std.algorithm and range seems to make a big difference.Is there a good resource to read about the good stuff in std.algorithm and range? I tried going through the library docs but they are too exhaustive! Thanks :)
Jun 09 2017