digitalmars.D.learn - Speeding up text file parser (BLAST tabular format)

Fredrik Boulund (49/49) Sep 14 2015 Hi,

Fredrik Boulund (13/14) Sep 14 2015 Example output might be useful for you to see as well:
Edwin van Leeuwen (8/14) Sep 14 2015 Sounds like this program is actually IO bound. In that case I

Fredrik Boulund (12/18) Sep 14 2015 Hi Edwin, thanks for your quick reply!

Edwin van Leeuwen (10/30) Sep 14 2015 Two things that you could try:

Fredrik Boulund (19/27) Sep 14 2015 I didn't know that hitlists.byKey was that expensive, that's just

Laeeth Isharc (19/49) Sep 14 2015 I picked up D to start learning maybe a couple of years ago. I

Fredrik Boulund (10/27) Sep 14 2015 I saw the dlangscience project on GitHub the other day. I've yet

Andrea Fontana (8/9) Sep 14 2015 Also if problem probabily is i/o related, have you tried with:

Andrea Fontana (3/12) Sep 14 2015 s/also/even
John Colvin (5/14) Sep 14 2015 -inline in particular is likely to have a strong impact here

Fredrik Boulund (6/22) Sep 14 2015 Why would -inline be particularly likely to make a big difference

John Colvin (11/36) Sep 14 2015 Range-based code like you are using leads to *huge* numbers of

Fredrik Boulund (7/16) Sep 14 2015 Thanks for that explanation! Now that you mention it it makes

H. S. Teoh via Digitalmars-d-learn (13/31) Sep 14 2015 If performance is a problem, the first thing I'd recommend is to use a

Fredrik Boulund (12/21) Sep 14 2015 I agree with you on that. I used Python's cProfile module to find

Edwin van Leeuwen (8/19) Sep 14 2015 See this link for clarification on what the columns/numbers in

H. S. Teoh via Digitalmars-d-learn (8/23) Sep 14 2015 Dmd's profiler has some limitations, especially if you're doing
Fredrik Boulund (6/12) Sep 15 2015 Thanks for the link. I read up on what everything means, but I

Fredrik Boulund (13/22) Sep 14 2015 Thanks for the suggestions! I'm not too familiar with compiled

John Colvin (4/18) Sep 14 2015 what system are you on? What are the error messages you are

Fredrik Boulund (26/28) Sep 14 2015 I really appreciate your will to try to help me out. This is what

John Colvin (4/33) Sep 14 2015 Yup, glibc is too old for those binaries.

Fredrik Boulund (4/6) Sep 14 2015 It says "ldd (GNU libc) 2.12". Hmm... The most recent version in

John Colvin (8/16) Sep 14 2015 I've had nothing but trouble when using different versions of

Fredrik Boulund (10/16) Sep 15 2015 Thanks for the offer, but don't go out of your way for my sake.

John Colvin (4/16) Sep 15 2015 try this:

Fredrik Boulund (21/23) Sep 15 2015 Nope, :(

John Colvin (3/8) Sep 15 2015 Oh well, worth a try I guess.
Andrew Brown (12/38) Sep 15 2015 I had some luck building a local copy of llvm in my home

Fredrik Boulund (7/17) Sep 15 2015 Thanks for your suggestion. I'm amazed by the amount of effort

Rikki Cattermole (35/81) Sep 14 2015 A lot of this hasn't been covered I believe.

John Colvin (5/11) Sep 14 2015 Not a good idea in general. Much better to ask for a string in
NX (5/7) Sep 14 2015 I believe that should be:

H. S. Teoh via Digitalmars-d-learn (121/121) Sep 14 2015 I decided to give the code a spin with `gdc -O3 -pg`. Turns out that the

Kapps (3/8) Sep 14 2015 Perhaps using the new rangified splitter instead of split would

H. S. Teoh via Digitalmars-d-learn (5/13) Sep 14 2015 I tried it. It was slower, surprisingly. I didn't dig deeper into why.

Fredrik Boulund (5/14) Sep 15 2015 Interesting, I'll have to go through your code to understand

H. S. Teoh via Digitalmars-d-learn (14/28) Sep 15 2015 It's very simple, actually. Basically you just call GC.disable() at the

Rikki Cattermole (2/9) Sep 14 2015 Probably.

Fredrik Boulund (13/55) Sep 15 2015 Thanks for your suggestions! That sure is a lot of details. I'll

Kagamin (4/8) Sep 15 2015 No, casting to string would be incorrect here because the line

Rikki Cattermole (2/8) Sep 15 2015 I made the assumption it wasn't allocating. Ehhh.

CraigDillabaugh (18/34) Sep 14 2015 I am going to go off the beaten path here. If you really want

John Colvin (5/15) Sep 14 2015 my favourite for streaming a file:

Fredrik Boulund (3/20) Sep 15 2015 Is this an efficient way of reading this type of file? What

Kagamin (4/9) Sep 15 2015 It provides you only one char at a time instead of a whole line.

John Colvin (5/16) Sep 15 2015 reasonably efficient, yes. See http://stackoverflow.com/a/237495

Kagamin (2/7) Sep 15 2015 lineSplitter doesn't work with input ranges.

John Colvin (2/12) Sep 15 2015 Ugh

Fredrik Boulund <fredrik.boulund gmail.com> writes:

Hi,

This is my first post on Dlang forums and I don't have a lot of 
experience with D (yet). I mainly code bioinformatics-stuff in 
Python on my day-to-day job, but I've been toying with D for a 
couple of years now. I had this idea that it'd be fun to write a 
parser for a text-based tabular data format I tend to read a lot 
of in my programs, but I was a bit stomped that the D 
implementation I created was slower than my Python-version. I 
tried running `dmd -profile` on it but didn't really understand 
what I can do to make it go faster. I guess there's some 
unnecessary dynamic array extensions being made but I can't 
figure out how to do without them, maybe someone can help me out? 
I tried making the examples as small as possible.

Here's the code D code: http://dpaste.com/2HP0ZVA
Here's my Python code for comparison: http://dpaste.com/0MPBK67

Using a small test file (~550 MB) on my machine (2x Xeon(R) CPU 
E5-2670 with RAID6 SAS disks and 192GB of RAM), the D version 
runs in about 20 seconds and the Python version less than 16 
seconds. I've repeated runs at least thrice when testing. This 
holds true even if the D version is compiled with -O.

The file being parsed is the output of a DNA/protein sequence 
mapping algorithm called BLAT, but the tabular output format is 
originally known from the famous BLAST algorithm.
Here's a short example of what the input files looks like: 
http://dpaste.com/017N58F
The format is TAB-delimited: query, target, percent_identity, 
alignment_length, mismatches, gaps, query_start, query_end, 
target_start, target_end, e-value, bitscore
In the example the output is sorted by query, but this cannot be 
assumed to hold true for all cases. The input file varies in 
range from several hundred megabytes to several gigabytes (10+ 
GiB).

A brief explanation on what the code does:
Parse each line,
Only accept records with percent_identity >= min_identity (90.0) 
and alignment_length >= min_matches (10),
Store all such records as tuples (in the D code this is a struct) 
in an array in an associative array indexed by 'query',
For each query, remove any records with percent_id less than 5 
percentage points less than the highest value observed for that 
query,
Write results to stdout (in my real code the data is subject to 
further downstream processing)

This was all just for me learning to do some basic stuff in D, 
e.g. file handling, streaming data from disk, etc. I'm really 
curious what I can do to improve the D code. My original idea was 
that maybe I should compile the performance critical parts of my 
Python codebase to D and call them with PyD or something, but not 
I'm not so sure any more. Help and suggestions appreciated!

Sep 14 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund 
wrote:
 [...]

Example output might be useful for you to see as well:

10009.1.1:5.2e-02_13: 16
10014.1.1:2.9e-03_11: 44
10017.1.1:4.1e-02_13: 16
10026.1.1:5.8e-03_12: 27
10027.1.1:6.6e-04_13: 16
10060.1.1:2.7e-03_14: 2
10061.1.1:5.1e-07_13: 41


Worth noticing is that it is essentially impossible to predict 
how many "hits"/records there are for each query, it varies 
wildly from 0 to 1000+ in some cases.

Sep 14 2015

Edwin van Leeuwen <edder tkwsping.nl> writes:

On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund 
wrote:
 Hi,

 Using a small test file (~550 MB) on my machine (2x Xeon(R) CPU 
 E5-2670 with RAID6 SAS disks and 192GB of RAM), the D version 
 runs in about 20 seconds and the Python version less than 16 
 seconds. I've repeated runs at least thrice when testing. This 
 holds true even if the D version is compiled with -O.

Sounds like this program is actually IO bound. In that case I 
would not expect a really expect an improvement by using D. What 
is the CPU usage like when you run this program?

Also which dmd version are you using. I think there were some 
performance improvements for file reading in the latest version 
(2.068)

Sep 14 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Monday, 14 September 2015 at 12:44:22 UTC, Edwin van Leeuwen 
wrote:
 Sounds like this program is actually IO bound. In that case I 
 would not expect a really expect an improvement by using D. 
 What is the CPU usage like when you run this program?

 Also which dmd version are you using. I think there were some 
 performance improvements for file reading in the latest version 
 (2.068)

Hi Edwin, thanks for your quick reply!

I'm using v2.068.1; I actually got inspired to try this out after 
skimming the changelog :).

Regarding if it is IO-bound. I actually expected it would be, but 
both the Python and the D-version consume 100% CPU while running, 
and just copying the file around only takes a few seconds (cf 
15-20 sec in runtime for the two programs). There's bound to be 
some aggressive file caching going on, but I figure that would 
rather normalize program runtimes at lower times after running 
them a few times, but I see nothing indicating that.

Sep 14 2015

Edwin van Leeuwen <edder tkwsping.nl> writes:

On Monday, 14 September 2015 at 12:50:03 UTC, Fredrik Boulund 
wrote:
 On Monday, 14 September 2015 at 12:44:22 UTC, Edwin van Leeuwen 
 wrote:
 Sounds like this program is actually IO bound. In that case I 
 would not expect a really expect an improvement by using D. 
 What is the CPU usage like when you run this program?

 Also which dmd version are you using. I think there were some 
 performance improvements for file reading in the latest 
 version (2.068)

 Hi Edwin, thanks for your quick reply!

 I'm using v2.068.1; I actually got inspired to try this out 
 after skimming the changelog :).

 Regarding if it is IO-bound. I actually expected it would be, 
 but both the Python and the D-version consume 100% CPU while 
 running, and just copying the file around only takes a few 
 seconds (cf 15-20 sec in runtime for the two programs). There's 
 bound to be some aggressive file caching going on, but I figure 
 that would rather normalize program runtimes at lower times 
 after running them a few times, but I see nothing indicating 
 that.

Two things that you could try:

First hitlists.byKey can be expensive (especially if hitlists is 
big). Instead use:

foreach( key, value ; hitlists )

Also the filter.array.length is quite expensive. You could use 
count instead.
import std.algorithm : count;
value.count!(h => h.pid >= (max_pid - max_pid_diff));

Sep 14 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Monday, 14 September 2015 at 13:10:50 UTC, Edwin van Leeuwen 
wrote:
 Two things that you could try:

 First hitlists.byKey can be expensive (especially if hitlists 
 is big). Instead use:

 foreach( key, value ; hitlists )

 Also the filter.array.length is quite expensive. You could use 
 count instead.
 import std.algorithm : count;
 value.count!(h => h.pid >= (max_pid - max_pid_diff));

I didn't know that hitlists.byKey was that expensive, that's just 
the kind of feedback I was hoping for. I'm just grasping for 
straws in the online documentation when I want to do things. With 
my Python background it feels as if I can still get things that 
work that way.

I realize the filter.array.length thing is indeed expensive. I 
find it especially horrendous that the code I've written needs to 
allocate a big dynamic array that will most likely be cut down 
quite drastically in this step. Unfortunately I haven't figured 
out a good way to do this without storing the intermediary 
results since I cannot know if there might be yet another hit for 
any encountered "query" since the input file might not be sorted. 
But the main reason I didn't just count the values like you 
suggest is actually that I need the filtered hits in later 
downstream analysis. The filtered hits for each query are used as 
input to a lowest common ancestor algorithm on the taxonomic tree 
(of life).

Sep 14 2015

Laeeth Isharc <spamnolaeeth nospamlaeeth.com> writes:

On Monday, 14 September 2015 at 13:55:50 UTC, Fredrik Boulund 
wrote:
 On Monday, 14 September 2015 at 13:10:50 UTC, Edwin van Leeuwen 
 wrote:
 Two things that you could try:

 First hitlists.byKey can be expensive (especially if hitlists 
 is big). Instead use:

 foreach( key, value ; hitlists )

 Also the filter.array.length is quite expensive. You could use 
 count instead.
 import std.algorithm : count;
 value.count!(h => h.pid >= (max_pid - max_pid_diff));

 I didn't know that hitlists.byKey was that expensive, that's 
 just the kind of feedback I was hoping for. I'm just grasping 
 for straws in the online documentation when I want to do 
 things. With my Python background it feels as if I can still 
 get things that work that way.

I picked up D to start learning maybe a couple of years ago.  I 
found Ali's book, Andrei's book, github source code (including 
for Phobos), and asking here to be the best resources.  The docs 
make perfect sense when you have got to a certain level (or 
perhaps if you have a computer sciencey background), but can be 
tough before that (though they are getting better).

You should definitely take a look at the dlangscience project 
organized by John Colvin and others.

If you like ipython/jupyter also see his pydmagic - write D 
inline in a notebook.

You may find this series of posts interesting too - another 
bioinformatics guy migrating from Python:
http://forum.dlang.org/post/akzdstfiwwzfeoudhshg forum.dlang.org

 I realize the filter.array.length thing is indeed expensive. I 
 find it especially horrendous that the code I've written needs 
 to allocate a big dynamic array that will most likely be cut 
 down quite drastically in this step. Unfortunately I haven't 
 figured out a good way to do this without storing the 
 intermediary results since I cannot know if there might be yet 
 another hit for any encountered "query" since the input file 
 might not be sorted. But the main reason I didn't just count 
 the values like you suggest is actually that I need the 
 filtered hits in later downstream analysis. The filtered hits 
 for each query are used as input to a lowest common ancestor 
 algorithm on the taxonomic tree (of life).

Unfortunately I haven't time to read your code, and others will 
do better.  But do you use .reserve() ?  Also these are a nice 
fast container library based on Andrei Alexandrescu's allocator:

https://github.com/economicmodeling/containers

Sep 14 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Monday, 14 September 2015 at 14:15:25 UTC, Laeeth Isharc wrote:
 I picked up D to start learning maybe a couple of years ago.  I 
 found Ali's book, Andrei's book, github source code (including 
 for Phobos), and asking here to be the best resources.  The 
 docs make perfect sense when you have got to a certain level 
 (or perhaps if you have a computer sciencey background), but 
 can be tough before that (though they are getting better).

 You should definitely take a look at the dlangscience project 
 organized by John Colvin and others.

 If you like ipython/jupyter also see his pydmagic - write D 
 inline in a notebook.

I saw the dlangscience project on GitHub the other day. I've yet 
to venture deeper. The inlining of D in jupyter notebooks sure is 
cool, but I'm not sure it's very useful for me, Python feels more 
succinct for notebook use. Still, I really appreciate the effort 
put into that, it's really cool!

 You may find this series of posts interesting too - another 
 bioinformatics guy migrating from Python:
 http://forum.dlang.org/post/akzdstfiwwzfeoudhshg forum.dlang.org

I'll have a look at that series of posts, thanks for the heads-up!

 Unfortunately I haven't time to read your code, and others will 
 do better.  But do you use .reserve() ?  Also these are a nice 
 fast container library based on Andrei Alexandrescu's allocator:

 https://github.com/economicmodeling/containers

Not familiar with .reserve(), nor Andrei's allocator library. 
I'll put that in the stuff-to-read-about-queue for now. :) Thanks 
for your tips!

Sep 14 2015

Andrea Fontana <nospam example.com> writes:

On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund 
wrote:
 [...]

Also if problem probabily is i/o related, have you tried with:
-O -inline -release -noboundscheck
?

Anyway I think it's a good idea to test it against gdc and ldc 
that are known to generate faster executables.

Andrea

Sep 14 2015

Andrea Fontana <nospam example.com> writes:

On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana 
wrote:
 On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund 
 wrote:
 [...]

 Also if problem probabily is i/o related, have you tried with:
 -O -inline -release -noboundscheck
 ?

 Anyway I think it's a good idea to test it against gdc and ldc 
 that are known to generate faster executables.

 Andrea

s/also/even

Sep 14 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana 
wrote:
 On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund 
 wrote:
 [...]

 Also if problem probabily is i/o related, have you tried with:
 -O -inline -release -noboundscheck
 ?

-inline in particular is likely to have a strong impact here

 Anyway I think it's a good idea to test it against gdc and ldc 
 that are known to generate faster executables.

 Andrea

+1 I would expect ldc or gdc to strongly outperform dmd on this 
code.

Sep 14 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Monday, 14 September 2015 at 13:37:18 UTC, John Colvin wrote:
 On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana 
 wrote:
 On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund 
 wrote:
 [...]

 Also if problem probabily is i/o related, have you tried with:
 -O -inline -release -noboundscheck
 ?

 -inline in particular is likely to have a strong impact here

Why would -inline be particularly likely to make a big difference 
in this case? I'm trying to learn, but I don't see what inlining 
could be done in this case.

 Anyway I think it's a good idea to test it against gdc and ldc 
 that are known to generate faster executables.

 Andrea

 +1 I would expect ldc or gdc to strongly outperform dmd on this 
 code.

Why is that? I would love to learn to understand why they could 
be expected to perform much better on this kind of code.

Sep 14 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Monday, 14 September 2015 at 13:58:33 UTC, Fredrik Boulund 
wrote:
 On Monday, 14 September 2015 at 13:37:18 UTC, John Colvin wrote:
 On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana 
 wrote:
 On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund 
 wrote:
 [...]

 Also if problem probabily is i/o related, have you tried with:
 -O -inline -release -noboundscheck
 ?

 -inline in particular is likely to have a strong impact here

 Why would -inline be particularly likely to make a big 
 difference in this case? I'm trying to learn, but I don't see 
 what inlining could be done in this case.

Range-based code like you are using leads to *huge* numbers of 
function calls to get anything done. The advantage of inlining is 
twofold: 1) you don't have to pay the cost of the function call 
itself and 2) often more optimisation can be done once a function 
is inlined.

 Anyway I think it's a good idea to test it against gdc and 
 ldc that are known to generate faster executables.

 Andrea

 +1 I would expect ldc or gdc to strongly outperform dmd on 
 this code.

 Why is that? I would love to learn to understand why they could 
 be expected to perform much better on this kind of code.

Because there are much better at inlining. dmd is quick to 
compile your code and is most up-to-date, but ldc and gdc will 
produce somewhat faster code in almost all cases, sometimes very 
dramatically much faster.

Sep 14 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Monday, 14 September 2015 at 14:18:58 UTC, John Colvin wrote:
 Range-based code like you are using leads to *huge* numbers of 
 function calls to get anything done. The advantage of inlining 
 is twofold: 1) you don't have to pay the cost of the function 
 call itself and 2) often more optimisation can be done once a 
 function is inlined.

Thanks for that explanation! Now that you mention it it makes 
perfect sense. I never considered it, but of course *huge* 
numbers of function calls to e.g. next() and other range-methods 
will be made.

 Because there are much better at inlining. dmd is quick to 
 compile your code and is most up-to-date, but ldc and gdc will 
 produce somewhat faster code in almost all cases, sometimes 
 very dramatically much faster.

Sure sounds like I could have more fun with LDC and GDC on my 
system in addition to DMD :).

Sep 14 2015

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Mon, Sep 14, 2015 at 02:34:41PM +0000, Fredrik Boulund via
Digitalmars-d-learn wrote:
 On Monday, 14 September 2015 at 14:18:58 UTC, John Colvin wrote:
Range-based code like you are using leads to *huge* numbers of
function calls to get anything done. The advantage of inlining is
twofold: 1) you don't have to pay the cost of the function call
itself and 2) often more optimisation can be done once a function is
inlined.

 
 Thanks for that explanation! Now that you mention it it makes perfect
 sense.  I never considered it, but of course *huge* numbers of
 function calls to e.g. next() and other range-methods will be made.
 
Because there are much better at inlining. dmd is quick to compile
your code and is most up-to-date, but ldc and gdc will produce
somewhat faster code in almost all cases, sometimes very dramatically
much faster.

 
 Sure sounds like I could have more fun with LDC and GDC on my system
 in addition to DMD :).

If performance is a problem, the first thing I'd recommend is to use a
profiler to find out where the hotspots are. (More often than not, I
have found that the hotspots are not where I expected them to be;
sometimes a 1-line change to an unanticipated hotspot can result in a
huge performance boost.)

The next thing I'd try is to use gdc instead of dmd. ;-)  IME, code
produced by `gdc -O3` is at least 20-30% faster than code produced by
`dmd -O -inline`. Sometimes the difference can be up to 40-50%,
depending on the kind of code you're compiling.


T

-- 
Lottery: tax on the stupid. -- Slashdotter

Sep 14 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Monday, 14 September 2015 at 14:40:29 UTC, H. S. Teoh wrote:
 If performance is a problem, the first thing I'd recommend is 
 to use a profiler to find out where the hotspots are. (More 
 often than not, I have found that the hotspots are not where I 
 expected them to be; sometimes a 1-line change to an 
 unanticipated hotspot can result in a huge performance boost.)

I agree with you on that. I used Python's cProfile module to find 
the performance bottleneck in the Python version I posted, and 
shaved off 8-10 seconds of runtime on an extraneous str.split() I 
had missed.
I tried using the built-in profiler in DMD on the D program but 
to no avail. I couldn't really make any sense of the output other 
than that were enormous amounts of calls to lots of functions I 
couldn't find a way to remove from the code. Here's a paste of 
the trace output from the version I posted in the original post: 
http://dpaste.com/1AXPK9P

 The next thing I'd try is to use gdc instead of dmd. ;-)  IME, 
 code produced by `gdc -O3` is at least 20-30% faster than code 
 produced by `dmd -O -inline`. Sometimes the difference can be 
 up to 40-50%, depending on the kind of code you're compiling.

Yes, it really seems that gdc or ldc is the way to go.

Sep 14 2015

Edwin van Leeuwen <edder tkwsping.nl> writes:

On Monday, 14 September 2015 at 14:54:34 UTC, Fredrik Boulund 
wrote:
 On Monday, 14 September 2015 at 14:40:29 UTC, H. S. Teoh wrote:
 I agree with you on that. I used Python's cProfile module to 
 find the performance bottleneck in the Python version I posted, 
 and shaved off 8-10 seconds of runtime on an extraneous 
 str.split() I had missed.
 I tried using the built-in profiler in DMD on the D program but 
 to no avail. I couldn't really make any sense of the output 
 other than that were enormous amounts of calls to lots of 
 functions I couldn't find a way to remove from the code. Here's 
 a paste of the trace output from the version I posted in the 
 original post: http://dpaste.com/1AXPK9P

See this link for clarification on what the columns/numbers in 
the profile file mean
http://forum.dlang.org/post/f9gjmo$2gce$1 digitalmars.com

It is still difficult to parse though. I myself often use sysprof 
(only available on linux), which automatically ranks by time 
spent.

Sep 14 2015

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Mon, Sep 14, 2015 at 04:13:12PM +0000, Edwin van Leeuwen via
Digitalmars-d-learn wrote:
 On Monday, 14 September 2015 at 14:54:34 UTC, Fredrik Boulund wrote:
[...] I tried using the built-in profiler in DMD on the D program but
to no avail. I couldn't really make any sense of the output other
than that were enormous amounts of calls to lots of functions I
couldn't find a way to remove from the code. Here's a paste of the
trace output from the version I posted in the original post:
http://dpaste.com/1AXPK9P

 
 See this link for clarification on what the columns/numbers in the
 profile file mean
 http://forum.dlang.org/post/f9gjmo$2gce$1 digitalmars.com
 
 It is still difficult to parse though. I myself often use sysprof
 (only available on linux), which automatically ranks by time spent.

Dmd's profiler has some limitations, especially if you're doing
something that's CPU bound for a long time (its internal counters are
not wide enough and may overflow -- I have run into this before and it
made it unusable for me). I highly recommend using `gdc -pg` with gprof.


T

-- 
Only boring people get bored. -- JM

Sep 14 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Monday, 14 September 2015 at 16:13:14 UTC, Edwin van Leeuwen 
wrote:
 See this link for clarification on what the columns/numbers in 
 the profile file mean
 http://forum.dlang.org/post/f9gjmo$2gce$1 digitalmars.com

 It is still difficult to parse though. I myself often use 
 sysprof (only available on linux), which automatically ranks by 
 time spent.

Thanks for the link. I read up on what everything means, but I 
think the problem isn't finding what consumes the most time, the 
problem is me not knowing the standard library well enough to 
translate the largest consumers to actual parts of my code :).

Sep 15 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana 
wrote:
 On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund 
 wrote:
 [...]

 Also if problem probabily is i/o related, have you tried with:
 -O -inline -release -noboundscheck
 ?

 Anyway I think it's a good idea to test it against gdc and ldc 
 that are known to generate faster executables.

 Andrea

Thanks for the suggestions! I'm not too familiar with compiled 
languages like this, I've mainly written small programs in D and 
run them via `rdmd` in a scripting language fashion. I'll read up 
on what the different compile flags do (I knew about -O, but I'm 
not sure what the others do).

Unfortunately I cannot get LDC working on my system. It seems to 
fail finding some shared library when I download the binary 
released, and I can't figure out how to make it compile. I 
haven't really given GDC a try yet. I'll see what I can do.

Running the original D code I posted before with the flags you 
suggested reduced the runtime by about 2 seconds on average.

Sep 14 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Monday, 14 September 2015 at 13:50:22 UTC, Fredrik Boulund 
wrote:
 On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana 
 wrote:
 [...]

 Thanks for the suggestions! I'm not too familiar with compiled 
 languages like this, I've mainly written small programs in D 
 and run them via `rdmd` in a scripting language fashion. I'll 
 read up on what the different compile flags do (I knew about 
 -O, but I'm not sure what the others do).

 Unfortunately I cannot get LDC working on my system. It seems 
 to fail finding some shared library when I download the binary 
 released, and I can't figure out how to make it compile. I 
 haven't really given GDC a try yet. I'll see what I can do.

 Running the original D code I posted before with the flags you 
 suggested reduced the runtime by about 2 seconds on average.

what system are you on? What are the error messages you are 
getting?

Sep 14 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Monday, 14 September 2015 at 14:14:18 UTC, John Colvin wrote:
 what system are you on? What are the error messages you are 
 getting?

I really appreciate your will to try to help me out. This is what 
ldd shows on the latest binary release of LDC on my machine. I'm 
on a Red Hat Enterprise Linux 6.6 system.

[boulund terra ~]$ ldd 
~/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2
/home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2: 
/lib64/libc.so.6: version `GLIBC_2.14' not found (required by 
/home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2)
/home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2: 
/lib64/libc.so.6: version `GLIBC_2.15' not found (required by 
/home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2)
         linux-vdso.so.1 =>  (0x00007fff623ff000)
         libconfig.so.8 => 
/home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/libconfig.so.8
(0x00007f7f716e1000)
         libpthread.so.0 => /lib64/libpthread.so.0 
(0x00007f7f714a3000)
         libdl.so.2 => /lib64/libdl.so.2 (0x00007f7f7129f000)
         libstdc++.so.6 => /usr/lib64/libstdc++.so.6 
(0x00000032cde00000)
         libm.so.6 => /lib64/libm.so.6 (0x00007f7f7101a000)
         libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000032cca00000)
         libc.so.6 => /lib64/libc.so.6 (0x00007f7f70c86000)
         /lib64/ld-linux-x86-64.so.2 (0x00007f7f718ec000)

As you can see it lacks something related to GLIBC, but I'm not 
sure how to fix that.

Sep 14 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Monday, 14 September 2015 at 14:25:04 UTC, Fredrik Boulund 
wrote:
 On Monday, 14 September 2015 at 14:14:18 UTC, John Colvin wrote:
 what system are you on? What are the error messages you are 
 getting?

 I really appreciate your will to try to help me out. This is 
 what ldd shows on the latest binary release of LDC on my 
 machine. I'm on a Red Hat Enterprise Linux 6.6 system.

 [boulund terra ~]$ ldd 
 ~/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2
 /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2: 
 /lib64/libc.so.6: version `GLIBC_2.14' not found (required by 
 /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2)
 /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2: 
 /lib64/libc.so.6: version `GLIBC_2.15' not found (required by 
 /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2)
         linux-vdso.so.1 =>  (0x00007fff623ff000)
         libconfig.so.8 => 
 /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/libconfig.so.8
(0x00007f7f716e1000)
         libpthread.so.0 => /lib64/libpthread.so.0 
 (0x00007f7f714a3000)
         libdl.so.2 => /lib64/libdl.so.2 (0x00007f7f7129f000)
         libstdc++.so.6 => /usr/lib64/libstdc++.so.6 
 (0x00000032cde00000)
         libm.so.6 => /lib64/libm.so.6 (0x00007f7f7101a000)
         libgcc_s.so.1 => /lib64/libgcc_s.so.1 
 (0x00000032cca00000)
         libc.so.6 => /lib64/libc.so.6 (0x00007f7f70c86000)
         /lib64/ld-linux-x86-64.so.2 (0x00007f7f718ec000)

 As you can see it lacks something related to GLIBC, but I'm not 
 sure how to fix that.

Yup, glibc is too old for those binaries.

What does "ldd --version" say?

Sep 14 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Monday, 14 September 2015 at 14:28:41 UTC, John Colvin wrote:
 Yup, glibc is too old for those binaries.

 What does "ldd --version" say?

It says "ldd (GNU libc) 2.12". Hmm... The most recent version in 
RHEL's repo is "2.12-1.166.el6_7.1", which is what is installed. 
Can this be side-loaded without too much hassle and manual effort?

Sep 14 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Monday, 14 September 2015 at 14:35:26 UTC, Fredrik Boulund 
wrote:
 On Monday, 14 September 2015 at 14:28:41 UTC, John Colvin wrote:
 Yup, glibc is too old for those binaries.

 What does "ldd --version" say?

 It says "ldd (GNU libc) 2.12". Hmm... The most recent version 
 in RHEL's repo is "2.12-1.166.el6_7.1", which is what is 
 installed. Can this be side-loaded without too much hassle and 
 manual effort?

I've had nothing but trouble when using different versions of 
libc. It would be easier to do this instead: 
http://wiki.dlang.org/Building_LDC_from_source

I'm running a build of LDC git HEAD right now on an old server 
with 2.11, I'll upload the result somewhere once it's done if it 
might be useful

Sep 14 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Monday, 14 September 2015 at 15:04:12 UTC, John Colvin wrote:
 I've had nothing but trouble when using different versions of 
 libc. It would be easier to do this instead: 
 http://wiki.dlang.org/Building_LDC_from_source

 I'm running a build of LDC git HEAD right now on an old server 
 with 2.11, I'll upload the result somewhere once it's done if 
 it might be useful

Thanks for the offer, but don't go out of your way for my sake. 
Maybe I'll just build this in a clean environment instead of on 
my work computer to get rid of all the hassle. The Red Hat 
llvm-devel packages are broken, dependent on libffi-devel which 
is unavailable. Getting the build environment up to speed on my 
main machine would take me a lot more time than I have right now.

Tried building LDC from scratch but it fails because of missing 
LLVM components, despite having LLVM 3.4.2 installed (though 
lacking devel components).

Sep 15 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Tuesday, 15 September 2015 at 08:45:00 UTC, Fredrik Boulund 
wrote:
 On Monday, 14 September 2015 at 15:04:12 UTC, John Colvin wrote:
 [...]

 Thanks for the offer, but don't go out of your way for my sake. 
 Maybe I'll just build this in a clean environment instead of on 
 my work computer to get rid of all the hassle. The Red Hat 
 llvm-devel packages are broken, dependent on libffi-devel which 
 is unavailable. Getting the build environment up to speed on my 
 main machine would take me a lot more time than I have right 
 now.

 Tried building LDC from scratch but it fails because of missing 
 LLVM components, despite having LLVM 3.4.2 installed (though 
 lacking devel components).

try this: 
https://dlangscience.github.io/resources/ldc-0.16.0-a2_glibc2.11.3.tar.xz

Sep 15 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Tuesday, 15 September 2015 at 10:01:30 UTC, John Colvin wrote:
 try this: 
 https://dlangscience.github.io/resources/ldc-0.16.0-a2_glibc2.11.3.tar.xz

Nope, :(

$ ldd ldc2
./ldc2: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not 
found (required by ./ldc2)
         linux-vdso.so.1 =>  (0x00007fff2ffd8000)
         libpthread.so.0 => /lib64/libpthread.so.0 
(0x000000318a000000)
         libdl.so.2 => /lib64/libdl.so.2 (0x000000318a400000)
         libncurses.so.5 => /lib64/libncurses.so.5 
(0x000000319bc00000)
         librt.so.1 => /lib64/librt.so.1 (0x000000318a800000)
         libz.so.1 => /lib64/libz.so.1 (0x000000318ac00000)
         libstdc++.so.6 => /usr/lib64/libstdc++.so.6 
(0x000000318dc00000)
         libm.so.6 => /lib64/libm.so.6 (0x0000003189c00000)
         libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000000318c000000)
         libc.so.6 => /lib64/libc.so.6 (0x0000003189800000)
         /lib64/ld-linux-x86-64.so.2 (0x0000003189400000)
         libtinfo.so.5 => /lib64/libtinfo.so.5 (0x0000003199000000)

Thanks for trying though!

Sep 15 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Tuesday, 15 September 2015 at 13:49:04 UTC, Fredrik Boulund 
wrote:
 On Tuesday, 15 September 2015 at 10:01:30 UTC, John Colvin 
 wrote:
 [...]

 Nope, :(

 [...]

Oh well, worth a try I guess.

Sep 15 2015

Andrew Brown <aabrown24 hotmail.com> writes:

I had some luck building a local copy of llvm in my home 
directory, using a linux version about as old as yours (llvm 3.5 
i used) specifying:

--configure --prefix=/home/andrew/llvm

so make install would install it somewhere I had permissions.

Then I changed the cmake command to:

cmake -L -DLLVM_CONFIG="/home/andrew/llvm/bin/llvm-config" ..

and I got a working install of ldc.

Make yourself a cup of tea while you wait though if you try it, 
llvm was about an hour and a half to compile.

On Tuesday, 15 September 2015 at 13:49:04 UTC, Fredrik Boulund 
wrote:
 On Tuesday, 15 September 2015 at 10:01:30 UTC, John Colvin 
 wrote:
 try this: 
 https://dlangscience.github.io/resources/ldc-0.16.0-a2_glibc2.11.3.tar.xz

 Nope, :(

 $ ldd ldc2
 ./ldc2: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not 
 found (required by ./ldc2)
         linux-vdso.so.1 =>  (0x00007fff2ffd8000)
         libpthread.so.0 => /lib64/libpthread.so.0 
 (0x000000318a000000)
         libdl.so.2 => /lib64/libdl.so.2 (0x000000318a400000)
         libncurses.so.5 => /lib64/libncurses.so.5 
 (0x000000319bc00000)
         librt.so.1 => /lib64/librt.so.1 (0x000000318a800000)
         libz.so.1 => /lib64/libz.so.1 (0x000000318ac00000)
         libstdc++.so.6 => /usr/lib64/libstdc++.so.6 
 (0x000000318dc00000)
         libm.so.6 => /lib64/libm.so.6 (0x0000003189c00000)
         libgcc_s.so.1 => /lib64/libgcc_s.so.1 
 (0x000000318c000000)
         libc.so.6 => /lib64/libc.so.6 (0x0000003189800000)
         /lib64/ld-linux-x86-64.so.2 (0x0000003189400000)
         libtinfo.so.5 => /lib64/libtinfo.so.5 
 (0x0000003199000000)

 Thanks for trying though!

Sep 15 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Tuesday, 15 September 2015 at 18:42:29 UTC, Andrew Brown wrote:
 I had some luck building a local copy of llvm in my home 
 directory, using a linux version about as old as yours (llvm 
 3.5 i used) specifying:

 --configure --prefix=/home/andrew/llvm

 so make install would install it somewhere I had permissions.

 Then I changed the cmake command to:

 cmake -L -DLLVM_CONFIG="/home/andrew/llvm/bin/llvm-config" ..

 and I got a working install of ldc.

 Make yourself a cup of tea while you wait though if you try it, 
 llvm was about an hour and a half to compile.

Thanks for your suggestion. I'm amazed by the amount of effort 
you guys put into helping me. Unfortunately the only precompiled 
version of libstdc++ available for the system in question via Red 
Hat repo's is 4.4.7, and compiling llvm from scratch requires at 
least 4.7. I'll be fine using DMD for now as I'm still learning 
more about D :).

Sep 15 2015

Rikki Cattermole <alphaglosined gmail.com> writes:

On 15/09/15 12:30 AM, Fredrik Boulund wrote:
 Hi,

 This is my first post on Dlang forums and I don't have a lot of
 experience with D (yet). I mainly code bioinformatics-stuff in Python on
 my day-to-day job, but I've been toying with D for a couple of years
 now. I had this idea that it'd be fun to write a parser for a text-based
 tabular data format I tend to read a lot of in my programs, but I was a
 bit stomped that the D implementation I created was slower than my
 Python-version. I tried running `dmd -profile` on it but didn't really
 understand what I can do to make it go faster. I guess there's some
 unnecessary dynamic array extensions being made but I can't figure out
 how to do without them, maybe someone can help me out? I tried making
 the examples as small as possible.

 Here's the code D code: http://dpaste.com/2HP0ZVA
 Here's my Python code for comparison: http://dpaste.com/0MPBK67

 Using a small test file (~550 MB) on my machine (2x Xeon(R) CPU E5-2670
 with RAID6 SAS disks and 192GB of RAM), the D version runs in about 20
 seconds and the Python version less than 16 seconds. I've repeated runs
 at least thrice when testing. This holds true even if the D version is
 compiled with -O.

 The file being parsed is the output of a DNA/protein sequence mapping
 algorithm called BLAT, but the tabular output format is originally known
 from the famous BLAST algorithm.
 Here's a short example of what the input files looks like:
 http://dpaste.com/017N58F
 The format is TAB-delimited: query, target, percent_identity,
 alignment_length, mismatches, gaps, query_start, query_end,
 target_start, target_end, e-value, bitscore
 In the example the output is sorted by query, but this cannot be assumed
 to hold true for all cases. The input file varies in range from several
 hundred megabytes to several gigabytes (10+ GiB).

 A brief explanation on what the code does:
 Parse each line,
 Only accept records with percent_identity >= min_identity (90.0) and
 alignment_length >= min_matches (10),
 Store all such records as tuples (in the D code this is a struct) in an
 array in an associative array indexed by 'query',
 For each query, remove any records with percent_id less than 5
 percentage points less than the highest value observed for that query,
 Write results to stdout (in my real code the data is subject to further
 downstream processing)

 This was all just for me learning to do some basic stuff in D, e.g. file
 handling, streaming data from disk, etc. I'm really curious what I can
 do to improve the D code. My original idea was that maybe I should
 compile the performance critical parts of my Python codebase to D and
 call them with PyD or something, but not I'm not so sure any more. Help
 and suggestions appreciated!

A lot of this hasn't been covered I believe.

http://dpaste.dzfl.pl/f7ab2915c3e1

1) You don't need to convert char[] to string via to. No. Too much. Cast it.
2) You don't need byKey, use foreach key, value syntax. That way you 
won't go around modifying things unnecessarily.

Ok, I disabled GC + reserved a bunch of memory. It probably won't help 
much actually. In fact may make it fail so keep that in mind.

Humm what else.

I'm worried about that first foreach. I don't think it needs to exist as 
it does. I believe an input range would be far better. Use a buffer to 
store the Hit[]'s. Have a subset per set of them.

If the first foreach is an input range, then things become slightly 
easier in the second. Now you can turn that into it's own input range.
Also that .array usage concerns me. Many an allocation there! Hence why 
the input range should be the return from it.

The last foreach, is lets assume dummy. Keep in mind, stdout is 
expensive here. DO NOT USE. If you must buffer output then do it large 
quantities.


Based upon what I can see, you are definitely not able to use your cpu's 
to the max. There is no way that is the limiting factor here. Maybe your 
usage of a core is. But not the cpu's itself.

The thing is, you cannot use multiple threads on that first foreach loop 
to speed things up. No. That needs to happen all on one thread.
Instead after that thread you need to push the result into another.

Perhaps, per thread one lock (mutex) + buffer for hits. Go round robin 
over all the threads. If mutex is still locked, you'll need to wait. In 
this situation a locked mutex means all you worker threads are working. 
So you can't do anything more (anyway).

Of course after all this, the HDD may still be getting hit too hard. In 
which case I would recommend you memory mapping it. Which should allow 
the OS to more efficiently handle reading it into memory. But you'll 
need to rework .byLine for that.


Wow that was a lot at 4:30am! So don't take it too seriously. I'm sure 
somebody else will rip that to shreds!

Sep 14 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Monday, 14 September 2015 at 16:33:23 UTC, Rikki Cattermole 
wrote:
 On 15/09/15 12:30 AM, Fredrik Boulund wrote:
[...]

 A lot of this hasn't been covered I believe.

 http://dpaste.dzfl.pl/f7ab2915c3e1

 1) You don't need to convert char[] to string via to. No. Too 
 much. Cast it.

Not a good idea in general. Much better to ask for a string in 
the first place by using byLine!(immutable(char), 
immutable(char)). Alternatively just use char[] throughout.

Sep 14 2015

NX <nightmarex1337 hotmail.com> writes:

On Monday, 14 September 2015 at 16:33:23 UTC, Rikki Cattermole 
wrote:
 A lot of this hasn't been covered I believe.

 http://dpaste.dzfl.pl/f7ab2915c3e1

I believe that should be:
foreach (query, ref value; hitlists)
Since an assignment happenin there..?

Sep 14 2015

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

I decided to give the code a spin with `gdc -O3 -pg`. Turns out that the
hotspot is in std.array.split, contrary to expectations. :-)  Here are
the first few lines of the gprof output:

-----snip-----
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 19.77      0.43     0.43 76368364     0.00     0.00  nothrow  safe int
std.array.split!(char[]).split(char[]).__foreachbody2(ref ulong, ref dchar)
 16.74      0.79     0.36                             nothrow void
gc.gc.Gcx.mark(void*, void*, int)
 10.23      1.01     0.22                             _aApplycd2
  8.14      1.18     0.18                             _d_arrayappendcTX
  4.19      1.27     0.09                             nothrow ulong
gc.gc.Gcx.fullcollect()
-----snip-----

As you can see, std.array.split takes up almost 20% of the total running
time.

The surprising (or not-so-surprising?) second runner-up for hot spot is
actually the GC's marking algorithm. I'm guessing this is likely because
of the extensive use of small GC-allocated arrays (splitting into
strings, and the like).

The 3rd entry, _aApplycd2, appears to be the druntime implementation of
the foreach loop, so I'm not sure what could be done about it.

Anyway, just for kicks, I tried various ways of reducing the cost of
std.array.split, but didn't get very far.  Replacing it with
std.regex.split didn't help. Looking at its implementation, while it
does allocate a new array each time, it also slices over the input when
constructing the substrings, so it didn't seem as inefficient as I first
thought.

Next I tried disabling the GC with core.memory.GC.disable().
Immediately, I got a 20% performance boost.  Of course, running with a
disabled GC will soon eat up all memory and crash, which isn't an option
in real-world usage, so the next best solution is to manually schedule
GC collection cycles, say call GC.collect() every n iterations of the
parsing loop, for some value of n.

I tried implementing a crude version of this (see code below), and found
that manually calling GC.collect() even as frequently as once every 5000
loop iterations (for a 500,000 line test input file) still gives about
15% performance improvement over completely disabling the GC.  Since
most of the arrays involved here are pretty small, the frequency could
be reduced to once every 50,000 iterations and you'd pretty much get the
20% performance boost for free, and still not run out of memory too
quickly.

Note that all measurements were done with `gdc -O3`.  I did try the
original code with `dmd -O`, and found that it was 20% slower than the
gdc version, so I didn't look further.

Anyway, here's the code with the manual GC collection schedule (I
modified main() slightly so that I could easily test the code with
different input files, so you have to specify the input filename as an
argument to the program when running it):

-------snip-------

// Programmed in the D language
// Fredrik Boulund 2015-09-09
// Modified by H. S. Teoh 2015-09-14

import core.memory; // for GC control
import std.stdio;
import std.array;
import std.conv;
import std.algorithm;

struct Hit 
{
    string target;
    float pid;
    int matches, mismatches, gaps, qstart, qstop, tstart, tstop;
    double evalue, bitscore;
}

enum manualGcCount = 5_000;
ulong ticksToCollect = manualGcCount;

void gcTick()
{
    if (--ticksToCollect == 0)
    {
        GC.collect();
        ticksToCollect = manualGcCount;
    }
}

void custom_parse(string filename)
{
    float min_identity = 90.0;
    int min_matches = 10;
    Hit[][string] hitlists;

    foreach (record; filename
                     .File
                     .byLine
                     .map!split
                     .filter!(l => l[2].to!float >= min_identity && 
                                   l[3].to!int >= min_matches))
    {
        hitlists[record[0].to!string] ~= Hit(record[1].to!string,
                                             record[2].to!float,
                                             record[3].to!int,
                                             record[4].to!int,
                                             record[5].to!int,
                                             record[6].to!int,
                                             record[7].to!int,
                                             record[8].to!int,
                                             record[9].to!int,
                                             record[10].to!double,
                                             record[11].to!double);
        gcTick();
    }

    foreach (query; hitlists.byKey)
    {
        float max_pid = reduce!((a,b) => max(a, b.pid))(-double.max,
hitlists[query]);
        float max_pid_diff = 5.00;
        hitlists[query] = hitlists[query].filter!(h => h.pid >= (max_pid -
max_pid_diff)).array();
        writeln(query, ": ", hitlists[query].length);

        gcTick();
    }

}

void main(string[] args)
{
    auto filename = args[1];
    //auto filename = "blat.blast8";
    //auto filename = "./QE_150828_46.blast8";

    GC.disable();
    custom_parse(filename);
}
-------snip-------


T

-- 
Stop staring at me like that! It's offens... no, you'll hurt your eyes!

Sep 14 2015

Kapps <opantm2+spam gmail.com> writes:

On Monday, 14 September 2015 at 18:31:38 UTC, H. S. Teoh wrote:
 I decided to give the code a spin with `gdc -O3 -pg`. Turns out 
 that the hotspot is in std.array.split, contrary to 
 expectations. :-)  Here are the first few lines of the gprof 
 output:

 [...]

Perhaps using the new rangified splitter instead of split would 
help.

Sep 14 2015

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Mon, Sep 14, 2015 at 08:07:45PM +0000, Kapps via Digitalmars-d-learn wrote:
 On Monday, 14 September 2015 at 18:31:38 UTC, H. S. Teoh wrote:
I decided to give the code a spin with `gdc -O3 -pg`. Turns out that
the hotspot is in std.array.split, contrary to expectations. :-)
Here are the first few lines of the gprof output:

[...]

 
 Perhaps using the new rangified splitter instead of split would help.

I tried it. It was slower, surprisingly. I didn't dig deeper into why.


T

-- 
I see that you JS got Bach.

Sep 14 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Monday, 14 September 2015 at 18:31:38 UTC, H. S. Teoh wrote:
 I tried implementing a crude version of this (see code below), 
 and found that manually calling GC.collect() even as frequently 
 as once every 5000 loop iterations (for a 500,000 line test 
 input file) still gives about 15% performance improvement over 
 completely disabling the GC.  Since most of the arrays involved 
 here are pretty small, the frequency could be reduced to once 
 every 50,000 iterations and you'd pretty much get the 20% 
 performance boost for free, and still not run out of memory too 
 quickly.

Interesting, I'll have to go through your code to understand 
exactly what's going on. I also noticed some GC-related stuff 
high up in my profiling, but had no idea what could be done about 
that. Appreciate the suggestions!

Sep 15 2015

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Tue, Sep 15, 2015 at 08:55:43AM +0000, Fredrik Boulund via
Digitalmars-d-learn wrote:
 On Monday, 14 September 2015 at 18:31:38 UTC, H. S. Teoh wrote:
I tried implementing a crude version of this (see code below), and
found that manually calling GC.collect() even as frequently as once
every 5000 loop iterations (for a 500,000 line test input file) still
gives about 15% performance improvement over completely disabling the
GC.  Since most of the arrays involved here are pretty small, the
frequency could be reduced to once every 50,000 iterations and you'd
pretty much get the 20% performance boost for free, and still not run
out of memory too quickly.

 
 Interesting, I'll have to go through your code to understand exactly
 what's going on. I also noticed some GC-related stuff high up in my
 profiling, but had no idea what could be done about that. Appreciate
 the suggestions!

It's very simple, actually. Basically you just call GC.disable() at the
beginning of the program to disable automatic collection cycles, then at
period intervals in you manually trigger collections by calling
GC.collect().

The way I implemented it in my test code was to use a global counter
that I decrement once every loop iteration. When the counter reaches
zero, GC.collect() is called, and then the counter is reset to its
original value.  This is encapsulated in the gcTick() function, so that
it's easy to tweak the frequency of the manual collections without
modifying several different places in the code each time.


T

-- 
BREAKFAST.COM halted...Cereal Port Not Responding. -- YHL

Sep 15 2015

Rikki Cattermole <alphaglosined gmail.com> writes:

On 15/09/15 5:41 AM, NX wrote:
 On Monday, 14 September 2015 at 16:33:23 UTC, Rikki Cattermole wrote:
 A lot of this hasn't been covered I believe.

 http://dpaste.dzfl.pl/f7ab2915c3e1

 I believe that should be:
 foreach (query, ref value; hitlists)
 Since an assignment happenin there..?

Probably.

Sep 14 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Monday, 14 September 2015 at 16:33:23 UTC, Rikki Cattermole 
wrote:
 A lot of this hasn't been covered I believe.

 http://dpaste.dzfl.pl/f7ab2915c3e1

 1) You don't need to convert char[] to string via to. No. Too 
 much. Cast it.
 2) You don't need byKey, use foreach key, value syntax. That 
 way you won't go around modifying things unnecessarily.

 Ok, I disabled GC + reserved a bunch of memory. It probably 
 won't help much actually. In fact may make it fail so keep that 
 in mind.

 Humm what else.

 I'm worried about that first foreach. I don't think it needs to 
 exist as it does. I believe an input range would be far better. 
 Use a buffer to store the Hit[]'s. Have a subset per set of 
 them.

 If the first foreach is an input range, then things become 
 slightly easier in the second. Now you can turn that into it's 
 own input range.
 Also that .array usage concerns me. Many an allocation there! 
 Hence why the input range should be the return from it.

 The last foreach, is lets assume dummy. Keep in mind, stdout is 
 expensive here. DO NOT USE. If you must buffer output then do 
 it large quantities.


 Based upon what I can see, you are definitely not able to use 
 your cpu's to the max. There is no way that is the limiting 
 factor here. Maybe your usage of a core is. But not the cpu's 
 itself.

 The thing is, you cannot use multiple threads on that first 
 foreach loop to speed things up. No. That needs to happen all 
 on one thread.
 Instead after that thread you need to push the result into 
 another.

 Perhaps, per thread one lock (mutex) + buffer for hits. Go 
 round robin over all the threads. If mutex is still locked, 
 you'll need to wait. In this situation a locked mutex means all 
 you worker threads are working. So you can't do anything more 
 (anyway).

 Of course after all this, the HDD may still be getting hit too 
 hard. In which case I would recommend you memory mapping it. 
 Which should allow the OS to more efficiently handle reading it 
 into memory. But you'll need to rework .byLine for that.


 Wow that was a lot at 4:30am! So don't take it too seriously. 
 I'm sure somebody else will rip that to shreds!

Thanks for your suggestions! That sure is a lot of details. I'll 
have to go through them carefully to understand what to do with 
all this. Going multithreaded sounds fun but would  effectively 
kill of all of my spare time, so I might have to skip that. :)

Using char[] all around might be a good idea, but it doesn't seem 
like the string conversions are really that taxing. What are the 
arguments for working on char[] arrays rather than strings?

I'm aware that printing output like that is a performance killer, 
but it's not supposed to write anything in the final program. 
It's just there for me to be able to compare the results to my 
Python code.

Sep 15 2015

Kagamin <spam here.lot> writes:

On Tuesday, 15 September 2015 at 08:51:02 UTC, Fredrik Boulund 
wrote:
 Using char[] all around might be a good idea, but it doesn't 
 seem like the string conversions are really that taxing. What 
 are the arguments for working on char[] arrays rather than 
 strings?

No, casting to string would be incorrect here because the line 
buffer is reused, your original usage of to!string is correct.

Sep 15 2015

Rikki Cattermole <alphaglosined gmail.com> writes:

On 15/09/15 9:00 PM, Kagamin wrote:
 On Tuesday, 15 September 2015 at 08:51:02 UTC, Fredrik Boulund wrote:
 Using char[] all around might be a good idea, but it doesn't seem like
 the string conversions are really that taxing. What are the arguments
 for working on char[] arrays rather than strings?

 No, casting to string would be incorrect here because the line buffer is
 reused, your original usage of to!string is correct.

I made the assumption it wasn't allocating. Ehhh.

Sep 15 2015

CraigDillabaugh <craig.dillabaugh gmail.com> writes:

On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund 
wrote:
 Hi,

 This is my first post on Dlang forums and I don't have a lot of 
 experience with D (yet). I mainly code bioinformatics-stuff in 
 Python on my day-to-day job, but I've been toying with D for a 
 couple of years now. I had this idea that it'd be fun to write 
 a parser for a text-based tabular data format I tend to read a 
 lot of in my programs, but I was a bit stomped that the D 
 implementation I created was slower than my Python-version. I 
 tried running `dmd -profile` on it but didn't really understand 
 what I can do to make it go faster. I guess there's some 
 unnecessary dynamic array extensions being made but I can't 
 figure out how to do without them, maybe someone can help me 
 out? I tried making the examples as small as possible.

 Here's the code D code: http://dpaste.com/2HP0ZVA
 Here's my Python code for comparison: http://dpaste.com/0MPBK67

 clip

I am going to go off the beaten path here.  If you really want 
speed
for a file like this one way of getting that is to read the file
in as a single large binary array of ubytes (or in blocks if its 
too big)
and parse the lines yourself. Should be fairly easy with D's 
array slicing.

I looked at the format and it appears that lines are quite simple 
and use
a limited subset of the ASCII chars.  If that is in fact true 
then you
should be able to speed up reading using this technique.  If you 
can have
UTF8 chars in there, or if the format can be more complex than 
that shown
in your example, then please ignore my suggestion.

Sep 14 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Monday, 14 September 2015 at 17:51:43 UTC, CraigDillabaugh 
wrote:
 On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund 
 wrote:
 [...]

 I am going to go off the beaten path here.  If you really want 
 speed
 for a file like this one way of getting that is to read the file
 in as a single large binary array of ubytes (or in blocks if 
 its too big)
 and parse the lines yourself. Should be fairly easy with D's 
 array slicing.

my favourite for streaming a file:
enum chunkSize = 4096;
File(fileName).byChunk(chunkSize).map!"cast(char[])a".joiner()

Sep 14 2015

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Monday, 14 September 2015 at 18:08:31 UTC, John Colvin wrote:
 On Monday, 14 September 2015 at 17:51:43 UTC, CraigDillabaugh 
 wrote:
 On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund 
 wrote:
 [...]

 I am going to go off the beaten path here.  If you really want 
 speed
 for a file like this one way of getting that is to read the 
 file
 in as a single large binary array of ubytes (or in blocks if 
 its too big)
 and parse the lines yourself. Should be fairly easy with D's 
 array slicing.

 my favourite for streaming a file:
 enum chunkSize = 4096;
 File(fileName).byChunk(chunkSize).map!"cast(char[])a".joiner()

Is this an efficient way of reading this type of file? What 
should one keep in mind when choosing chunkSize?

Sep 15 2015

Kagamin <spam here.lot> writes:

On Tuesday, 15 September 2015 at 08:53:37 UTC, Fredrik Boulund 
wrote:
 my favourite for streaming a file:
 enum chunkSize = 4096;
 File(fileName).byChunk(chunkSize).map!"cast(char[])a".joiner()

 Is this an efficient way of reading this type of file? What 
 should one keep in mind when choosing chunkSize?

It provides you only one char at a time instead of a whole line. 
It will be quite constraining for your code if not mind-bending.

Sep 15 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Tuesday, 15 September 2015 at 09:09:00 UTC, Kagamin wrote:
 On Tuesday, 15 September 2015 at 08:53:37 UTC, Fredrik Boulund 
 wrote:
 my favourite for streaming a file:
 enum chunkSize = 4096;
 File(fileName).byChunk(chunkSize).map!"cast(char[])a".joiner()

 Is this an efficient way of reading this type of file? What 
 should one keep in mind when choosing chunkSize?


reasonably efficient, yes. See http://stackoverflow.com/a/237495 
for a discussion of chunk sizing when streaming a file.

 It provides you only one char at a time instead of a whole 
 line. It will be quite constraining for your code if not 
 mind-bending.



File(fileName).byChunk(chunkSize).map!"cast(char[])a".joiner().lineSplitter()

Sep 15 2015

Kagamin <spam here.lot> writes:

On Tuesday, 15 September 2015 at 09:19:29 UTC, John Colvin wrote:
 It provides you only one char at a time instead of a whole 
 line. It will be quite constraining for your code if not 
 mind-bending.



 File(fileName).byChunk(chunkSize).map!"cast(char[])a".joiner().lineSplitter()

lineSplitter doesn't work with input ranges.

Sep 15 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Tuesday, 15 September 2015 at 13:01:06 UTC, Kagamin wrote:
 On Tuesday, 15 September 2015 at 09:19:29 UTC, John Colvin 
 wrote:
 It provides you only one char at a time instead of a whole 
 line. It will be quite constraining for your code if not 
 mind-bending.



 File(fileName).byChunk(chunkSize).map!"cast(char[])a".joiner().lineSplitter()

 lineSplitter doesn't work with input ranges.

Ugh

Sep 15 2015

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Speeding up text file parser (BLAST tabular format)