digitalmars.D.learn - 200-600x slower Dlang performance with nested foreach loop

methonash (58/58) Jan 26 2021 Greetings Dlang wizards,

Paul Backus (7/22) Jan 26 2021 It would be much easier for us to help you with this if you could

methonash (16/21) Jan 26 2021 I cannot post the full source code.

H. S. Teoh (18/33) Jan 26 2021 Then we are limited in how much we can help you.

Imperatorn (2/6) Jan 26 2021 Source please 👍
H. S. Teoh (56/106) Jan 26 2021 [...]

methonash (31/57) Jan 26 2021 Many thanks for the tip! I look forward to trying this soon. For

Paul Backus (6/14) Jan 26 2021 You could try sorting the array first, and then using `uniq` [1]

H. S. Teoh (11/27) Jan 26 2021 Yes, definitely try this. This will completely eliminate the overhead

FeepingCreature (4/13) Jan 27 2021 Associative arrays allocate per entry added?!

Paul Backus (7/10) Jan 27 2021 Maybe it's to avoid invalidating the result of `key in aa` when

Paul Backus (4/9) Jan 27 2021 Correction: the GC would take care of the safety issue, of

Steven Schveighoffer (18/27) Jan 27 2021 No, it wouldn't. The problem is not a GC issue, but an issue with

methonash (63/63) Jan 30 2021 Greetings all,

Steven Schveighoffer (9/90) Jan 30 2021 The code looks pretty minimal.

methonash (29/35) Jan 31 2021 I have user requirements that this application track string IDs

a11e99z (3/9) Jan 31 2021 j is counted from current begginning: from 0

CraigDillabaugh (3/14) Jan 30 2021 But that is still O(n^2), you've only changed the constant.

Steven Schveighoffer (8/21) Jan 26 2021 Maybe try a different approach.
mw (2/15) Jan 26 2021 What's the typical length of your strings?

mw (13/29) Jan 26 2021 Actually, I think it's reasonable given your algo:

methonash <nope dne.com> writes:

Greetings Dlang wizards,

I seek knowledge/understanding of a very frustrating phenomenon 
I've experienced over the past several days.

The problem space:

1) Read a list of strings from a file
2) De-duplicate all strings into the subset of unique strings
3) Sort the subset of unique strings by descending length and 
then by ascending lexicographic identity
4) Iterate through the sorted subset of unique strings, 
identifying smaller sequences with perfect identity to their 
largest possible parent string

I have written a Dlang program that performantly progresses 

to uniquely de-duplicate the initial set of strings and then used 
multiSort(). Performance was good up till this point, especially 
with use of the LDC compiler.


SortedRange, I used .array to convert the returned SortedRange 
into an array of type string[]. This appeared to work, and 
neither DMD nor LDC threw any warnings/errors for doing this.

With the formally returned array, I then attempted to construct a 
double foreach loop to iterate through the sorted array of unique 
strings and find substring matches.

foreach( i, ref pStr; sortedArr )
{
     foreach( j, ref cStr; sortedArr[ i + 1 .. $ ] )
     {
         if( indexOf( pStr, cStr ) > -1 )
         {
             // ...
         }
     }
}

Before adding the code excerpt above, the Dlang program was 
taking ~1 second on an input file containing approx. 64,000 
strings.

By adding the code above, the program now takes 6 minutes to 
complete. An attempt was made to more efficiently perform 
ASCII-only substring searching by converting the sorted string[] 
into ubyte[][] and then using countUntil() instead of indexOf(), 
but this had an effect that was completely opposite to what I had 
previously experienced: the program then took over 20 minutes to 
complete!

Thus, I am entirely baffled.

My first attempt to solve this problem space used a small Perl 
program to perform steps 1 through 3, which would then pipe 
intermediate output to a small Dlang program handling only step 

countUntil().

The Dlang code for the nested foreach block above is essentially 
near-identical between my two Dlang implementations. Yet, the 
second implementation--where I'm trying to solve the entire 
problem space in D--has absolutely failed in terms of performance.

Perl+D, ubyte[][], countUntil() :: under 2 seconds
only D, string[], indexOf() :: ~6 minutes
only D, ubyte[][], countUntil() :: >20 minutes

Please advise. This nightmarish experience is shaking my 
confidence in using D.

Jan 26 2021

Paul Backus <snarwin gmail.com> writes:

On Tuesday, 26 January 2021 at 17:40:36 UTC, methonash wrote:
 foreach( i, ref pStr; sortedArr )
 {
     foreach( j, ref cStr; sortedArr[ i + 1 .. $ ] )
     {
         if( indexOf( pStr, cStr ) > -1 )
         {
             // ...
         }
     }
 }

 Before adding the code excerpt above, the Dlang program was 
 taking ~1 second on an input file containing approx. 64,000 
 strings.

 By adding the code above, the program now takes 6 minutes to 
 complete.

It would be much easier for us to help you with this if you could 
post the full program, or at the very least a reduced version 
that reproduces the same issue. [1] Since your attempts so far 
have failed to fix the problem, it is quite likely that some part 
of the code you do not suspect is actually to blame.

[1] https://idownvotedbecau.se/nomcve/

Jan 26 2021

methonash <nope dne.com> writes:

On Tuesday, 26 January 2021 at 17:56:22 UTC, Paul Backus wrote:
 It would be much easier for us to help you with this if you 
 could post the full program, or at the very least a reduced 
 version that reproduces the same issue. [1] Since your attempts 
 so far have failed to fix the problem, it is quite likely that 
 some part of the code you do not suspect is actually to blame.

I cannot post the full source code.

Regarding a reduced version reproducing the issue: well, that's 
exactly what the nested foreach loop does. Without it, the 
program reaches that point quickly.

With the nested foreach block, it slows to a crawl. More 
specifically, commenting-out the indexOf() or countUntil() 
sub-blocks preserves fast performance, but I'm not sure if that 
may be related to compiler optimizations realizing that there's 
nothing but "dead/nonexistent code" inside the loops and 
generating a binary that just never goes there.

If this may help: I've composed the second Dlang implementation 
as one extended block of code within main() and am thinking of 
soon refactoring the code into different functions. I remain 
pessimistic of whether this may help.

Is there any possibility this could be GC-related?

Jan 26 2021

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Jan 26, 2021 at 06:13:54PM +0000, methonash via Digitalmars-d-learn
wrote:
[...]
 I cannot post the full source code.

Then we are limited in how much we can help you.


 Regarding a reduced version reproducing the issue: well, that's
 exactly what the nested foreach loop does. Without it, the program
 reaches that point quickly.

By "reduced version" we mean a code exceprt that's (1) compilable, (2)
runnable, and (3) exhibits the same problem that you're seeing in your
full code.

Posting an isolated loop cut out of some unknown function in some
unknown module somewhere is not helping us identify where your problem
is.


 With the nested foreach block, it slows to a crawl. More specifically,
 commenting-out the indexOf() or countUntil() sub-blocks preserves fast
 performance, but I'm not sure if that may be related to compiler
 optimizations realizing that there's nothing but "dead/nonexistent
 code" inside the loops and generating a binary that just never goes
 there.

When in doubt, use a disassembler to see exactly what is the generated
code.


 If this may help: I've composed the second Dlang implementation as one
 extended block of code within main() and am thinking of soon
 refactoring the code into different functions. I remain pessimistic of
 whether this may help.

As I said in my other reply: don't guess, profile!  Randomly changing
your code in the hopes that it will help, in general, won't.


 Is there any possibility this could be GC-related?

Without more information and a complete, compilable example, it's
anybody's guess.


T

-- 
Debian GNU/Linux: Cray on your desktop.

Jan 26 2021

Imperatorn <johan_forsberg_86 hotmail.com> writes:

On Tuesday, 26 January 2021 at 17:40:36 UTC, methonash wrote:
 Greetings Dlang wizards,

 I seek knowledge/understanding of a very frustrating phenomenon 
 I've experienced over the past several days.

 [...]

Source please 👍

Jan 26 2021

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Jan 26, 2021 at 05:40:36PM +0000, methonash via Digitalmars-d-learn
wrote:
[...]
 1) Read a list of strings from a file
 2) De-duplicate all strings into the subset of unique strings
 3) Sort the subset of unique strings by descending length and then by
 ascending lexicographic identity
 4) Iterate through the sorted subset of unique strings, identifying smaller
 sequences with perfect identity to their largest possible parent string

[...]

 I used .array to convert the returned SortedRange into an array of type
 string[].

Do not do this. Every time you call .array it allocates a new array and
copies all its contents over. If this code runs frequently, it will
cause a big performance hit, not to mention high GC load.

The function you're looking for is .release, not .array.


[...]
 With the formally returned array, I then attempted to construct a
 double foreach loop to iterate through the sorted array of unique
 strings and find substring matches.
 
 foreach( i, ref pStr; sortedArr )
 {
     foreach( j, ref cStr; sortedArr[ i + 1 .. $ ] )
     {
         if( indexOf( pStr, cStr ) > -1 )
         {
             // ...
         }
     }
 }
 
 Before adding the code excerpt above, the Dlang program was taking ~1
 second on an input file containing approx. 64,000 strings.
 
 By adding the code above, the program now takes 6 minutes to complete.

That nested loop is an O(n^2) algorithm. Meaning it will slow down
*very* quickly as the size of the array n increases.  You might want to
think about how to improve this algorithm.


 An attempt was made to more efficiently perform ASCII-only substring
 searching by converting the sorted string[] into ubyte[][] and then
 using countUntil() instead of indexOf(), but this had an effect that
 was completely opposite to what I had previously experienced: the
 program then took over 20 minutes to complete!

How are you doing the conversion?  If you're using std.conv.to or
something like that, it will definitely cause a big performance hit
because of the needless allocations and copying.  You probably want a
direct cast instead.  I.e., you want to reinterpret the array reference,
not transcribe a copy of it into a ubyte[][].

Probably what you're looking for is to use .representation and
.countUntil, or maybe just .canFind to bypass the decoding overhead. (If
indeed that is the bottleneck; it may not be.  Have you used a profiler
to identify where the hotspot is?)


 Thus, I am entirely baffled.
 
 My first attempt to solve this problem space used a small Perl program
 to perform steps 1 through 3, which would then pipe intermediate

 arrays (no use of AAs) of ubyte[][] with use of countUntil().

Using AA's may not necessarily improve performance.  It depends on what
your code does with it.  Because AA's require random access to memory,
it's not friendly to the CPU cache hierarchy, whereas traversing linear
arrays is more cache-friendly and in some cases will out-perform AA's.


 The Dlang code for the nested foreach block above is essentially
 near-identical between my two Dlang implementations. Yet, the second
 implementation--where I'm trying to solve the entire problem space in
 D--has absolutely failed in terms of performance.
 
 Perl+D, ubyte[][], countUntil() :: under 2 seconds
 only D, string[], indexOf() :: ~6 minutes
 only D, ubyte[][], countUntil() :: >20 minutes
 
 Please advise. This nightmarish experience is shaking my confidence in
 using D.

First of all, you need to use a profiler to identify where the hotspots
are.  Otherwise you could well be pouring tons of effort into
"optimizing" code that doesn't actually need optimizing, while
completely missing the real source of the problem.  Whenever you run
into performance problems, do not assume you know where the problem is,
profile, profile, profile!

Second, you only posted a small fragment of your code, so it's hard to
say where the problem really is.  I can only guess based on what you
described.  If you could post the entire program, or at least a
complete, compilable and runnable excerpt thereof that displays the same
(or similar) performance problems, then we could better help you
pinpoint where the problem is.

In general, though, things to look for are:

(1) Unnecessary memory allocations, e.g., calling .array on a
SortedRange when you really should be using .release, or calling a
conversion function to transcribe an array instead of just casting to
reinterpret it;

(2) Algorithms with poor big-O characteristics, e.g., the O(n^2) nested
loop you have above.

(3) Expensive operations inside inner loops, because the loop nesting
magnifies any slowness in the code.

But above all, before randomly changing your code in the hopes that you
will hit upon a solution, use a profiler.  Don't shoot in the dark; use
a profiler to identify the actual performance bottlenecks.  Don't waste
time optimizing things that don't need optimizing; focus your efforts on
the hotspots identified by your profiler. (And don't think you're
smarter than your profiler -- I've been coding for ~20 years, and I
still find that my bottlenecks are usually not where I think they are.
Profile, profile, profile!)


T

-- 
What do you call optometrist jokes? Vitreous humor.

Jan 26 2021

methonash <nope dne.com> writes:

On Tuesday, 26 January 2021 at 18:17:31 UTC, H. S. Teoh wrote:
 Do not do this. Every time you call .array it allocates a new 
 array and copies all its contents over. If this code runs 
 frequently, it will cause a big performance hit, not to mention 
 high GC load.

 The function you're looking for is .release, not .array.

Many thanks for the tip! I look forward to trying this soon. For 
reference, the .array call is only performed once.

 That nested loop is an O(n^2) algorithm. Meaning it will slow 
 down *very* quickly as the size of the array n increases.  You 
 might want to think about how to improve this algorithm.

Nice observation, and yes, this would typically be an O(n^2) 
approach.

However, due to subsetting the input dataset to unique strings 
and then sorting in descending length, one might notice that the 
inner foreach loop does not iterate over all of n, only on the 
iterator value i+1 through the end of the array.

Thus, I believe this would then become approximately O(n^2/2). 
More precisely, it should be O( ( n^2 + n ) / 2 ).

Further: the original dataset has 64k strings. Squaring that 
yields 4.1 billion string comparisons.

Once uniquely de-duplicated, the dataset is reduced to ~46k 
strings. Considering roughly O(n^2/2) at this level, this yields 
1.06 billion string comparisons.

So, performing steps 1 through 3 improves the brute-force string 
comparison problem four-fold in my test development dataset.

 Using AA's may not necessarily improve performance.  It depends 
 on what your code does with it.  Because AA's require random 
 access to memory, it's not friendly to the CPU cache hierarchy, 
 whereas traversing linear arrays is more cache-friendly and in 
 some cases will out-perform AA's.

I figured a built-in AA might be an efficient path to performing 
unique string de-duplication. If there's a more performant method 
available, I'll certainly try it.

 First of all, you need to use a profiler to identify where the 
 hotspots are.  Otherwise you could well be pouring tons of 
 effort into "optimizing" code that doesn't actually need 
 optimizing, while completely missing the real source of the 
 problem.  Whenever you run into performance problems, do not 
 assume you know where the problem is, profile, profile, profile!

Message received. Given that D is the first compiled language 
I've semi-seriously dabbled with, I have no real experience with 
profiler usage.

 Second, you only posted a small fragment of your code, so it's 
 hard to say where the problem really is.  I can only guess 
 based on what you described.  If you could post the entire 
 program, or at least a complete, compilable and runnable 
 excerpt thereof that displays the same (or similar) performance 
 problems, then we could better help you pinpoint where the 
 problem is.

Yes, I'll be looking to present a complete, compilable, and 
executable demo of code for this issue if/when subsequent efforts 
continue to fail.

For professional reasons (because I no longer work in academia), 
I cannot share the original source code for the issue presented 
here, but I can attempt to reproduce it in a minimally complete 
form for a public dataset.

Jan 26 2021

Paul Backus <snarwin gmail.com> writes:

On Tuesday, 26 January 2021 at 23:57:43 UTC, methonash wrote:
 Using AA's may not necessarily improve performance.  It 
 depends on what your code does with it.  Because AA's require 
 random access to memory, it's not friendly to the CPU cache 
 hierarchy, whereas traversing linear arrays is more 
 cache-friendly and in some cases will out-perform AA's.

 I figured a built-in AA might be an efficient path to 
 performing unique string de-duplication. If there's a more 
 performant method available, I'll certainly try it.

You could try sorting the array first, and then using `uniq` [1] 
to discard duplicate elements. There's an example in the docs 
that shows how to do this in-place (without allocating additional 
memory).

[1] http://phobos.dpldocs.info/std.algorithm.iteration.uniq.html

Jan 26 2021

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Jan 27, 2021 at 01:28:33AM +0000, Paul Backus via Digitalmars-d-learn
wrote:
 On Tuesday, 26 January 2021 at 23:57:43 UTC, methonash wrote:
 Using AA's may not necessarily improve performance.  It depends on
 what your code does with it.  Because AA's require random access
 to memory, it's not friendly to the CPU cache hierarchy, whereas
 traversing linear arrays is more cache-friendly and in some cases
 will out-perform AA's.

 
 I figured a built-in AA might be an efficient path to performing
 unique string de-duplication. If there's a more performant method
 available, I'll certainly try it.

 
 You could try sorting the array first, and then using `uniq` [1] to
 discard duplicate elements. There's an example in the docs that shows
 how to do this in-place (without allocating additional memory).
 
 [1] http://phobos.dpldocs.info/std.algorithm.iteration.uniq.html

Yes, definitely try this.  This will completely eliminate the overhead
of using an AA, which has to allocate memory (at least) once per entry
added.  Especially since the data has to be sorted eventually anyway,
you might as well sort first then use the sortedness as a convenient
property for fast de-duplication.  Since .uniq traverses the range
linearly, this will be cache-friendly, and along with eliminating GC
load should give you a speed boost.


T

-- 
Nearly all men can stand adversity, but if you want to test a man's character,
give him power. -- Abraham Lincoln

Jan 26 2021

FeepingCreature <feepingcreature gmail.com> writes:

On Wednesday, 27 January 2021 at 02:14:39 UTC, H. S. Teoh wrote:
 Yes, definitely try this.  This will completely eliminate the 
 overhead of using an AA, which has to allocate memory (at 
 least) once per entry added.  Especially since the data has to 
 be sorted eventually anyway, you might as well sort first then 
 use the sortedness as a convenient property for fast 
 de-duplication.  Since .uniq traverses the range linearly, this 
 will be cache-friendly, and along with eliminating GC load 
 should give you a speed boost.


 T

Associative arrays allocate per entry added?!

https://github.com/dlang/druntime/blob/master/src/rt/aaA.d#L205 
Oh God, associative arrays allocate per entry added!

Jan 27 2021

Paul Backus <snarwin gmail.com> writes:

On Wednesday, 27 January 2021 at 14:15:26 UTC, FeepingCreature 
wrote:
 Associative arrays allocate per entry added?!

 https://github.com/dlang/druntime/blob/master/src/rt/aaA.d#L205 
 Oh God, associative arrays allocate per entry added!

Maybe it's to avoid invalidating the result of `key in aa` when 
adding or removing entries? The spec doesn't say anything about 
it either way [1], but allowing invalidation would make AAs much 
more difficult to use in  safe code.

[1] https://dlang.org/spec/hash-map.html

Jan 27 2021

Paul Backus <snarwin gmail.com> writes:

On Wednesday, 27 January 2021 at 15:12:32 UTC, Paul Backus wrote:
 Maybe it's to avoid invalidating the result of `key in aa` when 
 adding or removing entries? The spec doesn't say anything about 
 it either way [1], but allowing invalidation would make AAs 
 much more difficult to use in  safe code.

 [1] https://dlang.org/spec/hash-map.html

Correction: the GC would take care of the safety issue, of 
course. I haven't had my morning tea yet, and clearly I'm not 
thinking straight. :)

Jan 27 2021

Steven Schveighoffer <schveiguy gmail.com> writes:

On 1/27/21 10:14 AM, Paul Backus wrote:
 On Wednesday, 27 January 2021 at 15:12:32 UTC, Paul Backus wrote:
 Maybe it's to avoid invalidating the result of `key in aa` when adding 
 or removing entries? The spec doesn't say anything about it either way 
 [1], but allowing invalidation would make AAs much more difficult to 
 use in  safe code.


Yes, that's the reason.

 
 Correction: the GC would take care of the safety issue, of course. I 
 haven't had my morning tea yet, and clearly I'm not thinking straight. :)

No, it wouldn't. The problem is not a GC issue, but an issue with 
aliasing expectations (if you rehash, you completely rearrange the buckets).

so if you have:

auto x = key in aa;

aa[someOtherKey] = value; // rehashes

Code at this point currently can count on x pointing at the item being 
used in the AA.

If we change it now, lots of code will break.

This is not truly a horrible issue, you can use a custom implementation 
(see any number of container libraries on code.dlang.org).

What would be nice though, is to provide an actual template, that 
builtin AAs are an alias for (like string), and then if you want 
slightly different behavior, you provide different parameters.

But at the end of the day, the builtin AAs will ALWAYS do this. We can't 
change it now.

-Steve

Jan 27 2021

methonash <nope dne.net> writes:

Greetings all,

Many thanks for sharing your collective perspective and advice 
thus far! It has been very helpful and instructive. I return 
bearing live data and a minimally complete, compilable, and 
executable program to experiment with and potentially optimize. 
The dataset can be pulled from here:

https://filebin.net/qf2km1ea9qgd5skp/seqs.fasta.gz?t=97kgpebg

Running "cksum" on this file:

1477520542 2199192 seqs.fasta.gz

Naturally, you'll need to gunzip this file. The decompressed file 
contains strings on every even-numbered line that have already 
been reduced to the unique de-duplicated subset, and they have 
already been sorted by descending length and alphabetical 
identity.


finding/removing strings that can be entirely absorbed 
(substringed) by their largest possible parent.

And now for the code:


import std.stdio : writefln, File, stdin;
import std.conv : to;
import std.string : indexOf;

void main()
{
         string[] seqs;

         foreach( line; stdin.byLine() )
         {
                 if( line[ 0 ] == '>' ) continue;
                 else seqs ~= to!string( line );
         }

         foreach( i; 0 .. seqs.length )
         {
                 if( seqs[ i ].length == 0 ) continue;

                 foreach( j; i + 1 .. seqs.length )
                 {
                         if( seqs[ j ].length == 0 ||
                             seqs[ i ].length == seqs[ j ].length 
) continue;

                         if( indexOf( seqs[ i ], seqs[ j ] ) > -1 )
                         {
                                 seqs[ j ] = "";

                                 writefln( "%s contains %s", i, j 
);
                         }
                 }
         }
}


Compile the source and then run the executable via redirecting 
stdin:

./substr < seqs.fasta

See any straightforward optimization paths here?

For curiosity, I experimented with use of string[] and ubyte[][] 
and several functions (indexOf, canFind, countUntil) to assess 
for the best potential performer. My off-the-cuff results:

string[] with indexOf() :: 26.5-27 minutes
string[] with canFind() :: >28 minutes
ubyte[][] with canFind() :: 27.5 minutes
ubyte[][] with countUntil() :: 27.5 minutes

Resultantly, the code above uses string[] with indexOf(). Tests 
were performed with an Intel(R) Xeon(R) Platinum 8259CL CPU   
2.50GHz.

I have additional questions/concerns/confusion surrounding the 
foreach() syntax I have had to apply above, but performance 
remains my chief immediate concern.

Jan 30 2021

Steven Schveighoffer <schveiguy gmail.com> writes:

On 1/30/21 6:13 PM, methonash wrote:
 Greetings all,
 
 Many thanks for sharing your collective perspective and advice thus far! 
 It has been very helpful and instructive. I return bearing live data and 
 a minimally complete, compilable, and executable program to experiment 
 with and potentially optimize. The dataset can be pulled from here:
 
 https://filebin.net/qf2km1ea9qgd5skp/seqs.fasta.gz?t=97kgpebg
 
 Running "cksum" on this file:
 
 1477520542 2199192 seqs.fasta.gz
 
 Naturally, you'll need to gunzip this file. The decompressed file 
 contains strings on every even-numbered line that have already been 
 reduced to the unique de-duplicated subset, and they have already been 
 sorted by descending length and alphabetical identity.
 

 finding/removing strings that can be entirely absorbed (substringed) by 
 their largest possible parent.
 
 And now for the code:
 
 
 import std.stdio : writefln, File, stdin;
 import std.conv : to;
 import std.string : indexOf;
 
 void main()
 {
          string[] seqs;
 
          foreach( line; stdin.byLine() )
          {
                  if( line[ 0 ] == '>' ) continue;
                  else seqs ~= to!string( line );
          }
 
          foreach( i; 0 .. seqs.length )
          {
                  if( seqs[ i ].length == 0 ) continue;
 
                  foreach( j; i + 1 .. seqs.length )
                  {
                          if( seqs[ j ].length == 0 ||
                              seqs[ i ].length ==
seqs[ j ].length ) 
 continue;
 
                          if( indexOf( seqs[ i ], seqs[
j ] ) > -1 )
                          {
                                  seqs[ j ] = "";
 
                                  writefln( "%s
contains %s", i, j );
                          }
                  }
          }
 }
 
 
 Compile the source and then run the executable via redirecting stdin:
 
 ./substr < seqs.fasta
 
 See any straightforward optimization paths here?
 
 For curiosity, I experimented with use of string[] and ubyte[][] and 
 several functions (indexOf, canFind, countUntil) to assess for the best 
 potential performer. My off-the-cuff results:
 
 string[] with indexOf() :: 26.5-27 minutes
 string[] with canFind() :: >28 minutes
 ubyte[][] with canFind() :: 27.5 minutes
 ubyte[][] with countUntil() :: 27.5 minutes
 
 Resultantly, the code above uses string[] with indexOf(). Tests were 
 performed with an Intel(R) Xeon(R) Platinum 8259CL CPU   2.50GHz.
 
 I have additional questions/concerns/confusion surrounding the foreach() 
 syntax I have had to apply above, but performance remains my chief 
 immediate concern.

The code looks pretty minimal.

I'd suggest trying it in reverse. If you have the sequence "cba", "ba", 
"a", then determining "a" is in "ba" is probably cheaper than 
determining "a" is in "cba".

Are you still convinced that it's possible to do it in under 2 seconds? 
That would seem a huge discrepancy. If not, what specifically are you 
looking for in terms of performance?

-Steve

Jan 30 2021

methonash <nope dne.net> writes:

On Sunday, 31 January 2021 at 00:53:05 UTC, Steven Schveighoffer 
wrote:
 I'd suggest trying it in reverse. If you have the sequence 
 "cba", "ba", "a", then determining "a" is in "ba" is probably 
 cheaper than determining "a" is in "cba".

I have user requirements that this application track string IDs 
that get collapsed under parents. To minimize/simplify array 
concatenations, I figured that going in descending order might 
make those operations less expensive.

Though I think this is also worth trying.

 Are you still convinced that it's possible to do it in under 2 
 seconds? That would seem a huge discrepancy. If not, what 
 specifically are you looking for in terms of performance?

Good question. As a sanity check, at some point this past week, I 
re-wrote everything in a single contained Perl program, and 
running that program on the dataset I've shared above took well 
over 2 hours of wall time.

Considering that the index() function in Perl is pre-compiled, I 
imagine that Dlang running 4-5x faster is a pretty good speed 
boost, as it may only be benefiting from the speed of compiled 
loops over interpreted loops.

What confuses me, at this point, is this: I originally wrote the 
D code using foreach in this style:

foreach( i, ref parentString; strings )
{
     foreach( j, ref childString; strings[ i + 1 .. $ ] )
     {
         // ...
     }
}

Essentially, the value of j printed to stdout should always be 
larger than the value of i. Yet, when running in the above style, 
the values of j reported to stdout inevitably become smaller than 
i, suggesting that the loop is somehow traversing backwards. How 
can this be explained?

Jan 31 2021

a11e99z <black80 bk.ru> writes:

On Sunday, 31 January 2021 at 16:50:15 UTC, methonash wrote:
 What confuses me, at this point, is this: I originally wrote 
 the D code using foreach in this style:

 foreach( i, ref parentString; strings )
     foreach( j, ref childString; strings[ i + 1 .. $ ] )

 Essentially, the value of j printed to stdout should always be 
 larger than the value of i.

j is counted from current begginning: from 0
strings[i+1..$][0] - its right behaviour

Jan 31 2021

CraigDillabaugh <craig.dillabaugh gmail.com> writes:

On Tuesday, 26 January 2021 at 23:57:43 UTC, methonash wrote:
clip
 That nested loop is an O(n^2) algorithm. Meaning it will slow 
 down *very* quickly as the size of the array n increases.  You 
 might want to think about how to improve this algorithm.

 Nice observation, and yes, this would typically be an O(n^2) 
 approach.

 However, due to subsetting the input dataset to unique strings 
 and then sorting in descending length, one might notice that 
 the inner foreach loop does not iterate over all of n, only on 
 the iterator value i+1 through the end of the array.

 Thus, I believe this would then become approximately O(n^2/2). 
 More precisely, it should be O( ( n^2 + n ) / 2 ).

But that is still O(n^2), you've only changed the constant.

Jan 30 2021

Steven Schveighoffer <schveiguy gmail.com> writes:

On 1/26/21 12:40 PM, methonash wrote:
 My first attempt to solve this problem space used a small Perl program 
 to perform steps 1 through 3, which would then pipe intermediate output 

 use of AAs) of ubyte[][] with use of countUntil().
 
 The Dlang code for the nested foreach block above is essentially 
 near-identical between my two Dlang implementations. Yet, the second 
 implementation--where I'm trying to solve the entire problem space in 
 D--has absolutely failed in terms of performance.
 
 Perl+D, ubyte[][], countUntil() :: under 2 seconds
 only D, string[], indexOf() :: ~6 minutes
 only D, ubyte[][], countUntil() :: >20 minutes

Maybe try a different approach.

Replace the perl code with D, and still have it output to the same small 
D program that processes the results. It seems from your description 
that everything is "identical" that if your conclusions are correct, it 
should be at least as fast as the Perl+D version.

But I think you are missing something else.

-Steve

Jan 26 2021

mw <mingwu gmail.com> writes:

On Tuesday, 26 January 2021 at 17:40:36 UTC, methonash wrote:
 foreach( i, ref pStr; sortedArr )
 {
     foreach( j, ref cStr; sortedArr[ i + 1 .. $ ] )
     {
         if( indexOf( pStr, cStr ) > -1 )
         {
             // ...
         }
     }
 }

 Before adding the code excerpt above, the Dlang program was 
 taking ~1 second on an input file containing approx. 64,000 
 strings.

What's the typical length of your strings?

Jan 26 2021

mw <mingwu gmail.com> writes:

On Tuesday, 26 January 2021 at 21:55:47 UTC, mw wrote:
 On Tuesday, 26 January 2021 at 17:40:36 UTC, methonash wrote:
 foreach( i, ref pStr; sortedArr )
 {
     foreach( j, ref cStr; sortedArr[ i + 1 .. $ ] )
     {
         if( indexOf( pStr, cStr ) > -1 )
         {
             // ... yourInnerOp
         }
     }
 }

 Before adding the code excerpt above, the Dlang program was 
 taking ~1 second on an input file containing approx. 64,000 
 strings.

 What's the typical length of your strings?

Actually, I think it's reasonable given your algo:

Your algo (double-loop) is O(n^2), n = 64,000

so the loop will run n^2 = 4,096,000,000 i.e 4G

Suppose your CPU is 2GHz, and suppose each loop operation take 
just 1 machine cycle (very unlikely), this algo will take 2 
seconds.

However, string searching i.e `indexOf`, or `yourInnerLoop` can 
easily take hundreds of cycles, let's suppose it's 100 machine 
cycles (still a very low estimate), then the algo will take ~200 
seconds = ~3.3 minutes.

If you want, you can try to rewrite your algo in Java or Python, 
and compare the run time with the Dlang version.

Jan 26 2021

D Programming

C/C++ Programming

Other

digitalmars.D.learn - 200-600x slower Dlang performance with nested foreach loop