digitalmars.D.learn - Create many objects using threads

Caslav Sabani (15/15) May 05 2014 Hi,

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (17/25) May 05 2014 1) If it has to be a single array, meaning that all of the objects are

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (63/78) May 05 2014 Here is an example:

Caslav Sabani (6/6) May 05 2014 Hi Ali,

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (6/10) May 05 2014 The .parallel in the foreach loop makes the body of the loop be executed...

Kapps (6/22) May 05 2014 I could be wrong here, but I think that the GC actually blocks

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (34/38) May 05 2014 I did:

Caslav Sabani (6/6) May 05 2014 Hi all,

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (4/6) May 05 2014 Not at all! That statement can be true only in certain programs. :)

hardcoremore (6/13) May 06 2014 But what does exactly means that Garbage Collector blocks? What

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (10/14) May 06 2014 I know this much: The current GC that comes in D runtime is a

Kapps (71/110) May 06 2014 Huh, that's a much, much, higher impact than I'd expected.

Kapps (31/101) May 06 2014 I tried with using an allocator that never releases memory,

"Caslav Sabani" <caslav.sabani gmail.com> writes:

Hi,


I have just started to learn D. Its a great language. I am trying 
to achieve the following but I am not sure is it possible or 
should be done at all:

I want to have one array where I will store like 100000  objects.

But I want to use 4 threads where each thread will create 25000 
objects and store them in array above mentioned. And all 4 
threads should be working in parallel because I have 4 core 
processor for example. I do not care in which order objects are 
created nor objects should be aware of one another. I just need 
them stored in array.

Can threading help in creating many objects at once?

Note that I am beginner at working with threads so any help is 
welcome :)


Thanks

May 05 2014

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 05/05/2014 10:14 AM, Caslav Sabani wrote:

 I want to have one array where I will store like 100000  objects.

 But I want to use 4 threads where each thread will create 25000 objects
 and store them in array above mentioned.

1) If it has to be a single array, meaning that all of the objects are 
in consecutive memory, you can create the array and give four slices of 
it to the four tasks.

To do that, you can either create a proper D array filled with objects 
with .init values; or you can allocate any type of memory and create 
objects in place there.

2) If it doesn't have to a single array, you can have the four tasks 
create four separate arrays. You can then use them as a single range by 
std.range.chain. This option allows you to have a single array as well.

I would like to give examples of those methods later. Gotta go now... :)

 And all 4 threads should be working in parallel because I have 4 core
 processor for example. I do not care in which order objects are 

created nor
 objects should be aware of one another. I just need them stored in array.
 Can threading help in creating many objects at once?  Note that I am
 beginner at working with threads so any help is welcome :) Thanks

I recommend looking at std.parallelism first and then std.concurrency. 
Here are two chapters that may be helpful:

   http://ddili.org/ders/d.en/parallelism.html

   http://ddili.org/ders/d.en/concurrency.html

Ali

May 05 2014

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 05/05/2014 10:25 AM, Ali Çehreli wrote:

 On 05/05/2014 10:14 AM, Caslav Sabani wrote:

  > I want to have one array where I will store like 100000  objects.
  >
  > But I want to use 4 threads where each thread will create 25000 

objects
  > and store them in array above mentioned.

 1) If it has to be a single array, meaning that all of the objects are
 in consecutive memory, you can create the array and give four slices of
 it to the four tasks.

 To do that, you can either create a proper D array filled with objects
 with .init values; or you can allocate any type of memory and create
 objects in place there.

Here is an example:

import std.stdio;
import std.parallelism;
import core.thread;
import std.conv;

enum elementCount = 8;
size_t elementPerThread;

static this ()
{
     assert((elementCount % totalCPUs) == 0,
            "Cannot distribute tasks to cores evenly");

     elementPerThread = elementCount / totalCPUs;
}

void main()
{
     auto arr = new int[](elementCount);

     foreach (i; 0 .. totalCPUs) {
         const beg = i * elementPerThread;
         const end = beg + elementPerThread;
         arr[beg .. end] = i.to!int;
     }

     thread_joinAll();    // (I don't think this is necessary with 
std.parallelism)

     writeln(arr);    // [ 0, 0, 1, 1, 2, 2, 3, 3 ]
}

 2) If it doesn't have to a single array, you can have the four tasks
 create four separate arrays. You can then use them as a single range by
 std.range.chain.

That is a lie. :) chain would work but it had to know the number of 
total cores at compile time. Instead, joiner or join can be used:

import std.stdio;
import std.parallelism;
import core.thread;

enum elementCount = 8;
size_t elementPerThread;

static this ()
{
     assert((elementCount % totalCPUs) == 0,
            "Cannot distribute tasks to cores evenly");

     elementPerThread = elementCount / totalCPUs;
}

void main()
{
     auto arr = new int[][](totalCPUs);

     foreach (i; 0 .. totalCPUs) {
         foreach (e; 0 .. elementPerThread) {
             arr[i] ~= i;
         }
     }

     thread_joinAll();    // (I don't think this is necessary with 
std.parallelism)

     writeln(arr);             // [[0, 0], [1, 1], [2, 2], [3, 3]]

     import std.range;
     writeln(arr.joiner);      // [ 0, 0, 1, 1, 2, 2, 3, 3 ]

     import std.algorithm;
     auto arr2 = arr.joiner.array;
     static assert(is (typeof(arr2) == int[]));
     writeln(arr2);           // [ 0, 0, 1, 1, 2, 2, 3, 3 ]

     auto arr3 = arr.join;
     static assert(is (typeof(arr3) == int[]));
     writeln(arr3);           // [ 0, 0, 1, 1, 2, 2, 3, 3 ]
}

 This option allows you to have a single array as well.

arr2 and arr3 above are examples of that.

Ali

May 05 2014

"Caslav Sabani" <caslav.sabani gmail.com> writes:

Hi Ali,


Thanks for your reply. But I am struggling to understand from 
your example where is the code that creates or spawns new thread.


How do you create new thread and fill array with instantiated 
objects in that thread?



Thanks

May 05 2014

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 05/05/2014 01:38 PM, Caslav Sabani wrote:

 I am struggling to understand from your example where is the code that
 creates or spawns new thread.

The .parallel in the foreach loop makes the body of the loop be executed 
in parallel.

 How do you create new thread and fill array with instantiated objects in
 that thread?

It is automatic in that example but you can created thread explicitly by 
std.concurrency or core.thread as well.

Ali

May 05 2014

"Kapps" <opantm2+spam gmail.com> writes:

On Monday, 5 May 2014 at 17:14:54 UTC, Caslav Sabani wrote:
 Hi,


 I have just started to learn D. Its a great language. I am 
 trying to achieve the following but I am not sure is it 
 possible or should be done at all:

 I want to have one array where I will store like 100000  
 objects.

 But I want to use 4 threads where each thread will create 25000 
 objects and store them in array above mentioned. And all 4 
 threads should be working in parallel because I have 4 core 
 processor for example. I do not care in which order objects are 
 created nor objects should be aware of one another. I just need 
 them stored in array.

 Can threading help in creating many objects at once?

 Note that I am beginner at working with threads so any help is 
 welcome :)


 Thanks

I could be wrong here, but I think that the GC actually blocks 
when creating objects, and thus multiple threads creating 
instances would not provide a significant speedup, possibly even 
a slowdown.

You'd want to benchmark this to be certain it helps.

May 05 2014

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 05/05/2014 02:38 PM, Kapps wrote:

 I think that the GC actually blocks when
 creating objects, and thus multiple threads creating instances would not
 provide a significant speedup, possibly even a slowdown.

Wow! That is the case. :)

 You'd want to benchmark this to be certain it helps.

I did:

import std.range;
import std.parallelism;

class C
{}

void foo()
{
     auto c = new C;
}

void main(string[] args)
{
     enum totalElements = 10_000_000;

     if (args.length > 1) {
         foreach (i; iota(totalElements).parallel) {
             foo();
         }

     } else {
         foreach (i; iota(totalElements)) {
             foo();
         }
     }
}

Typical run on my system for "-O -noboundscheck -inline":

$ time ./deneme parallel

real	0m4.236s
user	0m4.325s
sys	0m9.795s

$ time ./deneme

real	0m0.753s
user	0m0.748s
sys	0m0.003s

Ali

May 05 2014

"Caslav Sabani" <caslav.sabani gmail.com> writes:

Hi all,


Thanks for your reply. So basically using threads in D for 
creating multiple instances of class is actually slower.


But what does exactly means that Garbage Collector blocks? What 
does it blocks and in which way?



Thanks

May 05 2014

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 05/05/2014 04:32 PM, Caslav Sabani wrote:

 So basically using threads in D for creating multiple instances of 

class is
 actually slower.

Not at all! That statement can be true only in certain programs. :)

Ali

May 05 2014

"hardcoremore" <caslav.sabani gmail.com> writes:

On Tuesday, 6 May 2014 at 03:26:52 UTC, Ali Çehreli wrote:
 On 05/05/2014 04:32 PM, Caslav Sabani wrote:

 So basically using threads in D for creating multiple

 instances of class is
 actually slower.

 Not at all! That statement can be true only in certain 
 programs. :)

 Ali


But what does exactly means that Garbage Collector blocks? What
does it blocks and in which way?


And can I use threads to create multiple instance faster or that 
is just not possible?



Thanks

May 06 2014

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 05/06/2014 05:46 AM, hardcoremore wrote:

 But what does exactly means that Garbage Collector blocks? What
 does it blocks and in which way?

I know this much: The current GC that comes in D runtime is a 
single-threaded GC (aka "a stop-the-world GC"), meaning that all threads 
are stopped when the GC is running a garbage collection cycle.

 And can I use threads to create multiple instance faster or that is just
 not possible?

My example program that did nothing but constructed objects on the GC 
heap cannot be an indicator of the performance of all multi-threaded 
programs. In real programs there will be computation-intensive parts; 
there will be parts blocked on I/O; etc. There is no way of knowing 
without measuring.

Ali

May 06 2014

"Kapps" <opantm2+spam gmail.com> writes:

On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote:
 On 05/05/2014 02:38 PM, Kapps wrote:

 I think that the GC actually blocks when
 creating objects, and thus multiple threads creating

 instances would not
 provide a significant speedup, possibly even a slowdown.

 Wow! That is the case. :)

 You'd want to benchmark this to be certain it helps.

 I did:

 import std.range;
 import std.parallelism;

 class C
 {}

 void foo()
 {
     auto c = new C;
 }

 void main(string[] args)
 {
     enum totalElements = 10_000_000;

     if (args.length > 1) {
         foreach (i; iota(totalElements).parallel) {
             foo();
         }

     } else {
         foreach (i; iota(totalElements)) {
             foo();
         }
     }
 }

 Typical run on my system for "-O -noboundscheck -inline":

 $ time ./deneme parallel

 real	0m4.236s
 user	0m4.325s
 sys	0m9.795s

 $ time ./deneme

 real	0m0.753s
 user	0m0.748s
 sys	0m0.003s

 Ali

Huh, that's a much, much, higher impact than I'd expected.
I tried with GDC as well (the one in Debian stable, which is 
unfortunately still 2.055...) and got similar results. I also 
tried creating only totalCPUs threads and having each of them 
create NUM_ELEMENTS / totalCPUs objects rather than risking that 
each creation was a task, and it still seems to be the same.

Using malloc and emplace instead of new D, results are about 50% 
faster for single-threadeded and ~3-4 times faster for 
multi-threaded (4 cpu 8 thread machine, Linux 64-bit). The 
multi-threaded version is still twice as slow though. On my 
Windows laptop (with the program compiled for 32-bit), it did not 
make a significant difference and the multi-threaded version is 
still 4 times slower.

That being said, I think most malloc implementations while being 
thread-safe, usually use locks or do not scale well.

Code:
import std.range;
import std.parallelism;
import std.datetime;
import std.stdio;
import core.stdc.stdlib;
import std.conv;

class C {}

void foo() {
     //auto c = new C;
     enum size = __traits(classInstanceSize, C);
     void[] mem = malloc(size)[0..size];
     emplace!C(mem);
}

void createFoos(size_t count) {
     foreach(i; 0 .. count) {
         foo();
     }
}

void main(string[] args) {
     StopWatch sw = StopWatch(AutoStart.yes);
     enum totalElements = 10_000_000;
     if (args.length <= 1) {
         foreach (i; iota(totalElements)) {
             foo();
         }
     } else if(args[1] == "tasks") {
         foreach (i; parallel(iota(totalElements))) {
             foo();
         }
     } else if(args[1] == "parallel") {
         for(int i = 0; i < totalCPUs; i++) {
             taskPool.put(task(&createFoos, totalElements / 
totalCPUs));
         }
         taskPool.finish(true);
     } else
         writeln("Unknown argument '", args[1], "'.");
     sw.stop();
     writeln(cast(Duration)sw.peek);
}

Results (Linux 64-bit):
shardsoft:~$ dmd -O -inline -release test.d
shardsoft:~$ ./test
552 ms, 729 μs, and 7 hnsecs
shardsoft:~$ ./test
532 ms, 139 μs, and 5 hnsecs
shardsoft:~$ ./test tasks
1 sec, 171 ms, 126 μs, and 4 hnsecs
shardsoft:~$ ./test tasks
1 sec, 38 ms, 468 μs, and 6 hnsecs
shardsoft:~$ ./test parallel
1 sec, 146 ms, 738 μs, and 2 hnsecs
shardsoft:~$ ./test parallel
1 sec, 268 ms, 195 μs, and 3 hnsecs

May 06 2014

"Kapps" <opantm2+spam gmail.com> writes:

On Tuesday, 6 May 2014 at 15:56:11 UTC, Kapps wrote:
 On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote:
 On 05/05/2014 02:38 PM, Kapps wrote:

 I think that the GC actually blocks when
 creating objects, and thus multiple threads creating

 instances would not
 provide a significant speedup, possibly even a slowdown.

 Wow! That is the case. :)

 You'd want to benchmark this to be certain it helps.

 I did:

 import std.range;
 import std.parallelism;

 class C
 {}

 void foo()
 {
    auto c = new C;
 }

 void main(string[] args)
 {
    enum totalElements = 10_000_000;

    if (args.length > 1) {
        foreach (i; iota(totalElements).parallel) {
            foo();
        }

    } else {
        foreach (i; iota(totalElements)) {
            foo();
        }
    }
 }

 Typical run on my system for "-O -noboundscheck -inline":

 $ time ./deneme parallel

 real	0m4.236s
 user	0m4.325s
 sys	0m9.795s

 $ time ./deneme

 real	0m0.753s
 user	0m0.748s
 sys	0m0.003s

 Ali

 Huh, that's a much, much, higher impact than I'd expected.
 I tried with GDC as well (the one in Debian stable, which is 
 unfortunately still 2.055...) and got similar results. I also 
 tried creating only totalCPUs threads and having each of them 
 create NUM_ELEMENTS / totalCPUs objects rather than risking 
 that each creation was a task, and it still seems to be the 
 same.

snip

I tried with using an allocator that never releases memory, 
rounds up to a power of 2, and is lock-free. The results are 
quite a bit better.

shardsoft:~$ ./test
1 sec, 47 ms, 474 μs, and 4 hnsecs
shardsoft:~$ ./test
1 sec, 43 ms, 588 μs, and 2 hnsecs
shardsoft:~$ ./test tasks
692 ms, 769 μs, and 8 hnsecs
shardsoft:~$ ./test tasks
692 ms, 686 μs, and 8 hnsecs
shardsoft:~$ ./test parallel
691 ms, 856 μs, and 9 hnsecs
shardsoft:~$ ./test parallel
690 ms, 22 μs, and 3 hnsecs

I get similar results on my laptop (which is much faster than the 
results I got on it using DMD's malloc):
test

1 sec, 125 ms, and 847 ╬╝s
test

1 sec, 125 ms, 741 ╬╝s, and 6 hnsecs

test tasks

556 ms, 613 ╬╝s, and 8 hnsecs
test tasks

552 ms and 287 ╬╝s

test parallel

554 ms, 542 ╬╝s, and 6 hnsecs
test parallel

551 ms, 514 ╬╝s, and 9 hnsecs


Code:
http://pastie.org/9146326

Unfortunately it doesn't compile with the ancient version of gdc 
available in Debian, so I couldn't test with that. The results 
should be quite a bit better since core.atomic would be faster. 
And frankly, I'm not sure if the allocator actually works 
properly, but it's just for testing purposes anyways.

May 06 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Create many objects using threads