www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Create many objects using threads

reply "Caslav Sabani" <caslav.sabani gmail.com> writes:
Hi,


I have just started to learn D. Its a great language. I am trying 
to achieve the following but I am not sure is it possible or 
should be done at all:

I want to have one array where I will store like 100000  objects.

But I want to use 4 threads where each thread will create 25000 
objects and store them in array above mentioned. And all 4 
threads should be working in parallel because I have 4 core 
processor for example. I do not care in which order objects are 
created nor objects should be aware of one another. I just need 
them stored in array.

Can threading help in creating many objects at once?

Note that I am beginner at working with threads so any help is 
welcome :)


Thanks
May 05 2014
next sibling parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 05/05/2014 10:14 AM, Caslav Sabani wrote:

 I want to have one array where I will store like 100000  objects.

 But I want to use 4 threads where each thread will create 25000 objects
 and store them in array above mentioned.
1) If it has to be a single array, meaning that all of the objects are in consecutive memory, you can create the array and give four slices of it to the four tasks. To do that, you can either create a proper D array filled with objects with .init values; or you can allocate any type of memory and create objects in place there. 2) If it doesn't have to a single array, you can have the four tasks create four separate arrays. You can then use them as a single range by std.range.chain. This option allows you to have a single array as well. I would like to give examples of those methods later. Gotta go now... :)
 And all 4 threads should be working in parallel because I have 4 core
 processor for example. I do not care in which order objects are 
created nor
 objects should be aware of one another. I just need them stored in array.
 Can threading help in creating many objects at once?  Note that I am
 beginner at working with threads so any help is welcome :) Thanks
I recommend looking at std.parallelism first and then std.concurrency. Here are two chapters that may be helpful: http://ddili.org/ders/d.en/parallelism.html http://ddili.org/ders/d.en/concurrency.html Ali
May 05 2014
parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 05/05/2014 10:25 AM, Ali Çehreli wrote:

 On 05/05/2014 10:14 AM, Caslav Sabani wrote:

  > I want to have one array where I will store like 100000  objects.
  >
  > But I want to use 4 threads where each thread will create 25000 
objects
  > and store them in array above mentioned.

 1) If it has to be a single array, meaning that all of the objects are
 in consecutive memory, you can create the array and give four slices of
 it to the four tasks.

 To do that, you can either create a proper D array filled with objects
 with .init values; or you can allocate any type of memory and create
 objects in place there.
Here is an example: import std.stdio; import std.parallelism; import core.thread; import std.conv; enum elementCount = 8; size_t elementPerThread; static this () { assert((elementCount % totalCPUs) == 0, "Cannot distribute tasks to cores evenly"); elementPerThread = elementCount / totalCPUs; } void main() { auto arr = new int[](elementCount); foreach (i; 0 .. totalCPUs) { const beg = i * elementPerThread; const end = beg + elementPerThread; arr[beg .. end] = i.to!int; } thread_joinAll(); // (I don't think this is necessary with std.parallelism) writeln(arr); // [ 0, 0, 1, 1, 2, 2, 3, 3 ] }
 2) If it doesn't have to a single array, you can have the four tasks
 create four separate arrays. You can then use them as a single range by
 std.range.chain.
That is a lie. :) chain would work but it had to know the number of total cores at compile time. Instead, joiner or join can be used: import std.stdio; import std.parallelism; import core.thread; enum elementCount = 8; size_t elementPerThread; static this () { assert((elementCount % totalCPUs) == 0, "Cannot distribute tasks to cores evenly"); elementPerThread = elementCount / totalCPUs; } void main() { auto arr = new int[][](totalCPUs); foreach (i; 0 .. totalCPUs) { foreach (e; 0 .. elementPerThread) { arr[i] ~= i; } } thread_joinAll(); // (I don't think this is necessary with std.parallelism) writeln(arr); // [[0, 0], [1, 1], [2, 2], [3, 3]] import std.range; writeln(arr.joiner); // [ 0, 0, 1, 1, 2, 2, 3, 3 ] import std.algorithm; auto arr2 = arr.joiner.array; static assert(is (typeof(arr2) == int[])); writeln(arr2); // [ 0, 0, 1, 1, 2, 2, 3, 3 ] auto arr3 = arr.join; static assert(is (typeof(arr3) == int[])); writeln(arr3); // [ 0, 0, 1, 1, 2, 2, 3, 3 ] }
 This option allows you to have a single array as well.
arr2 and arr3 above are examples of that. Ali
May 05 2014
parent reply "Caslav Sabani" <caslav.sabani gmail.com> writes:
Hi Ali,


Thanks for your reply. But I am struggling to understand from 
your example where is the code that creates or spawns new thread.


How do you create new thread and fill array with instantiated 
objects in that thread?



Thanks
May 05 2014
parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 05/05/2014 01:38 PM, Caslav Sabani wrote:

 I am struggling to understand from your example where is the code that
 creates or spawns new thread.
The .parallel in the foreach loop makes the body of the loop be executed in parallel.
 How do you create new thread and fill array with instantiated objects in
 that thread?
It is automatic in that example but you can created thread explicitly by std.concurrency or core.thread as well. Ali
May 05 2014
prev sibling parent reply "Kapps" <opantm2+spam gmail.com> writes:
On Monday, 5 May 2014 at 17:14:54 UTC, Caslav Sabani wrote:
 Hi,


 I have just started to learn D. Its a great language. I am 
 trying to achieve the following but I am not sure is it 
 possible or should be done at all:

 I want to have one array where I will store like 100000  
 objects.

 But I want to use 4 threads where each thread will create 25000 
 objects and store them in array above mentioned. And all 4 
 threads should be working in parallel because I have 4 core 
 processor for example. I do not care in which order objects are 
 created nor objects should be aware of one another. I just need 
 them stored in array.

 Can threading help in creating many objects at once?

 Note that I am beginner at working with threads so any help is 
 welcome :)


 Thanks
I could be wrong here, but I think that the GC actually blocks when creating objects, and thus multiple threads creating instances would not provide a significant speedup, possibly even a slowdown. You'd want to benchmark this to be certain it helps.
May 05 2014
parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 05/05/2014 02:38 PM, Kapps wrote:

 I think that the GC actually blocks when
 creating objects, and thus multiple threads creating instances would not
 provide a significant speedup, possibly even a slowdown.
Wow! That is the case. :)
 You'd want to benchmark this to be certain it helps.
I did: import std.range; import std.parallelism; class C {} void foo() { auto c = new C; } void main(string[] args) { enum totalElements = 10_000_000; if (args.length > 1) { foreach (i; iota(totalElements).parallel) { foo(); } } else { foreach (i; iota(totalElements)) { foo(); } } } Typical run on my system for "-O -noboundscheck -inline": $ time ./deneme parallel real 0m4.236s user 0m4.325s sys 0m9.795s $ time ./deneme real 0m0.753s user 0m0.748s sys 0m0.003s Ali
May 05 2014
next sibling parent reply "Caslav Sabani" <caslav.sabani gmail.com> writes:
Hi all,


Thanks for your reply. So basically using threads in D for 
creating multiple instances of class is actually slower.


But what does exactly means that Garbage Collector blocks? What 
does it blocks and in which way?



Thanks
May 05 2014
parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 05/05/2014 04:32 PM, Caslav Sabani wrote:

 So basically using threads in D for creating multiple instances of 
class is
 actually slower.
Not at all! That statement can be true only in certain programs. :) Ali
May 05 2014
parent reply "hardcoremore" <caslav.sabani gmail.com> writes:
On Tuesday, 6 May 2014 at 03:26:52 UTC, Ali Çehreli wrote:
 On 05/05/2014 04:32 PM, Caslav Sabani wrote:

 So basically using threads in D for creating multiple
instances of class is
 actually slower.
Not at all! That statement can be true only in certain programs. :) Ali
But what does exactly means that Garbage Collector blocks? What does it blocks and in which way? And can I use threads to create multiple instance faster or that is just not possible? Thanks
May 06 2014
parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 05/06/2014 05:46 AM, hardcoremore wrote:

 But what does exactly means that Garbage Collector blocks? What
 does it blocks and in which way?
I know this much: The current GC that comes in D runtime is a single-threaded GC (aka "a stop-the-world GC"), meaning that all threads are stopped when the GC is running a garbage collection cycle.
 And can I use threads to create multiple instance faster or that is just
 not possible?
My example program that did nothing but constructed objects on the GC heap cannot be an indicator of the performance of all multi-threaded programs. In real programs there will be computation-intensive parts; there will be parts blocked on I/O; etc. There is no way of knowing without measuring. Ali
May 06 2014
prev sibling parent reply "Kapps" <opantm2+spam gmail.com> writes:
On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote:
 On 05/05/2014 02:38 PM, Kapps wrote:

 I think that the GC actually blocks when
 creating objects, and thus multiple threads creating
instances would not
 provide a significant speedup, possibly even a slowdown.
Wow! That is the case. :)
 You'd want to benchmark this to be certain it helps.
I did: import std.range; import std.parallelism; class C {} void foo() { auto c = new C; } void main(string[] args) { enum totalElements = 10_000_000; if (args.length > 1) { foreach (i; iota(totalElements).parallel) { foo(); } } else { foreach (i; iota(totalElements)) { foo(); } } } Typical run on my system for "-O -noboundscheck -inline": $ time ./deneme parallel real 0m4.236s user 0m4.325s sys 0m9.795s $ time ./deneme real 0m0.753s user 0m0.748s sys 0m0.003s Ali
Huh, that's a much, much, higher impact than I'd expected. I tried with GDC as well (the one in Debian stable, which is unfortunately still 2.055...) and got similar results. I also tried creating only totalCPUs threads and having each of them create NUM_ELEMENTS / totalCPUs objects rather than risking that each creation was a task, and it still seems to be the same. Using malloc and emplace instead of new D, results are about 50% faster for single-threadeded and ~3-4 times faster for multi-threaded (4 cpu 8 thread machine, Linux 64-bit). The multi-threaded version is still twice as slow though. On my Windows laptop (with the program compiled for 32-bit), it did not make a significant difference and the multi-threaded version is still 4 times slower. That being said, I think most malloc implementations while being thread-safe, usually use locks or do not scale well. Code: import std.range; import std.parallelism; import std.datetime; import std.stdio; import core.stdc.stdlib; import std.conv; class C {} void foo() { //auto c = new C; enum size = __traits(classInstanceSize, C); void[] mem = malloc(size)[0..size]; emplace!C(mem); } void createFoos(size_t count) { foreach(i; 0 .. count) { foo(); } } void main(string[] args) { StopWatch sw = StopWatch(AutoStart.yes); enum totalElements = 10_000_000; if (args.length <= 1) { foreach (i; iota(totalElements)) { foo(); } } else if(args[1] == "tasks") { foreach (i; parallel(iota(totalElements))) { foo(); } } else if(args[1] == "parallel") { for(int i = 0; i < totalCPUs; i++) { taskPool.put(task(&createFoos, totalElements / totalCPUs)); } taskPool.finish(true); } else writeln("Unknown argument '", args[1], "'."); sw.stop(); writeln(cast(Duration)sw.peek); } Results (Linux 64-bit): shardsoft:~$ dmd -O -inline -release test.d shardsoft:~$ ./test 552 ms, 729 μs, and 7 hnsecs shardsoft:~$ ./test 532 ms, 139 μs, and 5 hnsecs shardsoft:~$ ./test tasks 1 sec, 171 ms, 126 μs, and 4 hnsecs shardsoft:~$ ./test tasks 1 sec, 38 ms, 468 μs, and 6 hnsecs shardsoft:~$ ./test parallel 1 sec, 146 ms, 738 μs, and 2 hnsecs shardsoft:~$ ./test parallel 1 sec, 268 ms, 195 μs, and 3 hnsecs
May 06 2014
parent "Kapps" <opantm2+spam gmail.com> writes:
On Tuesday, 6 May 2014 at 15:56:11 UTC, Kapps wrote:
 On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote:
 On 05/05/2014 02:38 PM, Kapps wrote:

 I think that the GC actually blocks when
 creating objects, and thus multiple threads creating
instances would not
 provide a significant speedup, possibly even a slowdown.
Wow! That is the case. :)
 You'd want to benchmark this to be certain it helps.
I did: import std.range; import std.parallelism; class C {} void foo() { auto c = new C; } void main(string[] args) { enum totalElements = 10_000_000; if (args.length > 1) { foreach (i; iota(totalElements).parallel) { foo(); } } else { foreach (i; iota(totalElements)) { foo(); } } } Typical run on my system for "-O -noboundscheck -inline": $ time ./deneme parallel real 0m4.236s user 0m4.325s sys 0m9.795s $ time ./deneme real 0m0.753s user 0m0.748s sys 0m0.003s Ali
Huh, that's a much, much, higher impact than I'd expected. I tried with GDC as well (the one in Debian stable, which is unfortunately still 2.055...) and got similar results. I also tried creating only totalCPUs threads and having each of them create NUM_ELEMENTS / totalCPUs objects rather than risking that each creation was a task, and it still seems to be the same. snip
I tried with using an allocator that never releases memory, rounds up to a power of 2, and is lock-free. The results are quite a bit better. shardsoft:~$ ./test 1 sec, 47 ms, 474 μs, and 4 hnsecs shardsoft:~$ ./test 1 sec, 43 ms, 588 μs, and 2 hnsecs shardsoft:~$ ./test tasks 692 ms, 769 μs, and 8 hnsecs shardsoft:~$ ./test tasks 692 ms, 686 μs, and 8 hnsecs shardsoft:~$ ./test parallel 691 ms, 856 μs, and 9 hnsecs shardsoft:~$ ./test parallel 690 ms, 22 μs, and 3 hnsecs I get similar results on my laptop (which is much faster than the results I got on it using DMD's malloc):
test
1 sec, 125 ms, and 847 ╬╝s
test
1 sec, 125 ms, 741 ╬╝s, and 6 hnsecs
test tasks
556 ms, 613 ╬╝s, and 8 hnsecs
test tasks
552 ms and 287 ╬╝s
test parallel
554 ms, 542 ╬╝s, and 6 hnsecs
test parallel
551 ms, 514 ╬╝s, and 9 hnsecs Code: http://pastie.org/9146326 Unfortunately it doesn't compile with the ancient version of gdc available in Debian, so I couldn't test with that. The results should be quite a bit better since core.atomic would be faster. And frankly, I'm not sure if the allocator actually works properly, but it's just for testing purposes anyways.
May 06 2014