www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Testing some singleton implementations

reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
There was a nice blog-post about implementing low-lock singletons in D, here:
http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/

One suggestion on Reddit was by dawgfoto (I think this is Martin
Nowak?), to use atomic primitives instead:
http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/c9tmz07

I wanted to benchmark these different approaches. I was expecting
Martin's implementation to be the fastest one, but on my machine
(Athlon II X4 620 - 2.61GHz) the implementation in the blog post turns
out to be the fastest one. I'm wondering whether my test case is
flawed in some way. Btw, I think we should put an implementation of
this into Phobos.

The timings on my machine:

Test time for LockSingleton: 542 msecs.
Test time for SyncSingleton: 20 msecs.
Test time for AtomicSingleton: 755 msecs.

Here's the code:

http://codepad.org/TMb0xxYw

And pasted below for convenience:

-----
module singleton;

import std.concurrency;
import core.atomic;
import core.thread;

class LockSingleton
{
    static LockSingleton get()
    {
        __gshared LockSingleton _instance;

        synchronized
        {
            if (_instance is null)
                _instance = new LockSingleton;
        }

        return _instance;
    }

private:
    this() { }
}

class SyncSingleton
{
    static SyncSingleton get()
    {
        static bool _instantiated;  // tls
        __gshared SyncSingleton _instance;

        if (!_instantiated)
        {
            synchronized
            {
                if (_instance is null)
                    _instance = new SyncSingleton;

                _instantiated = true;
            }
        }

        return _instance;
    }

private:
    this() { }
}

class AtomicSingleton
{
    static AtomicSingleton get()
    {
        shared bool _instantiated;
        __gshared AtomicSingleton _instance;

        // only enter synchronized block if not instantiated
        if (!atomicLoad!(MemoryOrder.acq)(_instantiated))
        {
            synchronized
            {
                if (_instance is null)
                    _instance = new AtomicSingleton;

                atomicStore!(MemoryOrder.rel)(_instantiated, true);
            }
        }

        return _instance;
    }
}

version (unittest)
{
    ulong _thread_call_count;  // TLS
}

unittest
{
    import std.datetime;
    import std.stdio;
    import std.string;
    import std.typetuple;

    foreach (TestClass; TypeTuple!(LockSingleton, SyncSingleton,
AtomicSingleton))
    {
        // mixin to avoid multiple definition errors
        mixin(q{

        static void test_%1$s()
        {
            foreach (i; 0 .. 1024_000)
            {
                // just trying to avoid the compiler from doing
dead-code optimization
                _thread_call_count += (TestClass.get() !is null);
            }
        }

        auto sw = StopWatch(AutoStart.yes);

        enum threadCount = 4;
        foreach (i; 0 .. threadCount)
            spawn(&test_%1$s);
        thread_joinAll();

        }.format(TestClass.stringof));

        sw.stop();
        writefln("Test time for %s: %s msecs.", TestClass.stringof,
sw.peek.msecs);
    }
}

void main() { }
-----
Jan 31 2014
next sibling parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
You forgot to make the flag static for AtomicSingleton. I'd also 
move the timing into the threads themselves, for fairness :)

http://codepad.org/gvm3A88k

Timings on my machine:

ldc2 -unittest -release -O3:

Test time for LockSingleton: 537 msecs.
Test time for SyncSingleton: 2 msecs.
Test time for AtomicSingleton: 2.25 msecs.

dmd -unittest -release -O -inline:

Test time for LockSingleton: 451.5 msecs.
Test time for SyncSingleton: 7.75 msecs.
Test time for AtomicSingleton: 99.75 msecs.
Jan 31 2014
next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/31/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:
 You forgot to make the flag static for AtomicSingleton.
Ah. It was copied verbatim from reddit, I guess we both missed it.
 Timings on my machine:

 ldc2 -unittest -release -O3:

 Test time for LockSingleton: 537 msecs.
 Test time for SyncSingleton: 2 msecs.
 Test time for AtomicSingleton: 2.25 msecs.
Here's mine: $ dmd -release -inline -O -noboundscheck -unittest -run singleton.d Test time for LockSingleton: 577.5 msecs. Test time for SyncSingleton: 9.25 msecs. Test time for AtomicSingleton: 159.75 msecs. Maybe ldc's optimizer is just much better at this? In either case how come the atomic version is slower?
Jan 31 2014
parent "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 31 January 2014 at 10:39:19 UTC, Andrej Mitrovic wrote:
 On 1/31/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:
 You forgot to make the flag static for AtomicSingleton.
Ah. It was copied verbatim from reddit, I guess we both missed it.
Yeah, with D's verbosity in this cases it's easy to miss.
 Here's mine:

 $ dmd -release -inline -O -noboundscheck -unittest -run 
 singleton.d

 Test time for LockSingleton: 577.5 msecs.
 Test time for SyncSingleton: 9.25 msecs.
 Test time for AtomicSingleton: 159.75 msecs.

 Maybe ldc's optimizer is just much better at this?
It is :) http://forum.dlang.org/thread/lqmqsnucadaqlkxkoffc forum.dlang.org
 In either case how come the atomic version is slower?
It may not be universally true, as Dmitry mentioned. On some platforms, TLS could be slow but atomics fast. I'm suspecting that on Windows TLS could be slower, actually.
Jan 31 2014
prev sibling parent reply Benjamin Thaut <code benjamin-thaut.de> writes:
Am 31.01.2014 10:18, schrieb Stanislav Blinov:
 You forgot to make the flag static for AtomicSingleton. I'd also move
 the timing into the threads themselves, for fairness :)

 http://codepad.org/gvm3A88k

 Timings on my machine:

 ldc2 -unittest -release -O3:

 Test time for LockSingleton: 537 msecs.
 Test time for SyncSingleton: 2 msecs.
 Test time for AtomicSingleton: 2.25 msecs.

 dmd -unittest -release -O -inline:

 Test time for LockSingleton: 451.5 msecs.
 Test time for SyncSingleton: 7.75 msecs.
 Test time for AtomicSingleton: 99.75 msecs.
For x86 CPUs you don't really need MemoryOrder.acq as reads are atomic by default. So I replaced that with MemoryOrder.raw and named it AtomicSingletonRaw On Windows 7: dmd -unittest -release -O -inline -noboundscheck Test time for LockSingleton: 299 msecs. Test time for SyncSingleton: 5 msecs. Test time for AtomicSingleton: 304 msecs. Test time for AtomicSingletonRaw: 280 msecs. ldc2 -release -unittest -O3 Test time for LockSingleton: 320 msecs. Test time for SyncSingleton: 2 msecs. Test time for AtomicSingleton: 271 msecs. Test time for AtomicSingletonRaw: 209 msecs. It seems that the SyncSingleton is supperior in all cases.
Jan 31 2014
next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/31/14, Benjamin Thaut <code benjamin-thaut.de> wrote:
 For x86 CPUs you don't really need MemoryOrder.acq as reads are atomic
 by default.
Hmm, I guess we could use a version(X86) block to pick this. When you say x86, do you also imply X86_64? Where can I read about the memory reads being atomic by default?
Jan 31 2014
parent reply Benjamin Thaut <code benjamin-thaut.de> writes:
Am 31.01.2014 12:44, schrieb Andrej Mitrovic:
 On 1/31/14, Benjamin Thaut <code benjamin-thaut.de> wrote:
 For x86 CPUs you don't really need MemoryOrder.acq as reads are atomic
 by default.
Hmm, I guess we could use a version(X86) block to pick this. When you say x86, do you also imply X86_64? Where can I read about the memory reads being atomic by default?
It depends on the processor architecture. Usually if you have a "normal" CPU architecture it garantuees a consitent view to memory. Meaning all reads and writes are atomic. (But not read modify write, or even read write). Usually only numa architectures don't garantuee a consitent view of memory, resulting in reads and writes not beeing atomic. For example the Intel Itanium architecture does not garantuee this. But usually all single processor architectures garantuee a consitent view of memory. I did not come arcross one yet, that didn't do so. (so ARM, PPC and X86, X86_64 all have atomic read/writes) Also see: http://en.wikipedia.org/wiki/Cache_coherence
Jan 31 2014
parent reply Benjamin Thaut <code benjamin-thaut.de> writes:
If you need the details, read:

http://lwn.net/Articles/250967/

Kind Regards
Benjamin Thaut
Jan 31 2014
next sibling parent reply "Jonathan Bettencourt" <jbetten gmail.com> writes:
Is it just me or does the implementation of atomic.d look grossly 
inefficient and badly in need of a rewrite?
Jan 31 2014
parent Benjamin Thaut <code benjamin-thaut.de> writes:
Am 31.01.2014 15:27, schrieb Jonathan Bettencourt:
 Is it just me or does the implementation of atomic.d look grossly
 inefficient and badly in need of a rewrite?
I can't really judge that, as I don't have much experience in lock free programming. But if someone is to rewrite this module, then it should be someone with quite some experience in lock free programming. Taking a look at the memory model of C++11 and copy from there, might not hurt either.
Jan 31 2014
prev sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/31/14, Benjamin Thaut <code benjamin-thaut.de> wrote:
 If you need the details, read:

 http://lwn.net/Articles/250967/
Aye it's been on my todo list forever, even though I've read the first part when it was a single blost post, afair.
Jan 31 2014
parent Benjamin Thaut <code benjamin-thaut.de> writes:
Am 31.01.2014 15:30, schrieb Andrej Mitrovic:
 On 1/31/14, Benjamin Thaut <code benjamin-thaut.de> wrote:
 If you need the details, read:

 http://lwn.net/Articles/250967/
Aye it's been on my todo list forever, even though I've read the first part when it was a single blost post, afair.
You should really take the time to read it. Its one of the best articles on the internet I ever read, and it has tons of relevant information for programmers. You can skip the first chaper, as it mostly talks about the hardware details of how memory works, and why it is hard to make it faster.
Jan 31 2014
prev sibling parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 31 January 2014 at 11:31:53 UTC, Benjamin Thaut wrote:

 For x86 CPUs you don't really need MemoryOrder.acq as reads are 
 atomic by default.
Uhm... atomicLoad() itself guarantees that the read is atomic. It's not about atomicity of operation, it's about sequential consistency. Using raw in this case is safe because the further synchronized block guarantees that this read will not be reordered to follow write. In fact, the presence of that synchronized block allows for making both load and store raw.
Jan 31 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:

synchronized block:

         // (2)
         if (!atomicLoad!(MemoryOrder.raw)(_instantiated))
         {
             // (1)
             synchronized
             { // <- this is 'acquire'
                 if (_instance is null) {
                     _instance = new AtomicSingleton;
                 }

             } // <- this is 'release'

             // This store cannot be moved to positions (1) or (2) 
because
             // of 'synchronized' above
             atomicStore!(MemoryOrder.raw)(_instantiated, true);
         }
Jan 31 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
31-Jan-2014 17:26, Stanislav Blinov пишет:

 synchronized block:

          // (2)
          if (!atomicLoad!(MemoryOrder.raw)(_instantiated))
          {
              // (1)
              synchronized
              { // <- this is 'acquire'
                  if (_instance is null) {
//(3)
                      _instance = new AtomicSingleton;
                  }

              } // <- this is 'release'
//(4)
              // This store cannot be moved to positions (1) or (2) because
              // of 'synchronized' above
              atomicStore!(MemoryOrder.raw)(_instantiated, true);
          }
No it's not - the second thread may get to (3) while some other thread is at (4). -- Dmitry Olshansky
Jan 31 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 31 January 2014 at 15:18:43 UTC, Dmitry Olshansky
wrote:
 31-Jan-2014 17:26, Stanislav Blinov пишет:

 the
 synchronized block:

         // (2)
         if (!atomicLoad!(MemoryOrder.raw)(_instantiated))
         {
             // (1)
             synchronized
             { // <- this is 'acquire'
                 if (_instance is null) {
//(3)
                     _instance = new AtomicSingleton;
                 }

             } // <- this is 'release'
//(4)
             // This store cannot be moved to positions (1) or 
 (2) because
             // of 'synchronized' above
             atomicStore!(MemoryOrder.raw)(_instantiated, true);
         }
No it's not - the second thread may get to (3) while some other thread is at (4).
Nope. The only way the thread is going to end up past the null check is if it's instantiating the singleton. It's inside the locked region. As long as the bool is false one of the threads will get inside. the synchronized block, all others will lock. Once that "first" thread is done, the others will see a non null reference. No thread can get to 4 until the singleton is created.
Jan 31 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 31 January 2014 at 23:35:25 UTC, Stanislav Blinov 
wrote:

        // (2)
        if (!atomicLoad!(MemoryOrder.raw)(_instantiated))
        {
            // (1)
            synchronized
            { // <- this is 'acquire'
                if (_instance is null) {
//(3)
                    _instance = new AtomicSingleton;
                }

            } // <- this is 'release'
//(4)
            // This store cannot be moved to positions (1) or 
 (2) because
            // of 'synchronized' above
            atomicStore!(MemoryOrder.raw)(_instantiated, true);
        }
No it's not - the second thread may get to (3) while some other thread is at (4).
Nope. The only way the thread is going to end up past the null check is if it's instantiating the singleton. It's inside the locked region. As long as the bool is false one of the threads will get inside. the synchronized block, all others will lock. Once that "first" thread is done, the others will see a non null reference. No thread can get to 4 until the singleton is created.
To clarify: only one thread will ever get to position (3). All others that follow it will see that _instance is not null, thus will just leave the synchronized section. Of course, this means that some N threads (that arrived to the synchronized section before the singleton was created) will all write 'true' into the flag. No big deal :)
Feb 01 2014
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
01-Feb-2014 18:23, Stanislav Blinov пишет:
 On Friday, 31 January 2014 at 23:35:25 UTC, Stanislav Blinov wrote:

        // (2)
        if (!atomicLoad!(MemoryOrder.raw)(_instantiated))
        {
            // (1)
            synchronized
            { // <- this is 'acquire'
                if (_instance is null) {
//(3)
                    _instance = new AtomicSingleton;
                }

            } // <- this is 'release'
//(4)
            // This store cannot be moved to positions (1) or (2)
 because
            // of 'synchronized' above
            atomicStore!(MemoryOrder.raw)(_instantiated, true);
        }
No it's not - the second thread may get to (3) while some other thread is at (4).
Nope. The only way the thread is going to end up past the null check is if it's instantiating the singleton. It's inside the locked region. As long as the bool is false one of the threads will get inside. the synchronized block, all others will lock. Once that "first" thread is done, the others will see a non null reference. No thread can get to 4 until the singleton is created.
To clarify: only one thread will ever get to position (3). All others that follow it will see that _instance is not null, thus will just leave the synchronized section. Of course, this means that some N threads (that arrived to the synchronized section before the singleton was created) will all write 'true' into the flag. No big deal :)
Yes, I see there could be many writes to _instantiated field but not _instance. -- Dmitry Olshansky
Feb 01 2014
prev sibling parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
There's a lot more to these singletons than meets the eye.

- It would seem that such usage of raw MemoryOrder in 
AtomicSingleton would be wrong (e.g. return to acq/rel is in 
order, which should not pose any performance issues on X86, as 
Sean mentioned).

- The instance references should be qualified shared.

This needs more serious review, even if only for academic 
purposes. I'll see what I can come up with :)
In the meantime, if anyone has anything to add to the list, 
please chime in!
Feb 07 2014
next sibling parent "Jonathan Bettencourt" <jbetten gmail.com> writes:
On Friday, 7 February 2014 at 20:09:29 UTC, Stanislav Blinov 
wrote:
 There's a lot more to these singletons than meets the eye.

 - It would seem that such usage of raw MemoryOrder in 
 AtomicSingleton would be wrong (e.g. return to acq/rel is in 
 order, which should not pose any performance issues on X86, as 
 Sean mentioned).
I agree that acq/rel is the correct way to go, but it will cause performance issues with the current implementation of AtomicLoad.
Feb 07 2014
prev sibling parent reply "Cecil Ward" <d cecilward.com> writes:
On Friday, 7 February 2014 at 20:09:29 UTC, Stanislav Blinov
wrote:
 There's a lot more to these singletons than meets the eye.

 - It would seem that such usage of raw MemoryOrder in 
 AtomicSingleton would be wrong (e.g. return to acq/rel is in 
 order, which should not pose any performance issues on X86, as 
 Sean mentioned).

 - The instance references should be qualified shared.

 This needs more serious review, even if only for academic 
 purposes. I'll see what I can come up with :)
 In the meantime, if anyone has anything to add to the list, 
 please chime in!
Hi Martin, Sean, Stanislav et al I would quite like to code-review atomics.d and maybe think about improving the documentation and adding a few comments, especially for the purposes of knowledge capture in this sticky field. Would that be ok, in principle? There are a few rough edges here and there _in my very unworthy opinion_, and the odd bit that doesn't look quite right somehow especially in the x64 branch. If I could even find the odd bug then that would be good. Or rather bad. A big amount of work has clearly gone into this module. So, many beers to Sean and others who put their time into it. Research can be quite a pig too on a project of this kind, I would imagine. There is quite a list of things that I'm currently unclear about when I read through the D, and this might mean me whimpering for help occasionally..? Best, Cecil.
Feb 27 2014
parent "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 28 February 2014 at 00:29:49 UTC, Cecil Ward wrote:
 On Friday, 7 February 2014 at 20:09:29 UTC, Stanislav Blinov
 This needs more serious review, even if only for academic 
 purposes. I'll see what I can come up with :)
 In the meantime, if anyone has anything to add to the list, 
 please chime in!
Hi Martin, Sean, Stanislav et al I would quite like to code-review atomics.d
When I said "review" I meant this specific issue, e.g. singletons. Since then I got a bit carried away into general issues with 'shared' qualifier, so for me the quirks of singletons are on hold for now. But if you find other bugs (in atomic.d or anywhere else), inconsistencies, documentation omissions, etc., please post them. This thread clearly shows the value of more thorough testing. Who knows how long it would've taken to notice that atomicLoad() issue if Andrej hadn't created this thread.
 and maybe think about improving the documentation and adding a 
 few comments, especially
 for the purposes of knowledge capture in this sticky field.

 Would that be ok, in principle?
IMO submitting issues, enhacnements, documentation updates is always a good idea. Though don't be surprised if your submissions hang in the air for a while, it's pretty common esp. when people responsible for the original code are busy with other things.
 There are a few rough edges here and there _in my very unworthy
 opinion_, and the odd bit that doesn't look quite right somehow
 especially in the x64 branch. If I could even find the odd bug
 then that would be good. Or rather bad.

 A big amount of work has clearly gone into this module. So, many
 beers to Sean and others who put their time into it. Research 
 can
 be quite a pig too on a project of this kind, I would imagine.
Use bugzilla (https://d.puremagic.com/issues/) to submit issues/enhancement requests; or submit ready pull requests on github so that they can be reviewed, improved, and if all is good, eventually accepted. It's best done that way since it presents clear history and more focused discussion, and because threads in this NG sink rather quickly.
 There is quite a list of things that I'm currently unclear about
 when I read through the D, and this might mean me whimpering for
 help occasionally..?
I don't see a big red banner saying "don't post your questions here" anywhere ;)
Mar 03 2014
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
31-Jan-2014 12:25, Andrej Mitrovic пишет:
 There was a nice blog-post about implementing low-lock singletons in D, here:
 http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/

 One suggestion on Reddit was by dawgfoto (I think this is Martin
 Nowak?), to use atomic primitives instead:
 http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/c9tmz07

 I wanted to benchmark these different approaches. I was expecting
 Martin's implementation to be the fastest one, but on my machine
 (Athlon II X4 620 - 2.61GHz) the implementation in the blog post turns
 out to be the fastest one.
And it was a big thing because of that. Also keep in mind that atomic ops are _relatively_ cheap on x86 the stuff should get even better on say ARM. -- Dmitry Olshansky
Jan 31 2014
next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/31/14, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:
 Also keep in mind that atomic
 ops are _relatively_ cheap on x86 the stuff should get even better on
 say ARM.
Hmm yeah, but I was expecting better numbers. Even after the 'static' fix in the bug as noted by Stanislav the atomic version is slower.
Jan 31 2014
prev sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/31/14, Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:
 Hmm yeah, but I was expecting better numbers. Even after the 'static'
 fix in the bug as noted by Stanislav the atomic version is slower.
Actually, I think I understand why this happens. Logically, the atomic version will do an atomic read for *every* access, whereas the TLS implementation only checks a thread-local boolean flag. Even though the TLS implementation forces each new thread to enter the synchronized block *on the first read for that thread*, on subsequent reads that thread will not enter the synchronized block anymore. After the very first call of every thread, the cost of the read operation for the TLS version is a TLS read, whereas for the atomic version it is an atomic read. I guess TLS read operations simply beat atomic read operations. The atomic implementation probably beats the TLS version when a lot of new threads are being spawned at once and they only retrieve the singleton which has already been initialized. E.g., say a 1000 threads are spawned. In the atomic version, the 1000 threads will all do an atomic read and not enter the synchronized block, whereas in the TLS version the 1000 threads will all need to enter a synchronized block on the very first read.
Jan 31 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 31 January 2014 at 10:57:53 UTC, Andrej Mitrovic wrote:

 The atomic implementation probably beats the TLS version when a 
 lot of
 new threads are being spawned at once and they only retrieve the
 singleton which has already been initialized. E.g., say a 1000 
 threads
 are spawned.
Easy enough to test. But inconclusive. I just ran some tests with 1024 threads :) First, subsequent runs on my machine show interleaving results: Test time for SyncSingleton: 61.2334 msecs. Test time for AtomicSingleton: 15.9795 msecs. Test time for SyncSingleton: 11.209 msecs. Test time for AtomicSingleton: 25.4395 msecs. Test time for SyncSingleton: 22.8105 msecs. Test time for AtomicSingleton: 35.1865 msecs. I guess I'd need a different CPU (and probably one that's not doing anything else at the time) to get conclusive results. It also seems that either there *is* a race in there somewhere, or maybe a bug?.. Some runs just flat freeze (even on small thread counts) :\
Jan 31 2014
parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/31/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:
 First, subsequent runs on my machine show interleaving results.
 It also seems that either there *is* a race in there somewhere,
 or maybe a bug?.. Some runs just flat freeze (even on small
 thread counts) :\
Hmm.. Well I know we've had some issues with threads on FreeBSD. It's hard to just guess what's wrong though. :)
Jan 31 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 31 January 2014 at 11:18:03 UTC, Andrej Mitrovic wrote:
 On 1/31/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:
 First, subsequent runs on my machine show interleaving results.
 It also seems that either there *is* a race in there somewhere,
 or maybe a bug?.. Some runs just flat freeze (even on small
 thread counts) :\
Hmm.. Well I know we've had some issues with threads on FreeBSD. It's hard to just guess what's wrong though. :)
I'm not comfortable with that atomicOp in the thread function. I've reworked the unittest a little, to accomodate for multiple runs: http://codepad.org/ghZdjvUE And here are ldc's results (you may want to lower the thread count for dmd, I've killed program after the very first test took 27 second :o): Test 0 time for SyncSingleton: 35.4775 msecs. Test 0 time for AtomicSingleton: 58.5859 msecs. Test 1 time for SyncSingleton: 64.9863 msecs. Test 1 time for AtomicSingleton: 12.5479 msecs. Test 2 time for SyncSingleton: 44.2617 msecs. Test 2 time for AtomicSingleton: 26.2842 msecs. Test 3 time for SyncSingleton: 24.8008 msecs. Test 3 time for AtomicSingleton: 34.416 msecs. Test 4 time for SyncSingleton: 5.63477 msecs. Test 4 time for AtomicSingleton: 28.458 msecs. Test 5 time for SyncSingleton: 18.1123 msecs. Test 5 time for AtomicSingleton: 29.6738 msecs. Test 6 time for SyncSingleton: 12.0234 msecs. Test 6 time for AtomicSingleton: 53.2061 msecs. Test 7 time for SyncSingleton: 70.6982 msecs. Test 7 time for AtomicSingleton: 13.2285 msecs. Test 8 time for SyncSingleton: 12.3447 msecs. Test 8 time for AtomicSingleton: 8.06348 msecs. Test 9 time for SyncSingleton: 20.3145 msecs. Test 9 time for AtomicSingleton: 14.334 msecs. Again, inconclusive :)
Jan 31 2014
parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/31/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:
 I've reworked the unittest a little, to accomodate for multiple
 runs:

 http://codepad.org/ghZdjvUE
I've finally managed to build LDC2 on Windows (MinGW version), here are the timings between DMD and LDC2: $ dmd -release -inline -O -noboundscheck -unittest singleton_2.d -oftest.exe && test.exe Test time for LockSingleton: 606.5 msecs. Test time for SyncSingleton: 7 msecs. Test time for AtomicSingleton: 138 msecs. $ ldmd2 -release -inline -O -noboundscheck -unittest singleton_2.d -oftest.exe && test.exe Test time for LockSingleton: 536.25 msecs. Test time for SyncSingleton: 5 msecs. Test time for AtomicSingleton: 3 msecs. Freaking awesome!
Feb 04 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Tuesday, 4 February 2014 at 09:44:04 UTC, Andrej Mitrovic 
wrote:

 I've finally managed to build LDC2 on Windows (MinGW version), 
 here
 are the timings between DMD and LDC2:

 $ dmd -release -inline -O -noboundscheck -unittest singleton_2.d
  -oftest.exe && test.exe
 Test time for LockSingleton: 606.5 msecs.
 Test time for SyncSingleton: 7 msecs.
 Test time for AtomicSingleton: 138 msecs.

 $ ldmd2 -release -inline -O -noboundscheck -unittest 
 singleton_2.d
  -oftest.exe && test.exe
 Test time for LockSingleton: 536.25 msecs.
 Test time for SyncSingleton: 5 msecs.
 Test time for AtomicSingleton: 3 msecs.

 Freaking awesome!
:) Have you also included fixes from http://forum.dlang.org/post/khidcgetalmguhassvqm forum.dlang.org ? How do the test results look in multiple runs? Is AtomicSingleton always faster than SyncSingleton on Windows?
Feb 04 2014
next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 2/4/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:
 Have you also included fixes from
 http://forum.dlang.org/post/khidcgetalmguhassvqm forum.dlang.org ?
I haven't figured out exactly what you're trying to swap there. Do you have a full example:
 How do the test results look in multiple runs? Is AtomicSingleton
 always faster than SyncSingleton on Windows?
Pretty much. I'm getting reliable results. But I'm not a statistics pro (and yeah I've read http://zedshaw.com/essays/programmer_stats.html - still doesn't make me a pro).
Feb 04 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Tuesday, 4 February 2014 at 14:23:51 UTC, Andrej Mitrovic 
wrote:
 On 2/4/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:
 Have you also included fixes from
 http://forum.dlang.org/post/khidcgetalmguhassvqm forum.dlang.org 
 ?
I haven't figured out exactly what you're trying to swap there. Do you have a full example:
Both atomicLoad and atomicStore use raw MemoryOrder, and also the atomicStore is out of the synchronized {} section: http://dpaste.dzfl.pl/291abc51bb0e
 How do the test results look in multiple runs? Is 
 AtomicSingleton
 always faster than SyncSingleton on Windows?
Pretty much. I'm getting reliable results.
Interesting. As you've seen, for me on Linux it's 50/50.
 But I'm not a statistics pro (and yeah I've read
 http://zedshaw.com/essays/programmer_stats.html - still doesn't 
 make me a pro).
Same here :)
Feb 04 2014
parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 2/4/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:
 Both atomicLoad and atomicStore use raw MemoryOrder, and also the
 atomicStore is out of the synchronized {} section:

 http://dpaste.dzfl.pl/291abc51bb0e
No difference, but maybe the timing precision isn't proper. It always displays one of 3/3.25/4 msecs. Anywho what's important is that Atomic is really speedy and Sync is almost as fast. Except with DMD which is bad at optimizing this specific code.
Feb 05 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Wednesday, 5 February 2014 at 08:39:08 UTC, Andrej Mitrovic 
wrote:

 No difference, but maybe the timing precision isn't proper. It 
 always displays one of 3/3.25/4 msecs.
Hmm... It should be as proper as it gets, judging from StopWatch's docs.
 Anywho what's important is that Atomic is really speedy and 
 Sync is almost as fast. Except with DMD  which is
 bad at optimizing this specific code.
Yup, at least we have two fast low-lock implementations to choose from depending on platform's capabilities regarding TLS and atomics.
Feb 05 2014
parent "Jonathan Bettencourt" <jbetten gmail.com> writes:
On Wednesday, 5 February 2014 at 09:30:51 UTC, Stanislav Blinov 
wrote:
 On Wednesday, 5 February 2014 at 08:39:08 UTC, Andrej Mitrovic 
 wrote:

 No difference, but maybe the timing precision isn't proper. It 
 always displays one of 3/3.25/4 msecs.
Hmm... It should be as proper as it gets, judging from StopWatch's docs.
 Anywho what's important is that Atomic is really speedy and 
 Sync is almost as fast. Except with DMD  which is
 bad at optimizing this specific code.
Yup, at least we have two fast low-lock implementations to choose from depending on platform's capabilities regarding TLS and atomics.
The atomics implementation in druntime is very inefficient, it uses compare-and-swap for nearly everything. I'm working on a rewrite.
Feb 05 2014
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 2/4/14, Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:
 I haven't figured out exactly what you're trying to swap there. Do you
 have a full example:
s/:/?
Feb 04 2014
prev sibling parent reply Jerry <jlquinn optonline.net> writes:
"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

 On Tuesday, 4 February 2014 at 09:44:04 UTC, Andrej Mitrovic wrote:

 I've finally managed to build LDC2 on Windows (MinGW version), here
 are the timings between DMD and LDC2:

 $ dmd -release -inline -O -noboundscheck -unittest singleton_2.d
  -oftest.exe && test.exe
 Test time for LockSingleton: 606.5 msecs.
 Test time for SyncSingleton: 7 msecs.
 Test time for AtomicSingleton: 138 msecs.

 $ ldmd2 -release -inline -O -noboundscheck -unittest singleton_2.d
  -oftest.exe && test.exe
 Test time for LockSingleton: 536.25 msecs.
 Test time for SyncSingleton: 5 msecs.
 Test time for AtomicSingleton: 3 msecs.

 Freaking awesome!
Here's the best and worst times I get on my linux laptop. These are with 2.064.2 dmd and gdc 4.9 with 2.064.2 On Ubuntu x86_64: ~/dmd2/linux/bin64/dmd -O -release -inline -noboundscheck -unittest singleton.d Test 2 time for SyncSingleton: 753.547 msecs. Test 2 time for AtomicSingleton: 22290.3 msecs. Test 3 time for SyncSingleton: 254.968 msecs. Test 3 time for AtomicSingleton: 22903.3 msecs. Test 6 time for SyncSingleton: 510.118 msecs. Test 6 time for AtomicSingleton: 23970.9 msecs. Test 8 time for SyncSingleton: 480.175 msecs. Test 8 time for AtomicSingleton: 12827.9 msecs. ../bin/gdc -frelease -funittest -O3 singleton.d Test 0 time for SyncSingleton: 458.605 msecs. Test 0 time for AtomicSingleton: 1985.87 msecs. Test 1 time for SyncSingleton: 334.097 msecs. Test 1 time for AtomicSingleton: 2030.29 msecs. Test 5 time for SyncSingleton: 355.765 msecs. Test 5 time for AtomicSingleton: 1040.87 msecs. Test 9 time for SyncSingleton: 295.145 msecs. Test 9 time for AtomicSingleton: 1272.22 msecs. It seems like gdc and dmd are similar for SyncSingleton. AtomicSingleton is significantly faster for gdc, but not as fast as SyncSingleton.
Feb 04 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Wednesday, 5 February 2014 at 00:11:58 UTC, Jerry wrote:

 Here's the best and worst times I get on my linux laptop.  
 These are
 with 2.064.2 dmd and gdc 4.9 with 2.064.2

 On Ubuntu x86_64:

 ~/dmd2/linux/bin64/dmd -O -release -inline -noboundscheck 
 -unittest singleton.d

 Test 2 time for SyncSingleton: 753.547 msecs.
 Test 2 time for AtomicSingleton: 22290.3 msecs.

 Test 3 time for SyncSingleton: 254.968 msecs.
 Test 3 time for AtomicSingleton: 22903.3 msecs.

 Test 6 time for SyncSingleton: 510.118 msecs.
 Test 6 time for AtomicSingleton: 23970.9 msecs.

 Test 8 time for SyncSingleton: 480.175 msecs.
 Test 8 time for AtomicSingleton: 12827.9 msecs.
Whoah, those times for AtomicSingleton are way high. What kind of machine is your laptop? Perhaps we need to repost the test with the latest implementation of AtomicSingleton.
Feb 05 2014
parent reply Jerry <jlquinn optonline.net> writes:
"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

 On Wednesday, 5 February 2014 at 00:11:58 UTC, Jerry wrote:

 Here's the best and worst times I get on my linux laptop.  These are
 with 2.064.2 dmd and gdc 4.9 with 2.064.2

 On Ubuntu x86_64:

 ~/dmd2/linux/bin64/dmd -O -release -inline -noboundscheck -unittest
 singleton.d

 Test 2 time for SyncSingleton: 753.547 msecs.
 Test 2 time for AtomicSingleton: 22290.3 msecs.

 Test 3 time for SyncSingleton: 254.968 msecs.
 Test 3 time for AtomicSingleton: 22903.3 msecs.

 Test 6 time for SyncSingleton: 510.118 msecs.
 Test 6 time for AtomicSingleton: 23970.9 msecs.

 Test 8 time for SyncSingleton: 480.175 msecs.
 Test 8 time for AtomicSingleton: 12827.9 msecs.
Whoah, those times for AtomicSingleton are way high. What kind of machine is your laptop?
Core 2 Due T9400. The gdc times were much better for AtomicSingleton - about 4x slower than SyncSingleton.
 Perhaps we need to repost the test with the latest implementation of
 AtomicSingleton.
I downloaded the test program yesterday.
Feb 05 2014
next sibling parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Wednesday, 5 February 2014 at 21:47:40 UTC, Jerry wrote:

 I downloaded the test program yesterday.
Here's my latest revision: http://dpaste.dzfl.pl/5b54df1c7004 Andrej, I hope you don't mind me fiddling with that code? I've put that atomic fix in there, also switched timing to use hnsecs (converted back to msecs for output), which seems to give more accurate readings.
Feb 05 2014
parent reply Jerry <jlquinn optonline.net> writes:
"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

 On Wednesday, 5 February 2014 at 21:47:40 UTC, Jerry wrote:

 I downloaded the test program yesterday.
Here's my latest revision: http://dpaste.dzfl.pl/5b54df1c7004 Andrej, I hope you don't mind me fiddling with that code? I've put that atomic fix in there, also switched timing to use hnsecs (converted back to msecs for output), which seems to give more accurate readings.
Yup, that helps out the AtomicSingleton a lot. Here's best and worst times for each for dmd and gdc: jlquinn wyvern:~/d/tests$ ~/dmd2/linux/bin64/dmd -O -release -inline -unittest singleton2.d jlquinn wyvern:~/d/tests$ ./singleton2 *Test 2 time for SyncSingleton: 585.992 msecs. Test 2 time for AtomicSingleton: 1189.03 msecs. Test 5 time for SyncSingleton: 796.834 msecs. *Test 5 time for AtomicSingleton: 1069.08 msecs. *Test 7 time for SyncSingleton: 811.711 msecs. Test 7 time for AtomicSingleton: 1263.36 msecs. Test 9 time for SyncSingleton: 605.729 msecs. *Test 9 time for AtomicSingleton: 2173.74 msecs. jlquinn wyvern:~/d/tests$ ../bin/gdc -O3 -finline -frelease -fno-bounds-check -funittest singleton2.d jlquinn wyvern:~/d/tests$ ./a.out Test 0 time for SyncSingleton: 542.797 msecs. *Test 0 time for AtomicSingleton: 257.805 msecs. *Test 5 time for SyncSingleton: 620.052 msecs. Test 5 time for AtomicSingleton: 248.951 msecs. Test 7 time for SyncSingleton: 437.124 msecs. *Test 7 time for AtomicSingleton: 605.781 msecs. *Test 8 time for SyncSingleton: 252.643 msecs. Test 8 time for AtomicSingleton: 279.854 msecs.
Feb 06 2014
next sibling parent reply "Sean Kelly" <sean invisibleduck.org> writes:
Weird.  atomicLoad(raw) should be the same as atomicLoad(acq), 
and atomicStore(raw) should be the same as atomicStore(rel).  At 
least on x86.  I don't know why that change made a difference in 
performance.
Feb 07 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 7 February 2014 at 08:10:58 UTC, Sean Kelly wrote:
 Weird.  atomicLoad(raw) should be the same as atomicLoad(acq), 
 and atomicStore(raw) should be the same as atomicStore(rel).  
 At least on x86.  I don't know why that change made a 
 difference in performance.
huh? --8<-- core/atomic.d template needsLoadBarrier( MemoryOrder ms ) { enum bool needsLoadBarrier = ms != MemoryOrder.raw; } -->8-- Didn't you write this? :)
Feb 07 2014
parent reply "Sean Kelly" <sean invisibleduck.org> writes:
On Friday, 7 February 2014 at 11:17:49 UTC, Stanislav Blinov 
wrote:
 On Friday, 7 February 2014 at 08:10:58 UTC, Sean Kelly wrote:
 Weird.  atomicLoad(raw) should be the same as atomicLoad(acq), 
 and atomicStore(raw) should be the same as atomicStore(rel).  
 At least on x86.  I don't know why that change made a 
 difference in performance.
huh? --8<-- core/atomic.d template needsLoadBarrier( MemoryOrder ms ) { enum bool needsLoadBarrier = ms != MemoryOrder.raw; } -->8-- Didn't you write this? :)
Oops. I thought that since Intel has officially defined loads as having acquire semantics, I had eliminated the barrier requirement there. But I guess not. I suppose it's an issue worth discussing. Does anyone know offhand what C++0x implementations do for load acquires on x86?
Feb 07 2014
next sibling parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 7 February 2014 at 15:42:06 UTC, Sean Kelly wrote:

 Oops.  I thought that since Intel has officially defined loads 
 as having acquire semantics, I had eliminated the barrier 
 requirement there.  But I guess not.  I suppose it's an issue 
 worth discussing.  Does anyone know offhand what C++0x 
 implementations do for load acquires on x86?
Offhand - no. But who forbids empirical tests? :) --8<-- main.cpp #include <atomic> #include <cstdint> #include <iostream> int test32() { std::atomic<int> ai(0xfacefeed); return ai.load(std::memory_order_acquire); } int64_t test64() { std::atomic<int64_t> ai(0xbadface00badface); return ai.load(std::memory_order_acquire); } int main(int argc, char** argv) { auto i1 = test32(); auto i2 = test64(); // Prevent dead code optimization std::cout << i1 << " " << i2 << std::endl; } -->8-- I've pulled the atomic ops into separate functions to try and prevent the compiler from being too clever. I'm using --std=c++11 but --std=c++0x would work as well. $ g++ -Ofast -m32 --std=c++11 main.cpp $ objdump -d -w -r -C --no-show-raw-insn --disassembler-options=intel a.out | less -S --8<-- 08048830 <test32()>: 8048830: sub esp,0x10 8048833: mov DWORD PTR [esp+0xc],0xfacefeed 804883b: mov eax,DWORD PTR [esp+0xc] 804883f: add esp,0x10 8048842: ret 8048843: lea esi,[esi+0x0] 8048849: lea edi,[edi+eiz*1+0x0] 08048850 <test64()>: 8048850: sub esp,0x1c 8048853: mov DWORD PTR [esp+0x10],0xbadface 804885b: mov DWORD PTR [esp+0x14],0xbadface0 8048863: fild QWORD PTR [esp+0x10] 8048867: fistp QWORD PTR [esp] 804886a: mov eax,DWORD PTR [esp] 804886d: mov edx,DWORD PTR [esp+0x4] 8048871: add esp,0x1c 8048874: ret 8048875: xchg ax,ax 8048877: xchg ax,ax 8048879: xchg ax,ax 804887b: xchg ax,ax 804887d: xchg ax,ax 804887f: nop -->8-- $ g++ -Ofast -m64 --std=c++11 main.cpp $ objdump -d -w -r -C --no-show-raw-insn --disassembler-options=intel a.out | less -S --8<-- 0000000000400950 <test32()>: 400950: mov DWORD PTR [rsp-0x18],0xfacefeed 400958: mov eax,DWORD PTR [rsp-0x18] 40095c: ret 40095d: nop DWORD PTR [rax] 0000000000400960 <test64()>: 400960: movabs rax,0xbadface00badface 40096a: mov QWORD PTR [rsp-0x18],rax 40096f: mov rax,QWORD PTR [rsp-0x18] 400974: ret 400975: nop WORD PTR cs:[rax+rax*1+0x0] 40097f: nop -->8-- No barriers in sight.
Feb 07 2014
parent reply "Sean Kelly" <sean invisibleduck.org> writes:
On Friday, 7 February 2014 at 16:36:03 UTC, Stanislav Blinov 
wrote:
 No barriers in sight.
Awesome. Then I think we can go back to the old logic.
Feb 07 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 7 February 2014 at 16:57:50 UTC, Sean Kelly wrote:
 On Friday, 7 February 2014 at 16:36:03 UTC, Stanislav Blinov 
 wrote:
 No barriers in sight.
Awesome. Then I think we can go back to the old logic.
Cool. Also, from http://en.cppreference.com/w/cpp/atomic/memory_order: --8<-- On strongly-ordered systems (x86, SPARC, IBM mainframe), release-acquire ordering is automatic for the majority of operations. No additional CPU instructions are issued for this synchronization mode, only certain compiler optimizations are affected (e.g. the compiler is prohibited from moving non-atomic stores past the atomic store-release or perform non-atomic loads earlier than the atomic load-acquire) -->8--
Feb 07 2014
next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Fri, 07 Feb 2014 17:10:06 +0000
schrieb "Stanislav Blinov" <stanislav.blinov gmail.com>:

 On Friday, 7 February 2014 at 16:57:50 UTC, Sean Kelly wrote:
 On Friday, 7 February 2014 at 16:36:03 UTC, Stanislav Blinov 
 wrote:
 No barriers in sight.
Awesome. Then I think we can go back to the old logic.
Cool. Also, from http://en.cppreference.com/w/cpp/atomic/memory_order: --8<-- On strongly-ordered systems (x86, SPARC, IBM mainframe), release-acquire ordering is automatic for the majority of operations. No additional CPU instructions are issued for this synchronization mode, only certain compiler optimizations are affected (e.g. the compiler is prohibited from moving non-atomic stores past the atomic store-release or perform non-atomic loads earlier than the atomic load-acquire) -->8--
Strong-ordering does not work on x86/amd64 in two cases: http://preshing.com/20120913/acquire-and-release-semantics/#IDComment721195739 Just thought I should throw that in. Only the official CPU docs will give certainty :) -- Marco
Feb 07 2014
prev sibling parent reply Martin Nowak <code dawg.eu> writes:
On 02/07/2014 06:10 PM, Stanislav Blinov wrote:
 On Friday, 7 February 2014 at 16:57:50 UTC, Sean Kelly wrote:
 --8<--

 On strongly-ordered systems (x86, SPARC, IBM mainframe), release-acquire
 ordering is automatic for the majority of operations. No additional CPU
 instructions are issued for this synchronization mode, only certain
 compiler optimizations are affected (e.g. the compiler is prohibited
 from moving non-atomic stores past the atomic store-release or perform
 non-atomic loads earlier than the atomic load-acquire)

 -->8--
So, who is going to fix core.atomic?
Feb 08 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Sunday, 9 February 2014 at 01:40:51 UTC, Martin Nowak wrote:

 So, who is going to fix core.atomic?
I was under impression that Sean was onto it.
Feb 09 2014
parent reply Martin Nowak <code dawg.eu> writes:
On 02/09/2014 03:07 PM, Stanislav Blinov wrote:
 On Sunday, 9 February 2014 at 01:40:51 UTC, Martin Nowak wrote:

 So, who is going to fix core.atomic?
I was under impression that Sean was onto it.
Can you please submit a bug report, so we don't loose track of this.
Feb 09 2014
parent "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Sunday, 9 February 2014 at 18:07:50 UTC, Martin Nowak wrote:

 Can you please submit a bug report, so we don't loose track of 
 this.
Sure: https://d.puremagic.com/issues/show_bug.cgi?id=12121
Feb 09 2014
prev sibling parent reply Iain Buclaw <ibuclaw gdcproject.org> writes:
On 7 Feb 2014 15:45, "Sean Kelly" <sean invisibleduck.org> wrote:
 On Friday, 7 February 2014 at 11:17:49 UTC, Stanislav Blinov wrote:
 On Friday, 7 February 2014 at 08:10:58 UTC, Sean Kelly wrote:
 Weird.  atomicLoad(raw) should be the same as atomicLoad(acq), and
atomicStore(raw) should be the same as atomicStore(rel). At least on x86. I don't know why that change made a difference in performance.
 huh?

 --8<-- core/atomic.d

         template needsLoadBarrier( MemoryOrder ms )
         {
             enum bool needsLoadBarrier = ms != MemoryOrder.raw;
         }

 -->8--

 Didn't you write this? :)
Oops. I thought that since Intel has officially defined loads as having
acquire semantics, I had eliminated the barrier requirement there. But I guess not. I suppose it's an issue worth discussing. Does anyone know offhand what C++0x implementations do for load acquires on x86? Speaking of which, I need to add 'Update gcc.atomics to use new C++0x intrinsics' to the GDCProjects page - they map closely to what core.atomic is doing, and should see better performance compared to the __sync intrinsics. :)
Feb 07 2014
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Fri, 7 Feb 2014 18:42:29 +0000
schrieb Iain Buclaw <ibuclaw gdcproject.org>:

 On 7 Feb 2014 15:45, "Sean Kelly" <sean invisibleduck.org> wrote:
 On Friday, 7 February 2014 at 11:17:49 UTC, Stanislav Blinov wrote:
 On Friday, 7 February 2014 at 08:10:58 UTC, Sean Kelly wrote:
 Weird.  atomicLoad(raw) should be the same as atomicLoad(acq), and
atomicStore(raw) should be the same as atomicStore(rel). At least on x86. I don't know why that change made a difference in performance.
 huh?

 --8<-- core/atomic.d

         template needsLoadBarrier( MemoryOrder ms )
         {
             enum bool needsLoadBarrier =3D ms !=3D MemoryOrder.raw;
         }

 -->8--

 Didn't you write this? :)
Oops. I thought that since Intel has officially defined loads as having
acquire semantics, I had eliminated the barrier requirement there. But I guess not. I suppose it's an issue worth discussing. Does anyone know offhand what C++0x implementations do for load acquires on x86? =20 Speaking of which, I need to add 'Update gcc.atomics to use new C++0x intrinsics' to the GDCProjects page - they map closely to what core.atomic is doing, and should see better performance compared to the __sync intrinsics. :)
You send shared variables as "volatile" to the backend and that is correct. I wonder since that should create strong ordering of memory operations (correct?), if DMD has something similar, or if D's "shared" isn't really shared at al=C4=BA and relies entirely on the correct use of atomicLoad/atomicStore and atomicFence. In that case, would the GCC backend be able to optimize more around shared variables (by not considering them volatile) and still be no worse off than DMD? --=20 Marco
Feb 07 2014
parent reply Iain Buclaw <ibuclaw gdcproject.org> writes:
On 8 Feb 2014 01:20, "Marco Leise" <Marco.Leise gmx.de> wrote:
 Am Fri, 7 Feb 2014 18:42:29 +0000
 schrieb Iain Buclaw <ibuclaw gdcproject.org>:

 On 7 Feb 2014 15:45, "Sean Kelly" <sean invisibleduck.org> wrote:
 On Friday, 7 February 2014 at 11:17:49 UTC, Stanislav Blinov wrote:
 On Friday, 7 February 2014 at 08:10:58 UTC, Sean Kelly wrote:
 Weird.  atomicLoad(raw) should be the same as atomicLoad(acq), and
atomicStore(raw) should be the same as atomicStore(rel). At least on
x86.
  I don't know why that change made a difference in performance.
 huh?

 --8<-- core/atomic.d

         template needsLoadBarrier( MemoryOrder ms )
         {
             enum bool needsLoadBarrier =3D ms !=3D MemoryOrder.raw;
         }

 -->8--

 Didn't you write this? :)
Oops. I thought that since Intel has officially defined loads as
having
 acquire semantics, I had eliminated the barrier requirement there.  But
I
 guess not.  I suppose it's an issue worth discussing.  Does anyone know
 offhand what C++0x implementations do for load acquires on x86?

 Speaking of which, I need to add 'Update gcc.atomics to use new C++0x
 intrinsics' to the GDCProjects page - they map closely to what
core.atomic
 is doing, and should see better performance compared to the __sync
 intrinsics.  :)
You send shared variables as "volatile" to the backend and that is correct. I wonder since that should create strong ordering of memory operations (correct?), if DMD has something similar, or if D's "shared" isn't really shared at al=C4=BA and relies entirely on the correct use of atomicLoad/atomicStore and atomicFence. In that case, would the GCC backend be able to optimize more around shared variables (by not considering them volatile) and still be no worse off than DMD?
No. The fact that I decided shared data be marked volatile was *not* because of a strong ordering. Remember, we follow C semantics here, which is quite specific in not guaranteeing this. The reason it is set as volatile, is that it (instead) guarantees the compiler will not generate code that explicitly cache the shared data.
Feb 09 2014
next sibling parent "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
Isn't it great how a simple benchmark thread can reveal such 
valuable insights and important problems?
Feb 09 2014
prev sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 9 Feb 2014 20:47:07 +0000
schrieb Iain Buclaw <ibuclaw gdcproject.org>:

 On 8 Feb 2014 01:20, "Marco Leise" <Marco.Leise gmx.de> wrote:
 Am Fri, 7 Feb 2014 18:42:29 +0000
 schrieb Iain Buclaw <ibuclaw gdcproject.org>:

 On 7 Feb 2014 15:45, "Sean Kelly" <sean invisibleduck.org> wrote:
 On Friday, 7 February 2014 at 11:17:49 UTC, Stanislav Blinov wrote:
 On Friday, 7 February 2014 at 08:10:58 UTC, Sean Kelly wrote:
 Weird.  atomicLoad(raw) should be the same as atomicLoad(acq), and
atomicStore(raw) should be the same as atomicStore(rel). At least on
x86.
  I don't know why that change made a difference in performance.
 huh?

 --8<-- core/atomic.d

         template needsLoadBarrier( MemoryOrder ms )
         {
             enum bool needsLoadBarrier =3D ms !=3D MemoryOrder.raw;
         }

 -->8--

 Didn't you write this? :)
Oops. I thought that since Intel has officially defined loads as
having
 acquire semantics, I had eliminated the barrier requirement there.  B=
ut
 I
 guess not.  I suppose it's an issue worth discussing.  Does anyone kn=
ow
 offhand what C++0x implementations do for load acquires on x86?

 Speaking of which, I need to add 'Update gcc.atomics to use new C++0x
 intrinsics' to the GDCProjects page - they map closely to what
core.atomic
 is doing, and should see better performance compared to the __sync
 intrinsics.  :)
You send shared variables as "volatile" to the backend and that is correct. I wonder since that should create strong ordering of memory operations (correct?), if DMD has something similar, or if D's "shared" isn't really shared at al=C4=BA and relies entirely on the correct use of atomicLoad/atomicStore and atomicFence. In that case, would the GCC backend be able to optimize more around shared variables (by not considering them volatile) and still be no worse off than DMD?
=20 No. The fact that I decided shared data be marked volatile was *not* because of a strong ordering. Remember, we follow C semantics here, which is quite specific in not guaranteeing this. =20 The reason it is set as volatile, is that it (instead) guarantees the compiler will not generate code that explicitly cache the shared data.
Ah, alright then. --=20 Marco
Feb 17 2014
prev sibling parent "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 7 February 2014 at 04:06:40 UTC, Jerry wrote:
 "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
 Here's my latest revision: http://dpaste.dzfl.pl/5b54df1c7004
Yup, that helps out the AtomicSingleton a lot. Here's best and worst times for each for dmd and gdc:
Cool, I almost started to research that CPU of yours :)
 jlquinn wyvern:~/d/tests$ ~/dmd2/linux/bin64/dmd -O -release 
 -inline -unittest singleton2.d
 jlquinn wyvern:~/d/tests$ ./singleton2
 *Test 2 time for SyncSingleton: 585.992 msecs.
 Test 2 time for AtomicSingleton: 1189.03 msecs.

 Test 5 time for SyncSingleton: 796.834 msecs.
 *Test 5 time for AtomicSingleton: 1069.08 msecs.

 *Test 7 time for SyncSingleton: 811.711 msecs.
 Test 7 time for AtomicSingleton: 1263.36 msecs.

 Test 9 time for SyncSingleton: 605.729 msecs.
 *Test 9 time for AtomicSingleton: 2173.74 msecs.

 jlquinn wyvern:~/d/tests$ ../bin/gdc -O3 -finline -frelease 
 -fno-bounds-check -funittest singleton2.d
 jlquinn wyvern:~/d/tests$ ./a.out
 Test 0 time for SyncSingleton: 542.797 msecs.
 *Test 0 time for AtomicSingleton: 257.805 msecs.

 *Test 5 time for SyncSingleton: 620.052 msecs.
 Test 5 time for AtomicSingleton: 248.951 msecs.

 Test 7 time for SyncSingleton: 437.124 msecs.
 *Test 7 time for AtomicSingleton: 605.781 msecs.

 *Test 8 time for SyncSingleton: 252.643 msecs.
 Test 8 time for AtomicSingleton: 279.854 msecs.
Nice.
Feb 07 2014
prev sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 05 Feb 2014 16:47:40 -0500
schrieb Jerry <jlquinn optonline.net>:

 "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
 
 On Wednesday, 5 February 2014 at 00:11:58 UTC, Jerry wrote:

 Here's the best and worst times I get on my linux laptop.  These are
 with 2.064.2 dmd and gdc 4.9 with 2.064.2

 On Ubuntu x86_64:

 ~/dmd2/linux/bin64/dmd -O -release -inline -noboundscheck -unittest
 singleton.d

 Test 2 time for SyncSingleton: 753.547 msecs.
 Test 2 time for AtomicSingleton: 22290.3 msecs.

 Test 3 time for SyncSingleton: 254.968 msecs.
 Test 3 time for AtomicSingleton: 22903.3 msecs.

 Test 6 time for SyncSingleton: 510.118 msecs.
 Test 6 time for AtomicSingleton: 23970.9 msecs.

 Test 8 time for SyncSingleton: 480.175 msecs.
 Test 8 time for AtomicSingleton: 12827.9 msecs.
Whoah, those times for AtomicSingleton are way high. What kind of machine is your laptop?
Core 2 Due T9400. The gdc times were much better for AtomicSingleton - about 4x slower than SyncSingleton.
 Perhaps we need to repost the test with the latest implementation of
 AtomicSingleton.
I downloaded the test program yesterday.
I just tested with DMD 2.064.2 and my numbers for the AtomicSingleton are not as high. This is on a Core 2 Duo T7250 / 2.0 Ghz. Test 0 time for SyncSingleton: 1068.83 msecs. Test 0 time for AtomicSingleton: 2102.32 msecs. Test 1 time for SyncSingleton: 901.215 msecs. Test 1 time for AtomicSingleton: 2479.6 msecs. Test 2 time for SyncSingleton: 1091.91 msecs. Test 2 time for AtomicSingleton: 2269.45 msecs. Test 3 time for SyncSingleton: 1156.74 msecs. Test 3 time for AtomicSingleton: 2498.25 msecs. Also for GDC my numbers are like this: Test 0 time for SyncSingleton: 657.928 msecs. Test 0 time for AtomicSingleton: 851.795 msecs. Test 1 time for SyncSingleton: 655.204 msecs. Test 1 time for AtomicSingleton: 893.51 msecs. Test 2 time for SyncSingleton: 613.881 msecs. Test 2 time for AtomicSingleton: 843.635 msecs. Test 3 time for SyncSingleton: 657.87 msecs. Test 3 time for AtomicSingleton: 709.823 msecs. Which is far from the difference you see. -- Marco
Feb 07 2014
prev sibling next sibling parent reply "Dejan Lekic" <dejan.lekic gmail.com> writes:
I was thinking about implementing a typical Java singleton in D, 
and then decided to first check whether someone already did that, 
and guess what - yes, someone did. Chech this URL: 
http://dblog.aldacron.net/2007/03/03/singletons-in-d/

Something like this (taken from the article above) in the case 
you do not want lazy initialisation:

     class Singleton2(T)
     {
     public:
         static const T instance;

     private:
         this() {}

         static this() { instance = new T; }
     }

     class TMySingleton2 : Singleton!(TMySingleton2)
     {
     }

Something like this (taken from the article above) in the case 
you want lazy initialisation:

     class Singleton(T)
     {
     public:
         static T instance()
         {
             if(_instance is null) _instance = new T;
             return _instance;
         }

     private:
         this() {}

         static T _instance;
     }

     class TMySingleton : Singleton!(TMySingleton)
     {
     }

If there are some Java programmers around who are curious how is 
Java version done: 
http://www.javaworld.com/article/2073352/core-java/simply-singleton.html
Jan 31 2014
next sibling parent reply "Dejan Lekic" <dejan.lekic gmail.com> writes:
I should have mentioned two things in my previous post.

1) There are no locks involved. No need, because the solution 
relies on the fact that static member variables are guaranteed to 
be created the first time they are accessed.

2) Note that we have constructor disabled. This is important not 
to forget. ;)
Jan 31 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 31 January 2014 at 10:26:50 UTC, Dejan Lekic wrote:
 I should have mentioned two things in my previous post.

 1) There are no locks involved. No need, because the solution 
 relies on the fact that static member variables are guaranteed 
 to be created the first time they are accessed.
And they are thread-local :)
 2) Note that we have constructor disabled. This is important 
 not to forget. ;)
What use would the const version have? You'd still need some way to access the instance, right? Cast away const?
Jan 31 2014
parent reply "Dejan Lekic" <dejan.lekic gmail.com> writes:
 What use would the const version have? You'd still need some 
 way to access the instance, right? Cast away const?
I believe it should have been "final" instead of "const".
Jan 31 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 31 January 2014 at 11:08:42 UTC, Dejan Lekic wrote:

 I believe it should have been "final" instead of "const".
But D doesn't have "final" :) In any event, that article by Mike Parker is about D1.
Jan 31 2014
next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/31/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:
 On Friday, 31 January 2014 at 11:08:42 UTC, Dejan Lekic wrote:

 I believe it should have been "final" instead of "const".
But D doesn't have "final" :) In any event, that article by Mike Parker is about D1.
AFAIK D1's final was equivalent to D2's immutable. But I maybe remembering that wrong. Or maybe D2 initially used final before settling for the new keyword immutable, to avoid confusion by users.
Jan 31 2014
parent reply Jacob Carlborg <doob me.com> writes:
On 2014-01-31 12:27, Andrej Mitrovic wrote:

 AFAIK D1's final was equivalent to D2's immutable. But I maybe
 remembering that wrong.
In D2 if if a variable is immutable or const you can not call non-const non-immutable methods via that variable. D1 didn't have any concept of this. "const" and "final" in D1 as more, you cannot change this variable.
 Or maybe D2 initially used final before
 settling for the new keyword immutable, to avoid confusion by users.
D2 used "invariant" before it used "immutable". It also changed the meaning of "const" compared to D1. -- /Jacob Carlborg
Jan 31 2014
parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/31/14, Jacob Carlborg <doob me.com> wrote:
 In D2 if if a variable is immutable or const you can not call non-const
 non-immutable methods via that variable. D1 didn't have any concept of
 this. "const" and "final" in D1 as more, you cannot change this variable.
So in D1 const is non-transitive?
Jan 31 2014
parent "Dicebot" <public dicebot.lv> writes:
On Friday, 31 January 2014 at 12:09:49 UTC, Andrej Mitrovic wrote:
 On 1/31/14, Jacob Carlborg <doob me.com> wrote:
 In D2 if if a variable is immutable or const you can not call 
 non-const
 non-immutable methods via that variable. D1 didn't have any 
 concept of
 this. "const" and "final" in D1 as more, you cannot change 
 this variable.
So in D1 const is non-transitive?
It is completely different in D1. I think it is not even a qualifier there but a storage class - you can't have const function arguments, it is not printed in typeof and, yes, it is non-transitive. It basically just says "you can't modify this memory block". Also const variables with initializer act as D2 enums. This is one of reasons why porting Sociomantic code will be quite painful :)
Jan 31 2014
prev sibling parent "Dejan Lekic" <dejan.lekic gmail.com> writes:
 But D doesn't have "final" :) In any event, that article by 
 Mike Parker is about D1.
Well, "final" still works. Until it does not we will agree that D does not have it. ;) That article applies to D2 as well, without any problems.
Jan 31 2014
prev sibling next sibling parent reply "Namespace" <rswhite4 googlemail.com> writes:
On Friday, 31 January 2014 at 10:20:45 UTC, Dejan Lekic wrote:
 I was thinking about implementing a typical Java singleton in 
 D, and then decided to first check whether someone already did 
 that, and guess what - yes, someone did. Chech this URL: 
 http://dblog.aldacron.net/2007/03/03/singletons-in-d/

 Something like this (taken from the article above) in the case 
 you do not want lazy initialisation:

     class Singleton2(T)
     {
     public:
         static const T instance;

     private:
         this() {}

         static this() { instance = new T; }
     }

     class TMySingleton2 : Singleton!(TMySingleton2)
     {
     }

 Something like this (taken from the article above) in the case 
 you want lazy initialisation:

     class Singleton(T)
     {
     public:
         static T instance()
         {
             if(_instance is null) _instance = new T;
             return _instance;
         }

     private:
         this() {}

         static T _instance;
     }

     class TMySingleton : Singleton!(TMySingleton)
     {
     }

 If there are some Java programmers around who are curious how 
 is Java version done: 
 http://www.javaworld.com/article/2073352/core-java/simply-singleton.html
Why is someone interested in implementing such an Ani Pattern like Singletons? In most of all cases Singletons are misused.
Jan 31 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 31 January 2014 at 10:27:28 UTC, Namespace wrote:

 Why is someone interested in implementing such an Ani Pattern 
 like Singletons?
Why is someone overquoting without reason? ;)
 In most of all cases Singletons are misused.
Any sort of shared (as in, between threads) resource is often a singleton. A queue for message passing, concurrent GC, a pipe... Even it doesn't have SINGLETON (yes, in all capitals to irritate reviewers) in its name.
Jan 31 2014
parent "Namespace" <rswhite4 googlemail.com> writes:
On Friday, 31 January 2014 at 10:50:57 UTC, Stanislav Blinov 
wrote:
 On Friday, 31 January 2014 at 10:27:28 UTC, Namespace wrote:

 Why is someone interested in implementing such an Ani Pattern 
 like Singletons?
Why is someone overquoting without reason? ;)
I know so many people and have read so many books where Singletons are misused, that I react a bit allergic on it. In most cases, a singleton is absolutely unnecessary and hidden a global variable. Sorry if it may have sounded too harsh. ;)
Jan 31 2014
prev sibling parent reply "Dejan Lekic" <dejan.lekic gmail.com> writes:
Here is an updated Andrej's code: 
http://dpaste.dzfl.pl/c85f487c7f70
SingletonSimple is a winner, followed by the SyncSingleton and 
SingletonLazy.
Jan 31 2014
next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/31/14, Dejan Lekic <dejan.lekic gmail.com> wrote:
 SingletonSimple is a winner
Well yeah, but that's not really the only thing what a singleton is about. It's also about being able to initialize the singleton at an arbitrary time, rather than in a module constructor before main() is called.
Jan 31 2014
next sibling parent "Dejan Lekic" <dejan.lekic gmail.com> writes:
On Friday, 31 January 2014 at 11:42:29 UTC, Andrej Mitrovic wrote:
 On 1/31/14, Dejan Lekic <dejan.lekic gmail.com> wrote:
 SingletonSimple is a winner
Well yeah, but that's not really the only thing what a singleton is about. It's also about being able to initialize the singleton at an arbitrary time, rather than in a module constructor before main() is called.
Absolutely, that is why I would use bothe alternatives, depending on the use-case.
Jan 31 2014
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/31/14, 3:42 AM, Andrej Mitrovic wrote:
 On 1/31/14, Dejan Lekic <dejan.lekic gmail.com> wrote:
 SingletonSimple is a winner
Well yeah, but that's not really the only thing what a singleton is about. It's also about being able to initialize the singleton at an arbitrary time, rather than in a module constructor before main() is called.
Well yah Singleton should be created on first access. Andrei
Jan 31 2014
parent "Dejan Lekic" <dejan.lekic gmail.com> writes:
On Friday, 31 January 2014 at 17:10:08 UTC, Andrei Alexandrescu 
wrote:
 On 1/31/14, 3:42 AM, Andrej Mitrovic wrote:
 On 1/31/14, Dejan Lekic <dejan.lekic gmail.com> wrote:
 SingletonSimple is a winner
Well yeah, but that's not really the only thing what a singleton is about. It's also about being able to initialize the singleton at an arbitrary time, rather than in a module constructor before main() is called.
Well yah Singleton should be created on first access. Andrei
If that is what people want, then David's version is definitely the best one.
Jan 31 2014
prev sibling next sibling parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 31 January 2014 at 11:34:13 UTC, Dejan Lekic wrote:

 SingletonSimple is a winner, followed by the SyncSingleton and 
 SingletonLazy.
Dejan, your singletons are thread-local :)
Jan 31 2014
parent "Dejan Lekic" <dejan.lekic gmail.com> writes:
On Friday, 31 January 2014 at 11:44:10 UTC, Stanislav Blinov
wrote:
 On Friday, 31 January 2014 at 11:34:13 UTC, Dejan Lekic wrote:

 SingletonSimple is a winner, followed by the SyncSingleton and 
 SingletonLazy.
Dejan, your singletons are thread-local :)
YAY, that is correct! :'(
Jan 31 2014
prev sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/31/14, Dejan Lekic <dejan.lekic gmail.com> wrote:
 SingletonLazy.
SingletonLazy isn't thread-safe. :)
Jan 31 2014
next sibling parent "Dejan Lekic" <dejan.lekic gmail.com> writes:
 SingletonLazy isn't thread-safe. :)
EEK!
Jan 31 2014
prev sibling parent "Dejan Lekic" <dejan.lekic gmail.com> writes:
On Friday, 31 January 2014 at 11:45:56 UTC, Andrej Mitrovic wrote:
 On 1/31/14, Dejan Lekic <dejan.lekic gmail.com> wrote:
 SingletonLazy.
SingletonLazy isn't thread-safe. :)
I made it thread-safe, and guess what - I ended up with SyncSingleton-like solution! So SyncSingleton is a clean winner if you want to make it lazy.
Jan 31 2014
prev sibling next sibling parent reply "TC" <chalucha gmail.com> writes:
On Friday, 31 January 2014 at 08:25:16 UTC, Andrej Mitrovic wrote:
 class LockSingleton
 {
     static LockSingleton get()
     {
         __gshared LockSingleton _instance;

         synchronized
         {
             if (_instance is null)
                 _instance = new LockSingleton;
         }

         return _instance;
     }

 private:
     this() { }
 }
Should't be the LockSingleton implemented like this instead? class LockSingleton { static auto get() { if (_instance is null) { synchronized { if (_instance is null) _instance = new LockSingleton; } } return _instance; } private: this() { } __gshared LockSingleton _instance; } At least this is the way singleton is suggested to implement in instantiation and not allways.
Feb 07 2014
next sibling parent Iain Buclaw <ibuclaw gdcproject.org> writes:
On 7 February 2014 10:25, TC <chalucha gmail.com> wrote:
 On Friday, 31 January 2014 at 08:25:16 UTC, Andrej Mitrovic wrote:
 class LockSingleton
 {
     static LockSingleton get()
     {
         __gshared LockSingleton _instance;

         synchronized
         {
             if (_instance is null)

                 _instance = new LockSingleton;
         }

         return _instance;
     }

 private:
     this() { }
 }
Should't be the LockSingleton implemented like this instead? class LockSingleton { static auto get() { if (_instance is null) { synchronized { if (_instance is null) _instance = new LockSingleton; } } return _instance; } private: this() { } __gshared LockSingleton _instance; } synchronization is then needed only for initial instantiation and not allways.
We don't want double-checked locking. :) This was discussed at dconf, the D way is to leverage native thread local storage. I seem to recall that when David tested this, GDC had pretty much near identical speeds to unsafe gets(). You'll have to consult the slides, but I think it was something like: class LockSingleton { static auto get() { if (!_instantiated) { synchronized (LockSingleton.classinfo) { if (_instance is null) _instance = new LockSingleton; _instantiated = true; } } return _instance; } private: this() { } static bool _instantiated; __gshared LockSingleton _instance; }
Feb 07 2014
prev sibling parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 7 February 2014 at 10:25:52 UTC, TC wrote:

 Should't be the LockSingleton implemented like this instead?

 class LockSingleton
 {
     static auto get()
     {
         if (_instance is null)
(_instance is null) will most likely not be an atomic operation. References are two words. Imagine that one thread writes half a reference inside synchronized {}, then goes to sleep. What would the thread that gets to that 'if' return? I'd say it'll return "ouch".
Feb 07 2014
next sibling parent reply "Daniel Murphy" <yebbliesnospam gmail.com> writes:
"Stanislav Blinov"  wrote in message 
news:idrxthgkumydmiszdtcx forum.dlang.org...
 (_instance is null) will most likely not be an atomic operation. 
 References are two words.
References are one word.
Feb 07 2014
parent "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 7 February 2014 at 11:36:23 UTC, Daniel Murphy wrote:
 "Stanislav Blinov"  wrote in message 
 news:idrxthgkumydmiszdtcx forum.dlang.org...
 (_instance is null) will most likely not be an atomic 
 operation. References are two words.
References are one word.
Heh, indeed. Need to go have my brain scanned :\ I have no idea why I thought that.
Feb 07 2014
prev sibling parent "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Friday, 7 February 2014 at 11:31:14 UTC, Stanislav Blinov 
wrote:
 On Friday, 7 February 2014 at 10:25:52 UTC, TC wrote:

 Should't be the LockSingleton implemented like this instead?

 class LockSingleton
 {
    static auto get()
    {
        if (_instance is null)
(_instance is null) will most likely not be an atomic operation. References are two words. Imagine that one thread writes half a reference inside synchronized {}, then goes to sleep. What would the thread that gets to that 'if' return? I'd say it'll return "ouch".
Scratch that.
Feb 07 2014
prev sibling parent reply luka8088 <luka8088 owave.net> writes:
On 31.1.2014. 9:25, Andrej Mitrovic wrote:
 There was a nice blog-post about implementing low-lock singletons in D, here:
 http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/
 
 One suggestion on Reddit was by dawgfoto (I think this is Martin
 Nowak?), to use atomic primitives instead:
 http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/c9tmz07
 
 I wanted to benchmark these different approaches. I was expecting
 Martin's implementation to be the fastest one, but on my machine
 (Athlon II X4 620 - 2.61GHz) the implementation in the blog post turns
 out to be the fastest one. I'm wondering whether my test case is
 flawed in some way. Btw, I think we should put an implementation of
 this into Phobos.
 
 The timings on my machine:
 
 Test time for LockSingleton: 542 msecs.
 Test time for SyncSingleton: 20 msecs.
 Test time for AtomicSingleton: 755 msecs.
 
What about swapping function pointer so the check is done only once per thread? (Thread is tldr so I am sorry if someone already suggested this) -------------------------------------------------- class FunctionPointerSingleton { private static __gshared typeof(this) instance_; // tls property static typeof(this) function () get; static this () { get = { synchronized { if (instance_ is null) instance_ = new typeof(this)(); get = { return instance_; }; return instance_; } }; } } -------------------------------------------------- dmd -release -inline -O -noboundscheck -unittest -run singleton.d Test time for LockSingleton: 901 msecs. Test time for SyncSingleton: 20.75 msecs. Test time for AtomicSingleton: 169 msecs. Test time for FunctionPointerSingleton: 7.5 msecs. I don't have such a muscular machine xD
Feb 09 2014
next sibling parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Sunday, 9 February 2014 at 12:20:54 UTC, luka8088 wrote:

 What about swapping function pointer so the check is done only 
 once per
 thread? (Thread is tldr so I am sorry if someone already 
 suggested this)
That is an interesting idea indeed, though it seems to be faster only for dmd. I haven't studied the assembly yet, but with LDC I don't see any noticeable difference between SyncSingleton and FunctionPointerSingleton.
Feb 09 2014
parent luka8088 <luka8088 owave.net> writes:
On 9.2.2014. 15:09, Stanislav Blinov wrote:
 On Sunday, 9 February 2014 at 12:20:54 UTC, luka8088 wrote:
 
 What about swapping function pointer so the check is done only once per
 thread? (Thread is tldr so I am sorry if someone already suggested this)
That is an interesting idea indeed, though it seems to be faster only for dmd. I haven't studied the assembly yet, but with LDC I don't see any noticeable difference between SyncSingleton and FunctionPointerSingleton.
I got it while writing code for dynamic languages (especially javascript). Thought came that instead of checking for something that you know will always have the same result just remove that piece of code and voila :)
Feb 09 2014
prev sibling next sibling parent reply Martin Nowak <code dawg.eu> writes:
On 02/09/2014 01:20 PM, luka8088 wrote:
 class FunctionPointerSingleton {

    private static __gshared typeof(this) instance_;

    // tls
     property static typeof(this) function () get;
You don't even need to make this TLS, right?
    static this () {
      get = {
        synchronized {
          if (instance_ is null)
            instance_ = new typeof(this)();
          get = { return instance_; };
          return instance_;
        }
      };
    }

 }
Feb 09 2014
parent reply "Stanislav Blinov" <stanislav.blinov gmail.com> writes:
On Sunday, 9 February 2014 at 18:06:46 UTC, Martin Nowak wrote:
 On 02/09/2014 01:20 PM, luka8088 wrote:
 class FunctionPointerSingleton {

   private static __gshared typeof(this) instance_;

   // tls
    property static typeof(this) function () get;
You don't even need to make this TLS, right?
I don't follow. get should be TLS, as a replacement for SyncSingleton's _instantiated TLS bool.
Feb 09 2014
parent luka8088 <luka8088 owave.net> writes:
On 9.2.2014. 19:51, Stanislav Blinov wrote:
 On Sunday, 9 February 2014 at 18:06:46 UTC, Martin Nowak wrote:
 On 02/09/2014 01:20 PM, luka8088 wrote:
 class FunctionPointerSingleton {

   private static __gshared typeof(this) instance_;

   // tls
    property static typeof(this) function () get;
You don't even need to make this TLS, right?
I don't follow. get should be TLS, as a replacement for SyncSingleton's _instantiated TLS bool.
It is tls and it needs to be tls because one thread could be replacing where get points to while another is trying to access it. It's either tls or putting some synchronization above it which would break the whole idea of executing synchronized block only once per thread.
Feb 09 2014
prev sibling next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 2/9/14, luka8088 <luka8088 owave.net> wrote:
 What about swapping function pointer so the check is done only once per
 thread? (Thread is tldr so I am sorry if someone already suggested this)
Interesting solution for sure.
   // tls
    property static typeof(this) function () get;
This confused me for a second since property is meaningless for variables. :>
Feb 10 2014
parent luka8088 <luka8088 owave.net> writes:
On 10.2.2014. 10:52, Andrej Mitrovic wrote:
 On 2/9/14, luka8088 <luka8088 owave.net> wrote:
 What about swapping function pointer so the check is done only once per
 thread? (Thread is tldr so I am sorry if someone already suggested this)
Interesting solution for sure.
   // tls
    property static typeof(this) function () get;
This confused me for a second since property is meaningless for variables. :>
Yeah. My mistake. It should be removed.
Feb 10 2014
prev sibling next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 2/9/14, luka8088 <luka8088 owave.net> wrote:
   private static __gshared typeof(this) instance_;
Also, "static __gshared" is really meaningless here, it's either static (TLS), or globally shared, either way it's not a class instance, so you can type __gshared alone here. Otherwise I'm not sure what the semantics of a per-class-instance __gshared field would be, if that can exist.
Feb 10 2014
parent reply luka8088 <luka8088 owave.net> writes:
On 10.2.2014. 10:54, Andrej Mitrovic wrote:
 On 2/9/14, luka8088 <luka8088 owave.net> wrote:
   private static __gshared typeof(this) instance_;
Also, "static __gshared" is really meaningless here, it's either static (TLS), or globally shared, either way it's not a class instance, so you can type __gshared alone here. Otherwise I'm not sure what the semantics of a per-class-instance __gshared field would be, if that can exist.
"static" does not meat it must be tls, as "static shared" is valid. I just like to write that it is static and not shared. I know that __gshared does imply static but this implication is not intuitive to me so I write it explicitly. For example, I think that the following code should output 5 and 6 (as it would it __gshared did not imply static): module program; import std.stdio; import core.thread; class A { __gshared int i; } void main () { auto a1 = new A(); auto a2 = new A(); (new Thread({ a1.i = 5; a2.i = 6; (new Thread({ writeln(a1.i); writeln(a2.i); })).start(); })).start(); } But in any case, this variable is just __gshared.
Feb 10 2014
next sibling parent luka8088 <luka8088 owave.net> writes:
On 10.2.2014. 13:44, luka8088 wrote:
 On 10.2.2014. 10:54, Andrej Mitrovic wrote:
 On 2/9/14, luka8088 <luka8088 owave.net> wrote:
   private static __gshared typeof(this) instance_;
Also, "static __gshared" is really meaningless here, it's either static (TLS), or globally shared, either way it's not a class instance, so you can type __gshared alone here. Otherwise I'm not sure what the semantics of a per-class-instance __gshared field would be, if that can exist.
"static" does not meat it must be tls, as "static shared" is valid. I just like to write that it is static and not shared. I know that __gshared does imply static but this implication is not intuitive to me so I write it explicitly. For example, I think that the following code should output 5 and 6 (as it would it __gshared did not imply static): module program; import std.stdio; import core.thread; class A { __gshared int i; } void main () { auto a1 = new A(); auto a2 = new A(); (new Thread({ a1.i = 5; a2.i = 6; (new Thread({ writeln(a1.i); writeln(a2.i); })).start(); })).start(); } But in any case, this variable is just __gshared.
Um actually this makes no sense. But anyway I mark it static.
Feb 10 2014
prev sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 2/10/14, luka8088 <luka8088 owave.net> wrote:
 "static" does not mean it must be tls, as "static shared" is valid.
Yes you're right. I'm beginning to really dislike the 20 different meanings of "static". :)
Feb 10 2014
parent reply "Daniel Murphy" <yebbliesnospam gmail.com> writes:
"Andrej Mitrovic"  wrote in message 
news:mailman.111.1392039607.21734.digitalmars-d puremagic.com...

 Yes you're right. I'm beginning to really dislike the 20 different
 meanings of "static". :)
Don't forget that __gshared static and static __gshared do different things!
Feb 10 2014
next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 2/10/14, Daniel Murphy <yebbliesnospam gmail.com> wrote:
 Don't forget that __gshared static and static __gshared do different things!
wat.
Feb 10 2014
parent "Dicebot" <public dicebot.lv> writes:
On Monday, 10 February 2014 at 16:53:35 UTC, Andrej Mitrovic 
wrote:
 On 2/10/14, Daniel Murphy <yebbliesnospam gmail.com> wrote:
 Don't forget that __gshared static and static __gshared do 
 different things!
wat.
To be more specific: "WATWATWAT"
Feb 10 2014
prev sibling parent reply "Dejan Lekic" <dejan.lekic gmail.com> writes:
On Monday, 10 February 2014 at 14:15:58 UTC, Daniel Murphy wrote:
 "Andrej Mitrovic"  wrote in message 
 news:mailman.111.1392039607.21734.digitalmars-d puremagic.com...

 Yes you're right. I'm beginning to really dislike the 20 
 different
 meanings of "static". :)
Don't forget that __gshared static and static __gshared do different things!
Care to elaborate?
Feb 10 2014
parent reply "Daniel Murphy" <yebbliesnospam gmail.com> writes:
"Dejan Lekic"  wrote in message news:nvakemdpugwupoqctrtd forum.dlang.org...
 Don't forget that __gshared static and static __gshared do different 
 things!
Care to elaborate?
https://d.puremagic.com/issues/show_bug.cgi?id=4419
Feb 10 2014
parent reply "Andrej Mitrovic" <andrej.mitrovich gmail.com> writes:
On Tuesday, 11 February 2014 at 03:43:35 UTC, Daniel Murphy wrote:
 "Dejan Lekic"  wrote in message 
 news:nvakemdpugwupoqctrtd forum.dlang.org...
 Don't forget that __gshared static and static __gshared do 
 different things!
Care to elaborate?
https://d.puremagic.com/issues/show_bug.cgi?id=4419
Ah, that thing. Yeah this whole issue is rather messy IMO.
Feb 11 2014
parent reply Jerry <jlquinn optonline.net> writes:
"Andrej Mitrovic" <andrej.mitrovich gmail.com> writes:

 On Tuesday, 11 February 2014 at 03:43:35 UTC, Daniel Murphy wrote:
 "Dejan Lekic"  wrote in message news:nvakemdpugwupoqctrtd forum.dlang.org...
 Don't forget that __gshared static and static __gshared do > different
things! Care to elaborate?
https://d.puremagic.com/issues/show_bug.cgi?id=4419
Ah, that thing. Yeah this whole issue is rather messy IMO.
Looking at the bug, I see the compiler doesn't implement what the spec says. The spec says __gshared implies static. Is the messiness fixing the implementation to match the spec, or refining the spec to better define what should happen?
Feb 11 2014
parent "Daniel Murphy" <yebbliesnospam gmail.com> writes:
"Jerry"  wrote in message news:87sirpbjdf.fsf optonline.net...

 Looking at the bug, I see the compiler doesn't implement what the spec
 says.  The spec says __gshared implies static.  Is the messiness fixing
 the implementation to match the spec, or refining the spec to better
 define what should happen?
It's just messy in the sense that it doesn't behave in a logical or useful way.
Feb 12 2014
prev sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 2/9/14, luka8088 <luka8088 owave.net> wrote:
 dmd -release -inline -O -noboundscheck -unittest -run singleton.d

 Test time for LockSingleton: 901 msecs.
 Test time for SyncSingleton: 20.75 msecs.
 Test time for AtomicSingleton: 169 msecs.
 Test time for FunctionPointerSingleton: 7.5 msecs.
C:\dev\code\d_code>test_dmd Test time for LockSingleton: 438 msecs. Test time for SyncSingleton: 6.25 msecs. Test time for AtomicSingleton: 8 msecs. Test time for FunctionPointerSingleton: 5 msecs. C:\dev\code\d_code>test_ldc Test time for LockSingleton: 575.5 msecs. Test time for SyncSingleton: 5 msecs. Test time for AtomicSingleton: 3 msecs. Test time for FunctionPointerSingleton: 5.25 msecs. It seems it makes a tiny bit of difference for DMD, but LDC still generates better codegen for the atomic version.
Feb 10 2014
parent luka8088 <luka8088 owave.net> writes:
On 10.2.2014. 10:59, Andrej Mitrovic wrote:
 On 2/9/14, luka8088 <luka8088 owave.net> wrote:
 dmd -release -inline -O -noboundscheck -unittest -run singleton.d

 Test time for LockSingleton: 901 msecs.
 Test time for SyncSingleton: 20.75 msecs.
 Test time for AtomicSingleton: 169 msecs.
 Test time for FunctionPointerSingleton: 7.5 msecs.
C:\dev\code\d_code>test_dmd Test time for LockSingleton: 438 msecs. Test time for SyncSingleton: 6.25 msecs. Test time for AtomicSingleton: 8 msecs. Test time for FunctionPointerSingleton: 5 msecs. C:\dev\code\d_code>test_ldc Test time for LockSingleton: 575.5 msecs. Test time for SyncSingleton: 5 msecs. Test time for AtomicSingleton: 3 msecs. Test time for FunctionPointerSingleton: 5.25 msecs. It seems it makes a tiny bit of difference for DMD, but LDC still generates better codegen for the atomic version.
Could it be that TLS is slower in LLVM?
Feb 10 2014