www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - D performance compared with other languages

reply "christopher diggins" <cdiggins users.sourceforge.net> writes:
I have written an object oriented Heapsort in D at
http://www.heron-language.com/benchmarks/heapsort-d.html to compare the
performance of object oriented code in several languages. The results were
somewhat disappointing, when compared with C++, Delphi, Java, and Heron. I
was hoping that there might be some suggestions from members of this group
about how I might improve the code I wrote while keeping the overall
algorithm and design the same.

Please note that the program is a port of the program at
http://www.heron-language.com/benchmarks/heapsort-heron.html and is
discussed further at http://www.heron-language.com/benchmarks/index.html

TIA
-- 
Christopher Diggins
http://www.cdiggins.com
http://www.heron-language.com
May 11 2004
next sibling parent Andy Friesen <andy ikagames.com> writes:
christopher diggins wrote:
 I have written an object oriented Heapsort in D at
 http://www.heron-language.com/benchmarks/heapsort-d.html to compare the
 performance of object oriented code in several languages. The results were
 somewhat disappointing, when compared with C++, Delphi, Java, and Heron. I
 was hoping that there might be some suggestions from members of this group
 about how I might improve the code I wrote while keeping the overall
 algorithm and design the same.
 
 Please note that the program is a port of the program at
 http://www.heron-language.com/benchmarks/heapsort-heron.html and is
 discussed further at http://www.heron-language.com/benchmarks/index.html

The time decreases significantly if you use both -O and -release. -- andy
May 11 2004
prev sibling next sibling parent reply "Walter" <newshound digitalmars.com> writes:
Make sure to compile with:
    -release -O -inline

"christopher diggins" <cdiggins users.sourceforge.net> wrote in message
news:c7qvo6$1256$1 digitaldaemon.com...
 I have written an object oriented Heapsort in D at
 http://www.heron-language.com/benchmarks/heapsort-d.html to compare the
 performance of object oriented code in several languages. The results were
 somewhat disappointing, when compared with C++, Delphi, Java, and Heron. I
 was hoping that there might be some suggestions from members of this group
 about how I might improve the code I wrote while keeping the overall
 algorithm and design the same.

 Please note that the program is a port of the program at
 http://www.heron-language.com/benchmarks/heapsort-heron.html and is
 discussed further at http://www.heron-language.com/benchmarks/index.html

 TIA
 -- 
 Christopher Diggins
 http://www.cdiggins.com
 http://www.heron-language.com

May 11 2004
parent reply "christopher diggins" <cdiggins users.sourceforge.net> writes:
Thanks for the feedback, Walter. I had only used -O but not -inline
and -release which sped up the execution to 703 msec and the memory usage
dropped to 2MB on my system. I have now updated the web page as a result.
There is still 4 bytes per object that I can't account for. I count 8 bytes
for the vector data, 4 bytes for the pointer to the object, 4 bytes for the
vtable in the object, which only brings me to 16 bytes per object, what is
the other 4 bytes for? Is there a way to allocate 100,000 objects at once
rather then one by one?

-- 
Christopher Diggins
http://www.cdiggins.com
http://www.heron-language.com
May 11 2004
next sibling parent Helmut Leitner <helmut.leitner wikiservice.at> writes:
christopher diggins wrote:
 
 Thanks for the feedback, Walter. I had only used -O but not -inline
 and -release which sped up the execution to 703 msec and the memory usage
 dropped to 2MB on my system. I have now updated the web page as a result.
 There is still 4 bytes per object that I can't account for. I count 8 bytes
 for the vector data, 4 bytes for the pointer to the object, 4 bytes for the
 vtable in the object, which only brings me to 16 bytes per object, what is
 the other 4 bytes for? 

Each interface has a 4 byte/object cost. -- Helmut Leitner leitner hls.via.at Graz, Austria www.hls-software.com
May 11 2004
prev sibling next sibling parent "Ben Hinkle" <bhinkle mathworks.com> writes:
Every D object has a synchronized monitor that is right after the vtable.
That is a pointer to a platform-specific mutex of some sort.
Structs don't have any vtable or monitor so they would be the most efficient
memory-wise but wouldn't support any dynamic binding or synchronization.

"christopher diggins" <cdiggins users.sourceforge.net> wrote in message
news:c7r63p$1bv1$1 digitaldaemon.com...
 Thanks for the feedback, Walter. I had only used -O but not -inline
 and -release which sped up the execution to 703 msec and the memory usage
 dropped to 2MB on my system. I have now updated the web page as a result.
 There is still 4 bytes per object that I can't account for. I count 8

 for the vector data, 4 bytes for the pointer to the object, 4 bytes for

 vtable in the object, which only brings me to 16 bytes per object, what is
 the other 4 bytes for? Is there a way to allocate 100,000 objects at once
 rather then one by one?

 -- 
 Christopher Diggins
 http://www.cdiggins.com
 http://www.heron-language.com

May 11 2004
prev sibling next sibling parent reply "Walter" <newshound digitalmars.com> writes:
"christopher diggins" <cdiggins users.sourceforge.net> wrote in message
news:c7r63p$1bv1$1 digitaldaemon.com...
 Thanks for the feedback, Walter. I had only used -O but not -inline
 and -release which sped up the execution to 703 msec and the memory usage
 dropped to 2MB on my system. I have now updated the web page as a result.

Thanks! Could you also please hyperlink the "Digital Mars Compiler" to www.digitalmars.com/d/ ? You'll also get faster results if you use 'final' on all member functions that are not overrided (essentially all the functions you don't mark as 'virtual' in the C++ version).
 There is still 4 bytes per object that I can't account for. I count 8

 for the vector data, 4 bytes for the pointer to the object, 4 bytes for

 vtable in the object, which only brings me to 16 bytes per object, what is
 the other 4 bytes for?

It's for the monitor.
 Is there a way to allocate 100,000 objects at once
 rather then one by one?

Yes. Implement a new() function for the class, and have it parcel out bits from a preallocated array.
May 11 2004
next sibling parent reply Daniel Horn <hellcatv hotmail.com> writes:
Walter wrote:
 "christopher diggins" <cdiggins users.sourceforge.net> wrote in message
 news:c7r63p$1bv1$1 digitaldaemon.com...
 
Thanks for the feedback, Walter. I had only used -O but not -inline
and -release which sped up the execution to 703 msec and the memory usage
dropped to 2MB on my system. I have now updated the web page as a result.

Thanks! Could you also please hyperlink the "Digital Mars Compiler" to www.digitalmars.com/d/ ? You'll also get faster results if you use 'final' on all member functions that are not overrided (essentially all the functions you don't mark as 'virtual' in the C++ version).
There is still 4 bytes per object that I can't account for. I count 8

bytes
for the vector data, 4 bytes for the pointer to the object, 4 bytes for

the
vtable in the object, which only brings me to 16 bytes per object, what is
the other 4 bytes for?

It's for the monitor.
Is there a way to allocate 100,000 objects at once
rather then one by one?

Yes. Implement a new() function for the class, and have it parcel out bits from a preallocated array.

in an array (or you may wish to have the same class in 3 different packed arrays) clearly you could make some sort of template class that inherits and make that have a "new" operator that does the parceling... but it's a hack that only lets me have a single stack per template class (what if I want to have many stacks of said objects) I could have a global "current stack" pointer and set it to my next available stack, but this is beginning to sound more and more contrived. why not just do a structure then-- Do structs support static inheritance (I guess they will with mixins?)
May 11 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Daniel Horn" <hellcatv hotmail.com> wrote in message
news:c7ra0p$1hr4$1 digitaldaemon.com...
Is there a way to allocate 100,000 objects at once
rather then one by one?



 from a preallocated array.

in an array (or you may wish to have the same class in 3 different packed arrays)

I do the same sorts of optimizations in C++ when I want performance. Storage allocation is always a ripe target for performance tuning.
 clearly you could make some sort of template class that inherits and
 make that have a "new" operator that does the parceling...
 but it's a hack that only lets me have a single stack per template class
 (what if I want to have many stacks of said objects) I could have a
 global "current stack" pointer and set it to my next available stack,
 but this is beginning to sound more and more contrived.

That's why the general purpose storage allocator is the default solution. But when speed is your master, writing custom allocators can yield big benefits, and that's why D supports doing it.
 why not just do a structure then--

Structs can't inherit from an interface.
 Do structs support static inheritance (I guess they will with mixins?)

Mixins have the potential to do this. Stay tuned!
May 11 2004
parent reply -scooter- <scottm cs.ucla.edu> writes:
Walter wrote:

 I do the same sorts of optimizations in C++ when I want performance. Storage
 allocation is always a ripe target for performance tuning.

One would think so, but an excellent paper from OOPSLA '02 has hard evidence to the contrary: Reconsidering custom memory allocation (Berger, et. al). Paper is at http://www.cs.umass.edu/~emery/pubs/berger-oopsla2002.pdf The only time it seems that a custom memory allocator wins is in the "bulk free" case, when an entire heap of objects (like a compiler's AST for a class <hint>) gets deallocated. Otherwise, custom storage management is almost a complete waste of time. I haven't taken Berger's HeapLayers or reaps allocation libraries out for a spin yet, but it looks interesting. -scooter
May 12 2004
next sibling parent reply Kevin Bealer <Kevin_member pathlink.com> writes:
In article <c7tlva$22e8$1 digitaldaemon.com>, -scooter- says...
Walter wrote:

 I do the same sorts of optimizations in C++ when I want performance. Storage
 allocation is always a ripe target for performance tuning.

One would think so, but an excellent paper from OOPSLA '02 has hard evidence to the contrary: Reconsidering custom memory allocation (Berger, et. al). Paper is at http://www.cs.umass.edu/~emery/pubs/berger-oopsla2002.pdf The only time it seems that a custom memory allocator wins is in the "bulk free" case, when an entire heap of objects (like a compiler's AST for a class <hint>) gets deallocated. Otherwise, custom storage management is almost a complete waste of time. I haven't taken Berger's HeapLayers or reaps allocation libraries out for a spin yet, but it looks interesting. -scooter

In a GC program, you can avoid GC overhead by keeping free lists of anything you allocate. The garbage collector won't run because you aren't calling new(). This was a tip for "real time" programming. Note that you are avoiding benefit #1 of a garbage collector: not having to manage delete[] yourself. You also run the risk of old pointers to supposedly "new" objects, if it wasn't ready to be deleted. On the plus side, if you are not sure its garbage, dont put it on the list: the GC will eventually hunt it down if you are, in fact, leaking. Kevin
May 12 2004
parent reply Scott Michel <scottm cs.ucla.edu> writes:
Kevin Bealer wrote:
 In a GC program, you can avoid GC overhead by keeping free lists of
 anything you
 allocate.  The garbage collector won't run because you aren't calling
 new().
 This was a tip for "real time" programming.  Note that you are avoiding
 benefit
 #1 of a garbage collector: not having to manage delete[] yourself.  You
 #also run
 the risk of old pointers to supposedly "new" objects, if it wasn't ready
 to be
 deleted.  On the plus side, if you are not sure its garbage, dont put it
 on the list: the GC will eventually hunt it down if you are, in fact,
 leaking.

Any "modern" general-purpose allocator already does this since the late 80s. Preventing free space fragmentation turns out to be the bigger problem. Besides, if you keep a free list, then you're into having to play "God" with object ctor and dtor. :-)
May 13 2004
parent reply Kevin Bealer <Kevin_member pathlink.com> writes:
In article <c8075s$1ff$1 digitaldaemon.com>, Scott Michel says...
Kevin Bealer wrote:
 In a GC program, you can avoid GC overhead by keeping free lists of
 anything you
 allocate.  The garbage collector won't run because you aren't calling
 new().
 This was a tip for "real time" programming.  Note that you are avoiding
 benefit
 #1 of a garbage collector: not having to manage delete[] yourself.  You
 #also run
 the risk of old pointers to supposedly "new" objects, if it wasn't ready
 to be
 deleted.  On the plus side, if you are not sure its garbage, dont put it
 on the list: the GC will eventually hunt it down if you are, in fact,
 leaking.

Any "modern" general-purpose allocator already does this since the late 80s. Preventing free space fragmentation turns out to be the bigger problem. Besides, if you keep a free list, then you're into having to play "God" with object ctor and dtor. :-)

I've usually heard that phrase when talking about cloning or life/death issues... I'm not sure what rights are reserved to God when talking about writing software... If you mean that the free list can run very large and hog memory, you can of course provide a size bound. If it gets over N elements, clear the "top" pointer, or selectively drop elements down to N. Regarding fragmentation: does the D collector actually copy collect, or does it need to be conservative? I would suspect that using C pointers and copy collection are two great tastes that don't taste great together. Kevin
May 13 2004
parent reply -scooter- <scottm cs.ucla.edu> writes:
Kevin Bealer wrote:

 In article <c8075s$1ff$1 digitaldaemon.com>, Scott Michel says...
 
Any "modern" general-purpose allocator already does this since the late 80s.
Preventing free space fragmentation turns out to be the bigger problem.

Besides, if you keep a free list, then you're into having to play "God" with
object ctor and dtor. :-)

I've usually heard that phrase when talking about cloning or life/death issues... I'm not sure what rights are reserved to God when talking about writing software... If you mean that the free list can run very large and hog memory, you can of course provide a size bound. If it gets over N elements, clear the "top" pointer, or selectively drop elements down to N.

My point was that you become responsible for object lifetimes; you're responsible for their proper birth and death (and possibly inadvertant ressurection... ok, bad pun sequence.) So keeping your own free list means that you might have faster allocation, but you now become responsible for a lot more than just allocation. But as I pointed out, modern mallocs already do this.
 Regarding fragmentation: does the D collector actually copy collect, or does it
 need to be conservative?  I would suspect that using C pointers and copy
 collection are two great tastes that don't taste great together.

Wrong type of fragmentation. :-) This happens when the allocator doesn't have specifically sized areanas for certain sized allocations (usually powers of 2, n > 4). If the allocator just tries to find the best fit, breaking up previously allocated blocks without coalescing adjacent freespace, you end up with a very fragmented memory map. -scooter
May 17 2004
parent reply Kevin Bealer <Kevin_member pathlink.com> writes:
In article <c8bg4j$1a4b$1 digitaldaemon.com>, -scooter- says...
Kevin Bealer wrote:

 In article <c8075s$1ff$1 digitaldaemon.com>, Scott Michel says...
 
Any "modern" general-purpose allocator already does this since the late 80s.
Preventing free space fragmentation turns out to be the bigger problem.

Besides, if you keep a free list, then you're into having to play "God" with
object ctor and dtor. :-)

I've usually heard that phrase when talking about cloning or life/death issues... I'm not sure what rights are reserved to God when talking about writing software... If you mean that the free list can run very large and hog memory, you can of course provide a size bound. If it gets over N elements, clear the "top" pointer, or selectively drop elements down to N.

My point was that you become responsible for object lifetimes; you're responsible for their proper birth and death (and possibly inadvertant ressurection... ok, bad pun sequence.) So keeping your own free list means that you might have faster allocation, but you now become responsible for a lot more than just allocation. But as I pointed out, modern mallocs already do this.

Okay, that makes sense.
 Regarding fragmentation: does the D collector actually copy collect, or does it
 need to be conservative?  I would suspect that using C pointers and copy
 collection are two great tastes that don't taste great together.

Wrong type of fragmentation. :-) This happens when the allocator doesn't have specifically sized areanas for certain sized allocations (usually powers of 2, n > 4). If the allocator just tries to find the best fit, breaking up previously allocated blocks without coalescing adjacent freespace, you end up with a very fragmented memory map.

I was thinking "locality". Okay, this makes sense. Yeah, you can use a large program allocator that uses a large number of free lists and chops up blocks (completely unsuitable for small, memory tight, programs), or you can use small-piece allocators and tolerate fragmentation. The ugliest way is to tweak the algorithm to use the same sized object (by making buffers a certain size). Kevin
-scooter

May 18 2004
parent reply -scooter- <scottm cs.ucla.edu> writes:
Kevin Bealer wrote:
Regarding fragmentation: does the D collector actually copy collect, or does it
need to be conservative?  I would suspect that using C pointers and copy
collection are two great tastes that don't taste great together.

Wrong type of fragmentation. :-) This happens when the allocator doesn't have specifically sized areanas for certain sized allocations (usually powers of 2, n > 4). If the allocator just tries to find the best fit, breaking up previously allocated blocks without coalescing adjacent freespace, you end up with a very fragmented memory map.

I was thinking "locality". Okay, this makes sense. Yeah, you can use a large program allocator that uses a large number of free lists and chops up blocks (completely unsuitable for small, memory tight, programs), or you can use small-piece allocators and tolerate fragmentation. The ugliest way is to tweak the algorithm to use the same sized object (by making buffers a certain size).

You're thinking of customized slab allocators or separate heaps for a particular object type or objects of the same size. Once you allocated from the custom heap, you return it and keep track of the free objects. Yup, modern mallocs (FreeBSD's libc version and Lea's, commonly found in Linux's glibc and usable under Win32) do this already; that's why they have arenas. The allocators you're thinking about that has really nasty allocation algorithms are the old Microsoft C compiler's malloc and the old Kingsley malloc in past BSDs. Basically, there's really not much that's exciting about memory allocation. Berger's paper is probably the most exciting research I've seen in 15 years and it solves an interesting problem when you really need custom allocation or want a region of related objects allocated, initialized, freed and destroyed as a region. The current GP allocators do a really good job and tweaking won't actually buy you much extra performance.
May 19 2004
parent Kevin Bealer <Kevin_member pathlink.com> writes:
In article <c8hb2a$1o63$1 digitaldaemon.com>, -scooter- says...
Kevin Bealer wrote:
Regarding fragmentation: does the D collector actually copy collect, or does it
need to be conservative?  I would suspect that using C pointers and copy
collection are two great tastes that don't taste great together.

Wrong type of fragmentation. :-) This happens when the allocator doesn't have specifically sized areanas for certain sized allocations (usually powers of 2, n > 4). If the allocator just tries to find the best fit, breaking up previously allocated blocks without coalescing adjacent freespace, you end up with a very fragmented memory map.

I was thinking "locality". Okay, this makes sense. Yeah, you can use a large program allocator that uses a large number of free lists and chops up blocks (completely unsuitable for small, memory tight, programs), or you can use small-piece allocators and tolerate fragmentation. The ugliest way is to tweak the algorithm to use the same sized object (by making buffers a certain size).

You're thinking of customized slab allocators or separate heaps for a particular object type or objects of the same size. Once you allocated from the custom heap, you return it and keep track of the free objects. Yup, modern mallocs (FreeBSD's libc version and Lea's, commonly found in Linux's glibc and usable under Win32) do this already; that's why they have arenas. The allocators you're thinking about that has really nasty allocation algorithms are the old Microsoft C compiler's malloc and the old Kingsley malloc in past BSDs. Basically, there's really not much that's exciting about memory allocation. Berger's paper is probably the most exciting research I've seen in 15 years and it solves an interesting problem when you really need custom allocation or want a region of related objects allocated, initialized, freed and destroyed as a region. The current GP allocators do a really good job and tweaking won't actually buy you much extra performance.

Which makes perfect sense. I was writing one because I had a neat idea vis a vis persistent objects and relative pointers. You can create data in a subheap, and mmap it in, providing an instant program "state load" effect. The relative pointers keep all objects connected. Except of course virtual pointers; also you need to rewrite every class to use the sub heap, everything on the subheap needs to be connected to a "top" pointer, etc. It works, its good, but its special purpose. A smart language space implementation could take advantage of such a design, but it gets messy in portable-application space. The other time you might want one is when you want a set of data (i.e. linked list) to have really good locality. But the solution to that is a vector or plex. (By plex, I mean a list-of-small-vectors, with a list-like interface. They "should" have good performance in many common cases; they were used in a GNU container library that went away when STL came out.) Vectors are probably good enough for this purpose. I guess the general philosophy of all that is that we should start treating heap storage the way we treat FS storage. If the system runs out of main memory, and all of your data is organized by memory pages, then OS page-out wont cause thrashing until every page of system memory is in-use. Locality is treated as foremost in disk container algorithm design. Swap thrashing occurs because it is not treated that way in heap container class design. To a degree, in most cases, always ski in control, etc. Kevin
May 20 2004
prev sibling parent "Walter" <newshound digitalmars.com> writes:
"-scooter-" <scottm cs.ucla.edu> wrote in message
news:c7tlva$22e8$1 digitaldaemon.com...
 Walter wrote:

 I do the same sorts of optimizations in C++ when I want performance.


 allocation is always a ripe target for performance tuning.

One would think so, but an excellent paper from OOPSLA '02 has hard

 to the contrary: Reconsidering custom memory allocation (Berger, et. al).
 Paper is at http://www.cs.umass.edu/~emery/pubs/berger-oopsla2002.pdf

 The only time it seems that a custom memory allocator wins is in the "bulk
 free" case, when an entire heap of objects (like a compiler's AST for a
 class <hint>) gets deallocated. Otherwise, custom storage management is
 almost a complete waste of time.

 I haven't taken Berger's HeapLayers or reaps allocation libraries out for

 spin yet, but it looks interesting.

The article is right in that sometimes performance optimizations yield unexpected results. Just goes to show one should rely on a good performance profiler when doing optimization.
May 12 2004
prev sibling next sibling parent reply "christopher diggins" <cdiggins users.sourceforge.net> writes:
-- 
Christopher Diggins
http://www.cdiggins.com
http://www.heron-language.com
"Walter" <newshound digitalmars.com> wrote in message
news:c7r9ku$1h7b$1 digitaldaemon.com...
 "christopher diggins" <cdiggins users.sourceforge.net> wrote in message
 news:c7r63p$1bv1$1 digitaldaemon.com...
 Thanks for the feedback, Walter. I had only used -O but not -inline
 and -release which sped up the execution to 703 msec and the memory


 dropped to 2MB on my system. I have now updated the web page as a


 Thanks! Could you also please hyperlink the "Digital Mars Compiler" to
 www.digitalmars.com/d/ ? You'll also get faster results if you use 'final'
 on all member functions that are not overrided (essentially all the
 functions you don't mark as 'virtual' in the C++ version).

Done and done. The final keywords sped it up even more from 710 to 610 msec.
 There is still 4 bytes per object that I can't account for. I count 8

 for the vector data, 4 bytes for the pointer to the object, 4 bytes for

 vtable in the object, which only brings me to 16 bytes per object, what


 the other 4 bytes for?

It's for the monitor.

What about the 4 bytes on top of that for the interfaces what are they for? Any plans on changing this in the future, or making a compile switch?
 Is there a way to allocate 100,000 objects at once
 rather then one by one?

Yes. Implement a new() function for the class, and have it parcel out bits from a preallocated array.

I won't be including that in the example then, it seems like too much of a roundabout way to do things for the scope of the program I included. Thanks again Walter! Christopher Diggins http://www.heron-language.com
May 11 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"christopher diggins" <cdiggins users.sourceforge.net> wrote in message
news:c7rfbm$1q4l$1 digitaldaemon.com...
 Done and done. The final keywords sped it up even more from 710 to 610

 There is still 4 bytes per object that I can't account for. I count 8

 for the vector data, 4 bytes for the pointer to the object, 4 bytes



 the
 vtable in the object, which only brings me to 16 bytes per object,



 is
 the other 4 bytes for?

It's for the monitor.

What about the 4 bytes on top of that for the interfaces what are they

Each interface derived from has a vptr allocated for it.
 Any plans on changing this in the future, or making a compile switch?

No, it's needed to get reasonable performance out of interfaces.
 Is there a way to allocate 100,000 objects at once
 rather then one by one?

Yes. Implement a new() function for the class, and have it parcel out


 from a preallocated array.

I won't be including that in the example then, it seems like too much of a roundabout way to do things for the scope of the program I included. Thanks again Walter!

You're welcome. You should also note that arrays have a built-in sort property, which uses quicksort. It'll blow away heapsort <g>. Doing that cuts the sort time on my machine from 453 ms to 63 ms. Source attached.
May 11 2004
next sibling parent reply "Jan-Eric Duden" <jeduden whisset.com> writes:
 You're welcome. You should also note that arrays have a built-in sort
 property, which uses quicksort. It'll blow away heapsort <g>. Doing that
 cuts the sort time on my machine from 453 ms to 63 ms. Source attached.

different algorithms (quicksort or heapsort). Or can the code for the sort property be better optimized than a "normal" implementation of a sort algorithm?
May 12 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Jan-Eric Duden" <jeduden whisset.com> wrote in message
news:c7sioo$cvh$1 digitaldaemon.com...
 You're welcome. You should also note that arrays have a built-in sort
 property, which uses quicksort. It'll blow away heapsort <g>. Doing that
 cuts the sort time on my machine from 453 ms to 63 ms. Source attached.

different algorithms (quicksort or heapsort). Or can the code for the sort property be better optimized than a "normal" implementation of a sort algorithm?

I'd have to do testing to make sure, but I'm pretty sure it's due to quicksort being algorithmically superior. One of the reasons I built sort into arrays is because people implement inefficient sorts over and over, and I wanted to make it real easy to use the best algorithm. (A job I had once was to optimize a large program written by others; in poking around in it I found 3 different hacky implementations of bubblesort. Replaced them all with a call to C's qsort(), and I was an instant hero for the speedup <g>.)
May 12 2004
next sibling parent reply "Jan-Eric Duden" <jeduden whisset.com> writes:
 I'd have to do testing to make sure, but I'm pretty sure it's due to
 quicksort being algorithmically superior. One of the reasons I built sort
 into arrays is because people implement inefficient sorts over and over,

 I wanted to make it real easy to use the best algorithm. (A job I had once
 was to optimize a large program written by others; in poking around in it

 found 3 different hacky implementations of bubblesort. Replaced them all
 with a call to C's qsort(), and I was an instant hero for the speedup

I thought qsort and heapsort work about at the same speed. I thought the most important difference between a heapsort and q-sort implementation in C is the order of the algorithms: - heapsort has roughly the same performance for all array instanciations O(n*log n) - qsort is a lot slower O(n^2) on some instanciations and a lot faster on almost sorted arrays O(n). the average is of course: O(n*log n) Maybe the difference is that the C q-sort implementation is highly optimizied and the heapsort implementation that was used is not?
May 12 2004
parent reply "christopher diggins" <cdiggins users.sourceforge.net> writes:
"Jan-Eric Duden" <jeduden whisset.com> wrote in message
news:c7spim$n4h$1 digitaldaemon.com...
 I'd have to do testing to make sure, but I'm pretty sure it's due to
 quicksort being algorithmically superior. One of the reasons I built


 into arrays is because people implement inefficient sorts over and over,

 I wanted to make it real easy to use the best algorithm. (A job I had


 was to optimize a large program written by others; in poking around in


 I
 found 3 different hacky implementations of bubblesort. Replaced them all
 with a call to C's qsort(), and I was an instant hero for the speedup

I thought qsort and heapsort work about at the same speed. I thought the most important difference between a heapsort and q-sort implementation in C is the order of the algorithms: - heapsort has roughly the same performance for all array instanciations O(n*log n) - qsort is a lot slower O(n^2) on some instanciations and a lot faster on almost sorted arrays O(n). the average is of course: O(n*log n) Maybe the difference is that the C q-sort implementation is highly optimizied and the heapsort implementation that was used is not?

Both quicksort and heapsort are O(n log n) on average but the constants involved in Quicksort are much better. Quicksort though depends a lot on whether the partitioning subroutine is balanced. The differences are discussed at length in Introduction to Algorithms by Cormen, Leiserson and Rivest ( http://theory.lcs.mit.edu/~clr/ ) -- Christopher Diggins http://www.cdiggins.com http://www.heron-language.com
May 12 2004
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
christopher diggins wrote:

<snip>
 - qsort is a lot slower O(n^2) on some instanciations and a lot 
 faster on almost sorted arrays O(n). the average is of course: 
 O(n*log n)


Can you give an example of an O(n) case of quicksort?
 Maybe the difference is that the C q-sort implementation is highly
 optimizied and the heapsort implementation that was used is not?

Both quicksort and heapsort are O(n log n) on average

Yes, but in the worst case, heapsort wins O(n log n) to O(n^2). When benchmarking a sorting algorithm, it's helpful to try a number of random cases, as well as the control cases of already sorted and already sorted backwards. If you're benchmarking algorithms, rather than compilers or languages, you should try to use the same compiler, the same language and the same compiler options. As such, it makes limited sense to compare a ready-made C library function of one algorithm with your own implementation of another. Of course, if you're comparing languages/compilers, you should compare implementations of the same algorithm, and set as similar levels of optimisation as possible (ITMS). Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
May 12 2004
next sibling parent reply "Jan-Eric Duden" <jeduden whisset.com> writes:
 Can you give an example of an O(n) case of quicksort?

that quits if no swaps occured.
 Maybe the difference is that the C q-sort implementation is highly
 optimizied and the heapsort implementation that was used is not?

Both quicksort and heapsort are O(n log n) on average

Yes, but in the worst case, heapsort wins O(n log n) to O(n^2).

My question was more related to the constants. I wasn't aware that heapsort is that much slower.
May 12 2004
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Jan-Eric Duden wrote:

Can you give an example of an O(n) case of quicksort?

A sorted array. You only get O(n) in a slightly modified version of q-sort that quits if no swaps occured.

By 'swaps' do you mean the swapping of an element with the hole into which the key element is to go? How can one element being already in the right place imply that the whole array is sorted? <snip>
 Yep, that's why sometimes heapsort is preferred over quicksort.
 
 My question was more related to the constants. I wasn't aware that heapsort
 is that much slower.

What constants? Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
May 12 2004
parent reply Kevin Bealer <Kevin_member pathlink.com> writes:
In article <c7tbbi$1i8q$1 digitaldaemon.com>, Stewart Gordon says...
Jan-Eric Duden wrote:

Can you give an example of an O(n) case of quicksort?

A sorted array. You only get O(n) in a slightly modified version of q-sort that quits if no swaps occured.

By 'swaps' do you mean the swapping of an element with the hole into which the key element is to go? How can one element being already in the right place imply that the whole array is sorted? <snip>
 Yep, that's why sometimes heapsort is preferred over quicksort.
 
 My question was more related to the constants. I wasn't aware that heapsort
 is that much slower.

What constants?

O(N), O(N^2), O(N*lnN) and so on, describe how the number of operations required is affected by the size of the input set. This is called "big O notation". We say a sort takes O(N^2) when the time required is proportional to the square of the input set size (N). This is called "order of N squared". The real speed of the algorithm is something like K * N^2. If one algorithm is twice as fast as another for any value of N, then it is said to have K=2 while the other has K=1. K is the constant. All of this analysis ignores effects like cache locality, because they are difficult to summarize in any simple and general way. Normally, O(...), or "order" is the most important criteria. If O(...) is the same for both designs, then the secondary effects apply. For large data sets, cache and locality issues are most important, whereas for small data sets, the "constant" is probably more important. "Small" here probably means "fits, or almost fits, in memory". Special data sets can trip up some algorithms, for example, carelessly written qsort will take forever on already-sorted data. Bubble sort is the classic "bad" sorting algorithm. It still has its defenders, for special cases. Like insertion sort, it probably survives because it is easy to code for small hand written linked lists in languages like C. Often large, old, programs in C will have dozens or hundreds of data structures of the type "small-structs-in-a-short-list", half of these will have a ten line hand coded insertion sort (shudder). "Don't get out of the boat". I always liked "shell sorts" and "natural merge sorts" but noone uses those any more. Probably because they're really slow... oh well to each his own I guess. Kevin
May 12 2004
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Kevin Bealer wrote:

<snip>
 O(N), O(N^2), O(N*lnN) and so on, describe how the number of operations
required
 is affected by the size of the input set.  This is called "big O notation".
 
 We say a sort takes O(N^2) when the time required is proportional to the square
 of the input set size (N).  This is called "order of N squared".

You seldom get direct proportionality in practice. In fact, it means that there exists some K for which F(N) < K*N^2, for all N>N0 for some N0. And if you replace < with >, you get big Omega notation. And if you consider both together, you get big Theta notation. I guess that people commonly use big O when they really mean big Theta, probably because O is on most keyboards/character sets, and considering that anything Theta(N^2) is O(N^2) anyway.
 The real speed of the algorithm is something like K * N^2.  If one algorithm is
 twice as fast as another for any value of N, then it is said to have K=2 while
 the other has K=1.  K is the constant.

Yes. Though it would depend on such things as the relative costs of comparison and assignment/swapping. <snip>
 Bubble sort is the classic "bad" sorting algorithm.  It still has its
defenders,
 for special cases.  Like insertion sort, it probably survives because it is
easy
 to code for small hand written linked lists in languages like C.

I guess for linked lists, insertion sort is as easy or perhaps easier to code than bubble sort. Just thinking about it, on a linked list, insertion sort has the advantage that an insertion is O(1). OTOH, on an array it has the advantage that a search for the right place is O(log N). I suppose the constant for the whole alg is smaller for the linked list.... I guess selection sort (or shaker sort, which I hadn't heard of before but is a perfectly intuitive improvement I had thought up) is equally easy and thus defendable. It even has its strenghts, such as sorting tournament crosstables.... <snip>
 I always liked "shell sorts" and "natural merge sorts" but noone uses those any
 more.  Probably because they're really slow... oh well to each his own I guess.

Strange - JA's source shows shell sort being marginally quicker than heap sort. Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
May 13 2004
next sibling parent reply J Anderson <REMOVEanderson badmama.com.au> writes:
Stewart Gordon wrote:

 Yes.  Though it would depend on such things as the relative costs of 
 comparison and assignment/swapping.

<snip> On of the great things about ADA generics is that you could easily replace the comparison operator and do tests on how many comparisons were actually performed. Of course you could box all the variables you use in the sort algorithm but that's just not the same. I wish D templates could accept operators/functions as parameters. At the very least its a good way to profile that particular aspect of an algorithm. -- -Anderson: http://badmama.com.au/~anderson/
May 13 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"J Anderson" <REMOVEanderson badmama.com.au> wrote in message
news:c7vnbp$2bvs$1 digitaldaemon.com...
  I wish D
 templates could accept operators/functions as parameters.

It can accept functions as parameters - both as function pointers, and as aliases.
May 13 2004
parent reply J Anderson <REMOVEanderson badmama.com.au> writes:
Walter wrote:

"J Anderson" <REMOVEanderson badmama.com.au> wrote in message
news:c7vnbp$2bvs$1 digitaldaemon.com...
  

 I wish D
templates could accept operators/functions as parameters.
    

It can accept functions as parameters - both as function pointers, and as aliases.

-- -Anderson: http://badmama.com.au/~anderson/
May 14 2004
parent "Walter" <newshound digitalmars.com> writes:
"J Anderson" <REMOVEanderson badmama.com.au> wrote in message
news:c81tqo$2h9g$1 digitaldaemon.com...
 Walter wrote:

"J Anderson" <REMOVEanderson badmama.com.au> wrote in message
news:c7vnbp$2bvs$1 digitaldaemon.com...


 I wish D
templates could accept operators/functions as parameters.

It can accept functions as parameters - both as function pointers, and as aliases.


You can use the equivalent name for the operator functions.
May 14 2004
prev sibling parent reply J Anderson <REMOVEanderson badmama.com.au> writes:
Stewart Gordon wrote:

 I always liked "shell sorts" and "natural merge sorts" but noone uses 
 those any
 more.  Probably because they're really slow... oh well to each his 
 own I guess.

Strange - JA's source shows shell sort being marginally quicker than heap sort. Stewart.

sort was slightly broken? -- -Anderson: http://badmama.com.au/~anderson/
May 13 2004
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
J Anderson wrote:
<snip>
 If your talking

My talking?
 about the website I provided, did you read that shell 
 sort was slightly broken?

No. I still don't. Where's that little bit of info buried then? Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
May 13 2004
parent J Anderson <REMOVEanderson badmama.com.au> writes:
Stewart Gordon wrote:

 J Anderson wrote:
 <snip>

 If your talking

My talking?
 about the website I provided, did you read that shell sort was 
 slightly broken?

No. I still don't. Where's that little bit of info buried then? Stewart.

-- -Anderson: http://badmama.com.au/~anderson/
May 13 2004
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
Stewart Gordon wrote:
 
 Yes, but in the worst case, heapsort wins O(n log n) to O(n^2).

But some versions of heapsort need a considerable memory overhead to run. In general, quicksort is the better choice for average sorting situations. People with special needs generally also know how to code for those needs :) Sean
May 12 2004
parent reply J Anderson <REMOVEanderson badmama.com.au> writes:
Sean Kelly wrote:

 Stewart Gordon wrote:

 Yes, but in the worst case, heapsort wins O(n log n) to O(n^2).

But some versions of heapsort need a considerable memory overhead to run. In general, quicksort is the better choice for average sorting situations. People with special needs generally also know how to code for those needs :) Sean

Right! Even sort algorithms like short-circuit bubble sort can be useful if they are used in right application. Short-circuit bubble sort is most useful when you know that most-of-the the array is sorted because then you get o(n). With standard quicksort you'd get O(n^2) for a sorted array. -- -Anderson: http://badmama.com.au/~anderson/
May 12 2004
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
J Anderson wrote:

<snip>
 Right!  Even sort algorithms like short-circuit bubble sort can be 
 useful if they are used in right application.  Short-circuit bubble sort 
 is most useful when you know that most-of-the the array is sorted 
 because then you get o(n). With standard quicksort you'd get O(n^2) for 
 a sorted array.

Even better for nearly sorted data is bidirectional short-circuit bubble sort. Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
May 12 2004
parent reply C <qbert atari-soldiers.com> writes:
 Even better for nearly sorted data is bidirectional short-circuit bubble 
 sort.

Hmm, I couldn't find anything on this do you have some more info ? Charlie On Wed, 12 May 2004 17:09:30 +0100, Stewart Gordon <smjg_1998 yahoo.com> wrote:
 J Anderson wrote:

 <snip>
 Right!  Even sort algorithms like short-circuit bubble sort can be 
 useful if they are used in right application.  Short-circuit bubble 
 sort is most useful when you know that most-of-the the array is sorted 
 because then you get o(n). With standard quicksort you'd get O(n^2) for 
 a sorted array.

Even better for nearly sorted data is bidirectional short-circuit bubble sort. Stewart.

-- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
May 12 2004
next sibling parent J Anderson <REMOVEanderson badmama.com.au> writes:
C wrote:

 Even better for nearly sorted data is bidirectional short-circuit 
 bubble sort.

Hmm, I couldn't find anything on this do you have some more info ? Charlie On Wed, 12 May 2004 17:09:30 +0100, Stewart Gordon <smjg_1998 yahoo.com> wrote:
 J Anderson wrote:

 <snip>

 Right!  Even sort algorithms like short-circuit bubble sort can be 
 useful if they are used in right application.  Short-circuit bubble 
 sort is most useful when you know that most-of-the the array is 
 sorted because then you get o(n). With standard quicksort you'd get 
 O(n^2) for a sorted array.

Even better for nearly sorted data is bidirectional short-circuit bubble sort. Stewart.


array. In a bidirectional sort elements sink down to the bottom as well. Of course short-circuit bubble sort still wins in some circumstances. Here's an interesting page on sorting (includes bidirectional): http://www.cs.ubc.ca/spider/harrison/Java/sorting-demo.html BTW: In uni I had to do a project with 18 completely different sort algorithms. So I really got to know them well. -- -Anderson: http://badmama.com.au/~anderson/
May 12 2004
prev sibling parent Stewart Gordon <smjg_1998 yahoo.com> writes:
C wrote:

 Even better for nearly sorted data is bidirectional short-circuit 
 bubble sort.

Hmm, I couldn't find anything on this do you have some more info ?

Several years ago I wrote a simple database application. When adding a record, it would add it to the end of the file. When deleting one it would fill the gap with the record from the end. When updating it would just update in place. The sort implementation was a bidirectional bubble sort. Hence the records added since the last sort would bubble up to their correct positions, and the records moved to fill gaps would bubble down back towards the end. Updated records (where the change rendered the record out of order) could bubble in either direction. The algorithm was also quite suited to the program's rather simplistic dynamic record loading. At a time it would keep 50 consecutive records in memory, moving the 'window' as necessary. Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
May 13 2004
prev sibling next sibling parent reply resistor AT nospam DOT mac DOT com <resistor_member pathlink.com> writes:
Here's something interesting for everyone.  I took the heapsort.d example, redid
it to be a bit more D-
like (used a true array rather than an array class, used opCmp instead of a
Compare method, etc.)

Anyways, I decided to test my version against Walter's QuickSort.  Here are the
results (the QuickSort 
and HeapSort files are identical except one calls HeapSort(a) and the other
calls a.sort):

DMD QuickSort - 93 msecs
DMD HeapSort - 62 msecs

GDC QuickSort - 76 msecs
GDC HeapSort - 62 msecs

The times are more or less constant across runs.  Only the GDC HeapSort varies,
and it only be an msec 
or two.

This brings to light two things: First, that the HeapSort is potentially more
viable than the QuickSort.  
And secondly, that the GDC backend is a good bit more powerful than the DMD
backend.  Also, I'm 
pretty sure that GDC's libphobos has debugging enabled, so it might be able to
go faster!

Comment? Thoughts?

Owen

P.S. I've attached the code for this test, in case I've made a silly mistake
that is favoring the HeapSort.

In article <c7sncs$k3e$1 digitaldaemon.com>, Walter says...
I'd have to do testing to make sure, but I'm pretty sure it's due to
quicksort being algorithmically superior. One of the reasons I built sort
into arrays is because people implement inefficient sorts over and over, and
I wanted to make it real easy to use the best algorithm. (A job I had once
was to optimize a large program written by others; in poking around in it I
found 3 different hacky implementations of bubblesort. Replaced them all
with a call to C's qsort(), and I was an instant hero for the speedup <g>.)

May 12 2004
parent J Anderson <REMOVEanderson badmama.com.au> writes:
resistor AT nospam DOT mac DOT com wrote:

Here's something interesting for everyone.  I took the heapsort.d example, redid
it to be a bit more D-
like (used a true array rather than an array class, used opCmp instead of a
Compare method, etc.)

Anyways, I decided to test my version against Walter's QuickSort.  Here are the
results (the QuickSort 
and HeapSort files are identical except one calls HeapSort(a) and the other
calls a.sort):

DMD QuickSort - 93 msecs
DMD HeapSort - 62 msecs

GDC QuickSort - 76 msecs
GDC HeapSort - 62 msecs

The times are more or less constant across runs.  Only the GDC HeapSort varies,
and it only be an msec 
or two.

This brings to light two things: First, that the HeapSort is potentially more
viable than the QuickSort.  
And secondly, that the GDC backend is a good bit more powerful than the DMD
backend.  Also, I'm 
pretty sure that GDC's libphobos has debugging enabled, so it might be able to
go faster!

Comment? Thoughts?

Owen

P.S. I've attached the code for this test, in case I've made a silly mistake
that is favoring the HeapSort.

In article <c7sncs$k3e$1 digitaldaemon.com>, Walter says...
  

I'd have to do testing to make sure, but I'm pretty sure it's due to
quicksort being algorithmically superior. One of the reasons I built sort
into arrays is because people implement inefficient sorts over and over, and
I wanted to make it real easy to use the best algorithm. (A job I had once
was to optimize a large program written by others; in poking around in it I
found 3 different hacky implementations of bubblesort. Replaced them all
with a call to C's qsort(), and I was an instant hero for the speedup <g>.)
    


It would also be interesting to see how something like radix sort compares. -- -Anderson: http://badmama.com.au/~anderson/
May 12 2004
prev sibling parent resistor AT nospam DOT mac DOT com <resistor_member pathlink.com> writes:
Oops.  Forgot the attachment.  Here it is.

In article <c7sncs$k3e$1 digitaldaemon.com>, Walter says...
I'd have to do testing to make sure, but I'm pretty sure it's due to
quicksort being algorithmically superior. One of the reasons I built sort
into arrays is because people implement inefficient sorts over and over, and
I wanted to make it real easy to use the best algorithm. (A job I had once
was to optimize a large program written by others; in poking around in it I
found 3 different hacky implementations of bubblesort. Replaced them all
with a call to C's qsort(), and I was an instant hero for the speedup <g>.)

May 12 2004
prev sibling parent reply =?iso-8859-1?q?Knud_S=F8rensen?= <knud NetRunner.all-technology.com> writes:
On Tue, 11 May 2004 14:47:58 -0700, Walter wrote:

 
 You're welcome. You should also note that arrays have a built-in sort
 property, which uses quicksort. It'll blow away heapsort <g>. Doing that
 cuts the sort time on my machine from 453 ms to 63 ms. Source attached.
 

Why are your not using introspective sort ?? http://www.nist.gov/dads/HTML/introspectiveSort.html
May 12 2004
parent "Walter" <newshound digitalmars.com> writes:
"Knud Sørensen" <knud NetRunner.all-technology.com> wrote in message
news:pan.2004.05.12.22.37.18.149835 NetRunner.all-technology.com...
 On Tue, 11 May 2004 14:47:58 -0700, Walter wrote:

 You're welcome. You should also note that arrays have a built-in sort
 property, which uses quicksort. It'll blow away heapsort <g>. Doing that
 cuts the sort time on my machine from 453 ms to 63 ms. Source attached.

Why are your not using introspective sort ?? http://www.nist.gov/dads/HTML/introspectiveSort.html

Actually, it does just that (switch to heapsort as necessary).
May 12 2004
prev sibling parent reply =?ISO-8859-1?Q?J=F6rg_R=FCppel?= <joerg sharky-x.de> writes:
Walter wrote:

 "christopher diggins" <cdiggins users.sourceforge.net> wrote in message
 news:c7r63p$1bv1$1 digitaldaemon.com...
 
Thanks for the feedback, Walter. I had only used -O but not -inline
and -release which sped up the execution to 703 msec and the memory usage
dropped to 2MB on my system. I have now updated the web page as a result.

Thanks! Could you also please hyperlink the "Digital Mars Compiler" to www.digitalmars.com/d/ ? You'll also get faster results if you use 'final' on all member functions that are not overrided (essentially all the functions you don't mark as 'virtual' in the C++ version).

I think I've read in the docs that the D compiler autodetects if a function should be virtual or not. Why is there such a speedgain by explicitely marking methods as final, when the compiler can do this implicitely? Regards, Jörg
May 12 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Jörg Rüppel" <joerg sharky-x.de> wrote in message
news:c7u7ru$4dt$1 digitaldaemon.com...
 Walter wrote:

 "christopher diggins" <cdiggins users.sourceforge.net> wrote in message
 news:c7r63p$1bv1$1 digitaldaemon.com...

Thanks for the feedback, Walter. I had only used -O but not -inline
and -release which sped up the execution to 703 msec and the memory



dropped to 2MB on my system. I have now updated the web page as a



 Thanks! Could you also please hyperlink the "Digital Mars Compiler" to
 www.digitalmars.com/d/ ? You'll also get faster results if you use


 on all member functions that are not overrided (essentially all the
 functions you don't mark as 'virtual' in the C++ version).

I think I've read in the docs that the D compiler autodetects if a function should be virtual or not. Why is there such a speedgain by explicitely marking methods as final, when the compiler can do this implicitely?

Because the compiler is not sophisticated enough to do it automatically yet.
May 12 2004
parent =?ISO-8859-1?Q?J=F6rg_R=FCppel?= <joerg sharky-x.de> writes:
I think I've read in the docs that the D compiler autodetects if a
function should be virtual or not. Why is there such a speedgain by
explicitely marking methods as final, when the compiler can do this
implicitely?

Because the compiler is not sophisticated enough to do it automatically yet.

Regards, Jörg
May 13 2004
prev sibling parent reply Kevin Bealer <Kevin_member pathlink.com> writes:
In article <c7r63p$1bv1$1 digitaldaemon.com>, christopher diggins says...
Thanks for the feedback, Walter. I had only used -O but not -inline
and -release which sped up the execution to 703 msec and the memory usage
dropped to 2MB on my system. I have now updated the web page as a result.
There is still 4 bytes per object that I can't account for. I count 8 bytes
for the vector data, 4 bytes for the pointer to the object, 4 bytes for the
vtable in the object, which only brings me to 16 bytes per object, what is
the other 4 bytes for? Is there a way to allocate 100,000 objects at once
rather then one by one?

If I have a container of objects that implement interface Printable(), how does Heron determine which are type Circle and which are type Square, without the use of a vtable? If the language cannot determine which type is which, then you should remove the word "virtual" from the C++ version, and put "final" in the Java (and D?) versions: if the container can only have one type of object, then C++ does not need the virtual keyword, and including it kind of "rigs" the demo, right? Kevin
May 11 2004
parent reply "christopher diggins" <cdiggins users.sourceforge.net> writes:
"Kevin Bealer" <Kevin_member pathlink.com> wrote in message
news:c7rboo$1kqs$1 digitaldaemon.com...
 In article <c7r63p$1bv1$1 digitaldaemon.com>, christopher diggins says...
Thanks for the feedback, Walter. I had only used -O but not -inline
and -release which sped up the execution to 703 msec and the memory usage
dropped to 2MB on my system. I have now updated the web page as a result.
There is still 4 bytes per object that I can't account for. I count 8


for the vector data, 4 bytes for the pointer to the object, 4 bytes for


vtable in the object, which only brings me to 16 bytes per object, what


the other 4 bytes for? Is there a way to allocate 100,000 objects at once
rather then one by one?

If I have a container of objects that implement interface Printable(), how

 Heron determine which are type Circle and which are type Square, without

 of a vtable?

Interface references contain an extra pointer to a dispatch table. So the dynamic dispatching is done when needed. The compiler ignores the interface references when it can deduce the concrete types.
 If the language cannot determine which type is which, then you should

 word "virtual" from the C++ version, and put "final" in the Java (and D?)

I do agree with you about the final keyword and I have done that to the D version, and I am going to do it in the Java version.
 versions:  if the container can only have one type of object, then C++

 need the virtual keyword, and including it kind of "rigs" the demo, right?

You are correct, but Vector2D is used to demonstrate a class that is run-time polymorphic but happens to be used in a non-polymorphic manner. I could easily add to the demo a modification that forces the issue of polymorphism, but I thought it would complicate things too much. Do you recommend that I change the demo to make this more explicit? -- Christopher Diggins http://www.cdiggins.com http://www.heron-language.com
May 11 2004
parent reply Kevin Bealer <Kevin_member pathlink.com> writes:
In article <c7rhdv$1t4h$1 digitaldaemon.com>, christopher diggins says...
..
You are correct, but Vector2D is used to demonstrate a class that is
run-time polymorphic but happens to be used in a non-polymorphic manner. I
could easily add to the demo a modification that forces the issue of
polymorphism, but I thought it would complicate things too much. Do you
recommend that I change the demo to make this more explicit?

Heron's approach of non-inheritance binding is interesting; I take it there is a global dispatch table with pointers to all methods appearing in non-deduceable cases? I think the demo you have is good as far as it goes. But I would be more interested to know what happens in both cases for both languages: i.e. C++ run-time and compile-time and Heron run-time and compile-time. If Heron can outperform in *both* cases, that would be much more impressive. But in (performance oriented code) in a language like C++, virtual-ness is usually used only when compile time binding is difficult to achieve. Kevin
May 11 2004
parent reply "christopher diggins" <cdiggins users.sourceforge.net> writes:
"Kevin Bealer" <Kevin_member pathlink.com> wrote in message
news:c7rkjh$2216$1 digitaldaemon.com...
 In article <c7rhdv$1t4h$1 digitaldaemon.com>, christopher diggins says...
 ..
You are correct, but Vector2D is used to demonstrate a class that is
run-time polymorphic but happens to be used in a non-polymorphic manner.


could easily add to the demo a modification that forces the issue of
polymorphism, but I thought it would complicate things too much. Do you
recommend that I change the demo to make this more explicit?

Heron's approach of non-inheritance binding is interesting; I take it

 a global dispatch table with pointers to all methods appearing in

 cases?

Close, there are lots of small tables created.
 I think the demo you have is good as far as it goes.  But I would be more
 interested to know what happens in both cases for both languages: i.e. C++
 run-time and compile-time and Heron run-time and compile-time.  If Heron

 outperform in *both* cases, that would be much more impressive.

Good point.
 But in (performance oriented code) in a language like C++, virtual-ness is
 usually used only when compile time binding is difficult to achieve.

In retrospect, I am starting to be of the opinion that the example may fact rigged as you mentioned. There is no way that I can justify a design that uses ICompare as a class with virtual functions. -- Christopher Diggins http://www.cdiggins.com http://www.heron-language.com
May 12 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"christopher diggins" <cdiggins users.sourceforge.net> wrote in message
news:c7t5qi$19oe$1 digitaldaemon.com...
 In retrospect, I am starting to be of the opinion that the example may

 rigged as you mentioned. There is no way that I can justify a design that
 uses ICompare as a class with virtual functions.

I'd also be careful about the GetMagnitude() function. The time spent in the two multiplies may swamp other effects. I've seen many benchmark programs that purport to measure X, when they actually are measuring Y that the benchmark writer thought was innocuous. Compiler vendors have been known to exploit these naive benchmarks by optimizing the heck out of Y (which may be much easier than optimizing X), and letting the benchmark reporter misinterpret it into saying that X was optimized. I've even seen crafty vendors create benchmarks specifically to mislead people into thinking X was being optimized when it was actually Y. (This is why I tend to not bother creating benchmarks for any but my own use, I like to use benchmarks written by people not affiliated with DM to avoid even the appearance of impropriety.) In your benchmark, I'd check the multiplies, I'd also check to see if maybe one compiler does a better job of function inlining or register allocation before concluding that the speed difference is due to virtual dispatch. Running a disassembler over the generated code and being familiar with performance assembler coding is essential. The results can be quite surprising.
May 12 2004
parent "christopher diggins" <cdiggins users.sourceforge.net> writes:
"Walter" <newshound digitalmars.com> wrote in message
news:c7u37p$2m0t$1 digitaldaemon.com...
 "christopher diggins" <cdiggins users.sourceforge.net> wrote in message
 news:c7t5qi$19oe$1 digitaldaemon.com...
 In retrospect, I am starting to be of the opinion that the example may

 rigged as you mentioned. There is no way that I can justify a design


 uses ICompare as a class with virtual functions.

I'd also be careful about the GetMagnitude() function. The time spent in

 two multiplies may swamp other effects. I've seen many benchmark programs
 that purport to measure X, when they actually are measuring Y that the
 benchmark writer thought was innocuous. Compiler vendors have been known

 exploit these naive benchmarks by optimizing the heck out of Y (which may

 much easier than optimizing X), and letting the benchmark reporter
 misinterpret it into saying that X was optimized.

 I've even seen crafty vendors create benchmarks specifically to mislead
 people into thinking X was being optimized when it was actually Y. (This

 why I tend to not bother creating benchmarks for any but my own use, I

 to use benchmarks written by people not affiliated with DM to avoid even

 appearance of impropriety.)

 In your benchmark, I'd check the multiplies, I'd also check to see if

 one compiler does a better job of function inlining or register allocation
 before concluding that the speed difference is due to virtual dispatch.
 Running a disassembler over the generated code and being familiar with
 performance assembler coding is essential. The results can be quite
 surprising.

You make some excellent points. I have written a brand new benchmark, which only has increments, decrements, == and function calls. Hopefully this is more accurate. Unfortunately I am too slow and clumsy with assembler to be able to do any disassembly, hopefully someone else will come along and do it. -- Christopher Diggins http://www.cdiggins.com http://www.heron-language.com
May 13 2004
prev sibling next sibling parent Andy Friesen <andy ikagames.com> writes:
christopher diggins wrote:

 I have written an object oriented Heapsort in D at
 http://www.heron-language.com/benchmarks/heapsort-d.html to compare the
 performance of object oriented code in several languages. The results were
 somewhat disappointing, when compared with C++, Delphi, Java, and Heron. I
 was hoping that there might be some suggestions from members of this group
 about how I might improve the code I wrote while keeping the overall
 algorithm and design the same.
 
 Please note that the program is a port of the program at
 http://www.heron-language.com/benchmarks/heapsort-heron.html and is
 discussed further at http://www.heron-language.com/benchmarks/index.html

This is completely tangental, but Microsoft's C++ compiler completes the heapsort in 235ms. (Athlon XP 1600+ with 512MB of RAM, using /O2) -- andy
May 11 2004
prev sibling next sibling parent reply "Matthew" <matthew.hat stlsoft.dot.org> writes:
Why have you not used DMC++ for the Heron and C++? That way the comparison would
be truer

"christopher diggins" <cdiggins users.sourceforge.net> wrote in message
news:c7qvo6$1256$1 digitaldaemon.com...
 I have written an object oriented Heapsort in D at
 http://www.heron-language.com/benchmarks/heapsort-d.html to compare the
 performance of object oriented code in several languages. The results were
 somewhat disappointing, when compared with C++, Delphi, Java, and Heron. I
 was hoping that there might be some suggestions from members of this group
 about how I might improve the code I wrote while keeping the overall
 algorithm and design the same.

 Please note that the program is a port of the program at
 http://www.heron-language.com/benchmarks/heapsort-heron.html and is
 discussed further at http://www.heron-language.com/benchmarks/index.html

 TIA
 -- 
 Christopher Diggins
 http://www.cdiggins.com
 http://www.heron-language.com

May 11 2004
parent reply "christopher diggins" <cdiggins users.sourceforge.net> writes:
"Matthew" <matthew.hat stlsoft.dot.org> wrote in message
news:c7rgk8$1s1l$1 digitaldaemon.com...
 Why have you not used DMC++ for the Heron and C++? That way the comparison

 be truer

If all that would happen is make the heron and C++ code run slower (or faster) I don't really see the point. I am simply using the best compilers I have at my disposal. I also don't think it would make any difference to the whole point of the comparison which is that Heron implements more efficient object oriented code the other languages which support object oriented techniques. -- Christopher Diggins http://www.cdiggins.com http://www.heron-language.com
May 11 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"christopher diggins" <cdiggins users.sourceforge.net> wrote in message
news:c7rl11$22nr$1 digitaldaemon.com...
 "Matthew" <matthew.hat stlsoft.dot.org> wrote in message
 news:c7rgk8$1s1l$1 digitaldaemon.com...
 Why have you not used DMC++ for the Heron and C++? That way the


 would
 be truer

If all that would happen is make the heron and C++ code run slower (or faster) I don't really see the point. I am simply using the best compilers

 have at my disposal. I also don't think it would make any difference to

 whole point of the comparison which is that Heron implements more

 object oriented code the other languages which support object oriented
 techniques.

That depends on if point is really to compare inherent efficiency of the language, not the relative implementation vagaries of different compilers. DMD and DMC++ share a common optimizer and code generator.
May 11 2004
parent "christopher diggins" <cdiggins users.sourceforge.net> writes:
"Walter" <newshound digitalmars.com> wrote in message
news:c7rn04$25r5$1 digitaldaemon.com...
 "christopher diggins" <cdiggins users.sourceforge.net> wrote in message
 news:c7rl11$22nr$1 digitaldaemon.com...
 "Matthew" <matthew.hat stlsoft.dot.org> wrote in message
 news:c7rgk8$1s1l$1 digitaldaemon.com...
 Why have you not used DMC++ for the Heron and C++? That way the


 would
 be truer

If all that would happen is make the heron and C++ code run slower (or faster) I don't really see the point. I am simply using the best


 I
 have at my disposal. I also don't think it would make any difference to

 whole point of the comparison which is that Heron implements more

 object oriented code the other languages which support object oriented
 techniques.

That depends on if point is really to compare inherent efficiency of the language, not the relative implementation vagaries of different compilers. DMD and DMC++ share a common optimizer and code generator.

I am now convinced and the new example ( http://www.heron-language.com/benchmarks/index.html ) uses only the Digital mars compilers. -- Christopher Diggins http://www.cdiggins.com http://www.heron-language.com
May 13 2004
prev sibling parent C <qbert atari-soldiers.com> writes:
Impressive speeds on Heron!  Id like to see so more benchmarks :).

C

On Tue, 11 May 2004 12:41:34 -0400, christopher diggins 
<cdiggins users.sourceforge.net> wrote:

 I have written an object oriented Heapsort in D at
 http://www.heron-language.com/benchmarks/heapsort-d.html to compare the
 performance of object oriented code in several languages. The results 
 were
 somewhat disappointing, when compared with C++, Delphi, Java, and Heron. 
 I
 was hoping that there might be some suggestions from members of this 
 group
 about how I might improve the code I wrote while keeping the overall
 algorithm and design the same.

 Please note that the program is a port of the program at
 http://www.heron-language.com/benchmarks/heapsort-heron.html and is
 discussed further at http://www.heron-language.com/benchmarks/index.html

 TIA

May 11 2004