digitalmars.D - Dynamic Code in D

Michael Wilson (130/130) Jan 12 2008 I'm responsible for an existing Java application that is the core

Robert Fraser (9/16) Jan 12 2008 Wow; that's a cool project and as much of it as you're willing to
bearophile (14/20) Jan 12 2008 D is a very nice language, and it looks quite nice compared to Java, but...

Michael Wilson (117/163) Jan 12 2008 I know; Java was awful when I first started using it in 1997 (not by

bearophile (110/117) Jan 12 2008 Current DMD D is able to inline free functions and more. But Java code i...

Neal Alexander (4/7) Jan 12 2008 I was under the impression this was easy using XADD / CMPXCHG with the

Sean Kelly (7/16) Jan 14 2008 It's also very slow compared to some of the optimized algorithms out

CptJack (27/72) Jan 12 2008 He meant "Common Lisp programmers". "CLisp" is just a single

bearophile (6/13) Jan 12 2008 I like some sides of Lisp (and you have seen I too have suggest it as an...
Michael Wilson (20/31) Jan 12 2008 I could write a long response to this but it would be off topic for

Paolo Invernizzi (5/16) Jan 12 2008 You can also not directly target bytecodes: what I mean its that you can...
Walter Bright (2/6) Jan 13 2008 I'm certainly willing to accept help with this!
Sean Kelly (6/17) Jan 13 2008 Right now, D uses OS API routines for synchronization. Tango also has a...

Paolo Invernizzi (4/8) Jan 12 2008 I guess the best approach it's LLVM...
CptJack (2/2) Jan 12 2008 I'm not even close to being an expert on the internal D compiler mechani...
BLS (13/13) Jan 13 2008 Hi Michael,
Walter Bright (22/148) Jan 13 2008 I don't know what the right answer is for your application, so I'll just...
Tom S (12/21) Jan 13 2008 For what it's worth, you don't need to use DDL on Linux. .so is just
Rory Mcguire (7/7) Mar 27 2008 checkout kong on dsource.org. www.dsource.org/projects/kong

Michael Wilson <mwilson purplefrogtext.com> writes:

I'm responsible for an existing Java application that is the core
product of my company. However it's essentially a productised prototype
and needs a rewrite, or at least some serious refactoring, sometime
this year. I am considering doing this in D; I like the language design
and there are potential performance and maintainability benefits. As it
turns out, this may entail creating a new D library. Thus I'd like to
ask the following questions;

1) Is this practical at all?
2) Is there a better way to do this?
3) Would anyone other than us want to use this library if we made it
(i.e. is it worth open-sourcing).

The system carries out data mining tasks using algorithms generated by a
genetic programming engine. The original prototype (developed before I
joined the company) did this by construcing a Lisp-style function tree
out of Java objects, then running that in an interpreter (i.e. it used
ECJ). Unsurprisingly this was hideously slow. The current version that
I designed works by compiling the function tree to Java bytecode (as
static methods in a new class), calling an in-memory classloader on the
output (allowing Java to link it), then allowing Hotspot to translate
it to native machine code. However to get good performance on large
datasets, I had to implement the terminal functions (i.e. the inner
loops) in C, using GCC's SSE intrinsics. The dynamic java code calls
this via the Java Native Interface (specifically, a class full of static
native methods). However JNI has a significant call overhead, which is
eating up a good fraction of the performance boost, as well as creating
additional development/maintenance hassle. This system is running on a
couple of beefy 16 core Opteron servers and it still takes hours to run
on large datasets.

The major problem with implementing in D is the lack of runtime
compliation and the AFAIK subpar support for dynamic linking. I'll get
to that in a bit. The other potential problems are;

1. Support for AMD64 on Linux is absolutely essential; these datasets
are several gigabytes in size and need to be loaded into memory. Plus
there is a nontrivial amount of 64-bit integer manipulation in the Java
code. Linux obviously because what sane person would waste money on
Windows Server licenses for compute servers? I haven't been able to
track down the latest estimate for when DMD will support AMD64, but at
least GDC does now. Is this stable enough for production use?

2. High-performance synchronization. As of 1.6, Java's monitor
implementation is quite impressive; the biased locking mechanism means
that atomic operations are only required for a small fraction of monitor
acquisitions. I've been trying to find out if D's implementation uses
the same mechanism. This is probably only good for a few percent of
performance, but on 16-way NUMA locking performance can make a
difference even when it isn't in the inner loops.

3. SSE intrinsics. Does GDC have the equivalent of GCC SSE intrinsics
yet (i.e. "target builtins")? This post suggests that it does;
http://www.digitalmars.com/pnews/read.php?server=news.digitalmars.com&group=D.gnu&artnum=2622
but I can't see anything in the documentation about them. If not I'll
have to continue linking to the existing C code, which eliminates a
significant benefit of going to D (simplifying the code base - but at
least the JNI call & pinning overhead will be gone). Ideally
std.intrinsic would have these (presumably that's how DMD is going to
include them, if an when it does - as far as I can tell, DMD is still
doing even basic double-precision operations with slow x87 code rather
than SSE code, unless
http://www.digitalmars.com/d/archives/digitalmars/D/Optimizations_of_DMD_49840.html
is out of date).

Ok, on to the serious problem, dynamic code support. The GP engine is
outputting a function tree structure. I can trivially output this as D
or Java source. I can output it as Java bytecode with a bit more effort.
I really don't want to have to compile it to x86 machine code (and link
it against the terminal functions) myself, particularly given that the
research guys keep elaborating the algorithm. Here's what I've
considered so far;

1. Output as C source plus a JNI stub in bytecode. Invoke gcc on it to
build a dynamic library. Load the class in Java. This would remove most
of the JNI overhead (because the Java/C crossover would be pushed 4 to
10 places up the call graph). However I'd have to keep invoking GCC as
a seperate process, though I could cut down the overhead of this
somewhat by putting it and the dynamic files on a ramdisk. I'm unsure
what the performance would be like under Linux of constantly linking
new shared libraries, and I'm not sure if the Sun JVM is capable of
unloading native code along with unloading classes (though this memory
leak is probably insignificant over the lifetime of the process). This
is the solution I will be using if I can't find a good D solution.

2. Port to D. Change the GP system to target MiniD bytecode. I haven't
examined MiniD in detail yet, but I get the impression that this will
be marginally harder than targetting Java bytecode. Alternatively I
could target MiniD source, which would be simpler for me, at some
additional compilation cost, or I could dig into the MiniD compiler
source and see if it has an AST I could target. The later would be
ideal if the code is stable, but it probably isn't.

MiniD is purely interpreted, so performance of the dynamic code will
almost certainly suck badly compared to Hotspot. On the plus side,
'linking' cost is effectively zero. The inner loops will still have
to be in either D (if it has decent SSE intrinsics now) or C (if it
doesn't); at first glance MiniD native invokation overhead looks better
than JNI (no need to pin for one thing, since the GC is non-copying)
but still nontrivial.

3) Port to D. Change the GP system to target D source. Batch code
updates together. On each update, recompile the codebase using gdc
running in a ramdisk. Start a new process using the new code. Keep
all the persistent data in a shared memory segment. Get the new
process to map in the shared segment, then kill the old process.
Processing can continue from where it left off using the updated code.

AFAIK this /should/ work, but I don't know what the overheads are
going to be like. Having to batch together the code updates is
mildly inconvenient from the algorithm design side. I'm not sure what
if any incremental compilation features GDC has; perhaps I could path
it to stay in memory and stream new classes to it over a pipe.

4) Port to D. Change the GP system to target D source. Batch code
updated together. On each update, link the new classes in with DDL.

The major problem with this is that AFAIK DDL isn't working on Linux
yet (at all, never mind on AMD64) and is still beta quality code
even on Windows. In addition to that, I'm not clear if it can ever
recover the memory for unused code, though as I mentioned for (1) this
leak isn't likely to be a problem over the lifetime of the process.
Plus the GDC invokation overhead as for (3).

and finally...

6) Write a new 'D Dynamic Code Library'. My best guess for how this
should work is this;

a) This library will statically link to GDC and the GCC back end.
b) On first invokation, look for a metadata file. If it isn't present or
is out of date, recompile the app from source and keep the relevant
GDC state in memory. Otherwise, deserialise GDC state from the metadata
file.
c) Supply functions that accept D source and possibly one or more of the
GCC intermediate representations (i.e. AST, CFG, RTL). These would call
through to GDC/GCC, intercept the output and write it into the code
page, then perform runtime linking as necessary. This is probably a lot
of work, but I am unsure exactly how much.
d) Open source this library in case anyone else wants to use it.

Any thoughts on any of the above? Bear in mind that I have a mild bias
towards using D even if it is some extra work to get things going, but I
am certainly not going to accept worse overall performance than the Java
version.

Michael Wilson
Systems Developer
Purple Frog Text Ltd

Jan 12 2008

Robert Fraser <fraserofthenight gmail.com> writes:

Wow; that's a cool project and as much of it as you're willing to 
open-source would be awesome, even if it mainly serves as an educational 
tool. Dynamic code support has been something D has been lacking for a 
while. I don't know the answer to most of your questions, but...

Michael Wilson wrote:
 1. Support for AMD64 on Linux is absolutely essential; these datasets
 are several gigabytes in size and need to be loaded into memory. Plus
 there is a nontrivial amount of 64-bit integer manipulation in the Java
 code. Linux obviously because what sane person would waste money on
 Windows Server licenses for compute servers? I haven't been able to
 track down the latest estimate for when DMD will support AMD64, but at
 least GDC does now. Is this stable enough for production use?

AFAIK, the code generator on GDC is more mature/stable than DMD. It uses 
the GCC backend, so if you consider GCC's AMD64 codegen stable, then 
GDC's will follow.

Anyway, good luck figuring all this out!

Jan 12 2008

bearophile <bearophileHUGS lycos.com> writes:

Your set of problems is quite interesting. I am not much experienced in D yet,
but I can say some of the things I think. There are lot of people here that
know much more than me on such matters (but I have implemented many
evolution-based programs), they will spot errors among the things I have said
here.

D is a young language, it has lot of rough corners, so you have to forget the
high level of polish that Java systems have today. I like D, and hopefully more
and more people will start using it, but I think today it's not much fit to
replace "serious" Java programs.

I am considering doing this in D; I like the language design and there are
potential performance and maintainability benefits.<

D is a very nice language, and it looks quite nice compared to Java, but note
that most of the small (1000-10000 lines) programs I have translated from Java
to D, the Java6 version was 2-10 times faster. HotSpot inlines methods, and it
has a really good GC, while DMD currently seem to not inline them and its GC is
snail-slow compared to the refined HotSpot (not the server one) one. Those two
things alone make D code much slower if you use to code in the usual highly OOP
style Java code has.


The system carries out data mining tasks using algorithms generated by a
genetic programming engine. The original prototype (developed before I joined
the company) did this by construcing a Lisp-style function tree out of Java
objects, then running that in an interpreter (i.e. it used ECJ). Unsurprisingly
this was hideously slow.<

For that purpose CommonLisp looks like the a good language (beside Java
itself). It's fast enough if used well, and it compiles the functions on the
fly. There are some disadvantages, like it's not easy to find CLisp programmers.
How much time does a single fitness computation take on average? If it's really
little time (less than 30-40 seconds) then compiling/interfacing timings become
important. In that situation you need a fast compiler, etc.
 

The dynamic java code calls this via the Java Native Interface (specifically, a
class full of static native methods). However JNI has a significant call
overhead, which is eating up a good fraction of the performance boost, as well
as creating additional development/maintenance hassle. This system is running
on a couple of beefy 16 core Opteron servers and it still takes hours to run on
large datasets.<

Often the bigger gains in such evolutionary programs come from "improving" the
search space, adding more heuristics, etc. Not from going from Java to D.


2. High-performance synchronization. As of 1.6, Java's monitor implementation
is quite impressive;<

I don't know the answer, but in most things D is far from being tuned and



3. SSE intrinsics. Does GDC have the equivalent of GCC SSE intrinsics yet<

Nope, I think.


1. Output as C source plus a JNI stub in bytecode. Invoke gcc on it to build a
dynamic library.<

If the running time of that fitness function is very little you may need to
find something faster than GCC, like TinyCC (but I think that's not your
situation, and most probably your fitness functions are small, so they are very
quick to compile).

Look for other alternative languages/solutions too, like CLisp, Java TinyCC -
compiled code, Python+Psyco with Cython compiled code on the fly, etc.

I think you can answer most of your questions about D doing some small
benchmarks for 1-2 days only. Maybe you will like the language but not its
performance for your purposes :-)

Bye,
bearophile

Jan 12 2008

Michael Wilson <mwilson purplefrogtext.com> writes:

bearophile wrote:
 D is a young language, it has lot of rough corners, so you have to forget
 the high level of polish that Java systems have today.

I know; Java was awful when I first started using it in 1997 (not by 
choice, at the time), but these days it's actually very good for
large reasonably-long-running applications. However while the runtime
system is very good and the standard library is decent, the language
itself is kinda gnarly and obviously restrictive compared to C (thus
the use of JNI in this app).

 I like D, and hopefully more and more people will start using it, but I
 think today it's not much fit to replace "serious" Java programs.

For this app, there are three types of server, the web cluster, the DB
cluster and the compute cluster. The web cluster has all the interface
and application logic, and that code is definitely staying in Java
(+Javascript on the client side). All the persistent data stays in
on the file/DB servers, in Postgres and some big binary files. The
compute servers are the only ones I am considering moving to D. If it
wasn't for the SSE I probably wouldn't, but that JNI overhead is
seriously annoying. Plus I can do low-level things such as NUMA
optimisation from D much more easily than I can from Java.

 HotSpot inlines methods, and it has a really good GC, while DMD
 currently seem to not inline them and its GC is snail-slow compared
 to the refined HotSpot (not the server one) one.

I'm not concerned about GC since there isn't much temporary object
usage; the bulk of the data is in a relatively small number of huge
arrays (which I may well go to manual memory management for in D
anyway). No method inlining is a little more serious, but then I'll
have to use GDC rather than DMD for the 64-bit support, and
presumably GDC inlines methods ok? If not I could make the GP
output stage do it, the call depth and function size is low enough
that it shouldn't be difficult.

 For that purpose CommonLisp looks like the a good language (beside
 Java itself). It's fast enough if used well, and it compiles the
 functions on the fly. There are some disadvantages, like it's not
 easy to find CLisp programmers.

It's even harder to find D developers, but it's easy for developers

and if I used it on a production system other developers at the
company would have to as well). Lisp, not so much. I guarentee that
if I said 'I want to reimplement in Lisp' the answer would be no,
while I received provisional approval for using D on the basis that
'it's very similar to Java, our developers can learn it easily'.

Aside from that the other problems with using Lisp in this app are;
1) /Much/ more work to port the sections of code I want to reuse
from the old system across.
2) I don't know of any CL implementations with SSE intrinsics. So
unless someone else knows one I'd be stuck linking to C code again.
It's possible that there's a CL implementation with less overhead
for linking dynamic code to external C libraries than JNI has, but
if so I don't know of it.
3) AFAIK similar issues with trying to do NUMA optimisation as Java
and worse synchronisation performance (though again, there may be
some super-optimal CL implementation out there I don't know about).


 How much time does a single fitness computation take on average?
 If it's really little time (less than 30-40 seconds) then
 compiling/interfacing timings become important.

This depends on the dataset size. For a typical dataset, on each pass
the GP engine generates a few thousand classifiers based on a few
hundred (dynamic) component functions. Computing the fitness of those
takes somewhere between 10 seconds and a minute, on a server with
eight Opteron 8218s. This is heavily optimised to reuse intermediate
values where possible, which is where the moderate amount of thread
synchronisation comes from (and why one big server outperforms lots
of small servers of the same price and more aggregate compute
performance - though we're working on that). Each run makes several
hundred of these passes.

So the compiler is invoked a few times a minute on the equivalent
of a couple of thousand lines of code. Compiler speed shouldn't be
a major factor.

 Often the bigger gains in such evolutionary programs come from
 "improving" the search space, adding more heuristics, etc. Not
 from going from Java to D.

Absolutely, but I'm not responsible for that (though I certainly
make suggestions). My job is to make the algorithms the research
team comes up with run on commercial datasets (my team as a whole
is responsible for 'productisation').

 2. High-performance synchronization. As of 1.6, Java's monitor
 implementation is quite impressive;<

 
 I don't know the answer, but in most things D is far from being tuned


That's something I could potentially help with once I'm decently
familiar with the language, if the GDC developers are accepting patches.
Otherwise, I could just implement my own synchronization, via
inline assembly (since this doesn't have to be multi-platform) or C.

 3. SSE intrinsics. Does GDC have the equivalent of GCC SSE intrinsics yet

 
 Nope, I think.

Clearly they were scheduled to go in last year (from the post I linked
to); has there been a disruption to GDC development recently?

 If the running time of that fitness function is very little you may need to
 find something faster than GCC, like TinyCC (but I think that's not your
 situation, and most probably your fitness functions are small, so 

they are
 very quick to compile).

Interesting suggestion but unfotuantely TinyCC does not (AFAIK) target
AMD64 - also a problem for a lot of CL implementations. I'll have a look
around for similar projects that do though.

 Look for other alternative languages/solutions too, like CLisp, Java TinyCC 
 compiled code, Python+Psyco with Cython compiled code on the fly, etc.

I am doing so; D is one of them. :) The very first thing I tried was
Excelsior JET, a mixed mode compiler for Java (static where possible,
JIT for new code loaded at runtime) that I'd had good results with in
the past. Unfortunately performance dropped significantly, even when I
'cheated', saved a log of all GP functions generated in a run and
supplied them to JET to statically compile for the benchmark run. As
you noted, Hotspot is really quite good these days.

 I think you can answer most of your questions about D doing some small
 benchmarks for 1-2 days only. Maybe you will like the language but not
 its performance for your purposes :-)

Yep; I'll probably start with the 'multiple processes passing data via
shared memory' solution, as that sounds like the simplest to implement.
If and when I've got that working as a proof of concept, I'll take a
look at LLVM.

Paolo Invernizzi wrote:
 I guess the best approach it's LLVM...
 Take a look at http://www.llvm.org
 Cheers, Paolo

Thanks! That does look really interesting. I had a quick look at
GNU Lightning but it's almost entirely unoptimising, and thus probably
unusable. LLVM looks much better at first glance and I should be able
to target LLVM bytecode directly, but I'll have to investigate the 
feasibility of linking against D. Googling, it looks like there have
been a few attempts to use LLVM and D together before, but mostly in
the sense of using it as a third static compiler (e.g.
http://www.dsource.org/projects/llvmdc). I will have to do some more
research and see if there is anything I could get involved with.
Needless to say, if I can get my company to contribute engineering
time to some interesting open source development, I will. :)

Robert Fraser wrote;
 Wow; that's a cool project and as much of it as you're willing to
 open-source would be awesome, even if it mainly serves as an
 educational tool.

I'm pretty sure the core algorithms will never be open sourced, but
something like a dynamic code library definitely would be OSed, and
I'm hopefully that the whole GP engine might eventually be OSed.

CptJack wrote:
 Everything you need to do is provided by Common Lisp, and I don't mean
 "CLisp", which is a byte-code interpreted CL implementation. Contrary
 to what most people assume, almost all CL implementations compile
 functions on-the-fly to machine code. That's raw machine object code,
 not byte-code, compiled and loaded dynamically while the system is
 still running. This sounds exactly like what you want.

That sounds exactly like what the Java version is already doing. The
only obvious benefit with Lisp is a slight simplification of the GP
engine output stage. As I've noted, Hotspot is very well optimised
these days, I'm not aware of a CL implementation that does better and
indeed most perform significantly worse. D would have the advantage of
being able to link to C with effectively zero overhead, do manual memory
allocation and synchronization where beneficial (i.e. NUMA-optimised
memory allocation that still plays well with the GC) and be able to use
inline assembly for SSE code. CL doesn't have any of those advantages
AFAIK, plus as I mentioned earlier it's (much) harder to get new
developers for and (significantly) harder to train existing developers
in. I'm actually genuinely surprised that the research team chose to
develop the original algorithms in Java rather than Lisp; it would
have been a good fit with that task, plus they're ex-academics and
thus inherently more likely to use Lisp. But I don't think that it's
a good choice for the production system, particularly given that the
current code is in Java.

Michael Wilson
Systems Developer
Purple Frog Text Ltd

Jan 12 2008

bearophile <bearophileHUGS lycos.com> writes:

Michael Wilson:

No method inlining is a little more serious, but then I'll have to use GDC
rather than DMD for the 64-bit support, and presumably GDC inlines methods ok?
If not I could make the GP output stage do it, the call depth and function size
is low enough that it shouldn't be difficult.<

Current DMD D is able to inline free functions and more. But Java code is often
full of getters/setters, they don't seem to be inlined by DMD, so they may slow
down code translated from Java to D.
To put some "facts" behind my words you can perform a quick benchmark:


import std.stdio: put = writef, putr = writefln;
import d.time: clock;

struct S1 { uint s; }

void inc_sA(ref S1 s1) {
    s1.s++;
}

void inc_sA(S1* s1) {
    s1.s++;
}

struct S2 {
    uint s;
    void inc() {
        this.s++;
    }
}

class C1 {
    uint s;
    void inc() {
        this.s++;
    }
}

final class C2 {
    uint s;
    final void inc() {
        this.s++;
    }
}

class C3a : C1 {
    void inc2() {
        this.s++;
    }
}

final class C3b : C1 {
    final void inc2() {
        this.s++;
    }
}


void main() {
    const uint N = 300_000_000;

    S1 s1;
    auto t = clock();
    for(uint i; i < N; ++i)
        s1.s++;
    putr(clock() - t, " ", s1.s);

    s1.s = 0;
    t = clock();
    for(uint i; i < N; ++i)
        inc_sA(s1);
    putr(clock() - t, " ", s1.s);

    s1.s = 0;
    t = clock();
    for(uint i; i < N; ++i)
        inc_sA(&s1); // inlined
    putr(clock() - t, " ", s1.s);

    S2 s2a;
    t = clock();
    for(uint i; i < N; ++i)
        s2a.inc;     // inlined
    putr(clock() - t, " ", s2a.s);

    auto c1a = new C1;
    t = clock();
    for(uint i; i < N; ++i)
        c1a.inc;
    putr(clock() - t, " ", c1a.s);

    auto c2a = new C2;
    t = clock();
    for(uint i; i < N; ++i)
        c2a.inc;
    putr(clock() - t, " ", c2a.s);

    auto c3a = new C3a;
    t = clock();
    for(uint i; i < N; ++i)
        c3a.inc2;
    putr(clock() - t, " ", c3a.s);

    auto c3b = new C3b;
    t = clock();
    for(uint i; i < N; ++i)
        c3b.inc2;
    putr(clock() - t, " ", c3b.s);

    c3a = new C3a;
    t = clock();
    for(uint i; i < N; ++i)
        c3a.inc;
    putr(clock() - t, " ", c3a.s);
}


Timings N = 300_000_000, DMD 1.025

1) -O -release
2) -O -release -inline

  1)     2)
 2.63   2.63
 4.29   4.30
 3.70   2.50 *
 4.30   2.69 *
 4.91   4.91
 4.32   4.29
 4.90   4.90
 4.28   4.29
 4.94   4.92

Someone can run that on GDC too.


'it's very similar to Java, our developers can learn it easily'.<

Sometimes D can be seen as a superset of Java, but it has other things that you
need time to learn, like all the juicy template tricks :-)


Aside from that the other problems with using Lisp in this app are;<

About them you can post questions on a Lisp newsgroup.


Otherwise, I could just implement my own synchronization, via inline assembly
(since this doesn't have to be multi-platform) or C.<

You are a bold person :-)


Interesting suggestion but unfotuantely TinyCC does not (AFAIK) target AMD64 -
also a problem for a lot of CL implementations. I'll have a look around for
similar projects that do though.<

From the things you have said I think you don't need TinyCC at all.


As you noted, Hotspot is really quite good these days.<

Behind HotSpot there are quite more developers than behind DMD (but I suspect
Walter is better than most of them :-) ).


LLVM looks much better at first glance and I should be able to target LLVM
bytecode directly, but I'll have to investigate the feasibility of linking
against D.<

LLVM seems a nice solution. And you can just use C/C++ with LLVM instead of D...

Bye,
bearophile

Jan 12 2008

Neal Alexander <wqeqweuqy hotmail.com> writes:

 Otherwise, I could just implement my own synchronization, via inline assembly
(since this doesn't have to be multi-platform) or C.<

 
 You are a bold person :-)

I was under the impression this was easy using XADD / CMPXCHG with the 
LOCK prefix.

How to yield / wait may be a different story, but how many possible 
general purpose options could there be?

Jan 12 2008

Sean Kelly <sean f4.ca> writes:

Neal Alexander wrote:
 Otherwise, I could just implement my own synchronization, via inline 
 assembly (since this doesn't have to be multi-platform) or C.<

 You are a bold person :-)

 
 I was under the impression this was easy using XADD / CMPXCHG with the 
 LOCK prefix.

It's also very slow compared to some of the optimized algorithms out 
there.  But putting inline asm inside functions is pretty easy either way.

 How to yield / wait may be a different story, but how many possible 
 general purpose options could there be?

There are established, optimized algorithms for locking on x86, so 
implementing one by example shouldn't be difficult.  I don't have any 
links offhand though.  comp.programming.threads would be a good start.


Sean

Jan 14 2008

CptJack <cptjack nospam.net> writes:

Michael Wilson Wrote:

 For that purpose CommonLisp looks like the a good language (beside
 Java itself). It's fast enough if used well, and it compiles the
 functions on the fly. There are some disadvantages, like it's not
 easy to find CLisp programmers.

He meant "Common Lisp programmers". "CLisp" is just a single 
implementation.

 It's even harder to find D developers, but it's easy for developers

 and if I used it on a production system other developers at the
 company would have to as well). Lisp, not so much. I guarentee that
 if I said 'I want to reimplement in Lisp' the answer would be no,
 while I received provisional approval for using D on the basis that
 'it's very similar to Java, our developers can learn it easily'.

That's a valid argument. However, I would like to point you to this
very interesting essay before dismissing Lisp entirely for that reason:
http://www.paulgraham.com/avg.html

 Aside from that the other problems with using Lisp in this app are;
 1) /Much/ more work to port the sections of code I want to reuse
 from the old system across.

Once you begin to do some serious programming in Lisp, and know it 
as well as you do your current language of choice, I think you will 
find that it is /much/ faster to program in than any other language. 
At first you will be fighting the syntax and the parens, then one day
you will just "get it" and your world will change. It has so many
unique features that you will wonder how you ever programmed without 
them (in fact, almost every "exciting new feature" in language X was
implemented in Lisp 40 years ago).

 2) I don't know of any CL implementations with SSE intrinsics. So
 unless someone else knows one I'd be stuck linking to C code again.
 It's possible that there's a CL implementation with less overhead
 for linking dynamic code to external C libraries than JNI has, but
 if so I don't know of it.

As for SSE support, I know SBCL has SSE support, and therefore I 
believe CMUCL also does (since SBCL is forked from CMUCL). ECL is
completely based around linking dynamic code, and is highly optimized
to do so.

 3) AFAIK similar issues with trying to do NUMA optimisation as Java
 and worse synchronisation performance (though again, there may be
 some super-optimal CL implementation out there I don't know about).

Again, take a better look at SBCL and CMUCL.

 CptJack wrote:
  > Everything you need to do is provided by Common Lisp, and I don't mean
  > "CLisp", which is a byte-code interpreted CL implementation. Contrary
  > to what most people assume, almost all CL implementations compile
  > functions on-the-fly to machine code. That's raw machine object code,
  > not byte-code, compiled and loaded dynamically while the system is
  > still running. This sounds exactly like what you want.

 That sounds exactly like what the Java version is already doing.

This is my point exactly. You've had to re-implement Lisp to get Java to do
what you want, a-la Greenspun's Tenth Rule.

 The only obvious benefit with Lisp is a slight simplification of the GP
 engine output stage. As I've noted, Hotspot is very well optimised
 these days, I'm not aware of a CL implementation that does better and
 indeed most perform significantly worse.

http://shootout.alioth.debian.org/gp4/benchmark.php?test=all&lang=sbcl&lang2=java
They look pretty similar to me in this admittedly fairly worthless 
comparison.

 D would have the advantage of
 being able to link to C with effectively zero overhead, do manual memory
 allocation and synchronization where beneficial (i.e. NUMA-optimised
 memory allocation that still plays well with the GC) and be able to use
 inline assembly for SSE code. CL doesn't have any of those advantages

Not true. I think it's just a matter of common misconceptions about Lisp and
its implementations. Most people think it's slow and interpreted. It's neither.
Movitz for example is one Common Lisp implementation designed to be as
"on-the-metal" as you want including inline assembly. Some of this movitz code
has been ported to other implementations as well.

 plus as I mentioned earlier it's (much) harder to get new
 developers for and (significantly) harder to train existing developers
 in. 

As I said, that is indeed a valid argument. However, some of this is because of
perceived hurdles that really don't exist (or are at least smaller than they at
first appear). 

The only reason I'm still making arguments in favor of Common Lisp for your
project is because what you describe is basically a Common Lisp implementation
hacked together from Java and C. Why not just do away with the hacks?

Jan 12 2008

bearophile <bearophileHUGS lycos.com> writes:

CptJack:
 Once you begin to do some serious programming in Lisp, and know it 
 as well as you do your current language of choice, I think you will 
 find that it is /much/ faster to program in than any other language.

I like some sides of Lisp (and you have seen I too have suggest it as an option
for this project), but I think you can't prove that. For example I don't think
you can write serious programs in Lisp faster than in Python... ;-)
And to know CommonLisp as well as Java/D you may need years. If you know Java
you can learn D rather well in less than 4-6 months. I think only C++ requires
more learning time than CommonLisp :-)


 That sounds exactly like what the Java version is already doing.

 
 This is my point exactly. You've had to re-implement Lisp to get Java to
 do what you want, a-la Greenspun's Tenth Rule.

But if that's already work done, then there may be no need to write it again in
Lisp :-)

Bye,
bearophile

Jan 12 2008

Michael Wilson <mwilson purplefrogtext.com> writes:

CptJack wrote:
 Once you begin to do some serious programming in Lisp, and know it 
 as well as you do your current language of choice, I think you will 
 find that it is /much/ faster to program in than any other language. 
 At first you will be fighting the syntax and the parens, then one day
 you will just "get it" and your world will change. It has so many
 unique features that you will wonder how you ever programmed without 
 them (in fact, almost every "exciting new feature" in language X was
 implemented in Lisp 40 years ago).

I could write a long response to this but it would be off topic for
this newsgroup. Rather I will say only that I have written a moderate
about of Lisp code and I have had many, many encounters with 'keen'
Lisp advocates. I can't recall /ever/ winning a 'which language to
use' debate with /any/ of them, for any real or proposed application.
At some point in becoming fond of the language a critical threshold
is reached and that person is forever more believes that no other
(high-level) language has a genuine reason to exist. I think I can
understand where this notion comes from (I have had some experience
of language and compiler design), but I certainly don't agree with it.
In any case I've found such debates rather unproductive.

 The only reason I'm still making arguments in favor of Common Lisp for
 your project is because what you describe is basically a Common Lisp
 implementation hacked together from Java and C.

You may believe this, probably based on a mental model that states 'all
other languages are a nasty subset of Lisp' as an axiom, but it is not
the case in this (or for that matter, most) instance. In fact I'm
curious why you're here at all, as this also implies that D can never
be more than a nasty subset of Lisp.

Michael Wilson
Systems Developer
Purple Frog Text Ltd

Jan 12 2008

Paolo Invernizzi <paoloinvernizzi fastwebnet.it> writes:

You can also not directly target bytecodes: what I mean its that you can
directly, in your program, construct and link (with an veeeery good optimizer)
the genetic functions, without having to emit/compile/link nothing bytecode
related. The JIT is your friend.
That can be a very good solution expecially if the JITted functions are
not-trivial ones, and can benefit a lot from optimizations.

I've done something similar some times ago, writing some wrapper C code to the
interesting LLVM C++ API (but now I think there's an effort to produce LLVM C
API), using it in a genetic D program that instantiated a lot of LLVM functions
(for a generation) in an LLVM module, and then executed them over the D
individuals: I aborted the effort for lack of time, and not-so-big improvement
over a more classical, only D based, solution.

Cheers, Paolo


Michael Wilson Wrote:

 Thanks! That does look really interesting. I had a quick look at
 GNU Lightning but it's almost entirely unoptimising, and thus probably
 unusable. LLVM looks much better at first glance and I should be able
 to target LLVM bytecode directly, but I'll have to investigate the 
 feasibility of linking against D. Googling, it looks like there have
 been a few attempts to use LLVM and D together before, but mostly in
 the sense of using it as a third static compiler (e.g.
 http://www.dsource.org/projects/llvmdc). I will have to do some more
 research and see if there is anything I could get involved with.
 Needless to say, if I can get my company to contribute engineering
 time to some interesting open source development, I will. :)

Jan 12 2008

Walter Bright <newshound1 digitalmars.com> writes:

Michael Wilson wrote:
 That's something I could potentially help with once I'm decently
 familiar with the language, if the GDC developers are accepting patches.
 Otherwise, I could just implement my own synchronization, via
 inline assembly (since this doesn't have to be multi-platform) or C.

I'm certainly willing to accept help with this!

Jan 13 2008

Sean Kelly <sean f4.ca> writes:

Michael Wilson wrote:
 bearophile wrote:

 2. High-performance synchronization. As of 1.6, Java's monitor


  >> implementation is quite impressive;<
 I don't know the answer, but in most things D is far from being tuned


 
 That's something I could potentially help with once I'm decently
 familiar with the language, if the GDC developers are accepting patches.
 Otherwise, I could just implement my own synchronization, via
 inline assembly (since this doesn't have to be multi-platform) or C.

Right now, D uses OS API routines for synchronization.  Tango also has a 
Mutex class design that integrates with the 'synchronized' statement, 
and it would be pretty easy to create a futex or whatever that does the 
same thing.  This may be the way to go for special-purpose locking.


Sean

Jan 13 2008

Paolo Invernizzi <arathorn fastwebnet.it> writes:

I guess the best approach it's LLVM...
Take a look at http://www.llvm.org
Cheers, Paolo

Michael Wilson Wrote:

 Any thoughts on any of the above? Bear in mind that I have a mild bias
 towards using D even if it is some extra work to get things going, but I
 am certainly not going to accept worse overall performance than the Java
 version.

Jan 12 2008

CptJack <cptjack nospam.net> writes:

I'm not even close to being an expert on the internal D compiler mechanisms to
know if it satisfies your requirements. However, reading your post I kept
thinking one thing: Everything you need to do is provided by Common Lisp, and I
don't mean "CLisp", which is a byte-code interpreted CL implementation.
Contrary to what most people assume, almost all CL implementations compile
functions on-the-fly to machine code. That's raw machine object code, not
byte-code, compiled and loaded dynamically while the system is still running.
This sounds exactly like what you want. CMUCL and SBCL are the two most mature
open source implementations that do so on linux, although ECL does as well and
it's getting more mature all the time.

In fact, it sounds like your current system is a perfect example of Greenspun's
Tenth Rule.

Jan 12 2008

BLS <nanali nospam-wanadoo.fr> writes:

Hi Michael,
A pragmatic approch :
D-newLisp-C
I guess you'll find the follwing newLisp examples interresting :

- embed binary machine code into newLISP source

- Distributed computing, map/reduce stype example of calculating and 
reducing word counts from different documents
(see source code comments)

You can embedd newLisp .so - or run as inetd or xinetd service  etc.

True64Unix binary available. (otherwise you have to deal with the source)

Just in case, the example link :
http://newlisp.org/index.cgi?Tips_and_Tricks

HTH Bjoern

Jan 13 2008

Walter Bright <newshound1 digitalmars.com> writes:

I don't know what the right answer is for your application, so I'll just 
try and answer where I can.

Michael Wilson wrote:
 The system carries out data mining tasks using algorithms generated by a
 genetic programming engine. The original prototype (developed before I
 joined the company) did this by construcing a Lisp-style function tree
 out of Java objects, then running that in an interpreter (i.e. it used
 ECJ). Unsurprisingly this was hideously slow. The current version that
 I designed works by compiling the function tree to Java bytecode (as
 static methods in a new class), calling an in-memory classloader on the
 output (allowing Java to link it), then allowing Hotspot to translate
 it to native machine code. However to get good performance on large
 datasets, I had to implement the terminal functions (i.e. the inner
 loops) in C, using GCC's SSE intrinsics. The dynamic java code calls
 this via the Java Native Interface (specifically, a class full of static
 native methods). However JNI has a significant call overhead, which is
 eating up a good fraction of the performance boost, as well as creating
 additional development/maintenance hassle. This system is running on a
 couple of beefy 16 core Opteron servers and it still takes hours to run
 on large datasets.

C code can be called directly from D without a JNI type interface.

 The major problem with implementing in D is the lack of runtime
 compliation and the AFAIK subpar support for dynamic linking. I'll get
 to that in a bit. The other potential problems are;
 
 1. Support for AMD64 on Linux is absolutely essential; these datasets
 are several gigabytes in size and need to be loaded into memory. Plus
 there is a nontrivial amount of 64-bit integer manipulation in the Java
 code. Linux obviously because what sane person would waste money on
 Windows Server licenses for compute servers? I haven't been able to
 track down the latest estimate for when DMD will support AMD64, but at
 least GDC does now. Is this stable enough for production use?

64 bit DMD is not on the horizon yet, so your only option there is GDC. 
If it isn't stable enough, I suspect that your needs will drive it to 
stability.


 2. High-performance synchronization. As of 1.6, Java's monitor
 implementation is quite impressive; the biased locking mechanism means
 that atomic operations are only required for a small fraction of monitor
 acquisitions. I've been trying to find out if D's implementation uses
 the same mechanism. This is probably only good for a few percent of
 performance, but on 16-way NUMA locking performance can make a
 difference even when it isn't in the inner loops.

D's locking mechanism is controlled by library code, so it is adaptable 
by the user. Currently, it is rather basic.


 3. SSE intrinsics. Does GDC have the equivalent of GCC SSE intrinsics
 yet (i.e. "target builtins")? This post suggests that it does;
 http://www.digitalmars.com/pnews/read.php?server=news.digitalmars.com&gro
p=D.gnu&artnum=2622 
 
 but I can't see anything in the documentation about them. If not I'll
 have to continue linking to the existing C code, which eliminates a
 significant benefit of going to D (simplifying the code base - but at
 least the JNI call & pinning overhead will be gone). Ideally
 std.intrinsic would have these (presumably that's how DMD is going to
 include them, if an when it does - as far as I can tell, DMD is still
 doing even basic double-precision operations with slow x87 code rather
 than SSE code, unless
 http://www.digitalmars.com/d/archives/digitalmars/D/Optimizatio
s_of_DMD_49840.html 
 
 is out of date).

DMD supports the SSE code via the inline assembler only. I don't know if 
GDC does. In any case, it would be simple to just link to C code that 
does it.


 Ok, on to the serious problem, dynamic code support. The GP engine is
 outputting a function tree structure. I can trivially output this as D
 or Java source. I can output it as Java bytecode with a bit more effort.
 I really don't want to have to compile it to x86 machine code (and link
 it against the terminal functions) myself, particularly given that the
 research guys keep elaborating the algorithm. Here's what I've
 considered so far;
 
 1. Output as C source plus a JNI stub in bytecode. Invoke gcc on it to
 build a dynamic library. Load the class in Java. This would remove most
 of the JNI overhead (because the Java/C crossover would be pushed 4 to
 10 places up the call graph). However I'd have to keep invoking GCC as
 a seperate process, though I could cut down the overhead of this
 somewhat by putting it and the dynamic files on a ramdisk. I'm unsure
 what the performance would be like under Linux of constantly linking
 new shared libraries, and I'm not sure if the Sun JVM is capable of
 unloading native code along with unloading classes (though this memory
 leak is probably insignificant over the lifetime of the process). This
 is the solution I will be using if I can't find a good D solution.
 
 2. Port to D. Change the GP system to target MiniD bytecode. I haven't
 examined MiniD in detail yet, but I get the impression that this will
 be marginally harder than targetting Java bytecode. Alternatively I
 could target MiniD source, which would be simpler for me, at some
 additional compilation cost, or I could dig into the MiniD compiler
 source and see if it has an AST I could target. The later would be
 ideal if the code is stable, but it probably isn't.
 
 MiniD is purely interpreted, so performance of the dynamic code will
 almost certainly suck badly compared to Hotspot. On the plus side,
 'linking' cost is effectively zero. The inner loops will still have
 to be in either D (if it has decent SSE intrinsics now) or C (if it
 doesn't); at first glance MiniD native invokation overhead looks better
 than JNI (no need to pin for one thing, since the GC is non-copying)
 but still nontrivial.

If the linker overhead is too big, it's possible to do the 'linking' 
yourself by directly reading the .o file, stuffing it in memory, and 
executing it.


 3) Port to D. Change the GP system to target D source. Batch code
 updates together. On each update, recompile the codebase using gdc
 running in a ramdisk. Start a new process using the new code. Keep
 all the persistent data in a shared memory segment. Get the new
 process to map in the shared segment, then kill the old process.
 Processing can continue from where it left off using the updated code.
 
 AFAIK this /should/ work, but I don't know what the overheads are
 going to be like. Having to batch together the code updates is
 mildly inconvenient from the algorithm design side. I'm not sure what
 if any incremental compilation features GDC has; perhaps I could path
 it to stay in memory and stream new classes to it over a pipe.

This looks like your best shot with D. Like you say, it is dependent on 
how fast GDC can be run.

 4) Port to D. Change the GP system to target D source. Batch code
 updated together. On each update, link the new classes in with DDL.
 
 The major problem with this is that AFAIK DDL isn't working on Linux
 yet (at all, never mind on AMD64) and is still beta quality code
 even on Windows. In addition to that, I'm not clear if it can ever
 recover the memory for unused code, though as I mentioned for (1) this
 leak isn't likely to be a problem over the lifetime of the process.
 Plus the GDC invokation overhead as for (3).
 
 and finally...

D DLLs work on Windows. DMD supports writing pic code on Linux, I've 
just never done the work to build a shared library using it.


 6) Write a new 'D Dynamic Code Library'. My best guess for how this
 should work is this;
 
 a) This library will statically link to GDC and the GCC back end.
 b) On first invokation, look for a metadata file. If it isn't present or
 is out of date, recompile the app from source and keep the relevant
 GDC state in memory. Otherwise, deserialise GDC state from the metadata
 file.
 c) Supply functions that accept D source and possibly one or more of the
 GCC intermediate representations (i.e. AST, CFG, RTL). These would call
 through to GDC/GCC, intercept the output and write it into the code
 page, then perform runtime linking as necessary. This is probably a lot
 of work, but I am unsure exactly how much.
 d) Open source this library in case anyone else wants to use it.

Wow.

 Any thoughts on any of the above? Bear in mind that I have a mild bias
 towards using D even if it is some extra work to get things going, but I
 am certainly not going to accept worse overall performance than the Java
 version.

I think that any D path you take with this will be of a large, sustained 
benefit to the D community.

Jan 13 2008

Tom S <h3r3tic remove.mat.uni.torun.pl> writes:

Michael Wilson wrote:
 4) Port to D. Change the GP system to target D source. Batch code
 updated together. On each update, link the new classes in with DDL.
 
 The major problem with this is that AFAIK DDL isn't working on Linux
 yet (at all, never mind on AMD64) and is still beta quality code
 even on Windows. In addition to that, I'm not clear if it can ever
 recover the memory for unused code, though as I mentioned for (1) this
 leak isn't likely to be a problem over the lifetime of the process.
 Plus the GDC invokation overhead as for (3).

For what it's worth, you don't need to use DDL on Linux. .so is just 
fine there (at least with GDC). DDL was pretty stable when implemented 
with Phobos, I have used it in a mid-scale project (a first person 
shooter engine+game) for plugins and rendering kernels (in a 
micro-kernel-like architecture). DDL got recently ported to Tango. I 
haven't done much testing yet, but it appers to be working just fine.

Good luck with your project! :)


-- 
Tomasz Stachowiak
http://h3.team0xf.com/
h3/h3r3tic on #D freenode

Jan 13 2008

Rory Mcguire <rjmcguire gmail.com> writes:

checkout kong on dsource.org. www.dsource.org/projects/kong

It gives you the ability to modify what function is called when a library
function
is called, so you could have a library that you load on program initialization
just to make the symbols available and then each
time you generate new D code, you compile it to a shared library, Load it and
then
modify the original functions to point to the newly loaded libraries symbols.

-Rory

Mar 27 2008

D Programming

C/C++ Programming

Other

digitalmars.D - Dynamic Code in D