digitalmars.D - {OT} Youtube Video: newCTFE: Starting to write the x86 JIT

Stefan Koch (12/12) Apr 20 2017 Hi Guys,

Stefan Koch (2/15) Apr 20 2017 Actual code-gen starts at 34:00 something.
Suliman (2/15) Apr 20 2017 Could you explain where it can be helpful?

Stefan Koch (5/6) Apr 20 2017 It's helpful for newCTFE's development. :)

=?UTF-8?B?Tm9yZGzDtnc=?= (2/6) Apr 20 2017 Wow.
evilrat (6/13) Apr 21 2017 Is this apply to templates too? I recently tried some code, and

Stefan Koch (4/19) Apr 21 2017 No it most likely will not.
Stefan Koch (4/19) Apr 22 2017 If you could share the code it would be appreciated.

evilrat (5/15) Apr 22 2017 Sorry, I failed, that was actually caused by build system and

Stefan Koch (4/15) Apr 22 2017 Ah I see.

John Colvin (11/24) Apr 22 2017 Is there not some way that you could get the current

Stefan Koch (9/39) Apr 22 2017 newCTFE is currently at a phase where high-level features have to

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (3/9) Apr 24 2017 Yes, that's not fair.

Stefan Koch (8/9) Apr 24 2017 x86 has addressing modes which allow you to multiply an index by

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (18/28) Apr 25 2017 Oh, ok. AFAIK The decoding of indexing modes into micro-ops (the

Patrick Schluter (7/27) Apr 25 2017 It's already the case. Intel and AMD (especially in Ryzen)

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (10/15) Apr 25 2017 I think it just has to be done on a case-by-case basis. But if

Jonathan Marler (9/22) Apr 24 2017 Have you considered using the LLVM jit compiler for CTFE? We

jmh530 (2/6) Apr 24 2017 I can't help but laugh at this after the above posts...

Jonathan Marler (5/16) Apr 24 2017 That is pretty hilarious :) I suppose I just demonstrated the

Stefan Koch <uplink.coder googlemail.com> writes:

Hi Guys,

I just begun work on the x86 jit backend.

Because right now I am at a stage where further design decisions 
need to be made and those decisions need to be informed by how a 
_fast_ jit-compatible x86-codegen is structured.

Since I do believe that this is an interesting topic;
I will give you the over-the-shoulder perspective on this.

At the time of posting the video is still uploading, but you 
should be able to see it soon.

https://www.youtube.com/watch?v=pKorjPAvhQY

Cheers,
Stefan

Apr 20 2017

Stefan Koch <uplink.coder googlemail.com> writes:

On Thursday, 20 April 2017 at 12:56:11 UTC, Stefan Koch wrote:
 Hi Guys,

 I just begun work on the x86 jit backend.

 Because right now I am at a stage where further design 
 decisions need to be made and those decisions need to be 
 informed by how a _fast_ jit-compatible x86-codegen is 
 structured.

 Since I do believe that this is an interesting topic;
 I will give you the over-the-shoulder perspective on this.

 At the time of posting the video is still uploading, but you 
 should be able to see it soon.

 https://www.youtube.com/watch?v=pKorjPAvhQY

 Cheers,
 Stefan

Actual code-gen starts at 34:00 something.

Apr 20 2017

Suliman <evermind live.ru> writes:

On Thursday, 20 April 2017 at 12:56:11 UTC, Stefan Koch wrote:
 Hi Guys,

 I just begun work on the x86 jit backend.

 Because right now I am at a stage where further design 
 decisions need to be made and those decisions need to be 
 informed by how a _fast_ jit-compatible x86-codegen is 
 structured.

 Since I do believe that this is an interesting topic;
 I will give you the over-the-shoulder perspective on this.

 At the time of posting the video is still uploading, but you 
 should be able to see it soon.

 https://www.youtube.com/watch?v=pKorjPAvhQY

 Cheers,
 Stefan

Could you explain where it can be helpful?

Apr 20 2017

Stefan Koch <uplink.coder googlemail.com> writes:

On Thursday, 20 April 2017 at 14:35:27 UTC, Suliman wrote:
 Could you explain where it can be helpful?

It's helpful for newCTFE's development. :)
The I estimate the jit will easily be 10 times faster then my 
bytecode interpreter.
which will make it about 100-1000x faster then the current CTFE.

Apr 20 2017

=?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Thursday, 20 April 2017 at 14:54:20 UTC, Stefan Koch wrote:
 It's helpful for newCTFE's development. :)
 The I estimate the jit will easily be 10 times faster then my 
 bytecode interpreter.
 which will make it about 100-1000x faster then the current CTFE.

Wow.

Apr 20 2017

evilrat <evilrat666 gmail.com> writes:

On Thursday, 20 April 2017 at 14:54:20 UTC, Stefan Koch wrote:
 On Thursday, 20 April 2017 at 14:35:27 UTC, Suliman wrote:
 Could you explain where it can be helpful?

 It's helpful for newCTFE's development. :)
 The I estimate the jit will easily be 10 times faster then my 
 bytecode interpreter.
 which will make it about 100-1000x faster then the current CTFE.

Is this apply to templates too? I recently tried some code, and 
templated version with about 10 instantiations for 4-5 types 
increased compile time from about 1 sec up to 4! The template 
itself was staightforward, just had a bunch of static 
if-else-else for types special cases.

Apr 21 2017

Stefan Koch <uplink.coder googlemail.com> writes:

On Saturday, 22 April 2017 at 03:03:32 UTC, evilrat wrote:
 On Thursday, 20 April 2017 at 14:54:20 UTC, Stefan Koch wrote:
 On Thursday, 20 April 2017 at 14:35:27 UTC, Suliman wrote:
 Could you explain where it can be helpful?

 It's helpful for newCTFE's development. :)
 The I estimate the jit will easily be 10 times faster then my 
 bytecode interpreter.
 which will make it about 100-1000x faster then the current 
 CTFE.

 Is this apply to templates too? I recently tried some code, and 
 templated version with about 10 instantiations for 4-5 types 
 increased compile time from about 1 sec up to 4! The template 
 itself was staightforward, just had a bunch of static 
 if-else-else for types special cases.

No it most likely will not.
However I am planning to work on speeding templates up after 
newCTFE is done.

Apr 21 2017

Stefan Koch <uplink.coder googlemail.com> writes:

On Saturday, 22 April 2017 at 03:03:32 UTC, evilrat wrote:
 On Thursday, 20 April 2017 at 14:54:20 UTC, Stefan Koch wrote:
 On Thursday, 20 April 2017 at 14:35:27 UTC, Suliman wrote:
 Could you explain where it can be helpful?

 It's helpful for newCTFE's development. :)
 The I estimate the jit will easily be 10 times faster then my 
 bytecode interpreter.
 which will make it about 100-1000x faster then the current 
 CTFE.

 Is this apply to templates too? I recently tried some code, and 
 templated version with about 10 instantiations for 4-5 types 
 increased compile time from about 1 sec up to 4! The template 
 itself was staightforward, just had a bunch of static 
 if-else-else for types special cases.

If you could share the code it would be appreciated.
If you cannot share it publicly come in irc sometime.
I am Uplink|DMD there.

Apr 22 2017

evilrat <evilrat666 gmail.com> writes:

On Saturday, 22 April 2017 at 10:38:45 UTC, Stefan Koch wrote:
 On Saturday, 22 April 2017 at 03:03:32 UTC, evilrat wrote:
 Is this apply to templates too? I recently tried some code, 
 and templated version with about 10 instantiations for 4-5 
 types increased compile time from about 1 sec up to 4! The 
 template itself was staightforward, just had a bunch of static 
 if-else-else for types special cases.

 If you could share the code it would be appreciated.
 If you cannot share it publicly come in irc sometime.
 I am Uplink|DMD there.

Sorry, I failed, that was actually caused by build system and 
added dependencies(which is compiled every time no matter what, 
hence the slowdown). Testing overloaded functions vs template 
shows no significant difference in build times.

Apr 22 2017

Stefan Koch <uplink.coder googlemail.com> writes:

On Sunday, 23 April 2017 at 02:45:09 UTC, evilrat wrote:
 On Saturday, 22 April 2017 at 10:38:45 UTC, Stefan Koch wrote:
 On Saturday, 22 April 2017 at 03:03:32 UTC, evilrat wrote:
 [...]

 If you could share the code it would be appreciated.
 If you cannot share it publicly come in irc sometime.
 I am Uplink|DMD there.

 Sorry, I failed, that was actually caused by build system and 
 added dependencies(which is compiled every time no matter what, 
 hence the slowdown). Testing overloaded functions vs template 
 shows no significant difference in build times.

Ah I see.
4x slowdown for 10 instances seemed rather unusual.
Though doubtlessly possible.

Apr 22 2017

John Colvin <john.loughran.colvin gmail.com> writes:

On Thursday, 20 April 2017 at 12:56:11 UTC, Stefan Koch wrote:
 Hi Guys,

 I just begun work on the x86 jit backend.

 Because right now I am at a stage where further design 
 decisions need to be made and those decisions need to be 
 informed by how a _fast_ jit-compatible x86-codegen is 
 structured.

 Since I do believe that this is an interesting topic;
 I will give you the over-the-shoulder perspective on this.

 At the time of posting the video is still uploading, but you 
 should be able to see it soon.

 https://www.youtube.com/watch?v=pKorjPAvhQY

 Cheers,
 Stefan

Is there not some way that you could get the current 
interpreter-based implementation in to dmd sooner and then modify 
the design later if necessary when you do x86 jit? The benefits 
of having just *fast* ctfe sooner are perhaps larger than the 
benefits of having *even faster* ctfe later. Faster templates are 
also something that might be higher priority - assuming it will 
be you who does the work there.

Obviously it's your time and you're free to do whatever you like 
whenever you like, but I was just wondering what you're reasoning 
for the order of your plan is?

Apr 22 2017

Stefan Koch <uplink.coder googlemail.com> writes:

On Saturday, 22 April 2017 at 14:22:18 UTC, John Colvin wrote:
 On Thursday, 20 April 2017 at 12:56:11 UTC, Stefan Koch wrote:
 Hi Guys,

 I just begun work on the x86 jit backend.

 Because right now I am at a stage where further design 
 decisions need to be made and those decisions need to be 
 informed by how a _fast_ jit-compatible x86-codegen is 
 structured.

 Since I do believe that this is an interesting topic;
 I will give you the over-the-shoulder perspective on this.

 At the time of posting the video is still uploading, but you 
 should be able to see it soon.

 https://www.youtube.com/watch?v=pKorjPAvhQY

 Cheers,
 Stefan

 Is there not some way that you could get the current 
 interpreter-based implementation in to dmd sooner and then 
 modify the design later if necessary when you do x86 jit? The 
 benefits of having just *fast* ctfe sooner are perhaps larger 
 than the benefits of having *even faster* ctfe later. Faster 
 templates are also something that might be higher priority - 
 assuming it will be you who does the work there.

 Obviously it's your time and you're free to do whatever you 
 like whenever you like, but I was just wondering what you're 
 reasoning for the order of your plan is?

newCTFE is currently at a phase where high-level features have to 
be implemented.
And for that reason I am looking to extend the interface to 
support for example scaled loads and the like.
Otherwise you and up with 1000 temporaries that add offsets to 
pointers.
Also and perhaps more importantly I am sick and tired of hearing 
"why don't you use ldc/llvm?" all the time...

Apr 22 2017

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:

On Saturday, 22 April 2017 at 14:29:22 UTC, Stefan Koch wrote:
 And for that reason I am looking to extend the interface to 
 support for example scaled loads and the like.
 Otherwise you and up with 1000 temporaries that add offsets to 
 pointers.

What are scaled loads?

 Also and perhaps more importantly I am sick and tired of 
 hearing "why don't you use ldc/llvm?" all the time...

Yes, that's not fair.

Apr 24 2017

Stefan Koch <uplink.coder googlemail.com> writes:

On Monday, 24 April 2017 at 11:29:01 UTC, Ola Fosheim Grøstad 
wrote:
 What are scaled loads?

x86 has addressing modes which allow you to multiply an index by 
a certain set of scalars and add it as on offset to the pointer 
you want to load.
Thereby making memory access patterns more transparent to the 
caching and prefetch systems.
As well as reducing the overall code-size.

Apr 24 2017

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:

On Monday, 24 April 2017 at 17:48:50 UTC, Stefan Koch wrote:
 On Monday, 24 April 2017 at 11:29:01 UTC, Ola Fosheim Grøstad 
 wrote:
 What are scaled loads?

 x86 has addressing modes which allow you to multiply an index 
 by a certain set of scalars and add it as on offset to the 
 pointer you want to load.
 Thereby making memory access patterns more transparent to the 
 caching and prefetch systems.
 As well as reducing the overall code-size.

Oh, ok. AFAIK The decoding of indexing modes into micro-ops (the 
real instructions used inside the CPU, not the actual op-codes) 
has no effect on the caching system. It may however compress the 
generated code so you don't flush the instruction cache and speed 
up the decoding of op-codes into micro-ops.

If you want to improve cache loads you have to consider when to 
use the "prefetch" instructions, but the effect (positive or 
negative) varies greatly between CPU generations so you will 
basically need to target each CPU-generation individually.

Probably too much work to be worthwhile as it usually doesn't pay 
off until you work on large datasets and then you usually have to 
be careful with partitioning the data into cache-friendly 
working-sets. Probably not so easy to do for a JIT.

You'll probably get a decent performance boost without worrying 
about caching too much in the first implementation anyway. Any 
gains in that area could be obliterated in the next CPU 
generation... :-/

Apr 25 2017

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Tuesday, 25 April 2017 at 09:09:00 UTC, Ola Fosheim Grøstad 
wrote:
 On Monday, 24 April 2017 at 17:48:50 UTC, Stefan Koch wrote:
 [...]

 Oh, ok. AFAIK The decoding of indexing modes into micro-ops 
 (the real instructions used inside the CPU, not the actual 
 op-codes) has no effect on the caching system. It may however 
 compress the generated code so you don't flush the instruction 
 cache and speed up the decoding of op-codes into micro-ops.

 If you want to improve cache loads you have to consider when to 
 use the "prefetch" instructions, but the effect (positive or 
 negative) varies greatly between CPU generations so you will 
 basically need to target each CPU-generation individually.

 Probably too much work to be worthwhile as it usually doesn't 
 pay off until you work on large datasets and then you usually 
 have to be careful with partitioning the data into 
 cache-friendly working-sets. Probably not so easy to do for a 
 JIT.

 You'll probably get a decent performance boost without worrying 
 about caching too much in the first implementation anyway. Any 
 gains in that area could be obliterated in the next CPU 
 generation... :-/

It's already the case. Intel and AMD (especially in Ryzen) 
strongly discourage the use of prefetch instructions since at 
least Core2 and Athlon64. The icache cost rarely pays off and 
very often breaks the auto-prefetcher algorithms by spoiling 
memory bandwidth.

Apr 25 2017

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:

On Tuesday, 25 April 2017 at 16:16:43 UTC, Patrick Schluter wrote:
 It's already the case. Intel and AMD (especially in Ryzen) 
 strongly discourage the use of prefetch instructions since at 
 least Core2 and Athlon64. The icache cost rarely pays off and 
 very often breaks the auto-prefetcher algorithms by spoiling 
 memory bandwidth.

I think it just has to be done on a case-by-case basis. But if 
one doesn't target a specific set of CPUs and a specific 
predictable access pattern (like visiting every 4th cacheline) 
then one probably shouldn't do it.

There are also so many different types to choose from: 
prefetch-for-write, prefetch-for-one-time-use, 
prefetch-to-cache-level2, etc... Hard to get that right for a 
small-scale JIT without knowledge of the the algorithm or the 
dataset.

Apr 25 2017

Jonathan Marler <johnnymarler gmail.com> writes:

On Thursday, 20 April 2017 at 12:56:11 UTC, Stefan Koch wrote:
 Hi Guys,

 I just begun work on the x86 jit backend.

 Because right now I am at a stage where further design 
 decisions need to be made and those decisions need to be 
 informed by how a _fast_ jit-compatible x86-codegen is 
 structured.

 Since I do believe that this is an interesting topic;
 I will give you the over-the-shoulder perspective on this.

 At the time of posting the video is still uploading, but you 
 should be able to see it soon.

 https://www.youtube.com/watch?v=pKorjPAvhQY

 Cheers,
 Stefan

Have you considered using the LLVM jit compiler for CTFE? We 
already have an LLVM front end. This would mean that CTFE would 
depend on LLVM, which is a large dependency, but it would create 
very fast, optimized code for CTFE on any platform.

Keep in mind that I'm not as familiar with the technical details 
of CTFE so you may see alot of negative ramifications that I'm 
not aware of. I just want to make sure it's being considered and 
what yours and others thoughts were.

Apr 24 2017

jmh530 <john.michael.hall gmail.com> writes:

On Monday, 24 April 2017 at 12:59:55 UTC, Jonathan Marler wrote:
 Have you considered using the LLVM jit compiler for CTFE? We 
 already have an LLVM front end. This would mean that CTFE would 
 depend on LLVM, which is a large dependency, but it would 
 create very fast, optimized code for CTFE on any platform.

I can't help but laugh at this after the above posts...

Apr 24 2017

Jonathan Marler <johnnymarler gmail.com> writes:

On Monday, 24 April 2017 at 14:41:44 UTC, jmh530 wrote:
 On Monday, 24 April 2017 at 12:59:55 UTC, Jonathan Marler wrote:
 Have you considered using the LLVM jit compiler for CTFE? We 
 already have an LLVM front end. This would mean that CTFE 
 would depend on LLVM, which is a large dependency, but it 
 would create very fast, optimized code for CTFE on any 
 platform.

 I can't help but laugh at this after the above posts...

I totally missed when Stefan said:

 Also and perhaps more importantly I am sick and tired of 
 hearing "why don't you use ldc/llvm?" all the time...

That is pretty hilarious :)  I suppose I just demonstrated the 
reason he is attempting to create an x86 jitter so he will have 
an interface that could be extended to something like LLVM.  Wow.

Apr 24 2017

D Programming

C/C++ Programming

Other

digitalmars.D - {OT} Youtube Video: newCTFE: Starting to write the x86 JIT