www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Compilation strategy

reply Russel Winder <russel winder.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

A quick straw poll.  Do people prefer to have all sources compiled in a
single compiler call, or (more like C++) separate compilation of each
object followed by a link call.=20

Thanks.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder
Dec 15 2012
next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Saturday, 15 December 2012 at 16:55:39 UTC, Russel Winder 
wrote:
 Do people prefer to have all sources compiled in a
 single compiler call

I prefer the single call.
Dec 15 2012
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/15/12 11:55 AM, Russel Winder wrote:
 A quick straw poll.  Do people prefer to have all sources compiled in a
 single compiler call, or (more like C++) separate compilation of each
 object followed by a link call.

In phobos we use a single call for building the library. Then (at least on Posix) we use multiple calls for running unittests. Andrei
Dec 15 2012
prev sibling next sibling parent reply "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Saturday, 15 December 2012 at 16:55:39 UTC, Russel Winder 
wrote:
 A quick straw poll.  Do people prefer to have all sources 
 compiled in a
 single compiler call, or (more like C++) separate compilation 
 of each
 object followed by a link call.

Single compiler call is easier for small projects, but I worry about compile times for larger projects...
Dec 15 2012
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Dec 15, 2012 at 06:31:17PM +0100, RenatoUtsch wrote:
 On Saturday, 15 December 2012 at 17:05:59 UTC, Peter Alexander
 wrote:
On Saturday, 15 December 2012 at 16:55:39 UTC, Russel Winder
wrote:
A quick straw poll.  Do people prefer to have all sources compiled
in a single compiler call, or (more like C++) separate compilation
of each object followed by a link call.

Single compiler call is easier for small projects, but I worry about compile times for larger projects...

Yes, I'm writing a build system for D (that will be pretty damn good, I think, it has some interesting new concepts), and compiling each source separately to an object, and then linking everything will allow easily to make the build parallel, dividing the sources to compile in various threads. Or the compiler already does that if I pass all source files in one call?

I find that the current front-end (common to dmd, gdc, ldc) tends to work better when passed multiple source files at once. It tends to be faster, presumably because it only has to parse commonly-imported files once, and also produces smaller object/executable sizes -- maybe due to fewer duplicated template instantiations? I'm not sure of the exact reasons, but this behaviour appears consistent throughout dmd and gdc, and I presume also ldc (I didn't test that). So based on this, I'd lean toward compiling multiple files at once. However, in very large project, clearly this won't work very well. If it takes half an hour to build the entire system, it makes the code - compile - test cycle very slow, which reduces programmer productivity. So perhaps one possible middle ground would be to link packages separately, but compile all the sources within a single package at once. Presumably, if the project is properly organized, recompiling a single package won't take too long, and has the perk of optimizing for size within packages. This will probably also map to SCons easily, since SCons builds per-directory. T -- They pretend to pay us, and we pretend to work. -- Russian saying
Dec 15 2012
parent Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 12/15/2012 10:30 AM, RenatoUtsch wrote:
 Well, the idea is good. Small projects usually don't have much packages,
 so there will be just a few compiler calls. And compiling files
 concurrently will only have a meaningful efect if the project is large,
 and a large project will have a lot of packages.

 Maybe adding an option to choose between compiling all sources at once,
 per package, or per source. For example, in development and debug builds
 the compilation is per file or package, but in release builds all
 sources are compiled at once, or various packages at once.

I always thought it would be cool if my build tool could automatically determine the dependency graph of my d code and use it to intelligently break my project into small enough semi-independent chunks to compile.
Dec 16 2012
prev sibling next sibling parent "RenatoUtsch" <renatoutsch gmail.com> writes:
On Saturday, 15 December 2012 at 18:00:58 UTC, H. S. Teoh wrote:
 On Sat, Dec 15, 2012 at 06:31:17PM +0100, RenatoUtsch wrote:
 On Saturday, 15 December 2012 at 17:05:59 UTC, Peter Alexander
 wrote:
On Saturday, 15 December 2012 at 16:55:39 UTC, Russel Winder
wrote:
A quick straw poll.  Do people prefer to have all sources 
compiled
in a single compiler call, or (more like C++) separate 
compilation
of each object followed by a link call.

Single compiler call is easier for small projects, but I worry about compile times for larger projects...

Yes, I'm writing a build system for D (that will be pretty damn good, I think, it has some interesting new concepts), and compiling each source separately to an object, and then linking everything will allow easily to make the build parallel, dividing the sources to compile in various threads. Or the compiler already does that if I pass all source files in one call?

I find that the current front-end (common to dmd, gdc, ldc) tends to work better when passed multiple source files at once. It tends to be faster, presumably because it only has to parse commonly-imported files once, and also produces smaller object/executable sizes -- maybe due to fewer duplicated template instantiations? I'm not sure of the exact reasons, but this behaviour appears consistent throughout dmd and gdc, and I presume also ldc (I didn't test that). So based on this, I'd lean toward compiling multiple files at once.

Yeah, I did read about this somewhee.
 However, in very large project, clearly this won't work very 
 well. If it
 takes half an hour to build the entire system, it makes the 
 code -
 compile - test cycle very slow, which reduces programmer 
 productivity.

 So perhaps one possible middle ground would be to link packages
 separately, but compile all the sources within a single package 
 at once.
 Presumably, if the project is properly organized, recompiling a 
 single
 package won't take too long, and has the perk of optimizing for 
 size
 within packages. This will probably also map to SCons easily, 
 since
 SCons builds per-directory.


 T

Well, the idea is good. Small projects usually don't have much packages, so there will be just a few compiler calls. And compiling files concurrently will only have a meaningful efect if the project is large, and a large project will have a lot of packages. Maybe adding an option to choose between compiling all sources at once, per package, or per source. For example, in development and debug builds the compilation is per file or package, but in release builds all sources are compiled at once, or various packages at once. This way release builds will take advantage of this behavior that the frontend has, but developers won't have productivity issues. And, of couse, the behaviour will not be fixed, the devs that are using the build system will choose that.
Dec 15 2012
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Dec 15, 2012 at 07:30:52PM +0100, RenatoUtsch wrote:
 On Saturday, 15 December 2012 at 18:00:58 UTC, H. S. Teoh wrote:

So perhaps one possible middle ground would be to link packages
separately, but compile all the sources within a single package at
once.  Presumably, if the project is properly organized, recompiling
a single package won't take too long, and has the perk of optimizing
for size within packages. This will probably also map to SCons
easily, since SCons builds per-directory.


 Well, the idea is good. Small projects usually don't have much
 packages, so there will be just a few compiler calls. And compiling
 files concurrently will only have a meaningful efect if the project
 is large, and a large project will have a lot of packages.

Yes, that's the idea behind it.
 Maybe adding an option to choose between compiling all sources at
 once, per package, or per source. For example, in development and
 debug builds the compilation is per file or package, but in release
 builds all sources are compiled at once, or various packages at once.
 
 This way release builds will take advantage of this behavior that
 the frontend has, but developers won't have productivity issues.
 And, of couse, the behaviour will not be fixed, the devs that are
 using the build system will choose that.

I forgot to mention also, that passing too many source files to the compiler may sometimes cause memory consumption issues, as the compiler has to hold everything in memory. This may not be practical for very large project, where you can't fit everything into RAM. T -- Stop staring at me like that! You'll offend... no, you'll hurt your eyes!
Dec 15 2012
prev sibling next sibling parent "RenatoUtsch" <renatoutsch gmail.com> writes:
On Saturday, 15 December 2012 at 18:44:35 UTC, H. S. Teoh wrote:
 On Sat, Dec 15, 2012 at 07:30:52PM +0100, RenatoUtsch wrote:
 On Saturday, 15 December 2012 at 18:00:58 UTC, H. S. Teoh 
 wrote:

So perhaps one possible middle ground would be to link 
packages
separately, but compile all the sources within a single 
package at
once.  Presumably, if the project is properly organized, 
recompiling
a single package won't take too long, and has the perk of 
optimizing
for size within packages. This will probably also map to SCons
easily, since SCons builds per-directory.


 Well, the idea is good. Small projects usually don't have much
 packages, so there will be just a few compiler calls. And 
 compiling
 files concurrently will only have a meaningful efect if the 
 project
 is large, and a large project will have a lot of packages.

Yes, that's the idea behind it.
 Maybe adding an option to choose between compiling all sources 
 at
 once, per package, or per source. For example, in development 
 and
 debug builds the compilation is per file or package, but in 
 release
 builds all sources are compiled at once, or various packages 
 at once.
 
 This way release builds will take advantage of this behavior 
 that
 the frontend has, but developers won't have productivity 
 issues.
 And, of couse, the behaviour will not be fixed, the devs that 
 are
 using the build system will choose that.

I forgot to mention also, that passing too many source files to the compiler may sometimes cause memory consumption issues, as the compiler has to hold everything in memory. This may not be practical for very large project, where you can't fit everything into RAM. T

Well, so compiling by packages seem to be the best approach. When I return home I will do some tests to see what I can do. -- Renato
Dec 15 2012
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 12/15/2012 9:31 AM, RenatoUtsch wrote:
 Yes, I'm writing a build system for D (that will be pretty damn good, I think,
 it has some interesting new concepts), and compiling each source separately to
 an object, and then linking everything will allow easily to make the build
 parallel, dividing the sources to compile in various threads. Or the compiler
 already does that if I pass all source files in one call?

The compiler does a little multithreading, but not enough to make a difference. I've certainly thought about various schemes to parallelize it, though.
Dec 15 2012
prev sibling next sibling parent "Jakob Bornecrantz" <wallbraker gmail.com> writes:
On Saturday, 15 December 2012 at 17:05:59 UTC, Peter Alexander 
wrote:
 On Saturday, 15 December 2012 at 16:55:39 UTC, Russel Winder 
 wrote:
 A quick straw poll.  Do people prefer to have all sources 
 compiled in a
 single compiler call, or (more like C++) separate compilation 
 of each
 object followed by a link call.

Single compiler call is easier for small projects, but I worry about compile times for larger projects...

As evident by Phobos and my own project[1], for larger projects multiple concurrent calls is the only way to go. Rebuilding everything does take a bit, but a bit of thought behind the layout of the project and things work much faster when working on specific areas. Cheers, Jakob. [1] http://github.com/Charged/Miners
Dec 15 2012
prev sibling next sibling parent "RenatoUtsch" <renatoutsch gmail.com> writes:
On Saturday, 15 December 2012 at 17:05:59 UTC, Peter Alexander 
wrote:
 On Saturday, 15 December 2012 at 16:55:39 UTC, Russel Winder 
 wrote:
 A quick straw poll.  Do people prefer to have all sources 
 compiled in a
 single compiler call, or (more like C++) separate compilation 
 of each
 object followed by a link call.

Single compiler call is easier for small projects, but I worry about compile times for larger projects...

Yes, I'm writing a build system for D (that will be pretty damn good, I think, it has some interesting new concepts), and compiling each source separately to an object, and then linking everything will allow easily to make the build parallel, dividing the sources to compile in various threads. Or the compiler already does that if I pass all source files in one call? -- Renato Utsch
Dec 15 2012
prev sibling next sibling parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Saturday, 15 December 2012 at 17:27:38 UTC, Jakob Bornecrantz 
wrote:
 On Saturday, 15 December 2012 at 17:05:59 UTC, Peter Alexander
 Single compiler call is easier for small projects, but I worry 
 about compile times for larger projects...

As evident by Phobos and my own project[1], for larger projects multiple concurrent calls is the only way to go. Rebuilding everything does take a bit, but a bit of thought behind the layout of the project and things work much faster when working on specific areas.

Phobos is only around 200kloc. I'm worrying about the really large projects (multi-million lines of code).
Dec 15 2012
prev sibling next sibling parent "jerro" <a a.com> writes:
On Saturday, 15 December 2012 at 17:31:19 UTC, RenatoUtsch wrote:
 On Saturday, 15 December 2012 at 17:05:59 UTC, Peter Alexander 
 wrote:
 On Saturday, 15 December 2012 at 16:55:39 UTC, Russel Winder 
 wrote:
 A quick straw poll.  Do people prefer to have all sources 
 compiled in a
 single compiler call, or (more like C++) separate compilation 
 of each
 object followed by a link call.

Single compiler call is easier for small projects, but I worry about compile times for larger projects...

Yes, I'm writing a build system for D (that will be pretty damn good, I think, it has some interesting new concepts)

I took a look at your github project, there isn't any code yet, but I like the concept. I was actually planing to do something similar, but since you are already doing it, I think my time would be better spent contributing to your project. Will there be some publicly available code in the near future?
Dec 15 2012
prev sibling next sibling parent "RenatoUtsch" <renatoutsch gmail.com> writes:
On Saturday, 15 December 2012 at 18:24:50 UTC, jerro wrote:
 On Saturday, 15 December 2012 at 17:31:19 UTC, RenatoUtsch 
 wrote:
 On Saturday, 15 December 2012 at 17:05:59 UTC, Peter Alexander 
 wrote:
 On Saturday, 15 December 2012 at 16:55:39 UTC, Russel Winder 
 wrote:
 A quick straw poll.  Do people prefer to have all sources 
 compiled in a
 single compiler call, or (more like C++) separate 
 compilation of each
 object followed by a link call.

Single compiler call is easier for small projects, but I worry about compile times for larger projects...

Yes, I'm writing a build system for D (that will be pretty damn good, I think, it has some interesting new concepts)

I took a look at your github project, there isn't any code yet, but I like the concept. I was actually planing to do something similar, but since you are already doing it, I think my time would be better spent contributing to your project. Will there be some publicly available code in the near future?

I expect to release a first alpha version in about 15~30 days, maybe less, it depends on how much time I will have on the rest of this month.
Dec 15 2012
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 12/15/2012 8:55 AM, Russel Winder wrote:
 A quick straw poll.  Do people prefer to have all sources compiled in a
 single compiler call, or (more like C++) separate compilation of each
 object followed by a link call.

Both are needed, and are suitable for different purposes. It's like asking if you prefer a standard or a philips screwdriver.
Dec 15 2012
prev sibling next sibling parent "David Nadlinger" <see klickverbot.at> writes:
On Saturday, 15 December 2012 at 17:02:08 UTC, Andrei 
Alexandrescu wrote:
 In phobos we use a single call for building the library. Then 
 (at least on Posix) we use multiple calls for running unittests.

This highlights the problem with giving a single answer to the question: Building a large project in one call is often impractical. It only works for Phobos library builds because many templates don't even get instantiated. David
Dec 15 2012
prev sibling next sibling parent Paulo Pinto <pjmlp progtools.org> writes:
Am 15.12.2012 17:55, schrieb Russel Winder:
 A quick straw poll.  Do people prefer to have all sources compiled in a
 single compiler call, or (more like C++) separate compilation of each
 object followed by a link call.

 Thanks.

I prefer to compile by package, like in any module supporting language. -- Paulo
Dec 16 2012
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Saturday, 15 December 2012 at 16:55:39 UTC, Russel Winder 
wrote:
 A quick straw poll.  Do people prefer to have all sources 
 compiled in a
 single compiler call, or (more like C++) separate compilation 
 of each
 object followed by a link call.

 Thanks.

I currently use full compilation. I have no choice because the compiler isn't emitting all symbols I need and get linking errors when doing separate compilation. Compiling my code require more than 2.5Gb of RAM now and is quite slow. To avoid multiple instantiation of the same template by the compiler, I guess the best option to me would be to compile on a by package basis.
Dec 16 2012
prev sibling next sibling parent reply Michel Fortin <michel.fortin michelf.ca> writes:
On 2012-12-17 03:18:45 +0000, Walter Bright <newshound2 digitalmars.com> said:

 Whether the file format is text or binary does not make any fundamental 
 difference.

I too expect the difference in performance to be negligible in binary form if you maintain the same structure. But if you're translating it to another format you can improve the structure to make it faster. If the file had a table of contents (TOC) of publicly visible symbols right at the start, you could read that table of content alone to fill symbol tables while lazy-loading symbol definitions from the file only when needed. Often, most of the file beyond the TOC wouldn't be needed at all. Having to parse and construct the syntax tree for the whole file incurs many memory allocations in the compiler, which you could avoid if the file was structured for lazy-loading. With a TOC you have very little to read from disk and very little to allocate in memory and that'll make compilation faster. More importantly, if you use only fully-qualified symbol names in the translated form, then you'll be able to load lazily privately imported modules because they'll only be needed when you need the actual definition of a symbol. (Template instantiation might require loading privately imported modules too.) And then you could structure it so a whole library could fit in one file, putting all the TOCs at the start of the same file so it loads from disk in a single read operation (or a couple of *sequential* reads). I'm not sure of the speedup all this would provide, but I'd hazard a guess that it wouldn't be so negligible when compiling a large project incrementally. Implementing any of this in the current front end would be a *lot* of work however. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca/
Dec 16 2012
next sibling parent reply Paulo Pinto <pjmlp progtools.org> writes:
Am 17.12.2012 21:09, schrieb foobar:
 On Monday, 17 December 2012 at 04:49:46 UTC, Michel Fortin wrote:
 On 2012-12-17 03:18:45 +0000, Walter Bright
 <newshound2 digitalmars.com> said:

 Whether the file format is text or binary does not make any
 fundamental difference.

I too expect the difference in performance to be negligible in binary form if you maintain the same structure. But if you're translating it to another format you can improve the structure to make it faster. If the file had a table of contents (TOC) of publicly visible symbols right at the start, you could read that table of content alone to fill symbol tables while lazy-loading symbol definitions from the file only when needed. Often, most of the file beyond the TOC wouldn't be needed at all. Having to parse and construct the syntax tree for the whole file incurs many memory allocations in the compiler, which you could avoid if the file was structured for lazy-loading. With a TOC you have very little to read from disk and very little to allocate in memory and that'll make compilation faster. More importantly, if you use only fully-qualified symbol names in the translated form, then you'll be able to load lazily privately imported modules because they'll only be needed when you need the actual definition of a symbol. (Template instantiation might require loading privately imported modules too.) And then you could structure it so a whole library could fit in one file, putting all the TOCs at the start of the same file so it loads from disk in a single read operation (or a couple of *sequential* reads). I'm not sure of the speedup all this would provide, but I'd hazard a guess that it wouldn't be so negligible when compiling a large project incrementally. Implementing any of this in the current front end would be a *lot* of work however.

Precisely. That is the correct solution and is also how [turbo?] pascal units (==libs) where implemented *decades ago*. I'd like to also emphasize the importance of using a *single* encapsulated file. This prevents synchronization hazards that D inherited from the broken c/c++ model.

I really miss it, but at least it has been picked up by Go as well. Still find strange that many C and C++ developers are unaware that we have modules since the early 80's. -- Paulo
Dec 17 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12/18/2012 12:34 AM, Paulo Pinto пишет:
 Am 17.12.2012 21:09, schrieb foobar:
 On Monday, 17 December 2012 at 04:49:46 UTC, Michel Fortin wrote:
 On 2012-12-17 03:18:45 +0000, Walter Bright
 <newshound2 digitalmars.com> said:

 Whether the file format is text or binary does not make any
 fundamental difference.

I too expect the difference in performance to be negligible in binary form if you maintain the same structure. But if you're translating it to another format you can improve the structure to make it faster. If the file had a table of contents (TOC) of publicly visible symbols right at the start, you could read that table of content alone to fill symbol tables while lazy-loading symbol definitions from the file only when needed. Often, most of the file beyond the TOC wouldn't be needed at all. Having to parse and construct the syntax tree for the whole file incurs many memory allocations in the compiler, which you could avoid if the file was structured for lazy-loading. With a TOC you have very little to read from disk and very little to allocate in memory and that'll make compilation faster. More importantly, if you use only fully-qualified symbol names in the translated form, then you'll be able to load lazily privately imported modules because they'll only be needed when you need the actual definition of a symbol. (Template instantiation might require loading privately imported modules too.) And then you could structure it so a whole library could fit in one file, putting all the TOCs at the start of the same file so it loads from disk in a single read operation (or a couple of *sequential* reads). I'm not sure of the speedup all this would provide, but I'd hazard a guess that it wouldn't be so negligible when compiling a large project incrementally. Implementing any of this in the current front end would be a *lot* of work however.

Precisely. That is the correct solution and is also how [turbo?] pascal units (==libs) where implemented *decades ago*. I'd like to also emphasize the importance of using a *single* encapsulated file. This prevents synchronization hazards that D inherited from the broken c/c++ model.


I really loved the way Turbo Pascal units were made. I wish D go the same route. Object files would then be looked at as minimal and stupid variation of module where symbols are identified by mangling (not plain meta data as (would be) in module) and no source for templates is emitted. AFAIK Delphi is able to produce both DCU and OBJ files (and link with). Dunno what it does with generics (and which kind these are) and how.
 I really miss it, but at least it has been picked up by Go as well.

 Still find strange that many C and C++ developers are unaware that we
 have modules since the early 80's.

I suspect it's one of prime examples where UNIX philosophy of combining a bunch of simple (~ dumb) programs together in place of one more complex program was taken *far* beyond reasonable lengths. Having a pipe-line: preprocessor -> compiler -> (still?) assembler -> linker where every program tries hard to know nothing about the previous ones (and be as simple as possibly can be) is bound to get inadequate results on many fronts: - efficiency & scalability - cross-border error reporting and detection (linker errors? errors for expanded macro magic?) - cross-file manipulations (e.g. optimization, see _how_ LTO is done in GCC) - multiple problems from a loss of information across pipeline* *Semantic info on interdependency of symbols in a source file is destroyed right before the linker and thus each .obj file is included as a whole or not at all. Thus all C run-times I've seen _sidestep_ this by writing each function in its own file(!). Even this alone should have been a clear indication. While simplicity (and correspondingly size in memory) of programs was the king in 70's it's well past due. Nowadays I think is all about getting highest throughput and more powerful features.
 --
 Paulo

-- Dmitry Olshansky
Dec 17 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/17/2012 2:08 PM, Dmitry Olshansky wrote:
 I really loved the way Turbo Pascal units were made. I wish D go the same
 route.  Object files would then be looked at as minimal and stupid variation of
 module where symbols are identified by mangling (not plain meta data as (would
 be) in module) and no source for templates is emitted.
 +1

I'll bite. How is this superior to D's system? I have never used TP.
 *Semantic info on interdependency of symbols in a source file is destroyed
right
 before the linker and thus each .obj file is included as a whole or not at all.
 Thus all C run-times I've seen _sidestep_ this by writing each function in its
 own file(!). Even this alone should have been a clear indication.

This is done using COMDATs in C++ and D today.
Dec 17 2012
next sibling parent Paulo Pinto <pjmlp progtools.org> writes:
Am 17.12.2012 23:23, schrieb Walter Bright:
 On 12/17/2012 2:08 PM, Dmitry Olshansky wrote:
 I really loved the way Turbo Pascal units were made. I wish D go the same
 route.  Object files would then be looked at as minimal and stupid
 variation of
 module where symbols are identified by mangling (not plain meta data
 as (would
 be) in module) and no source for templates is emitted.
 +1

I'll bite. How is this superior to D's system? I have never used TP.

Just explaining the TP way, not doing comparisons. Each unit (module) is a single file and contains all declarations, there is a separation between the public and implementation part. Multiple units can be circular dependent, if they depend between each other on the implementation part. The compiler and IDE are able to extract all the necessary information from a unit file, thus making a single file all that is required for making the compiler happy and avoiding synchronization errors. Like any language using modules, the compiler is pretty fast and uses an included linker optimized for the information stored in the units. Besides the IDE, there are command line utilities that dump the public information of a given unit, as a way for programmers to read the available exported API. Basically not much different from what Java and .NET do, but with a language that by default uses native compilation tooling. -- Paulo
Dec 17 2012
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12/18/2012 2:23 AM, Walter Bright пишет:
 On 12/17/2012 2:08 PM, Dmitry Olshansky wrote:
 I really loved the way Turbo Pascal units were made. I wish D go the same
 route.  Object files would then be looked at as minimal and stupid
 variation of
 module where symbols are identified by mangling (not plain meta data
 as (would
 be) in module) and no source for templates is emitted.
 +1

I'll bite. How is this superior to D's system? I have never used TP.

One superiority is having a compiled module with public interface (a-la .di but in some binary format) in one file. Along with public interface it retains dependency information. Basically things that describe one entity should not be separated. I can say that advantage of "grab this single file and you are good to go" should not be underestimated. Thusly there is no mess with header files out of date and/or object files that fail to link because of that. Now back then there were no templates nor CTFE. So module structure was simple. There were no packages (they landed in Delphi). I'd expect D to have a format built around modules and packages of these. Then pre-compiled libraries are commonly distributed as a package. The upside of having our special format is being able to tailor it for our needs e.g. store type info & meta-data plainly (not mangle-demangle), having separately compiled (and checked) pure functions, better cross symbol dependency etc. To link with C we could still compile all of D modules into a huge object file (split into a monstrous amount of sections).
 *Semantic info on interdependency of symbols in a source file is
 destroyed right
 before the linker and thus each .obj file is included as a whole or
 not at all.
 Thus all C run-times I've seen _sidestep_ this by writing each
 function in its
 own file(!). Even this alone should have been a clear indication.

This is done using COMDATs in C++ and D today.

Well, that's terse. Either way it looks like a workaround for templates that during separate compilation dump identical code in obj-s to auto-merge these. More then that - the end result is the same: to avoid carrying junk into an app you (or compiler) still have to put each function in its own section. Doing separate compilation I always (unless doing LTO or template heavy code) see either whole or nothing (D included). Most likely the compiler will do it for you only with a special switch. This begs another question - why not eliminate junk by default? P.S. Looking at M$: http://msdn.microsoft.com/en-us/library/xsa71f43.aspx it needs 2 switches - 1 for linker 1 for compiler. Hilarious. -- Dmitry Olshansky
Dec 18 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/18/2012 1:33 AM, Dmitry Olshansky wrote:
 More then that - the end result is the same: to avoid carrying junk into an app
 you (or compiler) still have to put each function in its own section.

That's what COMDATs are.
 Doing separate compilation I always (unless doing LTO or template heavy code)
 see either whole or nothing (D included). Most likely the compiler will do it
 for you only with a special switch.

dmd emits COMDATs for all global functions. You can see this by running dumpobj on the output.
Dec 18 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12/18/2012 6:51 PM, Walter Bright пишет:
 On 12/18/2012 1:33 AM, Dmitry Olshansky wrote:
 More then that - the end result is the same: to avoid carrying junk
 into an app
 you (or compiler) still have to put each function in its own section.

That's what COMDATs are.

 Doing separate compilation I always (unless doing LTO or template
 heavy code)
 see either whole or nothing (D included). Most likely the compiler
 will do it
 for you only with a special switch.

dmd emits COMDATs for all global functions. You can see this by running dumpobj on the output.

Thanks for carrying on this Q. I'm using objconv by Agner Fog as I haven't got dumpobj (guess I'll buy it if need be). However I see comments in dumped asm that mark section boundaries that all functions are indeed in COMDAT sections. Still linking these object files and disassembling output I see all of functions are there intact. I've added debug symbols to the build though - could it make optlink keep symbols? After dropping debug info I can't yet make heads or tails of what's in the exe yet but it _seems_ to not include all of the unused code. Gotta investigate on a smaller sample. -- Dmitry Olshansky
Dec 18 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/18/2012 8:48 AM, Dmitry Olshansky wrote:
 After dropping debug info I can't yet make heads or tails of what's in the exe
 yet but it _seems_ to not include all of the unused code. Gotta investigate on
a
 smaller sample.

Generate a linker .map file (-map to dmd). That'll tell you what's in it.
Dec 18 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12/18/2012 9:15 PM, Walter Bright пишет:
 On 12/18/2012 8:48 AM, Dmitry Olshansky wrote:
 After dropping debug info I can't yet make heads or tails of what's in
 the exe
 yet but it _seems_ to not include all of the unused code. Gotta
 investigate on a
 smaller sample.

Generate a linker .map file (-map to dmd). That'll tell you what's in it.

Still it tells only half the story - what symbols are there (and a lot of them shouldn't have been) - now the most important part to figure out is _why_. Given that almost everything is templates and not instantiated (thus thank god is not present). Still both quite some templates and certain normal functions made it in without ever being called. I'm sure they are not called because I just imported the module. Adding trace prints to the functions in question shows nothing on screen. I tried running linker with -xref and I see that the stuff I don't expect to land in .exe looks either like this: Symbol Defined Referenced immutable(unicode_tables.SetEntry!(ushort).SetEntry) unicode_tables.unicodeCased unicode_tables (Meaning that it's not referenced anywhere yet present but I guess unreferenced global data is not stripped away) Or (for functions): dchar uni.toUpper(dchar) uni uni const( trusted dchar function(uint)) uni.Grapheme.opIndex uni uni ... Meaning that it's defined & referenced in the same module only (not the one with empty main). Yet it's getting pulled in... I'm certain that at least toUpper is not called anywhere in the empty module (nor module ctors, as I have none). Can you recommend any steps to see the web of symbols that eventually pulls them in? Peeking at dependency chain (not file-grained but symbol grained) in any form would have been awesome. -- Dmitry Olshansky
Dec 19 2012
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-12-18 17:48, Dmitry Olshansky wrote:

 I'm using objconv by Agner Fog as I haven't got dumpobj (guess I'll buy
 it if need be).

dumpobj is included in the DMD release, at least on Mac OS X. -- /Jacob Carlborg
Dec 18 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12/19/2012 12:01 AM, Jacob Carlborg пишет:
 On 2012-12-18 17:48, Dmitry Olshansky wrote:

 I'm using objconv by Agner Fog as I haven't got dumpobj (guess I'll buy
 it if need be).

dumpobj is included in the DMD release, at least on Mac OS X.

-- Dmitry Olshansky
Dec 18 2012
parent reply Paulo Pinto <pjmlp progtools.org> writes:
Am 18.12.2012 21:09, schrieb Dmitry Olshansky:
 12/19/2012 12:01 AM, Jacob Carlborg пишет:
 On 2012-12-18 17:48, Dmitry Olshansky wrote:

 I'm using objconv by Agner Fog as I haven't got dumpobj (guess I'll buy
 it if need be).

dumpobj is included in the DMD release, at least on Mac OS X.


dumpbin
Dec 18 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
12/19/2012 12:15 AM, Paulo Pinto пишет:
 Am 18.12.2012 21:09, schrieb Dmitry Olshansky:
 12/19/2012 12:01 AM, Jacob Carlborg пишет:
 On 2012-12-18 17:48, Dmitry Olshansky wrote:

 I'm using objconv by Agner Fog as I haven't got dumpobj (guess I'll buy
 it if need be).

dumpobj is included in the DMD release, at least on Mac OS X.


dumpbin

Only COFF I guess ;) -- Dmitry Olshansky
Dec 19 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/18/2012 3:43 AM, foobar wrote:
 Honest question - If D already has all the semantic info in COMDAT sections,

It doesn't. COMDATs are object file sections. They do not contain type info, for example.
   * provide a byte-code solution to support the portability case. e.g Java
 byte-code or Google's pNaCL solution that relies on LLVM bit-code.

There is no advantage to bytecodes. Putting them in a zip file does not make them produce better results.
Dec 18 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/18/2012 7:51 AM, H. S. Teoh wrote:
 An idea occurred to me while reading this. What if, when compiling a
 module, say, the compiler not only emits object code, but also
 information like which functions are implied to be strongly pure, weakly
 pure,  safe, etc., as well as some kind of symbol dependency
 information. Basically, any derived information that isn't immediately
 obvious from the code is saved.

 Then when importing the module, the compiler doesn't have to re-derive
 all of this information, but it is immediately available.

 One can also include information like whether a function actually throws
 an exception (regardless of whether it's marked nothrow), which
 exception(s) it throws, etc.. This may open up the possibility of doing
 some things with the language that are currently infeasible, regardless
 of the obfuscation issue.

This is a binary import. It offers negligible advantages over .di files.
Dec 18 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/18/2012 9:42 AM, H. S. Teoh wrote:
 I was thinking more along the lines of things like fully automatic
 purity, safety, exception inference. For example, every function body
 eventually has to be processed by the compiler, so if a particular
 function is inferred to throw exception X, for example, then when its
 callers are compiled, this fact can be propagated to them. To do this
 for the whole program might be infeasible due to the sheer size of
 things, but if a module contains, for each function exposed by the API,
 a list of all thrown exceptions, then when the module is imported this
 information is available up-front and can be propagated further up the
 call chain. Same thing goes with purity and  safe.

 This may even allow us to make pure/ safe/nothrow fully automated so
 that you don't have to explicitly state them (except when you want the
 compiler to verify that what you wrote is actually pure, safe, etc.).

The trouble with this is the separate compilation model. If the attributes are not in the function signature, then the function implementation can change without recompiling the user of that function. Changing the inferred attributes then will subtly break your build. Inferred attributes only work when the implementation source is guaranteed to be available, such as with template functions. Having a binary format doesn't change this.
Dec 18 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 12/18/2012 11:23 AM, H. S. Teoh wrote:
 On Tue, Dec 18, 2012 at 09:55:57AM -0800, Walter Bright wrote:
 On 12/18/2012 9:42 AM, H. S. Teoh wrote:
 I was thinking more along the lines of things like fully automatic
 purity, safety, exception inference. For example, every function body
 eventually has to be processed by the compiler, so if a particular
 function is inferred to throw exception X, for example, then when its
 callers are compiled, this fact can be propagated to them. To do this
 for the whole program might be infeasible due to the sheer size of
 things, but if a module contains, for each function exposed by the
 API, a list of all thrown exceptions, then when the module is
 imported this information is available up-front and can be propagated
 further up the call chain. Same thing goes with purity and  safe.

 This may even allow us to make pure/ safe/nothrow fully automated so
 that you don't have to explicitly state them (except when you want
 the compiler to verify that what you wrote is actually pure, safe,
 etc.).

The trouble with this is the separate compilation model. If the attributes are not in the function signature, then the function implementation can change without recompiling the user of that function. Changing the inferred attributes then will subtly break your build.

And here's a reason for using an intermediate format (whether it's bytecote or just plain serialized AST or something else, is irrelevant). Say we put the precompiled module in a zip file of some sort. If the function attributes change, so does the zip file. So if proper make dependencies are setup, this will automatically trigger the recompilation of whoever uses the module.

Relying on a makefile being correct does not solve it.
 Inferred attributes only work when the implementation source is
 guaranteed to be available, such as with template functions.

 Having a binary format doesn't change this.

Actually, this doesn't depend on the format being binary. You can save everything in plain text format and it will still work. In fact, there might be reasons to want a text format instead of binary, since then one could look at the compiler output to find out what the inferred attributes of a particular declaration are without needing to add compiler querying features.

The "plain text format" that works is called D source code :-)
Dec 18 2012
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-12-18 01:13, H. S. Teoh wrote:

 The problem is not so much the structure preprocessor -> compiler ->
 assembler -> linker; the problem is that these logical stages have been
 arbitrarily assigned to individual processes residing in their own
 address space, communicating via files (or pipes, whatever it may be).

 The fact that they are separate processes is in itself not that big of a
 problem, but the fact that they reside in their own address space is a
 big problem, because you cannot pass any information down the chain
 except through rudimentary OS interfaces like files and pipes. Even that
 wouldn't have been so bad, if it weren't for the fact that user
 interface (in the form of text input / object file format) has also been
 conflated with program interface (the compiler has to produce the input
 to the assembler, in *text*, and the assembler has to produce object
 files that do not encode any direct dependency information because
 that's the standard file format the linker expects).

 Now consider if we keep the same stages, but each stage is not a
 separate program but a *library*. The code then might look, in greatly
 simplified form, something like this:

 	import libdmd.compiler;
 	import libdmd.assembler;
 	import libdmd.linker;

 	void main(string[] args) {
 		// typeof(asmCode) is some arbitrarily complex data
 		// structure encoding assembly code, inter-module
 		// dependencies, etc.
 		auto asmCode = compiler.lex(args)
 			.parse()
 			.optimize()
 			.codegen();

 		// Note: no stupid redundant convert to string, parse,
 		// convert back to internal representation.
 		auto objectCode = assembler.assemble(asmCode);

 		// Note: linker has direct access to dependency info,
 		// etc., carried over from asmCode -> objectCode.
 		auto executable = linker.link(objectCode);
 		File output(outfile, "w");
 		executable.generate(output);
 	}

 Note that the types asmCode, objectCode, executable, are arbitrarily
 complex, and may contain lazy-evaluated data structure, references to
 on-disk temporary storage (for large projects you can't hold everything
 in RAM), etc.. Dependency information in asmCode is propagated to
 objectCode, as necessary. The linker has full access to all info the
 compiler has access to, and can perform inter-module optimization, etc.,
 by accessing information available to the *compiler* front-end, not just
 some crippled object file format.

 The root of the current nonsense is that perfectly-fine data structures
 are arbitrarily required to be flattened into some kind of intermediate
 form, written to some file (or sent down some pipe), often with loss of
 information, then read from the other end, interpreted, and
 reconstituted into other data structures (with incomplete info), then
 processed. In many cases, information that didn't make it through the
 channel has to be reconstructed (often imperfectly), and then used. Most
 of these steps are redundant. If the compiler data structures were
 already directly available in the first place, none of this baroque
 dance is necessary.

I couldn't agree more. -- /Jacob Carlborg
Dec 17 2012
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Dec 18, 2012 at 08:12:51AM -0800, Walter Bright wrote:
 On 12/18/2012 7:51 AM, H. S. Teoh wrote:
An idea occurred to me while reading this. What if, when compiling a
module, say, the compiler not only emits object code, but also
information like which functions are implied to be strongly pure,
weakly pure,  safe, etc., as well as some kind of symbol dependency
information. Basically, any derived information that isn't
immediately obvious from the code is saved.

Then when importing the module, the compiler doesn't have to
re-derive all of this information, but it is immediately available.

One can also include information like whether a function actually
throws an exception (regardless of whether it's marked nothrow),
which exception(s) it throws, etc.. This may open up the possibility
of doing some things with the language that are currently infeasible,
regardless of the obfuscation issue.

This is a binary import. It offers negligible advantages over .di files.

I was thinking more along the lines of things like fully automatic purity, safety, exception inference. For example, every function body eventually has to be processed by the compiler, so if a particular function is inferred to throw exception X, for example, then when its callers are compiled, this fact can be propagated to them. To do this for the whole program might be infeasible due to the sheer size of things, but if a module contains, for each function exposed by the API, a list of all thrown exceptions, then when the module is imported this information is available up-front and can be propagated further up the call chain. Same thing goes with purity and safe. This may even allow us to make pure/ safe/nothrow fully automated so that you don't have to explicitly state them (except when you want the compiler to verify that what you wrote is actually pure, safe, etc.). T -- Тише едешь, дальше будешь.
Dec 18 2012
prev sibling next sibling parent reply "foobar" <foo bar.com> writes:
On Monday, 17 December 2012 at 04:49:46 UTC, Michel Fortin wrote:
 On 2012-12-17 03:18:45 +0000, Walter Bright 
 <newshound2 digitalmars.com> said:

 Whether the file format is text or binary does not make any 
 fundamental difference.

I too expect the difference in performance to be negligible in binary form if you maintain the same structure. But if you're translating it to another format you can improve the structure to make it faster. If the file had a table of contents (TOC) of publicly visible symbols right at the start, you could read that table of content alone to fill symbol tables while lazy-loading symbol definitions from the file only when needed. Often, most of the file beyond the TOC wouldn't be needed at all. Having to parse and construct the syntax tree for the whole file incurs many memory allocations in the compiler, which you could avoid if the file was structured for lazy-loading. With a TOC you have very little to read from disk and very little to allocate in memory and that'll make compilation faster. More importantly, if you use only fully-qualified symbol names in the translated form, then you'll be able to load lazily privately imported modules because they'll only be needed when you need the actual definition of a symbol. (Template instantiation might require loading privately imported modules too.) And then you could structure it so a whole library could fit in one file, putting all the TOCs at the start of the same file so it loads from disk in a single read operation (or a couple of *sequential* reads). I'm not sure of the speedup all this would provide, but I'd hazard a guess that it wouldn't be so negligible when compiling a large project incrementally. Implementing any of this in the current front end would be a *lot* of work however.

Precisely. That is the correct solution and is also how [turbo?] pascal units (==libs) where implemented *decades ago*. I'd like to also emphasize the importance of using a *single* encapsulated file. This prevents synchronization hazards that D inherited from the broken c/c++ model.
Dec 17 2012
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Dec 18, 2012 at 09:55:57AM -0800, Walter Bright wrote:
 On 12/18/2012 9:42 AM, H. S. Teoh wrote:
I was thinking more along the lines of things like fully automatic
purity, safety, exception inference. For example, every function body
eventually has to be processed by the compiler, so if a particular
function is inferred to throw exception X, for example, then when its
callers are compiled, this fact can be propagated to them. To do this
for the whole program might be infeasible due to the sheer size of
things, but if a module contains, for each function exposed by the
API, a list of all thrown exceptions, then when the module is
imported this information is available up-front and can be propagated
further up the call chain. Same thing goes with purity and  safe.

This may even allow us to make pure/ safe/nothrow fully automated so
that you don't have to explicitly state them (except when you want
the compiler to verify that what you wrote is actually pure, safe,
etc.).

The trouble with this is the separate compilation model. If the attributes are not in the function signature, then the function implementation can change without recompiling the user of that function. Changing the inferred attributes then will subtly break your build.

And here's a reason for using an intermediate format (whether it's bytecote or just plain serialized AST or something else, is irrelevant). Say we put the precompiled module in a zip file of some sort. If the function attributes change, so does the zip file. So if proper make dependencies are setup, this will automatically trigger the recompilation of whoever uses the module.
 Inferred attributes only work when the implementation source is
 guaranteed to be available, such as with template functions.
 
 Having a binary format doesn't change this.

Actually, this doesn't depend on the format being binary. You can save everything in plain text format and it will still work. In fact, there might be reasons to want a text format instead of binary, since then one could look at the compiler output to find out what the inferred attributes of a particular declaration are without needing to add compiler querying features. T -- Why have vacation when you can work?? -- EC
Dec 18 2012
prev sibling next sibling parent "foobar" <foo bar.com> writes:
On Monday, 17 December 2012 at 22:24:00 UTC, Walter Bright wrote:
 On 12/17/2012 2:08 PM, Dmitry Olshansky wrote:
 I really loved the way Turbo Pascal units were made. I wish D 
 go the same
 route.  Object files would then be looked at as minimal and 
 stupid variation of
 module where symbols are identified by mangling (not plain 
 meta data as (would
 be) in module) and no source for templates is emitted.
 +1

I'll bite. How is this superior to D's system? I have never used TP.
 *Semantic info on interdependency of symbols in a source file 
 is destroyed right
 before the linker and thus each .obj file is included as a 
 whole or not at all.
 Thus all C run-times I've seen _sidestep_ this by writing each 
 function in its
 own file(!). Even this alone should have been a clear 
 indication.

This is done using COMDATs in C++ and D today.

Honest question - If D already has all the semantic info in COMDAT sections, why do we still require additional auxiliary files? Surely, a single binary library (lib/so) should be enough to encapsulate a library without the need to re-parse the source files or additional header files? You yourself seem to agree that a single zip file is superior to what we currently have and as an aside the entire Java community agrees with use - Java Jar/War/etc formats are all renamed zip archives. Regarding the obfuscation and portability issues - the zip file can contain whatever we want. This means it should be possible to tailor the contents to support different use-cases: * provide fat-libraries as in OSX - internally store multiple binaries for different architectures, those binary objects are very hard to decompile back to source code thus answering the obfuscation need. * provide a byte-code solution to support the portability case. e.g Java byte-code or Google's pNaCL solution that relies on LLVM bit-code. Also, there are different work-flows that can be implemented - Java uses JIT to gain efficiency vs. .NET that supports install-time AOT compilation. It basically stores the native executable in a special cache.
Dec 18 2012
prev sibling parent "Paulo Pinto" <pjmlp progtools.org> writes:
On Tuesday, 18 December 2012 at 11:43:18 UTC, foobar wrote:
 On Monday, 17 December 2012 at 22:24:00 UTC, Walter Bright 
 wrote:
 On 12/17/2012 2:08 PM, Dmitry Olshansky wrote:
 I really loved the way Turbo Pascal units were made. I wish D 
 go the same
 route.  Object files would then be looked at as minimal and 
 stupid variation of
 module where symbols are identified by mangling (not plain 
 meta data as (would
 be) in module) and no source for templates is emitted.
 +1

I'll bite. How is this superior to D's system? I have never used TP.
 *Semantic info on interdependency of symbols in a source file 
 is destroyed right
 before the linker and thus each .obj file is included as a 
 whole or not at all.
 Thus all C run-times I've seen _sidestep_ this by writing 
 each function in its
 own file(!). Even this alone should have been a clear 
 indication.

This is done using COMDATs in C++ and D today.

Honest question - If D already has all the semantic info in COMDAT sections, why do we still require additional auxiliary files? Surely, a single binary library (lib/so) should be enough to encapsulate a library without the need to re-parse the source files or additional header files? You yourself seem to agree that a single zip file is superior to what we currently have and as an aside the entire Java community agrees with use - Java Jar/War/etc formats are all renamed zip archives. Regarding the obfuscation and portability issues - the zip file can contain whatever we want. This means it should be possible to tailor the contents to support different use-cases: * provide fat-libraries as in OSX - internally store multiple binaries for different architectures, those binary objects are very hard to decompile back to source code thus answering the obfuscation need. * provide a byte-code solution to support the portability case. e.g Java byte-code or Google's pNaCL solution that relies on LLVM bit-code. Also, there are different work-flows that can be implemented - Java uses JIT to gain efficiency vs. .NET that supports install-time AOT compilation. It basically stores the native executable in a special cache.

In Windows 8 RT, .NET binaries are actually compiled to native code when uploaded to the Windows App Store.
Dec 18 2012