www.digitalmars.com         C & C++   DMDScript  

D.gnu - Supporting emulated tls

reply Johannes Pfau <nospam example.com> writes:
I thought about supporting emulated tls a little. The GCC emutls.c
implementation currently can't work with the gc, as every TLS variable
is allocated individually and therefore we don't have a contiguous
memory region for the gc. I think these are the possible solutions:

* Try to fix GCCs emutls to allocate all tls memory for a module
  (application/shared object) at once. That's the best solution
  and native TLS works this way, but I'm not sure if we can extract
  enough information from the runtime linker to make this work (we
  need at least the combined size of all tls variables).

* Provide a callback in GCC's emutls which is called after every
  allocation. This could call GC.addRange for every variable, but I
  guess adding huge amounts of ranges is slow.

* Make it possible to register a custom allocator for GCC's emutls (not
  sure if possible, as this would have to be set up very early in
  application startup). Then allocate the memory directly from the GC
  (but this memory should only be scanned, not collected) 

* Replace the calls to mallloc in emutls.c with a custom, region based
  memory allocator. (This is not a perfect solution though, it can
  always happen that we'll need more memory)



* Do not use GCC's emutls at all, roll a custom solution. This could be
  compatible with / based on dmd's tls emulation for OSX. Most of the
  implementation is in core.thread, all that's necessary is to group
  the tls data into a _tls_data_array and call ___tls_get_addr for
  every tls access. I'm not sure if this can be done in the
  'middle-end' though and it doesn't support shared libraries yet.
Mar 18 2012
next sibling parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
On 18 March 2012 11:32, Johannes Pfau <nospam example.com> wrote:
 I thought about supporting emulated tls a little. The GCC emutls.c
 implementation currently can't work with the gc, as every TLS variable
 is allocated individually and therefore we don't have a contiguous
 memory region for the gc. I think these are the possible solutions:

 * Try to fix GCCs emutls to allocate all tls memory for a module
 =A0(application/shared object) at once. That's the best solution
 =A0and native TLS works this way, but I'm not sure if we can extract
 =A0enough information from the runtime linker to make this work (we
 =A0need at least the combined size of all tls variables).

 * Provide a callback in GCC's emutls which is called after every
 =A0allocation. This could call GC.addRange for every variable, but I
 =A0guess adding huge amounts of ranges is slow.

Painfully slow.
 * Make it possible to register a custom allocator for GCC's emutls (not
 =A0sure if possible, as this would have to be set up very early in
 =A0application startup). Then allocate the memory directly from the GC
 =A0(but this memory should only be scanned, not collected)

 * Replace the calls to mallloc in emutls.c with a custom, region based
 =A0memory allocator. (This is not a perfect solution though, it can
 =A0always happen that we'll need more memory)



 * Do not use GCC's emutls at all, roll a custom solution. This could be
 =A0compatible with / based on dmd's tls emulation for OSX. Most of the
 =A0implementation is in core.thread, all that's necessary is to group
 =A0the tls data into a _tls_data_array and call ___tls_get_addr for
 =A0every tls access. I'm not sure if this can be done in the
 =A0'middle-end' though and it doesn't support shared libraries yet.

If we are going to fix TLS, I'd rather it be in the most platform agnostic way possible, if it could be helped. That would mean also scrapping the current implementation on Linux (just tries to mimic what dmd does, and has corner cases where it doesn't always get it right). --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Mar 18 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-03-18 19:39, Johannes Pfau wrote:

 You mean getting rid of __tls_beg and __tls_end? I'd also like to
 remove those, but:

__tls_beg and __tls_end is not used by Mac OS X any more: https://github.com/D-Programming-Language/druntime/commit/73cf2c150665cb17d9365a6e3d6cf144d76312d6 https://github.com/D-Programming-Language/dmd/commit/054c525edba048ad7829dd5ec2d8d9261a6517c3
 TLS is mostly object-format specific (not as much OS specific). The ELF
 implementation lays out the TLS data for a module (module = shared
 library or the application) in a contiguous way. The details are
 described in "ELF Handling For Thread-Local
 Storage" (www.akkadia.org/drepper/tls.pdf).

Mac OS X 10.7 + supports TLS natively. But I don't know where to find documentation about it. It always possible to look at the source code. -- /Jacob Carlborg
Mar 18 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-03-19 09:17, Johannes Pfau wrote:
 Am Sun, 18 Mar 2012 22:06:41 +0100
 schrieb Jacob Carlborg<doob me.com>:

 Yes, but OSX still uses emulated tls. With the way dmd emulates TLS
 it's possible to remove __tls_beg and __tls_end, but for native TLS
 those symbols are still needed. However, as the runtime linker (ld.so)
 has got the necessary information, it's possible that OSX even offers a
 API to access it. It's just that most C libraries don't provide a way to
 get the TLS segment sizes and the (per thread) addresses of the TLS
 blocks.

The dyld library on Mac OS X provides access to segments and sections. But since the dynamic loader needs can get this information it should be possible for other applications to get this information as well? Just walk through the object file and find the necessary segments?
 Mac OS X 10.7 + supports TLS natively. But I don't know where to find
 documentation about it. It always possible to look at the source code.

Then it's probably already supported by GCC/GDC. But having working emulated TLS would be nice for many other architectures. Native TLS is not that widespread.

Yeah, don't know about GCC though, Apple cares less and less about GCC and putting all their effort in to LLVM and Clang. Ok, I didn't know how widespread TLS was. -- /Jacob Carlborg
Mar 19 2012
prev sibling next sibling parent =?ISO-8859-1?Q?Alex_R=F8nne_Petersen?= <xtzgzorex gmail.com> writes:
On 18-03-2012 12:32, Johannes Pfau wrote:
 I thought about supporting emulated tls a little. The GCC emutls.c
 implementation currently can't work with the gc, as every TLS variable
 is allocated individually and therefore we don't have a contiguous
 memory region for the gc. I think these are the possible solutions:

 * Try to fix GCCs emutls to allocate all tls memory for a module
    (application/shared object) at once. That's the best solution
    and native TLS works this way, but I'm not sure if we can extract
    enough information from the runtime linker to make this work (we
    need at least the combined size of all tls variables).

 * Provide a callback in GCC's emutls which is called after every
    allocation. This could call GC.addRange for every variable, but I
    guess adding huge amounts of ranges is slow.

We should avoid this if possible, yes. A small root set is desirable.
 * Make it possible to register a custom allocator for GCC's emutls (not
    sure if possible, as this would have to be set up very early in
    application startup). Then allocate the memory directly from the GC
    (but this memory should only be scanned, not collected)

Such an allocator would probably just allocate a decently-sized memory block from libc and add it as a root range (rather than individual word-sized roots). The memory doesn't necessarily have to be allocated with the GC.
 * Replace the calls to mallloc in emutls.c with a custom, region based
    memory allocator. (This is not a perfect solution though, it can
    always happen that we'll need more memory)



 * Do not use GCC's emutls at all, roll a custom solution. This could be
    compatible with / based on dmd's tls emulation for OSX. Most of the
    implementation is in core.thread, all that's necessary is to group
    the tls data into a _tls_data_array and call ___tls_get_addr for
    every tls access. I'm not sure if this can be done in the
    'middle-end' though and it doesn't support shared libraries yet.

-- - Alex
Mar 18 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Sun, 18 Mar 2012 12:21:51 +0000
schrieb Iain Buclaw <ibuclaw ubuntu.com>:

 On 18 March 2012 11:32, Johannes Pfau <nospam example.com> wrote:
 I thought about supporting emulated tls a little. The GCC emutls.c
 implementation currently can't work with the gc, as every TLS
 variable is allocated individually and therefore we don't have a
 contiguous memory region for the gc. I think these are the possible
 solutions:

 * Try to fix GCCs emutls to allocate all tls memory for a module
 =C2=A0(application/shared object) at once. That's the best solution
 =C2=A0and native TLS works this way, but I'm not sure if we can extract
 =C2=A0enough information from the runtime linker to make this work (we
 =C2=A0need at least the combined size of all tls variables).

 * Provide a callback in GCC's emutls which is called after every
 =C2=A0allocation. This could call GC.addRange for every variable, but I
 =C2=A0guess adding huge amounts of ranges is slow.

Painfully slow. =20 =20
 * Make it possible to register a custom allocator for GCC's emutls
 (not sure if possible, as this would have to be set up very early in
 =C2=A0application startup). Then allocate the memory directly from the =


 =C2=A0(but this memory should only be scanned, not collected)

 * Replace the calls to mallloc in emutls.c with a custom, region
 based memory allocator. (This is not a perfect solution though, it
 can always happen that we'll need more memory)



 * Do not use GCC's emutls at all, roll a custom solution. This
 could be compatible with / based on dmd's tls emulation for OSX.
 Most of the implementation is in core.thread, all that's necessary
 is to group the tls data into a _tls_data_array and call
 ___tls_get_addr for every tls access. I'm not sure if this can be
 done in the 'middle-end' though and it doesn't support shared
 libraries yet.

If we are going to fix TLS, I'd rather it be in the most platform agnostic way possible, if it could be helped. That would mean also scrapping the current implementation on Linux (just tries to mimic what dmd does, and has corner cases where it doesn't always get it right).

You mean getting rid of __tls_beg and __tls_end? I'd also like to remove those, but: TLS is mostly object-format specific (not as much OS specific). The ELF implementation lays out the TLS data for a module (module =3D shared library or the application) in a contiguous way. The details are described in "ELF Handling For Thread-Local Storage" (www.akkadia.org/drepper/tls.pdf). The GC requires the TLS blocks to be contiguous, this is not the case for GCC's emulated TLS and this causes issues there. For native TLS/ELF this requirement is met, but the GC also has to know the start and the size of the TLS sections. Although the runtime linker has this information, there's no standard way to access it. So we could: * Add a custom extension API to the C libraries. We'd need at least: A 'tls_range dl_get_tls_range(void *handle)' function related to the dl* set of funtions in the runtime linker, and a 'tls_range dl_get_tls_range2(struct dl_phdr_info *info)' to be used with dl_iterate_phdr. We also need some way to get the tls range for the application, 'get_app_tls_range' (although some libcs also return the application module in dl_iterate_phdr). This seems to be the best way, but we'd have to patch every C library and it would take some time till those updated C libraries are widely deployed. The other solution is to hook directly into each C libraries non-public (and maybe non-stable!) API. For example, the structure returned by BSD libc's dl_iterate_phdr and dlopen has these fields: int tlsindex; /* Index in DTV for this module void *tlsinit; /* Base address of TLS init block size_t tlsinitsize; /* Size of TLS init block for this module size_t tlssize; /* Size of TLS block for this module size_t tlsoffset; /* Offset of static TLS block for this module=20 size_t tlsalign; /* Alignment of static TLS block tlsindex gives us the start-address of the TLS for every thread, as long as we know how to compute the TLS address from the TP (thread pointer) and the dtv index (there are basically 2 methods, described in "ELF Handling For Thread-Local Storage") and tlssize gives us the size. However, there doesn't seem to be a painless way to do this...
Mar 18 2012
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-03-18 12:32, Johannes Pfau wrote:
 I thought about supporting emulated tls a little. The GCC emutls.c
 implementation currently can't work with the gc, as every TLS variable
 is allocated individually and therefore we don't have a contiguous
 memory region for the gc. I think these are the possible solutions:

Why not use the native TLS implementation when available and roll our own, like DMD on Mac OS X, when none exists? BTW, I think it would be possible to emulate TLS in a very similar way to how it's implemented natively for ELF. -- /Jacob Carlborg
Mar 18 2012
next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 19 March 2012 08:15, Johannes Pfau <nospam example.com> wrote:
 Am Sun, 18 Mar 2012 21:57:57 +0100
 schrieb Jacob Carlborg <doob me.com>:

 On 2012-03-18 12:32, Johannes Pfau wrote:
 I thought about supporting emulated tls a little. The GCC emutls.c
 implementation currently can't work with the gc, as every TLS
 variable is allocated individually and therefore we don't have a
 contiguous memory region for the gc. I think these are the possible
 solutions:

Why not use the native TLS implementation when available and roll our own, like DMD on Mac OS X, when none exists?

That's what we (mostly) do right now. We have 2 issues: * Our own, emulated TLS support is implemented in GCC. This means it's =A0also used in C, which is great. Also GCC's emulated tls needs =A0absolutely no special features in the runtime linker, compile time =A0linker or language frontends. It's very portable and works with all =A0weird combinations of dynamic libraries, dlopen, etc. =A0But it has one quirk: It doesn't allocate TLS memory in a contiguous =A0way, every tls variable is allocated using malloc. This means we =A0can't pass a range to the GC for the tls variables. So we can't =A0support this emutls in the GC.

As far as my thought process goes, the only (implementable in the GDC frontend) way to force contiguous layout of all TLS symbols is to pack them up ourselves into a struct that is accessible via a single global module-level variable. And in the .ctor section, the module adds this range to the GC. This should be enough so it also works for shared libraries too, however I'm sure there is quite a few details I am missing out on here that would block this from working. :) --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Mar 19 2012
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-03-19 09:15, Johannes Pfau wrote:
 Am Sun, 18 Mar 2012 21:57:57 +0100
 schrieb Jacob Carlborg<doob me.com>:

 On 2012-03-18 12:32, Johannes Pfau wrote:
 I thought about supporting emulated tls a little. The GCC emutls.c
 implementation currently can't work with the gc, as every TLS
 variable is allocated individually and therefore we don't have a
 contiguous memory region for the gc. I think these are the possible
 solutions:

Why not use the native TLS implementation when available and roll our own, like DMD on Mac OS X, when none exists?

That's what we (mostly) do right now. We have 2 issues: * Our own, emulated TLS support is implemented in GCC. This means it's also used in C, which is great. Also GCC's emulated tls needs absolutely no special features in the runtime linker, compile time linker or language frontends. It's very portable and works with all weird combinations of dynamic libraries, dlopen, etc. But it has one quirk: It doesn't allocate TLS memory in a contiguous way, every tls variable is allocated using malloc. This means we can't pass a range to the GC for the tls variables. So we can't support this emutls in the GC.

Ok, I see.
 * The other issue with native TLS is that using bracketing with
    __tls_beg and __tls_end has corner cases where it doesn't work. We'd
    need an alternative to locate the TLS memory addresses and TLS sizes.
    But there's no standard or public API to do that.

On Mac OS X they are actually not needed. Don't know about other platforms.
 BTW, I think it would be possible to emulate TLS in a very similar
 way to how it's implemented natively for ELF.

I don't think it's that easy. For example, how would you assign module ids? For native TLS this is partially done by the compile time linker (for the main application and libraries that are always loaded), but if no native TLS is available, we can't rely on the linker to do that. We also need some way to get the current module id in running code.

As I understand it, in the native ELF implementation, assembly is used to access the current module id, this is for FreeBSD: http://people.freebsd.org/~marcel/tls.html This is how ___tls_get_addr is implemented on FreeBSD ELF i386: https://bitbucket.org/freebsd/freebsd-head/src/4e8f50fe2f05/libexec/rtld-elf/i386/reloc.c#cl-355
 And how do we get the TLS initialization data? If we placed it into an
 array, like DMD does on OSX, we could use dlsym for dlopened libraries,
 but what about initially loaded libraries?

In the same way it's done in the native implementation. Isn't it possible to access all loaded libraries?
 Say you have application 'app', which depends on 'liba' and 'libb'. All
 of these have TLS data. Maybe we could implement something using
 dl_iterate_phdr, but that's a nonstandard extension.

Ok. Mac OS X has this a function called "_dyld_register_func_for_add_image", I guess other OS'es don't have a corresponding function? In general all this stuff very low level and nonstandard. https://developer.apple.com/library/mac/#documentation/developertools/Reference/MachOReference/Reference/reference.html#jumpTo_53
 Compare that to GCC's emulation, which is probably slow, but 'just
 works' everywhere (except for the GC :-( ).

Yeah, that's a big advantage. In general I was hoping that the work done by the dynamic loader to setup TLS could be moved to druntime. -- /Jacob Carlborg
Mar 19 2012
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-03-19 16:57, Johannes Pfau wrote:
 Am Mon, 19 Mar 2012 10:40:25 +0100
 schrieb Jacob Carlborg<doob me.com>:

 In general I was hoping that the work done by the dynamic loader to
 setup TLS could be moved to druntime.

That'd be nice, but I think the runtime linker doesn't export all necessary information.

I think this would require to investigate each individual platform and see what's possible. -- /Jacob Carlborg
Mar 19 2012
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-03-23 06:03, Martin Nowak wrote:
 On Mon, 19 Mar 2012 10:40:25 +0100, Jacob Carlborg <doob me.com> wrote:

 As I understand it, in the native ELF implementation, assembly is used
 to access the current module id, this is for FreeBSD:
 http://people.freebsd.org/~marcel/tls.html
 This is how ___tls_get_addr is implemented on FreeBSD ELF i386:
 https://bitbucket.org/freebsd/freebsd-head/src/4e8f50fe2f05/libexec/rtld-elf/i386/reloc.c#cl-355

Not quite. Access to the static image is done through %fs relative addressing which is super-fast and requires no runtime linking. The general dynamic addressing needs one tls_index struct in the GOT for every variable and a call to _tls_get_addr(tls_index*). The module index and the offset are filled by the runtime linker.

Ok, I see. -- /Jacob Carlborg
Mar 25 2012
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 19 March 2012 15:25, Johannes Pfau <nospam example.com> wrote:
 Am Mon, 19 Mar 2012 09:22:01 +0000
 schrieb Iain Buclaw <ibuclaw ubuntu.com>:

 On 19 March 2012 08:15, Johannes Pfau <nospam example.com> wrote:
 * Our own, emulated TLS support is implemented in GCC. This means
 it's also used in C, which is great. Also GCC's emulated tls needs
 =A0absolutely no special features in the runtime linker, compile time
 =A0linker or language frontends. It's very portable and works with all
 =A0weird combinations of dynamic libraries, dlopen, etc.
 =A0But it has one quirk: It doesn't allocate TLS memory in a
 contiguous way, every tls variable is allocated using malloc. This
 means we can't pass a range to the GC for the tls variables. So we
 can't support this emutls in the GC.

As far as my thought process goes, the only (implementable in the GDC frontend) way to force contiguous layout of all TLS symbols is to pack them up ourselves into a struct that is accessible via a single global module-level variable. =A0And in the .ctor section, the module adds this range to the GC. =A0This should be enough so it also works for shared libraries too, however I'm sure there is quite a few details I am missing out on here that would block this from working. :)

Good idea, I should have thought about that. I can't think of a reason why it wouldn't work and it should be quite fast as well.

Initial things to think about on the top of my head: * Speed to access symbols. * Accessing thread local symbols across modules.
 Just to clarify: 'module-level' as in D module(/object file) or as in
 one variable per shared library/application? If we can support one
 variable per shared library/application that'd be great, as we will
 then only have a few tls ranges for the gc.

Per module - see the code that initialises _Dmodule_ref. We're really just adding two extra fields to that which includes starting address and size. --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Mar 19 2012
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 21 March 2012 13:17, Johannes Pfau <nospam example.com> wrote:
 Am Mon, 19 Mar 2012 16:14:36 +0000
 schrieb Iain Buclaw <ibuclaw ubuntu.com>:


 Initial things to think about on the top of my head:

 * Speed to access symbols.

It needs the normal code to access the TLS struct / get the address of the TLS struct + one add instruction which adds the offset for the specific variable. So it should be fast enough.
 * Accessing thread local symbols across modules.

Do we have to use module-local symbols? If we could use symbols with unique, mangled names, we could just access that symbol+offset from every module. This assumes the d/di files provide enough information to calculate the offset.

Oh yeah, that's it. Perhaps the externally visible mangled names just be references to the actual location? I don't think there would be enough information to access via main entry point symbol+offset. -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Mar 21 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Mon, 19 Mar 2012 16:57:29 +0100, Johannes Pfau <nospam example.com>  
wrote:

 The only way to access all loaded library is dl_iterate_phdr. But I'm
 not sure if it provides all necessary information.

Yes it does. The drawback is that it eagerly allocates the TLS block. https://github.com/dawgfoto/druntime/blob/SharedRuntime/src/rt/dso.d#L408 https://github.com/dawgfoto/druntime/blob/SharedRuntime/src/rt/dso.d#L459
Mar 22 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Mon, 19 Mar 2012 10:40:25 +0100, Jacob Carlborg <doob me.com> wrote:

 As I understand it, in the native ELF implementation, assembly is used  
 to access the current module id, this is for FreeBSD:
  http://people.freebsd.org/~marcel/tls.html
  This is how ___tls_get_addr is implemented on FreeBSD ELF i386:
   
 https://bitbucket.org/freebsd/freebsd-head/src/4e8f50fe2f05/libexec/rtld-elf/i386/reloc.c#cl-355

Not quite. Access to the static image is done through %fs relative addressing which is super-fast and requires no runtime linking. The general dynamic addressing needs one tls_index struct in the GOT for every variable and a call to _tls_get_addr(tls_index*). The module index and the offset are filled by the runtime linker.
Mar 22 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Mon, 19 Mar 2012 09:15:08 +0100, Johannes Pfau <nospam example.com>  
wrote:

 And how do we get the TLS initialization data? If we placed it into an
 array, like DMD does on OSX, we could use dlsym for dlopened libraries,
 but what about initially loaded libraries?

That doesn't work because the symbols would collide. If you made them local symbols OTOH you can't access them through dlsym. Use dl_iterate_phdr to get the initial image.
Mar 22 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
 I guess the OSX emulated tls code will also be adapted to support
 multiple modules? I guess we can just wait until your changes are
 merged into DMD and then think about emulated TLS again.

We're already merging since 3 month or so.
Mar 23 2012
prev sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Fri, 23 Mar 2012 11:02:44 +0100, Johannes Pfau <nospam example.com>  
wrote:

 For FreeBSD, this is easy again, dtv[1] contains the size of the dtv.
 But that's probably nonstandard.

Yeah, seems to be non-standard. There might also be issues with outdated dtv's.
Mar 23 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Sun, 18 Mar 2012 21:57:57 +0100
schrieb Jacob Carlborg <doob me.com>:

 On 2012-03-18 12:32, Johannes Pfau wrote:
 I thought about supporting emulated tls a little. The GCC emutls.c
 implementation currently can't work with the gc, as every TLS
 variable is allocated individually and therefore we don't have a
 contiguous memory region for the gc. I think these are the possible
 solutions:

Why not use the native TLS implementation when available and roll our own, like DMD on Mac OS X, when none exists?

That's what we (mostly) do right now. We have 2 issues: * Our own, emulated TLS support is implemented in GCC. This means it's also used in C, which is great. Also GCC's emulated tls needs absolutely no special features in the runtime linker, compile time linker or language frontends. It's very portable and works with all weird combinations of dynamic libraries, dlopen, etc. But it has one quirk: It doesn't allocate TLS memory in a contiguous way, every tls variable is allocated using malloc. This means we can't pass a range to the GC for the tls variables. So we can't support this emutls in the GC. * The other issue with native TLS is that using bracketing with __tls_beg and __tls_end has corner cases where it doesn't work. We'd need an alternative to locate the TLS memory addresses and TLS sizes. But there's no standard or public API to do that.
 BTW, I think it would be possible to emulate TLS in a very similar
 way to how it's implemented natively for ELF.
 

I don't think it's that easy. For example, how would you assign module ids? For native TLS this is partially done by the compile time linker (for the main application and libraries that are always loaded), but if no native TLS is available, we can't rely on the linker to do that. We also need some way to get the current module id in running code. And how do we get the TLS initialization data? If we placed it into an array, like DMD does on OSX, we could use dlsym for dlopened libraries, but what about initially loaded libraries? Say you have application 'app', which depends on 'liba' and 'libb'. All of these have TLS data. Maybe we could implement something using dl_iterate_phdr, but that's a nonstandard extension. Compare that to GCC's emulation, which is probably slow, but 'just works' everywhere (except for the GC :-( ).
Mar 19 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Sun, 18 Mar 2012 22:06:41 +0100
schrieb Jacob Carlborg <doob me.com>:

 On 2012-03-18 19:39, Johannes Pfau wrote:
 
 You mean getting rid of __tls_beg and __tls_end? I'd also like to
 remove those, but:

__tls_beg and __tls_end is not used by Mac OS X any more: https://github.com/D-Programming-Language/druntime/commit/73cf2c150665cb17d9365a6e3d6cf144d76312d6 https://github.com/D-Programming-Language/dmd/commit/054c525edba048ad7829dd5ec2d8d9261a6517c3

Yes, but OSX still uses emulated tls. With the way dmd emulates TLS it's possible to remove __tls_beg and __tls_end, but for native TLS those symbols are still needed. However, as the runtime linker (ld.so) has got the necessary information, it's possible that OSX even offers a API to access it. It's just that most C libraries don't provide a way to get the TLS segment sizes and the (per thread) addresses of the TLS blocks.
 TLS is mostly object-format specific (not as much OS specific). The
 ELF implementation lays out the TLS data for a module (module =
 shared library or the application) in a contiguous way. The details
 are described in "ELF Handling For Thread-Local
 Storage" (www.akkadia.org/drepper/tls.pdf).

Mac OS X 10.7 + supports TLS natively. But I don't know where to find documentation about it. It always possible to look at the source code.

Then it's probably already supported by GCC/GDC. But having working emulated TLS would be nice for many other architectures. Native TLS is not that widespread.
Mar 19 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Mon, 19 Mar 2012 09:22:01 +0000
schrieb Iain Buclaw <ibuclaw ubuntu.com>:

 On 19 March 2012 08:15, Johannes Pfau <nospam example.com> wrote:
 * Our own, emulated TLS support is implemented in GCC. This means
 it's also used in C, which is great. Also GCC's emulated tls needs
 =C2=A0absolutely no special features in the runtime linker, compile time
 =C2=A0linker or language frontends. It's very portable and works with a=


 =C2=A0weird combinations of dynamic libraries, dlopen, etc.
 =C2=A0But it has one quirk: It doesn't allocate TLS memory in a
 contiguous way, every tls variable is allocated using malloc. This
 means we can't pass a range to the GC for the tls variables. So we
 can't support this emutls in the GC.

As far as my thought process goes, the only (implementable in the GDC frontend) way to force contiguous layout of all TLS symbols is to pack them up ourselves into a struct that is accessible via a single global module-level variable. And in the .ctor section, the module adds this range to the GC. This should be enough so it also works for shared libraries too, however I'm sure there is quite a few details I am missing out on here that would block this from working. :) =20

Good idea, I should have thought about that. I can't think of a reason why it wouldn't work and it should be quite fast as well. Just to clarify: 'module-level' as in D module(/object file) or as in one variable per shared library/application? If we can support one variable per shared library/application that'd be great, as we will then only have a few tls ranges for the gc.=20
Mar 19 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Mon, 19 Mar 2012 10:40:25 +0100
schrieb Jacob Carlborg <doob me.com>:

 On 2012-03-19 09:15, Johannes Pfau wrote:
 Am Sun, 18 Mar 2012 21:57:57 +0100
 schrieb Jacob Carlborg<doob me.com>:

 On 2012-03-18 12:32, Johannes Pfau wrote:
 I thought about supporting emulated tls a little. The GCC emutls.c
 implementation currently can't work with the gc, as every TLS
 variable is allocated individually and therefore we don't have a
 contiguous memory region for the gc. I think these are the
 possible solutions:

Why not use the native TLS implementation when available and roll our own, like DMD on Mac OS X, when none exists?

That's what we (mostly) do right now. We have 2 issues: * Our own, emulated TLS support is implemented in GCC. This means it's also used in C, which is great. Also GCC's emulated tls needs absolutely no special features in the runtime linker, compile time linker or language frontends. It's very portable and works with all weird combinations of dynamic libraries, dlopen, etc. But it has one quirk: It doesn't allocate TLS memory in a contiguous way, every tls variable is allocated using malloc. This means we can't pass a range to the GC for the tls variables. So we can't support this emutls in the GC.

Ok, I see.
 * The other issue with native TLS is that using bracketing with
    __tls_beg and __tls_end has corner cases where it doesn't work.
 We'd need an alternative to locate the TLS memory addresses and TLS
 sizes. But there's no standard or public API to do that.

On Mac OS X they are actually not needed. Don't know about other platforms.
 BTW, I think it would be possible to emulate TLS in a very similar
 way to how it's implemented natively for ELF.

I don't think it's that easy. For example, how would you assign module ids? For native TLS this is partially done by the compile time linker (for the main application and libraries that are always loaded), but if no native TLS is available, we can't rely on the linker to do that. We also need some way to get the current module id in running code.

As I understand it, in the native ELF implementation, assembly is used to access the current module id, this is for FreeBSD: http://people.freebsd.org/~marcel/tls.html This is how ___tls_get_addr is implemented on FreeBSD ELF i386: https://bitbucket.org/freebsd/freebsd-head/src/4e8f50fe2f05/libexec/rtld-elf/i386/reloc.c#cl-355

Yep and the module id is part of the tls_index parameter. That pointer is a pointer into the GOT. The initial values of that GOT entry are undefined, they are filled in by the runtime linker. We could probably emulate all that, but it seems a little complicated to me. If we can get the current emutls to work, that be awesome even if it's slow. Proper native TLS support is easier to implement in the runtime linker anyway.
 And how do we get the TLS initialization data? If we placed it into
 an array, like DMD does on OSX, we could use dlsym for dlopened
 libraries, but what about initially loaded libraries?

In the same way it's done in the native implementation. Isn't it possible to access all loaded libraries?

The only way to access all loaded library is dl_iterate_phdr. But I'm not sure if it provides all necessary information.
 
 Say you have application 'app', which depends on 'liba' and 'libb'.
 All of these have TLS data. Maybe we could implement something using
 dl_iterate_phdr, but that's a nonstandard extension.

Ok. Mac OS X has this a function called "_dyld_register_func_for_add_image", I guess other OS'es don't have a corresponding function? In general all this stuff very low level and nonstandard.

Some C libraries might provide a similar API, but there's no guarantee such an API is available.
 
 https://developer.apple.com/library/mac/#documentation/developertools/Reference/MachOReference/Reference/reference.html#jumpTo_53
 
 Compare that to GCC's emulation, which is probably slow, but 'just
 works' everywhere (except for the GC :-( ).

Yeah, that's a big advantage. In general I was hoping that the work done by the dynamic loader to setup TLS could be moved to druntime.

That'd be nice, but I think the runtime linker doesn't export all necessary information.
Mar 19 2012
prev sibling next sibling parent Rainer Schuetze <r.sagitario gmx.de> writes:
On 3/18/2012 12:32 PM, Johannes Pfau wrote:
 I thought about supporting emulated tls a little. The GCC emutls.c
 implementation currently can't work with the gc, as every TLS variable
 is allocated individually and therefore we don't have a contiguous
 memory region for the gc. I think these are the possible solutions:

 * Try to fix GCCs emutls to allocate all tls memory for a module
    (application/shared object) at once. That's the best solution
    and native TLS works this way, but I'm not sure if we can extract
    enough information from the runtime linker to make this work (we
    need at least the combined size of all tls variables).

 * Provide a callback in GCC's emutls which is called after every
    allocation. This could call GC.addRange for every variable, but I
    guess adding huge amounts of ranges is slow.

 * Make it possible to register a custom allocator for GCC's emutls (not
    sure if possible, as this would have to be set up very early in
    application startup). Then allocate the memory directly from the GC
    (but this memory should only be scanned, not collected)

 * Replace the calls to mallloc in emutls.c with a custom, region based
    memory allocator. (This is not a perfect solution though, it can
    always happen that we'll need more memory)

Check the implementation of ranges in gcx.d: it's rather fast to add a range (vector like appending to exponentially growing data), and a simple loop over the ranges is done in the collection that would not change performance a lot when being executed in one memory chunk: it's the marking of references in the scanned data that is expensive. I would be more concerned about removal of ranges, though. It scans existing ranges linearly to find the one to remove and moves the remaining entries in memory. Some optimizations might be helpful here.
 * Do not use GCC's emutls at all, roll a custom solution. This could be
    compatible with / based on dmd's tls emulation for OSX. Most of the
    implementation is in core.thread, all that's necessary is to group
    the tls data into a _tls_data_array and call ___tls_get_addr for
    every tls access. I'm not sure if this can be done in the
    'middle-end' though and it doesn't support shared libraries yet.

Mar 19 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Mon, 19 Mar 2012 16:14:36 +0000
schrieb Iain Buclaw <ibuclaw ubuntu.com>:


 
 Initial things to think about on the top of my head:
 
 * Speed to access symbols.

It needs the normal code to access the TLS struct / get the address of the TLS struct + one add instruction which adds the offset for the specific variable. So it should be fast enough.
 * Accessing thread local symbols across modules.

Do we have to use module-local symbols? If we could use symbols with unique, mangled names, we could just access that symbol+offset from every module. This assumes the d/di files provide enough information to calculate the offset.
 
 Just to clarify: 'module-level' as in D module(/object file) or as
 in one variable per shared library/application? If we can support
 one variable per shared library/application that'd be great, as we
 will then only have a few tls ranges for the gc.

Per module - see the code that initialises _Dmodule_ref. We're really just adding two extra fields to that which includes starting address and size.

Mar 21 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Fri, 23 Mar 2012 06:06:39 +0100
schrieb "Martin Nowak" <dawg dawgfoto.de>:

 On Mon, 19 Mar 2012 09:15:08 +0100, Johannes Pfau
 <nospam example.com> wrote:
 
 And how do we get the TLS initialization data? If we placed it into
 an array, like DMD does on OSX, we could use dlsym for dlopened
 libraries, but what about initially loaded libraries?

That doesn't work because the symbols would collide. If you made them local symbols OTOH you can't access them through dlsym. Use dl_iterate_phdr to get the initial image.

I just saw your latest work on DSO yesterday (I was looking for a status update for shared libraries as Android does not officially support native applications. The supported way is to build a shared library and load it into a JAVA app. And indeed, native applications have some wierd corner cases (https://github.com/jpf91/GDC/issues/4). Android is such a crappy platform for native apps...) You're doing some awesome work there. I'm not sure what issues gdc had with the original TLS support, but I guess that your new code will probably solve those. I guess the OSX emulated tls code will also be adapted to support multiple modules? I guess we can just wait until your changes are merged into DMD and then think about emulated TLS again.
Mar 23 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Fri, 23 Mar 2012 05:48:46 +0100
schrieb "Martin Nowak" <dawg dawgfoto.de>:

 On Mon, 19 Mar 2012 16:57:29 +0100, Johannes Pfau
 <nospam example.com> wrote:
 
 The only way to access all loaded library is dl_iterate_phdr. But
 I'm not sure if it provides all necessary information.

Yes it does. The drawback is that it eagerly allocates the TLS block. https://github.com/dawgfoto/druntime/blob/SharedRuntime/src/rt/dso.d#L408 https://github.com/dawgfoto/druntime/blob/SharedRuntime/src/rt/dso.d#L459

As written in some comment in your code, we can avoid eager allocation using some architecture dependent 'hacks'. I think we'd have to * get the thread pointer (architecture specific) * find the dtv (there's only variant 1 / variant 2?) * access the correct dtv entry (C library dependent?) * check if the dtv entry is initialized (C library dependent?) For FreeBSD step 3/4 means checking if dtv[index + 1] == 0. It's probably the same for most other C libraries. The tricky part is that we have to check first if the dtv is big enough for the current index. For FreeBSD, this is easy again, dtv[1] contains the size of the dtv. But that's probably nonstandard. All this is not hard to do, but quite system specific. Normal Desktop OS probably don't need this optimization. For systems which benifit from it, adding it shouldn't be too difficult. In case you're interested, the FreeBSD linker source is here (BSD licensed, of course): http://www.freebsd.org/cgi/cvsweb.cgi/src/libexec/rtld-elf/rtld.c?rev=1.196;content-type=text%2Fplain search for tls_get_addr_slow, __tls_get_addr, dtv, tls
Mar 23 2012
prev sibling next sibling parent reply "Martin Nowak" <dawg dawgfoto.de> writes:
Just another point about TLS.

extern(C) /*__thread*/ int foo;

At some point you want to be able to access C++ TLS variables
so emulation should not replace native TLS support.
Mar 23 2012
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-03-23 12:55, Martin Nowak wrote:
 Just another point about TLS.

 extern(C) /*__thread*/ int foo;

 At some point you want to be able to access C++ TLS variables
 so emulation should not replace native TLS support.

So C++ TLS is not using the same implementation as the C extension __thread? -- /Jacob Carlborg
Mar 25 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-03-25 20:34, Martin Nowak wrote:

 Sorry,
 that might have been misleading.
 The point I was trying to make is that D's TLS support shouldn't
 deviate from the native platform TLS if one is available. I've just
 tried it out, and indeed I can access C TLS variables from D.

Ok. Yes, if a native TLS is available that should be used. -- /Jacob Carlborg
Mar 25 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Sun, 25 Mar 2012 16:29:25 +0200, Jacob Carlborg <doob me.com> wrote:

 On 2012-03-23 12:55, Martin Nowak wrote:
 Just another point about TLS.

 extern(C) /*__thread*/ int foo;

 At some point you want to be able to access C++ TLS variables
 so emulation should not replace native TLS support.

So C++ TLS is not using the same implementation as the C extension __thread?

Sorry, that might have been misleading. The point I was trying to make is that D's TLS support shouldn't deviate from the native platform TLS if one is available. I've just tried it out, and indeed I can access C TLS variables from D.
Mar 25 2012
prev sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
On 25 March 2012 21:29, Jacob Carlborg <doob me.com> wrote:
 On 2012-03-25 20:34, Martin Nowak wrote:

 Sorry,
 that might have been misleading.
 The point I was trying to make is that D's TLS support shouldn't
 deviate from the native platform TLS if one is available. I've just
 tried it out, and indeed I can access C TLS variables from D.

Ok. Yes, if a native TLS is available that should be used. -- /Jacob Carlborg

Native implementations are used in GDC. We are currently going on blind faith that all symbols are between _tlsstart and _tlsend though, and packed together in a contiguous fashion. :~) -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Mar 26 2012
prev sibling parent Johannes Pfau <nospam example.com> writes:
Am Fri, 23 Mar 2012 13:05:55 +0100
schrieb "Martin Nowak" <dawg dawgfoto.de>:

 On Fri, 23 Mar 2012 11:02:44 +0100, Johannes Pfau
 <nospam example.com> wrote:
 
 For FreeBSD, this is easy again, dtv[1] contains the size of the
 dtv. But that's probably nonstandard.

Yeah, seems to be non-standard.

 There might also be issues with outdated dtv's.

Which means we'd have to check the generation counter. And if the counter mismatches, we'd need the C runtime to update it. I'm not sure if we can access tls_dtv_generation on FreeBSD, but updating the counter is easy: every call to __tls_get_addr works, so we could just use module id 1 offset 0. AFAIK the TLS memory for the application module is always allocated anyway. I'll probably try this at some point, but I have to set up a FreeBSD VM first.
Mar 23 2012