D.gnu - Supporting emulated tls

Johannes Pfau (25/25) Mar 18 2012 I thought about supporting emulated tls a little. The GCC emutls.c

Iain Buclaw (10/35) Mar 18 2012 If we are going to fix TLS, I'd rather it be in the most platform

Johannes Pfau (40/86) Mar 18 2012 GC

Jacob Carlborg (8/15) Mar 18 2012 __tls_beg and __tls_end is not used by Mac OS X any more:

Johannes Pfau (12/32) Mar 19 2012 Yes, but OSX still uses emulated tls. With the way dmd emulates TLS

Jacob Carlborg (10/25) Mar 19 2012 The dyld library on Mac OS X provides access to segments and sections.

=?ISO-8859-1?Q?Alex_R=F8nne_Petersen?= (8/33) Mar 18 2012 Such an allocator would probably just allocate a decently-sized memory
Jacob Carlborg (7/11) Mar 18 2012 Why not use the native TLS implementation when available and roll our

Johannes Pfau (29/41) Mar 19 2012 That's what we (mostly) do right now. We have 2 issues:

Iain Buclaw (11/32) Mar 19 2012 As far as my thought process goes, the only (implementable in the GDC

Johannes Pfau (9/30) Mar 19 2012 ll

Iain Buclaw (10/39) Mar 19 2012 Initial things to think about on the top of my head:

Johannes Pfau (9/25) Mar 21 2012 It needs the normal code to access the TLS struct / get the address of

Iain Buclaw (8/22) Mar 21 2012 Oh yeah, that's it. Perhaps the externally visible mangled names just

Jacob Carlborg (20/61) Mar 19 2012 On Mac OS X they are actually not needed. Don't know about other platfor...

Johannes Pfau (15/96) Mar 19 2012 Yep and the module id is part of the tls_index parameter. That pointer

Jacob Carlborg (5/12) Mar 19 2012 I think this would require to investigate each individual platform and
Martin Nowak (6/8) Mar 22 2012 Yes it does.

Johannes Pfau (20/31) Mar 23 2012 As written in some comment in your code, we can avoid eager allocation

Martin Nowak (4/6) Mar 23 2012 Yeah, seems to be non-standard.

Johannes Pfau (10/18) Mar 23 2012 Which means we'd have to check the generation counter. And if the

Martin Nowak (8/14) Mar 22 2012 Not quite.

Jacob Carlborg (4/18) Mar 25 2012 Ok, I see.

Martin Nowak (6/9) Mar 22 2012 That doesn't work because the symbols would collide.

Johannes Pfau (14/25) Mar 23 2012 I just saw your latest work on DSO yesterday (I was looking for a

Martin Nowak (1/4) Mar 23 2012 We're already merging since 3 month or so.

Rainer Schuetze (9/34) Mar 19 2012 Check the implementation of ranges in gcx.d: it's rather fast to add a
Martin Nowak (4/4) Mar 23 2012 Just another point about TLS.

Jacob Carlborg (4/8) Mar 25 2012 So C++ TLS is not using the same implementation as the C extension __thr...

Martin Nowak (6/16) Mar 25 2012 Sorry,

Jacob Carlborg (4/9) Mar 25 2012 Ok. Yes, if a native TLS is available that should be used.

Iain Buclaw (7/16) Mar 26 2012 Native implementations are used in GDC. We are currently going on

Johannes Pfau <nospam example.com> writes:

I thought about supporting emulated tls a little. The GCC emutls.c
implementation currently can't work with the gc, as every TLS variable
is allocated individually and therefore we don't have a contiguous
memory region for the gc. I think these are the possible solutions:

* Try to fix GCCs emutls to allocate all tls memory for a module
  (application/shared object) at once. That's the best solution
  and native TLS works this way, but I'm not sure if we can extract
  enough information from the runtime linker to make this work (we
  need at least the combined size of all tls variables).

* Provide a callback in GCC's emutls which is called after every
  allocation. This could call GC.addRange for every variable, but I
  guess adding huge amounts of ranges is slow.

* Make it possible to register a custom allocator for GCC's emutls (not
  sure if possible, as this would have to be set up very early in
  application startup). Then allocate the memory directly from the GC
  (but this memory should only be scanned, not collected) 

* Replace the calls to mallloc in emutls.c with a custom, region based
  memory allocator. (This is not a perfect solution though, it can
  always happen that we'll need more memory)



* Do not use GCC's emutls at all, roll a custom solution. This could be
  compatible with / based on dmd's tls emulation for OSX. Most of the
  implementation is in core.thread, all that's necessary is to group
  the tls data into a _tls_data_array and call ___tls_get_addr for
  every tls access. I'm not sure if this can be done in the
  'middle-end' though and it doesn't support shared libraries yet.

Mar 18 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 18 March 2012 11:32, Johannes Pfau <nospam example.com> wrote:
 I thought about supporting emulated tls a little. The GCC emutls.c
 implementation currently can't work with the gc, as every TLS variable
 is allocated individually and therefore we don't have a contiguous
 memory region for the gc. I think these are the possible solutions:

 * Try to fix GCCs emutls to allocate all tls memory for a module
 =A0(application/shared object) at once. That's the best solution
 =A0and native TLS works this way, but I'm not sure if we can extract
 =A0enough information from the runtime linker to make this work (we
 =A0need at least the combined size of all tls variables).

 * Provide a callback in GCC's emutls which is called after every
 =A0allocation. This could call GC.addRange for every variable, but I
 =A0guess adding huge amounts of ranges is slow.

Painfully slow.


 * Make it possible to register a custom allocator for GCC's emutls (not
 =A0sure if possible, as this would have to be set up very early in
 =A0application startup). Then allocate the memory directly from the GC
 =A0(but this memory should only be scanned, not collected)

 * Replace the calls to mallloc in emutls.c with a custom, region based
 =A0memory allocator. (This is not a perfect solution though, it can
 =A0always happen that we'll need more memory)



 * Do not use GCC's emutls at all, roll a custom solution. This could be
 =A0compatible with / based on dmd's tls emulation for OSX. Most of the
 =A0implementation is in core.thread, all that's necessary is to group
 =A0the tls data into a _tls_data_array and call ___tls_get_addr for
 =A0every tls access. I'm not sure if this can be done in the
 =A0'middle-end' though and it doesn't support shared libraries yet.

If we are going to fix TLS, I'd rather it be in the most platform
agnostic way possible, if it could be helped. That would mean also
scrapping the current implementation on Linux (just tries to mimic
what dmd does, and has corner cases where it doesn't always get it
right).




--=20
Iain Buclaw

*(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Mar 18 2012

Johannes Pfau <nospam example.com> writes:

Am Sun, 18 Mar 2012 12:21:51 +0000
schrieb Iain Buclaw <ibuclaw ubuntu.com>:

 On 18 March 2012 11:32, Johannes Pfau <nospam example.com> wrote:
 I thought about supporting emulated tls a little. The GCC emutls.c
 implementation currently can't work with the gc, as every TLS
 variable is allocated individually and therefore we don't have a
 contiguous memory region for the gc. I think these are the possible
 solutions:

 * Try to fix GCCs emutls to allocate all tls memory for a module
 =C2=A0(application/shared object) at once. That's the best solution
 =C2=A0and native TLS works this way, but I'm not sure if we can extract
 =C2=A0enough information from the runtime linker to make this work (we
 =C2=A0need at least the combined size of all tls variables).

 * Provide a callback in GCC's emutls which is called after every
 =C2=A0allocation. This could call GC.addRange for every variable, but I
 =C2=A0guess adding huge amounts of ranges is slow.

=20
 Painfully slow.
=20
=20
 * Make it possible to register a custom allocator for GCC's emutls
 (not sure if possible, as this would have to be set up very early in
 =C2=A0application startup). Then allocate the memory directly from the =


GC
 =C2=A0(but this memory should only be scanned, not collected)

 * Replace the calls to mallloc in emutls.c with a custom, region
 based memory allocator. (This is not a perfect solution though, it
 can always happen that we'll need more memory)



 * Do not use GCC's emutls at all, roll a custom solution. This
 could be compatible with / based on dmd's tls emulation for OSX.
 Most of the implementation is in core.thread, all that's necessary
 is to group the tls data into a _tls_data_array and call
 ___tls_get_addr for every tls access. I'm not sure if this can be
 done in the 'middle-end' though and it doesn't support shared
 libraries yet.

=20
 If we are going to fix TLS, I'd rather it be in the most platform
 agnostic way possible, if it could be helped. That would mean also
 scrapping the current implementation on Linux (just tries to mimic
 what dmd does, and has corner cases where it doesn't always get it
 right).

You mean getting rid of __tls_beg and __tls_end? I'd also like to
remove those, but:

TLS is mostly object-format specific (not as much OS specific). The ELF
implementation lays out the TLS data for a module (module =3D shared
library or the application) in a contiguous way. The details are
described in "ELF Handling For Thread-Local
Storage" (www.akkadia.org/drepper/tls.pdf).

The GC requires the TLS blocks to be contiguous, this is not the case
for GCC's emulated TLS and this causes issues there.

For native TLS/ELF this requirement is met, but the GC also has to know
the start and the size of the TLS sections. Although the runtime
linker has this information, there's no standard way to access it. So
we could:

* Add a custom extension API to the C libraries. We'd need at least: A
  'tls_range dl_get_tls_range(void *handle)' function related to the
  dl* set of funtions in the runtime linker, and a 'tls_range
  dl_get_tls_range2(struct dl_phdr_info *info)' to be used with
  dl_iterate_phdr. We also need some way to get the tls range for the
  application, 'get_app_tls_range' (although some libcs also return
  the application module in dl_iterate_phdr).

This seems to be the best way, but we'd have to patch every C library
and it would take some time till those updated C libraries are widely
deployed.

The other solution is to hook directly into each C libraries non-public
(and maybe non-stable!) API. For example, the structure returned by BSD
libc's dl_iterate_phdr and dlopen has these fields:

 int tlsindex;		/* Index in DTV for this module
 void *tlsinit;		/* Base address of TLS init block
 size_t tlsinitsize;	/* Size of TLS init block for this module
 size_t tlssize;	/* Size of TLS block for this module
 size_t tlsoffset;	/* Offset of static TLS block for this module=20
 size_t tlsalign;	/* Alignment of static TLS block

tlsindex gives us the start-address of the TLS for every thread, as
long as we know how to compute the TLS address from the TP (thread
pointer) and the dtv index (there are basically 2 methods, described in
"ELF Handling For Thread-Local Storage") and tlssize gives us the size.


However, there doesn't seem to be a painless way to do this...

Mar 18 2012

Jacob Carlborg <doob me.com> writes:

On 2012-03-18 19:39, Johannes Pfau wrote:

 You mean getting rid of __tls_beg and __tls_end? I'd also like to
 remove those, but:

__tls_beg and __tls_end is not used by Mac OS X any more:

https://github.com/D-Programming-Language/druntime/commit/73cf2c150665cb17d9365a6e3d6cf144d76312d6

https://github.com/D-Programming-Language/dmd/commit/054c525edba048ad7829dd5ec2d8d9261a6517c3

 TLS is mostly object-format specific (not as much OS specific). The ELF
 implementation lays out the TLS data for a module (module = shared
 library or the application) in a contiguous way. The details are
 described in "ELF Handling For Thread-Local
 Storage" (www.akkadia.org/drepper/tls.pdf).

Mac OS X 10.7 + supports TLS natively. But I don't know where to find 
documentation about it. It always possible to look at the source code.

-- 
/Jacob Carlborg

Mar 18 2012

Johannes Pfau <nospam example.com> writes:

Am Sun, 18 Mar 2012 22:06:41 +0100
schrieb Jacob Carlborg <doob me.com>:

On 2012-03-18 19:39, Johannes Pfau wrote:

You mean getting rid of __tls_beg and __tls_end? I'd also like to
remove those, but:

__tls_beg and __tls_end is not used by Mac OS X any more:

https://github.com/D-Programming-Language/druntime/commit/73cf2c150665cb17d9365a6e3d6cf144d76312d6

https://github.com/D-Programming-Language/dmd/commit/054c525edba048ad7829dd5ec2d8d9261a6517c3

Yes, but OSX still uses emulated tls. With the way dmd emulates TLS
it's possible to remove __tls_beg and __tls_end, but for native TLS
those symbols are still needed. However, as the runtime linker (ld.so)
has got the necessary information, it's possible that OSX even offers a
API to access it. It's just that most C libraries don't provide a way to
get the TLS segment sizes and the (per thread) addresses of the TLS
blocks.

TLS is mostly object-format specific (not as much OS specific). The
ELF implementation lays out the TLS data for a module (module =
shared library or the application) in a contiguous way. The details
are described in "ELF Handling For Thread-Local
Storage" (www.akkadia.org/drepper/tls.pdf).

Mac OS X 10.7 + supports TLS natively. But I don't know where to find
documentation about it. It always possible to look at the source code.

Then it's probably already supported by GCC/GDC. But having working
emulated TLS would be nice for many other architectures. Native TLS is
not that widespread.

Mar 19 2012

Jacob Carlborg <doob me.com> writes:

On 2012-03-19 09:17, Johannes Pfau wrote:
 Am Sun, 18 Mar 2012 22:06:41 +0100
 schrieb Jacob Carlborg<doob me.com>:

 Yes, but OSX still uses emulated tls. With the way dmd emulates TLS
 it's possible to remove __tls_beg and __tls_end, but for native TLS
 those symbols are still needed. However, as the runtime linker (ld.so)
 has got the necessary information, it's possible that OSX even offers a
 API to access it. It's just that most C libraries don't provide a way to
 get the TLS segment sizes and the (per thread) addresses of the TLS
 blocks.

The dyld library on Mac OS X provides access to segments and sections. 
But since the dynamic loader needs can get this information it should be 
possible for other applications to get this information as well?

Just walk through the object file and find the necessary segments?

 Mac OS X 10.7 + supports TLS natively. But I don't know where to find
 documentation about it. It always possible to look at the source code.

 Then it's probably already supported by GCC/GDC. But having working
 emulated TLS would be nice for many other architectures. Native TLS is
 not that widespread.

Yeah, don't know about GCC though, Apple cares less and less about GCC 
and putting all their effort in to LLVM and Clang. Ok, I didn't know how 
widespread TLS was.

-- 
/Jacob Carlborg

Mar 19 2012

=?ISO-8859-1?Q?Alex_R=F8nne_Petersen?= <xtzgzorex gmail.com> writes:

On 18-03-2012 12:32, Johannes Pfau wrote:
 I thought about supporting emulated tls a little. The GCC emutls.c
 implementation currently can't work with the gc, as every TLS variable
 is allocated individually and therefore we don't have a contiguous
 memory region for the gc. I think these are the possible solutions:

 * Try to fix GCCs emutls to allocate all tls memory for a module
    (application/shared object) at once. That's the best solution
    and native TLS works this way, but I'm not sure if we can extract
    enough information from the runtime linker to make this work (we
    need at least the combined size of all tls variables).

 * Provide a callback in GCC's emutls which is called after every
    allocation. This could call GC.addRange for every variable, but I
    guess adding huge amounts of ranges is slow.

We should avoid this if possible, yes. A small root set is desirable.

 * Make it possible to register a custom allocator for GCC's emutls (not
    sure if possible, as this would have to be set up very early in
    application startup). Then allocate the memory directly from the GC
    (but this memory should only be scanned, not collected)

Such an allocator would probably just allocate a decently-sized memory 
block from libc and add it as a root range (rather than individual 
word-sized roots). The memory doesn't necessarily have to be allocated 
with the GC.

 * Replace the calls to mallloc in emutls.c with a custom, region based
    memory allocator. (This is not a perfect solution though, it can
    always happen that we'll need more memory)



 * Do not use GCC's emutls at all, roll a custom solution. This could be
    compatible with / based on dmd's tls emulation for OSX. Most of the
    implementation is in core.thread, all that's necessary is to group
    the tls data into a _tls_data_array and call ___tls_get_addr for
    every tls access. I'm not sure if this can be done in the
    'middle-end' though and it doesn't support shared libraries yet.


-- 
- Alex

Mar 18 2012

Jacob Carlborg <doob me.com> writes:

On 2012-03-18 12:32, Johannes Pfau wrote:
 I thought about supporting emulated tls a little. The GCC emutls.c
 implementation currently can't work with the gc, as every TLS variable
 is allocated individually and therefore we don't have a contiguous
 memory region for the gc. I think these are the possible solutions:

Why not use the native TLS implementation when available and roll our 
own, like DMD on Mac OS X, when none exists?

BTW, I think it would be possible to emulate TLS in a very similar way 
to how it's implemented natively for ELF.

-- 
/Jacob Carlborg

Mar 18 2012

Johannes Pfau <nospam example.com> writes:

Am Sun, 18 Mar 2012 21:57:57 +0100
schrieb Jacob Carlborg <doob me.com>:

 On 2012-03-18 12:32, Johannes Pfau wrote:
 I thought about supporting emulated tls a little. The GCC emutls.c
 implementation currently can't work with the gc, as every TLS
 variable is allocated individually and therefore we don't have a
 contiguous memory region for the gc. I think these are the possible
 solutions:

 
 Why not use the native TLS implementation when available and roll our 
 own, like DMD on Mac OS X, when none exists?

That's what we (mostly) do right now. We have 2 issues:

* Our own, emulated TLS support is implemented in GCC. This means it's
  also used in C, which is great. Also GCC's emulated tls needs
  absolutely no special features in the runtime linker, compile time
  linker or language frontends. It's very portable and works with all
  weird combinations of dynamic libraries, dlopen, etc.
  But it has one quirk: It doesn't allocate TLS memory in a contiguous
  way, every tls variable is allocated using malloc. This means we
  can't pass a range to the GC for the tls variables. So we can't
  support this emutls in the GC.

* The other issue with native TLS is that using bracketing with
  __tls_beg and __tls_end has corner cases where it doesn't work. We'd
  need an alternative to locate the TLS memory addresses and TLS sizes.
  But there's no standard or public API to do that.

 BTW, I think it would be possible to emulate TLS in a very similar
 way to how it's implemented natively for ELF.
 

I don't think it's that easy. For example, how would you assign module
ids? For native TLS this is partially done by the compile time linker
(for the main application and libraries that are always loaded), but if
no native TLS is available, we can't rely on the linker to do that. We
also need some way to get the current module id in running code.

And how do we get the TLS initialization data? If we placed it into an
array, like DMD does on OSX, we could use dlsym for dlopened libraries,
but what about initially loaded libraries?

Say you have application 'app', which depends on 'liba' and 'libb'. All
of these have TLS data. Maybe we could implement something using
dl_iterate_phdr, but that's a nonstandard extension.

Compare that to GCC's emulation, which is probably slow, but 'just
works' everywhere (except for the GC :-( ).

Mar 19 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 19 March 2012 08:15, Johannes Pfau <nospam example.com> wrote:
 Am Sun, 18 Mar 2012 21:57:57 +0100
 schrieb Jacob Carlborg <doob me.com>:

 On 2012-03-18 12:32, Johannes Pfau wrote:
 I thought about supporting emulated tls a little. The GCC emutls.c
 implementation currently can't work with the gc, as every TLS
 variable is allocated individually and therefore we don't have a
 contiguous memory region for the gc. I think these are the possible
 solutions:

 Why not use the native TLS implementation when available and roll our
 own, like DMD on Mac OS X, when none exists?

 That's what we (mostly) do right now. We have 2 issues:

 * Our own, emulated TLS support is implemented in GCC. This means it's
 =A0also used in C, which is great. Also GCC's emulated tls needs
 =A0absolutely no special features in the runtime linker, compile time
 =A0linker or language frontends. It's very portable and works with all
 =A0weird combinations of dynamic libraries, dlopen, etc.
 =A0But it has one quirk: It doesn't allocate TLS memory in a contiguous
 =A0way, every tls variable is allocated using malloc. This means we
 =A0can't pass a range to the GC for the tls variables. So we can't
 =A0support this emutls in the GC.

As far as my thought process goes, the only (implementable in the GDC
frontend) way to force contiguous layout of all TLS symbols is to pack
them up ourselves into a struct that is accessible via a single global
module-level variable.  And in the .ctor section, the module adds this
range to the GC.  This should be enough so it also works for shared
libraries too, however I'm sure there is quite a few details I am
missing out on here that would block this from working. :)


--=20
Iain Buclaw

*(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Mar 19 2012

Johannes Pfau <nospam example.com> writes:

Am Mon, 19 Mar 2012 09:22:01 +0000
schrieb Iain Buclaw <ibuclaw ubuntu.com>:

 On 19 March 2012 08:15, Johannes Pfau <nospam example.com> wrote:
 * Our own, emulated TLS support is implemented in GCC. This means
 it's also used in C, which is great. Also GCC's emulated tls needs
 =C2=A0absolutely no special features in the runtime linker, compile time
 =C2=A0linker or language frontends. It's very portable and works with a=


ll
 =C2=A0weird combinations of dynamic libraries, dlopen, etc.
 =C2=A0But it has one quirk: It doesn't allocate TLS memory in a
 contiguous way, every tls variable is allocated using malloc. This
 means we can't pass a range to the GC for the tls variables. So we
 can't support this emutls in the GC.

=20
 As far as my thought process goes, the only (implementable in the GDC
 frontend) way to force contiguous layout of all TLS symbols is to pack
 them up ourselves into a struct that is accessible via a single global
 module-level variable.  And in the .ctor section, the module adds this
 range to the GC.  This should be enough so it also works for shared
 libraries too, however I'm sure there is quite a few details I am
 missing out on here that would block this from working. :)
=20

Good idea, I should have thought about that. I can't think of a
reason why it wouldn't work and it should be quite fast as well.

Just to clarify: 'module-level' as in D module(/object file) or as in
one variable per shared library/application? If we can support one
variable per shared library/application that'd be great, as we will
then only have a few tls ranges for the gc.=20

Mar 19 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 19 March 2012 15:25, Johannes Pfau <nospam example.com> wrote:
 Am Mon, 19 Mar 2012 09:22:01 +0000
 schrieb Iain Buclaw <ibuclaw ubuntu.com>:

 On 19 March 2012 08:15, Johannes Pfau <nospam example.com> wrote:
 * Our own, emulated TLS support is implemented in GCC. This means
 it's also used in C, which is great. Also GCC's emulated tls needs
 =A0absolutely no special features in the runtime linker, compile time
 =A0linker or language frontends. It's very portable and works with all
 =A0weird combinations of dynamic libraries, dlopen, etc.
 =A0But it has one quirk: It doesn't allocate TLS memory in a
 contiguous way, every tls variable is allocated using malloc. This
 means we can't pass a range to the GC for the tls variables. So we
 can't support this emutls in the GC.

 As far as my thought process goes, the only (implementable in the GDC
 frontend) way to force contiguous layout of all TLS symbols is to pack
 them up ourselves into a struct that is accessible via a single global
 module-level variable. =A0And in the .ctor section, the module adds this
 range to the GC. =A0This should be enough so it also works for shared
 libraries too, however I'm sure there is quite a few details I am
 missing out on here that would block this from working. :)

 Good idea, I should have thought about that. I can't think of a
 reason why it wouldn't work and it should be quite fast as well.

Initial things to think about on the top of my head:

* Speed to access symbols.
* Accessing thread local symbols across modules.


 Just to clarify: 'module-level' as in D module(/object file) or as in
 one variable per shared library/application? If we can support one
 variable per shared library/application that'd be great, as we will
 then only have a few tls ranges for the gc.

Per module - see the code that initialises _Dmodule_ref.  We're really
just adding two extra fields to that which includes starting address
and size.


--=20
Iain Buclaw

*(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Mar 19 2012

Johannes Pfau <nospam example.com> writes:

Am Mon, 19 Mar 2012 16:14:36 +0000
schrieb Iain Buclaw <ibuclaw ubuntu.com>:


 
 Initial things to think about on the top of my head:
 
 * Speed to access symbols.

It needs the normal code to access the TLS struct / get the address of
the TLS struct + one add instruction which adds the offset for the
specific variable. So it should be fast enough.

 * Accessing thread local symbols across modules.

Do we have to use module-local symbols? If we could use symbols with
unique, mangled names, we could just access that symbol+offset from
every module. This assumes the d/di files provide enough information to 
calculate the offset.

 
 Just to clarify: 'module-level' as in D module(/object file) or as
 in one variable per shared library/application? If we can support
 one variable per shared library/application that'd be great, as we
 will then only have a few tls ranges for the gc.

 
 Per module - see the code that initialises _Dmodule_ref.  We're really
 just adding two extra fields to that which includes starting address
 and size.

Mar 21 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 21 March 2012 13:17, Johannes Pfau <nospam example.com> wrote:
 Am Mon, 19 Mar 2012 16:14:36 +0000
 schrieb Iain Buclaw <ibuclaw ubuntu.com>:


 Initial things to think about on the top of my head:

 * Speed to access symbols.

 It needs the normal code to access the TLS struct / get the address of
 the TLS struct + one add instruction which adds the offset for the
 specific variable. So it should be fast enough.

 * Accessing thread local symbols across modules.

 Do we have to use module-local symbols? If we could use symbols with
 unique, mangled names, we could just access that symbol+offset from
 every module. This assumes the d/di files provide enough information to
 calculate the offset.

Oh yeah, that's it.  Perhaps the externally visible mangled names just
be references to the actual location?

I don't think there would be enough information to access via main
entry point symbol+offset.


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Mar 21 2012

Jacob Carlborg <doob me.com> writes:

On 2012-03-19 09:15, Johannes Pfau wrote:
Am Sun, 18 Mar 2012 21:57:57 +0100
schrieb Jacob Carlborg<doob me.com>:

On 2012-03-18 12:32, Johannes Pfau wrote:
I thought about supporting emulated tls a little. The GCC emutls.c
implementation currently can't work with the gc, as every TLS
variable is allocated individually and therefore we don't have a
contiguous memory region for the gc. I think these are the possible
solutions:

Why not use the native TLS implementation when available and roll our
own, like DMD on Mac OS X, when none exists?

That's what we (mostly) do right now. We have 2 issues:

* Our own, emulated TLS support is implemented in GCC. This means it's
also used in C, which is great. Also GCC's emulated tls needs
absolutely no special features in the runtime linker, compile time
linker or language frontends. It's very portable and works with all
weird combinations of dynamic libraries, dlopen, etc.
But it has one quirk: It doesn't allocate TLS memory in a contiguous
way, every tls variable is allocated using malloc. This means we
can't pass a range to the GC for the tls variables. So we can't
support this emutls in the GC.

Ok, I see.

* The other issue with native TLS is that using bracketing with
__tls_beg and __tls_end has corner cases where it doesn't work. We'd
need an alternative to locate the TLS memory addresses and TLS sizes.
But there's no standard or public API to do that.

On Mac OS X they are actually not needed. Don't know about other platforms.

BTW, I think it would be possible to emulate TLS in a very similar
way to how it's implemented natively for ELF.

I don't think it's that easy. For example, how would you assign module
ids? For native TLS this is partially done by the compile time linker
(for the main application and libraries that are always loaded), but if
no native TLS is available, we can't rely on the linker to do that. We
also need some way to get the current module id in running code.

As I understand it, in the native ELF implementation, assembly is used
to access the current module id, this is for FreeBSD:

http://people.freebsd.org/~marcel/tls.html

This is how ___tls_get_addr is implemented on FreeBSD ELF i386:

https://bitbucket.org/freebsd/freebsd-head/src/4e8f50fe2f05/libexec/rtld-elf/i386/reloc.c#cl-355

And how do we get the TLS initialization data? If we placed it into an
array, like DMD does on OSX, we could use dlsym for dlopened libraries,
but what about initially loaded libraries?

In the same way it's done in the native implementation. Isn't it
possible to access all loaded libraries?

Say you have application 'app', which depends on 'liba' and 'libb'. All
of these have TLS data. Maybe we could implement something using
dl_iterate_phdr, but that's a nonstandard extension.

Ok. Mac OS X has this a function called
"_dyld_register_func_for_add_image", I guess other OS'es don't have a
corresponding function? In general all this stuff very low level and
nonstandard.

https://developer.apple.com/library/mac/#documentation/developertools/Reference/MachOReference/Reference/reference.html#jumpTo_53

Compare that to GCC's emulation, which is probably slow, but 'just
works' everywhere (except for the GC :-( ).

Yeah, that's a big advantage.

In general I was hoping that the work done by the dynamic loader to
setup TLS could be moved to druntime.

--
/Jacob Carlborg

Mar 19 2012

Johannes Pfau <nospam example.com> writes:

Am Mon, 19 Mar 2012 10:40:25 +0100
schrieb Jacob Carlborg <doob me.com>:

On 2012-03-19 09:15, Johannes Pfau wrote:
Am Sun, 18 Mar 2012 21:57:57 +0100
schrieb Jacob Carlborg<doob me.com>:

Why not use the native TLS implementation when available and roll
our own, like DMD on Mac OS X, when none exists?

That's what we (mostly) do right now. We have 2 issues:

* Our own, emulated TLS support is implemented in GCC. This means
it's also used in C, which is great. Also GCC's emulated tls needs
absolutely no special features in the runtime linker, compile
time linker or language frontends. It's very portable and works
with all weird combinations of dynamic libraries, dlopen, etc.
But it has one quirk: It doesn't allocate TLS memory in a
contiguous way, every tls variable is allocated using malloc. This
means we can't pass a range to the GC for the tls variables. So we
can't support this emutls in the GC.

Ok, I see.

* The other issue with native TLS is that using bracketing with
__tls_beg and __tls_end has corner cases where it doesn't work.
We'd need an alternative to locate the TLS memory addresses and TLS
sizes. But there's no standard or public API to do that.

On Mac OS X they are actually not needed. Don't know about other
platforms.

BTW, I think it would be possible to emulate TLS in a very similar
way to how it's implemented natively for ELF.

I don't think it's that easy. For example, how would you assign
module ids? For native TLS this is partially done by the compile
time linker (for the main application and libraries that are always
loaded), but if no native TLS is available, we can't rely on the
linker to do that. We also need some way to get the current module
id in running code.

As I understand it, in the native ELF implementation, assembly is
used to access the current module id, this is for FreeBSD:

http://people.freebsd.org/~marcel/tls.html

This is how ___tls_get_addr is implemented on FreeBSD ELF i386:

https://bitbucket.org/freebsd/freebsd-head/src/4e8f50fe2f05/libexec/rtld-elf/i386/reloc.c#cl-355

Yep and the module id is part of the tls_index parameter. That pointer
is a pointer into the GOT. The initial values of that GOT entry are
undefined, they are filled in by the runtime linker. We could probably
emulate all that, but it seems a little complicated to me. If we can
get the current emutls to work, that be awesome even if it's slow.
Proper native TLS support is easier to implement in the runtime linker
anyway.

And how do we get the TLS initialization data? If we placed it into
an array, like DMD does on OSX, we could use dlsym for dlopened
libraries, but what about initially loaded libraries?

In the same way it's done in the native implementation. Isn't it
possible to access all loaded libraries?

The only way to access all loaded library is dl_iterate_phdr. But I'm
not sure if it provides all necessary information.

Say you have application 'app', which depends on 'liba' and 'libb'.
All of these have TLS data. Maybe we could implement something using
dl_iterate_phdr, but that's a nonstandard extension.

Ok. Mac OS X has this a function called
"_dyld_register_func_for_add_image", I guess other OS'es don't have a
corresponding function? In general all this stuff very low level and
nonstandard.

Some C libraries might provide a similar API, but there's no guarantee
such an API is available.

https://developer.apple.com/library/mac/#documentation/developertools/Reference/MachOReference/Reference/reference.html#jumpTo_53

Compare that to GCC's emulation, which is probably slow, but 'just
works' everywhere (except for the GC :-( ).

Yeah, that's a big advantage.

In general I was hoping that the work done by the dynamic loader to
setup TLS could be moved to druntime.

That'd be nice, but I think the runtime linker doesn't export all
necessary information.

Mar 19 2012

Jacob Carlborg <doob me.com> writes:

On 2012-03-19 16:57, Johannes Pfau wrote:
 Am Mon, 19 Mar 2012 10:40:25 +0100
 schrieb Jacob Carlborg<doob me.com>:

 In general I was hoping that the work done by the dynamic loader to
 setup TLS could be moved to druntime.

 That'd be nice, but I think the runtime linker doesn't export all
 necessary information.

I think this would require to investigate each individual platform and 
see what's possible.

-- 
/Jacob Carlborg

Mar 19 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Mon, 19 Mar 2012 16:57:29 +0100, Johannes Pfau <nospam example.com>  
wrote:

 The only way to access all loaded library is dl_iterate_phdr. But I'm
 not sure if it provides all necessary information.

Yes it does.

The drawback is that it eagerly allocates the TLS block.
https://github.com/dawgfoto/druntime/blob/SharedRuntime/src/rt/dso.d#L408
https://github.com/dawgfoto/druntime/blob/SharedRuntime/src/rt/dso.d#L459

Mar 22 2012

Johannes Pfau <nospam example.com> writes:

Am Fri, 23 Mar 2012 05:48:46 +0100
schrieb "Martin Nowak" <dawg dawgfoto.de>:

On Mon, 19 Mar 2012 16:57:29 +0100, Johannes Pfau
<nospam example.com> wrote:

The only way to access all loaded library is dl_iterate_phdr. But
I'm not sure if it provides all necessary information.

Yes it does.

The drawback is that it eagerly allocates the TLS block.
https://github.com/dawgfoto/druntime/blob/SharedRuntime/src/rt/dso.d#L408
https://github.com/dawgfoto/druntime/blob/SharedRuntime/src/rt/dso.d#L459

As written in some comment in your code, we can avoid eager allocation
using some architecture dependent 'hacks'. I think we'd have to
* get the thread pointer (architecture specific)
* find the dtv (there's only variant 1 / variant 2?)
* access the correct dtv entry (C library dependent?)
* check if the dtv entry is initialized (C library dependent?)

For FreeBSD step 3/4 means checking if dtv[index + 1] == 0. It's
probably the same for most other C libraries. The tricky part is that
we have to check first if the dtv is big enough for the current index.
For FreeBSD, this is easy again, dtv[1] contains the size of the dtv.
But that's probably nonstandard.

All this is not hard to do, but quite system specific. Normal Desktop
OS probably don't need this optimization. For systems which benifit
from it, adding it shouldn't be too difficult.

In case you're interested, the FreeBSD linker source is here (BSD
licensed, of course):
http://www.freebsd.org/cgi/cvsweb.cgi/src/libexec/rtld-elf/rtld.c?rev=1.196;content-type=text%2Fplain
search for tls_get_addr_slow, __tls_get_addr, dtv, tls

Mar 23 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Fri, 23 Mar 2012 11:02:44 +0100, Johannes Pfau <nospam example.com>  
wrote:

 For FreeBSD, this is easy again, dtv[1] contains the size of the dtv.
 But that's probably nonstandard.

Yeah, seems to be non-standard.
There might also be issues with outdated dtv's.

Mar 23 2012

Johannes Pfau <nospam example.com> writes:

Am Fri, 23 Mar 2012 13:05:55 +0100
schrieb "Martin Nowak" <dawg dawgfoto.de>:

 On Fri, 23 Mar 2012 11:02:44 +0100, Johannes Pfau
 <nospam example.com> wrote:
 
 For FreeBSD, this is easy again, dtv[1] contains the size of the
 dtv. But that's probably nonstandard.

 
 Yeah, seems to be non-standard.

 There might also be issues with outdated dtv's.

Which means we'd have to check the generation counter. And if the
counter mismatches, we'd need the C runtime to update it. I'm not sure
if we can access tls_dtv_generation on FreeBSD, but updating the
counter is easy: every call to __tls_get_addr works, so we could just
use module id 1 offset 0. AFAIK the TLS memory for the application
module is always allocated anyway.

I'll probably try this at some point, but I have to set up a FreeBSD VM
first.

Mar 23 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Mon, 19 Mar 2012 10:40:25 +0100, Jacob Carlborg <doob me.com> wrote:

 As I understand it, in the native ELF implementation, assembly is used  
 to access the current module id, this is for FreeBSD:
  http://people.freebsd.org/~marcel/tls.html
  This is how ___tls_get_addr is implemented on FreeBSD ELF i386:
   
 https://bitbucket.org/freebsd/freebsd-head/src/4e8f50fe2f05/libexec/rtld-elf/i386/reloc.c#cl-355

Not quite.
Access to the static image is done through %fs relative addressing
which is super-fast and requires no runtime linking.
The general dynamic addressing needs one tls_index
struct in the GOT for every variable and a call to  
_tls_get_addr(tls_index*).
The module index and the offset are filled by the runtime linker.

Mar 22 2012

Jacob Carlborg <doob me.com> writes:

On 2012-03-23 06:03, Martin Nowak wrote:
 On Mon, 19 Mar 2012 10:40:25 +0100, Jacob Carlborg <doob me.com> wrote:

 As I understand it, in the native ELF implementation, assembly is used
 to access the current module id, this is for FreeBSD:
 http://people.freebsd.org/~marcel/tls.html
 This is how ___tls_get_addr is implemented on FreeBSD ELF i386:
 https://bitbucket.org/freebsd/freebsd-head/src/4e8f50fe2f05/libexec/rtld-elf/i386/reloc.c#cl-355

 Not quite.
 Access to the static image is done through %fs relative addressing
 which is super-fast and requires no runtime linking.
 The general dynamic addressing needs one tls_index
 struct in the GOT for every variable and a call to
 _tls_get_addr(tls_index*).
 The module index and the offset are filled by the runtime linker.

Ok, I see.

-- 
/Jacob Carlborg

Mar 25 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Mon, 19 Mar 2012 09:15:08 +0100, Johannes Pfau <nospam example.com>  
wrote:

 And how do we get the TLS initialization data? If we placed it into an
 array, like DMD does on OSX, we could use dlsym for dlopened libraries,
 but what about initially loaded libraries?

That doesn't work because the symbols would collide.
If you made them local symbols OTOH you can't access
them through dlsym.
Use dl_iterate_phdr to get the initial image.

Mar 22 2012

Johannes Pfau <nospam example.com> writes:

Am Fri, 23 Mar 2012 06:06:39 +0100
schrieb "Martin Nowak" <dawg dawgfoto.de>:

 On Mon, 19 Mar 2012 09:15:08 +0100, Johannes Pfau
 <nospam example.com> wrote:
 
 And how do we get the TLS initialization data? If we placed it into
 an array, like DMD does on OSX, we could use dlsym for dlopened
 libraries, but what about initially loaded libraries?

 
 That doesn't work because the symbols would collide.
 If you made them local symbols OTOH you can't access
 them through dlsym.
 Use dl_iterate_phdr to get the initial image.

I just saw your latest work on DSO yesterday (I was looking for a
status update for shared libraries as Android does not officially
support native applications. The supported way is to build a shared
library and load it into a JAVA app. And indeed, native applications
have some wierd corner cases (https://github.com/jpf91/GDC/issues/4).
Android is such a crappy platform for native apps...)
You're doing some awesome work there. I'm not sure what issues gdc had
with the original TLS support, but I guess that your new code will
probably solve those.

I guess the OSX emulated tls code will also be adapted to support
multiple modules? I guess we can just wait until your changes are
merged into DMD and then think about emulated TLS again.

Mar 23 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

 I guess the OSX emulated tls code will also be adapted to support
 multiple modules? I guess we can just wait until your changes are
 merged into DMD and then think about emulated TLS again.

We're already merging since 3 month or so.

Mar 23 2012

Rainer Schuetze <r.sagitario gmx.de> writes:

On 3/18/2012 12:32 PM, Johannes Pfau wrote:
 I thought about supporting emulated tls a little. The GCC emutls.c
 implementation currently can't work with the gc, as every TLS variable
 is allocated individually and therefore we don't have a contiguous
 memory region for the gc. I think these are the possible solutions:

 * Try to fix GCCs emutls to allocate all tls memory for a module
    (application/shared object) at once. That's the best solution
    and native TLS works this way, but I'm not sure if we can extract
    enough information from the runtime linker to make this work (we
    need at least the combined size of all tls variables).

 * Provide a callback in GCC's emutls which is called after every
    allocation. This could call GC.addRange for every variable, but I
    guess adding huge amounts of ranges is slow.

 * Make it possible to register a custom allocator for GCC's emutls (not
    sure if possible, as this would have to be set up very early in
    application startup). Then allocate the memory directly from the GC
    (but this memory should only be scanned, not collected)

 * Replace the calls to mallloc in emutls.c with a custom, region based
    memory allocator. (This is not a perfect solution though, it can
    always happen that we'll need more memory)

Check the implementation of ranges in gcx.d: it's rather fast to add a 
range (vector like appending to exponentially growing data), and a 
simple loop over the ranges is done in the collection that would not 
change performance a lot when being executed in one memory chunk: it's 
the marking of references in the scanned data that is expensive.

I would be more concerned about removal of ranges, though. It scans 
existing ranges linearly to find the one to remove and moves the 
remaining entries in memory. Some optimizations might be helpful here.

 * Do not use GCC's emutls at all, roll a custom solution. This could be
    compatible with / based on dmd's tls emulation for OSX. Most of the
    implementation is in core.thread, all that's necessary is to group
    the tls data into a _tls_data_array and call ___tls_get_addr for
    every tls access. I'm not sure if this can be done in the
    'middle-end' though and it doesn't support shared libraries yet.

Mar 19 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

Just another point about TLS.

extern(C) /*__thread*/ int foo;

At some point you want to be able to access C++ TLS variables
so emulation should not replace native TLS support.

Mar 23 2012

Jacob Carlborg <doob me.com> writes:

On 2012-03-23 12:55, Martin Nowak wrote:
 Just another point about TLS.

 extern(C) /*__thread*/ int foo;

 At some point you want to be able to access C++ TLS variables
 so emulation should not replace native TLS support.

So C++ TLS is not using the same implementation as the C extension __thread?

-- 
/Jacob Carlborg

Mar 25 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Sun, 25 Mar 2012 16:29:25 +0200, Jacob Carlborg <doob me.com> wrote:

 On 2012-03-23 12:55, Martin Nowak wrote:
 Just another point about TLS.

 extern(C) /*__thread*/ int foo;

 At some point you want to be able to access C++ TLS variables
 so emulation should not replace native TLS support.

 So C++ TLS is not using the same implementation as the C extension  
 __thread?

Sorry,
that might have been misleading.
The point I was trying to make is that D's TLS support shouldn't
deviate from the native platform TLS if one is available. I've just
tried it out, and indeed I can access C TLS variables from D.

Mar 25 2012

Jacob Carlborg <doob me.com> writes:

On 2012-03-25 20:34, Martin Nowak wrote:

 Sorry,
 that might have been misleading.
 The point I was trying to make is that D's TLS support shouldn't
 deviate from the native platform TLS if one is available. I've just
 tried it out, and indeed I can access C TLS variables from D.

Ok. Yes, if a native TLS is available that should be used.

-- 
/Jacob Carlborg

Mar 25 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 25 March 2012 21:29, Jacob Carlborg <doob me.com> wrote:
 On 2012-03-25 20:34, Martin Nowak wrote:

 Sorry,
 that might have been misleading.
 The point I was trying to make is that D's TLS support shouldn't
 deviate from the native platform TLS if one is available. I've just
 tried it out, and indeed I can access C TLS variables from D.


 Ok. Yes, if a native TLS is available that should be used.

 --
 /Jacob Carlborg

Native implementations are used in GDC.  We are currently going on
blind faith that all symbols are between _tlsstart and _tlsend though,
and packed together in a contiguous fashion. :~)

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Mar 26 2012

D Programming

C/C++ Programming

Other

D.gnu - Supporting emulated tls