www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - F*cked by memory corruption after assiging value to associative array

reply frame <frame86 live.com> writes:
After a while my program crashes.

I'm inspecting in the debugger that some strings are overwritten 
after a struct is assigned to an associative array.

- I have disabled the GC.
- All happens in the same thread.
- The strings belong to an object collection inside an object 
created from a d-DLL.
- The object returned by the DLL function is added to the GC with 
GC.addRoot().
- This object also lives in a static array the whole time.
- Not all objects are affected but many.
- The struct itself looks okay also the key for the associative 
array has normal form.

The data is not overwritten by another Thread (only one is 
running) but by the compiler. I'm watching it by memory location. 
It gets always visible first after that assignment. But how is 
this even possible? In theory, how could I ran into this issue?
Jan 25 2021
parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Monday, 25 January 2021 at 11:15:28 UTC, frame wrote:
 After a while my program crashes.

 I'm inspecting in the debugger that some strings are 
 overwritten after a struct is assigned to an associative array.

 - I have disabled the GC.
 - All happens in the same thread.
 - The strings belong to an object collection inside an object 
 created from a d-DLL.
 - The object returned by the DLL function is added to the GC 
 with GC.addRoot().
 - This object also lives in a static array the whole time.
 - Not all objects are affected but many.
 - The struct itself looks okay also the key for the associative 
 array has normal form.

 The data is not overwritten by another Thread (only one is 
 running) but by the compiler. I'm watching it by memory 
 location. It gets always visible first after that assignment. 
 But how is this even possible? In theory, how could I ran into 
 this issue?
That should really not be possible. I suspect the memory used by the original data got reused for the associative array somehow. But if the GC is off from program start, that should really not occur. Do you maybe turn the GC off before the AA assignment, but after it's already marked that memory freed? Try turning it off completely from the commandline with --DRT-gcopt=gc:manual
Jan 25 2021
parent reply frame <frame86 live.com> writes:
On Monday, 25 January 2021 at 11:25:56 UTC, FeepingCreature wrote:

 I suspect the memory used by the original data got reused for 
 the associative array somehow. But if the GC is off from 
 program start, that should really not occur. Do you maybe turn 
 the GC off before the AA assignment, but after it's already 
 marked that memory freed? Try turning it off completely from 
 the commandline with --DRT-gcopt=gc:manual
With that option the bytes behind the strings look different, so it has an impact. But sadly it does not help. - I have comment out all GC.free(). - GC.disable() is called in main. - GC.profileStats.numCollections is 0.
Jan 25 2021
parent reply vitamin <vit vit.vit> writes:
On Monday, 25 January 2021 at 13:44:52 UTC, frame wrote:
 On Monday, 25 January 2021 at 11:25:56 UTC, FeepingCreature 
 wrote:

 I suspect the memory used by the original data got reused for 
 the associative array somehow. But if the GC is off from 
 program start, that should really not occur. Do you maybe turn 
 the GC off before the AA assignment, but after it's already 
 marked that memory freed? Try turning it off completely from 
 the commandline with --DRT-gcopt=gc:manual
With that option the bytes behind the strings look different, so it has an impact. But sadly it does not help. - I have comment out all GC.free(). - GC.disable() is called in main. - GC.profileStats.numCollections is 0.
Is the object returned from dll GC allocated?
Jan 25 2021
parent reply frame <frame86 live.com> writes:
On Monday, 25 January 2021 at 14:34:23 UTC, vitamin wrote:

 Is the object returned from dll GC allocated?
The object is created on the default way. No alternating allocation. Before the object is returned it's added to GC.addRoot() which should be enough but may I'm wrong. I also tried to add each object that holds the string member with GC.addRoot().
Jan 25 2021
parent reply vitamin <vit vit.vit> writes:
On Monday, 25 January 2021 at 15:46:15 UTC, frame wrote:
 On Monday, 25 January 2021 at 14:34:23 UTC, vitamin wrote:

 Is the object returned from dll GC allocated?
The object is created on the default way. No alternating allocation. Before the object is returned it's added to GC.addRoot() which should be enough but may I'm wrong. I also tried to add each object that holds the string member with GC.addRoot().
If created on the default way mean allocated with new (=> GC) then I don't known where is problem, but if the object is allocated with other way, for example malloc, some allocator then you need tell GC about that object with GC.addRange.
Jan 25 2021
next sibling parent reply frame <frame86 live.com> writes:
On Monday, 25 January 2021 at 16:14:05 UTC, vitamin wrote:

 If created on the default way mean allocated with new (=> GC) 
 then I don't known where is problem, but if the object is 
 allocated with other way, for example malloc, some allocator 
 then you need tell GC about that object with GC.addRange.
Yes with simple new operator. Forgot to mention: the DLL itself calls a DLL. I made following observation, don't know if it makes any sense: Assuming that executable and DLL have their own GC instance, it seems that the --DRT option is ignored by DLLs or they do not see the command line, idk. But using extern (C) __gshared string[] rt_options = ["gcopt=gc:manual"]; in main and DLLs, keeps the memory intact. Setting gc:conservative on a DLL corrupts again. Also GC.profileStats.numCollections says that there are cycles when the memory gets corrupted. So that means that GC.addRoot() isn't "global" and must be called after the DLL function has returned the object to adapt it or within the DLL itself? or both?
Jan 25 2021
parent reply vitamin <vit vit.vit> writes:
On Monday, 25 January 2021 at 16:44:40 UTC, frame wrote:
 On Monday, 25 January 2021 at 16:14:05 UTC, vitamin wrote:

 [...]
Yes with simple new operator. Forgot to mention: the DLL itself calls a DLL. [...]
If there are separated GCs for DLLs, then you must call GC.addRoot() from DLL where you allocate that object.
Jan 25 2021
parent reply frame <frame86 live.com> writes:
On Monday, 25 January 2021 at 16:54:42 UTC, vitamin wrote:
 On Monday, 25 January 2021 at 16:44:40 UTC, frame wrote:
 On Monday, 25 January 2021 at 16:14:05 UTC, vitamin wrote:

 [...]
Yes with simple new operator. Forgot to mention: the DLL itself calls a DLL. [...]
If there are separated GCs for DLLs, then you must call GC.addRoot() from DLL where you allocate that object.
Yes, I directly calling on every function that returns an object: T fun(T)(T object) { GC.addRoot(cast(void*) object); } ... extern (C) export SomeObject bar() { return fun(new SomeObject); } Wrong way?
Jan 25 2021
next sibling parent reply frame <frame86 live.com> writes:
On Monday, 25 January 2021 at 17:11:37 UTC, frame wrote:
 Wrong way?
Please, someone correct me if I'm getting this wrong: Structure: EXE/Main Thread: - GC: manual - requests DLL 1 object A - GC knows about object A DLL/Thread 1: - GC: conservative - allocates new object A -> addRoot(object A), return to EXE (out param) - requests DLL 2 object B - GC knows about object A and object B - requests sub objects of object B later DLL/Thread 2: - GC: manual - allocates new object B -> addRoot(object B), return to DLL 1 (out param) - GC knows about object B - allocates sub objects over object B when DLL 1 requests it, return to DLL 1 (out param) - sub objects are stored in object B - object B sub objects memory gets corrupted after DLL 1 becomes active thread again In this scenario only DLL 1 can cause the corruption as it does not occur if all GCs are set to manual. At this point I am confused about how memory allocation is ensured. Each thread should have assigned its own memory area. Each GC adopts the root by the returned object and knows about that area too. But if DLL 1 becomes active it writes into sub memory of DLL 2. It only can because it has adopted the root of object B - but why does not see DLL 1 then that sub objects of B are still alive?
Jan 26 2021
next sibling parent frame <frame86 live.com> writes:
On Tuesday, 26 January 2021 at 14:31:58 UTC, frame wrote:
but why does not see DLL 1 then that sub objects of
 B are still alive?
I may fool myself but could it be caused by an already gone slice data? It very looks like that only a specific string property is corrupted which got the same slice data as an input parameter. I thought that the slice data should stay referenced in the persistent object anyway but the GC seems not so smart to detect this. The error can be prevented with .dup so far.
Jan 27 2021
prev sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 1/26/21 6:31 AM, frame wrote:

 all GCs
Multiple D runtimes? That might work I guess but I've never heard of anybody talking about having multiple runtimes. Does rt_init() initialize *a* D runtime or *the* D runtime? If it indeed works we definitely need much better documentation. I load my libraries with loadLibrary[1] so that "[if] the library contains a D runtime it will be integrated with the current runtime." Ali [1] https://dlang.org/library/core/runtime/runtime.load_library.html
Jan 27 2021
parent reply frame <frame86 live.com> writes:
On Wednesday, 27 January 2021 at 17:41:05 UTC, Ali Çehreli wrote:
 On 1/26/21 6:31 AM, frame wrote:

 all GCs
Multiple D runtimes? That might work I guess but I've never heard of anybody talking about having multiple runtimes. Does rt_init() initialize *a* D runtime or *the* D runtime? If it indeed works we definitely need much better documentation. I load my libraries with loadLibrary[1] so that "[if] the library contains a D runtime it will be integrated with the current runtime." Ali [1] https://dlang.org/library/core/runtime/runtime.load_library.html
I have no idea if there are multiple runtimes. I just use the mixin SimpleDllMain. But there must be multiple instances of GCs running because 1) command line argument --DRT-gcopt=gc:manual was seen by the EXE but ignored by the DLL and still crashed 2) after "burning in" gc:manual in the DLL, observing GC.profileStats.numCollections shows in one DLL thread 0 and the other DLL thread > 0 and thus crashed. Or my debugger lied to me. I also use loadLibrary.
Jan 27 2021
next sibling parent reply tsbockman <thomas.bockman gmail.com> writes:
On Wednesday, 27 January 2021 at 18:09:39 UTC, frame wrote:
 there must be multiple instances of GCs running because
Sharing data between multiple threads that each use a different instance of the D GC will definitely not work right, because each GC will only know to pause the threads and scan the roots that it has been directly informed of. There is supposed to only be one instance of the D GC running per process. If you have more than one running then either you aren't linking and loading the DLLs correctly, or you have run into a serious bug in the D tooling.
 Or my debugger lied to me.
I have found the gdb debugger on Linux often breaks horribly on my D code, especially when it is multi-threaded, and the debugger is only semi-usable. Maybe the Windows debugger is better now? (I wouldn't know, since I haven't used it in a while.) I think skepticism is warranted here.
Jan 27 2021
parent reply frame <frame86 live.com> writes:
On Wednesday, 27 January 2021 at 22:57:11 UTC, tsbockman wrote:

 There is supposed to only be one instance of the D GC running 
 per process. If you have more than one running then either you 
 aren't linking and loading the DLLs correctly, or you have run 
 into a serious bug in the D tooling.
What could I do wrong by just using SimpleDllMain and then put my exports? build line for DLL is: rdmd -shared --build-only -gf -m64 Under Linux everything is shared. Under Windows each DLL seems to run in its own thread, has its own rt_options and do not see any __gshared variable value. Its completely isolated and so I assume that also GC is. Also https://wiki.dlang.org/Win32_DLLs_in_D says: Each EXE and DLL will have their own gc instance. I also wonder why the static linked DLL should use a GC proxy while as SimpleDllMain does nothing with a proxy - should loadLibrary() take care off here automatically? It seems, it does not.
Jan 27 2021
parent reply tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 28 January 2021 at 07:50:43 UTC, frame wrote:
 Under Linux everything is shared. Under Windows each DLL seems 
 to run in its own thread, has its own rt_options and do not see 
 any __gshared variable value. Its completely isolated and so I 
 assume that also GC is.
This stuff works correctly under Linux, and is quite broken in Windows. This has been known for years, but hasn't been fixed yet. This link for my other reply gives more details: https://forum.dlang.org/post/veeksndchoppftlujrwl forum.dlang.org
 Also https://wiki.dlang.org/Win32_DLLs_in_D says: Each EXE and 
 DLL will have their own gc instance.
They each have their own GC instance because no one has fully fixed the problems discussed at my link, above, not because it's actually a good idea for them each to have their own GC instance. It is possible to get things sort of working with on Windows, anyway. But, this requires either: A) Following all the same rules that you would need to follow if you wanted to share D GCed memory with another thread written in C. (Just adding GC roots is not enough.) B) Ensuring that the GC proxy connections are properly established before doing anything else. This doesn't actually work correctly or reliably, but it might work well enough for your use case. Maybe.
Jan 28 2021
parent reply frame <frame86 live.com> writes:
On Thursday, 28 January 2021 at 19:22:16 UTC, tsbockman wrote:
 It is possible to get things sort of working with on Windows, 
 anyway.
I'm ok with it as long as the memory is not re-used by the GC. It seems that it can be prevented with addRoot() successfully. The other problem with shared slice data is somewhat logical as the DLL GC doesn't care on the origin of the data from another thread and the data's origin GC sees any reference to it gone after passing it to the DLL function. They are isolated and data which must be kept longer should be copied where it's necessary.
Jan 28 2021
parent reply tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 28 January 2021 at 20:17:09 UTC, frame wrote:
 On Thursday, 28 January 2021 at 19:22:16 UTC, tsbockman wrote:
 It is possible to get things sort of working with on Windows, 
 anyway.
I'm ok with it as long as the memory is not re-used by the GC. It seems that it can be prevented with addRoot() successfully.
GC.addRoot is not enough by itself. Each GC needs to know about every single thread that may own or mutate any pointer to memory managed by that particular GC. If a GC doesn't know, memory may be prematurely freed, and therefore wrongly re-used. This is because when it scans memory for pointers to find out which memory is still in use, an untracked thread may be hiding a pointer on its stack or in registers, or it might move a pointer value from a location late in the scanning order to a location early in the scanning order while the GC is scanning the memory in between, such that the pointer value is not in either location *at the time the GC checks it*. You won't be able to test for this problem easily, because it is non-deterministic and depends upon the precise timing with which each thread is scheduled and memory is synchronized. But, it will probably still bite you later. If you were just manually creating additional threads unknown to the GC, you could tell the GC about them with core.thread.osthread.thread_attachThis and thread_detachThis. But, I don't know if those work right when there are multiple disconnected copies of D runtime running at the same time like this. The official solution is to get the GC proxy connected properly from each DLL to the EXE. This is still very broken on Windows in other ways (again, explained at my link), but it should at least prevent the race condition I described above, as well as being more efficient than running multiple GCs in parallel. Alternatively, you can design your APIs so that no pointer to GC memory is ever owned or mutated by any thread unknown to that GC. (This is the only option when working across language boundaries.)
Jan 28 2021
parent frame <frame86 live.com> writes:
On Thursday, 28 January 2021 at 22:11:40 UTC, tsbockman wrote:

 Alternatively, you can design your APIs so that no pointer to 
 GC memory is ever owned or mutated by any thread unknown to 
 that GC. (This is the only option when working across language 
 boundaries.)
Yes, thank you for your input - I was already thinking about that as it shows the better design and also wouldn't require that the DLL itself build-in much redundant code. However, if I run GC.collect() after every DLL function was done then it clearly shows which data goes away and bite me. Basically most objects are allocated in the main EXE/DLL anyway - it only comes in trouble where a new separate object is returned by the sub DLL - GC.addRoot() really solves that problem.
Jan 29 2021
prev sibling parent tsbockman <thomas.bockman gmail.com> writes:
On Wednesday, 27 January 2021 at 18:09:39 UTC, frame wrote:
 I have no idea if there are multiple runtimes. I just use the 
 mixin SimpleDllMain. But there must be multiple instances of 
 GCs running because
Another thread is running right now which I think is touching upon these same issues. Adam D. Ruppe explains some of what's going on: https://forum.dlang.org/post/veeksndchoppftlujrwl forum.dlang.org Sadly, it looks like shared D DLLs are just kind of broken on Windows, unless you want to go the betterC route...
Jan 27 2021
prev sibling parent reply ShadoLight <ettienne.gilbert gmail.com> writes:
On Monday, 25 January 2021 at 17:11:37 UTC, frame wrote:
 On Monday, 25 January 2021 at 16:54:42 UTC, vitamin wrote:
 On Monday, 25 January 2021 at 16:44:40 UTC, frame wrote:
 On Monday, 25 January 2021 at 16:14:05 UTC, vitamin wrote:
 Yes, I directly calling on every function that returns an 
 object:

 T fun(T)(T object) {
   GC.addRoot(cast(void*) object);
 }
 ...
 extern (C) export SomeObject bar() {
     return fun(new SomeObject);
 }
Just to confirm... I assume you just neglected to show the line in fun template function that returns the object, right? Like... T fun(T)(T object) { GC.addRoot(cast(void*) object); return object; }
Jan 29 2021
parent frame <frame86 live.com> writes:
On Friday, 29 January 2021 at 15:09:23 UTC, ShadoLight wrote:

 Just to confirm... I assume you just neglected to show the line 
 in fun template function that returns the object, right?
Yes, that's pseudo code with a missed return :D
Jan 29 2021
prev sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 1/25/21 8:14 AM, vitamin wrote:

 If created on the default way mean allocated with new (=> GC)
I had the same thought. The following would be the "default way" for me but passing that object's address to addRoot would be wrong: import core.memory; struct S { } void main() { auto a = S(); GC.addRoot(&a); // Wrong: 'a' is on the stack } I'm pretty sure frame knows this but still... Ali
Jan 25 2021