www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - btdu - a sampling disk usage profiler for btrfs (written in D)

reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
https://blog.cy.md/2020/11/08/btdu-sampling-disk-usage-profiler-for-btrfs/

https://github.com/CyberShadow/btdu

D-related thoughts:

- D programs that build fine on one Linux machine may still fail 
to build with mysterious linking errors on another, even when 
using Dub which takes care of dependency management. I saw two 
counts of this, caused by differences in DMD/LDC and Arch/Debian 
(one being that, for whatever reason, libz is not pulled in on 
LDC/Debian despite being a Phobos dependency). Also, LDC is the D 
compiler that's installed by default when the system wants a D 
compiler (e.g. if you try do install Dub by itself).

- The garbage collector is still a major hindrance for system 
programming. In this case it was due to the ioctls used being 
slow, and when the GC tries to stop the world to do its thing, it 
just hangs the entire program until ALL ioctls in all threads 
complete. This means it wasn't possible to have a stutter-free 
interactive UI, so I had to move processing to subprocesses.

- One user wondered why the program needed so many threads. The 
answer was that half of them were owned by the GC (it never stops 
its worker threads, they just sit idle).

- I used the Deimos ncurses bindings package. I'm thankful that 
it already existed, though I had to push some fixes to fix static 
linking. The most annoying part was waiting overnight for 
code.dlang.org to pick up the new tags, because there is no way 
to get it to update a package unless you're the owner, and no way 
to otherwise specify a dependency unless using a branch (which is 
deprecated and prints a big warning when your users build your 
program).

- Nice D features that came in useful: reflection to generate a 
lightweight serializer/deserializer for subprocess communication; 
strings as slices to allow processing them without copying them 
out of the network buffer; and template mixins to add common 
behavior to types without runtime polymorphism.
Nov 08 2020
next sibling parent reply user1234 <user1234 12.de> writes:
On Sunday, 8 November 2020 at 17:23:32 UTC, Vladimir Panteleev 
wrote:
 https://blog.cy.md/2020/11/08/btdu-sampling-disk-usage-profiler-for-btrfs/

 https://github.com/CyberShadow/btdu

 D-related thoughts:

 - D programs that build fine on one Linux machine may still 
 fail to build with mysterious linking errors on another, even 
 when using Dub which takes care of dependency management. I saw 
 two counts of this, caused by differences in DMD/LDC and 
 Arch/Debian (one being that, for whatever reason, libz is not 
 pulled in on LDC/Debian despite being a Phobos dependency). 
 Also, LDC is the D compiler that's installed by default when 
 the system wants a D compiler (e.g. if you try do install Dub 
 by itself).

 - The garbage collector is still a major hindrance for system 
 programming. In this case it was due to the ioctls used being 
 slow, and when the GC tries to stop the world to do its thing, 
 it just hangs the entire program until ALL ioctls in all 
 threads complete. This means it wasn't possible to have a 
 stutter-free interactive UI, so I had to move processing to 
 subprocesses.

 - One user wondered why the program needed so many threads. The 
 answer was that half of them were owned by the GC (it never 
 stops its worker threads, they just sit idle).

 - I used the Deimos ncurses bindings package. I'm thankful that 
 it already existed, though I had to push some fixes to fix 
 static linking. The most annoying part was waiting overnight 
 for code.dlang.org to pick up the new tags, because there is no 
 way to get it to update a package unless you're the owner, and 
 no way to otherwise specify a dependency unless using a branch 
 (which is deprecated and prints a big warning when your users 
 build your program).

 - Nice D features that came in useful: reflection to generate a 
 lightweight serializer/deserializer for subprocess 
 communication; strings as slices to allow processing them 
 without copying them out of the network buffer; and template 
 mixins to add common behavior to types without runtime 
 polymorphism.
I like the report about how D was efficienet to develop this tool, otherwise what do you use it for ? What is the typical usage of such tools ?
Nov 09 2020
parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Monday, 9 November 2020 at 12:21:55 UTC, user1234 wrote:
 I like the report about how D was efficienet to develop this 
 tool, otherwise
 what do you use it for ? What is the typical usage of such 
 tools ?
Well, the README and linked blog post answer that to some extent, but my personal use cases are actually tangential to D, so I can write more about that here. I've been using btrfs on my home system ever since switching to Linux full-time, and a few years ago I switched over the server (hosting this forum / the wiki / some other services) to it too. This allowed us to have incremental, atomic, hourly, off-site backups, which actually saved our butts big-time when the hosting provider decided to shut off the server over a clerical issue in the distant year of 2019. Some snapshots are also retained for a while to allow rollbacks or undelete files in case I fat-finger something during maintenance. One of btrfs's boons is that across subvolumes and clones, deduplication allows reusing the same unique block across many files and snapshots, which saves space but also what enables atomic snapshots to work (with successive writes being COW). If you add compression on top of that, it can be challenging to understand what is actually using how much space, and since storage costs are not insignificant on a FOSS budget, it does need to be managed, and I was missing a tool that would help do this. Another unique benefit of btdu is that it starts displaying results almost instantly, which is great when the disk is full causing everything to be on fire and you need to free up some disk space right now.
Nov 09 2020
parent user1234 <user1234 12.de> writes:
On Monday, 9 November 2020 at 12:52:12 UTC, Vladimir Panteleev 
wrote:
 On Monday, 9 November 2020 at 12:21:55 UTC, user1234 wrote:
 I like the report about how D was efficienet to develop this 
 tool, otherwise
 what do you use it for ? What is the typical usage of such 
 tools ?
Well, the README and linked blog post answer that to some extent, but my personal use cases are actually tangential to D, so I can write more about that here. I've been using btrfs on my home system ever since switching to Linux full-time, and a few years ago I switched over the server (hosting this forum / the wiki / some other services) to it too. This allowed us to have incremental, atomic, hourly, off-site backups, which actually saved our butts big-time when the hosting provider decided to shut off the server over a clerical issue in the distant year of 2019. Some snapshots are also retained for a while to allow rollbacks or undelete files in case I fat-finger something during maintenance. One of btrfs's boons is that across subvolumes and clones, deduplication allows reusing the same unique block across many files and snapshots, which saves space but also what enables atomic snapshots to work (with successive writes being COW). If you add compression on top of that, it can be challenging to understand what is actually using how much space, and since storage costs are not insignificant on a FOSS budget, it does need to be managed, and I was missing a tool that would help do this. Another unique benefit of btdu is that it starts displaying results almost instantly, which is great when the disk is full causing everything to be on fire and you need to free up some disk space right now.
Allright it's clearer now, thanks for the clarifications ;)
Nov 09 2020
prev sibling next sibling parent reply matheus <matheus gmail.com> writes:
On Sunday, 8 November 2020 at 17:23:32 UTC, Vladimir Panteleev 
wrote:
 ...
 - The garbage collector is still a major hindrance for system 
 programming. In this case it was due to the ioctls used being 
 slow, and when the GC tries to stop the world to do its thing, 
 it just hangs the entire program until ALL ioctls in all 
 threads complete. This means it wasn't possible to have a 
 stutter-free interactive UI, so I had to move processing to 
 subprocesses.
 ...
I read about GC issues like this very often and my question is: Can't GC be set just to run without collecting anything, and manually set it to collect after a process is finished? Matheus.
Nov 09 2020
parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Monday, 9 November 2020 at 13:33:50 UTC, matheus wrote:
 I read about GC issues like this very often and my question is: 
 Can't GC be set just to run without collecting anything, and 
 manually set it to collect after a process is finished?
You can disable the GC and you can run it manually, but this wouldn't help in this case, because the ioctls are run across threads in an overlapping way. It would be possible if the program was designed such that every once in a while, the main thread tells all worker threads "OK, let's do a GC so nobody start any new ioctls for now", and when the last ioctl finishes run the GC and then let worker threads start ioctls again, but this means that up to all but one worker threads are idle and waiting for the last ioctl to finish. ioctl duration varies from milliseconds to seconds in this case, so it would noticeably affect throughput.
Nov 09 2020
parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 11/9/20 8:41 AM, Vladimir Panteleev wrote:
 On Monday, 9 November 2020 at 13:33:50 UTC, matheus wrote:
 I read about GC issues like this very often and my question is: Can't 
 GC be set just to run without collecting anything, and manually set it 
 to collect after a process is finished?
You can disable the GC and you can run it manually, but this wouldn't help in this case, because the ioctls are run across threads in an overlapping way. It would be possible if the program was designed such that every once in a while, the main thread tells all worker threads "OK, let's do a GC so nobody start any new ioctls for now", and when the last ioctl finishes run the GC and then let worker threads start ioctls again, but this means that up to all but one worker threads are idle and waiting for the last ioctl to finish. ioctl duration varies from milliseconds to seconds in this case, so it would noticeably affect throughput.
It would still help I think, because for instance, the UI is probably not running ioctls, and so it wouldn't pause while you are waiting for the ioctle-running threads to finish. -Steve
Nov 09 2020
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On Sunday, 8 November 2020 at 17:23:32 UTC, Vladimir Panteleev 
wrote:

 - D programs that build fine on one Linux machine may still 
 fail to build with mysterious linking errors on another, even 
 when using Dub which takes care of dependency management. I saw 
 two counts of this, caused by differences in DMD/LDC and 
 Arch/Debian (one being that, for whatever reason, libz is not 
 pulled in on LDC/Debian despite being a Phobos dependency). 
 Also, LDC is the D compiler that's installed by default when 
 the system wants a D compiler (e.g. if you try do install Dub 
 by itself).
I don't think this is specific to D. I've seen in the past problems caused by package maintainers not building the package in the same way as upstream. Or they split up a package in multiple packages.
 - The garbage collector is still a major hindrance for system 
 programming. In this case it was due to the ioctls used being 
 slow, and when the GC tries to stop the world to do its thing, 
 it just hangs the entire program until ALL ioctls in all 
 threads complete.
You should probably never let the GC run on a realtime thread, like audio or video processing (not sure if ioctls falls into this category). These days, modern UIs should probably fall into the realtime category.
 This means it wasn't possible to have a stutter-free 
 interactive UI, so I had to move processing to subprocesses.
I'm not sure if it's possible to ever have a completely stutter-free UI with a stop-the-world GC.
 - One user wondered why the program needed so many threads. The 
 answer was that half of them were owned by the GC (it never 
 stops its worker threads, they just sit idle).
Is that the answer? I mean, the GC doesn't create any threads by itself, does it?
 - I used the Deimos ncurses bindings package. I'm thankful that 
 it already existed, though I had to push some fixes to fix 
 static linking. The most annoying part was waiting overnight 
 for code.dlang.org to pick up the new tags, because there is no 
 way to get it to update a package unless you're the owner, and 
 no way to otherwise specify a dependency unless using a branch 
 (which is deprecated and prints a big warning when your users 
 build your program).
Since 2.094.0, you can specify a Git repository as a dependency [1]. You can also specify a local path as a dependency [2], useful when developing a library and an application at the same time, as two separate Dub packages. [1] https://dlang.org/changelog/2.094.0.html#git-paths [2] https://dub.pm/package-format-sdl.html#version-specs
Nov 10 2020
parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Tuesday, 10 November 2020 at 09:40:33 UTC, Jacob Carlborg 
wrote:
 On Sunday, 8 November 2020 at 17:23:32 UTC, Vladimir Panteleev 
 wrote:

 - D programs that build fine on one Linux machine may still 
 fail to build with mysterious linking errors on another, even 
 when using Dub which takes care of dependency management. I 
 saw two counts of this, caused by differences in DMD/LDC and 
 Arch/Debian (one being that, for whatever reason, libz is not 
 pulled in on LDC/Debian despite being a Phobos dependency). 
 Also, LDC is the D compiler that's installed by default when 
 the system wants a D compiler (e.g. if you try do install Dub 
 by itself).
I don't think this is specific to D. I've seen in the past problems caused by package maintainers not building the package in the same way as upstream. Or they split up a package in multiple packages.
I think it might be less of a problem in e.g. Go.
 - The garbage collector is still a major hindrance for system 
 programming. In this case it was due to the ioctls used being 
 slow, and when the GC tries to stop the world to do its thing, 
 it just hangs the entire program until ALL ioctls in all 
 threads complete.
You should probably never let the GC run on a realtime thread, like audio or video processing (not sure if ioctls falls into this category). These days, modern UIs should probably fall into the realtime category.
Doing UI without GC in D would be pretty painful. But, by itself the GC doesn't add much latency to introduce stutter in the UI - a GC scan is generally quick enough that the UI doesn't feel laggy or stuttery. The problem is that the GC is waiting for all threads to finish their ioctls, while the program otherwise is completely suspended. This affects not just UI, but throughput.
 - One user wondered why the program needed so many threads. 
 The answer was that half of them were owned by the GC (it 
 never stops its worker threads, they just sit idle).
Is that the answer? I mean, the GC doesn't create any threads by itself, does it?
Yes, it does, since the introduction of parallel heap scanning in 2.087: https://dlang.org/changelog/2.087.0.html#gc_parallel
 - I used the Deimos ncurses bindings package. I'm thankful 
 that it already existed, though I had to push some fixes to 
 fix static linking. The most annoying part was waiting 
 overnight for code.dlang.org to pick up the new tags, because 
 there is no way to get it to update a package unless you're 
 the owner, and no way to otherwise specify a dependency unless 
 using a branch (which is deprecated and prints a big warning 
 when your users build your program).
Since 2.094.0, you can specify a Git repository as a dependency [1]. You can also specify a local path as a dependency [2], useful when developing a library and an application at the same time, as two separate Dub packages. [1] https://dlang.org/changelog/2.094.0.html#git-paths [2] https://dub.pm/package-format-sdl.html#version-specs
This is super useful. Thanks.
Nov 10 2020
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Tuesday, 10 November 2020 at 10:42:09 UTC, Vladimir Panteleev 
wrote:
 But, by itself the GC doesn't add much latency to introduce 
 stutter in the UI - a GC scan is generally quick enough that 
 the UI doesn't feel laggy or stuttery. The problem is that the 
 GC is waiting for all threads to finish their ioctls, while the 
 program otherwise is completely suspended. This affects not 
 just UI, but throughput.
Would a thread local GC with reference counted shared objects work for your use case?
Nov 10 2020
parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Tuesday, 10 November 2020 at 13:55:52 UTC, Ola Fosheim Grøstad 
wrote:
 On Tuesday, 10 November 2020 at 10:42:09 UTC, Vladimir 
 Panteleev wrote:
 But, by itself the GC doesn't add much latency to introduce 
 stutter in the UI - a GC scan is generally quick enough that 
 the UI doesn't feel laggy or stuttery. The problem is that the 
 GC is waiting for all threads to finish their ioctls, while 
 the program otherwise is completely suspended. This affects 
 not just UI, but throughput.
Would a thread local GC with reference counted shared objects work for your use case?
I don't think there is a simple answer here. Removing the global GC lock for allocations, and allowing each thread to allocate from its own private pool, would greatly improve the performance of multi-threaded applications. For example, the global GC lock was what was preventing moving more processing in Dustmite to worker threads - currently, it's often better to keep everything in one thread for GC-dependent code instead of using worker threads specifically because of the overhead of the global GC lock. I think such a modification would be possible without radical changes to the language or GC design, but it's possible I'm missing something. However, that wouldn't help in this case, because the problem here doesn't come from allocations, but from the stop-the-world aspect of the GC. A theoretical non-stop-the-world GC would indeed help in this situation, but such a GC is only possible if you restrict the language to a subset, such that all copies of managed objects are always visible to the compiler. It would require all system / extern(C) code to be carefully re-scrutinized. In short, this would essentially be a different language (based on D). I don't think we can get there from where we are now.
Nov 27 2020
parent reply Ola Fosheim Grostad <ola.fosheim.grostad gmail.com> writes:
On Friday, 27 November 2020 at 10:20:41 UTC, Vladimir Panteleev 
wrote:
 However, that wouldn't help in this case, because the problem 
 here doesn't come from allocations, but from the stop-the-world 
 aspect of the GC.

 A theoretical non-stop-the-world GC would indeed help in this 
 situation, but such a GC is only possible if you restrict the 
 language to a subset, such that all copies of managed objects 
 are always visible to the compiler. It would require all 
  system / extern(C) code to be carefully re-scrutinized. In 
 short, this would essentially be a different language (based on 
 D). I don't think we can get there from where we are now.
Hm, but it would only stop a single thread. You would not be allowed to share nonpinned objects with other threads.
Nov 27 2020
parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Friday, 27 November 2020 at 10:26:18 UTC, Ola Fosheim Grostad 
wrote:
 On Friday, 27 November 2020 at 10:20:41 UTC, Vladimir Panteleev 
 wrote:
 However, that wouldn't help in this case, because the problem 
 here doesn't come from allocations, but from the 
 stop-the-world aspect of the GC.

 A theoretical non-stop-the-world GC would indeed help in this 
 situation, but such a GC is only possible if you restrict the 
 language to a subset, such that all copies of managed objects 
 are always visible to the compiler. It would require all 
  system / extern(C) code to be carefully re-scrutinized. In 
 short, this would essentially be a different language (based 
 on D). I don't think we can get there from where we are now.
Hm, but it would only stop a single thread. You would not be allowed to share nonpinned objects with other threads.
Right, so that's another imposed limitation of such a GC. You'd still also lose the ability to memcpy or memset a struct that had managed pointers, as that would break the reference count that the GC relies on to work. It would definitely solve the performance problem, but it would be such a radical change that it would essentially be a different language (and debatedly no longer a system-programming one).
Nov 27 2020
parent reply Ola Fosheim Grostad <ola.fosheim.grostad gmail.com> writes:
On Friday, 27 November 2020 at 10:31:21 UTC, Vladimir Panteleev 
wrote:
 Right, so that's another imposed limitation of such a GC. You'd 
 still also lose the ability to memcpy or memset a struct that 
 had managed pointers, as that would break the reference count 
 that the GC relies on to work. It would definitely solve the 
 performance problem, but it would be such a radical change that 
 it would essentially be a different language (and debatedly no 
 longer a system-programming one).
I think it is no different than shared_ptr. I also think one can add some safety through global pointer analysis for existing code. Let the pinning be done by a counter, when you pin the object you get a smartpointer borrowed_ptr... when the count goes to zero, the object is local again.
Nov 27 2020
parent reply IGotD- <nise nise.com> writes:
On Friday, 27 November 2020 at 10:41:54 UTC, Ola Fosheim Grostad 
wrote:
 I think it is no different than shared_ptr. I also think one 
 can add some safety through global pointer analysis for 
 existing code.  Let the pinning be done by a counter, when you 
 pin the object you get a smartpointer borrowed_ptr... when the 
 count goes to zero, the object is local again.
Reference counting which also means multiple ownership doesn't play well well with any borrowing mechanism. Reason is that the compiler cannot determine the borrow checker at compile time and must insert runtime checks if you are allowed to borrow or not. This reduces the performance, probably not a lot but still. Let's leave borrow checker outside D and just have good old reference counting, that's what we need. Speaking of parallel GC, even if we have atomic reference counting or other parallel method, the underlying malloc/free must also be non blocking or at least reduce the locking as much as possible. Many libc implementations have this already though.
Nov 27 2020
parent reply Ola Fosheim Grostad <ola.fosheim.grostad gmail.com> writes:
On Friday, 27 November 2020 at 11:31:48 UTC, IGotD- wrote:
 Reference counting which also means multiple ownership doesn't 
 play well well with any borrowing mechanism. Reason is that the 
 compiler cannot determine the borrow checker at compile time 
 and must insert runtime checks if you are allowed to borrow or 
 not. This reduces the performance, probably not a lot but 
 still. Let's leave borrow checker outside D and just have good 
 old reference counting, that's what we need.
You can view ARC as a borrowchecker. If the ARC optimizer succeeds globally then all acquire/release can be omitted for that type (or that call graph path). The problem is interior pointers which would require fat pointers or borrowchecker... But again you could rewrite those fat pointers if ARC optimization is highly successful (if the code validates like it would for a borrow checker)
Nov 27 2020
parent Ola Fosheim Grostad <ola.fosheim.grostad gmail.com> writes:
On Friday, 27 November 2020 at 12:00:40 UTC, Ola Fosheim Grostad 
wrote:
 You can view ARC as a borrowchecker. If the ARC optimizer 
 succeeds globally then all acquire/release can be omitted for 
 that type (or that call graph path).

 The problem is interior pointers which would require fat 
 pointers or borrowchecker...

 But again you could rewrite those fat pointers if ARC 
 optimization is highly successful (if the code validates like 
 it would for a borrow
Sadly, templated types that depend on struct size and field offsets could be a problem for such rewrites... So the compiler would have to annotate structs with dependencies...
Nov 27 2020