digitalmars.D.bugs - [Issue 21890] New: Memory layout and access patterns in the
- d-bugmail puremagic.com (89/89) May 03 2021 https://issues.dlang.org/show_bug.cgi?id=21890
https://issues.dlang.org/show_bug.cgi?id=21890 Issue ID: 21890 Summary: Memory layout and access patterns in the backend's COFF implementation lead to apocalyptically terrible LLC misses. Product: D Version: D2 Hardware: x86 OS: Windows Status: NEW Keywords: backend, performance Severity: normal Priority: P1 Component: dmd Assignee: nobody puremagic.com Reporter: maxhaton gmail.com TL;DR dmd doesn't utilize the caches very well at all, this makes it slow. I just pointed Intel's vTune profiler at dmd compiling itself, turns out that out of a ~4 seconds, the hotspots are Top Hotspots: Function Module CPU Time MsCoffObj_getsegment dmdforprofilegc.exe 0.626s dmd.func.traverseIndirections.traverse dmdforprofilegc.exe 0.245s ScopeDsymbol::search dmdforprofilegc.exe 0.146s Type::addMod dmdforprofilegc.exe 0.086s func 0x18002b650 ntdll.dll 0.085s [Others] N/A* 3.261s Something is going wrong with MsCoffObj_getsegment. Further investigation suggests almost all the measured direct-to-DRAM hits happen only in this function. Function / Call Stack CPU Time Memory Bound Loads Stores LLC Miss Count Average Latency (cycles) MsCoffObj_getsegment 626.000ms 70.0% 397,211,916 1,200,036 32,402,268 49 Going deeper the offending line seems to be the following: ------------------------------ trusted segidx_t MsCoffObj_getsegment(const(char)* sectname, uint flags) { //printf("getsegment(%s)\n", sectname); assert(strlen(sectname) <= 8); // so it won't go into string_table if (!(flags & IMAGE_SCN_LNK_COMDAT)) { for (segidx_t seg = 1; seg < SegData.length; seg++) { seg_data *pseg = SegData[seg]; if (!(ScnhdrTab[pseg.SDshtidx].Characteristics & IMAGE_SCN_LNK_COMDAT) && <---- THIS LINE HERE strncmp(cast(const(char)* )ScnhdrTab[pseg.SDshtidx].Name, sectname, 8) == 0) { //printf("\t%s\n", sectname); return seg; // return existing segment } } } segidx_t seg = MsCoffObj_getsegment2(MsCoffObj_addScnhdr(sectname, flags)); //printf("\tSegData.length = %d\n", SegData.length); //printf("\tseg = %d, %d, %s\n", seg, SegData[seg].SDshtidx, ScnhdrTab[SegData[seg].SDshtidx].s_name); return seg; } ------------------------------ Disclaimer: vTune has a habit of bumping performance counter data into the next instructions total, so these cache misses could be from the SegData access also (I will investigate further). Although the sheer volume may be an issue in itself (I haven't instrumented this yet), this code is a n-tuple threat: - SegData's struct is 140 bytes in size, and SDshtidx is at index 64, so the cache efficiency is not good. - This is a memory dependency chain which isn't ideal. - This assessed value is then used as an index into ScnhdrTab, which is an array of structs of size 40, whose last element is then indexed. This means that due to the allocator alignment it is quite likely that only one .Characteristics can actually fit into one cacheline. Solution: Switch the data structure/s from Array of struct to struct of arrays. This could yield somewhere on the order of 4x to 16x speedup if my data and hypothesis are correct. I've given the rest of the code a skim and it seems like this change shouldn't pessimize the rest of the code, as the buffers are mostly accessed sparsely (most accesses are only to 1 member at a time), the actual code itself should be able to stay the same shape thanks to D's usual bag of tricks. Also known as https://en.wikipedia.org/wiki/Data-oriented_design Additional Information: -- Profiled build was the result of "ldc2 -O3", so no PGO or LTO. -- Further investigation of the length of SegData yields a number of roughly 59000 by the end of compilation, however this function seems to be called a lot also so there I think the combination of this loop and a loop in a caller is yielding Ω(n^2) runtime. --
May 03 2021