www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Determing cache sizes -- request for testing

reply Don <nospam nospam.com.au> writes:
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

To implement efficient memory-intensive operations (memcpy, array 
operations, matrix multiplication, etc), you really need to know the 
sizes of the data caches.
Although most modern CPUs provide methods to determine the sizes of 
their built-in caches, it's a complete pigs breakfast. There are 
multiple complicated methods, and documentation is scant.
I've written some code to make this mess usable, and provide what you 
really want. For each level of cache, the code provides size in KB, ways 
of associativity, and the cache line size.

The attached code should eventually become part of std.cpuid, and an 
equivalent module in Tango. But, it needs significant further testing.

Please compile and run the code, and report the results. Any results 
would be useful, but particularly valuable would be:
(1) Multicore AMD machines;
(2) Early AMD machines (K6 or earlier).
(3) Early Intel machines;
(4) anything from another manufacturer.
(5) any crashes or obvious bugs.

Public domain.
Sep 10 2008
next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Wed, 10 Sep 2008 13:41:40 +0400, Don <nospam nospam.com.au> wrote:

 To implement efficient memory-intensive operations (memcpy, array
 operations, matrix multiplication, etc), you really need to know the
 sizes of the data caches.
 Although most modern CPUs provide methods to determine the sizes of
 their built-in caches, it's a complete pigs breakfast. There are
 multiple complicated methods, and documentation is scant.
 I've written some code to make this mess usable, and provide what you
 really want. For each level of cache, the code provides size in KB, ways
 of associativity, and the cache line size.

 The attached code should eventually become part of std.cpuid, and an
 equivalent module in Tango. But, it needs significant further testing.

 Please compile and run the code, and report the results. Any results
 would be useful, but particularly valuable would be:
 (1) Multicore AMD machines;
 (2) Early AMD machines (K6 or earlier).
 (3) Early Intel machines;
 (4) anything from another manufacturer.
 (5) any crashes or obvious bugs.

 Public domain.

Vendor string: AuthenticAMD Processor string: AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ Signature: Family=15 Model=35 Stepping=2 Features: MMX FXSR SSE SSE2 SSE3 3DNow! 3DNow!+ MMX+ AMD64 HTT Multithreading: 2 threads / 2 cores Family=F Model=3 Stepping=2 Data caches: Level 1 size=8K, ways=2 linesize=32 Level 2 size=512K, ways=16 linesize=0 Level 3 size=4194303K, ways=1 linesize=0
Sep 10 2008
prev sibling next sibling parent Tomas Lindquist Olsen <tomas famolsen.dk> writes:
Vendor string:    AuthenticAMD
Processor string: AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
Signature:        Family=15 Model=43 Stepping=1
Features:         MMX FXSR SSE SSE2 SSE3 3DNow! 3DNow!+ MMX+ AMD64 HTT
Multithreading:   2 threads / 2 cores

Family=F Model=B Stepping=1
Data caches:
Level 1 size=8K, ways=2 linesize=32
Level 2 size=512K, ways=16 linesize=0
Level 3 size=4194303K, ways=1 linesize=0
Sep 10 2008
prev sibling next sibling parent Tomas Lindquist Olsen <tomas famolsen.dk> writes:
Vendor string:    GenuineIntel
Processor string: Intel(R) Celeron(R) CPU          550    2.00GHz
Signature:        Family=6 Model=22 Stepping=1
Features:         MMX FXSR SSE SSE2 SSE3 SSSE3 AMD64
Multithreading:   1 threads / 1 cores

Family=6 Model=6 Stepping=1
Data caches:
Level 1 size=32K, ways=8 linesize=64
Level 2 size=1024K, ways=4 linesize=64
Level 3 size=4194303K, ways=1 linesize=64
Sep 10 2008
prev sibling next sibling parent reply Don <nospam nospam.com.au> writes:
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

That's already showed up a heap of bugs. Aargh.
Here's an improved version. But I still don't understand why it's 
reporting no L3 cache on the AMD64 machines.
Sep 10 2008
parent torhu <no spam.invalid> writes:
This is a thunderbird 1.4.  According to cpu-z, the L1 cache is really 
64+64 kB, with 64-byte line size.  The rest looks correct.

Vendor string:    AuthenticAMD
Processor string: AMD Athlon(tm) processor
Signature:        Family=6 Model=4 Stepping=4
Features:         MMX FXSR 3DNow! 3DNow!+ MMX+
Multithreading:   1 threads / 1 cores

Family=6 Model=4 Stepping=4
Data caches:
Level 1 size=8K, ways=2 linesize=32
Level 2 size=256K, ways=16 linesize=64
Level 3 size=4194303K, ways=1 linesize=64
Sep 10 2008
prev sibling next sibling parent Manuel =?ISO-8859-1?B?S/ZuaWc=?= <manuelk89 gmx.net> writes:
Am Wed, 10 Sep 2008 11:41:40 +0200
schrieb Don <nospam nospam.com.au>:

 To implement efficient memory-intensive operations (memcpy, array 
 operations, matrix multiplication, etc), you really need to know the 
 sizes of the data caches.
 Although most modern CPUs provide methods to determine the sizes of 
 their built-in caches, it's a complete pigs breakfast. There are 
 multiple complicated methods, and documentation is scant.
 I've written some code to make this mess usable, and provide what you 
 really want. For each level of cache, the code provides size in KB,
 ways of associativity, and the cache line size.
 
 The attached code should eventually become part of std.cpuid, and an 
 equivalent module in Tango. But, it needs significant further testing.
 
 Please compile and run the code, and report the results. Any results 
 would be useful, but particularly valuable would be:
 (1) Multicore AMD machines;
 (2) Early AMD machines (K6 or earlier).
 (3) Early Intel machines;
 (4) anything from another manufacturer.
 (5) any crashes or obvious bugs.
 
 Public domain.
 

Should the Pentium4 HyperThreading Technology count as two cores? $ ./cache Vendor string: GenuineIntel Processor string: Intel(R) Pentium(R) 4 CPU 3.00GHz Signature: Family=15 Model=2 Stepping=9 Features: MMX FXSR SSE SSE2 HTT Multithreading: 1 threads / 1 cores Family=F Model=2 Stepping=9 Data caches: Level 1 size=8K, ways=2 linesize=32 Level 2 size=512K, ways=8 linesize=64 Level 3 size=4194303K, ways=1 linesize=64 $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 3.00GHz stepping : 9 cpu MHz : 2992.567 cache size : 512 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs bts cid xtpr bogomips : 5990.83 clflush size : 64 power management: processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 3.00GHz stepping : 9 cpu MHz : 2992.567 cache size : 512 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 1 initial apicid : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs bts cid xtpr bogomips : 5986.89 clflush size : 64 power management:
Sep 10 2008
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Don:
 To implement efficient memory-intensive operations (memcpy, array 
 operations, matrix multiplication, etc), you really need to know the 
 sizes of the data caches.

Your code is surely useful, but take a look at "cache oblivious data structures": http://en.wikipedia.org/wiki/Cache-oblivious_algorithm I presume your code doesn't work at compile time, so I presume you have to run your code once, save the results on disk, and then re-start the compilation to use those values to compute the tuning compile-time constants of the data structures :-) On this Intel the data looks almost correct (I have used the second version of your code), but there isn't level 3 cache, and the RAM size is 2 GB. Vendor string: GenuineIntel Processor string: Intel(R) Pentium(R) Dual CPU E2180 2.00GHz Signature: Family=6 Model=15 Stepping=13 Features: MMX FXSR SSE SSE2 SSE3 SSSE3 AMD64 HTT Multithreading: 2 threads / 2 cores Family=6 Model=F Stepping=D Data caches: Level 1 size=16K, ways=8 linesize=64 Level 2 size=512K, ways=4 linesize=64 Level 3 size=4194303K, ways=1 linesize=64 Bye, bearophile
Sep 10 2008
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
bearophile:
 Features:         MMX FXSR SSE SSE2 SSE3 SSSE3 AMD64 HTT

I don't know if my CPU has HyperThreading, I presume not. Bye, bearophile
Sep 10 2008
prev sibling parent reply Don <nospam nospam.com.au> writes:
bearophile wrote:
 Don:
 To implement efficient memory-intensive operations (memcpy, array 
 operations, matrix multiplication, etc), you really need to know the 
 sizes of the data caches.

Your code is surely useful, but take a look at "cache oblivious data structures": http://en.wikipedia.org/wiki/Cache-oblivious_algorithm

Yes, it's a good approach, but it doesn't help for something like memcpy.
 I presume your code doesn't work at compile time,

Correct.
 so I presume you have to run your code once, save the results on disk, and
then re-start the compilation to use those values to compute the tuning
compile-time constants of the data structures :-)

No. That wouldn't be much use. The cache size is simply used as a parameter at run-time. It's only the linesize which has a major impact on optimal code -- but it's 32 or 64 bytes on every system which I know of. So it's possible to deal with it at compile time, too. You can, in fact, just plug the L1 cache size into your cache-oblivious algorithm as the cut-off level, significantly improving performance.
 On this Intel the data looks almost correct (I have used the second version of
your code), but there isn't level 3 cache, and the RAM size is 2 GB.

The value shown for L3 should be greater than the memory size, if there is no L2 cache. (you never fall out of the L3 cache). So it's correct.
Sep 10 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Don:
 The value shown for L3 should be greater than the memory size, if there 
 is no L2 cache. (you never fall out of the L3 cache). So it's correct.

I don't understand. And I have L2 cache. The results I expect from your code running on my PC are: L1: 32 + 32 KB L2: 1024 KB L3: 0 MB RAM: 2 GB Or if you want an output more usable by an algorithm, it can output a dynamic array of longs: Memory levels ==> [65536, 1048576, 2147483648] Bye, bearophile
Sep 10 2008
parent reply Don <nospam nospam.com.au> writes:
bearophile wrote:
 Don:
 The value shown for L3 should be greater than the memory size, if there 
 is no L2 cache. (you never fall out of the L3 cache). So it's correct.

I don't understand. And I have L2 cache.

Oops, that should have been "if there is no L3 cache".
 The results I expect from your code running on my PC are:
 
 L1: 32 + 32 KB
 L2: 1024 KB
 L3: 0 MB
 RAM: 2 GB

(1) Unfortunately, I don't think it's possible to determine the amount of RAM without help from the OS. So I give the last value uint.max bytes. Perhaps that is too confusing. (2) The cache values are per core.
 Or if you want an output more usable by an algorithm, it can output a dynamic
array of longs:
 
 Memory levels ==> [65536, 1048576, 2147483648]

Perhaps. I haven't decided on a final interface. Note, though, that the relevant size of the cache depends on what you are doing. For example, if you are operating on 3 arrays, you need to divide the cache size by 3. But if the cache level has an associativity less than 3, you have the risk of cache thrashing. So generally you need to make your own table of cache sizes anyway.
Sep 10 2008
parent "Manfred_Nowak" <svv1999 hotmail.com> writes:
Don wrote:

  I don't think it's possible to determine the amount 
 of RAM without help from the OS.

Because without help of the OS, one will only get the available virtual memory, which is provided by the OS? -manfred -- If life is going to exist in this Universe, then the one thing it cannot afford to have is a sense of proportion. (Douglas Adams)
Sep 10 2008
prev sibling next sibling parent TomD <t_demmer nospam.web.de> writes:
Vendor string:    GenuineIntel
Processor string: Intel(R) Core(TM)2 CPU         T5500    1.66GHz
Signature:        Family=6 Model=15 Stepping=2
Features:         MMX FXSR SSE SSE2 SSE3 SSSE3 AMD64 HTT
Multithreading:   2 threads / 2 cores

Family=6 Model=F Stepping=2
Data caches:
Level 1 size=16K, ways=8 linesize=64
Level 2 size=1024K, ways=8 linesize=64
Level 3 size=4194303K, ways=1 linesize=64
Sep 10 2008
prev sibling next sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Don (nospam nospam.com.au)'s article
 This is a multi-part message in MIME format.
 --------------000507070609070404050303
 Content-Type: text/plain; charset=ISO-8859-1; format=flowed
 Content-Transfer-Encoding: 7bit
 To implement efficient memory-intensive operations (memcpy, array
 operations, matrix multiplication, etc), you really need to know the
 sizes of the data caches.
 Although most modern CPUs provide methods to determine the sizes of
 their built-in caches, it's a complete pigs breakfast. There are
 multiple complicated methods, and documentation is scant.
 I've written some code to make this mess usable, and provide what you
 really want. For each level of cache, the code provides size in KB, ways
 of associativity, and the cache line size.
 The attached code should eventually become part of std.cpuid, and an
 equivalent module in Tango. But, it needs significant further testing.
 Please compile and run the code, and report the results. Any results
 would be useful, but particularly valuable would be:
 (1) Multicore AMD machines;
 (2) Early AMD machines (K6 or earlier).
 (3) Early Intel machines;
 (4) anything from another manufacturer.
 (5) any crashes or obvious bugs.

Seems to get things wrong for quad core Intels. No, I'm not rich enough to own one myself, this is my work computer. Vendor string: GenuineIntel Processor string: Intel(R) Core(TM)2 Quad CPU 2.66GHz Signature: Family=6 Model=15 Stepping=7 Features: MMX FXSR SSE SSE2 SSE3 SSSE3 AMD64 HTT Multithreading: 4 threads / 4 cores Family=6 Model=F Stepping=7 Data caches: Level 1 size=8K, ways=8 linesize=64 Level 2 size=1024K, ways=16 linesize=64 Level 3 size=4194303K, ways=1 linesize=64 According to CPU-Z, the L2 cache is 4096 Kb * 2. I'm pretty sure that the quad-core Intels are basically two dual-cores stuck together, meaning that they have 2 L2 caches, each shared between two of the cores. Also, the L1 cache info is wrong if CPU-Z is right. I actually have 32K of L2 cache per core. For your convenience, I've attached my CPU-Z HTML output.
Sep 10 2008
prev sibling next sibling parent =?ISO-8859-1?Q?=22J=E9r=F4me_M=2E_Berger=22?= <jeberger free.fr> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

	I get:
 ./cache

Family=0 Model=0 Stepping=0 Data caches: Level 1 size=8K, ways=2 linesize=32 Level 2 size=4194303K, ways=1 linesize=32 Level 3 size=4194303K, ways=1 linesize=32
 gdc --version

 cat /proc/cpuinfo

vendor_id : AuthenticAMD cpu family : 15 model : 75 model name : AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ stepping : 2 cpu MHz : 1800.000 cache size : 512 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dno wext 3dnow rep_good pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy bogomips : 3618.77 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 75 model name : AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ stepping : 2 cpu MHz : 1800.000 cache size : 512 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dno wext 3dnow rep_good pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy bogomips : 3618.77 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc - -- +------------------------- Jerome M. BERGER ---------------------+ | mailto:jeberger free.fr | ICQ: 238062172 | | http://jeberger.free.fr/ | Jabber: jeberger jabber.fr | +---------------------------------+------------------------------+ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkjIEG4ACgkQd0kWM4JG3k9yPwCgl4Fd7Yu2rH3tbIB9K/Ir05da mcgAnjbiErqCQ+GmrxJKoeS2TeRkO5QN =QE3i -----END PGP SIGNATURE-----
Sep 10 2008
prev sibling next sibling parent =?iso-8859-1?Q?Julio=20C=e9sar=20Carrascal=20Urquijo?= <jcarrascal gmail.com> writes:
Hello Don,

 Please compile and run the code, and report the results. Any results
 would be useful, but particularly valuable would be:
 (1) Multicore AMD machines;
 (2) Early AMD machines (K6 or earlier).
 (3) Early Intel machines;
 (4) anything from another manufacturer.
 (5) any crashes or obvious bugs.
 Public domain.
 

Vendor string: AuthenticAMD Processor string: AMD Athlon(tm) 64 X2 Dual-Core Processor TK-55 Signature: Family=15 Model=104 Stepping=1 Features: MMX FXSR SSE SSE2 SSE3 3DNow! 3DNow!+ MMX+ AMD64 HTT Multithreading: 2 threads / 2 cores Family=F Model=8 Stepping=1 Data caches: Level 1 size=8K, ways=2 linesize=32 Level 2 size=256K, ways=16 linesize=0 Level 3 size=4194303K, ways=1 linesize=0 The same data is reported by CPU-Z.
Sep 10 2008
prev sibling next sibling parent reply "Craig Black" <craigblack2 cox.net> writes:
Don,

Very good work on all this stuff!!  I didn't realize that this was possible 
to do.  Once you get the cache sizes it would be beneficial to know what 
regions of memory are currently loaded into cache and RAM.  Do you know if 
this can be done?

If such information was available, it would be possible to write a "memory 
optimizer" that would work on special data structures that could be moved 
around on the heap.  Objects that are accessed frequently could be moved 
around to improve locality of reference.

-Craig 
Sep 10 2008
parent Don <nospam nospam.com.au> writes:
Craig Black wrote:
 Don,
 
 Very good work on all this stuff!!  I didn't realize that this was 
 possible to do.  Once you get the cache sizes it would be beneficial to 
 know what regions of memory are currently loaded into cache and RAM.  Do 
 you know if this can be done?

I don't think that's possible. One thing you can do, though, is use the performance counters to measure how many cache misses you're getting. (There are performance counters for L1 read misses, L1 write misses, L2 misses, # L2 lines evicted from cache, etc). Requires a small kernel mode driver, though, so can't be used for client code. But it's what I use for development -- you can learn a lot with it.
 If such information was available, it would be possible to write a 
 "memory optimizer" that would work on special data structures that could 
 be moved around on the heap.  Objects that are accessed frequently could 
 be moved around to improve locality of reference.

Nice idea. Still, the most important things can be done at compile time. (especially, making sure that arrays of structs are sensibly arranged).
Sep 11 2008
prev sibling parent Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
Don wrote:
 To implement efficient memory-intensive operations (memcpy, array 
 operations, matrix multiplication, etc), you really need to know the 
 sizes of the data caches.
 Although most modern CPUs provide methods to determine the sizes of 
 their built-in caches, it's a complete pigs breakfast. There are 
 multiple complicated methods, and documentation is scant.
 I've written some code to make this mess usable, and provide what you 
 really want. For each level of cache, the code provides size in KB, ways 
 of associativity, and the cache line size.
 
 The attached code should eventually become part of std.cpuid, and an 
 equivalent module in Tango. But, it needs significant further testing.
 
 Please compile and run the code, and report the results. Any results 
 would be useful, but particularly valuable would be:
 (1) Multicore AMD machines;
 (2) Early AMD machines (K6 or earlier).
 (3) Early Intel machines;
 (4) anything from another manufacturer.
 (5) any crashes or obvious bugs.
 
 Public domain.
 

Vendor string: AuthenticAMD Processor string: AMD Athlon(tm) 64 X2 Dual Core Processor 5000+ Signature: Family=15 Model=107 Stepping=2 Features: MMX FXSR SSE SSE2 SSE3 3DNow! 3DNow!+ MMX+ AMD64 HTT Multithreading: 2 threads / 2 cores Family=F Model=6B Stepping=2 Data caches: Level 1 size=8K, ways=2 linesize=32 Level 2 size=512K, ways=16 linesize=64 Level 3 size=4194303K, ways=1 linesize=64 -- Bruno Medeiros - Software Developer, MSc. in CS/E graduate http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Sep 23 2008