www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - optimized array operations

reply Eugene Pelekhay <pelekhay gmail.com> writes:
Content-Type: text/plain

I'm finished optimized version of array operations, using SSE2 instructions.
Sep 21 2008
next sibling parent reply Hoenir <mrmocool gmx.de> writes:
Eugene Pelekhay schrieb:
 I'm finished optimized version of array operations, using SSE2 instructions.

else { NoSEE2(); } be NoSSE2()?
Sep 22 2008
parent reply Eugene Pelekhay <pelekhay gmail.com> writes:
Hoenir Wrote:

 Eugene Pelekhay schrieb:
 I'm finished optimized version of array operations, using SSE2 instructions.

else { NoSEE2(); } be NoSSE2()?

No, If we have no inline asm we have noting to do
Sep 22 2008
parent reply Hoenir <mrmocool gmx.de> writes:
Eugene Pelekhay schrieb:
 Hoenir Wrote:
 
 Eugene Pelekhay schrieb:
 I'm finished optimized version of array operations, using SSE2 instructions.

else { NoSEE2(); } be NoSSE2()?

No, If we have no inline asm we have noting to do

I meant the spelling. you wrote NoSEE2
Sep 23 2008
parent Eugene Pelekhay <pelekhay gmail.com> writes:
Hoenir Wrote:

 Eugene Pelekhay schrieb:
 Hoenir Wrote:
 
 Eugene Pelekhay schrieb:
 I'm finished optimized version of array operations, using SSE2 instructions.

else { NoSEE2(); } be NoSSE2()?

No, If we have no inline asm we have noting to do

I meant the spelling. you wrote NoSEE2

Sorry, yes you right
Sep 23 2008
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Eugene Pelekhay:
 I'm finished optimized version of array operations, using SSE2 instructions.

Where are the benchmarks to compare the performance of the old version with the new one? Bye, bearophile
Sep 22 2008
next sibling parent reply Eugene Pelekhay <pelekhay gmail.com> writes:
bearophile Wrote:

 Eugene Pelekhay:
 I'm finished optimized version of array operations, using SSE2 instructions.

Where are the benchmarks to compare the performance of the old version with the new one?

It's bencmark() function old implementation begins form undrscore
Sep 22 2008
parent reply "Saaa" <empty needmail.com> writes:
Claps
But shouldn't you add -O ? 
Sep 22 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Denis Koroskin:
 New hardware - new results (-O -release -inline):
 length=1
 _add Time elapsed: 91.212/0 ticks OK
 _sub Time elapsed: 89.8235/0 ticks OK
 add Time elapsed: 82.6135/0 ticks OK
 sub Time elapsed: 84.2032/0 ticks OK
 mul Time elapsed: 82.8325/0 ticks OK
 div Time elapsed: 99.2924/0 ticks OK

I don't understand how to read the timings, are the timings before the slash of the current implementation, and the timings after the slash of your new code? Regarding your code, lines as: foreach (i; 0 .. 10_000_000) { Probably need to be rewritten as: for (int i; i < 10_000_000; i++) { Because array operations are present in D1 too, and it may be better to write code that works on both D1/D2 if possible (so Walter has less code differences to manage). Bye, bearophile
Sep 22 2008
parent reply Eugene Pelekhay <pelekhay gmail.com> writes:
bearophile Wrote:

 Denis Koroskin:
 New hardware - new results (-O -release -inline):
 length=1
 _add Time elapsed: 91.212/0 ticks OK
 _sub Time elapsed: 89.8235/0 ticks OK
 add Time elapsed: 82.6135/0 ticks OK
 sub Time elapsed: 84.2032/0 ticks OK
 mul Time elapsed: 82.8325/0 ticks OK
 div Time elapsed: 99.2924/0 ticks OK

I don't understand how to read the timings, are the timings before the slash of the current implementation, and the timings after the slash of your new code? Regarding your code, lines as: foreach (i; 0 .. 10_000_000) { Probably need to be rewritten as: for (int i; i < 10_000_000; i++) { Because array operations are present in D1 too, and it may be better to write code that works on both D1/D2 if possible (so Walter has less code differences to manage). Bye, bearophile

timings shown per implementation for aligned/unaligned data. name of function with underscore it's default/current implementation without new optimized version regarding rewriting loops, may be You right, I didn't checked if it even compiled by D 1.0, however major code differcence will be in benchmark function wich is not intended for inclusion in librariy
Sep 23 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Eugene Pelekhay:
 however major code differcence will be in benchmark function wich is
 not intended for inclusion in librariy

In my libs I keep the tidy benchmark code too (and few timing tables too), because it's part of the performance benchmarking of the code, so it completes the unittests, spotting performance problems, etc. Bye, bearophile
Sep 23 2008
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Mon, 22 Sep 2008 20:33:40 +0400, Eugene Pelekhay <pelekhay gmail.com>  
wrote:

 bearophile Wrote:

 Eugene Pelekhay:
 I'm finished optimized version of array operations, using SSE2  

Where are the benchmarks to compare the performance of the old version with the new one?

It's bencmark() function old implementation begins form undrscore

I renamed benchmark() to main() and run the tests with -release -inline. Well done! Your implementation does indeed perform up to 2 times better that built-in version, according to the results. Here they are: length=1 _add Time elapsed: 280.319/0 ticks OK _sub Time elapsed: 256.594/0 ticks OK add Time elapsed: 214.942/0 ticks OK sub Time elapsed: 154.829/0 ticks OK mul Time elapsed: 150.332/0 ticks OK div Time elapsed: 157.477/0 ticks OK length=2 _add Time elapsed: 156.033/151.279 ticks OK _sub Time elapsed: 156.565/152.039 ticks OK add Time elapsed: 150.95/157.855 ticks OK sub Time elapsed: 156.006/159.271 ticks OK mul Time elapsed: 154.156/158.504 ticks OK div Time elapsed: 195.399/187.488 ticks OK length=3 _add Time elapsed: 171.777/171.285 ticks OK _sub Time elapsed: 172.744/173.135 ticks OK add Time elapsed: 158.411/171.716 ticks OK sub Time elapsed: 158.756/167.616 ticks OK mul Time elapsed: 159.1/168.461 ticks OK div Time elapsed: 236.523/230.023 ticks OK length=4 _add Time elapsed: 187.279/187.507 ticks OK _sub Time elapsed: 187.64/186.197 ticks OK add Time elapsed: 158.703/173.175 ticks OK sub Time elapsed: 157.075/174.885 ticks OK mul Time elapsed: 160.297/174.931 ticks OK div Time elapsed: 275.418/262.012 ticks OK length=5 _add Time elapsed: 204.594/200.33 ticks OK _sub Time elapsed: 203.379/207.323 ticks OK add Time elapsed: 163.637/183.095 ticks OK sub Time elapsed: 164.883/179.372 ticks OK mul Time elapsed: 166.262/178.485 ticks OK div Time elapsed: 317.013/317.882 ticks OK length=6 _add Time elapsed: 209.717/210.563 ticks OK _sub Time elapsed: 209.713/211.452 ticks OK add Time elapsed: 163.643/181.725 ticks OK sub Time elapsed: 164.512/187.003 ticks OK mul Time elapsed: 162.201/185.62 ticks OK div Time elapsed: 352.298/340.421 ticks OK length=7 _add Time elapsed: 238.152/232.379 ticks OK _sub Time elapsed: 232.444/226.117 ticks OK add Time elapsed: 167.371/190.912 ticks OK sub Time elapsed: 164.427/188.946 ticks OK mul Time elapsed: 163.788/189.1 ticks OK div Time elapsed: 385.258/396.481 ticks OK length=8 _add Time elapsed: 235.129/230.914 ticks OK _sub Time elapsed: 184.234/229.212 ticks OK add Time elapsed: 165.899/192.069 ticks OK sub Time elapsed: 167.237/191.622 ticks OK mul Time elapsed: 167.428/196.7 ticks OK div Time elapsed: 452.226/439.266 ticks OK length=9 _add Time elapsed: 242.522/242.835 ticks OK _sub Time elapsed: 198.431/186.926 ticks OK add Time elapsed: 178.967/202.155 ticks OK sub Time elapsed: 173.741/197.983 ticks OK mul Time elapsed: 176.276/211.608 ticks OK div Time elapsed: 457.407/479.647 ticks OK length=10 _add Time elapsed: 260.412/257.576 ticks OK _sub Time elapsed: 207.76/206.046 ticks OK add Time elapsed: 169.342/202.356 ticks OK sub Time elapsed: 169.686/201.182 ticks OK mul Time elapsed: 168.23/201.069 ticks OK div Time elapsed: 504.553/499.578 ticks OK length=11 _add Time elapsed: 270.73/266.472 ticks OK _sub Time elapsed: 226.691/221.545 ticks OK add Time elapsed: 171.588/212.554 ticks OK sub Time elapsed: 172.838/213.734 ticks OK mul Time elapsed: 177.953/223.758 ticks OK div Time elapsed: 525.984/539.466 ticks OK length=12 _add Time elapsed: 302.982/305.531 ticks OK _sub Time elapsed: 247.649/236.579 ticks OK add Time elapsed: 175.799/218.129 ticks OK sub Time elapsed: 174.062/217.84 ticks OK mul Time elapsed: 175.902/221.012 ticks OK div Time elapsed: 573.274/560.699 ticks OK length=13 _add Time elapsed: 313.339/311.964 ticks OK _sub Time elapsed: 250.683/242.214 ticks OK add Time elapsed: 181.252/219.131 ticks OK sub Time elapsed: 182.489/220.372 ticks OK mul Time elapsed: 177.835/227.545 ticks OK div Time elapsed: 598.186/618.626 ticks OK length=14 _add Time elapsed: 326.999/320.797 ticks OK _sub Time elapsed: 264.637/257.676 ticks OK add Time elapsed: 173.004/226.604 ticks OK sub Time elapsed: 176.729/226.711 ticks OK mul Time elapsed: 179.686/234.603 ticks OK div Time elapsed: 649.893/640.516 ticks OK length=15 _add Time elapsed: 364.665/354.617 ticks OK _sub Time elapsed: 274.117/266.381 ticks OK add Time elapsed: 179.939/233.329 ticks OK sub Time elapsed: 179.812/233.004 ticks OK mul Time elapsed: 179.694/235.165 ticks OK div Time elapsed: 673.612/699.527 ticks OK length=16 _add Time elapsed: 223.35/327.695 ticks OK _sub Time elapsed: 227.287/272.908 ticks OK add Time elapsed: 185.393/236.471 ticks OK sub Time elapsed: 182.69/236.823 ticks OK mul Time elapsed: 186.721/243.525 ticks OK div Time elapsed: 717.472/725.396 ticks OK length=17 _add Time elapsed: 248.901/230.275 ticks OK _sub Time elapsed: 247.695/229.207 ticks OK add Time elapsed: 191.686/253.113 ticks OK sub Time elapsed: 192.252/255.335 ticks OK mul Time elapsed: 197.645/255.323 ticks OK div Time elapsed: 757.526/783.176 ticks OK length=18 _add Time elapsed: 256.292/251.377 ticks OK _sub Time elapsed: 283.259/272.567 ticks OK add Time elapsed: 239.513/350.044 ticks OK sub Time elapsed: 230.761/312.844 ticks OK mul Time elapsed: 216.002/295.263 ticks OK div Time elapsed: 835.063/882.578 ticks OK length=19 _add Time elapsed: 269.324/266.079 ticks OK _sub Time elapsed: 268.632/260.185 ticks OK add Time elapsed: 202.924/277.364 ticks OK sub Time elapsed: 204.761/282.057 ticks OK mul Time elapsed: 212.51/280.277 ticks OK div Time elapsed: 859.039/862.403 ticks OK length=20 _add Time elapsed: 491.68/478.187 ticks OK _sub Time elapsed: 302.505/292.327 ticks OK add Time elapsed: 251.343/365.493 ticks OK sub Time elapsed: 253.363/364.301 ticks OK mul Time elapsed: 283.059/409.972 ticks OK div Time elapsed: 1000.98/1085.22 ticks OK length=21 _add Time elapsed: 503.658/492.33 ticks OK _sub Time elapsed: 518.409/509.397 ticks OK add Time elapsed: 288.601/403.432 ticks OK sub Time elapsed: 253.521/355.318 ticks OK mul Time elapsed: 247.759/342.146 ticks OK div Time elapsed: 1070.34/1187.87 ticks OK length=22 _add Time elapsed: 504.76/489.249 ticks OK _sub Time elapsed: 511.489/493.343 ticks OK add Time elapsed: 287.776/419.468 ticks OK sub Time elapsed: 290.898/428.521 ticks OK mul Time elapsed: 350.474/504.08 ticks OK div Time elapsed: 1023.04/1086.01 ticks OK length=23 _add Time elapsed: 364.767/356.623 ticks OK _sub Time elapsed: 437.096/440.046 ticks OK add Time elapsed: 239.832/351.762 ticks OK sub Time elapsed: 215.47/304.019 ticks OK mul Time elapsed: 234.817/332.294 ticks OK div Time elapsed: 978.904/995.593 ticks OK length=24 _add Time elapsed: 331.951/333.321 ticks OK _sub Time elapsed: 273.76/333.263 ticks OK add Time elapsed: 205.865/305.647 ticks OK sub Time elapsed: 205.496/299.6 ticks OK mul Time elapsed: 208.869/308.821 ticks OK div Time elapsed: 1049.6/1086.68 ticks OK length=25 _add Time elapsed: 351.851/349.097 ticks OK _sub Time elapsed: 302.465/285.624 ticks OK add Time elapsed: 218.116/306.209 ticks OK sub Time elapsed: 219.07/311.153 ticks OK mul Time elapsed: 220.504/305.367 ticks OK div Time elapsed: 1053.65/1094.66 ticks OK
Sep 22 2008
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Mon, 22 Sep 2008 22:24:16 +0400, Saaa <empty needmail.com> wrote:

 Claps
 But shouldn't you add -O ?

Yeah, I forgot about that one. New hardware - new results (-O -release -inline): length=1 _add Time elapsed: 91.212/0 ticks OK _sub Time elapsed: 89.8235/0 ticks OK add Time elapsed: 82.6135/0 ticks OK sub Time elapsed: 84.2032/0 ticks OK mul Time elapsed: 82.8325/0 ticks OK div Time elapsed: 99.2924/0 ticks OK length=2 _add Time elapsed: 106.647/109.083 ticks OK _sub Time elapsed: 107.971/102.719 ticks OK add Time elapsed: 82.7988/102.962 ticks OK sub Time elapsed: 81.4732/101.616 ticks OK mul Time elapsed: 83.8156/99.2563 ticks OK div Time elapsed: 92.5873/108.747 ticks OK length=3 _add Time elapsed: 126.652/116.706 ticks OK _sub Time elapsed: 126.399/125.706 ticks OK add Time elapsed: 96.3761/107.643 ticks OK sub Time elapsed: 94.1087/105.429 ticks OK mul Time elapsed: 96.0873/103.863 ticks OK div Time elapsed: 113.2/148.406 ticks OK length=4 _add Time elapsed: 147.263/144.917 ticks OK _sub Time elapsed: 147.572/143.089 ticks OK add Time elapsed: 85.2906/104.056 ticks OK sub Time elapsed: 90.3686/102.911 ticks OK mul Time elapsed: 84.7035/102.004 ticks OK div Time elapsed: 95.4113/140.324 ticks OK length=5 _add Time elapsed: 173.293/167.389 ticks OK _sub Time elapsed: 171.469/175.664 ticks OK add Time elapsed: 99.3267/99.8193 ticks OK sub Time elapsed: 99.7452/103.429 ticks OK mul Time elapsed: 91.2656/108.578 ticks OK div Time elapsed: 139.442/184.214 ticks OK length=6 _add Time elapsed: 191.206/187.248 ticks OK _sub Time elapsed: 187.905/184.413 ticks OK add Time elapsed: 87.7778/119.368 ticks OK sub Time elapsed: 93.0108/118.036 ticks OK mul Time elapsed: 87.0704/117.874 ticks OK div Time elapsed: 117.72/241.09 ticks OK length=7 _add Time elapsed: 205.864/205.521 ticks OK _sub Time elapsed: 202.43/200.819 ticks OK add Time elapsed: 102.379/122.167 ticks OK sub Time elapsed: 100.484/119.401 ticks OK mul Time elapsed: 106.196/124.518 ticks OK div Time elapsed: 110.556/245.094 ticks OK length=8 _add Time elapsed: 220.057/222.208 ticks OK _sub Time elapsed: 113.122/210.502 ticks OK add Time elapsed: 99.7759/125.82 ticks OK sub Time elapsed: 95.7975/129.289 ticks OK mul Time elapsed: 96.1895/135.446 ticks OK div Time elapsed: 104.639/152.259 ticks OK length=9 _add Time elapsed: 251.001/217.675 ticks OK _sub Time elapsed: 135.93/128.081 ticks OK add Time elapsed: 110.924/125.979 ticks OK sub Time elapsed: 107.991/130.015 ticks OK mul Time elapsed: 107.806/129.482 ticks OK div Time elapsed: 114.265/179.028 ticks OK length=10 _add Time elapsed: 264.573/259.45 ticks OK _sub Time elapsed: 148.835/169.921 ticks OK add Time elapsed: 95.0567/140.087 ticks OK sub Time elapsed: 97.4369/152.479 ticks OK mul Time elapsed: 98.3549/154.104 ticks OK div Time elapsed: 108.558/252.462 ticks OK length=11 _add Time elapsed: 279.349/269.359 ticks OK _sub Time elapsed: 164.775/161.75 ticks OK add Time elapsed: 114.546/147.507 ticks OK sub Time elapsed: 109.125/147.676 ticks OK mul Time elapsed: 112.426/136.735 ticks OK div Time elapsed: 135.365/287.607 ticks OK length=12 _add Time elapsed: 296.75/296.691 ticks OK _sub Time elapsed: 188.989/173.847 ticks OK add Time elapsed: 102.656/151.517 ticks OK sub Time elapsed: 105.074/157.331 ticks OK mul Time elapsed: 109.028/145.493 ticks OK div Time elapsed: 147.061/343.124 ticks OK length=13 _add Time elapsed: 321.381/306.395 ticks OK _sub Time elapsed: 212.665/204.659 ticks OK add Time elapsed: 115.45/152.029 ticks OK sub Time elapsed: 122.004/145.958 ticks OK mul Time elapsed: 119.664/142.964 ticks OK div Time elapsed: 141.235/354.308 ticks OK length=14 _add Time elapsed: 335.214/331.419 ticks OK _sub Time elapsed: 223.052/229.708 ticks OK add Time elapsed: 113.681/155.009 ticks OK sub Time elapsed: 113.955/160.36 ticks OK mul Time elapsed: 113.102/149.142 ticks OK div Time elapsed: 245.735/390.141 ticks OK length=15 _add Time elapsed: 359.901/332.174 ticks OK _sub Time elapsed: 244.768/244.861 ticks OK add Time elapsed: 131.356/152.704 ticks OK sub Time elapsed: 125.648/157.999 ticks OK mul Time elapsed: 124.153/161.608 ticks OK div Time elapsed: 250.017/404.974 ticks OK length=16 _add Time elapsed: 140.074/379.592 ticks OK _sub Time elapsed: 149.367/251.098 ticks OK add Time elapsed: 113.109/170.17 ticks OK sub Time elapsed: 113.996/164.889 ticks OK mul Time elapsed: 114.763/169.273 ticks OK div Time elapsed: 159.482/241.894 ticks OK length=17 _add Time elapsed: 155.749/154.477 ticks OK _sub Time elapsed: 154.405/168.108 ticks OK add Time elapsed: 119.572/163.988 ticks OK sub Time elapsed: 120.492/165.51 ticks OK mul Time elapsed: 113.184/161.779 ticks OK div Time elapsed: 256.044/366.288 ticks OK length=18 _add Time elapsed: 172.663/172.298 ticks OK _sub Time elapsed: 174.816/170.034 ticks OK add Time elapsed: 123.649/163.883 ticks OK sub Time elapsed: 127.407/161.022 ticks OK mul Time elapsed: 114.204/164.654 ticks OK div Time elapsed: 264.864/433.902 ticks OK length=19 _add Time elapsed: 198.912/196.095 ticks OK _sub Time elapsed: 193.752/198.134 ticks OK add Time elapsed: 125.866/175.446 ticks OK sub Time elapsed: 133.9/175.22 ticks OK mul Time elapsed: 128.767/171.784 ticks OK div Time elapsed: 373.334/453.652 ticks OK length=20 _add Time elapsed: 217.477/215.582 ticks OK _sub Time elapsed: 223.455/216.313 ticks OK add Time elapsed: 126.333/184.697 ticks OK sub Time elapsed: 123.555/173.922 ticks OK mul Time elapsed: 128.443/179.81 ticks OK div Time elapsed: 372.062/481.06 ticks OK length=21 _add Time elapsed: 238.9/229.665 ticks OK _sub Time elapsed: 237.118/239.118 ticks OK add Time elapsed: 130.701/184.553 ticks OK sub Time elapsed: 132.75/174.828 ticks OK mul Time elapsed: 133.93/177.637 ticks OK div Time elapsed: 400.036/514.569 ticks OK length=22 _add Time elapsed: 254.142/256.432 ticks OK _sub Time elapsed: 262.423/244.817 ticks OK add Time elapsed: 125.395/196.22 ticks OK sub Time elapsed: 124.649/198.292 ticks OK mul Time elapsed: 123.808/190.321 ticks OK div Time elapsed: 387.283/542.869 ticks OK length=23 _add Time elapsed: 260.849/282.866 ticks OK _sub Time elapsed: 268.18/275.912 ticks OK add Time elapsed: 135.358/194.129 ticks OK sub Time elapsed: 135.746/199.16 ticks OK mul Time elapsed: 138.564/192.774 ticks OK div Time elapsed: 375.942/566.722 ticks OK length=24 _add Time elapsed: 299.96/281.965 ticks OK _sub Time elapsed: 168.451/279.595 ticks OK add Time elapsed: 130.49/191.557 ticks OK sub Time elapsed: 140.546/191.857 ticks OK mul Time elapsed: 137/196.271 ticks OK div Time elapsed: 381.22/499.397 ticks OK length=25 _add Time elapsed: 321.095/300.535 ticks OK _sub Time elapsed: 180.699/195.685 ticks OK add Time elapsed: 145.348/195.331 ticks OK sub Time elapsed: 147.631/204.476 ticks OK mul Time elapsed: 135.815/201.486 ticks OK div Time elapsed: 384.632/484.732 ticks OK length=26 _add Time elapsed: 332.848/329.833 ticks OK _sub Time elapsed: 205.984/205.769 ticks OK add Time elapsed: 139.282/201.451 ticks OK sub Time elapsed: 137.362/203.057 ticks OK mul Time elapsed: 142.365/208.345 ticks OK div Time elapsed: 388.909/542.32 ticks OK length=27 _add Time elapsed: 345.331/348.951 ticks OK _sub Time elapsed: 225.069/221.668 ticks OK add Time elapsed: 145.462/208.467 ticks OK sub Time elapsed: 148.598/204.095 ticks OK mul Time elapsed: 147.294/208.258 ticks OK div Time elapsed: 406.289/605.841 ticks OK length=28 _add Time elapsed: 367.349/359.765 ticks OK _sub Time elapsed: 252.124/251.313 ticks OK add Time elapsed: 146.061/203.42 ticks OK sub Time elapsed: 146.909/210.649 ticks OK mul Time elapsed: 146.599/210.4 ticks OK div Time elapsed: 410.878/653.689 ticks OK length=29 _add Time elapsed: 376.89/379.718 ticks OK _sub Time elapsed: 269.778/267.857 ticks OK add Time elapsed: 155.037/209.949 ticks OK sub Time elapsed: 150.242/223.406 ticks OK mul Time elapsed: 153.016/211.877 ticks OK div Time elapsed: 421.687/659.577 ticks OK length=30 _add Time elapsed: 403.276/393.81 ticks OK _sub Time elapsed: 285.613/287.475 ticks OK add Time elapsed: 151.406/224.127 ticks OK sub Time elapsed: 147.401/221.73 ticks OK mul Time elapsed: 152.91/222.371 ticks OK div Time elapsed: 520.294/692.957 ticks OK length=31 _add Time elapsed: 408.127/423.503 ticks OK _sub Time elapsed: 301.977/302.537 ticks OK add Time elapsed: 155.872/225.604 ticks OK sub Time elapsed: 157.22/233.895 ticks OK mul Time elapsed: 152.599/234.549 ticks OK div Time elapsed: 517.482/722.413 ticks OK length=32 _add Time elapsed: 210.358/430.569 ticks OK _sub Time elapsed: 210.955/306.312 ticks OK add Time elapsed: 157.188/228.343 ticks OK sub Time elapsed: 161.601/233.145 ticks OK mul Time elapsed: 158.086/229.47 ticks OK div Time elapsed: 295.6/394.333 ticks OK length=33 _add Time elapsed: 224.954/216.645 ticks OK _sub Time elapsed: 217.369/220.828 ticks OK add Time elapsed: 165.163/232.728 ticks OK sub Time elapsed: 168.748/225.759 ticks OK mul Time elapsed: 163.696/229.012 ticks OK div Time elapsed: 527.48/695.096 ticks OK length=34 _add Time elapsed: 257.833/244.777 ticks OK _sub Time elapsed: 251.336/235.344 ticks OK add Time elapsed: 161.457/233.644 ticks OK sub Time elapsed: 170.191/227.339 ticks OK mul Time elapsed: 155.878/235.089 ticks OK div Time elapsed: 542.227/721.453 ticks OK length=35 _add Time elapsed: 266.234/256.81 ticks OK _sub Time elapsed: 257.025/267.033 ticks OK add Time elapsed: 167.552/239.984 ticks OK sub Time elapsed: 171.778/246.633 ticks OK mul Time elapsed: 181.015/235.993 ticks OK div Time elapsed: 635.09/767.093 ticks OK length=36 _add Time elapsed: 285.926/277.594 ticks OK _sub Time elapsed: 283.741/279.879 ticks OK add Time elapsed: 167.919/248.243 ticks OK sub Time elapsed: 172.496/256.965 ticks OK mul Time elapsed: 173.703/244.221 ticks OK div Time elapsed: 638.483/798.543 ticks OK length=37 _add Time elapsed: 302.726/295.017 ticks OK _sub Time elapsed: 302.918/287.871 ticks OK add Time elapsed: 179.282/238.392 ticks OK sub Time elapsed: 178.655/235.482 ticks OK mul Time elapsed: 181.672/249.255 ticks OK div Time elapsed: 685.776/815.693 ticks OK length=38 _add Time elapsed: 309.713/317.891 ticks OK _sub Time elapsed: 311.551/318.951 ticks OK add Time elapsed: 183.749/248.201 ticks OK sub Time elapsed: 182.457/249.767 ticks OK mul Time elapsed: 173.167/254.031 ticks OK div Time elapsed: 654.338/850.899 ticks OK length=39 _add Time elapsed: 331.271/332.909 ticks OK _sub Time elapsed: 332.29/340.035 ticks OK add Time elapsed: 190.792/252.185 ticks OK sub Time elapsed: 180.932/259.944 ticks OK mul Time elapsed: 185.708/255.719 ticks OK div Time elapsed: 647.137/875.864 ticks OK length=40 _add Time elapsed: 355.193/336.131 ticks OK _sub Time elapsed: 232.643/339.317 ticks OK add Time elapsed: 180.946/262.636 ticks OK sub Time elapsed: 185.971/263.164 ticks OK mul Time elapsed: 183.59/259.031 ticks OK div Time elapsed: 654.698/792.629 ticks OK length=41 _add Time elapsed: 377.304/359.589 ticks OK _sub Time elapsed: 255.977/244.171 ticks OK add Time elapsed: 186.605/263.662 ticks OK sub Time elapsed: 193.321/263.436 ticks OK mul Time elapsed: 188.19/265.654 ticks OK div Time elapsed: 668.542/791.271 ticks OK length=42 _add Time elapsed: 379.921/401.174 ticks OK _sub Time elapsed: 277.8/270.201 ticks OK add Time elapsed: 180.505/267.69 ticks OK sub Time elapsed: 180.711/276.402 ticks OK mul Time elapsed: 188.227/271.997 ticks OK div Time elapsed: 663.969/835.869 ticks OK length=43 _add Time elapsed: 409.307/404.964 ticks OK _sub Time elapsed: 292.069/293.229 ticks OK add Time elapsed: 195.449/269.036 ticks OK sub Time elapsed: 196.669/277.26 ticks OK mul Time elapsed: 196.262/260.278 ticks OK div Time elapsed: 676.178/909.889 ticks OK length=44 _add Time elapsed: 432.034/424.81 ticks OK _sub Time elapsed: 309.993/315.46 ticks OK add Time elapsed: 190.56/265.444 ticks OK sub Time elapsed: 189.873/280.316 ticks OK mul Time elapsed: 192.435/280.126 ticks OK div Time elapsed: 699.378/954.738 ticks OK length=45 _add Time elapsed: 444.159/442.594 ticks OK _sub Time elapsed: 328.808/335.795 ticks OK add Time elapsed: 193.941/276.788 ticks OK sub Time elapsed: 199.565/289.84 ticks OK mul Time elapsed: 192.014/289.763 ticks OK div Time elapsed: 692.945/977.789 ticks OK length=46 _add Time elapsed: 468.229/459.252 ticks OK _sub Time elapsed: 353.782/344.905 ticks OK add Time elapsed: 194.883/292.752 ticks OK sub Time elapsed: 195.22/293.456 ticks OK mul Time elapsed: 205.8/280.818 ticks OK div Time elapsed: 782.868/1006.44 ticks OK length=47 _add Time elapsed: 487.082/475.871 ticks OK _sub Time elapsed: 367.18/358.831 ticks OK add Time elapsed: 198.84/295.624 ticks OK sub Time elapsed: 212.021/289.335 ticks OK mul Time elapsed: 204.586/297.196 ticks OK div Time elapsed: 799.228/1021.54 ticks OK length=48 _add Time elapsed: 267.277/502.503 ticks OK _sub Time elapsed: 266.504/380.614 ticks OK add Time elapsed: 206.485/291.437 ticks OK sub Time elapsed: 208.751/293.01 ticks OK mul Time elapsed: 204.827/297.649 ticks OK div Time elapsed: 780.641/960.721 ticks OK length=49 _add Time elapsed: 304.351/285.978 ticks OK _sub Time elapsed: 296.304/293.319 ticks OK add Time elapsed: 204.647/300.744 ticks OK sub Time elapsed: 219.501/300.072 ticks OK mul Time elapsed: 215.664/300.494 ticks OK div Time elapsed: 814.522/973.663 ticks OK length=50 _add Time elapsed: 314.686/302.493 ticks OK _sub Time elapsed: 311.272/295.271 ticks OK add Time elapsed: 215.314/307.107 ticks OK sub Time elapsed: 220.754/307.691 ticks OK mul Time elapsed: 215.919/290.891 ticks OK div Time elapsed: 810.286/1051.65 ticks OK length=51 _add Time elapsed: 325.737/323.767 ticks OK _sub Time elapsed: 327.822/325.581 ticks OK add Time elapsed: 214.684/308.805 ticks OK sub Time elapsed: 223.51/306.075 ticks OK mul Time elapsed: 231.098/301.439 ticks OK div Time elapsed: 900.325/1099.88 ticks OK length=52 _add Time elapsed: 343.305/345.213 ticks OK _sub Time elapsed: 356.256/330.22 ticks OK add Time elapsed: 217.673/309.469 ticks OK sub Time elapsed: 213.138/316.923 ticks OK mul Time elapsed: 225.799/302.165 ticks OK div Time elapsed: 931.267/1099.24 ticks OK length=53 _add Time elapsed: 362.516/358.258 ticks OK _sub Time elapsed: 351.763/370.718 ticks OK add Time elapsed: 223.179/313.459 ticks OK sub Time elapsed: 229.445/304.11 ticks OK mul Time elapsed: 234.065/307.542 ticks OK div Time elapsed: 944.563/1138.39 ticks OK length=54 _add Time elapsed: 377.505/378.166 ticks OK _sub Time elapsed: 383.461/377.096 ticks OK add Time elapsed: 221.536/325.325 ticks OK sub Time elapsed: 228.266/316.626 ticks OK mul Time elapsed: 227.809/315.476 ticks OK div Time elapsed: 946.69/1138.51 ticks OK length=55 _add Time elapsed: 399.227/393.232 ticks OK _sub Time elapsed: 392.297/400.515 ticks OK add Time elapsed: 247.683/308.55 ticks OK sub Time elapsed: 229.509/333.954 ticks OK mul Time elapsed: 226.54/328.09 ticks OK div Time elapsed: 931.044/1170.36 ticks OK length=56 _add Time elapsed: 416.823/403.239 ticks OK _sub Time elapsed: 301.852/396.32 ticks OK add Time elapsed: 227.044/329.364 ticks OK sub Time elapsed: 236.1/324.558 ticks OK mul Time elapsed: 227.417/331.793 ticks OK div Time elapsed: 932.889/1107.33 ticks OK length=57 _add Time elapsed: 431.389/439.424 ticks OK _sub Time elapsed: 322.097/319.734 ticks OK add Time elapsed: 241.851/327.392 ticks OK sub Time elapsed: 242.448/328.433 ticks OK mul Time elapsed: 232.783/334.7 ticks OK div Time elapsed: 929.991/1117.31 ticks OK length=58 _add Time elapsed: 461.386/452.545 ticks OK _sub Time elapsed: 335.282/339.561 ticks OK add Time elapsed: 237.17/332.617 ticks OK sub Time elapsed: 237.223/338.237 ticks OK mul Time elapsed: 238.037/338.872 ticks OK div Time elapsed: 936.697/1146.92 ticks OK length=59 _add Time elapsed: 474.379/458.845 ticks OK _sub Time elapsed: 350.086/365.762 ticks OK add Time elapsed: 246.19/337.611 ticks OK sub Time elapsed: 235.106/354.139 ticks OK mul Time elapsed: 242.851/341.426 ticks OK div Time elapsed: 944.318/1234.25 ticks OK length=60 _add Time elapsed: 495.193/482.783 ticks OK _sub Time elapsed: 374.211/367.742 ticks OK add Time elapsed: 237.374/344.622 ticks OK sub Time elapsed: 240.316/346.119 ticks OK mul Time elapsed: 240.609/338.559 ticks OK div Time elapsed: 935.387/1297.94 ticks OK length=61 _add Time elapsed: 514.358/502.993 ticks OK _sub Time elapsed: 390.162/390.932 ticks OK add Time elapsed: 241.255/348.387 ticks OK sub Time elapsed: 242.809/351.17 ticks OK mul Time elapsed: 250.034/345.349 ticks OK div Time elapsed: 951.429/1293.45 ticks OK length=62 _add Time elapsed: 520.519/529.711 ticks OK _sub Time elapsed: 416.408/408.851 ticks OK add Time elapsed: 254.22/354.111 ticks OK sub Time elapsed: 244.632/350.879 ticks OK mul Time elapsed: 257.566/345.321 ticks OK div Time elapsed: 1078.53/1310.91 ticks OK length=63 _add Time elapsed: 562.293/519.727 ticks OK _sub Time elapsed: 428.17/413.478 ticks OK add Time elapsed: 248.69/355.602 ticks OK sub Time elapsed: 245.328/366.907 ticks OK mul Time elapsed: 254.024/362.498 ticks OK div Time elapsed: 1067.04/1326.91 ticks OK length=64 _add Time elapsed: 335.339/553.137 ticks OK _sub Time elapsed: 354.924/447.268 ticks OK add Time elapsed: 246.453/365.996 ticks OK sub Time elapsed: 246.788/376.753 ticks OK mul Time elapsed: 249.476/366.142 ticks OK div Time elapsed: 552.97/730.293 ticks OK length=65 _add Time elapsed: 358.974/354.964 ticks OK _sub Time elapsed: 363.012/343.379 ticks OK add Time elapsed: 255.596/368.984 ticks OK sub Time elapsed: 258.5/366.621 ticks OK mul Time elapsed: 259.761/362.566 ticks OK div Time elapsed: 1092.02/1299.73 ticks OK length=66 _add Time elapsed: 359.259/378.295 ticks OK _sub Time elapsed: 374.198/375.876 ticks OK add Time elapsed: 263.544/360.228 ticks OK sub Time elapsed: 254.211/384.656 ticks OK mul Time elapsed: 279.067/360.507 ticks OK div Time elapsed: 1111.76/1344.45 ticks OK length=67 _add Time elapsed: 384.084/389.701 ticks OK _sub Time elapsed: 396.238/388.472 ticks OK add Time elapsed: 281.756/361.551 ticks OK sub Time elapsed: 269.27/367.422 ticks OK mul Time elapsed: 273.885/369.471 ticks OK div Time elapsed: 1197.07/1380.06 ticks OK length=68 _add Time elapsed: 393.396/420.486 ticks OK _sub Time elapsed: 414.866/403.697 ticks OK add Time elapsed: 267.692/379.969 ticks OK sub Time elapsed: 263.991/383.789 ticks OK mul Time elapsed: 274.52/371.483 ticks OK div Time elapsed: 1187.03/1436.83 ticks OK length=69 _add Time elapsed: 431.78/427.684 ticks OK _sub Time elapsed: 421.565/424.427 ticks OK add Time elapsed: 277.268/380.385 ticks OK sub Time elapsed: 282.947/373.963 ticks OK mul Time elapsed: 274.886/375.29 ticks OK div Time elapsed: 1220.53/1431.85 ticks OK length=70 _add Time elapsed: 442.703/444.091 ticks OK _sub Time elapsed: 431.654/447.327 ticks OK add Time elapsed: 269.111/387.659 ticks OK sub Time elapsed: 277.507/389.476 ticks OK mul Time elapsed: 274.437/382.376 ticks OK div Time elapsed: 1212.23/1466.07 ticks OK length=71 _add Time elapsed: 462.218/455.112 ticks OK _sub Time elapsed: 453.836/459.67 ticks OK add Time elapsed: 285.554/387.218 ticks OK sub Time elapsed: 276.451/400.249 ticks OK mul Time elapsed: 281.861/398.923 ticks OK div Time elapsed: 1191.57/1509.2 ticks OK length=72 _add Time elapsed: 493.165/451.07 ticks OK _sub Time elapsed: 383.937/486.385 ticks OK add Time elapsed: 272.546/400.891 ticks OK sub Time elapsed: 279.275/399.312 ticks OK mul Time elapsed: 282.033/393.875 ticks OK div Time elapsed: 1220.72/1411.41 ticks OK length=73 _add Time elapsed: 512.622/492.644 ticks OK _sub Time elapsed: 401.054/398.604 ticks OK add Time elapsed: 279.056/400.75 ticks OK sub Time elapsed: 282.898/407.293 ticks OK mul Time elapsed: 288.664/399.52 ticks OK div Time elapsed: 1213.19/1419.73 ticks OK length=74 _add Time elapsed: 503.085/529.645 ticks OK _sub Time elapsed: 411.629/414.095 ticks OK add Time elapsed: 282.825/418.836 ticks OK sub Time elapsed: 287.468/403.812 ticks OK mul Time elapsed: 278.148/412.515 ticks OK div Time elapsed: 1214.84/1460.34 ticks OK length=75 _add Time elapsed: 531.45/535.004 ticks OK _sub Time elapsed: 433.107/439.227 ticks OK add Time elapsed: 287.305/419.055 ticks OK sub Time elapsed: 291.013/414.632 ticks OK mul Time elapsed: 285.342/405.669 ticks OK div Time elapsed: 1231.41/1533.75 ticks OK length=76 _add Time elapsed: 561.063/552.126 ticks OK _sub Time elapsed: 465.263/443.168 ticks OK add Time elapsed: 286.613/412.866 ticks OK sub Time elapsed: 286.832/420.092 ticks OK mul Time elapsed: 280.548/421.78 ticks OK div Time elapsed: 1248.54/1568.52 ticks OK length=77 _add Time elapsed: 575.087/568.608 ticks OK _sub Time elapsed: 465.96/475.069 ticks OK add Time elapsed: 293.111/424.733 ticks OK sub Time elapsed: 299.161/406.599 ticks OK mul Time elapsed: 297.784/412.169 ticks OK div Time elapsed: 1233.47/1604.47 ticks OK length=78 _add Time elapsed: 595.2/572.981 ticks OK _sub Time elapsed: 489.888/482.512 ticks OK add Time elapsed: 293.983/424.379 ticks OK sub Time elapsed: 295.689/415.777 ticks OK mul Time elapsed: 296.699/422.527 ticks OK div Time elapsed: 1311.24/1654.65 ticks OK length=79 _add Time elapsed: 598.657/596.911 ticks OK _sub Time elapsed: 504.575/496.47 ticks OK add Time elapsed: 303.08/413.76 ticks OK sub Time elapsed: 304.829/421.582 ticks OK mul Time elapsed: 303.251/419.876 ticks OK div Time elapsed: 1329.7/1652.36 ticks OK length=80 _add Time elapsed: 437.117/642.091 ticks OK _sub Time elapsed: 429.137/520.225 ticks OK add Time elapsed: 299.568/429.152 ticks OK sub Time elapsed: 300.053/434.764 ticks OK mul Time elapsed: 302.309/444.049 ticks OK div Time elapsed: 1339.33/1584.78 ticks OK length=81 _add Time elapsed: 440.174/435.213 ticks OK _sub Time elapsed: 433.759/441.694 ticks OK add Time elapsed: 306.671/433.963 ticks OK sub Time elapsed: 297.908/444.775 ticks OK mul Time elapsed: 308.561/435.177 ticks OK div Time elapsed: 1315.29/1653.02 ticks OK length=82 _add Time elapsed: 469.092/438.8 ticks OK _sub Time elapsed: 451.863/450.332 ticks OK add Time elapsed: 312.852/430.53 ticks OK sub Time elapsed: 325.197/420.117 ticks OK mul Time elapsed: 302.835/435.999 ticks OK div Time elapsed: 1360.83/1674.64 ticks OK length=83 _add Time elapsed: 465.206/475.058 ticks OK _sub Time elapsed: 475.242/457.122 ticks OK add Time elapsed: 328.438/421.783 ticks OK sub Time elapsed: 324.661/438.016 ticks OK mul Time elapsed: 313.683/442.003 ticks OK div Time elapsed: 1466.26/1710.36 ticks OK length=84 _add Time elapsed: 489.599/480.966 ticks OK _sub Time elapsed: 495.369/485.923 ticks OK add Time elapsed: 312.373/444.29 ticks OK sub Time elapsed: 314.232/447.084 ticks OK mul Time elapsed: 317.83/449.682 ticks OK div Time elapsed: 1460.4/1732.44 ticks OK length=85 _add Time elapsed: 494.465/510.477 ticks OK _sub Time elapsed: 504.591/501.154 ticks OK add Time elapsed: 329.929/447.761 ticks OK sub Time elapsed: 325.923/429.993 ticks OK mul Time elapsed: 329.57/437.342 ticks OK div Time elapsed: 1529.71/1723.23 ticks OK length=86 _add Time elapsed: 524.428/513.885 ticks OK _sub Time elapsed: 532.973/514.426 ticks OK add Time elapsed: 328.297/453.259 ticks OK sub Time elapsed: 315.945/461.639 ticks OK mul Time elapsed: 313.232/462.955 ticks OK div Time elapsed: 1481.25/1783.94 ticks OK length=87 _add Time elapsed: 542.834/547.054 ticks OK _sub Time elapsed: 534.502/540.445 ticks OK add Time elapsed: 328.451/459.601 ticks OK sub Time elapsed: 323.138/471.289 ticks OK mul Time elapsed: 323.559/469.69 ticks OK div Time elapsed: 1478.56/1789.95 ticks OK length=88 _add Time elapsed: 559.149/540.209 ticks OK _sub Time elapsed: 467.187/548.509 ticks OK add Time elapsed: 325.471/460.771 ticks OK sub Time elapsed: 335.339/466.029 ticks OK mul Time elapsed: 326.084/471.339 ticks OK div Time elapsed: 1479.28/1742.23 ticks OK length=89 _add Time elapsed: 566.712/588.523 ticks OK _sub Time elapsed: 469.806/459.95 ticks OK add Time elapsed: 338.911/457.485 ticks OK sub Time elapsed: 334.403/466.14 ticks OK mul Time elapsed: 340.578/464.191 ticks OK div Time elapsed: 1466.79/1744.03 ticks OK length=90 _add Time elapsed: 598.437/587.695 ticks OK _sub Time elapsed: 488.859/482.761 ticks OK add Time elapsed: 338.232/464.905 ticks OK sub Time elapsed: 330.091/476.286 ticks OK mul Time elapsed: 321.685/475.102 ticks OK div Time elapsed: 1491.21/1761.42 ticks OK length=91 _add Time elapsed: 613.987/620.071 ticks OK _sub Time elapsed: 495.52/491.394 ticks OK add Time elapsed: 330.957/467.345 ticks OK sub Time elapsed: 319.492/486.503 ticks OK mul Time elapsed: 337.706/460.03 ticks OK div Time elapsed: 1476.99/1866.98 ticks OK length=92 _add Time elapsed: 625.902/633.783 ticks OK _sub Time elapsed: 523.034/508.256 ticks OK add Time elapsed: 344.815/477.523 ticks OK sub Time elapsed: 342.839/483.43 ticks OK mul Time elapsed: 345.255/472.697 ticks OK div Time elapsed: 1486.58/1907.46 ticks OK length=93 _add Time elapsed: 659.789/634.764 ticks OK _sub Time elapsed: 534.698/534.583 ticks OK add Time elapsed: 338.921/472.935 ticks OK sub Time elapsed: 334.226/489.875 ticks OK mul Time elapsed: 354.807/475.381 ticks OK div Time elapsed: 1495.41/1925.02 ticks OK length=94 _add Time elapsed: 678.224/664.108 ticks OK _sub Time elapsed: 536.47/566.966 ticks OK add Time elapsed: 354.209/470.3 ticks OK sub Time elapsed: 348.005/477.1 ticks OK mul Time elapsed: 350.478/481.299 ticks OK div Time elapsed: 1618.21/1937.98 ticks OK length=95 _add Time elapsed: 682.407/694.824 ticks OK _sub Time elapsed: 562.326/566.028 ticks OK add Time elapsed: 342.352/493.886 ticks OK sub Time elapsed: 353.066/484.998 ticks OK mul Time elapsed: 348.976/488.973 ticks OK div Time elapsed: 1601.93/1968.91 ticks OK length=96 _add Time elapsed: 474.848/726.389 ticks OK _sub Time elapsed: 472.71/602.703 ticks OK add Time elapsed: 352.66/495.778 ticks OK sub Time elapsed: 347.603/510.234 ticks OK mul Time elapsed: 348.987/506.741 ticks OK div Time elapsed: 1614.32/1890.42 ticks OK length=97 _add Time elapsed: 488.601/507.866 ticks OK _sub Time elapsed: 497.994/505.726 ticks OK add Time elapsed: 356.127/504.72 ticks OK sub Time elapsed: 361.736/487.17 ticks OK mul Time elapsed: 357.125/497.206 ticks OK div Time elapsed: 1628.87/1922.07 ticks OK length=98 _add Time elapsed: 518.49/508.557 ticks OK _sub Time elapsed: 517.291/520.478 ticks OK add Time elapsed: 367.989/506.137 ticks OK sub Time elapsed: 355.828/506.196 ticks OK mul Time elapsed: 359.528/494.497 ticks OK div Time elapsed: 1629.75/1990.39 ticks OK length=99 _add Time elapsed: 541.861/518.594 ticks OK _sub Time elapsed: 536.132/533.183 ticks OK add Time elapsed: 364.271/501.308 ticks OK sub Time elapsed: 363.476/509.053 ticks OK mul Time elapsed: 368.54/512.869 ticks OK div Time elapsed: 1726.36/2012.6 ticks OK
Sep 22 2008
prev sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Tue, 23 Sep 2008 04:48:41 +0400, bearophile <bearophileHUGS lycos.com>  
wrote:

 Denis Koroskin:
 New hardware - new results (-O -release -inline):
 length=1
 _add Time elapsed: 91.212/0 ticks OK
 _sub Time elapsed: 89.8235/0 ticks OK
 add Time elapsed: 82.6135/0 ticks OK
 sub Time elapsed: 84.2032/0 ticks OK
 mul Time elapsed: 82.8325/0 ticks OK
 div Time elapsed: 99.2924/0 ticks OK

I don't understand how to read the timings, are the timings before the slash of the current implementation, and the timings after the slash of your new code?

No, I think that _add/_sub - old version and add/sub/mul/div - new version, i.e. you should compare _add and _sub against add and sub respectively.
Sep 22 2008
prev sibling parent reply Don <nospam nospam.com.au> writes:
Eugene Pelekhay wrote:
 I'm finished optimized version of array operations, using SSE2 instructions.

Good work. Note that your code will actually work better for floats (using SSE) than with SSE2. As far as I can tell, X87 code may actually be faster for the unaligned case. Comparing x87 code fld mem fadd mem fstp mem with SSE code movapd/movupd reg, mem addpd reg, mem movapd/movupd mem, reg On all CPUs below the x87 code takes 3uops, so it is 6 uops for two doubles, 12 for four floats. The number of SSE uops depends on whether aligned or unaligned loads are used. Importantly, the extra uops are mostly for the load & store ports, so this is going to translate reasonably well to clock cycles: CPU aligned unaligned PentiumM 6 14 Core2 3 14 AMD K8 6 11 AMD K10 4 5 (AMD K7 is the same as K8, except doesn't have SSE2). Practical conclusion: Probably better to use x87 for the unaligned double case, on everything except K10. For unaligned floats, it's marginal, again only a clear win on the K10. If the _destination_ is aligned, even if the _source_ is not, SSE floats will be better on any of the processors. Theoretical conclusion: Don't assume SSE is always faster! The balance changes for more complex operations (for simple ones like add, you're limited by memory bandwidth, so SSE doesn't help very much).
Sep 23 2008
parent reply Eugene Pelekhay <pelekhay gmail.com> writes:
Don Wrote:

 Eugene Pelekhay wrote:
 I'm finished optimized version of array operations, using SSE2 instructions.

Good work. Note that your code will actually work better for floats (using SSE) than with SSE2. As far as I can tell, X87 code may actually be faster for the unaligned case. Comparing x87 code fld mem fadd mem fstp mem with SSE code movapd/movupd reg, mem addpd reg, mem movapd/movupd mem, reg On all CPUs below the x87 code takes 3uops, so it is 6 uops for two doubles, 12 for four floats. The number of SSE uops depends on whether aligned or unaligned loads are used. Importantly, the extra uops are mostly for the load & store ports, so this is going to translate reasonably well to clock cycles: CPU aligned unaligned PentiumM 6 14 Core2 3 14 AMD K8 6 11 AMD K10 4 5 (AMD K7 is the same as K8, except doesn't have SSE2). Practical conclusion: Probably better to use x87 for the unaligned double case, on everything except K10. For unaligned floats, it's marginal, again only a clear win on the K10. If the _destination_ is aligned, even if the _source_ is not, SSE floats will be better on any of the processors. Theoretical conclusion: Don't assume SSE is always faster! The balance changes for more complex operations (for simple ones like add, you're limited by memory bandwidth, so SSE doesn't help very much).

Thanks for advise, I'll try to improve it. Actualy I not used assembler 7 years and my knowledge is a bit outdated.
Sep 24 2008
parent reply "Jb" <jb nowhere.com> writes:
"Eugene Pelekhay" <pelekhay gmail.com> wrote in message 
news:gbeaf0$g28$1 digitalmars.com...
 Don Wrote:

 Eugene Pelekhay wrote:
 I'm finished optimized version of array operations, using SSE2 
 instructions.

Good work. Note that your code will actually work better for floats (using SSE) than with SSE2. As far as I can tell, X87 code may actually be faster for the unaligned case. Comparing x87 code fld mem fadd mem fstp mem with SSE code movapd/movupd reg, mem addpd reg, mem movapd/movupd mem, reg On all CPUs below the x87 code takes 3uops, so it is 6 uops for two doubles, 12 for four floats. The number of SSE uops depends on whether aligned or unaligned loads are used. Importantly, the extra uops are mostly for the load & store ports, so this is going to translate reasonably well to clock cycles: CPU aligned unaligned PentiumM 6 14 Core2 3 14 AMD K8 6 11 AMD K10 4 5 (AMD K7 is the same as K8, except doesn't have SSE2). Practical conclusion: Probably better to use x87 for the unaligned double case, on everything except K10. For unaligned floats, it's marginal, again only a clear win on the K10. If the _destination_ is aligned, even if the _source_ is not, SSE floats will be better on any of the processors. Theoretical conclusion: Don't assume SSE is always faster! The balance changes for more complex operations (for simple ones like add, you're limited by memory bandwidth, so SSE doesn't help very much).

Thanks for advise, I'll try to improve it. Actualy I not used assembler 7 years and my knowledge is a bit outdated.

If you are doing unaligned memory acesses it's actualy faster to do this.. MOVLPS XMM0,[address] MOVHPS XMM0,[address+8] Than it is to do MOVUPS XMM0,[address] The reason being that (on almost all but a very latest chips) SSE ops are actualy split into 2 64 bit ops. So the former code actualy works out a lot faster. Also, unaligned loads are a whole lot quicker than unaligned stores. 2 or 3 times faster IIRC. So the best method is bend over backwards to get your writes aligned.
Sep 25 2008
parent reply Eugene Pelekhay <pelekhay gmail.com> writes:
Content-Type: text/plain

Jb Wrote:

 If you are doing unaligned memory acesses it's actualy faster to do this..
 
 MOVLPS    XMM0,[address]
 MOVHPS   XMM0,[address+8]
 
 Than it is to do
 
 MOVUPS  XMM0,[address]
 
 The reason being that (on almost all but a very latest chips) SSE ops are 
 actualy split into 2 64 bit ops. So the former code actualy works out a lot 
 faster.
 
 Also, unaligned loads are a whole lot quicker than unaligned stores. 2 or 3 
 times faster IIRC. So the best method is bend over backwards to get your 
 writes aligned.
 

Thanks, I'll check this way too. Meanwile can anybody test new version on other systems, I implemented operations for unaligned case by x87 instructions and my benchamrc show that it works much slower then SSE2 version. This means that Don's theory wrong or I having unusual Pentium-M or I have bad x87 code.
Sep 26 2008
parent Don <nospam nospam.com.au> writes:
Eugene Pelekhay wrote:
 Jb Wrote:
 
 If you are doing unaligned memory acesses it's actualy faster to do this..

 MOVLPS    XMM0,[address]
 MOVHPS   XMM0,[address+8]

 Than it is to do

 MOVUPS  XMM0,[address]

 The reason being that (on almost all but a very latest chips) SSE ops are 
 actualy split into 2 64 bit ops. So the former code actualy works out a lot 
 faster.

 Also, unaligned loads are a whole lot quicker than unaligned stores. 2 or 3 
 times faster IIRC. So the best method is bend over backwards to get your 
 writes aligned.

Thanks, I'll check this way too. Meanwile can anybody test new version on other systems, I implemented operations for unaligned case by x87 instructions and my benchamrc show that it works much slower then SSE2 version. This means that Don's theory wrong or I having unusual Pentium-M or I have bad x87 code.

the code so big that you probably get limited by instruction decoding. The whole loop can be reduced to something like (not tested): // EAX=length. // count UP from -length lea EDX, [EDX + 8*EAX]; lea EDI, [EDI + 8*EAX]; lea ESI, [ESI + 8*EAX]; neg EAX; start: fld dword ptr [EDX+8*EAX]; fadd dword ptr [ESI+8*EAX]; fstp dword ptr [EDI+8*EAX]; add EAX, 1; jnz start; There are 5 fused uops in the loop. Every instruction is 1 uop, so decoding is not a bottleneck. There are two memory loads per loop (execution unit p2), one store (p3), add EAX uses p0 or p1, jnz uses p1, fadd uses p0 or p1. Since Pentium M can do 3uops per clock as long as they're in different units, the best case would be two clocks per loop. Loop unrolling _might_ be necessary to get it to schedule the instructions correctly, but otherwise it's unhelpful. On PentiumM there's a bug which means it keeps trying to do two fadds at once, even though it only has one FADD execution unit. So one keeps getting stalled, so it probably won't be as fast as it should be. Sometimes you can fix that by moving the add EAX above the store, or above the fadd. On Core2 you should get 2 clocks per iteration without any trouble.
Sep 26 2008