digitalmars.D - String compare performance
- bearophile (630/630) Nov 27 2010 While translating a common Python script to D, I have found something in...
- Ellery Newcomer (1/1) Nov 27 2010 how well does dmd perform when you use a switch statement?
- bearophile (31/32) Nov 27 2010 4th D version:
- spir (26/138) Nov 27 2010 4.5.
- bearophile (41/44) Nov 27 2010 Here it is:
- Jonathan M Davis (6/11) Nov 27 2010 You might be able to do that for primitive types and for structs with no...
- Kagamin (2/5) Nov 27 2010 You can use memcmp, though only for utf-8 strings.
- Stewart Gordon (4/11) Nov 28 2010 Only for utf-8 strings? Why's that? I would've thought memcmp to be
- bearophobic (4/19) Nov 28 2010 D community is amazing cult of premature optimization fans. Any one of y...
- Michel Fortin (11/37) Nov 28 2010 Comparing unicode UTF-* strings using memcmp is fine as long as what
- Bruno Medeiros (5/37) Dec 07 2010 Why are people still replying to nameless trolls? There has been several...
- =?ISO-8859-1?Q?Br=fcno_Mediocre?= (3/43) Dec 08 2010 Trololol. Maybe they're a bit dumb, my brother. If they some day become ...
- Bruno Medeiros (4/10) Dec 09 2010 Lol, "Brüno Mediocre", well thought, that's actually funny.
- Daniel Gibson (2/15) Dec 09 2010 Nah, I think it's a rather mediocre joke.
- Robert Jacques (5/18) Nov 28 2010 memcmp is type agnostic if all you want to compare is equality. The othe...
- Kagamin (7/7) Nov 27 2010 You need only match 3-char strings?
- Kagamin (7/7) Nov 27 2010 or
- bearophile (4/6) Nov 27 2010 I agree there are are ways to speed up the code, probably the D #3 code ...
- =?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= (7/15) Nov 28 2010 This doesn't work on some architectures if a.ptr or b.ptr is odd...
- bearophile (34/34) Nov 27 2010 I have done another test:
- Robert Jacques (46/90) Nov 27 2010 Hi bearophile,
- bearophile (82/86) Nov 28 2010 Thank you for your work :-)
- =?ISO-8859-1?Q?Pelle_M=E5nsson?= (14/100) Nov 28 2010 I don't have your data set, but for me using random data this was within...
- bearophile (68/70) Nov 28 2010 Thank you for your code. I have added your version, plus a modified vers...
- spir (28/30) Nov 28 2010 lem, it's quite unnatural. Here I have explained what I think is a good ...
- =?ISO-8859-1?Q?Pelle_M=E5nsson?= (6/76) Nov 28 2010 Measuring incorrectly in a performance test is a silly mistake, I was
- Robert Jacques (30/45) Nov 28 2010 [snip]
- bearophile (63/66) Nov 28 2010 The version #11 (I have not reformatted your function):
- Austin Hastings (25/28) Nov 27 2010 It's not clear to me if the point of your post is "how do I make this go...
- Jonathan M Davis (6/23) Nov 27 2010 What I think is pretty clear from this is that at some point, some work ...
- Iain Buclaw (6/29) Nov 28 2010 Just food for thought. GDC uses memcmp when using string comparisons in ...
- bearophile (5/10) Nov 28 2010 The asm of the first version of the function compiled with gdc uses "rep...
- Jonathan M Davis (2/26) Nov 27 2010 Enhancement Request: http://d.puremagic.com/issues/show_bug.cgi?id=5282
- bearophile (8/18) Nov 28 2010 This case is very simple, so a simple tree (as in the third D version of...
- spir (10/17) Nov 28 2010 's
- bearophile (7/8) Nov 28 2010 The solution I suggest for this problem is: when DMD knows at compile-ti...
While translating a common Python script to D, I have found something
interesting, so I have reduced it to a little benchmark.
Below there is the reduced Python2 code (it uses Psyco), and a little program
to generate some test data. The timing of the first D2 version is not good
compared to the Python-Psyco program (the generation of the *300 array is a
quick thing), so I have created two more D2 versions to show myself that D2
wasn't broken :-)
The reduced code looks like a syntetic benchmark, but it has about the same
performance profile of a 60 lines long Python script (the original code was
using xrange(0,len(...)-3,3) instead of xrange(len(...)-3), but the situation
doesn't change much).
Timings, dmd compiler, best of 4, seconds:
Psy: 1.59
2.3 GHz CPU.
dmd 2.050, ldc based on DMD v1.057 and llvm 2.6, gdc head based on GCC 4.4.5.
--------------------------
import psyco
def test(data):
count = 0
for i in xrange(len(data) - 3):
codon = data[i : i + 3]
if codon == "TAG" or codon == "TGA" or codon == "TAA":
count += 1
return count
def main():
data = open("data.txt").read()
data = data * 300
print test(data)
psyco.full()
main()
--------------------------
To produce some test data:
from itertools import islice
def rnd(ia = 3877, ic = 29573, im = 139968):
seed = 42
imf = float(im)
while 1:
seed = (seed * ia + ic) % im
r = seed / imf
yield "ACGT"[int(r * 4)]
def main():
data = "".join(islice(rnd(), 0, 120000))
file("data.txt", "w").write(data)
main()
--------------------------
A D2 traslation (I have used printf to reduce a lot the asm produced):
import std.file: read;
import std.c.stdio: printf;
int test(char[] data) {
int count;
foreach (i; 0 .. data.length - 3) {
char[] codon = data[i .. i + 3];
if (codon == "TAG" || codon == "TGA" || codon == "TAA")
count++;
}
return count;
}
void main() {
char[] data0 = cast(char[])read("data.txt");
int n = 300;
char[] data = new char[data0.length * n];
for (size_t pos; pos < data.length; pos += data0.length)
data[pos .. pos+data0.length] = data0;
printf("%d\n", test(data));
}
--------------------------
import std.file: read;
import std.c.stdio: printf;
bool cmp3(char[] s1, string s2) {
return s1[0] == s2[0] && s1[1] == s2[1] && s1[2] == s2[2];
}
int test(char[] data) {
int count;
foreach (i; 0 .. data.length - 3) {
char[] codon = data[i .. i + 3];
if (cmp3(codon, "TAG") || cmp3(codon, "TGA") || cmp3(codon, "TAA"))
count++;
}
return count;
}
void main() {
char[] data0 = cast(char[])read("data.txt");
int n = 300;
char[] data = new char[data0.length * n];
for (size_t pos; pos < data.length; pos += data0.length)
data[pos .. pos+data0.length] = data0;
printf("%d\n", test(data));
}
--------------------------
import std.file: read;
import std.c.stdio: printf;
int test(char[] data) {
int count;
foreach (i; 0 .. data.length - 3) {
if (data[i] == 'T') {
if (data[i+1] == 'A') {
if (data[i+2] == 'G') {
count++;
} else if (data[i+2] == 'A') {
count++;
}
} else if (data[i+1] == 'G') {
if (data[i+2] == 'A')
count++;
}
}
}
return count;
}
void main() {
char[] data0 = cast(char[])read("data.txt");
int n = 300;
char[] data = new char[data0.length * n];
for (size_t pos; pos < data.length; pos += data0.length)
data[pos .. pos+data0.length] = data0;
printf("%d\n", test(data));
}
--------------------------
dmd -O -release -inline test1.d
_D5test14testFAaZi comdat
L0: sub ESP,024h
push EBX
xor EBX,EBX
push EBP
mov EBP,030h[ESP]
push ESI
xor ESI,ESI
add EBP,0FFFFFFFDh
push EDI
je LA8
mov EDX,03Ch[ESP]
mov EAX,038h[ESP]
mov EDI,EDX
L22: mov ECX,offset FLAT:_D11TypeInfo_Aa6__initZ
lea EAX,3[EBX]
sub EAX,EBX
push ECX
lea EDX,[EDI][EBX]
push dword ptr FLAT:_DATA[0Ch]
push dword ptr FLAT:_DATA[08h]
mov 020h[ESP],EAX
mov 024h[ESP],EDX
push EDX
push EAX
call near ptr __adEq2
add ESP,014h
test EAX,EAX
jne L9E
mov ECX,offset FLAT:_D11TypeInfo_Aa6__initZ
push ECX
push dword ptr FLAT:_DATA[01Ch]
push dword ptr FLAT:_DATA[018h]
push dword ptr 024h[ESP]
push dword ptr 024h[ESP]
call near ptr __adEq2
add ESP,014h
test EAX,EAX
jne L9E
mov EDX,offset FLAT:_D11TypeInfo_Aa6__initZ
push EDX
push dword ptr FLAT:_DATA[02Ch]
push dword ptr FLAT:_DATA[028h]
push dword ptr 024h[ESP]
push dword ptr 024h[ESP]
call near ptr __adEq2
add ESP,014h
test EAX,EAX
je L9F
L9E: inc ESI
L9F: inc EBX
cmp EBX,EBP
jb L22
LA8: pop EDI
mov EAX,ESI
pop ESI
pop EBP
pop EBX
add ESP,024h
ret 8
--------------------------
dmd -O -release -inline test2.d
_D5test24testFAaZi comdat
sub ESP,03Ch
mov ECX,040h[ESP]
push EBX
push EBP
push ESI
xor ESI,ESI
push EDI
xor EDI,EDI
add ECX,0FFFFFFFDh
mov 01Ch[ESP],ECX
je LC5
mov EDX,054h[ESP]
mov EAX,050h[ESP]
mov EBP,EDX
L26: lea EBX,3[ESI]
sub EBX,ESI
lea ECX,[EBP][ESI]
mov 028h[ESP],ECX
mov DL,[ECX]
mov ECX,FLAT:_DATA[0Ch]
mov 024h[ESP],EBX
mov EAX,FLAT:_DATA[08h]
cmp DL,[ECX]
mov 010h[ESP],ECX
jne L65
mov EDX,028h[ESP]
mov EAX,EBX
mov EBX,010h[ESP]
mov CL,1[EDX]
cmp CL,1[EBX]
jne L65
mov DL,2[EDX]
cmp DL,2[EBX]
je LB9
L65: mov EDX,028h[ESP]
mov CL,[EDX]
mov 018h[ESP],EDX
mov EAX,024h[ESP]
mov EDX,FLAT:_DATA[01Ch]
cmp CL,[EDX]
mov EAX,FLAT:_DATA[018h]
jne L96
mov EBX,018h[ESP]
mov AL,1[EBX]
cmp AL,1[EDX]
jne L96
mov AL,2[EBX]
cmp AL,2[EDX]
je LB9
L96: mov EDX,FLAT:_DATA[02Ch]
mov EAX,FLAT:_DATA[028h]
cmp CL,[EDX]
jne LBA
mov ECX,018h[ESP]
mov BL,1[ECX]
cmp BL,1[EDX]
jne LBA
mov CL,2[ECX]
cmp CL,2[EDX]
jne LBA
LB9: inc EDI
LBA: inc ESI
cmp ESI,01Ch[ESP]
jb L26
LC5: mov EAX,EDI
pop EDI
pop ESI
pop EBP
pop EBX
add ESP,03Ch
ret 8
--------------------------
dmd -O -release -inline test3.d
_D5test34testFAaZi comdat
push EAX
xor EDX,EDX
push EBX
xor EBX,EBX
push ESI
mov ESI,010h[ESP]
add ESI,0FFFFFFFDh
je L4A
mov 8[ESP],EDX
mov EDX,014h[ESP]
mov ECX,EDX
mov EAX,010h[ESP]
mov EDX,8[ESP]
L22: cmp [ECX][EDX],054h
jne L45
mov AL,1[ECX][EDX]
cmp AL,041h
jne L39
cmp byte ptr 2[ECX][EDX],047h
je L44
jmp short L3D
L39: cmp AL,047h
jne L45
L3D: cmp byte ptr 2[ECX][EDX],041h
jne L45
L44: inc EBX
L45: inc EDX
cmp EDX,ESI
jb L22
L4A: pop ESI
mov EAX,EBX
pop EBX
pop ECX
ret 8
--------------------------
gdc -O3 -frelease -inline -S test1.d -o test1.s
_D5test14testFAaZi:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
subl $12, %esp
movl 8(%ebp), %eax
movl 12(%ebp), %edx
movl $0, -24(%ebp)
subl $3, %eax
movl %edx, -20(%ebp)
movl %eax, -16(%ebp)
je .L5
xorl %edx, %edx
jmp .L8
.p2align 4,,7
.p2align 3
.L6:
addl $1, -24(%ebp)
addl $1, %edx
cmpl %edx, -16(%ebp)
jbe .L5
.L8:
movl -20(%ebp), %eax
movl $.LC0, %edi
movl $3, %ecx
addl %edx, %eax
movl %eax, %esi
repz cmpsb
je .L6
movl %eax, %esi
movl $.LC1, %edi
movl $3, %ecx
repz cmpsb
je .L6
movl $.LC2, %edi
movl %eax, %esi
movl $3, %ecx
repz cmpsb
je .L6
addl $1, %edx
cmpl %edx, -16(%ebp)
ja .L8
.L5:
movl -24(%ebp), %eax
addl $12, %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
--------------------------------------
gdc -O3 -frelease -inline -S test2.d -o test2.s
_D5test24testFAaZi:
pushl %ebp
xorl %edx, %edx
movl %esp, %ebp
xorl %eax, %eax
pushl %edi
pushl %esi
pushl %ebx
subl $16, %esp
movl 8(%ebp), %ebx
movl 12(%ebp), %esi
movl %ebx, %edi
subl $3, %edi
jne .L18
jmp .L9
.p2align 4,,7
.p2align 3
.L10:
movl -16(%ebp), %ecx
movzbl .LC1, %ebx
cmpb (%ecx), %bl
je .L15
.L12:
movzbl .LC2, %ebx
cmpb (%ecx), %bl
je .L21
.L13:
addl $1, %edx
cmpl %edx, %edi
jbe .L9
.p2align 4,,7
.p2align 3
.L18:
leal (%esi,%edx), %ecx
movzbl .LC0, %ebx
movl $3, -20(%ebp)
movl %ecx, -16(%ebp)
cmpb (%ecx), %bl
jne .L10
movzbl .LC0+1, %ebx
cmpb 1(%ecx), %bl
jne .L10
movzbl .LC0+2, %ebx
cmpb 2(%ecx), %bl
jne .L10
.L11:
addl $1, %edx
addl $1, %eax
cmpl %edx, %edi
ja .L18
.L9:
addl $16, %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
.p2align 4,,7
.p2align 3
.L15:
movzbl .LC1+1, %ebx
cmpb 1(%ecx), %bl
jne .L12
movzbl .LC1+2, %ebx
cmpb 2(%ecx), %bl
je .L11
movzbl .LC2, %ebx
cmpb (%ecx), %bl
jne .L13
.p2align 4,,7
.p2align 3
.L21:
movzbl .LC2+1, %ebx
cmpb 1(%ecx), %bl
jne .L13
movzbl .LC2+2, %ebx
cmpb 2(%ecx), %bl
jne .L13
jmp .L11
--------------------------------------
gdc -O3 -frelease -inline -S test3.d -o test3.s
_D5test34testFAaZi:
pushl %ebp
xorl %eax, %eax
movl %esp, %ebp
pushl %esi
movl 8(%ebp), %esi
pushl %ebx
movl 12(%ebp), %ebx
subl $3, %esi
je .L6
movl $1, %edx
jmp .L7
.p2align 4,,7
.p2align 3
.L3:
cmpl %edx, %esi
leal 1(%edx), %ecx
jbe .L6
.L11:
movl %ecx, %edx
.L7:
cmpb $84, -1(%ebx,%edx)
jne .L3
movzbl (%ebx,%edx), %ecx
cmpb $65, %cl
je .L10
cmpb $71, %cl
jne .L3
xorl %ecx, %ecx
cmpb $65, 1(%ebx,%edx)
sete %cl
addl %ecx, %eax
cmpl %edx, %esi
leal 1(%edx), %ecx
ja .L11
.p2align 4,,7
.p2align 3
.L6:
popl %ebx
popl %esi
popl %ebp
ret
---------------------------
ldc -O3 -release -inline -output-s test1.d
_D5test14testFAaZi:
pushl %ebp
pushl %ebx
pushl %edi
pushl %esi
subl $20, %esp
movl 40(%esp), %esi
cmpl $3, %esi
je .LBB1_8
addl $4294967293, %esi
xorl %ebx, %ebx
movl %ebx, %edi
.align 16
.LBB1_2:
movl 44(%esp), %eax
leal (%eax,%ebx), %ebp
movl %ebp, 4(%esp)
movl $_D11TypeInfo_Aa6__initZ, 16(%esp)
movl $.str, 12(%esp)
movl $3, 8(%esp)
movl $3, (%esp)
call _adEq
testl %eax, %eax
jne .LBB1_5
movl %ebp, 4(%esp)
movl $_D11TypeInfo_Aa6__initZ, 16(%esp)
movl $.str1, 12(%esp)
movl $3, 8(%esp)
movl $3, (%esp)
call _adEq
testl %eax, %eax
jne .LBB1_5
movl %ebp, 4(%esp)
movl $_D11TypeInfo_Aa6__initZ, 16(%esp)
movl $.str2, 12(%esp)
movl $3, 8(%esp)
movl $3, (%esp)
call _adEq
testl %eax, %eax
je .LBB1_6
.LBB1_5:
incl %edi
.LBB1_6:
incl %ebx
cmpl %esi, %ebx
jb .LBB1_2
.LBB1_7:
movl %edi, %eax
addl $20, %esp
popl %esi
popl %edi
popl %ebx
popl %ebp
ret $8
.LBB1_8:
xorl %edi, %edi
jmp .LBB1_7
---------------------------
ldc -O3 -release -inline -output-s test2.d
_D5test24testFAaZi:
pushl %ebx
pushl %esi
movl 12(%esp), %ecx
cmpl $3, %ecx
je .LBB2_14
movl 16(%esp), %edx
addl $4294967293, %ecx
xorl %eax, %eax
movl %eax, %esi
.align 16
.LBB2_2:
movb (%edx,%esi), %bl
cmpb $84, %bl
jne .LBB2_12
movb 1(%edx,%esi), %bh
cmpb $65, %bh
jne .LBB2_5
cmpb $71, 2(%esi,%edx)
je .LBB2_11
.LBB2_5:
cmpb $84, %bl
jne .LBB2_12
cmpb $71, %bh
jne .LBB2_8
cmpb $65, 2(%esi,%edx)
je .LBB2_11
.LBB2_8:
cmpb $65, %bh
setne %bh
cmpb $84, %bl
jne .LBB2_12
testb %bh, %bh
jne .LBB2_12
cmpb $65, 2(%esi,%edx)
jne .LBB2_12
.LBB2_11:
incl %eax
.LBB2_12:
incl %esi
cmpl %ecx, %esi
jb .LBB2_2
.LBB2_13:
popl %esi
popl %ebx
ret $8
.LBB2_14:
xorl %eax, %eax
jmp .LBB2_13
---------------------------
ldc -O3 -release -inline -output-s test3.d
_D5test34testFAaZi:
pushl %ebx
pushl %edi
pushl %esi
movl 16(%esp), %ecx
cmpl $3, %ecx
je .LBB1_12
movl 20(%esp), %edx
addl $4294967293, %ecx
xorl %eax, %eax
movl %eax, %esi
.align 16
.LBB1_2:
cmpb $84, (%edx,%esi)
jne .LBB1_10
movb 1(%edx,%esi), %bl
cmpb $71, %bl
je .LBB1_8
cmpb $65, %bl
jne .LBB1_10
movb 2(%esi,%edx), %bl
cmpb $71, %bl
jne .LBB1_7
incl %eax
jmp .LBB1_10
.LBB1_7:
cmpb $65, %bl
jmp .LBB1_9
.LBB1_8:
cmpb $65, 2(%esi,%edx)
.LBB1_9:
sete %bl
movzbl %bl, %edi
addl %edi, %eax
.LBB1_10:
incl %esi
cmpl %ecx, %esi
jb .LBB1_2
.LBB1_11:
popl %esi
popl %edi
popl %ebx
ret $8
.LBB1_12:
xorl %eax, %eax
jmp .LBB1_11
---------------------------
Bye,
bearophile
Nov 27 2010
how well does dmd perform when you use a switch statement?
Nov 27 2010
Ellery Newcomer:how well does dmd perform when you use a switch statement?4th D version: import std.file: read; import std.c.stdio: printf; int test(char[] data) { int count; foreach (i; 0 .. data.length - 3) { switch (data[i .. i + 3]) { case "TAG", "TGA", "TAA": count++; default: } } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; printf("%d\n", test(data)); } Timings, dmd compiler, best of 4, seconds: Psy: 1.59 Bye, bearophile
Nov 27 2010
On Sat, 27 Nov 2010 11:53:06 -0500 bearophile <bearophileHUGS lycos.com> wrote:Timings, dmd compiler, best of 4, seconds: Psy: 1.59 =20 =20 2.3 GHz CPU. dmd 2.050, ldc based on DMD v1.057 and llvm 2.6, gdc head based on GCC 4.=4.5.=20 -------------------------- =20 import psyco =20 def test(data): count =3D 0 for i in xrange(len(data) - 3): codon =3D data[i : i + 3] if codon =3D=3D "TAG" or codon =3D=3D "TGA" or codon =3D=3D "TAA": count +=3D 1 return count =20 def main(): data =3D open("data.txt").read() data =3D data * 300 print test(data) =20 psyco.full() main() =20 -------------------------- =20 To produce some test data: =20 =20 from itertools import islice =20 def rnd(ia =3D 3877, ic =3D 29573, im =3D 139968): seed =3D 42 imf =3D float(im) while 1: seed =3D (seed * ia + ic) % im r =3D seed / imf yield "ACGT"[int(r * 4)] =20 =20 def main(): data =3D "".join(islice(rnd(), 0, 120000)) file("data.txt", "w").write(data) =20 main() =20 -------------------------- =20 A D2 traslation (I have used printf to reduce a lot the asm produced): =20 =20 import std.file: read; import std.c.stdio: printf; =20 int test(char[] data) { int count; foreach (i; 0 .. data.length - 3) { char[] codon =3D data[i .. i + 3]; if (codon =3D=3D "TAG" || codon =3D=3D "TGA" || codon =3D=3D "TAA=")count++; } return count; } =20 void main() { char[] data0 =3D cast(char[])read("data.txt"); int n =3D 300; char[] data =3D new char[data0.length * n]; for (size_t pos; pos < data.length; pos +=3D data0.length) data[pos .. pos+data0.length] =3D data0; =20 printf("%d\n", test(data)); } =20 -------------------------- =20 import std.file: read; import std.c.stdio: printf; =20 bool cmp3(char[] s1, string s2) { return s1[0] =3D=3D s2[0] && s1[1] =3D=3D s2[1] && s1[2] =3D=3D s2[2]; } =20 int test(char[] data) { int count; foreach (i; 0 .. data.length - 3) { char[] codon =3D data[i .. i + 3]; if (cmp3(codon, "TAG") || cmp3(codon, "TGA") || cmp3(codon, "TAA"=))count++; } return count; } =20 void main() { char[] data0 =3D cast(char[])read("data.txt"); int n =3D 300; char[] data =3D new char[data0.length * n]; for (size_t pos; pos < data.length; pos +=3D data0.length) data[pos .. pos+data0.length] =3D data0; =20 printf("%d\n", test(data)); }Impressive! What's your diagnostic on why basic string compare parforms the way it does? And have you tried a naive hand-written strComp (just for the sake of compl= eteness) like: bool strComp (string s1, string s2) { auto l =3D s1.length; if (l !=3D s2.length) return false; foreach (uint i ; 0..l) if (s1[i] !=3D s2[i]) return false; return true; } (I bet this will perform at about the same order of magnitude as your versi= Also, is there a way to bit-compare given memory areas at much higher speed= than element per element (I mean for arrays in general)? Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 27 2010
denis.spirWhat's your diagnostic on why basic string compare parforms the way it does?I prefer the numbers and asm listings speak for themselves, because my diagnostic sometimes is wrong. But I guess it's caused by three not inlined calls to __adEq2 each loop, done to compare 3 char long arrays.And have you tried a naive hand-written strComp (just for the sake of completeness) like:Here it is: Timings, dmd compiler, best of 4, seconds: Psy: 1.59 import std.file: read; import std.c.stdio: printf; bool cmp2(char[] s1, string s2) { if (s1.length != s2.length) return false; foreach (uint i ; 0 .. s1.length) if (s1[i] != s2[i]) return false; return true; } int test(char[] data) { int count; foreach (i; 0 .. data.length - 3) { char[] codon = data[i .. i + 3]; if (cmp2(codon, "TAG") || cmp2(codon, "TGA") || cmp2(codon, "TAA")) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; printf("%d\n", test(data)); }Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't. Bye, bearophile
Nov 27 2010
On Saturday 27 November 2010 14:46:00 bearophile wrote:denis.spirYou might be able to do that for primitive types and for structs with no opEquals() and with no member variables with an opEquals(), but you certainly can't do it in the general case, since in the general case, you'd have to be calling opEquals() on each element individually. - Jonathan M DavisAlso, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't.
Nov 27 2010
bearophile Wrote:You can use memcmp, though only for utf-8 strings.Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't.
Nov 27 2010
On 27/11/2010 23:04, Kagamin wrote:bearophile Wrote:Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.You can use memcmp, though only for utf-8 strings.Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't.
Nov 28 2010
Stewart Gordon Wrote:On 27/11/2010 23:04, Kagamin wrote:D community is amazing cult of premature optimization fans. Any one of you heard of canonically equivalent sequences? The integrated Unicode support is a clusterfuck. Please do compare ASCII strings with memcmp, but no Unicode. Where did the original poster pull this problem from, his ass? "My system runs 100,000,000,000 instructions per second, but this comparison of 4 letter strings uses 5 cycles too much! This is the only problem on the way to world domination with my $500 Microsoft Word clone!". No wait, the problems are completely imaginatory. Bye, bearophobicbearophile Wrote:Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.You can use memcmp, though only for utf-8 strings.Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't.
Nov 28 2010
On 2010-11-28 20:57:38 -0500, bearophobic <notbear cave.net> said:Stewart Gordon Wrote:Comparing unicode UTF-* strings using memcmp is fine as long as what you want to know is whether the code points are the same. If your point was that per-code-point comparisons aren't the right way to compare Unicode strings (in most situations), then I support this view too. Though if that's what you wanted to say, you could have made your point clearer. -- Michel Fortin michel.fortin michelf.com http://michelf.com/On 27/11/2010 23:04, Kagamin wrote:D community is amazing cult of premature optimization fans. Any one of you heard of canonically equivalent sequences? The integrated Unicode support is a clusterfuck. Please do compare ASCII strings with memcmp, but no Unicode. Where did the original poster pull this problem from, his ass? "My system runs 100,000,000,000 instructions per second, but this comparison of 4 letter strings uses 5 cycles too much! This is the only problem on the way to world domination with my $500 Microsoft Word clone!". No wait, the problems are completely imaginatory.bearophile Wrote:Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.You can use memcmp, though only for utf-8 strings.Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't.
Nov 28 2010
On 29/11/2010 02:11, Michel Fortin wrote:On 2010-11-28 20:57:38 -0500, bearophobic <notbear cave.net> said:Why are people still replying to nameless trolls? There has been several cases of that in recent threads. :/ -- Bruno Medeiros - Software EngineerStewart Gordon Wrote:Comparing unicode UTF-* strings using memcmp is fine as long as what you want to know is whether the code points are the same. If your point was that per-code-point comparisons aren't the right way to compare Unicode strings (in most situations), then I support this view too. Though if that's what you wanted to say, you could have made your point clearer.On 27/11/2010 23:04, Kagamin wrote:D community is amazing cult of premature optimization fans. Any one of you heard of canonically equivalent sequences? The integrated Unicode support is a clusterfuck. Please do compare ASCII strings with memcmp, but no Unicode. Where did the original poster pull this problem from, his ass? "My system runs 100,000,000,000 instructions per second, but this comparison of 4 letter strings uses 5 cycles too much! This is the only problem on the way to world domination with my $500 Microsoft Word clone!". No wait, the problems are completely imaginatory.bearophile Wrote:Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.You can use memcmp, though only for utf-8 strings.Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't.
Dec 07 2010
Bruno Medeiros Wrote:On 29/11/2010 02:11, Michel Fortin wrote:Trololol. Maybe they're a bit dumb, my brother. If they some day become smarter, they'll stop using D. They see how much shit it is. I miss my wife. Oh god.... bring back my life! Bring me my.. sandwich!On 2010-11-28 20:57:38 -0500, bearophobic <notbear cave.net> said:Why are people still replying to nameless trolls? There has been several cases of that in recent threads. :/Stewart Gordon Wrote:Comparing unicode UTF-* strings using memcmp is fine as long as what you want to know is whether the code points are the same. If your point was that per-code-point comparisons aren't the right way to compare Unicode strings (in most situations), then I support this view too. Though if that's what you wanted to say, you could have made your point clearer.On 27/11/2010 23:04, Kagamin wrote:D community is amazing cult of premature optimization fans. Any one of you heard of canonically equivalent sequences? The integrated Unicode support is a clusterfuck. Please do compare ASCII strings with memcmp, but no Unicode. Where did the original poster pull this problem from, his ass? "My system runs 100,000,000,000 instructions per second, but this comparison of 4 letter strings uses 5 cycles too much! This is the only problem on the way to world domination with my $500 Microsoft Word clone!". No wait, the problems are completely imaginatory.bearophile Wrote:Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.You can use memcmp, though only for utf-8 strings.Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't.
Dec 08 2010
On 08/12/2010 15:02, Brüno Mediocre wrote:Bruno Medeiros Wrote:Lol, "Brüno Mediocre", well thought, that's actually funny. -- Brüno Mediocre - Software EngineerWhy are people still replying to nameless trolls? There has been several cases of that in recent threads. :/Trololol. Maybe they're a bit dumb, my brother. If they some day become smarter, they'll stop using D. They see how much shit it is. I miss my wife. Oh god.... bring back my life! Bring me my.. sandwich!
Dec 09 2010
Bruno Medeiros schrieb:On 08/12/2010 15:02, Brüno Mediocre wrote:Nah, I think it's a rather mediocre joke.Bruno Medeiros Wrote:Lol, "Brüno Mediocre", well thought, that's actually funny.Why are people still replying to nameless trolls? There has been several cases of that in recent threads. :/Trololol. Maybe they're a bit dumb, my brother. If they some day become smarter, they'll stop using D. They see how much shit it is. I miss my wife. Oh god.... bring back my life! Bring me my.. sandwich!
Dec 09 2010
On Sun, 28 Nov 2010 20:32:24 -0500, Stewart Gordon <smjg_1998 yahoo.com> wrote:On 27/11/2010 23:04, Kagamin wrote:memcmp is type agnostic if all you want to compare is equality. The other use of memcmp is essentially as an opCmp, in which case it would be type sensitive.bearophile Wrote:Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.You can use memcmp, though only for utf-8 strings.Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't.
Nov 28 2010
You need only match 3-char strings?
Try this hack
bool cmp3c(string a, string etalon) pure
{
if(a.length!=3)return false;
return *cast(short*)a.ptr==*cast(short*)b.ptr && a[2]==b[2];
}
Nov 27 2010
or
bool cmp3c(string a, string b) pure
{
if(a.length!=3)return false;
return *cast(char[3]*)a.ptr==*cast(char[3]*)b.ptr;
}
llvm.memcmp should rock with lengths known at compile time.
Nov 27 2010
Kagamin:You need only match 3-char strings? Try this hackenough. But while some things are better left to Phobos, comparing short strings efficiently is the compiler's job, not mine (and I am not sure your trick works in C99). Bye, bearophile
Nov 27 2010
Kagamin wrote:
You need only match 3-char strings?
Try this hack
=20
bool cmp3c(string a, string etalon) pure
{
if(a.length!=3D3)return false;
return *cast(short*)a.ptr=3D=3D*cast(short*)b.ptr && a[2]=3D=3Db[2];
}
This doesn't work on some architectures if a.ptr or b.ptr is odd...
Jerome
--=20
mailto:jeberger free.fr
http://jeberger.free.fr
Jabber: jeberger jabber.fr
Nov 28 2010
I have done another test:
Timings, dmd compiler, best of 4, seconds:
Psy: 1.59
import std.file: read;
import std.c.stdio: printf;
int test(char[] data) {
int count;
foreach (i; 0 .. data.length - 3) {
char[] codon = data[i .. i + 3];
if ((codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&
codon[2] == 'G') ||
(codon.length == 3 && codon[0] == 'T' && codon[1] == 'G' &&
codon[2] == 'A') ||
(codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&
codon[2] == 'A'))
count++;
}
return count;
}
void main() {
char[] data0 = cast(char[])read("data.txt");
int n = 300;
char[] data = new char[data0.length * n];
for (size_t pos; pos < data.length; pos += data0.length)
data[pos .. pos+data0.length] = data0;
printf("%d\n", test(data));
}
So when there is to compare among strings known at compile-time to be small
(like < 6 char), the comparison shall be replaced with inlined single char
comparisons. This makes the code longer so it increases code cache pressure,
but seeing how much slow the alternative is, I think it's an improvement.
(A smart compiler is even able to remove the codon.length==3 test because the
slice data[i..i+3] is always of length 3 (here mysteriously if you remove those
three length tests the program compiled with dmd gets slower)).
Bye,
bearophile
Nov 27 2010
On Sat, 27 Nov 2010 22:08:29 -0500, bearophile <bearophileHUGS lycos.com>
wrote:
I have done another test:
Timings, dmd compiler, best of 4, seconds:
Psy: 1.59
import std.file: read;
import std.c.stdio: printf;
int test(char[] data) {
int count;
foreach (i; 0 .. data.length - 3) {
char[] codon = data[i .. i + 3];
if ((codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&
codon[2] == 'G') ||
(codon.length == 3 && codon[0] == 'T' && codon[1] == 'G' &&
codon[2] == 'A') ||
(codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&
codon[2] == 'A'))
count++;
}
return count;
}
void main() {
char[] data0 = cast(char[])read("data.txt");
int n = 300;
char[] data = new char[data0.length * n];
for (size_t pos; pos < data.length; pos += data0.length)
data[pos .. pos+data0.length] = data0;
printf("%d\n", test(data));
}
So when there is to compare among strings known at compile-time to be
small (like < 6 char), the comparison shall be replaced with inlined
single char comparisons. This makes the code longer so it increases code
cache pressure, but seeing how much slow the alternative is, I think
it's an improvement.
(A smart compiler is even able to remove the codon.length==3 test
because the slice data[i..i+3] is always of length 3 (here mysteriously
if you remove those three length tests the program compiled with dmd
gets slower)).
Bye,
bearophile
Hi bearophile,
I've spent some time having fun this afternoon optimizing array-equals
using vectorization techniques. I found that vectorizing using ulongs
worked best on my system except with the shortest strings, where a simple
Duff's device edged it out. If you'd like to try it out on your data set:
bool arrayComp3(bool useBitCompare = true,T)(const T[] a, const T[] b)
pure nothrow {
if(a.length != b.length) return false;
static if(useBitCompare) {
auto pab = cast(ubyte*)a.ptr;
auto pbb = cast(ubyte*)b.ptr;
if(pab is pbb) return true;
auto byte_length = a.length*T.sizeof;
auto pa_end = cast(ulong*)(pab+byte_length);
final switch (byte_length % ulong.sizeof) {
case 7: if(*pab++ != *pbb++ ) return false;
case 6: if(*pab++ != *pbb++ ) return false;
case 5: if(*pab++ != *pbb++ ) return false;
case 4: if(*pab++ != *pbb++ ) return false;
case 3: if(*pab++ != *pbb++ ) return false;
case 2: if(*pab++ != *pbb++ ) return false;
case 1: if(*pab++ != *pbb++ ) return false;
case 0:
}
auto pa = cast(ulong*)pab;
auto pb = cast(ulong*)pbb;
while (pa < pa_end) {
if(*pa++ != *pb++ ) return false;
}
} else { // default to a
short duff's device
auto pa = a.ptr;
auto pb = b.ptr;
if(pa is pb) return true;
auto n = (a.length + 3) / 4;
final switch (a.length % 4) {
case 0: do { if(*pa++ != *pb++ ) return false;
case 3: if(*pa++ != *pb++ ) return false;
case 2: if(*pa++ != *pb++ ) return false;
case 1: if(*pa++ != *pb++ ) return false;
} while (--n > 0);
} }
return true;
}
Nov 27 2010
Robert Jacques:I've spent some time having fun this afternoon optimizing array-equals using vectorization techniques. I found that vectorizing using ulongs worked best on my system except with the shortest strings, where a simple Duff's device edged it out. If you'd like to try it out on your data set:Thank you for your work :-) import std.file: read; import std.c.stdio: printf; import std.exception: assumeUnique; bool arrayComp(bool useBitCompare=true, T) (const T[] a, const T[] b) pure nothrow { if (a.length != b.length) return false; static if (useBitCompare) { auto pab = cast(ubyte*)a.ptr; auto pbb = cast(ubyte*)b.ptr; if (pab is pbb) return true; auto byte_length = a.length * T.sizeof; auto pa_end = cast(ulong*)(pab + byte_length); final switch (byte_length % ulong.sizeof) { case 7: if (*pab++ != *pbb++) return false; case 6: if (*pab++ != *pbb++) return false; case 5: if (*pab++ != *pbb++) return false; case 4: if (*pab++ != *pbb++) return false; case 3: if (*pab++ != *pbb++) return false; case 2: if (*pab++ != *pbb++) return false; case 1: if (*pab++ != *pbb++) return false; case 0: } auto pa = cast(ulong*)pab; auto pb = cast(ulong*)pbb; while (pa < pa_end) { if (*pa++ != *pb++) return false; } } else { // default to a short duff's device auto pa = a.ptr; auto pb = b.ptr; if (pa == pb) return true; auto n = (a.length + 3) / 4; final switch (a.length % 4) { case 0: do { if (*pa++ != *pb++) return false; case 3: if (*pa++ != *pb++) return false; case 2: if (*pa++ != *pb++) return false; case 1: if (*pa++ != *pb++) return false; } while (--n > 0); } } return true; } int test(string data) { int count; foreach (i; 0 .. data.length - 3) { auto codon = data[i .. i + 3]; if (arrayComp(codon, "TAG") || arrayComp(codon, "TGA") || arrayComp(codon, "TAA")) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; string sdata = assumeUnique(data); printf("%d\n", test(sdata)); } Timings, dmd compiler, best of 4, seconds: Psy: 1.59 Your function can't be inlined because it's big, so this code isn't faster than inlined code like this generated by the compiler: (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] == 'G') Bye, bearophile
Nov 28 2010
On 11/28/2010 12:44 PM, bearophile wrote:Robert Jacques:I don't have your data set, but for me using random data this was within int test(string data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count += 1; data.popFront; } } Also, this one is far easier to generalize for strings of different lengths, and such :-)I've spent some time having fun this afternoon optimizing array-equals using vectorization techniques. I found that vectorizing using ulongs worked best on my system except with the shortest strings, where a simple Duff's device edged it out. If you'd like to try it out on your data set:Thank you for your work :-) import std.file: read; import std.c.stdio: printf; import std.exception: assumeUnique; bool arrayComp(bool useBitCompare=true, T) (const T[] a, const T[] b) pure nothrow { if (a.length != b.length) return false; static if (useBitCompare) { auto pab = cast(ubyte*)a.ptr; auto pbb = cast(ubyte*)b.ptr; if (pab is pbb) return true; auto byte_length = a.length * T.sizeof; auto pa_end = cast(ulong*)(pab + byte_length); final switch (byte_length % ulong.sizeof) { case 7: if (*pab++ != *pbb++) return false; case 6: if (*pab++ != *pbb++) return false; case 5: if (*pab++ != *pbb++) return false; case 4: if (*pab++ != *pbb++) return false; case 3: if (*pab++ != *pbb++) return false; case 2: if (*pab++ != *pbb++) return false; case 1: if (*pab++ != *pbb++) return false; case 0: } auto pa = cast(ulong*)pab; auto pb = cast(ulong*)pbb; while (pa< pa_end) { if (*pa++ != *pb++) return false; } } else { // default to a short duff's device auto pa = a.ptr; auto pb = b.ptr; if (pa == pb) return true; auto n = (a.length + 3) / 4; final switch (a.length % 4) { case 0: do { if (*pa++ != *pb++) return false; case 3: if (*pa++ != *pb++) return false; case 2: if (*pa++ != *pb++) return false; case 1: if (*pa++ != *pb++) return false; } while (--n> 0); } } return true; } int test(string data) { int count; foreach (i; 0 .. data.length - 3) { auto codon = data[i .. i + 3]; if (arrayComp(codon, "TAG") || arrayComp(codon, "TGA") || arrayComp(codon, "TAA")) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos< data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; string sdata = assumeUnique(data); printf("%d\n", test(sdata)); } Timings, dmd compiler, best of 4, seconds: Psy: 1.59 Your function can't be inlined because it's big, so this code isn't faster than inlined code like this generated by the compiler: (codon.length == 3&& codon[0] == 'T'&& codon[1] == 'A'&& codon[2] == 'G') Bye, bearophile
Nov 28 2010
Pelle Mĺnsson:I don't have your data set,In my first post I have explained how to create the data set, using a little Python script that I have listed there. You just need a Python2 interpreter.any fiddly code.Thank you for your code. I have added your version, plus a modified version ------------------------- import std.file: read; import std.c.stdio: printf; import std.algorithm: find; import std.array: empty, popFront; int test(char[] data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count++; data.popFront(); } } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } ------------------------- import std.file: read; import std.c.stdio: printf; import std.algorithm: find; import std.array: empty, popFront; int test(ubyte[] data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count++; data.popFront(); } } void main() { ubyte[] data0 = cast(ubyte[])read("data.txt"); int n = 300; ubyte[] data = new ubyte[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } ------------------------- Timings, dmd compiler, best of 4, seconds: Psy: 1.59 So on this PC it's not much fast. And generally I don't like to use find, empty and popFront to solve this problem, it's quite unnatural. Here I have explained what I think is a good enough solution to this performance problem: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=123044 Bye, bearophile
Nov 28 2010
On Sun, 28 Nov 2010 10:48:46 -0500 bearophile <bearophileHUGS lycos.com> wrote:generally I don't like to use find, empty and popFront to solve this prob=lem, it's quite unnatural. Here I have explained what I think is a good eno= ugh solution to this performance problem:http://www.digitalmars.com/webnews/newsgroups.php?art_group=3Ddigitalmars=.D&article_id=3D123044 What I would like is a just decent routine for comparing arrays of elements= which _conceptually_ are values, whatever their complexity (meaning, possi= bly structs or nested arrays). This should always (*) use a default '=3D=3D= ' comparison, which in turn may perform a kind of memcomp. I have no idea how to avoid bad perf such as shown by your tests, but this = is really necessary. Tons of routines and algorithms use '=3D=3D', and amon= g the most used; str.find will perform an arbitrary number of '=3D=3D' chec= ks; and I don't even want to think at pattern matching ;-) Denis (*) I think that: * If an element type is a value type, it should not overide opEquals. '=3D= =3D' should translate to equality of every sub-element. Else, there is some= thing wrong in the design. But there may be use cases I overlook. (Eg metad= ata stored on the element itself?) * A plain value may well refer to a "thing"/object, eg a wiget's outlook po= inting to a color palette, in which case reference comparison is the right = choice -- so, no complication. * Real things (referenced objects, entities with self/id) should not have o= pEquals at all. They should be compare only by id, meaning using 'is'. This= shows again the ambivalence of dyn arrays... -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 28 2010
On 11/28/2010 04:48 PM, bearophile wrote:Pelle Mĺnsson:Measuring incorrectly in a performance test is a silly mistake, I was indeed wrong about the factor two thing! I thought find was specialized for chars so it wouldn't require casting to ubyte. Maybe I was wrong again. Thank you for your engaging performance oriented posts. :-)I don't have your data set,In my first post I have explained how to create the data set, using a little Python script that I have listed there. You just need a Python2 interpreter.any fiddly code.Thank you for your code. I have added your version, plus a modified version ------------------------- import std.file: read; import std.c.stdio: printf; import std.algorithm: find; import std.array: empty, popFront; int test(char[] data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count++; data.popFront(); } } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos< data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } ------------------------- import std.file: read; import std.c.stdio: printf; import std.algorithm: find; import std.array: empty, popFront; int test(ubyte[] data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count++; data.popFront(); } } void main() { ubyte[] data0 = cast(ubyte[])read("data.txt"); int n = 300; ubyte[] data = new ubyte[data0.length * n]; for (size_t pos; pos< data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } ------------------------- Timings, dmd compiler, best of 4, seconds: Psy: 1.59 So on this PC it's not much fast. And generally I don't like to use find, empty and popFront to solve this problem, it's quite unnatural. Here I have explained what I think is a good enough solution to this performance problem: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=123044 Bye, bearophile
Nov 28 2010
On Sun, 28 Nov 2010 06:44:23 -0500, bearophile <bearophileHUGS lycos.com> wrote:Robert Jacques:[snip]I've spent some time having fun this afternoon optimizing array-equals using vectorization techniques. I found that vectorizing using ulongs worked best on my system except with the shortest strings, where a simple Duff's device edged it out. If you'd like to try it out on your data set:Thank you for your work :-)Your function can't be inlined because it's big, so this code isn't faster than inlined code like this generated by the compiler: (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] == 'G') Bye, bearophileStill, part of the point was that string comparisons in general were way too slow. Anyways, I've applied the same technique in a partially unrolled version if you want to check it out: bool arrayComp(T, size_t N)(const T[] a, ref const T[N] b) pure nothrow { static if(T.sizeof*N <= uint.sizeof) { return a.length == N && !( (*(cast(uint*)a.ptr) ^ *(cast(uint*)b.ptr)) & (uint.max >> 8*(uint.sizeof - T.sizeof*N) )); } else static if(T.sizeof*N <= ulong.sizeof) { return a.length == N && !( (*(cast(ulong*)a.ptr) ^ *(cast(ulong*)b.ptr)) & (ulong.max>> 8*(ulong.sizeof - T.sizeof*N) )); } else { // Fall back to a loop if(a.length != N || (*(cast(ulong*)a.ptr) != *(cast(ulong*)b.ptr)) ) return false; enum alignment = T.sizeof*N % ulong.sizeof > 0 ? T.sizeof*N % ulong.sizeof : ulong.sizeof; auto pa = cast(ulong*)(cast(ubyte*)a.ptr + alignment); auto pb = cast(ulong*)(cast(ubyte*)b.ptr + alignment); auto pa_end = cast(ulong*)(cast(ubyte*)a.ptr + T.sizeof*N); while (pa < pa_end) { if(*pa++ != *pb++ ) return false; } return true; } } Be warned that you'll have to use strings explicitly typed as immutable char[N], since both array literals and string literals won't match a this template.
Nov 28 2010
Robert Jacques:Still, part of the point was that string comparisons in general were way too slow. Anyways, I've applied the same technique in a partially unrolled version if you want to check it out:import std.file: read; import std.c.stdio: printf; // not formatted bool arrayComp(T, size_t N)(const T[] a, ref const T[N] b) pure nothrow { static if(T.sizeof*N <= uint.sizeof) { return a.length == N && !( (*(cast(uint*)a.ptr) ^ *(cast(uint*)b.ptr)) & (uint.max >> 8*(uint.sizeof - T.sizeof*N) )); } else static if(T.sizeof*N <= ulong.sizeof) { return a.length == N && !( (*(cast(ulong*)a.ptr) ^ *(cast(ulong*)b.ptr)) & (ulong.max>> 8*(ulong.sizeof - T.sizeof*N) )); } else { // Fall back to a loop if(a.length != N || (*(cast(ulong*)a.ptr) != *(cast(ulong*)b.ptr)) ) return false; enum alignment = T.sizeof*N % ulong.sizeof > 0 ? T.sizeof*N % ulong.sizeof : ulong.sizeof; auto pa = cast(ulong*)(cast(ubyte*)a.ptr + alignment); auto pb = cast(ulong*)(cast(ubyte*)b.ptr + alignment); auto pa_end = cast(ulong*)(cast(ubyte*)a.ptr + T.sizeof*N); while (pa < pa_end) { if(*pa++ != *pb++ ) return false; } return true; } } pure int test(const char[] data) { int count; foreach (i; 0 .. data.length - 3) { const char[] codon = data[i .. i + 3]; if (arrayComp!(char, 3)(codon, "TAG") || arrayComp!(char, 3)(codon, "TGA") || arrayComp!(char, 3)(codon, "TAA")) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } Timings, dmd compiler, best of 4, seconds: Psy: 1.59 This problem probably needs to be solved by the compiler. Bye, bearophile
Nov 28 2010
On 11/27/2010 11:53 AM, bearophile wrote:While translating a common Python script to D, I have found something interesting, so I have reduced it to a little benchmark. Below there is the reduced Python2 code (it uses Psyco), and a little program to generate some test data. The timing of the first D2 version is not good compared to the Python-Psyco program (the generation of the *300 array is a quick thing), so I have created two more D2 versions to show myself that D2 wasn't broken :-) The reduced code looks like a syntetic benchmark, but it has about the same performance profile of a 60 lines long Python script (the original code was using xrange(0,len(...)-3,3) instead of xrange(len(...)-3), but the situation doesn't change much).It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default". If it is the former, a state machine is probably the right answer for long input strings. start -> E E A,C,G:-> E E T:-> T T A:-> TA T C:-> E T G:-> TG T T:-> T TA A,G:-> E, {++count} TA C:-> E TA T:-> T TG A:-> E, {++count} TG C,G:-> E TG T:-> T Since the probability of failure is about .75, 0.5, {.5 or .75}, the overall comparison cost is about 1 + .25 + .125 = 1.375, times compare+loop times n, plus all the function call overhead, etc. Using a DFA should reduce the cost to 1 * index, times n, assuming you are using a small alphabet (I'm guessing that your codons are ascii+uppercase only, table-size = 26). =Austin
Nov 27 2010
On Saturday 27 November 2010 22:13:39 Austin Hastings wrote:On 11/27/2010 11:53 AM, bearophile wrote:What I think is pretty clear from this is that at some point, some work needs to be done to optimize array comparisons when memcmp can be used instead of having to call the individual opEquals() of each element. It's not the sort of thing that's likely to be a priority, but it really should be done at some point. - Jonathan M DavisWhile translating a common Python script to D, I have found something interesting, so I have reduced it to a little benchmark. Below there is the reduced Python2 code (it uses Psyco), and a little program to generate some test data. The timing of the first D2 version is not good compared to the Python-Psyco program (the generation of the *300 array is a quick thing), so I have created two more D2 versions to show myself that D2 wasn't broken :-) The reduced code looks like a syntetic benchmark, but it has about the same performance profile of a 60 lines long Python script (the original code was using xrange(0,len(...)-3,3) instead of xrange(len(...)-3), but the situation doesn't change much).It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default".
Nov 27 2010
== Quote from Jonathan M Davis (jmdavisProg gmx.com)'s articleOn Saturday 27 November 2010 22:13:39 Austin Hastings wrote:Just food for thought. GDC uses memcmp when using string comparisons in the first implementation. if (codon == "TAG" || codon == "TGA" || codon == "TAA") And it is still the slowest case of the lot. RegardsOn 11/27/2010 11:53 AM, bearophile wrote:What I think is pretty clear from this is that at some point, some work needs to be done to optimize array comparisons when memcmp can be used instead of having to call the individual opEquals() of each element. It's not the sort of thing that's likely to be a priority, but it really should be done at some point. - Jonathan M DavisWhile translating a common Python script to D, I have found something interesting, so I have reduced it to a little benchmark. Below there is the reduced Python2 code (it uses Psyco), and a little program to generate some test data. The timing of the first D2 version is not good compared to the Python-Psyco program (the generation of the *300 array is a quick thing), so I have created two more D2 versions to show myself that D2 wasn't broken :-) The reduced code looks like a syntetic benchmark, but it has about the same performance profile of a 60 lines long Python script (the original code was using xrange(0,len(...)-3,3) instead of xrange(len(...)-3), but the situation doesn't change much).It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default".
Nov 28 2010
Iain Buclaw:GDC uses memcmp when using string comparisons in the first implementation. if (codon == "TAG" || codon == "TGA" || codon == "TAA") And it is still the slowest case of the lot.The asm of the first version of the function compiled with gdc uses "repz cmpsb". And while I don't remember precise timings for it, I think it was about two times faster than the first version compiled with dmd (on Windows). Regarding the idea of using memcmp, I have added a comment to the issue 5282. Bye, bearophile
Nov 28 2010
On Saturday 27 November 2010 22:23:39 Jonathan M Davis wrote:On Saturday 27 November 2010 22:13:39 Austin Hastings wrote:Enhancement Request: http://d.puremagic.com/issues/show_bug.cgi?id=5282On 11/27/2010 11:53 AM, bearophile wrote:What I think is pretty clear from this is that at some point, some work needs to be done to optimize array comparisons when memcmp can be used instead of having to call the individual opEquals() of each element. It's not the sort of thing that's likely to be a priority, but it really should be done at some point.While translating a common Python script to D, I have found something interesting, so I have reduced it to a little benchmark. Below there is the reduced Python2 code (it uses Psyco), and a little program to generate some test data. The timing of the first D2 version is not good compared to the Python-Psyco program (the generation of the *300 array is a quick thing), so I have created two more D2 versions to show myself that D2 wasn't broken :-) The reduced code looks like a syntetic benchmark, but it has about the same performance profile of a 60 lines long Python script (the original code was using xrange(0,len(...)-3,3) instead of xrange(len(...)-3), but the situation doesn't change much).It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default".
Nov 27 2010
Austin Hastings:It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default".The third D version in my post is fast enough, so the point of that post is that a very common string compare operation, done in the natural way, is a bit too much slow in D/DMD.If it is the former, a state machine is probably the right answer for long input strings.This case is very simple, so a simple tree (as in the third D version of my post) is probably good enough. In some similar but more complex situations of genomic processing I've seen that in C it's good to have computed gotos, to implement faster state machines. See my recent post "explicitly unimplemented computed gotos" :-) This was my use case beside the desire to create interpreters in D (it's a similar thing).Since the probability of failure is about .75, 0.5, {.5 or .75}, the overall comparison cost is about 1 + .25 + .125 = 1.375, times compare+loop times n, plus all the function call overhead, etc.All 3 codons start with T, so if you test that first letter before the 3 string compare you speed up the original code a lot. But I was looking for a more general solution. Implementing a state machine is good, but in this case I'd like the compiler to do more optimizations by itself. A simple but good enough optimization for the compiler is to replace tests with strings (known to be short at compile-time) with inlined comparisons of their single chars.Using a DFA should reduce the cost to 1 * index, times n, assuming you are using a small alphabet (I'm guessing that your codons are ascii+uppercase only, table-size = 26).In this case the alphabet was just 4 symbols: {'A', 'C', 'G', 'T'} (with no undecided or modified nucleosides). Bye, bearophile
Nov 28 2010
On Sat, 27 Nov 2010 22:41:45 -0800 Jonathan M Davis <jmdavisProg gmx.com> wrote:'sWhat I think is pretty clear from this is that at some point, some work needs to be done to optimize array comparisons when memcmp can be used instead of having to call the individual opEquals() of each element. It=uldnot the sort of thing that's likely to be a priority, but it really sho=Yes, but other comments seems to show memcmp only doubles speed. This would= only bring us twice as slow as python ;-) Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.combe done at some point. =20=20 Enhancement Request: http://d.puremagic.com/issues/show_bug.cgi?id=3D5282
Nov 28 2010
spir:Yes, but other comments seems to show memcmp only doubles speed. This would only bring us twice as slow as python ;-)The solution I suggest for this problem is: when DMD knows at compile-time the length of one of the two strings to equate, and such length is small (like < 6), then it may replace the string eq code like this: (codon == "TAG") With code like: (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] == 'G') Bye, bearophile
Nov 28 2010









bearophile <bearophileHUGS lycos.com> 