digitalmars.D - String compare performance

bearophile (630/630) Nov 27 2010 While translating a common Python script to D, I have found something in...

Ellery Newcomer (1/1) Nov 27 2010 how well does dmd perform when you use a switch statement?

bearophile (31/32) Nov 27 2010 4th D version:

spir (26/138) Nov 27 2010 4.5.

bearophile (41/44) Nov 27 2010 Here it is:

Jonathan M Davis (6/11) Nov 27 2010 You might be able to do that for primitive types and for structs with no...

Kagamin (2/5) Nov 27 2010 You can use memcmp, though only for utf-8 strings.

Stewart Gordon (4/11) Nov 28 2010 Only for utf-8 strings? Why's that? I would've thought memcmp to be

bearophobic (4/19) Nov 28 2010 D community is amazing cult of premature optimization fans. Any one of y...

Michel Fortin (11/37) Nov 28 2010 Comparing unicode UTF-* strings using memcmp is fine as long as what

Bruno Medeiros (5/37) Dec 07 2010 Why are people still replying to nameless trolls? There has been several...

=?ISO-8859-1?Q?Br=fcno_Mediocre?= (3/43) Dec 08 2010 Trololol. Maybe they're a bit dumb, my brother. If they some day become ...

Bruno Medeiros (4/10) Dec 09 2010 Lol, "Br�no Mediocre", well thought, that's actually funny.

Daniel Gibson (2/15) Dec 09 2010 Nah, I think it's a rather mediocre joke.

Robert Jacques (5/18) Nov 28 2010 memcmp is type agnostic if all you want to compare is equality. The othe...

Kagamin (7/7) Nov 27 2010 You need only match 3-char strings?

Kagamin (7/7) Nov 27 2010 or
bearophile (4/6) Nov 27 2010 I agree there are are ways to speed up the code, probably the D #3 code ...
=?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= (7/15) Nov 28 2010 This doesn't work on some architectures if a.ptr or b.ptr is odd...

bearophile (34/34) Nov 27 2010 I have done another test:

Robert Jacques (46/90) Nov 27 2010 Hi bearophile,

bearophile (82/86) Nov 28 2010 Thank you for your work :-)

=?ISO-8859-1?Q?Pelle_M=E5nsson?= (14/100) Nov 28 2010 I don't have your data set, but for me using random data this was within...

bearophile (68/70) Nov 28 2010 Thank you for your code. I have added your version, plus a modified vers...

spir (28/30) Nov 28 2010 lem, it's quite unnatural. Here I have explained what I think is a good ...
=?ISO-8859-1?Q?Pelle_M=E5nsson?= (6/76) Nov 28 2010 Measuring incorrectly in a performance test is a silly mistake, I was

Robert Jacques (30/45) Nov 28 2010 [snip]

bearophile (63/66) Nov 28 2010 The version #11 (I have not reformatted your function):

Austin Hastings (25/28) Nov 27 2010 It's not clear to me if the point of your post is "how do I make this go...

Jonathan M Davis (6/23) Nov 27 2010 What I think is pretty clear from this is that at some point, some work ...

Iain Buclaw (6/29) Nov 28 2010 Just food for thought. GDC uses memcmp when using string comparisons in ...

bearophile (5/10) Nov 28 2010 The asm of the first version of the function compiled with gdc uses "rep...

Jonathan M Davis (2/26) Nov 27 2010 Enhancement Request: http://d.puremagic.com/issues/show_bug.cgi?id=5282
bearophile (8/18) Nov 28 2010 This case is very simple, so a simple tree (as in the third D version of...
spir (10/17) Nov 28 2010 's

bearophile (7/8) Nov 28 2010 The solution I suggest for this problem is: when DMD knows at compile-ti...

bearophile <bearophileHUGS lycos.com> writes:

While translating a common Python script to D, I have found something
interesting, so I have reduced it to a little benchmark.

Below there is the reduced Python2 code (it uses Psyco), and a little program
to generate some test data. The timing of the first D2 version is not good
compared to the Python-Psyco program (the generation of the *300 array is a
quick thing), so I have created two more D2 versions to show myself that D2
wasn't broken :-)

The reduced code looks like a syntetic benchmark, but it has about the same
performance profile of a 60 lines long Python script (the original code was
using xrange(0,len(...)-3,3)  instead of xrange(len(...)-3), but the situation
doesn't change much).

Timings, dmd compiler, best of 4, seconds:

  Psy:  1.59




2.3 GHz CPU.
dmd 2.050, ldc based on DMD v1.057 and llvm 2.6, gdc head based on GCC 4.4.5.

--------------------------


import psyco

def test(data):
    count = 0
    for i in xrange(len(data) - 3):
        codon = data[i : i + 3]
        if codon == "TAG" or codon == "TGA" or codon == "TAA":
            count += 1
    return count

def main():
    data = open("data.txt").read()
    data = data * 300
    print test(data)

psyco.full()
main()

--------------------------

To produce some test data:



from itertools import islice

def rnd(ia = 3877, ic = 29573, im = 139968):
    seed = 42
    imf = float(im)
    while 1:
        seed = (seed * ia + ic) % im
        r = seed / imf
        yield "ACGT"[int(r * 4)]


def main():
    data = "".join(islice(rnd(), 0, 120000))
    file("data.txt", "w").write(data)

main()

--------------------------

A D2 traslation (I have used printf to reduce a lot the asm produced):



import std.file: read;
import std.c.stdio: printf;

int test(char[] data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        char[] codon = data[i .. i + 3];
        if (codon == "TAG" || codon == "TGA" || codon == "TAA")
            count++;
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}

--------------------------


import std.file: read;
import std.c.stdio: printf;

bool cmp3(char[] s1, string s2) {
    return s1[0] == s2[0] && s1[1] == s2[1] && s1[2] == s2[2];
}

int test(char[] data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        char[] codon = data[i .. i + 3];
        if (cmp3(codon, "TAG") || cmp3(codon, "TGA") || cmp3(codon, "TAA"))
            count++;
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}

--------------------------


import std.file: read;
import std.c.stdio: printf;

int test(char[] data) {
    int count;

    foreach (i; 0 ..  data.length - 3) {
        if (data[i] == 'T') {
            if (data[i+1] == 'A') {
                if (data[i+2] == 'G') {
                    count++;
                } else if (data[i+2] == 'A') {
                    count++;
                }
            } else if (data[i+1] == 'G') {
                if (data[i+2] == 'A')
                    count++;
            }
        }
    }

    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}

--------------------------

dmd -O -release -inline test1.d

_D5test14testFAaZi	comdat
L0:		sub	ESP,024h
		push	EBX
		xor	EBX,EBX
		push	EBP
		mov	EBP,030h[ESP]
		push	ESI
		xor	ESI,ESI
		add	EBP,0FFFFFFFDh
		push	EDI
		je	LA8
		mov	EDX,03Ch[ESP]
		mov	EAX,038h[ESP]
		mov	EDI,EDX
L22:		mov	ECX,offset FLAT:_D11TypeInfo_Aa6__initZ
		lea	EAX,3[EBX]
		sub	EAX,EBX
		push	ECX
		lea	EDX,[EDI][EBX]
		push	dword ptr FLAT:_DATA[0Ch]
		push	dword ptr FLAT:_DATA[08h]
		mov	020h[ESP],EAX
		mov	024h[ESP],EDX
		push	EDX
		push	EAX
		call	near ptr __adEq2
		add	ESP,014h
		test	EAX,EAX
		jne	L9E
		mov	ECX,offset FLAT:_D11TypeInfo_Aa6__initZ
		push	ECX
		push	dword ptr FLAT:_DATA[01Ch]
		push	dword ptr FLAT:_DATA[018h]
		push	dword ptr 024h[ESP]
		push	dword ptr 024h[ESP]
		call	near ptr __adEq2
		add	ESP,014h
		test	EAX,EAX
		jne	L9E
		mov	EDX,offset FLAT:_D11TypeInfo_Aa6__initZ
		push	EDX
		push	dword ptr FLAT:_DATA[02Ch]
		push	dword ptr FLAT:_DATA[028h]
		push	dword ptr 024h[ESP]
		push	dword ptr 024h[ESP]
		call	near ptr __adEq2
		add	ESP,014h
		test	EAX,EAX
		je	L9F
L9E:		inc	ESI
L9F:		inc	EBX
		cmp	EBX,EBP
		jb	L22
LA8:		pop	EDI
		mov	EAX,ESI
		pop	ESI
		pop	EBP
		pop	EBX
		add	ESP,024h
		ret	8

--------------------------

dmd -O -release -inline test2.d

_D5test24testFAaZi	comdat
		sub	ESP,03Ch
		mov	ECX,040h[ESP]
		push	EBX
		push	EBP
		push	ESI
		xor	ESI,ESI
		push	EDI
		xor	EDI,EDI
		add	ECX,0FFFFFFFDh
		mov	01Ch[ESP],ECX
		je	LC5
		mov	EDX,054h[ESP]
		mov	EAX,050h[ESP]
		mov	EBP,EDX
L26:		lea	EBX,3[ESI]
		sub	EBX,ESI
		lea	ECX,[EBP][ESI]
		mov	028h[ESP],ECX
		mov	DL,[ECX]
		mov	ECX,FLAT:_DATA[0Ch]
		mov	024h[ESP],EBX
		mov	EAX,FLAT:_DATA[08h]
		cmp	DL,[ECX]
		mov	010h[ESP],ECX
		jne	L65
		mov	EDX,028h[ESP]
		mov	EAX,EBX
		mov	EBX,010h[ESP]
		mov	CL,1[EDX]
		cmp	CL,1[EBX]
		jne	L65
		mov	DL,2[EDX]
		cmp	DL,2[EBX]
		je	LB9
L65:		mov	EDX,028h[ESP]
		mov	CL,[EDX]
		mov	018h[ESP],EDX
		mov	EAX,024h[ESP]
		mov	EDX,FLAT:_DATA[01Ch]
		cmp	CL,[EDX]
		mov	EAX,FLAT:_DATA[018h]
		jne	L96
		mov	EBX,018h[ESP]
		mov	AL,1[EBX]
		cmp	AL,1[EDX]
		jne	L96
		mov	AL,2[EBX]
		cmp	AL,2[EDX]
		je	LB9
L96:		mov	EDX,FLAT:_DATA[02Ch]
		mov	EAX,FLAT:_DATA[028h]
		cmp	CL,[EDX]
		jne	LBA
		mov	ECX,018h[ESP]
		mov	BL,1[ECX]
		cmp	BL,1[EDX]
		jne	LBA
		mov	CL,2[ECX]
		cmp	CL,2[EDX]
		jne	LBA
LB9:		inc	EDI
LBA:		inc	ESI
		cmp	ESI,01Ch[ESP]
		jb	L26
LC5:		mov	EAX,EDI
		pop	EDI
		pop	ESI
		pop	EBP
		pop	EBX
		add	ESP,03Ch
		ret	8

--------------------------

dmd -O -release -inline test3.d

_D5test34testFAaZi	comdat
		push	EAX
		xor	EDX,EDX
		push	EBX
		xor	EBX,EBX
		push	ESI
		mov	ESI,010h[ESP]
		add	ESI,0FFFFFFFDh
		je	L4A
		mov	8[ESP],EDX
		mov	EDX,014h[ESP]
		mov	ECX,EDX
		mov	EAX,010h[ESP]
		mov	EDX,8[ESP]
L22:		cmp	[ECX][EDX],054h
		jne	L45
		mov	AL,1[ECX][EDX]
		cmp	AL,041h
		jne	L39
		cmp	byte ptr 2[ECX][EDX],047h
		je	L44
		jmp short	L3D
L39:		cmp	AL,047h
		jne	L45
L3D:		cmp	byte ptr 2[ECX][EDX],041h
		jne	L45
L44:		inc	EBX
L45:		inc	EDX
		cmp	EDX,ESI
		jb	L22
L4A:		pop	ESI
		mov	EAX,EBX
		pop	EBX
		pop	ECX
		ret	8

--------------------------

gdc -O3 -frelease -inline -S test1.d -o test1.s

_D5test14testFAaZi:
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$12, %esp
	movl	8(%ebp), %eax
	movl	12(%ebp), %edx
	movl	$0, -24(%ebp)
	subl	$3, %eax
	movl	%edx, -20(%ebp)
	movl	%eax, -16(%ebp)
	je	.L5
	xorl	%edx, %edx
	jmp	.L8
	.p2align 4,,7
	.p2align 3
.L6:
	addl	$1, -24(%ebp)
	addl	$1, %edx
	cmpl	%edx, -16(%ebp)
	jbe	.L5
.L8:
	movl	-20(%ebp), %eax
	movl	$.LC0, %edi
	movl	$3, %ecx
	addl	%edx, %eax
	movl	%eax, %esi
	repz cmpsb
	je	.L6
	movl	%eax, %esi
	movl	$.LC1, %edi
	movl	$3, %ecx
	repz cmpsb
	je	.L6
	movl	$.LC2, %edi
	movl	%eax, %esi
	movl	$3, %ecx
	repz cmpsb
	je	.L6
	addl	$1, %edx
	cmpl	%edx, -16(%ebp)
	ja	.L8
.L5:
	movl	-24(%ebp), %eax
	addl	$12, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret

--------------------------------------

gdc -O3 -frelease -inline -S test2.d -o test2.s

_D5test24testFAaZi:
	pushl	%ebp
	xorl	%edx, %edx
	movl	%esp, %ebp
	xorl	%eax, %eax
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$16, %esp
	movl	8(%ebp), %ebx
	movl	12(%ebp), %esi
	movl	%ebx, %edi
	subl	$3, %edi
	jne	.L18
	jmp	.L9
	.p2align 4,,7
	.p2align 3
.L10:
	movl	-16(%ebp), %ecx
	movzbl	.LC1, %ebx
	cmpb	(%ecx), %bl
	je	.L15
.L12:
	movzbl	.LC2, %ebx
	cmpb	(%ecx), %bl
	je	.L21
.L13:
	addl	$1, %edx
	cmpl	%edx, %edi
	jbe	.L9
	.p2align 4,,7
	.p2align 3
.L18:
	leal	(%esi,%edx), %ecx
	movzbl	.LC0, %ebx
	movl	$3, -20(%ebp)
	movl	%ecx, -16(%ebp)
	cmpb	(%ecx), %bl
	jne	.L10
	movzbl	.LC0+1, %ebx
	cmpb	1(%ecx), %bl
	jne	.L10
	movzbl	.LC0+2, %ebx
	cmpb	2(%ecx), %bl
	jne	.L10
.L11:
	addl	$1, %edx
	addl	$1, %eax
	cmpl	%edx, %edi
	ja	.L18
.L9:
	addl	$16, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
	.p2align 4,,7
	.p2align 3
.L15:
	movzbl	.LC1+1, %ebx
	cmpb	1(%ecx), %bl
	jne	.L12
	movzbl	.LC1+2, %ebx
	cmpb	2(%ecx), %bl
	je	.L11
	movzbl	.LC2, %ebx
	cmpb	(%ecx), %bl
	jne	.L13
	.p2align 4,,7
	.p2align 3
.L21:
	movzbl	.LC2+1, %ebx
	cmpb	1(%ecx), %bl
	jne	.L13
	movzbl	.LC2+2, %ebx
	cmpb	2(%ecx), %bl
	jne	.L13
	jmp	.L11

--------------------------------------

gdc -O3 -frelease -inline -S test3.d -o test3.s

_D5test34testFAaZi:
	pushl	%ebp
	xorl	%eax, %eax
	movl	%esp, %ebp
	pushl	%esi
	movl	8(%ebp), %esi
	pushl	%ebx
	movl	12(%ebp), %ebx
	subl	$3, %esi
	je	.L6
	movl	$1, %edx
	jmp	.L7
	.p2align 4,,7
	.p2align 3
.L3:
	cmpl	%edx, %esi
	leal	1(%edx), %ecx
	jbe	.L6
.L11:
	movl	%ecx, %edx
.L7:
	cmpb	$84, -1(%ebx,%edx)
	jne	.L3
	movzbl	(%ebx,%edx), %ecx
	cmpb	$65, %cl
	je	.L10
	cmpb	$71, %cl
	jne	.L3
	xorl	%ecx, %ecx
	cmpb	$65, 1(%ebx,%edx)
	sete	%cl
	addl	%ecx, %eax
	cmpl	%edx, %esi
	leal	1(%edx), %ecx
	ja	.L11
	.p2align 4,,7
	.p2align 3
.L6:
	popl	%ebx
	popl	%esi
	popl	%ebp
	ret

---------------------------

ldc -O3 -release -inline -output-s test1.d

_D5test14testFAaZi:
	pushl	%ebp
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	subl	$20, %esp
	movl	40(%esp), %esi
	cmpl	$3, %esi
	je	.LBB1_8
	addl	$4294967293, %esi
	xorl	%ebx, %ebx
	movl	%ebx, %edi
	.align	16
.LBB1_2:
	movl	44(%esp), %eax
	leal	(%eax,%ebx), %ebp
	movl	%ebp, 4(%esp)
	movl	$_D11TypeInfo_Aa6__initZ, 16(%esp)
	movl	$.str, 12(%esp)
	movl	$3, 8(%esp)
	movl	$3, (%esp)
	call	_adEq
	testl	%eax, %eax
	jne	.LBB1_5
	movl	%ebp, 4(%esp)
	movl	$_D11TypeInfo_Aa6__initZ, 16(%esp)
	movl	$.str1, 12(%esp)
	movl	$3, 8(%esp)
	movl	$3, (%esp)
	call	_adEq
	testl	%eax, %eax
	jne	.LBB1_5
	movl	%ebp, 4(%esp)
	movl	$_D11TypeInfo_Aa6__initZ, 16(%esp)
	movl	$.str2, 12(%esp)
	movl	$3, 8(%esp)
	movl	$3, (%esp)
	call	_adEq
	testl	%eax, %eax
	je	.LBB1_6
.LBB1_5:
	incl	%edi
.LBB1_6:
	incl	%ebx
	cmpl	%esi, %ebx
	jb	.LBB1_2
.LBB1_7:
	movl	%edi, %eax
	addl	$20, %esp
	popl	%esi
	popl	%edi
	popl	%ebx
	popl	%ebp
	ret	$8
.LBB1_8:
	xorl	%edi, %edi
	jmp	.LBB1_7

---------------------------

ldc -O3 -release -inline -output-s test2.d

_D5test24testFAaZi:
	pushl	%ebx
	pushl	%esi
	movl	12(%esp), %ecx
	cmpl	$3, %ecx
	je	.LBB2_14
	movl	16(%esp), %edx
	addl	$4294967293, %ecx
	xorl	%eax, %eax
	movl	%eax, %esi
	.align	16
.LBB2_2:
	movb	(%edx,%esi), %bl
	cmpb	$84, %bl
	jne	.LBB2_12
	movb	1(%edx,%esi), %bh
	cmpb	$65, %bh
	jne	.LBB2_5
	cmpb	$71, 2(%esi,%edx)
	je	.LBB2_11
.LBB2_5:
	cmpb	$84, %bl
	jne	.LBB2_12
	cmpb	$71, %bh
	jne	.LBB2_8
	cmpb	$65, 2(%esi,%edx)
	je	.LBB2_11
.LBB2_8:
	cmpb	$65, %bh
	setne	%bh
	cmpb	$84, %bl
	jne	.LBB2_12
	testb	%bh, %bh
	jne	.LBB2_12
	cmpb	$65, 2(%esi,%edx)
	jne	.LBB2_12
.LBB2_11:
	incl	%eax
.LBB2_12:
	incl	%esi
	cmpl	%ecx, %esi
	jb	.LBB2_2
.LBB2_13:
	popl	%esi
	popl	%ebx
	ret	$8
.LBB2_14:
	xorl	%eax, %eax
	jmp	.LBB2_13

---------------------------

ldc -O3 -release -inline -output-s test3.d

_D5test34testFAaZi:
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	16(%esp), %ecx
	cmpl	$3, %ecx
	je	.LBB1_12
	movl	20(%esp), %edx
	addl	$4294967293, %ecx
	xorl	%eax, %eax
	movl	%eax, %esi
	.align	16
.LBB1_2:
	cmpb	$84, (%edx,%esi)
	jne	.LBB1_10
	movb	1(%edx,%esi), %bl
	cmpb	$71, %bl
	je	.LBB1_8
	cmpb	$65, %bl
	jne	.LBB1_10
	movb	2(%esi,%edx), %bl
	cmpb	$71, %bl
	jne	.LBB1_7
	incl	%eax
	jmp	.LBB1_10
.LBB1_7:
	cmpb	$65, %bl
	jmp	.LBB1_9
.LBB1_8:
	cmpb	$65, 2(%esi,%edx)
.LBB1_9:
	sete	%bl
	movzbl	%bl, %edi
	addl	%edi, %eax
.LBB1_10:
	incl	%esi
	cmpl	%ecx, %esi
	jb	.LBB1_2
.LBB1_11:
	popl	%esi
	popl	%edi
	popl	%ebx
	ret	$8
.LBB1_12:
	xorl	%eax, %eax
	jmp	.LBB1_11

---------------------------

Bye,
bearophile

Nov 27 2010

Ellery Newcomer <ellery-newcomer utulsa.edu> writes:

how well does dmd perform when you use a switch statement?

Nov 27 2010

bearophile <bearophileHUGS lycos.com> writes:

Ellery Newcomer:

 how well does dmd perform when you use a switch statement?

4th D version:

import std.file: read;
import std.c.stdio: printf;

int test(char[] data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        switch (data[i .. i + 3]) {
            case "TAG", "TGA", "TAA":
                count++;
            default:
        }
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}



Timings, dmd compiler, best of 4, seconds:


  Psy:  1.59



Bye,
bearophile

Nov 27 2010

spir <denis.spir gmail.com> writes:

On Sat, 27 Nov 2010 11:53:06 -0500
bearophile <bearophileHUGS lycos.com> wrote:

 Timings, dmd compiler, best of 4, seconds:

   Psy:  1.59


=20
=20
 2.3 GHz CPU.
 dmd 2.050, ldc based on DMD v1.057 and llvm 2.6, gdc head based on GCC 4.=

4.5.
=20
 --------------------------
=20

 import psyco
=20
 def test(data):
     count =3D 0
     for i in xrange(len(data) - 3):
         codon =3D data[i : i + 3]
         if codon =3D=3D "TAG" or codon =3D=3D "TGA" or codon =3D=3D "TAA":
             count +=3D 1
     return count
=20
 def main():
     data =3D open("data.txt").read()
     data =3D data * 300
     print test(data)
=20
 psyco.full()
 main()
=20
 --------------------------
=20
 To produce some test data:
=20
=20

 from itertools import islice
=20
 def rnd(ia =3D 3877, ic =3D 29573, im =3D 139968):
     seed =3D 42
     imf =3D float(im)
     while 1:
         seed =3D (seed * ia + ic) % im
         r =3D seed / imf
         yield "ACGT"[int(r * 4)]
=20
=20
 def main():
     data =3D "".join(islice(rnd(), 0, 120000))
     file("data.txt", "w").write(data)
=20
 main()
=20
 --------------------------
=20
 A D2 traslation (I have used printf to reduce a lot the asm produced):
=20
=20

 import std.file: read;
 import std.c.stdio: printf;
=20
 int test(char[] data) {
     int count;
     foreach (i; 0 ..  data.length - 3) {
         char[] codon =3D data[i .. i + 3];
         if (codon =3D=3D "TAG" || codon =3D=3D "TGA" || codon =3D=3D "TAA=

")
             count++;
     }
     return count;
 }
=20
 void main() {
     char[] data0 =3D cast(char[])read("data.txt");
     int n =3D 300;
     char[] data =3D new char[data0.length * n];
     for (size_t pos; pos < data.length; pos +=3D data0.length)
         data[pos .. pos+data0.length] =3D data0;
=20
     printf("%d\n", test(data));
 }
=20
 --------------------------
=20

 import std.file: read;
 import std.c.stdio: printf;
=20
 bool cmp3(char[] s1, string s2) {
     return s1[0] =3D=3D s2[0] && s1[1] =3D=3D s2[1] && s1[2] =3D=3D s2[2];
 }
=20
 int test(char[] data) {
     int count;
     foreach (i; 0 ..  data.length - 3) {
         char[] codon =3D data[i .. i + 3];
         if (cmp3(codon, "TAG") || cmp3(codon, "TGA") || cmp3(codon, "TAA"=

))
             count++;
     }
     return count;
 }
=20
 void main() {
     char[] data0 =3D cast(char[])read("data.txt");
     int n =3D 300;
     char[] data =3D new char[data0.length * n];
     for (size_t pos; pos < data.length; pos +=3D data0.length)
         data[pos .. pos+data0.length] =3D data0;
=20
     printf("%d\n", test(data));
 }

Impressive!
What's your diagnostic on why basic string compare parforms the way it does?
And have you tried a naive hand-written strComp (just for the sake of compl=
eteness) like:

    bool strComp (string s1, string s2) {
        auto l =3D s1.length;
        if (l !=3D s2.length)
            return false;
        foreach (uint i ; 0..l)
            if (s1[i] !=3D s2[i])
                return false;
        return true;
    }

(I bet this will perform at about the same order of magnitude as your versi=

Also, is there a way to bit-compare given memory areas at much higher speed=
 than element per element (I mean for arrays in general)?


Denis
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com

Nov 27 2010

bearophile <bearophileHUGS lycos.com> writes:

denis.spir

 What's your diagnostic on why basic string compare parforms the way it does?

I prefer the numbers and asm listings speak for themselves, because my
diagnostic sometimes is wrong. But I guess it's caused by three not inlined
calls to __adEq2 each loop, done to compare 3 char long arrays.


 And have you tried a naive hand-written strComp (just for the sake of
completeness) like:

Here it is:

Timings, dmd compiler, best of 4, seconds:



  Psy:  1.59





import std.file: read;
import std.c.stdio: printf;

bool cmp2(char[] s1, string s2) {
    if (s1.length != s2.length)
        return false;
    foreach (uint i ; 0 .. s1.length)
        if (s1[i] != s2[i])
            return false;
    return true;
}

int test(char[] data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        char[] codon = data[i .. i + 3];
        if (cmp2(codon, "TAG") || cmp2(codon, "TGA") || cmp2(codon, "TAA"))
            count++;
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}


 Also, is there a way to bit-compare given memory areas at much higher speed
than element per element (I mean for arrays in general)?

I don't know. I think you can't.

Bye,
bearophile

Nov 27 2010

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday 27 November 2010 14:46:00 bearophile wrote:
 denis.spir
 Also, is there a way to bit-compare given memory areas at much higher
 speed than element per element (I mean for arrays in general)?

 
 I don't know. I think you can't.

You might be able to do that for primitive types and for structs with no 
opEquals() and with no member variables with an opEquals(), but you certainly 
can't do it in the general case, since in the general case, you'd have to be 
calling opEquals() on each element individually.

- Jonathan M Davis

Nov 27 2010

Kagamin <spam here.lot> writes:

bearophile Wrote:

 Also, is there a way to bit-compare given memory areas at much higher speed
than element per element (I mean for arrays in general)?

 
 I don't know. I think you can't.

You can use memcmp, though only for utf-8 strings.

Nov 27 2010

Stewart Gordon <smjg_1998 yahoo.com> writes:

On 27/11/2010 23:04, Kagamin wrote:
 bearophile Wrote:

 Also, is there a way to bit-compare given memory areas at much
 higher speed than element per element (I mean for arrays in
 general)?

 I don't know. I think you can't.

 You can use memcmp, though only for utf-8 strings.

Only for utf-8 strings?  Why's that?  I would've thought memcmp to be 
type agnostic.

Stewart.

Nov 28 2010

bearophobic <notbear cave.net> writes:

Stewart Gordon Wrote:

 On 27/11/2010 23:04, Kagamin wrote:
 bearophile Wrote:

 Also, is there a way to bit-compare given memory areas at much
 higher speed than element per element (I mean for arrays in
 general)?

 I don't know. I think you can't.

 You can use memcmp, though only for utf-8 strings.

 
 Only for utf-8 strings?  Why's that?  I would've thought memcmp to be 
 type agnostic.
 
 Stewart.

D community is amazing cult of premature optimization fans. Any one of you
heard of canonically equivalent sequences? The integrated Unicode support is a
clusterfuck. Please do compare ASCII strings with memcmp, but no Unicode. Where
did the original poster pull this problem from, his ass? "My system runs
100,000,000,000 instructions per second, but this comparison of 4 letter
strings uses 5 cycles too much! This is the only problem on the way to world
domination with my $500 Microsoft Word clone!". No wait, the problems are
completely imaginatory.

Bye,
bearophobic

Nov 28 2010

Michel Fortin <michel.fortin michelf.com> writes:

On 2010-11-28 20:57:38 -0500, bearophobic <notbear cave.net> said:

 Stewart Gordon Wrote:
 
 On 27/11/2010 23:04, Kagamin wrote:
 bearophile Wrote:
 
 Also, is there a way to bit-compare given memory areas at much
 higher speed than element per element (I mean for arrays in
 general)?

 
 I don't know. I think you can't.

 
 You can use memcmp, though only for utf-8 strings.

 
 Only for utf-8 strings?  Why's that?  I would've thought memcmp to be
 type agnostic.
 
 Stewart.

 
 D community is amazing cult of premature optimization fans. Any one of 
 you heard of canonically equivalent sequences? The integrated Unicode 
 support is a clusterfuck. Please do compare ASCII strings with memcmp, 
 but no Unicode. Where did the original poster pull this problem from, 
 his ass? "My system runs 100,000,000,000 instructions per second, but 
 this comparison of 4 letter strings uses 5 cycles too much! This is the 
 only problem on the way to world domination with my $500 Microsoft Word 
 clone!". No wait, the problems are completely imaginatory.

Comparing unicode UTF-* strings using memcmp is fine as long as what 
you want to know is whether the code points are the same. If your point 
was that per-code-point comparisons aren't the right way to compare 
Unicode strings (in most situations), then I support this view too. 
Though if that's what you wanted to say, you could have made your point 
clearer.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Nov 28 2010

Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:

On 29/11/2010 02:11, Michel Fortin wrote:
 On 2010-11-28 20:57:38 -0500, bearophobic <notbear cave.net> said:

 Stewart Gordon Wrote:

 On 27/11/2010 23:04, Kagamin wrote:
 bearophile Wrote:

 Also, is there a way to bit-compare given memory areas at much
 higher speed than element per element (I mean for arrays in
 general)?

 I don't know. I think you can't.

 You can use memcmp, though only for utf-8 strings.

 Only for utf-8 strings? Why's that? I would've thought memcmp to be
 type agnostic.

 Stewart.

 D community is amazing cult of premature optimization fans. Any one of
 you heard of canonically equivalent sequences? The integrated Unicode
 support is a clusterfuck. Please do compare ASCII strings with memcmp,
 but no Unicode. Where did the original poster pull this problem from,
 his ass? "My system runs 100,000,000,000 instructions per second, but
 this comparison of 4 letter strings uses 5 cycles too much! This is
 the only problem on the way to world domination with my $500 Microsoft
 Word clone!". No wait, the problems are completely imaginatory.

 Comparing unicode UTF-* strings using memcmp is fine as long as what you
 want to know is whether the code points are the same. If your point was
 that per-code-point comparisons aren't the right way to compare Unicode
 strings (in most situations), then I support this view too. Though if
 that's what you wanted to say, you could have made your point clearer.

Why are people still replying to nameless trolls? There has been several 
cases of that in recent threads. :/

-- 
Bruno Medeiros - Software Engineer

Dec 07 2010

=?ISO-8859-1?Q?Br=fcno_Mediocre?= <gay mail.com> writes:

Bruno Medeiros Wrote:

 On 29/11/2010 02:11, Michel Fortin wrote:
 On 2010-11-28 20:57:38 -0500, bearophobic <notbear cave.net> said:

 Stewart Gordon Wrote:

 On 27/11/2010 23:04, Kagamin wrote:
 bearophile Wrote:

 Also, is there a way to bit-compare given memory areas at much
 higher speed than element per element (I mean for arrays in
 general)?

 I don't know. I think you can't.

 You can use memcmp, though only for utf-8 strings.

 Only for utf-8 strings? Why's that? I would've thought memcmp to be
 type agnostic.

 Stewart.

 D community is amazing cult of premature optimization fans. Any one of
 you heard of canonically equivalent sequences? The integrated Unicode
 support is a clusterfuck. Please do compare ASCII strings with memcmp,
 but no Unicode. Where did the original poster pull this problem from,
 his ass? "My system runs 100,000,000,000 instructions per second, but
 this comparison of 4 letter strings uses 5 cycles too much! This is
 the only problem on the way to world domination with my $500 Microsoft
 Word clone!". No wait, the problems are completely imaginatory.

 Comparing unicode UTF-* strings using memcmp is fine as long as what you
 want to know is whether the code points are the same. If your point was
 that per-code-point comparisons aren't the right way to compare Unicode
 strings (in most situations), then I support this view too. Though if
 that's what you wanted to say, you could have made your point clearer.

 
 Why are people still replying to nameless trolls? There has been several 
 cases of that in recent threads. :/

Trololol. Maybe they're a bit dumb, my brother. If they some day become
smarter, they'll stop using D. They see how much shit it is.

I miss my wife. Oh god.... bring back my life! Bring me my.. sandwich!

Dec 08 2010

Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:

On 08/12/2010 15:02, Br�no Mediocre wrote:
 Bruno Medeiros Wrote:
 Why are people still replying to nameless trolls? There has been several
 cases of that in recent threads. :/

 Trololol. Maybe they're a bit dumb, my brother. If they some day become
smarter, they'll stop using D. They see how much shit it is.

 I miss my wife. Oh god.... bring back my life! Bring me my.. sandwich!

Lol, "Br�no Mediocre", well thought, that's actually funny.

-- 
Br�no Mediocre - Software Engineer

Dec 09 2010

Daniel Gibson <metalcaedes gmail.com> writes:

Bruno Medeiros schrieb:
 On 08/12/2010 15:02, Br�no Mediocre wrote:
 Bruno Medeiros Wrote:
 Why are people still replying to nameless trolls? There has been several
 cases of that in recent threads. :/

 Trololol. Maybe they're a bit dumb, my brother. If they some day 
 become smarter, they'll stop using D. They see how much shit it is.

 I miss my wife. Oh god.... bring back my life! Bring me my.. sandwich!

 
 Lol, "Br�no Mediocre", well thought, that's actually funny.
 

Nah, I think it's a rather mediocre joke.

Dec 09 2010

"Robert Jacques" <sandford jhu.edu> writes:

On Sun, 28 Nov 2010 20:32:24 -0500, Stewart Gordon <smjg_1998 yahoo.com>  
wrote:

 On 27/11/2010 23:04, Kagamin wrote:
 bearophile Wrote:

 Also, is there a way to bit-compare given memory areas at much
 higher speed than element per element (I mean for arrays in
 general)?

 I don't know. I think you can't.

 You can use memcmp, though only for utf-8 strings.

 Only for utf-8 strings?  Why's that?  I would've thought memcmp to be  
 type agnostic.

 Stewart.

memcmp is type agnostic if all you want to compare is equality. The other  
use of memcmp is essentially as an opCmp, in which case it would be type  
sensitive.

Nov 28 2010

Kagamin <spam here.lot> writes:

You need only match 3-char strings?
Try this hack

bool cmp3c(string a, string etalon) pure
{
  if(a.length!=3)return false;
  return *cast(short*)a.ptr==*cast(short*)b.ptr && a[2]==b[2];
}

Nov 27 2010

Kagamin <spam here.lot> writes:

or

bool cmp3c(string a, string b) pure
{
  if(a.length!=3)return false;
  return *cast(char[3]*)a.ptr==*cast(char[3]*)b.ptr;
}

llvm.memcmp should rock with lengths known at compile time.

Nov 27 2010

bearophile <bearophileHUGS lycos.com> writes:

Kagamin:

 You need only match 3-char strings?
 Try this hack


enough. But while some things are better left to Phobos, comparing short
strings efficiently is the compiler's job, not mine (and I am not sure your
trick works in C99).

Bye,
bearophile

Nov 27 2010

=?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= <jeberger free.fr> writes:

Kagamin wrote:
 You need only match 3-char strings?
 Try this hack
=20
 bool cmp3c(string a, string etalon) pure
 {
   if(a.length!=3D3)return false;
   return *cast(short*)a.ptr=3D=3D*cast(short*)b.ptr && a[2]=3D=3Db[2];
 }

	This doesn't work on some architectures if a.ptr or b.ptr is odd...

		Jerome
--=20
mailto:jeberger free.fr
http://jeberger.free.fr
Jabber: jeberger jabber.fr

Nov 28 2010

bearophile <bearophileHUGS lycos.com> writes:

I have done another test:

Timings, dmd compiler, best of 4, seconds:



  Psy:  1.59





import std.file: read;
import std.c.stdio: printf;

int test(char[] data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        char[] codon = data[i .. i + 3];
        if ((codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&
codon[2] == 'G') ||
            (codon.length == 3 && codon[0] == 'T' && codon[1] == 'G' &&
codon[2] == 'A') ||
            (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&
codon[2] == 'A'))
            count++;
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}


So when there is to compare among strings known at compile-time to be small
(like < 6 char), the comparison shall be replaced with inlined single char
comparisons. This makes the code longer so it increases code cache pressure,
but seeing how much slow the alternative is, I think it's an improvement.

(A smart compiler is even able to remove the codon.length==3 test because the
slice data[i..i+3] is always of length 3 (here mysteriously if you remove those
three length tests the program compiled with dmd gets slower)).

Bye,
bearophile

Nov 27 2010

"Robert Jacques" <sandford jhu.edu> writes:

On Sat, 27 Nov 2010 22:08:29 -0500, bearophile <bearophileHUGS lycos.com>  
wrote:
 I have done another test:

 Timings, dmd compiler, best of 4, seconds:



   Psy:  1.59





 import std.file: read;
 import std.c.stdio: printf;

 int test(char[] data) {
     int count;
     foreach (i; 0 ..  data.length - 3) {
         char[] codon = data[i .. i + 3];
         if ((codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&  
 codon[2] == 'G') ||
             (codon.length == 3 && codon[0] == 'T' && codon[1] == 'G' &&  
 codon[2] == 'A') ||
             (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&  
 codon[2] == 'A'))
             count++;
     }
     return count;
 }

 void main() {
     char[] data0 = cast(char[])read("data.txt");
     int n = 300;
     char[] data = new char[data0.length * n];
     for (size_t pos; pos < data.length; pos += data0.length)
         data[pos .. pos+data0.length] = data0;

     printf("%d\n", test(data));
 }


 So when there is to compare among strings known at compile-time to be  
 small (like < 6 char), the comparison shall be replaced with inlined  
 single char comparisons. This makes the code longer so it increases code  
 cache pressure, but seeing how much slow the alternative is, I think  
 it's an improvement.

 (A smart compiler is even able to remove the codon.length==3 test  
 because the slice data[i..i+3] is always of length 3 (here mysteriously  
 if you remove those three length tests the program compiled with dmd  
 gets slower)).

 Bye,
 bearophile

Hi bearophile,
I've spent some time having fun this afternoon optimizing array-equals  
using vectorization techniques. I found that vectorizing using ulongs  
worked best on my system except with the shortest strings, where a simple  
Duff's device edged it out. If you'd like to try it out on your data set:


bool arrayComp3(bool useBitCompare = true,T)(const T[] a, const T[] b)  
pure nothrow  {
     if(a.length != b.length) return false;

     static if(useBitCompare) {
         auto pab = cast(ubyte*)a.ptr;
         auto pbb = cast(ubyte*)b.ptr;
         if(pab is pbb) return true;

         auto byte_length = a.length*T.sizeof;
         auto pa_end      = cast(ulong*)(pab+byte_length);
	    final switch (byte_length % ulong.sizeof) {
             case 7:             if(*pab++ != *pbb++ ) return false;
             case 6:             if(*pab++ != *pbb++ ) return false;
             case 5:             if(*pab++ != *pbb++ ) return false;
             case 4:             if(*pab++ != *pbb++ ) return false;
             case 3:             if(*pab++ != *pbb++ ) return false;
             case 2:             if(*pab++ != *pbb++ ) return false;
             case 1:             if(*pab++ != *pbb++ ) return false;
             case 0:
	    }
         auto pa      = cast(ulong*)pab;
         auto pb      = cast(ulong*)pbb;
         while (pa < pa_end) {
             if(*pa++ != *pb++ ) return false;
         }
     } else {                                            // default to a  
short duff's device
         auto pa = a.ptr;
         auto pb = b.ptr;
         if(pa is pb) return true;
	    auto n  = (a.length + 3) / 4;
	    final switch (a.length % 4) {
             case 0:        do {  if(*pa++ != *pb++ ) return false;
             case 3:              if(*pa++ != *pb++ ) return false;
             case 2:              if(*pa++ != *pb++ ) return false;
             case 1:              if(*pa++ != *pb++ ) return false;
                       } while (--n > 0);
	    }    }
     return true;
}

Nov 27 2010

bearophile <bearophileHUGS lycos.com> writes:

Robert Jacques:

 I've spent some time having fun this afternoon optimizing array-equals  
 using vectorization techniques. I found that vectorizing using ulongs  
 worked best on my system except with the shortest strings, where a simple  
 Duff's device edged it out. If you'd like to try it out on your data set:

Thank you for your work :-)




import std.file: read;
import std.c.stdio: printf;
import std.exception: assumeUnique;

bool arrayComp(bool useBitCompare=true, T)
              (const T[] a, const T[] b) pure nothrow {
    if (a.length != b.length)
        return false;

    static if (useBitCompare) {
        auto pab = cast(ubyte*)a.ptr;
        auto pbb = cast(ubyte*)b.ptr;
        if (pab is pbb)
            return true;

        auto byte_length = a.length * T.sizeof;
        auto pa_end = cast(ulong*)(pab + byte_length);

        final switch (byte_length % ulong.sizeof) {
            case 7: if (*pab++ != *pbb++) return false;
            case 6: if (*pab++ != *pbb++) return false;
            case 5: if (*pab++ != *pbb++) return false;
            case 4: if (*pab++ != *pbb++) return false;
            case 3: if (*pab++ != *pbb++) return false;
            case 2: if (*pab++ != *pbb++) return false;
            case 1: if (*pab++ != *pbb++) return false;
            case 0:
        }

        auto pa = cast(ulong*)pab;
        auto pb = cast(ulong*)pbb;

        while (pa < pa_end) {
            if (*pa++ != *pb++)
                return false;
        }
    } else { // default to a short duff's device
        auto pa = a.ptr;
        auto pb = b.ptr;
        if (pa == pb)
            return true;
        auto n  = (a.length + 3) / 4;

        final switch (a.length % 4) {
            case 0: do { if (*pa++ != *pb++) return false;
            case 3:      if (*pa++ != *pb++) return false;
            case 2:      if (*pa++ != *pb++) return false;
            case 1:      if (*pa++ != *pb++) return false;
                       } while (--n > 0);
        }
    }

    return true;
}

int test(string data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        auto codon = data[i .. i + 3];
        if (arrayComp(codon, "TAG") || arrayComp(codon, "TGA") ||
arrayComp(codon, "TAA"))
            count++;
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;
    string sdata = assumeUnique(data);

    printf("%d\n", test(sdata));
}


Timings, dmd compiler, best of 4, seconds:



  Psy:  1.59







Your function can't be inlined because it's big, so this code isn't faster than
inlined code like this generated by the compiler:
(codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] == 'G')

Bye,
bearophile

Nov 28 2010

=?ISO-8859-1?Q?Pelle_M=E5nsson?= <pelle.mansson gmail.com> writes:

On 11/28/2010 12:44 PM, bearophile wrote:
 Robert Jacques:

 I've spent some time having fun this afternoon optimizing array-equals
 using vectorization techniques. I found that vectorizing using ulongs
 worked best on my system except with the shortest strings, where a simple
 Duff's device edged it out. If you'd like to try it out on your data set:

 Thank you for your work :-)




 import std.file: read;
 import std.c.stdio: printf;
 import std.exception: assumeUnique;

 bool arrayComp(bool useBitCompare=true, T)
                (const T[] a, const T[] b) pure nothrow {
      if (a.length != b.length)
          return false;

      static if (useBitCompare) {
          auto pab = cast(ubyte*)a.ptr;
          auto pbb = cast(ubyte*)b.ptr;
          if (pab is pbb)
              return true;

          auto byte_length = a.length * T.sizeof;
          auto pa_end = cast(ulong*)(pab + byte_length);

          final switch (byte_length % ulong.sizeof) {
              case 7: if (*pab++ != *pbb++) return false;
              case 6: if (*pab++ != *pbb++) return false;
              case 5: if (*pab++ != *pbb++) return false;
              case 4: if (*pab++ != *pbb++) return false;
              case 3: if (*pab++ != *pbb++) return false;
              case 2: if (*pab++ != *pbb++) return false;
              case 1: if (*pab++ != *pbb++) return false;
              case 0:
          }

          auto pa = cast(ulong*)pab;
          auto pb = cast(ulong*)pbb;

          while (pa<  pa_end) {
              if (*pa++ != *pb++)
                  return false;
          }
      } else { // default to a short duff's device
          auto pa = a.ptr;
          auto pb = b.ptr;
          if (pa == pb)
              return true;
          auto n  = (a.length + 3) / 4;

          final switch (a.length % 4) {
              case 0: do { if (*pa++ != *pb++) return false;
              case 3:      if (*pa++ != *pb++) return false;
              case 2:      if (*pa++ != *pb++) return false;
              case 1:      if (*pa++ != *pb++) return false;
                         } while (--n>  0);
          }
      }

      return true;
 }

 int test(string data) {
      int count;
      foreach (i; 0 ..  data.length - 3) {
          auto codon = data[i .. i + 3];
          if (arrayComp(codon, "TAG") || arrayComp(codon, "TGA") ||
arrayComp(codon, "TAA"))
              count++;
      }
      return count;
 }

 void main() {
      char[] data0 = cast(char[])read("data.txt");
      int n = 300;
      char[] data = new char[data0.length * n];
      for (size_t pos; pos<  data.length; pos += data0.length)
          data[pos .. pos+data0.length] = data0;
      string sdata = assumeUnique(data);

      printf("%d\n", test(sdata));
 }


 Timings, dmd compiler, best of 4, seconds:



    Psy:  1.59







 Your function can't be inlined because it's big, so this code isn't faster
than inlined code like this generated by the compiler:
 (codon.length == 3&&  codon[0] == 'T'&&  codon[1] == 'A'&&  codon[2] == 'G')

 Bye,
 bearophile

I don't have your data set, but for me using random data this was within 


int test(string data) {
     int count;
     while (true) {
         data = data.find("TAG", "TGA", "TAA")[0];

         if (data.empty) return count;

         count += 1;
         data.popFront;
     }
}

Also, this one is far easier to generalize for strings of different 
lengths, and such :-)

Nov 28 2010

bearophile <bearophileHUGS lycos.com> writes:

Pelle M�nsson:

 I don't have your data set,

In my first post I have explained how to create the data set, using a little
Python script that I have listed there. You just need a Python2 interpreter.



any fiddly code.

Thank you for your code. I have added your version, plus a modified version


-------------------------


import std.file: read;
import std.c.stdio: printf;
import std.algorithm: find;
import std.array: empty, popFront;

int test(char[] data) {
    int count;
    while (true) {
        data = data.find("TAG", "TGA", "TAA")[0];
        if (data.empty)
            return count;
         count++;
         data.popFront();
     }
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos + data0.length] = data0;

    printf("%d\n", test(data));
}

-------------------------


import std.file: read;
import std.c.stdio: printf;
import std.algorithm: find;
import std.array: empty, popFront;

int test(ubyte[] data) {
    int count;
    while (true) {
        data = data.find("TAG", "TGA", "TAA")[0];
        if (data.empty)
            return count;
         count++;
         data.popFront();
     }
}

void main() {
    ubyte[] data0 = cast(ubyte[])read("data.txt");
    int n = 300;
    ubyte[] data = new ubyte[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos + data0.length] = data0;

    printf("%d\n", test(data));
}

-------------------------

Timings, dmd compiler, best of 4, seconds:





  Psy:   1.59







So on this PC it's not much fast. And generally I don't like to use find, empty
and popFront to solve this problem, it's quite unnatural. Here I have explained
what I think is a good enough solution to this performance problem:
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=123044

Bye,
bearophile

Nov 28 2010

spir <denis.spir gmail.com> writes:

On Sun, 28 Nov 2010 10:48:46 -0500
bearophile <bearophileHUGS lycos.com> wrote:

 generally I don't like to use find, empty and popFront to solve this prob=

lem, it's quite unnatural. Here I have explained what I think is a good eno=
ugh solution to this performance problem:
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=3Ddigitalmars=

.D&article_id=3D123044

What I would like is a just decent routine for comparing arrays of elements=
 which _conceptually_ are values, whatever their complexity (meaning, possi=
bly structs or nested arrays). This should always (*) use a default '=3D=3D=
' comparison, which in turn may perform a kind of memcomp.
I have no idea how to avoid bad perf such as shown by your tests, but this =
is really necessary. Tons of routines and algorithms use '=3D=3D', and amon=
g the most used; str.find will perform an arbitrary number of '=3D=3D' chec=
ks; and I don't even want to think at pattern matching ;-)

Denis

(*) I think that:
* If an element type is a value type, it should not overide opEquals. '=3D=
=3D' should translate to equality of every sub-element. Else, there is some=
thing wrong in the design. But there may be use cases I overlook. (Eg metad=
ata stored on the element itself?)
* A plain value may well refer to a "thing"/object, eg a wiget's outlook po=
inting to a color palette, in which case reference comparison is the right =
choice -- so, no complication.
* Real things (referenced objects, entities with self/id) should not have o=
pEquals at all. They should be compare only by id, meaning using 'is'. This=
 shows again the ambivalence of dyn arrays...
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com

Nov 28 2010

=?ISO-8859-1?Q?Pelle_M=E5nsson?= <pelle.mansson gmail.com> writes:

On 11/28/2010 04:48 PM, bearophile wrote:
 Pelle M�nsson:

 I don't have your data set,

 In my first post I have explained how to create the data set, using a little
Python script that I have listed there. You just need a Python2 interpreter.



any fiddly code.

 Thank you for your code. I have added your version, plus a modified version


 -------------------------


 import std.file: read;
 import std.c.stdio: printf;
 import std.algorithm: find;
 import std.array: empty, popFront;

 int test(char[] data) {
      int count;
      while (true) {
          data = data.find("TAG", "TGA", "TAA")[0];
          if (data.empty)
              return count;
           count++;
           data.popFront();
       }
 }

 void main() {
      char[] data0 = cast(char[])read("data.txt");
      int n = 300;
      char[] data = new char[data0.length * n];
      for (size_t pos; pos<  data.length; pos += data0.length)
          data[pos .. pos + data0.length] = data0;

      printf("%d\n", test(data));
 }

 -------------------------


 import std.file: read;
 import std.c.stdio: printf;
 import std.algorithm: find;
 import std.array: empty, popFront;

 int test(ubyte[] data) {
      int count;
      while (true) {
          data = data.find("TAG", "TGA", "TAA")[0];
          if (data.empty)
              return count;
           count++;
           data.popFront();
       }
 }

 void main() {
      ubyte[] data0 = cast(ubyte[])read("data.txt");
      int n = 300;
      ubyte[] data = new ubyte[data0.length * n];
      for (size_t pos; pos<  data.length; pos += data0.length)
          data[pos .. pos + data0.length] = data0;

      printf("%d\n", test(data));
 }

 -------------------------

 Timings, dmd compiler, best of 4, seconds:





    Psy:   1.59







 So on this PC it's not much fast. And generally I don't like to use find,
empty and popFront to solve this problem, it's quite unnatural. Here I have
explained what I think is a good enough solution to this performance problem:
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=123044

 Bye,
 bearophile

Measuring incorrectly in a performance test is a silly mistake, I was 
indeed wrong about the factor two thing!

I thought find was specialized for chars so it wouldn't require casting 
to ubyte. Maybe I was wrong again.

Thank you for your engaging performance oriented posts. :-)

Nov 28 2010

"Robert Jacques" <sandford jhu.edu> writes:

On Sun, 28 Nov 2010 06:44:23 -0500, bearophile <bearophileHUGS lycos.com>  
wrote:
 Robert Jacques:

 I've spent some time having fun this afternoon optimizing array-equals
 using vectorization techniques. I found that vectorizing using ulongs
 worked best on my system except with the shortest strings, where a  
 simple
 Duff's device edged it out. If you'd like to try it out on your data  
 set:

 Thank you for your work :-)


[snip]

 Your function can't be inlined because it's big, so this code isn't  
 faster than inlined code like this generated by the compiler:
 (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] ==  
 'G')

 Bye,
 bearophile

Still, part of the point was that string comparisons in general were way  
too slow. Anyways, I've applied the same technique in a partially unrolled  
version if you want to check it out:

bool arrayComp(T, size_t N)(const T[] a, ref const T[N] b) pure nothrow {
     static if(T.sizeof*N <= uint.sizeof) {
         return a.length == N && !( (*(cast(uint*)a.ptr)  ^  
*(cast(uint*)b.ptr))  & (uint.max >> 8*(uint.sizeof  - T.sizeof*N) ));
     } else static if(T.sizeof*N <= ulong.sizeof) {
         return a.length == N && !( (*(cast(ulong*)a.ptr) ^  
*(cast(ulong*)b.ptr)) & (ulong.max>> 8*(ulong.sizeof - T.sizeof*N) ));
     } else { // Fall back to a loop
         if(a.length != N || (*(cast(ulong*)a.ptr) != *(cast(ulong*)b.ptr))  
) return false;
         enum alignment = T.sizeof*N % ulong.sizeof > 0 ? T.sizeof*N %  
ulong.sizeof : ulong.sizeof;
         auto pa      = cast(ulong*)(cast(ubyte*)a.ptr + alignment);
         auto pb      = cast(ulong*)(cast(ubyte*)b.ptr + alignment);
         auto pa_end  = cast(ulong*)(cast(ubyte*)a.ptr + T.sizeof*N);
         while (pa < pa_end) {
             if(*pa++ != *pb++ ) return false;
         }
         return true;
     }
}

Be warned that you'll have to use strings explicitly typed as immutable  
char[N], since both array literals and string literals won't match a this  
template.

Nov 28 2010

bearophile <bearophileHUGS lycos.com> writes:

Robert Jacques:

 Still, part of the point was that string comparisons in general were way  
 too slow. Anyways, I've applied the same technique in a partially unrolled  
 version if you want to check it out:





import std.file: read;
import std.c.stdio: printf;

// not formatted
bool arrayComp(T, size_t N)(const T[] a, ref const T[N] b) pure nothrow {
     static if(T.sizeof*N <= uint.sizeof) {
         return a.length == N && !( (*(cast(uint*)a.ptr)  ^
*(cast(uint*)b.ptr))  & (uint.max >> 8*(uint.sizeof  - T.sizeof*N) ));
     } else static if(T.sizeof*N <= ulong.sizeof) {
         return a.length == N && !( (*(cast(ulong*)a.ptr) ^
*(cast(ulong*)b.ptr)) & (ulong.max>> 8*(ulong.sizeof - T.sizeof*N) ));
     } else { // Fall back to a loop
         if(a.length != N || (*(cast(ulong*)a.ptr) != *(cast(ulong*)b.ptr))
) return false;
         enum alignment = T.sizeof*N % ulong.sizeof > 0 ? T.sizeof*N %
ulong.sizeof : ulong.sizeof;
         auto pa      = cast(ulong*)(cast(ubyte*)a.ptr + alignment);
         auto pb      = cast(ulong*)(cast(ubyte*)b.ptr + alignment);
         auto pa_end  = cast(ulong*)(cast(ubyte*)a.ptr + T.sizeof*N);
         while (pa < pa_end) {
             if(*pa++ != *pb++ ) return false;
         }
         return true;
     }
}

pure int test(const char[] data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        const char[] codon = data[i .. i + 3];
        if (arrayComp!(char, 3)(codon, "TAG") ||
            arrayComp!(char, 3)(codon, "TGA") ||
            arrayComp!(char, 3)(codon, "TAA"))
            count++;
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos + data0.length] = data0;

    printf("%d\n", test(data));
}


Timings, dmd compiler, best of 4, seconds:





  Psy:   1.59








This problem probably needs to be solved by the compiler.

Bye,
bearophile

Nov 28 2010

Austin Hastings <ah08010-d yahoo.com> writes:

On 11/27/2010 11:53 AM, bearophile wrote:
 While translating a common Python script to D, I have found something
interesting, so I have reduced it to a little benchmark.

 Below there is the reduced Python2 code (it uses Psyco), and a little program
to generate some test data. The timing of the first D2 version is not good
compared to the Python-Psyco program (the generation of the *300 array is a
quick thing), so I have created two more D2 versions to show myself that D2
wasn't broken :-)

 The reduced code looks like a syntetic benchmark, but it has about the same
performance profile of a 60 lines long Python script (the original code was
using xrange(0,len(...)-3,3)  instead of xrange(len(...)-3), but the situation
doesn't change much).

It's not clear to me if the point of your post is "how do I make this go 
faster?" or "Waa! D underperforms Python by default".

If it is the former, a state machine is probably the right answer for 
long input strings.

start -> E

E A,C,G:-> E
E T:-> T

T A:-> TA
T C:-> E
T G:-> TG
T T:-> T

TA A,G:-> E, {++count}
TA C:-> E
TA T:-> T

TG A:-> E, {++count}
TG C,G:-> E
TG T:-> T


Since the probability of failure is about .75, 0.5, {.5 or .75}, the 
overall comparison cost is about 1 + .25 + .125 = 1.375, times 
compare+loop times n, plus all the function call overhead, etc.

Using a DFA should reduce the cost to 1 * index, times n, assuming you 
are using a small alphabet (I'm guessing that your codons are 
ascii+uppercase only, table-size = 26).

=Austin

Nov 27 2010

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday 27 November 2010 22:13:39 Austin Hastings wrote:
 On 11/27/2010 11:53 AM, bearophile wrote:
 While translating a common Python script to D, I have found something
 interesting, so I have reduced it to a little benchmark.
 
 Below there is the reduced Python2 code (it uses Psyco), and a little
 program to generate some test data. The timing of the first D2 version
 is not good compared to the Python-Psyco program (the generation of the
 *300 array is a quick thing), so I have created two more D2 versions to
 show myself that D2 wasn't broken :-)
 
 The reduced code looks like a syntetic benchmark, but it has about the
 same performance profile of a 60 lines long Python script (the original
 code was using xrange(0,len(...)-3,3)  instead of xrange(len(...)-3),
 but the situation doesn't change much).

 
 It's not clear to me if the point of your post is "how do I make this go
 faster?" or "Waa! D underperforms Python by default".


What I think is pretty clear from this is that at some point, some work needs
to 
be done to optimize array comparisons when memcmp can be used instead of having 
to call the individual opEquals() of each element. It's not the sort of thing 
that's likely to be a priority, but it really should be done at some point.

- Jonathan M Davis

Nov 27 2010

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from Jonathan M Davis (jmdavisProg gmx.com)'s article
 On Saturday 27 November 2010 22:13:39 Austin Hastings wrote:
 On 11/27/2010 11:53 AM, bearophile wrote:
 While translating a common Python script to D, I have found something
 interesting, so I have reduced it to a little benchmark.

 Below there is the reduced Python2 code (it uses Psyco), and a little
 program to generate some test data. The timing of the first D2 version
 is not good compared to the Python-Psyco program (the generation of the
 *300 array is a quick thing), so I have created two more D2 versions to
 show myself that D2 wasn't broken :-)

 The reduced code looks like a syntetic benchmark, but it has about the
 same performance profile of a 60 lines long Python script (the original
 code was using xrange(0,len(...)-3,3)  instead of xrange(len(...)-3),
 but the situation doesn't change much).

 It's not clear to me if the point of your post is "how do I make this go
 faster?" or "Waa! D underperforms Python by default".

 What I think is pretty clear from this is that at some point, some work needs
to
 be done to optimize array comparisons when memcmp can be used instead of having
 to call the individual opEquals() of each element. It's not the sort of thing
 that's likely to be a priority, but it really should be done at some point.
 - Jonathan M Davis

Just food for thought. GDC uses memcmp when using string comparisons in the
first
implementation.

if (codon == "TAG" || codon == "TGA" || codon == "TAA")

And it is still the slowest case of the lot.

Regards

Nov 28 2010

bearophile <bearophileHUGS lycos.com> writes:

Iain Buclaw:

 GDC uses memcmp when using string comparisons in the first implementation.
 
 if (codon == "TAG" || codon == "TGA" || codon == "TAA")
 
 And it is still the slowest case of the lot.

The asm of the first version of the function compiled with gdc uses "repz
cmpsb". And while I don't remember precise timings for it, I think it was about
two times faster than the first version compiled with dmd (on Windows).

Regarding the idea of using memcmp, I have added a comment to the issue 5282.

Bye,
bearophile

Nov 28 2010

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday 27 November 2010 22:23:39 Jonathan M Davis wrote:
 On Saturday 27 November 2010 22:13:39 Austin Hastings wrote:
 On 11/27/2010 11:53 AM, bearophile wrote:
 While translating a common Python script to D, I have found something
 interesting, so I have reduced it to a little benchmark.
 
 Below there is the reduced Python2 code (it uses Psyco), and a little
 program to generate some test data. The timing of the first D2 version
 is not good compared to the Python-Psyco program (the generation of the
 *300 array is a quick thing), so I have created two more D2 versions to
 show myself that D2 wasn't broken :-)
 
 The reduced code looks like a syntetic benchmark, but it has about the
 same performance profile of a 60 lines long Python script (the original
 code was using xrange(0,len(...)-3,3)  instead of xrange(len(...)-3),
 but the situation doesn't change much).

 
 It's not clear to me if the point of your post is "how do I make this go
 faster?" or "Waa! D underperforms Python by default".

 
 What I think is pretty clear from this is that at some point, some work
 needs to be done to optimize array comparisons when memcmp can be used
 instead of having to call the individual opEquals() of each element. It's
 not the sort of thing that's likely to be a priority, but it really should
 be done at some point.

Enhancement Request: http://d.puremagic.com/issues/show_bug.cgi?id=5282

Nov 27 2010

bearophile <bearophileHUGS lycos.com> writes:

Austin Hastings:

 It's not clear to me if the point of your post is "how do I make this go 
 faster?" or "Waa! D underperforms Python by default".

The third D version in my post is fast enough, so the point of that post is
that a very common string compare operation, done in the natural way, is a bit
too much slow in D/DMD.


 If it is the former, a state machine is probably the right answer for 
 long input strings.

This case is very simple, so a simple tree (as in the third D version of my
post) is probably good enough.

In some similar but more complex situations of genomic processing I've seen
that in C it's good to have computed gotos, to implement faster state machines.
See my recent post "explicitly unimplemented computed gotos" :-) This was my
use case beside the desire to create interpreters in D (it's a similar thing).


 Since the probability of failure is about .75, 0.5, {.5 or .75}, the 
 overall comparison cost is about 1 + .25 + .125 = 1.375, times 
 compare+loop times n, plus all the function call overhead, etc.

All 3 codons start with T, so if you test that first letter before the 3 string
compare you speed up the original code a lot. But I was looking for a more
general solution. Implementing a state machine is good, but in this case I'd
like the compiler to do more optimizations by itself. A simple but good enough
optimization for the compiler is to replace tests with strings (known to be
short at compile-time) with inlined comparisons of their single chars.


 Using a DFA should reduce the cost to 1 * index, times n, assuming you 
 are using a small alphabet (I'm guessing that your codons are 
 ascii+uppercase only, table-size = 26).

In this case the alphabet was just 4 symbols: {'A', 'C', 'G', 'T'} (with no
undecided or modified nucleosides).

Bye,
bearophile

Nov 28 2010

spir <denis.spir gmail.com> writes:

On Sat, 27 Nov 2010 22:41:45 -0800
Jonathan M Davis <jmdavisProg gmx.com> wrote:

 What I think is pretty clear from this is that at some point, some work
 needs to be done to optimize array comparisons when memcmp can be used
 instead of having to call the individual opEquals() of each element. It=


's
 not the sort of thing that's likely to be a priority, but it really sho=


uld
 be done at some point. =20

=20
 Enhancement Request: http://d.puremagic.com/issues/show_bug.cgi?id=3D5282

Yes, but other comments seems to show memcmp only doubles speed. This would=
 only bring us twice as slow as python ;-)

Denis
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com

Nov 28 2010

bearophile <bearophileHUGS lycos.com> writes:

spir:

 Yes, but other comments seems to show memcmp only doubles speed. This would
only bring us twice as slow as python ;-)

The solution I suggest for this problem is: when DMD knows at compile-time the
length of one of the two strings to equate, and such length is small (like <
6), then it may replace the string eq code like this:
(codon == "TAG")
With code like:
(codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] == 'G')

Bye,
bearophile

Nov 28 2010

D Programming

C/C++ Programming

Other

digitalmars.D - String compare performance