digitalmars.D.learn - Can this implementation of Damm algorithm be optimized?

Nestor (34/34) Feb 09 2017 Hi,

Cym13 (12/47) Feb 09 2017 I can't see how to make it more efficient than it is (aside from
Era Scarecrow (25/39) Feb 09 2017 Well one thing is you can probably reduce them from chars to just

Nestor (43/85) Feb 09 2017 OK I changed the approach using a multidimensional array for the

Daniel Kozak via Digitalmars-d-learn (3/90) Feb 09 2017 Maybe you can try use static array instead of dynamic

Nestor (55/57) Feb 09 2017 I shaved it to this to discard unneccessary time-consuming

Daniel Kozak via Digitalmars-d-learn (2/52) Feb 09 2017 Did you try it with different backends? llvm (ldc), gcc(gdc)?

Nestor (9/11) Feb 09 2017 Not really, just standard dmd.

=?UTF-8?Q?Ali_=c3=87ehreli?= (8/10) Feb 09 2017 avgtime profiles the whole process, right? It measures everything that

Daniel Kozak via Digitalmars-d-learn (3/14) Feb 09 2017 Yes this is important, and for example when you used shared libphobos it...

Cym13 (69/160) Feb 09 2017 The question is, why do you expect it to be noticably faster? On
Era Scarecrow (36/41) Feb 09 2017 Truthfully, because you'll need tens of millions or hundreds of

Nestor (3/18) Feb 10 2017 Thank you for the detailed reply. I wasn't able to follow you

Era Scarecrow (28/30) Feb 11 2017 The idea behind it is like this (which you can scale up):

Nestor (7/37) Feb 11 2017 Notice this is no ordinary matrix, but an Anti-Simmetric

Era Scarecrow (6/11) Feb 11 2017 Yes i know, which is why i had 3 to calculate 2 inputs because

Era Scarecrow (32/34) Feb 11 2017 Alright I've found the bug and fixed it, and it passes with

Era Scarecrow (12/14) Feb 11 2017 Just ran the unittests under the dmd profiler, says the

Era Scarecrow (14/16) Feb 11 2017 Ran some more tests.

Nestor (3/19) Feb 12 2017 Wow!

Era Scarecrow (23/28) Feb 12 2017 Certainly. But the bulk of the answer comes down that the 2

Nestor (5/40) Feb 11 2017 I fail to see where you are declaring QG10Matrix2, because

Era Scarecrow (17/20) Feb 11 2017 I declared it here:

Nestor (4/7) Feb 09 2017 I forgot to comment that what is multiplied by ten is not the

Nestor <nestorperez2016 yopmail.com> writes:

Hi,

I was trying to port C code from the article in Wikiversity [1] 
to D, but I'm not sure this implementation is the most efficient 
way to do it in D, so suggestions to optimize it are welcome:

import std.stdio;

static immutable char[] QG10Matrix =
   "03175986427092154863420687135917509834266123045978" ~
   "36742095815869720134894536201794386172052581436790";

char checkDigit(string str) {
   char tmpdigit = '0';
   foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + (tmpdigit 
- '0') * 10];
   return tmpdigit;
}

enum {
   EXIT_SUCCESS = 0,
   EXIT_FAILURE = 1,
}

int main(string[] args) {
   scope(failure) {
     writeln("Invalid arguments. You must pass a number.");
     return EXIT_FAILURE;
   }
   assert(args.length == 2);
   char digit = checkDigit(args[1]);
   if(digit == '0') writefln("%s is a valid number.", args[1]);
   else {
     writefln("%s is not a valid number (but it would be, 
appending digit %s).",
       args[1], digit);
   }

   return EXIT_SUCCESS;
}

[1] https://en.wikiversity.org/wiki/Damm_algorithm

Feb 09 2017

Cym13 <cpicard openmailbox.org> writes:

On Thursday, 9 February 2017 at 17:36:11 UTC, Nestor wrote:
 Hi,

 I was trying to port C code from the article in Wikiversity [1] 
 to D, but I'm not sure this implementation is the most 
 efficient way to do it in D, so suggestions to optimize it are 
 welcome:

 import std.stdio;

 static immutable char[] QG10Matrix =
   "03175986427092154863420687135917509834266123045978" ~
   "36742095815869720134894536201794386172052581436790";

 char checkDigit(string str) {
   char tmpdigit = '0';
   foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + 
 (tmpdigit - '0') * 10];
   return tmpdigit;
 }

 enum {
   EXIT_SUCCESS = 0,
   EXIT_FAILURE = 1,
 }

 int main(string[] args) {
   scope(failure) {
     writeln("Invalid arguments. You must pass a number.");
     return EXIT_FAILURE;
   }
   assert(args.length == 2);
   char digit = checkDigit(args[1]);
   if(digit == '0') writefln("%s is a valid number.", args[1]);
   else {
     writefln("%s is not a valid number (but it would be, 
 appending digit %s).",
       args[1], digit);
   }

   return EXIT_SUCCESS;
 }

 [1] https://en.wikiversity.org/wiki/Damm_algorithm

I can't see how to make it more efficient than it is (aside from 
finding a better algorithm which isn't a language concern). You 
are using C style and will get the same machine code as if you'd 
written it with C. The only difference would be boundchecking 
that you can disable by using the -boundscheck=off compiler 
switch.

By the way note that you can use the -profile switch to see where 
the bottlenecks are, in your case writting the result to stdout 
is *by far* what takes the most time (and I'm quite certain the 
bottleneck is the same in the C version for it's a fact that I/O 
is slow).

Feb 09 2017

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Thursday, 9 February 2017 at 17:36:11 UTC, Nestor wrote:
 I was trying to port C code from the article in Wikiversity [1] 
 to D, but I'm not sure this implementation is the most 
 efficient way to do it in D, so suggestions to optimize it are 
 welcome:

 import std.stdio;

 static immutable char[] QG10Matrix =
   "03175986427092154863420687135917509834266123045978" ~
   "36742095815869720134894536201794386172052581436790";

 char checkDigit(string str) {
   char tmpdigit = '0';
   foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + 
 (tmpdigit - '0') * 10];
   return tmpdigit;
 }

Well one thing is you can probably reduce them from chars to just 
bytes, instead of having to subtract you can instead add at the 
end. Although unless you're working with a VERY large input you 
won't see a difference.

Actually since you're also multiplying by 10, you can incorporate 
that in the table too... (although a mixin might be better for 
the conversion than by hand)


  static immutable char[] QG10Matrix = [
     00,30,10,70,50,90,80,60,40,20,
     70,00,90,20,10,50,40,80,60,30,
     40,20,00,60,80,70,10,30,50,90,
     10,70,50,00,90,80,30,40,20,60,
     60,10,20,30,00,40,50,90,70,80,
     30,60,70,40,20,00,90,50,80,10,
     50,80,60,90,70,20,00,10,30,40,
     80,90,40,50,30,60,20,00,10,70,
     90,40,30,80,60,10,70,20,00,50,
     20,50,80,10,40,30,60,70,90,00];

  char checkDigit(string str) {
    char tmpdigit = 0;
    foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') +
  tmpdigit];
    return (tmpdigit/10) + '0';
  }

Feb 09 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Thursday, 9 February 2017 at 18:34:30 UTC, Era Scarecrow wrote:
 On Thursday, 9 February 2017 at 17:36:11 UTC, Nestor wrote:
 I was trying to port C code from the article in Wikiversity 
 [1] to D, but I'm not sure this implementation is the most 
 efficient way to do it in D, so suggestions to optimize it are 
 welcome:

 import std.stdio;

 static immutable char[] QG10Matrix =
   "03175986427092154863420687135917509834266123045978" ~
   "36742095815869720134894536201794386172052581436790";

 char checkDigit(string str) {
   char tmpdigit = '0';
   foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + 
 (tmpdigit - '0') * 10];
   return tmpdigit;
 }

 Well one thing is you can probably reduce them from chars to 
 just bytes, instead of having to subtract you can instead add 
 at the end. Although unless you're working with a VERY large 
 input you won't see a difference.

 Actually since you're also multiplying by 10, you can 
 incorporate that in the table too... (although a mixin might be 
 better for the conversion than by hand)


  static immutable char[] QG10Matrix = [
     00,30,10,70,50,90,80,60,40,20,
     70,00,90,20,10,50,40,80,60,30,
     40,20,00,60,80,70,10,30,50,90,
     10,70,50,00,90,80,30,40,20,60,
     60,10,20,30,00,40,50,90,70,80,
     30,60,70,40,20,00,90,50,80,10,
     50,80,60,90,70,20,00,10,30,40,
     80,90,40,50,30,60,20,00,10,70,
     90,40,30,80,60,10,70,20,00,50,
     20,50,80,10,40,30,60,70,90,00];

  char checkDigit(string str) {
    char tmpdigit = 0;
    foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') +
  tmpdigit];
    return (tmpdigit/10) + '0';
  }



OK I changed the approach using a multidimensional array for the 
matrix so I could ditch arithmetic operations altogether, but 
curiously after measuring a few thousand runs of both 
implementations through avgtime, I see no noticeable difference. 
Why?

import std.stdio;

static immutable ubyte[][] QG10Matrix = [
   [0,3,1,7,5,9,8,6,4,2],[7,0,9,2,1,5,4,8,6,3],
   [4,2,0,6,8,7,1,3,5,9],[1,7,5,0,9,8,3,4,2,6],
   [6,1,2,3,0,4,5,9,7,8],[3,6,7,4,2,0,9,5,8,1],
   [5,8,6,9,7,2,0,1,3,4],[8,9,4,5,3,6,2,0,1,7],
   [9,4,3,8,6,1,7,2,0,5],[2,5,8,1,4,3,6,7,9,0],
];

static int charToInt(char chr) {
   scope(failure) return -1;
   return cast(int)(chr - '0');
}

ubyte checkDigit(string str) {
   ubyte tmpdigit;
   foreach(chr; str) tmpdigit = 
QG10Matrix[tmpdigit][charToInt(chr)];
   return tmpdigit;
}

enum {
   EXIT_SUCCESS = 0,
   EXIT_FAILURE = 1,
}

int main(string[] args) {
   scope(failure) {
     writeln("Invalid arguments. You must pass a number.");
     return EXIT_FAILURE;
   }
   assert(args.length == 2);
   ubyte digit = checkDigit(args[1]);
   if(digit == 0) writefln("%s is a valid number.", args[1]);
   else {
     writefln("%s is not a valid number (but it would be, 
appending digit %s).",
       args[1], digit);
   }

   return EXIT_SUCCESS;
}

Feb 09 2017

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

Dne 9.2.2017 v 20:39 Nestor via Digitalmars-d-learn napsal(a):

 On Thursday, 9 February 2017 at 18:34:30 UTC, Era Scarecrow wrote:
 On Thursday, 9 February 2017 at 17:36:11 UTC, Nestor wrote:
 I was trying to port C code from the article in Wikiversity [1] to 
 D, but I'm not sure this implementation is the most efficient way to 
 do it in D, so suggestions to optimize it are welcome:

 import std.stdio;

 static immutable char[] QG10Matrix =
   "03175986427092154863420687135917509834266123045978" ~
   "36742095815869720134894536201794386172052581436790";

 char checkDigit(string str) {
   char tmpdigit = '0';
   foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + (tmpdigit - 
 '0') * 10];
   return tmpdigit;
 }

 Well one thing is you can probably reduce them from chars to just 
 bytes, instead of having to subtract you can instead add at the end. 
 Although unless you're working with a VERY large input you won't see 
 a difference.

 Actually since you're also multiplying by 10, you can incorporate 
 that in the table too... (although a mixin might be better for the 
 conversion than by hand)


  static immutable char[] QG10Matrix = [
     00,30,10,70,50,90,80,60,40,20,
     70,00,90,20,10,50,40,80,60,30,
     40,20,00,60,80,70,10,30,50,90,
     10,70,50,00,90,80,30,40,20,60,
     60,10,20,30,00,40,50,90,70,80,
     30,60,70,40,20,00,90,50,80,10,
     50,80,60,90,70,20,00,10,30,40,
     80,90,40,50,30,60,20,00,10,70,
     90,40,30,80,60,10,70,20,00,50,
     20,50,80,10,40,30,60,70,90,00];

  char checkDigit(string str) {
    char tmpdigit = 0;
    foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') +
  tmpdigit];
    return (tmpdigit/10) + '0';
  }



 OK I changed the approach using a multidimensional array for the 
 matrix so I could ditch arithmetic operations altogether, but 
 curiously after measuring a few thousand runs of both implementations 
 through avgtime, I see no noticeable difference. Why?

 import std.stdio;

 static immutable ubyte[][] QG10Matrix = [
   [0,3,1,7,5,9,8,6,4,2],[7,0,9,2,1,5,4,8,6,3],
   [4,2,0,6,8,7,1,3,5,9],[1,7,5,0,9,8,3,4,2,6],
   [6,1,2,3,0,4,5,9,7,8],[3,6,7,4,2,0,9,5,8,1],
   [5,8,6,9,7,2,0,1,3,4],[8,9,4,5,3,6,2,0,1,7],
   [9,4,3,8,6,1,7,2,0,5],[2,5,8,1,4,3,6,7,9,0],
 ];

 static int charToInt(char chr) {
   scope(failure) return -1;
   return cast(int)(chr - '0');
 }

 ubyte checkDigit(string str) {
   ubyte tmpdigit;
   foreach(chr; str) tmpdigit = QG10Matrix[tmpdigit][charToInt(chr)];
   return tmpdigit;
 }

 enum {
   EXIT_SUCCESS = 0,
   EXIT_FAILURE = 1,
 }

 int main(string[] args) {
   scope(failure) {
     writeln("Invalid arguments. You must pass a number.");
     return EXIT_FAILURE;
   }
   assert(args.length == 2);
   ubyte digit = checkDigit(args[1]);
   if(digit == 0) writefln("%s is a valid number.", args[1]);
   else {
     writefln("%s is not a valid number (but it would be, appending 
 digit %s).",
       args[1], digit);
   }

   return EXIT_SUCCESS;
 }

Maybe you can try use static array instead of dynamic
static immutable ubyte[10][10] QG10Matrix = ...

Feb 09 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Thursday, 9 February 2017 at 20:46:06 UTC, Daniel Kozak wrote:
 Maybe you can try use static array instead of dynamic
 static immutable ubyte[10][10] QG10Matrix = ...

I shaved it to this to discard unneccessary time-consuming 
functions:

static immutable ubyte[10][10] QG10Matrix = [
   [0,3,1,7,5,9,8,6,4,2],[7,0,9,2,1,5,4,8,6,3],
   [4,2,0,6,8,7,1,3,5,9],[1,7,5,0,9,8,3,4,2,6],
   [6,1,2,3,0,4,5,9,7,8],[3,6,7,4,2,0,9,5,8,1],
   [5,8,6,9,7,2,0,1,3,4],[8,9,4,5,3,6,2,0,1,7],
   [9,4,3,8,6,1,7,2,0,5],[2,5,8,1,4,3,6,7,9,0],
];

static immutable string number =
   
"0123456789012345678901234567890123456789012345678901234567890123456789";

static int charToInt(char chr) {
   return ((chr >= '0') || (chr <= '9')) ? cast(int)(chr - '0') : 
-1;
}

ubyte checkDigit(string str) {
   ubyte tmpdigit;
   foreach(chr; str) tmpdigit = 
QG10Matrix[tmpdigit][charToInt(chr)];
   return tmpdigit;
}

void main() {
   auto digit = checkDigit(number);
}

I even tried making checkDigit static, but surprisingly this 
increased average execution time by 1ms.

Anyway, the previopus version (modified to benefit from a little 
optimization too) is almost as performant, even though it 
includes multiplication and sum:

static immutable char[] QG10Matrix =
   "03175986427092154863420687135917509834266123045978" ~
   "36742095815869720134894536201794386172052581436790";

static immutable string number =
   
"0123456789012345678901234567890123456789012345678901234567890123456789";

static int charToInt(char chr) {
   return ((chr >= '0') || (chr <= '9')) ? cast(int)(chr - '0') : 
-1;
}

char checkDigit(string str) {
   char tmpdigit = '0';
   foreach(chr; str) tmpdigit =
     QG10Matrix[charToInt(chr) + (charToInt(tmpdigit) * 10)];
   return tmpdigit;
}

void main() {
   auto digit = checkDigit(number);
}

I compiled both with -inline -noboundscheck -release and the 
multidimensional array version does perform 1ms faster for a 
couple hundred runs, but I expected the difference to be much 
more noticeable.

Any idea of what might be happening here?

Feb 09 2017

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

Dne 9.2.2017 v 22:29 Nestor via Digitalmars-d-learn napsal(a):

 On Thursday, 9 February 2017 at 20:46:06 UTC, Daniel Kozak wrote:
 Maybe you can try use static array instead of dynamic
 static immutable ubyte[10][10] QG10Matrix = ...

 I shaved it to this to discard unneccessary time-consuming functions:

 static immutable ubyte[10][10] QG10Matrix = [
   [0,3,1,7,5,9,8,6,4,2],[7,0,9,2,1,5,4,8,6,3],
   [4,2,0,6,8,7,1,3,5,9],[1,7,5,0,9,8,3,4,2,6],
   [6,1,2,3,0,4,5,9,7,8],[3,6,7,4,2,0,9,5,8,1],
   [5,8,6,9,7,2,0,1,3,4],[8,9,4,5,3,6,2,0,1,7],
   [9,4,3,8,6,1,7,2,0,5],[2,5,8,1,4,3,6,7,9,0],
 ];

 static immutable string number =
 "0123456789012345678901234567890123456789012345678901234567890123456789";

 static int charToInt(char chr) {
   return ((chr >= '0') || (chr <= '9')) ? cast(int)(chr - '0') : -1;
 }

 ubyte checkDigit(string str) {
   ubyte tmpdigit;
   foreach(chr; str) tmpdigit = QG10Matrix[tmpdigit][charToInt(chr)];
   return tmpdigit;
 }

 void main() {
   auto digit = checkDigit(number);
 }

 I even tried making checkDigit static, but surprisingly this increased 
 average execution time by 1ms.

 Anyway, the previopus version (modified to benefit from a little 
 optimization too) is almost as performant, even though it includes 
 multiplication and sum:

 static immutable char[] QG10Matrix =
   "03175986427092154863420687135917509834266123045978" ~
   "36742095815869720134894536201794386172052581436790";

 static immutable string number =
 "0123456789012345678901234567890123456789012345678901234567890123456789";

 static int charToInt(char chr) {
   return ((chr >= '0') || (chr <= '9')) ? cast(int)(chr - '0') : -1;
 }

 char checkDigit(string str) {
   char tmpdigit = '0';
   foreach(chr; str) tmpdigit =
     QG10Matrix[charToInt(chr) + (charToInt(tmpdigit) * 10)];
   return tmpdigit;
 }

 void main() {
   auto digit = checkDigit(number);
 }

 I compiled both with -inline -noboundscheck -release and the 
 multidimensional array version does perform 1ms faster for a couple 
 hundred runs, but I expected the difference to be much more noticeable.

 Any idea of what might be happening here?

Did you try it with different backends? llvm (ldc), gcc(gdc)?

Feb 09 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Thursday, 9 February 2017 at 21:43:08 UTC, Daniel Kozak wrote:
 Any idea of what might be happening here?

 Did you try it with different backends? llvm (ldc), gcc(gdc)?

Not really, just standard dmd.

I tried running each algoritm a few times through avgtime using 
different digit lengths (upto 10,000, my PC won't handle much 
more) and different amount of repetitions, and the results aren't 
consistent, some times one algorithm is marginally faster, 
sometimes the other. Apparently the compiler is capable of 
optimizing the unidimensional array version.

Thank you all nevertheless for the suggestions.

Feb 09 2017

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 02/09/2017 02:04 PM, Nestor wrote:

 I tried running each algoritm a few times through avgtime using
 different digit lengths

avgtime profiles the whole process, right? It measures everything that 
is involved in that little program. At least OS starting the program, D 
runtime initializing, program interacting with the console, etc.

Since you're comparing algorithms, you need to take everything else out 
of the equation. A better approach is to use std.datetime.benchmark, 
which would repeat just the algorithm that you're interested in.

Ali

Feb 09 2017

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

Dne 9.2.2017 v 23:08 Ali Çehreli via Digitalmars-d-learn napsal(a):

 On 02/09/2017 02:04 PM, Nestor wrote:

 I tried running each algoritm a few times through avgtime using
 different digit lengths

 avgtime profiles the whole process, right? It measures everything that 
 is involved in that little program. At least OS starting the program, 
 D runtime initializing, program interacting with the console, etc.

 Since you're comparing algorithms, you need to take everything else 
 out of the equation. A better approach is to use 
 std.datetime.benchmark, which would repeat just the algorithm that 
 you're interested in.

 Ali

Yes this is important, and for example when you used shared libphobos it 
makes really big difference

Feb 09 2017

Cym13 <cpicard openmailbox.org> writes:

On Thursday, 9 February 2017 at 19:39:49 UTC, Nestor wrote:
 On Thursday, 9 February 2017 at 18:34:30 UTC, Era Scarecrow 
 wrote:
 On Thursday, 9 February 2017 at 17:36:11 UTC, Nestor wrote:
 I was trying to port C code from the article in Wikiversity 
 [1] to D, but I'm not sure this implementation is the most 
 efficient way to do it in D, so suggestions to optimize it 
 are welcome:

 import std.stdio;

 static immutable char[] QG10Matrix =
   "03175986427092154863420687135917509834266123045978" ~
   "36742095815869720134894536201794386172052581436790";

 char checkDigit(string str) {
   char tmpdigit = '0';
   foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + 
 (tmpdigit - '0') * 10];
   return tmpdigit;
 }

 Well one thing is you can probably reduce them from chars to 
 just bytes, instead of having to subtract you can instead add 
 at the end. Although unless you're working with a VERY large 
 input you won't see a difference.

 Actually since you're also multiplying by 10, you can 
 incorporate that in the table too... (although a mixin might 
 be better for the conversion than by hand)


  static immutable char[] QG10Matrix = [
     00,30,10,70,50,90,80,60,40,20,
     70,00,90,20,10,50,40,80,60,30,
     40,20,00,60,80,70,10,30,50,90,
     10,70,50,00,90,80,30,40,20,60,
     60,10,20,30,00,40,50,90,70,80,
     30,60,70,40,20,00,90,50,80,10,
     50,80,60,90,70,20,00,10,30,40,
     80,90,40,50,30,60,20,00,10,70,
     90,40,30,80,60,10,70,20,00,50,
     20,50,80,10,40,30,60,70,90,00];

  char checkDigit(string str) {
    char tmpdigit = 0;
    foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') +
  tmpdigit];
    return (tmpdigit/10) + '0';
  }



 OK I changed the approach using a multidimensional array for 
 the matrix so I could ditch arithmetic operations altogether, 
 but curiously after measuring a few thousand runs of both 
 implementations through avgtime, I see no noticeable 
 difference. Why?

 import std.stdio;

 static immutable ubyte[][] QG10Matrix = [
   [0,3,1,7,5,9,8,6,4,2],[7,0,9,2,1,5,4,8,6,3],
   [4,2,0,6,8,7,1,3,5,9],[1,7,5,0,9,8,3,4,2,6],
   [6,1,2,3,0,4,5,9,7,8],[3,6,7,4,2,0,9,5,8,1],
   [5,8,6,9,7,2,0,1,3,4],[8,9,4,5,3,6,2,0,1,7],
   [9,4,3,8,6,1,7,2,0,5],[2,5,8,1,4,3,6,7,9,0],
 ];

 static int charToInt(char chr) {
   scope(failure) return -1;
   return cast(int)(chr - '0');
 }

 ubyte checkDigit(string str) {
   ubyte tmpdigit;
   foreach(chr; str) tmpdigit = 
 QG10Matrix[tmpdigit][charToInt(chr)];
   return tmpdigit;
 }

 enum {
   EXIT_SUCCESS = 0,
   EXIT_FAILURE = 1,
 }

 int main(string[] args) {
   scope(failure) {
     writeln("Invalid arguments. You must pass a number.");
     return EXIT_FAILURE;
   }
   assert(args.length == 2);
   ubyte digit = checkDigit(args[1]);
   if(digit == 0) writefln("%s is a valid number.", args[1]);
   else {
     writefln("%s is not a valid number (but it would be, 
 appending digit %s).",
       args[1], digit);
   }

   return EXIT_SUCCESS;
 }

The question is, why do you expect it to be noticably faster? On 
modern hardware it the optimizations are such that so little a 
change in code is very hard to link to a difference in running 
time. If you really want to show one the following code does the 
benchmark the right way (ie: using a very long input, tens of 
thousands of runs, and avoiding I/O and load time of the program 
to compare the bare function implementations):

import std.conv;
import std.stdio;
import std.range;
import std.array;
import std.algorithm;

string testcase;

static immutable char[] QG10MatrixOne =
   "03175986427092154863420687135917509834266123045978" ~
   "36742095815869720134894536201794386172052581436790";

char checkDigitOne(string str) {
   char tmpdigit = 0;
   foreach(chr; str) tmpdigit = QG10MatrixOne[(chr - '0') + 
(tmpdigit - '0') * 10];
   return tmpdigit;
}

void testCheckDigitOne() {
     checkDigitTwo(testcase);
}

static immutable ubyte[][] QG10MatrixTwo = [
   [0,3,1,7,5,9,8,6,4,2],[7,0,9,2,1,5,4,8,6,3],
   [4,2,0,6,8,7,1,3,5,9],[1,7,5,0,9,8,3,4,2,6],
   [6,1,2,3,0,4,5,9,7,8],[3,6,7,4,2,0,9,5,8,1],
   [5,8,6,9,7,2,0,1,3,4],[8,9,4,5,3,6,2,0,1,7],
   [9,4,3,8,6,1,7,2,0,5],[2,5,8,1,4,3,6,7,9,0],
];

static int charToInt(char chr) {
   scope(failure) return -1;
   return cast(int)(chr - '0');
}

ubyte checkDigitTwo(string str) {
   ubyte tmpdigit;
   foreach(chr; str) {
       tmpdigit = QG10MatrixTwo[tmpdigit][charToInt(chr)];
   }
   return tmpdigit;
}

void testCheckDigitTwo() {
     checkDigitTwo(testcase);
}

void main(string[] args) {
   testcase = 
iota(10).cycle.take(100000).map!(to!string).array.join;

   import std.datetime: benchmark;
   benchmark!(testCheckDigitOne, 
testCheckDigitTwo)(10000).each!writeln;
}


Yet on my machine I have the following times (note that the time 
itself depends on my hardware, what's important is the 
difference):

TickDuration(15785255852)
TickDuration(15784826803)

So it seems that the first version is slower than the second one, 
but by so little that it's hard to link to the actual 
implementation. If anything it shows the futility of such 
low-level tricks for meaningless optimization. And by the way 
note that the benchmark avoid measuring IO because we were 
interested in the functions. But if we were measuring it it would 
represent 4 fifth of the running time (on my machine, with 
numbers as long as the one used in the benchmark). This means 
even a 20% improvement in checkDigit would actually only 
represent a 4% improvement of the program.

Feb 09 2017

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Thursday, 9 February 2017 at 19:39:49 UTC, Nestor wrote:
 OK I changed the approach using a multidimensional array for 
 the matrix so I could ditch arithmetic operations altogether, 
 but curiously after measuring a few thousand runs of both 
 implementations through avgtime, I see no noticeable 
 difference. Why?

  Truthfully, because you'll need tens of millions or hundreds of 
millions in length to determine if it makes a difference and how 
much. Addition, subtraction and simple memory lookups take very 
little time, and since the entire array (100 bytes) fits in the 
cache, it is going to perform very very very well regardless if 
you can optimize it further.

  If you tested this on a much slower system, say an 8bit 6502 the 
differences would be far more pronounced, but not that much 
different.

  Since the algorithm is more or less O(n) optimizing it won't 
make many differences.

  It's possible you could get a speedup by making them ints 
instead of chars, since then it might not have a penalty for the 
'address not divisible by 4' that applies (which is more for ARM 
architectures and less for x86).

  Other optimizations could be to make it multiple levels, taking 
the basic 100 elements and expanding them 2-3 levels deep in a 
lookup and having it do it in more or less a single operation. 
(100 bytes for 1 level, 10,000 for 2 levels, 1,000,000 for 3 
levels, 100,000,000 for 4 levels, etc), but the steps of 
converting it to the array lookup won't give you that much gain, 
although fewer memory lookups but none of them will be cached, so 
any advantage from that is probably lost. Although if you bump up 
the size to 16x10 instead of 10x10, you could use a shift instead 
of *10 which will make that slightly faster (there will be unused 
empty padded spots)

  In theory if you avoid the memory lookup at all, you could gain 
some amount of speed, depending on how it searches a manual 
table, although using a switch-case and a mixin to do all the 
details it feels like it wouldn't give you any gain...

  Division operations are the slowest operations you can do, but 
otherwise most instructions run really fast. Unless you're trying 
to get it smaller (fewer bytes for the call) or shaving for speed 
by instruction cycle counting (like on the 6502) i doubt you'll 
get much benefit.

Feb 09 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Thursday, 9 February 2017 at 23:49:19 UTC, Era Scarecrow wrote:
  Other optimizations could be to make it multiple levels, 
 taking the basic 100 elements and expanding them 2-3 levels 
 deep in a lookup and having it do it in more or less a single 
 operation. (100 bytes for 1 level, 10,000 for 2 levels, 
 1,000,000 for 3 levels, 100,000,000 for 4 levels, etc), but the 
 steps of converting it to the array lookup won't give you that 
 much gain, although fewer memory lookups but none of them will 
 be cached, so any advantage from that is probably lost. 
 Although if you bump up the size to 16x10 instead of 10x10, you 
 could use a shift instead of *10 which will make that slightly 
 faster (there will be unused empty padded spots)

  In theory if you avoid the memory lookup at all, you could 
 gain some amount of speed, depending on how it searches a 
 manual table, although using a switch-case and a mixin to do 
 all the details it feels like it wouldn't give you any gain...

Thank you for the detailed reply. I wasn't able to follow you 
regarding the multilevel stuff though :(

Feb 10 2017

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Friday, 10 February 2017 at 11:27:02 UTC, Nestor wrote:
 Thank you for the detailed reply. I wasn't able to follow you 
 regarding the multilevel stuff though :(

  The idea behind it is like this (which you can scale up):

static immutable int[] QG10Matrix2 = buildMatrix2();

int[] buildMatrix2() {
     string digits = "0123456789";
     int[] l = new int[16*16*10];
     char[3] s;
     foreach(a; digits)
     foreach(b; digits)
     foreach(c; digits) {
         s[] = [a,b,c];
         l[(a-'0')<< 8|(b-'0')<<4|(c-'0')]=checkDigit(cast(string) 
s) - '0';
     }

     return l;
}


Using that it SHOULD allow you to get the result of 2 inputs 
simply by using 2 characters (plus the old result)

char checkDigit2(string str) {
     int tmpdigit = 0;
     for(;str.length >= 2;str=str[2 .. $]) {
         tmpdigit = QG10Matrix2[tmpdigit<<8|(str[0]-'0')<< 
4|(str[1]-'0')];
     }
    // handle remainder single character and return value


  While it should be easy, I'm having issues trying to get the 
proper results via unittests and I'm not sure why. Probably 
something incredibly simple on my part.

Feb 11 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Saturday, 11 February 2017 at 11:45:02 UTC, Era Scarecrow 
wrote:
 On Friday, 10 February 2017 at 11:27:02 UTC, Nestor wrote:
 Thank you for the detailed reply. I wasn't able to follow you 
 regarding the multilevel stuff though :(

  The idea behind it is like this (which you can scale up):

 static immutable int[] QG10Matrix2 = buildMatrix2();

 int[] buildMatrix2() {
     string digits = "0123456789";
     int[] l = new int[16*16*10];
     char[3] s;
     foreach(a; digits)
     foreach(b; digits)
     foreach(c; digits) {
         s[] = [a,b,c];
         l[(a-'0')<< 
 8|(b-'0')<<4|(c-'0')]=checkDigit(cast(string) s) - '0';
     }

     return l;
 }


 Using that it SHOULD allow you to get the result of 2 inputs 
 simply by using 2 characters (plus the old result)

 char checkDigit2(string str) {
     int tmpdigit = 0;
     for(;str.length >= 2;str=str[2 .. $]) {
         tmpdigit = QG10Matrix2[tmpdigit<<8|(str[0]-'0')<< 
 4|(str[1]-'0')];
     }
    // handle remainder single character and return value


  While it should be easy, I'm having issues trying to get the 
 proper results via unittests and I'm not sure why. Probably 
 something incredibly simple on my part.

Notice this is no ordinary matrix, but an Anti-Simmetric 
QuasiGroup of order 10, and tmpdigit (called interim in the 
algorithm) is used in each round (although the function isn't 
recursive) together with each digit to calculate final check 
digit.

Feb 11 2017

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Saturday, 11 February 2017 at 20:19:51 UTC, Nestor wrote:
 Notice this is no ordinary matrix, but an Anti-Simmetric 
 QuasiGroup of order 10, and tmpdigit (called interim in the 
 algorithm) is used in each round (although the function isn't 
 recursive) together with each digit to calculate final check 
 digit.

  Yes i know, which is why i had 3 to calculate 2 inputs because 
the third is the temp/previous calculation.

  If however you were calculating a fixed number of digits a 
single table could be made and do a single lookup, assuming it 
wasn't too large to make it uncumbersome or impractical.

Feb 11 2017

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Saturday, 11 February 2017 at 21:02:40 UTC, Era Scarecrow 
wrote:
  Yes i know, which is why i had 3 to calculate 2 inputs because 
 the third is the temp/previous calculation.

  Alright I've found the bug and fixed it, and it passes with 
flying colors (brute force tests up to 6 digits); However it 
doesn't use the original function to build the table. So I'm 
satisfied it will handle any length now.

  But it seriously is a lot of overhead for such a simple function.

int[] buildMatrix2() {
     string digits = "0123456789";
     int[] l = new int[16*16*10];
     l[] = -1; //printing the array it's obvious to see what is 
padding
     foreach(a; digits)
     foreach(b; digits)
     foreach(c; digits) {
         int t = (a-'0')*10,
             t2 = (QG10Matrix[(b - '0') + t]-'0') * 10,
             off = (a - '0') << 8 | (b - '0') << 4 | (c - '0');
         l[off] = (QG10Matrix[(c - '0') + t2]-'0')<<8;
     }

     return l;
}

char checkDigit2(string str) {
     int tmpdigit = 0;
     for(;str.length >= 2;str = str[2 .. $])
         tmpdigit = 
QG10Matrix2[tmpdigit|(str[0]-'0')<<4|(str[1]-'0')];

     tmpdigit>>=8;
     if (str.length==1)
         return QG10Matrix[(str[0]-'0')+tmpdigit*10];

     return (tmpdigit+'0') & 0xff;
}

Feb 11 2017

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Saturday, 11 February 2017 at 21:41:11 UTC, Era Scarecrow 
wrote:
  But it seriously is a lot of overhead for such a simple 
 function.

  Just ran the unittests under the dmd profiler, says the 
algorithm is 11% faster now. So yeah slightly more optimized. 
Another level and we could probably get 25%, but the built matrix 
will blow up far larger than the 10k it is now.

   Num          Tree        Func        Per
   Calls        Time        Time        Call
12000000     1281989     1281989           0     char 
damm.checkDigit(immutable(char)[])
12000000     1146308     1146308           0     char 
damm.checkDigit2(immutable(char)[])

Feb 11 2017

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Saturday, 11 February 2017 at 21:56:54 UTC, Era Scarecrow 
wrote:
  Just ran the unittests under the dmd profiler, says the 
 algorithm is 11% faster now. So yeah slightly more optimized.

Ran some more tests.

Without optimization but with with 4 levels (a 2.5Mb table), it 
gains to a whopping 27%!
However with optimizations turned on it dwindled to a mere 15% 
boost
And optimization + no bounds checking, 2 & 4 levels both give a 
9% boost total.

Testing purely on 8byte inputs (Brute forced all combinations) 
receives the same 9% boost with negligible difference.

Safe to say going higher levels isn't going to give you 
sufficient improvement; Also exe file is 3Mb big (but compresses 
to 150k).

Feb 11 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Sunday, 12 February 2017 at 05:54:34 UTC, Era Scarecrow wrote:
 On Saturday, 11 February 2017 at 21:56:54 UTC, Era Scarecrow 
 wrote:
  Just ran the unittests under the dmd profiler, says the 
 algorithm is 11% faster now. So yeah slightly more optimized.

 Ran some more tests.

 Without optimization but with with 4 levels (a 2.5Mb table), it 
 gains to a whopping 27%!
 However with optimizations turned on it dwindled to a mere 15% 
 boost
 And optimization + no bounds checking, 2 & 4 levels both give a 
 9% boost total.

 Testing purely on 8byte inputs (Brute forced all combinations) 
 receives the same 9% boost with negligible difference.

 Safe to say going higher levels isn't going to give you 
 sufficient improvement; Also exe file is 3Mb big (but 
 compresses to 150k).

Wow!
Thanks for the interest and effort.

Feb 12 2017

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Monday, 13 February 2017 at 00:56:37 UTC, Nestor wrote:
 On Sunday, 12 February 2017 at 05:54:34 UTC, Era Scarecrow 
 wrote:
 Ran some more tests.

 Wow!
 Thanks for the interest and effort.

  Certainly. But the bulk of the answer comes down that the 2 
levels that I've already provided are the fastest you're probably 
going to get. Certainly we can test using shorts or bytes 
instead, but it's likely the results will only go down.

  To note my tests are strictly on my x86 system and it would be 
better to also test this on other systems like PPC, Linux, ARM, 
and other architectures to see how they perform, and possibly 
tweak them as appropriate.

  Still we did find out there is some optimization that can be 
done and successfully for the Damm algorithm, it just isn't going 
to be a lot.

  Hmmm... A thought does come to mind. Parallelizing the code; 
However that would require probably 11 instances to get a 2x 
speedup (calculating the second half with all 10 possibilities 
for the carry over, and also calculating the first half, then 
choosing which of the 10 based on the first half's output), which 
only really works if you have a ton of cores, and the input is 
REALLY REALLY large, like a meg or something. While the usage of 
the Damm code is more useful for adding a digit to the end of a 
code like UPC or Barcodes as error detection, and expecting 
larger than 32 for real applications is unlikely.

  But at this point I'm rambling.

Feb 12 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Saturday, 11 February 2017 at 21:41:11 UTC, Era Scarecrow 
wrote:
 On Saturday, 11 February 2017 at 21:02:40 UTC, Era Scarecrow 
 wrote:
  Yes i know, which is why i had 3 to calculate 2 inputs 
 because the third is the temp/previous calculation.

  Alright I've found the bug and fixed it, and it passes with 
 flying colors (brute force tests up to 6 digits); However it 
 doesn't use the original function to build the table. So I'm 
 satisfied it will handle any length now.

  But it seriously is a lot of overhead for such a simple 
 function.

 int[] buildMatrix2() {
     string digits = "0123456789";
     int[] l = new int[16*16*10];
     l[] = -1; //printing the array it's obvious to see what is 
 padding
     foreach(a; digits)
     foreach(b; digits)
     foreach(c; digits) {
         int t = (a-'0')*10,
             t2 = (QG10Matrix[(b - '0') + t]-'0') * 10,
             off = (a - '0') << 8 | (b - '0') << 4 | (c - '0');
         l[off] = (QG10Matrix[(c - '0') + t2]-'0')<<8;
     }

     return l;
 }

 char checkDigit2(string str) {
     int tmpdigit = 0;
     for(;str.length >= 2;str = str[2 .. $])
         tmpdigit = 
 QG10Matrix2[tmpdigit|(str[0]-'0')<<4|(str[1]-'0')];

     tmpdigit>>=8;
     if (str.length==1)
         return QG10Matrix[(str[0]-'0')+tmpdigit*10];

     return (tmpdigit+'0') & 0xff;
 }

I fail to see where you are declaring QG10Matrix2, because 
apparently it's an array of chars, but buildMatrix2 returns an 
array of int (2560 elements??) with lots of -1 values.

Feb 11 2017

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Sunday, 12 February 2017 at 00:43:55 UTC, Nestor wrote:
 I fail to see where you are declaring QG10Matrix2, because 
 apparently it's an array of chars, but buildMatrix2 returns an 
 array of int (2560 elements??) with lots of -1 values.

I declared it here: 
http://forum.dlang.org/post/fiamjuhiddbzwvaplxkr forum.dlang.org

and it's not chars, it's ints. Part of it is to avoid the mem 
access penalty for non divisible by 4 addresses, and another is 
that the array returns not chars, but numbers from 0-2560, which 
is the answer *256 (<<8), which offers hopefully some slight 
speed gain on longer inputs.

  Also the size of the array is guesstimated, and the -1's just 
signify the padding (which can be any value, but -1 makes it 
obvious). Since it's a 10x10 array it's based on, but using 4bits 
per section, there's 6 elements lost, and 6 whole rows lost. It's 
a necessary loss to gain speed; Thankfully it's only using 10k 
(2560 members) and not 700k as was my original guess when i was 
calculating it wrong earlier.

  Doing it wrong earlier, the compiler kept crashing... :P running 
out of memory.

Feb 11 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Thursday, 9 February 2017 at 18:34:30 UTC, Era Scarecrow wrote:
 ...
 Actually since you're also multiplying by 10, you can 
 incorporate that in the table too...

I forgot to comment that what is multiplied by ten is not the 
value but the starting position in the array (a way to emulate a 
matrix), but anyway see my comments in previous post.

Feb 09 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Can this implementation of Damm algorithm be optimized?