www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Can this implementation of Damm algorithm be optimized?

reply Nestor <nestorperez2016 yopmail.com> writes:
Hi,

I was trying to port C code from the article in Wikiversity [1] 
to D, but I'm not sure this implementation is the most efficient 
way to do it in D, so suggestions to optimize it are welcome:

import std.stdio;

static immutable char[] QG10Matrix =
   "03175986427092154863420687135917509834266123045978" ~
   "36742095815869720134894536201794386172052581436790";

char checkDigit(string str) {
   char tmpdigit = '0';
   foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + (tmpdigit 
- '0') * 10];
   return tmpdigit;
}

enum {
   EXIT_SUCCESS = 0,
   EXIT_FAILURE = 1,
}

int main(string[] args) {
   scope(failure) {
     writeln("Invalid arguments. You must pass a number.");
     return EXIT_FAILURE;
   }
   assert(args.length == 2);
   char digit = checkDigit(args[1]);
   if(digit == '0') writefln("%s is a valid number.", args[1]);
   else {
     writefln("%s is not a valid number (but it would be, 
appending digit %s).",
       args[1], digit);
   }

   return EXIT_SUCCESS;
}

[1] https://en.wikiversity.org/wiki/Damm_algorithm
Feb 09 2017
next sibling parent Cym13 <cpicard openmailbox.org> writes:
On Thursday, 9 February 2017 at 17:36:11 UTC, Nestor wrote:
 Hi,

 I was trying to port C code from the article in Wikiversity [1] 
 to D, but I'm not sure this implementation is the most 
 efficient way to do it in D, so suggestions to optimize it are 
 welcome:

 import std.stdio;

 static immutable char[] QG10Matrix =
   "03175986427092154863420687135917509834266123045978" ~
   "36742095815869720134894536201794386172052581436790";

 char checkDigit(string str) {
   char tmpdigit = '0';
   foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + 
 (tmpdigit - '0') * 10];
   return tmpdigit;
 }

 enum {
   EXIT_SUCCESS = 0,
   EXIT_FAILURE = 1,
 }

 int main(string[] args) {
   scope(failure) {
     writeln("Invalid arguments. You must pass a number.");
     return EXIT_FAILURE;
   }
   assert(args.length == 2);
   char digit = checkDigit(args[1]);
   if(digit == '0') writefln("%s is a valid number.", args[1]);
   else {
     writefln("%s is not a valid number (but it would be, 
 appending digit %s).",
       args[1], digit);
   }

   return EXIT_SUCCESS;
 }

 [1] https://en.wikiversity.org/wiki/Damm_algorithm
I can't see how to make it more efficient than it is (aside from finding a better algorithm which isn't a language concern). You are using C style and will get the same machine code as if you'd written it with C. The only difference would be boundchecking that you can disable by using the -boundscheck=off compiler switch. By the way note that you can use the -profile switch to see where the bottlenecks are, in your case writting the result to stdout is *by far* what takes the most time (and I'm quite certain the bottleneck is the same in the C version for it's a fact that I/O is slow).
Feb 09 2017
prev sibling parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Thursday, 9 February 2017 at 17:36:11 UTC, Nestor wrote:
 I was trying to port C code from the article in Wikiversity [1] 
 to D, but I'm not sure this implementation is the most 
 efficient way to do it in D, so suggestions to optimize it are 
 welcome:

 import std.stdio;

 static immutable char[] QG10Matrix =
   "03175986427092154863420687135917509834266123045978" ~
   "36742095815869720134894536201794386172052581436790";

 char checkDigit(string str) {
   char tmpdigit = '0';
   foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + 
 (tmpdigit - '0') * 10];
   return tmpdigit;
 }
Well one thing is you can probably reduce them from chars to just bytes, instead of having to subtract you can instead add at the end. Although unless you're working with a VERY large input you won't see a difference. Actually since you're also multiplying by 10, you can incorporate that in the table too... (although a mixin might be better for the conversion than by hand) static immutable char[] QG10Matrix = [ 00,30,10,70,50,90,80,60,40,20, 70,00,90,20,10,50,40,80,60,30, 40,20,00,60,80,70,10,30,50,90, 10,70,50,00,90,80,30,40,20,60, 60,10,20,30,00,40,50,90,70,80, 30,60,70,40,20,00,90,50,80,10, 50,80,60,90,70,20,00,10,30,40, 80,90,40,50,30,60,20,00,10,70, 90,40,30,80,60,10,70,20,00,50, 20,50,80,10,40,30,60,70,90,00]; char checkDigit(string str) { char tmpdigit = 0; foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + tmpdigit]; return (tmpdigit/10) + '0'; }
Feb 09 2017
next sibling parent reply Nestor <nestorperez2016 yopmail.com> writes:
On Thursday, 9 February 2017 at 18:34:30 UTC, Era Scarecrow wrote:
 On Thursday, 9 February 2017 at 17:36:11 UTC, Nestor wrote:
 I was trying to port C code from the article in Wikiversity 
 [1] to D, but I'm not sure this implementation is the most 
 efficient way to do it in D, so suggestions to optimize it are 
 welcome:

 import std.stdio;

 static immutable char[] QG10Matrix =
   "03175986427092154863420687135917509834266123045978" ~
   "36742095815869720134894536201794386172052581436790";

 char checkDigit(string str) {
   char tmpdigit = '0';
   foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + 
 (tmpdigit - '0') * 10];
   return tmpdigit;
 }
Well one thing is you can probably reduce them from chars to just bytes, instead of having to subtract you can instead add at the end. Although unless you're working with a VERY large input you won't see a difference. Actually since you're also multiplying by 10, you can incorporate that in the table too... (although a mixin might be better for the conversion than by hand) static immutable char[] QG10Matrix = [ 00,30,10,70,50,90,80,60,40,20, 70,00,90,20,10,50,40,80,60,30, 40,20,00,60,80,70,10,30,50,90, 10,70,50,00,90,80,30,40,20,60, 60,10,20,30,00,40,50,90,70,80, 30,60,70,40,20,00,90,50,80,10, 50,80,60,90,70,20,00,10,30,40, 80,90,40,50,30,60,20,00,10,70, 90,40,30,80,60,10,70,20,00,50, 20,50,80,10,40,30,60,70,90,00]; char checkDigit(string str) { char tmpdigit = 0; foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + tmpdigit]; return (tmpdigit/10) + '0'; }
OK I changed the approach using a multidimensional array for the matrix so I could ditch arithmetic operations altogether, but curiously after measuring a few thousand runs of both implementations through avgtime, I see no noticeable difference. Why? import std.stdio; static immutable ubyte[][] QG10Matrix = [ [0,3,1,7,5,9,8,6,4,2],[7,0,9,2,1,5,4,8,6,3], [4,2,0,6,8,7,1,3,5,9],[1,7,5,0,9,8,3,4,2,6], [6,1,2,3,0,4,5,9,7,8],[3,6,7,4,2,0,9,5,8,1], [5,8,6,9,7,2,0,1,3,4],[8,9,4,5,3,6,2,0,1,7], [9,4,3,8,6,1,7,2,0,5],[2,5,8,1,4,3,6,7,9,0], ]; static int charToInt(char chr) { scope(failure) return -1; return cast(int)(chr - '0'); } ubyte checkDigit(string str) { ubyte tmpdigit; foreach(chr; str) tmpdigit = QG10Matrix[tmpdigit][charToInt(chr)]; return tmpdigit; } enum { EXIT_SUCCESS = 0, EXIT_FAILURE = 1, } int main(string[] args) { scope(failure) { writeln("Invalid arguments. You must pass a number."); return EXIT_FAILURE; } assert(args.length == 2); ubyte digit = checkDigit(args[1]); if(digit == 0) writefln("%s is a valid number.", args[1]); else { writefln("%s is not a valid number (but it would be, appending digit %s).", args[1], digit); } return EXIT_SUCCESS; }
Feb 09 2017
next sibling parent reply Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
Dne 9.2.2017 v 20:39 Nestor via Digitalmars-d-learn napsal(a):

 On Thursday, 9 February 2017 at 18:34:30 UTC, Era Scarecrow wrote:
 On Thursday, 9 February 2017 at 17:36:11 UTC, Nestor wrote:
 I was trying to port C code from the article in Wikiversity [1] to 
 D, but I'm not sure this implementation is the most efficient way to 
 do it in D, so suggestions to optimize it are welcome:

 import std.stdio;

 static immutable char[] QG10Matrix =
   "03175986427092154863420687135917509834266123045978" ~
   "36742095815869720134894536201794386172052581436790";

 char checkDigit(string str) {
   char tmpdigit = '0';
   foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + (tmpdigit - 
 '0') * 10];
   return tmpdigit;
 }
Well one thing is you can probably reduce them from chars to just bytes, instead of having to subtract you can instead add at the end. Although unless you're working with a VERY large input you won't see a difference. Actually since you're also multiplying by 10, you can incorporate that in the table too... (although a mixin might be better for the conversion than by hand) static immutable char[] QG10Matrix = [ 00,30,10,70,50,90,80,60,40,20, 70,00,90,20,10,50,40,80,60,30, 40,20,00,60,80,70,10,30,50,90, 10,70,50,00,90,80,30,40,20,60, 60,10,20,30,00,40,50,90,70,80, 30,60,70,40,20,00,90,50,80,10, 50,80,60,90,70,20,00,10,30,40, 80,90,40,50,30,60,20,00,10,70, 90,40,30,80,60,10,70,20,00,50, 20,50,80,10,40,30,60,70,90,00]; char checkDigit(string str) { char tmpdigit = 0; foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + tmpdigit]; return (tmpdigit/10) + '0'; }
OK I changed the approach using a multidimensional array for the matrix so I could ditch arithmetic operations altogether, but curiously after measuring a few thousand runs of both implementations through avgtime, I see no noticeable difference. Why? import std.stdio; static immutable ubyte[][] QG10Matrix = [ [0,3,1,7,5,9,8,6,4,2],[7,0,9,2,1,5,4,8,6,3], [4,2,0,6,8,7,1,3,5,9],[1,7,5,0,9,8,3,4,2,6], [6,1,2,3,0,4,5,9,7,8],[3,6,7,4,2,0,9,5,8,1], [5,8,6,9,7,2,0,1,3,4],[8,9,4,5,3,6,2,0,1,7], [9,4,3,8,6,1,7,2,0,5],[2,5,8,1,4,3,6,7,9,0], ]; static int charToInt(char chr) { scope(failure) return -1; return cast(int)(chr - '0'); } ubyte checkDigit(string str) { ubyte tmpdigit; foreach(chr; str) tmpdigit = QG10Matrix[tmpdigit][charToInt(chr)]; return tmpdigit; } enum { EXIT_SUCCESS = 0, EXIT_FAILURE = 1, } int main(string[] args) { scope(failure) { writeln("Invalid arguments. You must pass a number."); return EXIT_FAILURE; } assert(args.length == 2); ubyte digit = checkDigit(args[1]); if(digit == 0) writefln("%s is a valid number.", args[1]); else { writefln("%s is not a valid number (but it would be, appending digit %s).", args[1], digit); } return EXIT_SUCCESS; }
Maybe you can try use static array instead of dynamic static immutable ubyte[10][10] QG10Matrix = ...
Feb 09 2017
parent reply Nestor <nestorperez2016 yopmail.com> writes:
On Thursday, 9 February 2017 at 20:46:06 UTC, Daniel Kozak wrote:
 Maybe you can try use static array instead of dynamic
 static immutable ubyte[10][10] QG10Matrix = ...
I shaved it to this to discard unneccessary time-consuming functions: static immutable ubyte[10][10] QG10Matrix = [ [0,3,1,7,5,9,8,6,4,2],[7,0,9,2,1,5,4,8,6,3], [4,2,0,6,8,7,1,3,5,9],[1,7,5,0,9,8,3,4,2,6], [6,1,2,3,0,4,5,9,7,8],[3,6,7,4,2,0,9,5,8,1], [5,8,6,9,7,2,0,1,3,4],[8,9,4,5,3,6,2,0,1,7], [9,4,3,8,6,1,7,2,0,5],[2,5,8,1,4,3,6,7,9,0], ]; static immutable string number = "0123456789012345678901234567890123456789012345678901234567890123456789"; static int charToInt(char chr) { return ((chr >= '0') || (chr <= '9')) ? cast(int)(chr - '0') : -1; } ubyte checkDigit(string str) { ubyte tmpdigit; foreach(chr; str) tmpdigit = QG10Matrix[tmpdigit][charToInt(chr)]; return tmpdigit; } void main() { auto digit = checkDigit(number); } I even tried making checkDigit static, but surprisingly this increased average execution time by 1ms. Anyway, the previopus version (modified to benefit from a little optimization too) is almost as performant, even though it includes multiplication and sum: static immutable char[] QG10Matrix = "03175986427092154863420687135917509834266123045978" ~ "36742095815869720134894536201794386172052581436790"; static immutable string number = "0123456789012345678901234567890123456789012345678901234567890123456789"; static int charToInt(char chr) { return ((chr >= '0') || (chr <= '9')) ? cast(int)(chr - '0') : -1; } char checkDigit(string str) { char tmpdigit = '0'; foreach(chr; str) tmpdigit = QG10Matrix[charToInt(chr) + (charToInt(tmpdigit) * 10)]; return tmpdigit; } void main() { auto digit = checkDigit(number); } I compiled both with -inline -noboundscheck -release and the multidimensional array version does perform 1ms faster for a couple hundred runs, but I expected the difference to be much more noticeable. Any idea of what might be happening here?
Feb 09 2017
parent reply Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
Dne 9.2.2017 v 22:29 Nestor via Digitalmars-d-learn napsal(a):

 On Thursday, 9 February 2017 at 20:46:06 UTC, Daniel Kozak wrote:
 Maybe you can try use static array instead of dynamic
 static immutable ubyte[10][10] QG10Matrix = ...
I shaved it to this to discard unneccessary time-consuming functions: static immutable ubyte[10][10] QG10Matrix = [ [0,3,1,7,5,9,8,6,4,2],[7,0,9,2,1,5,4,8,6,3], [4,2,0,6,8,7,1,3,5,9],[1,7,5,0,9,8,3,4,2,6], [6,1,2,3,0,4,5,9,7,8],[3,6,7,4,2,0,9,5,8,1], [5,8,6,9,7,2,0,1,3,4],[8,9,4,5,3,6,2,0,1,7], [9,4,3,8,6,1,7,2,0,5],[2,5,8,1,4,3,6,7,9,0], ]; static immutable string number = "0123456789012345678901234567890123456789012345678901234567890123456789"; static int charToInt(char chr) { return ((chr >= '0') || (chr <= '9')) ? cast(int)(chr - '0') : -1; } ubyte checkDigit(string str) { ubyte tmpdigit; foreach(chr; str) tmpdigit = QG10Matrix[tmpdigit][charToInt(chr)]; return tmpdigit; } void main() { auto digit = checkDigit(number); } I even tried making checkDigit static, but surprisingly this increased average execution time by 1ms. Anyway, the previopus version (modified to benefit from a little optimization too) is almost as performant, even though it includes multiplication and sum: static immutable char[] QG10Matrix = "03175986427092154863420687135917509834266123045978" ~ "36742095815869720134894536201794386172052581436790"; static immutable string number = "0123456789012345678901234567890123456789012345678901234567890123456789"; static int charToInt(char chr) { return ((chr >= '0') || (chr <= '9')) ? cast(int)(chr - '0') : -1; } char checkDigit(string str) { char tmpdigit = '0'; foreach(chr; str) tmpdigit = QG10Matrix[charToInt(chr) + (charToInt(tmpdigit) * 10)]; return tmpdigit; } void main() { auto digit = checkDigit(number); } I compiled both with -inline -noboundscheck -release and the multidimensional array version does perform 1ms faster for a couple hundred runs, but I expected the difference to be much more noticeable. Any idea of what might be happening here?
Did you try it with different backends? llvm (ldc), gcc(gdc)?
Feb 09 2017
parent reply Nestor <nestorperez2016 yopmail.com> writes:
On Thursday, 9 February 2017 at 21:43:08 UTC, Daniel Kozak wrote:
 Any idea of what might be happening here?
Did you try it with different backends? llvm (ldc), gcc(gdc)?
Not really, just standard dmd. I tried running each algoritm a few times through avgtime using different digit lengths (upto 10,000, my PC won't handle much more) and different amount of repetitions, and the results aren't consistent, some times one algorithm is marginally faster, sometimes the other. Apparently the compiler is capable of optimizing the unidimensional array version. Thank you all nevertheless for the suggestions.
Feb 09 2017
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 02/09/2017 02:04 PM, Nestor wrote:

 I tried running each algoritm a few times through avgtime using
 different digit lengths
avgtime profiles the whole process, right? It measures everything that is involved in that little program. At least OS starting the program, D runtime initializing, program interacting with the console, etc. Since you're comparing algorithms, you need to take everything else out of the equation. A better approach is to use std.datetime.benchmark, which would repeat just the algorithm that you're interested in. Ali
Feb 09 2017
parent Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
Dne 9.2.2017 v 23:08 Ali Çehreli via Digitalmars-d-learn napsal(a):

 On 02/09/2017 02:04 PM, Nestor wrote:

 I tried running each algoritm a few times through avgtime using
 different digit lengths
avgtime profiles the whole process, right? It measures everything that is involved in that little program. At least OS starting the program, D runtime initializing, program interacting with the console, etc. Since you're comparing algorithms, you need to take everything else out of the equation. A better approach is to use std.datetime.benchmark, which would repeat just the algorithm that you're interested in. Ali
Yes this is important, and for example when you used shared libphobos it makes really big difference
Feb 09 2017
prev sibling next sibling parent Cym13 <cpicard openmailbox.org> writes:
On Thursday, 9 February 2017 at 19:39:49 UTC, Nestor wrote:
 On Thursday, 9 February 2017 at 18:34:30 UTC, Era Scarecrow 
 wrote:
 On Thursday, 9 February 2017 at 17:36:11 UTC, Nestor wrote:
 I was trying to port C code from the article in Wikiversity 
 [1] to D, but I'm not sure this implementation is the most 
 efficient way to do it in D, so suggestions to optimize it 
 are welcome:

 import std.stdio;

 static immutable char[] QG10Matrix =
   "03175986427092154863420687135917509834266123045978" ~
   "36742095815869720134894536201794386172052581436790";

 char checkDigit(string str) {
   char tmpdigit = '0';
   foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + 
 (tmpdigit - '0') * 10];
   return tmpdigit;
 }
Well one thing is you can probably reduce them from chars to just bytes, instead of having to subtract you can instead add at the end. Although unless you're working with a VERY large input you won't see a difference. Actually since you're also multiplying by 10, you can incorporate that in the table too... (although a mixin might be better for the conversion than by hand) static immutable char[] QG10Matrix = [ 00,30,10,70,50,90,80,60,40,20, 70,00,90,20,10,50,40,80,60,30, 40,20,00,60,80,70,10,30,50,90, 10,70,50,00,90,80,30,40,20,60, 60,10,20,30,00,40,50,90,70,80, 30,60,70,40,20,00,90,50,80,10, 50,80,60,90,70,20,00,10,30,40, 80,90,40,50,30,60,20,00,10,70, 90,40,30,80,60,10,70,20,00,50, 20,50,80,10,40,30,60,70,90,00]; char checkDigit(string str) { char tmpdigit = 0; foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + tmpdigit]; return (tmpdigit/10) + '0'; }
OK I changed the approach using a multidimensional array for the matrix so I could ditch arithmetic operations altogether, but curiously after measuring a few thousand runs of both implementations through avgtime, I see no noticeable difference. Why? import std.stdio; static immutable ubyte[][] QG10Matrix = [ [0,3,1,7,5,9,8,6,4,2],[7,0,9,2,1,5,4,8,6,3], [4,2,0,6,8,7,1,3,5,9],[1,7,5,0,9,8,3,4,2,6], [6,1,2,3,0,4,5,9,7,8],[3,6,7,4,2,0,9,5,8,1], [5,8,6,9,7,2,0,1,3,4],[8,9,4,5,3,6,2,0,1,7], [9,4,3,8,6,1,7,2,0,5],[2,5,8,1,4,3,6,7,9,0], ]; static int charToInt(char chr) { scope(failure) return -1; return cast(int)(chr - '0'); } ubyte checkDigit(string str) { ubyte tmpdigit; foreach(chr; str) tmpdigit = QG10Matrix[tmpdigit][charToInt(chr)]; return tmpdigit; } enum { EXIT_SUCCESS = 0, EXIT_FAILURE = 1, } int main(string[] args) { scope(failure) { writeln("Invalid arguments. You must pass a number."); return EXIT_FAILURE; } assert(args.length == 2); ubyte digit = checkDigit(args[1]); if(digit == 0) writefln("%s is a valid number.", args[1]); else { writefln("%s is not a valid number (but it would be, appending digit %s).", args[1], digit); } return EXIT_SUCCESS; }
The question is, why do you expect it to be noticably faster? On modern hardware it the optimizations are such that so little a change in code is very hard to link to a difference in running time. If you really want to show one the following code does the benchmark the right way (ie: using a very long input, tens of thousands of runs, and avoiding I/O and load time of the program to compare the bare function implementations): import std.conv; import std.stdio; import std.range; import std.array; import std.algorithm; string testcase; static immutable char[] QG10MatrixOne = "03175986427092154863420687135917509834266123045978" ~ "36742095815869720134894536201794386172052581436790"; char checkDigitOne(string str) { char tmpdigit = 0; foreach(chr; str) tmpdigit = QG10MatrixOne[(chr - '0') + (tmpdigit - '0') * 10]; return tmpdigit; } void testCheckDigitOne() { checkDigitTwo(testcase); } static immutable ubyte[][] QG10MatrixTwo = [ [0,3,1,7,5,9,8,6,4,2],[7,0,9,2,1,5,4,8,6,3], [4,2,0,6,8,7,1,3,5,9],[1,7,5,0,9,8,3,4,2,6], [6,1,2,3,0,4,5,9,7,8],[3,6,7,4,2,0,9,5,8,1], [5,8,6,9,7,2,0,1,3,4],[8,9,4,5,3,6,2,0,1,7], [9,4,3,8,6,1,7,2,0,5],[2,5,8,1,4,3,6,7,9,0], ]; static int charToInt(char chr) { scope(failure) return -1; return cast(int)(chr - '0'); } ubyte checkDigitTwo(string str) { ubyte tmpdigit; foreach(chr; str) { tmpdigit = QG10MatrixTwo[tmpdigit][charToInt(chr)]; } return tmpdigit; } void testCheckDigitTwo() { checkDigitTwo(testcase); } void main(string[] args) { testcase = iota(10).cycle.take(100000).map!(to!string).array.join; import std.datetime: benchmark; benchmark!(testCheckDigitOne, testCheckDigitTwo)(10000).each!writeln; } Yet on my machine I have the following times (note that the time itself depends on my hardware, what's important is the difference): TickDuration(15785255852) TickDuration(15784826803) So it seems that the first version is slower than the second one, but by so little that it's hard to link to the actual implementation. If anything it shows the futility of such low-level tricks for meaningless optimization. And by the way note that the benchmark avoid measuring IO because we were interested in the functions. But if we were measuring it it would represent 4 fifth of the running time (on my machine, with numbers as long as the one used in the benchmark). This means even a 20% improvement in checkDigit would actually only represent a 4% improvement of the program.
Feb 09 2017
prev sibling parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Thursday, 9 February 2017 at 19:39:49 UTC, Nestor wrote:
 OK I changed the approach using a multidimensional array for 
 the matrix so I could ditch arithmetic operations altogether, 
 but curiously after measuring a few thousand runs of both 
 implementations through avgtime, I see no noticeable 
 difference. Why?
Truthfully, because you'll need tens of millions or hundreds of millions in length to determine if it makes a difference and how much. Addition, subtraction and simple memory lookups take very little time, and since the entire array (100 bytes) fits in the cache, it is going to perform very very very well regardless if you can optimize it further. If you tested this on a much slower system, say an 8bit 6502 the differences would be far more pronounced, but not that much different. Since the algorithm is more or less O(n) optimizing it won't make many differences. It's possible you could get a speedup by making them ints instead of chars, since then it might not have a penalty for the 'address not divisible by 4' that applies (which is more for ARM architectures and less for x86). Other optimizations could be to make it multiple levels, taking the basic 100 elements and expanding them 2-3 levels deep in a lookup and having it do it in more or less a single operation. (100 bytes for 1 level, 10,000 for 2 levels, 1,000,000 for 3 levels, 100,000,000 for 4 levels, etc), but the steps of converting it to the array lookup won't give you that much gain, although fewer memory lookups but none of them will be cached, so any advantage from that is probably lost. Although if you bump up the size to 16x10 instead of 10x10, you could use a shift instead of *10 which will make that slightly faster (there will be unused empty padded spots) In theory if you avoid the memory lookup at all, you could gain some amount of speed, depending on how it searches a manual table, although using a switch-case and a mixin to do all the details it feels like it wouldn't give you any gain... Division operations are the slowest operations you can do, but otherwise most instructions run really fast. Unless you're trying to get it smaller (fewer bytes for the call) or shaving for speed by instruction cycle counting (like on the 6502) i doubt you'll get much benefit.
Feb 09 2017
parent reply Nestor <nestorperez2016 yopmail.com> writes:
On Thursday, 9 February 2017 at 23:49:19 UTC, Era Scarecrow wrote:
  Other optimizations could be to make it multiple levels, 
 taking the basic 100 elements and expanding them 2-3 levels 
 deep in a lookup and having it do it in more or less a single 
 operation. (100 bytes for 1 level, 10,000 for 2 levels, 
 1,000,000 for 3 levels, 100,000,000 for 4 levels, etc), but the 
 steps of converting it to the array lookup won't give you that 
 much gain, although fewer memory lookups but none of them will 
 be cached, so any advantage from that is probably lost. 
 Although if you bump up the size to 16x10 instead of 10x10, you 
 could use a shift instead of *10 which will make that slightly 
 faster (there will be unused empty padded spots)

  In theory if you avoid the memory lookup at all, you could 
 gain some amount of speed, depending on how it searches a 
 manual table, although using a switch-case and a mixin to do 
 all the details it feels like it wouldn't give you any gain...
Thank you for the detailed reply. I wasn't able to follow you regarding the multilevel stuff though :(
Feb 10 2017
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Friday, 10 February 2017 at 11:27:02 UTC, Nestor wrote:
 Thank you for the detailed reply. I wasn't able to follow you 
 regarding the multilevel stuff though :(
The idea behind it is like this (which you can scale up): static immutable int[] QG10Matrix2 = buildMatrix2(); int[] buildMatrix2() { string digits = "0123456789"; int[] l = new int[16*16*10]; char[3] s; foreach(a; digits) foreach(b; digits) foreach(c; digits) { s[] = [a,b,c]; l[(a-'0')<< 8|(b-'0')<<4|(c-'0')]=checkDigit(cast(string) s) - '0'; } return l; } Using that it SHOULD allow you to get the result of 2 inputs simply by using 2 characters (plus the old result) char checkDigit2(string str) { int tmpdigit = 0; for(;str.length >= 2;str=str[2 .. $]) { tmpdigit = QG10Matrix2[tmpdigit<<8|(str[0]-'0')<< 4|(str[1]-'0')]; } // handle remainder single character and return value While it should be easy, I'm having issues trying to get the proper results via unittests and I'm not sure why. Probably something incredibly simple on my part.
Feb 11 2017
parent reply Nestor <nestorperez2016 yopmail.com> writes:
On Saturday, 11 February 2017 at 11:45:02 UTC, Era Scarecrow 
wrote:
 On Friday, 10 February 2017 at 11:27:02 UTC, Nestor wrote:
 Thank you for the detailed reply. I wasn't able to follow you 
 regarding the multilevel stuff though :(
The idea behind it is like this (which you can scale up): static immutable int[] QG10Matrix2 = buildMatrix2(); int[] buildMatrix2() { string digits = "0123456789"; int[] l = new int[16*16*10]; char[3] s; foreach(a; digits) foreach(b; digits) foreach(c; digits) { s[] = [a,b,c]; l[(a-'0')<< 8|(b-'0')<<4|(c-'0')]=checkDigit(cast(string) s) - '0'; } return l; } Using that it SHOULD allow you to get the result of 2 inputs simply by using 2 characters (plus the old result) char checkDigit2(string str) { int tmpdigit = 0; for(;str.length >= 2;str=str[2 .. $]) { tmpdigit = QG10Matrix2[tmpdigit<<8|(str[0]-'0')<< 4|(str[1]-'0')]; } // handle remainder single character and return value While it should be easy, I'm having issues trying to get the proper results via unittests and I'm not sure why. Probably something incredibly simple on my part.
Notice this is no ordinary matrix, but an Anti-Simmetric QuasiGroup of order 10, and tmpdigit (called interim in the algorithm) is used in each round (although the function isn't recursive) together with each digit to calculate final check digit.
Feb 11 2017
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Saturday, 11 February 2017 at 20:19:51 UTC, Nestor wrote:
 Notice this is no ordinary matrix, but an Anti-Simmetric 
 QuasiGroup of order 10, and tmpdigit (called interim in the 
 algorithm) is used in each round (although the function isn't 
 recursive) together with each digit to calculate final check 
 digit.
Yes i know, which is why i had 3 to calculate 2 inputs because the third is the temp/previous calculation. If however you were calculating a fixed number of digits a single table could be made and do a single lookup, assuming it wasn't too large to make it uncumbersome or impractical.
Feb 11 2017
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Saturday, 11 February 2017 at 21:02:40 UTC, Era Scarecrow 
wrote:
  Yes i know, which is why i had 3 to calculate 2 inputs because 
 the third is the temp/previous calculation.
Alright I've found the bug and fixed it, and it passes with flying colors (brute force tests up to 6 digits); However it doesn't use the original function to build the table. So I'm satisfied it will handle any length now. But it seriously is a lot of overhead for such a simple function. int[] buildMatrix2() { string digits = "0123456789"; int[] l = new int[16*16*10]; l[] = -1; //printing the array it's obvious to see what is padding foreach(a; digits) foreach(b; digits) foreach(c; digits) { int t = (a-'0')*10, t2 = (QG10Matrix[(b - '0') + t]-'0') * 10, off = (a - '0') << 8 | (b - '0') << 4 | (c - '0'); l[off] = (QG10Matrix[(c - '0') + t2]-'0')<<8; } return l; } char checkDigit2(string str) { int tmpdigit = 0; for(;str.length >= 2;str = str[2 .. $]) tmpdigit = QG10Matrix2[tmpdigit|(str[0]-'0')<<4|(str[1]-'0')]; tmpdigit>>=8; if (str.length==1) return QG10Matrix[(str[0]-'0')+tmpdigit*10]; return (tmpdigit+'0') & 0xff; }
Feb 11 2017
next sibling parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Saturday, 11 February 2017 at 21:41:11 UTC, Era Scarecrow 
wrote:
  But it seriously is a lot of overhead for such a simple 
 function.
Just ran the unittests under the dmd profiler, says the algorithm is 11% faster now. So yeah slightly more optimized. Another level and we could probably get 25%, but the built matrix will blow up far larger than the 10k it is now. Num Tree Func Per Calls Time Time Call 12000000 1281989 1281989 0 char damm.checkDigit(immutable(char)[]) 12000000 1146308 1146308 0 char damm.checkDigit2(immutable(char)[])
Feb 11 2017
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Saturday, 11 February 2017 at 21:56:54 UTC, Era Scarecrow 
wrote:
  Just ran the unittests under the dmd profiler, says the 
 algorithm is 11% faster now. So yeah slightly more optimized.
Ran some more tests. Without optimization but with with 4 levels (a 2.5Mb table), it gains to a whopping 27%! However with optimizations turned on it dwindled to a mere 15% boost And optimization + no bounds checking, 2 & 4 levels both give a 9% boost total. Testing purely on 8byte inputs (Brute forced all combinations) receives the same 9% boost with negligible difference. Safe to say going higher levels isn't going to give you sufficient improvement; Also exe file is 3Mb big (but compresses to 150k).
Feb 11 2017
parent reply Nestor <nestorperez2016 yopmail.com> writes:
On Sunday, 12 February 2017 at 05:54:34 UTC, Era Scarecrow wrote:
 On Saturday, 11 February 2017 at 21:56:54 UTC, Era Scarecrow 
 wrote:
  Just ran the unittests under the dmd profiler, says the 
 algorithm is 11% faster now. So yeah slightly more optimized.
Ran some more tests. Without optimization but with with 4 levels (a 2.5Mb table), it gains to a whopping 27%! However with optimizations turned on it dwindled to a mere 15% boost And optimization + no bounds checking, 2 & 4 levels both give a 9% boost total. Testing purely on 8byte inputs (Brute forced all combinations) receives the same 9% boost with negligible difference. Safe to say going higher levels isn't going to give you sufficient improvement; Also exe file is 3Mb big (but compresses to 150k).
Wow! Thanks for the interest and effort.
Feb 12 2017
parent Era Scarecrow <rtcvb32 yahoo.com> writes:
On Monday, 13 February 2017 at 00:56:37 UTC, Nestor wrote:
 On Sunday, 12 February 2017 at 05:54:34 UTC, Era Scarecrow 
 wrote:
 Ran some more tests.
Wow! Thanks for the interest and effort.
Certainly. But the bulk of the answer comes down that the 2 levels that I've already provided are the fastest you're probably going to get. Certainly we can test using shorts or bytes instead, but it's likely the results will only go down. To note my tests are strictly on my x86 system and it would be better to also test this on other systems like PPC, Linux, ARM, and other architectures to see how they perform, and possibly tweak them as appropriate. Still we did find out there is some optimization that can be done and successfully for the Damm algorithm, it just isn't going to be a lot. Hmmm... A thought does come to mind. Parallelizing the code; However that would require probably 11 instances to get a 2x speedup (calculating the second half with all 10 possibilities for the carry over, and also calculating the first half, then choosing which of the 10 based on the first half's output), which only really works if you have a ton of cores, and the input is REALLY REALLY large, like a meg or something. While the usage of the Damm code is more useful for adding a digit to the end of a code like UPC or Barcodes as error detection, and expecting larger than 32 for real applications is unlikely. But at this point I'm rambling.
Feb 12 2017
prev sibling parent reply Nestor <nestorperez2016 yopmail.com> writes:
On Saturday, 11 February 2017 at 21:41:11 UTC, Era Scarecrow 
wrote:
 On Saturday, 11 February 2017 at 21:02:40 UTC, Era Scarecrow 
 wrote:
  Yes i know, which is why i had 3 to calculate 2 inputs 
 because the third is the temp/previous calculation.
Alright I've found the bug and fixed it, and it passes with flying colors (brute force tests up to 6 digits); However it doesn't use the original function to build the table. So I'm satisfied it will handle any length now. But it seriously is a lot of overhead for such a simple function. int[] buildMatrix2() { string digits = "0123456789"; int[] l = new int[16*16*10]; l[] = -1; //printing the array it's obvious to see what is padding foreach(a; digits) foreach(b; digits) foreach(c; digits) { int t = (a-'0')*10, t2 = (QG10Matrix[(b - '0') + t]-'0') * 10, off = (a - '0') << 8 | (b - '0') << 4 | (c - '0'); l[off] = (QG10Matrix[(c - '0') + t2]-'0')<<8; } return l; } char checkDigit2(string str) { int tmpdigit = 0; for(;str.length >= 2;str = str[2 .. $]) tmpdigit = QG10Matrix2[tmpdigit|(str[0]-'0')<<4|(str[1]-'0')]; tmpdigit>>=8; if (str.length==1) return QG10Matrix[(str[0]-'0')+tmpdigit*10]; return (tmpdigit+'0') & 0xff; }
I fail to see where you are declaring QG10Matrix2, because apparently it's an array of chars, but buildMatrix2 returns an array of int (2560 elements??) with lots of -1 values.
Feb 11 2017
parent Era Scarecrow <rtcvb32 yahoo.com> writes:
On Sunday, 12 February 2017 at 00:43:55 UTC, Nestor wrote:
 I fail to see where you are declaring QG10Matrix2, because 
 apparently it's an array of chars, but buildMatrix2 returns an 
 array of int (2560 elements??) with lots of -1 values.
I declared it here: http://forum.dlang.org/post/fiamjuhiddbzwvaplxkr forum.dlang.org and it's not chars, it's ints. Part of it is to avoid the mem access penalty for non divisible by 4 addresses, and another is that the array returns not chars, but numbers from 0-2560, which is the answer *256 (<<8), which offers hopefully some slight speed gain on longer inputs. Also the size of the array is guesstimated, and the -1's just signify the padding (which can be any value, but -1 makes it obvious). Since it's a 10x10 array it's based on, but using 4bits per section, there's 6 elements lost, and 6 whole rows lost. It's a necessary loss to gain speed; Thankfully it's only using 10k (2560 members) and not 700k as was my original guess when i was calculating it wrong earlier. Doing it wrong earlier, the compiler kept crashing... :P running out of memory.
Feb 11 2017
prev sibling parent Nestor <nestorperez2016 yopmail.com> writes:
On Thursday, 9 February 2017 at 18:34:30 UTC, Era Scarecrow wrote:
 ...
 Actually since you're also multiplying by 10, you can 
 incorporate that in the table too...
I forgot to comment that what is multiplied by ten is not the value but the starting position in the array (a way to emulate a matrix), but anyway see my comments in previous post.
Feb 09 2017