www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Is this a bug or a VERY sneaky case?

reply rempas <rempas tutanota.com> writes:
First of all, I would like to ask if there is a better place to 
make this kind of posts. I'm 99% sure that I have found a bug but 
I still want to ask just to be sure that there isn't happening 
something that I don't know about. So yeah, if there is a place 
that is for this kind of things, please inform me.

Ok so I'm making a library and in one of the functions that 
converts an integer to a string (char*) and returns it, there is 
a weird bug. So the problem is when I try to negate the given 
number and add one to it and then take the result and assign it 
to an unsigned long (ulong) variable. The line goes as follows: 
`ulong fnum = -num + 1;`. So it first does then negation from 
`num` and then adds 1 to the result. To test that my function 
works, I'm using the macros from "limits.h" to check the smallest 
and biggest possible values for each type. Everything seems to 
work great except for "INT_MIN". I don't know why this one but it 
doesn't work as expected. What's weird (and why I think it's 99% 
a bug) is that If I change that one line of code and make it into 
two separate lines, it will work (even tho it's the same thing 
under the hood). What's the change? Well we go from:

`ulong fnum = -num + 1;` to `ulong fnum = -num; fnum++;`

which like I said, is the exact same thing! So what are your 
thoughts? Even if there are more things going one here that I 
don't know, it doesn't make sense to me that everything else 
(including "LONG_MIN" which is a bigger number) works and 
"INT_MIN" doesn't. Also keep in mind that I'm using LDC2 to 
compile because I'm using GCC inline assembly syntax for the 
library so I can't compile with DMD.

In case someone want's to see the full function, you can check 
bellow:

```
import core.memory;

import core.stdc.stdio;
import core.stdc.stdlib;
import core.stdc.limits;

alias u8  = ubyte;
alias i8  = byte;
alias u16 = ushort;
alias i16 = short;
alias u32 = uint;
alias i32 = int;
alias u64 = ulong;
alias i64 = long;

enum U8_MAX  = 255;
enum U16_MAX = 65535;
enum U32_MAX = 4294967295;
enum U64_MAX = 18446744073709551615;

enum I8_MIN  = -128;
enum I8_MAX  = 127;
enum I16_MIN = -32768;
enum I16_MAX = 32767;
enum I32_MIN = -2147483648;
enum I32_MAX = 2147483647;
enum I64_MIN = -9223372036854775808;
enum I64_MAX = 9223372036854775807;

enum is_same(alias value, T) = is(typeof(value) == T);

char* to_str(T)(T num, u8 base) {
   if (num == 0) return cast(char*)"0";

   bool min_num = false;

   // Digit count for each size
   // That's not the full code, only the one for
   // signed numbers which is what we want for now
   static if (is_same!(num, i8)) {
     enum buffer_size = 5;
   } else static if (is_same!(num, i16)) {
     enum buffer_size = 7;
   } else static if (is_same!(num, i32)) {
     enum buffer_size = 12;
   } else {
     enum buffer_size = 21;
   }

   // Overflow check
   static if (is_same!(num, i8)) {
     if (num == I8_MIN) {
       min_num = true;
       ++num;
     }
   } else static if (is_same!(num, i16)) {
     if (num == I16_MIN) {
       min_num = true;
       ++num;
     }
   } else static if (is_same!(num, i32)) {
     if (num == I32_MIN) {
       min_num = true;
       ++num;
     }
   } else {
     if (num == I64_MIN) {
       min_num = true;
       ++num;
     }
   }

   char* buf = cast(char*)pureMalloc(buffer_size);
   i32 i = buffer_size;
   u64 fnum;

   if (num < 0) {
     if (min_num) {
       fnum = -num + 1; // This line causes the error
       // It works if used as a separate instructions
       // fnum = -num;
       // fnum++;
     }
     else fnum = -num;
   }
   else fnum = num;

   for(; fnum && i; --i, fnum /= base)
     buf[i] = "0123456789abcdef"[fnum % base];

   if (num < 0) {
     buf[i] = '-';
     return buf + i;
   }

   return buf + (i+1);
}

extern (C) void main() {
   printf("The value is %d\n",         INT_MIN);
   printf("The value is %s\n",  to_str(INT_MIN, 10));
   exit(0);
}
```
Dec 25 2021
parent reply Temtaime <temtaime gmail.com> writes:
To get correct results use

fnum = u64(-num) + 1;

there's no bug.

Also get rid of your *_MAX, *_MIN.
Use u8.max, i8.min built-in properties etc.
Dec 25 2021
next sibling parent reply rempas <rempas tutanota.com> writes:
On Saturday, 25 December 2021 at 13:39:14 UTC, Temtaime wrote:
 To get correct results use

 fnum = u64(-num) + 1;

 there's no bug.
Weird. It truly works but why isn't this necessary for other types as well?
 Also get rid of your *_MAX, *_MIN.
 Use u8.max, i8.min built-in properties etc.
Oh, didn't knew about that! Thanks!
Dec 25 2021
parent Rumbu <rumbu rumbu.ro> writes:
On Saturday, 25 December 2021 at 14:55:29 UTC, rempas wrote:
 On Saturday, 25 December 2021 at 13:39:14 UTC, Temtaime wrote:
 To get correct results use

 fnum = u64(-num) + 1;

 there's no bug.
Weird. It truly works but why isn't this necessary for other types as well?
Because others are promoted to int according to integer promotion rules. https://dlang.org/spec/type.html#integer-promotions so ```-num + 1``` gets compiled as ```cast(int)(-num) + cast(int)1``` as long as num is byte, short, ubyte, ushort, bool, char or wchar.
 Also get rid of your *_MAX, *_MIN.
 Use u8.max, i8.min built-in properties etc.
Oh, didn't knew about that! Thanks!
Dec 25 2021
prev sibling parent reply Rumbu <rumbu rumbu.ro> writes:
On Saturday, 25 December 2021 at 13:39:14 UTC, Temtaime wrote:
 To get correct results use

 fnum = u64(-num) + 1;

 there's no bug.

 Also get rid of your *_MAX, *_MIN.
 Use u8.max, i8.min built-in properties etc.
And max length of a integral can be obtained using ```log10(n)+1```, you don't need a bunch of static ifs.
Dec 25 2021
parent reply rempas <rempas tutanota.com> writes:
On Saturday, 25 December 2021 at 17:29:37 UTC, Rumbu wrote:
 And max length of a integral can be obtained using 
 ```log10(n)+1```, you don't need a bunch of static ifs.
Now you found the guy to talk about maths.... Could you mind giving a demonstration?
Dec 25 2021
parent reply Rumbu <rumbu rumbu.ro> writes:
On Saturday, 25 December 2021 at 20:50:12 UTC, rempas wrote:
 On Saturday, 25 December 2021 at 17:29:37 UTC, Rumbu wrote:
 And max length of a integral can be obtained using 
 ```log10(n)+1```, you don't need a bunch of static ifs.
Now you found the guy to talk about maths.... Could you mind giving a demonstration?
It's common sense, log10 means "give me the power of 10 to obtain n". And we know that 10^x means 1 followed by x zeroes, hence the maximum width for the number. You add 1 because of the 1 before the zeroes. So if we take as an example ubyte.max, we have ```log10(255) = 2.4```. Truncated as int, you get ```2```. Add ```1``` and you obtain ```3```, the exact length of ```255```. Or you can take 1000. ```log10(1000) = 3```, add ```1```, you obtain ```4```, the exact length of ```1000```.
Dec 25 2021
parent reply rempas <rempas tutanota.com> writes:
On Saturday, 25 December 2021 at 21:43:39 UTC, Rumbu wrote:
 It's common sense, log10 means "give me the power of 10 to 
 obtain n". And we know that 10^x means 1 followed by x zeroes, 
 hence the maximum width for the number. You add 1 because of 
 the 1 before the zeroes.

 So if we take as an example ubyte.max, we have ```log10(255) = 
 2.4```. Truncated as int, you get ```2```. Add ```1``` and you 
 obtain ```3```, the exact length of ```255```.

 Or you can take 1000. ```log10(1000) = 3```, add ```1```, you 
 obtain ```4```, the exact length of ```1000```.
I got that now! You want to replace the static ifs for the enum "buffer_size" (which is a name I'm going to change). However, is the algorithm built-in? If I have to make it, this means that I'll have to spend time finding how to make it and I will also end up with more code I will eliminate. Unless of course there is still something I don't understand...
Dec 25 2021
parent reply WebFreak001 <d.forum webfreak.org> writes:
On Sunday, 26 December 2021 at 06:34:14 UTC, rempas wrote:
 On Saturday, 25 December 2021 at 21:43:39 UTC, Rumbu wrote:
 It's common sense, log10 means "give me the power of 10 to 
 obtain n". And we know that 10^x means 1 followed by x zeroes, 
 hence the maximum width for the number. You add 1 because of 
 the 1 before the zeroes.

 So if we take as an example ubyte.max, we have ```log10(255) = 
 2.4```. Truncated as int, you get ```2```. Add ```1``` and you 
 obtain ```3```, the exact length of ```255```.

 Or you can take 1000. ```log10(1000) = 3```, add ```1```, you 
 obtain ```4```, the exact length of ```1000```.
I got that now! You want to replace the static ifs for the enum "buffer_size" (which is a name I'm going to change). However, is the algorithm built-in? If I have to make it, this means that I'll have to spend time finding how to make it and I will also end up with more code I will eliminate. Unless of course there is still something I don't understand...
log10 (or more correctly for all bases: `log(num, base)` or `log(num)/log(base)`) is the mathematically correct answer for **positive** numbers, but I think practically the major disadvantage (performance) of it in code like this outweigh the advantage (memory saving) of being correct, with integers of values at most -2^63..2^64. I think your current code is good as it is. (for base 10 at least) I don't think you will really save any memory by leaving out the spare bytes, the malloc might add way more overhead. There is no need to overthink this really and the `static if` cases for the different data-types are a good enough tradeoff for saving a few bytes of memory for no extra work needed at runtime. You could better improve your code performance & memory usage by looking into better allocation strategies for the small memory blocks you allocate. But you should only really need this when your custom to_str function is called a massive amount of times. (for an int -> string function like this I could imagine it being worthwhile in certain scenarios though) For base 2 the biggest number would then be 64 characters, still very manageable. Some tips I think I would rather suggest to you based on that code: (for style and to avoid bugs, not changing performance or memory usage much) - use [contracts](https://dlang.org/spec/function#contracts) for stuff like the base to indicate, that only bases 2..16 are allowed: `char* to_str(T)(T num, u8 base) in(base >= 2 && base <= 16) { ...` (nice for documentation and catches accidental bugs in development, in release builds these checks are omitted - which is part of the reason why you should never catch AssertError, Error or Throwable!) - use D's datatypes, not your own ones (if you want others to look/work on your code too it's better to use the common names for stuff) - but ofc staying consistent across your code is more important - use `is(T == ubyte)` etc. instead of your custom `is_same!(val, ubyte)` (same reason as above, people need to read the definition of is_same first) - work with slices, not with pointers (if you plan to use your code from D, it's much cleaner and avoids bugs! does not need a trailing null terminator and works with safe code)
Dec 28 2021
parent reply rempas <rempas tutanota.com> writes:
On Tuesday, 28 December 2021 at 15:25:30 UTC, WebFreak001 wrote:

Thanks a lot for your huge answer! Yeah, I don't want my program 
to pay a big price in runtime so I agree with you that "static 
ifs" are fine. Now I will quote some other stuff but in general 
thanks a lot

 I think your current code is good as it is. (for base 10 at 
 least). For base 2 the biggest number would then be 64 
 characters, still very manageable.
Yeah you are right and here are how bugs happen and I don't notice them. I will fix it right away!
 - use [contracts](https://dlang.org/spec/function#contracts) 
 for stuff like the base to indicate, that only bases 2..16 are 
 allowed:
   `char* to_str(T)(T num, u8 base) in(base >= 2 && base <= 16) 
 { ...`
   (nice for documentation and catches accidental bugs in 
 development, in release builds these checks are omitted - which 
 is part of the reason why you should never catch AssertError, 
 Error or Throwable!)
Thanks! Yeah tbh I don't know a lot of D "very advanced" stuff yet and I only use the features that I need. This looks awesome but one problem with it is that it will not allow me to give a custom message in case of a failure and it will also not print the line from where the function was called but rather the line of the function itself. So I don't know...
 - use D's datatypes, not your own ones (if you want others to 
 look/work on your code too it's better to use the common names 
 for stuff) - but ofc staying consistent across your code is 
 more important
I suppose you mean about "str" right? In this case, I would love using D's "string" if it wasn't immutable by default. It pisses me A LOT when a language tries to "protect" me from myself. There are a lot of other stuff that I would like "string" to have but I wouldn't mind them so much if string was mutable by default (or even better if string literals were "char*" like C and could get automatically casted so I could use char[] without the need of an ".dup"). Another thing is that I'm making a library so people will probably read the definition of "str" out of interest anyway and learn it if they want to use the library as users. And even for people that want to only contribute to a specific place in the code and not use the library (which why would you do that anyway?), the way "str" is used in the code, is similar to how "string" is used (check how they both have a ".ptr" property to get the actual pointer for example) so I don't really think that there is a problem with that.
 - use `is(T == ubyte)` etc. instead of your custom 
 `is_same!(val, ubyte)` (same reason as above, people need to 
 read the definition of is_same first)
This is a simple definition actually so why is it such of a big deal? Also we shouldn't use "is(T == ubyte)" but instead "is(typeof(val) == ubyte)" just like I'm doing it. This is because in variadic functions, "T" will have different type for each argument so it will not work (I made this mistake and people told me so that's how I know). So why do we have to type this much when we can automate this with a simple definition? There is also one for checking if a type is a number (integer), a floating point, a string (including my "str") etc. I don't find these hard to learn and memorize so I don't find a reason to not make my (our) life easier and just use "macros". This is the main reason I use D and not [Vox](https://github.com/MrSmith33/vox) in the first place (and the fact that in Vox you cannot fully work with Variadic functions yet).
 - work with slices, not with pointers (if you plan to use your 
 code from D, it's much cleaner and avoids bugs! does not need a 
 trailing null terminator and works with  safe code)
Slices are objects tho and this means paying a runtime cost versus just using a variable. Also the only place I used pointers are with C-type string (char* or u8* in my custom "str") and one member (_count) in my custom "str" struct. And all of this cases were checked very carefully and they are very specific. People that will use the library should rarely need to use pointers and this is what I try to do with my library and why I don't use "libc". However! When it comes to the actual library itself, I want to go as low level as possible and have a library that is as performant as possible.
Dec 30 2021
parent reply WebFreak001 <d.forum webfreak.org> writes:
On Thursday, 30 December 2021 at 08:26:17 UTC, rempas wrote:
 [...]

 - use D's datatypes, not your own ones (if you want others to 
 look/work on your code too it's better to use the common names 
 for stuff) - but ofc staying consistent across your code is 
 more important
I suppose you mean about "str" right? In this case, I would love using D's "string" if it wasn't immutable by default. It pisses me A LOT when a language tries to "protect" me from myself. There are a lot of other stuff that I would like "string" to have but I wouldn't mind them so much if string was mutable by default (or even better if string literals were "char*" like C and could get automatically casted so I could use char[] without the need of an ".dup"). Another thing is that I'm making a library so people will probably read the definition of "str" out of interest anyway and learn it if they want to use the library as users. And even for people that want to only contribute to a specific place in the code and not use the library (which why would you do that anyway?), the way "str" is used in the code, is similar to how "string" is used (check how they both have a ".ptr" property to get the actual pointer for example) so I don't really think that there is a problem with that.
No actually I meant the u8, u16, etc. - if you stay consistent it's fine, but most of the D ecosystem uses just the D native types (ubyte, ushort, etc.) which have [guaranteed bit widths](https://dlang.org/spec/type.html#basic-data-types) as well. Once working with other code it's possible they could also have custom definitions and then there are 3 or more different aliases for something meaning the same thing.
 - use `is(T == ubyte)` etc. instead of your custom 
 `is_same!(val, ubyte)` (same reason as above, people need to 
 read the definition of is_same first)
This is a simple definition actually so why is it such of a big deal? Also we shouldn't use "is(T == ubyte)" but instead "is(typeof(val) == ubyte)" just like I'm doing it. This is because in variadic functions, "T" will have different type for each argument so it will not work (I made this mistake and people told me so that's how I know). So why do we have to type this much when we can automate this with a simple definition? There is also one for checking if a type is a number (integer), a floating point, a string (including my "str") etc. I don't find these hard to learn and memorize so I don't find a reason to not make my (our) life easier and just use "macros". This is the main reason I use D and not [Vox](https://github.com/MrSmith33/vox) in the first place (and the fact that in Vox you cannot fully work with Variadic functions yet).
If you stay consistent it's fine - once you work with other code which also has a template like this it starts to be possible that there is gonna be more than one definition to do the same simple thing. Also the name `is_same` is a little confusing because there is also a `__traits(isSame, a, b)` which returns true if both arguments are the same symbol. It does not do the `typeof(a) == b` you are doing. (which I would btw not think of when I read "is_same")
 - work with slices, not with pointers (if you plan to use your 
 code from D, it's much cleaner and avoids bugs! does not need 
 a trailing null terminator and works with  safe code)
Slices are objects tho and this means paying a runtime cost versus just using a variable. Also the only place I used pointers are with C-type string (char* or u8* in my custom "str") and one member (_count) in my custom "str" struct. And all of this cases were checked very carefully and they are very specific. People that will use the library should rarely need to use pointers and this is what I try to do with my library and why I don't use "libc". However! When it comes to the actual library itself, I want to go as low level as possible and have a library that is as performant as possible.
No slices are basically `struct T[] { T* ptr; size_t length; }` - it's returning pointer + length in a 16 byte struct (on 64 bit, possibly by returning via 2 registers) and does not introduce any indirections. I think it's the best way to handle more (or less) than one element pointers anywhere in D. Slices do bounds checking and with that add more safety to your program. You can disable it globally (unless in safe code) but I would recommend not doing so. It can be a performance issue for algorithms that are on a very hot code path, like in big loops. In these cases I would recommend using some kind of `assert(maxValue < slice.length);` before your loop and to disable single bounds checks inside the loop use `slice.ptr[i]` instead of `slice[i]`. Usually it is not necessary to do this unless you are working on some low-level algorithms like sorting, unicode processing, parsing, etc. Additionally having the length with your array can often give you performance improvements as you can just use a simple loop and don't need to check every item in your array to be the null terminator, which x86 processors can greatly speed up and parallelize!
Dec 30 2021
parent rempas <rempas tutanota.com> writes:
On Thursday, 30 December 2021 at 16:17:28 UTC, WebFreak001 wrote:
 No actually I meant the u8, u16, etc. - if you stay consistent 
 it's fine, but most of the D ecosystem uses just the D native 
 types (ubyte, ushort, etc.) which have [guaranteed bit 
 widths](https://dlang.org/spec/type.html#basic-data-types) as 
 well.
Oh, these type are actually aliases so they will interact nicely with the language. So it's like "alias u8 = ubyte", "alias i8 = byte", "alias u16 = ushort" etc. A lot of other programming languages have used these names for their types because: 1. They are shorter 2. Look nicer (once you get used to them) 3. And for beginners (and for everyone actually as you can instantly notice), it is easier to know the size of each type (u8 is unsigned 8bit (1byte) integer, i32 is a signed 32bit (4 byte) integer etc.)
 Once working with other code it's possible they could also have 
 custom definitions and then there are 3 or more different 
 aliases for something meaning the same thing.
My library will have no dependencies so it will not have to work with other code. These types will be the "official names" used for library development. People that will use the library don't have to use them of course (and that's the awesome thing).
 If you stay consistent it's fine - once you work with other 
 code which also has a template like this it starts to be 
 possible that there is gonna be more than one definition to do 
 the same simple thing.

 Also the name `is_same` is a little confusing because there is 
 also a `__traits(isSame, a, b)` which returns true if both 
 arguments are the same symbol. It does not do the `typeof(a) == 
 b` you are doing. (which I would btw not think of when I read 
 "is_same")
Again we will not work with other code HOWEVER, this definitions are intended (but of course not forced) to be used from library uses. So yeah, "is_same" is indeed not good. Do you suggest any other name (that is not to long)? I'm thinking about "same_type" and "is_type". The first one makes a lot of sense and the second one doesn't make so much sense but it is small and cool ;)
 No slices are basically `struct T[] { T* ptr; size_t length; }` 
 - it's returning pointer + length in a 16 byte struct (on 64 
 bit, possibly by returning via 2 registers) and does not 
 introduce any indirections. I think it's the best way to handle 
 more (or less) than one element pointers anywhere in D.
Yeah, this is exactly what I'm saying! You get back a struct (which I suppose is heavier vs a simple variable) and some times you don't need it (which is the case for the places I'm using pointers). Slices are amazing (even in the cases where I mentioned) where you need to take a specific part of a string so you do two things in one place. In this case then ok, slices just make things easier (with less chances of making bugs)!
 Slices do bounds checking and with that add more safety to your 
 program. You can disable it globally (unless in  safe code) but 
 I would recommend not doing so. It can be a performance issue 
 for algorithms that are on a very hot code path, like in big 
 loops. In these cases I would recommend using some kind of 
 `assert(maxValue < slice.length);` before your loop and to 
 disable single bounds checks inside the loop use `slice.ptr[i]` 
 instead of `slice[i]`.
Yeah, I know. Tbh, to me (and adding to what we already said), slices just seem of a more "beginner friendly" to actually do the same thing you would do with a pointer + a variable to holds its length. And the bounds-checking can actually be unnecessary since we will probably do that ourselves anyway because you will go out of bounds either because: 1. You don't know what you are doing. Which means that you don't pay attention to what you are writing (don't code drunk please) or you don't understand exactly what your code does (which may also be the case when you copy paste code online). In this case, there is a general problem that you should fix and having the bounds automatically checked for you, is not gonna fix the problem (probably) 2. A user input value (I'm talking about reading from the standard input) that was out of bounds. In that case you would probably want to tell the user that they gave a wrong input rather than stop the execution of the program. This is the same way, `to!byte(val)` will throw an exception if they conversion fails rather then give you a value which you can check against your original value to see if the conversion failed (and why). So yeah....
 Usually it is not necessary to do this unless you are working 
 on some low-level algorithms like sorting, unicode processing, 
 parsing, etc.
Yeah, makes sense.
 Additionally having the length with your array can often give 
 you performance improvements as you can just use a simple loop 
 and don't need to check every item in your array to be the null 
 terminator, which x86 processors can greatly speed up and 
 parallelize!
Yeah, I thought the same. Well, don't worry tho, I decided (even before making this talk) to replace every place in my code that returns "char*" to return my custom "str" so we will eliminate the pointers anyway and string will have a ".length" property (just like how D's string do). Anyway is that I haven't wrote a lot of code (in general) and mostly I'm thinking about C ways of doing things except for times where I feel limited and I think about something that I need. I will probably upload the code in some weeks (maybe days If I'm not lazy???) so it would be nice for you guys actually read it and give me some review. Thanks a lot for your time replying to me and I wish you a happy new year!!! Of course I'm not implying that we should stop talking, reply me if you need more ;)
Dec 30 2021