digitalmars.D - Regarding hex strings

bearophile (20/20) Oct 17 2012 (Repost)

H. S. Teoh (13/25) Oct 17 2012 [...]

foobar (7/36) Oct 18 2012 IMO, this is a redundant feature that complicates the language

monarch_dodra (9/15) Oct 18 2012 Have you actually ever written code that requires using code

foobar (13/30) Oct 18 2012 I didn't try to compile it :) I just rewrote berophile's example

bearophile (17/24) Oct 18 2012 But this code:

foobar (8/32) Oct 18 2012 How often large binary blobs are literally spelled in the source

foobar (9/16) Oct 18 2012 This is especially a good reason to remove this feature as it

monarch_dodra (3/20) Oct 18 2012 Yeah, that makes sense too. I'll try to toy around on my end and

monarch_dodra (31/33) Oct 18 2012 That was actually relatively easy!

monarch_dodra (62/64) Oct 18 2012 //----

monarch_dodra (38/39) Oct 18 2012 With correct-er utf string support. In theory, non-ascii

bearophile (7/8) Oct 18 2012 It must scale up to "real world" usages. Try it with a program

monarch_dodra (15/23) Oct 18 2012 Hum... The compilation is pretty fast actually, about 1 second,

Marco Leise (26/58) Oct 18 2012 Hehe, I assume most of the regulars know this: DMD used to

Jonathan M Davis (12/15) Oct 18 2012 Yes, but it didn't use it for long, because it made performance worse, a...

Marco Leise (13/24) Oct 18 2012 He called it a FUD? Without trying to sound too patronizing, most D

Jonathan M Davis (9/29) Oct 18 2012 I don't think that he used quite that term, but his point was that I sho...

monarch_dodra (8/30) Oct 20 2012 I should have read your post in more detail. I thought you were

Nick Sabalausky (11/19) Oct 18 2012 Frequency isn't the issue. The issues are "*Is* it ever needed?" and

foobar (11/34) Oct 19 2012 Any real-world use cases to support this claim? Does C++ have

Nick Sabalausky (23/68) Oct 20 2012 I've used it. And Denis just posted an example of where it was used to

Kagamin (4/11) Oct 18 2012 You should use unicode directly here, that's the whole point to

Jonathan M Davis (6/19) Oct 18 2012 It's a nice feature, but there are plenty of cases where it makes more s...

Kagamin (2/2) Oct 18 2012 Your keyboard doesn't have ready unicode values for all

Jonathan M Davis (13/15) Oct 18 2012 So? That doesn't make it so that it's not valuable to be able to input t...

Don Clugston (6/40) Oct 18 2012 That is not the same. Array literals are not the same as string

foobar (8/61) Oct 18 2012 I don't see how that detail is relevant to this discussion as I

Don Clugston (3/53) Oct 19 2012 That doesn't compile.

foobar (3/8) Oct 19 2012 Come on, "assuming the code points are valid". It says so 4 lines

Don Clugston (7/15) Oct 19 2012 It isn't the same.

foobar (27/45) Oct 19 2012 Yes, the \u requires code points and not code-units for a

foobar (8/56) Oct 19 2012 I just re-checked and to clarify string literals support _three_
Nick Sabalausky (7/21) Oct 20 2012 Using x"..." doesn't prevent anyone from doing that:

Denis Shelomovskij (11/17) Oct 20 2012 Maybe. Just an example of a real world code:

foobar (4/22) Oct 20 2012 I personally find the former more readable but I guess there

Nick Sabalausky (4/21) Oct 20 2012 Honestly, I can't imagine how anyone wouldn't find the latter vastly

H. S. Teoh (28/50) Oct 20 2012 If you want vastly human readable, you want heredoc hex syntax,

foobar (6/61) Oct 20 2012 Yeah, I like this. I'd prefer brackets over quotes but it not a

foobar (2/72) Oct 20 2012 ** not a big deal

Nick Sabalausky (16/68) Oct 20 2012 Can't you already just do this?:
Dejan Lekic (4/18) Oct 22 2012 Having a heredoc syntax for hex-strings that produce ubyte[]

H. S. Teoh (6/27) Oct 22 2012 What I meant was, a syntax similar to heredoc, not an actual heredoc,

Nick Sabalausky (6/31) Oct 18 2012 Big +1

bearophile (5/9) Oct 18 2012 I'd like an opinion on such topics from one of the the D bosses

monarch_dodra (11/25) Oct 18 2012 The conversion can't be done *implicitly*, but you can still get
Dejan Lekic (4/25) Oct 22 2012 +1 on this one
Simen Kjaeraas (5/9) Oct 22 2012 That syntax is already taken, though.

"bearophile" <bearophileHUGS lycos.com> writes:

(Repost)

hex strings are useful, but I think they were invented in D1 when 
strings were convertible to char[]. But today they are an array 
of immutable UFT-8, so I think this default type is not so useful:

void main() {
     string data1 = x"A1 B2 C3 D4"; // OK
     immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
}


test.d(3): Error: cannot implicitly convert expression 
("\xa1\xb2\xc3\xd4") of type string to ubyte[]


Generally I want to use hex strings to put binary data in a 
program, so usually it's a ubyte[] or uint[].

So I have to use something like:

auto data3 = cast(ubyte[])(x"A1 B2 C3 D4".dup);


So maybe the following literals are more useful in D2:

ubyte[] data4 = x[A1 B2 C3 D4];
uint[]  data5 = x[A1 B2 C3 D4];
ulong[] data6 = x[A1 B2 C3 D4 A1 B2 C3 D4];

Bye,
bearophile

Oct 17 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
[...]
 hex strings are useful, but I think they were invented in D1 when
 strings were convertible to char[]. But today they are an array of
 immutable UFT-8, so I think this default type is not so useful:
 
 void main() {
     string data1 = x"A1 B2 C3 D4"; // OK
     immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
 }
 
 
 test.d(3): Error: cannot implicitly convert expression
 ("\xa1\xb2\xc3\xd4") of type string to ubyte[]

[...]

Yeah I think hex strings would be better as ubyte[] by default.

More generally, though, I think *both* of the above lines should be
equally accepted.  If you write x"A1 B2 C3" in the context of
initializing a string, then the compiler should infer the type of the
literal as string, and if the same literal occurs in the context of,
say, passing a ubyte[], then its type should be inferred as ubyte[], NOT
string.


T

-- 
Who told you to swim in Crocodile Lake without life insurance??

Oct 17 2012

"foobar" <foo bar.com> writes:

On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:
 On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
 [...]
 hex strings are useful, but I think they were invented in D1 
 when
 strings were convertible to char[]. But today they are an 
 array of
 immutable UFT-8, so I think this default type is not so useful:
 
 void main() {
     string data1 = x"A1 B2 C3 D4"; // OK
     immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
 }
 
 
 test.d(3): Error: cannot implicitly convert expression
 ("\xa1\xb2\xc3\xd4") of type string to ubyte[]

 [...]

 Yeah I think hex strings would be better as ubyte[] by default.

 More generally, though, I think *both* of the above lines 
 should be
 equally accepted.  If you write x"A1 B2 C3" in the context of
 initializing a string, then the compiler should infer the type 
 of the
 literal as string, and if the same literal occurs in the 
 context of,
 say, passing a ubyte[], then its type should be inferred as 
 ubyte[], NOT
 string.


 T

IMO, this is a redundant feature that complicates the language 
for no benefit and should be deprecated.
strings already have an escape sequence for specifying 
code-points "\u" and for ubyte arrays you can simply use:
immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

So basically this feature gains us nothing.

Oct 18 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 18 October 2012 at 08:58:57 UTC, foobar wrote:
 IMO, this is a redundant feature that complicates the language 
 for no benefit and should be deprecated.
 strings already have an escape sequence for specifying 
 code-points "\u" and for ubyte arrays you can simply use:
 immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

 So basically this feature gains us nothing.

Have you actually ever written code that requires using code 
points? This feature is a *huge* convenience for when you do. 
Just compare:

string nihongo1 = x"e697a5 e69cac e8aa9e";
string nihongo2 = "\ue697a5\ue69cac\ue8aa9e";
ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 
0xaa, 0x9e];

BTW, your data2 doesn't compile.

Oct 18 2012

"foobar" <foo bar.com> writes:

On Thursday, 18 October 2012 at 09:42:43 UTC, monarch_dodra wrote:
 On Thursday, 18 October 2012 at 08:58:57 UTC, foobar wrote:
 IMO, this is a redundant feature that complicates the language 
 for no benefit and should be deprecated.
 strings already have an escape sequence for specifying 
 code-points "\u" and for ubyte arrays you can simply use:
 immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

 So basically this feature gains us nothing.

 Have you actually ever written code that requires using code 
 points? This feature is a *huge* convenience for when you do. 
 Just compare:

 string nihongo1 = x"e697a5 e69cac e8aa9e";
 string nihongo2 = "\ue697a5\ue69cac\ue8aa9e";
 ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 
 0xaa, 0x9e];

 BTW, your data2 doesn't compile.

I didn't try to compile it :) I just rewrote berophile's example 
with 0x prefixes.

How often do you actually need to write code-point _literals_ in 
your code?
I'm not arguing that it isn't convenient. My question would be 
rather Anderi's "does it pull it's own weight?" meaning does the 
added complexity in the language and having more than one way for 
doing something worth that convenience?

Seems to me this is in the same ballpark as the built-in complex 
numbers. Sure it's nice to be able to write "4+5i" instead of 
"complex(4,5)" but how frequently do you actually ever need the 
_literals_ even in complex computational heavy code?

Oct 18 2012

"bearophile" <bearophileHUGS lycos.com> writes:

The docs say:
http://dlang.org/lex.html

Hex strings allow string literals to be created using hex data. 
The hex data need not form valid UTF characters.<

But this code:


void main() {
     immutable ubyte[4] data = x"F9 04 C1 E2";
}



Gives me:

temp.d(2): Error: Outside Unicode code space

Are the docs correct?

--------------------------

foobar:

 Seems to me this is in the same ballpark as the built-in 
 complex numbers. Sure it's nice to be able to write "4+5i" 
 instead of "complex(4,5)" but how frequently do you actually 
 ever need the _literals_ even in complex computational heavy 
 code?

Compared to "oct!5151151511", one problem with code like this is 
that binary blobs are sometimes large, so supporting a x"" syntax 
is better:

immutable ubyte[4] data = hex!"F9 04 C1 E2";

Bye,
bearophile

Oct 18 2012

"foobar" <foo bar.com> writes:

On Thursday, 18 October 2012 at 10:05:06 UTC, bearophile wrote:
 The docs say:
 http://dlang.org/lex.html

Hex strings allow string literals to be created using hex data. 
The hex data need not form valid UTF characters.<

 But this code:


 void main() {
     immutable ubyte[4] data = x"F9 04 C1 E2";
 }



 Gives me:

 temp.d(2): Error: Outside Unicode code space

 Are the docs correct?

 --------------------------

 foobar:

 Seems to me this is in the same ballpark as the built-in 
 complex numbers. Sure it's nice to be able to write "4+5i" 
 instead of "complex(4,5)" but how frequently do you actually 
 ever need the _literals_ even in complex computational heavy 
 code?

 Compared to "oct!5151151511", one problem with code like this 
 is that binary blobs are sometimes large, so supporting a x"" 
 syntax is better:

 immutable ubyte[4] data = hex!"F9 04 C1 E2";

 Bye,
 bearophile

How often large binary blobs are literally spelled in the source 
code (as opposed to just being read from a file)?
In any case, I'm not opposed to such a utility library, in fact I 
think it's a rather good idea and we already have a precedent 
with "oct!"
I just don't think this belongs as a built-in feature in the 
language.

Oct 18 2012

"foobar" <foo bar.com> writes:

On Thursday, 18 October 2012 at 10:11:14 UTC, foobar wrote:
 On Thursday, 18 October 2012 at 10:05:06 UTC, bearophile wrote:
 The docs say:
 http://dlang.org/lex.html

Hex strings allow string literals to be created using hex 
data. The hex data need not form valid UTF characters.<



This is especially a good reason to remove this feature as it 
breaks the principle of least surprise and I consider it a major 
bug, not a feature.

I expect D's strings which are by definition Unicode to _only_ 
ever allow _valid_ Unicode. It makes no sense what so ever to 
allow this nasty back-door. Other text encoding should be either 
stored and treated as binary data (ubyte[]) or better yet stored 
in their own types that will ensure those encodings' invariants.

Oct 18 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 18 October 2012 at 10:17:06 UTC, foobar wrote:
 On Thursday, 18 October 2012 at 10:11:14 UTC, foobar wrote:
 On Thursday, 18 October 2012 at 10:05:06 UTC, bearophile wrote:
 The docs say:
 http://dlang.org/lex.html

Hex strings allow string literals to be created using hex 
data. The hex data need not form valid UTF characters.<



 This is especially a good reason to remove this feature as it 
 breaks the principle of least surprise and I consider it a 
 major bug, not a feature.

 I expect D's strings which are by definition Unicode to _only_ 
 ever allow _valid_ Unicode. It makes no sense what so ever to 
 allow this nasty back-door. Other text encoding should be 
 either stored and treated as binary data (ubyte[]) or better 
 yet stored in their own types that will ensure those encodings' 
 invariants.

Yeah, that makes sense too. I'll try to toy around on my end and 
see if I can write an "hex".

Oct 18 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 18 October 2012 at 10:39:46 UTC, monarch_dodra wrote:
 Yeah, that makes sense too. I'll try to toy around on my end 
 and see if I can write an "hex".

That was actually relatively easy!

Here is some usecase:

//----
void main()
{
     enum a = hex!"01 ff 7f";
     enum b = hex!0x01_ff_7f;
     ubyte[] c = hex!"0123456789abcdef";
     immutable(ubyte)[] bearophile1 = hex!"A1 B2 C3 D4";
     immutable(ubyte)[] bearophile2 = hex!0xA1_B2_C3_D4;

     a.writeln();
     b.writeln();
     c.writeln();
     bearophile1.writeln();
     bearophile2.writeln();
}
//----

And corresponding output:

//----
[1, 255, 127]
[1, 255, 127]
[1, 35, 69, 103, 137, 171, 205, 239]
[161, 178, 195, 212]
[161, 178, 195, 212]
//----

hex! was a very good idea actually, imo. I'll post my current 
impl in the next post.

That said, I don't know if I'd deprecate x"", as it serves a 
different role, as you have already pointed out, in that it 
*will* validate the code points.

Oct 18 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 18 October 2012 at 11:24:04 UTC, monarch_dodra wrote:
 hex! was a very good idea actually, imo. I'll post my current 
 impl in the next post.

//----
import std.stdio;
import std.conv;
import std.ascii;


template hex(string s)
{
     enum hex = decode(s);
}


template hex(ulong ul)
{
     enum hex = decode(ul);
}

ubyte[] decode(string s)
{
     ubyte[] ret;
     size_t p;
     while(p < s.length)
     {
         while( s[p] == ' ' || s[p] == '_' )
         {
             ++p;
             if (p == s.length) assert(0, text("Premature end of 
string at index ", p, "."));;
         }

         char c1 = s[p];
         if (!std.ascii.isHexDigit(c1)) assert(0, text("Unexpected 
character ", c1, " at index ", p, "."));
         c1 = cast(char)std.ascii.toUpper(c1);

         ++p;
         if (p == s.length) assert(0, text("Premature end of 
string after ", c1, "."));

         char c2 = s[p];
         if (!std.ascii.isHexDigit(c2)) assert(0, text("Unexpected 
character ", c2, " at index ", p, "."));
         c2 = cast(char)std.ascii.toUpper(c2);
         ++p;


         ubyte val;
         if('0' <= c2 && c2 <= '9') val += (c2 - '0');
         if('A' <= c2 && c2 <= 'F') val += (c2 - 'A' + 10);
         if('0' <= c1 && c1 <= '9') val += ((c1 - '0')*16);
         if('A' <= c1 && c1 <= 'F') val += ((c1 - 'A' + 10)*16);
         ret ~= val;
     }
     return ret;
}

ubyte[] decode(ulong ul)
{
     //NOTE: This is not efficinet AT ALL (push front)
     //but it is ctfe, so we can live it for now ^^
     //I'll optimize it if I try to push it
     ubyte[] ret;
     while(ul)
     {
         ubyte t = ul%256;
         ret = t ~ ret;
         ul /= 256;
     }
     return ret;
}
//----

NOT a final version.

Oct 18 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 18 October 2012 at 11:26:13 UTC, monarch_dodra wrote:
 NOT a final version.

With correct-er utf string support. In theory, non-ascii 
characters are illegal, but it makes for safer code, and better 
diagnosis.

//----
ubyte[] decode(string s)
{
     ubyte[] ret;;
     while(s.length)
     {
         while( s.front == ' ' || s.front == '_' )
         {
             s.popFront();
             if (!s.length) assert(0, text("Premature end of 
string."));;
         }

         dchar c1 = s.front;
         if (!std.ascii.isHexDigit(c1)) assert(0, text("Unexpected 
character ", c1, "."));
         c1 = std.ascii.toUpper(c1);

         s.popFront();
         if (!s.length) assert(0, text("Premature end of string 
after ", c1, "."));

         dchar c2 = s.front;
         if (!std.ascii.isHexDigit(c2)) assert(0, text("Unexpected 
character ", c2, " after ", c1, "."));
         c2 = std.ascii.toUpper(c2);
         s.popFront();

         ubyte val;
         if('0' <= c2 && c2 <= '9') val += (c2 - '0');
         if('A' <= c2 && c2 <= 'F') val += (c2 - 'A' + 10);
         if('0' <= c1 && c1 <= '9') val += ((c1 - '0')*16);
         if('A' <= c1 && c1 <= 'F') val += ((c1 - 'A' + 10)*16);
         ret ~= val;
     }
     return ret;
}
//----

Oct 18 2012

"bearophile" <bearophileHUGS lycos.com> writes:

monarch_dodra:

 hex! was a very good idea actually, imo.

It must scale up to "real world" usages. Try it with a program
composed of 3 modules each one containing a 100 KB long string.
Then try it with a program with two hundred of medium sized
literals, and let's see compilation times and binary sizes.

Bye,
bearophile

Oct 18 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 18 October 2012 at 13:15:55 UTC, bearophile wrote:
 monarch_dodra:

 hex! was a very good idea actually, imo.

 It must scale up to "real world" usages. Try it with a program
 composed of 3 modules each one containing a 100 KB long string.
 Then try it with a program with two hundred of medium sized
 literals, and let's see compilation times and binary sizes.

 Bye,
 bearophile

Hum... The compilation is pretty fast actually, about 1 second, 
provided it doesn't choke.

It works for strings up to a length of 400 lines   80 chars per 
line, which result to approximately 16K of data. After that, I 
get a DMD out of memory error.

DMD memory usage spikes quite quickly. To compile those 400 lines 
(16K), I use 800MB of memory (!). If I reach about 1GB, then it 
crashes.

I tried using a refAppender instead of ret~, but that changed 
nothing.

Kind of weird it would use that much memory though...

Also, the memory doesn't get released. I can parse a 1x400 Line 
string, but if I try to parse 3 of them, DMD will choke on the 
second one. :(

Oct 18 2012

Marco Leise <Marco.Leise gmx.de> writes:

Am Thu, 18 Oct 2012 16:31:57 +0200
schrieb "monarch_dodra" <monarchdodra gmail.com>:

 On Thursday, 18 October 2012 at 13:15:55 UTC, bearophile wrote:
 monarch_dodra:

 hex! was a very good idea actually, imo.

 It must scale up to "real world" usages. Try it with a program
 composed of 3 modules each one containing a 100 KB long string.
 Then try it with a program with two hundred of medium sized
 literals, and let's see compilation times and binary sizes.

 Bye,
 bearophile

 
 Hum... The compilation is pretty fast actually, about 1 second, 
 provided it doesn't choke.
 
 It works for strings up to a length of 400 lines   80 chars per 
 line, which result to approximately 16K of data. After that, I 
 get a DMD out of memory error.
 
 DMD memory usage spikes quite quickly. To compile those 400 lines 
 (16K), I use 800MB of memory (!). If I reach about 1GB, then it 
 crashes.
 
 I tried using a refAppender instead of ret~, but that changed 
 nothing.
 
 Kind of weird it would use that much memory though...
 
 Also, the memory doesn't get released. I can parse a 1x400 Line 
 string, but if I try to parse 3 of them, DMD will choke on the 
 second one. :(

Hehe, I assume most of the regulars know this: DMD used to
use a garbage collector that is disabled. Memory just isn't
freed! Also it has copy on write semantics during CTFE:

int bug6498(int x)
{
    int n = 0;
    while (n < x)
        ++n;
    return n;
}
static assert(bug6498(10_000_000)==10_000_000);

--> Fails with an 'out of memory' error.

http://d.puremagic.com/issues/show_bug.cgi?id=6498

So, as strange as it sounds, for now try not to write often or
into large blocks. Using this knowledge I was sometimes able
to bring down the memory consumption considerably by caching
recurring concatenations of two strings or to!string calls.

That said, appending single elements to an array may actually
be better than using a fixed-sized one and have DMD duplicate
it on every write. :p

Please remember to give Don a cookie when he manages to change
the compiler to modify in-place where appropriate.

-- 
Marco

Oct 18 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, October 19, 2012 05:14:44 Marco Leise wrote:
 Hehe, I assume most of the regulars know this: DMD used to
 use a garbage collector that is disabled.

Yes, but it didn't use it for long, because it made performance worse, and 
Walter didn't have the time to spend fixing it, so it was disabled. Presumably, 
someone will take the time to improve it at some point and then it will be re-
enabled.

 Memory just isn't freed!

That was my understanding, but the last time that I said that, Brad Roberts 
said that it wasn't true, and that we should stop spreading that FUD, so I 
don't know what the exact situation is, but it sounds like if that was true in 
the past, it's not true now. Regardless, it's clear that dmd still uses too 
much memory in many cases, especially when code uses a lot of templates or 
CTFE.

- Jonathan M Davis

Oct 18 2012

Marco Leise <Marco.Leise gmx.de> writes:

Am Thu, 18 Oct 2012 21:03:01 -0700
schrieb Jonathan M Davis <jmdavisProg gmx.com>:

 On Friday, October 19, 2012 05:14:44 Marco Leise wrote:
 Memory just isn't freed!

 
 That was my understanding, but the last time that I said that, Brad Roberts 
 said that it wasn't true, and that we should stop spreading that FUD, so I 
 don't know what the exact situation is, but it sounds like if that was true in 
 the past, it's not true now. Regardless, it's clear that dmd still uses too 
 much memory in many cases, especially when code uses a lot of templates or 
 CTFE.
 
 - Jonathan M Davis

He called it a FUD? Without trying to sound too patronizing, most D
programmers would really only notice DMD's memory footprint
when they use CTFE features. It is always Pegged, ctRegex, etc.
that make the issue come up, never basic code. And preloading
the Boehm collector showed that gigabytes of CTFE memory usage
can still be brought down to a few hundred MB [citation
needed]. I guess we can meet somewhere in the middle. Btw. did
I mix up Don and Brad in the last post ? Who is working on the
memory management ?

-- 
Marco

Oct 18 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, October 19, 2012 07:29:46 Marco Leise wrote:
 Am Thu, 18 Oct 2012 21:03:01 -0700
 
 schrieb Jonathan M Davis <jmdavisProg gmx.com>:
 On Friday, October 19, 2012 05:14:44 Marco Leise wrote:
 Memory just isn't freed!

 
 That was my understanding, but the last time that I said that, Brad
 Roberts
 said that it wasn't true, and that we should stop spreading that FUD, so I
 don't know what the exact situation is, but it sounds like if that was
 true in the past, it's not true now. Regardless, it's clear that dmd
 still uses too much memory in many cases, especially when code uses a lot
 of templates or CTFE.
 
 - Jonathan M Davis

 
 He called it a FUD?

I don't think that he used quite that term, but his point was that I shouldn't 
be saying that, because it wasn't true, and so I was spreading incorrect 
information (that and the fact that he was tired of people spreading that 
incorrect information, IIRC). I can't find the exact post at the moment though.

 I guess we can meet somewhere in the middle. Btw. did
 I mix up Don and Brad in the last post ? Who is working on the
 memory management ?

I don't think that you mixed anyone up. Don works primarily on CTFE. Brad 
works primarily on the auto tester and other infrastructure required for the 
dmd/Phobos folks to do what they do.

- Jonathan M Davis

Oct 18 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 19 October 2012 at 03:14:54 UTC, Marco Leise wrote:
 Hehe, I assume most of the regulars know this: DMD used to
 use a garbage collector that is disabled. Memory just isn't
 freed! Also it has copy on write semantics during CTFE:

 int bug6498(int x)
 {
     int n = 0;
     while (n < x)
         ++n;
     return n;
 }
 static assert(bug6498(10_000_000)==10_000_000);

 --> Fails with an 'out of memory' error.

 http://d.puremagic.com/issues/show_bug.cgi?id=6498

 So, as strange as it sounds, for now try not to write often or
 into large blocks. Using this knowledge I was sometimes able
 to bring down the memory consumption considerably by caching
 recurring concatenations of two strings or to!string calls.

 That said, appending single elements to an array may actually
 be better than using a fixed-sized one and have DMD duplicate
 it on every write. :p

 Please remember to give Don a cookie when he manages to change
 the compiler to modify in-place where appropriate.

I should have read your post in more detail. I thought you were 
saying that allocations are never freed, but it is indeed more 
than that: Every write allocates.

I just spent the last hour trying to "optimize" my code, only to 
realize that at its "simplest" (Walk the string counting 
elements), I run out of memory :/

Can't do much more about it at this point.

Oct 20 2012

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On Thu, 18 Oct 2012 12:11:13 +0200
"foobar" <foo bar.com> wrote:
 
 How often large binary blobs are literally spelled in the source 
 code (as opposed to just being read from a file)?


Frequency isn't the issue. The issues are "*Is* it ever needed?" and
"When it is needed, is it useful enough?" The answer to both is most
certainly "yes". (Remember, D is supposed to usable as a systems
language, it's not merely a high-level-app-only language.)

Keep in mind, the question "Does it pull it's own weight?" is for
adding new features, not for going around gutting the language
just because we can.

 In any case, I'm not opposed to such a utility library, in fact I 
 think it's a rather good idea and we already have a precedent 
 with "oct!"
 I just don't think this belongs as a built-in feature in the 
 language.

I think monarch_dodra's test proves that it definitely needs to be
built-in.

Oct 18 2012

"foobar" <foo bar.com> writes:

On Friday, 19 October 2012 at 00:14:18 UTC, Nick Sabalausky wrote:
 On Thu, 18 Oct 2012 12:11:13 +0200
 "foobar" <foo bar.com> wrote:
 
 How often large binary blobs are literally spelled in the 
 source code (as opposed to just being read from a file)?


 Frequency isn't the issue. The issues are "*Is* it ever 
 needed?" and
 "When it is needed, is it useful enough?" The answer to both is 
 most
 certainly "yes". (Remember, D is supposed to usable as a systems
 language, it's not merely a high-level-app-only language.)

Any real-world use cases to support this claim? Does C++ have 
such a feature?
My limited experience with kernels is that this feature is not 
needed. The solution we used for this was to define an extern 
symbol and load it with a linker script (the binary data was of 
course stored in separate files).

 Keep in mind, the question "Does it pull it's own weight?" is 
 for
 adding new features, not for going around gutting the language
 just because we can.

Ok, I grant you that but remember that the whole thread started 
because the feature _doesn't_ work so lets rephrase - is it worth 
the effort to fix this feature?

 In any case, I'm not opposed to such a utility library, in 
 fact I think it's a rather good idea and we already have a 
 precedent with "oct!"
 I just don't think this belongs as a built-in feature in the 
 language.

 I think monarch_dodra's test proves that it definitely needs to 
 be
 built-in.

It proves that DMD has bugs that should be fixed, nothing more.

Oct 19 2012

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On Fri, 19 Oct 2012 15:07:09 +0200
"foobar" <foo bar.com> wrote:

 On Friday, 19 October 2012 at 00:14:18 UTC, Nick Sabalausky wrote:
 On Thu, 18 Oct 2012 12:11:13 +0200
 "foobar" <foo bar.com> wrote:
 
 How often large binary blobs are literally spelled in the 
 source code (as opposed to just being read from a file)?


 Frequency isn't the issue. The issues are "*Is* it ever 
 needed?" and
 "When it is needed, is it useful enough?" The answer to both is 
 most
 certainly "yes". (Remember, D is supposed to usable as a systems
 language, it's not merely a high-level-app-only language.)

 
 Any real-world use cases to support this claim?

I've used it. And Denis just posted an example of where it was used to
make code far more readable.

 Does C++ have such a feature?

It does not. As one consequence off the top of my head, including binary
data into GBA homebrew became more of an awkward bloated mess than it
needed to be.

 My limited experience with kernels is that this feature is not 
 needed.

"I haven't needed it" isn't remotely sufficient to demonstrate that
something doesn't "pull it's own weight".

 The solution we used for this was to define an extern 
 symbol and load it with a linker script (the binary data was of 
 course stored in separate files).
 

Yuck!

s/solution/workaround/

 Keep in mind, the question "Does it pull it's own weight?" is 
 for
 adding new features, not for going around gutting the language
 just because we can.

 
 Ok, I grant you that but remember that the whole thread started 
 because the feature _doesn't_ work so lets rephrase - is it worth 
 the effort to fix this feature?
 

The only bug is that it tries to validate it as UTF contrary to the
spec. Making it *not* try to validate it sounds like a very minor
effort. I think you're blowing it out of proportion.

And yes, I think it's definitely worth it.

 In any case, I'm not opposed to such a utility library, in 
 fact I think it's a rather good idea and we already have a 
 precedent with "oct!"
 I just don't think this belongs as a built-in feature in the 
 language.

 I think monarch_dodra's test proves that it definitely needs to 
 be
 built-in.

 
 It proves that DMD has bugs that should be fixed, nothing more.

Right so let's jettison x"..." just because *someday* CTFE might become
good enough that we can bring the feature back. How does that make
any sense?

We already have it, it basically works (aside from only a fairly
trivial issue). *When* CTFE is good enough to replace it, *then* we can
have a sane debate about actually doing so. Until then, "Let's get
rid of x"..." because it can be done in the library" is a pointless
argument because at least for now it's NOT TRUE.

Oct 20 2012

"Kagamin" <spam here.lot> writes:

On Thursday, 18 October 2012 at 09:42:43 UTC, monarch_dodra wrote:
 Have you actually ever written code that requires using code 
 points? This feature is a *huge* convenience for when you do. 
 Just compare:

 string nihongo1 = x"e697a5 e69cac e8aa9e";
 string nihongo2 = "\ue697a5\ue69cac\ue8aa9e";
 ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 
 0xaa, 0x9e];

You should use unicode directly here, that's the whole point to 
support it.
string nihongo = "日本語";

Oct 18 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, October 18, 2012 15:56:50 Kagamin wrote:
 On Thursday, 18 October 2012 at 09:42:43 UTC, monarch_dodra wrote:
 Have you actually ever written code that requires using code
 points? This feature is a *huge* convenience for when you do.
 Just compare:
 
 string nihongo1 = x"e697a5 e69cac e8aa9e";
 string nihongo2 = "\ue697a5\ue69cac\ue8aa9e";
 ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8,
 0xaa, 0x9e];

 
 You should use unicode directly here, that's the whole point to
 support it.
 string nihongo = "日本語";

It's a nice feature, but there are plenty of cases where it makes more sense 
to use the unicode values rather than the characters themselves (e.g. your 
keyboard doesn't have the characters in question). It's valuable to be able to 
do it both ways.

- Jonathan M Davis

Oct 18 2012

"Kagamin" <spam here.lot> writes:

Your keyboard doesn't have ready unicode values for all 
characters either.

Oct 18 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, October 18, 2012 21:09:14 Kagamin wrote:
 Your keyboard doesn't have ready unicode values for all
 characters either.

So? That doesn't make it so that it's not valuable to be able to input the 
values in hexidecimal instead of as actual unicode characters. Heck, if you 
want a specific character, I wouldn't trust copying the characters anyway, 
because it's far too easy to have two characters which look really similar but 
are different (e.g. there are multiple types of angle brackets in unicode), 
whereas with the numbers you can be sure. And with some characters (e.g. 
unicode whitespace characters), it generally doesn't make sense to enter the 
characters directly.

Regardless, my point is that both approaches can be useful, so it's good to be 
able to do both. If you prefer to put the unicode characters in directly, then 
do that, but others may prefer the other way. Personally, I've done both.

- Jonathan M Davis

Oct 18 2012

Don Clugston <dac nospam.com> writes:

On 18/10/12 10:58, foobar wrote:
 On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:
 On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
 [...]
 hex strings are useful, but I think they were invented in D1 when
 strings were convertible to char[]. But today they are an array of
 immutable UFT-8, so I think this default type is not so useful:

 void main() {
     string data1 = x"A1 B2 C3 D4"; // OK
     immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
 }


 test.d(3): Error: cannot implicitly convert expression
 ("\xa1\xb2\xc3\xd4") of type string to ubyte[]

 [...]

 Yeah I think hex strings would be better as ubyte[] by default.

 More generally, though, I think *both* of the above lines should be
 equally accepted.  If you write x"A1 B2 C3" in the context of
 initializing a string, then the compiler should infer the type of the
 literal as string, and if the same literal occurs in the context of,
 say, passing a ubyte[], then its type should be inferred as ubyte[], NOT
 string.


 T

 IMO, this is a redundant feature that complicates the language for no
 benefit and should be deprecated.
 strings already have an escape sequence for specifying code-points "\u"
 and for ubyte arrays you can simply use:
 immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

 So basically this feature gains us nothing.

That is not the same. Array literals are not the same as string 
literals, they have an implicit .dup.
See my recent thread on this issue (which unfortunately seems have to 
died without a resolution, people got hung up about trailing null 
characters without apparently noticing the more important issue of the dup).

Oct 18 2012

"foobar" <foo bar.com> writes:

On Thursday, 18 October 2012 at 14:29:57 UTC, Don Clugston wrote:
 On 18/10/12 10:58, foobar wrote:
 On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:
 On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
 [...]
 hex strings are useful, but I think they were invented in D1 
 when
 strings were convertible to char[]. But today they are an 
 array of
 immutable UFT-8, so I think this default type is not so 
 useful:

 void main() {
    string data1 = x"A1 B2 C3 D4"; // OK
    immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
 }


 test.d(3): Error: cannot implicitly convert expression
 ("\xa1\xb2\xc3\xd4") of type string to ubyte[]

 [...]

 Yeah I think hex strings would be better as ubyte[] by 
 default.

 More generally, though, I think *both* of the above lines 
 should be
 equally accepted.  If you write x"A1 B2 C3" in the context of
 initializing a string, then the compiler should infer the 
 type of the
 literal as string, and if the same literal occurs in the 
 context of,
 say, passing a ubyte[], then its type should be inferred as 
 ubyte[], NOT
 string.


 T

 IMO, this is a redundant feature that complicates the language 
 for no
 benefit and should be deprecated.
 strings already have an escape sequence for specifying 
 code-points "\u"
 and for ubyte arrays you can simply use:
 immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

 So basically this feature gains us nothing.

 That is not the same. Array literals are not the same as string 
 literals, they have an implicit .dup.
 See my recent thread on this issue (which unfortunately seems 
 have to died without a resolution, people got hung up about 
 trailing null characters without apparently noticing the more 
 important issue of the dup).

I don't see how that detail is relevant to this discussion as I 
was not arguing against string literals or array literals in 
general.

We can still have both (assuming the code points are valid...):
string foo = "\ua1\ub2\uc3"; // no .dup
and:
ubyte[3] goo = [0xa1, 0xb2, 0xc3]; // implicit .dup

Oct 18 2012

Don Clugston <dac nospam.com> writes:

On 18/10/12 17:43, foobar wrote:
 On Thursday, 18 October 2012 at 14:29:57 UTC, Don Clugston wrote:
 On 18/10/12 10:58, foobar wrote:
 On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:
 On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
 [...]
 hex strings are useful, but I think they were invented in D1 when
 strings were convertible to char[]. But today they are an array of
 immutable UFT-8, so I think this default type is not so useful:

 void main() {
    string data1 = x"A1 B2 C3 D4"; // OK
    immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
 }


 test.d(3): Error: cannot implicitly convert expression
 ("\xa1\xb2\xc3\xd4") of type string to ubyte[]

 [...]

 Yeah I think hex strings would be better as ubyte[] by default.

 More generally, though, I think *both* of the above lines should be
 equally accepted.  If you write x"A1 B2 C3" in the context of
 initializing a string, then the compiler should infer the type of the
 literal as string, and if the same literal occurs in the context of,
 say, passing a ubyte[], then its type should be inferred as ubyte[],
 NOT
 string.


 T

 IMO, this is a redundant feature that complicates the language for no
 benefit and should be deprecated.
 strings already have an escape sequence for specifying code-points "\u"
 and for ubyte arrays you can simply use:
 immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

 So basically this feature gains us nothing.

 That is not the same. Array literals are not the same as string
 literals, they have an implicit .dup.
 See my recent thread on this issue (which unfortunately seems have to
 died without a resolution, people got hung up about trailing null
 characters without apparently noticing the more important issue of the
 dup).

 I don't see how that detail is relevant to this discussion as I was not
 arguing against string literals or array literals in general.

 We can still have both (assuming the code points are valid...):
 string foo = "\ua1\ub2\uc3"; // no .dup

That doesn't compile.
Error: escape hex sequence has 2 hex digits instead of 4

Oct 19 2012

"foobar" <foo bar.com> writes:

On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:
 We can still have both (assuming the code points are valid...):
 string foo = "\ua1\ub2\uc3"; // no .dup

 That doesn't compile.
 Error: escape hex sequence has 2 hex digits instead of 4

Come on, "assuming the code points are valid". It says so 4 lines 
above!

Oct 19 2012

Don Clugston <dac nospam.com> writes:

On 19/10/12 16:07, foobar wrote:
 On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:
 We can still have both (assuming the code points are valid...):
 string foo = "\ua1\ub2\uc3"; // no .dup

 That doesn't compile.
 Error: escape hex sequence has 2 hex digits instead of 4

 Come on, "assuming the code points are valid". It says so 4 lines above!

It isn't the same.
Hex strings are the raw bytes, eg UTF8 code points. (ie, it includes the 
high bits that indicate the length of each char).
\u makes dchars.

"\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two non-zero 
bytes.

Oct 19 2012

"foobar" <foo bar.com> writes:

On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:
 On 19/10/12 16:07, foobar wrote:
 On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:
 We can still have both (assuming the code points are 
 valid...):
 string foo = "\ua1\ub2\uc3"; // no .dup

 That doesn't compile.
 Error: escape hex sequence has 2 hex digits instead of 4

 Come on, "assuming the code points are valid". It says so 4 
 lines above!

 It isn't the same.
 Hex strings are the raw bytes, eg UTF8 code points. (ie, it 
 includes the high bits that indicate the length of each char).
 \u makes dchars.

 "\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two 
 non-zero bytes.

Yes, the \u requires code points and not code-units for a 
specific UTF encoding, which you are correct in pointing out are 
four hex digits and not two.
This is a very reasonable choice to prevent/reduce Unicode 
encoding errors.

http://dlang.org/lex.html#HexString states:
"Hex strings allow string literals to be created using hex data. 
The hex data need not form valid UTF characters."

I _already_ said that I consider this a major semantic bug as it 
violates the principle of least surprise - the programmer's 
expectation that the D string types which are Unicode according 
to the spec to, well, actually contain _valid_ Unicode and _not_ 
arbitrary binary data.
Given the above, the design of \u makes perfect sense for 
_strings_ - you can use _valid_ code-points (not code units) in 
hex form.

For general purpose binary data (i.e. _not_ UTF encoded Unicode 
text) I also _already_ said IMO should be either stored as 
ubyte[] or better yet their own types that would ensure the 
correct invariants for the data type, be it audio, video, or just 
a different text encoding.

In neither case the hex-string is relevant IMO. In the former it 
potentially violates the type's invariant and in the latter we 
already have array literals.

Using a malformed _string_ to initialize ubyte[] IMO is simply 
less readable. How did that article call such features, "WAT"?

Oct 19 2012

"foobar" <foo bar.com> writes:

On Friday, 19 October 2012 at 18:46:07 UTC, foobar wrote:
 On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:
 On 19/10/12 16:07, foobar wrote:
 On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston 
 wrote:
 We can still have both (assuming the code points are 
 valid...):
 string foo = "\ua1\ub2\uc3"; // no .dup

 That doesn't compile.
 Error: escape hex sequence has 2 hex digits instead of 4

 Come on, "assuming the code points are valid". It says so 4 
 lines above!

 It isn't the same.
 Hex strings are the raw bytes, eg UTF8 code points. (ie, it 
 includes the high bits that indicate the length of each char).
 \u makes dchars.

 "\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two 
 non-zero bytes.

 Yes, the \u requires code points and not code-units for a 
 specific UTF encoding, which you are correct in pointing out 
 are four hex digits and not two.
 This is a very reasonable choice to prevent/reduce Unicode 
 encoding errors.

 http://dlang.org/lex.html#HexString states:
 "Hex strings allow string literals to be created using hex 
 data. The hex data need not form valid UTF characters."

 I _already_ said that I consider this a major semantic bug as 
 it violates the principle of least surprise - the programmer's 
 expectation that the D string types which are Unicode according 
 to the spec to, well, actually contain _valid_ Unicode and 
 _not_ arbitrary binary data.
 Given the above, the design of \u makes perfect sense for 
 _strings_ - you can use _valid_ code-points (not code units) in 
 hex form.

 For general purpose binary data (i.e. _not_ UTF encoded Unicode 
 text) I also _already_ said IMO should be either stored as 
 ubyte[] or better yet their own types that would ensure the 
 correct invariants for the data type, be it audio, video, or 
 just a different text encoding.

 In neither case the hex-string is relevant IMO. In the former 
 it potentially violates the type's invariant and in the latter 
 we already have array literals.

 Using a malformed _string_ to initialize ubyte[] IMO is simply 
 less readable. How did that article call such features, "WAT"?

I just re-checked and to clarify string literals support _three_ 
escape sequences:
\x__ - a single byte
\u____ - two bytes
\U________ - four bytes

So raw bytes _can_ be directly specified and I hope the compiler 
still verifies the string literal is valid Unicode.

Oct 19 2012

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On Fri, 19 Oct 2012 20:46:06 +0200
 
 For general purpose binary data (i.e. _not_ UTF encoded Unicode 
 text) I also _already_ said IMO should be either stored as 
 ubyte[]

Problem is, x"..." is FAR better syntax for that.

 or better yet their own types that would ensure the 
 correct invariants for the data type, be it audio, video, or just 
 a different text encoding.

Using x"..." doesn't prevent anyone from doing that:

auto a = SomeAudioType(x"...");

 
 In neither case the hex-string is relevant IMO. In the former it 
 potentially violates the type's invariant and in the latter we 
 already have array literals.
 
 Using a malformed _string_ to initialize ubyte[] IMO is simply 
 less readable. How did that article call such features, "WAT"?

The only thing ridiculous about x"..." is that somewhere along the
lines it was decided that it must be a string instead of the arbitrary
binary data that it *is*.

Oct 20 2012

Denis Shelomovskij <verylonglogin.reg gmail.com> writes:

18.10.2012 12:58, foobar пишет:
 IMO, this is a redundant feature that complicates the language for no
 benefit and should be deprecated.
 strings already have an escape sequence for specifying code-points "\u"
 and for ubyte arrays you can simply use:
 immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

 So basically this feature gains us nothing.

Maybe. Just an example of a real world code:

Arrays:
https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110

vs

Hex strings:
https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130

By the way, current code isn't affected by the topic issue.

-- 
Денис В. Шеломовский
Denis V. Shelomovskij

Oct 20 2012

"foobar" <foo bar.com> writes:

On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij 
wrote:
 18.10.2012 12:58, foobar пишет:
 IMO, this is a redundant feature that complicates the language 
 for no
 benefit and should be deprecated.
 strings already have an escape sequence for specifying 
 code-points "\u"
 and for ubyte arrays you can simply use:
 immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

 So basically this feature gains us nothing.

 Maybe. Just an example of a real world code:

 Arrays:
 https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110

 vs

 Hex strings:
 https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130

 By the way, current code isn't affected by the topic issue.

I personally find the former more readable but I guess there 
would always be someone to disagree. As the say, YMMV.

Oct 20 2012

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On Sat, 20 Oct 2012 14:59:27 +0200
"foobar" <foo bar.com> wrote:
 On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij 
 wrote:
 Maybe. Just an example of a real world code:

 Arrays:
 https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110

 vs

 Hex strings:
 https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130

 By the way, current code isn't affected by the topic issue.

 
 I personally find the former more readable but I guess there 
 would always be someone to disagree. As the say, YMMV.

Honestly, I can't imagine how anyone wouldn't find the latter vastly
more readable.

Oct 20 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky wrote:
 On Sat, 20 Oct 2012 14:59:27 +0200
 "foobar" <foo bar.com> wrote:
 On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij
 wrote:
 Maybe. Just an example of a real world code:

 Arrays:
 https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110

 vs

 Hex strings:
 https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130

 By the way, current code isn't affected by the topic issue.

 
 I personally find the former more readable but I guess there 
 would always be someone to disagree. As the say, YMMV.

 
 Honestly, I can't imagine how anyone wouldn't find the latter vastly
 more readable.

If you want vastly human readable, you want heredoc hex syntax,
something like this:

	ubyte[] = x"<<END
	32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e
	32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20
	2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20
	74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69
	6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20
	74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d
	20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20
	22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73
	20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67
	6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65
	END";

(I just made that syntax up, so the details are not final, but you get
the idea.) I would propose supporting this in D, but then D already has
way too many different ways of writing strings, some of questionable
utility, so I will refrain.

Of course, the above syntax might actually be implementable with a
suitable mixin template that takes a compile-time string. Maybe we
should lobby for such a template to go into Phobos -- that might
motivate people to fix CTFE in dmd so that it doesn't consume
unreasonable amounts of memory when the size of CTFE input gets
moderately large (see other recent thread on this topic).


T

-- 
Без труда не выловишь и рыбку из пруда.

Oct 20 2012

"foobar" <foo bar.com> writes:

On Saturday, 20 October 2012 at 21:03:20 UTC, H. S. Teoh wrote:
 On Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky wrote:
 On Sat, 20 Oct 2012 14:59:27 +0200
 "foobar" <foo bar.com> wrote:
 On Saturday, 20 October 2012 at 10:51:25 UTC, Denis 
 Shelomovskij
 wrote:
 Maybe. Just an example of a real world code:

 Arrays:
 https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110

 vs

 Hex strings:
 https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130

 By the way, current code isn't affected by the topic issue.

 
 I personally find the former more readable but I guess there 
 would always be someone to disagree. As the say, YMMV.

 
 Honestly, I can't imagine how anyone wouldn't find the latter 
 vastly
 more readable.

 If you want vastly human readable, you want heredoc hex syntax,
 something like this:

 	ubyte[] = x"<<END
 	32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e
 	32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20
 	2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20
 	74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69
 	6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20
 	74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d
 	20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20
 	22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73
 	20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67
 	6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65
 	END";

 (I just made that syntax up, so the details are not final, but 
 you get
 the idea.) I would propose supporting this in D, but then D 
 already has
 way too many different ways of writing strings, some of 
 questionable
 utility, so I will refrain.

 Of course, the above syntax might actually be implementable 
 with a
 suitable mixin template that takes a compile-time string. Maybe 
 we
 should lobby for such a template to go into Phobos -- that might
 motivate people to fix CTFE in dmd so that it doesn't consume
 unreasonable amounts of memory when the size of CTFE input gets
 moderately large (see other recent thread on this topic).


 T

Yeah, I like this. I'd prefer brackets over quotes but it not a 
big dig as the qoutes in the above are not very noticeable. It 
should look distinct from textual strings.
As you said, this could/should be implemented as a template.

Vote++

Oct 20 2012

"foobar" <foo bar.com> writes:

On Saturday, 20 October 2012 at 21:16:44 UTC, foobar wrote:
 On Saturday, 20 October 2012 at 21:03:20 UTC, H. S. Teoh wrote:
 On Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky 
 wrote:
 On Sat, 20 Oct 2012 14:59:27 +0200
 "foobar" <foo bar.com> wrote:
 On Saturday, 20 October 2012 at 10:51:25 UTC, Denis 
 Shelomovskij
 wrote:
 Maybe. Just an example of a real world code:

 Arrays:
 https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110

 vs

 Hex strings:
 https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130

 By the way, current code isn't affected by the topic 
 issue.

 
 I personally find the former more readable but I guess 
 there would always be someone to disagree. As the say, YMMV.

 
 Honestly, I can't imagine how anyone wouldn't find the latter 
 vastly
 more readable.

 If you want vastly human readable, you want heredoc hex syntax,
 something like this:

 	ubyte[] = x"<<END
 	32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e
 	32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20
 	2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20
 	74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69
 	6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20
 	74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d
 	20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20
 	22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73
 	20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67
 	6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65
 	END";

 (I just made that syntax up, so the details are not final, but 
 you get
 the idea.) I would propose supporting this in D, but then D 
 already has
 way too many different ways of writing strings, some of 
 questionable
 utility, so I will refrain.

 Of course, the above syntax might actually be implementable 
 with a
 suitable mixin template that takes a compile-time string. 
 Maybe we
 should lobby for such a template to go into Phobos -- that 
 might
 motivate people to fix CTFE in dmd so that it doesn't consume
 unreasonable amounts of memory when the size of CTFE input gets
 moderately large (see other recent thread on this topic).


 T

 Yeah, I like this. I'd prefer brackets over quotes but it not a 
 big dig as the qoutes in the above are not very noticeable. It 
 should look distinct from textual strings.
 As you said, this could/should be implemented as a template.

 Vote++

** not a big deal

Oct 20 2012

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On Sat, 20 Oct 2012 14:05:21 -0700
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote:

 On Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky wrote:
 On Sat, 20 Oct 2012 14:59:27 +0200
 "foobar" <foo bar.com> wrote:
 On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij
 wrote:
 Maybe. Just an example of a real world code:

 Arrays:
 https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110

 vs

 Hex strings:
 https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130

 By the way, current code isn't affected by the topic issue.

 
 I personally find the former more readable but I guess there 
 would always be someone to disagree. As the say, YMMV.

 
 Honestly, I can't imagine how anyone wouldn't find the latter vastly
 more readable.

 
 If you want vastly human readable, you want heredoc hex syntax,
 something like this:
 
 	ubyte[] = x"<<END
 	32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e
 	32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20
 	2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20
 	74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69
 	6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20
 	74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d
 	20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20
 	22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73
 	20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67
 	6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65
 	END";
 
 (I just made that syntax up, so the details are not final, but you get
 the idea.) I would propose supporting this in D, but then D already
 has way too many different ways of writing strings, some of
 questionable utility, so I will refrain.
 
 Of course, the above syntax might actually be implementable with a
 suitable mixin template that takes a compile-time string. Maybe we
 should lobby for such a template to go into Phobos -- that might
 motivate people to fix CTFE in dmd so that it doesn't consume
 unreasonable amounts of memory when the size of CTFE input gets
 moderately large (see other recent thread on this topic).
 

Can't you already just do this?:

 	auto blah = x"
 	32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e
 	32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20
 	2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20
 	74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69
 	6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20
 	74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d
 	20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20
 	22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73
 	20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67
 	6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65
 	";

I thought all string literals in D accepted embedded newlines?

Oct 20 2012

"Dejan Lekic" <dejan.lekic gmail.com> writes:

 If you want vastly human readable, you want heredoc hex syntax,
 something like this:

 	ubyte[] = x"<<END
 	32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e
 	32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20
 	2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20
 	74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69
 	6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20
 	74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d
 	20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20
 	22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73
 	20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67
 	6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65
 	END";

Having a heredoc syntax for hex-strings that produce ubyte[] 
arrays is confusing for people who would (naturally) expect a 
string from a heredoc string. It is not named hereDOC for no 
reason. :)

Oct 22 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Oct 22, 2012 at 01:14:21PM +0200, Dejan Lekic wrote:
If you want vastly human readable, you want heredoc hex syntax,
something like this:

	ubyte[] = x"<<END
	32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e
	32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20
	2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20
	74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69
	6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20
	74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d
	20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20
	22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73
	20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67
	6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65
	END";

 
 Having a heredoc syntax for hex-strings that produce ubyte[] arrays
 is confusing for people who would (naturally) expect a string from a
 heredoc string. It is not named hereDOC for no reason. :)

What I meant was, a syntax similar to heredoc, not an actual heredoc,
which would be a string.


T

-- 
Knowledge is that area of ignorance that we arrange and classify. -- Ambrose
Bierce

Oct 22 2012

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On Wed, 17 Oct 2012 19:49:43 -0700
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote:

 On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
 [...]
 hex strings are useful, but I think they were invented in D1 when
 strings were convertible to char[]. But today they are an array of
 immutable UFT-8, so I think this default type is not so useful:
 
 void main() {
     string data1 = x"A1 B2 C3 D4"; // OK
     immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
 }
 
 
 test.d(3): Error: cannot implicitly convert expression
 ("\xa1\xb2\xc3\xd4") of type string to ubyte[]

 [...]
 
 Yeah I think hex strings would be better as ubyte[] by default.
 
 More generally, though, I think *both* of the above lines should be
 equally accepted.  If you write x"A1 B2 C3" in the context of
 initializing a string, then the compiler should infer the type of the
 literal as string, and if the same literal occurs in the context of,
 say, passing a ubyte[], then its type should be inferred as ubyte[],
 NOT string.
 

Big +1

Having the language expect x"..." to always be a string (let alone a
*valid UTF* string) is just insane. It's just too damn useful for
arbitrary binary data.

Oct 18 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Nick Sabalausky:

 Big +1

 Having the language expect x"..." to always be a string (let 
 alone a *valid UTF* string) is just insane. It's just too
 damn useful for arbitrary binary data.

I'd like an opinion on such topics from one of the the D bosses 
:-)

Bye,
bearophile

Oct 18 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 18 October 2012 at 00:45:12 UTC, bearophile wrote:
 (Repost)

 hex strings are useful, but I think they were invented in D1 
 when strings were convertible to char[]. But today they are an 
 array of immutable UFT-8, so I think this default type is not 
 so useful:

 void main() {
     string data1 = x"A1 B2 C3 D4"; // OK
     immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
 }


 test.d(3): Error: cannot implicitly convert expression 
 ("\xa1\xb2\xc3\xd4") of type string to ubyte[]

 [SNIP]

 Bye,
 bearophile

The conversion can't be done *implicitly*, but you can still get 
your code to compile:

//----
void main() {
     immutable(ubyte)[] data2 =
         cast(immutable(ubyte)[]) x"A1 B2 C3 D4"; // OK!
}
//----

It's a bit ugly, and I agree it should work natively, but it is a 
workaround.

Oct 18 2012

"Dejan Lekic" <dejan.lekic gmail.com> writes:

On Thursday, 18 October 2012 at 00:45:12 UTC, bearophile wrote:
 (Repost)

 hex strings are useful, but I think they were invented in D1 
 when strings were convertible to char[]. But today they are an 
 array of immutable UFT-8, so I think this default type is not 
 so useful:

 void main() {
     string data1 = x"A1 B2 C3 D4"; // OK
     immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
 }


 test.d(3): Error: cannot implicitly convert expression 
 ("\xa1\xb2\xc3\xd4") of type string to ubyte[]


 Generally I want to use hex strings to put binary data in a 
 program, so usually it's a ubyte[] or uint[].

 So I have to use something like:

 auto data3 = cast(ubyte[])(x"A1 B2 C3 D4".dup);


 So maybe the following literals are more useful in D2:

 ubyte[] data4 = x[A1 B2 C3 D4];
 uint[]  data5 = x[A1 B2 C3 D4];
 ulong[] data6 = x[A1 B2 C3 D4 A1 B2 C3 D4];

 Bye,
 bearophile

+1 on this one
I also like the x[ ... ] literal because it makes it obvious that 
we are dealing with an array.

Oct 22 2012

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On 2012-45-18 02:10, bearophile <bearophileHUGS lycos.com> wrote:

 So maybe the following literals are more useful in D2:

 ubyte[] data4 = x[A1 B2 C3 D4];
 uint[]  data5 = x[A1 B2 C3 D4];
 ulong[] data6 = x[A1 B2 C3 D4 A1 B2 C3 D4];

That syntax is already taken, though.

Still, I see no reason for x"..." not to return ubyte[].

-- 
Simen

Oct 22 2012

D Programming

C/C++ Programming

Other

digitalmars.D - Regarding hex strings