www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Regarding hex strings

reply "bearophile" <bearophileHUGS lycos.com> writes:
(Repost)

hex strings are useful, but I think they were invented in D1 when 
strings were convertible to char[]. But today they are an array 
of immutable UFT-8, so I think this default type is not so useful:

void main() {
     string data1 = x"A1 B2 C3 D4"; // OK
     immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
}


test.d(3): Error: cannot implicitly convert expression 
("\xa1\xb2\xc3\xd4") of type string to ubyte[]


Generally I want to use hex strings to put binary data in a 
program, so usually it's a ubyte[] or uint[].

So I have to use something like:

auto data3 = cast(ubyte[])(x"A1 B2 C3 D4".dup);


So maybe the following literals are more useful in D2:

ubyte[] data4 = x[A1 B2 C3 D4];
uint[]  data5 = x[A1 B2 C3 D4];
ulong[] data6 = x[A1 B2 C3 D4 A1 B2 C3 D4];

Bye,
bearophile
Oct 17 2012
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
[...]
 hex strings are useful, but I think they were invented in D1 when
 strings were convertible to char[]. But today they are an array of
 immutable UFT-8, so I think this default type is not so useful:
 
 void main() {
     string data1 = x"A1 B2 C3 D4"; // OK
     immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
 }
 
 
 test.d(3): Error: cannot implicitly convert expression
 ("\xa1\xb2\xc3\xd4") of type string to ubyte[]
[...] Yeah I think hex strings would be better as ubyte[] by default. More generally, though, I think *both* of the above lines should be equally accepted. If you write x"A1 B2 C3" in the context of initializing a string, then the compiler should infer the type of the literal as string, and if the same literal occurs in the context of, say, passing a ubyte[], then its type should be inferred as ubyte[], NOT string. T -- Who told you to swim in Crocodile Lake without life insurance??
Oct 17 2012
next sibling parent reply "foobar" <foo bar.com> writes:
On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:
 On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
 [...]
 hex strings are useful, but I think they were invented in D1 
 when
 strings were convertible to char[]. But today they are an 
 array of
 immutable UFT-8, so I think this default type is not so useful:
 
 void main() {
     string data1 = x"A1 B2 C3 D4"; // OK
     immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
 }
 
 
 test.d(3): Error: cannot implicitly convert expression
 ("\xa1\xb2\xc3\xd4") of type string to ubyte[]
[...] Yeah I think hex strings would be better as ubyte[] by default. More generally, though, I think *both* of the above lines should be equally accepted. If you write x"A1 B2 C3" in the context of initializing a string, then the compiler should infer the type of the literal as string, and if the same literal occurs in the context of, say, passing a ubyte[], then its type should be inferred as ubyte[], NOT string. T
IMO, this is a redundant feature that complicates the language for no benefit and should be deprecated. strings already have an escape sequence for specifying code-points "\u" and for ubyte arrays you can simply use: immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4]; So basically this feature gains us nothing.
Oct 18 2012
next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 18 October 2012 at 08:58:57 UTC, foobar wrote:
 IMO, this is a redundant feature that complicates the language 
 for no benefit and should be deprecated.
 strings already have an escape sequence for specifying 
 code-points "\u" and for ubyte arrays you can simply use:
 immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

 So basically this feature gains us nothing.
Have you actually ever written code that requires using code points? This feature is a *huge* convenience for when you do. Just compare: string nihongo1 = x"e697a5 e69cac e8aa9e"; string nihongo2 = "\ue697a5\ue69cac\ue8aa9e"; ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 0xaa, 0x9e]; BTW, your data2 doesn't compile.
Oct 18 2012
next sibling parent reply "foobar" <foo bar.com> writes:
On Thursday, 18 October 2012 at 09:42:43 UTC, monarch_dodra wrote:
 On Thursday, 18 October 2012 at 08:58:57 UTC, foobar wrote:
 IMO, this is a redundant feature that complicates the language 
 for no benefit and should be deprecated.
 strings already have an escape sequence for specifying 
 code-points "\u" and for ubyte arrays you can simply use:
 immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

 So basically this feature gains us nothing.
Have you actually ever written code that requires using code points? This feature is a *huge* convenience for when you do. Just compare: string nihongo1 = x"e697a5 e69cac e8aa9e"; string nihongo2 = "\ue697a5\ue69cac\ue8aa9e"; ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 0xaa, 0x9e]; BTW, your data2 doesn't compile.
I didn't try to compile it :) I just rewrote berophile's example with 0x prefixes. How often do you actually need to write code-point _literals_ in your code? I'm not arguing that it isn't convenient. My question would be rather Anderi's "does it pull it's own weight?" meaning does the added complexity in the language and having more than one way for doing something worth that convenience? Seems to me this is in the same ballpark as the built-in complex numbers. Sure it's nice to be able to write "4+5i" instead of "complex(4,5)" but how frequently do you actually ever need the _literals_ even in complex computational heavy code?
Oct 18 2012
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
The docs say:
http://dlang.org/lex.html

Hex strings allow string literals to be created using hex data. 
The hex data need not form valid UTF characters.<
But this code: void main() { immutable ubyte[4] data = x"F9 04 C1 E2"; } Gives me: temp.d(2): Error: Outside Unicode code space Are the docs correct? -------------------------- foobar:
 Seems to me this is in the same ballpark as the built-in 
 complex numbers. Sure it's nice to be able to write "4+5i" 
 instead of "complex(4,5)" but how frequently do you actually 
 ever need the _literals_ even in complex computational heavy 
 code?
Compared to "oct!5151151511", one problem with code like this is that binary blobs are sometimes large, so supporting a x"" syntax is better: immutable ubyte[4] data = hex!"F9 04 C1 E2"; Bye, bearophile
Oct 18 2012
parent reply "foobar" <foo bar.com> writes:
On Thursday, 18 October 2012 at 10:05:06 UTC, bearophile wrote:
 The docs say:
 http://dlang.org/lex.html

Hex strings allow string literals to be created using hex data. 
The hex data need not form valid UTF characters.<
But this code: void main() { immutable ubyte[4] data = x"F9 04 C1 E2"; } Gives me: temp.d(2): Error: Outside Unicode code space Are the docs correct? -------------------------- foobar:
 Seems to me this is in the same ballpark as the built-in 
 complex numbers. Sure it's nice to be able to write "4+5i" 
 instead of "complex(4,5)" but how frequently do you actually 
 ever need the _literals_ even in complex computational heavy 
 code?
Compared to "oct!5151151511", one problem with code like this is that binary blobs are sometimes large, so supporting a x"" syntax is better: immutable ubyte[4] data = hex!"F9 04 C1 E2"; Bye, bearophile
How often large binary blobs are literally spelled in the source code (as opposed to just being read from a file)? In any case, I'm not opposed to such a utility library, in fact I think it's a rather good idea and we already have a precedent with "oct!" I just don't think this belongs as a built-in feature in the language.
Oct 18 2012
next sibling parent reply "foobar" <foo bar.com> writes:
On Thursday, 18 October 2012 at 10:11:14 UTC, foobar wrote:
 On Thursday, 18 October 2012 at 10:05:06 UTC, bearophile wrote:
 The docs say:
 http://dlang.org/lex.html

Hex strings allow string literals to be created using hex 
data. The hex data need not form valid UTF characters.<
This is especially a good reason to remove this feature as it breaks the principle of least surprise and I consider it a major bug, not a feature. I expect D's strings which are by definition Unicode to _only_ ever allow _valid_ Unicode. It makes no sense what so ever to allow this nasty back-door. Other text encoding should be either stored and treated as binary data (ubyte[]) or better yet stored in their own types that will ensure those encodings' invariants.
Oct 18 2012
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 18 October 2012 at 10:17:06 UTC, foobar wrote:
 On Thursday, 18 October 2012 at 10:11:14 UTC, foobar wrote:
 On Thursday, 18 October 2012 at 10:05:06 UTC, bearophile wrote:
 The docs say:
 http://dlang.org/lex.html

Hex strings allow string literals to be created using hex 
data. The hex data need not form valid UTF characters.<
This is especially a good reason to remove this feature as it breaks the principle of least surprise and I consider it a major bug, not a feature. I expect D's strings which are by definition Unicode to _only_ ever allow _valid_ Unicode. It makes no sense what so ever to allow this nasty back-door. Other text encoding should be either stored and treated as binary data (ubyte[]) or better yet stored in their own types that will ensure those encodings' invariants.
Yeah, that makes sense too. I'll try to toy around on my end and see if I can write an "hex".
Oct 18 2012
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 18 October 2012 at 10:39:46 UTC, monarch_dodra wrote:
 Yeah, that makes sense too. I'll try to toy around on my end 
 and see if I can write an "hex".
That was actually relatively easy! Here is some usecase: //---- void main() { enum a = hex!"01 ff 7f"; enum b = hex!0x01_ff_7f; ubyte[] c = hex!"0123456789abcdef"; immutable(ubyte)[] bearophile1 = hex!"A1 B2 C3 D4"; immutable(ubyte)[] bearophile2 = hex!0xA1_B2_C3_D4; a.writeln(); b.writeln(); c.writeln(); bearophile1.writeln(); bearophile2.writeln(); } //---- And corresponding output: //---- [1, 255, 127] [1, 255, 127] [1, 35, 69, 103, 137, 171, 205, 239] [161, 178, 195, 212] [161, 178, 195, 212] //---- hex! was a very good idea actually, imo. I'll post my current impl in the next post. That said, I don't know if I'd deprecate x"", as it serves a different role, as you have already pointed out, in that it *will* validate the code points.
Oct 18 2012
next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 18 October 2012 at 11:24:04 UTC, monarch_dodra wrote:
 hex! was a very good idea actually, imo. I'll post my current 
 impl in the next post.
//---- import std.stdio; import std.conv; import std.ascii; template hex(string s) { enum hex = decode(s); } template hex(ulong ul) { enum hex = decode(ul); } ubyte[] decode(string s) { ubyte[] ret; size_t p; while(p < s.length) { while( s[p] == ' ' || s[p] == '_' ) { ++p; if (p == s.length) assert(0, text("Premature end of string at index ", p, "."));; } char c1 = s[p]; if (!std.ascii.isHexDigit(c1)) assert(0, text("Unexpected character ", c1, " at index ", p, ".")); c1 = cast(char)std.ascii.toUpper(c1); ++p; if (p == s.length) assert(0, text("Premature end of string after ", c1, ".")); char c2 = s[p]; if (!std.ascii.isHexDigit(c2)) assert(0, text("Unexpected character ", c2, " at index ", p, ".")); c2 = cast(char)std.ascii.toUpper(c2); ++p; ubyte val; if('0' <= c2 && c2 <= '9') val += (c2 - '0'); if('A' <= c2 && c2 <= 'F') val += (c2 - 'A' + 10); if('0' <= c1 && c1 <= '9') val += ((c1 - '0')*16); if('A' <= c1 && c1 <= 'F') val += ((c1 - 'A' + 10)*16); ret ~= val; } return ret; } ubyte[] decode(ulong ul) { //NOTE: This is not efficinet AT ALL (push front) //but it is ctfe, so we can live it for now ^^ //I'll optimize it if I try to push it ubyte[] ret; while(ul) { ubyte t = ul%256; ret = t ~ ret; ul /= 256; } return ret; } //---- NOT a final version.
Oct 18 2012
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 18 October 2012 at 11:26:13 UTC, monarch_dodra wrote:
 NOT a final version.
With correct-er utf string support. In theory, non-ascii characters are illegal, but it makes for safer code, and better diagnosis. //---- ubyte[] decode(string s) { ubyte[] ret;; while(s.length) { while( s.front == ' ' || s.front == '_' ) { s.popFront(); if (!s.length) assert(0, text("Premature end of string."));; } dchar c1 = s.front; if (!std.ascii.isHexDigit(c1)) assert(0, text("Unexpected character ", c1, ".")); c1 = std.ascii.toUpper(c1); s.popFront(); if (!s.length) assert(0, text("Premature end of string after ", c1, ".")); dchar c2 = s.front; if (!std.ascii.isHexDigit(c2)) assert(0, text("Unexpected character ", c2, " after ", c1, ".")); c2 = std.ascii.toUpper(c2); s.popFront(); ubyte val; if('0' <= c2 && c2 <= '9') val += (c2 - '0'); if('A' <= c2 && c2 <= 'F') val += (c2 - 'A' + 10); if('0' <= c1 && c1 <= '9') val += ((c1 - '0')*16); if('A' <= c1 && c1 <= 'F') val += ((c1 - 'A' + 10)*16); ret ~= val; } return ret; } //----
Oct 18 2012
prev sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
monarch_dodra:

 hex! was a very good idea actually, imo.
It must scale up to "real world" usages. Try it with a program composed of 3 modules each one containing a 100 KB long string. Then try it with a program with two hundred of medium sized literals, and let's see compilation times and binary sizes. Bye, bearophile
Oct 18 2012
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 18 October 2012 at 13:15:55 UTC, bearophile wrote:
 monarch_dodra:

 hex! was a very good idea actually, imo.
It must scale up to "real world" usages. Try it with a program composed of 3 modules each one containing a 100 KB long string. Then try it with a program with two hundred of medium sized literals, and let's see compilation times and binary sizes. Bye, bearophile
Hum... The compilation is pretty fast actually, about 1 second, provided it doesn't choke. It works for strings up to a length of 400 lines 80 chars per line, which result to approximately 16K of data. After that, I get a DMD out of memory error. DMD memory usage spikes quite quickly. To compile those 400 lines (16K), I use 800MB of memory (!). If I reach about 1GB, then it crashes. I tried using a refAppender instead of ret~, but that changed nothing. Kind of weird it would use that much memory though... Also, the memory doesn't get released. I can parse a 1x400 Line string, but if I try to parse 3 of them, DMD will choke on the second one. :(
Oct 18 2012
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 18 Oct 2012 16:31:57 +0200
schrieb "monarch_dodra" <monarchdodra gmail.com>:

 On Thursday, 18 October 2012 at 13:15:55 UTC, bearophile wrote:
 monarch_dodra:

 hex! was a very good idea actually, imo.
It must scale up to "real world" usages. Try it with a program composed of 3 modules each one containing a 100 KB long string. Then try it with a program with two hundred of medium sized literals, and let's see compilation times and binary sizes. Bye, bearophile
Hum... The compilation is pretty fast actually, about 1 second, provided it doesn't choke. It works for strings up to a length of 400 lines 80 chars per line, which result to approximately 16K of data. After that, I get a DMD out of memory error. DMD memory usage spikes quite quickly. To compile those 400 lines (16K), I use 800MB of memory (!). If I reach about 1GB, then it crashes. I tried using a refAppender instead of ret~, but that changed nothing. Kind of weird it would use that much memory though... Also, the memory doesn't get released. I can parse a 1x400 Line string, but if I try to parse 3 of them, DMD will choke on the second one. :(
Hehe, I assume most of the regulars know this: DMD used to use a garbage collector that is disabled. Memory just isn't freed! Also it has copy on write semantics during CTFE: int bug6498(int x) { int n = 0; while (n < x) ++n; return n; } static assert(bug6498(10_000_000)==10_000_000); --> Fails with an 'out of memory' error. http://d.puremagic.com/issues/show_bug.cgi?id=6498 So, as strange as it sounds, for now try not to write often or into large blocks. Using this knowledge I was sometimes able to bring down the memory consumption considerably by caching recurring concatenations of two strings or to!string calls. That said, appending single elements to an array may actually be better than using a fixed-sized one and have DMD duplicate it on every write. :p Please remember to give Don a cookie when he manages to change the compiler to modify in-place where appropriate. -- Marco
Oct 18 2012
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, October 19, 2012 05:14:44 Marco Leise wrote:
 Hehe, I assume most of the regulars know this: DMD used to
 use a garbage collector that is disabled.
Yes, but it didn't use it for long, because it made performance worse, and Walter didn't have the time to spend fixing it, so it was disabled. Presumably, someone will take the time to improve it at some point and then it will be re- enabled.
 Memory just isn't freed!
That was my understanding, but the last time that I said that, Brad Roberts said that it wasn't true, and that we should stop spreading that FUD, so I don't know what the exact situation is, but it sounds like if that was true in the past, it's not true now. Regardless, it's clear that dmd still uses too much memory in many cases, especially when code uses a lot of templates or CTFE. - Jonathan M Davis
Oct 18 2012
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 18 Oct 2012 21:03:01 -0700
schrieb Jonathan M Davis <jmdavisProg gmx.com>:

 On Friday, October 19, 2012 05:14:44 Marco Leise wrote:
 Memory just isn't freed!
That was my understanding, but the last time that I said that, Brad Roberts said that it wasn't true, and that we should stop spreading that FUD, so I don't know what the exact situation is, but it sounds like if that was true in the past, it's not true now. Regardless, it's clear that dmd still uses too much memory in many cases, especially when code uses a lot of templates or CTFE. - Jonathan M Davis
He called it a FUD? Without trying to sound too patronizing, most D programmers would really only notice DMD's memory footprint when they use CTFE features. It is always Pegged, ctRegex, etc. that make the issue come up, never basic code. And preloading the Boehm collector showed that gigabytes of CTFE memory usage can still be brought down to a few hundred MB [citation needed]. I guess we can meet somewhere in the middle. Btw. did I mix up Don and Brad in the last post ? Who is working on the memory management ? -- Marco
Oct 18 2012
parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, October 19, 2012 07:29:46 Marco Leise wrote:
 Am Thu, 18 Oct 2012 21:03:01 -0700
 
 schrieb Jonathan M Davis <jmdavisProg gmx.com>:
 On Friday, October 19, 2012 05:14:44 Marco Leise wrote:
 Memory just isn't freed!
That was my understanding, but the last time that I said that, Brad Roberts said that it wasn't true, and that we should stop spreading that FUD, so I don't know what the exact situation is, but it sounds like if that was true in the past, it's not true now. Regardless, it's clear that dmd still uses too much memory in many cases, especially when code uses a lot of templates or CTFE. - Jonathan M Davis
He called it a FUD?
I don't think that he used quite that term, but his point was that I shouldn't be saying that, because it wasn't true, and so I was spreading incorrect information (that and the fact that he was tired of people spreading that incorrect information, IIRC). I can't find the exact post at the moment though.
 I guess we can meet somewhere in the middle. Btw. did
 I mix up Don and Brad in the last post ? Who is working on the
 memory management ?
I don't think that you mixed anyone up. Don works primarily on CTFE. Brad works primarily on the auto tester and other infrastructure required for the dmd/Phobos folks to do what they do. - Jonathan M Davis
Oct 18 2012
prev sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 19 October 2012 at 03:14:54 UTC, Marco Leise wrote:
 Hehe, I assume most of the regulars know this: DMD used to
 use a garbage collector that is disabled. Memory just isn't
 freed! Also it has copy on write semantics during CTFE:

 int bug6498(int x)
 {
     int n = 0;
     while (n < x)
         ++n;
     return n;
 }
 static assert(bug6498(10_000_000)==10_000_000);

 --> Fails with an 'out of memory' error.

 http://d.puremagic.com/issues/show_bug.cgi?id=6498

 So, as strange as it sounds, for now try not to write often or
 into large blocks. Using this knowledge I was sometimes able
 to bring down the memory consumption considerably by caching
 recurring concatenations of two strings or to!string calls.

 That said, appending single elements to an array may actually
 be better than using a fixed-sized one and have DMD duplicate
 it on every write. :p

 Please remember to give Don a cookie when he manages to change
 the compiler to modify in-place where appropriate.
I should have read your post in more detail. I thought you were saying that allocations are never freed, but it is indeed more than that: Every write allocates. I just spent the last hour trying to "optimize" my code, only to realize that at its "simplest" (Walk the string counting elements), I run out of memory :/ Can't do much more about it at this point.
Oct 20 2012
prev sibling parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On Thu, 18 Oct 2012 12:11:13 +0200
"foobar" <foo bar.com> wrote:
 
 How often large binary blobs are literally spelled in the source 
 code (as opposed to just being read from a file)?
Frequency isn't the issue. The issues are "*Is* it ever needed?" and "When it is needed, is it useful enough?" The answer to both is most certainly "yes". (Remember, D is supposed to usable as a systems language, it's not merely a high-level-app-only language.) Keep in mind, the question "Does it pull it's own weight?" is for adding new features, not for going around gutting the language just because we can.
 In any case, I'm not opposed to such a utility library, in fact I 
 think it's a rather good idea and we already have a precedent 
 with "oct!"
 I just don't think this belongs as a built-in feature in the 
 language.
I think monarch_dodra's test proves that it definitely needs to be built-in.
Oct 18 2012
parent reply "foobar" <foo bar.com> writes:
On Friday, 19 October 2012 at 00:14:18 UTC, Nick Sabalausky wrote:
 On Thu, 18 Oct 2012 12:11:13 +0200
 "foobar" <foo bar.com> wrote:
 
 How often large binary blobs are literally spelled in the 
 source code (as opposed to just being read from a file)?
Frequency isn't the issue. The issues are "*Is* it ever needed?" and "When it is needed, is it useful enough?" The answer to both is most certainly "yes". (Remember, D is supposed to usable as a systems language, it's not merely a high-level-app-only language.)
Any real-world use cases to support this claim? Does C++ have such a feature? My limited experience with kernels is that this feature is not needed. The solution we used for this was to define an extern symbol and load it with a linker script (the binary data was of course stored in separate files).
 Keep in mind, the question "Does it pull it's own weight?" is 
 for
 adding new features, not for going around gutting the language
 just because we can.
Ok, I grant you that but remember that the whole thread started because the feature _doesn't_ work so lets rephrase - is it worth the effort to fix this feature?
 In any case, I'm not opposed to such a utility library, in 
 fact I think it's a rather good idea and we already have a 
 precedent with "oct!"
 I just don't think this belongs as a built-in feature in the 
 language.
I think monarch_dodra's test proves that it definitely needs to be built-in.
It proves that DMD has bugs that should be fixed, nothing more.
Oct 19 2012
parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On Fri, 19 Oct 2012 15:07:09 +0200
"foobar" <foo bar.com> wrote:

 On Friday, 19 October 2012 at 00:14:18 UTC, Nick Sabalausky wrote:
 On Thu, 18 Oct 2012 12:11:13 +0200
 "foobar" <foo bar.com> wrote:
 
 How often large binary blobs are literally spelled in the 
 source code (as opposed to just being read from a file)?
Frequency isn't the issue. The issues are "*Is* it ever needed?" and "When it is needed, is it useful enough?" The answer to both is most certainly "yes". (Remember, D is supposed to usable as a systems language, it's not merely a high-level-app-only language.)
Any real-world use cases to support this claim?
I've used it. And Denis just posted an example of where it was used to make code far more readable.
 Does C++ have such a feature?
It does not. As one consequence off the top of my head, including binary data into GBA homebrew became more of an awkward bloated mess than it needed to be.
 My limited experience with kernels is that this feature is not 
 needed.
"I haven't needed it" isn't remotely sufficient to demonstrate that something doesn't "pull it's own weight".
 The solution we used for this was to define an extern 
 symbol and load it with a linker script (the binary data was of 
 course stored in separate files).
 
Yuck! s/solution/workaround/
 Keep in mind, the question "Does it pull it's own weight?" is 
 for
 adding new features, not for going around gutting the language
 just because we can.
Ok, I grant you that but remember that the whole thread started because the feature _doesn't_ work so lets rephrase - is it worth the effort to fix this feature?
The only bug is that it tries to validate it as UTF contrary to the spec. Making it *not* try to validate it sounds like a very minor effort. I think you're blowing it out of proportion. And yes, I think it's definitely worth it.
 In any case, I'm not opposed to such a utility library, in 
 fact I think it's a rather good idea and we already have a 
 precedent with "oct!"
 I just don't think this belongs as a built-in feature in the 
 language.
I think monarch_dodra's test proves that it definitely needs to be built-in.
It proves that DMD has bugs that should be fixed, nothing more.
Right so let's jettison x"..." just because *someday* CTFE might become good enough that we can bring the feature back. How does that make any sense? We already have it, it basically works (aside from only a fairly trivial issue). *When* CTFE is good enough to replace it, *then* we can have a sane debate about actually doing so. Until then, "Let's get rid of x"..." because it can be done in the library" is a pointless argument because at least for now it's NOT TRUE.
Oct 20 2012
prev sibling parent reply "Kagamin" <spam here.lot> writes:
On Thursday, 18 October 2012 at 09:42:43 UTC, monarch_dodra wrote:
 Have you actually ever written code that requires using code 
 points? This feature is a *huge* convenience for when you do. 
 Just compare:

 string nihongo1 = x"e697a5 e69cac e8aa9e";
 string nihongo2 = "\ue697a5\ue69cac\ue8aa9e";
 ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 
 0xaa, 0x9e];
You should use unicode directly here, that's the whole point to support it. string nihongo = "日本語";
Oct 18 2012
parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, October 18, 2012 15:56:50 Kagamin wrote:
 On Thursday, 18 October 2012 at 09:42:43 UTC, monarch_dodra wrote:
 Have you actually ever written code that requires using code
 points? This feature is a *huge* convenience for when you do.
 Just compare:
 
 string nihongo1 = x"e697a5 e69cac e8aa9e";
 string nihongo2 = "\ue697a5\ue69cac\ue8aa9e";
 ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8,
 0xaa, 0x9e];
You should use unicode directly here, that's the whole point to support it. string nihongo = "日本語";
It's a nice feature, but there are plenty of cases where it makes more sense to use the unicode values rather than the characters themselves (e.g. your keyboard doesn't have the characters in question). It's valuable to be able to do it both ways. - Jonathan M Davis
Oct 18 2012
parent reply "Kagamin" <spam here.lot> writes:
Your keyboard doesn't have ready unicode values for all 
characters either.
Oct 18 2012
parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, October 18, 2012 21:09:14 Kagamin wrote:
 Your keyboard doesn't have ready unicode values for all
 characters either.
So? That doesn't make it so that it's not valuable to be able to input the values in hexidecimal instead of as actual unicode characters. Heck, if you want a specific character, I wouldn't trust copying the characters anyway, because it's far too easy to have two characters which look really similar but are different (e.g. there are multiple types of angle brackets in unicode), whereas with the numbers you can be sure. And with some characters (e.g. unicode whitespace characters), it generally doesn't make sense to enter the characters directly. Regardless, my point is that both approaches can be useful, so it's good to be able to do both. If you prefer to put the unicode characters in directly, then do that, but others may prefer the other way. Personally, I've done both. - Jonathan M Davis
Oct 18 2012
prev sibling next sibling parent reply Don Clugston <dac nospam.com> writes:
On 18/10/12 10:58, foobar wrote:
 On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:
 On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
 [...]
 hex strings are useful, but I think they were invented in D1 when
 strings were convertible to char[]. But today they are an array of
 immutable UFT-8, so I think this default type is not so useful:

 void main() {
     string data1 = x"A1 B2 C3 D4"; // OK
     immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
 }


 test.d(3): Error: cannot implicitly convert expression
 ("\xa1\xb2\xc3\xd4") of type string to ubyte[]
[...] Yeah I think hex strings would be better as ubyte[] by default. More generally, though, I think *both* of the above lines should be equally accepted. If you write x"A1 B2 C3" in the context of initializing a string, then the compiler should infer the type of the literal as string, and if the same literal occurs in the context of, say, passing a ubyte[], then its type should be inferred as ubyte[], NOT string. T
IMO, this is a redundant feature that complicates the language for no benefit and should be deprecated. strings already have an escape sequence for specifying code-points "\u" and for ubyte arrays you can simply use: immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4]; So basically this feature gains us nothing.
That is not the same. Array literals are not the same as string literals, they have an implicit .dup. See my recent thread on this issue (which unfortunately seems have to died without a resolution, people got hung up about trailing null characters without apparently noticing the more important issue of the dup).
Oct 18 2012
parent reply "foobar" <foo bar.com> writes:
On Thursday, 18 October 2012 at 14:29:57 UTC, Don Clugston wrote:
 On 18/10/12 10:58, foobar wrote:
 On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:
 On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
 [...]
 hex strings are useful, but I think they were invented in D1 
 when
 strings were convertible to char[]. But today they are an 
 array of
 immutable UFT-8, so I think this default type is not so 
 useful:

 void main() {
    string data1 = x"A1 B2 C3 D4"; // OK
    immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
 }


 test.d(3): Error: cannot implicitly convert expression
 ("\xa1\xb2\xc3\xd4") of type string to ubyte[]
[...] Yeah I think hex strings would be better as ubyte[] by default. More generally, though, I think *both* of the above lines should be equally accepted. If you write x"A1 B2 C3" in the context of initializing a string, then the compiler should infer the type of the literal as string, and if the same literal occurs in the context of, say, passing a ubyte[], then its type should be inferred as ubyte[], NOT string. T
IMO, this is a redundant feature that complicates the language for no benefit and should be deprecated. strings already have an escape sequence for specifying code-points "\u" and for ubyte arrays you can simply use: immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4]; So basically this feature gains us nothing.
That is not the same. Array literals are not the same as string literals, they have an implicit .dup. See my recent thread on this issue (which unfortunately seems have to died without a resolution, people got hung up about trailing null characters without apparently noticing the more important issue of the dup).
I don't see how that detail is relevant to this discussion as I was not arguing against string literals or array literals in general. We can still have both (assuming the code points are valid...): string foo = "\ua1\ub2\uc3"; // no .dup and: ubyte[3] goo = [0xa1, 0xb2, 0xc3]; // implicit .dup
Oct 18 2012
parent reply Don Clugston <dac nospam.com> writes:
On 18/10/12 17:43, foobar wrote:
 On Thursday, 18 October 2012 at 14:29:57 UTC, Don Clugston wrote:
 On 18/10/12 10:58, foobar wrote:
 On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:
 On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
 [...]
 hex strings are useful, but I think they were invented in D1 when
 strings were convertible to char[]. But today they are an array of
 immutable UFT-8, so I think this default type is not so useful:

 void main() {
    string data1 = x"A1 B2 C3 D4"; // OK
    immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
 }


 test.d(3): Error: cannot implicitly convert expression
 ("\xa1\xb2\xc3\xd4") of type string to ubyte[]
[...] Yeah I think hex strings would be better as ubyte[] by default. More generally, though, I think *both* of the above lines should be equally accepted. If you write x"A1 B2 C3" in the context of initializing a string, then the compiler should infer the type of the literal as string, and if the same literal occurs in the context of, say, passing a ubyte[], then its type should be inferred as ubyte[], NOT string. T
IMO, this is a redundant feature that complicates the language for no benefit and should be deprecated. strings already have an escape sequence for specifying code-points "\u" and for ubyte arrays you can simply use: immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4]; So basically this feature gains us nothing.
That is not the same. Array literals are not the same as string literals, they have an implicit .dup. See my recent thread on this issue (which unfortunately seems have to died without a resolution, people got hung up about trailing null characters without apparently noticing the more important issue of the dup).
I don't see how that detail is relevant to this discussion as I was not arguing against string literals or array literals in general. We can still have both (assuming the code points are valid...): string foo = "\ua1\ub2\uc3"; // no .dup
That doesn't compile. Error: escape hex sequence has 2 hex digits instead of 4
Oct 19 2012
parent reply "foobar" <foo bar.com> writes:
On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:
 We can still have both (assuming the code points are valid...):
 string foo = "\ua1\ub2\uc3"; // no .dup
That doesn't compile. Error: escape hex sequence has 2 hex digits instead of 4
Come on, "assuming the code points are valid". It says so 4 lines above!
Oct 19 2012
parent reply Don Clugston <dac nospam.com> writes:
On 19/10/12 16:07, foobar wrote:
 On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:
 We can still have both (assuming the code points are valid...):
 string foo = "\ua1\ub2\uc3"; // no .dup
That doesn't compile. Error: escape hex sequence has 2 hex digits instead of 4
Come on, "assuming the code points are valid". It says so 4 lines above!
It isn't the same. Hex strings are the raw bytes, eg UTF8 code points. (ie, it includes the high bits that indicate the length of each char). \u makes dchars. "\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two non-zero bytes.
Oct 19 2012
parent reply "foobar" <foo bar.com> writes:
On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:
 On 19/10/12 16:07, foobar wrote:
 On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:
 We can still have both (assuming the code points are 
 valid...):
 string foo = "\ua1\ub2\uc3"; // no .dup
That doesn't compile. Error: escape hex sequence has 2 hex digits instead of 4
Come on, "assuming the code points are valid". It says so 4 lines above!
It isn't the same. Hex strings are the raw bytes, eg UTF8 code points. (ie, it includes the high bits that indicate the length of each char). \u makes dchars. "\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two non-zero bytes.
Yes, the \u requires code points and not code-units for a specific UTF encoding, which you are correct in pointing out are four hex digits and not two. This is a very reasonable choice to prevent/reduce Unicode encoding errors. http://dlang.org/lex.html#HexString states: "Hex strings allow string literals to be created using hex data. The hex data need not form valid UTF characters." I _already_ said that I consider this a major semantic bug as it violates the principle of least surprise - the programmer's expectation that the D string types which are Unicode according to the spec to, well, actually contain _valid_ Unicode and _not_ arbitrary binary data. Given the above, the design of \u makes perfect sense for _strings_ - you can use _valid_ code-points (not code units) in hex form. For general purpose binary data (i.e. _not_ UTF encoded Unicode text) I also _already_ said IMO should be either stored as ubyte[] or better yet their own types that would ensure the correct invariants for the data type, be it audio, video, or just a different text encoding. In neither case the hex-string is relevant IMO. In the former it potentially violates the type's invariant and in the latter we already have array literals. Using a malformed _string_ to initialize ubyte[] IMO is simply less readable. How did that article call such features, "WAT"?
Oct 19 2012
next sibling parent "foobar" <foo bar.com> writes:
On Friday, 19 October 2012 at 18:46:07 UTC, foobar wrote:
 On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:
 On 19/10/12 16:07, foobar wrote:
 On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston 
 wrote:
 We can still have both (assuming the code points are 
 valid...):
 string foo = "\ua1\ub2\uc3"; // no .dup
That doesn't compile. Error: escape hex sequence has 2 hex digits instead of 4
Come on, "assuming the code points are valid". It says so 4 lines above!
It isn't the same. Hex strings are the raw bytes, eg UTF8 code points. (ie, it includes the high bits that indicate the length of each char). \u makes dchars. "\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two non-zero bytes.
Yes, the \u requires code points and not code-units for a specific UTF encoding, which you are correct in pointing out are four hex digits and not two. This is a very reasonable choice to prevent/reduce Unicode encoding errors. http://dlang.org/lex.html#HexString states: "Hex strings allow string literals to be created using hex data. The hex data need not form valid UTF characters." I _already_ said that I consider this a major semantic bug as it violates the principle of least surprise - the programmer's expectation that the D string types which are Unicode according to the spec to, well, actually contain _valid_ Unicode and _not_ arbitrary binary data. Given the above, the design of \u makes perfect sense for _strings_ - you can use _valid_ code-points (not code units) in hex form. For general purpose binary data (i.e. _not_ UTF encoded Unicode text) I also _already_ said IMO should be either stored as ubyte[] or better yet their own types that would ensure the correct invariants for the data type, be it audio, video, or just a different text encoding. In neither case the hex-string is relevant IMO. In the former it potentially violates the type's invariant and in the latter we already have array literals. Using a malformed _string_ to initialize ubyte[] IMO is simply less readable. How did that article call such features, "WAT"?
I just re-checked and to clarify string literals support _three_ escape sequences: \x__ - a single byte \u____ - two bytes \U________ - four bytes So raw bytes _can_ be directly specified and I hope the compiler still verifies the string literal is valid Unicode.
Oct 19 2012
prev sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On Fri, 19 Oct 2012 20:46:06 +0200
 
 For general purpose binary data (i.e. _not_ UTF encoded Unicode 
 text) I also _already_ said IMO should be either stored as 
 ubyte[]
Problem is, x"..." is FAR better syntax for that.
 or better yet their own types that would ensure the 
 correct invariants for the data type, be it audio, video, or just 
 a different text encoding.
Using x"..." doesn't prevent anyone from doing that: auto a = SomeAudioType(x"...");
 
 In neither case the hex-string is relevant IMO. In the former it 
 potentially violates the type's invariant and in the latter we 
 already have array literals.
 
 Using a malformed _string_ to initialize ubyte[] IMO is simply 
 less readable. How did that article call such features, "WAT"?
The only thing ridiculous about x"..." is that somewhere along the lines it was decided that it must be a string instead of the arbitrary binary data that it *is*.
Oct 20 2012
prev sibling parent reply Denis Shelomovskij <verylonglogin.reg gmail.com> writes:
18.10.2012 12:58, foobar пишет:
 IMO, this is a redundant feature that complicates the language for no
 benefit and should be deprecated.
 strings already have an escape sequence for specifying code-points "\u"
 and for ubyte arrays you can simply use:
 immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

 So basically this feature gains us nothing.
Maybe. Just an example of a real world code: Arrays: https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110 vs Hex strings: https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130 By the way, current code isn't affected by the topic issue. -- Денис В. Шеломовский Denis V. Shelomovskij
Oct 20 2012
parent reply "foobar" <foo bar.com> writes:
On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij 
wrote:
 18.10.2012 12:58, foobar пишет:
 IMO, this is a redundant feature that complicates the language 
 for no
 benefit and should be deprecated.
 strings already have an escape sequence for specifying 
 code-points "\u"
 and for ubyte arrays you can simply use:
 immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

 So basically this feature gains us nothing.
Maybe. Just an example of a real world code: Arrays: https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110 vs Hex strings: https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130 By the way, current code isn't affected by the topic issue.
I personally find the former more readable but I guess there would always be someone to disagree. As the say, YMMV.
Oct 20 2012
parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On Sat, 20 Oct 2012 14:59:27 +0200
"foobar" <foo bar.com> wrote:
 On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij 
 wrote:
 Maybe. Just an example of a real world code:

 Arrays:
 https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110

 vs

 Hex strings:
 https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130

 By the way, current code isn't affected by the topic issue.
I personally find the former more readable but I guess there would always be someone to disagree. As the say, YMMV.
Honestly, I can't imagine how anyone wouldn't find the latter vastly more readable.
Oct 20 2012
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky wrote:
 On Sat, 20 Oct 2012 14:59:27 +0200
 "foobar" <foo bar.com> wrote:
 On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij
 wrote:
 Maybe. Just an example of a real world code:

 Arrays:
 https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110

 vs

 Hex strings:
 https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130

 By the way, current code isn't affected by the topic issue.
I personally find the former more readable but I guess there would always be someone to disagree. As the say, YMMV.
Honestly, I can't imagine how anyone wouldn't find the latter vastly more readable.
If you want vastly human readable, you want heredoc hex syntax, something like this: ubyte[] = x"<<END 32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e 32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20 2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20 74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69 6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20 74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d 20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20 22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73 20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67 6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65 END"; (I just made that syntax up, so the details are not final, but you get the idea.) I would propose supporting this in D, but then D already has way too many different ways of writing strings, some of questionable utility, so I will refrain. Of course, the above syntax might actually be implementable with a suitable mixin template that takes a compile-time string. Maybe we should lobby for such a template to go into Phobos -- that might motivate people to fix CTFE in dmd so that it doesn't consume unreasonable amounts of memory when the size of CTFE input gets moderately large (see other recent thread on this topic). T -- Без труда не выловишь и рыбку из пруда.
Oct 20 2012
next sibling parent reply "foobar" <foo bar.com> writes:
On Saturday, 20 October 2012 at 21:03:20 UTC, H. S. Teoh wrote:
 On Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky wrote:
 On Sat, 20 Oct 2012 14:59:27 +0200
 "foobar" <foo bar.com> wrote:
 On Saturday, 20 October 2012 at 10:51:25 UTC, Denis 
 Shelomovskij
 wrote:
 Maybe. Just an example of a real world code:

 Arrays:
 https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110

 vs

 Hex strings:
 https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130

 By the way, current code isn't affected by the topic issue.
I personally find the former more readable but I guess there would always be someone to disagree. As the say, YMMV.
Honestly, I can't imagine how anyone wouldn't find the latter vastly more readable.
If you want vastly human readable, you want heredoc hex syntax, something like this: ubyte[] = x"<<END 32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e 32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20 2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20 74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69 6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20 74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d 20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20 22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73 20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67 6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65 END"; (I just made that syntax up, so the details are not final, but you get the idea.) I would propose supporting this in D, but then D already has way too many different ways of writing strings, some of questionable utility, so I will refrain. Of course, the above syntax might actually be implementable with a suitable mixin template that takes a compile-time string. Maybe we should lobby for such a template to go into Phobos -- that might motivate people to fix CTFE in dmd so that it doesn't consume unreasonable amounts of memory when the size of CTFE input gets moderately large (see other recent thread on this topic). T
Yeah, I like this. I'd prefer brackets over quotes but it not a big dig as the qoutes in the above are not very noticeable. It should look distinct from textual strings. As you said, this could/should be implemented as a template. Vote++
Oct 20 2012
parent "foobar" <foo bar.com> writes:
On Saturday, 20 October 2012 at 21:16:44 UTC, foobar wrote:
 On Saturday, 20 October 2012 at 21:03:20 UTC, H. S. Teoh wrote:
 On Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky 
 wrote:
 On Sat, 20 Oct 2012 14:59:27 +0200
 "foobar" <foo bar.com> wrote:
 On Saturday, 20 October 2012 at 10:51:25 UTC, Denis 
 Shelomovskij
 wrote:
 Maybe. Just an example of a real world code:

 Arrays:
 https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110

 vs

 Hex strings:
 https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130

 By the way, current code isn't affected by the topic 
 issue.
I personally find the former more readable but I guess there would always be someone to disagree. As the say, YMMV.
Honestly, I can't imagine how anyone wouldn't find the latter vastly more readable.
If you want vastly human readable, you want heredoc hex syntax, something like this: ubyte[] = x"<<END 32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e 32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20 2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20 74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69 6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20 74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d 20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20 22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73 20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67 6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65 END"; (I just made that syntax up, so the details are not final, but you get the idea.) I would propose supporting this in D, but then D already has way too many different ways of writing strings, some of questionable utility, so I will refrain. Of course, the above syntax might actually be implementable with a suitable mixin template that takes a compile-time string. Maybe we should lobby for such a template to go into Phobos -- that might motivate people to fix CTFE in dmd so that it doesn't consume unreasonable amounts of memory when the size of CTFE input gets moderately large (see other recent thread on this topic). T
Yeah, I like this. I'd prefer brackets over quotes but it not a big dig as the qoutes in the above are not very noticeable. It should look distinct from textual strings. As you said, this could/should be implemented as a template. Vote++
** not a big deal
Oct 20 2012
prev sibling next sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On Sat, 20 Oct 2012 14:05:21 -0700
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote:

 On Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky wrote:
 On Sat, 20 Oct 2012 14:59:27 +0200
 "foobar" <foo bar.com> wrote:
 On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij
 wrote:
 Maybe. Just an example of a real world code:

 Arrays:
 https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110

 vs

 Hex strings:
 https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130

 By the way, current code isn't affected by the topic issue.
I personally find the former more readable but I guess there would always be someone to disagree. As the say, YMMV.
Honestly, I can't imagine how anyone wouldn't find the latter vastly more readable.
If you want vastly human readable, you want heredoc hex syntax, something like this: ubyte[] = x"<<END 32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e 32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20 2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20 74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69 6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20 74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d 20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20 22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73 20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67 6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65 END"; (I just made that syntax up, so the details are not final, but you get the idea.) I would propose supporting this in D, but then D already has way too many different ways of writing strings, some of questionable utility, so I will refrain. Of course, the above syntax might actually be implementable with a suitable mixin template that takes a compile-time string. Maybe we should lobby for such a template to go into Phobos -- that might motivate people to fix CTFE in dmd so that it doesn't consume unreasonable amounts of memory when the size of CTFE input gets moderately large (see other recent thread on this topic).
Can't you already just do this?: auto blah = x" 32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e 32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20 2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20 74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69 6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20 74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d 20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20 22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73 20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67 6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65 "; I thought all string literals in D accepted embedded newlines?
Oct 20 2012
prev sibling parent reply "Dejan Lekic" <dejan.lekic gmail.com> writes:
 If you want vastly human readable, you want heredoc hex syntax,
 something like this:

 	ubyte[] = x"<<END
 	32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e
 	32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20
 	2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20
 	74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69
 	6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20
 	74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d
 	20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20
 	22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73
 	20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67
 	6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65
 	END";
Having a heredoc syntax for hex-strings that produce ubyte[] arrays is confusing for people who would (naturally) expect a string from a heredoc string. It is not named hereDOC for no reason. :)
Oct 22 2012
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Oct 22, 2012 at 01:14:21PM +0200, Dejan Lekic wrote:
If you want vastly human readable, you want heredoc hex syntax,
something like this:

	ubyte[] = x"<<END
	32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e
	32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20
	2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20
	74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69
	6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20
	74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d
	20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20
	22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73
	20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67
	6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65
	END";
Having a heredoc syntax for hex-strings that produce ubyte[] arrays is confusing for people who would (naturally) expect a string from a heredoc string. It is not named hereDOC for no reason. :)
What I meant was, a syntax similar to heredoc, not an actual heredoc, which would be a string. T -- Knowledge is that area of ignorance that we arrange and classify. -- Ambrose Bierce
Oct 22 2012
prev sibling parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On Wed, 17 Oct 2012 19:49:43 -0700
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote:

 On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
 [...]
 hex strings are useful, but I think they were invented in D1 when
 strings were convertible to char[]. But today they are an array of
 immutable UFT-8, so I think this default type is not so useful:
 
 void main() {
     string data1 = x"A1 B2 C3 D4"; // OK
     immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
 }
 
 
 test.d(3): Error: cannot implicitly convert expression
 ("\xa1\xb2\xc3\xd4") of type string to ubyte[]
[...] Yeah I think hex strings would be better as ubyte[] by default. More generally, though, I think *both* of the above lines should be equally accepted. If you write x"A1 B2 C3" in the context of initializing a string, then the compiler should infer the type of the literal as string, and if the same literal occurs in the context of, say, passing a ubyte[], then its type should be inferred as ubyte[], NOT string.
Big +1 Having the language expect x"..." to always be a string (let alone a *valid UTF* string) is just insane. It's just too damn useful for arbitrary binary data.
Oct 18 2012
parent "bearophile" <bearophileHUGS lycos.com> writes:
Nick Sabalausky:

 Big +1

 Having the language expect x"..." to always be a string (let 
 alone a *valid UTF* string) is just insane. It's just too
 damn useful for arbitrary binary data.
I'd like an opinion on such topics from one of the the D bosses :-) Bye, bearophile
Oct 18 2012
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Thursday, 18 October 2012 at 00:45:12 UTC, bearophile wrote:
 (Repost)

 hex strings are useful, but I think they were invented in D1 
 when strings were convertible to char[]. But today they are an 
 array of immutable UFT-8, so I think this default type is not 
 so useful:

 void main() {
     string data1 = x"A1 B2 C3 D4"; // OK
     immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
 }


 test.d(3): Error: cannot implicitly convert expression 
 ("\xa1\xb2\xc3\xd4") of type string to ubyte[]

 [SNIP]

 Bye,
 bearophile
The conversion can't be done *implicitly*, but you can still get your code to compile: //---- void main() { immutable(ubyte)[] data2 = cast(immutable(ubyte)[]) x"A1 B2 C3 D4"; // OK! } //---- It's a bit ugly, and I agree it should work natively, but it is a workaround.
Oct 18 2012
prev sibling next sibling parent "Dejan Lekic" <dejan.lekic gmail.com> writes:
On Thursday, 18 October 2012 at 00:45:12 UTC, bearophile wrote:
 (Repost)

 hex strings are useful, but I think they were invented in D1 
 when strings were convertible to char[]. But today they are an 
 array of immutable UFT-8, so I think this default type is not 
 so useful:

 void main() {
     string data1 = x"A1 B2 C3 D4"; // OK
     immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
 }


 test.d(3): Error: cannot implicitly convert expression 
 ("\xa1\xb2\xc3\xd4") of type string to ubyte[]


 Generally I want to use hex strings to put binary data in a 
 program, so usually it's a ubyte[] or uint[].

 So I have to use something like:

 auto data3 = cast(ubyte[])(x"A1 B2 C3 D4".dup);


 So maybe the following literals are more useful in D2:

 ubyte[] data4 = x[A1 B2 C3 D4];
 uint[]  data5 = x[A1 B2 C3 D4];
 ulong[] data6 = x[A1 B2 C3 D4 A1 B2 C3 D4];

 Bye,
 bearophile
+1 on this one I also like the x[ ... ] literal because it makes it obvious that we are dealing with an array.
Oct 22 2012
prev sibling parent "Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On 2012-45-18 02:10, bearophile <bearophileHUGS lycos.com> wrote:

 So maybe the following literals are more useful in D2:

 ubyte[] data4 = x[A1 B2 C3 D4];
 uint[]  data5 = x[A1 B2 C3 D4];
 ulong[] data6 = x[A1 B2 C3 D4 A1 B2 C3 D4];
That syntax is already taken, though. Still, I see no reason for x"..." not to return ubyte[]. -- Simen
Oct 22 2012