www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - understanding string suffixes

reply Manfred Nowak <svv1999 hotmail.com> writes:
What are the default suffixes depending on the bom of the source?

What is the meaning of a c-suffix in an utf32 source?

-manfred
Aug 13 2005
parent reply Derek Parnell <derek psych.ward> writes:
On Sat, 13 Aug 2005 07:20:01 +0000 (UTC), Manfred Nowak wrote:

 What are the default suffixes depending on the bom of the source?
 
 What is the meaning of a c-suffix in an utf32 source?

I don't believe that the string literal suffixes are effected in any way by the source code encoding scheme. I think that "qwerty"c is formed as a UTF8 string in RAM by the compiler regardless of UTF encoding of the source. -- Derek Parnell Melbourne, Australia 13/08/2005 11:26:11 PM
Aug 13 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Derek Parnell <derek psych.ward> wrote:
 
 What are the default suffixes depending on the bom of the
 source? 
 
 What is the meaning of a c-suffix in an utf32 source?

I don't believe that the string literal suffixes are effected in any way by the source code encoding scheme.

True, true. I never thaught that the meaning of a supplied suffix may change depending on the source code encoding scheme. But the specs state: | The optional Postfix character gives a specific type to the | string, rather than it being inferred from the context. This is | useful when the type cannot be unambiguously inferred, such as | when overloading based on string type. But when is the type of a string ambiguous? The BOM, either missing or existing, supplies always a context for every string in the source: missing BOM c UTF8-BOM ? (probably c) UTF16-BOM w UTF32-BOM d Therefore a string literal s in an UTF32-source is aequivalent to the same string literal followed by the d-suffix: sd. I have done some tests and found, that a valid UTF32-code in a string literal suffixed with w throws an error, because it is not a legal UTF16-code. Therefore at least the w-suffix denotes not only type but also a check of the semantically correctness of the content of the string literal.
 I think that "qwerty"c  is formed as a UTF8 string in RAM by the
 compiler regardless of UTF encoding of the source. 

UTF8? Because it is indistinguishable from ASCII in this case? -manfred
Aug 14 2005
parent reply Derek Parnell <derek psych.ward> writes:
On Sun, 14 Aug 2005 12:17:44 +0000 (UTC), Manfred Nowak wrote:

 Derek Parnell <derek psych.ward> wrote:
  

[snip]
 But when is the type of a string ambiguous? 

The ambiguity is not in the encoding of the source text but in the way that a string literal is used when matching function signatures. Given ... void func(char[] x) { . . . } func( "some string" ); There is no problem so far, as there is only one possible match, but add this ... void func(dchar[] x) { . . . } And now there is an ambiguity. It is in this situation that string literal suffixes are useful. We need to do ... func( "some string"c ); or before suffixes func( cast(char[]) "some string" ); -- Derek Parnell Melbourne, Australia 14/08/2005 10:54:46 PM
Aug 14 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Derek Parnell <derek psych.ward> wrote:

[...]
 but add this ...
 
  void func(dchar[] x) { . . . }
 
 And now there is an ambiguity.

Ouch. Now I see, that the old story on string literals has been covered with a fig leaf excuse. A source containing an overloaded function like void func( char[] s){} void func( wchar[] s){} void func( dchar[] s){} and a call with an unsuffixed string literal like func( "SomeString"); is unambiguously solvable by looking at the BOM of the source file, as I have already mentioned in the foregoing post: In an ASCII-source the char[]-overload has to be used, whereas in an UTF32-source the dchar[]-overload has to be used. What else should be natural? "Hey dear chinese, you have written all your strings in this UTF32- source in chinese letters, but please assure your D-compiler that you really meant to write chinese letters by appending the d-suffix to all your strings!"? Nope. No chinese should be forced to act this way. However, if a string in his source is not an UTF32-string he now can use the c- or d-suffix. Of course this would also imply, that an UTF32-source may have severe behaviour changes, if the BOM is changed. There is one more problem I do not understand: what will now happen with the call: func( "\u00001111"d "qwerty"c); Is this ambiguous? -manfred
Aug 14 2005
next sibling parent reply Derek Parnell <derek psych.ward> writes:
On Sun, 14 Aug 2005 14:12:21 +0000 (UTC), Manfred Nowak wrote:

 Derek Parnell <derek psych.ward> wrote:
 
 [...]
 but add this ...
 
  void func(dchar[] x) { . . . }
 
 And now there is an ambiguity.

Ouch. Now I see, that the old story on string literals has been covered with a fig leaf excuse. A source containing an overloaded function like void func( char[] s){} void func( wchar[] s){} void func( dchar[] s){} and a call with an unsuffixed string literal like func( "SomeString"); is unambiguously solvable by looking at the BOM of the source file, as I have already mentioned in the foregoing post: In an ASCII-source the char[]-overload has to be used, whereas in an UTF32-source the dchar[]-overload has to be used. What else should be natural?

I can see where you are going with this, but the encoding of the source text should be independent of the interpretation of undecorated string literals. Just because a file is encoded as UTF8 there should be no restriction on me deciding to save the file as UTF16. The compiler should not go choosing which function to call based on how the file just happens to be encoded.
 "Hey dear chinese, you have written all your strings in this UTF32-
 source in chinese letters, but please assure your D-compiler that 
 you really meant to write chinese letters by appending the d-suffix 
 to all your strings!"?

This sounds more like we need to have a pragma that specifies which default encoding we mean to have on the undecorated literals in a specific source text.
 Nope. No chinese should be forced to act this way. However, if a 
 string in his source is not an UTF32-string he now can use the c- 
 or d-suffix.
 
 Of course this would also imply, that an UTF32-source may have 
 severe behaviour changes, if the BOM is changed.

Exactly, so we should avoid this trap. Keep the default encoding as UTF8, but I still think that a pragma would be a good (and easy to implement) idea.
 There is one more problem I do not understand:
 
 what will now happen with the call:
 
   func( "\u00001111"d "qwerty"c);
 
 Is this ambiguous?

Not to the compiler. If you try this you get the error message " mismatched string literal postfixes 'd' and 'c' " -- Derek Parnell Melbourne, Australia 15/08/2005 12:17:33 AM
Aug 14 2005
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Derek Parnell <derek psych.ward> wrote:

[...]
 the encoding of the
 source text should be independent of the interpretation of
 undecorated string literals. Just because a file is encoded as
 UTF8 there should be no restriction on me deciding to save the
 file as UTF16. The compiler should not go choosing which
 function to call based on how the file just happens to be
 encoded.

Nice example, but is this argument suited in general? How will your embedded string literals be saved to the UTF16-source by your editor? And once you have changed some of them to real utf16-codes, how will your editor save them, if you decide to revert to utf8?
 This sounds more like we need to have a pragma that specifies
 which default encoding we mean to have on the undecorated
 literals in a specific source text.

Agreed. That might be a solution. [...]
 what will now happen with the call:
 
   func( "\u00001111"d "qwerty"c);
 
 Is this ambiguous?

Not to the compiler. If you try this you get the error message " mismatched string literal postfixes 'd' and 'c' "

Yes. But have you tried any further? func( "" ""c); //mismatched string literal postfixes ' ' and 'c' func( "" ""d); //mismatched string literal postfixes ' ' and 'd' Then: an unsuffixed string literal is neither compatibel with c nor d. So what is it, that the compiler complains about undecorated string literals match both char[] and dchar[]? Are we chasing a phantom, because the overloading routine of dmd is broken? vathix and some others have already reported on similar problems: digitalmars.D.bugs/3206 -manfred
Aug 14 2005
next sibling parent Manfred Nowak <svv1999 hotmail.com> writes:
Manfred Nowak <svv1999 hotmail.com> wrote:

[...]
 Are we chasing a phantom, because the overloading routine of dmd
 is broken?

digitalmars.D/27069 is the other reference where a `dchar' was matched by a `char' and a `creal'. -manfred
Aug 14 2005
prev sibling parent reply Derek Parnell <derek psych.ward> writes:
On Sun, 14 Aug 2005 18:07:45 +0000 (UTC), Manfred Nowak wrote:

 Derek Parnell <derek psych.ward> wrote:
 
 [...]
 the encoding of the
 source text should be independent of the interpretation of
 undecorated string literals. Just because a file is encoded as
 UTF8 there should be no restriction on me deciding to save the
 file as UTF16. The compiler should not go choosing which
 function to call based on how the file just happens to be
 encoded.

Nice example, but is this argument suited in general?

Yes.
 How will your 
 embedded string literals be saved to the UTF16-source by your 
 editor? 

As UTF16 encoded strings, just like the rest of the text. The editor doesn't distinguish between string literals and other text.
And once you have changed some of them to real utf16-codes, 
 how will your editor save them, if you decide to revert to utf8?

"real utf16-codes"? What are these? If the text is encoded in UTF16 then the entire text already contains real UTF16 codes. If I decide to save it as UTF8, then the editor will translate it for me. The way that the text is displayed remains the same, regardless of its encoding. The storage of the text is independent of the way it is displayed and interpreted by the compiler.
 This sounds more like we need to have a pragma that specifies
 which default encoding we mean to have on the undecorated
 literals in a specific source text.

Agreed. That might be a solution. [...]
 what will now happen with the call:
 
   func( "\u00001111"d "qwerty"c);
 
 Is this ambiguous?

Not to the compiler. If you try this you get the error message " mismatched string literal postfixes 'd' and 'c' "

Yes. But have you tried any further? func( "" ""c); //mismatched string literal postfixes ' ' and 'c' func( "" ""d); //mismatched string literal postfixes ' ' and 'd' Then: an unsuffixed string literal is neither compatibel with c nor d. So what is it, that the compiler complains about undecorated string literals match both char[] and dchar[]?

That has got to be a bug.
 Are we chasing a phantom, because the overloading routine of dmd is 
 broken?

Its not broken. It works as documented. ;-) -- Derek Parnell Melbourne, Australia 15/08/2005 8:02:38 AM
Aug 14 2005
parent "Ben Hinkle" <ben.hinkle gmail.com> writes:
"Derek Parnell" <derek psych.ward> wrote in message 
news:1fki98rnsnkwe$.11rh18dazasty.dlg 40tude.net...
 On Sun, 14 Aug 2005 18:07:45 +0000 (UTC), Manfred Nowak wrote:

 Derek Parnell <derek psych.ward> wrote:

 [...]
 the encoding of the
 source text should be independent of the interpretation of
 undecorated string literals. Just because a file is encoded as
 UTF8 there should be no restriction on me deciding to save the
 file as UTF16. The compiler should not go choosing which
 function to call based on how the file just happens to be
 encoded.

Nice example, but is this argument suited in general?

Yes.
 How will your
 embedded string literals be saved to the UTF16-source by your
 editor?

As UTF16 encoded strings, just like the rest of the text. The editor doesn't distinguish between string literals and other text.
And once you have changed some of them to real utf16-codes,
 how will your editor save them, if you decide to revert to utf8?

"real utf16-codes"? What are these? If the text is encoded in UTF16 then the entire text already contains real UTF16 codes. If I decide to save it as UTF8, then the editor will translate it for me. The way that the text is displayed remains the same, regardless of its encoding. The storage of the text is independent of the way it is displayed and interpreted by the compiler.

Agreed. The c/w/d postfix indicates how to store the string in the obj file not how it is stored in the source file.
 This sounds more like we need to have a pragma that specifies
 which default encoding we mean to have on the undecorated
 literals in a specific source text.

Agreed. That might be a solution. [...]
 what will now happen with the call:

   func( "\u00001111"d "qwerty"c);

 Is this ambiguous?

Not to the compiler. If you try this you get the error message " mismatched string literal postfixes 'd' and 'c' "

Yes. But have you tried any further? func( "" ""c); //mismatched string literal postfixes ' ' and 'c' func( "" ""d); //mismatched string literal postfixes ' ' and 'd' Then: an unsuffixed string literal is neither compatibel with c nor d. So what is it, that the compiler complains about undecorated string literals match both char[] and dchar[]?

That has got to be a bug.

The current behavior is reasonable to me. It prevents any confusion with func(""c "") should it keep the c postfix or substitute the "empty postfix"? The current behavior errors and makes the user choose.
Aug 14 2005
prev sibling parent reply xs0 <xs0 xs0.com> writes:
Manfred Nowak wrote:
 In an ASCII-source the char[]-overload has to be used, whereas in 
 an UTF32-source the dchar[]-overload has to be used. What else 
 should be natural?

Are you sure that it's a good idea to change behavior of code based on the encoding of the file? I sure don't... That would be like if "123.456" would be interpreted either as 123456 or 123.456, depending on your regional settings.. A definite disaster :)
 "Hey dear chinese, you have written all your strings in this UTF32-
 source in chinese letters, but please assure your D-compiler that 
 you really meant to write chinese letters by appending the d-suffix 
 to all your strings!"?

Aren't the characters the same in all cases, just the string type changes? xs0
Aug 14 2005
parent Manfred Nowak <svv1999 hotmail.com> writes:
xs0 <xs0 xs0.com> wrote:

[...]
 Are you sure that it's a good idea to change behavior of code
 based on the encoding of the file? I sure don't...

What is the code embedded in a file if you do not know the encoding of the file? Please explain. [...]
 That would be like if "123.456" would be interpreted either as
 123456 or 123.456, depending on your regional settings.. A
 definite disaster :) 

.. or 123,456. A desaster the germans are totally aware of, because comma and point change role when changing from english to german encoding. [...]
 Aren't the characters the same in all cases, just the string
 type changes? 

I might get you wro9ng, but why should the string literal consisting of the one letter d-string for "true" be the characters "true" as a 4-letter c-string? -manfred
Aug 14 2005