digitalmars.D.learn - understanding string suffixes

Manfred Nowak (3/3) Aug 13 2005 What are the default suffixes depending on the bom of the source?

Derek Parnell (9/12) Aug 13 2005 I don't believe that the string literal suffixes are effected in any way...

Manfred Nowak (25/34) Aug 14 2005 True, true. I never thaught that the meaning of a supplied suffix

Derek Parnell (19/22) Aug 14 2005 The ambiguity is not in the encoding of the source text but in the way t...

Manfred Nowak (30/35) Aug 14 2005 [...]

Derek Parnell (20/66) Aug 14 2005 I can see where you are going with this, but the encoding of the source

Manfred Nowak (19/38) Aug 14 2005 Nice example, but is this argument suited in general? How will your

Manfred Nowak (6/8) Aug 14 2005 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/27069
Derek Parnell (19/63) Aug 14 2005 As UTF16 encoded strings, just like the rest of the text. The editor

Ben Hinkle (8/62) Aug 14 2005 Agreed. The c/w/d postfix indicates how to store the string in the obj f...

xs0 (7/14) Aug 14 2005 Are you sure that it's a good idea to change behavior of code based on

Manfred Nowak (13/20) Aug 14 2005 What is the code embedded in a file if you do not know the encoding

Manfred Nowak <svv1999 hotmail.com> writes:

What are the default suffixes depending on the bom of the source?

What is the meaning of a c-suffix in an utf32 source?

-manfred

Aug 13 2005

Derek Parnell <derek psych.ward> writes:

On Sat, 13 Aug 2005 07:20:01 +0000 (UTC), Manfred Nowak wrote:

 What are the default suffixes depending on the bom of the source?
 
 What is the meaning of a c-suffix in an utf32 source?

I don't believe that the string literal suffixes are effected in any way by
the source code encoding scheme. 

I think that "qwerty"c  is formed as a UTF8 string in RAM by the compiler
regardless of UTF encoding of the source. 

-- 
Derek Parnell
Melbourne, Australia
13/08/2005 11:26:11 PM

Aug 13 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Derek Parnell <derek psych.ward> wrote:
 
 What are the default suffixes depending on the bom of the
 source? 
 
 What is the meaning of a c-suffix in an utf32 source?

 
 I don't believe that the string literal suffixes are effected in
 any way by the source code encoding scheme. 

True, true. I never thaught that the meaning of a supplied suffix
may change depending on the source code encoding scheme. But the
specs state: 

| The optional Postfix character gives a specific type to the
| string, rather than it being inferred from the context. This is
| useful when the type cannot be unambiguously inferred, such as
| when overloading based on string type.

But when is the type of a string ambiguous? The BOM, either missing 
or existing, supplies always a context for every string in the 
source:

  missing BOM    c
  UTF8-BOM       ? (probably c)
  UTF16-BOM      w
  UTF32-BOM      d  


Therefore a string literal s in an UTF32-source is aequivalent to 
the same string literal followed by the d-suffix: sd.

I have done some tests and found, that a valid UTF32-code in a 
string literal suffixed with w throws an error, because it is not a 
legal UTF16-code. Therefore at least the w-suffix denotes not only 
type but also a check of the semantically correctness of the 
content of the string literal.


 I think that "qwerty"c  is formed as a UTF8 string in RAM by the
 compiler regardless of UTF encoding of the source. 

UTF8? Because it is indistinguishable from ASCII in this case?

-manfred

Aug 14 2005

Derek Parnell <derek psych.ward> writes:

On Sun, 14 Aug 2005 12:17:44 +0000 (UTC), Manfred Nowak wrote:

 Derek Parnell <derek psych.ward> wrote:
  

[snip]

 But when is the type of a string ambiguous? 

The ambiguity is not in the encoding of the source text but in the way that
a string literal is used when matching function signatures.

Given ...

 void func(char[] x) { . . . }
 func( "some string" );

There is no problem so far, as there is only one possible match, but add
this ...

 void func(dchar[] x) { . . . }

And now there is an ambiguity. It is in this situation that string literal
suffixes are useful. We need to do ...

 func( "some string"c );

or before suffixes

 func( cast(char[]) "some string" );

-- 
Derek Parnell
Melbourne, Australia
14/08/2005 10:54:46 PM

Aug 14 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Derek Parnell <derek psych.ward> wrote:

[...]
 but add this ...
 
  void func(dchar[] x) { . . . }
 
 And now there is an ambiguity.

[...]

Ouch. Now I see, that the old story on string literals has been 
covered with a fig leaf excuse.

A source containing an overloaded function like

  void func( char[] s){}
  void func( wchar[] s){}
  void func( dchar[] s){}

and a call with an unsuffixed string literal like

  func( "SomeString");

is unambiguously solvable by looking at the BOM of the source file, 
as I have already mentioned in the foregoing post:

In an ASCII-source the char[]-overload has to be used, whereas in 
an UTF32-source the dchar[]-overload has to be used. What else 
should be natural?

"Hey dear chinese, you have written all your strings in this UTF32-
source in chinese letters, but please assure your D-compiler that 
you really meant to write chinese letters by appending the d-suffix 
to all your strings!"?

Nope. No chinese should be forced to act this way. However, if a 
string in his source is not an UTF32-string he now can use the c- 
or d-suffix.

Of course this would also imply, that an UTF32-source may have 
severe behaviour changes, if the BOM is changed.

There is one more problem I do not understand:

what will now happen with the call:

  func( "\u00001111"d "qwerty"c);

Is this ambiguous?

-manfred

Aug 14 2005

Derek Parnell <derek psych.ward> writes:

On Sun, 14 Aug 2005 14:12:21 +0000 (UTC), Manfred Nowak wrote:

 Derek Parnell <derek psych.ward> wrote:
 
 [...]
 but add this ...
 
  void func(dchar[] x) { . . . }
 
 And now there is an ambiguity.

 [...]
 
 Ouch. Now I see, that the old story on string literals has been 
 covered with a fig leaf excuse.
 
 A source containing an overloaded function like
 
   void func( char[] s){}
   void func( wchar[] s){}
   void func( dchar[] s){}
 
 and a call with an unsuffixed string literal like
 
   func( "SomeString");
 
 is unambiguously solvable by looking at the BOM of the source file, 
 as I have already mentioned in the foregoing post:
 
 In an ASCII-source the char[]-overload has to be used, whereas in 
 an UTF32-source the dchar[]-overload has to be used. What else 
 should be natural?

I can see where you are going with this, but the encoding of the source
text should be independent of the interpretation of undecorated string
literals. Just because a file is encoded as UTF8 there should be no
restriction on me deciding to save the file as UTF16. The compiler should
not go choosing which function to call based on how the file just happens
to be encoded.

 "Hey dear chinese, you have written all your strings in this UTF32-
 source in chinese letters, but please assure your D-compiler that 
 you really meant to write chinese letters by appending the d-suffix 
 to all your strings!"?

This sounds more like we need to have a pragma that specifies which default
encoding we mean to have on the undecorated literals in a specific source
text.

 Nope. No chinese should be forced to act this way. However, if a 
 string in his source is not an UTF32-string he now can use the c- 
 or d-suffix.
 
 Of course this would also imply, that an UTF32-source may have 
 severe behaviour changes, if the BOM is changed.

Exactly, so we should avoid this trap. Keep the default encoding as UTF8,
but I still think that a pragma would be a good (and easy to implement)
idea.

 
 There is one more problem I do not understand:
 
 what will now happen with the call:
 
   func( "\u00001111"d "qwerty"c);
 
 Is this ambiguous?

Not to the compiler. If you try this you get the error message 

  " mismatched string literal postfixes 'd' and 'c' "

-- 
Derek Parnell
Melbourne, Australia
15/08/2005 12:17:33 AM

Aug 14 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Derek Parnell <derek psych.ward> wrote:

[...]
 the encoding of the
 source text should be independent of the interpretation of
 undecorated string literals. Just because a file is encoded as
 UTF8 there should be no restriction on me deciding to save the
 file as UTF16. The compiler should not go choosing which
 function to call based on how the file just happens to be
 encoded.

Nice example, but is this argument suited in general? How will your 
embedded string literals be saved to the UTF16-source by your 
editor? And once you have changed some of them to real utf16-codes, 
how will your editor save them, if you decide to revert to utf8?


 This sounds more like we need to have a pragma that specifies
 which default encoding we mean to have on the undecorated
 literals in a specific source text.

Agreed. That might be a solution.

[...]
 what will now happen with the call:
 
   func( "\u00001111"d "qwerty"c);
 
 Is this ambiguous?

 
 Not to the compiler. If you try this you get the error message 
 
   " mismatched string literal postfixes 'd' and 'c' "

Yes. But have you tried any further?

func( "" ""c); //mismatched string literal postfixes ' ' and 'c'
func( "" ""d); //mismatched string literal postfixes ' ' and 'd'

Then: an unsuffixed string literal is neither compatibel with c nor 
d. So what is it, that the compiler complains about undecorated 
string literals match both char[] and dchar[]?

Are we chasing a phantom, because the overloading routine of dmd is 
broken?

vathix  and some others have already reported on similar problems:

http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/3206

-manfred

Aug 14 2005

Manfred Nowak <svv1999 hotmail.com> writes:

Manfred Nowak <svv1999 hotmail.com> wrote:

[...]
 Are we chasing a phantom, because the overloading routine of dmd
 is broken?

http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/27069

is the other reference where a `dchar' was matched by a `char' and a 
`creal'.

-manfred

Aug 14 2005

Derek Parnell <derek psych.ward> writes:

On Sun, 14 Aug 2005 18:07:45 +0000 (UTC), Manfred Nowak wrote:

 Derek Parnell <derek psych.ward> wrote:
 
 [...]
 the encoding of the
 source text should be independent of the interpretation of
 undecorated string literals. Just because a file is encoded as
 UTF8 there should be no restriction on me deciding to save the
 file as UTF16. The compiler should not go choosing which
 function to call based on how the file just happens to be
 encoded.

 
 Nice example, but is this argument suited in general?

Yes.

 How will your 
 embedded string literals be saved to the UTF16-source by your 
 editor? 

As UTF16 encoded strings, just like the rest of the text. The editor
doesn't distinguish between string literals and other text.

And once you have changed some of them to real utf16-codes, 
 how will your editor save them, if you decide to revert to utf8?

"real utf16-codes"? What are these? If the text is encoded in UTF16 then
the entire text already contains real UTF16 codes. If I decide to save it
as UTF8, then the editor will translate it for me. The way that the text is
displayed remains the same, regardless of its encoding. The storage of the
text is independent of the way it is displayed and interpreted by the
compiler.
 
 This sounds more like we need to have a pragma that specifies
 which default encoding we mean to have on the undecorated
 literals in a specific source text.

 
 Agreed. That might be a solution.
 
 [...]
 what will now happen with the call:
 
   func( "\u00001111"d "qwerty"c);
 
 Is this ambiguous?

 
 Not to the compiler. If you try this you get the error message 
 
   " mismatched string literal postfixes 'd' and 'c' "

 
 Yes. But have you tried any further?
 
 func( "" ""c); //mismatched string literal postfixes ' ' and 'c'
 func( "" ""d); //mismatched string literal postfixes ' ' and 'd'
 
 Then: an unsuffixed string literal is neither compatibel with c nor 
 d. So what is it, that the compiler complains about undecorated 
 string literals match both char[] and dchar[]?

That has got to be a bug.
 
 Are we chasing a phantom, because the overloading routine of dmd is 
 broken?

Its not broken. It works as documented.  ;-)  
 


-- 
Derek Parnell
Melbourne, Australia
15/08/2005 8:02:38 AM

Aug 14 2005

"Ben Hinkle" <ben.hinkle gmail.com> writes:

"Derek Parnell" <derek psych.ward> wrote in message 
news:1fki98rnsnkwe$.11rh18dazasty.dlg 40tude.net...
 On Sun, 14 Aug 2005 18:07:45 +0000 (UTC), Manfred Nowak wrote:

 Derek Parnell <derek psych.ward> wrote:

 [...]
 the encoding of the
 source text should be independent of the interpretation of
 undecorated string literals. Just because a file is encoded as
 UTF8 there should be no restriction on me deciding to save the
 file as UTF16. The compiler should not go choosing which
 function to call based on how the file just happens to be
 encoded.

 Nice example, but is this argument suited in general?

 Yes.

 How will your
 embedded string literals be saved to the UTF16-source by your
 editor?

 As UTF16 encoded strings, just like the rest of the text. The editor
 doesn't distinguish between string literals and other text.

And once you have changed some of them to real utf16-codes,
 how will your editor save them, if you decide to revert to utf8?

 "real utf16-codes"? What are these? If the text is encoded in UTF16 then
 the entire text already contains real UTF16 codes. If I decide to save it
 as UTF8, then the editor will translate it for me. The way that the text 
 is
 displayed remains the same, regardless of its encoding. The storage of the
 text is independent of the way it is displayed and interpreted by the
 compiler.

Agreed. The c/w/d postfix indicates how to store the string in the obj file 
not how it is stored in the source file.

 This sounds more like we need to have a pragma that specifies
 which default encoding we mean to have on the undecorated
 literals in a specific source text.

 Agreed. That might be a solution.

 [...]
 what will now happen with the call:

   func( "\u00001111"d "qwerty"c);

 Is this ambiguous?

 Not to the compiler. If you try this you get the error message

   " mismatched string literal postfixes 'd' and 'c' "

 Yes. But have you tried any further?

 func( "" ""c); //mismatched string literal postfixes ' ' and 'c'
 func( "" ""d); //mismatched string literal postfixes ' ' and 'd'

 Then: an unsuffixed string literal is neither compatibel with c nor
 d. So what is it, that the compiler complains about undecorated
 string literals match both char[] and dchar[]?

 That has got to be a bug.

The current behavior is reasonable to me. It prevents any confusion with
 func(""c "")
should it keep the c postfix or substitute the "empty postfix"? The current 
behavior errors and makes the user choose.

Aug 14 2005

xs0 <xs0 xs0.com> writes:

Manfred Nowak wrote:
 In an ASCII-source the char[]-overload has to be used, whereas in 
 an UTF32-source the dchar[]-overload has to be used. What else 
 should be natural?

Are you sure that it's a good idea to change behavior of code based on 
the encoding of the file? I sure don't...

That would be like if "123.456" would be interpreted either as 123456 or 
123.456, depending on your regional settings.. A definite disaster :)

 "Hey dear chinese, you have written all your strings in this UTF32-
 source in chinese letters, but please assure your D-compiler that 
 you really meant to write chinese letters by appending the d-suffix 
 to all your strings!"?

Aren't the characters the same in all cases, just the string type changes?


xs0

Aug 14 2005

Manfred Nowak <svv1999 hotmail.com> writes:

xs0 <xs0 xs0.com> wrote:

[...]
 Are you sure that it's a good idea to change behavior of code
 based on the encoding of the file? I sure don't...

What is the code embedded in a file if you do not know the encoding 
of the file? Please explain.


[...]
 That would be like if "123.456" would be interpreted either as
 123456 or 123.456, depending on your regional settings.. A
 definite disaster :) 

.. or 123,456. A desaster the germans are totally aware of, because 
comma and point change role when changing from english to german 
encoding.


[...]
 Aren't the characters the same in all cases, just the string
 type changes? 

I might get you wro9ng, but why should the string literal consisting 
of the one letter d-string for "true" be the characters "true" as a 
4-letter c-string?

-manfred

Aug 14 2005

D Programming

C/C++ Programming

Other

digitalmars.D.learn - understanding string suffixes