digitalmars.D.bugs - the D crowd does bobdamn Rocket Science

Georg Wrede (180/196) Nov 18 2005 I've spent a week studying the UTF issue, and another trying to explain

Derek Parnell (28/28) Nov 18 2005 On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:

Georg Wrede (26/55) Nov 18 2005 If somebody wants to retain the bit pattern while storing the contents

Sean Kelly (28/69) Nov 18 2005 I somewhat agree. Since the three char types in D really do represent

Regan Heath (11/41) Nov 18 2005 Making the cast explicit sounds like a good compromise to me.

Sean Kelly (8/15) Nov 18 2005 This is the comparison I was thinking of as well. Though I've never

Regan Heath (7/20) Nov 18 2005 Nope. Kris's post has something about it, here:

Kris (49/118) Nov 18 2005 FWIW, I agree. And it should be explicit, to avoid unseen /runtime/

Sean Kelly (17/27) Nov 18 2005 It may well not be. A set of properties is another approach:

Kris (21/46) Nov 18 2005 I would agree, since it thoroughly isolates the special cases:

Regan Heath (24/67) Nov 18 2005 You have my vote here.
Derek Parnell (44/99) Nov 18 2005 Agreed. There are times, I suppose, when the coder does not want this to

Regan Heath (8/56) Nov 18 2005 Georg/Derek, I replied to Georg here:

Georg Wrede (5/13) Nov 20 2005 Good suggestion!

Regan Heath (36/65) Nov 20 2005 Ok. I have taken your reply, clicked reply, and pasted it in here :)

Derek Parnell (39/120) Nov 20 2005 Are you suggesting that in the situation where multiple function signatu...

Regan Heath (98/187) Nov 20 2005 I'm suggesting that an undecorated string literal could default to char[...

Derek Parnell (8/11) Nov 20 2005 Not only do great minds think alike, so you and I! I'm starting to think...
Kris (4/19) Nov 20 2005 Aye!

Georg Wrede (69/146) Nov 20 2005 Would you be surprised:

Derek Parnell (29/197) Nov 20 2005 Surprised about the two conversions? No, I just said that's what it woul...

Georg Wrede (7/9) Nov 20 2005 Whew, I was just starting to wonder what to do. :-)

Regan Heath (16/24) Nov 20 2005 I'm interested in both your opinions on:

Bruno Medeiros (6/13) Nov 20 2005 It should? Why?, what is the problem of using the toUTFxx functions?

Georg Wrede (5/17) Nov 20 2005 Nothing wrong. But cast should not do the union thing.
Derek Parnell (7/16) Nov 20 2005 Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong...

Bruno Medeiros (11/29) Nov 21 2005 No, we don't. But the case is different: between primitive numbers the

Derek Parnell (12/37) Nov 21 2005 Why? If documented, the user can be prepared.

Don Clugston (39/81) Nov 21 2005 I would think so. I'd define trivial as: "the assembly code doesn't have...

Oskar Linde (13/24) Nov 21 2005 String literals already work like this. :)
Sean Kelly (5/9) Nov 21 2005 Really? I tested this a few days ago and it seemed like literals larger...

Oskar Linde (20/28) Nov 21 2005 You are right, large integers are automatically treated as longs, but to...

Sean Kelly (4/7) Nov 21 2005 This seems reasonable though, since it's really a matter of precision

Bruno Medeiros (16/42) Nov 23 2005 A good question indeed. I was thinking something equivalent to what Don

Georg Wrede <georg.wrede nospam.org> writes:

I've spent a week studying the UTF issue, and another trying to explain 
it. Some progress, but not enough. Either I'm a bad explainer (hence, 
skip dreams of returning to teaching CS when I'm old), or, this really 
is an intractable issue. (Hmmmm, or everybody else is right, and I need 
to get on the green pills.)

My almost final try: Let's imagine (please, pretty please!), that Bill 
Gates descends from Claudius. This would invariably lead to cardinals 
being represented as Roman Numerals in computer user interfaces, ever 
since MSDOS times.

Then we'd have that as an everyday representation of integers. Obviously 
we'd have the functions r2a and a2r for changing string representations 
between Arabic ("1234...") and Roman ("I II III ...") numerals.

To make this example work we also need to imagine that we need a 
notation for roman numerals within string. Let this notation be:

"\R" as in "\RXIV" (to represent the number 14)

(Since this in reality is not needed, we have to (again) imagine that it 
is, for Historical Reasons -- the MSDOS machines sometimes crashed when 
there were too many capital X on a command line, but at the time nobody 
found the reason, so the \R notation was created as a [q&d] fix.)

So, since it is politically incorrect to write "December 24", we have to 
write "December XXIV" but since the ancient bug lurks if this file gets 
transferred to "a major operating system", we have to be careful and 
write "December \RXXIV".

Now, programmers are lazy, and they end up writing "\Rxxiv" and getting 
all kinds of error messages like "invalid string literal". So a few 
anarchist programmers decided to implement the possibility of writing 
lower case roman numerals, even if the Romans themselves disapproved of 
it from the beginning.

The prefix \r is already taken, so they had two choices, either make 
computers smart and let them understand \Rxxiv, but that would risk Bill 
getting angry. So they needed another prefix. They chose \N (this choice 
is a unix inside joke).

---

Then a compiler guru (who happened to descend from Asterix the Gallian) 
decided to write a new language. In the midst of that all, he stumbled 
upon the Roman issue. Beign diligent (which I've wanted to be all my 
life too but never succeeded (go ask my mother, my teacher, my bosses)), 
he decided to implement strings in a non-brekable way.

So now we have:

char[], Rchar[] and Nchar[], the latter two being for situations where 
the string might contain [expletive deleted] roman values.

The logical next step was to decorate the strings themselves, so that 
the computer can unambiguously know what to assign where. Therefore we 
now have "", ""R and ""N kind of strings. Oh, and to be totally 
unambiguous and symmetric, the redundant ""C was introduced to 
explicitly denote the non-R, non-N kind of string, in case such might be 
needed some day.

Now, being modern, the guru had already made the "" kinds of strings 
Roman Proof, with the help of Ancient Gill, an elusive but legendary oracle.

---

The III wise men had also become aware of this problem space. Since 
everything in modern times grows exponentially, the letters X, C and M 
(for ten, hundred and thousand), would sooner than later need to be 
accompanied by letters for larger numbers. For a million M was already 
taken, so they chose N. And then G, T, P, E, Z, Y for giga, tera, peta, 
exa, zetta and yotta. Then the in-betweens had to be worked out too, for 
5000, 5000000 etc. 50 was already L and 500 was D..... whatever, you get 
the picture. .-)

So, they decided that, to make a string spec that lasts "forever" the 
new string had to be stored with 32 bits per character. (Since 
exponential growth (it's true, just look how Bill's purse grows!), is 
reality, they figured the numerals would run out of letters, and that's 
why new glyphs would have to be invented eventually, ad absurdum. 32 
bits would carry us until the end of the Universe.) They called it N. 
This was the official representation of strings that might contain roman 
numerals way into the future.

Then some other guys thought "Naaw, t's nuff if we have strings that 
take us till the day we retire, so 16 bits oughtta be plenty." That 
became the R string. Practical Bill adopted the R string. Later the 
other guys had to admit that their employer might catch the "retiring 
plot", so they amended the R string to SOMETIMES contain 32 bits.

Now, Bill, the practic he is, ignored the issue (probably on the same 
"retiring plot"). And what Bill does, defines what is right (at least 
with suits, and hey, they rule -- as opposed to us geeks).

---

Luckily II blessed-by-Bob Sourcerers (notice the spelling), thought the 
R and N stuff was wasting space, was needed only occasionally, and was 
in general cumbersome. Everybody elses but Bill's R had to be able to 
handle 32 bits every once in a while, and the N stuff really was overkill.

They figured "The absolute majority of crap needs 7 bits, the absolute 
majority of the rest needs 9 bits, and the absolute majority of the rest 
needs 12 bits. So there's pretty little left after all this -- however, 
since we are blessed-by-Bob, and we do stuff properly, we won't give up 
until we can handle all this, and handle it gracefully."

They decided that their string (which they christened ""C) has to be 
compact, handle 7-bit stuff as fast as non-roman-aware programs do, 9 
bit stuff almost as fast as the R programs, and it has to be lightning 
fast to convert to and from. Also, they wanted the C strings to be 
usable as much as possible by old library routines, so for example, the 
old routines should be able to search and sort their strings without 
upgrades. And they knew that strings do get chopped, so they designed 
them so that you can start wherever, and just by looking at the 
particular octet, you'd know whether it's proper to chop the string 
there. And if it isn't, it should be trivial to look a couple of octets 
forward (or even back), and just immediately see where the next 
breakable place is. Ha, and they wanted the C strings to be 
endiannes-proof!!

The II were already celebrities with the Enlightened, so it was decided 
that the C string will be standard on POSIX systems. Smart crowd.

*** if I don't see light here, I'll write some more one day ***

---

If I write the following strings here, and somebody pastes them in his 
source code,

"abracadabra"
"räyhäpäivä"
"ШЖЯЮЄШ"

compiles his D program, and runs it, what should (and probably will!) 
happen, is that the program output looks like the strings here.

If the guy has never heard of our Unbelievable utf-discussion, he 
probably never is aware that some UTF or other crap is or has been 
involved. (Hell, I've used Finnish letters in my D source code all the 
time, and never thought any of it.)

After having seen this discussion, he gets nervous, and quickly changes 
all his strings so that they are ""c ""w and ""d decorated. From then 
on, he hardly dares to touch strings that contain non US content. Like 
us here.

The interesting thing is, did I originally write them in UTF-8, UTF-16 
or UTF-32?

How many times were they converted between these widths while travelling 
from my keyboard to this newsgroup to his machine to the executable to 
the output file?

Probably they've been in UTF-7 too, since they've gone through mail 
transport, which still is from the previous millennium.

---

At this point I have to ask, are there any folks here who do not believe 
that the following proves anything:

 Xref: digitalmars.com digitalmars.D.bugs:5436 digitalmars.D:29904
 Xref: digitalmars.com digitalmars.D.bugs:5440 digitalmars.D:29906
 
 Those show that the meaning of the program does not change when the
 source code is transliterated to a different UTF encoding.
 
 They also show that editing code in different UTF formats, and
 inserting "foreign" text even directly to string literals, does
 survive intact when the source file is converted between different
 UTF formats.
 
 Further, they show that decorating a string literal to c, w, or d,
 does not change the interpretation of the contents of the string
 whether it contains "foreign" literals directly inserted, or not.
 
 Most permutations of the above 3 paragraphs were tested.

(Oh, correction to the last line: "_All_ cross permutations of the 3 
paragraphs were tested.)

Endiannes was not considered, but hey, with wrong endianness, either 
your text editor cant't read the file to begin with, or if it can, then 
you _can_ edit the strings with even more "foreign characters" and still 
be ok!


I hereby declare, that it makes _no_ difference
whatsoever in which width a string literal is stored,
as long as the compiler implicitly casts it when
it gets used.

I hereby also declare, that implicit casts of strings
(be they literal or heap or stack allocated), carries
no risks whatsoever. Period.

I hereby declare that string literal decorations
are not only unneeded, they create an enormous
amount of confusion. (Even we are totally bewildered,
so _every_ newcomer to D will be that too.) There
are _no_ upsides to them.

I hereby declare that it should be illegal to
implicitly convert char or wchar to any integer
type. Further it should be illegal to even cast
char or wchar to any integer type. The cast should
have to be via a cast to void! (I.e. difficult but
possible.) With dchar even implicit casts are ok.
Cast from char or wchar via dchar should be illegal.
(Trust me, illegal. While at the same time even
implicit casts from char[] and wchar[] to each
other and to and from dchar[] are ok!) Cast between
char wchar and dchar should be illegal, unless via
void.


A good programmer would use the same width all over the place. An even 
better programmer would typedef his own anyway. If an idiot has his 
program convert width at every other assignment, then he'll have other 
idiocies in his code too. He should go to VB.

---

   But some other things are (both now, and even
   if we fix the above) downright hazardous, and
   should cause a throw, and in non-release
   programs a runtime assert failure:

Copying any string to a fixed length array, _if_ the array is either 
wchar[] or char[]. (dchar[] is ok.) The (throw or) assert should fail if 
the copied string is not breakable where the receiving array gets full.

whatever foo = "ää";   // foo and "" can be any of c/w/d.
char[3] barf = foo;    // write cast if needed
// Odd number of chars in barf, breaks ää wrong. "ää" is 4 bytes.

Same goes for wchar[3].

---

Once we agree on this, then it's time to see if some more AJ stuff is 
left to fix. But not before.

Nov 18 2005

Derek Parnell <derek psych.ward> writes:

On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:

[snip]

It seems that you use the word 'cast' to mean conversion of one utf
encoding to another. However, this is not what D does.

   dchar[] y;
   wchar[] x;

   x = cast(wchar[])y;

does *not* convert the content of 'y' to utf-16 encoding. Currently you
*must* use the toUTF16 function to do that.

  x = std.utf.toUTF16(y);

However, are you saying that D should change its behaviour such that it
should always implicitly convert between encoding types? Should this happen
only with assignments or should it also happen on function calls?



   dchar y;

   foo("Some Test Data"); // Which one now?

Given just the function signature and an undecorated string, it is not
possible for the compiler to call the 'correct' function. In fact, it is
not possible for a person (other than the original designer) to know which
is the right one to call? 

D has currently got the better solution to this problem; get the coder to
identify the storage characteristics of the string!

-- 
Derek Parnell
Melbourne, Australia
18/11/2005 10:42:31 PM

Nov 18 2005

Georg Wrede <georg.wrede nospam.org> writes:

Derek Parnell wrote:
 On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:
 
 [snip]
 
 It seems that you use the word 'cast' to mean conversion of one utf
 encoding to another. However, this is not what D does.
 
    dchar[] y;
    wchar[] x;
 
    x = cast(wchar[])y;
 
 does *not* convert the content of 'y' to utf-16 encoding. Currently you
 *must* use the toUTF16 function to do that.

If somebody wants to retain the bit pattern while storing the contents 
to something else, it should be done with a union. (Just as you can do 
with pointers, or even objects! To name a few "workarounds".)

A cast should do precisely what our toUTFxxx functions currently do.

 However, are you saying that D should change its behaviour such that
 it should always implicitly convert between encoding types? Should 
 this happen only with assignments or should it also happen on
 function calls?

Both. And everywhere else (in case we forgot to name some situation).



    dchar y;

    foo("Some Test Data"); // Which one now?

Test data is undecorated, hence char[]. Technically on the last line 
above it could pick at random, when it has no "right" alternative, but I 
think it would be Polite Manners to make the compiler complain.

I'm still trying to get through the notion that it 
_really_does_not_matter_ what it chooses!

(Of course performance is slower with a lot of unnecessary casts ( = 
conversions), but that's the programmer's fault, not ours.)

 Given just the function signature and an undecorated string, it is not
 possible for the compiler to call the 'correct' function. In fact, it
 is not possible for a person (other than the original designer) to
 know which is the right one to call?

That is (I'm sorry, no offense), based on a misconception.

Please see my other posts today, where I try to clear (among other 
things) this very issue.

 D has currently got the better solution to this problem; get the
 coder to identify the storage characteristics of the string!

He does, at assignment to a variable. And, up till that time, it makes 
no difference. It _really_ does not.

This also I try to explain in the other posts.

(The issue and concepts are crystal clear, maybe it's just me not being 
able to describe them with the right words. Not to you, or Walter, or 
the others?)

We are all seeing bogeymen all over the place, where there are none. 
It's like my kids this time of the year, when it is always dark behind 
the house, under the bed, and on the attic.

Aaaaaaaaaaah, now I got it. It's been Halloween again. Sigh!

Nov 18 2005

Sean Kelly <sean f4.ca> writes:

Georg Wrede wrote:
 Derek Parnell wrote:
 On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:

 [snip]

 It seems that you use the word 'cast' to mean conversion of one utf
 encoding to another. However, this is not what D does.

    dchar[] y;
    wchar[] x;

    x = cast(wchar[])y;

 does *not* convert the content of 'y' to utf-16 encoding. Currently you
 *must* use the toUTF16 function to do that.

 
 If somebody wants to retain the bit pattern while storing the contents 
 to something else, it should be done with a union. (Just as you can do 
 with pointers, or even objects! To name a few "workarounds".)
 
 A cast should do precisely what our toUTFxxx functions currently do.

I somewhat agree.  Since the three char types in D really do represent 
various encodings, the current behavior of casting a char[] to dchar[] 
produces a meaningless result (AFAIK).  On the other hand, this would 
make casting strings behave differently from casting anything else in D, 
and I abhor inconsistencies.  Though for what it's worth, I don't 
consider the conversion cost to be much of an issue so long as strings 
must be cast explicitly.  And either way, I would love to have UTF 
conversion for strings supported in-language.  It does make some sense, 
given that the three encodings exist as distinct value types in D already.

 However, are you saying that D should change its behaviour such that
 it should always implicitly convert between encoding types? Should 
 this happen only with assignments or should it also happen on
 function calls?

 
 Both. And everywhere else (in case we forgot to name some situation).

I disagree.  While this would make programming quite simple, it would 
also incur hidden runtime costs that would be difficult to ferret out. 
This might be fine for a scripting-type language, but not for a systems 
programming language IMO.

 D has currently got the better solution to this problem; get the
 coder to identify the storage characteristics of the string!

 
 He does, at assignment to a variable. And, up till that time, it makes 
 no difference. It _really_ does not.

True.

 This also I try to explain in the other posts.
 
 (The issue and concepts are crystal clear, maybe it's just me not being 
 able to describe them with the right words. Not to you, or Walter, or 
 the others?)
 
 We are all seeing bogeymen all over the place, where there are none. 
 It's like my kids this time of the year, when it is always dark behind 
 the house, under the bed, and on the attic.

What I like about the current behavior (no implicit conversion), is that 
it makes it readily obvious where translation needs to occur and thus 
makes it easy for the programmer to decide if that seems appropriate. 
That said, I agree that the overall runtime cost is likely consistent 
between a program with and without implicit conversion--either the API 
calls with have overloads for all types and thus allow you to avoid 
conversion, or they will only support one type and require conversion if 
you've standardized on a different type.  It may well be that concerns 
over implicit convesion is unfounded, but I'll have to give the matter 
some more thought before I can say one way or the other.  My current 
experience with D isn't such that I've had to deal with this particular 
issue much.


Sean

Nov 18 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Fri, 18 Nov 2005 09:48:51 -0800, Sean Kelly <sean f4.ca> wrote:
 Georg Wrede wrote:
 Derek Parnell wrote:
 On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:

 [snip]

 It seems that you use the word 'cast' to mean conversion of one utf
 encoding to another. However, this is not what D does.

    dchar[] y;
    wchar[] x;

    x = cast(wchar[])y;

 does *not* convert the content of 'y' to utf-16 encoding. Currently you
 *must* use the toUTF16 function to do that.

  If somebody wants to retain the bit pattern while storing the contents  
 to something else, it should be done with a union. (Just as you can do  
 with pointers, or even objects! To name a few "workarounds".)
  A cast should do precisely what our toUTFxxx functions currently do.

 I somewhat agree.  Since the three char types in D really do represent  
 various encodings, the current behavior of casting a char[] to dchar[]  
 produces a meaningless result (AFAIK).  On the other hand, this would  
 make casting strings behave differently from casting anything else in D,  
 and I abhor inconsistencies.  Though for what it's worth, I don't  
 consider the conversion cost to be much of an issue so long as strings  
 must be cast explicitly.  And either way, I would love to have UTF  
 conversion for strings supported in-language.  It does make some sense,  
 given that the three encodings exist as distinct value types in D  
 already.

Making the cast explicit sounds like a good compromise to me.

The way I see it casting from int to float is similar to casting from  
char[] to wchar[]. The data must be converted from one form to another for  
it to make sense, you'd never 'paint' and 'int' as a 'float' it would be  
meaningless, the same is true for char[] to wchar[].

The correct way to paint data as char[], wchar[] or dchar[] is to paint a  
byte[](or ubyte[]). In other words if you have some data of unknown  
encoding you should be reading it into byte[](or ubyte[]) and then  
painting as the correct type, once it is known.

Regan

Nov 18 2005

Sean Kelly <sean f4.ca> writes:

Regan Heath wrote:
 
 Making the cast explicit sounds like a good compromise to me.
 
 The way I see it casting from int to float is similar to casting from 
 char[] to wchar[]. The data must be converted from one form to another 
 for it to make sense, you'd never 'paint' and 'int' as a 'float' it 
 would be meaningless, the same is true for char[] to wchar[].

This is the comparison I was thinking of as well.  Though I've never 
tried casting an array of ints to floats.  I suspect it doesn't work, 
does it?  My only other reservation is that the behavior could not be 
preserved for casting char types, and unlike narrowing conversions (such 
as float to int), meaning can't even be preserved in narrowing char 
conversions (such as wchar to char).


Sean

Nov 18 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Fri, 18 Nov 2005 15:29:23 -0800, Sean Kelly <sean f4.ca> wrote:
 Regan Heath wrote:
  Making the cast explicit sounds like a good compromise to me.
  The way I see it casting from int to float is similar to casting from  
 char[] to wchar[]. The data must be converted from one form to another  
 for it to make sense, you'd never 'paint' and 'int' as a 'float' it  
 would be meaningless, the same is true for char[] to wchar[].

 This is the comparison I was thinking of as well.  Though I've never  
 tried casting an array of ints to floats.  I suspect it doesn't work,  
 does it?

Nope. Kris's post has something about it, here:
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/30158

 My only other reservation is that the behavior could not be preserved  
 for casting char types, and unlike narrowing conversions (such as float  
 to int), meaning can't even be preserved in narrowing char conversions  
 (such as wchar to char).

Indeed. Due to the fact that the meaning (the "character") may be  
represented as 1 wchar, but 2 char's.
The thread above has some more interesting stuff about this.

Regan

Nov 18 2005

"Kris" <fu bar.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message
 Georg Wrede wrote:
 Derek Parnell wrote:
 On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:

 [snip]

 It seems that you use the word 'cast' to mean conversion of one utf
 encoding to another. However, this is not what D does.

    dchar[] y;
    wchar[] x;

    x = cast(wchar[])y;

 does *not* convert the content of 'y' to utf-16 encoding. Currently you
 *must* use the toUTF16 function to do that.

 If somebody wants to retain the bit pattern while storing the contents to 
 something else, it should be done with a union. (Just as you can do with 
 pointers, or even objects! To name a few "workarounds".)

 A cast should do precisely what our toUTFxxx functions currently do.

 I somewhat agree.  Since the three char types in D really do represent 
 various encodings, the current behavior of casting a char[] to dchar[] 
 produces a meaningless result (AFAIK).  On the other hand, this would make 
 casting strings behave differently from casting anything else in D, and I 
 abhor inconsistencies.


Amen to that.


 Though for what it's worth, I don't consider the conversion cost to be 
 much of an issue so long as strings must be cast explicitly.  And either 
 way, I would love to have UTF conversion for strings supported 
 in-language.  It does make some sense, given that the three encodings 
 exist as distinct value types in D already.


FWIW, I agree. And it should be explicit, to avoid unseen /runtime/ 
conversion (the performance issue).

But, I have a feeling that cast([]) is not the right approach here? One 
reason is that structs/classes can have only one opCast() method. perhaps 
there's another approach for such syntax? That's assuming, however, that one 
does not create a special-case for char[] types (per above inconsistencies).


 However, are you saying that D should change its behaviour such that
 it should always implicitly convert between encoding types? Should this 
 happen only with assignments or should it also happen on
 function calls?

 Both. And everywhere else (in case we forgot to name some situation).

 I disagree.  While this would make programming quite simple, it would also 
 incur hidden runtime costs that would be difficult to ferret out. This 
 might be fine for a scripting-type language, but not for a systems 
 programming language IMO.


Amen to that, too!



 D has currently got the better solution to this problem; get the
 coder to identify the storage characteristics of the string!

 He does, at assignment to a variable. And, up till that time, it makes no 
 difference. It _really_ does not.

 True.

 This also I try to explain in the other posts.

 (The issue and concepts are crystal clear, maybe it's just me not being 
 able to describe them with the right words. Not to you, or Walter, or the 
 others?)

 We are all seeing bogeymen all over the place, where there are none. It's 
 like my kids this time of the year, when it is always dark behind the 
 house, under the bed, and on the attic.

 What I like about the current behavior (no implicit conversion), is that 
 it makes it readily obvious where translation needs to occur and thus 
 makes it easy for the programmer to decide if that seems appropriate.


Right on!


 That said, I agree that the overall runtime cost is likely consistent 
 between a program with and without implicit conversion--either the API 
 calls with have overloads for all types and thus allow you to avoid 
 conversion, or they will only support one type and require conversion if 
 you've standardized on a different type.


As long as the /runtime/ penalties are clear within the code design (not 
quietly 'padded' by the compiler), that makes sense.


 It may well be that concerns over implicit convesion is unfounded, but 
 I'll have to give the matter some more thought before I can say one way or 
 the other.  My current experience with D isn't such that I've had to deal 
 with this particular issue much.


I'm afraid I have. Both in Mango.io and in the ICU wrappers. While there are 
no metrics for such things (that I'm aware of) my gut feel was that 'hidden' 
conversion would not be a good thing. Of course, that depends upon the 
"level" one is talking about:

High level :: slow to medium performance
Low level :: high performance

A lot of folks just don't care about performance (oh, woe!) and that's fine. 
But I think it's worth keeping the distinction in mind when discussing this 
topic. I'd be a bit horrified to find the compiler adding hidden transcoding 
at the IO level (via Mango.io for example). But then, I'm a dinosaur.

So. That doesn't mean that the language should not perhaps support some 
sugar for such operations. Yet the difficulty there is said sugar would 
likely bind directly to some internal runtime support (such as utf.d), which 
may not be the most appropriate for the task (it tends to be character 
oriented, rather than stream oriented). In addition, there's often a need 
for multiple return-values from certain types of transcoding ops. I imagine 
that would be tricky via such sugar? Maybe not.

Transcoding is easy when the source content is reasonably small and fully 
contained within block of memory. It quickly becomes quite complex when 
streaming instead. That's really worth considering.

To illustrate, here's some of the transcoder signatures from the ICU code:

uint   function (Handle, wchar*, uint, void*, uint, inout Error) 
ucnv_toUChars;
uint   function (Handle, void*, uint, wchar*, uint, inout Error) 
ucnv_fromUChars;

Above are the simple ones, where all of the source is present in memory.

void   function (Handle, void**, void*, wchar**, wchar*, int*, ubyte, inout 
Error) ucnv_fromUnicode;
void   function (Handle, wchar**, wchar*, void**, void*, int*, ubyte, inout 
Error)  ucnv_toUnicode;
void   function (Handle, Handle, void**, void*, void**, void*, wchar*, 
wchar*, wchar*, wchar*, ubyte, ubyte, inout Error) ucnv_convertEx;

And those are the ones for handling streaming; note the double pointers? 
That's so one can handle "trailing" partial characters. Non trival :-)

Thus, I'd suspect it may be appropriate for D to add some transcoding sugar. 
But it would likely have to be highly constrained (per the simple case). Is 
it worth it?

Nov 18 2005

Sean Kelly <sean f4.ca> writes:

Kris wrote:
 
 But, I have a feeling that cast([]) is not the right approach here? One 
 reason is that structs/classes can have only one opCast() method. perhaps 
 there's another approach for such syntax? That's assuming, however, that one 
 does not create a special-case for char[] types (per above inconsistencies).

It may well not be.  A set of properties is another approach:

char[]  c = "abc";
dchar[] d = c.toDString();

but this would still only work for arrays.  Conversion between char 
types still only make sense if they are widening conversions.  Perrhaps 
I'm simply becoming spoiled by having so much built into D.  This may 
well be simply a job for library code.

 Transcoding is easy when the source content is reasonably small and fully 
 contained within block of memory. It quickly becomes quite complex when 
 streaming instead. That's really worth considering.

Good point.  One of the first things I had to do for readf/unFormat was 
rewrite std.utf to accept delegates.  There simply isn't any other good 
way to ensure that too much data isn't read from the stream by mistake.

  > Thus, I'd suspect it may be appropriate for D to add some 
transcoding sugar.
 But it would likely have to be highly constrained (per the simple case). Is 
 it worth it?

Probably not :-)  But I suppose it's worth discussing.  I do like the 
idea of not having to rely on library code to do simple string 
transcoding, though this seems of limited use given the above concerns.


Sean

Nov 18 2005

"Kris" <fu bar.com> writes:

"Sean Kelly" <sean f4.ca> wrote ...
 Kris wrote:
 But, I have a feeling that cast([]) is not the right approach here? One 
 reason is that structs/classes can have only one opCast() method. perhaps 
 there's another approach for such syntax? That's assuming, however, that 
 one does not create a special-case for char[] types (per above 
 inconsistencies).

 It may well not be.  A set of properties is another approach:

 char[]  c = "abc";
 dchar[] d = c.toDString();


I would agree, since it thoroughly isolates the special cases:

char[].utf16
char[].utf32

wchar[].utf8
wchar[].utf32

dchar[].utf8
dchar[].utf16

Generics might require the addition of 'identity' properties, like 
char[].utf8 ?


 but this would still only work for arrays.  Conversion between char types 
 still only make sense if they are widening conversions.


Aye. If the above set of properties were for arrays only, then one may be 
able to make a case that it doesn't break consistency. There might be a 
second, somewhat distinct, set:

char.utf16
char.utf32

wchar.utf32

I think your approach is far more amenable than cast(), Sean. And properties 
don't eat up keyword space <g>


 Transcoding is easy when the source content is reasonably small and fully 
 contained within block of memory. It quickly becomes quite complex when 
 streaming instead. That's really worth considering.

 Good point.  One of the first things I had to do for readf/unFormat was 
 rewrite std.utf to accept delegates.  There simply isn't any other good 
 way to ensure that too much data isn't read from the stream by mistake.

  > Thus, I'd suspect it may be appropriate for D to add some transcoding 
 sugar.
 But it would likely have to be highly constrained (per the simple case). 
 Is it worth it?

 Probably not :-)  But I suppose it's worth discussing.  I do like the idea 
 of not having to rely on library code to do simple string transcoding, 
 though this seems of limited use given the above concerns.


Yeah. It would be limited (e.g. no streaming), and would likely be 
implemented using the heap. Even then, as you note, it could be attractive 
to some.

Nov 18 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Fri, 18 Nov 2005 15:31:48 +0200, Georg Wrede <georg.wrede nospam.org>  
wrote:
 Derek Parnell wrote:
 On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:
  [snip]
  It seems that you use the word 'cast' to mean conversion of one utf
 encoding to another. However, this is not what D does.
     dchar[] y;
    wchar[] x;
     x = cast(wchar[])y;
  does *not* convert the content of 'y' to utf-16 encoding. Currently you
 *must* use the toUTF16 function to do that.

 If somebody wants to retain the bit pattern while storing the contents  
 to something else, it should be done with a union. (Just as you can do  
 with pointers, or even objects! To name a few "workarounds".)

 A cast should do precisely what our toUTFxxx functions currently do.

You have my vote here.

 However, are you saying that D should change its behaviour such that
 it should always implicitly convert between encoding types? Should this  
 happen only with assignments or should it also happen on
 function calls?

 Both. And everywhere else (in case we forgot to name some situation).

The main argument against this last time it was proposed was that a  
expression containing several char[] types would implicitly convert any  
number of times during the expression. This transcoding would be  
inefficient, and silent, and thus bad, eg.

  char[] a = "this is a test string"
wchar[] b = "regan was here";
dchar[] c = "georg posted this thing";

  char[] d = c[0..7] ~ b[6..10] ~ a[10..14] ~ c[20..$] ~ a[14..$] ~  
c[16..17]
  //supposed to be: georg was testing strings :)

How many times does the above transcode using the current implicit  
conversion rules (the last time this topic was aired it branched into a  
discussion about how these rules could change to improve the situation)



    dchar y;

    foo("Some Test Data"); // Which one now?

 Test data is undecorated, hence char[]. Technically on the last line  
 above it could pick at random, when it has no "right" alternative, but I  
 think it would be Polite Manners to make the compiler complain.

Which is what it does currently, right?

 I'm still trying to get through the notion that it  
 _really_does_not_matter_ what it chooses!

I'm still not convinced. I will raise my issues in the later posts you  
promise.

 (Of course performance is slower with a lot of unnecessary casts ( =  
 conversions), but that's the programmer's fault, not ours.)

I tend to agree here but as I say above, last time this aired people  
complained about this very thing.

 Given just the function signature and an undecorated string, it is not
 possible for the compiler to call the 'correct' function. In fact, it
 is not possible for a person (other than the original designer) to
 know which is the right one to call?

 That is (I'm sorry, no offense), based on a misconception.

 Please see my other posts today, where I try to clear (among other  
 things) this very issue.

Ok.

 D has currently got the better solution to this problem; get the
 coder to identify the storage characteristics of the string!

 He does, at assignment to a variable. And, up till that time, it makes  
 no difference. It _really_ does not.

 This also I try to explain in the other posts.

Ok.

Regan

Nov 18 2005

Derek Parnell <derek psych.ward> writes:

On Fri, 18 Nov 2005 15:31:48 +0200, Georg Wrede wrote:

 Derek Parnell wrote:
 On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:
 
 [snip]
 
 It seems that you use the word 'cast' to mean conversion of one utf
 encoding to another. However, this is not what D does.
 
    dchar[] y;
    wchar[] x;
 
    x = cast(wchar[])y;
 
 does *not* convert the content of 'y' to utf-16 encoding. Currently you
 *must* use the toUTF16 function to do that.

 
 If somebody wants to retain the bit pattern while storing the contents 
 to something else, it should be done with a union. (Just as you can do 
 with pointers, or even objects! To name a few "workarounds".)
 
 A cast should do precisely what our toUTFxxx functions currently do.

Agreed. There are times, I suppose, when the coder does not want this to
happen, but those could be coded with a cast(byte[]) to avoid that.

 However, are you saying that D should change its behaviour such that
 it should always implicitly convert between encoding types? Should 
 this happen only with assignments or should it also happen on
 function calls?

 
 Both. And everywhere else (in case we forgot to name some situation).

We have problems with inout and out parameters.

  foo(inout wchar x) {}

  dchar[] y = "abc";
  foo(y);


In this case, if automatic conversion took place, it would have to do it
twice. It would be like doing ...

   auto wchar[] temp;
   temp = toUTF16(y);
   foo(temp);
   y = toUTF32(temp);



    dchar y;

    foo("Some Test Data"); // Which one now?

 
 Test data is undecorated, hence char[]. Technically on the last line 
 above it could pick at random, when it has no "right" alternative, but I 
 think it would be Polite Manners to make the compiler complain.

Yes, at that's what happens now.

 I'm still trying to get through the notion that it 
 _really_does_not_matter_ what it chooses!

I disagree. Without know what the intention of the function is, one has no
way of knowing which function to call.

Try it. Which one is the right one to call in the example above? It is
quite possible that there is no right one.

If we have automatic conversion and it choose one at random, there is no
way of knowing that its doing the 'right' thing to the data we give it. In
my opinion, its a coding error and the coder need to provide more
information to the compiler.

 (Of course performance is slower with a lot of unnecessary casts ( = 
 conversions), but that's the programmer's fault, not ours.)
 
 Given just the function signature and an undecorated string, it is not
 possible for the compiler to call the 'correct' function. In fact, it
 is not possible for a person (other than the original designer) to
 know which is the right one to call?

 
 That is (I'm sorry, no offense), based on a misconception.
 
 Please see my other posts today, where I try to clear (among other 
 things) this very issue.

I challenge you, right here and now, to tell me which of those two
functions above is the one that the coder intended to be called. 

If the coder had written 

    foo("Some Test Data"w);

then its pretty clear which function was intended.


For example, D rightly complains when the similar situation occurs with the
various integers.

void foo(long x) {}
void foo(int x) {}
void main()
{
  short y;  
  foo(y);
}

If D did implicit conversions and chose one at random I'm sure we would
complain.

 D has currently got the better solution to this problem; get the
 coder to identify the storage characteristics of the string!

 
 He does, at assignment to a variable. And, up till that time, it makes 
 no difference. It _really_ does not.

But it *DOES* make a difference when doing signature matching. I'm not
talking about assignments to variables.

-- 
Derek Parnell
Melbourne, Australia
19/11/2005 8:59:16 AM

Nov 18 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Sat, 19 Nov 2005 09:19:28 +1100, Derek Parnell <derek psych.ward> wrote:
 I'm still trying to get through the notion that it
 _really_does_not_matter_ what it chooses!

 I disagree. Without know what the intention of the function is, one has  
 no
 way of knowing which function to call.

 Try it. Which one is the right one to call in the example above? It is
 quite possible that there is no right one.

 If we have automatic conversion and it choose one at random, there is no
 way of knowing that its doing the 'right' thing to the data we give it.  
 In
 my opinion, its a coding error and the coder need to provide more
 information to the compiler.

 (Of course performance is slower with a lot of unnecessary casts ( =
 conversions), but that's the programmer's fault, not ours.)

 Given just the function signature and an undecorated string, it is not
 possible for the compiler to call the 'correct' function. In fact, it
 is not possible for a person (other than the original designer) to
 know which is the right one to call?

 That is (I'm sorry, no offense), based on a misconception.

 Please see my other posts today, where I try to clear (among other
 things) this very issue.

 I challenge you, right here and now, to tell me which of those two
 functions above is the one that the coder intended to be called.

 If the coder had written

     foo("Some Test Data"w);

 then its pretty clear which function was intended.


 For example, D rightly complains when the similar situation occurs with  
 the
 various integers.

 void foo(long x) {}
 void foo(int x) {}
 void main()
 {
   short y;
   foo(y);
 }

 If D did implicit conversions and chose one at random I'm sure we would
 complain.

 D has currently got the better solution to this problem; get the
 coder to identify the storage characteristics of the string!

 He does, at assignment to a variable. And, up till that time, it makes
 no difference. It _really_ does not.

 But it *DOES* make a difference when doing signature matching. I'm not
 talking about assignments to variables.

Georg/Derek, I replied to Georg here:
   http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5587

saying essentially the same things as Derek has above. I reckon we combine  
these threads and continue in this one, as opposed to the one I linked  
above. I or you can link the other thread to here with a post if you're in  
agreement.

Regan

Nov 18 2005

Georg Wrede <georg.wrede nospam.org> writes:

Regan Heath wrote:
 
 Georg/Derek, I replied to Georg here:
 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5587
 
 saying essentially the same things as Derek has above. I reckon we 
 combine  these threads and continue in this one, as opposed to the one I 
 linked  above. I or you can link the other thread to here with a post if 
 you're in  agreement.

Good suggestion!

I actually intended that, but forgot about it while reading and 
thinking. :-/

So, the reply is to it directly.

Nov 20 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Sun, 20 Nov 2005 17:28:33 +0200, Georg Wrede <georg.wrede nospam.org>  
wrote:
 Regan Heath wrote:
  Georg/Derek, I replied to Georg here:
 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5587
  saying essentially the same things as Derek has above. I reckon we  
 combine  these threads and continue in this one, as opposed to the one  
 I linked  above. I or you can link the other thread to here with a post  
 if you're in  agreement.

 Good suggestion!

 I actually intended that, but forgot about it while reading and  
 thinking. :-/

 So, the reply is to it directly.

Ok. I have taken your reply, clicked reply, and pasted it in here :)
(I hope this post isn't confusing for anyone)

-------------------------
Copied from:  
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5607
-------------------------

On Sun, 20 Nov 2005 17:17:33 +0200, Georg Wrede <georg.wrede nospam.org>  
wrote:

 Regan Heath wrote:
 On Fri, 18 Nov 2005 13:02:05 +0200, Georg Wrede  
 <georg.wrede nospam.org>  wrote:
  Lets assume there is 2 functions of the same name (unintentionally),  
 doing different things.
  In that source file the programmer writes:
  write("test");
  DMD tries to choose the storage type of "test" based on the available
 overloads. There are 2 available overloads X and Y. It currently
 fails and gives an error.
  If instead it picked an overload (X) and stored "test" in the type
 for X, calling the overload for X, I agree, there would be
 _absolutely no problems_ with the stored data.
  BUT
  the overload for X doesn't do the same thing as the overload for Y.

 Isn't that a problem with having overloading at all in a language?
 Sooner or later, most of us have done it. If not each already? Isn't  
 this a problem with overloading in general, and not with UTF?

You're right. The problem is not limited to string literals, integer  
literals exhibit exactly the same problem, AFAICS. So, you've convinced  
me. Here is why...

http://www.digitalmars.com/d/lex.html#integerliteral
(see "The type of the integer is resolved as follows")

In essence integer literals _default_ to 'int' unless another type is  
specified or required.

This suggested change does that, and nothing else? (can anyone see a  
difference?)

If so and if I can accept the behaviour for integer literals why can't I  
for string literals?

The only logical reason I can think of for not accepting it, is if there  
exists a difference between integer literals and string literals which  
affects this behaviour.

I can think of differences, but none which affect the behaviour. So, it  
seems that if I accept the risk for integers, I have to accept the risk  
for string literals too.

---

Note that string promotion should occur just like integer promotion does,  
eg:

void foo(long i) {}
foo(5); //calls foo(long) with no error

void foo(wchar[] s) {}
foo("test"); //should call foo(wchar[]) with no error

this behaviour is current and should not change.

Regan

Nov 20 2005

Derek Parnell <derek psych.ward> writes:

On Mon, 21 Nov 2005 10:58:28 +1300, Regan Heath wrote:

Ok, I'll comment but only 'cos you asked ;-)

 On Sun, 20 Nov 2005 17:28:33 +0200, Georg Wrede <georg.wrede nospam.org>  
 wrote:
 Regan Heath wrote:
  Georg/Derek, I replied to Georg here:
 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5587
  saying essentially the same things as Derek has above. I reckon we  
 combine  these threads and continue in this one, as opposed to the one  
 I linked  above. I or you can link the other thread to here with a post  
 if you're in  agreement.

 Good suggestion!

 I actually intended that, but forgot about it while reading and  
 thinking. :-/

 So, the reply is to it directly.

 
 Ok. I have taken your reply, clicked reply, and pasted it in here :)
 (I hope this post isn't confusing for anyone)
 
 -------------------------
 Copied from:  
 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5607
 -------------------------
 
 On Sun, 20 Nov 2005 17:17:33 +0200, Georg Wrede <georg.wrede nospam.org>  
 wrote:
 
 Regan Heath wrote:
 On Fri, 18 Nov 2005 13:02:05 +0200, Georg Wrede  
 <georg.wrede nospam.org>  wrote:
  Lets assume there is 2 functions of the same name (unintentionally),  
 doing different things.
  In that source file the programmer writes:
  write("test");
  DMD tries to choose the storage type of "test" based on the available
 overloads. There are 2 available overloads X and Y. It currently
 fails and gives an error.
  If instead it picked an overload (X) and stored "test" in the type
 for X, calling the overload for X, I agree, there would be
 _absolutely no problems_ with the stored data.
  BUT
  the overload for X doesn't do the same thing as the overload for Y.

 Isn't that a problem with having overloading at all in a language?
 Sooner or later, most of us have done it. If not each already? Isn't  
 this a problem with overloading in general, and not with UTF?

 
 You're right. The problem is not limited to string literals, integer  
 literals exhibit exactly the same problem, AFAICS. So, you've convinced  
 me. Here is why...
 
 http://www.digitalmars.com/d/lex.html#integerliteral
 (see "The type of the integer is resolved as follows")
 
 In essence integer literals _default_ to 'int' unless another type is  
 specified or required.
 
 This suggested change does that, and nothing else? (can anyone see a  
 difference?)

Are you suggesting that in the situation where multiple function signatures
could possibly match an undecorated string literal, that D should assume
that the string literal is actually in utf-8 format, and if that then fails
to find a match, it should signal an error?
 
 If so and if I can accept the behaviour for integer literals why can't I  
 for string literals?
 
 The only logical reason I can think of for not accepting it, is if there  
 exists a difference between integer literals and string literals which  
 affects this behaviour.
 
 I can think of differences, but none which affect the behaviour. So, it  
 seems that if I accept the risk for integers, I have to accept the risk  
 for string literals too.

What might be a relevant point about this is that we are trying to talk
about strings, but as far as D is concerned, we are really talking about
arrays (of code-units). And for arrays, the current D behaviour is
self-consistent. If however, D supported a true string data type, then a
great deal of our messy code dealing with UTF conversions would disappear,
just as it does with integers and floating point values. Imagine the
problems we would have if integers were regarded as arrays of bits by the
compiler!

 ---
 
 Note that string promotion should occur just like integer promotion does,  
 eg:
 
 void foo(long i) {}
 foo(5); //calls foo(long) with no error

But what happens when ...
 void foo(long i) {}
 void foo(short i) {}

 foo(5); //calls ???

 void foo(wchar[] s) {}
 foo("test"); //should call foo(wchar[]) with no error
 
 this behaviour is current and should not change.

Agreed.

 void foo(wchar[] s) {}
 void foo(char[]  s) {}
 foo("test"); //should call ???

I'm now thinking that it should call the char[] signature without error.

But in this case ...

 void foo(wchar[] s) {}
 void foo(dchar[] s) {}
 foo("test"); //should call an error.


If we had a generic string type we'd probably just code ....

 void foo(string s) {}
 foo("test");  // Calls the one function
 foo("test"d); // Also calls the one function

D would convert to an appropriate UTF format silently before (and after
calling).

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
21/11/2005 10:41:35 AM

Nov 20 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Mon, 21 Nov 2005 11:10:30 +1100, Derek Parnell <derek psych.ward> wrote:
 On Mon, 21 Nov 2005 10:58:28 +1300, Regan Heath wrote:

 Ok, I'll comment but only 'cos you asked ;-)

Thanks <g>.

 Regan Heath wrote:
 On Fri, 18 Nov 2005 13:02:05 +0200, Georg Wrede
 <georg.wrede nospam.org>  wrote:
  Lets assume there is 2 functions of the same name (unintentionally),
 doing different things.
  In that source file the programmer writes:
  write("test");
  DMD tries to choose the storage type of "test" based on the available
 overloads. There are 2 available overloads X and Y. It currently
 fails and gives an error.
  If instead it picked an overload (X) and stored "test" in the type
 for X, calling the overload for X, I agree, there would be
 _absolutely no problems_ with the stored data.
  BUT
  the overload for X doesn't do the same thing as the overload for Y.

 Isn't that a problem with having overloading at all in a language?
 Sooner or later, most of us have done it. If not each already? Isn't
 this a problem with overloading in general, and not with UTF?

 You're right. The problem is not limited to string literals, integer
 literals exhibit exactly the same problem, AFAICS. So, you've convinced
 me. Here is why...

 http://www.digitalmars.com/d/lex.html#integerliteral
 (see "The type of the integer is resolved as follows")

 In essence integer literals _default_ to 'int' unless another type is
 specified or required.

 This suggested change does that, and nothing else? (can anyone see a
 difference?)

 Are you suggesting that in the situation where multiple function  
 signatures could possibly match an undecorated string literal, that D  
 should assume
 that the string literal is actually in utf-8 format, and if that then  
 fails to find a match, it should signal an error?

I'm suggesting that an undecorated string literal could default to char[]  
similar to how an undecorated integer literal defaults to 'int' and that  
the risk created by that behaviour would be no different in either case.

 If so and if I can accept the behaviour for integer literals why can't I
 for string literals?

 The only logical reason I can think of for not accepting it, is if there
 exists a difference between integer literals and string literals which
 affects this behaviour.

 I can think of differences, but none which affect the behaviour. So, it
 seems that if I accept the risk for integers, I have to accept the risk
 for string literals too.

 What might be a relevant point about this is that we are trying to talk
 about strings, but as far as D is concerned, we are really talking about
 arrays (of code-units). And for arrays, the current D behaviour is
 self-consistent.  If however, D supported a true string data type, then a
 great deal of our messy code dealing with UTF conversions would  
 disappear, just as it does with integers and floating point values. 
 Imagine the problems we would have if integers were regarded as arrays  
 of bits by the
 compiler!

I'm not sure it makes any difference that char[] is an array, if you  
imagine that we removed the current integer literal rules, here:
   http://www.digitalmars.com/d/lex.html#integerliteral
   (see "The type of the integer is resolved as follows")

then short/int/long would exhibit the same problem that  
char[]/wchar[]/dchar[] does, this would be illegal:

void foo(short i) {}
void foo(int i) {}
void foo(long i) {}
foo(5);

requiring:

foo(5s); //to call short version
foo(5i); //to call int version
foo(5l); //to call long version

or:

foo(cast(short)5); //to call short version
foo(cast(int)5); //to call int version
foo(cast(long)5); //to call long version

just like char[]/wchar[]/dchar[] does today.

 ---

 Note that string promotion should occur just like integer promotion  
 does,
 eg:

 void foo(long i) {}
 foo(5); //calls foo(long) with no error

 But what happens when ...
  void foo(long i) {}
  void foo(short i) {}

  foo(5); //calls ???

You get:

test.d(8): function test.foo called with argument types:
	(int)
matches both:
	test.foo(short)
and:
	test.foo(long)

which is correct IMO because 'int' can be promoted to both 'short' and  
'long' with equal preference. ("that's the long and short of it" <g>)

 void foo(wchar[] s) {}
 foo("test"); //should call foo(wchar[]) with no error

 this behaviour is current and should not change.

 Agreed.

  void foo(wchar[] s) {}
  void foo(char[]  s) {}
  foo("test"); //should call ???

 I'm now thinking that it should call the char[] signature without error.

That's what they've been suggesting. I have started to agree (because it's  
no more risky than current integer literal behaviour, which I have  
blithely accepted for years now - perhaps due to lack of knowledge when I  
first started programming, and now because I am used to it and it seems  
natural)

 But in this case ...

  void foo(wchar[] s) {}
  void foo(dchar[] s) {}
  foo("test"); //should call an error.

Agreed, just like the integer literal example above.

 If we had a generic string type we'd probably just code ....

  void foo(string s) {}
  foo("test");  // Calls the one function
  foo("test"d); // Also calls the one function

 D would convert to an appropriate UTF format silently before (and after
 calling).

It's an interesting idea. I was thinking the same thing recently, why not  
have 1 super-type "string" and have it convert between the format required  
when asked eg.

//writing strings
void c_function_call(char *string) {}
void os_function_call(wchar[] string) {}
void write_to_file_in_specific_encoding(dchar[] string) {}

string a = "test"; //"test" is stored in application defined default  
internal representation (more on this later)

c_function_call(a.utf8);
os_function_call(a.utf16);
write_to_file_in_specific_encoding(a.utf32);
normal_d_function(a);

//reading strings
void read_from_file_in_specific_encoding(inout dchar[]) {}

string a;
read_from_file_in_specific_encoding(a.utf32);


or, perhaps we can go one step further and implicitly transcode where  
required, eg:

c_function_call(a);
os_function_call(a);
write_to_file_in_specific_encoding(a);
read_from_file_in_specific_encoding(a);

The properties (Sean's idea, thanks Sean) utf8, utf16, and utf32 would be  
of type char[], wchar[] and dchar[] respectively. (so, these types remain)

Slicing string would give characters as opposed to code units (parts of  
characters).

I still believe the only times you care which encoding it is in, and/or  
should be transcoding, is on input and output, and for performance reasons  
you do not want it converting all over the place.

To address performance concerns each application may want to define the  
default internal encoding of strings for performance reasons, and/or we  
could use the encoding specified on assignment/creation, eg.

string a; //stored in application defined default (or char[] as that is  
D's general purpose default)

string a = "test"w; //stored as wchar[] internally
a.utf16 //does no transcoding
a.utf32; a.utf8 //causes transcoding

or, when you have nothing to assign a special syntax is used to specify  
the internal encoding

//some options off the top of my head...
string a = string.UTF16;
string a!(wchar[]); //random though, can all this be achieved with a  
template?
string a(UTF16);

read_from_file_in_specific_encoding(a.utf32);

the above would create an empty/non-existant(lets no go here yet <g>)  
utf16 string in memory, and transcode from the file, which is utf32 to  
utf16 for internal representation, then:
a.utf16 //does no transcoding
a.utf8 a.utf32 //causes transcoding

Assignment of strings of different internal representation would cause  
transcoding. This should be rare as most should be in the application  
defined internal representation, it would naturally occur on input and  
output where you cannot avoid it anyway.

This idea has me quite excited, if no-one can poke large unsightly holes  
in it perhaps we could work on a draft spec for it? (i.e. post it to  
digitalmars.D and see what everyone thinks)

Regan

Nov 20 2005

Derek Parnell <derek psych.ward> writes:

On Mon, 21 Nov 2005 14:16:22 +1300, Regan Heath wrote:


[snip]

 This idea has me quite excited, if no-one can poke large unsightly holes  
 in it perhaps we could work on a draft spec for it? (i.e. post it to  
 digitalmars.D and see what everyone thinks)

Not only do great minds think alike, so you and I! I'm starting to thinking
that you (and your minion helpers) have hit upon a 'Great Idea(tm)'

-- 
Derek Parnell
Melbourne, Australia
21/11/2005 12:57:40 PM

Nov 20 2005

"Kris" <fu bar.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote...
 On Mon, 21 Nov 2005 11:10:30 +1100, Derek Parnell <derek psych.ward> 
 wrote:

[snip]
  void foo(wchar[] s) {}
  void foo(char[]  s) {}
  foo("test"); //should call ???

 I'm now thinking that it should call the char[] signature without error.

 That's what they've been suggesting. I have started to agree (because it's 
 no more risky than current integer literal behaviour

Aye!


 But in this case ...

  void foo(wchar[] s) {}
  void foo(dchar[] s) {}
  foo("test"); //should call an error.

 Agreed, just like the integer literal example above.

Aye!

Nov 20 2005

Georg Wrede <georg.wrede nospam.org> writes:

Derek Parnell wrote:
 On Fri, 18 Nov 2005 15:31:48 +0200, Georg Wrede wrote:
 Derek Parnell wrote:


 However, are you saying that D should change its behaviour such
 that it should always implicitly convert between encoding types?
 Should this happen only with assignments or should it also happen
 on function calls?

 Both. And everywhere else (in case we forgot to name some
 situation).

 
 We have problems with inout and out parameters.
 
   foo(inout wchar x) {}
 
   dchar[] y = "abc";
   foo(y);
 
 In this case, if automatic conversion took place, it would have to do
 it twice. It would be like doing ...
 
    auto wchar[] temp;
    temp = toUTF16(y);
    foo(temp);
    y = toUTF32(temp);

Would you be surprised:

Foo[10] foo = new Foo;
for(ubyte i=0; i<10; i++) // Not short, int, or long, "save space"
{
     foo[i] = whatever;  // Gee, compiler silently casts to int!
}

He might either be stupid, uneducated, or then not coded since 1985.
And it happens.



   dchar y;

   foo("Some Test Data"); // Which one now?

 Test data is undecorated, hence char[]. Technically on the last
 line above it could pick at random, when it has no "right"
 alternative, but I think it would be Polite Manners to make the
 compiler complain.

 
 Yes, at that's what happens now.
 
I'm still trying to get through the notion that it 
_really_does_not_matter_ what it chooses!

 
 I disagree. Without know what the intention of the function is, one
 has no way of knowing which function to call.
 
 Try it. Which one is the right one to call in the example above? It
 is quite possible that there is no right one.

If the overloaded functions purport to take UTF (of any width at all), 
then it is assumed that they do _semantically_ the same thing. Thus, one 
has the right to sleep at night.

The programmer shall not see any difference whichever is chosen:

  - if there's only one type, then there's no choice anyway.
  - if there's one that matches, then pick that, (not that it would be 
obligatory, but it's polite.)
  - if there are the two non-matching, then pick the one preferred by 
the compiler writer, or the OS vendor. If not, then just pick either one.
  - if there are no UTF versions, then it'd be okay to complain, at 
compile time.

 If we have automatic conversion and it choose one at random, there is
 no way of knowing that its doing the 'right' thing to the data we
 give it. In my opinion, its a coding error and the coder need to
 provide more information to the compiler.

I want everyone to understand that it makes just as little difference as 
when the compiler optimizer chooses a datatype for variable i in this:

for(ubyte i=0; i<256; i++)
{
     // do stuff
}

Can you honestly say that it makes a difference which type i is? (Except 
signed byte, of course. And we're not talking about performance.)

I wouldn't be surprised if DMD (haven't checked!) would sneak i to int 
instead of the explicitly asked-for ubyte, already in the default 
compile mode. And -release, and at -O probably should. (Again, haven't 
checked, and even if it does not do it, the issue is a matter of 
principle: would making it int make a difference in this example?)

 (Of course performance is slower with a lot of unnecessary casts (
 = conversions), but that's the programmer's fault, not ours.)

 Given just the function signature and an undecorated string, it
 is not possible for the compiler to call the 'correct' function.
 In fact, it is not possible for a person (other than the original
 designer) to know which is the right one to call?

That is (I'm sorry, no offense), based on a misconception.

 Please see my other posts today, where I try to clear (among other 
 things) this very issue.

 
 I challenge you, right here and now, to tell me which of those two 
 functions above is the one that the coder intended to be called.

Suppose you're in a huge software project with D, and the customer has 
ordered it to do all arithmetic in long. After 1500000 lines it goes to 
the beta testers, and they report wierd behavior.

Three weeks of searching, and the boss is raving around with an axe.
One night the following code is found:

import std.stdio;

void main()
{
     long myvar;
...
     myvar = int.max / 47;
... 300 lines
     myvar = scale(myvar);
... 500 lines
}

... 50000 lines later

long scale(int v)
{
     long tmp = 1000 * v;
     return tmp / 3;
}

Folks suspect the bug is here, but what is wrong?
Does the compiler complain? Should it?

 If the coder had written 
 
     foo("Some Test Data"w);
 
 then its pretty clear which function was intended.

Except that my example above is dangerous, while with UTF it can't get 
dangerous.

Hey, what sould the compiler complain if I write:

char[] a = "\U00000041"c;

(Do you think it currently complains? Saying what? Or doesn't it? And 
what do you say happens if one would get this currently compiled and run?)

 D has currently got the better solution to this problem; get the 
 coder to identify the storage characteristics of the string!

 
 He does, at assignment to a variable. And, up till that time, it
 makes no difference. It _really_ does not.

 
 But it *DOES* make a difference when doing signature matching. I'm
 not talking about assignments to variables.

Would it be correct to say that the undecorated string literal can't 
possibly be done anything with so that the type of the receiver is not 
known?

Apart from passing to overloaded functions (each of which does know 
"what it wants"), is there any situation where UTF is accepted, but the 
receiver does not itself know which it "wants", or even "prefers"?

Should there be such cases? Could there?

Nov 20 2005

Derek Parnell <derek psych.ward> writes:

On Sun, 20 Nov 2005 17:02:04 +0200, Georg Wrede wrote:

 Derek Parnell wrote:
 On Fri, 18 Nov 2005 15:31:48 +0200, Georg Wrede wrote:
 Derek Parnell wrote:


 
 However, are you saying that D should change its behaviour such
 that it should always implicitly convert between encoding types?
 Should this happen only with assignments or should it also happen
 on function calls?

 Both. And everywhere else (in case we forgot to name some
 situation).

 
 We have problems with inout and out parameters.
 
   foo(inout wchar x) {}
 
   dchar[] y = "abc";
   foo(y);
 
 In this case, if automatic conversion took place, it would have to do
 it twice. It would be like doing ...
 
    auto wchar[] temp;
    temp = toUTF16(y);
    foo(temp);
    y = toUTF32(temp);

 
 Would you be surprised:

Surprised about the two conversions? No, I just said that's what it would
have to do, so no I wouldn't be surprised. I just said it would be a
problem. In so far as the compiler would (currently) no warn coders about
the performance hit until they profiled it, and even then it might not be
obvious to some people.

 Foo[10] foo = new Foo;
 for(ubyte i=0; i<10; i++) // Not short, int, or long, "save space"
 {
      foo[i] = whatever;  // Gee, compiler silently casts to int!
 }
 
 He might either be stupid, uneducated, or then not coded since 1985.
 And it happens.

What on earth has the above example got to do with double conversions? And
converting from ubyte to int is not exactly a performance drain.




   dchar y;

   foo("Some Test Data"); // Which one now?

 Test data is undecorated, hence char[]. Technically on the last
 line above it could pick at random, when it has no "right"
 alternative, but I think it would be Polite Manners to make the
 compiler complain.

 
 Yes, at that's what happens now.
 
I'm still trying to get through the notion that it 
_really_does_not_matter_ what it chooses!

 
 I disagree. Without know what the intention of the function is, one
 has no way of knowing which function to call.
 
 Try it. Which one is the right one to call in the example above? It
 is quite possible that there is no right one.

 
 If the overloaded functions purport to take UTF (of any width at all), 
 then it is assumed that they do _semantically_ the same thing. Thus, one 
 has the right to sleep at night.

Assumptions like that have a nasty habit of generating nightmares. It is
*only* an assumption and not a decision based on actual knowledge.

 The programmer shall not see any difference whichever is chosen:
 
   - if there's only one type, then there's no choice anyway.

But is more than one.

   - if there's one that matches, then pick that, (not that it would be 
 obligatory, but it's polite.)

Sorry, no matches.

   - if there are the two non-matching, then pick the one preferred by 
 the compiler writer, or the OS vendor. If not, then just pick either one.

BANG! This is where we part company. My believe is that to assume that
functions with the same name are going to do the same thing is a dangerous
one and can lead to mistakes. Whereas you seem to be saying that this is a
safe assumption to make.

   - if there are no UTF versions, then it'd be okay to complain, at 
 compile time.
 
 If we have automatic conversion and it choose one at random, there is
 no way of knowing that its doing the 'right' thing to the data we
 give it. In my opinion, its a coding error and the coder need to
 provide more information to the compiler.

 
 I want everyone to understand that it makes just as little difference as 
 when the compiler optimizer chooses a datatype for variable i in this:
 
 for(ubyte i=0; i<256; i++)
 {
      // do stuff
 }
 
 Can you honestly say that it makes a difference which type i is? (Except 
 signed byte, of course. And we're not talking about performance.)

No, but what's this got to do with the argument?

 I wouldn't be surprised if DMD (haven't checked!) would sneak i to int 
 instead of the explicitly asked-for ubyte, already in the default 
 compile mode. And -release, and at -O probably should. (Again, haven't 
 checked, and even if it does not do it, the issue is a matter of 
 principle: would making it int make a difference in this example?)

Red Herring Alert!

 (Of course performance is slower with a lot of unnecessary casts (
 = conversions), but that's the programmer's fault, not ours.)

 Given just the function signature and an undecorated string, it
 is not possible for the compiler to call the 'correct' function.
 In fact, it is not possible for a person (other than the original
 designer) to know which is the right one to call?

That is (I'm sorry, no offense), based on a misconception.

 Please see my other posts today, where I try to clear (among other 
 things) this very issue.

 
 I challenge you, right here and now, to tell me which of those two 
 functions above is the one that the coder intended to be called.

 
 Suppose you're in a huge software project with D, and the customer has 
 ordered it to do all arithmetic in long. After 1500000 lines it goes to 
 the beta testers, and they report wierd behavior.
 
 Three weeks of searching, and the boss is raving around with an axe.
 One night the following code is found:
 
 import std.stdio;
 
 void main()
 {
      long myvar;
 ...
      myvar = int.max / 47;
 ... 300 lines
      myvar = scale(myvar);
 ... 500 lines
 }
 
 ... 50000 lines later
 
 long scale(int v)
 {
      long tmp = 1000 * v;
      return tmp / 3;
 }
 
 Folks suspect the bug is here, but what is wrong?
 Does the compiler complain? Should it?

No it doesn't and yes it should.

 If the coder had written 
 
     foo("Some Test Data"w);
 
 then its pretty clear which function was intended.

 
 Except that my example above is dangerous, while with UTF it can't get 
 dangerous.

Assumptions can hurt too.

 Hey, what sould the compiler complain if I write:
 
 char[] a = "\U00000041"c;
 
 (Do you think it currently complains? Saying what? Or doesn't it? And 
 what do you say happens if one would get this currently compiled and run?)

Of course not. Both 'a' and the literal are of the same data type. 
 
 D has currently got the better solution to this problem; get the 
 coder to identify the storage characteristics of the string!

 
 He does, at assignment to a variable. And, up till that time, it
 makes no difference. It _really_ does not.

 
 But it *DOES* make a difference when doing signature matching. I'm
 not talking about assignments to variables.

 
 Would it be correct to say that the undecorated string literal can't 
 possibly be done anything with so that the type of the receiver is not 
 known?
 
 Apart from passing to overloaded functions (each of which does know 
 "what it wants"), is there any situation where UTF is accepted, but the 
 receiver does not itself know which it "wants", or even "prefers"?
 
 Should there be such cases? Could there?

Again, I fail to see what this has to do with the issue. 

Let's call a halt to this discussion. I suspect that you and I will not
agree about this function signature matching issue anytime soon.

-- 
Derek Parnell
Melbourne, Australia
21/11/2005 6:55:57 AM

Nov 20 2005

Georg Wrede <georg.wrede nospam.org> writes:

Derek Parnell wrote:
 Let's call a halt to this discussion. I suspect that you and I will
 not agree about this function signature matching issue anytime soon.

Whew, I was just starting to wonder what to do. :-)

Maybe we'll save the others some headaches too. Besides, at this point, 
I guess nobody else reads this thread anyway. :-)

But it was nice to learn that with some folks you really can disagree 
long and good, and still not start fighting.

georg

Nov 20 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Mon, 21 Nov 2005 00:23:39 +0200, Georg Wrede <georg.wrede nospam.org>  
wrote:
 Derek Parnell wrote:
 Let's call a halt to this discussion. I suspect that you and I will
 not agree about this function signature matching issue anytime soon.

 Whew, I was just starting to wonder what to do. :-)

I'm interested in both your opinions on:
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5612

 Maybe we'll save the others some headaches too. Besides, at this point,  
 I guess nobody else reads this thread anyway. :-)

Or they prefer to lurk. Or we scared them away.

 But it was nice to learn that with some folks you really can disagree  
 long and good, and still not start fighting.

It's how it's supposed to work :)

The key, I believe, is to realise that it's not personal, it's an  
discussion/argument of opinion. Disagreeing with an opinion is not the  
same as disliking the person who holds that opinion. Of course this is  
only true when the participants do not make comments which can be taken as  
being directed at the person, as opposed to the points of the argument  
itself. This is harder than it sounds because the written word often does  
not convey your meaning as well as your face and voice could do in a face  
to face conversation.

My 2c.

Regan

Nov 20 2005

Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:

Georg Wrede wrote:
 
 If somebody wants to retain the bit pattern while storing the contents 
 to something else, it should be done with a union. (Just as you can do 
 with pointers, or even objects! To name a few "workarounds".)
 
 A cast should do precisely what our toUTFxxx functions currently do.
 

It should? Why?, what is the problem of using the toUTFxx functions?

-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to 
be... unnatural."

Nov 20 2005

Georg Wrede <georg.wrede nospam.org> writes:

Bruno Medeiros wrote:
 Georg Wrede wrote:
 
 
 If somebody wants to retain the bit pattern while storing the
 contents to something else, it should be done with a union. (Just
 as you can do with pointers, or even objects! To name a few
 "workarounds".)
 
 A cast should do precisely what our toUTFxxx functions currently
 do.
 

 It should? Why?, what is the problem of using the toUTFxx functions?

Nothing wrong. But cast should not do the union thing.

Of course, we could have the toUTFxxx and no cast at all for UTF 
strings, no problem. But definitely _not_ have the cast do the "union 
thing".

Nov 20 2005

Derek Parnell <derek psych.ward> writes:

On Sun, 20 Nov 2005 11:30:34 +0000, Bruno Medeiros wrote:

 Georg Wrede wrote:
 
 If somebody wants to retain the bit pattern while storing the contents 
 to something else, it should be done with a union. (Just as you can do 
 with pointers, or even objects! To name a few "workarounds".)
 
 A cast should do precisely what our toUTFxxx functions currently do.
 

 It should? Why?, what is the problem of using the toUTFxx functions?

Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(),
... ?


-- 
Derek Parnell
Melbourne, Australia
21/11/2005 6:48:48 AM

Nov 20 2005

Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:

Derek Parnell wrote:
 On Sun, 20 Nov 2005 11:30:34 +0000, Bruno Medeiros wrote:
 
 
Georg Wrede wrote:

If somebody wants to retain the bit pattern while storing the contents 
to something else, it should be done with a union. (Just as you can do 
with pointers, or even objects! To name a few "workarounds".)

A cast should do precisely what our toUTFxxx functions currently do.

It should? Why?, what is the problem of using the toUTFxx functions?

 
 
 Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(),
 .... ?
 

No, we don't. But the case is different: between primitive numbers the 
casts are usually (if not allways?) implicit, but most importantly, they 
are quite trivial. And by trivial I mean Assembly-level trivial. String 
enconding conversions on the other hand (as you surely are aware) are 
quite not trivial (both in terms of code, run time, and heap memory 
usage), and I don't think a cast should perform such non-trivial operations.

-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to 
be... unnatural."

Nov 21 2005

Derek Parnell <derek psych.ward> writes:

On Mon, 21 Nov 2005 11:14:47 +0000, Bruno Medeiros wrote:

 Derek Parnell wrote:
 On Sun, 20 Nov 2005 11:30:34 +0000, Bruno Medeiros wrote:
 
 
Georg Wrede wrote:

If somebody wants to retain the bit pattern while storing the contents 
to something else, it should be done with a union. (Just as you can do 
with pointers, or even objects! To name a few "workarounds".)

A cast should do precisely what our toUTFxxx functions currently do.

It should? Why?, what is the problem of using the toUTFxx functions?

 
 
 Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(),
 .... ?
 

 No, we don't. But the case is different: between primitive numbers the 
 casts are usually (if not allways?) implicit, but most importantly, they 
 are quite trivial. And by trivial I mean Assembly-level trivial. String 
 enconding conversions on the other hand (as you surely are aware) are 
 quite not trivial (both in terms of code, run time, and heap memory 
 usage), and I don't think a cast should perform such non-trivial operations.

Why? If documented, the user can be prepared.

And where is the tipping point? The point at which an operation becomes
non-trivial? You mention 'assembly-level' by which I think you mean that a
sub-routine is not called but the machine code is generated in-line for the
operation. Would that be the trivial/non-trivial divide?

Is conversion from byte to real done in-line or via sub-routine call? I
don't actually know, just asking.

-- 
Derek Parnell
Melbourne, Australia
21/11/2005 10:47:34 PM

Nov 21 2005

Don Clugston <dac nospam.com.au> writes:

Derek Parnell wrote:
 On Mon, 21 Nov 2005 11:14:47 +0000, Bruno Medeiros wrote:
 
 
Derek Parnell wrote:

On Sun, 20 Nov 2005 11:30:34 +0000, Bruno Medeiros wrote:



Georg Wrede wrote:


If somebody wants to retain the bit pattern while storing the contents 
to something else, it should be done with a union. (Just as you can do 
with pointers, or even objects! To name a few "workarounds".)

A cast should do precisely what our toUTFxxx functions currently do.

It should? Why?, what is the problem of using the toUTFxx functions?


Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(),
.... ?

No, we don't. But the case is different: between primitive numbers the 
casts are usually (if not allways?) implicit, but most importantly, they 
are quite trivial. And by trivial I mean Assembly-level trivial. String 
enconding conversions on the other hand (as you surely are aware) are 
quite not trivial (both in terms of code, run time, and heap memory 
usage), and I don't think a cast should perform such non-trivial operations.

 
 
 Why? If documented, the user can be prepared.
 
 And where is the tipping point? The point at which an operation becomes
 non-trivial? You mention 'assembly-level' by which I think you mean that a
 sub-routine is not called but the machine code is generated in-line for the
 operation. Would that be the trivial/non-trivial divide?

I would think so. I'd define trivial as: "the assembly code doesn't have 
any loops".

 Is conversion from byte to real done in-line or via sub-routine call? I
 don't actually know, just asking.

On x86,
int -> real can be done with the FILD instruction. Or can be done 
without FPU, in a couple of instructions.
short -> int is done with MOVSX
ushort -> uint is done with MOVZX.

HOWEVER -- I don't think this is really relevant. The real issue is 
about literals, which as Georg rightly said, could be stored in ANY 
format. Conversions from a literal to any type has ZERO runtime cost.

I think that in a few respects, the existing situation for strings is 
BETTER than the situation for integers.

I personally don't like the fact that integer literals default to 'int', 
unless you suffix them with L. Even if the number is too big to fit into 
an int! And floating-point constants default to 'double', not real.


One intriguing possibility would be to have literals having NO type (or 
more accurately, an unassigned type). The type only being assigned when 
it is used.

eg  "abc" is of type: const __unassignedchar [].
There are implicit conversions from __unassignedchar [] to char[], 
wchar[], and dchar[]. But there are none from char[] to wchar[].

Adding a suffix changes the type from __unassignedchar to char[], 
wchar[], or dchar[], preventing any implicit conversions.

(__unassignedchar could also be called __stringliteral -- it's 
inaccessable, anyway).

Similarly, an integral constant could be of type __integerliteral
UNTIL it is assigned to something.
At this point, a check is performed to see if the value can actually fit 
in the type. If not, (eg when an extended UTF char is assigned to a 
char), it's an error.

Admittedly, it's more difficult to deal with when you have integers, and 
especially with reals, where no lossless conversion exists (because 
1.0/3.0f + 1.0/5.0f is not the same as cast(float)(1.0/3.0L + 1.0/5.0L) 
-- the roundoff errors are different).

There are some vaguaries -- what rounding mode is used when performing 
calculations on reals? This is implementation-defined in C and C++, 
would be nice if it were specified in D.

UTF strings are not the only worm in this can of worms :-)

Nov 21 2005

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

In article <dlsj73$2fod$1 digitaldaemon.com>, Don Clugston says...

I think that in a few respects, the existing situation for strings is 
BETTER than the situation for integers.

I personally don't like the fact that integer literals default to 'int', 
unless you suffix them with L. Even if the number is too big to fit into 
an int! And floating-point constants default to 'double', not real.

I agree with you.

One intriguing possibility would be to have literals having NO type (or 
more accurately, an unassigned type). The type only being assigned when 
it is used.

eg  "abc" is of type: const __unassignedchar [].
There are implicit conversions from __unassignedchar [] to char[], 
wchar[], and dchar[]. But there are none from char[] to wchar[].

String literals already work like this. :)
String literals without suffix are char[], but not "committed". String literals
with a suffix are "committed" to their type.

Check the frontend sources. StringExp::implicitConvTo(Type *t) allows conversion
of non-committed string literals to {,w,d}char arrays and pointers.

This is what makes this an error:




Regards, 

/Oskar

Nov 21 2005

Sean Kelly <sean f4.ca> writes:

Don Clugston wrote:
 
 I personally don't like the fact that integer literals default to 'int', 
 unless you suffix them with L. Even if the number is too big to fit into 
 an int! And floating-point constants default to 'double', not real.

Really?  I tested this a few days ago and it seemed like literals larger 
than int.max were treated as longs.  I'll mock up another test on my way 
to work.


Sean

Nov 21 2005

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

In article <dlt2b9$6df$3 digitaldaemon.com>, Sean Kelly says...
Don Clugston wrote:
 
 I personally don't like the fact that integer literals default to 'int', 
 unless you suffix them with L. Even if the number is too big to fit into 
 an int! And floating-point constants default to 'double', not real.

Really?  I tested this a few days ago and it seemed like literals larger 
than int.max were treated as longs.  I'll mock up another test on my way 
to work.

You are right, large integers are automatically treated as longs, but too large
floating point literals are not automatically treated as real.













Prints:
int
long
double
real

/Oskar

Nov 21 2005

Sean Kelly <sean f4.ca> writes:

Oskar Linde wrote:
 
 You are right, large integers are automatically treated as longs, but too large
 floating point literals are not automatically treated as real.

This seems reasonable though, since it's really a matter of precision 
with floating-point numbers moreso than representability.


Sean

Nov 21 2005

Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:

Derek Parnell wrote:
 On Mon, 21 Nov 2005 11:14:47 +0000, Bruno Medeiros wrote:
 
Derek Parnell wrote:
Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(),
.... ?

No, we don't. But the case is different: between primitive numbers the 
casts are usually (if not allways?) implicit, but most importantly, they 
are quite trivial. And by trivial I mean Assembly-level trivial. String 
enconding conversions on the other hand (as you surely are aware) are 
quite not trivial (both in terms of code, run time, and heap memory 
usage), and I don't think a cast should perform such non-trivial operations.

 
 
 Why? If documented, the user can be prepared.
 
 And where is the tipping point? The point at which an operation becomes
 non-trivial? You mention 'assembly-level' by which I think you mean that a
 sub-routine is not called but the machine code is generated in-line for the
 operation. Would that be the trivial/non-trivial divide?
 

A good question indeed. I was thinking something equivalent to what Don 
Clugston said: if the code run time depends on the object size, that is, 
is not constant bounded, then it's beyond the acceptable point. Another 
disqualifier is allocating memory on the heap.
A string enconding conversion does both things.

 Is conversion from byte to real done in-line or via sub-routine call? I
 don't actually know, just asking.
 

I didn't know for sure the answer before Don replied, but I already 
suspected that it was merely an Assembly one-liner (i.e., one 
instruction only).

Note: I think the most complex cast we have right now is a class object 
downcast, which, altough not universally constant bounded, it's still 
compile-time constant bounded.


-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to 
be... unnatural."

Nov 23 2005

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - the D crowd does bobdamn Rocket Science