www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - the D crowd does bobdamn Rocket Science

reply Georg Wrede <georg.wrede nospam.org> writes:
I've spent a week studying the UTF issue, and another trying to explain 
it. Some progress, but not enough. Either I'm a bad explainer (hence, 
skip dreams of returning to teaching CS when I'm old), or, this really 
is an intractable issue. (Hmmmm, or everybody else is right, and I need 
to get on the green pills.)

My almost final try: Let's imagine (please, pretty please!), that Bill 
Gates descends from Claudius. This would invariably lead to cardinals 
being represented as Roman Numerals in computer user interfaces, ever 
since MSDOS times.

Then we'd have that as an everyday representation of integers. Obviously 
we'd have the functions r2a and a2r for changing string representations 
between Arabic ("1234...") and Roman ("I II III ...") numerals.

To make this example work we also need to imagine that we need a 
notation for roman numerals within string. Let this notation be:

"\R" as in "\RXIV" (to represent the number 14)

(Since this in reality is not needed, we have to (again) imagine that it 
is, for Historical Reasons -- the MSDOS machines sometimes crashed when 
there were too many capital X on a command line, but at the time nobody 
found the reason, so the \R notation was created as a [q&d] fix.)

So, since it is politically incorrect to write "December 24", we have to 
write "December XXIV" but since the ancient bug lurks if this file gets 
transferred to "a major operating system", we have to be careful and 
write "December \RXXIV".

Now, programmers are lazy, and they end up writing "\Rxxiv" and getting 
all kinds of error messages like "invalid string literal". So a few 
anarchist programmers decided to implement the possibility of writing 
lower case roman numerals, even if the Romans themselves disapproved of 
it from the beginning.

The prefix \r is already taken, so they had two choices, either make 
computers smart and let them understand \Rxxiv, but that would risk Bill 
getting angry. So they needed another prefix. They chose \N (this choice 
is a unix inside joke).

---

Then a compiler guru (who happened to descend from Asterix the Gallian) 
decided to write a new language. In the midst of that all, he stumbled 
upon the Roman issue. Beign diligent (which I've wanted to be all my 
life too but never succeeded (go ask my mother, my teacher, my bosses)), 
he decided to implement strings in a non-brekable way.

So now we have:

char[], Rchar[] and Nchar[], the latter two being for situations where 
the string might contain [expletive deleted] roman values.

The logical next step was to decorate the strings themselves, so that 
the computer can unambiguously know what to assign where. Therefore we 
now have "", ""R and ""N kind of strings. Oh, and to be totally 
unambiguous and symmetric, the redundant ""C was introduced to 
explicitly denote the non-R, non-N kind of string, in case such might be 
needed some day.

Now, being modern, the guru had already made the "" kinds of strings 
Roman Proof, with the help of Ancient Gill, an elusive but legendary oracle.

---

The III wise men had also become aware of this problem space. Since 
everything in modern times grows exponentially, the letters X, C and M 
(for ten, hundred and thousand), would sooner than later need to be 
accompanied by letters for larger numbers. For a million M was already 
taken, so they chose N. And then G, T, P, E, Z, Y for giga, tera, peta, 
exa, zetta and yotta. Then the in-betweens had to be worked out too, for 
5000, 5000000 etc. 50 was already L and 500 was D..... whatever, you get 
the picture. .-)

So, they decided that, to make a string spec that lasts "forever" the 
new string had to be stored with 32 bits per character. (Since 
exponential growth (it's true, just look how Bill's purse grows!), is 
reality, they figured the numerals would run out of letters, and that's 
why new glyphs would have to be invented eventually, ad absurdum. 32 
bits would carry us until the end of the Universe.) They called it N. 
This was the official representation of strings that might contain roman 
numerals way into the future.

Then some other guys thought "Naaw, t's nuff if we have strings that 
take us till the day we retire, so 16 bits oughtta be plenty." That 
became the R string. Practical Bill adopted the R string. Later the 
other guys had to admit that their employer might catch the "retiring 
plot", so they amended the R string to SOMETIMES contain 32 bits.

Now, Bill, the practic he is, ignored the issue (probably on the same 
"retiring plot"). And what Bill does, defines what is right (at least 
with suits, and hey, they rule -- as opposed to us geeks).

---

Luckily II blessed-by-Bob Sourcerers (notice the spelling), thought the 
R and N stuff was wasting space, was needed only occasionally, and was 
in general cumbersome. Everybody elses but Bill's R had to be able to 
handle 32 bits every once in a while, and the N stuff really was overkill.

They figured "The absolute majority of crap needs 7 bits, the absolute 
majority of the rest needs 9 bits, and the absolute majority of the rest 
needs 12 bits. So there's pretty little left after all this -- however, 
since we are blessed-by-Bob, and we do stuff properly, we won't give up 
until we can handle all this, and handle it gracefully."

They decided that their string (which they christened ""C) has to be 
compact, handle 7-bit stuff as fast as non-roman-aware programs do, 9 
bit stuff almost as fast as the R programs, and it has to be lightning 
fast to convert to and from. Also, they wanted the C strings to be 
usable as much as possible by old library routines, so for example, the 
old routines should be able to search and sort their strings without 
upgrades. And they knew that strings do get chopped, so they designed 
them so that you can start wherever, and just by looking at the 
particular octet, you'd know whether it's proper to chop the string 
there. And if it isn't, it should be trivial to look a couple of octets 
forward (or even back), and just immediately see where the next 
breakable place is. Ha, and they wanted the C strings to be 
endiannes-proof!!

The II were already celebrities with the Enlightened, so it was decided 
that the C string will be standard on POSIX systems. Smart crowd.

*** if I don't see light here, I'll write some more one day ***

---

If I write the following strings here, and somebody pastes them in his 
source code,

"abracadabra"
"räyhäpäivä"
"ШЖЯЮЄШ"

compiles his D program, and runs it, what should (and probably will!) 
happen, is that the program output looks like the strings here.

If the guy has never heard of our Unbelievable utf-discussion, he 
probably never is aware that some UTF or other crap is or has been 
involved. (Hell, I've used Finnish letters in my D source code all the 
time, and never thought any of it.)

After having seen this discussion, he gets nervous, and quickly changes 
all his strings so that they are ""c ""w and ""d decorated. From then 
on, he hardly dares to touch strings that contain non US content. Like 
us here.

The interesting thing is, did I originally write them in UTF-8, UTF-16 
or UTF-32?

How many times were they converted between these widths while travelling 
from my keyboard to this newsgroup to his machine to the executable to 
the output file?

Probably they've been in UTF-7 too, since they've gone through mail 
transport, which still is from the previous millennium.

---

At this point I have to ask, are there any folks here who do not believe 
that the following proves anything:

 Xref: digitalmars.com digitalmars.D.bugs:5436 digitalmars.D:29904
 Xref: digitalmars.com digitalmars.D.bugs:5440 digitalmars.D:29906
 
 Those show that the meaning of the program does not change when the
 source code is transliterated to a different UTF encoding.
 
 They also show that editing code in different UTF formats, and
 inserting "foreign" text even directly to string literals, does
 survive intact when the source file is converted between different
 UTF formats.
 
 Further, they show that decorating a string literal to c, w, or d,
 does not change the interpretation of the contents of the string
 whether it contains "foreign" literals directly inserted, or not.
 
 Most permutations of the above 3 paragraphs were tested.

(Oh, correction to the last line: "_All_ cross permutations of the 3 paragraphs were tested.) Endiannes was not considered, but hey, with wrong endianness, either your text editor cant't read the file to begin with, or if it can, then you _can_ edit the strings with even more "foreign characters" and still be ok! ################################################### I hereby declare, that it makes _no_ difference whatsoever in which width a string literal is stored, as long as the compiler implicitly casts it when it gets used. ################################################### I hereby also declare, that implicit casts of strings (be they literal or heap or stack allocated), carries no risks whatsoever. Period. ################################################### I hereby declare that string literal decorations are not only unneeded, they create an enormous amount of confusion. (Even we are totally bewildered, so _every_ newcomer to D will be that too.) There are _no_ upsides to them. ################################################### I hereby declare that it should be illegal to implicitly convert char or wchar to any integer type. Further it should be illegal to even cast char or wchar to any integer type. The cast should have to be via a cast to void! (I.e. difficult but possible.) With dchar even implicit casts are ok. Cast from char or wchar via dchar should be illegal. (Trust me, illegal. While at the same time even implicit casts from char[] and wchar[] to each other and to and from dchar[] are ok!) Cast between char wchar and dchar should be illegal, unless via void. ################################################### A good programmer would use the same width all over the place. An even better programmer would typedef his own anyway. If an idiot has his program convert width at every other assignment, then he'll have other idiocies in his code too. He should go to VB. --- But some other things are (both now, and even if we fix the above) downright hazardous, and should cause a throw, and in non-release programs a runtime assert failure: Copying any string to a fixed length array, _if_ the array is either wchar[] or char[]. (dchar[] is ok.) The (throw or) assert should fail if the copied string is not breakable where the receiving array gets full. whatever foo = "ää"; // foo and "" can be any of c/w/d. char[3] barf = foo; // write cast if needed // Odd number of chars in barf, breaks ää wrong. "ää" is 4 bytes. Same goes for wchar[3]. --- Once we agree on this, then it's time to see if some more AJ stuff is left to fix. But not before.
Nov 18 2005
parent reply Derek Parnell <derek psych.ward> writes:
On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:

[snip]

It seems that you use the word 'cast' to mean conversion of one utf
encoding to another. However, this is not what D does.

   dchar[] y;
   wchar[] x;

   x = cast(wchar[])y;

does *not* convert the content of 'y' to utf-16 encoding. Currently you
*must* use the toUTF16 function to do that.

  x = std.utf.toUTF16(y);

However, are you saying that D should change its behaviour such that it
should always implicitly convert between encoding types? Should this happen
only with assignments or should it also happen on function calls?

   foo(wchar[] x) { . . .  } // #1
   foo(dchar[] x) { . . .  } // #2
   dchar y;
   foo(y);  // Obviously should call #2
   foo("Some Test Data"); // Which one now?

Given just the function signature and an undecorated string, it is not
possible for the compiler to call the 'correct' function. In fact, it is
not possible for a person (other than the original designer) to know which
is the right one to call? 

D has currently got the better solution to this problem; get the coder to
identify the storage characteristics of the string!

-- 
Derek Parnell
Melbourne, Australia
18/11/2005 10:42:31 PM
Nov 18 2005
parent reply Georg Wrede <georg.wrede nospam.org> writes:
Derek Parnell wrote:
 On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:
 
 [snip]
 
 It seems that you use the word 'cast' to mean conversion of one utf
 encoding to another. However, this is not what D does.
 
    dchar[] y;
    wchar[] x;
 
    x = cast(wchar[])y;
 
 does *not* convert the content of 'y' to utf-16 encoding. Currently you
 *must* use the toUTF16 function to do that.

If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.
 However, are you saying that D should change its behaviour such that
 it should always implicitly convert between encoding types? Should 
 this happen only with assignments or should it also happen on
 function calls?

Both. And everywhere else (in case we forgot to name some situation).
    foo(wchar[] x) { . . .  } // #1
    foo(dchar[] x) { . . .  } // #2
    dchar y;
    foo(y);  // Obviously should call #2
    foo("Some Test Data"); // Which one now?

Test data is undecorated, hence char[]. Technically on the last line above it could pick at random, when it has no "right" alternative, but I think it would be Polite Manners to make the compiler complain. I'm still trying to get through the notion that it _really_does_not_matter_ what it chooses! (Of course performance is slower with a lot of unnecessary casts ( = conversions), but that's the programmer's fault, not ours.)
 Given just the function signature and an undecorated string, it is not
 possible for the compiler to call the 'correct' function. In fact, it
 is not possible for a person (other than the original designer) to
 know which is the right one to call?

That is (I'm sorry, no offense), based on a misconception. Please see my other posts today, where I try to clear (among other things) this very issue.
 D has currently got the better solution to this problem; get the
 coder to identify the storage characteristics of the string!

He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not. This also I try to explain in the other posts. (The issue and concepts are crystal clear, maybe it's just me not being able to describe them with the right words. Not to you, or Walter, or the others?) We are all seeing bogeymen all over the place, where there are none. It's like my kids this time of the year, when it is always dark behind the house, under the bed, and on the attic. Aaaaaaaaaaah, now I got it. It's been Halloween again. Sigh!
Nov 18 2005
next sibling parent reply Sean Kelly <sean f4.ca> writes:
Georg Wrede wrote:
 Derek Parnell wrote:
 On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:

 [snip]

 It seems that you use the word 'cast' to mean conversion of one utf
 encoding to another. However, this is not what D does.

    dchar[] y;
    wchar[] x;

    x = cast(wchar[])y;

 does *not* convert the content of 'y' to utf-16 encoding. Currently you
 *must* use the toUTF16 function to do that.

If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.

I somewhat agree. Since the three char types in D really do represent various encodings, the current behavior of casting a char[] to dchar[] produces a meaningless result (AFAIK). On the other hand, this would make casting strings behave differently from casting anything else in D, and I abhor inconsistencies. Though for what it's worth, I don't consider the conversion cost to be much of an issue so long as strings must be cast explicitly. And either way, I would love to have UTF conversion for strings supported in-language. It does make some sense, given that the three encodings exist as distinct value types in D already.
 However, are you saying that D should change its behaviour such that
 it should always implicitly convert between encoding types? Should 
 this happen only with assignments or should it also happen on
 function calls?

Both. And everywhere else (in case we forgot to name some situation).

I disagree. While this would make programming quite simple, it would also incur hidden runtime costs that would be difficult to ferret out. This might be fine for a scripting-type language, but not for a systems programming language IMO.
 D has currently got the better solution to this problem; get the
 coder to identify the storage characteristics of the string!

He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.

True.
 This also I try to explain in the other posts.
 
 (The issue and concepts are crystal clear, maybe it's just me not being 
 able to describe them with the right words. Not to you, or Walter, or 
 the others?)
 
 We are all seeing bogeymen all over the place, where there are none. 
 It's like my kids this time of the year, when it is always dark behind 
 the house, under the bed, and on the attic.

What I like about the current behavior (no implicit conversion), is that it makes it readily obvious where translation needs to occur and thus makes it easy for the programmer to decide if that seems appropriate. That said, I agree that the overall runtime cost is likely consistent between a program with and without implicit conversion--either the API calls with have overloads for all types and thus allow you to avoid conversion, or they will only support one type and require conversion if you've standardized on a different type. It may well be that concerns over implicit convesion is unfounded, but I'll have to give the matter some more thought before I can say one way or the other. My current experience with D isn't such that I've had to deal with this particular issue much. Sean
Nov 18 2005
next sibling parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Fri, 18 Nov 2005 09:48:51 -0800, Sean Kelly <sean f4.ca> wrote:
 Georg Wrede wrote:
 Derek Parnell wrote:
 On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:

 [snip]

 It seems that you use the word 'cast' to mean conversion of one utf
 encoding to another. However, this is not what D does.

    dchar[] y;
    wchar[] x;

    x = cast(wchar[])y;

 does *not* convert the content of 'y' to utf-16 encoding. Currently you
 *must* use the toUTF16 function to do that.

to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.

I somewhat agree. Since the three char types in D really do represent various encodings, the current behavior of casting a char[] to dchar[] produces a meaningless result (AFAIK). On the other hand, this would make casting strings behave differently from casting anything else in D, and I abhor inconsistencies. Though for what it's worth, I don't consider the conversion cost to be much of an issue so long as strings must be cast explicitly. And either way, I would love to have UTF conversion for strings supported in-language. It does make some sense, given that the three encodings exist as distinct value types in D already.

Making the cast explicit sounds like a good compromise to me. The way I see it casting from int to float is similar to casting from char[] to wchar[]. The data must be converted from one form to another for it to make sense, you'd never 'paint' and 'int' as a 'float' it would be meaningless, the same is true for char[] to wchar[]. The correct way to paint data as char[], wchar[] or dchar[] is to paint a byte[](or ubyte[]). In other words if you have some data of unknown encoding you should be reading it into byte[](or ubyte[]) and then painting as the correct type, once it is known. Regan
Nov 18 2005
parent reply Sean Kelly <sean f4.ca> writes:
Regan Heath wrote:
 
 Making the cast explicit sounds like a good compromise to me.
 
 The way I see it casting from int to float is similar to casting from 
 char[] to wchar[]. The data must be converted from one form to another 
 for it to make sense, you'd never 'paint' and 'int' as a 'float' it 
 would be meaningless, the same is true for char[] to wchar[].

This is the comparison I was thinking of as well. Though I've never tried casting an array of ints to floats. I suspect it doesn't work, does it? My only other reservation is that the behavior could not be preserved for casting char types, and unlike narrowing conversions (such as float to int), meaning can't even be preserved in narrowing char conversions (such as wchar to char). Sean
Nov 18 2005
parent "Regan Heath" <regan netwin.co.nz> writes:
On Fri, 18 Nov 2005 15:29:23 -0800, Sean Kelly <sean f4.ca> wrote:
 Regan Heath wrote:
  Making the cast explicit sounds like a good compromise to me.
  The way I see it casting from int to float is similar to casting from  
 char[] to wchar[]. The data must be converted from one form to another  
 for it to make sense, you'd never 'paint' and 'int' as a 'float' it  
 would be meaningless, the same is true for char[] to wchar[].

This is the comparison I was thinking of as well. Though I've never tried casting an array of ints to floats. I suspect it doesn't work, does it?

Nope. Kris's post has something about it, here: digitalmars.D/30158
 My only other reservation is that the behavior could not be preserved  
 for casting char types, and unlike narrowing conversions (such as float  
 to int), meaning can't even be preserved in narrowing char conversions  
 (such as wchar to char).

Indeed. Due to the fact that the meaning (the "character") may be represented as 1 wchar, but 2 char's. The thread above has some more interesting stuff about this. Regan
Nov 18 2005
prev sibling parent reply "Kris" <fu bar.com> writes:
"Sean Kelly" <sean f4.ca> wrote in message
 Georg Wrede wrote:
 Derek Parnell wrote:
 On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:

 [snip]

 It seems that you use the word 'cast' to mean conversion of one utf
 encoding to another. However, this is not what D does.

    dchar[] y;
    wchar[] x;

    x = cast(wchar[])y;

 does *not* convert the content of 'y' to utf-16 encoding. Currently you
 *must* use the toUTF16 function to do that.

If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.

I somewhat agree. Since the three char types in D really do represent various encodings, the current behavior of casting a char[] to dchar[] produces a meaningless result (AFAIK). On the other hand, this would make casting strings behave differently from casting anything else in D, and I abhor inconsistencies.

Amen to that.
 Though for what it's worth, I don't consider the conversion cost to be 
 much of an issue so long as strings must be cast explicitly.  And either 
 way, I would love to have UTF conversion for strings supported 
 in-language.  It does make some sense, given that the three encodings 
 exist as distinct value types in D already.

FWIW, I agree. And it should be explicit, to avoid unseen /runtime/ conversion (the performance issue). But, I have a feeling that cast([]) is not the right approach here? One reason is that structs/classes can have only one opCast() method. perhaps there's another approach for such syntax? That's assuming, however, that one does not create a special-case for char[] types (per above inconsistencies).
 However, are you saying that D should change its behaviour such that
 it should always implicitly convert between encoding types? Should this 
 happen only with assignments or should it also happen on
 function calls?

Both. And everywhere else (in case we forgot to name some situation).

I disagree. While this would make programming quite simple, it would also incur hidden runtime costs that would be difficult to ferret out. This might be fine for a scripting-type language, but not for a systems programming language IMO.

Amen to that, too!
 D has currently got the better solution to this problem; get the
 coder to identify the storage characteristics of the string!

He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.

True.
 This also I try to explain in the other posts.

 (The issue and concepts are crystal clear, maybe it's just me not being 
 able to describe them with the right words. Not to you, or Walter, or the 
 others?)

 We are all seeing bogeymen all over the place, where there are none. It's 
 like my kids this time of the year, when it is always dark behind the 
 house, under the bed, and on the attic.

What I like about the current behavior (no implicit conversion), is that it makes it readily obvious where translation needs to occur and thus makes it easy for the programmer to decide if that seems appropriate.

Right on!
 That said, I agree that the overall runtime cost is likely consistent 
 between a program with and without implicit conversion--either the API 
 calls with have overloads for all types and thus allow you to avoid 
 conversion, or they will only support one type and require conversion if 
 you've standardized on a different type.

As long as the /runtime/ penalties are clear within the code design (not quietly 'padded' by the compiler), that makes sense.
 It may well be that concerns over implicit convesion is unfounded, but 
 I'll have to give the matter some more thought before I can say one way or 
 the other.  My current experience with D isn't such that I've had to deal 
 with this particular issue much.

I'm afraid I have. Both in Mango.io and in the ICU wrappers. While there are no metrics for such things (that I'm aware of) my gut feel was that 'hidden' conversion would not be a good thing. Of course, that depends upon the "level" one is talking about: High level :: slow to medium performance Low level :: high performance A lot of folks just don't care about performance (oh, woe!) and that's fine. But I think it's worth keeping the distinction in mind when discussing this topic. I'd be a bit horrified to find the compiler adding hidden transcoding at the IO level (via Mango.io for example). But then, I'm a dinosaur. So. That doesn't mean that the language should not perhaps support some sugar for such operations. Yet the difficulty there is said sugar would likely bind directly to some internal runtime support (such as utf.d), which may not be the most appropriate for the task (it tends to be character oriented, rather than stream oriented). In addition, there's often a need for multiple return-values from certain types of transcoding ops. I imagine that would be tricky via such sugar? Maybe not. Transcoding is easy when the source content is reasonably small and fully contained within block of memory. It quickly becomes quite complex when streaming instead. That's really worth considering. To illustrate, here's some of the transcoder signatures from the ICU code: uint function (Handle, wchar*, uint, void*, uint, inout Error) ucnv_toUChars; uint function (Handle, void*, uint, wchar*, uint, inout Error) ucnv_fromUChars; Above are the simple ones, where all of the source is present in memory. void function (Handle, void**, void*, wchar**, wchar*, int*, ubyte, inout Error) ucnv_fromUnicode; void function (Handle, wchar**, wchar*, void**, void*, int*, ubyte, inout Error) ucnv_toUnicode; void function (Handle, Handle, void**, void*, void**, void*, wchar*, wchar*, wchar*, wchar*, ubyte, ubyte, inout Error) ucnv_convertEx; And those are the ones for handling streaming; note the double pointers? That's so one can handle "trailing" partial characters. Non trival :-) Thus, I'd suspect it may be appropriate for D to add some transcoding sugar. But it would likely have to be highly constrained (per the simple case). Is it worth it?
Nov 18 2005
parent reply Sean Kelly <sean f4.ca> writes:
Kris wrote:
 
 But, I have a feeling that cast([]) is not the right approach here? One 
 reason is that structs/classes can have only one opCast() method. perhaps 
 there's another approach for such syntax? That's assuming, however, that one 
 does not create a special-case for char[] types (per above inconsistencies).

It may well not be. A set of properties is another approach: char[] c = "abc"; dchar[] d = c.toDString(); but this would still only work for arrays. Conversion between char types still only make sense if they are widening conversions. Perrhaps I'm simply becoming spoiled by having so much built into D. This may well be simply a job for library code.
 Transcoding is easy when the source content is reasonably small and fully 
 contained within block of memory. It quickly becomes quite complex when 
 streaming instead. That's really worth considering.

Good point. One of the first things I had to do for readf/unFormat was rewrite std.utf to accept delegates. There simply isn't any other good way to ensure that too much data isn't read from the stream by mistake. > Thus, I'd suspect it may be appropriate for D to add some transcoding sugar.
 But it would likely have to be highly constrained (per the simple case). Is 
 it worth it?

Probably not :-) But I suppose it's worth discussing. I do like the idea of not having to rely on library code to do simple string transcoding, though this seems of limited use given the above concerns. Sean
Nov 18 2005
parent "Kris" <fu bar.com> writes:
"Sean Kelly" <sean f4.ca> wrote ...
 Kris wrote:
 But, I have a feeling that cast([]) is not the right approach here? One 
 reason is that structs/classes can have only one opCast() method. perhaps 
 there's another approach for such syntax? That's assuming, however, that 
 one does not create a special-case for char[] types (per above 
 inconsistencies).

It may well not be. A set of properties is another approach: char[] c = "abc"; dchar[] d = c.toDString();

I would agree, since it thoroughly isolates the special cases: char[].utf16 char[].utf32 wchar[].utf8 wchar[].utf32 dchar[].utf8 dchar[].utf16 Generics might require the addition of 'identity' properties, like char[].utf8 ?
 but this would still only work for arrays.  Conversion between char types 
 still only make sense if they are widening conversions.

Aye. If the above set of properties were for arrays only, then one may be able to make a case that it doesn't break consistency. There might be a second, somewhat distinct, set: char.utf16 char.utf32 wchar.utf32 I think your approach is far more amenable than cast(), Sean. And properties don't eat up keyword space <g>
 Transcoding is easy when the source content is reasonably small and fully 
 contained within block of memory. It quickly becomes quite complex when 
 streaming instead. That's really worth considering.

Good point. One of the first things I had to do for readf/unFormat was rewrite std.utf to accept delegates. There simply isn't any other good way to ensure that too much data isn't read from the stream by mistake. > Thus, I'd suspect it may be appropriate for D to add some transcoding sugar.
 But it would likely have to be highly constrained (per the simple case). 
 Is it worth it?

Probably not :-) But I suppose it's worth discussing. I do like the idea of not having to rely on library code to do simple string transcoding, though this seems of limited use given the above concerns.

Yeah. It would be limited (e.g. no streaming), and would likely be implemented using the heap. Even then, as you note, it could be attractive to some.
Nov 18 2005
prev sibling next sibling parent "Regan Heath" <regan netwin.co.nz> writes:
On Fri, 18 Nov 2005 15:31:48 +0200, Georg Wrede <georg.wrede nospam.org>  
wrote:
 Derek Parnell wrote:
 On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:
  [snip]
  It seems that you use the word 'cast' to mean conversion of one utf
 encoding to another. However, this is not what D does.
     dchar[] y;
    wchar[] x;
     x = cast(wchar[])y;
  does *not* convert the content of 'y' to utf-16 encoding. Currently you
 *must* use the toUTF16 function to do that.

If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.

You have my vote here.
 However, are you saying that D should change its behaviour such that
 it should always implicitly convert between encoding types? Should this  
 happen only with assignments or should it also happen on
 function calls?

Both. And everywhere else (in case we forgot to name some situation).

The main argument against this last time it was proposed was that a expression containing several char[] types would implicitly convert any number of times during the expression. This transcoding would be inefficient, and silent, and thus bad, eg. char[] a = "this is a test string" wchar[] b = "regan was here"; dchar[] c = "georg posted this thing"; char[] d = c[0..7] ~ b[6..10] ~ a[10..14] ~ c[20..$] ~ a[14..$] ~ c[16..17] //supposed to be: georg was testing strings :) How many times does the above transcode using the current implicit conversion rules (the last time this topic was aired it branched into a discussion about how these rules could change to improve the situation)
    foo(wchar[] x) { . . .  } // #1
    foo(dchar[] x) { . . .  } // #2
    dchar y;
    foo(y);  // Obviously should call #2
    foo("Some Test Data"); // Which one now?

Test data is undecorated, hence char[]. Technically on the last line above it could pick at random, when it has no "right" alternative, but I think it would be Polite Manners to make the compiler complain.

Which is what it does currently, right?
 I'm still trying to get through the notion that it  
 _really_does_not_matter_ what it chooses!

I'm still not convinced. I will raise my issues in the later posts you promise.
 (Of course performance is slower with a lot of unnecessary casts ( =  
 conversions), but that's the programmer's fault, not ours.)

I tend to agree here but as I say above, last time this aired people complained about this very thing.
 Given just the function signature and an undecorated string, it is not
 possible for the compiler to call the 'correct' function. In fact, it
 is not possible for a person (other than the original designer) to
 know which is the right one to call?

That is (I'm sorry, no offense), based on a misconception. Please see my other posts today, where I try to clear (among other things) this very issue.

Ok.
 D has currently got the better solution to this problem; get the
 coder to identify the storage characteristics of the string!

He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not. This also I try to explain in the other posts.

Ok. Regan
Nov 18 2005
prev sibling next sibling parent reply Derek Parnell <derek psych.ward> writes:
On Fri, 18 Nov 2005 15:31:48 +0200, Georg Wrede wrote:

 Derek Parnell wrote:
 On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:
 
 [snip]
 
 It seems that you use the word 'cast' to mean conversion of one utf
 encoding to another. However, this is not what D does.
 
    dchar[] y;
    wchar[] x;
 
    x = cast(wchar[])y;
 
 does *not* convert the content of 'y' to utf-16 encoding. Currently you
 *must* use the toUTF16 function to do that.

If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.

Agreed. There are times, I suppose, when the coder does not want this to happen, but those could be coded with a cast(byte[]) to avoid that.
 However, are you saying that D should change its behaviour such that
 it should always implicitly convert between encoding types? Should 
 this happen only with assignments or should it also happen on
 function calls?

Both. And everywhere else (in case we forgot to name some situation).

We have problems with inout and out parameters. foo(inout wchar x) {} dchar[] y = "abc"; foo(y); In this case, if automatic conversion took place, it would have to do it twice. It would be like doing ... auto wchar[] temp; temp = toUTF16(y); foo(temp); y = toUTF32(temp);
    foo(wchar[] x) { . . .  } // #1
    foo(dchar[] x) { . . .  } // #2
    dchar y;
    foo(y);  // Obviously should call #2
    foo("Some Test Data"); // Which one now?

Test data is undecorated, hence char[]. Technically on the last line above it could pick at random, when it has no "right" alternative, but I think it would be Polite Manners to make the compiler complain.

Yes, at that's what happens now.
 I'm still trying to get through the notion that it 
 _really_does_not_matter_ what it chooses!

I disagree. Without know what the intention of the function is, one has no way of knowing which function to call. Try it. Which one is the right one to call in the example above? It is quite possible that there is no right one. If we have automatic conversion and it choose one at random, there is no way of knowing that its doing the 'right' thing to the data we give it. In my opinion, its a coding error and the coder need to provide more information to the compiler.
 (Of course performance is slower with a lot of unnecessary casts ( = 
 conversions), but that's the programmer's fault, not ours.)
 
 Given just the function signature and an undecorated string, it is not
 possible for the compiler to call the 'correct' function. In fact, it
 is not possible for a person (other than the original designer) to
 know which is the right one to call?

That is (I'm sorry, no offense), based on a misconception. Please see my other posts today, where I try to clear (among other things) this very issue.

I challenge you, right here and now, to tell me which of those two functions above is the one that the coder intended to be called. If the coder had written foo("Some Test Data"w); then its pretty clear which function was intended. For example, D rightly complains when the similar situation occurs with the various integers. void foo(long x) {} void foo(int x) {} void main() { short y; foo(y); } If D did implicit conversions and chose one at random I'm sure we would complain.
 D has currently got the better solution to this problem; get the
 coder to identify the storage characteristics of the string!

He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.

But it *DOES* make a difference when doing signature matching. I'm not talking about assignments to variables. -- Derek Parnell Melbourne, Australia 19/11/2005 8:59:16 AM
Nov 18 2005
next sibling parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Sat, 19 Nov 2005 09:19:28 +1100, Derek Parnell <derek psych.ward> wrote:
 I'm still trying to get through the notion that it
 _really_does_not_matter_ what it chooses!

I disagree. Without know what the intention of the function is, one has no way of knowing which function to call. Try it. Which one is the right one to call in the example above? It is quite possible that there is no right one. If we have automatic conversion and it choose one at random, there is no way of knowing that its doing the 'right' thing to the data we give it. In my opinion, its a coding error and the coder need to provide more information to the compiler.
 (Of course performance is slower with a lot of unnecessary casts ( =
 conversions), but that's the programmer's fault, not ours.)

 Given just the function signature and an undecorated string, it is not
 possible for the compiler to call the 'correct' function. In fact, it
 is not possible for a person (other than the original designer) to
 know which is the right one to call?

That is (I'm sorry, no offense), based on a misconception. Please see my other posts today, where I try to clear (among other things) this very issue.

I challenge you, right here and now, to tell me which of those two functions above is the one that the coder intended to be called. If the coder had written foo("Some Test Data"w); then its pretty clear which function was intended. For example, D rightly complains when the similar situation occurs with the various integers. void foo(long x) {} void foo(int x) {} void main() { short y; foo(y); } If D did implicit conversions and chose one at random I'm sure we would complain.
 D has currently got the better solution to this problem; get the
 coder to identify the storage characteristics of the string!

He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.

But it *DOES* make a difference when doing signature matching. I'm not talking about assignments to variables.

Georg/Derek, I replied to Georg here: digitalmars.D.bugs/5587 saying essentially the same things as Derek has above. I reckon we combine these threads and continue in this one, as opposed to the one I linked above. I or you can link the other thread to here with a post if you're in agreement. Regan
Nov 18 2005
parent reply Georg Wrede <georg.wrede nospam.org> writes:
Regan Heath wrote:
 
 Georg/Derek, I replied to Georg here:
 digitalmars.D.bugs/5587
 
 saying essentially the same things as Derek has above. I reckon we 
 combine  these threads and continue in this one, as opposed to the one I 
 linked  above. I or you can link the other thread to here with a post if 
 you're in  agreement.

Good suggestion! I actually intended that, but forgot about it while reading and thinking. :-/ So, the reply is to it directly.
Nov 20 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Sun, 20 Nov 2005 17:28:33 +0200, Georg Wrede <georg.wrede nospam.org>  
wrote:
 Regan Heath wrote:
  Georg/Derek, I replied to Georg here:
 digitalmars.D.bugs/5587
  saying essentially the same things as Derek has above. I reckon we  
 combine  these threads and continue in this one, as opposed to the one  
 I linked  above. I or you can link the other thread to here with a post  
 if you're in  agreement.

Good suggestion! I actually intended that, but forgot about it while reading and thinking. :-/ So, the reply is to it directly.

Ok. I have taken your reply, clicked reply, and pasted it in here :) (I hope this post isn't confusing for anyone) ------------------------- Copied from: digitalmars.D.bugs/5607 ------------------------- On Sun, 20 Nov 2005 17:17:33 +0200, Georg Wrede <georg.wrede nospam.org> wrote:
 Regan Heath wrote:
 On Fri, 18 Nov 2005 13:02:05 +0200, Georg Wrede  
 <georg.wrede nospam.org>  wrote:
  Lets assume there is 2 functions of the same name (unintentionally),  
 doing different things.
  In that source file the programmer writes:
  write("test");
  DMD tries to choose the storage type of "test" based on the available
 overloads. There are 2 available overloads X and Y. It currently
 fails and gives an error.
  If instead it picked an overload (X) and stored "test" in the type
 for X, calling the overload for X, I agree, there would be
 _absolutely no problems_ with the stored data.
  BUT
  the overload for X doesn't do the same thing as the overload for Y.

Isn't that a problem with having overloading at all in a language? Sooner or later, most of us have done it. If not each already? Isn't this a problem with overloading in general, and not with UTF?

You're right. The problem is not limited to string literals, integer literals exhibit exactly the same problem, AFAICS. So, you've convinced me. Here is why... http://www.digitalmars.com/d/lex.html#integerliteral (see "The type of the integer is resolved as follows") In essence integer literals _default_ to 'int' unless another type is specified or required. This suggested change does that, and nothing else? (can anyone see a difference?) If so and if I can accept the behaviour for integer literals why can't I for string literals? The only logical reason I can think of for not accepting it, is if there exists a difference between integer literals and string literals which affects this behaviour. I can think of differences, but none which affect the behaviour. So, it seems that if I accept the risk for integers, I have to accept the risk for string literals too. --- Note that string promotion should occur just like integer promotion does, eg: void foo(long i) {} foo(5); //calls foo(long) with no error void foo(wchar[] s) {} foo("test"); //should call foo(wchar[]) with no error this behaviour is current and should not change. Regan
Nov 20 2005
parent reply Derek Parnell <derek psych.ward> writes:
On Mon, 21 Nov 2005 10:58:28 +1300, Regan Heath wrote:

Ok, I'll comment but only 'cos you asked ;-)

 On Sun, 20 Nov 2005 17:28:33 +0200, Georg Wrede <georg.wrede nospam.org>  
 wrote:
 Regan Heath wrote:
  Georg/Derek, I replied to Georg here:
 digitalmars.D.bugs/5587
  saying essentially the same things as Derek has above. I reckon we  
 combine  these threads and continue in this one, as opposed to the one  
 I linked  above. I or you can link the other thread to here with a post  
 if you're in  agreement.

Good suggestion! I actually intended that, but forgot about it while reading and thinking. :-/ So, the reply is to it directly.

Ok. I have taken your reply, clicked reply, and pasted it in here :) (I hope this post isn't confusing for anyone) ------------------------- Copied from: digitalmars.D.bugs/5607 ------------------------- On Sun, 20 Nov 2005 17:17:33 +0200, Georg Wrede <georg.wrede nospam.org> wrote:
 Regan Heath wrote:
 On Fri, 18 Nov 2005 13:02:05 +0200, Georg Wrede  
 <georg.wrede nospam.org>  wrote:
  Lets assume there is 2 functions of the same name (unintentionally),  
 doing different things.
  In that source file the programmer writes:
  write("test");
  DMD tries to choose the storage type of "test" based on the available
 overloads. There are 2 available overloads X and Y. It currently
 fails and gives an error.
  If instead it picked an overload (X) and stored "test" in the type
 for X, calling the overload for X, I agree, there would be
 _absolutely no problems_ with the stored data.
  BUT
  the overload for X doesn't do the same thing as the overload for Y.

Isn't that a problem with having overloading at all in a language? Sooner or later, most of us have done it. If not each already? Isn't this a problem with overloading in general, and not with UTF?

You're right. The problem is not limited to string literals, integer literals exhibit exactly the same problem, AFAICS. So, you've convinced me. Here is why... http://www.digitalmars.com/d/lex.html#integerliteral (see "The type of the integer is resolved as follows") In essence integer literals _default_ to 'int' unless another type is specified or required. This suggested change does that, and nothing else? (can anyone see a difference?)

Are you suggesting that in the situation where multiple function signatures could possibly match an undecorated string literal, that D should assume that the string literal is actually in utf-8 format, and if that then fails to find a match, it should signal an error?
 If so and if I can accept the behaviour for integer literals why can't I  
 for string literals?
 
 The only logical reason I can think of for not accepting it, is if there  
 exists a difference between integer literals and string literals which  
 affects this behaviour.
 
 I can think of differences, but none which affect the behaviour. So, it  
 seems that if I accept the risk for integers, I have to accept the risk  
 for string literals too.

What might be a relevant point about this is that we are trying to talk about strings, but as far as D is concerned, we are really talking about arrays (of code-units). And for arrays, the current D behaviour is self-consistent. If however, D supported a true string data type, then a great deal of our messy code dealing with UTF conversions would disappear, just as it does with integers and floating point values. Imagine the problems we would have if integers were regarded as arrays of bits by the compiler!
 ---
 
 Note that string promotion should occur just like integer promotion does,  
 eg:
 
 void foo(long i) {}
 foo(5); //calls foo(long) with no error

But what happens when ... void foo(long i) {} void foo(short i) {} foo(5); //calls ???
 void foo(wchar[] s) {}
 foo("test"); //should call foo(wchar[]) with no error
 
 this behaviour is current and should not change.

Agreed. void foo(wchar[] s) {} void foo(char[] s) {} foo("test"); //should call ??? I'm now thinking that it should call the char[] signature without error. But in this case ... void foo(wchar[] s) {} void foo(dchar[] s) {} foo("test"); //should call an error. If we had a generic string type we'd probably just code .... void foo(string s) {} foo("test"); // Calls the one function foo("test"d); // Also calls the one function D would convert to an appropriate UTF format silently before (and after calling). -- Derek (skype: derek.j.parnell) Melbourne, Australia 21/11/2005 10:41:35 AM
Nov 20 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Mon, 21 Nov 2005 11:10:30 +1100, Derek Parnell <derek psych.ward> wrote:
 On Mon, 21 Nov 2005 10:58:28 +1300, Regan Heath wrote:

 Ok, I'll comment but only 'cos you asked ;-)

Thanks <g>.
 Regan Heath wrote:
 On Fri, 18 Nov 2005 13:02:05 +0200, Georg Wrede
 <georg.wrede nospam.org>  wrote:
  Lets assume there is 2 functions of the same name (unintentionally),
 doing different things.
  In that source file the programmer writes:
  write("test");
  DMD tries to choose the storage type of "test" based on the available
 overloads. There are 2 available overloads X and Y. It currently
 fails and gives an error.
  If instead it picked an overload (X) and stored "test" in the type
 for X, calling the overload for X, I agree, there would be
 _absolutely no problems_ with the stored data.
  BUT
  the overload for X doesn't do the same thing as the overload for Y.

Isn't that a problem with having overloading at all in a language? Sooner or later, most of us have done it. If not each already? Isn't this a problem with overloading in general, and not with UTF?

You're right. The problem is not limited to string literals, integer literals exhibit exactly the same problem, AFAICS. So, you've convinced me. Here is why... http://www.digitalmars.com/d/lex.html#integerliteral (see "The type of the integer is resolved as follows") In essence integer literals _default_ to 'int' unless another type is specified or required. This suggested change does that, and nothing else? (can anyone see a difference?)

Are you suggesting that in the situation where multiple function signatures could possibly match an undecorated string literal, that D should assume that the string literal is actually in utf-8 format, and if that then fails to find a match, it should signal an error?

I'm suggesting that an undecorated string literal could default to char[] similar to how an undecorated integer literal defaults to 'int' and that the risk created by that behaviour would be no different in either case.
 If so and if I can accept the behaviour for integer literals why can't I
 for string literals?

 The only logical reason I can think of for not accepting it, is if there
 exists a difference between integer literals and string literals which
 affects this behaviour.

 I can think of differences, but none which affect the behaviour. So, it
 seems that if I accept the risk for integers, I have to accept the risk
 for string literals too.

What might be a relevant point about this is that we are trying to talk about strings, but as far as D is concerned, we are really talking about arrays (of code-units). And for arrays, the current D behaviour is self-consistent. If however, D supported a true string data type, then a great deal of our messy code dealing with UTF conversions would disappear, just as it does with integers and floating point values. Imagine the problems we would have if integers were regarded as arrays of bits by the compiler!

I'm not sure it makes any difference that char[] is an array, if you imagine that we removed the current integer literal rules, here: http://www.digitalmars.com/d/lex.html#integerliteral (see "The type of the integer is resolved as follows") then short/int/long would exhibit the same problem that char[]/wchar[]/dchar[] does, this would be illegal: void foo(short i) {} void foo(int i) {} void foo(long i) {} foo(5); requiring: foo(5s); //to call short version foo(5i); //to call int version foo(5l); //to call long version or: foo(cast(short)5); //to call short version foo(cast(int)5); //to call int version foo(cast(long)5); //to call long version just like char[]/wchar[]/dchar[] does today.
 ---

 Note that string promotion should occur just like integer promotion  
 does,
 eg:

 void foo(long i) {}
 foo(5); //calls foo(long) with no error

But what happens when ... void foo(long i) {} void foo(short i) {} foo(5); //calls ???

You get: test.d(8): function test.foo called with argument types: (int) matches both: test.foo(short) and: test.foo(long) which is correct IMO because 'int' can be promoted to both 'short' and 'long' with equal preference. ("that's the long and short of it" <g>)
 void foo(wchar[] s) {}
 foo("test"); //should call foo(wchar[]) with no error

 this behaviour is current and should not change.

Agreed. void foo(wchar[] s) {} void foo(char[] s) {} foo("test"); //should call ??? I'm now thinking that it should call the char[] signature without error.

That's what they've been suggesting. I have started to agree (because it's no more risky than current integer literal behaviour, which I have blithely accepted for years now - perhaps due to lack of knowledge when I first started programming, and now because I am used to it and it seems natural)
 But in this case ...

  void foo(wchar[] s) {}
  void foo(dchar[] s) {}
  foo("test"); //should call an error.

Agreed, just like the integer literal example above.
 If we had a generic string type we'd probably just code ....

  void foo(string s) {}
  foo("test");  // Calls the one function
  foo("test"d); // Also calls the one function

 D would convert to an appropriate UTF format silently before (and after
 calling).

It's an interesting idea. I was thinking the same thing recently, why not have 1 super-type "string" and have it convert between the format required when asked eg. //writing strings void c_function_call(char *string) {} void os_function_call(wchar[] string) {} void write_to_file_in_specific_encoding(dchar[] string) {} string a = "test"; //"test" is stored in application defined default internal representation (more on this later) c_function_call(a.utf8); os_function_call(a.utf16); write_to_file_in_specific_encoding(a.utf32); normal_d_function(a); //reading strings void read_from_file_in_specific_encoding(inout dchar[]) {} string a; read_from_file_in_specific_encoding(a.utf32); or, perhaps we can go one step further and implicitly transcode where required, eg: c_function_call(a); os_function_call(a); write_to_file_in_specific_encoding(a); read_from_file_in_specific_encoding(a); The properties (Sean's idea, thanks Sean) utf8, utf16, and utf32 would be of type char[], wchar[] and dchar[] respectively. (so, these types remain) Slicing string would give characters as opposed to code units (parts of characters). I still believe the only times you care which encoding it is in, and/or should be transcoding, is on input and output, and for performance reasons you do not want it converting all over the place. To address performance concerns each application may want to define the default internal encoding of strings for performance reasons, and/or we could use the encoding specified on assignment/creation, eg. string a; //stored in application defined default (or char[] as that is D's general purpose default) string a = "test"w; //stored as wchar[] internally a.utf16 //does no transcoding a.utf32; a.utf8 //causes transcoding or, when you have nothing to assign a special syntax is used to specify the internal encoding //some options off the top of my head... string a = string.UTF16; string a!(wchar[]); //random though, can all this be achieved with a template? string a(UTF16); read_from_file_in_specific_encoding(a.utf32); the above would create an empty/non-existant(lets no go here yet <g>) utf16 string in memory, and transcode from the file, which is utf32 to utf16 for internal representation, then: a.utf16 //does no transcoding a.utf8 a.utf32 //causes transcoding Assignment of strings of different internal representation would cause transcoding. This should be rare as most should be in the application defined internal representation, it would naturally occur on input and output where you cannot avoid it anyway. This idea has me quite excited, if no-one can poke large unsightly holes in it perhaps we could work on a draft spec for it? (i.e. post it to digitalmars.D and see what everyone thinks) Regan
Nov 20 2005
next sibling parent Derek Parnell <derek psych.ward> writes:
On Mon, 21 Nov 2005 14:16:22 +1300, Regan Heath wrote:


[snip]

 This idea has me quite excited, if no-one can poke large unsightly holes  
 in it perhaps we could work on a draft spec for it? (i.e. post it to  
 digitalmars.D and see what everyone thinks)

Not only do great minds think alike, so you and I! I'm starting to thinking that you (and your minion helpers) have hit upon a 'Great Idea(tm)' -- Derek Parnell Melbourne, Australia 21/11/2005 12:57:40 PM
Nov 20 2005
prev sibling parent "Kris" <fu bar.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote...
 On Mon, 21 Nov 2005 11:10:30 +1100, Derek Parnell <derek psych.ward> 
 wrote:

  void foo(wchar[] s) {}
  void foo(char[]  s) {}
  foo("test"); //should call ???

 I'm now thinking that it should call the char[] signature without error.

That's what they've been suggesting. I have started to agree (because it's no more risky than current integer literal behaviour

Aye!
 But in this case ...

  void foo(wchar[] s) {}
  void foo(dchar[] s) {}
  foo("test"); //should call an error.

Agreed, just like the integer literal example above.

Aye!
Nov 20 2005
prev sibling parent reply Georg Wrede <georg.wrede nospam.org> writes:
Derek Parnell wrote:
 On Fri, 18 Nov 2005 15:31:48 +0200, Georg Wrede wrote:
 Derek Parnell wrote:


 However, are you saying that D should change its behaviour such
 that it should always implicitly convert between encoding types?
 Should this happen only with assignments or should it also happen
 on function calls?

Both. And everywhere else (in case we forgot to name some situation).

We have problems with inout and out parameters. foo(inout wchar x) {} dchar[] y = "abc"; foo(y); In this case, if automatic conversion took place, it would have to do it twice. It would be like doing ... auto wchar[] temp; temp = toUTF16(y); foo(temp); y = toUTF32(temp);

Would you be surprised: Foo[10] foo = new Foo; for(ubyte i=0; i<10; i++) // Not short, int, or long, "save space" { foo[i] = whatever; // Gee, compiler silently casts to int! } He might either be stupid, uneducated, or then not coded since 1985. And it happens.
   foo(wchar[] x) { . . .  } // #1
   foo(dchar[] x) { . . .  } // #2
   dchar y;
   foo(y);  // Obviously should call #2
   foo("Some Test Data"); // Which one now?

Test data is undecorated, hence char[]. Technically on the last line above it could pick at random, when it has no "right" alternative, but I think it would be Polite Manners to make the compiler complain.

Yes, at that's what happens now.
I'm still trying to get through the notion that it 
_really_does_not_matter_ what it chooses!

I disagree. Without know what the intention of the function is, one has no way of knowing which function to call. Try it. Which one is the right one to call in the example above? It is quite possible that there is no right one.

If the overloaded functions purport to take UTF (of any width at all), then it is assumed that they do _semantically_ the same thing. Thus, one has the right to sleep at night. The programmer shall not see any difference whichever is chosen: - if there's only one type, then there's no choice anyway. - if there's one that matches, then pick that, (not that it would be obligatory, but it's polite.) - if there are the two non-matching, then pick the one preferred by the compiler writer, or the OS vendor. If not, then just pick either one. - if there are no UTF versions, then it'd be okay to complain, at compile time.
 If we have automatic conversion and it choose one at random, there is
 no way of knowing that its doing the 'right' thing to the data we
 give it. In my opinion, its a coding error and the coder need to
 provide more information to the compiler.

I want everyone to understand that it makes just as little difference as when the compiler optimizer chooses a datatype for variable i in this: for(ubyte i=0; i<256; i++) { // do stuff } Can you honestly say that it makes a difference which type i is? (Except signed byte, of course. And we're not talking about performance.) I wouldn't be surprised if DMD (haven't checked!) would sneak i to int instead of the explicitly asked-for ubyte, already in the default compile mode. And -release, and at -O probably should. (Again, haven't checked, and even if it does not do it, the issue is a matter of principle: would making it int make a difference in this example?)
 (Of course performance is slower with a lot of unnecessary casts (
 = conversions), but that's the programmer's fault, not ours.)

 Given just the function signature and an undecorated string, it
 is not possible for the compiler to call the 'correct' function.
 In fact, it is not possible for a person (other than the original
 designer) to know which is the right one to call?

That is (I'm sorry, no offense), based on a misconception. Please see my other posts today, where I try to clear (among other things) this very issue.

I challenge you, right here and now, to tell me which of those two functions above is the one that the coder intended to be called.

Suppose you're in a huge software project with D, and the customer has ordered it to do all arithmetic in long. After 1500000 lines it goes to the beta testers, and they report wierd behavior. Three weeks of searching, and the boss is raving around with an axe. One night the following code is found: import std.stdio; void main() { long myvar; ... myvar = int.max / 47; ... 300 lines myvar = scale(myvar); ... 500 lines } ... 50000 lines later long scale(int v) { long tmp = 1000 * v; return tmp / 3; } Folks suspect the bug is here, but what is wrong? Does the compiler complain? Should it?
 If the coder had written 
 
     foo("Some Test Data"w);
 
 then its pretty clear which function was intended.

Except that my example above is dangerous, while with UTF it can't get dangerous. Hey, what sould the compiler complain if I write: char[] a = "\U00000041"c; (Do you think it currently complains? Saying what? Or doesn't it? And what do you say happens if one would get this currently compiled and run?)
 D has currently got the better solution to this problem; get the 
 coder to identify the storage characteristics of the string!

He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.

But it *DOES* make a difference when doing signature matching. I'm not talking about assignments to variables.

Would it be correct to say that the undecorated string literal can't possibly be done anything with so that the type of the receiver is not known? Apart from passing to overloaded functions (each of which does know "what it wants"), is there any situation where UTF is accepted, but the receiver does not itself know which it "wants", or even "prefers"? Should there be such cases? Could there?
Nov 20 2005
parent reply Derek Parnell <derek psych.ward> writes:
On Sun, 20 Nov 2005 17:02:04 +0200, Georg Wrede wrote:

 Derek Parnell wrote:
 On Fri, 18 Nov 2005 15:31:48 +0200, Georg Wrede wrote:
 Derek Parnell wrote:


 However, are you saying that D should change its behaviour such
 that it should always implicitly convert between encoding types?
 Should this happen only with assignments or should it also happen
 on function calls?

Both. And everywhere else (in case we forgot to name some situation).

We have problems with inout and out parameters. foo(inout wchar x) {} dchar[] y = "abc"; foo(y); In this case, if automatic conversion took place, it would have to do it twice. It would be like doing ... auto wchar[] temp; temp = toUTF16(y); foo(temp); y = toUTF32(temp);

Would you be surprised:

Surprised about the two conversions? No, I just said that's what it would have to do, so no I wouldn't be surprised. I just said it would be a problem. In so far as the compiler would (currently) no warn coders about the performance hit until they profiled it, and even then it might not be obvious to some people.
 Foo[10] foo = new Foo;
 for(ubyte i=0; i<10; i++) // Not short, int, or long, "save space"
 {
      foo[i] = whatever;  // Gee, compiler silently casts to int!
 }
 
 He might either be stupid, uneducated, or then not coded since 1985.
 And it happens.

What on earth has the above example got to do with double conversions? And converting from ubyte to int is not exactly a performance drain.
   foo(wchar[] x) { . . .  } // #1
   foo(dchar[] x) { . . .  } // #2
   dchar y;
   foo(y);  // Obviously should call #2
   foo("Some Test Data"); // Which one now?

Test data is undecorated, hence char[]. Technically on the last line above it could pick at random, when it has no "right" alternative, but I think it would be Polite Manners to make the compiler complain.

Yes, at that's what happens now.
I'm still trying to get through the notion that it 
_really_does_not_matter_ what it chooses!

I disagree. Without know what the intention of the function is, one has no way of knowing which function to call. Try it. Which one is the right one to call in the example above? It is quite possible that there is no right one.

If the overloaded functions purport to take UTF (of any width at all), then it is assumed that they do _semantically_ the same thing. Thus, one has the right to sleep at night.

Assumptions like that have a nasty habit of generating nightmares. It is *only* an assumption and not a decision based on actual knowledge.
 The programmer shall not see any difference whichever is chosen:
 
   - if there's only one type, then there's no choice anyway.

But is more than one.
   - if there's one that matches, then pick that, (not that it would be 
 obligatory, but it's polite.)

Sorry, no matches.
   - if there are the two non-matching, then pick the one preferred by 
 the compiler writer, or the OS vendor. If not, then just pick either one.

BANG! This is where we part company. My believe is that to assume that functions with the same name are going to do the same thing is a dangerous one and can lead to mistakes. Whereas you seem to be saying that this is a safe assumption to make.
   - if there are no UTF versions, then it'd be okay to complain, at 
 compile time.
 
 If we have automatic conversion and it choose one at random, there is
 no way of knowing that its doing the 'right' thing to the data we
 give it. In my opinion, its a coding error and the coder need to
 provide more information to the compiler.

I want everyone to understand that it makes just as little difference as when the compiler optimizer chooses a datatype for variable i in this: for(ubyte i=0; i<256; i++) { // do stuff } Can you honestly say that it makes a difference which type i is? (Except signed byte, of course. And we're not talking about performance.)

No, but what's this got to do with the argument?
 I wouldn't be surprised if DMD (haven't checked!) would sneak i to int 
 instead of the explicitly asked-for ubyte, already in the default 
 compile mode. And -release, and at -O probably should. (Again, haven't 
 checked, and even if it does not do it, the issue is a matter of 
 principle: would making it int make a difference in this example?)

Red Herring Alert!
 (Of course performance is slower with a lot of unnecessary casts (
 = conversions), but that's the programmer's fault, not ours.)

 Given just the function signature and an undecorated string, it
 is not possible for the compiler to call the 'correct' function.
 In fact, it is not possible for a person (other than the original
 designer) to know which is the right one to call?

That is (I'm sorry, no offense), based on a misconception. Please see my other posts today, where I try to clear (among other things) this very issue.

I challenge you, right here and now, to tell me which of those two functions above is the one that the coder intended to be called.

Suppose you're in a huge software project with D, and the customer has ordered it to do all arithmetic in long. After 1500000 lines it goes to the beta testers, and they report wierd behavior. Three weeks of searching, and the boss is raving around with an axe. One night the following code is found: import std.stdio; void main() { long myvar; ... myvar = int.max / 47; ... 300 lines myvar = scale(myvar); ... 500 lines } ... 50000 lines later long scale(int v) { long tmp = 1000 * v; return tmp / 3; } Folks suspect the bug is here, but what is wrong? Does the compiler complain? Should it?

No it doesn't and yes it should.
 If the coder had written 
 
     foo("Some Test Data"w);
 
 then its pretty clear which function was intended.

Except that my example above is dangerous, while with UTF it can't get dangerous.

Assumptions can hurt too.
 Hey, what sould the compiler complain if I write:
 
 char[] a = "\U00000041"c;
 
 (Do you think it currently complains? Saying what? Or doesn't it? And 
 what do you say happens if one would get this currently compiled and run?)

Of course not. Both 'a' and the literal are of the same data type.
 D has currently got the better solution to this problem; get the 
 coder to identify the storage characteristics of the string!

He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.

But it *DOES* make a difference when doing signature matching. I'm not talking about assignments to variables.

Would it be correct to say that the undecorated string literal can't possibly be done anything with so that the type of the receiver is not known? Apart from passing to overloaded functions (each of which does know "what it wants"), is there any situation where UTF is accepted, but the receiver does not itself know which it "wants", or even "prefers"? Should there be such cases? Could there?

Again, I fail to see what this has to do with the issue. Let's call a halt to this discussion. I suspect that you and I will not agree about this function signature matching issue anytime soon. -- Derek Parnell Melbourne, Australia 21/11/2005 6:55:57 AM
Nov 20 2005
parent reply Georg Wrede <georg.wrede nospam.org> writes:
Derek Parnell wrote:
 Let's call a halt to this discussion. I suspect that you and I will
 not agree about this function signature matching issue anytime soon.

Whew, I was just starting to wonder what to do. :-) Maybe we'll save the others some headaches too. Besides, at this point, I guess nobody else reads this thread anyway. :-) But it was nice to learn that with some folks you really can disagree long and good, and still not start fighting. georg
Nov 20 2005
parent "Regan Heath" <regan netwin.co.nz> writes:
On Mon, 21 Nov 2005 00:23:39 +0200, Georg Wrede <georg.wrede nospam.org>  
wrote:
 Derek Parnell wrote:
 Let's call a halt to this discussion. I suspect that you and I will
 not agree about this function signature matching issue anytime soon.

Whew, I was just starting to wonder what to do. :-)

I'm interested in both your opinions on: digitalmars.D.bugs/5612
 Maybe we'll save the others some headaches too. Besides, at this point,  
 I guess nobody else reads this thread anyway. :-)

Or they prefer to lurk. Or we scared them away.
 But it was nice to learn that with some folks you really can disagree  
 long and good, and still not start fighting.

It's how it's supposed to work :) The key, I believe, is to realise that it's not personal, it's an discussion/argument of opinion. Disagreeing with an opinion is not the same as disliking the person who holds that opinion. Of course this is only true when the participants do not make comments which can be taken as being directed at the person, as opposed to the points of the argument itself. This is harder than it sounds because the written word often does not convey your meaning as well as your face and voice could do in a face to face conversation. My 2c. Regan
Nov 20 2005
prev sibling parent reply Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:
Georg Wrede wrote:
 
 If somebody wants to retain the bit pattern while storing the contents 
 to something else, it should be done with a union. (Just as you can do 
 with pointers, or even objects! To name a few "workarounds".)
 
 A cast should do precisely what our toUTFxxx functions currently do.
 

-- Bruno Medeiros - CS/E student "Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
Nov 20 2005
next sibling parent Georg Wrede <georg.wrede nospam.org> writes:
Bruno Medeiros wrote:
 Georg Wrede wrote:
 
 
 If somebody wants to retain the bit pattern while storing the
 contents to something else, it should be done with a union. (Just
 as you can do with pointers, or even objects! To name a few
 "workarounds".)
 
 A cast should do precisely what our toUTFxxx functions currently
 do.
 


Nothing wrong. But cast should not do the union thing. Of course, we could have the toUTFxxx and no cast at all for UTF strings, no problem. But definitely _not_ have the cast do the "union thing".
Nov 20 2005
prev sibling parent reply Derek Parnell <derek psych.ward> writes:
On Sun, 20 Nov 2005 11:30:34 +0000, Bruno Medeiros wrote:

 Georg Wrede wrote:
 
 If somebody wants to retain the bit pattern while storing the contents 
 to something else, it should be done with a union. (Just as you can do 
 with pointers, or even objects! To name a few "workarounds".)
 
 A cast should do precisely what our toUTFxxx functions currently do.
 


Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(), ... ? -- Derek Parnell Melbourne, Australia 21/11/2005 6:48:48 AM
Nov 20 2005
parent reply Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:
Derek Parnell wrote:
 On Sun, 20 Nov 2005 11:30:34 +0000, Bruno Medeiros wrote:
 
 
Georg Wrede wrote:

If somebody wants to retain the bit pattern while storing the contents 
to something else, it should be done with a union. (Just as you can do 
with pointers, or even objects! To name a few "workarounds".)

A cast should do precisely what our toUTFxxx functions currently do.

It should? Why?, what is the problem of using the toUTFxx functions?

Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(), .... ?

casts are usually (if not allways?) implicit, but most importantly, they are quite trivial. And by trivial I mean Assembly-level trivial. String enconding conversions on the other hand (as you surely are aware) are quite not trivial (both in terms of code, run time, and heap memory usage), and I don't think a cast should perform such non-trivial operations. -- Bruno Medeiros - CS/E student "Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
Nov 21 2005
parent reply Derek Parnell <derek psych.ward> writes:
On Mon, 21 Nov 2005 11:14:47 +0000, Bruno Medeiros wrote:

 Derek Parnell wrote:
 On Sun, 20 Nov 2005 11:30:34 +0000, Bruno Medeiros wrote:
 
 
Georg Wrede wrote:

If somebody wants to retain the bit pattern while storing the contents 
to something else, it should be done with a union. (Just as you can do 
with pointers, or even objects! To name a few "workarounds".)

A cast should do precisely what our toUTFxxx functions currently do.

It should? Why?, what is the problem of using the toUTFxx functions?

Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(), .... ?

casts are usually (if not allways?) implicit, but most importantly, they are quite trivial. And by trivial I mean Assembly-level trivial. String enconding conversions on the other hand (as you surely are aware) are quite not trivial (both in terms of code, run time, and heap memory usage), and I don't think a cast should perform such non-trivial operations.

Why? If documented, the user can be prepared. And where is the tipping point? The point at which an operation becomes non-trivial? You mention 'assembly-level' by which I think you mean that a sub-routine is not called but the machine code is generated in-line for the operation. Would that be the trivial/non-trivial divide? Is conversion from byte to real done in-line or via sub-routine call? I don't actually know, just asking. -- Derek Parnell Melbourne, Australia 21/11/2005 10:47:34 PM
Nov 21 2005
next sibling parent reply Don Clugston <dac nospam.com.au> writes:
Derek Parnell wrote:
 On Mon, 21 Nov 2005 11:14:47 +0000, Bruno Medeiros wrote:
 
 
Derek Parnell wrote:

On Sun, 20 Nov 2005 11:30:34 +0000, Bruno Medeiros wrote:



Georg Wrede wrote:


If somebody wants to retain the bit pattern while storing the contents 
to something else, it should be done with a union. (Just as you can do 
with pointers, or even objects! To name a few "workarounds".)

A cast should do precisely what our toUTFxxx functions currently do.

It should? Why?, what is the problem of using the toUTFxx functions?

Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(), .... ?

No, we don't. But the case is different: between primitive numbers the casts are usually (if not allways?) implicit, but most importantly, they are quite trivial. And by trivial I mean Assembly-level trivial. String enconding conversions on the other hand (as you surely are aware) are quite not trivial (both in terms of code, run time, and heap memory usage), and I don't think a cast should perform such non-trivial operations.

Why? If documented, the user can be prepared. And where is the tipping point? The point at which an operation becomes non-trivial? You mention 'assembly-level' by which I think you mean that a sub-routine is not called but the machine code is generated in-line for the operation. Would that be the trivial/non-trivial divide?

I would think so. I'd define trivial as: "the assembly code doesn't have any loops".
 Is conversion from byte to real done in-line or via sub-routine call? I
 don't actually know, just asking.

On x86, int -> real can be done with the FILD instruction. Or can be done without FPU, in a couple of instructions. short -> int is done with MOVSX ushort -> uint is done with MOVZX. HOWEVER -- I don't think this is really relevant. The real issue is about literals, which as Georg rightly said, could be stored in ANY format. Conversions from a literal to any type has ZERO runtime cost. I think that in a few respects, the existing situation for strings is BETTER than the situation for integers. I personally don't like the fact that integer literals default to 'int', unless you suffix them with L. Even if the number is too big to fit into an int! And floating-point constants default to 'double', not real. One intriguing possibility would be to have literals having NO type (or more accurately, an unassigned type). The type only being assigned when it is used. eg "abc" is of type: const __unassignedchar []. There are implicit conversions from __unassignedchar [] to char[], wchar[], and dchar[]. But there are none from char[] to wchar[]. Adding a suffix changes the type from __unassignedchar to char[], wchar[], or dchar[], preventing any implicit conversions. (__unassignedchar could also be called __stringliteral -- it's inaccessable, anyway). Similarly, an integral constant could be of type __integerliteral UNTIL it is assigned to something. At this point, a check is performed to see if the value can actually fit in the type. If not, (eg when an extended UTF char is assigned to a char), it's an error. Admittedly, it's more difficult to deal with when you have integers, and especially with reals, where no lossless conversion exists (because 1.0/3.0f + 1.0/5.0f is not the same as cast(float)(1.0/3.0L + 1.0/5.0L) -- the roundoff errors are different). There are some vaguaries -- what rounding mode is used when performing calculations on reals? This is implementation-defined in C and C++, would be nice if it were specified in D. UTF strings are not the only worm in this can of worms :-)
Nov 21 2005
next sibling parent Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
In article <dlsj73$2fod$1 digitaldaemon.com>, Don Clugston says...

I think that in a few respects, the existing situation for strings is 
BETTER than the situation for integers.

I personally don't like the fact that integer literals default to 'int', 
unless you suffix them with L. Even if the number is too big to fit into 
an int! And floating-point constants default to 'double', not real.

I agree with you.
One intriguing possibility would be to have literals having NO type (or 
more accurately, an unassigned type). The type only being assigned when 
it is used.

eg  "abc" is of type: const __unassignedchar [].
There are implicit conversions from __unassignedchar [] to char[], 
wchar[], and dchar[]. But there are none from char[] to wchar[].

String literals already work like this. :) String literals without suffix are char[], but not "committed". String literals with a suffix are "committed" to their type. Check the frontend sources. StringExp::implicitConvTo(Type *t) allows conversion of non-committed string literals to {,w,d}char arrays and pointers. This is what makes this an error: # void print(char[] x) {} # void print(wchar[] x) {} # void main() { print("test"); } Regards, /Oskar
Nov 21 2005
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
Don Clugston wrote:
 
 I personally don't like the fact that integer literals default to 'int', 
 unless you suffix them with L. Even if the number is too big to fit into 
 an int! And floating-point constants default to 'double', not real.

Really? I tested this a few days ago and it seemed like literals larger than int.max were treated as longs. I'll mock up another test on my way to work. Sean
Nov 21 2005
parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
In article <dlt2b9$6df$3 digitaldaemon.com>, Sean Kelly says...
Don Clugston wrote:
 
 I personally don't like the fact that integer literals default to 'int', 
 unless you suffix them with L. Even if the number is too big to fit into 
 an int! And floating-point constants default to 'double', not real.

Really? I tested this a few days ago and it seemed like literals larger than int.max were treated as longs. I'll mock up another test on my way to work.

You are right, large integers are automatically treated as longs, but too large floating point literals are not automatically treated as real. # import std.stdio; # # void main() { # writef("%s\n%s\n%s\n%s\n", # typeid(typeof(1231231231)), # typeid(typeof(12312312312312)), # typeid(typeof(1e100)), # //typeid(typeof(1e350)), // Error: number is not representable # typeid(typeof(1e350l)) // l-suffix # ); # } Prints: int long double real /Oskar
Nov 21 2005
parent Sean Kelly <sean f4.ca> writes:
Oskar Linde wrote:
 
 You are right, large integers are automatically treated as longs, but too large
 floating point literals are not automatically treated as real.

This seems reasonable though, since it's really a matter of precision with floating-point numbers moreso than representability. Sean
Nov 21 2005
prev sibling parent Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:
Derek Parnell wrote:
 On Mon, 21 Nov 2005 11:14:47 +0000, Bruno Medeiros wrote:
 
Derek Parnell wrote:
Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(),
.... ?

No, we don't. But the case is different: between primitive numbers the casts are usually (if not allways?) implicit, but most importantly, they are quite trivial. And by trivial I mean Assembly-level trivial. String enconding conversions on the other hand (as you surely are aware) are quite not trivial (both in terms of code, run time, and heap memory usage), and I don't think a cast should perform such non-trivial operations.

Why? If documented, the user can be prepared. And where is the tipping point? The point at which an operation becomes non-trivial? You mention 'assembly-level' by which I think you mean that a sub-routine is not called but the machine code is generated in-line for the operation. Would that be the trivial/non-trivial divide?

Clugston said: if the code run time depends on the object size, that is, is not constant bounded, then it's beyond the acceptable point. Another disqualifier is allocating memory on the heap. A string enconding conversion does both things.
 Is conversion from byte to real done in-line or via sub-routine call? I
 don't actually know, just asking.
 

suspected that it was merely an Assembly one-liner (i.e., one instruction only). Note: I think the most complex cast we have right now is a class object downcast, which, altough not universally constant bounded, it's still compile-time constant bounded. -- Bruno Medeiros - CS/E student "Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
Nov 23 2005