digitalmars.D - The case for ditching char and wchar (and renaming "dchar" as "char")

Arcane Jill (83/83) Aug 23 2004 D has come a long way, and much of the original architecture is now redu...

Matthias Becker (3/3) Aug 23 2004 After you proposed these ideas about allowing toString to return any cha...
Juanjo =?ISO-8859-15?Q?=C1lvarez?= (5/16) Aug 23 2004 Yes, it makes a lot of sense. You have my (useless) vote.
Ben Hinkle (14/124) Aug 23 2004 There were huge threads about char vs wchar vs dchar a while ago (on the...

Arcane Jill (41/54) Aug 23 2004 Well spotted. I had a look at some of those old threads, and it does see...

Juanjo =?ISO-8859-15?Q?=C1lvarez?= (8/12) Aug 23 2004 I suppose that if dchar it's what left of this ditching it should rename...

Roald Ribe (26/38) Aug 23 2004 "Ben Hinkle" wrote in message

Arcane Jill (13/18) Aug 23 2004 Is that true?
Ben Hinkle (14/59) Aug 23 2004 I didn't mean to suggest D ditch char or wchar or dchar. I'm just saying...

J C Calvarese (8/16) Aug 23 2004 In case anyone is interested, here's a page with links to many Unicode

Arcane Jill (5/5) Aug 23 2004 The case for retaining wchar has been made, and essentially won. Please ...

Ben Hinkle (9/16) Aug 23 2004 It's won already? What is this - a Mike Tyson fight? :-)

Andy Friesen (8/17) Aug 23 2004 I think it might be worth it for the conceptual clarity. UTF-32 happens...
Walter (46/46) Aug 23 2004 I'm not sure what transcoding means, I assume it's converting from one

Regan Heath (32/99) Aug 23 2004 Walter,

Walter (3/30) Aug 23 2004 I think your idea has a lot of merit. I'm certainly leaning that way.

antiAlias (44/53) Aug 23 2004 when

Regan Heath (91/165) Aug 23 2004 True. However, what else should the opCast for char[] to dchar[] do exce...

antiAlias (80/147) Aug 23 2004 ========================

Arcane Jill (89/143) Aug 24 2004 Yes. But I don't see D introducing a utf7char or a utfEbcdicChar type an...

antiAlias (54/75) Aug 24 2004 been

Regan Heath (63/122) Aug 24 2004 Nope.

antiAlias (4/9) Aug 24 2004 Yes. It most certainly is, Regan. I (incorrectly) assumed you understood

Regan Heath (13/24) Aug 24 2004 Either:

antiAlias (8/33) Aug 24 2004 Sorry to have offended your sensibilities, dude. If "knew" is used in yo...

Arcane Jill (6/11) Aug 24 2004 It's not invalid as such, it's just that the return type of an overloade...

Regan Heath (6/18) Aug 24 2004 Ahh.. excellent, that is what I was hoping to hear.

Arcane Jill (16/17) Aug 24 2004 It's not /all/ good news however. Consider these two cases:

Regan Heath (18/37) Aug 25 2004 True, and we can come up with much nastier string concatenation examples...

Arcane Jill (44/64) Aug 26 2004 I guess I mean specifically that:

antiAlias (24/91) Aug 26 2004 Hear! Hear!
Regan Heath (59/144) Aug 26 2004 Sure... so long as custom classes can overload and return char[] for

Arcane Jill (162/195) Aug 27 2004 You suggestion would work. But I'm still thinking along the lines that n...

Arcane Jill (11/13) Aug 27 2004 "The Deseret Alphabet was designed as an alternative to the Latin alphab...
Walter (12/14) Aug 28 2004 text

Sean Kelly (7/21) Aug 28 2004 Agreed. Frankly, I've begun to wonder just what the purpose of this dis...

J C Calvarese (13/45) Aug 28 2004 I think the thread has gone somewhat off topics by this point.
Matthias Becker (8/15) Aug 29 2004 Well the language can only use one type as return type for toString(). S...

stonecobra (8/30) Aug 29 2004 They won't worry about it, because if they are true performance

Juanjo =?ISO-8859-15?Q?=C1lvarez?= (3/4) Aug 29 2004 What was that? :-D
Ivan Senji (11/41) Aug 30 2004 all

Juanjo =?ISO-8859-15?Q?=C1lvarez?= (2/5) Aug 30 2004 I like it!

Walter (7/13) Aug 30 2004 You can allocate them with std.c.stdlib.malloc() or std.c.stdlib.alloca(...

stonecobra (4/23) Aug 30 2004 So, the C performance geeks will just stay with C, because that's how
Juanjo =?ISO-8859-15?Q?=C1lvarez?= (4/13) Aug 30 2004 And will they be garbage-collected? (just asking, I don't know.) Anyway ...

Walter (7/20) Aug 30 2004 std.c.stdlib.alloca(),

Ivan Senji (11/31) Aug 31 2004 should

Sean Kelly (4/13) Aug 31 2004 How about a smart pointer? I admit that the syntax wouldn't be quite as...

Nick (9/17) Aug 31 2004 How about just using a buffer class that clears up it's mess when collec...

Arcane Jill (6/14) Aug 31 2004 ..but is forbidden from having a destructor (even if you use alloca() vi...

Sean Kelly (5/20) Aug 31 2004 I'm not sure if it would work, but could you use gc.malloc to allocate h...
Walter (5/12) Aug 31 2004 a

Regan Heath (38/117) Aug 29 2004 You're right.. I would argue however that space == speed when you start ...

Walter (10/12) Aug 28 2004 wchar[], not

Andy Friesen (15/30) Aug 28 2004 Is there any chance that this could be adjusted somehow?

Sean Kelly (8/11) Aug 28 2004 I'm inclined to agree, though I'm wary of making char types a special ca...

Walter (5/16) Aug 28 2004 for

Walter (4/8) Aug 28 2004 to be

Carlos Santander B. (29/29) Aug 24 2004 "antiAlias" escribi� en el mensaje

Regan Heath (20/47) Aug 24 2004 I assumed opCat's parameter would have to be char[], wchar[] or dchar[],...

Carlos Santander B. (36/36) Aug 25 2004 "Regan Heath" escribi� en el mensaje

Arcane Jill (8/44) Aug 25 2004 This is irrelevant. opCat() does not need to do anything special for D s...

Ben Hinkle (7/41) Aug 24 2004 "Conversion hell" will exist any time three standards are in use. It doe...

Regan Heath (7/14) Aug 24 2004 Lets assume implicit transcoding is implemented, why wouldn't that make

Walter (7/10) Aug 28 2004 the

Arcane Jill (58/81) Aug 28 2004 It's only really UTF-8 decoding that's complicated. All the rest are pre...

J C Calvarese (12/46) Aug 28 2004 Could we ditch toString and replace the functionality with:

Ben Hinkle (16/67) Aug 28 2004

Arcane Jill (17/27) Aug 28 2004 Ask yourself why toString() exists at all. What does D use it for?

Walter (7/10) Aug 29 2004 toString()

Matthias Becker (3/12) Aug 29 2004 I don't get it :(

Walter (7/19) Aug 30 2004 be

Ben Hinkle (23/58) Aug 29 2004 In Java (and I suppose C#) it is very handy when debugging to print an
Berin Loritsch (13/27) Aug 30 2004 There are a couple of uses I use toString() for (in Java apps):

Dave (19/46) Aug 30 2004 (Also refering to the 'most European chars. are ASCII' post just ahead o...

Walter (16/21) Aug 30 2004 apps.

Regan Heath (77/273) Aug 24 2004 ---------------------------------------

antiAlias (15/19) Aug 24 2004 It happens all the time with streamed input. However, as AJ pointed out,

Regan Heath (14/39) Aug 24 2004 Ahhh.. I get it you were referring to not having all the input at one
Sean Kelly (8/22) Aug 25 2004 My modified version of std.utf is meant to address the streaming issue.

Sean Kelly (3/5) Aug 25 2004 "Accept a delegate."
Arcane Jill (35/41) Aug 25 2004 Some speed-up ideas...

antiAlias (39/82) Aug 25 2004 The decoding thing has issues as you point out Jill.

pragma (7/16) Aug 25 2004 Holy crap, what kind of data are you throwing at it? I don't mean to cr...

antiAlias (19/37) Aug 25 2004 That's funny.

Regan Heath (5/49) Aug 25 2004 Might I humbly suggest you add these routines to deimos?
pragma (18/29) Aug 25 2004 Thanks! That's why I've been doing these goofy sigs lately: if I can ma...

Arcane Jill (42/68) Aug 24 2004 Correctly me if I'm wrong, but according to the docs, there /is/ no from...

Regan Heath (46/48) Aug 24 2004 On Tue, 24 Aug 2004 12:03:17 +0000 (UTC), Arcane Jill

Walter (20/59) Aug 24 2004 opCast_r()

antiAlias (29/40) Aug 24 2004 I disagree with that for a number of reasons

Sean Kelly (25/39) Aug 24 2004 Ideally, an i/o library should be able to handle most conversions invisi...

antiAlias (12/22) Aug 24 2004 realm,

Sean Kelly (5/7) Aug 24 2004 Its only problem is that XML is a terrible text format. But for one ser...

Regan Heath (39/93) Aug 24 2004 The exact same thing can be said for implicit transcoding. What I mean

antiAlias (23/119) Aug 24 2004 Regan; I appeal to you to try and read things in context. I'm not even

Regan Heath (9/14) Aug 24 2004 I'm sorry to have come across that way. I was simply trying to add my

Arcane Jill <Arcane_member pathlink.com> writes:

D has come a long way, and much of the original architecture is now redundant.
Template/library based containers are now making built-in associative arrays
redundant, for example. And now a new revolution is on its way - transcoding,
which makes built in support for UTF-8 and friends equally redundant. (It does
not, of course, make Unicode itself redundant!).

D's "char" type is, by definition, a fragment of UTF-8.
But UTF-8 is just an encoding.

D's "wchar" type is, by definition, a fragment of UTF-16.
But UTF-16 is also just an encoding (or two).

D's "dchar" type flits ambiguously between a fragment of UTF-32 and an actual
Unicode codepoint (the two are more or less interchangeable).

<sarcasm>
By extension of this logic, why not:

schar - a fragment of UTF-7
ichar - a fragment of ISO-8859-1
cchar - a fragment of WINDOWS-1252
.. and so on, for every encoding you can think of. Hang on - we're going to run
out of letters!

and of course, Phobos would have to implement all the conversion functions:
toUTF7(), toISO88591(), and so on.
</sarcasm>

Nonsense? Of course it is. But the analogy is intended to show that the current
behavior of D is also nonsense. For N encodings, you need (N squared minus N)
conversion functions, so the number is going to grow quite rapidly as the number
of supported encodings increases. But if you instead use transcoding, then the
number of conversion functions you need is simply N. Not only that, the
mechanism is smoother, neater. Your code is more elegant. You simply don't have
to /worry/ about all that nonsense trying to get the three built-in encodings to
match, because the issue has simply gone away.

And once the issue has gone away, you no longer need a special type to hold
fragments of UTF-8 or UTF-16. Bye bye char. Bye bye wchar.

Kris (antiAlias) has sent me the the transcoding interface which Mango requires.
(Recieved, thanks). I've already written a generic one, but which didn't take
those requirements into account. So today I'm going to merge the two approaches
together and see what Kris thinks. So I'm pretty confident that within a few
days, Kris and I will have got together a transcoding architecture we're both
happy with - and since Kris has expertise in streams/Mango, and I have expertise
in Unicode/internationalization, I'd make a pretty good wager that between us
we're going to get it right. And we'll plumb in the UTF transcoders first. You
can probably expect all that to be done within days rather than weeks.

So why would we then need old-style-char or wchar any more?

For reasons of space-efficiency, one might want to store text in memory in UTF-8
format. Fair enough. But if char were to be ditched, you could still do that.
You'd simply use a ubyte[] for that purpose (just as you are now required to do
if you want to store text in memory in UTF-7). After all - what actually /is/ a
UTF fragment anyway? What meaning does the UTF-8 fragment 0x83 have in
isolation? Answer - none. It has meaning only in the context of the bytes
surrounding it. You don't need a special primitive type just to hold that
fragment. And of course, there is /nothing/ to stop special string classes from
being written to provide implementations of such space-efficient abstractions.

A further argument against char is people coming from C/C++ /will/ try to store
ISO-8859-1 encoded strings in a char[]. And they will get away with it, too, so
long as they don't try calling any toUTFxx() routines on them. Bugs can be
deeply buried in this confusion, failing to surface for a very long time.

Discussion in another thread has focused on the the fact that Object.toString()
returns a char[]. Regan and I have made the suggestion that the three string
types be interchangable. But there's a better way: have just the /one/ string
type. (As they say in Highlander, "There can be only one"). Problem gone away.

With new-style-char redefined, not merely as a UTF-32 code unit (a fragment of
an encoding), but as an actual Unicode character, things become much, much
simpler.

AND it would make life easier for Walter - fewer primitive types; less for the
compiler to understand/do.

Java tried to do it this way. When Java was invented, they had a char type
intended to hold a single Unicode character. They also had a byte type, an array
of which could store ASCII or ISO-8859-1 or UTF-8 encoded text. They also had
transcoding built in, to make it all hang together. Where it went wrong for Java
was that Unicode changed from being 16-bits wide to being 21-bits wide (so
suddenly Java's char was no longer wide enough, and they were forced to redefine
Java strings as being UTF-16 encoded). But please note that Java did /not/
attempt to have separate char types for each encoding. Even /after/ Unicode
exceeded 16-bits, Java was not tempted to introduce a new kind of char. Why not?
Because having more than one char type is an ugly kludge (particularly if you're
using Unicode by definition). It's an ugly kludge in D, too. I thought it was
really good, once upon a time, but now that transcoding is moving out to
libraries, and encompasses many /more/ encodings merely UTF-8/16/32, I no longer
think that. Now is the best time of all for a rethink.

But ...

there's a down-side ... it would break a lot of existing code. Well, so what?
This is a pre-1.0 wart-removing exercise. Like all of those other suggestions
we're voting on in another thread, the time to make this change is now, before
it's too late.

Arcane Jill

Aug 23 2004

Matthias Becker <Matthias_member pathlink.com> writes:

After you proposed these ideas about allowing toString to return any character
type I started thinking about it and finaly I thought: Why do we have more than
one character-type? (Just like you do)

Aug 23 2004

Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:

Arcane Jill wrote:
 
 ButY there's a better way:
 have just the /one/ string type. (As they say in Highlander, "There can be
 only one"). Problem gone away.

Yes, it makes a lot of sense. You have my (useless) vote.

 AND it would make life easier for Walter - fewer primitive types; less for
 the compiler to understand/do.

I think Walter should like it, if only for this. 

 But ...
 
 there's a down-side ... it would break a lot of existing code. Well, so
 what? This is a pre-1.0 wart-removing exercise. Like all of those other
 suggestions we're voting on in another thread, the time to make this
 change is now, before it's too late.

I'll be very happy to change my (little) code now.

Aug 23 2004

Ben Hinkle <bhinkle4 juno.com> writes:

Arcane Jill wrote:

 
 D has come a long way, and much of the original architecture is now
 redundant. Template/library based containers are now making built-in
 associative arrays redundant, for example. And now a new revolution is on
 its way - transcoding, which makes built in support for UTF-8 and friends
 equally redundant. (It does not, of course, make Unicode itself
 redundant!).
 
 D's "char" type is, by definition, a fragment of UTF-8.
 But UTF-8 is just an encoding.
 
 D's "wchar" type is, by definition, a fragment of UTF-16.
 But UTF-16 is also just an encoding (or two).
 
 D's "dchar" type flits ambiguously between a fragment of UTF-32 and an
 actual Unicode codepoint (the two are more or less interchangeable).
 
 <sarcasm>
 By extension of this logic, why not:
 
 schar - a fragment of UTF-7
 ichar - a fragment of ISO-8859-1
 cchar - a fragment of WINDOWS-1252
 .. and so on, for every encoding you can think of. Hang on - we're going
 to run out of letters!
 
 and of course, Phobos would have to implement all the conversion
 functions: toUTF7(), toISO88591(), and so on.
 </sarcasm>
 
 Nonsense? Of course it is. But the analogy is intended to show that the
 current behavior of D is also nonsense. For N encodings, you need (N
 squared minus N) conversion functions, so the number is going to grow
 quite rapidly as the number of supported encodings increases. But if you
 instead use transcoding, then the number of conversion functions you need
 is simply N. Not only that, the mechanism is smoother, neater. Your code
 is more elegant. You simply don't have to /worry/ about all that nonsense
 trying to get the three built-in encodings to match, because the issue has
 simply gone away.
 
 And once the issue has gone away, you no longer need a special type to
 hold fragments of UTF-8 or UTF-16. Bye bye char. Bye bye wchar.
 
 Kris (antiAlias) has sent me the the transcoding interface which Mango
 requires. (Recieved, thanks). I've already written a generic one, but
 which didn't take those requirements into account. So today I'm going to
 merge the two approaches together and see what Kris thinks. So I'm pretty
 confident that within a few days, Kris and I will have got together a
 transcoding architecture we're both happy with - and since Kris has
 expertise in streams/Mango, and I have expertise in
 Unicode/internationalization, I'd make a pretty good wager that between us
 we're going to get it right. And we'll plumb in the UTF transcoders first.
 You can probably expect all that to be done within days rather than weeks.
 
 So why would we then need old-style-char or wchar any more?
 
 For reasons of space-efficiency, one might want to store text in memory in
 UTF-8 format. Fair enough. But if char were to be ditched, you could still
 do that. You'd simply use a ubyte[] for that purpose (just as you are now
 required to do if you want to store text in memory in UTF-7). After all -
 what actually /is/ a UTF fragment anyway? What meaning does the UTF-8
 fragment 0x83 have in isolation? Answer - none. It has meaning only in the
 context of the bytes surrounding it. You don't need a special primitive
 type just to hold that fragment. And of course, there is /nothing/ to stop
 special string classes from being written to provide implementations of
 such space-efficient abstractions.
 
 A further argument against char is people coming from C/C++ /will/ try to
 store ISO-8859-1 encoded strings in a char[]. And they will get away with
 it, too, so long as they don't try calling any toUTFxx() routines on them.
 Bugs can be deeply buried in this confusion, failing to surface for a very
 long time.
 
 Discussion in another thread has focused on the the fact that
 Object.toString() returns a char[]. Regan and I have made the suggestion
 that the three string types be interchangable. But there's a better way:
 have just the /one/ string type. (As they say in Highlander, "There can be
 only one"). Problem gone away.
 
 With new-style-char redefined, not merely as a UTF-32 code unit (a
 fragment of an encoding), but as an actual Unicode character, things
 become much, much simpler.
 
 AND it would make life easier for Walter - fewer primitive types; less for
 the compiler to understand/do.
 
 Java tried to do it this way. When Java was invented, they had a char type
 intended to hold a single Unicode character. They also had a byte type, an
 array of which could store ASCII or ISO-8859-1 or UTF-8 encoded text. They
 also had transcoding built in, to make it all hang together. Where it went
 wrong for Java was that Unicode changed from being 16-bits wide to being
 21-bits wide (so suddenly Java's char was no longer wide enough, and they
 were forced to redefine Java strings as being UTF-16 encoded). But please
 note that Java did /not/ attempt to have separate char types for each
 encoding. Even /after/ Unicode exceeded 16-bits, Java was not tempted to
 introduce a new kind of char. Why not? Because having more than one char
 type is an ugly kludge (particularly if you're using Unicode by
 definition). It's an ugly kludge in D, too. I thought it was really good,
 once upon a time, but now that transcoding is moving out to libraries, and
 encompasses many /more/ encodings merely UTF-8/16/32, I no longer think
 that. Now is the best time of all for a rethink.
 
 But ...
 
 there's a down-side ... it would break a lot of existing code. Well, so
 what? This is a pre-1.0 wart-removing exercise. Like all of those other
 suggestions we're voting on in another thread, the time to make this
 change is now, before it's too late.
 
 Arcane Jill

There were huge threads about char vs wchar vs dchar a while ago (on the old
newsgroup, I think). All kinds of things like what the default should be,
what the names should be, what a string class could be etc. For example
 http://www.digitalmars.com/d/archives/20361.html
 http://www.digitalmars.com/d/archives/12382.html
or actually anything at
 http://www.digitalmars.com/d/archives/index.html
with the word "unicode" in the subject.

By the way, why if there are N encodings are there N^2-N converters?
Shouldn't there just be ~2*N to convert to/from one standard like dchar[]?
IBM's ICU (at http://oss.software.ibm.com/icu/) uses wchar[] as the
standard.

-Ben

Aug 23 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgcoe6$2cq4$1 digitaldaemon.com>, Ben Hinkle says...

There were huge threads about char vs wchar vs dchar a while ago (on the old
newsgroup, I think). All kinds of things like what the default should be,
what the names should be, what a string class could be etc. For example
 http://www.digitalmars.com/d/archives/20361.html
 http://www.digitalmars.com/d/archives/12382.html
or actually anything at
 http://www.digitalmars.com/d/archives/index.html
with the word "unicode" in the subject.

Well spotted. I had a look at some of those old threads, and it does seem that
most of the views back there were saying much the same thing as I'm suggesting
now, which is good, as I'm happy to count it as more votes for the proposal, AND
evidence of ongoing discontent over some years. The difference between now and
then is that /now/ we have transcoding classes underway, and we'll have a
working architecture very very soon, which will be able to plug into any kind of
string or stream class. This is the difference which makes ditching char and
wchar an actual practical possibility now.

Incidently, there were plenty of views in those archives which basically said
that the Unicode functions which now exist in etc.unicode (and which didn't
exist at the time) should exist. That's one problem solved.




By the way, why if there are N encodings are there N^2-N converters?
Shouldn't there just be ~2*N to convert to/from one standard like dchar[]?

Well, that's how transcoding will do it, obviously. I was comparing it to the
present system, in which N == 3 (UTF-8, UTF-16 and UTF-32), and there are 6 (=
3^2-3) converters in std.utf, these being:

*) toUTF8(wchar[]);
*) toUTF8(dchar[]);
*) toUTF16(char[]);
*) toUTF16(dchar[]);
*) toUTF32(char[]);
*) toUTF32(wchar[]);

If the current (std.utf) scheme were to be extended to include, say, UTF-7 and
UTF-EBCDIC, how would that scale up?



IBM's ICU (at http://oss.software.ibm.com/icu/)

Bloody hell. I wish someone had pointed me at ICU earlier. That is exceptional.
They've even got Unicode Regular Expressions! And transcoding functions. And
it's open source, too!

Should I just give up on etc.unicode? Maybe we should just put a D wrapper
around ICU instead, which would give D full Unicode support right now, and leave
me free to do crypto stuff!


IBM's ICU (at http://oss.software.ibm.com/icu/)
uses wchar[] as the standard.

Ah, no it doesn't. I just checked. ICU has the types UChar (platform dependent,
but wchar for us) and UChar32 (definitely a dchar). So you see, both wchar[] and
dchar[] are "standards" for ICU. (That said, I've only looked at it for a few
seconds, so I may have misunderstood).

Anwyay, UTF-16 transcoding will easily take care of interfacing with any UTF-16
architecture. The present situation in D is no more compatible than what I'm
suggesting.

Slightly modified proposal then - ditch char and wchar as before, PLUS,
incorporate ICU into D's core and write a D wrapper for it. (And ditch
etc.unicode - erk!) The ICU license is at
http://oss.software.ibm.com/cvs/icu/~checkout~/icu/license.html.

Arcane Jill

Aug 23 2004

Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:

Arcane Jill wrote:


 Slightly modified proposal then - ditch char and wchar as before, PLUS,
 incorporate ICU into D's core and write a D wrapper for it. (And ditch
 etc.unicode - erk!) The ICU license is at
 http://oss.software.ibm.com/cvs/icu/~checkout~/icu/license.html.

I suppose that if dchar it's what left of this ditching it should renamed to
char.

AJ, I don't know a shit^D^D^D^D too much about Unicode but you excitement
about ICU is really contagious, only one question, are the C wrappers at
the same level than the C++/Java ones? If so it seems that with a little
easy and boring (compared to writing etc.unicode) wrapping we're going to
have a first-class Unicode lib :) => (i18n version of <g>)

Aug 23 2004

"Roald Ribe" <rr.no spam.teikom.no> writes:

"Ben Hinkle" <bhinkle4 juno.com> wrote in message
news:cgcoe6$2cq4$1 digitaldaemon.com...

[snip]

 There were huge threads about char vs wchar vs dchar a while ago (on the

old
 newsgroup, I think). All kinds of things like what the default should be,
 what the names should be, what a string class could be etc. For example
  http://www.digitalmars.com/d/archives/20361.html
  http://www.digitalmars.com/d/archives/12382.html
 or actually anything at
  http://www.digitalmars.com/d/archives/index.html
 with the word "unicode" in the subject.

 By the way, why if there are N encodings are there N^2-N converters?
 Shouldn't there just be ~2*N to convert to/from one standard like dchar[]?
 IBM's ICU (at http://oss.software.ibm.com/icu/) uses wchar[] as the
 standard.

Indeed. There were several large discussions about this. Only a few
scandinavian/north european readers of this group seemed to be
positive at the time.
I am happy to see that more people are warming to the idea.
wchar (16-bit) is enough. It is even suggested as the best
implementation size by some UNICODE coding experts.
IBM / Sun / MS can not all be stupid at the same time...
I think it would be smart to interoperate with the 16-bit size
used internally in ICU, Java and MS-Windows. Only on unix/linux
would it make sense to use 32-bits dchar.
The 16 bits is enough for 99% of the cases/languages.
The last 1% can be handled quite fast by cached indexing
techniques in a String object. (this does not make for
optimal speed in the 1% case, but it will more than pay
for itself speedwise in 99% of all binary i/o operations :)

However, that is Walters main issue I think. He wants default 8-bit
chars to be default because this will make for the best possible i/o
speed with the current state of affairs. That is what I understould
from the last discussion at least. I am sure he will comment this
thread ;-) and correct me if I am wrong.

Regards,
Roald

Aug 23 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgct0l$2eut$1 digitaldaemon.com>, Roald Ribe says...

However, that is Walters main issue I think. He wants default 8-bit
chars to be default because this will make for the best possible i/o
speed with the current state of affairs. That is what I understould
from the last discussion at least. I am sure he will comment this
thread ;-) and correct me if I am wrong.

Is that true?

But UTF-8 /doesn't/ make the best possible I/O speed. To achieve that, you'd
need to be using the OS-native encoding internally (ISO-8859-1 on most Linux
boxes, WINDOWS-1252 on most Windows boxes). If UTF-8 is not used natively (which
most of the time it isn't), you'd still need transcoding. Fact is, transcoding
from UTF-16 to ISO-8859-1 or WINDOWS-1252 is going to be much faster than
transcoding from UTF-8 to those encodings.

And in any case, the time spent transcoding is almost always going to be
insignificant compared to time spent doing actual I/O. Think console input;
writing to disk; reading from CD-ROM; writing to a socket; .... Transcoding is
really not a bottleneck.

Jill

Aug 23 2004

Ben Hinkle <bhinkle4 juno.com> writes:

Roald Ribe wrote:

 
 "Ben Hinkle" <bhinkle4 juno.com> wrote in message
 news:cgcoe6$2cq4$1 digitaldaemon.com...
 
 [snip]
 
 There were huge threads about char vs wchar vs dchar a while ago (on the

 old
 newsgroup, I think). All kinds of things like what the default should be,
 what the names should be, what a string class could be etc. For example
  http://www.digitalmars.com/d/archives/20361.html
  http://www.digitalmars.com/d/archives/12382.html
 or actually anything at
  http://www.digitalmars.com/d/archives/index.html
 with the word "unicode" in the subject.

 By the way, why if there are N encodings are there N^2-N converters?
 Shouldn't there just be ~2*N to convert to/from one standard like
 dchar[]? IBM's ICU (at http://oss.software.ibm.com/icu/) uses wchar[] as
 the standard.

 
 Indeed. There were several large discussions about this. Only a few
 scandinavian/north european readers of this group seemed to be
 positive at the time.
 I am happy to see that more people are warming to the idea.
 wchar (16-bit) is enough. It is even suggested as the best
 implementation size by some UNICODE coding experts.
 IBM / Sun / MS can not all be stupid at the same time...
 I think it would be smart to interoperate with the 16-bit size
 used internally in ICU, Java and MS-Windows. Only on unix/linux
 would it make sense to use 32-bits dchar.
 The 16 bits is enough for 99% of the cases/languages.
 The last 1% can be handled quite fast by cached indexing
 techniques in a String object. (this does not make for
 optimal speed in the 1% case, but it will more than pay
 for itself speedwise in 99% of all binary i/o operations :)
 
 However, that is Walters main issue I think. He wants default 8-bit
 chars to be default because this will make for the best possible i/o
 speed with the current state of affairs. That is what I understould
 from the last discussion at least. I am sure he will comment this
 thread ;-) and correct me if I am wrong.
 
 Regards,
 Roald

I didn't mean to suggest D ditch char or wchar or dchar. I'm just saying ICU
uses wchar internally as the intermediate representation when converting
between encodings. That is different than changing D's concept of strings.
I should have added a sentance to my original post saying that I think D
should keep its support of the "big three" char, wchar and dchar (with
char[] as the standard concept of string) and have the library that handles
conversions between unicode and non-unicode (or between non-unicode)
encodings use whatever it wants as the intermediate representation. I think
for that dchar would probably be fine - but I have no experience with that
so that is just a naive guess.

Treating the unicode encodings specially seems more practical than saying
all non-standard encodings are treated the same.

-Ben

Aug 23 2004

J C Calvarese <jcc7 cox.net> writes:

Ben Hinkle wrote:
...
 There were huge threads about char vs wchar vs dchar a while ago (on the old
 newsgroup, I think). All kinds of things like what the default should be,
 what the names should be, what a string class could be etc. For example
  http://www.digitalmars.com/d/archives/20361.html
  http://www.digitalmars.com/d/archives/12382.html
 or actually anything at
  http://www.digitalmars.com/d/archives/index.html
 with the word "unicode" in the subject.

In case anyone is interested, here's a page with links to many Unicode 
threads in D newsgroups:
http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/

Aug 23 2004

Arcane Jill <Arcane_member pathlink.com> writes:

The case for retaining wchar has been made, and essentially won. Please see
separate thread about ICU (and maybe move this discussion there).

"char", however, is still up for deletion, since all arguments against it still
apply.

Jill

Aug 23 2004

Ben Hinkle <bhinkle4 juno.com> writes:

Arcane Jill wrote:

 
 The case for retaining wchar has been made, and essentially won. Please
 see separate thread about ICU (and maybe move this discussion there).

It's won already? What is this - a Mike Tyson fight? :-) 
Are you referring to the old threads (on the old newsgroup) or this new
thread?

 "char", however, is still up for deletion, since all arguments against it
 still apply.
 
 Jill

Once I argued that D's current concept of char should be called uchar or
something to indicate the UTF-8 encoding (as opposed to C's char encoding)
and that string literals have type uchar[]. I still think it would be
interesting to try but it's a tweak on the current system that isn't *that*
important.

Aug 23 2004

Andy Friesen <andy ikagames.com> writes:

Arcane Jill wrote:

 For reasons of space-efficiency, one might want to store text in memory in
UTF-8
 format. Fair enough. But if char were to be ditched, you could still do that.
 You'd simply use a ubyte[] for that purpose (just as you are now required to do
 if you want to store text in memory in UTF-7). After all - what actually /is/ a
 UTF fragment anyway? What meaning does the UTF-8 fragment 0x83 have in
 isolation? Answer - none. It has meaning only in the context of the bytes
 surrounding it. You don't need a special primitive type just to hold that
 fragment. And of course, there is /nothing/ to stop special string classes from
 being written to provide implementations of such space-efficient abstractions.

I think it might be worth it for the conceptual clarity.  UTF-32 happens 
to be the character type that's hardest to break.  It seems logical that 
it be the default.

The programmer can still take control and use another encoding when the 
problem domain allows for it, but it's an optimization, not business as 
usual.

  -- andy

Aug 23 2004

"Walter" <newshound digitalmars.com> writes:

I'm not sure what transcoding means, I assume it's converting from one
string type to another.

I have some experiences with having only one character type - implementing a
C compiler, which is largely string processing using ascii, implementing a
Java compiler and working on the Java vm, which does everything as a wchar,
and implementing a javascript compiler, interpreter, and runtime which does
everything as a dchar.

dchar implementations consume memory at an alarming rate, and if you're
doing server side code, this means you'll soon run into the point where
virtual memory starts thrashing, and performance goes quickly down the
tubes. I had to do a lot of work trying to overcome this problem. dchars are
very convenient to work with, however, and make a great deal of sense as the
temporary common intermediate form of all the conversions. I stress the
temporary, though, as if you keep it around you'll start to notice the
slowdowns it causes.

char implementations can be made really fast and memory efficient. I don't
know if anyone has run any statistics, but the vast bulk of text processing
done by programs is in ASCII.

wchar implementations are, of course, halfway in between. Microsoft went
with wchars back when wchars could handle all of unicode in one word. Now
that it doesn't anymore, that means that all wchar code is going to have to
handle multiword encodings. So it's just as much extra code to write as for
chars.

Java uses wchars, but Java was not designed for high performance (although
years of herculean efforts by a lot of very smart people have brought Java a
long ways in the performance department). My Java compiler written in C++,
which used UTF-8 internally, ran 10x faster than the one written in Java,
which used UTF-16. The speedup wasn't all due to the character encoding, but
it helped.


lot of sense for Microsoft. D isn't just a Win32 language, however, and
linux uses UTF-8. Furthermore, a lot of embedded systems stick with ASCII.

In other words, I think there's a strong case for all three character
encodings in D. I agree that those using code pages will run into trouble
now and then with this, but they will anyway, because code pages are just
endless trouble since which code page some data is in is not inherent in the
data. Your idea of having the compiler detect invalid UTF-8 sequences in
string literals is very helpful here in heading a lot of these issues off at
the pass.

I think it makes perfect sense for a transcoding library to standardize on
dchar and dchar[] as its intermediate form. But don't take away char[] for
people like me that want to use it for other purposes!

I also agree that for single characters, using dchar is the best way to go.
Note that I've been redoing Phobos internals this way. For example,
std.format generates dchars to feed to the consumer function. std.ctype
takes dchar's as arguments.

Aug 23 2004

Regan Heath <regan netwin.co.nz> writes:

Walter,

I pretty much agree with everything you have said here, I don't think we 
should remove any of the char types. That said I think...

The idea that a cast from one char type to another char type (implicit or 
explicit) should perform the correct UTF transcoding(conversion) is a good 
one, my arguments for this are as follows:

  - Q: if you paint a dchar[] as a char[] what do you get?
    A: a possibly (likely?) invalid UTF-8 sequence.

So why do it? I agree we do want to be able to 'paint' one type as 
another, but for a type with a specified encoding I don't think this makes 
any sense, does it? can you think of a reason to do it? Given that you 
could use ubyte, ushort or ulong instead. (types with no specified 
encoding).

  - Doing the transcoding means people writing string handling routines 
need only provide one routine and the result will automatically be 
transcoded to the type they're using.

This is such a great bonus! it will reduce the number of string handling 
routines by 2/3 as any routine will have it's result converted to the 
required type auto-magically.

  - The argument about consistency, that a ubyte cast to a ushort does not 
transcoding so a char to a dchar shouldn't either.

There are two ways of looking at this, on one hand you're saying they 
should all 'paint' as that is consistent. However, on the other I'm saying 
they should all produce a 'valid' result. So my argument here is that when 
you cast you expect a valid result, much like casting a float to an int 
does not just 'paint'.

I am interested to hear your opinions on this idea.

Regan.

On Mon, 23 Aug 2004 12:59:28 -0700, Walter <newshound digitalmars.com> 
wrote:
 I'm not sure what transcoding means, I assume it's converting from one
 string type to another.

 I have some experiences with having only one character type - 
 implementing a
 C compiler, which is largely string processing using ascii, implementing 
 a
 Java compiler and working on the Java vm, which does everything as a 
 wchar,
 and implementing a javascript compiler, interpreter, and runtime which 
 does
 everything as a dchar.

 dchar implementations consume memory at an alarming rate, and if you're
 doing server side code, this means you'll soon run into the point where
 virtual memory starts thrashing, and performance goes quickly down the
 tubes. I had to do a lot of work trying to overcome this problem. dchars 
 are
 very convenient to work with, however, and make a great deal of sense as 
 the
 temporary common intermediate form of all the conversions. I stress the
 temporary, though, as if you keep it around you'll start to notice the
 slowdowns it causes.

 char implementations can be made really fast and memory efficient. I 
 don't
 know if anyone has run any statistics, but the vast bulk of text 
 processing
 done by programs is in ASCII.

 wchar implementations are, of course, halfway in between. Microsoft went
 with wchars back when wchars could handle all of unicode in one word. Now
 that it doesn't anymore, that means that all wchar code is going to have 
 to
 handle multiword encodings. So it's just as much extra code to write as 
 for
 chars.

 Java uses wchars, but Java was not designed for high performance 
 (although
 years of herculean efforts by a lot of very smart people have brought 
 Java a
 long ways in the performance department). My Java compiler written in 
 C++,
 which used UTF-8 internally, ran 10x faster than the one written in Java,
 which used UTF-16. The speedup wasn't all due to the character encoding, 
 but
 it helped.


 a
 lot of sense for Microsoft. D isn't just a Win32 language, however, and
 linux uses UTF-8. Furthermore, a lot of embedded systems stick with 
 ASCII.

 In other words, I think there's a strong case for all three character
 encodings in D. I agree that those using code pages will run into trouble
 now and then with this, but they will anyway, because code pages are just
 endless trouble since which code page some data is in is not inherent in 
 the
 data. Your idea of having the compiler detect invalid UTF-8 sequences in
 string literals is very helpful here in heading a lot of these issues 
 off at
 the pass.

 I think it makes perfect sense for a transcoding library to standardize 
 on
 dchar and dchar[] as its intermediate form. But don't take away char[] 
 for
 people like me that want to use it for other purposes!

 I also agree that for single characters, using dchar is the best way to 
 go.
 Note that I've been redoing Phobos internals this way. For example,
 std.format generates dchars to feed to the consumer function. std.ctype
 takes dchar's as arguments.



-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 23 2004

"Walter" <newshound digitalmars.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsc7hhm0a5a2sq9 digitalmars.com...
 Walter,

 I pretty much agree with everything you have said here, I don't think we
 should remove any of the char types. That said I think...

 The idea that a cast from one char type to another char type (implicit or
 explicit) should perform the correct UTF transcoding(conversion) is a good
 one, my arguments for this are as follows:

   - Q: if you paint a dchar[] as a char[] what do you get?
     A: a possibly (likely?) invalid UTF-8 sequence.

 So why do it? I agree we do want to be able to 'paint' one type as
 another, but for a type with a specified encoding I don't think this makes
 any sense, does it? can you think of a reason to do it? Given that you
 could use ubyte, ushort or ulong instead. (types with no specified
 encoding).

   - Doing the transcoding means people writing string handling routines
 need only provide one routine and the result will automatically be
 transcoded to the type they're using.

 This is such a great bonus! it will reduce the number of string handling
 routines by 2/3 as any routine will have it's result converted to the
 required type auto-magically.

   - The argument about consistency, that a ubyte cast to a ushort does not
 transcoding so a char to a dchar shouldn't either.

 There are two ways of looking at this, on one hand you're saying they
 should all 'paint' as that is consistent. However, on the other I'm saying
 they should all produce a 'valid' result. So my argument here is that when
 you cast you expect a valid result, much like casting a float to an int
 does not just 'paint'.

 I am interested to hear your opinions on this idea.

I think your idea has a lot of merit. I'm certainly leaning that way.

Aug 23 2004

"antiAlias" <fu bar.com> writes:

"Walter" <newshound digitalmars.com> wrote in message ...
 "Regan" wrote:
 There are two ways of looking at this, on one hand you're saying they
 should all 'paint' as that is consistent. However, on the other I'm


saying
 they should all produce a 'valid' result. So my argument here is that


when
 you cast you expect a valid result, much like casting a float to an int
 does not just 'paint'.

 I am interested to hear your opinions on this idea.

 I think your idea has a lot of merit. I'm certainly leaning that way.

On the one hand, this would be well served by an opCast() and/or opCast_r()
on the primitive types; just the kind of thing  suggested in a related
thread (which talked about overloadable methods for primitive types).

On the other hand, we're talking transcoding here. Are you gonna' limit this
to UTF-8 only? Then, since the source and destination will typically be of
different sizes, do you then force all casts between these types to have the
destination be an array reference rather than an instance? One that is
always allocated on the fly? Then there's performance. It's entirely
possible to write transcoders that are minimally between 5 and 30 times
faster than the std.utf ones. Some people actually do care about efficiency.
I'm one of them.

If you do implement overloadable primitive-methods (like properties) then,
will you allow a programmer to override them? So they can make the opCast()
do something more specific to their own specific task?

That's seems like a lot to build into the core of a language.

Personally, I think it's 'borderline' to have so many data types available
for similar things. If there were an "alias wchar[] string", and the core
Object supported that via "string toString()", and the IUC library were
adopted, then I think some of the general confusion would perhaps melt
somewhat.

In many respects, too many choices is simply a BadThing (TM). Especially
when there's precious little solid guidance to help. That guidance might
come from a decent library that indicates how the types are used, and uses
one obvious type (string?) consistently. Besides, if IUC were adopted, less
people would have to worry about the distinction anyway.

Believe me when I say that Mango would dearly love to go dchar[] only.
Actually, it probably will at the higher levels because it makes life simple
for everyone. Oh, and I've been accused many times of being an efficiency
fanatic, especially when it comes to servers. But there's always a tradeoff
somewhere. Here, the tradeoff is simplicity-of-use versus quantities of RAM.
Which one changes dramatically over time? Hmmmm ... let me see now ...
64bit-OS for desktops just around corner?

Even on an embedded device I'd probably go "dchar only" regarding I18N.
Simply because the quantity of text processed on such devices is very
limited. Before anyone shoots me over this one, I regularly write code for
devices with just 4KB RAM ~ still use 16bit chars there when dealing with
XML input.

So what am I saying here? Available RAM will always increase in great leaps.
Contemplating that the latter should dictate ease-of-use within D is a
serious breach of logic, IMO. Ease of use, and above all, /consistency/
should be paramount; if you have the programmer in mind.

Aug 23 2004

Regan Heath <regan netwin.co.nz> writes:

On Mon, 23 Aug 2004 18:29:54 -0700, antiAlias <fu bar.com> wrote:
 "Walter" <newshound digitalmars.com> wrote in message ...
 "Regan" wrote:
 There are two ways of looking at this, on one hand you're saying they
 should all 'paint' as that is consistent. However, on the other I'm


 saying
 they should all produce a 'valid' result. So my argument here is that


 when
 you cast you expect a valid result, much like casting a float to an 

 int
 does not just 'paint'.

 I am interested to hear your opinions on this idea.

 I think your idea has a lot of merit. I'm certainly leaning that way.

 On the one hand, this would be well served by an opCast() and/or 
 opCast_r()
 on the primitive types; just the kind of thing  suggested in a related
 thread (which talked about overloadable methods for primitive types).

True. However, what else should the opCast for char[] to dchar[] do except 
transcode it? What about opCast for int to dchar.. it seems to me there is 
only 1 choice about what to do, anything else would be operator abuse.

 On the other hand, we're talking transcoding here. Are you gonna' limit 
 this
 to UTF-8 only?

I hope not, I am hoping to see:

        | UTF-8 | UTF-16 | UTF-32
--------------------------------
UTF-8  |   -        +       +
UTF-16 |   +        -       +
UTF-32 |   +        +       -

(+ indicates transcoding occurs)

 Then, since the source and destination will typically be of
 different sizes, do you then force all casts between these types to have 
 the
 destination be an array reference rather than an instance?

I don't think transcoding makes any sense unless you're talking about a 
'string' (i.e. char[], wchar[], or dchar[]) as opposed to a UTF-x fragment 
(i.e. char, wchar)

As AJ has frequently pointed out a char or wchar does not equal one 
"character" in some cases. Given that and assuming 'c' below is "not a 
whole character", I cannot see how:

dchar d;
char  c = ?; //not a whole character

d = c;            //implicit
d = cast(dchar)c; //explicit

would ever be able to transcode. So in this instance this should either be:

1. a compile error
2. a runtime error
3. a simple copy of the value, creating an invalid? (will it be AJ?) 
utf-32 character.


has the possibilty to create an invalid utf-x code-point/fragment (I don't 
know the right term)

 One that is
 always allocated on the fly?

No.. bad idea methinks.

 Then there's performance. It's entirely
 possible to write transcoders that are minimally between 5 and 30 times
 faster than the std.utf ones. Some people actually do care about 
 efficiency.
 I'm one of them.

Same here. I believe we all have the same goal, we just have different 
ideas about how best to get there.

 If you do implement overloadable primitive-methods (like properties) 
 then,
 will you allow a programmer to override them? So they can make the 
 opCast()
 do something more specific to their own specific task?

Can you think of a task you'd want to put one of these opCast methods to? 
(and give an example).

 That's seems like a lot to build into the core of a language.

The opCast overloading? or the original idea?

The only reservation I had about the original idea was that it seemed like 
"a lot to build into the core of a language" at first, and then I realised 
if you cast from one char type to another, the _only_ sensible thing to do 
is transcode, anything else is a bug/error.

In addition this sort of cast (char type to char type) doesn't occur a 
lot, but when it does you end up calling toUTFxx manually, why not just 
have it happen.

 Personally, I think it's 'borderline' to have so many data types 
 available for similar things.

Like byte, short, int and long?

 If there were an "alias wchar[] string", and the core
 Object supported that via "string toString()", and the IUC library were
 adopted, then I think some of the general confusion would perhaps melt
 somewhat.

Maybe.

However toString for an 'int' (for example) only needs a char[] to 
represent all it's possible values (with one character equalling one 
'char') so why use wchar or dchar for it's toString?

Conversly something else might find it more efficient to return it's 
toString as a dchar[].

You might argue either example is rare, certainly the latter is rarer than 
the former, but the efficiency gnome in my head won't shut up about this..

If we simply alias wchar[] as 'string' then these examples will need 
manual conversion to 'string' which involves calling toUTF16 or .. what 
would you call that conversion function such that it was obvious? 
toString? :)

 In many respects, too many choices is simply a BadThing (TM).

I think too many very similar but slightly different choices/methods is a 
bad thing, however, I don't see char, wchar and dchar as that. I think 
they are more like byte, short and int, different types for different 
uses, castable to/from each other where required. The choice they give you 
is being able to choose the right type for the job.

 Especially
 when there's precious little solid guidance to help. That guidance might
 come from a decent library that indicates how the types are used, and 
 uses
 one obvious type (string?) consistently.

I think the confusion comes from being used to only 1 string type, for 
C/C++ programmers it's typically and 8bit unsigned type usually containing 
ASCII, for Java a 16bit signed/unsigned? type containing UTF-16?

Internationalisation is a new topic for me and many others (I suspect) 
even for Walter(?).

Having 3 types requiring manual transcoding between them _is_ a pain.

 Besides, if IUC were adopted, less
 people would have to worry about the distinction anyway.

The same could be said for the implicit transcoding from one type to the 
other.

 Believe me when I say that Mango would dearly love to go dchar[] only.
 Actually, it probably will at the higher levels because it makes life 
 simple
 for everyone. Oh, and I've been accused many times of being an efficiency
 fanatic, especially when it comes to servers. But there's always a 
 tradeoff
 somewhere. Here, the tradeoff is simplicity-of-use versus quantities of 
 RAM.
 Which one changes dramatically over time? Hmmmm ... let me see now ...
 64bit-OS for desktops just around corner?

What you really mean is you'd dearly love to not have to worry about the 
differences between the 3 types, implicit transcoding will give you that. 
Furthermore it's simplicity without sacraficing RAM.

The thing I like about implicit transcoding is that if for example you 
have a lot of string data stored in memory, you can store it in the most 
efficient format for that data, which may be char, wchar or dchar.

If you then want to call a function which takes a dchar on some/all of 
your, it data will be implicitly converted to dchar (if not already) for 
the function.

If at a later date you decide to change the format of the stored data, you 
don't have to find every call to a function and insert/remove toUTFxx 
calls.

To me, this sounds like the most efficient way to handle it.

 Even on an embedded device I'd probably go "dchar only" regarding I18N.
 Simply because the quantity of text processed on such devices is very
 limited. Before anyone shoots me over this one, I regularly write code 
 for
 devices with just 4KB RAM ~ still use 16bit chars there when dealing with
 XML input.

We're you to use implicit transcoding you could store the data in memory 
in UTF-8, or UTF-16 then only transcode to UTF-32 when required, this 
would be more efficient.

 So what am I saying here? Available RAM will always increase in great 
 leaps.

Probably.

 Contemplating that the latter should dictate ease-of-use within D is a
 serious breach of logic, IMO.

"the latter"?

 Ease of use, and above all, /consistency/
 should be paramount; if you have the programmer in mind.

This last statement applies to implicit transcoding perfectly.
It's easy and consistent.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 23 2004

"antiAlias" <fu bar.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsc7u56zf5a2sq9 digitalmars.com...
 On Mon, 23 Aug 2004 18:29:54 -0700, antiAlias <fu bar.com> wrote:
 "Walter" <newshound digitalmars.com> wrote in message ...
 "Regan" wrote:
 There are two ways of looking at this, on one hand you're saying they
 should all 'paint' as that is consistent. However, on the other I'm


 saying
 they should all produce a 'valid' result. So my argument here is that


 when
 you cast you expect a valid result, much like casting a float to an

 int
 does not just 'paint'.

 I am interested to hear your opinions on this idea.

 I think your idea has a lot of merit. I'm certainly leaning that way.

 On the one hand, this would be well served by an opCast() and/or
 opCast_r()
 on the primitive types; just the kind of thing  suggested in a related
 thread (which talked about overloadable methods for primitive types).

 True. However, what else should the opCast for char[] to dchar[] do except
 transcode it? What about opCast for int to dchar.. it seems to me there is
 only 1 choice about what to do, anything else would be operator abuse.

 On the other hand, we're talking transcoding here. Are you gonna' limit
 this
 to UTF-8 only?

 I hope not, I am hoping to see:

         | UTF-8 | UTF-16 | UTF-32
 --------------------------------
 UTF-8  |   -        +       +
 UTF-16 |   +        -       +
 UTF-32 |   +        +       -

 (+ indicates transcoding occurs)

========================
And what happens when just one additional byte-oriented encoding is
introduced? Perhaps UTF-7? Perhaps EBCDIC? The basic premise is flawed
because there's no flexibility.

 Then, since the source and destination will typically be of
 different sizes, do you then force all casts between these types to have
 the
 destination be an array reference rather than an instance?

 I don't think transcoding makes any sense unless you're talking about a
 'string' (i.e. char[], wchar[], or dchar[]) as opposed to a UTF-x fragment
 (i.e. char, wchar)

========================
We /are/ talking about arrays. Perhaps if that sentence had ended "array
reference rather than an *array* instance?", it might have been more clear?
The point being made is that you would not be able to do anything like this:

char[15] dst;
dchar[10] src;

dst = cast(char[]) src;

because there's no ability via a cast() to indicate how many items from src
were converted, and how many items in dst were populated. You are forced
into this kind of thing:

char[] dst;
dchar[10] src;

dst = cast(char[]) src;

You see the distinction? It may be subtle to some, but it's a glaring
imbalance to others. The lValue must always be a reference because it's
gonna' be allocated dynamically to ensure all of the rValue will fit. In the
end, it's just far better to use functions/methods that provide the feedback
required so you can actually control what's going on (or such that a library
function can). That way, you're not restricted in the same fashion. We don't
need more asymmetry in D, and this just reeks of poor design, IMO.

To drive this home, consider the decoding version (rather than the encoding
above):

char[15] src;
dchar[] dst;

dst = cast(dchar[]) src;

What happens when there's a partial character left undecoded at the end of
'src'? There nothing here to tell you that you've got a dangly bit left at
the end of the souce-buffer. It's gone. Poof! Any further decoding from the
same file/socket/whatever is henceforth trashed, because the ball has been
both dropped and buried. End of story.


 Having 3 types requiring manual transcoding between them _is_ a pain.

========================
It certainly is. That's why other languages try to avoid it at all costs.

Having it done "generously" by the compiler is also a pain, inflexible, and
likely expensive. There are many things a programmer should take
responsibility for; transcoding comes under that umbrella because (a) there
can be subtle complexity involved and (b) it is relatively expensive to
churn through text and convert it; particularly so with the Phobos utf-8
code.

What you appear to be suggesting is that this kind of thing should happen
silently whilst one nonchalantly passes arguments around between methods.
That's insane, so I hope that's not what you're advocating. Java, for
example, does that at one specific layer (I/O), but you're apparently
suggesting doing it at any old place! And several times over, just in case
it wasn't good enough the first time :-)  Sorry man. This is inane. D is
/not/ a scripting language; instead it's supposed to be a systems language.


 Besides, if IUC were adopted, less
 people would have to worry about the distinction anyway.

 The same could be said for the implicit transcoding from one type to the
 other.

========================
That's pretty short-sighted IMO. You appear to be saying that implicit
transcoding would take the place of ICU; terribly misleading. Transcoding is
just a very small part of that package. Please try to reread the comment as
"most people would be shielded completely by the library functions,
therefore there's far fewer scenarios where they'd ever have a need to drop
into anything else".

This would be a very GoodThing for D users. Far better to have a good
library to take case of /all/ this crap than have the D language do some
partial-conversions on the fly, and then quit because it doesn't know how to
provide any further functionality. This is the classic
core-language-versus-library-functionality bitchfest all over again.
Building all this into a cast()? Hey! Let's make Walter's Regex class part
of the compiler too; and make it do UTF-8 decoding /while/ it's searching,
since you'll be able to pass it a dchar[] that will be generously converted
to the accepted char[] for you "on-the-fly".

Excuse me for jesting, but perhaps the Bidi text algorithms plus
date/numeric formatting & parsing will all fit into a single operator also?
That's kind of what's being suggested. I believe there's a serious
underestimate of the task being discussed.


 Believe me when I say that Mango would dearly love to go dchar[] only.
 Actually, it probably will at the higher levels because it makes life
 simple
 for everyone. Oh, and I've been accused many times of being an


efficiency
 fanatic, especially when it comes to servers. But there's always a
 tradeoff
 somewhere. Here, the tradeoff is simplicity-of-use versus quantities of
 RAM.
 Which one changes dramatically over time? Hmmmm ... let me see now ...
 64bit-OS for desktops just around corner?

 What you really mean is you'd dearly love to not have to worry about the
 differences between the 3 types, implicit transcoding will give you that.
 Furthermore it's simplicity without sacraficing RAM.

========================
Ahh Thanks. I didn't realize that's what I "really meant". Wait a minute ...


 Even on an embedded device I'd probably go "dchar only" regarding I18N.
 Simply because the quantity of text processed on such devices is very
 limited. Before anyone shoots me over this one, I regularly write code
 for
 devices with just 4KB RAM ~ still use 16bit chars there when dealing


with
 XML input.

 We're you to use implicit transcoding you could store the data in memory
 in UTF-8, or UTF-16 then only transcode to UTF-32 when required, this
 would be more efficient.

========================
That's a rather large assumption, don't you think? More efficient? In which
particualr way? Is memory usage or CPU usage more important in /my/
particular applications? Please either refrain, or commit to rewriting all
my old code more efficiently for me ... for free <g>

Aug 23 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgelmc$dqu$1 digitaldaemon.com>, antiAlias says...

And what happens when just one additional byte-oriented encoding is
introduced? Perhaps UTF-7? Perhaps EBCDIC? The basic premise is flawed
because there's no flexibility.

Yes. But I don't see D introducing a utf7char or a utfEbcdicChar type any time
soon. There's no flexibility because there are no plans to extend it.

I can live with ONE native string type.
I can also live with 3 mutually interoperable native string types.

But neither of these schemes precludes anyone else from writing/using other
string types.





We /are/ talking about arrays. Perhaps if that sentence had ended "array
reference rather than an *array* instance?", it might have been more clear?
The point being made is that you would not be able to do anything like this:

char[15] dst;
dchar[10] src;

dst = cast(char[]) src;

But that's because you already can't do:



and that's not going to change, so that's the end of that. Regan isn't
suggesting anything beyond a bit of syntactic sugar here. No-one (so far) has
suggested that dynamic arrays be converted to static arrays, or that
auto-casting should write into a user-supplied fixed-size buffer.

D does not prohibit you from doing the things you have suggested. It's just that
you'd have to do them explicitly. /Implicit/ conversion is only being suggested
for the three D string types.




because there's no ability via a cast() to indicate how many items from src
were converted, and how many items in dst were populated. You are forced
into this kind of thing:

char[] dst;
dchar[10] src;

dst = cast(char[]) src;

Actually it would be:



if I had my way.


You see the distinction? It may be subtle to some, but it's a glaring
imbalance to others. The lValue must always be a reference because it's
gonna' be allocated dynamically to ensure all of the rValue will fit. In the
end, it's just far better to use functions/methods that provide the feedback
required so you can actually control what's going on (or such that a library
function can). That way, you're not restricted in the same fashion. We don't
need more asymmetry in D, and this just reeks of poor design, IMO.

I don't agree with this claim. It is being suggested that:




be equivalent to:




Nobody loses anything by this. All that happens is that things work more
smoothly. If you want to call a different function to do your transcoding then
there's nothing to stop you.

I assume that the use you have in mind is stream-internal transcoding via
buffers. You can still do that. The above won't stop you.



To drive this home, consider the decoding version (rather than the encoding
above):

char[15] src;
dchar[] dst;

dst = cast(dchar[]) src;

What happens when there's a partial character left undecoded at the end of
'src'?

*ALL* that is being suggested is that, given the above declarations:



would be equivalent to:



Nothing more. So the answer to your question is, it would throw an exception.



There nothing here to tell you that you've got a dangly bit left at
the end of the souce-buffer. It's gone. Poof! Any further decoding from the
same file/socket/whatever is henceforth trashed, because the ball has been
both dropped and buried. End of story.

This has got nothing to do with either transcoding or streams. This is not being
suggested as a general transcoding mechanism, merely as an internal conversion
between D's three string types. /General/ transcoding will have to work for all
supported encodings, and won't be relying on the std.utf functions. Files and
sockets won't use the std.utf functions either because they will employ the
general transcoding mechanism.

Your transcoding ideas are excellent, but they are not relevant to this.



There are many things a programmer should take
responsibility for; transcoding comes under that umbrella because (a) there
can be subtle complexity involved and (b) it is relatively expensive to
churn through text and convert it; particularly so with the Phobos utf-8
code.

I'm surprised at you. I would have said:

There are some things a programmer should /not/ have to take
responsibility for; transcoding comes under that umbrella because (a) there
can be subtle complexity involved and (b) it is relatively expensive to
churn through text and convert it;

A library is the appropriate place for this stuff. That could be Phobos, Mango,
whatever. Here's my take on it:

(1) Implicit conversion between native D strings should be handled by the
compiler in whatever way it sees fit. If it chooses to invoke a library function
then so be it. (And it /should/ be Phobos, because Phobos is shipped with D).
Note that this is exactly the same situation that the future "cent" type will
incur. If I divide one "cent" by another, I would expect the D compiler to
translate this into a function call within Phobos. Why is that such a crime?

(2) Fully featured transcoding should be done by ICU, as this is a high
performance mature product, and all the code is already written.

(3) Some adaptation of the ICU wrapper may be necessary to integrate this more
neatly with "the D way". But I'm confident we can do this in a way which is not
biased toward Phobos.



Phobos UTF-8 code could be faster, I grant you. But perhaps it will be in the
next release. We're only talking about a tiny number of functions here, after
all.




What you appear to be suggesting is that this kind of thing should happen
silently whilst one nonchalantly passes arguments around between methods.

I would vote for that, yes.
But only as a second choice. 

My /first/ choice would be to have D standardize on one single kind of string,
the wchar[].


That's insane, so I hope that's not what you're advocating. Java, for
example, does that at one specific layer (I/O), but you're apparently
suggesting doing it at any old place! And several times over, just in case
it wasn't good enough the first time :-)  Sorry man. This is inane. D is
/not/ a scripting language; instead it's supposed to be a systems language.

How about we all get on the same side here?

Like I said, my /first/ choice would be to have D standardize on one single kind
of string, the wchar[].

But we don't always get our first choice. Walter doesn't like this idea. I
suggest you add your voice to mine and help try to persuade him that ONE kind of
string is the way to go. BUT - if we fail in convincing him - the second choice
is better than the status quo. Why? Because if we fail to convice Walter that
multiple string types is bad, then all that conversion is going to happen
ANYWAY. It will happen because Object.toString() will (often) return UTF-8;
because string literals will generate UTF-8; and because the functions in ICU
will require and return UTF-16. The ONLY difference between this suggestion and
the status quo is that won't have to write "cast(char[])" and "cast(wchar[])"
all over the place.

You're doing a lot of arguing /against/. What are you /for/? Arguing against a
suggestion is usually interpreted as a vote for the status quo. Are you really
doing that?



That's pretty short-sighted IMO. You appear to be saying that implicit
transcoding would take the place of ICU; terribly misleading.

Of course he's not.


Excuse me for jesting, but perhaps the Bidi text algorithms plus
date/numeric formatting & parsing will all fit into a single operator also?
That's kind of what's being suggested.

Nothing is being suggested except syntactic sugar.

My first choice vote still goes to ditching the char. With char gone, it would
be natural for wchar[] to become the "standard" D string, which fits in well
with ICU. There would be no need for implicit (or even explicit) cast-conversion
between wchar[] and dchar[], because dchar[] would be a specialist thing, only
used by speed-efficiency fanatics, while wchar[] would be business as usual.

But if I can't have my first choice, I'll vote for implicit conversion as my
second choice.

Arcane Jill

Aug 24 2004

"antiAlias" <fu bar.com> writes:

"Arcane Jill" <Arcane_member pathlink.com>
There nothing here to tell you that you've got a dangly bit left at
the end of the souce-buffer. It's gone. Poof! Any further decoding from


the
same file/socket/whatever is henceforth trashed, because the ball has


been
both dropped and buried. End of story.

 This has got nothing to do with either transcoding or streams. This is not

being
 suggested as a general transcoding mechanism, merely as an internal

conversion
 between D's three string types. /General/ transcoding will have to work

for all
 supported encodings, and won't be relying on the std.utf functions. Files

and
 sockets won't use the std.utf functions either because they will employ

the
 general transcoding mechanism.

Oh. I was under the impression the 'solution' being tendered was a
jack-of-all trades. If we're simply talking about converting /static/
strings between different representations, then, cool. It's done at
compile-time.


 I'm surprised at you. I would have said:

There are some things a programmer should /not/ have to take
responsibility for; transcoding comes under that umbrella because (a)


there
can be subtle complexity involved and (b) it is relatively expensive to
churn through text and convert it;

 A library is the appropriate place for this stuff. That could be Phobos,

Mango,
 whatever. Here's my take on it:

We agree. My point was that expensive operations such as these should
perhaps not be hidden "under the covers"; but explicitly handled by a
library call instead. However, I clearly have the wrong impression about the
extent of what this implicit-conversion is attempting.


 How about we all get on the same side here?

I really think we are, Jill. My concerns are about trying to build partial
versions of ICU functionality into the D language itself, rather than let
that extensive and capable library take care of it. But apparently that's
not what's happening here. My mistake.


 You're doing a lot of arguing /against/. What are you /for/? Arguing

against a
 suggestion is usually interpreted as a vote for the status quo. Are you

really
 doing that?

Nope. There appeared to be some concensus-building that transcoding could
all be handled via a cast() operator. I felt it worth pointing out where and
why that's not a valid approach.

The other aspect involved here is that of string-concatenation. D cannot
have more that one return type for toString() as you know. It's fixed at
char[]. If string concatenation uses the toString() method to retrieve its
components (as is being proposed elsewhere), then there will be multiple,
redundant, implicit conversions going on where the string really wanted to
be dchar[] in the first place. That is:

A a; // class instances ...
B b;
C c;

dchar[] message = c ~ b ~ a;

Under the proposed "implicit" scheme, if each toString() of A, B, and C wish
to return dchar[], then each concatenation causes an implicit
conversion/encoding from each dchar[] to char[] (for the toString() return).
Then another full conversion/decoding is performed back to the dchar[]
assignment once each has been concatenated. This is like the Wintel 'plot'
for selling more cpu's :-)

Doing this manually, one would forego the toString() altogether:

dchar[] message = c.getString() ~ b.getString() ~ a.getString();

... where getString() is a programmer-specific idiom to return the (natural)
dchar[] for these classes, and we carefully avoided all those darned
implicit-conversions. However, which approach do you think people will use?
My guess is that D may become bogged down in conversion hell over such
things.


So, to answer your question:
What I'm /for/ is not covering up these types of issues with blanket-style
implicit conversions. Something more constructive (and with a little more
forethought) needs to be done.

Aug 24 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 24 Aug 2004 11:36:45 -0700, antiAlias <fu bar.com> wrote:

<snip>

 Oh. I was under the impression the 'solution' being tendered was a
 jack-of-all trades. If we're simply talking about converting /static/
 strings between different representations, then, cool. It's done at
 compile-time.

Nope.

We are talking about implicit conversion to/from all 3 forms of UTF-x 
encoded base types where required.

Does this make any sense whatsoever currently?

char[]  c = ?; //some valid utf-8 sequence
dchar[] d;

d = cast(char[])c;


I believe the answer is "no", reasoning: d will now (possibly) contain an 
invalid utf-32 sequence. The only sensible thing to do is transcode.

If you want to 'paint' a char type as a smaller type to get at it's 
bits/bytes or shorts (snicker) you can and should use ubyte or ushort, 
_not_ another char type.

char types imply a default encoding so painting one to another is illegal, 
painting something to/from a char type is legal and useful.

<snip>

 We agree. My point was that expensive operations such as these should
 perhaps not be hidden "under the covers"; but explicitly handled by a
 library call instead.

On principle I totally agree with this statement.

However in this case I am simply suggesting implicit conversion where you 
would already have to write toUTFxx(), the idea does not _add_ any 
expense, only convenience.

Yes, an un-aware programmer might not realise it's transcoding, they might 
make some in-efficient choices, but that same programmer will probably 
also do this:

char[]  c;
dcahr[] d;

c = cast(char[])d;

and create invalid utf-x sequences (some of the time), and a bug.

 However, I clearly have the wrong impression about the
 extent of what this implicit-conversion is attempting.

No.. I _think_ you understood it fine. I _think_ we just disagree about 
what is efficient and what is not.

 How about we all get on the same side here?

 I really think we are, Jill. My concerns are about trying to build 
 partial
 versions of ICU functionality into the D language itself, rather than let
 that extensive and capable library take care of it. But apparently that's
 not what's happening here. My mistake.

Err... it is, a small part, the part that already exists in std.utf, 
conversion from utf-x to utf-<another x>.

 You're doing a lot of arguing /against/. What are you /for/? Arguing

 against a
 suggestion is usually interpreted as a vote for the status quo. Are you

 really
 doing that?

 Nope. There appeared to be some concensus-building that transcoding could
 all be handled via a cast() operator. I felt it worth pointing out where 
 and
 why that's not a valid approach.

Actually I want it to transcode implicitly eg.

char[]  c;
dchar[] d;

d = c;  //transcodes

Can you enumerate your reasons why this is 'not a valid approach' (I could 
search the previous posts and try to do that for you, but I might 
miss-interpret what you meant).

 The other aspect involved here is that of string-concatenation. D cannot
 have more that one return type for toString() as you know.

True.

 It's fixed at
 char[].

Is it?!
I didn't realise that, so this is invalid?

class A {
   dchar[] toString() {}
}

 If string concatenation uses the toString() method to retrieve its
 components (as is being proposed elsewhere), then there will be multiple,
 redundant, implicit conversions going on where the string really wanted 
 to
 be dchar[] in the first place. That is:

 A a; // class instances ...
 B b;
 C c;

 dchar[] message = c ~ b ~ a;

 Under the proposed "implicit" scheme, if each toString() of A, B, and C 
 wish
 to return dchar[], then each concatenation causes an implicit
 conversion/encoding from each dchar[] to char[] (for the toString() 
 return).

Assuming toString returned char[] and not dchar[], yes.
Assuming that trying to return dchar without transcoding will create 
invalid UTF-8 sequences.

If implicit transcoding were implemented then you should be able to define 
the return value of your classes toString to be any of char[] wchar[] or 
dchar[] as it will implicitly transcode to the type it requires.

Basically you use the most applicable type, for example AJ's Int class 
would use char[] (unless AJ has another reason not to) as all the string 
data required is ASCII and fits best in UTF-8.

 Then another full conversion/decoding is performed back to the dchar[]
 assignment once each has been concatenated. This is like the Wintel 
 'plot' for selling more cpu's :-)

Not if toString can return dchar[] and all 3 classes do that.

 Doing this manually, one would forego the toString() altogether:

 dchar[] message = c.getString() ~ b.getString() ~ a.getString();

 ... where getString() is a programmer-specific idiom to return the 
 (natural)
 dchar[] for these classes, and we carefully avoided all those darned
 implicit-conversions. However, which approach do you think people will 
 use?

Taking into account what I have said above.. the easy one, i.e. implicit 
transcoding.

 My guess is that D may become bogged down in conversion hell over such
 things.

 So, to answer your question:
 What I'm /for/ is not covering up these types of issues with 
 blanket-style
 implicit conversions. Something more constructive (and with a little more
 forethought) needs to be done.

I believe implicit conversion to be constructive, it stops bugs and makes 
string handling much easier. What we are doing here _is_ the forethought, 
after all nothing has been implemented yet.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 24 2004

"antiAlias" <fu bar.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote
 Is it?!
 I didn't realise that, so this is invalid?

 class A {
    dchar[] toString() {}
 }

Yes. It most certainly is, Regan. I (incorrectly) assumed you understood
that. Sorry. There have been a number of posts that note this, and its
implications.

Aug 24 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 24 Aug 2004 19:45:07 -0700, antiAlias <fu bar.com> wrote:

 "Regan Heath" <regan netwin.co.nz> wrote
 Is it?!
 I didn't realise that, so this is invalid?

 class A {
    dchar[] toString() {}
 }

 Yes. It most certainly is, Regan. I (incorrectly) assumed you understood
 that.

Either:
a. I am overly sensitive/insecure
b. You didn't realise
c. You're intentionally trying to belittle me

because ... "understood" is not the right word "knew" is a better choice.. 
"understood" implies I knew but didn't understand. That isn't the case. 
(this time)

 Sorry. There have been a number of posts that note this, and its
 implications.

I must have missed them, or missed the importance of that fact. strange 
given that I read *everything* in all the D NG's on digitalmars.com.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 24 2004

"antiAlias" <fu bar.com> writes:

Sorry to have offended your sensibilities, dude. If "knew" is used in your
part of the world then "understood" is used in mine. Too bad for the
misunderstanding.

Here's a link to a post from Matthew. Given that it's a reply, I think you
can safely count at least two posts that you missed <g>

news:cg69c1$120n$1 digitaldaemon.com



"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsc9uihna5a2sq9 digitalmars.com...
 On Tue, 24 Aug 2004 19:45:07 -0700, antiAlias <fu bar.com> wrote:

 "Regan Heath" <regan netwin.co.nz> wrote
 Is it?!
 I didn't realise that, so this is invalid?

 class A {
    dchar[] toString() {}
 }

 Yes. It most certainly is, Regan. I (incorrectly) assumed you understood
 that.

 Either:
 a. I am overly sensitive/insecure
 b. You didn't realise
 c. You're intentionally trying to belittle me

 because ... "understood" is not the right word "knew" is a better choice..
 "understood" implies I knew but didn't understand. That isn't the case.
 (this time)

 Sorry. There have been a number of posts that note this, and its
 implications.

 I must have missed them, or missed the importance of that fact. strange
 given that I read *everything* in all the D NG's on digitalmars.com.

 Regan

 --
 Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 24 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <opsc9m9ccg5a2sq9 digitalmars.com>, Regan Heath says...

Is it?!
I didn't realise that, so this is invalid?

class A {
   dchar[] toString() {}
}

It's not invalid as such, it's just that the return type of an overloaded
function has to be "covariant" with the return type of the function it's
overloading. So it's a compile error /now/. But if dchar[] and char[] were to be
considered mutually covariant then this would magically start to compile.

Arcane Jill

Aug 24 2004

Regan Heath <regan netwin.co.nz> writes:

On Wed, 25 Aug 2004 05:44:18 +0000 (UTC), Arcane Jill 
<Arcane_member pathlink.com> wrote:
 In article <opsc9m9ccg5a2sq9 digitalmars.com>, Regan Heath says...

 Is it?!
 I didn't realise that, so this is invalid?

 class A {
   dchar[] toString() {}
 }

 It's not invalid as such, it's just that the return type of an overloaded
 function has to be "covariant" with the return type of the function it's
 overloading. So it's a compile error /now/. But if dchar[] and char[] 
 were to be
 considered mutually covariant then this would magically start to compile.

Ahh.. excellent, that is what I was hoping to hear.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 24 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <opsc9xu9h35a2sq9 digitalmars.com>, Regan Heath says...

Ahh.. excellent, that is what I was hoping to hear.

It's not /all/ good news however. Consider these two cases:

(1)




All hunky dory. No conversions happen. /But/

(2)




Now /two/ conversions happen (assuming Object.toString() still returns char[]) -
toUTF8(wchar[]) followed by toUTF16(char[]). Still, that's polymorphism for you.
It is better than the status quo, but not quite as good (IMO) as having wchar[]
be the standard string type.

Arcane Jill

Aug 24 2004

Regan Heath <regan netwin.co.nz> writes:

On Wed, 25 Aug 2004 06:31:16 +0000 (UTC), Arcane Jill 
<Arcane_member pathlink.com> wrote:
 In article <opsc9xu9h35a2sq9 digitalmars.com>, Regan Heath says...

 Ahh.. excellent, that is what I was hoping to hear.

 It's not /all/ good news however. Consider these two cases:

 (1)




 All hunky dory. No conversions happen. /But/

 (2)




 Now /two/ conversions happen (assuming Object.toString() still returns 
 char[]) -
 toUTF8(wchar[]) followed by toUTF16(char[]). Still, that's polymorphism 
 for you.

True, and we can come up with much nastier string concatenation examples 
too.. I wonder if some cleverness can be thought up to lessen this effect 
somehow?

 It is better than the status quo, but not quite as good (IMO) as having 
 wchar[]
 be the standard string type.

By 'standard' do you mean that the others do not exist? or that it is the 
type you are encouraged to use unless you have reason not to?

I think the other types have a valid place in the D language, after all 
each type will be more or less efficient based on the specific 
circumstances it gets used in.

The most generally efficient type (if that's even possible to decide) 
should be the type we're encouraged to use, if that's wchar so be it.

Implicit transcoding will fit nicely with a standard type as when you are 
using another type the library functions (if all written for wchar for 
example) will still be available without explicit toUTFxx calls.

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 25 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <opsdbboxb05a2sq9 digitalmars.com>, Regan Heath says...

 It is better than the status quo, but not quite as good (IMO) as having 
 wchar[]
 be the standard string type.

By 'standard' do you mean that the others do not exist? or that it is the 
type you are encouraged to use unless you have reason not to?

I guess I mean specifically that:

(1) Object.toString() should return wchar[], not char[]

(2) String literals such as "hello world" should be interpretted as wchar[], not
char[].

(3) Object.d should contain the line:


(4) The text on page http://www.digitalmars.com/d/arrays.html should be changed.
Currently it says:

Dynamic arrays in D suggest the obvious solution - a string is just a dynamic
array of characters. String literals become just an easy way to write character
arrays.

    char[] str;
    char[] str1 = "abc";

This should be changed to:

Dynamic arrays in D suggest the obvious solution - a string is just a dynamic
array of characters. String literals become just an easy way to write character
arrays.

    wchar[] str;
    wchar[] str1 = "abc";

(5) There are probably several other things to change. I don't claim this is an
exhaustive list.



In other words, we could actually have our cake /and/ eat it. The intent is to
minimize, as far as possible, the number of calls to toUTFxx(). Ideally, they
should occur only at input and output. The way to minimize this is to keep
everything in the same type, so conversion is not needed. If the D
documentation, the behaviour of the compiler, and the organization of Phobos,
were to consistently use the same string type, and others were encouraged to use
the same type, conversions would be kept to a minimum.

Currently D does that - but it's "string of choice" is the char[], not the
wchar[]. Conversion to/from UTF-8 is incredibly slow for non-ASCII characters.
(It could be made faster, but it can /never/ be made as fast as UTF-16). So we
make wchar[], not char[], the "standard", and hey presto, things get faster (and
what's more will interface with ICU without conversion, which is really
important for internationalization).




The most generally efficient type (if that's even possible to decide) 
should be the type we're encouraged to use, if that's wchar so be it.

Well, it is usually believed that
UTF-8  is the most space-efficient but the least speed-efficient. 
UTF-32 is the most speed-efficient but the least space-efficient. 
UTF-16 is the happy medium.

However...
*) UTF-16 almost as fast as UTF-32, because the UTF-16 encoding is so simple
*) UTF-16 is more compact than UTF-8 for codepoints between U+0800 and U+FFFF,
each of which require 3 bytes in UTF-8, but only two bytes in UTF-16
*) The characters expressable in UTF-16 in a single wchar include every symbol
from every living language, so if you /pretend/ that wchar[] is an array of
characters rather than UTF-16 fragments, the effect is relatively harmless
(unlike UTF-8).


Implicit transcoding will fit nicely with a standard type as when you are 
using another type the library functions (if all written for wchar for 
example) will still be available without explicit toUTFxx calls.

True. I can't argue with that.

But back to the case for ditching char - think of this from another perspective.
In Java, would you be prepared to argue the case for the /introduction/ of an
8-bit wide character type to the language? And that this type could only ever be
used for UTF-8?

There's a reason why that suggestion sounds absurd. It is.

Arcane Jill

Aug 26 2004

"antiAlias" <fu bar.com> writes:

Hear! Hear!


"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgk32j$1fj$1 digitaldaemon.com...
 In article <opsdbboxb05a2sq9 digitalmars.com>, Regan Heath says...

 It is better than the status quo, but not quite as good (IMO) as having
 wchar[]
 be the standard string type.

By 'standard' do you mean that the others do not exist? or that it is the
type you are encouraged to use unless you have reason not to?

 I guess I mean specifically that:

 (1) Object.toString() should return wchar[], not char[]

 (2) String literals such as "hello world" should be interpretted as

wchar[], not
 char[].

 (3) Object.d should contain the line:


 (4) The text on page http://www.digitalmars.com/d/arrays.html should be

changed.
 Currently it says:

Dynamic arrays in D suggest the obvious solution - a string is just a


dynamic
array of characters. String literals become just an easy way to write


character
arrays.

    char[] str;
    char[] str1 = "abc";

 This should be changed to:

Dynamic arrays in D suggest the obvious solution - a string is just a


dynamic
array of characters. String literals become just an easy way to write


character
arrays.

    wchar[] str;
    wchar[] str1 = "abc";

 (5) There are probably several other things to change. I don't claim this

is an
 exhaustive list.



 In other words, we could actually have our cake /and/ eat it. The intent

is to
 minimize, as far as possible, the number of calls to toUTFxx(). Ideally,

they
 should occur only at input and output. The way to minimize this is to keep
 everything in the same type, so conversion is not needed. If the D
 documentation, the behaviour of the compiler, and the organization of

Phobos,
 were to consistently use the same string type, and others were encouraged

to use
 the same type, conversions would be kept to a minimum.

 Currently D does that - but it's "string of choice" is the char[], not the
 wchar[]. Conversion to/from UTF-8 is incredibly slow for non-ASCII

characters.
 (It could be made faster, but it can /never/ be made as fast as UTF-16).

So we
 make wchar[], not char[], the "standard", and hey presto, things get

faster (and
 what's more will interface with ICU without conversion, which is really
 important for internationalization).




The most generally efficient type (if that's even possible to decide)
should be the type we're encouraged to use, if that's wchar so be it.

 Well, it is usually believed that
 UTF-8  is the most space-efficient but the least speed-efficient.
 UTF-32 is the most speed-efficient but the least space-efficient.
 UTF-16 is the happy medium.

 However...
 *) UTF-16 almost as fast as UTF-32, because the UTF-16 encoding is so

simple
 *) UTF-16 is more compact than UTF-8 for codepoints between U+0800 and

U+FFFF,
 each of which require 3 bytes in UTF-8, but only two bytes in UTF-16
 *) The characters expressable in UTF-16 in a single wchar include every

symbol
 from every living language, so if you /pretend/ that wchar[] is an array

of
 characters rather than UTF-16 fragments, the effect is relatively harmless
 (unlike UTF-8).


Implicit transcoding will fit nicely with a standard type as when you are
using another type the library functions (if all written for wchar for
example) will still be available without explicit toUTFxx calls.

 True. I can't argue with that.

 But back to the case for ditching char - think of this from another

perspective.
 In Java, would you be prepared to argue the case for the /introduction/ of

an
 8-bit wide character type to the language? And that this type could only

ever be
 used for UTF-8?

 There's a reason why that suggestion sounds absurd. It is.

 Arcane Jill

Aug 26 2004

Regan Heath <regan netwin.co.nz> writes:

On Thu, 26 Aug 2004 07:21:55 +0000 (UTC), Arcane Jill 
<Arcane_member pathlink.com> wrote:
 In article <opsdbboxb05a2sq9 digitalmars.com>, Regan Heath says...

 It is better than the status quo, but not quite as good (IMO) as having
 wchar[]
 be the standard string type.

 By 'standard' do you mean that the others do not exist? or that it is 
 the
 type you are encouraged to use unless you have reason not to?

 I guess I mean specifically that:

 (1) Object.toString() should return wchar[], not char[]

Sure... so long as custom classes can overload and return char[] for 
situations where the app might be using char[] throughout (for whatever 
reason).

OT: here is an example where we _don't_ want the return value used for 
method name resolution.

 (2) String literals such as "hello world" should be interpretted as 
 wchar[], not char[].

Currently doesn't it decide the type based on the context? i.e.

void foo(dchar[] a);
foo("hello world");

would make "hello world" a dchar string literal?

I guess what you're saying is the default should be wchar[] where the type 
is indeterminate i.e.

writef("hello world");

but, why not use char[] for the above, it's more efficient in this case. 
The compiler could do a quick decision based on whether the string 
contains any code points >= U+0800 if not use char[] otherwise user 
wchar[], would that be a good soln?

After all, I don't mind if my compile is a little slower if it means my 
app is faster.

 (3) Object.d should contain the line:


I'm not sure I like this.. will this hide details a programmer should be 
aware of?

 (4) The text on page http://www.digitalmars.com/d/arrays.html should be 
 changed.
 Currently it says:

 Dynamic arrays in D suggest the obvious solution - a string is just a 
 dynamic
 array of characters. String literals become just an easy way to write 
 character
 arrays.

    char[] str;
    char[] str1 = "abc";

 This should be changed to:

 Dynamic arrays in D suggest the obvious solution - a string is just a 
 dynamic
 array of characters. String literals become just an easy way to write 
 character
 arrays.

    wchar[] str;
    wchar[] str1 = "abc";

 (5) There are probably several other things to change. I don't claim 
 this is an
 exhaustive list.

Sure, char[] is probably used and suggested in every example in the 
manuals except where it's giving an example if the differences in utf-x 
encodings.

 In other words, we could actually have our cake /and/ eat it. The intent 
 is to
 minimize, as far as possible, the number of calls to toUTFxx().

Agreed.

 Ideally, they should occur only at input and output. The way to minimize 
 this is to keep everything in the same type, so conversion is not 
 needed. If the D
 documentation, the behaviour of the compiler, and the organization of 
 Phobos, were to consistently use the same string type, and others were 
 encouraged to use the same type, conversions would be kept to a minimum.

 Currently D does that - but it's "string of choice" is the char[], not 
 the wchar[]. Conversion to/from UTF-8 is incredibly slow for non-ASCII 
 characters.
 (It could be made faster, but it can /never/ be made as fast as UTF-16). 
 So we make wchar[], not char[], the "standard", and hey presto, things 
 get faster

Qualification: For non ASCII only apps.

  (and what's more will interface with ICU without conversion, which is 
really
 important for internationalization).

The fact that ICU has no char type suggests it's a bad choice for D, that 
is, if we want to assume they knew what they we're doing. Are there any 
complaints from developers about ICU anywhere, perhaps some digging for 
dirt would help make an objective decision here?

 The most generally efficient type (if that's even possible to decide)
 should be the type we're encouraged to use, if that's wchar so be it.

 Well, it is usually believed that
 UTF-8  is the most space-efficient but the least speed-efficient.
 UTF-32 is the most speed-efficient but the least space-efficient.
 UTF-16 is the happy medium.

 However...
 *) UTF-16 almost as fast as UTF-32, because the UTF-16 encoding is so 
 simple
 *) UTF-16 is more compact than UTF-8 for codepoints between U+0800 and 
 U+FFFF,
 each of which require 3 bytes in UTF-8, but only two bytes in UTF-16
 *) The characters expressable in UTF-16 in a single wchar include every 
 symbol
 from every living language

.. ? I thought the problem Java had was with Unicode contains characters 
that are > U+FFFF.. if they're not part of a 'living language' what are 
they?

 , so if you /pretend/ that wchar[] is an array of
 characters rather than UTF-16 fragments, the effect is relatively 
 harmless
 (unlike UTF-8).

Sure, however if you're only dealing with ASCII doing that with char[] is 
also fine. Those of us who haven't done any internationalisation are used 
to dealing only with ASCII.


I'd like some more stats and figures, simply:
  - how many unicode characters are in the range < U+0800?
  - how many unicode from U+0800 >= x <= U+FFFF?
  - haw many unicode > U+FFFF?

(the answers to the above are probably quite simple, but I want them from 
someone who 'knows' rather than me who'd be guessing)

Then, how commonly is each range used? I imagine this differs depending on 
exactly what you're doing.. basically when would you use characters in 
each range and how common is that task?

It used to be that ASCII < U+0800 was the most common, it still may be, 
but I can see that it's not the future, the future is unicode.

 Implicit transcoding will fit nicely with a standard type as when you 
 are
 using another type the library functions (if all written for wchar for
 example) will still be available without explicit toUTFxx calls.

 True. I can't argue with that.

 But back to the case for ditching char - think of this from another 
 perspective.
 In Java, would you be prepared to argue the case for the /introduction/ 
 of an
 8-bit wide character type to the language? And that this type could only 
 ever be
 used for UTF-8?

 There's a reason why that suggestion sounds absurd. It is.

Isn't the reason that all the existing Java stuff, of which there is a 
*lot* uses wchar, so char wouldn't intgrate. That is different to D, where 
all 3 exist and char is actually the one with the most integration.

That said, would the introduction of char to Java give you anything? 
perhaps.. it would allow you to write an app that only deals with ASCII 
(chars < U+0800) more space efficiently, correct?

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 26 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <opsdc4sgsi5a2sq9 digitalmars.com>, Regan Heath says...

I guess what you're saying is the default should be wchar[] where the type 
is indeterminate

Yes.

but, why not use char[] for the above, it's more efficient in this case. 
The compiler could do a quick decision based on whether the string 
contains any code points >= U+0800 if not use char[] otherwise user 
wchar[], would that be a good soln?

You suggestion would work. But I'm still thinking along the lines that no
conversion is better than some conversion, and no conversion is only achievable
by having only one type of string. And even we don't enforce that, we should at
least encourage it.


After all, I don't mind if my compile is a little slower if it means my 
app is faster.

UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all text
is ASCII is going to be just as fast in UTF-16 as it is in ASCII. Remember that
ASCII is a subset of UTF-16, just as it is a subset of UTF-8.

Converting between UTF-8 and UTF-16 won't slow you down much if all your
characters are ASCII, of course. Such a conversion is trivial - not much slower
than a memcpy. /But/ - you're still making a copy, still allocating stuff off
the heap and copying data from one place to another, and that's still overhead
which you would have avoided had you used UTF-16 right through.



 (3) Object.d should contain the line:


I'm not sure I like this.. will this hide details a programmer should be 
aware of?

It's just a strong hint. If aliases are bad then we shouldn't use them anywhere.


 (It could be made faster, but it can /never/ be made as fast as UTF-16). 
 So we make wchar[], not char[], the "standard", and hey presto, things 
 get faster

Qualification: For non ASCII only apps.

No, for /all/ apps. I don't see any reason why ASCII stored in wchar[]s would be
any slower than ASCII stored in char[]s. Can you think of a reason why that
would be so?

ASCII is a subset of UTF-8
ASCII is a subset of UTF-16

where's the difference? The difference is space, not speed.



The fact that ICU has no char type suggests it's a bad choice for D, that 
is, if we want to assume they knew what they we're doing.

See http://oss.software.ibm.com/icu/userguide/strings.html for UCI's discussion
on this.


Are there any 
complaints from developers about ICU anywhere, perhaps some digging for 
dirt would help make an objective decision here?

I don't know. I imagine so. People generally tend to complain about
/everything/.


I'd like some more stats and figures, simply:
  - how many unicode characters are in the range < U+0800?

These figures are all from Unicode 4.0. Unicode now stands at 4.0.1, so these
figures are out of date - but they'll give you the general idea.

1646

  - how many unicode from U+0800 >= x <= U+FFFF?

54014, of which 6400 are Private Use Characters

  - haw many unicode > U+FFFF?

176012, of which 131068 are Private Use Characters


Then, how commonly is each range used? I imagine this differs depending on 
exactly what you're doing.. basically when would you use characters in 
each range and how common is that task?

The biggest non-BMP chunks are:
U+20000 to U+2A6D6 = CJK Compatibility Ideographs
U+F0000 to U+FFFFD = Private Use
U+100000 to U+10FFFFD = Private Use

Compatibility Ideographs are not used in Unicode except for round-trip
compatibility with legacy CJK character sets. Every single one of them is
nothing but a compatibility alias for another character. The Private Use
characters are not defined by Unicode (being reserved for private interchange
between consenting parties). The remainder of the non-BMP (> U+FFFF) characters
are:

*) More CJK compatibility characters
*) Old italic variants of ASCII characters
*) Gothic letters (no idea what they are)
*) The Deseret script (a dead language, so far as I know)
*) Musical symbols
*) Miscellaneous mathematical symbols
*) Mathematical variants of ASCII and Greek letters
*) "Tagged" variants of ASCII characters

The math characters are used only in math. The tagged characters are used only
in /one/ protocol (and in fact, some of those characters are not used at all,
even in that protocol). None of these characters are likely to be found in
general text, only in specialist applications.

Here is the complete list of "blocks":

0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
0100..017F; Latin Extended-A
0180..024F; Latin Extended-B
0250..02AF; IPA Extensions
02B0..02FF; Spacing Modifier Letters
0300..036F; Combining Diacritical Marks
0370..03FF; Greek
0400..04FF; Cyrillic
0530..058F; Armenian
0590..05FF; Hebrew
0600..06FF; Arabic
0700..074F; Syriac  
0780..07BF; Thaana
0900..097F; Devanagari
0980..09FF; Bengali
0A00..0A7F; Gurmukhi
0A80..0AFF; Gujarati
0B00..0B7F; Oriya
0B80..0BFF; Tamil
0C00..0C7F; Telugu
0C80..0CFF; Kannada
0D00..0D7F; Malayalam
0D80..0DFF; Sinhala
0E00..0E7F; Thai
0E80..0EFF; Lao
0F00..0FFF; Tibetan
1000..109F; Myanmar 
10A0..10FF; Georgian
1100..11FF; Hangul Jamo
1200..137F; Ethiopic
13A0..13FF; Cherokee
1400..167F; Unified Canadian Aboriginal Syllabics
1680..169F; Ogham
16A0..16FF; Runic
1780..17FF; Khmer
1800..18AF; Mongolian
1E00..1EFF; Latin Extended Additional
1F00..1FFF; Greek Extended
2000..206F; General Punctuation
2070..209F; Superscripts and Subscripts
20A0..20CF; Currency Symbols
20D0..20FF; Combining Marks for Symbols
2100..214F; Letterlike Symbols
2150..218F; Number Forms
2190..21FF; Arrows
2200..22FF; Mathematical Operators
2300..23FF; Miscellaneous Technical
2400..243F; Control Pictures
2440..245F; Optical Character Recognition
2460..24FF; Enclosed Alphanumerics
2500..257F; Box Drawing
2580..259F; Block Elements
25A0..25FF; Geometric Shapes
2600..26FF; Miscellaneous Symbols
2700..27BF; Dingbats
2800..28FF; Braille Patterns
2E80..2EFF; CJK Radicals Supplement
2F00..2FDF; Kangxi Radicals
2FF0..2FFF; Ideographic Description Characters
3000..303F; CJK Symbols and Punctuation
3040..309F; Hiragana
30A0..30FF; Katakana
3100..312F; Bopomofo
3130..318F; Hangul Compatibility Jamo
3190..319F; Kanbun
31A0..31BF; Bopomofo Extended
3200..32FF; Enclosed CJK Letters and Months
3300..33FF; CJK Compatibility
3400..4DB5; CJK Unified Ideographs Extension A
4E00..9FFF; CJK Unified Ideographs
A000..A48F; Yi Syllables
A490..A4CF; Yi Radicals
AC00..D7A3; Hangul Syllables
D800..DB7F; High Surrogates
DB80..DBFF; High Private Use Surrogates
DC00..DFFF; Low Surrogates
E000..F8FF; Private Use
F900..FAFF; CJK Compatibility Ideographs
FB00..FB4F; Alphabetic Presentation Forms
FB50..FDFF; Arabic Presentation Forms-A
FE20..FE2F; Combining Half Marks
FE30..FE4F; CJK Compatibility Forms
FE50..FE6F; Small Form Variants
FE70..FEFE; Arabic Presentation Forms-B
FEFF..FEFF; Specials
FF00..FFEF; Halfwidth and Fullwidth Forms
FFF0..FFFD; Specials
10300..1032F; Old Italic
10330..1034F; Gothic
10400..1044F; Deseret
1D000..1D0FF; Byzantine Musical Symbols
1D100..1D1FF; Musical Symbols
1D400..1D7FF; Mathematical Alphanumeric Symbols
20000..2A6D6; CJK Unified Ideographs Extension B
2F800..2FA1F; CJK Compatibility Ideographs Supplement
E0000..E007F; Tags
F0000..FFFFD; Private Use
100000..10FFFD; Private Use



It used to be that ASCII < U+0800 was the most common, it still may be, 
but I can see that it's not the future, the future is unicode.

Most common, perhaps, if you limit your text to Latin, Greek, Hebrew and
Russian. But not otherwise.




That said, would the introduction of char to Java give you anything? 
perhaps.. it would allow you to write an app that only deals with ASCII 
(chars < U+0800) more space efficiently, correct?

Only chars < U+0080 (not U+0800) would be more space efficient in UTF-8. Between
U+0080 and U+07FFF they both need two bytes. From U+0800 upwards, UTF-8 needs
three bytes where UTF-16 needs two.

How am I doing on the convincing front? I'd still go for:
*) wchar[] for everything in Phobos and everything DMD-generated;
*) ditch the char
*) lossless implicit conversion between all remaining D string types

Arcane Jill

Aug 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgmr7f$1b90$1 digitaldaemon.com>, Arcane Jill says...

10400..1044F; Deseret
*) The Deseret script (a dead language, so far as I know)


"The Deseret Alphabet was designed as an alternative to the Latin alphabet for
writing the English language. It was developed during the 1850s by The Church of
Jesus Christ of Latter-day Saints (also known as the "Mormon" or LDS Church)
under the guidance of Church President Brigham Young (1801-1877). Brigham
Young's secretary, George D. Watt, was among the designers of the Deseret
Alphabet."

"The LDS Church published four books using the Deseret Alphabet"


See http://www.molossia.org/alphabet.html

Jill

Aug 27 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgmr7f$1b90$1 digitaldaemon.com...
 UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all

text
 is ASCII is going to be just as fast in UTF-16 as it is in ASCII.

While the rest of your post has a great deal of merit, this bit here is just
not true. A pure ASCII app written using UTF-16 will consume twice as much
memory for its data, and there are a lot of operations on that data that
will be correspondingly half as fast. Furthermore, you'll start swapping a
lot sooner, and then performance takes a dive.

It makes sense for Java, Javascript, and for languages where performance is
not a top priority to standardize on one character type. But if D does not
handle ASCII very efficiently, it will not have a chance at interesting the
very performance conscious C/C++ programmers.

Aug 28 2004

Sean Kelly <sean f4.ca> writes:

In article <cgqlmm$2ui$1 digitaldaemon.com>, Walter says...
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgmr7f$1b90$1 digitaldaemon.com...
 UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all

text
 is ASCII is going to be just as fast in UTF-16 as it is in ASCII.

While the rest of your post has a great deal of merit, this bit here is just
not true. A pure ASCII app written using UTF-16 will consume twice as much
memory for its data, and there are a lot of operations on that data that
will be correspondingly half as fast. Furthermore, you'll start swapping a
lot sooner, and then performance takes a dive.

It makes sense for Java, Javascript, and for languages where performance is
not a top priority to standardize on one character type. But if D does not
handle ASCII very efficiently, it will not have a chance at interesting the
very performance conscious C/C++ programmers.

Agreed.  Frankly, I've begun to wonder just what the purpose of this discussion
is.  I think it's already been agreed that none of the three char types should
be removed from D, and it seems clear that there is no "default" char type.  Is
this a nomenclature issue?  ie. that the UTF-8 type is named "char" and thus
considered to be somehow more important than the others?


Sean

Aug 28 2004

J C Calvarese <jcc7 cox.net> writes:

Sean Kelly wrote:

 In article <cgqlmm$2ui$1 digitaldaemon.com>, Walter says...
 
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgmr7f$1b90$1 digitaldaemon.com...

UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all

text

is ASCII is going to be just as fast in UTF-16 as it is in ASCII.

While the rest of your post has a great deal of merit, this bit here is just
not true. A pure ASCII app written using UTF-16 will consume twice as much
memory for its data, and there are a lot of operations on that data that
will be correspondingly half as fast. Furthermore, you'll start swapping a
lot sooner, and then performance takes a dive.

It makes sense for Java, Javascript, and for languages where performance is
not a top priority to standardize on one character type. But if D does not
handle ASCII very efficiently, it will not have a chance at interesting the
very performance conscious C/C++ programmers.

 
 
 Agreed.  Frankly, I've begun to wonder just what the purpose of this discussion
 is.  I think it's already been agreed that none of the three char types should
 be removed from D, and it seems clear that there is no "default" char type.  Is

I think the thread has gone somewhat off topics by this point.

Apparently, a lot of people feel oppressed by ASCII. I much be a bad 
person since 7-bits is all I need most of the time.

On a related note, the "performance of char vs wchar" recently degraded 
into enlightened comments along the lines of "You're a know-it-all 
American cowboy who discriminates against the all of the Chinese, 
Japanese, Indians, Russian, British, and members of the European Union 
in the world. And Mormons, too." Or something like that.

Somehow these Unicode-related discussions bring out the best in people. :P

 this a nomenclature issue?  ie. that the UTF-8 type is named "char" and thus
 considered to be somehow more important than the others?
 
 
 Sean

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/

Aug 28 2004

Matthias Becker <Matthias_member pathlink.com> writes:

Agreed.  Frankly, I've begun to wonder just what the purpose of this discussion
is.

The toString()-method has to return a strig, but in which format? That's what
this is all about.

I think it's already been agreed that none of the three char types should
be removed from D, and it seems clear that there is no "default" char type.

Well the language can only use one type as return type for toString(). So there
actualy IS a default character type.

BTW, isn't the name "char" totaly misleading? char means character, but it can't
hold a character it can only hold parts of it. This is confusing.

Is
this a nomenclature issue?  ie. that the UTF-8 type is named "char" and thus
considered to be somehow more important than the others?

Nope.

-- Matthias Becker

Aug 29 2004

stonecobra <scott stonecobra.com> writes:

Walter wrote:
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cgmr7f$1b90$1 digitaldaemon.com...
 
UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all

 
 text
 
is ASCII is going to be just as fast in UTF-16 as it is in ASCII.

 
 
 While the rest of your post has a great deal of merit, this bit here is just
 not true. A pure ASCII app written using UTF-16 will consume twice as much
 memory for its data, and there are a lot of operations on that data that
 will be correspondingly half as fast. Furthermore, you'll start swapping a
 lot sooner, and then performance takes a dive.
 
 It makes sense for Java, Javascript, and for languages where performance is
 not a top priority to standardize on one character type. But if D does not
 handle ASCII very efficiently, it will not have a chance at interesting the
 very performance conscious C/C++ programmers.
 
 

They won't worry about it, because if they are true performance 
affinciandos, they will never create an array in D, because of the 
default initialization.   :)  so, ubyte[] it is <g>

Seriously, is performance a concern for D?  If it truly is, this should 
be able to be turned off, if I take ownership of the potential 
consequences, no?

Scott

Aug 29 2004

Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:

stonecobra wrote:

 affinciandos

What was that? :-D

(I guess what you're trying to say what "aficionado").

Aug 29 2004

"Ivan Senji" <ivan.senji public.srce.hr> writes:

"stonecobra" <scott stonecobra.com> wrote in message
news:cgu2ia$1vad$1 digitaldaemon.com...
 Walter wrote:
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cgmr7f$1b90$1 digitaldaemon.com...

UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which



all
 text

is ASCII is going to be just as fast in UTF-16 as it is in ASCII.


 While the rest of your post has a great deal of merit, this bit here is


just
 not true. A pure ASCII app written using UTF-16 will consume twice as


much
 memory for its data, and there are a lot of operations on that data that
 will be correspondingly half as fast. Furthermore, you'll start swapping


a
 lot sooner, and then performance takes a dive.

 It makes sense for Java, Javascript, and for languages where performance


is
 not a top priority to standardize on one character type. But if D does


not
 handle ASCII very efficiently, it will not have a chance at interesting


the
 very performance conscious C/C++ programmers.

 They won't worry about it, because if they are true performance
 affinciandos, they will never create an array in D, because of the
 default initialization.   :)  so, ubyte[] it is <g>

 Seriously, is performance a concern for D?  If it truly is, this should
 be able to be turned off, if I take ownership of the potential
 consequences, no?

What about having some standard allocator for arrays so we can:

char[] str = new(noinit) char[100];

 Scott

Aug 30 2004

Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:

Ivan Senji wrote:

 What about having some standard allocator for arrays so we can:
 
 char[] str = new(noinit) char[100];

I like it!

Aug 30 2004

"Walter" <newshound digitalmars.com> writes:

"stonecobra" <scott stonecobra.com> wrote in message
news:cgu2ia$1vad$1 digitaldaemon.com...
 They won't worry about it, because if they are true performance
 affinciandos, they will never create an array in D, because of the
 default initialization.   :)  so, ubyte[] it is <g>

 Seriously, is performance a concern for D?  If it truly is, this should
 be able to be turned off, if I take ownership of the potential
 consequences, no?

You can allocate them with std.c.stdlib.malloc() or std.c.stdlib.alloca(),
neither of which will do the initialization. Furthermore,
std.c.stdlib.alloca(n), where n is a constant, is handled as a special
optimization case, and will generate storage as part of the stack frame
setup (so it's zero cost).

Aug 30 2004

stonecobra <scott stonecobra.com> writes:

Walter wrote:

 "stonecobra" <scott stonecobra.com> wrote in message
 news:cgu2ia$1vad$1 digitaldaemon.com...
 
They won't worry about it, because if they are true performance
affinciandos, they will never create an array in D, because of the
default initialization.   :)  so, ubyte[] it is <g>

Seriously, is performance a concern for D?  If it truly is, this should
be able to be turned off, if I take ownership of the potential
consequences, no?

 
 
 You can allocate them with std.c.stdlib.malloc() or std.c.stdlib.alloca(),
 neither of which will do the initialization. Furthermore,
 std.c.stdlib.alloca(n), where n is a constant, is handled as a special
 optimization case, and will generate storage as part of the stack frame
 setup (so it's zero cost).
 
 

So, the C performance geeks will just stay with C, because that's how 
they'd do it now (no benefit to moving to d)?

Scott

Aug 30 2004

Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:

Walter wrote:

 Seriously, is performance a concern for D?  If it truly is, this should
 be able to be turned off, if I take ownership of the potential
 consequences, no?

 
 You can allocate them with std.c.stdlib.malloc() or std.c.stdlib.alloca(),
 neither of which will do the initialization. Furthermore,
 std.c.stdlib.alloca(n), where n is a constant, is handled as a special
 optimization case, and will generate storage as part of the stack frame
 setup (so it's zero cost).

And will they be garbage-collected? (just asking, I don't know.) Anyway it's
an ugly interface to do it, maybe as has been suggested more standard
allocators could be included for this case.

Aug 30 2004

"Walter" <newshound digitalmars.com> writes:

"Juanjo �lvarez" <juanjuxNO SPAMyahoo.es> wrote in message
news:ch0ajb$4d$1 digitaldaemon.com...
 Walter wrote:
 Seriously, is performance a concern for D?  If it truly is, this should
 be able to be turned off, if I take ownership of the potential
 consequences, no?

 You can allocate them with std.c.stdlib.malloc() or


std.c.stdlib.alloca(),
 neither of which will do the initialization. Furthermore,
 std.c.stdlib.alloca(n), where n is a constant, is handled as a special
 optimization case, and will generate storage as part of the stack frame
 setup (so it's zero cost).

 And will they be garbage-collected? (just asking, I don't know.)

malloc()? No, that will need an explicit call to free().
alloca()? No, that gets deallocated anyway when the function exits

 Anyway it's
 an ugly interface to do it, maybe as has been suggested more standard
 allocators could be included for this case.

You can also overload operators new and delete on a per-class basis, and
provide a custom allocator.

Aug 30 2004

"Ivan Senji" <ivan.senji public.srce.hr> writes:

"Walter" <newshound digitalmars.com> wrote in message
news:ch0ce4$11t$1 digitaldaemon.com...
 "Juanjo �lvarez" <juanjuxNO SPAMyahoo.es> wrote in message
 news:ch0ajb$4d$1 digitaldaemon.com...
 Walter wrote:
 Seriously, is performance a concern for D?  If it truly is, this




should
 be able to be turned off, if I take ownership of the potential
 consequences, no?

 You can allocate them with std.c.stdlib.malloc() or


 std.c.stdlib.alloca(),
 neither of which will do the initialization. Furthermore,
 std.c.stdlib.alloca(n), where n is a constant, is handled as a special
 optimization case, and will generate storage as part of the stack



frame
 setup (so it's zero cost).

 And will they be garbage-collected? (just asking, I don't know.)

 malloc()? No, that will need an explicit call to free().

But this would be a good way to get back to good old memory leaks
from forgetting to free something. This is something that a high level
language
like D should try to help with. Isn't there a way to extend the syntax
to enable us to allocate uninitialised arrays when someone explicitly
wants to do that?

 alloca()? No, that gets deallocated anyway when the function exits

 Anyway it's
 an ugly interface to do it, maybe as has been suggested more standard
 allocators could be included for this case.

 You can also overload operators new and delete on a per-class basis, and
 provide a custom allocator.

But not for arrays. This would mean too much wraping.

Aug 31 2004

Sean Kelly <sean f4.ca> writes:

In article <ch1ulm$ori$1 digitaldaemon.com>, Ivan Senji says...
"Walter" <newshound digitalmars.com> wrote in message
news:ch0ce4$11t$1 digitaldaemon.com...

 malloc()? No, that will need an explicit call to free().

But this would be a good way to get back to good old memory leaks
from forgetting to free something. This is something that a high level
language
like D should try to help with. Isn't there a way to extend the syntax
to enable us to allocate uninitialised arrays when someone explicitly
wants to do that?

How about a smart pointer?  I admit that the syntax wouldn't be quite as nice as
in C++, but the basic implementation should be the same.


Sean

Aug 31 2004

Nick <Nick_member pathlink.com> writes:

In article <ch2a1j$uqi$1 digitaldaemon.com>, Sean Kelly says...

 malloc()? No, that will need an explicit call to free().

But this would be a good way to get back to good old memory leaks
from forgetting to free something. This is something that a high level
language
like D should try to help with.

How about a smart pointer?  I admit that the syntax wouldn't be quite as nice
as in C++, but the basic implementation should be the same.

How about just using a buffer class that clears up it's mess when collected?
Seems to me like a simple solution that would cover most uses for uninitialized
buffers.

I have written an example here:
http://folk.uio.no/mortennk/d/array/membuffer.d

and used it here:
http://folk.uio.no/mortennk/d/array/uninitarray.d

Nick

Aug 31 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ch0ce4$11t$1 digitaldaemon.com>, Walter says...

 And will they be garbage-collected? (just asking, I don't know.)

malloc()? No, that will need an explicit call to free().
alloca()? No, that gets deallocated anyway when the function exits

..but is forbidden from having a destructor (even if you use alloca() via a
custom allocator) because you can't have destructable objects on the stack.


 Anyway it's
 an ugly interface to do it, maybe as has been suggested more standard
 allocators could be included for this case.

You can also overload operators new and delete on a per-class basis, and
provide a custom allocator.

..which of course will /also/ not be garbage collected. (I've been down this
road before).

Arcane Jill

Aug 31 2004

Sean Kelly <sean f4.ca> writes:

In article <ch22nn$qtf$1 digitaldaemon.com>, Arcane Jill says...
In article <ch0ce4$11t$1 digitaldaemon.com>, Walter says...

 And will they be garbage-collected? (just asking, I don't know.)

malloc()? No, that will need an explicit call to free().
alloca()? No, that gets deallocated anyway when the function exits

..but is forbidden from having a destructor (even if you use alloca() via a
custom allocator) because you can't have destructable objects on the stack.

You can't?  Why not?

 Anyway it's
 an ugly interface to do it, maybe as has been suggested more standard
 allocators could be included for this case.

You can also overload operators new and delete on a per-class basis, and
provide a custom allocator.

..which of course will /also/ not be garbage collected. (I've been down this
road before).

I'm not sure if it would work, but could you use gc.malloc to allocate heap
memory in operator new and then not provide an operator delete?


Sean

Aug 31 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:ch22nn$qtf$1 digitaldaemon.com...
 In article <ch0ce4$11t$1 digitaldaemon.com>, Walter says...

 And will they be garbage-collected? (just asking, I don't know.)

malloc()? No, that will need an explicit call to free().
alloca()? No, that gets deallocated anyway when the function exits

 ..but is forbidden from having a destructor (even if you use alloca() via

a
 custom allocator) because you can't have destructable objects on the

stack.

If you make it an auto class, you can.

Aug 31 2004

Regan Heath <regan netwin.co.nz> writes:

On Fri, 27 Aug 2004 08:26:23 +0000 (UTC), Arcane Jill 
<Arcane_member pathlink.com> wrote:
 In article <opsdc4sgsi5a2sq9 digitalmars.com>, Regan Heath says...
 After all, I don't mind if my compile is a little slower if it means my
 app is faster.

 UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which 
 all text
 is ASCII is going to be just as fast in UTF-16 as it is in ASCII. 
 Remember that
 ASCII is a subset of UTF-16, just as it is a subset of UTF-8.

 Converting between UTF-8 and UTF-16 won't slow you down much if all your
 characters are ASCII, of course. Such a conversion is trivial - not much 
 slower
 than a memcpy. /But/ - you're still making a copy, still allocating 
 stuff off
 the heap and copying data from one place to another, and that's still 
 overhead
 which you would have avoided had you used UTF-16 right through.

You're right.. I would argue however that space == speed when you start to 
run out, which happens 2 times faster if you use wchar (and ASCII only), 
right? the overall efficiency of a program is made up of both it's space 
and cpu requirements, some times you will need to or want to lessen the 
space requirements.

 (3) Object.d should contain the line:


 I'm not sure I like this.. will this hide details a programmer should be
 aware of?

 It's just a strong hint. If aliases are bad then we shouldn't use them 
 anywhere.

I wasn't suggesting aliases were bad. Aliases that serve to make type 
declarations clearer are very useful, they make code clearer. This alias 
just renames a type, so now it has 2 names, this will likely cause some 
confusion. I think we can suggest a type without it.

 (It could be made faster, but it can /never/ be made as fast as 
 UTF-16).
 So we make wchar[], not char[], the "standard", and hey presto, things
 get faster

 Qualification: For non ASCII only apps.

 No, for /all/ apps. I don't see any reason why ASCII stored in wchar[]s 
 would be
 any slower than ASCII stored in char[]s. Can you think of a reason why 
 that
 would be so?

 ASCII is a subset of UTF-8
 ASCII is a subset of UTF-16

 where's the difference? The difference is space, not speed.

Correct, but space == speed (as above).

 The fact that ICU has no char type suggests it's a bad choice for D, 
 that
 is, if we want to assume they knew what they we're doing.

 See http://oss.software.ibm.com/icu/userguide/strings.html for UCI's 
 discussion
 on this.

Thanks.

 Are there any
 complaints from developers about ICU anywhere, perhaps some digging for
 dirt would help make an objective decision here?

 I don't know. I imagine so. People generally tend to complain about
 /everything/.

<g> true, too true...

 I'd like some more stats and figures, simply:
  - how many unicode characters are in the range < U+0800?

 These figures are all from Unicode 4.0. Unicode now stands at 4.0.1, so 
 these
 figures are out of date - but they'll give you the general idea.

 1646

  - how many unicode from U+0800 >= x <= U+FFFF?

 54014, of which 6400 are Private Use Characters

  - haw many unicode > U+FFFF?

 176012, of which 131068 are Private Use Characters


 Then, how commonly is each range used? I imagine this differs depending 
 on
 exactly what you're doing.. basically when would you use characters in
 each range and how common is that task?


..thanks for the lists/figures..

 It used to be that ASCII < U+0800 was the most common, it still may be,
 but I can see that it's not the future, the future is unicode.

 Most common, perhaps, if you limit your text to Latin, Greek, Hebrew and
 Russian. But not otherwise.

So would you say most common worldwide then? It may be due to the fact 
that I only speak english but I see many more english-only programs than 
(pick a language)-only programs. Ignoring those applications that come in 
several languages (as all the big ones do).

 That said, would the introduction of char to Java give you anything?
 perhaps.. it would allow you to write an app that only deals with ASCII
 (chars < U+0800) more space efficiently, correct?

 Only chars < U+0080 (not U+0800) would be more space efficient in UTF-8.

Yes, what if they are all you want/need/(are going) to use...

 Between
 U+0080 and U+07FFF they both need two bytes. From U+0800 upwards, UTF-8 
 needs
 three bytes where UTF-16 needs two.

 How am I doing on the convincing front?

You still have work to do <g>

 I'd still go for:
 *) wchar[] for everything in Phobos and everything DMD-generated;

What if you know you're only going to need ASCII(utf-8), what if all your 
data is going to be in ASCII(utf-8), won't you want your static strings in 
ASCII(utf-8) also, to cut down on transcoding?

 *) ditch the char

I don't see the point in this. char[] is still useful regardless which 
type we 'promote' as the best type to use for internationalized strings.

If it were removed, then dchar should be removed for the same reason, and 
both types would have to be implemented in ubyte[] and int[] instead. 
Coincidentally this is what the ICU have done, I quote...

"UTF-8 and UTF-32 are supported with converters (ucnv.h), macros (utf.h), 
and convenience functions (ustring.h), but not directly as string encoding 
forms for most APIs."

I'm not convinced removing them is more useful than keeping them but 
having implicit transcoding.

 *) lossless implicit conversion between all remaining D string types

Here we agree, this would make life much easier.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 29 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgk32j$1fj$1 digitaldaemon.com...
 (2) String literals such as "hello world" should be interpretted as

wchar[], not
 char[].

Actually, string literals are already interpreted as char[], wchar[], or
dchar[] depending on the context they appear in. The compiler implicitly
does a UTF conversion on them as necessary. If you have an overload based on
char[] vs wchar[] vs dchar[] and pass a string literal, it should result in
an ambiguity error.

The only place it would default to char[] would be when it is passed as a
... argument to a variadic function.

Aug 28 2004

Andy Friesen <andy ikagames.com> writes:

Walter wrote:
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cgk32j$1fj$1 digitaldaemon.com...
 
(2) String literals such as "hello world" should be interpretted as

 
 wchar[], not
 
char[].

 
 
 Actually, string literals are already interpreted as char[], wchar[], or
 dchar[] depending on the context they appear in. The compiler implicitly
 does a UTF conversion on them as necessary. If you have an overload based on
 char[] vs wchar[] vs dchar[] and pass a string literal, it should result in
 an ambiguity error.

Is there any chance that this could be adjusted somehow?

I know the point is to avoid all the complications that the C++ appoarch 
entails, but this has a way of throwing a wrench in any interface that 
wants to handle all three.

Presently, we're given a choice: either handle all three char types and 
therefore demand ugly casts on all string literal arguments, or only 
handle one and force conversions that aren't necessarily required or 
desired by either the caller or the callee.

If say, in the case of an ambiguity, a string literal were assumed to be 
of the smallest char type for which a match exists, the code would 
compile and, in almost all cases, do the right thing.

It does complicate the rules some, but it seems preferable to the 
current dilemma.

  -- andy

Aug 28 2004

Sean Kelly <sean f4.ca> writes:

In article <cgqmpo$387$1 digitaldaemon.com>, Andy Friesen says...
I know the point is to avoid all the complications that the C++ appoarch 
entails, but this has a way of throwing a wrench in any interface that 
wants to handle all three.

I'm inclined to agree, though I'm wary of making char types a special case for
overload resolution.  Perhaps a prefix to indicate type?

c"" // utf-8
w"" // utf-16
d"" // utf-32

Still not ideal, but it would require less typing :/


Sean

Aug 28 2004

"Walter" <newshound digitalmars.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message
news:cgqnjm$3eq$1 digitaldaemon.com...
 In article <cgqmpo$387$1 digitaldaemon.com>, Andy Friesen says...
I know the point is to avoid all the complications that the C++ appoarch
entails, but this has a way of throwing a wrench in any interface that
wants to handle all three.

 I'm inclined to agree, though I'm wary of making char types a special case

for
 overload resolution.  Perhaps a prefix to indicate type?

 c"" // utf-8
 w"" // utf-16
 d"" // utf-32

 Still not ideal, but it would require less typing :/

I thought of the prefix approach, like C uses, but it just seemed redundant
for the odd case where a cast(char[]) will do.

Aug 28 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgh8vi$1o7k$1 digitaldaemon.com...
 It's not invalid as such, it's just that the return type of an overloaded
 function has to be "covariant" with the return type of the function it's
 overloading. So it's a compile error /now/. But if dchar[] and char[] were

to be
 considered mutually covariant then this would magically start to compile.

That would be nice, but I don't see how to technically make that work.

Aug 28 2004

"Carlos Santander B." <carlos8294 msn.com> writes:

"antiAlias" <fu bar.com> escribi� en el mensaje
news:cgg1mi$14l2$1 digitaldaemon.com
| A a; // class instances ...
| B b;
| C c;
|
| dchar[] message = c ~ b ~ a;

I have a question regarding this: what if A, B, and C were like this?

//////////////////////////
class A
{
    ... opCat_r (B b) { ... }
    ...
}

class B
{
    ... opCat (A a) { ... }
    ... opCat_r (C c) { ... }
    ...
}

class C
{
    ... opCat (B b) { ... }
    ...
}
//////////////////////////

How would "c ~ b ~ a" work with the proposed automatic call to .toString?

-----------------------
Carlos Santander Bernal

Aug 24 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 24 Aug 2004 21:31:56 -0500, Carlos Santander B. 
<carlos8294 msn.com> wrote:
 "antiAlias" <fu bar.com> escribi� en el mensaje
 news:cgg1mi$14l2$1 digitaldaemon.com
 | A a; // class instances ...
 | B b;
 | C c;
 |
 | dchar[] message = c ~ b ~ a;

 I have a question regarding this: what if A, B, and C were like this?

 //////////////////////////
 class A
 {
     ... opCat_r (B b) { ... }
     ...
 }

 class B
 {
     ... opCat (A a) { ... }
     ... opCat_r (C c) { ... }
     ...
 }

 class C
 {
     ... opCat (B b) { ... }
     ...
 }
 //////////////////////////

 How would "c ~ b ~ a" work with the proposed automatic call to .toString?

I assumed opCat's parameter would have to be char[], wchar[] or dchar[], 
as would it's return value. eg.

class B
{
   char[] opCat(char[] rhs){}
}

given implicit transcoding you could then say.

char[]  c;
wchar[] w;
dchar[] d;

B b = new B();

char[] p;

p = b ~ c;
p = b ~ w;
p = b ~ d;

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 24 2004

"Carlos Santander B." <carlos8294 msn.com> writes:

"Regan Heath" <regan netwin.co.nz> escribi� en el mensaje
news:opsc9un10r5a2sq9 digitalmars.com
| I assumed opCat's parameter would have to be char[], wchar[] or dchar[],
| as would it's return value. eg.
|

I don't see why it has to be only that way. ~ is the concatenation operator, so
I could define:

class Set(T)
{
    Set opCat(T newElem) { ... }
}

And expect it to work the way I want it. opCat is not only for strings. And it
shouldn't be.

| class B
| {
|    char[] opCat(char[] rhs){}
| }
|
| given implicit transcoding you could then say.
|
| char[]  c;
| wchar[] w;
| dchar[] d;
|
| B b = new B();
|
| char[] p;
|
| p = b ~ c;
| p = b ~ w;
| p = b ~ d;
|
| Regan
|

-----------------------
Carlos Santander Bernal

Aug 25 2004

Arcane Jill <Arcane_member pathlink.com> writes:

This is irrelevant. opCat() does not need to do anything special for D strings
to work, whether we go the wchar[] route or the implicit conversion route. For
wchar[]s, it already works. If we go for implicit converion, then the three
different kinds of D string would be regarded as covariant by the D compiler, so
expressions of the form (wchar[] ~ dchar[]) would be handled by the type
promotion system, not by opCat() - just as (float + int) is handled now.

Jill



In article <cgi5q2$254f$1 digitaldaemon.com>, Carlos Santander B. says...
"Regan Heath" <regan netwin.co.nz> escribi� en el mensaje
news:opsc9un10r5a2sq9 digitalmars.com
| I assumed opCat's parameter would have to be char[], wchar[] or dchar[],
| as would it's return value. eg.
|

I don't see why it has to be only that way. ~ is the concatenation operator, so
I could define:

class Set(T)
{
    Set opCat(T newElem) { ... }
}

And expect it to work the way I want it. opCat is not only for strings. And it
shouldn't be.

| class B
| {
|    char[] opCat(char[] rhs){}
| }
|
| given implicit transcoding you could then say.
|
| char[]  c;
| wchar[] w;
| dchar[] d;
|
| B b = new B();
|
| char[] p;
|
| p = b ~ c;
| p = b ~ w;
| p = b ~ d;
|
| Regan
|

-----------------------
Carlos Santander Bernal

Aug 25 2004

Ben Hinkle <bhinkle4 juno.com> writes:

 
 The other aspect involved here is that of string-concatenation. D cannot
 have more that one return type for toString() as you know. It's fixed at
 char[]. If string concatenation uses the toString() method to retrieve its
 components (as is being proposed elsewhere), then there will be multiple,
 redundant, implicit conversions going on where the string really wanted to
 be dchar[] in the first place. That is:
 
 A a; // class instances ...
 B b;
 C c;
 
 dchar[] message = c ~ b ~ a;
 
 Under the proposed "implicit" scheme, if each toString() of A, B, and C
 wish to return dchar[], then each concatenation causes an implicit
 conversion/encoding from each dchar[] to char[] (for the toString()
 return). Then another full conversion/decoding is performed back to the
 dchar[] assignment once each has been concatenated. This is like the
 Wintel 'plot' for selling more cpu's :-)
 
 Doing this manually, one would forego the toString() altogether:
 
 dchar[] message = c.getString() ~ b.getString() ~ a.getString();
 
 ... where getString() is a programmer-specific idiom to return the
 (natural) dchar[] for these classes, and we carefully avoided all those
 darned implicit-conversions. However, which approach do you think people
 will use? My guess is that D may become bogged down in conversion hell
 over such things.

"Conversion hell" will exist any time three standards are in use. It doesn't
matter how those standards are wrapped up - implicit, explicit, String
class or whatever. That's why we all win by agreeing on one standard and
trying to stick to it. In D now life is peachy in char[] land, slightly
less peachy in wchar[] and dchar[] land. I don't think there's any way to
make life peachy for all three cases.
 
 So, to answer your question:
 What I'm /for/ is not covering up these types of issues with blanket-style
 implicit conversions. Something more constructive (and with a little more
 forethought) needs to be done.

Aug 24 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 24 Aug 2004 22:53:47 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:

<snip>

 "Conversion hell" will exist any time three standards are in use. It 
 doesn't
 matter how those standards are wrapped up - implicit, explicit, String
 class or whatever. That's why we all win by agreeing on one standard and
 trying to stick to it. In D now life is peachy in char[] land, slightly
 less peachy in wchar[] and dchar[] land. I don't think there's any way to
 make life peachy for all three cases.

Lets assume implicit transcoding is implemented, why wouldn't that make 
life peachy in all 3?

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 24 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgfen9$rhq$1 digitaldaemon.com...
 Phobos UTF-8 code could be faster, I grant you. But perhaps it will be in

the
 next release. We're only talking about a tiny number of functions here,

after
 all.

My first goal with std.utf is to make it work right. Then comes the
optimize-the-heck-out-of-it. std.utf is shaping up to be a core dependency
for D, so making it run as fast as possible is worthwhile. Any suggestions?

Aug 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgqju0$2b4$1 digitaldaemon.com>, Walter says...

My first goal with std.utf is to make it work right. Then comes the
optimize-the-heck-out-of-it. std.utf is shaping up to be a core dependency
for D, so making it run as fast as possible is worthwhile. Any suggestions?

It's only really UTF-8 decoding that's complicated. All the rest are pretty
easy, even UTF-8 encoding (as I'm sure you know). The approach I took in the
code sample I posted here a while back was to read the first byte (which will be
the /only/ byte, in the case of ASCII) and use it as the index into a lookup
table. A byte has only 256 possible values - 128 after you've eliminated ASCII
chars, and you can look up both the sequence length (or 0 for illegal
first-bytes), and the initial value for the (dchar) accumulator. Then you just
get six more bits from each of the remaining bytes (after ensuring that the bit
pattern is 10xxxxxx). This approach will fail to catch precisely /two/
non-shortest cases, so you have to test for them explicitly. Finally, you make
sure that the resulting dchar is not a forbidden value.

(From memory, I think that in some cases your current code checks for errors
which can never happen, such as checking for a non-shortest 5+ byte sequence
/after/ overlong sequences have already been eliminated).

You could go further. Kris has mentioned that heap allocation is slow.
Presumably, you could start off by allocating a single char buffer of length 3*N
(if input=wchars) or 4*N (if input=dchars), decoding into it, and then reducing
its length. (Of course, the excess then won't be released).

(But never let an invalid input go unnoticed. That would be one optimization too
many).



In article <cgqju1$2b4$2 digitaldaemon.com>, Walter says...
 It's not invalid as such, it's just that the return type of an overloaded
 function has to be "covariant" with the return type of the function it's
 overloading. So it's a compile error /now/. But if dchar[] and char[] were

to be
 considered mutually covariant then this would magically start to compile.

That would be nice, but I don't see how to technically make that work.

You're right. It wouldn't work.

Well, this is the sense in which D does have a "default" string type. It is the
case where we see clearly that char[] has special privilege. Currently, Object -
and therefore /everything/ - defines a function toString() which returns a
char[]. It is not possible to overload this with a function returning wchar[],
even if wchar[] would be more appropriate. This affects /all/ objects.

What to do about it? Hmmm....

You could change the Object.toString to:





with the latter two employing implicit conversion for the "s = t;" part.
Subclasses of Object which overloaded only toString(out char[]) would get the
other three for free. But subclasses of Object which decided to go a bit further
could return a wchar[] or a dchar[] directly to cut down on conversions.




Actually, string literals are already interpreted as char[], wchar[], or
dchar[] depending on the context they appear in. The compiler implicitly
does a UTF conversion on them as necessary.







The error is:
incompatible types for ((s) ~ (" world")): 'dchar[]' and 'char[]'

But yes - it works /nearly/ always, and that's cool. The above case would be
covered by implicit conversion, of course (although that would defer the
conversion from compile-time to run-time).

If you have an overload based on
char[] vs wchar[] vs dchar[] and pass a string literal, it should result in
an ambiguity error.

Ah! That's the bit I didn't know. I was wondering how that context thing would
work, given that signature matching happens /after/ the evaluation of the
function's arguments' types.

You could fix this by allowing explicit UTF-8, UTF-16 and UTF-32 literals. Sean
suggested c"", w"" and d"" (and similarly for char literals). That would fix it.


 UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all

text
 is ASCII is going to be just as fast in UTF-16 as it is in ASCII.

While the rest of your post has a great deal of merit, this bit here is just
not true. A pure ASCII app written using UTF-16 will consume twice as much
memory for its data, and there are a lot of operations on that data that
will be correspondingly half as fast.

That's true. Guess I got a bit carried away there. I was thinking that
statements like "c = *p++;" would compile to just one machine code instruction
regardless of the data width, and that the byte-wide version wouldn't
necessarily be the fastest. But I forgot about all the initializing and copying
that you also have to do.


Arcane Jill

Aug 28 2004

J C Calvarese <jcc7 cox.net> writes:

Arcane Jill wrote:
 In article <cgqju0$2b4$1 digitaldaemon.com>, Walter says...

...
 In article <cgqju1$2b4$2 digitaldaemon.com>, Walter says...
 
It's not invalid as such, it's just that the return type of an overloaded
function has to be "covariant" with the return type of the function it's
overloading. So it's a compile error /now/. But if dchar[] and char[] were

to be

considered mutually covariant then this would magically start to compile.

That would be nice, but I don't see how to technically make that work.

 
 
 You're right. It wouldn't work.
 
 Well, this is the sense in which D does have a "default" string type. It is the
 case where we see clearly that char[] has special privilege. Currently, Object
-
 and therefore /everything/ - defines a function toString() which returns a
 char[]. It is not possible to overload this with a function returning wchar[],
 even if wchar[] would be more appropriate. This affects /all/ objects.
 
 What to do about it? Hmmm....
 
 You could change the Object.toString to:
 



 
 with the latter two employing implicit conversion for the "s = t;" part.
 Subclasses of Object which overloaded only toString(out char[]) would get the
 other three for free. But subclasses of Object which decided to go a bit
further
 could return a wchar[] or a dchar[] directly to cut down on conversions.

Could we ditch toString and replace the functionality with:

toUtf8(), toUtf16(), and toUtf32()
or toCharStr(), toWCharStr(), and toDCharStr()

Usually, the person writing the object could define one and the other 
two would call conversions.

There's probably some reason why this wouldn't work, but it's just such 
a pleasant idea to me that I was forced to share it.

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/

Aug 28 2004

Ben Hinkle <bhinkle4 juno.com> writes:

J C Calvarese wrote:

 Arcane Jill wrote:
 In article <cgqju0$2b4$1 digitaldaemon.com>, Walter says...

 ...
 In article <cgqju1$2b4$2 digitaldaemon.com>, Walter says...
 
It's not invalid as such, it's just that the return type of an
overloaded function has to be "covariant" with the return type of the
function it's overloading. So it's a compile error /now/. But if dchar[]
and char[] were

to be

considered mutually covariant then this would magically start to
compile.

That would be nice, but I don't see how to technically make that work.

 
 
 You're right. It wouldn't work.
 
 Well, this is the sense in which D does have a "default" string type. It
 is the case where we see clearly that char[] has special privilege.
 Currently, Object - and therefore /everything/ - defines a function
 toString() which returns a char[]. It is not possible to overload this
 with a function returning wchar[], even if wchar[] would be more
 appropriate. This affects /all/ objects.
 
 What to do about it? Hmmm....
 
 You could change the Object.toString to:
 



 
 with the latter two employing implicit conversion for the "s = t;" part.
 Subclasses of Object which overloaded only toString(out char[]) would get
 the other three for free. But subclasses of Object which decided to go a
 bit further could return a wchar[] or a dchar[] directly to cut down on
 conversions.

 
 Could we ditch toString and replace the functionality with:
 
 toUtf8(), toUtf16(), and toUtf32()
 or toCharStr(), toWCharStr(), and toDCharStr()
 
 Usually, the person writing the object could define one and the other
 two would call conversions.
 
 There's probably some reason why this wouldn't work, but it's just such
 a pleasant idea to me that I was forced to share it.

 
Why is toString such a hot topic anyway? In Java end users hardly ever see
the result of a toString. In D I can see things like toString(int) and
toString(double) being seen by users but in general Foo.toString should
just give a summary of the object - preferably short and easy to transcode.

I wouldn't use toString for things like getting user strings out of text
fields in a GUI or reading from a file. For those cases I would use another
function name like TextBox.getText or File.readStringW. Those other
functions can have char[] versions and wchar[] versions as desired. As an
example of a "bad" toString see std.stream.Stream.toString. It will usually
create a huge string.

For classes like AJ's arbitrary sized Int toString should return the whole
integer since that is the best summary of the object. So we should still
allow toString to return arbitrarily long strings - we just need to be
careful how toString is used.

Aug 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgra2k$udg$1 digitaldaemon.com>, Ben Hinkle says...

Why is toString such a hot topic anyway?

Ask yourself why toString() exists at all. What does D use it for?

If it's an unnecessary, hardly-used function, then it should be removed from
Object, because this is OOP, if something doesn't make sense for all Objects
then it should not be defined for all Objects.

On the other hand, if it /is/ necessary for all objects, it shouldn't be biased
one way or the other.


but in general Foo.toString should
just give a summary of the object

Of the object's /value/, yes. So toString() only makes sense for objects which
actually /have/ a value. I'm not sure if streams can be said to have a "value"
in the sense that Ints do, so maybe it shouldn't be defined at all for streams.


As an
example of a "bad" toString see std.stream.Stream.toString. It will usually
create a huge string.

Yes. Now I'm starting to wonder what toString() is actually for, and whether
implementing a three-function interface (Stringizable?) might be better than
inheriting from Object.


For classes like AJ's arbitrary sized Int toString should return the whole
integer since that is the best summary of the object. So we should still
allow toString to return arbitrarily long strings - we just need to be
careful how toString is used.

Let's go back to Walter on this one. Walter - why does Object have a toString()
function? In what way does D require or rely on it? How badly would D be
affected if it didn't exist at all or if it were an interface?

Jill

Aug 28 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgrsnq$168a$1 digitaldaemon.com...
 Let's go back to Walter on this one. Walter - why does Object have a

toString()
 function? In what way does D require or rely on it? How badly would D be
 affected if it didn't exist at all or if it were an interface?

It's so when you pass an object to writef(), there's a way that it can be
printed. But Ben is right, I don't see Object.toString() being used to
generate very large strings, so any transcoding of it isn't going to be
expensive overall.

Aug 29 2004

Matthias Becker <Matthias_member pathlink.com> writes:

news:cgrsnq$168a$1 digitaldaemon.com...
 Let's go back to Walter on this one. Walter - why does Object have a

toString()
 function? In what way does D require or rely on it? How badly would D be
 affected if it didn't exist at all or if it were an interface?

It's so when you pass an object to writef(), there's a way that it can be
printed. But Ben is right, I don't see Object.toString() being used to
generate very large strings, so any transcoding of it isn't going to be
expensive overall.

I don't get it :(

Why isn't it possible to use a Stringizeable interface instead?

-- Matthias Becker

Aug 29 2004

"Walter" <newshound digitalmars.com> writes:

"Matthias Becker" <Matthias_member pathlink.com> wrote in message
news:cgs9b4$1c0h$1 digitaldaemon.com...
news:cgrsnq$168a$1 digitaldaemon.com...
 Let's go back to Walter on this one. Walter - why does Object have a

toString()
 function? In what way does D require or rely on it? How badly would D



be
 affected if it didn't exist at all or if it were an interface?

It's so when you pass an object to writef(), there's a way that it can be
printed. But Ben is right, I don't see Object.toString() being used to
generate very large strings, so any transcoding of it isn't going to be
expensive overall.

 I don't get it :(

 Why isn't it possible to use a Stringizeable interface instead?

It is possible. But I think every Object should have some basic
functionality, and one of those is to be able to have itself pretty-printed.
This is also nice for a potential D debugger - it can take advantage of
toString() to produce a user-friendly representation of the class data.

Aug 30 2004

Ben Hinkle <bhinkle4 juno.com> writes:

Arcane Jill wrote:

 In article <cgra2k$udg$1 digitaldaemon.com>, Ben Hinkle says...
 
Why is toString such a hot topic anyway?

 
 Ask yourself why toString() exists at all. What does D use it for?


object out. One drawback with D's Object.toString is that the default
implementation doesn't print the object's address (or in Java's case the
hash code) so you can't distinguish one object from another. If anything
I'd like to see some guidelines for toString so that the output is
consistent across D. For example if the object doesn't have an obvious
string representation, like a class Foo{int n; double d;} then the result
of toString should have the form "[Foo n:0, d:0.0]" - possibly include the
address or hash code in there. I think this is basically the format Java
uses but I can't remember exactly. In general toString should avoid
newlines.

 If it's an unnecessary, hardly-used function, then it should be removed
 from Object, because this is OOP, if something doesn't make sense for all
 Objects then it should not be defined for all Objects.
 
 On the other hand, if it /is/ necessary for all objects, it shouldn't be
 biased one way or the other.

It isn't necessary but it is nice to have around. Does it make sense for all
objects? I guess that depends on one's viewpoint of "makes sense". I think
printing the class and hash code makes sense. Maybe others don't.
 
but in general Foo.toString should
just give a summary of the object

 
 Of the object's /value/, yes. So toString() only makes sense for objects
 which actually /have/ a value. I'm not sure if streams can be said to have
 a "value" in the sense that Ints do, so maybe it shouldn't be defined at
 all for streams.

See above comments about making sense and the default toString format.
 
As an
example of a "bad" toString see std.stream.Stream.toString. It will
usually create a huge string.

 
 Yes. Now I'm starting to wonder what toString() is actually for, and
 whether implementing a three-function interface (Stringizable?) might be
 better than inheriting from Object.

That's an option, but it adds more work for users to make something
stringizable - two of the stringizable functions will call the "real" one
and wrap the result in toUTF8 etc. Is it worth it to do all that just for
debugging?
 
For classes like AJ's arbitrary sized Int toString should return the whole
integer since that is the best summary of the object. So we should still
allow toString to return arbitrarily long strings - we just need to be
careful how toString is used.

 
 Let's go back to Walter on this one. Walter - why does Object have a
 toString() function? In what way does D require or rely on it? How badly
 would D be affected if it didn't exist at all or if it were an interface?
 
 Jill

Aug 29 2004

Berin Loritsch <bloritsch d-haven.org> writes:

Arcane Jill wrote:

 In article <cgra2k$udg$1 digitaldaemon.com>, Ben Hinkle says...
 
 
Why is toString such a hot topic anyway?

 
 
 Ask yourself why toString() exists at all. What does D use it for?
 
 If it's an unnecessary, hardly-used function, then it should be removed from
 Object, because this is OOP, if something doesn't make sense for all Objects
 then it should not be defined for all Objects.
 
 On the other hand, if it /is/ necessary for all objects, it shouldn't be biased
 one way or the other.

There are a couple of uses I use toString() for (in Java apps):

1: debugging.  Some useful info is derived when going through a
    graphical debugger (hovering over a variable will yield its
    toString() value).
2: easier inclusion in graphical lists.  I might have an object
    encapsulating the ISO-3166 country codes (long name and id),
    but use the toString() to show the long name.  That way in the
    drop down list I have everything I need to serialize the info
    to the DB.

But that's just Java....  D doesn't have decent IDE integration at this

operation.

Aug 30 2004

Dave <Dave_member pathlink.com> writes:

In article <cgva1p$2gep$1 digitaldaemon.com>, Berin Loritsch says...
Arcane Jill wrote:

 In article <cgra2k$udg$1 digitaldaemon.com>, Ben Hinkle says...
 
 
Why is toString such a hot topic anyway?

 
 
 Ask yourself why toString() exists at all. What does D use it for?
 
 If it's an unnecessary, hardly-used function, then it should be removed from
 Object, because this is OOP, if something doesn't make sense for all Objects
 then it should not be defined for all Objects.
 
 On the other hand, if it /is/ necessary for all objects, it shouldn't be biased
 one way or the other.

There are a couple of uses I use toString() for (in Java apps):

1: debugging.  Some useful info is derived when going through a
    graphical debugger (hovering over a variable will yield its
    toString() value).
2: easier inclusion in graphical lists.  I might have an object
    encapsulating the ISO-3166 country codes (long name and id),
    but use the toString() to show the long name.  That way in the
    drop down list I have everything I need to serialize the info
    to the DB.

But that's just Java....  D doesn't have decent IDE integration at this

operation.

(Also refering to the 'most European chars. are ASCII' post just ahead of this
one)

And because of their likely small size, even if they are filled with non-ASCII
characters, neither of those uses will realistically cause a performance
bottleneck if toString() is UTF-8 by 'default', right?

No matter how efficient the memory management system is, if the 'default' is
UTF-16 rather than UTF-8, most apps. will have to carry that extra de/allocation
& initialization burden for no reason other than expediency or ignorance on the
programmers part. Often twice the work and 1/2 the available memory for many of
the same jobs.

UTF-16 used everywhere is probably one reason why heavy-use Java server apps.
have the reputation as 'memory-thrashers'. And those runtimes have the benefit
of many man-years of development, use-cases and experimentation behind them.

Since UTF-8 is the most efficient and adequate for most of D's currently
forseeable uses, I say leave as is and put the 'burden' on the software that
needs or benefits from other than UTF-8.

IMO, D really needs to be a general performance winner in order to get a
toe-hold in the current market.

Aug 30 2004

"Walter" <newshound digitalmars.com> writes:

"Dave" <Dave_member pathlink.com> wrote in message
news:ch033f$2u7u$1 digitaldaemon.com...
 UTF-16 used everywhere is probably one reason why heavy-use Java server

apps.
 have the reputation as 'memory-thrashers'. And those runtimes have the

benefit
 of many man-years of development, use-cases and experimentation behind

them.

Many of Java's problems have gradually declined over time due to massive
research and development efforts by a lot of very, very smart people. For
example, Andy King has just pointed me to some research done by several
people that focussed on improving some minor aspects of the Java garbage
collector. D doesn't have a billion dollar budget <g>, and simply cannot
afford to have problems that need such budgets to find solutions for.

 IMO, D really needs to be a general performance winner in order to get a
 toe-hold in the current market.

Right. And if D acquires an early reputation for being 'slow', it will never
have a chance. Currently, D is *faster* than C++ on string benchmarks (see
www.digitalmars.com/d/cppstrings.html), and it must be that way, as that
seriously blunts criticisms aimed at D for the way it does strings relative
to C++.

Aug 30 2004

Regan Heath <regan netwin.co.nz> writes:

On Mon, 23 Aug 2004 23:05:45 -0700, antiAlias <fu bar.com> wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsc7u56zf5a2sq9 digitalmars.com...
 On Mon, 23 Aug 2004 18:29:54 -0700, antiAlias <fu bar.com> wrote:
 "Walter" <newshound digitalmars.com> wrote in message ...
 "Regan" wrote:
 There are two ways of looking at this, on one hand you're saying 



 they
 should all 'paint' as that is consistent. However, on the other I'm


 saying
 they should all produce a 'valid' result. So my argument here is 



 that
 when
 you cast you expect a valid result, much like casting a float to an

 int
 does not just 'paint'.

 I am interested to hear your opinions on this idea.

 I think your idea has a lot of merit. I'm certainly leaning that way.

 On the one hand, this would be well served by an opCast() and/or
 opCast_r()
 on the primitive types; just the kind of thing  suggested in a related
 thread (which talked about overloadable methods for primitive types).

 True. However, what else should the opCast for char[] to dchar[] do 
 except
 transcode it? What about opCast for int to dchar.. it seems to me there 
 is
 only 1 choice about what to do, anything else would be operator abuse.

 On the other hand, we're talking transcoding here. Are you gonna' 

 limit
 this
 to UTF-8 only?

 I hope not, I am hoping to see:

         | UTF-8 | UTF-16 | UTF-32
 --------------------------------
 UTF-8  |   -        +       +
 UTF-16 |   +        -       +
 UTF-32 |   +        +       -

 (+ indicates transcoding occurs)

 ========================
 And what happens when just one additional byte-oriented encoding is
 introduced? Perhaps UTF-7? Perhaps EBCDIC? The basic premise is flawed
 because there's no flexibility.

---------------------------------------
You'll have to check with Walter but I believe he has no plans to add 
another basic type to hold any specific encoding. Encodings other than 
UTF-x will be done with library functions and ubyte[], ushort[], uint[], 
ulong[].

 Then, since the source and destination will typically be of
 different sizes, do you then force all casts between these types to 

 have
 the
 destination be an array reference rather than an instance?

 I don't think transcoding makes any sense unless you're talking about a
 'string' (i.e. char[], wchar[], or dchar[]) as opposed to a UTF-x 
 fragment
 (i.e. char, wchar)

 ========================
 We /are/ talking about arrays. Perhaps if that sentence had ended "array
 reference rather than an *array* instance?", it might have been more 
 clear?

---------------------------------------
Or maybe I just miss-read or missunderstood it :)

 The point being made is that you would not be able to do anything like 
 this:

 char[15] dst;
 dchar[10] src;

 dst = cast(char[]) src;

 because there's no ability via a cast() to indicate how many items from 
 src
 were converted, and how many items in dst were populated. You are forced
 into this kind of thing:

 char[] dst;
 dchar[10] src;

 dst = cast(char[]) src;

---------------------------------------
How is that different to what we have to do now?

   dst = toUTF8(src);

?

 You see the distinction? It may be subtle to some, but it's a glaring
 imbalance to others. The lValue must always be a reference because it's
 gonna' be allocated dynamically to ensure all of the rValue will fit. In 
 the
 end, it's just far better to use functions/methods that provide the 
 feedback
 required so you can actually control what's going on (or such that a 
 library
 function can). That way, you're not restricted in the same fashion. We 
 don't
 need more asymmetry in D, and this just reeks of poor design, IMO.

---------------------------------------
Correct me if I'm wrong you're suggesting the use of a library function 
like this...

bool toUTF8(char[] dst, dchar[] src) {}

or similar, where the caller passes a buffer for the result, and has full 
control of the size /location/allocation etc of that buffer, correct?

1. The implicit casting idea does not prevent this.
2. The above function doesn't exist currently.. instead we have
    char[] toUTF8(dchar[] src) {}
which is identical to what an implict cast would do.

 To drive this home, consider the decoding version (rather than the 
 encoding
 above):

 char[15] src;
 dchar[] dst;

 dst = cast(dchar[]) src;

 What happens when there's a partial character left undecoded at the end 
 of
 'src'?

---------------------------------------
How is that even possible? Assuming src is a 'valid' UTF-8 sequences, all 
the characters can and will be encoded into UTF-32 producing a 'valid' 
UTF-32 sequence.

If 'src' _is_ invalid then an exception will be thrown much like the ones 
we currently have for invalid UTF sequences.

 There nothing here to tell you that you've got a dangly bit left at
 the end of the souce-buffer. It's gone. Poof! Any further decoding from 
 the
 same file/socket/whatever is henceforth trashed, because the ball has 
 been
 both dropped and buried. End of story.


 Having 3 types requiring manual transcoding between them _is_ a pain.

 ========================
 It certainly is. That's why other languages try to avoid it at all costs.

 Having it done "generously" by the compiler is also a pain, inflexible, 
 and
 likely expensive.

---------------------------------------
I'd argue that you're wrong about all but the last assertion above. Having 
it done by the compiler would not be:

1. a 'pain' - you wouldn't notice, and if you did and did not desire the 
behaviour you can manually convert just like you have to do now.

2. 'inflexible' - this idea does not preclude you doing things another 
way. It simply provides a default, which IMO is the sensible/correct thing 
to do.

it is however more 'expensive' than the current situation, but, it's no 
more expensive than doing the conversion manually, which is what you 
currently have to do..

 There are many things a programmer should take
 responsibility for; transcoding comes under that umbrella because (a) 
 there
 can be subtle complexity involved and (b) it is relatively expensive to
 churn through text and convert it; particularly so with the Phobos utf-8
 code.

 What you appear to be suggesting is that this kind of thing should happen
 silently whilst one nonchalantly passes arguments around between methods.

---------------------------------------
Yes.

 That's insane, so I hope that's not what you're advocating. Java, for
 example, does that at one specific layer (I/O)

---------------------------------------
Which it can do because it only has one string type.

 , but you're apparently
 suggesting doing it at any old place! And several times over, just in 
 case
 it wasn't good enough the first time :-)

---------------------------------------
Yes, as we only want to go to an in-efficient type temporarily eg.

Walters comment:

"dchars are very convenient to work with, however, and make a great deal 
of sense as the
temporary common intermediate form of all the conversions. I stress the
temporary, though, as if you keep it around you'll start to notice the
slowdowns it causes."

reflects exactly what implicit conversion will give you, the ability to 
store things in memory the format you think is most efficient and dip in 
and out of utf-32 _if_ required.

utf-32 is not 'required' for anything, it's simply 'convenient' for 
certain things, as AJ has frequently pointed out you can encode every 
Unicode character in all 3 formats UTF-8, UTF-16 and UTF-32.

 Sorry man. This is inane. D is
 /not/ a scripting language; instead it's supposed to be a systems 
 language.

---------------------------------------
Which is why having a native UTF-8 type is so useful, and why being able 
to 'temporarily' implicitly convert to utf-32 for convenience while 
storing in the most space efficient format is so useful.

 Besides, if IUC were adopted, less
 people would have to worry about the distinction anyway.

 The same could be said for the implicit transcoding from one type to the
 other.

 ========================
 That's pretty short-sighted IMO. You appear to be saying that implicit
 transcoding would take the place of ICU; terribly misleading.

---------------------------------------
Not at all. I implied this statement:

"if <implicit transcoding> were adopted, less people would have to worry 
about the distinction anyway"

or tried to.

 Transcoding is
 just a very small part of that package. Please try to reread the comment 
 as
 "most people would be shielded completely by the library functions,
 therefore there's far fewer scenarios where they'd ever have a need to 
 drop
 into anything else".

 This would be a very GoodThing for D users. Far better to have a good
 library to take case of /all/ this crap than have the D language do some
 partial-conversions on the fly, and then quit because it doesn't know 
 how to
 provide any further functionality. This is the classic
 core-language-versus-library-functionality bitchfest all over again.
 Building all this into a cast()? Hey! Let's make Walter's Regex class 
 part
 of the compiler too; and make it do UTF-8 decoding /while/ it's 
 searching,
 since you'll be able to pass it a dchar[] that will be generously 
 converted
 to the accepted char[] for you "on-the-fly".

 Excuse me for jesting, but perhaps the Bidi text algorithms plus
 date/numeric formatting & parsing will all fit into a single operator 
 also?
 That's kind of what's being suggested. I believe there's a serious
 underestimate of the task being discussed.

---------------------------------------
This is an over-exageration, perhaps you should give Jim Carrey acting 
lessons <g>

 Believe me when I say that Mango would dearly love to go dchar[] only.
 Actually, it probably will at the higher levels because it makes life
 simple
 for everyone. Oh, and I've been accused many times of being an


 efficiency
 fanatic, especially when it comes to servers. But there's always a
 tradeoff
 somewhere. Here, the tradeoff is simplicity-of-use versus quantities 

 of
 RAM.
 Which one changes dramatically over time? Hmmmm ... let me see now ...
 64bit-OS for desktops just around corner?

 What you really mean is you'd dearly love to not have to worry about the
 differences between the 3 types, implicit transcoding will give you 
 that.
 Furthermore it's simplicity without sacraficing RAM.

 ========================
 Ahh Thanks. I didn't realize that's what I "really meant". Wait a minute 
 ...

---------------------------------------
Forgive me for sharing my interpretation of what you said.


 Even on an embedded device I'd probably go "dchar only" regarding 

 I18N.
 Simply because the quantity of text processed on such devices is very
 limited. Before anyone shoots me over this one, I regularly write code
 for
 devices with just 4KB RAM ~ still use 16bit chars there when dealing


 with
 XML input.

 We're you to use implicit transcoding you could store the data in memory
 in UTF-8, or UTF-16 then only transcode to UTF-32 when required, this
 would be more efficient.

 ========================
 That's a rather large assumption, don't you think? More efficient? In 
 which
 particualr way? Is memory usage or CPU usage more important in /my/
 particular applications? Please either refrain, or commit to rewriting 
 all
 my old code more efficiently for me ... for free <g>

---------------------------------------
As stated earlier this concept does not stop you from optimising your app 
in any way shape or form, it's simply a sensible default behaviour IMO.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 24 2004

"antiAlias" <fu bar.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote ..
 What happens when there's a partial character left undecoded
 at the end of  'src'?

 ---------------------------------------
 How is that even possible?

It happens all the time with streamed input. However, as AJ pointed out,
neither you nor Walter are apparently suggesting that the cast() approach be
used for anything other than trivial conversions. That is, one would not use
this approach with respect to IO streaming. I had the (distinctly wrong)
impression this implied-conversion was intended to be a jack-of-all-trades.

Everything else in the post is therefore cast(void)  ~  so let's stop
wasting our breath :)

If these implicit conversions are put in place, then I respectfully suggest
the std.utf functions be replaced with something that avoids fragmenting the
heap in the manner they currently do (for non Latin-1); and it's not hard to
make them an order-of-magnitude faster, too.

Finally; there's still the problems related to string-concatentation and
toString(), as described toward the end of this post
news:cgg1mi$14l2$1 digitaldaemon.com

Aug 24 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 24 Aug 2004 19:29:24 -0700, antiAlias <fu bar.com> wrote:
 "Regan Heath" <regan netwin.co.nz> wrote ..
 What happens when there's a partial character left undecoded
 at the end of  'src'?

 ---------------------------------------
 How is that even possible?

 It happens all the time with streamed input.

Ahhh.. I get it you were referring to not having all the input at one 
time, with some being left in the 'stream'.. I can see your concern now.

 However, as AJ pointed out,
 neither you nor Walter are apparently suggesting that the cast() 
 approach be
 used for anything other than trivial conversions.

Correct, the cases where the current approach can actually create a bug, a 
bug that only sometimes happens.

 That is, one would not use
 this approach with respect to IO streaming. I had the (distinctly wrong)
 impression this implied-conversion was intended to be a 
 jack-of-all-trades.

 Everything else in the post is therefore cast(void)  ~  so let's stop
 wasting our breath :)

Yay :)

 If these implicit conversions are put in place, then I respectfully 
 suggest
 the std.utf functions be replaced with something that avoids fragmenting 
 the
 heap in the manner they currently do (for non Latin-1); and it's not 
 hard to
 make them an order-of-magnitude faster, too.

Good idea. If it's done it has to be done as efficiently as possible.

 Finally; there's still the problems related to string-concatentation and
 toString(), as described toward the end of this post

Yep. I think the toString restriction should be lifted, with implicit 
transcoding any of the string types should be valid.

I am still concerned about the number of transcoding operations that might 
occur in a unsuspecting programmers string concatenation...

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 24 2004

Sean Kelly <sean f4.ca> writes:

In article <cggtcq$1ird$1 digitaldaemon.com>, antiAlias says...
"Regan Heath" <regan netwin.co.nz> wrote ..
 What happens when there's a partial character left undecoded
 at the end of  'src'?

 ---------------------------------------
 How is that even possible?

It happens all the time with streamed input. However, as AJ pointed out,
neither you nor Walter are apparently suggesting that the cast() approach be
used for anything other than trivial conversions. That is, one would not use
this approach with respect to IO streaming. I had the (distinctly wrong)
impression this implied-conversion was intended to be a jack-of-all-trades.

My modified version of std.utf is meant to address the streaming issue.
Basically, I added versions of encode and decode that accept as the source or
destination hook.  Not perfect perhaps, but it does get around the problem of
encode/decode wanting to throw aqn exception of they encounter an invalid
sequence.

If these implicit conversions are put in place, then I respectfully suggest
the std.utf functions be replaced with something that avoids fragmenting the
heap in the manner they currently do (for non Latin-1); and it's not hard to
make them an order-of-magnitude faster, too.

Then by all means do so :)


Sean

Aug 25 2004

Sean Kelly <sean f4.ca> writes:

In article <cgidgk$28mg$1 digitaldaemon.com>, Sean Kelly says...
Basically, I added versions of encode and decode that accept as the source or
destination hook.

"Accept a delegate."


Sean

Aug 25 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cgidgk$28mg$1 digitaldaemon.com>, Sean Kelly says...

If these implicit conversions are put in place, then I respectfully suggest
the std.utf functions be replaced with something that avoids fragmenting the
heap in the manner they currently do (for non Latin-1); and it's not hard to
make them an order-of-magnitude faster, too.

Then by all means do so :)

Sean

Some speed-up ideas...

I posted a potentially speedier version of UTF-8 decode here a while back. The
basic algorithm I used was this: get the first byte; if it's ASCII, return it;
else use it as an index into a lookup table to get the sequence length. There's
slightly more to it than that, obviously, but that was the basis. Walter wanted
to know if there were any standard tests to check whether a UTF-8 function works
correctly. I didn't know of any.

The big difficulty with UTF-8 is that of being fully Unicode conformant. This is
poorly understood, so people are often tempted to make shortcuts. The std.utf
functions take no shortcuts and so are conformant.

The jist is this, however. You can have two different kinds of UTF-8 decode
routine - checked or unchecked. A checked function will ensure that the input
contains no invalid sequences (non-shortest sequences are always invalid), and
will throw an exception (or otherwise report the error) if that's not the case.
Checked decoders can be made fully conformant, but the checking can slow you
down.

Unchecked decoders, on the other hand, simply /assume/ that the input is valid,
and produce garbage if it isn't. Unchecked decoders can be made to go a lot
faster, but they are not Unicode conformant ... unless of course you *KNOW* with
100% certainty that the input *IS* valid. (Without this knowledge, your
application won't be Unicode conformant, and can actually be a security risk).
So, it would be possible to write a fast, unchecked UTF-8 decoder, if you made
use of D's Design by Contract. If you validate the string in the function's "in"
block, then you can assume valid input in the function body, and thereby go
faster (at least in a release build). But watch out for coding errors. The
caller *MUST* fulfil that contract, or you have a bug. And you'd still need to
have a checked UTF-8 decoder for those cases when you're not sure where the
input came from.

Being able to distinguish between sequences which have already been validated,
and those which have not, can buy you a lot of efficiency. Unfortunately, I
don't see how D can take advantage of that. If a D string were a class or a
struct, then it could have a class invariant - but D strings are just simple
arrays, and constructing invalid UTF-8 arrays is all too easy.

Arcane Jill

Aug 25 2004

"antiAlias" <fu bar.com> writes:

The decoding thing has issues as you point out Jill.

That's not the whole story, however. The Mango utf-8 Encoder is 5 to 15
times faster than the Phobos one, while executing the equivalent algorithm.
The CheckedDecoder is 5 to 15 times faster, whilst the UncheckedDecoder is
10 to 30 times faster (all  variances are due to alternate mixes of char,
wchar, dchar; all timings performed on a P3).

These are rather significant differences. I think it's safe to say that the
Phobos routines were "not written with efficiency in mind". Either that, or
Mango has some secret means of warping time ...



"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgih09$2ak7$1 digitaldaemon.com...
 In article <cgidgk$28mg$1 digitaldaemon.com>, Sean Kelly says...

If these implicit conversions are put in place, then I respectfully



suggest
the std.utf functions be replaced with something that avoids fragmenting



the
heap in the manner they currently do (for non Latin-1); and it's not



hard to
make them an order-of-magnitude faster, too.

Then by all means do so :)

Sean

 Some speed-up ideas...

 I posted a potentially speedier version of UTF-8 decode here a while back.

The
 basic algorithm I used was this: get the first byte; if it's ASCII, return

it;
 else use it as an index into a lookup table to get the sequence length.

There's
 slightly more to it than that, obviously, but that was the basis. Walter

wanted
 to know if there were any standard tests to check whether a UTF-8 function

works
 correctly. I didn't know of any.

 The big difficulty with UTF-8 is that of being fully Unicode conformant.

This is
 poorly understood, so people are often tempted to make shortcuts. The

std.utf
 functions take no shortcuts and so are conformant.

 The jist is this, however. You can have two different kinds of UTF-8

decode
 routine - checked or unchecked. A checked function will ensure that the

input
 contains no invalid sequences (non-shortest sequences are always invalid),

and
 will throw an exception (or otherwise report the error) if that's not the

case.
 Checked decoders can be made fully conformant, but the checking can slow

you
 down.

 Unchecked decoders, on the other hand, simply /assume/ that the input is

valid,
 and produce garbage if it isn't. Unchecked decoders can be made to go a

lot
 faster, but they are not Unicode conformant ... unless of course you

*KNOW* with
 100% certainty that the input *IS* valid. (Without this knowledge, your
 application won't be Unicode conformant, and can actually be a security

risk).
 So, it would be possible to write a fast, unchecked UTF-8 decoder, if you

made
 use of D's Design by Contract. If you validate the string in the

function's "in"
 block, then you can assume valid input in the function body, and thereby

go
 faster (at least in a release build). But watch out for coding errors. The
 caller *MUST* fulfil that contract, or you have a bug. And you'd still

need to
 have a checked UTF-8 decoder for those cases when you're not sure where

the
 input came from.

 Being able to distinguish between sequences which have already been

validated,
 and those which have not, can buy you a lot of efficiency. Unfortunately,

I
 don't see how D can take advantage of that. If a D string were a class or

a
 struct, then it could have a class invariant - but D strings are just

simple
 arrays, and constructing invalid UTF-8 arrays is all too easy.

 Arcane Jill

Aug 25 2004

pragma <pragma_member pathlink.com> writes:

In article <cgij3k$2bon$1 digitaldaemon.com>, antiAlias says...
The decoding thing has issues as you point out Jill.

That's not the whole story, however. The Mango utf-8 Encoder is 5 to 15
times faster than the Phobos one, while executing the equivalent algorithm.
The CheckedDecoder is 5 to 15 times faster, whilst the UncheckedDecoder is
10 to 30 times faster (all  variances are due to alternate mixes of char,
wchar, dchar; all timings performed on a P3).

Holy crap, what kind of data are you throwing at it?  I don't mean to criticize,
but there must be some clever coding on your part or some serious loopholes in
the algorithm to get that kind of an improvement. :)

These are rather significant differences. I think it's safe to say that the
Phobos routines were "not written with efficiency in mind". Either that, or
Mango has some secret means of warping time ...

In the case of the latter, may we rename the "Mango" to "Tardis"?

-Pragma
[[ EricAnderton at (daleks can't use stairs) yahoo.com ]]

Aug 25 2004

"antiAlias" <fu bar.com> writes:

<g> That's funny.

No loopholes; no clever coding. Just take a look at (for example) what
utf.encode does to the heap. Those order-of-magnitude timings are best-case
for Phobos ~ in a busy server environment they'd be even slower, likely
cause notable heap fragmentation, and persistently lock the heap against
other (much more appropriate) usage by other threads. Imagine multiple
threads doing implicit dchar[] conversions via utf.encode?

Because there's no supported means of resolving such things, one becomes
inclined to simply 'reimplement' instead of pooling ones skills and
resources to fix Phobos.

This is exactly the kind of thing the DSLG should take care of.



"pragma" <pragma_member pathlink.com> wrote in message
news:cgin45$2djh$1 digitaldaemon.com...
 In article <cgij3k$2bon$1 digitaldaemon.com>, antiAlias says...
The decoding thing has issues as you point out Jill.

That's not the whole story, however. The Mango utf-8 Encoder is 5 to 15
times faster than the Phobos one, while executing the equivalent


algorithm.
The CheckedDecoder is 5 to 15 times faster, whilst the UncheckedDecoder


is
10 to 30 times faster (all  variances are due to alternate mixes of char,
wchar, dchar; all timings performed on a P3).

 Holy crap, what kind of data are you throwing at it?  I don't mean to

criticize,
 but there must be some clever coding on your part or some serious

loopholes in
 the algorithm to get that kind of an improvement. :)

These are rather significant differences. I think it's safe to say that


the
Phobos routines were "not written with efficiency in mind". Either that,


or
Mango has some secret means of warping time ...

 In the case of the latter, may we rename the "Mango" to "Tardis"?

 -Pragma
 [[ EricAnderton at (daleks can't use stairs) yahoo.com ]]

Aug 25 2004

Regan Heath <regan netwin.co.nz> writes:

Might I humbly suggest you add these routines to deimos?
Have you seen my post to the DSLG thread, I think my idea has merit..

On Wed, 25 Aug 2004 13:07:57 -0700, antiAlias <fu bar.com> wrote:
 <g> That's funny.

 No loopholes; no clever coding. Just take a look at (for example) what
 utf.encode does to the heap. Those order-of-magnitude timings are 
 best-case
 for Phobos ~ in a busy server environment they'd be even slower, likely
 cause notable heap fragmentation, and persistently lock the heap against
 other (much more appropriate) usage by other threads. Imagine multiple
 threads doing implicit dchar[] conversions via utf.encode?

 Because there's no supported means of resolving such things, one becomes
 inclined to simply 'reimplement' instead of pooling ones skills and
 resources to fix Phobos.

 This is exactly the kind of thing the DSLG should take care of.



 "pragma" <pragma_member pathlink.com> wrote in message
 news:cgin45$2djh$1 digitaldaemon.com...
 In article <cgij3k$2bon$1 digitaldaemon.com>, antiAlias says...
The decoding thing has issues as you point out Jill.

That's not the whole story, however. The Mango utf-8 Encoder is 5 to 15
times faster than the Phobos one, while executing the equivalent


 algorithm.
The CheckedDecoder is 5 to 15 times faster, whilst the UncheckedDecoder


 is
10 to 30 times faster (all  variances are due to alternate mixes of 

 char,
wchar, dchar; all timings performed on a P3).

 Holy crap, what kind of data are you throwing at it?  I don't mean to

 criticize,
 but there must be some clever coding on your part or some serious

 loopholes in
 the algorithm to get that kind of an improvement. :)

These are rather significant differences. I think it's safe to say that


 the
Phobos routines were "not written with efficiency in mind". Either 

 that,

 or
Mango has some secret means of warping time ...

 In the case of the latter, may we rename the "Mango" to "Tardis"?

 -Pragma
 [[ EricAnderton at (daleks can't use stairs) yahoo.com ]]




-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 25 2004

pragma <pragma_member pathlink.com> writes:

In article <cgirdm$2fod$1 digitaldaemon.com>, antiAlias says...
<g> That's funny.

Thanks!  That's why I've been doing these goofy sigs lately: if I can make
people grin while they type, perhaps tempers won't flare as much in this NG. :) 

No loopholes; no clever coding. Just take a look at (for example) what
utf.encode does to the heap. Those order-of-magnitude timings are best-case
for Phobos ~ in a busy server environment they'd be even slower, likely
cause notable heap fragmentation, and persistently lock the heap against
other (much more appropriate) usage by other threads. Imagine multiple
threads doing implicit dchar[] conversions via utf.encode?

Yikes.  That's amazing.  I only wonder how this might stand up against ICU?

Because there's no supported means of resolving such things, one becomes
inclined to simply 'reimplement' instead of pooling ones skills and
resources to fix Phobos.

A personal favorite of mine:

1st law of engineering: Hit it with a hammer
2nd law of engineering: If law 1 fails, *use a bigger hammer*

. or the more classic idiom: use the right tool for the right job.

One could liken reimplementation to a "programmer's sin" of sorts only if they
doom others to continue that replication.  Sadly, everybody's lib is currently
"in progress" with no end in sight yet (which should change once D's remaining
quirks are stamped out).  At least Brad has been moderating product additions on
dsource to cut down some of the more coarse-grained duplication.

However we're still very much in the Beta years of D's life. ;)

This is exactly the kind of thing the DSLG should take care of.

At first I wasn't too keen on the idea myself, but perhaps it could use some
more discussion. (I'll post on the other thread).

- Pragma
[[ EricAnderton at (its gumby dammit) yahoo.com ]]

Aug 25 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <opsc7u56zf5a2sq9 digitalmars.com>, Regan Heath says...

True. However, what else should the opCast for char[] to dchar[] do except 
transcode it? What about opCast for int to dchar.. it seems to me there is 
only 1 choice about what to do, anything else would be operator abuse.

Correctly me if I'm wrong, but according to the docs, there /is/ no from-to
version of opCast(). opCast() remains almost completely useless, despite many
suggestions to fix it.

But there's no need to opCast() anything. Compiler magic can just call
std.toUTFxx() directly (which of course is what you said).

If you want to use a different encoder, just do it explicitly. e.g.




I don't think transcoding makes any sense unless you're talking about a 
'string' (i.e. char[], wchar[], or dchar[]) as opposed to a UTF-x fragment 
(i.e. char, wchar)

Correct. It doesn't.


As AJ has frequently pointed out a char or wchar does not equal one 
"character" in some cases. Given that and assuming 'c' below is "not a 
whole character", I cannot see how:

dchar d;
char  c = d; //not a whole character

d = c;            //implicit
d = cast(dchar)c; //explicit

The existing behavior is already flawed, but it's not going to change (unless
char is ditched), because these are primitive types, and Walter says so. Here's
what's wrong:

In the direction dchar -> char:




Well, you'd /expect/ that to go wrong. It's a narrowing conversion. But:






Casting from char to wchar or dchar will /only/ work if the character is
actually ASCII.




3. a simple copy of the value, creating an invalid? (will it be AJ?) 
utf-32 character.

No value in the range U+0000 to U+00FF is invalid UTF32. But you can't convert a
UTF-8-fragment to a character (see example above) and expect the result to be
meaningful, except for ASCII.



utf-x code-point/fragment (I don't know the right term)

Unicode uses the term "code unit". I have avoided that term as it's not
altogether clear to those unfamiliar with the jargon. I usually say "UTF-8
fragment" on this newsgroup.



 Then there's performance. It's entirely
 possible to write transcoders that are minimally between 5 and 30 times
 faster than the std.utf ones. Some people actually do care about 
 efficiency.
 I'm one of them.

Same here. I believe we all have the same goal, we just have different 
ideas about how best to get there.

The current implementation of the std.utf functions is not really relevant.
/Tomorrow/'s implementation could be way faster than today's. I don't see any
reason why Walter couldn't some day replace the implementation with the fastest
one on the planet. No code would break.


I think the confusion comes from being used to only 1 string type, for 
C/C++ programmers it's typically and 8bit unsigned type usually containing 
ASCII, for Java a 16bit signed/unsigned? type containing UTF-16?

Mebee, but in C++ I can do this:









The "confusion" in D arises (IMO) because we don't have implicit conversion.

Arcane Jill

Aug 24 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 24 Aug 2004 12:03:17 +0000 (UTC), Arcane Jill 
<Arcane_member pathlink.com> wrote:

<big snip> Thanks for those explainations.

 The "confusion" in D arises (IMO) because we don't have implicit 
 conversion.

That is my thought also, tho I note you would rather have 1 string type.

Lets do a pro's/con's list for implicit conversion and one string type 
because I am not totally convinced one is better than the other, let me 
start (trying to be as objective as possible and not favour 'my' idea)

[implicit conversion]
PROS:
  P1 - will cause:
      dchar[] d;
      char[] c = d;
    to produce valid utf sequences.

  P2 - allows you to write 1 of each string returning function (instead of 
3)

  P3 - explicit conversion calls not required. eg toUTFxx().


p1: is vital IMO
p2: this means less code replication, and less code in general needs to be 
written.
P3: could be argued to be 'laziness', I've been called lazy in the past.

CONS:
  C1 - transcoding is not FREE and it will happen without obvious 
indicators that it is happening.

  C2 - ppl will not learn the difference between char, wchar, and dchar as 
quickly.


c1: I would argue it's is not as big a deal as it first appears, where it 
happens you would need a toUTFxx call anyway. In string concatenations 
some extra transcoding will occur and I have no good solution for that, 
tho, allowing toString to return any of the 3 types would lessen this 
effect.

c2: Might be a 'pro' in disguise, they don't learn the difference because 
with implicit conversion it doesn't matter.


[one string type]
PROS:
P1 - allows you to write 1 of each string returning function (instead of 3)
      SAME AS ABOVE

CONS:
C1 - all your string characters are 16bits wide, space is wasted when 
support for ASCII or any other 8bit encoding is all that is required.


c1: I believe this to be a major 'con' for embedded etc small systems 
programming.


Please everyone add to this list any/all you can think of, correct any you 
think I have wrong, or miss represented.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 24 2004

"Walter" <newshound digitalmars.com> writes:

"antiAlias" <fu bar.com> wrote in message
news:cge5h3$5hv$1 digitaldaemon.com...
 On the one hand, this would be well served by an opCast() and/or

opCast_r()
 on the primitive types; just the kind of thing  suggested in a related
 thread (which talked about overloadable methods for primitive types).

 On the other hand, we're talking transcoding here. Are you gonna' limit

this
 to UTF-8 only?

No, to between char[], wchar[], and dchar[].

 Then, since the source and destination will typically be of
 different sizes, do you then force all casts between these types to have

the
 destination be an array reference rather than an instance? One that is
 always allocated on the fly?

I see it as implicitly calling the conversion function(s) in std.utf.

 If you do implement overloadable primitive-methods (like properties) then,
 will you allow a programmer to override them? So they can make the

opCast()
 do something more specific to their own specific task?

 That's seems like a lot to build into the core of a language.

I don't see adding opCast() for builtin types.

 Personally, I think it's 'borderline' to have so many data types available
 for similar things. If there were an "alias wchar[] string", and the core
 Object supported that via "string toString()", and the IUC library were
 adopted, then I think some of the general confusion would perhaps melt
 somewhat.

 In many respects, too many choices is simply a BadThing (TM). Especially
 when there's precious little solid guidance to help. That guidance might
 come from a decent library that indicates how the types are used, and uses
 one obvious type (string?) consistently. Besides, if IUC were adopted,

less
 people would have to worry about the distinction anyway.

 Believe me when I say that Mango would dearly love to go dchar[] only.
 Actually, it probably will at the higher levels because it makes life

simple
 for everyone. Oh, and I've been accused many times of being an efficiency
 fanatic, especially when it comes to servers. But there's always a

tradeoff
 somewhere. Here, the tradeoff is simplicity-of-use versus quantities of

RAM.
 Which one changes dramatically over time? Hmmmm ... let me see now ...
 64bit-OS for desktops just around corner?

 Even on an embedded device I'd probably go "dchar only" regarding I18N.
 Simply because the quantity of text processed on such devices is very
 limited. Before anyone shoots me over this one, I regularly write code for
 devices with just 4KB RAM ~ still use 16bit chars there when dealing with
 XML input.

 So what am I saying here? Available RAM will always increase in great

leaps.
 Contemplating that the latter should dictate ease-of-use within D is a
 serious breach of logic, IMO. Ease of use, and above all, /consistency/
 should be paramount; if you have the programmer in mind.

I thought so, too, until I built a server app that used all dchar[]
internally. Server apps tend to be driven to the limit, and reaching that
limit 4x sooner means that the customer has to buy 4x more server hardware.
Remember, that using 4 bytes per char doesn't just consume more ram, it
consumes a LOT more processor cycles with managing the extra memory.
(scanning, copying, initializing, gc marking, etc.)

Aug 24 2004

"antiAlias" <fu bar.com> writes:

"Walter"  wrote in...
 So what am I saying here? Available RAM will always increase in great

 leaps.
 Contemplating that the latter should dictate ease-of-use within D is a
 serious breach of logic, IMO. Ease of use, and above all, /consistency/
 should be paramount; if you have the programmer in mind.

 I thought so, too, until I built a server app that used all dchar[]
 internally. Server apps tend to be driven to the limit, and reaching that
 limit 4x sooner means that the customer has to buy 4x more server

hardware.
 Remember, that using 4 bytes per char doesn't just consume more ram, it
 consumes a LOT more processor cycles with managing the extra memory.
 (scanning, copying, initializing, gc marking, etc.)

I disagree with that for a number of reasons <g>

a) I was saying that usage of memory should not dictate language
ease-of-use. I didn't say that byte[], ubyte[], char[] and wchar[] should
all be dumped.  If D were dchar[] oriented, rather than char[] oriented, it
would arguably make it easier to use for the everyday folks. Those who
really care about squeezing bytes can, and should, deal with text encoding
and decoding issues. As it is, /everyone/ currently has to deal with those
issues at various levels.

b) There's an implication that all server apps are text-bound. That's just
not the case, but perhaps I'm being pedantic.

c) People who write servers have (traditionally) been a little more careful
about what they do. There are plenty of ways to avoid allocating memory and
thrashing the GC, where that's a concern. I do it all the time. In fact, one
of the unwritten goals of writing server software is to avoid regularly
using malloc/calloc where possible.

d) The predominant modern cpu's all have prefetch built-in, because of the
marketing craze for streaming-style application. This is great news for wide
chars! It means that a server can stream dchar[] much more effectively than
it could just a few years back. It's the conversions that are arguably a
problem.

e) dchar is the natural width of a 32bit processor, so it's not gonna take
more Processor Cycles to process those than 8bit chars. In fact, it's the
other way round where UTF-8 is involved. The bottleneck used to be the
front-side bus. Not so these days of 1Ghz HyperTransport, 800MHz Intel
quad-pumped bus, and prefetch everywhere.

So, no. I simply cannot agree that using dchar[] automatically means the
customer has to buy 4x more server hardware <g>

Aug 24 2004

Sean Kelly <sean f4.ca> writes:

In article <cgg5ko$16ja$1 digitaldaemon.com>, antiAlias says...
a) I was saying that usage of memory should not dictate language
ease-of-use. I didn't say that byte[], ubyte[], char[] and wchar[] should
all be dumped.  If D were dchar[] oriented, rather than char[] oriented, it
would arguably make it easier to use for the everyday folks. Those who
really care about squeezing bytes can, and should, deal with text encoding
and decoding issues. As it is, /everyone/ currently has to deal with those
issues at various levels.

Ideally, an i/o library should be able to handle most conversions invisibly, so
the user can work in whatever internal format they want without worrying too
much about the external format.  doFormat already takes char, wchar, and dchar
arguments and outputs UTF-8 or UTF-16 as appropriate, and I designed unFormat to
do pretty much the same.  I will say, however, that multibyte encoding schemes
are generally not very easy to deal with, so internal use of dchars still makes
a lot of sense.

b) There's an implication that all server apps are text-bound. That's just
not the case, but perhaps I'm being pedantic.

More often than not they are, especially in this age of XML and such.  And for
the ones that aren't text-bound, who cares how D deals with strings? :)

e) dchar is the natural width of a 32bit processor, so it's not gonna take
more Processor Cycles to process those than 8bit chars. In fact, it's the
other way round where UTF-8 is involved. The bottleneck used to be the
front-side bus. Not so these days of 1Ghz HyperTransport, 800MHz Intel
quad-pumped bus, and prefetch everywhere.

I think that UTF-8 is still more efficient in terms of memory reads and writes,
simply because it tends to take up less space than UTF-16 or UTF-32.  The
tradeoff is in processing time when the data ventures into the multibyte realm,
which is becoming increasingly more common.  But I'll have to think some more
about whether I would always write text-bound servers using dchars.  I'd like to
since it would simplify handling XML data, but I'm not particularly keen on
those 1GB streams suddenly becoming 4GB streams.  That's a decent bit of i/o
overhead, especially if I know that little to no data in that stream lives
beyond the ASCII charset.

At the end of the day, I think the programmer should be able to choose the
appropriate charset for the job.  Implicit conversion between char types is a
great idea and should clear up most of the confusion.  And the UTF-32 version is
called "dchar" which implies to me that it's the native character format for D
anyway.  Perhaps "char" should be renamed to something else?


Sean

Aug 24 2004

"antiAlias" <fu bar.com> writes:

"Sean Kelly" <sean f4.ca> wrote ...
 I think that UTF-8 is still more efficient in terms of memory reads and


writes,
 simply because it tends to take up less space than UTF-16 or UTF-32.  The
 tradeoff is in processing time when the data ventures into the multibyte

realm,
 which is becoming increasingly more common.  But I'll have to think some

more
 about whether I would always write text-bound servers using dchars.

I'm not saying that they should always be dchar[] :-)  I'm saying using the
additional memory usage of dchar[] as a tradeoff against language
ease-of-use is an invalid one. You can always dip down into ubyte[] for
those apps that actually care.

 I'd like to
 since it would simplify handling XML data, but I'm not particularly keen

on
 those 1GB streams suddenly becoming 4GB streams.  That's a decent bit of

i/o
 overhead, especially if I know that little to no data in that stream lives
 beyond the ASCII charset.

I'm rather tempted to suggest that a 1GB stream of XML has other problems to
content with <g>

Aug 24 2004

Sean Kelly <sean f4.ca> writes:

In article <cggca6$19k9$1 digitaldaemon.com>, antiAlias says...
I'm rather tempted to suggest that a 1GB stream of XML has other problems to
content with <g>

Its only problem is that XML is a terrible text format.  But for one server I
wrote it was not impossible that there may be a 1GB (uncompressed) data stream.
This was obviously not going to a web browser :)


Sean

Aug 24 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 24 Aug 2004 12:44:03 -0700, antiAlias <fu bar.com> wrote:
 "Walter"  wrote in...
 So what am I saying here? Available RAM will always increase in great

 leaps.
 Contemplating that the latter should dictate ease-of-use within D is a
 serious breach of logic, IMO. Ease of use, and above all, 

 /consistency/
 should be paramount; if you have the programmer in mind.

 I thought so, too, until I built a server app that used all dchar[]
 internally. Server apps tend to be driven to the limit, and reaching 
 that
 limit 4x sooner means that the customer has to buy 4x more server

 hardware.
 Remember, that using 4 bytes per char doesn't just consume more ram, it
 consumes a LOT more processor cycles with managing the extra memory.
 (scanning, copying, initializing, gc marking, etc.)

 I disagree with that for a number of reasons <g>

 a) I was saying that usage of memory should not dictate language
 ease-of-use. I didn't say that byte[], ubyte[], char[] and wchar[] should
 all be dumped.  If D were dchar[] oriented, rather than char[] oriented, 
 it
 would arguably make it easier to use for the everyday folks. Those who
 really care about squeezing bytes can, and should, deal with text 
 encoding
 and decoding issues. As it is, /everyone/ currently has to deal with 
 those
 issues at various levels.

The exact same thing can be said for implicit transcoding. What I mean 
is...

Implicit transcoding will make it easier to use for everyday folks. Those 
who really care about squeezing bytes can.

The advantage implicit transcoding has is everyday folks will likely be 
dealing with ASCII and will likely use char[] which has an advantage over 
dchar[] where you're dealing with ASCII.

Furthermore implicit transcoding removes the need to deal with 
encoding/decoding issue, generally speaking. Those that need to worry 
about it, can and will optimise where the implicit transcoding causes 
in-efficient behaviour.

 b) There's an implication that all server apps are text-bound. That's 
 just
 not the case, but perhaps I'm being pedantic.

It depends on the server. The mail server I work on is a candidate to be 
text-bound, in fact it is, it's disk bound, meaning, we cannot write our 
email text out to disk as fast as we can receive it (tcp/ip), and process 
it (transcoding etc).

 c) People who write servers have (traditionally) been a little more 
 careful
 about what they do. There are plenty of ways to avoid allocating memory 
 and
 thrashing the GC, where that's a concern. I do it all the time. In fact, 
 one
 of the unwritten goals of writing server software is to avoid regularly
 using malloc/calloc where possible.

Definately. having a UTF-8 char type which you can implicitly convert to a 
more convenient format temporarily (dchar[], utf-32) simply makes this 
easier IMO.

 d) The predominant modern cpu's all have prefetch built-in, because of 
 the
 marketing craze for streaming-style application. This is great news for 
 wide
 chars! It means that a server can stream dchar[] much more effectively 
 than
 it could just a few years back. It's the conversions that are arguably a
 problem.

If we're talking streaming as in streaming to disk or tcp/ip etc, I would 
argue that the time it takes to transcode is much less than the time it 
takes to write/send.

 e) dchar is the natural width of a 32bit processor, so it's not gonna 
 take
 more Processor Cycles to process those than 8bit chars. In fact, it's the
 other way round where UTF-8 is involved. The bottleneck used to be the
 front-side bus. Not so these days of 1Ghz HyperTransport, 800MHz Intel
 quad-pumped bus, and prefetch everywhere.

 So, no. I simply cannot agree that using dchar[] automatically means the
 customer has to buy 4x more server hardware <g>

All this arguing about what is more efficient is IMO totally pointless, 
the types of application vary so much that for one application/situation 
one method will be best and for another the other method will be.

D's goal is not to be specialised for any one style or application, as 
such 3 char types makes sense, doesn't it?

Regardless the only way to settle the performance argument is to benchmark 
something, therefore...

In what situations do you believe using UTF-32 dchar throughout the 
application will be faster than using all 3 types and implicit 
transcoding.. Consider:
  - the input may be in any encoding
  - the output may be in any encoding
  - it may need to store large amounts of the input in memory

..can you think of any more?

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 24 2004

"antiAlias" <fu bar.com> writes:

Regan; I appeal to you to try and read things in context. I'm not even
vaguely interested in getting into a pissing contest with you, so please,
try and follow this (and correlate with the text below if you have to):

a) I say available-memory will always increase in great leaps, so using that
as a design guide vis-a-vis a computer language doesn't make sense to me.

b) Walter says he used to think so too, until he build a wide-char-only
server; and points out that wide-chars can force the customer into
purchasing much more hardware due to memory consumption and additional CPU
usage.

c) I disagree with that position, and try to illustrate why I don't think
wide-chars are the demon they might once have been considered. And that
perhaps they get a 'bad rap' for the wrong reasons.

What you added here seems intended to fan some imaginary flames, or to be
argumentative purely for the sake of it, rather than to make any cohesive
point. In fact, four out of the five items you managed to completely
misconstrue. That may be my failing in terms of language use, so I'll accept
the consequences. I will not, however, bite.

Good-day, my friend  :-)


"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsc9o30tt5a2sq9 digitalmars.com...
 On Tue, 24 Aug 2004 12:44:03 -0700, antiAlias <fu bar.com> wrote:
 "Walter"  wrote in...
 So what am I saying here? Available RAM will always increase in great

 leaps.
 Contemplating that the latter should dictate ease-of-use within D is




a
 serious breach of logic, IMO. Ease of use, and above all,

 /consistency/
 should be paramount; if you have the programmer in mind.

 I thought so, too, until I built a server app that used all dchar[]
 internally. Server apps tend to be driven to the limit, and reaching
 that
 limit 4x sooner means that the customer has to buy 4x more server

 hardware.
 Remember, that using 4 bytes per char doesn't just consume more ram, it
 consumes a LOT more processor cycles with managing the extra memory.
 (scanning, copying, initializing, gc marking, etc.)

 I disagree with that for a number of reasons <g>

 a) I was saying that usage of memory should not dictate language
 ease-of-use. I didn't say that byte[], ubyte[], char[] and wchar[]


should
 all be dumped.  If D were dchar[] oriented, rather than char[] oriented,
 it
 would arguably make it easier to use for the everyday folks. Those who
 really care about squeezing bytes can, and should, deal with text
 encoding
 and decoding issues. As it is, /everyone/ currently has to deal with
 those
 issues at various levels.

 The exact same thing can be said for implicit transcoding. What I mean
 is...

 Implicit transcoding will make it easier to use for everyday folks. Those
 who really care about squeezing bytes can.

 The advantage implicit transcoding has is everyday folks will likely be
 dealing with ASCII and will likely use char[] which has an advantage over
 dchar[] where you're dealing with ASCII.

 Furthermore implicit transcoding removes the need to deal with
 encoding/decoding issue, generally speaking. Those that need to worry
 about it, can and will optimise where the implicit transcoding causes
 in-efficient behaviour.

 b) There's an implication that all server apps are text-bound. That's
 just
 not the case, but perhaps I'm being pedantic.

 It depends on the server. The mail server I work on is a candidate to be
 text-bound, in fact it is, it's disk bound, meaning, we cannot write our
 email text out to disk as fast as we can receive it (tcp/ip), and process
 it (transcoding etc).

 c) People who write servers have (traditionally) been a little more
 careful
 about what they do. There are plenty of ways to avoid allocating memory
 and
 thrashing the GC, where that's a concern. I do it all the time. In fact,
 one
 of the unwritten goals of writing server software is to avoid regularly
 using malloc/calloc where possible.

 Definately. having a UTF-8 char type which you can implicitly convert to a
 more convenient format temporarily (dchar[], utf-32) simply makes this
 easier IMO.

 d) The predominant modern cpu's all have prefetch built-in, because of
 the
 marketing craze for streaming-style application. This is great news for
 wide
 chars! It means that a server can stream dchar[] much more effectively
 than
 it could just a few years back. It's the conversions that are arguably a
 problem.

 If we're talking streaming as in streaming to disk or tcp/ip etc, I would
 argue that the time it takes to transcode is much less than the time it
 takes to write/send.

 e) dchar is the natural width of a 32bit processor, so it's not gonna
 take
 more Processor Cycles to process those than 8bit chars. In fact, it's


the
 other way round where UTF-8 is involved. The bottleneck used to be the
 front-side bus. Not so these days of 1Ghz HyperTransport, 800MHz Intel
 quad-pumped bus, and prefetch everywhere.

 So, no. I simply cannot agree that using dchar[] automatically means the
 customer has to buy 4x more server hardware <g>

 All this arguing about what is more efficient is IMO totally pointless,
 the types of application vary so much that for one application/situation
 one method will be best and for another the other method will be.

 D's goal is not to be specialised for any one style or application, as
 such 3 char types makes sense, doesn't it?

 Regardless the only way to settle the performance argument is to benchmark
 something, therefore...

 In what situations do you believe using UTF-32 dchar throughout the
 application will be faster than using all 3 types and implicit
 transcoding.. Consider:
   - the input may be in any encoding
   - the output may be in any encoding
   - it may need to store large amounts of the input in memory

 ..can you think of any more?

 Regan

 --
 Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 24 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 24 Aug 2004 22:08:39 -0700, antiAlias <fu bar.com> wrote:

<snip>

 What you added here seems intended to fan some imaginary flames, or to be
 argumentative purely for the sake of it, rather than to make any cohesive
 point. In fact, four out of the five items you managed to completely
 misconstrue. That may be my failing in terms of language use, so I'll 
 accept the consequences. I will not, however, bite.

I'm sorry to have come across that way. I was simply trying to add my 
point of view. If I have miss-understood your comments, sorry, I don't get 
it right all the time (despite what I might think).

Your (miss/understood/guided/ing) friend,
Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 24 2004

D Programming

C/C++ Programming

Other

digitalmars.D - The case for ditching char and wchar (and renaming "dchar" as "char")