www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - The case for ditching char and wchar (and renaming "dchar" as "char")

reply Arcane Jill <Arcane_member pathlink.com> writes:
D has come a long way, and much of the original architecture is now redundant.
Template/library based containers are now making built-in associative arrays
redundant, for example. And now a new revolution is on its way - transcoding,
which makes built in support for UTF-8 and friends equally redundant. (It does
not, of course, make Unicode itself redundant!).

D's "char" type is, by definition, a fragment of UTF-8.
But UTF-8 is just an encoding.

D's "wchar" type is, by definition, a fragment of UTF-16.
But UTF-16 is also just an encoding (or two).

D's "dchar" type flits ambiguously between a fragment of UTF-32 and an actual
Unicode codepoint (the two are more or less interchangeable).

<sarcasm>
By extension of this logic, why not:

schar - a fragment of UTF-7
ichar - a fragment of ISO-8859-1
cchar - a fragment of WINDOWS-1252
.. and so on, for every encoding you can think of. Hang on - we're going to run
out of letters!

and of course, Phobos would have to implement all the conversion functions:
toUTF7(), toISO88591(), and so on.
</sarcasm>

Nonsense? Of course it is. But the analogy is intended to show that the current
behavior of D is also nonsense. For N encodings, you need (N squared minus N)
conversion functions, so the number is going to grow quite rapidly as the number
of supported encodings increases. But if you instead use transcoding, then the
number of conversion functions you need is simply N. Not only that, the
mechanism is smoother, neater. Your code is more elegant. You simply don't have
to /worry/ about all that nonsense trying to get the three built-in encodings to
match, because the issue has simply gone away.

And once the issue has gone away, you no longer need a special type to hold
fragments of UTF-8 or UTF-16. Bye bye char. Bye bye wchar.

Kris (antiAlias) has sent me the the transcoding interface which Mango requires.
(Recieved, thanks). I've already written a generic one, but which didn't take
those requirements into account. So today I'm going to merge the two approaches
together and see what Kris thinks. So I'm pretty confident that within a few
days, Kris and I will have got together a transcoding architecture we're both
happy with - and since Kris has expertise in streams/Mango, and I have expertise
in Unicode/internationalization, I'd make a pretty good wager that between us
we're going to get it right. And we'll plumb in the UTF transcoders first. You
can probably expect all that to be done within days rather than weeks.

So why would we then need old-style-char or wchar any more?

For reasons of space-efficiency, one might want to store text in memory in UTF-8
format. Fair enough. But if char were to be ditched, you could still do that.
You'd simply use a ubyte[] for that purpose (just as you are now required to do
if you want to store text in memory in UTF-7). After all - what actually /is/ a
UTF fragment anyway? What meaning does the UTF-8 fragment 0x83 have in
isolation? Answer - none. It has meaning only in the context of the bytes
surrounding it. You don't need a special primitive type just to hold that
fragment. And of course, there is /nothing/ to stop special string classes from
being written to provide implementations of such space-efficient abstractions.

A further argument against char is people coming from C/C++ /will/ try to store
ISO-8859-1 encoded strings in a char[]. And they will get away with it, too, so
long as they don't try calling any toUTFxx() routines on them. Bugs can be
deeply buried in this confusion, failing to surface for a very long time.

Discussion in another thread has focused on the the fact that Object.toString()
returns a char[]. Regan and I have made the suggestion that the three string
types be interchangable. But there's a better way: have just the /one/ string
type. (As they say in Highlander, "There can be only one"). Problem gone away.

With new-style-char redefined, not merely as a UTF-32 code unit (a fragment of
an encoding), but as an actual Unicode character, things become much, much
simpler.

AND it would make life easier for Walter - fewer primitive types; less for the
compiler to understand/do.

Java tried to do it this way. When Java was invented, they had a char type
intended to hold a single Unicode character. They also had a byte type, an array
of which could store ASCII or ISO-8859-1 or UTF-8 encoded text. They also had
transcoding built in, to make it all hang together. Where it went wrong for Java
was that Unicode changed from being 16-bits wide to being 21-bits wide (so
suddenly Java's char was no longer wide enough, and they were forced to redefine
Java strings as being UTF-16 encoded). But please note that Java did /not/
attempt to have separate char types for each encoding. Even /after/ Unicode
exceeded 16-bits, Java was not tempted to introduce a new kind of char. Why not?
Because having more than one char type is an ugly kludge (particularly if you're
using Unicode by definition). It's an ugly kludge in D, too. I thought it was
really good, once upon a time, but now that transcoding is moving out to
libraries, and encompasses many /more/ encodings merely UTF-8/16/32, I no longer
think that. Now is the best time of all for a rethink.

But ...

there's a down-side ... it would break a lot of existing code. Well, so what?
This is a pre-1.0 wart-removing exercise. Like all of those other suggestions
we're voting on in another thread, the time to make this change is now, before
it's too late.

Arcane Jill
Aug 23 2004
next sibling parent Matthias Becker <Matthias_member pathlink.com> writes:
After you proposed these ideas about allowing toString to return any character
type I started thinking about it and finaly I thought: Why do we have more than
one character-type? (Just like you do)
Aug 23 2004
prev sibling next sibling parent Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:
Arcane Jill wrote:
 
 ButY there's a better way:
 have just the /one/ string type. (As they say in Highlander, "There can be
 only one"). Problem gone away.
Yes, it makes a lot of sense. You have my (useless) vote.
 AND it would make life easier for Walter - fewer primitive types; less for
 the compiler to understand/do.
I think Walter should like it, if only for this.
 But ...
 
 there's a down-side ... it would break a lot of existing code. Well, so
 what? This is a pre-1.0 wart-removing exercise. Like all of those other
 suggestions we're voting on in another thread, the time to make this
 change is now, before it's too late.
I'll be very happy to change my (little) code now.
Aug 23 2004
prev sibling next sibling parent reply Ben Hinkle <bhinkle4 juno.com> writes:
Arcane Jill wrote:

 
 D has come a long way, and much of the original architecture is now
 redundant. Template/library based containers are now making built-in
 associative arrays redundant, for example. And now a new revolution is on
 its way - transcoding, which makes built in support for UTF-8 and friends
 equally redundant. (It does not, of course, make Unicode itself
 redundant!).
 
 D's "char" type is, by definition, a fragment of UTF-8.
 But UTF-8 is just an encoding.
 
 D's "wchar" type is, by definition, a fragment of UTF-16.
 But UTF-16 is also just an encoding (or two).
 
 D's "dchar" type flits ambiguously between a fragment of UTF-32 and an
 actual Unicode codepoint (the two are more or less interchangeable).
 
 <sarcasm>
 By extension of this logic, why not:
 
 schar - a fragment of UTF-7
 ichar - a fragment of ISO-8859-1
 cchar - a fragment of WINDOWS-1252
 .. and so on, for every encoding you can think of. Hang on - we're going
 to run out of letters!
 
 and of course, Phobos would have to implement all the conversion
 functions: toUTF7(), toISO88591(), and so on.
 </sarcasm>
 
 Nonsense? Of course it is. But the analogy is intended to show that the
 current behavior of D is also nonsense. For N encodings, you need (N
 squared minus N) conversion functions, so the number is going to grow
 quite rapidly as the number of supported encodings increases. But if you
 instead use transcoding, then the number of conversion functions you need
 is simply N. Not only that, the mechanism is smoother, neater. Your code
 is more elegant. You simply don't have to /worry/ about all that nonsense
 trying to get the three built-in encodings to match, because the issue has
 simply gone away.
 
 And once the issue has gone away, you no longer need a special type to
 hold fragments of UTF-8 or UTF-16. Bye bye char. Bye bye wchar.
 
 Kris (antiAlias) has sent me the the transcoding interface which Mango
 requires. (Recieved, thanks). I've already written a generic one, but
 which didn't take those requirements into account. So today I'm going to
 merge the two approaches together and see what Kris thinks. So I'm pretty
 confident that within a few days, Kris and I will have got together a
 transcoding architecture we're both happy with - and since Kris has
 expertise in streams/Mango, and I have expertise in
 Unicode/internationalization, I'd make a pretty good wager that between us
 we're going to get it right. And we'll plumb in the UTF transcoders first.
 You can probably expect all that to be done within days rather than weeks.
 
 So why would we then need old-style-char or wchar any more?
 
 For reasons of space-efficiency, one might want to store text in memory in
 UTF-8 format. Fair enough. But if char were to be ditched, you could still
 do that. You'd simply use a ubyte[] for that purpose (just as you are now
 required to do if you want to store text in memory in UTF-7). After all -
 what actually /is/ a UTF fragment anyway? What meaning does the UTF-8
 fragment 0x83 have in isolation? Answer - none. It has meaning only in the
 context of the bytes surrounding it. You don't need a special primitive
 type just to hold that fragment. And of course, there is /nothing/ to stop
 special string classes from being written to provide implementations of
 such space-efficient abstractions.
 
 A further argument against char is people coming from C/C++ /will/ try to
 store ISO-8859-1 encoded strings in a char[]. And they will get away with
 it, too, so long as they don't try calling any toUTFxx() routines on them.
 Bugs can be deeply buried in this confusion, failing to surface for a very
 long time.
 
 Discussion in another thread has focused on the the fact that
 Object.toString() returns a char[]. Regan and I have made the suggestion
 that the three string types be interchangable. But there's a better way:
 have just the /one/ string type. (As they say in Highlander, "There can be
 only one"). Problem gone away.
 
 With new-style-char redefined, not merely as a UTF-32 code unit (a
 fragment of an encoding), but as an actual Unicode character, things
 become much, much simpler.
 
 AND it would make life easier for Walter - fewer primitive types; less for
 the compiler to understand/do.
 
 Java tried to do it this way. When Java was invented, they had a char type
 intended to hold a single Unicode character. They also had a byte type, an
 array of which could store ASCII or ISO-8859-1 or UTF-8 encoded text. They
 also had transcoding built in, to make it all hang together. Where it went
 wrong for Java was that Unicode changed from being 16-bits wide to being
 21-bits wide (so suddenly Java's char was no longer wide enough, and they
 were forced to redefine Java strings as being UTF-16 encoded). But please
 note that Java did /not/ attempt to have separate char types for each
 encoding. Even /after/ Unicode exceeded 16-bits, Java was not tempted to
 introduce a new kind of char. Why not? Because having more than one char
 type is an ugly kludge (particularly if you're using Unicode by
 definition). It's an ugly kludge in D, too. I thought it was really good,
 once upon a time, but now that transcoding is moving out to libraries, and
 encompasses many /more/ encodings merely UTF-8/16/32, I no longer think
 that. Now is the best time of all for a rethink.
 
 But ...
 
 there's a down-side ... it would break a lot of existing code. Well, so
 what? This is a pre-1.0 wart-removing exercise. Like all of those other
 suggestions we're voting on in another thread, the time to make this
 change is now, before it's too late.
 
 Arcane Jill
There were huge threads about char vs wchar vs dchar a while ago (on the old newsgroup, I think). All kinds of things like what the default should be, what the names should be, what a string class could be etc. For example http://www.digitalmars.com/d/archives/20361.html http://www.digitalmars.com/d/archives/12382.html or actually anything at http://www.digitalmars.com/d/archives/index.html with the word "unicode" in the subject. By the way, why if there are N encodings are there N^2-N converters? Shouldn't there just be ~2*N to convert to/from one standard like dchar[]? IBM's ICU (at http://oss.software.ibm.com/icu/) uses wchar[] as the standard. -Ben
Aug 23 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgcoe6$2cq4$1 digitaldaemon.com>, Ben Hinkle says...

There were huge threads about char vs wchar vs dchar a while ago (on the old
newsgroup, I think). All kinds of things like what the default should be,
what the names should be, what a string class could be etc. For example
 http://www.digitalmars.com/d/archives/20361.html
 http://www.digitalmars.com/d/archives/12382.html
or actually anything at
 http://www.digitalmars.com/d/archives/index.html
with the word "unicode" in the subject.
Well spotted. I had a look at some of those old threads, and it does seem that most of the views back there were saying much the same thing as I'm suggesting now, which is good, as I'm happy to count it as more votes for the proposal, AND evidence of ongoing discontent over some years. The difference between now and then is that /now/ we have transcoding classes underway, and we'll have a working architecture very very soon, which will be able to plug into any kind of string or stream class. This is the difference which makes ditching char and wchar an actual practical possibility now. Incidently, there were plenty of views in those archives which basically said that the Unicode functions which now exist in etc.unicode (and which didn't exist at the time) should exist. That's one problem solved.
By the way, why if there are N encodings are there N^2-N converters?
Shouldn't there just be ~2*N to convert to/from one standard like dchar[]?
Well, that's how transcoding will do it, obviously. I was comparing it to the present system, in which N == 3 (UTF-8, UTF-16 and UTF-32), and there are 6 (= 3^2-3) converters in std.utf, these being: *) toUTF8(wchar[]); *) toUTF8(dchar[]); *) toUTF16(char[]); *) toUTF16(dchar[]); *) toUTF32(char[]); *) toUTF32(wchar[]); If the current (std.utf) scheme were to be extended to include, say, UTF-7 and UTF-EBCDIC, how would that scale up?
IBM's ICU (at http://oss.software.ibm.com/icu/)
Bloody hell. I wish someone had pointed me at ICU earlier. That is exceptional. They've even got Unicode Regular Expressions! And transcoding functions. And it's open source, too! Should I just give up on etc.unicode? Maybe we should just put a D wrapper around ICU instead, which would give D full Unicode support right now, and leave me free to do crypto stuff!
IBM's ICU (at http://oss.software.ibm.com/icu/)
uses wchar[] as the standard.
Ah, no it doesn't. I just checked. ICU has the types UChar (platform dependent, but wchar for us) and UChar32 (definitely a dchar). So you see, both wchar[] and dchar[] are "standards" for ICU. (That said, I've only looked at it for a few seconds, so I may have misunderstood). Anwyay, UTF-16 transcoding will easily take care of interfacing with any UTF-16 architecture. The present situation in D is no more compatible than what I'm suggesting. Slightly modified proposal then - ditch char and wchar as before, PLUS, incorporate ICU into D's core and write a D wrapper for it. (And ditch etc.unicode - erk!) The ICU license is at http://oss.software.ibm.com/cvs/icu/~checkout~/icu/license.html. Arcane Jill
Aug 23 2004
parent Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:
Arcane Jill wrote:


 Slightly modified proposal then - ditch char and wchar as before, PLUS,
 incorporate ICU into D's core and write a D wrapper for it. (And ditch
 etc.unicode - erk!) The ICU license is at
 http://oss.software.ibm.com/cvs/icu/~checkout~/icu/license.html.
I suppose that if dchar it's what left of this ditching it should renamed to char. AJ, I don't know a shit^D^D^D^D too much about Unicode but you excitement about ICU is really contagious, only one question, are the C wrappers at the same level than the C++/Java ones? If so it seems that with a little easy and boring (compared to writing etc.unicode) wrapping we're going to have a first-class Unicode lib :) => (i18n version of <g>)
Aug 23 2004
prev sibling next sibling parent reply "Roald Ribe" <rr.no spam.teikom.no> writes:
"Ben Hinkle" <bhinkle4 juno.com> wrote in message
news:cgcoe6$2cq4$1 digitaldaemon.com...

[snip]

 There were huge threads about char vs wchar vs dchar a while ago (on the
old
 newsgroup, I think). All kinds of things like what the default should be,
 what the names should be, what a string class could be etc. For example
  http://www.digitalmars.com/d/archives/20361.html
  http://www.digitalmars.com/d/archives/12382.html
 or actually anything at
  http://www.digitalmars.com/d/archives/index.html
 with the word "unicode" in the subject.

 By the way, why if there are N encodings are there N^2-N converters?
 Shouldn't there just be ~2*N to convert to/from one standard like dchar[]?
 IBM's ICU (at http://oss.software.ibm.com/icu/) uses wchar[] as the
 standard.
Indeed. There were several large discussions about this. Only a few scandinavian/north european readers of this group seemed to be positive at the time. I am happy to see that more people are warming to the idea. wchar (16-bit) is enough. It is even suggested as the best implementation size by some UNICODE coding experts. IBM / Sun / MS can not all be stupid at the same time... I think it would be smart to interoperate with the 16-bit size used internally in ICU, Java and MS-Windows. Only on unix/linux would it make sense to use 32-bits dchar. The 16 bits is enough for 99% of the cases/languages. The last 1% can be handled quite fast by cached indexing techniques in a String object. (this does not make for optimal speed in the 1% case, but it will more than pay for itself speedwise in 99% of all binary i/o operations :) However, that is Walters main issue I think. He wants default 8-bit chars to be default because this will make for the best possible i/o speed with the current state of affairs. That is what I understould from the last discussion at least. I am sure he will comment this thread ;-) and correct me if I am wrong. Regards, Roald
Aug 23 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgct0l$2eut$1 digitaldaemon.com>, Roald Ribe says...

However, that is Walters main issue I think. He wants default 8-bit
chars to be default because this will make for the best possible i/o
speed with the current state of affairs. That is what I understould
from the last discussion at least. I am sure he will comment this
thread ;-) and correct me if I am wrong.
Is that true? But UTF-8 /doesn't/ make the best possible I/O speed. To achieve that, you'd need to be using the OS-native encoding internally (ISO-8859-1 on most Linux boxes, WINDOWS-1252 on most Windows boxes). If UTF-8 is not used natively (which most of the time it isn't), you'd still need transcoding. Fact is, transcoding from UTF-16 to ISO-8859-1 or WINDOWS-1252 is going to be much faster than transcoding from UTF-8 to those encodings. And in any case, the time spent transcoding is almost always going to be insignificant compared to time spent doing actual I/O. Think console input; writing to disk; reading from CD-ROM; writing to a socket; .... Transcoding is really not a bottleneck. Jill
Aug 23 2004
prev sibling parent Ben Hinkle <bhinkle4 juno.com> writes:
Roald Ribe wrote:

 
 "Ben Hinkle" <bhinkle4 juno.com> wrote in message
 news:cgcoe6$2cq4$1 digitaldaemon.com...
 
 [snip]
 
 There were huge threads about char vs wchar vs dchar a while ago (on the
old
 newsgroup, I think). All kinds of things like what the default should be,
 what the names should be, what a string class could be etc. For example
  http://www.digitalmars.com/d/archives/20361.html
  http://www.digitalmars.com/d/archives/12382.html
 or actually anything at
  http://www.digitalmars.com/d/archives/index.html
 with the word "unicode" in the subject.

 By the way, why if there are N encodings are there N^2-N converters?
 Shouldn't there just be ~2*N to convert to/from one standard like
 dchar[]? IBM's ICU (at http://oss.software.ibm.com/icu/) uses wchar[] as
 the standard.
Indeed. There were several large discussions about this. Only a few scandinavian/north european readers of this group seemed to be positive at the time. I am happy to see that more people are warming to the idea. wchar (16-bit) is enough. It is even suggested as the best implementation size by some UNICODE coding experts. IBM / Sun / MS can not all be stupid at the same time... I think it would be smart to interoperate with the 16-bit size used internally in ICU, Java and MS-Windows. Only on unix/linux would it make sense to use 32-bits dchar. The 16 bits is enough for 99% of the cases/languages. The last 1% can be handled quite fast by cached indexing techniques in a String object. (this does not make for optimal speed in the 1% case, but it will more than pay for itself speedwise in 99% of all binary i/o operations :) However, that is Walters main issue I think. He wants default 8-bit chars to be default because this will make for the best possible i/o speed with the current state of affairs. That is what I understould from the last discussion at least. I am sure he will comment this thread ;-) and correct me if I am wrong. Regards, Roald
I didn't mean to suggest D ditch char or wchar or dchar. I'm just saying ICU uses wchar internally as the intermediate representation when converting between encodings. That is different than changing D's concept of strings. I should have added a sentance to my original post saying that I think D should keep its support of the "big three" char, wchar and dchar (with char[] as the standard concept of string) and have the library that handles conversions between unicode and non-unicode (or between non-unicode) encodings use whatever it wants as the intermediate representation. I think for that dchar would probably be fine - but I have no experience with that so that is just a naive guess. Treating the unicode encodings specially seems more practical than saying all non-standard encodings are treated the same. -Ben
Aug 23 2004
prev sibling parent J C Calvarese <jcc7 cox.net> writes:
Ben Hinkle wrote:
...
 There were huge threads about char vs wchar vs dchar a while ago (on the old
 newsgroup, I think). All kinds of things like what the default should be,
 what the names should be, what a string class could be etc. For example
  http://www.digitalmars.com/d/archives/20361.html
  http://www.digitalmars.com/d/archives/12382.html
 or actually anything at
  http://www.digitalmars.com/d/archives/index.html
 with the word "unicode" in the subject.
In case anyone is interested, here's a page with links to many Unicode threads in D newsgroups: http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Aug 23 2004
prev sibling next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
The case for retaining wchar has been made, and essentially won. Please see
separate thread about ICU (and maybe move this discussion there).

"char", however, is still up for deletion, since all arguments against it still
apply.

Jill
Aug 23 2004
parent Ben Hinkle <bhinkle4 juno.com> writes:
Arcane Jill wrote:

 
 The case for retaining wchar has been made, and essentially won. Please
 see separate thread about ICU (and maybe move this discussion there).
It's won already? What is this - a Mike Tyson fight? :-) Are you referring to the old threads (on the old newsgroup) or this new thread?
 "char", however, is still up for deletion, since all arguments against it
 still apply.
 
 Jill
Once I argued that D's current concept of char should be called uchar or something to indicate the UTF-8 encoding (as opposed to C's char encoding) and that string literals have type uchar[]. I still think it would be interesting to try but it's a tweak on the current system that isn't *that* important.
Aug 23 2004
prev sibling next sibling parent Andy Friesen <andy ikagames.com> writes:
Arcane Jill wrote:

 For reasons of space-efficiency, one might want to store text in memory in
UTF-8
 format. Fair enough. But if char were to be ditched, you could still do that.
 You'd simply use a ubyte[] for that purpose (just as you are now required to do
 if you want to store text in memory in UTF-7). After all - what actually /is/ a
 UTF fragment anyway? What meaning does the UTF-8 fragment 0x83 have in
 isolation? Answer - none. It has meaning only in the context of the bytes
 surrounding it. You don't need a special primitive type just to hold that
 fragment. And of course, there is /nothing/ to stop special string classes from
 being written to provide implementations of such space-efficient abstractions.
I think it might be worth it for the conceptual clarity. UTF-32 happens to be the character type that's hardest to break. It seems logical that it be the default. The programmer can still take control and use another encoding when the problem domain allows for it, but it's an optimization, not business as usual. -- andy
Aug 23 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
I'm not sure what transcoding means, I assume it's converting from one
string type to another.

I have some experiences with having only one character type - implementing a
C compiler, which is largely string processing using ascii, implementing a
Java compiler and working on the Java vm, which does everything as a wchar,
and implementing a javascript compiler, interpreter, and runtime which does
everything as a dchar.

dchar implementations consume memory at an alarming rate, and if you're
doing server side code, this means you'll soon run into the point where
virtual memory starts thrashing, and performance goes quickly down the
tubes. I had to do a lot of work trying to overcome this problem. dchars are
very convenient to work with, however, and make a great deal of sense as the
temporary common intermediate form of all the conversions. I stress the
temporary, though, as if you keep it around you'll start to notice the
slowdowns it causes.

char implementations can be made really fast and memory efficient. I don't
know if anyone has run any statistics, but the vast bulk of text processing
done by programs is in ASCII.

wchar implementations are, of course, halfway in between. Microsoft went
with wchars back when wchars could handle all of unicode in one word. Now
that it doesn't anymore, that means that all wchar code is going to have to
handle multiword encodings. So it's just as much extra code to write as for
chars.

Java uses wchars, but Java was not designed for high performance (although
years of herculean efforts by a lot of very smart people have brought Java a
long ways in the performance department). My Java compiler written in C++,
which used UTF-8 internally, ran 10x faster than the one written in Java,
which used UTF-16. The speedup wasn't all due to the character encoding, but
it helped.


lot of sense for Microsoft. D isn't just a Win32 language, however, and
linux uses UTF-8. Furthermore, a lot of embedded systems stick with ASCII.

In other words, I think there's a strong case for all three character
encodings in D. I agree that those using code pages will run into trouble
now and then with this, but they will anyway, because code pages are just
endless trouble since which code page some data is in is not inherent in the
data. Your idea of having the compiler detect invalid UTF-8 sequences in
string literals is very helpful here in heading a lot of these issues off at
the pass.

I think it makes perfect sense for a transcoding library to standardize on
dchar and dchar[] as its intermediate form. But don't take away char[] for
people like me that want to use it for other purposes!

I also agree that for single characters, using dchar is the best way to go.
Note that I've been redoing Phobos internals this way. For example,
std.format generates dchars to feed to the consumer function. std.ctype
takes dchar's as arguments.
Aug 23 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
Walter,

I pretty much agree with everything you have said here, I don't think we 
should remove any of the char types. That said I think...

The idea that a cast from one char type to another char type (implicit or 
explicit) should perform the correct UTF transcoding(conversion) is a good 
one, my arguments for this are as follows:

  - Q: if you paint a dchar[] as a char[] what do you get?
    A: a possibly (likely?) invalid UTF-8 sequence.

So why do it? I agree we do want to be able to 'paint' one type as 
another, but for a type with a specified encoding I don't think this makes 
any sense, does it? can you think of a reason to do it? Given that you 
could use ubyte, ushort or ulong instead. (types with no specified 
encoding).

  - Doing the transcoding means people writing string handling routines 
need only provide one routine and the result will automatically be 
transcoded to the type they're using.

This is such a great bonus! it will reduce the number of string handling 
routines by 2/3 as any routine will have it's result converted to the 
required type auto-magically.

  - The argument about consistency, that a ubyte cast to a ushort does not 
transcoding so a char to a dchar shouldn't either.

There are two ways of looking at this, on one hand you're saying they 
should all 'paint' as that is consistent. However, on the other I'm saying 
they should all produce a 'valid' result. So my argument here is that when 
you cast you expect a valid result, much like casting a float to an int 
does not just 'paint'.

I am interested to hear your opinions on this idea.

Regan.

On Mon, 23 Aug 2004 12:59:28 -0700, Walter <newshound digitalmars.com> 
wrote:
 I'm not sure what transcoding means, I assume it's converting from one
 string type to another.

 I have some experiences with having only one character type - 
 implementing a
 C compiler, which is largely string processing using ascii, implementing 
 a
 Java compiler and working on the Java vm, which does everything as a 
 wchar,
 and implementing a javascript compiler, interpreter, and runtime which 
 does
 everything as a dchar.

 dchar implementations consume memory at an alarming rate, and if you're
 doing server side code, this means you'll soon run into the point where
 virtual memory starts thrashing, and performance goes quickly down the
 tubes. I had to do a lot of work trying to overcome this problem. dchars 
 are
 very convenient to work with, however, and make a great deal of sense as 
 the
 temporary common intermediate form of all the conversions. I stress the
 temporary, though, as if you keep it around you'll start to notice the
 slowdowns it causes.

 char implementations can be made really fast and memory efficient. I 
 don't
 know if anyone has run any statistics, but the vast bulk of text 
 processing
 done by programs is in ASCII.

 wchar implementations are, of course, halfway in between. Microsoft went
 with wchars back when wchars could handle all of unicode in one word. Now
 that it doesn't anymore, that means that all wchar code is going to have 
 to
 handle multiword encodings. So it's just as much extra code to write as 
 for
 chars.

 Java uses wchars, but Java was not designed for high performance 
 (although
 years of herculean efforts by a lot of very smart people have brought 
 Java a
 long ways in the performance department). My Java compiler written in 
 C++,
 which used UTF-8 internally, ran 10x faster than the one written in Java,
 which used UTF-16. The speedup wasn't all due to the character encoding, 
 but
 it helped.


 a
 lot of sense for Microsoft. D isn't just a Win32 language, however, and
 linux uses UTF-8. Furthermore, a lot of embedded systems stick with 
 ASCII.

 In other words, I think there's a strong case for all three character
 encodings in D. I agree that those using code pages will run into trouble
 now and then with this, but they will anyway, because code pages are just
 endless trouble since which code page some data is in is not inherent in 
 the
 data. Your idea of having the compiler detect invalid UTF-8 sequences in
 string literals is very helpful here in heading a lot of these issues 
 off at
 the pass.

 I think it makes perfect sense for a transcoding library to standardize 
 on
 dchar and dchar[] as its intermediate form. But don't take away char[] 
 for
 people like me that want to use it for other purposes!

 I also agree that for single characters, using dchar is the best way to 
 go.
 Note that I've been redoing Phobos internals this way. For example,
 std.format generates dchars to feed to the consumer function. std.ctype
 takes dchar's as arguments.
-- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 23 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsc7hhm0a5a2sq9 digitalmars.com...
 Walter,

 I pretty much agree with everything you have said here, I don't think we
 should remove any of the char types. That said I think...

 The idea that a cast from one char type to another char type (implicit or
 explicit) should perform the correct UTF transcoding(conversion) is a good
 one, my arguments for this are as follows:

   - Q: if you paint a dchar[] as a char[] what do you get?
     A: a possibly (likely?) invalid UTF-8 sequence.

 So why do it? I agree we do want to be able to 'paint' one type as
 another, but for a type with a specified encoding I don't think this makes
 any sense, does it? can you think of a reason to do it? Given that you
 could use ubyte, ushort or ulong instead. (types with no specified
 encoding).

   - Doing the transcoding means people writing string handling routines
 need only provide one routine and the result will automatically be
 transcoded to the type they're using.

 This is such a great bonus! it will reduce the number of string handling
 routines by 2/3 as any routine will have it's result converted to the
 required type auto-magically.

   - The argument about consistency, that a ubyte cast to a ushort does not
 transcoding so a char to a dchar shouldn't either.

 There are two ways of looking at this, on one hand you're saying they
 should all 'paint' as that is consistent. However, on the other I'm saying
 they should all produce a 'valid' result. So my argument here is that when
 you cast you expect a valid result, much like casting a float to an int
 does not just 'paint'.

 I am interested to hear your opinions on this idea.
I think your idea has a lot of merit. I'm certainly leaning that way.
Aug 23 2004
parent reply "antiAlias" <fu bar.com> writes:
"Walter" <newshound digitalmars.com> wrote in message ...
 "Regan" wrote:
 There are two ways of looking at this, on one hand you're saying they
 should all 'paint' as that is consistent. However, on the other I'm
saying
 they should all produce a 'valid' result. So my argument here is that
when
 you cast you expect a valid result, much like casting a float to an int
 does not just 'paint'.

 I am interested to hear your opinions on this idea.
I think your idea has a lot of merit. I'm certainly leaning that way.
On the one hand, this would be well served by an opCast() and/or opCast_r() on the primitive types; just the kind of thing suggested in a related thread (which talked about overloadable methods for primitive types). On the other hand, we're talking transcoding here. Are you gonna' limit this to UTF-8 only? Then, since the source and destination will typically be of different sizes, do you then force all casts between these types to have the destination be an array reference rather than an instance? One that is always allocated on the fly? Then there's performance. It's entirely possible to write transcoders that are minimally between 5 and 30 times faster than the std.utf ones. Some people actually do care about efficiency. I'm one of them. If you do implement overloadable primitive-methods (like properties) then, will you allow a programmer to override them? So they can make the opCast() do something more specific to their own specific task? That's seems like a lot to build into the core of a language. Personally, I think it's 'borderline' to have so many data types available for similar things. If there were an "alias wchar[] string", and the core Object supported that via "string toString()", and the IUC library were adopted, then I think some of the general confusion would perhaps melt somewhat. In many respects, too many choices is simply a BadThing (TM). Especially when there's precious little solid guidance to help. That guidance might come from a decent library that indicates how the types are used, and uses one obvious type (string?) consistently. Besides, if IUC were adopted, less people would have to worry about the distinction anyway. Believe me when I say that Mango would dearly love to go dchar[] only. Actually, it probably will at the higher levels because it makes life simple for everyone. Oh, and I've been accused many times of being an efficiency fanatic, especially when it comes to servers. But there's always a tradeoff somewhere. Here, the tradeoff is simplicity-of-use versus quantities of RAM. Which one changes dramatically over time? Hmmmm ... let me see now ... 64bit-OS for desktops just around corner? Even on an embedded device I'd probably go "dchar only" regarding I18N. Simply because the quantity of text processed on such devices is very limited. Before anyone shoots me over this one, I regularly write code for devices with just 4KB RAM ~ still use 16bit chars there when dealing with XML input. So what am I saying here? Available RAM will always increase in great leaps. Contemplating that the latter should dictate ease-of-use within D is a serious breach of logic, IMO. Ease of use, and above all, /consistency/ should be paramount; if you have the programmer in mind.
Aug 23 2004
next sibling parent reply Regan Heath <regan netwin.co.nz> writes:
On Mon, 23 Aug 2004 18:29:54 -0700, antiAlias <fu bar.com> wrote:
 "Walter" <newshound digitalmars.com> wrote in message ...
 "Regan" wrote:
 There are two ways of looking at this, on one hand you're saying they
 should all 'paint' as that is consistent. However, on the other I'm
saying
 they should all produce a 'valid' result. So my argument here is that
when
 you cast you expect a valid result, much like casting a float to an 
int
 does not just 'paint'.

 I am interested to hear your opinions on this idea.
I think your idea has a lot of merit. I'm certainly leaning that way.
On the one hand, this would be well served by an opCast() and/or opCast_r() on the primitive types; just the kind of thing suggested in a related thread (which talked about overloadable methods for primitive types).
True. However, what else should the opCast for char[] to dchar[] do except transcode it? What about opCast for int to dchar.. it seems to me there is only 1 choice about what to do, anything else would be operator abuse.
 On the other hand, we're talking transcoding here. Are you gonna' limit 
 this
 to UTF-8 only?
I hope not, I am hoping to see: | UTF-8 | UTF-16 | UTF-32 -------------------------------- UTF-8 | - + + UTF-16 | + - + UTF-32 | + + - (+ indicates transcoding occurs)
 Then, since the source and destination will typically be of
 different sizes, do you then force all casts between these types to have 
 the
 destination be an array reference rather than an instance?
I don't think transcoding makes any sense unless you're talking about a 'string' (i.e. char[], wchar[], or dchar[]) as opposed to a UTF-x fragment (i.e. char, wchar) As AJ has frequently pointed out a char or wchar does not equal one "character" in some cases. Given that and assuming 'c' below is "not a whole character", I cannot see how: dchar d; char c = ?; //not a whole character d = c; //implicit d = cast(dchar)c; //explicit would ever be able to transcode. So in this instance this should either be: 1. a compile error 2. a runtime error 3. a simple copy of the value, creating an invalid? (will it be AJ?) utf-32 character. has the possibilty to create an invalid utf-x code-point/fragment (I don't know the right term)
 One that is
 always allocated on the fly?
No.. bad idea methinks.
 Then there's performance. It's entirely
 possible to write transcoders that are minimally between 5 and 30 times
 faster than the std.utf ones. Some people actually do care about 
 efficiency.
 I'm one of them.
Same here. I believe we all have the same goal, we just have different ideas about how best to get there.
 If you do implement overloadable primitive-methods (like properties) 
 then,
 will you allow a programmer to override them? So they can make the 
 opCast()
 do something more specific to their own specific task?
Can you think of a task you'd want to put one of these opCast methods to? (and give an example).
 That's seems like a lot to build into the core of a language.
The opCast overloading? or the original idea? The only reservation I had about the original idea was that it seemed like "a lot to build into the core of a language" at first, and then I realised if you cast from one char type to another, the _only_ sensible thing to do is transcode, anything else is a bug/error. In addition this sort of cast (char type to char type) doesn't occur a lot, but when it does you end up calling toUTFxx manually, why not just have it happen.
 Personally, I think it's 'borderline' to have so many data types 
 available for similar things.
Like byte, short, int and long?
 If there were an "alias wchar[] string", and the core
 Object supported that via "string toString()", and the IUC library were
 adopted, then I think some of the general confusion would perhaps melt
 somewhat.
Maybe. However toString for an 'int' (for example) only needs a char[] to represent all it's possible values (with one character equalling one 'char') so why use wchar or dchar for it's toString? Conversly something else might find it more efficient to return it's toString as a dchar[]. You might argue either example is rare, certainly the latter is rarer than the former, but the efficiency gnome in my head won't shut up about this.. If we simply alias wchar[] as 'string' then these examples will need manual conversion to 'string' which involves calling toUTF16 or .. what would you call that conversion function such that it was obvious? toString? :)
 In many respects, too many choices is simply a BadThing (TM).
I think too many very similar but slightly different choices/methods is a bad thing, however, I don't see char, wchar and dchar as that. I think they are more like byte, short and int, different types for different uses, castable to/from each other where required. The choice they give you is being able to choose the right type for the job.
 Especially
 when there's precious little solid guidance to help. That guidance might
 come from a decent library that indicates how the types are used, and 
 uses
 one obvious type (string?) consistently.
I think the confusion comes from being used to only 1 string type, for C/C++ programmers it's typically and 8bit unsigned type usually containing ASCII, for Java a 16bit signed/unsigned? type containing UTF-16? Internationalisation is a new topic for me and many others (I suspect) even for Walter(?). Having 3 types requiring manual transcoding between them _is_ a pain.
 Besides, if IUC were adopted, less
 people would have to worry about the distinction anyway.
The same could be said for the implicit transcoding from one type to the other.
 Believe me when I say that Mango would dearly love to go dchar[] only.
 Actually, it probably will at the higher levels because it makes life 
 simple
 for everyone. Oh, and I've been accused many times of being an efficiency
 fanatic, especially when it comes to servers. But there's always a 
 tradeoff
 somewhere. Here, the tradeoff is simplicity-of-use versus quantities of 
 RAM.
 Which one changes dramatically over time? Hmmmm ... let me see now ...
 64bit-OS for desktops just around corner?
What you really mean is you'd dearly love to not have to worry about the differences between the 3 types, implicit transcoding will give you that. Furthermore it's simplicity without sacraficing RAM. The thing I like about implicit transcoding is that if for example you have a lot of string data stored in memory, you can store it in the most efficient format for that data, which may be char, wchar or dchar. If you then want to call a function which takes a dchar on some/all of your, it data will be implicitly converted to dchar (if not already) for the function. If at a later date you decide to change the format of the stored data, you don't have to find every call to a function and insert/remove toUTFxx calls. To me, this sounds like the most efficient way to handle it.
 Even on an embedded device I'd probably go "dchar only" regarding I18N.
 Simply because the quantity of text processed on such devices is very
 limited. Before anyone shoots me over this one, I regularly write code 
 for
 devices with just 4KB RAM ~ still use 16bit chars there when dealing with
 XML input.
We're you to use implicit transcoding you could store the data in memory in UTF-8, or UTF-16 then only transcode to UTF-32 when required, this would be more efficient.
 So what am I saying here? Available RAM will always increase in great 
 leaps.
Probably.
 Contemplating that the latter should dictate ease-of-use within D is a
 serious breach of logic, IMO.
"the latter"?
 Ease of use, and above all, /consistency/
 should be paramount; if you have the programmer in mind.
This last statement applies to implicit transcoding perfectly. It's easy and consistent. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 23 2004
next sibling parent reply "antiAlias" <fu bar.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsc7u56zf5a2sq9 digitalmars.com...
 On Mon, 23 Aug 2004 18:29:54 -0700, antiAlias <fu bar.com> wrote:
 "Walter" <newshound digitalmars.com> wrote in message ...
 "Regan" wrote:
 There are two ways of looking at this, on one hand you're saying they
 should all 'paint' as that is consistent. However, on the other I'm
saying
 they should all produce a 'valid' result. So my argument here is that
when
 you cast you expect a valid result, much like casting a float to an
int
 does not just 'paint'.

 I am interested to hear your opinions on this idea.
I think your idea has a lot of merit. I'm certainly leaning that way.
On the one hand, this would be well served by an opCast() and/or opCast_r() on the primitive types; just the kind of thing suggested in a related thread (which talked about overloadable methods for primitive types).
True. However, what else should the opCast for char[] to dchar[] do except transcode it? What about opCast for int to dchar.. it seems to me there is only 1 choice about what to do, anything else would be operator abuse.
 On the other hand, we're talking transcoding here. Are you gonna' limit
 this
 to UTF-8 only?
I hope not, I am hoping to see: | UTF-8 | UTF-16 | UTF-32 -------------------------------- UTF-8 | - + + UTF-16 | + - + UTF-32 | + + - (+ indicates transcoding occurs)
======================== And what happens when just one additional byte-oriented encoding is introduced? Perhaps UTF-7? Perhaps EBCDIC? The basic premise is flawed because there's no flexibility.
 Then, since the source and destination will typically be of
 different sizes, do you then force all casts between these types to have
 the
 destination be an array reference rather than an instance?
I don't think transcoding makes any sense unless you're talking about a 'string' (i.e. char[], wchar[], or dchar[]) as opposed to a UTF-x fragment (i.e. char, wchar)
======================== We /are/ talking about arrays. Perhaps if that sentence had ended "array reference rather than an *array* instance?", it might have been more clear? The point being made is that you would not be able to do anything like this: char[15] dst; dchar[10] src; dst = cast(char[]) src; because there's no ability via a cast() to indicate how many items from src were converted, and how many items in dst were populated. You are forced into this kind of thing: char[] dst; dchar[10] src; dst = cast(char[]) src; You see the distinction? It may be subtle to some, but it's a glaring imbalance to others. The lValue must always be a reference because it's gonna' be allocated dynamically to ensure all of the rValue will fit. In the end, it's just far better to use functions/methods that provide the feedback required so you can actually control what's going on (or such that a library function can). That way, you're not restricted in the same fashion. We don't need more asymmetry in D, and this just reeks of poor design, IMO. To drive this home, consider the decoding version (rather than the encoding above): char[15] src; dchar[] dst; dst = cast(dchar[]) src; What happens when there's a partial character left undecoded at the end of 'src'? There nothing here to tell you that you've got a dangly bit left at the end of the souce-buffer. It's gone. Poof! Any further decoding from the same file/socket/whatever is henceforth trashed, because the ball has been both dropped and buried. End of story.
 Having 3 types requiring manual transcoding between them _is_ a pain.
======================== It certainly is. That's why other languages try to avoid it at all costs. Having it done "generously" by the compiler is also a pain, inflexible, and likely expensive. There are many things a programmer should take responsibility for; transcoding comes under that umbrella because (a) there can be subtle complexity involved and (b) it is relatively expensive to churn through text and convert it; particularly so with the Phobos utf-8 code. What you appear to be suggesting is that this kind of thing should happen silently whilst one nonchalantly passes arguments around between methods. That's insane, so I hope that's not what you're advocating. Java, for example, does that at one specific layer (I/O), but you're apparently suggesting doing it at any old place! And several times over, just in case it wasn't good enough the first time :-) Sorry man. This is inane. D is /not/ a scripting language; instead it's supposed to be a systems language.
 Besides, if IUC were adopted, less
 people would have to worry about the distinction anyway.
The same could be said for the implicit transcoding from one type to the other.
======================== That's pretty short-sighted IMO. You appear to be saying that implicit transcoding would take the place of ICU; terribly misleading. Transcoding is just a very small part of that package. Please try to reread the comment as "most people would be shielded completely by the library functions, therefore there's far fewer scenarios where they'd ever have a need to drop into anything else". This would be a very GoodThing for D users. Far better to have a good library to take case of /all/ this crap than have the D language do some partial-conversions on the fly, and then quit because it doesn't know how to provide any further functionality. This is the classic core-language-versus-library-functionality bitchfest all over again. Building all this into a cast()? Hey! Let's make Walter's Regex class part of the compiler too; and make it do UTF-8 decoding /while/ it's searching, since you'll be able to pass it a dchar[] that will be generously converted to the accepted char[] for you "on-the-fly". Excuse me for jesting, but perhaps the Bidi text algorithms plus date/numeric formatting & parsing will all fit into a single operator also? That's kind of what's being suggested. I believe there's a serious underestimate of the task being discussed.
 Believe me when I say that Mango would dearly love to go dchar[] only.
 Actually, it probably will at the higher levels because it makes life
 simple
 for everyone. Oh, and I've been accused many times of being an
efficiency
 fanatic, especially when it comes to servers. But there's always a
 tradeoff
 somewhere. Here, the tradeoff is simplicity-of-use versus quantities of
 RAM.
 Which one changes dramatically over time? Hmmmm ... let me see now ...
 64bit-OS for desktops just around corner?
What you really mean is you'd dearly love to not have to worry about the differences between the 3 types, implicit transcoding will give you that. Furthermore it's simplicity without sacraficing RAM.
======================== Ahh Thanks. I didn't realize that's what I "really meant". Wait a minute ...
 Even on an embedded device I'd probably go "dchar only" regarding I18N.
 Simply because the quantity of text processed on such devices is very
 limited. Before anyone shoots me over this one, I regularly write code
 for
 devices with just 4KB RAM ~ still use 16bit chars there when dealing
with
 XML input.
We're you to use implicit transcoding you could store the data in memory in UTF-8, or UTF-16 then only transcode to UTF-32 when required, this would be more efficient.
======================== That's a rather large assumption, don't you think? More efficient? In which particualr way? Is memory usage or CPU usage more important in /my/ particular applications? Please either refrain, or commit to rewriting all my old code more efficiently for me ... for free <g>
Aug 23 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgelmc$dqu$1 digitaldaemon.com>, antiAlias says...

And what happens when just one additional byte-oriented encoding is
introduced? Perhaps UTF-7? Perhaps EBCDIC? The basic premise is flawed
because there's no flexibility.
Yes. But I don't see D introducing a utf7char or a utfEbcdicChar type any time soon. There's no flexibility because there are no plans to extend it. I can live with ONE native string type. I can also live with 3 mutually interoperable native string types. But neither of these schemes precludes anyone else from writing/using other string types.
We /are/ talking about arrays. Perhaps if that sentence had ended "array
reference rather than an *array* instance?", it might have been more clear?
The point being made is that you would not be able to do anything like this:

char[15] dst;
dchar[10] src;

dst = cast(char[]) src;
But that's because you already can't do: and that's not going to change, so that's the end of that. Regan isn't suggesting anything beyond a bit of syntactic sugar here. No-one (so far) has suggested that dynamic arrays be converted to static arrays, or that auto-casting should write into a user-supplied fixed-size buffer. D does not prohibit you from doing the things you have suggested. It's just that you'd have to do them explicitly. /Implicit/ conversion is only being suggested for the three D string types.
because there's no ability via a cast() to indicate how many items from src
were converted, and how many items in dst were populated. You are forced
into this kind of thing:

char[] dst;
dchar[10] src;

dst = cast(char[]) src;
Actually it would be: if I had my way.
You see the distinction? It may be subtle to some, but it's a glaring
imbalance to others. The lValue must always be a reference because it's
gonna' be allocated dynamically to ensure all of the rValue will fit. In the
end, it's just far better to use functions/methods that provide the feedback
required so you can actually control what's going on (or such that a library
function can). That way, you're not restricted in the same fashion. We don't
need more asymmetry in D, and this just reeks of poor design, IMO.
I don't agree with this claim. It is being suggested that: be equivalent to: Nobody loses anything by this. All that happens is that things work more smoothly. If you want to call a different function to do your transcoding then there's nothing to stop you. I assume that the use you have in mind is stream-internal transcoding via buffers. You can still do that. The above won't stop you.
To drive this home, consider the decoding version (rather than the encoding
above):

char[15] src;
dchar[] dst;

dst = cast(dchar[]) src;

What happens when there's a partial character left undecoded at the end of
'src'?
*ALL* that is being suggested is that, given the above declarations: would be equivalent to: Nothing more. So the answer to your question is, it would throw an exception.
There nothing here to tell you that you've got a dangly bit left at
the end of the souce-buffer. It's gone. Poof! Any further decoding from the
same file/socket/whatever is henceforth trashed, because the ball has been
both dropped and buried. End of story.
This has got nothing to do with either transcoding or streams. This is not being suggested as a general transcoding mechanism, merely as an internal conversion between D's three string types. /General/ transcoding will have to work for all supported encodings, and won't be relying on the std.utf functions. Files and sockets won't use the std.utf functions either because they will employ the general transcoding mechanism. Your transcoding ideas are excellent, but they are not relevant to this.
There are many things a programmer should take
responsibility for; transcoding comes under that umbrella because (a) there
can be subtle complexity involved and (b) it is relatively expensive to
churn through text and convert it; particularly so with the Phobos utf-8
code.
I'm surprised at you. I would have said:
There are some things a programmer should /not/ have to take
responsibility for; transcoding comes under that umbrella because (a) there
can be subtle complexity involved and (b) it is relatively expensive to
churn through text and convert it;
A library is the appropriate place for this stuff. That could be Phobos, Mango, whatever. Here's my take on it: (1) Implicit conversion between native D strings should be handled by the compiler in whatever way it sees fit. If it chooses to invoke a library function then so be it. (And it /should/ be Phobos, because Phobos is shipped with D). Note that this is exactly the same situation that the future "cent" type will incur. If I divide one "cent" by another, I would expect the D compiler to translate this into a function call within Phobos. Why is that such a crime? (2) Fully featured transcoding should be done by ICU, as this is a high performance mature product, and all the code is already written. (3) Some adaptation of the ICU wrapper may be necessary to integrate this more neatly with "the D way". But I'm confident we can do this in a way which is not biased toward Phobos. Phobos UTF-8 code could be faster, I grant you. But perhaps it will be in the next release. We're only talking about a tiny number of functions here, after all.
What you appear to be suggesting is that this kind of thing should happen
silently whilst one nonchalantly passes arguments around between methods.
I would vote for that, yes. But only as a second choice. My /first/ choice would be to have D standardize on one single kind of string, the wchar[].
That's insane, so I hope that's not what you're advocating. Java, for
example, does that at one specific layer (I/O), but you're apparently
suggesting doing it at any old place! And several times over, just in case
it wasn't good enough the first time :-)  Sorry man. This is inane. D is
/not/ a scripting language; instead it's supposed to be a systems language.
How about we all get on the same side here? Like I said, my /first/ choice would be to have D standardize on one single kind of string, the wchar[]. But we don't always get our first choice. Walter doesn't like this idea. I suggest you add your voice to mine and help try to persuade him that ONE kind of string is the way to go. BUT - if we fail in convincing him - the second choice is better than the status quo. Why? Because if we fail to convice Walter that multiple string types is bad, then all that conversion is going to happen ANYWAY. It will happen because Object.toString() will (often) return UTF-8; because string literals will generate UTF-8; and because the functions in ICU will require and return UTF-16. The ONLY difference between this suggestion and the status quo is that won't have to write "cast(char[])" and "cast(wchar[])" all over the place. You're doing a lot of arguing /against/. What are you /for/? Arguing against a suggestion is usually interpreted as a vote for the status quo. Are you really doing that?
That's pretty short-sighted IMO. You appear to be saying that implicit
transcoding would take the place of ICU; terribly misleading.
Of course he's not.
Excuse me for jesting, but perhaps the Bidi text algorithms plus
date/numeric formatting & parsing will all fit into a single operator also?
That's kind of what's being suggested.
Nothing is being suggested except syntactic sugar. My first choice vote still goes to ditching the char. With char gone, it would be natural for wchar[] to become the "standard" D string, which fits in well with ICU. There would be no need for implicit (or even explicit) cast-conversion between wchar[] and dchar[], because dchar[] would be a specialist thing, only used by speed-efficiency fanatics, while wchar[] would be business as usual. But if I can't have my first choice, I'll vote for implicit conversion as my second choice. Arcane Jill
Aug 24 2004
next sibling parent reply "antiAlias" <fu bar.com> writes:
"Arcane Jill" <Arcane_member pathlink.com>
There nothing here to tell you that you've got a dangly bit left at
the end of the souce-buffer. It's gone. Poof! Any further decoding from
the
same file/socket/whatever is henceforth trashed, because the ball has
been
both dropped and buried. End of story.
This has got nothing to do with either transcoding or streams. This is not
being
 suggested as a general transcoding mechanism, merely as an internal
conversion
 between D's three string types. /General/ transcoding will have to work
for all
 supported encodings, and won't be relying on the std.utf functions. Files
and
 sockets won't use the std.utf functions either because they will employ
the
 general transcoding mechanism.
Oh. I was under the impression the 'solution' being tendered was a jack-of-all trades. If we're simply talking about converting /static/ strings between different representations, then, cool. It's done at compile-time.
 I'm surprised at you. I would have said:

There are some things a programmer should /not/ have to take
responsibility for; transcoding comes under that umbrella because (a)
there
can be subtle complexity involved and (b) it is relatively expensive to
churn through text and convert it;
A library is the appropriate place for this stuff. That could be Phobos,
Mango,
 whatever. Here's my take on it:
We agree. My point was that expensive operations such as these should perhaps not be hidden "under the covers"; but explicitly handled by a library call instead. However, I clearly have the wrong impression about the extent of what this implicit-conversion is attempting.
 How about we all get on the same side here?
I really think we are, Jill. My concerns are about trying to build partial versions of ICU functionality into the D language itself, rather than let that extensive and capable library take care of it. But apparently that's not what's happening here. My mistake.
 You're doing a lot of arguing /against/. What are you /for/? Arguing
against a
 suggestion is usually interpreted as a vote for the status quo. Are you
really
 doing that?
Nope. There appeared to be some concensus-building that transcoding could all be handled via a cast() operator. I felt it worth pointing out where and why that's not a valid approach. The other aspect involved here is that of string-concatenation. D cannot have more that one return type for toString() as you know. It's fixed at char[]. If string concatenation uses the toString() method to retrieve its components (as is being proposed elsewhere), then there will be multiple, redundant, implicit conversions going on where the string really wanted to be dchar[] in the first place. That is: A a; // class instances ... B b; C c; dchar[] message = c ~ b ~ a; Under the proposed "implicit" scheme, if each toString() of A, B, and C wish to return dchar[], then each concatenation causes an implicit conversion/encoding from each dchar[] to char[] (for the toString() return). Then another full conversion/decoding is performed back to the dchar[] assignment once each has been concatenated. This is like the Wintel 'plot' for selling more cpu's :-) Doing this manually, one would forego the toString() altogether: dchar[] message = c.getString() ~ b.getString() ~ a.getString(); ... where getString() is a programmer-specific idiom to return the (natural) dchar[] for these classes, and we carefully avoided all those darned implicit-conversions. However, which approach do you think people will use? My guess is that D may become bogged down in conversion hell over such things. So, to answer your question: What I'm /for/ is not covering up these types of issues with blanket-style implicit conversions. Something more constructive (and with a little more forethought) needs to be done.
Aug 24 2004
next sibling parent reply Regan Heath <regan netwin.co.nz> writes:
On Tue, 24 Aug 2004 11:36:45 -0700, antiAlias <fu bar.com> wrote:

<snip>

 Oh. I was under the impression the 'solution' being tendered was a
 jack-of-all trades. If we're simply talking about converting /static/
 strings between different representations, then, cool. It's done at
 compile-time.
Nope. We are talking about implicit conversion to/from all 3 forms of UTF-x encoded base types where required. Does this make any sense whatsoever currently? char[] c = ?; //some valid utf-8 sequence dchar[] d; d = cast(char[])c; I believe the answer is "no", reasoning: d will now (possibly) contain an invalid utf-32 sequence. The only sensible thing to do is transcode. If you want to 'paint' a char type as a smaller type to get at it's bits/bytes or shorts (snicker) you can and should use ubyte or ushort, _not_ another char type. char types imply a default encoding so painting one to another is illegal, painting something to/from a char type is legal and useful. <snip>
 We agree. My point was that expensive operations such as these should
 perhaps not be hidden "under the covers"; but explicitly handled by a
 library call instead.
On principle I totally agree with this statement. However in this case I am simply suggesting implicit conversion where you would already have to write toUTFxx(), the idea does not _add_ any expense, only convenience. Yes, an un-aware programmer might not realise it's transcoding, they might make some in-efficient choices, but that same programmer will probably also do this: char[] c; dcahr[] d; c = cast(char[])d; and create invalid utf-x sequences (some of the time), and a bug.
 However, I clearly have the wrong impression about the
 extent of what this implicit-conversion is attempting.
No.. I _think_ you understood it fine. I _think_ we just disagree about what is efficient and what is not.
 How about we all get on the same side here?
I really think we are, Jill. My concerns are about trying to build partial versions of ICU functionality into the D language itself, rather than let that extensive and capable library take care of it. But apparently that's not what's happening here. My mistake.
Err... it is, a small part, the part that already exists in std.utf, conversion from utf-x to utf-<another x>.
 You're doing a lot of arguing /against/. What are you /for/? Arguing
against a
 suggestion is usually interpreted as a vote for the status quo. Are you
really
 doing that?
Nope. There appeared to be some concensus-building that transcoding could all be handled via a cast() operator. I felt it worth pointing out where and why that's not a valid approach.
Actually I want it to transcode implicitly eg. char[] c; dchar[] d; d = c; //transcodes Can you enumerate your reasons why this is 'not a valid approach' (I could search the previous posts and try to do that for you, but I might miss-interpret what you meant).
 The other aspect involved here is that of string-concatenation. D cannot
 have more that one return type for toString() as you know.
True.
 It's fixed at
 char[].
Is it?! I didn't realise that, so this is invalid? class A { dchar[] toString() {} }
 If string concatenation uses the toString() method to retrieve its
 components (as is being proposed elsewhere), then there will be multiple,
 redundant, implicit conversions going on where the string really wanted 
 to
 be dchar[] in the first place. That is:

 A a; // class instances ...
 B b;
 C c;

 dchar[] message = c ~ b ~ a;

 Under the proposed "implicit" scheme, if each toString() of A, B, and C 
 wish
 to return dchar[], then each concatenation causes an implicit
 conversion/encoding from each dchar[] to char[] (for the toString() 
 return).
Assuming toString returned char[] and not dchar[], yes. Assuming that trying to return dchar without transcoding will create invalid UTF-8 sequences. If implicit transcoding were implemented then you should be able to define the return value of your classes toString to be any of char[] wchar[] or dchar[] as it will implicitly transcode to the type it requires. Basically you use the most applicable type, for example AJ's Int class would use char[] (unless AJ has another reason not to) as all the string data required is ASCII and fits best in UTF-8.
 Then another full conversion/decoding is performed back to the dchar[]
 assignment once each has been concatenated. This is like the Wintel 
 'plot' for selling more cpu's :-)
Not if toString can return dchar[] and all 3 classes do that.
 Doing this manually, one would forego the toString() altogether:

 dchar[] message = c.getString() ~ b.getString() ~ a.getString();

 ... where getString() is a programmer-specific idiom to return the 
 (natural)
 dchar[] for these classes, and we carefully avoided all those darned
 implicit-conversions. However, which approach do you think people will 
 use?
Taking into account what I have said above.. the easy one, i.e. implicit transcoding.
 My guess is that D may become bogged down in conversion hell over such
 things.

 So, to answer your question:
 What I'm /for/ is not covering up these types of issues with 
 blanket-style
 implicit conversions. Something more constructive (and with a little more
 forethought) needs to be done.
I believe implicit conversion to be constructive, it stops bugs and makes string handling much easier. What we are doing here _is_ the forethought, after all nothing has been implemented yet. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 24 2004
next sibling parent reply "antiAlias" <fu bar.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote
 Is it?!
 I didn't realise that, so this is invalid?

 class A {
    dchar[] toString() {}
 }
Yes. It most certainly is, Regan. I (incorrectly) assumed you understood that. Sorry. There have been a number of posts that note this, and its implications.
Aug 24 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Tue, 24 Aug 2004 19:45:07 -0700, antiAlias <fu bar.com> wrote:

 "Regan Heath" <regan netwin.co.nz> wrote
 Is it?!
 I didn't realise that, so this is invalid?

 class A {
    dchar[] toString() {}
 }
Yes. It most certainly is, Regan. I (incorrectly) assumed you understood that.
Either: a. I am overly sensitive/insecure b. You didn't realise c. You're intentionally trying to belittle me because ... "understood" is not the right word "knew" is a better choice.. "understood" implies I knew but didn't understand. That isn't the case. (this time)
 Sorry. There have been a number of posts that note this, and its
 implications.
I must have missed them, or missed the importance of that fact. strange given that I read *everything* in all the D NG's on digitalmars.com. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 24 2004
parent "antiAlias" <fu bar.com> writes:
Sorry to have offended your sensibilities, dude. If "knew" is used in your
part of the world then "understood" is used in mine. Too bad for the
misunderstanding.

Here's a link to a post from Matthew. Given that it's a reply, I think you
can safely count at least two posts that you missed <g>

news:cg69c1$120n$1 digitaldaemon.com



"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsc9uihna5a2sq9 digitalmars.com...
 On Tue, 24 Aug 2004 19:45:07 -0700, antiAlias <fu bar.com> wrote:

 "Regan Heath" <regan netwin.co.nz> wrote
 Is it?!
 I didn't realise that, so this is invalid?

 class A {
    dchar[] toString() {}
 }
Yes. It most certainly is, Regan. I (incorrectly) assumed you understood that.
Either: a. I am overly sensitive/insecure b. You didn't realise c. You're intentionally trying to belittle me because ... "understood" is not the right word "knew" is a better choice.. "understood" implies I knew but didn't understand. That isn't the case. (this time)
 Sorry. There have been a number of posts that note this, and its
 implications.
I must have missed them, or missed the importance of that fact. strange given that I read *everything* in all the D NG's on digitalmars.com. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 24 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <opsc9m9ccg5a2sq9 digitalmars.com>, Regan Heath says...

Is it?!
I didn't realise that, so this is invalid?

class A {
   dchar[] toString() {}
}
It's not invalid as such, it's just that the return type of an overloaded function has to be "covariant" with the return type of the function it's overloading. So it's a compile error /now/. But if dchar[] and char[] were to be considered mutually covariant then this would magically start to compile. Arcane Jill
Aug 24 2004
next sibling parent reply Regan Heath <regan netwin.co.nz> writes:
On Wed, 25 Aug 2004 05:44:18 +0000 (UTC), Arcane Jill 
<Arcane_member pathlink.com> wrote:
 In article <opsc9m9ccg5a2sq9 digitalmars.com>, Regan Heath says...

 Is it?!
 I didn't realise that, so this is invalid?

 class A {
   dchar[] toString() {}
 }
It's not invalid as such, it's just that the return type of an overloaded function has to be "covariant" with the return type of the function it's overloading. So it's a compile error /now/. But if dchar[] and char[] were to be considered mutually covariant then this would magically start to compile.
Ahh.. excellent, that is what I was hoping to hear. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 24 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <opsc9xu9h35a2sq9 digitalmars.com>, Regan Heath says...

Ahh.. excellent, that is what I was hoping to hear.
It's not /all/ good news however. Consider these two cases: (1) All hunky dory. No conversions happen. /But/ (2) Now /two/ conversions happen (assuming Object.toString() still returns char[]) - toUTF8(wchar[]) followed by toUTF16(char[]). Still, that's polymorphism for you. It is better than the status quo, but not quite as good (IMO) as having wchar[] be the standard string type. Arcane Jill
Aug 24 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Wed, 25 Aug 2004 06:31:16 +0000 (UTC), Arcane Jill 
<Arcane_member pathlink.com> wrote:
 In article <opsc9xu9h35a2sq9 digitalmars.com>, Regan Heath says...

 Ahh.. excellent, that is what I was hoping to hear.
It's not /all/ good news however. Consider these two cases: (1) All hunky dory. No conversions happen. /But/ (2) Now /two/ conversions happen (assuming Object.toString() still returns char[]) - toUTF8(wchar[]) followed by toUTF16(char[]). Still, that's polymorphism for you.
True, and we can come up with much nastier string concatenation examples too.. I wonder if some cleverness can be thought up to lessen this effect somehow?
 It is better than the status quo, but not quite as good (IMO) as having 
 wchar[]
 be the standard string type.
By 'standard' do you mean that the others do not exist? or that it is the type you are encouraged to use unless you have reason not to? I think the other types have a valid place in the D language, after all each type will be more or less efficient based on the specific circumstances it gets used in. The most generally efficient type (if that's even possible to decide) should be the type we're encouraged to use, if that's wchar so be it. Implicit transcoding will fit nicely with a standard type as when you are using another type the library functions (if all written for wchar for example) will still be available without explicit toUTFxx calls. Regan. -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 25 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <opsdbboxb05a2sq9 digitalmars.com>, Regan Heath says...

 It is better than the status quo, but not quite as good (IMO) as having 
 wchar[]
 be the standard string type.
By 'standard' do you mean that the others do not exist? or that it is the type you are encouraged to use unless you have reason not to?
I guess I mean specifically that: (1) Object.toString() should return wchar[], not char[] (2) String literals such as "hello world" should be interpretted as wchar[], not char[]. (3) Object.d should contain the line: (4) The text on page http://www.digitalmars.com/d/arrays.html should be changed. Currently it says:
Dynamic arrays in D suggest the obvious solution - a string is just a dynamic
array of characters. String literals become just an easy way to write character
arrays.

    char[] str;
    char[] str1 = "abc";
This should be changed to:
Dynamic arrays in D suggest the obvious solution - a string is just a dynamic
array of characters. String literals become just an easy way to write character
arrays.

    wchar[] str;
    wchar[] str1 = "abc";
(5) There are probably several other things to change. I don't claim this is an exhaustive list. In other words, we could actually have our cake /and/ eat it. The intent is to minimize, as far as possible, the number of calls to toUTFxx(). Ideally, they should occur only at input and output. The way to minimize this is to keep everything in the same type, so conversion is not needed. If the D documentation, the behaviour of the compiler, and the organization of Phobos, were to consistently use the same string type, and others were encouraged to use the same type, conversions would be kept to a minimum. Currently D does that - but it's "string of choice" is the char[], not the wchar[]. Conversion to/from UTF-8 is incredibly slow for non-ASCII characters. (It could be made faster, but it can /never/ be made as fast as UTF-16). So we make wchar[], not char[], the "standard", and hey presto, things get faster (and what's more will interface with ICU without conversion, which is really important for internationalization).
The most generally efficient type (if that's even possible to decide) 
should be the type we're encouraged to use, if that's wchar so be it.
Well, it is usually believed that UTF-8 is the most space-efficient but the least speed-efficient. UTF-32 is the most speed-efficient but the least space-efficient. UTF-16 is the happy medium. However... *) UTF-16 almost as fast as UTF-32, because the UTF-16 encoding is so simple *) UTF-16 is more compact than UTF-8 for codepoints between U+0800 and U+FFFF, each of which require 3 bytes in UTF-8, but only two bytes in UTF-16 *) The characters expressable in UTF-16 in a single wchar include every symbol from every living language, so if you /pretend/ that wchar[] is an array of characters rather than UTF-16 fragments, the effect is relatively harmless (unlike UTF-8).
Implicit transcoding will fit nicely with a standard type as when you are 
using another type the library functions (if all written for wchar for 
example) will still be available without explicit toUTFxx calls.
True. I can't argue with that. But back to the case for ditching char - think of this from another perspective. In Java, would you be prepared to argue the case for the /introduction/ of an 8-bit wide character type to the language? And that this type could only ever be used for UTF-8? There's a reason why that suggestion sounds absurd. It is. Arcane Jill
Aug 26 2004
next sibling parent "antiAlias" <fu bar.com> writes:
Hear! Hear!


"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgk32j$1fj$1 digitaldaemon.com...
 In article <opsdbboxb05a2sq9 digitalmars.com>, Regan Heath says...

 It is better than the status quo, but not quite as good (IMO) as having
 wchar[]
 be the standard string type.
By 'standard' do you mean that the others do not exist? or that it is the type you are encouraged to use unless you have reason not to?
I guess I mean specifically that: (1) Object.toString() should return wchar[], not char[] (2) String literals such as "hello world" should be interpretted as
wchar[], not
 char[].

 (3) Object.d should contain the line:


 (4) The text on page http://www.digitalmars.com/d/arrays.html should be
changed.
 Currently it says:

Dynamic arrays in D suggest the obvious solution - a string is just a
dynamic
array of characters. String literals become just an easy way to write
character
arrays.

    char[] str;
    char[] str1 = "abc";
This should be changed to:
Dynamic arrays in D suggest the obvious solution - a string is just a
dynamic
array of characters. String literals become just an easy way to write
character
arrays.

    wchar[] str;
    wchar[] str1 = "abc";
(5) There are probably several other things to change. I don't claim this
is an
 exhaustive list.



 In other words, we could actually have our cake /and/ eat it. The intent
is to
 minimize, as far as possible, the number of calls to toUTFxx(). Ideally,
they
 should occur only at input and output. The way to minimize this is to keep
 everything in the same type, so conversion is not needed. If the D
 documentation, the behaviour of the compiler, and the organization of
Phobos,
 were to consistently use the same string type, and others were encouraged
to use
 the same type, conversions would be kept to a minimum.

 Currently D does that - but it's "string of choice" is the char[], not the
 wchar[]. Conversion to/from UTF-8 is incredibly slow for non-ASCII
characters.
 (It could be made faster, but it can /never/ be made as fast as UTF-16).
So we
 make wchar[], not char[], the "standard", and hey presto, things get
faster (and
 what's more will interface with ICU without conversion, which is really
 important for internationalization).




The most generally efficient type (if that's even possible to decide)
should be the type we're encouraged to use, if that's wchar so be it.
Well, it is usually believed that UTF-8 is the most space-efficient but the least speed-efficient. UTF-32 is the most speed-efficient but the least space-efficient. UTF-16 is the happy medium. However... *) UTF-16 almost as fast as UTF-32, because the UTF-16 encoding is so
simple
 *) UTF-16 is more compact than UTF-8 for codepoints between U+0800 and
U+FFFF,
 each of which require 3 bytes in UTF-8, but only two bytes in UTF-16
 *) The characters expressable in UTF-16 in a single wchar include every
symbol
 from every living language, so if you /pretend/ that wchar[] is an array
of
 characters rather than UTF-16 fragments, the effect is relatively harmless
 (unlike UTF-8).


Implicit transcoding will fit nicely with a standard type as when you are
using another type the library functions (if all written for wchar for
example) will still be available without explicit toUTFxx calls.
True. I can't argue with that. But back to the case for ditching char - think of this from another
perspective.
 In Java, would you be prepared to argue the case for the /introduction/ of
an
 8-bit wide character type to the language? And that this type could only
ever be
 used for UTF-8?

 There's a reason why that suggestion sounds absurd. It is.

 Arcane Jill
Aug 26 2004
prev sibling next sibling parent reply Regan Heath <regan netwin.co.nz> writes:
On Thu, 26 Aug 2004 07:21:55 +0000 (UTC), Arcane Jill 
<Arcane_member pathlink.com> wrote:
 In article <opsdbboxb05a2sq9 digitalmars.com>, Regan Heath says...

 It is better than the status quo, but not quite as good (IMO) as having
 wchar[]
 be the standard string type.
By 'standard' do you mean that the others do not exist? or that it is the type you are encouraged to use unless you have reason not to?
I guess I mean specifically that: (1) Object.toString() should return wchar[], not char[]
Sure... so long as custom classes can overload and return char[] for situations where the app might be using char[] throughout (for whatever reason). OT: here is an example where we _don't_ want the return value used for method name resolution.
 (2) String literals such as "hello world" should be interpretted as 
 wchar[], not char[].
Currently doesn't it decide the type based on the context? i.e. void foo(dchar[] a); foo("hello world"); would make "hello world" a dchar string literal? I guess what you're saying is the default should be wchar[] where the type is indeterminate i.e. writef("hello world"); but, why not use char[] for the above, it's more efficient in this case. The compiler could do a quick decision based on whether the string contains any code points >= U+0800 if not use char[] otherwise user wchar[], would that be a good soln? After all, I don't mind if my compile is a little slower if it means my app is faster.
 (3) Object.d should contain the line:

I'm not sure I like this.. will this hide details a programmer should be aware of?
 (4) The text on page http://www.digitalmars.com/d/arrays.html should be 
 changed.
 Currently it says:

 Dynamic arrays in D suggest the obvious solution - a string is just a 
 dynamic
 array of characters. String literals become just an easy way to write 
 character
 arrays.

    char[] str;
    char[] str1 = "abc";
This should be changed to:
 Dynamic arrays in D suggest the obvious solution - a string is just a 
 dynamic
 array of characters. String literals become just an easy way to write 
 character
 arrays.

    wchar[] str;
    wchar[] str1 = "abc";
(5) There are probably several other things to change. I don't claim this is an exhaustive list.
Sure, char[] is probably used and suggested in every example in the manuals except where it's giving an example if the differences in utf-x encodings.
 In other words, we could actually have our cake /and/ eat it. The intent 
 is to
 minimize, as far as possible, the number of calls to toUTFxx().
Agreed.
 Ideally, they should occur only at input and output. The way to minimize 
 this is to keep everything in the same type, so conversion is not 
 needed. If the D
 documentation, the behaviour of the compiler, and the organization of 
 Phobos, were to consistently use the same string type, and others were 
 encouraged to use the same type, conversions would be kept to a minimum.

 Currently D does that - but it's "string of choice" is the char[], not 
 the wchar[]. Conversion to/from UTF-8 is incredibly slow for non-ASCII 
 characters.
 (It could be made faster, but it can /never/ be made as fast as UTF-16). 
 So we make wchar[], not char[], the "standard", and hey presto, things 
 get faster
Qualification: For non ASCII only apps. (and what's more will interface with ICU without conversion, which is really
 important for internationalization).
The fact that ICU has no char type suggests it's a bad choice for D, that is, if we want to assume they knew what they we're doing. Are there any complaints from developers about ICU anywhere, perhaps some digging for dirt would help make an objective decision here?
 The most generally efficient type (if that's even possible to decide)
 should be the type we're encouraged to use, if that's wchar so be it.
Well, it is usually believed that UTF-8 is the most space-efficient but the least speed-efficient. UTF-32 is the most speed-efficient but the least space-efficient. UTF-16 is the happy medium. However... *) UTF-16 almost as fast as UTF-32, because the UTF-16 encoding is so simple *) UTF-16 is more compact than UTF-8 for codepoints between U+0800 and U+FFFF, each of which require 3 bytes in UTF-8, but only two bytes in UTF-16 *) The characters expressable in UTF-16 in a single wchar include every symbol from every living language
.. ? I thought the problem Java had was with Unicode contains characters that are > U+FFFF.. if they're not part of a 'living language' what are they?
 , so if you /pretend/ that wchar[] is an array of
 characters rather than UTF-16 fragments, the effect is relatively 
 harmless
 (unlike UTF-8).
Sure, however if you're only dealing with ASCII doing that with char[] is also fine. Those of us who haven't done any internationalisation are used to dealing only with ASCII. I'd like some more stats and figures, simply: - how many unicode characters are in the range < U+0800? - how many unicode from U+0800 >= x <= U+FFFF? - haw many unicode > U+FFFF? (the answers to the above are probably quite simple, but I want them from someone who 'knows' rather than me who'd be guessing) Then, how commonly is each range used? I imagine this differs depending on exactly what you're doing.. basically when would you use characters in each range and how common is that task? It used to be that ASCII < U+0800 was the most common, it still may be, but I can see that it's not the future, the future is unicode.
 Implicit transcoding will fit nicely with a standard type as when you 
 are
 using another type the library functions (if all written for wchar for
 example) will still be available without explicit toUTFxx calls.
True. I can't argue with that. But back to the case for ditching char - think of this from another perspective. In Java, would you be prepared to argue the case for the /introduction/ of an 8-bit wide character type to the language? And that this type could only ever be used for UTF-8? There's a reason why that suggestion sounds absurd. It is.
Isn't the reason that all the existing Java stuff, of which there is a *lot* uses wchar, so char wouldn't intgrate. That is different to D, where all 3 exist and char is actually the one with the most integration. That said, would the introduction of char to Java give you anything? perhaps.. it would allow you to write an app that only deals with ASCII (chars < U+0800) more space efficiently, correct? Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 26 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <opsdc4sgsi5a2sq9 digitalmars.com>, Regan Heath says...

I guess what you're saying is the default should be wchar[] where the type 
is indeterminate
Yes.
but, why not use char[] for the above, it's more efficient in this case. 
The compiler could do a quick decision based on whether the string 
contains any code points >= U+0800 if not use char[] otherwise user 
wchar[], would that be a good soln?
You suggestion would work. But I'm still thinking along the lines that no conversion is better than some conversion, and no conversion is only achievable by having only one type of string. And even we don't enforce that, we should at least encourage it.
After all, I don't mind if my compile is a little slower if it means my 
app is faster.
UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all text is ASCII is going to be just as fast in UTF-16 as it is in ASCII. Remember that ASCII is a subset of UTF-16, just as it is a subset of UTF-8. Converting between UTF-8 and UTF-16 won't slow you down much if all your characters are ASCII, of course. Such a conversion is trivial - not much slower than a memcpy. /But/ - you're still making a copy, still allocating stuff off the heap and copying data from one place to another, and that's still overhead which you would have avoided had you used UTF-16 right through.
 (3) Object.d should contain the line:

I'm not sure I like this.. will this hide details a programmer should be aware of?
It's just a strong hint. If aliases are bad then we shouldn't use them anywhere.
 (It could be made faster, but it can /never/ be made as fast as UTF-16). 
 So we make wchar[], not char[], the "standard", and hey presto, things 
 get faster
Qualification: For non ASCII only apps.
No, for /all/ apps. I don't see any reason why ASCII stored in wchar[]s would be any slower than ASCII stored in char[]s. Can you think of a reason why that would be so? ASCII is a subset of UTF-8 ASCII is a subset of UTF-16 where's the difference? The difference is space, not speed.
The fact that ICU has no char type suggests it's a bad choice for D, that 
is, if we want to assume they knew what they we're doing.
See http://oss.software.ibm.com/icu/userguide/strings.html for UCI's discussion on this.
Are there any 
complaints from developers about ICU anywhere, perhaps some digging for 
dirt would help make an objective decision here?
I don't know. I imagine so. People generally tend to complain about /everything/.
I'd like some more stats and figures, simply:
  - how many unicode characters are in the range < U+0800?
These figures are all from Unicode 4.0. Unicode now stands at 4.0.1, so these figures are out of date - but they'll give you the general idea. 1646
  - how many unicode from U+0800 >= x <= U+FFFF?
54014, of which 6400 are Private Use Characters
  - haw many unicode > U+FFFF?
176012, of which 131068 are Private Use Characters
Then, how commonly is each range used? I imagine this differs depending on 
exactly what you're doing.. basically when would you use characters in 
each range and how common is that task?
The biggest non-BMP chunks are: U+20000 to U+2A6D6 = CJK Compatibility Ideographs U+F0000 to U+FFFFD = Private Use U+100000 to U+10FFFFD = Private Use Compatibility Ideographs are not used in Unicode except for round-trip compatibility with legacy CJK character sets. Every single one of them is nothing but a compatibility alias for another character. The Private Use characters are not defined by Unicode (being reserved for private interchange between consenting parties). The remainder of the non-BMP (> U+FFFF) characters are: *) More CJK compatibility characters *) Old italic variants of ASCII characters *) Gothic letters (no idea what they are) *) The Deseret script (a dead language, so far as I know) *) Musical symbols *) Miscellaneous mathematical symbols *) Mathematical variants of ASCII and Greek letters *) "Tagged" variants of ASCII characters The math characters are used only in math. The tagged characters are used only in /one/ protocol (and in fact, some of those characters are not used at all, even in that protocol). None of these characters are likely to be found in general text, only in specialist applications. Here is the complete list of "blocks": 0000..007F; Basic Latin 0080..00FF; Latin-1 Supplement 0100..017F; Latin Extended-A 0180..024F; Latin Extended-B 0250..02AF; IPA Extensions 02B0..02FF; Spacing Modifier Letters 0300..036F; Combining Diacritical Marks 0370..03FF; Greek 0400..04FF; Cyrillic 0530..058F; Armenian 0590..05FF; Hebrew 0600..06FF; Arabic 0700..074F; Syriac 0780..07BF; Thaana 0900..097F; Devanagari 0980..09FF; Bengali 0A00..0A7F; Gurmukhi 0A80..0AFF; Gujarati 0B00..0B7F; Oriya 0B80..0BFF; Tamil 0C00..0C7F; Telugu 0C80..0CFF; Kannada 0D00..0D7F; Malayalam 0D80..0DFF; Sinhala 0E00..0E7F; Thai 0E80..0EFF; Lao 0F00..0FFF; Tibetan 1000..109F; Myanmar 10A0..10FF; Georgian 1100..11FF; Hangul Jamo 1200..137F; Ethiopic 13A0..13FF; Cherokee 1400..167F; Unified Canadian Aboriginal Syllabics 1680..169F; Ogham 16A0..16FF; Runic 1780..17FF; Khmer 1800..18AF; Mongolian 1E00..1EFF; Latin Extended Additional 1F00..1FFF; Greek Extended 2000..206F; General Punctuation 2070..209F; Superscripts and Subscripts 20A0..20CF; Currency Symbols 20D0..20FF; Combining Marks for Symbols 2100..214F; Letterlike Symbols 2150..218F; Number Forms 2190..21FF; Arrows 2200..22FF; Mathematical Operators 2300..23FF; Miscellaneous Technical 2400..243F; Control Pictures 2440..245F; Optical Character Recognition 2460..24FF; Enclosed Alphanumerics 2500..257F; Box Drawing 2580..259F; Block Elements 25A0..25FF; Geometric Shapes 2600..26FF; Miscellaneous Symbols 2700..27BF; Dingbats 2800..28FF; Braille Patterns 2E80..2EFF; CJK Radicals Supplement 2F00..2FDF; Kangxi Radicals 2FF0..2FFF; Ideographic Description Characters 3000..303F; CJK Symbols and Punctuation 3040..309F; Hiragana 30A0..30FF; Katakana 3100..312F; Bopomofo 3130..318F; Hangul Compatibility Jamo 3190..319F; Kanbun 31A0..31BF; Bopomofo Extended 3200..32FF; Enclosed CJK Letters and Months 3300..33FF; CJK Compatibility 3400..4DB5; CJK Unified Ideographs Extension A 4E00..9FFF; CJK Unified Ideographs A000..A48F; Yi Syllables A490..A4CF; Yi Radicals AC00..D7A3; Hangul Syllables D800..DB7F; High Surrogates DB80..DBFF; High Private Use Surrogates DC00..DFFF; Low Surrogates E000..F8FF; Private Use F900..FAFF; CJK Compatibility Ideographs FB00..FB4F; Alphabetic Presentation Forms FB50..FDFF; Arabic Presentation Forms-A FE20..FE2F; Combining Half Marks FE30..FE4F; CJK Compatibility Forms FE50..FE6F; Small Form Variants FE70..FEFE; Arabic Presentation Forms-B FEFF..FEFF; Specials FF00..FFEF; Halfwidth and Fullwidth Forms FFF0..FFFD; Specials 10300..1032F; Old Italic 10330..1034F; Gothic 10400..1044F; Deseret 1D000..1D0FF; Byzantine Musical Symbols 1D100..1D1FF; Musical Symbols 1D400..1D7FF; Mathematical Alphanumeric Symbols 20000..2A6D6; CJK Unified Ideographs Extension B 2F800..2FA1F; CJK Compatibility Ideographs Supplement E0000..E007F; Tags F0000..FFFFD; Private Use 100000..10FFFD; Private Use
It used to be that ASCII < U+0800 was the most common, it still may be, 
but I can see that it's not the future, the future is unicode.
Most common, perhaps, if you limit your text to Latin, Greek, Hebrew and Russian. But not otherwise.
That said, would the introduction of char to Java give you anything? 
perhaps.. it would allow you to write an app that only deals with ASCII 
(chars < U+0800) more space efficiently, correct?
Only chars < U+0080 (not U+0800) would be more space efficient in UTF-8. Between U+0080 and U+07FFF they both need two bytes. From U+0800 upwards, UTF-8 needs three bytes where UTF-16 needs two. How am I doing on the convincing front? I'd still go for: *) wchar[] for everything in Phobos and everything DMD-generated; *) ditch the char *) lossless implicit conversion between all remaining D string types Arcane Jill
Aug 27 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgmr7f$1b90$1 digitaldaemon.com>, Arcane Jill says...

10400..1044F; Deseret
*) The Deseret script (a dead language, so far as I know)
"The Deseret Alphabet was designed as an alternative to the Latin alphabet for writing the English language. It was developed during the 1850s by The Church of Jesus Christ of Latter-day Saints (also known as the "Mormon" or LDS Church) under the guidance of Church President Brigham Young (1801-1877). Brigham Young's secretary, George D. Watt, was among the designers of the Deseret Alphabet." "The LDS Church published four books using the Deseret Alphabet" See http://www.molossia.org/alphabet.html Jill
Aug 27 2004
prev sibling next sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgmr7f$1b90$1 digitaldaemon.com...
 UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all
text
 is ASCII is going to be just as fast in UTF-16 as it is in ASCII.
While the rest of your post has a great deal of merit, this bit here is just not true. A pure ASCII app written using UTF-16 will consume twice as much memory for its data, and there are a lot of operations on that data that will be correspondingly half as fast. Furthermore, you'll start swapping a lot sooner, and then performance takes a dive. It makes sense for Java, Javascript, and for languages where performance is not a top priority to standardize on one character type. But if D does not handle ASCII very efficiently, it will not have a chance at interesting the very performance conscious C/C++ programmers.
Aug 28 2004
next sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <cgqlmm$2ui$1 digitaldaemon.com>, Walter says...
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgmr7f$1b90$1 digitaldaemon.com...
 UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all
text
 is ASCII is going to be just as fast in UTF-16 as it is in ASCII.
While the rest of your post has a great deal of merit, this bit here is just not true. A pure ASCII app written using UTF-16 will consume twice as much memory for its data, and there are a lot of operations on that data that will be correspondingly half as fast. Furthermore, you'll start swapping a lot sooner, and then performance takes a dive. It makes sense for Java, Javascript, and for languages where performance is not a top priority to standardize on one character type. But if D does not handle ASCII very efficiently, it will not have a chance at interesting the very performance conscious C/C++ programmers.
Agreed. Frankly, I've begun to wonder just what the purpose of this discussion is. I think it's already been agreed that none of the three char types should be removed from D, and it seems clear that there is no "default" char type. Is this a nomenclature issue? ie. that the UTF-8 type is named "char" and thus considered to be somehow more important than the others? Sean
Aug 28 2004
next sibling parent J C Calvarese <jcc7 cox.net> writes:
Sean Kelly wrote:

 In article <cgqlmm$2ui$1 digitaldaemon.com>, Walter says...
 
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgmr7f$1b90$1 digitaldaemon.com...

UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all
text
is ASCII is going to be just as fast in UTF-16 as it is in ASCII.
While the rest of your post has a great deal of merit, this bit here is just not true. A pure ASCII app written using UTF-16 will consume twice as much memory for its data, and there are a lot of operations on that data that will be correspondingly half as fast. Furthermore, you'll start swapping a lot sooner, and then performance takes a dive. It makes sense for Java, Javascript, and for languages where performance is not a top priority to standardize on one character type. But if D does not handle ASCII very efficiently, it will not have a chance at interesting the very performance conscious C/C++ programmers.
Agreed. Frankly, I've begun to wonder just what the purpose of this discussion is. I think it's already been agreed that none of the three char types should be removed from D, and it seems clear that there is no "default" char type. Is
I think the thread has gone somewhat off topics by this point. Apparently, a lot of people feel oppressed by ASCII. I much be a bad person since 7-bits is all I need most of the time. On a related note, the "performance of char vs wchar" recently degraded into enlightened comments along the lines of "You're a know-it-all American cowboy who discriminates against the all of the Chinese, Japanese, Indians, Russian, British, and members of the European Union in the world. And Mormons, too." Or something like that. Somehow these Unicode-related discussions bring out the best in people. :P
 this a nomenclature issue?  ie. that the UTF-8 type is named "char" and thus
 considered to be somehow more important than the others?
 
 
 Sean
-- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Aug 28 2004
prev sibling parent Matthias Becker <Matthias_member pathlink.com> writes:
Agreed.  Frankly, I've begun to wonder just what the purpose of this discussion
is.
The toString()-method has to return a strig, but in which format? That's what this is all about.
I think it's already been agreed that none of the three char types should
be removed from D, and it seems clear that there is no "default" char type.
Well the language can only use one type as return type for toString(). So there actualy IS a default character type. BTW, isn't the name "char" totaly misleading? char means character, but it can't hold a character it can only hold parts of it. This is confusing.
Is
this a nomenclature issue?  ie. that the UTF-8 type is named "char" and thus
considered to be somehow more important than the others?
Nope. -- Matthias Becker
Aug 29 2004
prev sibling parent reply stonecobra <scott stonecobra.com> writes:
Walter wrote:
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cgmr7f$1b90$1 digitaldaemon.com...
 
UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all
text
is ASCII is going to be just as fast in UTF-16 as it is in ASCII.
While the rest of your post has a great deal of merit, this bit here is just not true. A pure ASCII app written using UTF-16 will consume twice as much memory for its data, and there are a lot of operations on that data that will be correspondingly half as fast. Furthermore, you'll start swapping a lot sooner, and then performance takes a dive. It makes sense for Java, Javascript, and for languages where performance is not a top priority to standardize on one character type. But if D does not handle ASCII very efficiently, it will not have a chance at interesting the very performance conscious C/C++ programmers.
They won't worry about it, because if they are true performance affinciandos, they will never create an array in D, because of the default initialization. :) so, ubyte[] it is <g> Seriously, is performance a concern for D? If it truly is, this should be able to be turned off, if I take ownership of the potential consequences, no? Scott
Aug 29 2004
next sibling parent Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:
stonecobra wrote:

 affinciandos
What was that? :-D (I guess what you're trying to say what "aficionado").
Aug 29 2004
prev sibling next sibling parent reply "Ivan Senji" <ivan.senji public.srce.hr> writes:
"stonecobra" <scott stonecobra.com> wrote in message
news:cgu2ia$1vad$1 digitaldaemon.com...
 Walter wrote:
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cgmr7f$1b90$1 digitaldaemon.com...

UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which
all
 text

is ASCII is going to be just as fast in UTF-16 as it is in ASCII.
While the rest of your post has a great deal of merit, this bit here is
just
 not true. A pure ASCII app written using UTF-16 will consume twice as
much
 memory for its data, and there are a lot of operations on that data that
 will be correspondingly half as fast. Furthermore, you'll start swapping
a
 lot sooner, and then performance takes a dive.

 It makes sense for Java, Javascript, and for languages where performance
is
 not a top priority to standardize on one character type. But if D does
not
 handle ASCII very efficiently, it will not have a chance at interesting
the
 very performance conscious C/C++ programmers.
They won't worry about it, because if they are true performance affinciandos, they will never create an array in D, because of the default initialization. :) so, ubyte[] it is <g> Seriously, is performance a concern for D? If it truly is, this should be able to be turned off, if I take ownership of the potential consequences, no?
What about having some standard allocator for arrays so we can: char[] str = new(noinit) char[100];
 Scott
Aug 30 2004
parent Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:
Ivan Senji wrote:

 What about having some standard allocator for arrays so we can:
 
 char[] str = new(noinit) char[100];
I like it!
Aug 30 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"stonecobra" <scott stonecobra.com> wrote in message
news:cgu2ia$1vad$1 digitaldaemon.com...
 They won't worry about it, because if they are true performance
 affinciandos, they will never create an array in D, because of the
 default initialization.   :)  so, ubyte[] it is <g>

 Seriously, is performance a concern for D?  If it truly is, this should
 be able to be turned off, if I take ownership of the potential
 consequences, no?
You can allocate them with std.c.stdlib.malloc() or std.c.stdlib.alloca(), neither of which will do the initialization. Furthermore, std.c.stdlib.alloca(n), where n is a constant, is handled as a special optimization case, and will generate storage as part of the stack frame setup (so it's zero cost).
Aug 30 2004
next sibling parent stonecobra <scott stonecobra.com> writes:
Walter wrote:

 "stonecobra" <scott stonecobra.com> wrote in message
 news:cgu2ia$1vad$1 digitaldaemon.com...
 
They won't worry about it, because if they are true performance
affinciandos, they will never create an array in D, because of the
default initialization.   :)  so, ubyte[] it is <g>

Seriously, is performance a concern for D?  If it truly is, this should
be able to be turned off, if I take ownership of the potential
consequences, no?
You can allocate them with std.c.stdlib.malloc() or std.c.stdlib.alloca(), neither of which will do the initialization. Furthermore, std.c.stdlib.alloca(n), where n is a constant, is handled as a special optimization case, and will generate storage as part of the stack frame setup (so it's zero cost).
So, the C performance geeks will just stay with C, because that's how they'd do it now (no benefit to moving to d)? Scott
Aug 30 2004
prev sibling parent reply Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:
Walter wrote:

 Seriously, is performance a concern for D?  If it truly is, this should
 be able to be turned off, if I take ownership of the potential
 consequences, no?
You can allocate them with std.c.stdlib.malloc() or std.c.stdlib.alloca(), neither of which will do the initialization. Furthermore, std.c.stdlib.alloca(n), where n is a constant, is handled as a special optimization case, and will generate storage as part of the stack frame setup (so it's zero cost).
And will they be garbage-collected? (just asking, I don't know.) Anyway it's an ugly interface to do it, maybe as has been suggested more standard allocators could be included for this case.
Aug 30 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Juanjo Álvarez" <juanjuxNO SPAMyahoo.es> wrote in message
news:ch0ajb$4d$1 digitaldaemon.com...
 Walter wrote:
 Seriously, is performance a concern for D?  If it truly is, this should
 be able to be turned off, if I take ownership of the potential
 consequences, no?
You can allocate them with std.c.stdlib.malloc() or
std.c.stdlib.alloca(),
 neither of which will do the initialization. Furthermore,
 std.c.stdlib.alloca(n), where n is a constant, is handled as a special
 optimization case, and will generate storage as part of the stack frame
 setup (so it's zero cost).
And will they be garbage-collected? (just asking, I don't know.)
malloc()? No, that will need an explicit call to free(). alloca()? No, that gets deallocated anyway when the function exits
 Anyway it's
 an ugly interface to do it, maybe as has been suggested more standard
 allocators could be included for this case.
You can also overload operators new and delete on a per-class basis, and provide a custom allocator.
Aug 30 2004
next sibling parent reply "Ivan Senji" <ivan.senji public.srce.hr> writes:
"Walter" <newshound digitalmars.com> wrote in message
news:ch0ce4$11t$1 digitaldaemon.com...
 "Juanjo Álvarez" <juanjuxNO SPAMyahoo.es> wrote in message
 news:ch0ajb$4d$1 digitaldaemon.com...
 Walter wrote:
 Seriously, is performance a concern for D?  If it truly is, this
should
 be able to be turned off, if I take ownership of the potential
 consequences, no?
You can allocate them with std.c.stdlib.malloc() or
std.c.stdlib.alloca(),
 neither of which will do the initialization. Furthermore,
 std.c.stdlib.alloca(n), where n is a constant, is handled as a special
 optimization case, and will generate storage as part of the stack
frame
 setup (so it's zero cost).
And will they be garbage-collected? (just asking, I don't know.)
malloc()? No, that will need an explicit call to free().
But this would be a good way to get back to good old memory leaks from forgetting to free something. This is something that a high level language like D should try to help with. Isn't there a way to extend the syntax to enable us to allocate uninitialised arrays when someone explicitly wants to do that?
 alloca()? No, that gets deallocated anyway when the function exits

 Anyway it's
 an ugly interface to do it, maybe as has been suggested more standard
 allocators could be included for this case.
You can also overload operators new and delete on a per-class basis, and provide a custom allocator.
But not for arrays. This would mean too much wraping.
Aug 31 2004
parent reply Sean Kelly <sean f4.ca> writes:
In article <ch1ulm$ori$1 digitaldaemon.com>, Ivan Senji says...
"Walter" <newshound digitalmars.com> wrote in message
news:ch0ce4$11t$1 digitaldaemon.com...

 malloc()? No, that will need an explicit call to free().
But this would be a good way to get back to good old memory leaks from forgetting to free something. This is something that a high level language like D should try to help with. Isn't there a way to extend the syntax to enable us to allocate uninitialised arrays when someone explicitly wants to do that?
How about a smart pointer? I admit that the syntax wouldn't be quite as nice as in C++, but the basic implementation should be the same. Sean
Aug 31 2004
parent Nick <Nick_member pathlink.com> writes:
In article <ch2a1j$uqi$1 digitaldaemon.com>, Sean Kelly says...

 malloc()? No, that will need an explicit call to free().
But this would be a good way to get back to good old memory leaks from forgetting to free something. This is something that a high level language like D should try to help with.
How about a smart pointer? I admit that the syntax wouldn't be quite as nice as in C++, but the basic implementation should be the same.
How about just using a buffer class that clears up it's mess when collected? Seems to me like a simple solution that would cover most uses for uninitialized buffers. I have written an example here: http://folk.uio.no/mortennk/d/array/membuffer.d and used it here: http://folk.uio.no/mortennk/d/array/uninitarray.d Nick
Aug 31 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ch0ce4$11t$1 digitaldaemon.com>, Walter says...

 And will they be garbage-collected? (just asking, I don't know.)
malloc()? No, that will need an explicit call to free(). alloca()? No, that gets deallocated anyway when the function exits
..but is forbidden from having a destructor (even if you use alloca() via a custom allocator) because you can't have destructable objects on the stack.
 Anyway it's
 an ugly interface to do it, maybe as has been suggested more standard
 allocators could be included for this case.
You can also overload operators new and delete on a per-class basis, and provide a custom allocator.
..which of course will /also/ not be garbage collected. (I've been down this road before). Arcane Jill
Aug 31 2004
next sibling parent Sean Kelly <sean f4.ca> writes:
In article <ch22nn$qtf$1 digitaldaemon.com>, Arcane Jill says...
In article <ch0ce4$11t$1 digitaldaemon.com>, Walter says...

 And will they be garbage-collected? (just asking, I don't know.)
malloc()? No, that will need an explicit call to free(). alloca()? No, that gets deallocated anyway when the function exits
..but is forbidden from having a destructor (even if you use alloca() via a custom allocator) because you can't have destructable objects on the stack.
You can't? Why not?
 Anyway it's
 an ugly interface to do it, maybe as has been suggested more standard
 allocators could be included for this case.
You can also overload operators new and delete on a per-class basis, and provide a custom allocator.
..which of course will /also/ not be garbage collected. (I've been down this road before).
I'm not sure if it would work, but could you use gc.malloc to allocate heap memory in operator new and then not provide an operator delete? Sean
Aug 31 2004
prev sibling parent "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:ch22nn$qtf$1 digitaldaemon.com...
 In article <ch0ce4$11t$1 digitaldaemon.com>, Walter says...

 And will they be garbage-collected? (just asking, I don't know.)
malloc()? No, that will need an explicit call to free(). alloca()? No, that gets deallocated anyway when the function exits
..but is forbidden from having a destructor (even if you use alloca() via
a
 custom allocator) because you can't have destructable objects on the
stack. If you make it an auto class, you can.
Aug 31 2004
prev sibling parent Regan Heath <regan netwin.co.nz> writes:
On Fri, 27 Aug 2004 08:26:23 +0000 (UTC), Arcane Jill 
<Arcane_member pathlink.com> wrote:
 In article <opsdc4sgsi5a2sq9 digitalmars.com>, Regan Heath says...
 After all, I don't mind if my compile is a little slower if it means my
 app is faster.
UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all text is ASCII is going to be just as fast in UTF-16 as it is in ASCII. Remember that ASCII is a subset of UTF-16, just as it is a subset of UTF-8. Converting between UTF-8 and UTF-16 won't slow you down much if all your characters are ASCII, of course. Such a conversion is trivial - not much slower than a memcpy. /But/ - you're still making a copy, still allocating stuff off the heap and copying data from one place to another, and that's still overhead which you would have avoided had you used UTF-16 right through.
You're right.. I would argue however that space == speed when you start to run out, which happens 2 times faster if you use wchar (and ASCII only), right? the overall efficiency of a program is made up of both it's space and cpu requirements, some times you will need to or want to lessen the space requirements.
 (3) Object.d should contain the line:

I'm not sure I like this.. will this hide details a programmer should be aware of?
It's just a strong hint. If aliases are bad then we shouldn't use them anywhere.
I wasn't suggesting aliases were bad. Aliases that serve to make type declarations clearer are very useful, they make code clearer. This alias just renames a type, so now it has 2 names, this will likely cause some confusion. I think we can suggest a type without it.
 (It could be made faster, but it can /never/ be made as fast as 
 UTF-16).
 So we make wchar[], not char[], the "standard", and hey presto, things
 get faster
Qualification: For non ASCII only apps.
No, for /all/ apps. I don't see any reason why ASCII stored in wchar[]s would be any slower than ASCII stored in char[]s. Can you think of a reason why that would be so? ASCII is a subset of UTF-8 ASCII is a subset of UTF-16 where's the difference? The difference is space, not speed.
Correct, but space == speed (as above).
 The fact that ICU has no char type suggests it's a bad choice for D, 
 that
 is, if we want to assume they knew what they we're doing.
See http://oss.software.ibm.com/icu/userguide/strings.html for UCI's discussion on this.
Thanks.
 Are there any
 complaints from developers about ICU anywhere, perhaps some digging for
 dirt would help make an objective decision here?
I don't know. I imagine so. People generally tend to complain about /everything/.
<g> true, too true...
 I'd like some more stats and figures, simply:
  - how many unicode characters are in the range < U+0800?
These figures are all from Unicode 4.0. Unicode now stands at 4.0.1, so these figures are out of date - but they'll give you the general idea. 1646
  - how many unicode from U+0800 >= x <= U+FFFF?
54014, of which 6400 are Private Use Characters
  - haw many unicode > U+FFFF?
176012, of which 131068 are Private Use Characters
 Then, how commonly is each range used? I imagine this differs depending 
 on
 exactly what you're doing.. basically when would you use characters in
 each range and how common is that task?
..thanks for the lists/figures..
 It used to be that ASCII < U+0800 was the most common, it still may be,
 but I can see that it's not the future, the future is unicode.
Most common, perhaps, if you limit your text to Latin, Greek, Hebrew and Russian. But not otherwise.
So would you say most common worldwide then? It may be due to the fact that I only speak english but I see many more english-only programs than (pick a language)-only programs. Ignoring those applications that come in several languages (as all the big ones do).
 That said, would the introduction of char to Java give you anything?
 perhaps.. it would allow you to write an app that only deals with ASCII
 (chars < U+0800) more space efficiently, correct?
Only chars < U+0080 (not U+0800) would be more space efficient in UTF-8.
Yes, what if they are all you want/need/(are going) to use...
 Between
 U+0080 and U+07FFF they both need two bytes. From U+0800 upwards, UTF-8 
 needs
 three bytes where UTF-16 needs two.

 How am I doing on the convincing front?
You still have work to do <g>
 I'd still go for:
 *) wchar[] for everything in Phobos and everything DMD-generated;
What if you know you're only going to need ASCII(utf-8), what if all your data is going to be in ASCII(utf-8), won't you want your static strings in ASCII(utf-8) also, to cut down on transcoding?
 *) ditch the char
I don't see the point in this. char[] is still useful regardless which type we 'promote' as the best type to use for internationalized strings. If it were removed, then dchar should be removed for the same reason, and both types would have to be implemented in ubyte[] and int[] instead. Coincidentally this is what the ICU have done, I quote... "UTF-8 and UTF-32 are supported with converters (ucnv.h), macros (utf.h), and convenience functions (ustring.h), but not directly as string encoding forms for most APIs." I'm not convinced removing them is more useful than keeping them but having implicit transcoding.
 *) lossless implicit conversion between all remaining D string types
Here we agree, this would make life much easier. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 29 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgk32j$1fj$1 digitaldaemon.com...
 (2) String literals such as "hello world" should be interpretted as
wchar[], not
 char[].
Actually, string literals are already interpreted as char[], wchar[], or dchar[] depending on the context they appear in. The compiler implicitly does a UTF conversion on them as necessary. If you have an overload based on char[] vs wchar[] vs dchar[] and pass a string literal, it should result in an ambiguity error. The only place it would default to char[] would be when it is passed as a ... argument to a variadic function.
Aug 28 2004
parent reply Andy Friesen <andy ikagames.com> writes:
Walter wrote:
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cgk32j$1fj$1 digitaldaemon.com...
 
(2) String literals such as "hello world" should be interpretted as
wchar[], not
char[].
Actually, string literals are already interpreted as char[], wchar[], or dchar[] depending on the context they appear in. The compiler implicitly does a UTF conversion on them as necessary. If you have an overload based on char[] vs wchar[] vs dchar[] and pass a string literal, it should result in an ambiguity error.
Is there any chance that this could be adjusted somehow? I know the point is to avoid all the complications that the C++ appoarch entails, but this has a way of throwing a wrench in any interface that wants to handle all three. Presently, we're given a choice: either handle all three char types and therefore demand ugly casts on all string literal arguments, or only handle one and force conversions that aren't necessarily required or desired by either the caller or the callee. If say, in the case of an ambiguity, a string literal were assumed to be of the smallest char type for which a match exists, the code would compile and, in almost all cases, do the right thing. It does complicate the rules some, but it seems preferable to the current dilemma. -- andy
Aug 28 2004
parent reply Sean Kelly <sean f4.ca> writes:
In article <cgqmpo$387$1 digitaldaemon.com>, Andy Friesen says...
I know the point is to avoid all the complications that the C++ appoarch 
entails, but this has a way of throwing a wrench in any interface that 
wants to handle all three.
I'm inclined to agree, though I'm wary of making char types a special case for overload resolution. Perhaps a prefix to indicate type? c"" // utf-8 w"" // utf-16 d"" // utf-32 Still not ideal, but it would require less typing :/ Sean
Aug 28 2004
parent "Walter" <newshound digitalmars.com> writes:
"Sean Kelly" <sean f4.ca> wrote in message
news:cgqnjm$3eq$1 digitaldaemon.com...
 In article <cgqmpo$387$1 digitaldaemon.com>, Andy Friesen says...
I know the point is to avoid all the complications that the C++ appoarch
entails, but this has a way of throwing a wrench in any interface that
wants to handle all three.
I'm inclined to agree, though I'm wary of making char types a special case
for
 overload resolution.  Perhaps a prefix to indicate type?

 c"" // utf-8
 w"" // utf-16
 d"" // utf-32

 Still not ideal, but it would require less typing :/
I thought of the prefix approach, like C uses, but it just seemed redundant for the odd case where a cast(char[]) will do.
Aug 28 2004
prev sibling parent "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgh8vi$1o7k$1 digitaldaemon.com...
 It's not invalid as such, it's just that the return type of an overloaded
 function has to be "covariant" with the return type of the function it's
 overloading. So it's a compile error /now/. But if dchar[] and char[] were
to be
 considered mutually covariant then this would magically start to compile.
That would be nice, but I don't see how to technically make that work.
Aug 28 2004
prev sibling next sibling parent reply "Carlos Santander B." <carlos8294 msn.com> writes:
"antiAlias" <fu bar.com> escribió en el mensaje
news:cgg1mi$14l2$1 digitaldaemon.com
| A a; // class instances ...
| B b;
| C c;
|
| dchar[] message = c ~ b ~ a;

I have a question regarding this: what if A, B, and C were like this?

//////////////////////////
class A
{
    ... opCat_r (B b) { ... }
    ...
}

class B
{
    ... opCat (A a) { ... }
    ... opCat_r (C c) { ... }
    ...
}

class C
{
    ... opCat (B b) { ... }
    ...
}
//////////////////////////

How would "c ~ b ~ a" work with the proposed automatic call to .toString?

-----------------------
Carlos Santander Bernal
Aug 24 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Tue, 24 Aug 2004 21:31:56 -0500, Carlos Santander B. 
<carlos8294 msn.com> wrote:
 "antiAlias" <fu bar.com> escribió en el mensaje
 news:cgg1mi$14l2$1 digitaldaemon.com
 | A a; // class instances ...
 | B b;
 | C c;
 |
 | dchar[] message = c ~ b ~ a;

 I have a question regarding this: what if A, B, and C were like this?

 //////////////////////////
 class A
 {
     ... opCat_r (B b) { ... }
     ...
 }

 class B
 {
     ... opCat (A a) { ... }
     ... opCat_r (C c) { ... }
     ...
 }

 class C
 {
     ... opCat (B b) { ... }
     ...
 }
 //////////////////////////

 How would "c ~ b ~ a" work with the proposed automatic call to .toString?
I assumed opCat's parameter would have to be char[], wchar[] or dchar[], as would it's return value. eg. class B { char[] opCat(char[] rhs){} } given implicit transcoding you could then say. char[] c; wchar[] w; dchar[] d; B b = new B(); char[] p; p = b ~ c; p = b ~ w; p = b ~ d; Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 24 2004
parent reply "Carlos Santander B." <carlos8294 msn.com> writes:
"Regan Heath" <regan netwin.co.nz> escribió en el mensaje
news:opsc9un10r5a2sq9 digitalmars.com
| I assumed opCat's parameter would have to be char[], wchar[] or dchar[],
| as would it's return value. eg.
|

I don't see why it has to be only that way. ~ is the concatenation operator, so
I could define:

class Set(T)
{
    Set opCat(T newElem) { ... }
}

And expect it to work the way I want it. opCat is not only for strings. And it
shouldn't be.

| class B
| {
|    char[] opCat(char[] rhs){}
| }
|
| given implicit transcoding you could then say.
|
| char[]  c;
| wchar[] w;
| dchar[] d;
|
| B b = new B();
|
| char[] p;
|
| p = b ~ c;
| p = b ~ w;
| p = b ~ d;
|
| Regan
|

-----------------------
Carlos Santander Bernal
Aug 25 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
This is irrelevant. opCat() does not need to do anything special for D strings
to work, whether we go the wchar[] route or the implicit conversion route. For
wchar[]s, it already works. If we go for implicit converion, then the three
different kinds of D string would be regarded as covariant by the D compiler, so
expressions of the form (wchar[] ~ dchar[]) would be handled by the type
promotion system, not by opCat() - just as (float + int) is handled now.

Jill



In article <cgi5q2$254f$1 digitaldaemon.com>, Carlos Santander B. says...
"Regan Heath" <regan netwin.co.nz> escribió en el mensaje
news:opsc9un10r5a2sq9 digitalmars.com
| I assumed opCat's parameter would have to be char[], wchar[] or dchar[],
| as would it's return value. eg.
|

I don't see why it has to be only that way. ~ is the concatenation operator, so
I could define:

class Set(T)
{
    Set opCat(T newElem) { ... }
}

And expect it to work the way I want it. opCat is not only for strings. And it
shouldn't be.

| class B
| {
|    char[] opCat(char[] rhs){}
| }
|
| given implicit transcoding you could then say.
|
| char[]  c;
| wchar[] w;
| dchar[] d;
|
| B b = new B();
|
| char[] p;
|
| p = b ~ c;
| p = b ~ w;
| p = b ~ d;
|
| Regan
|

-----------------------
Carlos Santander Bernal
Aug 25 2004
prev sibling parent reply Ben Hinkle <bhinkle4 juno.com> writes:
 
 The other aspect involved here is that of string-concatenation. D cannot
 have more that one return type for toString() as you know. It's fixed at
 char[]. If string concatenation uses the toString() method to retrieve its
 components (as is being proposed elsewhere), then there will be multiple,
 redundant, implicit conversions going on where the string really wanted to
 be dchar[] in the first place. That is:
 
 A a; // class instances ...
 B b;
 C c;
 
 dchar[] message = c ~ b ~ a;
 
 Under the proposed "implicit" scheme, if each toString() of A, B, and C
 wish to return dchar[], then each concatenation causes an implicit
 conversion/encoding from each dchar[] to char[] (for the toString()
 return). Then another full conversion/decoding is performed back to the
 dchar[] assignment once each has been concatenated. This is like the
 Wintel 'plot' for selling more cpu's :-)
 
 Doing this manually, one would forego the toString() altogether:
 
 dchar[] message = c.getString() ~ b.getString() ~ a.getString();
 
 ... where getString() is a programmer-specific idiom to return the
 (natural) dchar[] for these classes, and we carefully avoided all those
 darned implicit-conversions. However, which approach do you think people
 will use? My guess is that D may become bogged down in conversion hell
 over such things.
"Conversion hell" will exist any time three standards are in use. It doesn't matter how those standards are wrapped up - implicit, explicit, String class or whatever. That's why we all win by agreeing on one standard and trying to stick to it. In D now life is peachy in char[] land, slightly less peachy in wchar[] and dchar[] land. I don't think there's any way to make life peachy for all three cases.
 So, to answer your question:
 What I'm /for/ is not covering up these types of issues with blanket-style
 implicit conversions. Something more constructive (and with a little more
 forethought) needs to be done.
Aug 24 2004
parent Regan Heath <regan netwin.co.nz> writes:
On Tue, 24 Aug 2004 22:53:47 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:

<snip>

 "Conversion hell" will exist any time three standards are in use. It 
 doesn't
 matter how those standards are wrapped up - implicit, explicit, String
 class or whatever. That's why we all win by agreeing on one standard and
 trying to stick to it. In D now life is peachy in char[] land, slightly
 less peachy in wchar[] and dchar[] land. I don't think there's any way to
 make life peachy for all three cases.
Lets assume implicit transcoding is implemented, why wouldn't that make life peachy in all 3? Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 24 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgfen9$rhq$1 digitaldaemon.com...
 Phobos UTF-8 code could be faster, I grant you. But perhaps it will be in
the
 next release. We're only talking about a tiny number of functions here,
after
 all.
My first goal with std.utf is to make it work right. Then comes the optimize-the-heck-out-of-it. std.utf is shaping up to be a core dependency for D, so making it run as fast as possible is worthwhile. Any suggestions?
Aug 28 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgqju0$2b4$1 digitaldaemon.com>, Walter says...

My first goal with std.utf is to make it work right. Then comes the
optimize-the-heck-out-of-it. std.utf is shaping up to be a core dependency
for D, so making it run as fast as possible is worthwhile. Any suggestions?
It's only really UTF-8 decoding that's complicated. All the rest are pretty easy, even UTF-8 encoding (as I'm sure you know). The approach I took in the code sample I posted here a while back was to read the first byte (which will be the /only/ byte, in the case of ASCII) and use it as the index into a lookup table. A byte has only 256 possible values - 128 after you've eliminated ASCII chars, and you can look up both the sequence length (or 0 for illegal first-bytes), and the initial value for the (dchar) accumulator. Then you just get six more bits from each of the remaining bytes (after ensuring that the bit pattern is 10xxxxxx). This approach will fail to catch precisely /two/ non-shortest cases, so you have to test for them explicitly. Finally, you make sure that the resulting dchar is not a forbidden value. (From memory, I think that in some cases your current code checks for errors which can never happen, such as checking for a non-shortest 5+ byte sequence /after/ overlong sequences have already been eliminated). You could go further. Kris has mentioned that heap allocation is slow. Presumably, you could start off by allocating a single char buffer of length 3*N (if input=wchars) or 4*N (if input=dchars), decoding into it, and then reducing its length. (Of course, the excess then won't be released). (But never let an invalid input go unnoticed. That would be one optimization too many). In article <cgqju1$2b4$2 digitaldaemon.com>, Walter says...
 It's not invalid as such, it's just that the return type of an overloaded
 function has to be "covariant" with the return type of the function it's
 overloading. So it's a compile error /now/. But if dchar[] and char[] were
to be
 considered mutually covariant then this would magically start to compile.
That would be nice, but I don't see how to technically make that work.
You're right. It wouldn't work. Well, this is the sense in which D does have a "default" string type. It is the case where we see clearly that char[] has special privilege. Currently, Object - and therefore /everything/ - defines a function toString() which returns a char[]. It is not possible to overload this with a function returning wchar[], even if wchar[] would be more appropriate. This affects /all/ objects. What to do about it? Hmmm.... You could change the Object.toString to: with the latter two employing implicit conversion for the "s = t;" part. Subclasses of Object which overloaded only toString(out char[]) would get the other three for free. But subclasses of Object which decided to go a bit further could return a wchar[] or a dchar[] directly to cut down on conversions.
Actually, string literals are already interpreted as char[], wchar[], or
dchar[] depending on the context they appear in. The compiler implicitly
does a UTF conversion on them as necessary.
The error is: incompatible types for ((s) ~ (" world")): 'dchar[]' and 'char[]' But yes - it works /nearly/ always, and that's cool. The above case would be covered by implicit conversion, of course (although that would defer the conversion from compile-time to run-time).
If you have an overload based on
char[] vs wchar[] vs dchar[] and pass a string literal, it should result in
an ambiguity error.
Ah! That's the bit I didn't know. I was wondering how that context thing would work, given that signature matching happens /after/ the evaluation of the function's arguments' types. You could fix this by allowing explicit UTF-8, UTF-16 and UTF-32 literals. Sean suggested c"", w"" and d"" (and similarly for char literals). That would fix it.
 UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all
text
 is ASCII is going to be just as fast in UTF-16 as it is in ASCII.
While the rest of your post has a great deal of merit, this bit here is just not true. A pure ASCII app written using UTF-16 will consume twice as much memory for its data, and there are a lot of operations on that data that will be correspondingly half as fast.
That's true. Guess I got a bit carried away there. I was thinking that statements like "c = *p++;" would compile to just one machine code instruction regardless of the data width, and that the byte-wide version wouldn't necessarily be the fastest. But I forgot about all the initializing and copying that you also have to do. Arcane Jill
Aug 28 2004
parent reply J C Calvarese <jcc7 cox.net> writes:
Arcane Jill wrote:
 In article <cgqju0$2b4$1 digitaldaemon.com>, Walter says...
...
 In article <cgqju1$2b4$2 digitaldaemon.com>, Walter says...
 
It's not invalid as such, it's just that the return type of an overloaded
function has to be "covariant" with the return type of the function it's
overloading. So it's a compile error /now/. But if dchar[] and char[] were
to be
considered mutually covariant then this would magically start to compile.
That would be nice, but I don't see how to technically make that work.
You're right. It wouldn't work. Well, this is the sense in which D does have a "default" string type. It is the case where we see clearly that char[] has special privilege. Currently, Object - and therefore /everything/ - defines a function toString() which returns a char[]. It is not possible to overload this with a function returning wchar[], even if wchar[] would be more appropriate. This affects /all/ objects. What to do about it? Hmmm.... You could change the Object.toString to: with the latter two employing implicit conversion for the "s = t;" part. Subclasses of Object which overloaded only toString(out char[]) would get the other three for free. But subclasses of Object which decided to go a bit further could return a wchar[] or a dchar[] directly to cut down on conversions.
Could we ditch toString and replace the functionality with: toUtf8(), toUtf16(), and toUtf32() or toCharStr(), toWCharStr(), and toDCharStr() Usually, the person writing the object could define one and the other two would call conversions. There's probably some reason why this wouldn't work, but it's just such a pleasant idea to me that I was forced to share it. -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Aug 28 2004
parent reply Ben Hinkle <bhinkle4 juno.com> writes:
J C Calvarese wrote:

 Arcane Jill wrote:
 In article <cgqju0$2b4$1 digitaldaemon.com>, Walter says...
...
 In article <cgqju1$2b4$2 digitaldaemon.com>, Walter says...
 
It's not invalid as such, it's just that the return type of an
overloaded function has to be "covariant" with the return type of the
function it's overloading. So it's a compile error /now/. But if dchar[]
and char[] were
to be
considered mutually covariant then this would magically start to
compile.
That would be nice, but I don't see how to technically make that work.
You're right. It wouldn't work. Well, this is the sense in which D does have a "default" string type. It is the case where we see clearly that char[] has special privilege. Currently, Object - and therefore /everything/ - defines a function toString() which returns a char[]. It is not possible to overload this with a function returning wchar[], even if wchar[] would be more appropriate. This affects /all/ objects. What to do about it? Hmmm.... You could change the Object.toString to: with the latter two employing implicit conversion for the "s = t;" part. Subclasses of Object which overloaded only toString(out char[]) would get the other three for free. But subclasses of Object which decided to go a bit further could return a wchar[] or a dchar[] directly to cut down on conversions.
Could we ditch toString and replace the functionality with: toUtf8(), toUtf16(), and toUtf32() or toCharStr(), toWCharStr(), and toDCharStr() Usually, the person writing the object could define one and the other two would call conversions. There's probably some reason why this wouldn't work, but it's just such a pleasant idea to me that I was forced to share it.
Why is toString such a hot topic anyway? In Java end users hardly ever see the result of a toString. In D I can see things like toString(int) and toString(double) being seen by users but in general Foo.toString should just give a summary of the object - preferably short and easy to transcode. I wouldn't use toString for things like getting user strings out of text fields in a GUI or reading from a file. For those cases I would use another function name like TextBox.getText or File.readStringW. Those other functions can have char[] versions and wchar[] versions as desired. As an example of a "bad" toString see std.stream.Stream.toString. It will usually create a huge string. For classes like AJ's arbitrary sized Int toString should return the whole integer since that is the best summary of the object. So we should still allow toString to return arbitrarily long strings - we just need to be careful how toString is used.
Aug 28 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgra2k$udg$1 digitaldaemon.com>, Ben Hinkle says...

Why is toString such a hot topic anyway?
Ask yourself why toString() exists at all. What does D use it for? If it's an unnecessary, hardly-used function, then it should be removed from Object, because this is OOP, if something doesn't make sense for all Objects then it should not be defined for all Objects. On the other hand, if it /is/ necessary for all objects, it shouldn't be biased one way or the other.
but in general Foo.toString should
just give a summary of the object
Of the object's /value/, yes. So toString() only makes sense for objects which actually /have/ a value. I'm not sure if streams can be said to have a "value" in the sense that Ints do, so maybe it shouldn't be defined at all for streams.
As an
example of a "bad" toString see std.stream.Stream.toString. It will usually
create a huge string.
Yes. Now I'm starting to wonder what toString() is actually for, and whether implementing a three-function interface (Stringizable?) might be better than inheriting from Object.
For classes like AJ's arbitrary sized Int toString should return the whole
integer since that is the best summary of the object. So we should still
allow toString to return arbitrarily long strings - we just need to be
careful how toString is used.
Let's go back to Walter on this one. Walter - why does Object have a toString() function? In what way does D require or rely on it? How badly would D be affected if it didn't exist at all or if it were an interface? Jill
Aug 28 2004
next sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgrsnq$168a$1 digitaldaemon.com...
 Let's go back to Walter on this one. Walter - why does Object have a
toString()
 function? In what way does D require or rely on it? How badly would D be
 affected if it didn't exist at all or if it were an interface?
It's so when you pass an object to writef(), there's a way that it can be printed. But Ben is right, I don't see Object.toString() being used to generate very large strings, so any transcoding of it isn't going to be expensive overall.
Aug 29 2004
parent reply Matthias Becker <Matthias_member pathlink.com> writes:
news:cgrsnq$168a$1 digitaldaemon.com...
 Let's go back to Walter on this one. Walter - why does Object have a
toString()
 function? In what way does D require or rely on it? How badly would D be
 affected if it didn't exist at all or if it were an interface?
It's so when you pass an object to writef(), there's a way that it can be printed. But Ben is right, I don't see Object.toString() being used to generate very large strings, so any transcoding of it isn't going to be expensive overall.
I don't get it :( Why isn't it possible to use a Stringizeable interface instead? -- Matthias Becker
Aug 29 2004
parent "Walter" <newshound digitalmars.com> writes:
"Matthias Becker" <Matthias_member pathlink.com> wrote in message
news:cgs9b4$1c0h$1 digitaldaemon.com...
news:cgrsnq$168a$1 digitaldaemon.com...
 Let's go back to Walter on this one. Walter - why does Object have a
toString()
 function? In what way does D require or rely on it? How badly would D
be
 affected if it didn't exist at all or if it were an interface?
It's so when you pass an object to writef(), there's a way that it can be printed. But Ben is right, I don't see Object.toString() being used to generate very large strings, so any transcoding of it isn't going to be expensive overall.
I don't get it :( Why isn't it possible to use a Stringizeable interface instead?
It is possible. But I think every Object should have some basic functionality, and one of those is to be able to have itself pretty-printed. This is also nice for a potential D debugger - it can take advantage of toString() to produce a user-friendly representation of the class data.
Aug 30 2004
prev sibling next sibling parent Ben Hinkle <bhinkle4 juno.com> writes:
Arcane Jill wrote:

 In article <cgra2k$udg$1 digitaldaemon.com>, Ben Hinkle says...
 
Why is toString such a hot topic anyway?
Ask yourself why toString() exists at all. What does D use it for?
object out. One drawback with D's Object.toString is that the default implementation doesn't print the object's address (or in Java's case the hash code) so you can't distinguish one object from another. If anything I'd like to see some guidelines for toString so that the output is consistent across D. For example if the object doesn't have an obvious string representation, like a class Foo{int n; double d;} then the result of toString should have the form "[Foo n:0, d:0.0]" - possibly include the address or hash code in there. I think this is basically the format Java uses but I can't remember exactly. In general toString should avoid newlines.
 If it's an unnecessary, hardly-used function, then it should be removed
 from Object, because this is OOP, if something doesn't make sense for all
 Objects then it should not be defined for all Objects.
 
 On the other hand, if it /is/ necessary for all objects, it shouldn't be
 biased one way or the other.
It isn't necessary but it is nice to have around. Does it make sense for all objects? I guess that depends on one's viewpoint of "makes sense". I think printing the class and hash code makes sense. Maybe others don't.
but in general Foo.toString should
just give a summary of the object
Of the object's /value/, yes. So toString() only makes sense for objects which actually /have/ a value. I'm not sure if streams can be said to have a "value" in the sense that Ints do, so maybe it shouldn't be defined at all for streams.
See above comments about making sense and the default toString format.
As an
example of a "bad" toString see std.stream.Stream.toString. It will
usually create a huge string.
Yes. Now I'm starting to wonder what toString() is actually for, and whether implementing a three-function interface (Stringizable?) might be better than inheriting from Object.
That's an option, but it adds more work for users to make something stringizable - two of the stringizable functions will call the "real" one and wrap the result in toUTF8 etc. Is it worth it to do all that just for debugging?
For classes like AJ's arbitrary sized Int toString should return the whole
integer since that is the best summary of the object. So we should still
allow toString to return arbitrarily long strings - we just need to be
careful how toString is used.
Let's go back to Walter on this one. Walter - why does Object have a toString() function? In what way does D require or rely on it? How badly would D be affected if it didn't exist at all or if it were an interface? Jill
Aug 29 2004
prev sibling parent reply Berin Loritsch <bloritsch d-haven.org> writes:
Arcane Jill wrote:

 In article <cgra2k$udg$1 digitaldaemon.com>, Ben Hinkle says...
 
 
Why is toString such a hot topic anyway?
Ask yourself why toString() exists at all. What does D use it for? If it's an unnecessary, hardly-used function, then it should be removed from Object, because this is OOP, if something doesn't make sense for all Objects then it should not be defined for all Objects. On the other hand, if it /is/ necessary for all objects, it shouldn't be biased one way or the other.
There are a couple of uses I use toString() for (in Java apps): 1: debugging. Some useful info is derived when going through a graphical debugger (hovering over a variable will yield its toString() value). 2: easier inclusion in graphical lists. I might have an object encapsulating the ISO-3166 country codes (long name and id), but use the toString() to show the long name. That way in the drop down list I have everything I need to serialize the info to the DB. But that's just Java.... D doesn't have decent IDE integration at this operation.
Aug 30 2004
parent reply Dave <Dave_member pathlink.com> writes:
In article <cgva1p$2gep$1 digitaldaemon.com>, Berin Loritsch says...
Arcane Jill wrote:

 In article <cgra2k$udg$1 digitaldaemon.com>, Ben Hinkle says...
 
 
Why is toString such a hot topic anyway?
Ask yourself why toString() exists at all. What does D use it for? If it's an unnecessary, hardly-used function, then it should be removed from Object, because this is OOP, if something doesn't make sense for all Objects then it should not be defined for all Objects. On the other hand, if it /is/ necessary for all objects, it shouldn't be biased one way or the other.
There are a couple of uses I use toString() for (in Java apps): 1: debugging. Some useful info is derived when going through a graphical debugger (hovering over a variable will yield its toString() value). 2: easier inclusion in graphical lists. I might have an object encapsulating the ISO-3166 country codes (long name and id), but use the toString() to show the long name. That way in the drop down list I have everything I need to serialize the info to the DB. But that's just Java.... D doesn't have decent IDE integration at this operation.
(Also refering to the 'most European chars. are ASCII' post just ahead of this one) And because of their likely small size, even if they are filled with non-ASCII characters, neither of those uses will realistically cause a performance bottleneck if toString() is UTF-8 by 'default', right? No matter how efficient the memory management system is, if the 'default' is UTF-16 rather than UTF-8, most apps. will have to carry that extra de/allocation & initialization burden for no reason other than expediency or ignorance on the programmers part. Often twice the work and 1/2 the available memory for many of the same jobs. UTF-16 used everywhere is probably one reason why heavy-use Java server apps. have the reputation as 'memory-thrashers'. And those runtimes have the benefit of many man-years of development, use-cases and experimentation behind them. Since UTF-8 is the most efficient and adequate for most of D's currently forseeable uses, I say leave as is and put the 'burden' on the software that needs or benefits from other than UTF-8. IMO, D really needs to be a general performance winner in order to get a toe-hold in the current market.
Aug 30 2004
parent "Walter" <newshound digitalmars.com> writes:
"Dave" <Dave_member pathlink.com> wrote in message
news:ch033f$2u7u$1 digitaldaemon.com...
 UTF-16 used everywhere is probably one reason why heavy-use Java server
apps.
 have the reputation as 'memory-thrashers'. And those runtimes have the
benefit
 of many man-years of development, use-cases and experimentation behind
them. Many of Java's problems have gradually declined over time due to massive research and development efforts by a lot of very, very smart people. For example, Andy King has just pointed me to some research done by several people that focussed on improving some minor aspects of the Java garbage collector. D doesn't have a billion dollar budget <g>, and simply cannot afford to have problems that need such budgets to find solutions for.
 IMO, D really needs to be a general performance winner in order to get a
 toe-hold in the current market.
Right. And if D acquires an early reputation for being 'slow', it will never have a chance. Currently, D is *faster* than C++ on string benchmarks (see www.digitalmars.com/d/cppstrings.html), and it must be that way, as that seriously blunts criticisms aimed at D for the way it does strings relative to C++.
Aug 30 2004
prev sibling parent reply Regan Heath <regan netwin.co.nz> writes:
On Mon, 23 Aug 2004 23:05:45 -0700, antiAlias <fu bar.com> wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsc7u56zf5a2sq9 digitalmars.com...
 On Mon, 23 Aug 2004 18:29:54 -0700, antiAlias <fu bar.com> wrote:
 "Walter" <newshound digitalmars.com> wrote in message ...
 "Regan" wrote:
 There are two ways of looking at this, on one hand you're saying 
they
 should all 'paint' as that is consistent. However, on the other I'm
saying
 they should all produce a 'valid' result. So my argument here is 
that
 when
 you cast you expect a valid result, much like casting a float to an
int
 does not just 'paint'.

 I am interested to hear your opinions on this idea.
I think your idea has a lot of merit. I'm certainly leaning that way.
On the one hand, this would be well served by an opCast() and/or opCast_r() on the primitive types; just the kind of thing suggested in a related thread (which talked about overloadable methods for primitive types).
True. However, what else should the opCast for char[] to dchar[] do except transcode it? What about opCast for int to dchar.. it seems to me there is only 1 choice about what to do, anything else would be operator abuse.
 On the other hand, we're talking transcoding here. Are you gonna' 
limit
 this
 to UTF-8 only?
I hope not, I am hoping to see: | UTF-8 | UTF-16 | UTF-32 -------------------------------- UTF-8 | - + + UTF-16 | + - + UTF-32 | + + - (+ indicates transcoding occurs)
======================== And what happens when just one additional byte-oriented encoding is introduced? Perhaps UTF-7? Perhaps EBCDIC? The basic premise is flawed because there's no flexibility.
--------------------------------------- You'll have to check with Walter but I believe he has no plans to add another basic type to hold any specific encoding. Encodings other than UTF-x will be done with library functions and ubyte[], ushort[], uint[], ulong[].
 Then, since the source and destination will typically be of
 different sizes, do you then force all casts between these types to 
have
 the
 destination be an array reference rather than an instance?
I don't think transcoding makes any sense unless you're talking about a 'string' (i.e. char[], wchar[], or dchar[]) as opposed to a UTF-x fragment (i.e. char, wchar)
======================== We /are/ talking about arrays. Perhaps if that sentence had ended "array reference rather than an *array* instance?", it might have been more clear?
--------------------------------------- Or maybe I just miss-read or missunderstood it :)
 The point being made is that you would not be able to do anything like 
 this:

 char[15] dst;
 dchar[10] src;

 dst = cast(char[]) src;

 because there's no ability via a cast() to indicate how many items from 
 src
 were converted, and how many items in dst were populated. You are forced
 into this kind of thing:

 char[] dst;
 dchar[10] src;

 dst = cast(char[]) src;
--------------------------------------- How is that different to what we have to do now? dst = toUTF8(src); ?
 You see the distinction? It may be subtle to some, but it's a glaring
 imbalance to others. The lValue must always be a reference because it's
 gonna' be allocated dynamically to ensure all of the rValue will fit. In 
 the
 end, it's just far better to use functions/methods that provide the 
 feedback
 required so you can actually control what's going on (or such that a 
 library
 function can). That way, you're not restricted in the same fashion. We 
 don't
 need more asymmetry in D, and this just reeks of poor design, IMO.
--------------------------------------- Correct me if I'm wrong you're suggesting the use of a library function like this... bool toUTF8(char[] dst, dchar[] src) {} or similar, where the caller passes a buffer for the result, and has full control of the size /location/allocation etc of that buffer, correct? 1. The implicit casting idea does not prevent this. 2. The above function doesn't exist currently.. instead we have char[] toUTF8(dchar[] src) {} which is identical to what an implict cast would do.
 To drive this home, consider the decoding version (rather than the 
 encoding
 above):

 char[15] src;
 dchar[] dst;

 dst = cast(dchar[]) src;

 What happens when there's a partial character left undecoded at the end 
 of
 'src'?
--------------------------------------- How is that even possible? Assuming src is a 'valid' UTF-8 sequences, all the characters can and will be encoded into UTF-32 producing a 'valid' UTF-32 sequence. If 'src' _is_ invalid then an exception will be thrown much like the ones we currently have for invalid UTF sequences.
 There nothing here to tell you that you've got a dangly bit left at
 the end of the souce-buffer. It's gone. Poof! Any further decoding from 
 the
 same file/socket/whatever is henceforth trashed, because the ball has 
 been
 both dropped and buried. End of story.


 Having 3 types requiring manual transcoding between them _is_ a pain.
======================== It certainly is. That's why other languages try to avoid it at all costs. Having it done "generously" by the compiler is also a pain, inflexible, and likely expensive.
--------------------------------------- I'd argue that you're wrong about all but the last assertion above. Having it done by the compiler would not be: 1. a 'pain' - you wouldn't notice, and if you did and did not desire the behaviour you can manually convert just like you have to do now. 2. 'inflexible' - this idea does not preclude you doing things another way. It simply provides a default, which IMO is the sensible/correct thing to do. it is however more 'expensive' than the current situation, but, it's no more expensive than doing the conversion manually, which is what you currently have to do..
 There are many things a programmer should take
 responsibility for; transcoding comes under that umbrella because (a) 
 there
 can be subtle complexity involved and (b) it is relatively expensive to
 churn through text and convert it; particularly so with the Phobos utf-8
 code.

 What you appear to be suggesting is that this kind of thing should happen
 silently whilst one nonchalantly passes arguments around between methods.
--------------------------------------- Yes.
 That's insane, so I hope that's not what you're advocating. Java, for
 example, does that at one specific layer (I/O)
--------------------------------------- Which it can do because it only has one string type.
 , but you're apparently
 suggesting doing it at any old place! And several times over, just in 
 case
 it wasn't good enough the first time :-)
--------------------------------------- Yes, as we only want to go to an in-efficient type temporarily eg. Walters comment: "dchars are very convenient to work with, however, and make a great deal of sense as the temporary common intermediate form of all the conversions. I stress the temporary, though, as if you keep it around you'll start to notice the slowdowns it causes." reflects exactly what implicit conversion will give you, the ability to store things in memory the format you think is most efficient and dip in and out of utf-32 _if_ required. utf-32 is not 'required' for anything, it's simply 'convenient' for certain things, as AJ has frequently pointed out you can encode every Unicode character in all 3 formats UTF-8, UTF-16 and UTF-32.
 Sorry man. This is inane. D is
 /not/ a scripting language; instead it's supposed to be a systems 
 language.
--------------------------------------- Which is why having a native UTF-8 type is so useful, and why being able to 'temporarily' implicitly convert to utf-32 for convenience while storing in the most space efficient format is so useful.
 Besides, if IUC were adopted, less
 people would have to worry about the distinction anyway.
The same could be said for the implicit transcoding from one type to the other.
======================== That's pretty short-sighted IMO. You appear to be saying that implicit transcoding would take the place of ICU; terribly misleading.
--------------------------------------- Not at all. I implied this statement: "if <implicit transcoding> were adopted, less people would have to worry about the distinction anyway" or tried to.
 Transcoding is
 just a very small part of that package. Please try to reread the comment 
 as
 "most people would be shielded completely by the library functions,
 therefore there's far fewer scenarios where they'd ever have a need to 
 drop
 into anything else".

 This would be a very GoodThing for D users. Far better to have a good
 library to take case of /all/ this crap than have the D language do some
 partial-conversions on the fly, and then quit because it doesn't know 
 how to
 provide any further functionality. This is the classic
 core-language-versus-library-functionality bitchfest all over again.
 Building all this into a cast()? Hey! Let's make Walter's Regex class 
 part
 of the compiler too; and make it do UTF-8 decoding /while/ it's 
 searching,
 since you'll be able to pass it a dchar[] that will be generously 
 converted
 to the accepted char[] for you "on-the-fly".

 Excuse me for jesting, but perhaps the Bidi text algorithms plus
 date/numeric formatting & parsing will all fit into a single operator 
 also?
 That's kind of what's being suggested. I believe there's a serious
 underestimate of the task being discussed.
--------------------------------------- This is an over-exageration, perhaps you should give Jim Carrey acting lessons <g>
 Believe me when I say that Mango would dearly love to go dchar[] only.
 Actually, it probably will at the higher levels because it makes life
 simple
 for everyone. Oh, and I've been accused many times of being an
efficiency
 fanatic, especially when it comes to servers. But there's always a
 tradeoff
 somewhere. Here, the tradeoff is simplicity-of-use versus quantities 
of
 RAM.
 Which one changes dramatically over time? Hmmmm ... let me see now ...
 64bit-OS for desktops just around corner?
What you really mean is you'd dearly love to not have to worry about the differences between the 3 types, implicit transcoding will give you that. Furthermore it's simplicity without sacraficing RAM.
======================== Ahh Thanks. I didn't realize that's what I "really meant". Wait a minute ...
--------------------------------------- Forgive me for sharing my interpretation of what you said.
 Even on an embedded device I'd probably go "dchar only" regarding 
I18N.
 Simply because the quantity of text processed on such devices is very
 limited. Before anyone shoots me over this one, I regularly write code
 for
 devices with just 4KB RAM ~ still use 16bit chars there when dealing
with
 XML input.
We're you to use implicit transcoding you could store the data in memory in UTF-8, or UTF-16 then only transcode to UTF-32 when required, this would be more efficient.
======================== That's a rather large assumption, don't you think? More efficient? In which particualr way? Is memory usage or CPU usage more important in /my/ particular applications? Please either refrain, or commit to rewriting all my old code more efficiently for me ... for free <g>
--------------------------------------- As stated earlier this concept does not stop you from optimising your app in any way shape or form, it's simply a sensible default behaviour IMO. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 24 2004
parent reply "antiAlias" <fu bar.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote ..
 What happens when there's a partial character left undecoded
 at the end of  'src'?
--------------------------------------- How is that even possible?
It happens all the time with streamed input. However, as AJ pointed out, neither you nor Walter are apparently suggesting that the cast() approach be used for anything other than trivial conversions. That is, one would not use this approach with respect to IO streaming. I had the (distinctly wrong) impression this implied-conversion was intended to be a jack-of-all-trades. Everything else in the post is therefore cast(void) ~ so let's stop wasting our breath :) If these implicit conversions are put in place, then I respectfully suggest the std.utf functions be replaced with something that avoids fragmenting the heap in the manner they currently do (for non Latin-1); and it's not hard to make them an order-of-magnitude faster, too. Finally; there's still the problems related to string-concatentation and toString(), as described toward the end of this post news:cgg1mi$14l2$1 digitaldaemon.com
Aug 24 2004
next sibling parent Regan Heath <regan netwin.co.nz> writes:
On Tue, 24 Aug 2004 19:29:24 -0700, antiAlias <fu bar.com> wrote:
 "Regan Heath" <regan netwin.co.nz> wrote ..
 What happens when there's a partial character left undecoded
 at the end of  'src'?
--------------------------------------- How is that even possible?
It happens all the time with streamed input.
Ahhh.. I get it you were referring to not having all the input at one time, with some being left in the 'stream'.. I can see your concern now.
 However, as AJ pointed out,
 neither you nor Walter are apparently suggesting that the cast() 
 approach be
 used for anything other than trivial conversions.
Correct, the cases where the current approach can actually create a bug, a bug that only sometimes happens.
 That is, one would not use
 this approach with respect to IO streaming. I had the (distinctly wrong)
 impression this implied-conversion was intended to be a 
 jack-of-all-trades.

 Everything else in the post is therefore cast(void)  ~  so let's stop
 wasting our breath :)
Yay :)
 If these implicit conversions are put in place, then I respectfully 
 suggest
 the std.utf functions be replaced with something that avoids fragmenting 
 the
 heap in the manner they currently do (for non Latin-1); and it's not 
 hard to
 make them an order-of-magnitude faster, too.
Good idea. If it's done it has to be done as efficiently as possible.
 Finally; there's still the problems related to string-concatentation and
 toString(), as described toward the end of this post
Yep. I think the toString restriction should be lifted, with implicit transcoding any of the string types should be valid. I am still concerned about the number of transcoding operations that might occur in a unsuspecting programmers string concatenation... Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 24 2004
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <cggtcq$1ird$1 digitaldaemon.com>, antiAlias says...
"Regan Heath" <regan netwin.co.nz> wrote ..
 What happens when there's a partial character left undecoded
 at the end of  'src'?
--------------------------------------- How is that even possible?
It happens all the time with streamed input. However, as AJ pointed out, neither you nor Walter are apparently suggesting that the cast() approach be used for anything other than trivial conversions. That is, one would not use this approach with respect to IO streaming. I had the (distinctly wrong) impression this implied-conversion was intended to be a jack-of-all-trades.
My modified version of std.utf is meant to address the streaming issue. Basically, I added versions of encode and decode that accept as the source or destination hook. Not perfect perhaps, but it does get around the problem of encode/decode wanting to throw aqn exception of they encounter an invalid sequence.
If these implicit conversions are put in place, then I respectfully suggest
the std.utf functions be replaced with something that avoids fragmenting the
heap in the manner they currently do (for non Latin-1); and it's not hard to
make them an order-of-magnitude faster, too.
Then by all means do so :) Sean
Aug 25 2004
next sibling parent Sean Kelly <sean f4.ca> writes:
In article <cgidgk$28mg$1 digitaldaemon.com>, Sean Kelly says...
Basically, I added versions of encode and decode that accept as the source or
destination hook.
"Accept a delegate." Sean
Aug 25 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cgidgk$28mg$1 digitaldaemon.com>, Sean Kelly says...

If these implicit conversions are put in place, then I respectfully suggest
the std.utf functions be replaced with something that avoids fragmenting the
heap in the manner they currently do (for non Latin-1); and it's not hard to
make them an order-of-magnitude faster, too.
Then by all means do so :) Sean
Some speed-up ideas... I posted a potentially speedier version of UTF-8 decode here a while back. The basic algorithm I used was this: get the first byte; if it's ASCII, return it; else use it as an index into a lookup table to get the sequence length. There's slightly more to it than that, obviously, but that was the basis. Walter wanted to know if there were any standard tests to check whether a UTF-8 function works correctly. I didn't know of any. The big difficulty with UTF-8 is that of being fully Unicode conformant. This is poorly understood, so people are often tempted to make shortcuts. The std.utf functions take no shortcuts and so are conformant. The jist is this, however. You can have two different kinds of UTF-8 decode routine - checked or unchecked. A checked function will ensure that the input contains no invalid sequences (non-shortest sequences are always invalid), and will throw an exception (or otherwise report the error) if that's not the case. Checked decoders can be made fully conformant, but the checking can slow you down. Unchecked decoders, on the other hand, simply /assume/ that the input is valid, and produce garbage if it isn't. Unchecked decoders can be made to go a lot faster, but they are not Unicode conformant ... unless of course you *KNOW* with 100% certainty that the input *IS* valid. (Without this knowledge, your application won't be Unicode conformant, and can actually be a security risk). So, it would be possible to write a fast, unchecked UTF-8 decoder, if you made use of D's Design by Contract. If you validate the string in the function's "in" block, then you can assume valid input in the function body, and thereby go faster (at least in a release build). But watch out for coding errors. The caller *MUST* fulfil that contract, or you have a bug. And you'd still need to have a checked UTF-8 decoder for those cases when you're not sure where the input came from. Being able to distinguish between sequences which have already been validated, and those which have not, can buy you a lot of efficiency. Unfortunately, I don't see how D can take advantage of that. If a D string were a class or a struct, then it could have a class invariant - but D strings are just simple arrays, and constructing invalid UTF-8 arrays is all too easy. Arcane Jill
Aug 25 2004
parent reply "antiAlias" <fu bar.com> writes:
The decoding thing has issues as you point out Jill.

That's not the whole story, however. The Mango utf-8 Encoder is 5 to 15
times faster than the Phobos one, while executing the equivalent algorithm.
The CheckedDecoder is 5 to 15 times faster, whilst the UncheckedDecoder is
10 to 30 times faster (all  variances are due to alternate mixes of char,
wchar, dchar; all timings performed on a P3).

These are rather significant differences. I think it's safe to say that the
Phobos routines were "not written with efficiency in mind". Either that, or
Mango has some secret means of warping time ...



"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cgih09$2ak7$1 digitaldaemon.com...
 In article <cgidgk$28mg$1 digitaldaemon.com>, Sean Kelly says...

If these implicit conversions are put in place, then I respectfully
suggest
the std.utf functions be replaced with something that avoids fragmenting
the
heap in the manner they currently do (for non Latin-1); and it's not
hard to
make them an order-of-magnitude faster, too.
Then by all means do so :) Sean
Some speed-up ideas... I posted a potentially speedier version of UTF-8 decode here a while back.
The
 basic algorithm I used was this: get the first byte; if it's ASCII, return
it;
 else use it as an index into a lookup table to get the sequence length.
There's
 slightly more to it than that, obviously, but that was the basis. Walter
wanted
 to know if there were any standard tests to check whether a UTF-8 function
works
 correctly. I didn't know of any.

 The big difficulty with UTF-8 is that of being fully Unicode conformant.
This is
 poorly understood, so people are often tempted to make shortcuts. The
std.utf
 functions take no shortcuts and so are conformant.

 The jist is this, however. You can have two different kinds of UTF-8
decode
 routine - checked or unchecked. A checked function will ensure that the
input
 contains no invalid sequences (non-shortest sequences are always invalid),
and
 will throw an exception (or otherwise report the error) if that's not the
case.
 Checked decoders can be made fully conformant, but the checking can slow
you
 down.

 Unchecked decoders, on the other hand, simply /assume/ that the input is
valid,
 and produce garbage if it isn't. Unchecked decoders can be made to go a
lot
 faster, but they are not Unicode conformant ... unless of course you
*KNOW* with
 100% certainty that the input *IS* valid. (Without this knowledge, your
 application won't be Unicode conformant, and can actually be a security
risk).
 So, it would be possible to write a fast, unchecked UTF-8 decoder, if you
made
 use of D's Design by Contract. If you validate the string in the
function's "in"
 block, then you can assume valid input in the function body, and thereby
go
 faster (at least in a release build). But watch out for coding errors. The
 caller *MUST* fulfil that contract, or you have a bug. And you'd still
need to
 have a checked UTF-8 decoder for those cases when you're not sure where
the
 input came from.

 Being able to distinguish between sequences which have already been
validated,
 and those which have not, can buy you a lot of efficiency. Unfortunately,
I
 don't see how D can take advantage of that. If a D string were a class or
a
 struct, then it could have a class invariant - but D strings are just
simple
 arrays, and constructing invalid UTF-8 arrays is all too easy.

 Arcane Jill
Aug 25 2004
parent reply pragma <pragma_member pathlink.com> writes:
In article <cgij3k$2bon$1 digitaldaemon.com>, antiAlias says...
The decoding thing has issues as you point out Jill.

That's not the whole story, however. The Mango utf-8 Encoder is 5 to 15
times faster than the Phobos one, while executing the equivalent algorithm.
The CheckedDecoder is 5 to 15 times faster, whilst the UncheckedDecoder is
10 to 30 times faster (all  variances are due to alternate mixes of char,
wchar, dchar; all timings performed on a P3).
Holy crap, what kind of data are you throwing at it? I don't mean to criticize, but there must be some clever coding on your part or some serious loopholes in the algorithm to get that kind of an improvement. :)
These are rather significant differences. I think it's safe to say that the
Phobos routines were "not written with efficiency in mind". Either that, or
Mango has some secret means of warping time ...
In the case of the latter, may we rename the "Mango" to "Tardis"? -Pragma [[ EricAnderton at (daleks can't use stairs) yahoo.com ]]
Aug 25 2004
parent reply "antiAlias" <fu bar.com> writes:
<g> That's funny.

No loopholes; no clever coding. Just take a look at (for example) what
utf.encode does to the heap. Those order-of-magnitude timings are best-case
for Phobos ~ in a busy server environment they'd be even slower, likely
cause notable heap fragmentation, and persistently lock the heap against
other (much more appropriate) usage by other threads. Imagine multiple
threads doing implicit dchar[] conversions via utf.encode?

Because there's no supported means of resolving such things, one becomes
inclined to simply 'reimplement' instead of pooling ones skills and
resources to fix Phobos.

This is exactly the kind of thing the DSLG should take care of.



"pragma" <pragma_member pathlink.com> wrote in message
news:cgin45$2djh$1 digitaldaemon.com...
 In article <cgij3k$2bon$1 digitaldaemon.com>, antiAlias says...
The decoding thing has issues as you point out Jill.

That's not the whole story, however. The Mango utf-8 Encoder is 5 to 15
times faster than the Phobos one, while executing the equivalent
algorithm.
The CheckedDecoder is 5 to 15 times faster, whilst the UncheckedDecoder
is
10 to 30 times faster (all  variances are due to alternate mixes of char,
wchar, dchar; all timings performed on a P3).
Holy crap, what kind of data are you throwing at it? I don't mean to
criticize,
 but there must be some clever coding on your part or some serious
loopholes in
 the algorithm to get that kind of an improvement. :)

These are rather significant differences. I think it's safe to say that
the
Phobos routines were "not written with efficiency in mind". Either that,
or
Mango has some secret means of warping time ...
In the case of the latter, may we rename the "Mango" to "Tardis"? -Pragma [[ EricAnderton at (daleks can't use stairs) yahoo.com ]]
Aug 25 2004
next sibling parent Regan Heath <regan netwin.co.nz> writes:
Might I humbly suggest you add these routines to deimos?
Have you seen my post to the DSLG thread, I think my idea has merit..

On Wed, 25 Aug 2004 13:07:57 -0700, antiAlias <fu bar.com> wrote:
 <g> That's funny.

 No loopholes; no clever coding. Just take a look at (for example) what
 utf.encode does to the heap. Those order-of-magnitude timings are 
 best-case
 for Phobos ~ in a busy server environment they'd be even slower, likely
 cause notable heap fragmentation, and persistently lock the heap against
 other (much more appropriate) usage by other threads. Imagine multiple
 threads doing implicit dchar[] conversions via utf.encode?

 Because there's no supported means of resolving such things, one becomes
 inclined to simply 'reimplement' instead of pooling ones skills and
 resources to fix Phobos.

 This is exactly the kind of thing the DSLG should take care of.



 "pragma" <pragma_member pathlink.com> wrote in message
 news:cgin45$2djh$1 digitaldaemon.com...
 In article <cgij3k$2bon$1 digitaldaemon.com>, antiAlias says...
The decoding thing has issues as you point out Jill.

That's not the whole story, however. The Mango utf-8 Encoder is 5 to 15
times faster than the Phobos one, while executing the equivalent
algorithm.
The CheckedDecoder is 5 to 15 times faster, whilst the UncheckedDecoder
is
10 to 30 times faster (all  variances are due to alternate mixes of 
char,
wchar, dchar; all timings performed on a P3).
Holy crap, what kind of data are you throwing at it? I don't mean to
criticize,
 but there must be some clever coding on your part or some serious
loopholes in
 the algorithm to get that kind of an improvement. :)

These are rather significant differences. I think it's safe to say that
the
Phobos routines were "not written with efficiency in mind". Either 
that,
or
Mango has some secret means of warping time ...
In the case of the latter, may we rename the "Mango" to "Tardis"? -Pragma [[ EricAnderton at (daleks can't use stairs) yahoo.com ]]
-- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 25 2004
prev sibling parent pragma <pragma_member pathlink.com> writes:
In article <cgirdm$2fod$1 digitaldaemon.com>, antiAlias says...
<g> That's funny.
Thanks! That's why I've been doing these goofy sigs lately: if I can make people grin while they type, perhaps tempers won't flare as much in this NG. :)
No loopholes; no clever coding. Just take a look at (for example) what
utf.encode does to the heap. Those order-of-magnitude timings are best-case
for Phobos ~ in a busy server environment they'd be even slower, likely
cause notable heap fragmentation, and persistently lock the heap against
other (much more appropriate) usage by other threads. Imagine multiple
threads doing implicit dchar[] conversions via utf.encode?
Yikes. That's amazing. I only wonder how this might stand up against ICU?
Because there's no supported means of resolving such things, one becomes
inclined to simply 'reimplement' instead of pooling ones skills and
resources to fix Phobos.
A personal favorite of mine: 1st law of engineering: Hit it with a hammer 2nd law of engineering: If law 1 fails, *use a bigger hammer* . or the more classic idiom: use the right tool for the right job. One could liken reimplementation to a "programmer's sin" of sorts only if they doom others to continue that replication. Sadly, everybody's lib is currently "in progress" with no end in sight yet (which should change once D's remaining quirks are stamped out). At least Brad has been moderating product additions on dsource to cut down some of the more coarse-grained duplication. However we're still very much in the Beta years of D's life. ;)
This is exactly the kind of thing the DSLG should take care of.
At first I wasn't too keen on the idea myself, but perhaps it could use some more discussion. (I'll post on the other thread). - Pragma [[ EricAnderton at (its gumby dammit) yahoo.com ]]
Aug 25 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <opsc7u56zf5a2sq9 digitalmars.com>, Regan Heath says...

True. However, what else should the opCast for char[] to dchar[] do except 
transcode it? What about opCast for int to dchar.. it seems to me there is 
only 1 choice about what to do, anything else would be operator abuse.
Correctly me if I'm wrong, but according to the docs, there /is/ no from-to version of opCast(). opCast() remains almost completely useless, despite many suggestions to fix it. But there's no need to opCast() anything. Compiler magic can just call std.toUTFxx() directly (which of course is what you said). If you want to use a different encoder, just do it explicitly. e.g.
I don't think transcoding makes any sense unless you're talking about a 
'string' (i.e. char[], wchar[], or dchar[]) as opposed to a UTF-x fragment 
(i.e. char, wchar)
Correct. It doesn't.
As AJ has frequently pointed out a char or wchar does not equal one 
"character" in some cases. Given that and assuming 'c' below is "not a 
whole character", I cannot see how:

dchar d;
char  c = d; //not a whole character

d = c;            //implicit
d = cast(dchar)c; //explicit
The existing behavior is already flawed, but it's not going to change (unless char is ditched), because these are primitive types, and Walter says so. Here's what's wrong: In the direction dchar -> char: Well, you'd /expect/ that to go wrong. It's a narrowing conversion. But: Casting from char to wchar or dchar will /only/ work if the character is actually ASCII.
3. a simple copy of the value, creating an invalid? (will it be AJ?) 
utf-32 character.
No value in the range U+0000 to U+00FF is invalid UTF32. But you can't convert a UTF-8-fragment to a character (see example above) and expect the result to be meaningful, except for ASCII.
utf-x code-point/fragment (I don't know the right term)
Unicode uses the term "code unit". I have avoided that term as it's not altogether clear to those unfamiliar with the jargon. I usually say "UTF-8 fragment" on this newsgroup.
 Then there's performance. It's entirely
 possible to write transcoders that are minimally between 5 and 30 times
 faster than the std.utf ones. Some people actually do care about 
 efficiency.
 I'm one of them.
Same here. I believe we all have the same goal, we just have different ideas about how best to get there.
The current implementation of the std.utf functions is not really relevant. /Tomorrow/'s implementation could be way faster than today's. I don't see any reason why Walter couldn't some day replace the implementation with the fastest one on the planet. No code would break.
I think the confusion comes from being used to only 1 string type, for 
C/C++ programmers it's typically and 8bit unsigned type usually containing 
ASCII, for Java a 16bit signed/unsigned? type containing UTF-16?
Mebee, but in C++ I can do this: The "confusion" in D arises (IMO) because we don't have implicit conversion. Arcane Jill
Aug 24 2004
parent Regan Heath <regan netwin.co.nz> writes:
On Tue, 24 Aug 2004 12:03:17 +0000 (UTC), Arcane Jill 
<Arcane_member pathlink.com> wrote:

<big snip> Thanks for those explainations.

 The "confusion" in D arises (IMO) because we don't have implicit 
 conversion.
That is my thought also, tho I note you would rather have 1 string type. Lets do a pro's/con's list for implicit conversion and one string type because I am not totally convinced one is better than the other, let me start (trying to be as objective as possible and not favour 'my' idea) [implicit conversion] PROS: P1 - will cause: dchar[] d; char[] c = d; to produce valid utf sequences. P2 - allows you to write 1 of each string returning function (instead of 3) P3 - explicit conversion calls not required. eg toUTFxx(). p1: is vital IMO p2: this means less code replication, and less code in general needs to be written. P3: could be argued to be 'laziness', I've been called lazy in the past. CONS: C1 - transcoding is not FREE and it will happen without obvious indicators that it is happening. C2 - ppl will not learn the difference between char, wchar, and dchar as quickly. c1: I would argue it's is not as big a deal as it first appears, where it happens you would need a toUTFxx call anyway. In string concatenations some extra transcoding will occur and I have no good solution for that, tho, allowing toString to return any of the 3 types would lessen this effect. c2: Might be a 'pro' in disguise, they don't learn the difference because with implicit conversion it doesn't matter. [one string type] PROS: P1 - allows you to write 1 of each string returning function (instead of 3) SAME AS ABOVE CONS: C1 - all your string characters are 16bits wide, space is wasted when support for ASCII or any other 8bit encoding is all that is required. c1: I believe this to be a major 'con' for embedded etc small systems programming. Please everyone add to this list any/all you can think of, correct any you think I have wrong, or miss represented. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 24 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"antiAlias" <fu bar.com> wrote in message
news:cge5h3$5hv$1 digitaldaemon.com...
 On the one hand, this would be well served by an opCast() and/or
opCast_r()
 on the primitive types; just the kind of thing  suggested in a related
 thread (which talked about overloadable methods for primitive types).

 On the other hand, we're talking transcoding here. Are you gonna' limit
this
 to UTF-8 only?
No, to between char[], wchar[], and dchar[].
 Then, since the source and destination will typically be of
 different sizes, do you then force all casts between these types to have
the
 destination be an array reference rather than an instance? One that is
 always allocated on the fly?
I see it as implicitly calling the conversion function(s) in std.utf.
 If you do implement overloadable primitive-methods (like properties) then,
 will you allow a programmer to override them? So they can make the
opCast()
 do something more specific to their own specific task?

 That's seems like a lot to build into the core of a language.
I don't see adding opCast() for builtin types.
 Personally, I think it's 'borderline' to have so many data types available
 for similar things. If there were an "alias wchar[] string", and the core
 Object supported that via "string toString()", and the IUC library were
 adopted, then I think some of the general confusion would perhaps melt
 somewhat.

 In many respects, too many choices is simply a BadThing (TM). Especially
 when there's precious little solid guidance to help. That guidance might
 come from a decent library that indicates how the types are used, and uses
 one obvious type (string?) consistently. Besides, if IUC were adopted,
less
 people would have to worry about the distinction anyway.

 Believe me when I say that Mango would dearly love to go dchar[] only.
 Actually, it probably will at the higher levels because it makes life
simple
 for everyone. Oh, and I've been accused many times of being an efficiency
 fanatic, especially when it comes to servers. But there's always a
tradeoff
 somewhere. Here, the tradeoff is simplicity-of-use versus quantities of
RAM.
 Which one changes dramatically over time? Hmmmm ... let me see now ...
 64bit-OS for desktops just around corner?

 Even on an embedded device I'd probably go "dchar only" regarding I18N.
 Simply because the quantity of text processed on such devices is very
 limited. Before anyone shoots me over this one, I regularly write code for
 devices with just 4KB RAM ~ still use 16bit chars there when dealing with
 XML input.

 So what am I saying here? Available RAM will always increase in great
leaps.
 Contemplating that the latter should dictate ease-of-use within D is a
 serious breach of logic, IMO. Ease of use, and above all, /consistency/
 should be paramount; if you have the programmer in mind.
I thought so, too, until I built a server app that used all dchar[] internally. Server apps tend to be driven to the limit, and reaching that limit 4x sooner means that the customer has to buy 4x more server hardware. Remember, that using 4 bytes per char doesn't just consume more ram, it consumes a LOT more processor cycles with managing the extra memory. (scanning, copying, initializing, gc marking, etc.)
Aug 24 2004
parent reply "antiAlias" <fu bar.com> writes:
"Walter"  wrote in...
 So what am I saying here? Available RAM will always increase in great
leaps.
 Contemplating that the latter should dictate ease-of-use within D is a
 serious breach of logic, IMO. Ease of use, and above all, /consistency/
 should be paramount; if you have the programmer in mind.
I thought so, too, until I built a server app that used all dchar[] internally. Server apps tend to be driven to the limit, and reaching that limit 4x sooner means that the customer has to buy 4x more server
hardware.
 Remember, that using 4 bytes per char doesn't just consume more ram, it
 consumes a LOT more processor cycles with managing the extra memory.
 (scanning, copying, initializing, gc marking, etc.)
I disagree with that for a number of reasons <g> a) I was saying that usage of memory should not dictate language ease-of-use. I didn't say that byte[], ubyte[], char[] and wchar[] should all be dumped. If D were dchar[] oriented, rather than char[] oriented, it would arguably make it easier to use for the everyday folks. Those who really care about squeezing bytes can, and should, deal with text encoding and decoding issues. As it is, /everyone/ currently has to deal with those issues at various levels. b) There's an implication that all server apps are text-bound. That's just not the case, but perhaps I'm being pedantic. c) People who write servers have (traditionally) been a little more careful about what they do. There are plenty of ways to avoid allocating memory and thrashing the GC, where that's a concern. I do it all the time. In fact, one of the unwritten goals of writing server software is to avoid regularly using malloc/calloc where possible. d) The predominant modern cpu's all have prefetch built-in, because of the marketing craze for streaming-style application. This is great news for wide chars! It means that a server can stream dchar[] much more effectively than it could just a few years back. It's the conversions that are arguably a problem. e) dchar is the natural width of a 32bit processor, so it's not gonna take more Processor Cycles to process those than 8bit chars. In fact, it's the other way round where UTF-8 is involved. The bottleneck used to be the front-side bus. Not so these days of 1Ghz HyperTransport, 800MHz Intel quad-pumped bus, and prefetch everywhere. So, no. I simply cannot agree that using dchar[] automatically means the customer has to buy 4x more server hardware <g>
Aug 24 2004
next sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <cgg5ko$16ja$1 digitaldaemon.com>, antiAlias says...
a) I was saying that usage of memory should not dictate language
ease-of-use. I didn't say that byte[], ubyte[], char[] and wchar[] should
all be dumped.  If D were dchar[] oriented, rather than char[] oriented, it
would arguably make it easier to use for the everyday folks. Those who
really care about squeezing bytes can, and should, deal with text encoding
and decoding issues. As it is, /everyone/ currently has to deal with those
issues at various levels.
Ideally, an i/o library should be able to handle most conversions invisibly, so the user can work in whatever internal format they want without worrying too much about the external format. doFormat already takes char, wchar, and dchar arguments and outputs UTF-8 or UTF-16 as appropriate, and I designed unFormat to do pretty much the same. I will say, however, that multibyte encoding schemes are generally not very easy to deal with, so internal use of dchars still makes a lot of sense.
b) There's an implication that all server apps are text-bound. That's just
not the case, but perhaps I'm being pedantic.
More often than not they are, especially in this age of XML and such. And for the ones that aren't text-bound, who cares how D deals with strings? :)
e) dchar is the natural width of a 32bit processor, so it's not gonna take
more Processor Cycles to process those than 8bit chars. In fact, it's the
other way round where UTF-8 is involved. The bottleneck used to be the
front-side bus. Not so these days of 1Ghz HyperTransport, 800MHz Intel
quad-pumped bus, and prefetch everywhere.
I think that UTF-8 is still more efficient in terms of memory reads and writes, simply because it tends to take up less space than UTF-16 or UTF-32. The tradeoff is in processing time when the data ventures into the multibyte realm, which is becoming increasingly more common. But I'll have to think some more about whether I would always write text-bound servers using dchars. I'd like to since it would simplify handling XML data, but I'm not particularly keen on those 1GB streams suddenly becoming 4GB streams. That's a decent bit of i/o overhead, especially if I know that little to no data in that stream lives beyond the ASCII charset. At the end of the day, I think the programmer should be able to choose the appropriate charset for the job. Implicit conversion between char types is a great idea and should clear up most of the confusion. And the UTF-32 version is called "dchar" which implies to me that it's the native character format for D anyway. Perhaps "char" should be renamed to something else? Sean
Aug 24 2004
parent reply "antiAlias" <fu bar.com> writes:
"Sean Kelly" <sean f4.ca> wrote ...
 I think that UTF-8 is still more efficient in terms of memory reads and
writes,
 simply because it tends to take up less space than UTF-16 or UTF-32.  The
 tradeoff is in processing time when the data ventures into the multibyte
realm,
 which is becoming increasingly more common.  But I'll have to think some
more
 about whether I would always write text-bound servers using dchars.
I'm not saying that they should always be dchar[] :-) I'm saying using the additional memory usage of dchar[] as a tradeoff against language ease-of-use is an invalid one. You can always dip down into ubyte[] for those apps that actually care.
 I'd like to
 since it would simplify handling XML data, but I'm not particularly keen
on
 those 1GB streams suddenly becoming 4GB streams.  That's a decent bit of
i/o
 overhead, especially if I know that little to no data in that stream lives
 beyond the ASCII charset.
I'm rather tempted to suggest that a 1GB stream of XML has other problems to content with <g>
Aug 24 2004
parent Sean Kelly <sean f4.ca> writes:
In article <cggca6$19k9$1 digitaldaemon.com>, antiAlias says...
I'm rather tempted to suggest that a 1GB stream of XML has other problems to
content with <g>
Its only problem is that XML is a terrible text format. But for one server I wrote it was not impossible that there may be a 1GB (uncompressed) data stream. This was obviously not going to a web browser :) Sean
Aug 24 2004
prev sibling parent reply Regan Heath <regan netwin.co.nz> writes:
On Tue, 24 Aug 2004 12:44:03 -0700, antiAlias <fu bar.com> wrote:
 "Walter"  wrote in...
 So what am I saying here? Available RAM will always increase in great
leaps.
 Contemplating that the latter should dictate ease-of-use within D is a
 serious breach of logic, IMO. Ease of use, and above all, 
/consistency/
 should be paramount; if you have the programmer in mind.
I thought so, too, until I built a server app that used all dchar[] internally. Server apps tend to be driven to the limit, and reaching that limit 4x sooner means that the customer has to buy 4x more server
hardware.
 Remember, that using 4 bytes per char doesn't just consume more ram, it
 consumes a LOT more processor cycles with managing the extra memory.
 (scanning, copying, initializing, gc marking, etc.)
I disagree with that for a number of reasons <g> a) I was saying that usage of memory should not dictate language ease-of-use. I didn't say that byte[], ubyte[], char[] and wchar[] should all be dumped. If D were dchar[] oriented, rather than char[] oriented, it would arguably make it easier to use for the everyday folks. Those who really care about squeezing bytes can, and should, deal with text encoding and decoding issues. As it is, /everyone/ currently has to deal with those issues at various levels.
The exact same thing can be said for implicit transcoding. What I mean is... Implicit transcoding will make it easier to use for everyday folks. Those who really care about squeezing bytes can. The advantage implicit transcoding has is everyday folks will likely be dealing with ASCII and will likely use char[] which has an advantage over dchar[] where you're dealing with ASCII. Furthermore implicit transcoding removes the need to deal with encoding/decoding issue, generally speaking. Those that need to worry about it, can and will optimise where the implicit transcoding causes in-efficient behaviour.
 b) There's an implication that all server apps are text-bound. That's 
 just
 not the case, but perhaps I'm being pedantic.
It depends on the server. The mail server I work on is a candidate to be text-bound, in fact it is, it's disk bound, meaning, we cannot write our email text out to disk as fast as we can receive it (tcp/ip), and process it (transcoding etc).
 c) People who write servers have (traditionally) been a little more 
 careful
 about what they do. There are plenty of ways to avoid allocating memory 
 and
 thrashing the GC, where that's a concern. I do it all the time. In fact, 
 one
 of the unwritten goals of writing server software is to avoid regularly
 using malloc/calloc where possible.
Definately. having a UTF-8 char type which you can implicitly convert to a more convenient format temporarily (dchar[], utf-32) simply makes this easier IMO.
 d) The predominant modern cpu's all have prefetch built-in, because of 
 the
 marketing craze for streaming-style application. This is great news for 
 wide
 chars! It means that a server can stream dchar[] much more effectively 
 than
 it could just a few years back. It's the conversions that are arguably a
 problem.
If we're talking streaming as in streaming to disk or tcp/ip etc, I would argue that the time it takes to transcode is much less than the time it takes to write/send.
 e) dchar is the natural width of a 32bit processor, so it's not gonna 
 take
 more Processor Cycles to process those than 8bit chars. In fact, it's the
 other way round where UTF-8 is involved. The bottleneck used to be the
 front-side bus. Not so these days of 1Ghz HyperTransport, 800MHz Intel
 quad-pumped bus, and prefetch everywhere.

 So, no. I simply cannot agree that using dchar[] automatically means the
 customer has to buy 4x more server hardware <g>
All this arguing about what is more efficient is IMO totally pointless, the types of application vary so much that for one application/situation one method will be best and for another the other method will be. D's goal is not to be specialised for any one style or application, as such 3 char types makes sense, doesn't it? Regardless the only way to settle the performance argument is to benchmark something, therefore... In what situations do you believe using UTF-32 dchar throughout the application will be faster than using all 3 types and implicit transcoding.. Consider: - the input may be in any encoding - the output may be in any encoding - it may need to store large amounts of the input in memory ..can you think of any more? Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 24 2004
parent reply "antiAlias" <fu bar.com> writes:
Regan; I appeal to you to try and read things in context. I'm not even
vaguely interested in getting into a pissing contest with you, so please,
try and follow this (and correlate with the text below if you have to):

a) I say available-memory will always increase in great leaps, so using that
as a design guide vis-a-vis a computer language doesn't make sense to me.

b) Walter says he used to think so too, until he build a wide-char-only
server; and points out that wide-chars can force the customer into
purchasing much more hardware due to memory consumption and additional CPU
usage.

c) I disagree with that position, and try to illustrate why I don't think
wide-chars are the demon they might once have been considered. And that
perhaps they get a 'bad rap' for the wrong reasons.

What you added here seems intended to fan some imaginary flames, or to be
argumentative purely for the sake of it, rather than to make any cohesive
point. In fact, four out of the five items you managed to completely
misconstrue. That may be my failing in terms of language use, so I'll accept
the consequences. I will not, however, bite.

Good-day, my friend  :-)


"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsc9o30tt5a2sq9 digitalmars.com...
 On Tue, 24 Aug 2004 12:44:03 -0700, antiAlias <fu bar.com> wrote:
 "Walter"  wrote in...
 So what am I saying here? Available RAM will always increase in great
leaps.
 Contemplating that the latter should dictate ease-of-use within D is
a
 serious breach of logic, IMO. Ease of use, and above all,
/consistency/
 should be paramount; if you have the programmer in mind.
I thought so, too, until I built a server app that used all dchar[] internally. Server apps tend to be driven to the limit, and reaching that limit 4x sooner means that the customer has to buy 4x more server
hardware.
 Remember, that using 4 bytes per char doesn't just consume more ram, it
 consumes a LOT more processor cycles with managing the extra memory.
 (scanning, copying, initializing, gc marking, etc.)
I disagree with that for a number of reasons <g> a) I was saying that usage of memory should not dictate language ease-of-use. I didn't say that byte[], ubyte[], char[] and wchar[]
should
 all be dumped.  If D were dchar[] oriented, rather than char[] oriented,
 it
 would arguably make it easier to use for the everyday folks. Those who
 really care about squeezing bytes can, and should, deal with text
 encoding
 and decoding issues. As it is, /everyone/ currently has to deal with
 those
 issues at various levels.
The exact same thing can be said for implicit transcoding. What I mean is... Implicit transcoding will make it easier to use for everyday folks. Those who really care about squeezing bytes can. The advantage implicit transcoding has is everyday folks will likely be dealing with ASCII and will likely use char[] which has an advantage over dchar[] where you're dealing with ASCII. Furthermore implicit transcoding removes the need to deal with encoding/decoding issue, generally speaking. Those that need to worry about it, can and will optimise where the implicit transcoding causes in-efficient behaviour.
 b) There's an implication that all server apps are text-bound. That's
 just
 not the case, but perhaps I'm being pedantic.
It depends on the server. The mail server I work on is a candidate to be text-bound, in fact it is, it's disk bound, meaning, we cannot write our email text out to disk as fast as we can receive it (tcp/ip), and process it (transcoding etc).
 c) People who write servers have (traditionally) been a little more
 careful
 about what they do. There are plenty of ways to avoid allocating memory
 and
 thrashing the GC, where that's a concern. I do it all the time. In fact,
 one
 of the unwritten goals of writing server software is to avoid regularly
 using malloc/calloc where possible.
Definately. having a UTF-8 char type which you can implicitly convert to a more convenient format temporarily (dchar[], utf-32) simply makes this easier IMO.
 d) The predominant modern cpu's all have prefetch built-in, because of
 the
 marketing craze for streaming-style application. This is great news for
 wide
 chars! It means that a server can stream dchar[] much more effectively
 than
 it could just a few years back. It's the conversions that are arguably a
 problem.
If we're talking streaming as in streaming to disk or tcp/ip etc, I would argue that the time it takes to transcode is much less than the time it takes to write/send.
 e) dchar is the natural width of a 32bit processor, so it's not gonna
 take
 more Processor Cycles to process those than 8bit chars. In fact, it's
the
 other way round where UTF-8 is involved. The bottleneck used to be the
 front-side bus. Not so these days of 1Ghz HyperTransport, 800MHz Intel
 quad-pumped bus, and prefetch everywhere.

 So, no. I simply cannot agree that using dchar[] automatically means the
 customer has to buy 4x more server hardware <g>
All this arguing about what is more efficient is IMO totally pointless, the types of application vary so much that for one application/situation one method will be best and for another the other method will be. D's goal is not to be specialised for any one style or application, as such 3 char types makes sense, doesn't it? Regardless the only way to settle the performance argument is to benchmark something, therefore... In what situations do you believe using UTF-32 dchar throughout the application will be faster than using all 3 types and implicit transcoding.. Consider: - the input may be in any encoding - the output may be in any encoding - it may need to store large amounts of the input in memory ..can you think of any more? Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 24 2004
parent Regan Heath <regan netwin.co.nz> writes:
On Tue, 24 Aug 2004 22:08:39 -0700, antiAlias <fu bar.com> wrote:

<snip>

 What you added here seems intended to fan some imaginary flames, or to be
 argumentative purely for the sake of it, rather than to make any cohesive
 point. In fact, four out of the five items you managed to completely
 misconstrue. That may be my failing in terms of language use, so I'll 
 accept the consequences. I will not, however, bite.
I'm sorry to have come across that way. I was simply trying to add my point of view. If I have miss-understood your comments, sorry, I don't get it right all the time (despite what I might think). Your (miss/understood/guided/ing) friend, Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 24 2004