D - Unicode Character and String Intrinsics

Mark Evans (53/55) Mar 31 2003 I'm dubious about this claim. ANSI C char arrays are UTF-8 too, if the ...

Mark Evans (2/3) Mar 31 2003 Typo: that was "why 'ubyte' and 'ubyte[]' do not suffice."
Walter (37/93) Mar 31 2003 contents

Mark Evans (25/35) Mar 31 2003 But the only use for raw bytes is precisely such low-level format

Walter (21/47) Mar 31 2003 That's only partially true - the downside comes from needing high

Mark Evans (43/71) Mar 31 2003 If I understand correctly, the translation is that it's better to let
Sean L. Palmer (12/20) Apr 01 2003 But there is still concern for there to be a separate type, for function
Ilya Minkov (54/59) Apr 10 2003 Wait... won't language-supported iterators fix a need for accessing the

Helmut Leitner (8/13) Apr 11 2003 Being used to Perl, I think that the current D regex module has to be

Matthew Wilson (1/5) Mar 31 2003 Agree. Let's have more char types

Mark Evans (17/24) Mar 31 2003 Hi again Bill

Matthew Wilson (14/38) Mar 31 2003 I'm sold. Where can I sign up?

Peter Hercek (18/64) Mar 31 2003 Well, I went through character and code page problems too about a year

Ilya Minkov (20/37) Apr 10 2003 ply;

Bill Cox (9/14) Mar 31 2003 A maximalist wants many built-in features, from functional programming s...

Matthew Wilson (11/27) Mar 31 2003 And a pragmatist wants as much as is possible in libraries, but what he/...
Mark Evans (11/11) Mar 31 2003 Bill the point is that trying to paint me this or that color, instead of

Bill Cox (36/48) Apr 01 2003 Ok, I'll bite... Why do you feel I'm blowing smoke in your face?

Mark Evans (8/8) Apr 01 2003 Please don't turn this into yet another thread about DataDraw or dubious
Helmut Leitner (14/18) Apr 01 2003 When I read one of your postings a week ago, I googled for DataDraw

Bill Cox (100/126) Apr 02 2003 I'm not trying to advertise DataDraw. In fact, I'd love to see D

Helmut Leitner (34/134) Apr 02 2003 That means its dead outside of the heads of its few experts and

Bill Cox (75/243) Apr 03 2003 I agree with all your comments.

Mark Evans (26/28) Mar 31 2003 The compiler is open-source. Contributions are welcome. (Wasn't it you...

Bill Cox (18/23) Apr 01 2003 I wrote a toy compiler to test out some ideas in a few days off, not a D...

Mark Evans (25/28) Apr 01 2003 Unicode intrinsics make D a simple language. That is the point of havin...

Matthew Wilson (23/51) Apr 01 2003 Mark
Luna Kid (6/6) Apr 03 2003 Hmm... Mark, appreciating all your informedess and

Walter (8/13) May 21 2003 I have

J. Daniel Smith (7/20) May 22 2003 If you've got a UTF-32 string, UTF-16 is really only needed when calling

Matthew Wilson (40/95) Mar 31 2003 One minor point:
Mark Evans (49/49) Mar 31 2003 Walter -

Matthew Wilson (42/91) Mar 31 2003 Qualifying this again with the stipulation that I am far from an expert ...

Sean L. Palmer (9/12) Apr 01 2003 intrigued

Walter (5/13) May 21 2003 flat

Sean L. Palmer (6/20) May 22 2003 That lets you index sequentially pretty fast, but not randomly.

Mark Evans (54/63) Apr 01 2003 No. There is one table per string.

Sean L. Palmer (10/73) Apr 01 2003 The only problem with this idea is that passing this dual structure to a

Mark Evans (11/13) Apr 01 2003 Serialization at choke points has a cost of (a) zero, because the string...

Walter (10/15) May 21 2003 passed

Sean L. Palmer (42/91) Apr 01 2003 That's so crazy it just might work! ;)
Walter (7/9) May 21 2003 efficient

Mark Evans (25/30) May 23 2003 I would need a specific implementation code example to understand your t...

Mark Evans <Mark_member pathlink.com> writes:

Walter says (in response to my post)...
 D needs a Unicode string primitive.

It does already. In D, a char[] is really a utf-8 array.

I'm dubious about this claim.  ANSI C char arrays are UTF-8 too, if the contents
are 7-bit ACSII (a subset of UTF-8).  That doesn't mean they support UTF-8.  

UTF-8 is on D's very own 'to-do' list:
http://www.digitalmars.com/d/future.html

UTF-8 has a maximum encoding length of 6 bytes for one character.  If such a
character appears at index 100 in char[] myString, what is the return value from
myString[100]?  The answer should be "one UTF-8 char with an internal 6-byte
representation."  I don't think D does that.

Besides which, my idea was a native string primitive, not a quasi-array.  The
confusion of strings with arrays was a basic, fundamental mistake of C.  While
some string semantics do resemble those of arrays, this resemblance should not
mandate identical data types.  Strings are important enough to merit their own
intrinsic type.  Icon is not the only language to recognize that fact.  D
documents make no mention of any string primitive:
http://www.digitalmars.com/d/type.html
D has two intrinsic character types, a dynamic array type, and _no_ intrinsic
string type.

Characters should be defined as UTF-8 or UTF-16 or UTF-32, not "short" and
"wide."  The differing cross-platform widths of the 'wide' char is asking for
trouble; poof goes data portability.  D characters are not based on Unicode, but
archaic MS Windows API and legacy C terminology spot-welded onto Linux.  How
about Unicode as a basis?

The ideal type system would offer as intrinsic/primitive/native language types:
- UTF-8 char
- UTF-16 char
- UTF-32 char
- UTF-8 string
- UTF-16 string
- UTF-32 string
- built-in conversions between all of the above (e.g. UTF-8 to UTF-16)
- built-in conversions to/from UTF strings and C-style byte arrays

The preceding list will not seem very long when you consider how many numeric
types D supports.  Strings are as important as numbers.

The old C 'char' type is merely a byte; D already has 'ubyte.'  The distinction
between ubyte and char in D escapes me.  Maybe the reasoning is that a char
might be 'wide' so D needs a separate type?  But that reason disappears once you
have nice UTF characters.  So even if the list is a bit long it also eliminates
two redundant types, char and wchar.

I would not be against retention of char and char[] for C compatibility purposes
if someone could point out why 'ubyte' and 'char[]' do not suffice.  Otherwise I
would just alias 'char' into 'ubyte' and be done with it.  The wchar could be
stored inside a UTF-16 or UTF-32 char, or be declared as a struct.

To the user, strings would act like dynamic arrays.  Internally they are
different animals.  Each 'element' of the 'array' can have varying length per
Unicode specifications.  String primitives would hide Unicode complexity under
the hood.

That's just the beginning.  Now that you have string intrinsics, you can give
them special behaviors pertaining to i/o streams and such.  You can define
'streaming' conversions from other intrinsic types to strings for i/o purposes.
And...permit me to dream!...you can define Icon-style string scanning
expressions.

Mark

Mar 31 2003

Mark Evans <Mark_member pathlink.com> writes:

if someone could point out why 'ubyte' and 'char[]' do not suffice.

Typo:  that was "why 'ubyte' and 'ubyte[]' do not suffice."

- Mark

Mar 31 2003

"Walter" <walter digitalmars.com> writes:

"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6abjh$12m8$1 digitaldaemon.com...
 Walter says (in response to my post)...
 D needs a Unicode string primitive.

It does already. In D, a char[] is really a utf-8 array.

 I'm dubious about this claim.  ANSI C char arrays are UTF-8 too, if the

contents
 are 7-bit ACSII (a subset of UTF-8).  That doesn't mean they support

UTF-8.
 UTF-8 is on D's very own 'to-do' list:
 http://www.digitalmars.com/d/future.html

It is incompletely implemented, sure.

 UTF-8 has a maximum encoding length of 6 bytes for one character.  If such

a
 character appears at index 100 in char[] myString, what is the return

value from
 myString[100]?  The answer should be "one UTF-8 char with an internal

6-byte
 representation."  I don't think D does that.

No, it doesn't do that. Sometimes you want the byte, sometimes the assembled
unicode char.

 Besides which, my idea was a native string primitive, not a quasi-array.

The
 confusion of strings with arrays was a basic, fundamental mistake of C.

While
 some string semantics do resemble those of arrays, this resemblance should

not
 mandate identical data types.  Strings are important enough to merit their

own
 intrinsic type.  Icon is not the only language to recognize that fact.  D
 documents make no mention of any string primitive:
 http://www.digitalmars.com/d/type.html
 D has two intrinsic character types, a dynamic array type, and _no_

intrinsic
 string type.

D does have an intrinsic string literal.

 Characters should be defined as UTF-8 or UTF-16 or UTF-32, not "short" and
 "wide."  The differing cross-platform widths of the 'wide' char is asking

for
 trouble; poof goes data portability.  D characters are not based on

Unicode, but
 archaic MS Windows API and legacy C terminology spot-welded onto Linux.

How
 about Unicode as a basis?

Actually, this has changed. Wide chars are now fixed at 16 bits, i.e.
UTF-16. For UTF-32, just use uint's.

 The ideal type system would offer as intrinsic/primitive/native language

types:
 - UTF-8 char
 - UTF-16 char
 - UTF-32 char
 - UTF-8 string
 - UTF-16 string
 - UTF-32 string
 - built-in conversions between all of the above (e.g. UTF-8 to UTF-16)
 - built-in conversions to/from UTF strings and C-style byte arrays

 The preceding list will not seem very long when you consider how many

numeric
 types D supports.  Strings are as important as numbers.

That's actually pretty close to what D supports.

 The old C 'char' type is merely a byte; D already has 'ubyte.'  The

distinction
 between ubyte and char in D escapes me.  Maybe the reasoning is that a

char
 might be 'wide' so D needs a separate type?  But that reason disappears

once you
 have nice UTF characters.  So even if the list is a bit long it also

eliminates
 two redundant types, char and wchar.

The distinction is char is UTF-8, and byte is well, just a byte. The
distinction comes in handy when dealing with overloaded functions.

 I would not be against retention of char and char[] for C compatibility

purposes
 if someone could point out why 'ubyte' and 'char[]' do not suffice.

Function overloading.

 Otherwise I
 would just alias 'char' into 'ubyte' and be done with it.  The wchar could

be
 stored inside a UTF-16 or UTF-32 char, or be declared as a struct.

 To the user, strings would act like dynamic arrays.  Internally they are
 different animals.  Each 'element' of the 'array' can have varying length

per
 Unicode specifications.  String primitives would hide Unicode complexity

under
 the hood.

 That's just the beginning.  Now that you have string intrinsics, you can

give
 them special behaviors pertaining to i/o streams and such.  You can define
 'streaming' conversions from other intrinsic types to strings for i/o

purposes.
 And...permit me to dream!...you can define Icon-style string scanning
 expressions.

 Mark

Mar 31 2003

Mark Evans <Mark_member pathlink.com> writes:

The answer should be "one UTF-8 char with an internal 6-byte
representation."

No, it doesn't do that. Sometimes you want the byte,
sometimes the assembled unicode char.

But the only use for raw bytes is precisely such low-level format
conversions as are proposed to go under the hood. String usage
involves character analysis, not bit shuffling. There is a place for
getting raw bytes, but a string subscript is not it. Maybe a typecast
to ubyte[], and then an array subscript.

The whole point of built-in Unicode support is to let users avoid
dealing with bytes and let them deal with characters instead.

D does have an intrinsic string literal.

But it's not Unicode, just char or wchar. Those are both fixed
byte-width, but all Unicode chars, except UTF-32, are variable
byte-width.

Wide chars are now fixed at 16 bits, i.e. UTF-16.

Ditto. Wide chars are not UTF-16 chars since they are fixed width.
UTF-16 characters can be 16 or 32 bits wide. (UTF-8 characters can be
anywhere from 1 byte to 6 bytes wide.)

 For UTF-32, just use uint's.

Possible, but see my final point.

That's actually pretty close to what D supports.

I don't see anything close. (a) There is no Unicode string primitive
(char[] is not a string primitive, let alone Unicode; it's an array
type). (b) There are no Unicode characters. There are merely types
with similar 'average' sizes being touted as Unicode capable (they are
not).

 if someone could point out why 'ubyte' and 'ubyte[]' do not suffice.

Function overloading.

This comment is a logical contradiction with prior remarks. If the
distinction between ubyte and char matters for this reason, then the
same reason makes a difference between uint and UTF-32. But in the
latter case you say to just use uint. You can't have it both ways.

Thanks for taking all our thoughts into consideration.

Mark

Mar 31 2003

"Walter" <walter digitalmars.com> writes:

"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6an4p$1bpj$1 digitaldaemon.com...
 The whole point of built-in Unicode support is to let users avoid
 dealing with bytes and let them deal with characters instead.

That's only partially true - the downside comes from needing high
performance you'll need byte indices, not UTF character strides. There is no
getting away from the variable byte encoding. In my (limited) experience
with string processing and UTF-8, rarely is it necessary to decode it. Most
manipulation is done with indices.

D does have an intrinsic string literal.

 But it's not Unicode, just char or wchar. Those are both fixed
 byte-width, but all Unicode chars, except UTF-32, are variable
 byte-width.

No, in D, the intrinsic string literal is not just char or wchar. It's a
unicode string - its internal format is not fixed until semantic processing,
when it is adjusted to be UTF-8, -16, or -32 as needed.

Wide chars are now fixed at 16 bits, i.e. UTF-16.

 Ditto. Wide chars are not UTF-16 chars since they are fixed width.

What I meant is they do not change size from implementation to
implementation. They are 16 bits, and line up with the UTF-16 API's of
Win32.

 UTF-16 characters can be 16 or 32 bits wide. (UTF-8 characters can be
 anywhere from 1 byte to 6 bytes wide.)

Yes.

 For UTF-32, just use uint's.

 Possible, but see my final point.
That's actually pretty close to what D supports.

 I don't see anything close. (a) There is no Unicode string primitive
 (char[] is not a string primitive, let alone Unicode; it's an array
 type).

I think that's a matter of perspective.

 (b) There are no Unicode characters. There are merely types
 with similar 'average' sizes being touted as Unicode capable (they are
 not).

I believe they are unicode capable. Now, I have not written the I/O routines
so they will print as unicode, and there are other gaps in the
implementation, but the core concept is there.

 if someone could point out why 'ubyte' and 'ubyte[]' do not suffice.

Function overloading.

 This comment is a logical contradiction with prior remarks. If the
 distinction between ubyte and char matters for this reason, then the
 same reason makes a difference between uint and UTF-32. But in the
 latter case you say to just use uint. You can't have it both ways.

I think uint[] will well serve for UTF-32 because there is no need to be
concerned about multiword encoding.

 Thanks for taking all our thoughts into consideration.

You're welcome.

Mar 31 2003

Mark Evans <Mark_member pathlink.com> writes:

 The whole point of built-in Unicode support is to let users avoid
 dealing with bytes and let them deal with characters instead.

 That's only partially true - the downside comes from needing high
 performance you'll need byte indices, not UTF character strides.
 There is no getting away from the variable byte encoding.

If I understand correctly, the translation is that it's better to let
end users process bytes, so they can waste hours <g> tuning inner
loops, than to offer language support, with pre-tuned inner loops. I
don't see that. In fact native language support is better from a
performance perspective (both in time of execution and in time of
development).

 In my (limited) experience with string processing and UTF-8, rarely
 is it necessary to decode it. Most manipulation is done with
 indices.

Manipulation is done with indices in C, because that is all C offers.
It's one of the big problems with C vis-a-vis Unicode.

 in D, the intrinsic string literal is not just char or wchar. It's a
 unicode string - its internal format is not fixed until semantic
 processing, when it is adjusted to be UTF-8, -16, or -32 as needed.

I think your definition of "Unicode" is basically wrong. What you are
calling UTF-8 and UTF-16 is really just fixed-width slots that the
user must conglomerate, not true native Unicode characters. So we are
talking past each other.  For example when you say "internal format" I
don't suppose you have in mind that 6-byte-wide UTF-8 character I
mentioned.

When I say Unicode character, I mean an object that the language
recognizes, intrinsically, as a variable-byte-width object, but which
it presents to the user as an integrated (opaque) whole. I do not mean
a user-defined conglomeration of fixed-width fields. That seems to be
your working definition and it does not satisfy me.

Wide chars are now fixed at 16 bits, i.e. UTF-16.

 Ditto. Wide chars are not UTF-16 chars since they are fixed width.

 What I meant is they do not change size from implementation to
 implementation.

That's what I understood you to mean; and that much is good, as far as
it goes, but doesn't address Unicode.

 They are 16 bits, and line up with the UTF-16 API's of Win32.

If Windows supports full UTF-16, then D does not support UTF-16 API's
of Win32 with any native data type. The user still faces the same
labor (more or less) as supporting Unicode in ANSI C.


 I think that's a matter of perspective.... I believe they are
 unicode capable. Now, I have not written the I/O routines so they
 will print as unicode, and there are other gaps in the
 implementation, but the core concept is there.

I've tried to explain why there is no Unicode character in D, and on
that basis alone, I could say there is no Unicode string in D.

The syntax and semantics of char[] are identical across all types of
arrays, not limited to strings. (What syntax or semantics are unique
to strings?)

End users can create and manipulate almost any data structure -- any
collection of bits -- in D, or for that matter C, or assembly
language, or even machine language. What I'm talking about is
intrinsic language support to save the labor (and mistakes).

I could build Unicode strings with a Turing machine if I wanted to.
That's not "language support" in my book.

Saying that we already have 8-bit things, and 16-bit things, and
32-bit things, and that users can do Unicode by combining these things
in various ways, is not a reasonable argument that the language
supports Unicode. At best one might say, D does not prevent users from
implementing Unicode, if they want to take the extra trouble.

 if someone could point out why 'ubyte' and 'ubyte[]' do not suffice.

Function overloading.

 This comment is a logical contradiction with prior remarks. If the
 distinction between ubyte and char matters for this reason, then the
 same reason makes a difference between uint and UTF-32. But in the
 latter case you say to just use uint. You can't have it both ways.

 I think uint[] will well serve for UTF-32 because there is no need to be
 concerned about multiword encoding.

Then you are ignoring your own argument about function overloading!

:-)

Mark

Mar 31 2003

"Sean L. Palmer" <palmer.sean verizon.net> writes:

"Walter" <walter digitalmars.com> wrote in message
news:b6aoge$1cnp$1 digitaldaemon.com...
 if someone could point out why 'ubyte' and 'ubyte[]' do not suffice.

Function overloading.

 This comment is a logical contradiction with prior remarks. If the
 distinction between ubyte and char matters for this reason, then the
 same reason makes a difference between uint and UTF-32. But in the
 latter case you say to just use uint. You can't have it both ways.

 I think uint[] will well serve for UTF-32 because there is no need to be
 concerned about multiword encoding.

But there is still concern for there to be a separate type, for function
overloading.

Otherwise, how shall we print a Unicode character higher than position
0xFFFF?

Perhaps the basic char type would actually be 32 bits and capable of holding
any Unicode character?  And when used in array form, char[] would
transmogrify into UTF-8?  Would we then even need wchar?

Obviously this Unicode thing is a whole can of worms.  Too bad we can't get
everyone to forget about enough characters that they all fit in 16 bits!  ;)

Sean

Apr 01 2003

Ilya Minkov <midiclub 8ung.at> writes:

Walter wrote:
 That's only partially true - the downside comes from needing high
 performance you'll need byte indices, not UTF character strides. There is no
 getting away from the variable byte encoding. In my (limited) experience
 with string processing and UTF-8, rarely is it necessary to decode it. Most
 manipulation is done with indices.

Wait... won't language-supported iterators fix a need for accessing the
underlying array indices directly? I *definately* don't want to know
anything about underlying format, which can be really anything -
UTF-8/16/32, or even an agregate of 2 arrays like i or Mark have proposed.

Walter, you also don't: look what i found in this newsgroup. :)
And you claim it to be better to work with pointers into a char[], 
pretending it was an UTF-8 string!!!
--- 8< ---
At one time I had written a lexer that handled utf-8 source. It turned 
out to cause a lot of problems because strings could no longer be simply 
indexed by character position, nor could pointers be arbitrarilly 
incremented and decremented.

It turned out to be a lot of trouble  and I finally converted it to wchar's.
--- >8 ---


BTW, as to the possibilities that Mark wishes for himself, i've dug his 
message up, which was posted as i wasn't around yet. Here.

--- 8< ---
Short summaries here:

http://www.nmt.edu/tcc/help/lang/icon/positions.html
http://www.nmt.edu/tcc/help/lang/icon/substring.html
http://www.cs.arizona.edu/icon/docs/ipd266.htm
http://www.toolsofcomputing.com/IconHandbook/IconHandbook.pdf
Sections 6.2 and following.

Icon is simply unsurpassed in string processing and is for that reason 
famous among linguists.  There is more to the string processing than 
just character position indices.  Icon supports special clauses called 
"string scanning environments" which work like file i/o in a vague 
analogy.  (See third link
above, section 3.)

Icon also has nice built-in structures like sets (*character sets* turn 
out to be insanely useful), hash tables, and lists.  Somehow Icon never 
made it to the Big Leagues and that is a shame.  It deserves to be up 
there with Perl.  Icon is wicked fast when written correctly.

The Unicon project is the next-generation Icon, and has added objects 
and other modern features to base Icon.  It is on SourceForge.

(There was only one project in which I recall desiring a new Icon 
built-in.  I wanted a two-way hash table which could index off of either 
data column.  The workaround was to implement two mutually mirroring 
one-way hash tables.)

Icon has a very interesting 'success/failure' paradigm which might also 
be something to study, esp. in light of D's contract emphasis.  The 
unique 'goal-directed' paradigm is quite interesting but may have no 
application to D.

I have for a very long time desired Icon's string scanning capabilities 
in my C/C++ programs.  Even with std::string or string classes from 
various class libraries (I've used them all), there is just no 
comparison with Icon.  I would become a total D convert if it could do 
strings like Icon.

Mark

http://www.cs.arizona.edu/icon/
http://unicon.sourceforge.net/index.html
--- >8 ---


-i.

Apr 10 2003

Helmut Leitner <leitner hls.via.at> writes:

Ilya Minkov wrote:
 I have for a very long time desired Icon's string scanning capabilities
 in my C/C++ programs.  Even with std::string or string classes from
 various class libraries (I've used them all), there is just no
 comparison with Icon.  I would become a total D convert if it could do
 strings like Icon.

Being used to Perl, I think that the current D regex module has to be
extended. 

In what way does Icon differ (or have advantages) in string processing
compared to Perl?

-- 
Helmut Leitner    leitner hls.via.at
Graz, Austria   www.hls-software.com

Apr 11 2003

"Matthew Wilson" <dmd synesis.com.au> writes:

 This comment is a logical contradiction with prior remarks. If the
 distinction between ubyte and char matters for this reason, then the
 same reason makes a difference between uint and UTF-32. But in the
 latter case you say to just use uint. You can't have it both ways.

Agree. Let's have more char types

Mar 31 2003

Mark Evans <Mark_member pathlink.com> writes:

Hi again Bill

After your 'meta-programming' talk I shudder to think what your idea
of a maximalist is...maybe a computer that writes a compiler that
generates source code for a computer motherboard design program to
construct another computer that...

Under my scheme we gain 3 character types and drop 2: net gain 1. We
gain 3 string types and drop 1: net gain 2. Total net gain, 3 types.
What does that buy us? Complete internationalization of D, complete
freedom from ugly C string idioms, data portability across platforms,
ease of interfacing with Win32 APIs and other software languages.

The idea of "just one" Unicode type holds little water. Why don't you
make the same argument about numeric types, of which we have some
twenty-odd? Or how about if D offered just one data type, the bit, and
let you construct everything else from that? If D does Unicode then D
should do it right. It's a poor, asymmetric design to have some
Unicode built-in and the rest tacked on as library routines.

Mark


 This is a rare occasion when I agree with Mark. The fact that a
 minimalist like me, and a maximalist like Mark, and a pragmatist
 like yourself seem to agree is something Walter should consider. I
 would want to hold built-in string support to just UTF-8. D could
 offer some support for the other formats through conversion routines
 in a standard library. Having a single string format would surely be
 simpler than supporting them all. Bill

Mar 31 2003

"Matthew Wilson" <dmd synesis.com.au> writes:

I'm sold. Where can I sign up?

I presume you'll be working on the libraries ... ;)

To suck up: I've been faffing around with this issue for years, and have
been (unjustifiably, in my opinion) called on numerous times to expertly
opine on it for clients. (My expertise is limited to the C/C++
char/wchar_t/horrid-TCHAR type stuff, which I'm well aware is not the full
picture.) Your discussion here is the first time I even get a hint that I'm
listening to someone that know's what they're talking about. It's nasty,
nasty stuff, and I hope that your promise can bear fruit for D. If it can,
then it'll earn massive brownie points for D over its peer languages.
There's a big market out there of peoples whose character sets don't fall
into 7-bits ...



"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6al79$1ahd$1 digitaldaemon.com...
 Hi again Bill

 After your 'meta-programming' talk I shudder to think what your idea
 of a maximalist is...maybe a computer that writes a compiler that
 generates source code for a computer motherboard design program to
 construct another computer that...

 Under my scheme we gain 3 character types and drop 2: net gain 1. We
 gain 3 string types and drop 1: net gain 2. Total net gain, 3 types.
 What does that buy us? Complete internationalization of D, complete
 freedom from ugly C string idioms, data portability across platforms,
 ease of interfacing with Win32 APIs and other software languages.

 The idea of "just one" Unicode type holds little water. Why don't you
 make the same argument about numeric types, of which we have some
 twenty-odd? Or how about if D offered just one data type, the bit, and
 let you construct everything else from that? If D does Unicode then D
 should do it right. It's a poor, asymmetric design to have some
 Unicode built-in and the rest tacked on as library routines.

 Mark


 This is a rare occasion when I agree with Mark. The fact that a
 minimalist like me, and a maximalist like Mark, and a pragmatist
 like yourself seem to agree is something Walter should consider. I
 would want to hold built-in string support to just UTF-8. D could
 offer some support for the other formats through conversion routines
 in a standard library. Having a single string format would surely be
 simpler than supporting them all. Bill

Mar 31 2003

"Peter Hercek" <vvp no.post.spam.sk> writes:

Well, I went through character and code page problems too about a year
 ago. Very bad experience in C/C++ ... (I'm from place where 7 bits
 is not enough). I have two points about this:
1) D should support characters and not bytes (8bits) or words (16bits);
 when I'm indexing string I do so by characters and not by a byte multiply;
 if I would want to index by eg bytes I would ask for string byte length and
 cast to a byte array
2) Support for 3 character types (UTF8, UTF16, UTF32) is handy, but
 not critical (can be solved by conversion functions); actually for one
 character only, UTF32 has the shortest representation; it may be also
 interesting not to be able to specify the the exact encoding for a string
 (as oposed to an encoding for a character) - let's compiler to decide
 what is the best representation (may be some optimization can be
 achieved based on this later; eg compiler can decide to store strings
 in partially balanced trees like STLPort does for ropes, but with
 posibly different encodings for different nodes ... whatever just
 writting down my thoughts)


"Matthew Wilson" <dmd synesis.com.au> wrote in message
news:b6aq84$1dn4$1 digitaldaemon.com...
 I'm sold. Where can I sign up?

 I presume you'll be working on the libraries ... ;)

 To suck up: I've been faffing around with this issue for years, and have
 been (unjustifiably, in my opinion) called on numerous times to expertly
 opine on it for clients. (My expertise is limited to the C/C++
 char/wchar_t/horrid-TCHAR type stuff, which I'm well aware is not the full
 picture.) Your discussion here is the first time I even get a hint that I'm
 listening to someone that know's what they're talking about. It's nasty,
 nasty stuff, and I hope that your promise can bear fruit for D. If it can,
 then it'll earn massive brownie points for D over its peer languages.
 There's a big market out there of peoples whose character sets don't fall
 into 7-bits ...



 "Mark Evans" <Mark_member pathlink.com> wrote in message
 news:b6al79$1ahd$1 digitaldaemon.com...
 Hi again Bill

 After your 'meta-programming' talk I shudder to think what your idea
 of a maximalist is...maybe a computer that writes a compiler that
 generates source code for a computer motherboard design program to
 construct another computer that...

 Under my scheme we gain 3 character types and drop 2: net gain 1. We
 gain 3 string types and drop 1: net gain 2. Total net gain, 3 types.
 What does that buy us? Complete internationalization of D, complete
 freedom from ugly C string idioms, data portability across platforms,
 ease of interfacing with Win32 APIs and other software languages.

 The idea of "just one" Unicode type holds little water. Why don't you
 make the same argument about numeric types, of which we have some
 twenty-odd? Or how about if D offered just one data type, the bit, and
 let you construct everything else from that? If D does Unicode then D
 should do it right. It's a poor, asymmetric design to have some
 Unicode built-in and the rest tacked on as library routines.

 Mark


 This is a rare occasion when I agree with Mark. The fact that a
 minimalist like me, and a maximalist like Mark, and a pragmatist
 like yourself seem to agree is something Walter should consider. I
 would want to hold built-in string support to just UTF-8. D could
 offer some support for the other formats through conversion routines
 in a standard library. Having a single string format would surely be
 simpler than supporting them all. Bill

Mar 31 2003

Ilya Minkov <midiclub 8ung.at> writes:

Peter Hercek wrote:
 Well, I went through character and code page problems too about a year
  ago. Very bad experience in C/C++ ... (I'm from place where 7 bits
  is not enough). I have two points about this:

Me too :)

 1) D should support characters and not bytes (8bits) or words (16bits);=

  when I'm indexing string I do so by characters and not by a byte multi=

ply;
  if I would want to index by eg bytes I would ask for string byte lengt=

h and
  cast to a byte array

Right.

 2) Support for 3 character types (UTF8, UTF16, UTF32) is handy, but
  not critical (can be solved by conversion functions); actually for one=

  character only, UTF32 has the shortest representation; it may be also
  interesting not to be able to specify the the exact encoding for a str=

ing
  (as oposed to an encoding for a character) - let's compiler to decide
  what is the best representation (may be some optimization can be
  achieved based on this later; eg compiler can decide to store strings
  in partially balanced trees like STLPort does for ropes, but with
  posibly different encodings for different nodes ... whatever just
  writting down my thoughts)

UTF-32 doesn't have the shortest representation, since "in all 3=20
encodings [i.e. UTF-8/16/32] the maximim possible character=20
representation length is 4 bytes", as the official description says.=20
Though i agree that it's the most practical one, in part because working =

with an array of longs is nowadays faster than an array of shorts.

This is an implementation detail and should not matter though, because=20
whatever string implementation is, it should hide the undelying complexit=
y.

What matters though is that in UNICODE there are 2 kinds of characters - =

  normal and modifyers. So an "=E4" can be represented as well as "a" and=
=20
a special accent symbol. I'm pretty much sure you want to access these=20
as a whole, not separately.

-i.

Apr 10 2003

Bill Cox <Bill_member pathlink.com> writes:

In article <b6al79$1ahd$1 digitaldaemon.com>, Mark Evans says...
Hi again Bill

After your 'meta-programming' talk I shudder to think what your idea
of a maximalist is...maybe a computer that writes a compiler that
generates source code for a computer motherboard design program to
construct another computer that...

A maximalist wants many built-in features, from functional programming support,
to multimethods, to support of every character format known to man.  Not in
libraries, where we could all contribute, but built-in, where Walter has to
write it.

As a minimalist, I'd settle for features that allow me to add the features I
need to the language in libraries.  The meta-programming stuff I'd mentioned
leads in that direction.

Bill

Mar 31 2003

"Matthew Wilson" <dmd synesis.com.au> writes:

And a pragmatist wants as much as is possible in libraries, but what he/she
feels must be in the compiler because of the likelihood of stuff-ups if left
to the full spectrum of the developer community (such as meaningful ==,
string types and my auto-stringise thingo with char null *)

"Bill Cox" <Bill_member pathlink.com> wrote in message
news:b6b05r$1hsv$1 digitaldaemon.com...
 In article <b6al79$1ahd$1 digitaldaemon.com>, Mark Evans says...
Hi again Bill

After your 'meta-programming' talk I shudder to think what your idea
of a maximalist is...maybe a computer that writes a compiler that
generates source code for a computer motherboard design program to
construct another computer that...

 A maximalist wants many built-in features, from functional programming

support,
 to multimethods, to support of every character format known to man.  Not

in
 libraries, where we could all contribute, but built-in, where Walter has

to
 write it.

 As a minimalist, I'd settle for features that allow me to add the features

I
 need to the language in libraries.  The meta-programming stuff I'd

mentioned
 leads in that direction.

 Bill

Mar 31 2003

Mark Evans <Mark_member pathlink.com> writes:

Bill the point is that trying to paint me this or that color, instead of
focusing on something specific, is ad hominem.  I find it patronizing.
Especially since on this point you've already agreed with me explicitly.

We can quibble on specifics.  I want 3 char types, you want 2 (UTF8 + char) or
maybe even 3 (UTF8 + char + wchar).

I have much to say about those bizarre meta programming concepts.  I have worked
in EDA and know that domain - you can't blow smoke in my face, even if others
are impressed.  All I would say here is that by your own admission, you're
trying to write code for 'average' or 'dumb' programmers, so please focus on
doing just that.

Mark

Mar 31 2003

Bill Cox <bill viasic.com> writes:

Hi, Mark.

Mark Evans wrote:
 Bill the point is that trying to paint me this or that color, instead of
 focusing on something specific, is ad hominem.  I find it patronizing.
 Especially since on this point you've already agreed with me explicitly.
 
 We can quibble on specifics.  I want 3 char types, you want 2 (UTF8 + char) or
 maybe even 3 (UTF8 + char + wchar).
 
 I have much to say about those bizarre meta programming concepts.  I have
worked
 in EDA and know that domain - you can't blow smoke in my face, even if others
 are impressed.  All I would say here is that by your own admission, you're
 trying to write code for 'average' or 'dumb' programmers, so please focus on
 doing just that.

Ok, I'll bite... Why do you feel I'm blowing smoke in your face?

As for the meta-programming stuff, we use DataDraw today to do lots of 
it, and I find it very productive, particularly for our EDA work.  In 
particular, we added dynamic class extensions, recursive destructors, 
array bounds checking, pointer indrection checking to C.  The code 
generators also give us much of the power of template framworks.  We 
also use a memory mapping model that works great on 64-bit machines, 
where EDA is headed fast (we use theSheesh Kabob code generator).  All 
of these have very specific benifits for EDA, which I've covered in 
previous posts.

Before calling it bizarre, why not look into it?  A fairly receint 
version of DataDraw is available at:

http://www.viasic.com/download/datadraw.tar.gz

Most GUI programmers use Class Wizzard, which is much the same kind of 
thing.  Should that capability be in the language?  Possibly.  The 
concept has been researched by other groups, and one way to do it is to 
add "compile-time reflection classes" to the language.  OpenC++ is one 
example of this aproach.  XL does it, too.

Also, we don't hire average or dumb programmers.  We hire brilliant 
programmers, and train them to code as-if the target audience were 
stupid people.  This really helps them work together, and helps the code 
last over time.  It helps our business output a consistent product - the 
code looks much the same no matter who wrote it.  There are good 
business reasons for this.

Putting a restrictive coding methodology in place doesn't restrict how 
an algorithm works, just how the implementation looks.  So far, there 
have been exactly 0 algorithms that had to be changed in order to fit 
into our methodology.  We encourage our programmers to be as creative as 
possible in algorithm development, and to come up with brilliant 
solutions.  We enable them to implement those algorithms quickly and 
efficiently with a consistent, solid, and proven coding methodology. 
They spend less time thinking about how to write code, and more time 
writing it.  It's one of our competitive tools for success.

Bill

Apr 01 2003

Mark Evans <Mark_member pathlink.com> writes:

Please don't turn this into yet another thread about DataDraw or dubious
management 'expertise.'  (Put up a wiki board somewhere, OK?  I could show you
five different ways from Sunday to replace DataDraw with better code using
standard languages/libraries/mixins/design patterns/tools of which you seem
ignorant.  Sorry you'll have to pay me though.)

Thank you for supporting the idea that D needs some kind of native Unicode
support.

Mark

Apr 01 2003

Helmut Leitner <helmut.leitner chello.at> writes:

Bill Cox wrote:
 Before calling it bizarre, why not look into it?  A fairly receint
 version of DataDraw is available at:
 
 http://www.viasic.com/download/datadraw.tar.gz

When I read one of your postings a week ago, I googled for DataDraw
and didn't find references or a download page, although you said
it is open source. I found this very weird.

I also didn't get the impression that you were connected to the project.
Now a see in the About-Box, that you are the lead developer...

There is no LICENSE. The documentation is so imcomplete that I
wouldn't even start trying to use it (Although it's date says 1993).

There are surely better ways to advertise you project.
Why don't you set up an official OS project at sourceforge
and complete the documentation.

--
Helmut Leitner    leitner hls.via.at   
Graz, Austria   www.hls-software.com

Apr 01 2003

Bill Cox <bill viasic.com> writes:

Hi, Helmut.

Helmut Leitner wrote:
 
 Bill Cox wrote:
 
Before calling it bizarre, why not look into it?  A fairly receint
version of DataDraw is available at:

http://www.viasic.com/download/datadraw.tar.gz

 
 
 When I read one of your postings a week ago, I googled for DataDraw
 and didn't find references or a download page, although you said
 it is open source. I found this very weird.
 
 I also didn't get the impression that you were connected to the project.
 Now a see in the About-Box, that you are the lead developer...
 
 There is no LICENSE. The documentation is so imcomplete that I
 wouldn't even start trying to use it (Although it's date says 1993).
 
 There are surely better ways to advertise you project.
 Why don't you set up an official OS project at sourceforge
 and complete the documentation.
 
 --
 Helmut Leitner    leitner hls.via.at   
 Graz, Austria   www.hls-software.com

I'm not trying to advertise DataDraw.  In fact, I'd love to see D 
incorporate features that would allow me to kill it.  I'd prefer that 
user's didn't start adopting DataDraw, as I don't have the time to do 
free support.

It's open-source, as the copyright file describes.  It's a very weak 
copyright, meant to be weaker than the GNU GPL.  The documentation 
sucks, and I think it will probably stay that way.

I did write the first version, and place it into the open-source domain. 
  The guys who wrote the second one kept me listed in the about box, but 
I didn't write the code.  So far as I know, DataDraw is only in use at 
ViASIC (my company), QuickLogic, and Synplicity.  None of these 
companies has any reason to promote it.

It's specific insights I've gained in working with DataDraw that I've 
been trying to describe in this group, rather than trying to promote 
DataDraw.  I only posted it because someone asked me to, and the license 
requres that I do.

Through using DataDraw for many years, however, I think I've had some 
fairly unique insights into language design.  Adding features to a 
target langauge is what DataDraw is for, and I've been able to try out 
several features not found in C++ in a real industrial coding 
environment.  Some of those features I've described in other posts.

As I said, I was hoping D could be extended to make DataDraw obsolete. 
That turns out not to be the case.  I'll describe some of my current 
thinking about this matter below.

DataDraw currently just models data structures, and allows me to write 
code generators.  This is much like the old OM tool for UML (which 
DataDraw preceeds).  It gives me the power of compile-time reflection 
classes, like those in OpenC++.  However, for each new language, or 
coding style, I have to write a new code generator, and these things get 
really complex.  DataDraw currenly has 5.  That kind of sucks.

Instead, DataDraw should allow me to write one awesome code generator 
that targets in an intermediate language.  Then, it should allow me to 
write simple translators for each target language and coding style.  The 
bulk of the work could then be shared.

With a built-in language translator, DataDraw would be much simpler than 
it is now.  However, with a built-in language translator, DataDraw 
becomes a language in itself.  What's unique about it?  Simple.  It's 
extendable by me and others I work with who are familiar with the 
DataDraw code base.  I can generate code of any type, and add literally 
any feature I wish.  However, I do that by directly editing the code 
generators, which are written in C and which link into DataDraw's 
database.  That's not elegant, or usable by anyone not familiar with the 
DataDraw code base, although it does cover my needs.

So, I've been looking into what it takes to get the same power, but in a 
language that anyone could work with.  In particular, I've been 
examining what it would take for D to cover DataDraw's functionality. 
That, it turns out, is hard (which is one reason the XL compiler isn't 
done).  The more power you give the user, the more you open up the 
internals of the compiler, and the more complex you make the language.

For example, to do that in D, a natural way would be to make Walter's 
representation of D as data structures part of the language definition 
(thus greatly restricting how D compilers are built).  Then, you could 
offer access to reflection classes at compile time (as OpenC++ does).  A 
natural way to use these classes at compile time is to interpret D code. 
  Now, you have to write a D interpreter as well as a compiler.  This is 
the aproach taken by VHDL for their generators, and it really 
complicated implementations of compilers.  An alternative is to 
re-compile the compiler instead.  This is a bit brain-bending, but I 
think getting rid of the interpreter is worth it.  Besides, I already 
recompile DataDraw every time I fix or add a feature, and that's never 
been much of a problem.

Even if we added compile-time reflection classes, I still don't get all 
the power of DataDraw, which I can extend in any way, because I directly 
edit the source.  What's still missing?

For one thing, reflection classes can't be used to add syntax to the 
language.  That's a serious limitation.  XL's aproach allows some syntax 
extension.  Scheme also has a nice mechanism.  However, both systems are 
limited, and complex, and slow.  I'm toying with another aproach that is 
easy if you already allow users to compile custom versions of the 
compiler (which you do to get rid of the interpreter).  Just provide a 
simple mechanism for generating a syntax description for use by bison. 
That nails the problem.  Any new syntax can then be added by a user, so 
long as it's compatible with what's already there.  A drawback is that 
bison now becomes part of the language, along with all its quirks and 
strong points.  At least bison is pretty much available everywhere.

Just adding new syntax to the language doesn't get you all the way 
there.  You still are stuck with those reflection classes used to model 
the language.  If you have a new construct to implement, you can add the 
syntax, but what objects do you build to represent it?  The reflection 
classes themselves need to be extendable.  Really.  At that point, 
nothing in the language is left as non-configurable.  You're stuck with 
LAR1 parsers, but that's no big deal.

However, adding reflection classes is tricky.  Being C-derived, the 
language still needs to link with the C linker, including the compiler 
itself, especially if users are going to compile custom compilers for 
their applications.  That means that new types can't be added to the 
compiler's database, since C libraries are limited that way.  I'm 
currently toying with the age-old style of non-typed syntax trees rather 
than fully typed reflection classes.  It looks like it will work out, 
but in the end, all this has done is provide a compiler that's easy to 
extend.  It's easy to extend because it's parser, and internal data 
structures are simple, and extendable.  Plug-ins should be easy to 
write.  However, it's not really a standard language any more.  It's 
just a customizable compiler that's fairly easy to work with.

I'm left with the conclusion that D can't be enhanced be extendable the 
way XL wants to be, or the way I'd like D to be.

I don't see how D can get there from here.

Bill

Apr 02 2003

Helmut Leitner <helmut.leitner chello.at> writes:

Bill Cox wrote:
 There are surely better ways to advertise you project.

 ...
 I'm not trying to advertise DataDraw.  In fact, I'd love to see D
 incorporate features that would allow me to kill it.  I'd prefer that
 user's didn't start adopting DataDraw, as I don't have the time to do
 free support.

Ok, I think it's good to have this said.

 It's open-source, as the copyright file describes.  
 The documentation sucks, and I think it will probably stay that way.

That means its dead outside of the heads of its few experts and
will remain so.

 ...
 It's specific insights I've gained in working with DataDraw that I've
 been trying to describe in this group, rather than trying to promote
 DataDraw. 
 ...

I'm very interested in your experiences and insights. I'm doing
software projects since 1979 and feel very strong about the way
systems present themselves towards the programmer (APIs).

 Through using DataDraw for many years, however, I think I've had some
 fairly unique insights into language design.  Adding features to a
 target langauge is what DataDraw is for, and I've been able to try out
 several features not found in C++ in a real industrial coding
 environment.  Some of those features I've described in other posts.

I'll try to reread some of your postings and arguments. Can you 
give me some hints to find my way?

 As I said, I was hoping D could be extended to make DataDraw obsolete.
 That turns out not to be the case.  I'll describe some of my current
 thinking about this matter below.
 
 DataDraw currently just models data structures, and allows me to write
 code generators.  This is much like the old OM tool for UML (which
 DataDraw preceeds).  It gives me the power of compile-time reflection
 classes, like those in OpenC++.  However, for each new language, or
 coding style, I have to write a new code generator, and these things get
 really complex.  DataDraw currenly has 5.  That kind of sucks.
 
 Instead, DataDraw should allow me to write one awesome code generator
 that targets in an intermediate language.  Then, it should allow me to
 write simple translators for each target language and coding style.  The
 bulk of the work could then be shared.

That's a natural idea, that doesn't seem to work. I think that Charles
Simonyi has put 10 years into Intentional Programming to follow similiar
ideas and they burned millions of $.
 
 With a built-in language translator, DataDraw would be much simpler than
 it is now.  However, with a built-in language translator, DataDraw
 becomes a language in itself.  What's unique about it?  Simple.  It's
 extendable by me and others I work with who are familiar with the
 DataDraw code base.  I can generate code of any type, and add literally
 any feature I wish.  However, I do that by directly editing the code
 generators, which are written in C and which link into DataDraw's
 database.  That's not elegant, or usable by anyone not familiar with the
 DataDraw code base, although it does cover my needs.

This is a certain way to solve problems but it may or may not be optimal. 
The fact that you have this tool at hand gives power but may mislead.
 
 So, I've been looking into what it takes to get the same power, but in a
 language that anyone could work with.  In particular, I've been
 examining what it would take for D to cover DataDraw's functionality.

Analytically this is not a goal. The goal is to enable programmers to write 
great applications. What are their problems and how can they be solved? 

 That, it turns out, is hard (which is one reason the XL compiler isn't
 done).  The more power you give the user, the more you open up the
 internals of the compiler, and the more complex you make the language.

I agree. I think this is the problem of C++ itself. Too much complexity
for to little gain.
 
 For example, to do that in D, a natural way would be to make Walter's
 representation of D as data structures part of the language definition
 (thus greatly restricting how D compilers are built).  Then, you could
 offer access to reflection classes at compile time (as OpenC++ does).  A
 natural way to use these classes at compile time is to interpret D code.
   Now, you have to write a D interpreter as well as a compiler.  This is
 the aproach taken by VHDL for their generators, and it really
 complicated implementations of compilers.  An alternative is to
 re-compile the compiler instead.  This is a bit brain-bending, but I
 think getting rid of the interpreter is worth it.  Besides, I already
 recompile DataDraw every time I fix or add a feature, and that's never
 been much of a problem.
 
 Even if we added compile-time reflection classes, I still don't get all
 the power of DataDraw, which I can extend in any way, because I directly
 edit the source.  What's still missing?
 
 For one thing, reflection classes can't be used to add syntax to the
 language.  That's a serious limitation.  XL's aproach allows some syntax
 extension.  Scheme also has a nice mechanism.  However, both systems are
 limited, and complex, and slow.  I'm toying with another aproach that is
 easy if you already allow users to compile custom versions of the
 compiler (which you do to get rid of the interpreter).  Just provide a
 simple mechanism for generating a syntax description for use by bison.
 That nails the problem.  Any new syntax can then be added by a user, so
 long as it's compatible with what's already there.  A drawback is that
 bison now becomes part of the language, along with all its quirks and
 strong points.  At least bison is pretty much available everywhere.

I still don't know what problems you are trying to solve. 

A language that is able to extend its own syntax? Surely an faszinating idea 
but 99.9 percent of programmers would not be able to make good use of it.
 
 Just adding new syntax to the language doesn't get you all the way
 there.  You still are stuck with those reflection classes used to model
 the language.  If you have a new construct to implement, you can add the
 syntax, but what objects do you build to represent it?  The reflection
 classes themselves need to be extendable.  Really.  At that point,
 nothing in the language is left as non-configurable.  You're stuck with
 LAR1 parsers, but that's no big deal.
 
 However, adding reflection classes is tricky.  Being C-derived, the
 language still needs to link with the C linker, including the compiler
 itself, especially if users are going to compile custom compilers for
 their applications.  That means that new types can't be added to the
 compiler's database, since C libraries are limited that way.  I'm
 currently toying with the age-old style of non-typed syntax trees rather
 than fully typed reflection classes.  It looks like it will work out,
 but in the end, all this has done is provide a compiler that's easy to
 extend.  It's easy to extend because it's parser, and internal data
 structures are simple, and extendable.  Plug-ins should be easy to
 write.  However, it's not really a standard language any more.  It's
 just a customizable compiler that's fairly easy to work with.
 
 I'm left with the conclusion that D can't be enhanced be extendable the
 way XL wants to be, or the way I'd like D to be.

As I see it D was never designed to have an extensible syntax. 
 
 I don't see how D can get there from here.

For this reason it is unreasonable to think it could go there.

Currently I don't understand why it should go there, other than it would 
allow you to carry your DataDraw methods of problem solving on to D. 

But, as I said, I'll try to read some of your threads.

--
Helmut Leitner    leitner hls.via.at   
Graz, Austria   www.hls-software.com

Apr 02 2003

Bill Cox <bill viasic.com> writes:

I agree with all your comments.

At this point, I'm not advocating major changes to D, so this reply is 
more just to answer your questions that to give Walter any ideas.  You'd 
asked about specific features I'd been advocating, so I'll re-summarize 
them below.

1) Compile-time reflection classes.  I threw this out there as a 
possibility to be investigated.  Now that I've done that, I'm dropping 
that request, for reasons described in the you replied to below.

2) I'd still like to see more powerful iterators that the ones discussed 
lately.  You can look up my recomendations under "Cool iterators", or 
something like that.

3) Dynamic class extensions are also a great thing, and it's sad C++, 

databases have to emulate the extensions with cross-coupled void pointers.

4) A class framework inheritance mechnaism, such as Sather's "include" 
construct, virtual classes, or Dan's "Template Frameworks".  All of 
these cover a gaping hole in C++, but I'm concerned about the complexity 
of the virtual class aproach Walter was considering.

Embedded replies to a couple questions you posed are below.

Helmut Leitner wrote:
 
 Bill Cox wrote:
 
There are surely better ways to advertise you project.

...
I'm not trying to advertise DataDraw.  In fact, I'd love to see D
incorporate features that would allow me to kill it.  I'd prefer that
user's didn't start adopting DataDraw, as I don't have the time to do
free support.

 
 
 Ok, I think it's good to have this said.
 
 
It's open-source, as the copyright file describes.  
The documentation sucks, and I think it will probably stay that way.

 
 
 That means its dead outside of the heads of its few experts and
 will remain so.
 
 
...
It's specific insights I've gained in working with DataDraw that I've
been trying to describe in this group, rather than trying to promote
DataDraw. 
...

 
 
 I'm very interested in your experiences and insights. I'm doing
 software projects since 1979 and feel very strong about the way
 systems present themselves towards the programmer (APIs).
 
 
Through using DataDraw for many years, however, I think I've had some
fairly unique insights into language design.  Adding features to a
target langauge is what DataDraw is for, and I've been able to try out
several features not found in C++ in a real industrial coding
environment.  Some of those features I've described in other posts.

 
 
 I'll try to reread some of your postings and arguments. Can you 
 give me some hints to find my way?
As I said, I was hoping D could be extended to make DataDraw obsolete.

That turns out not to be the case.  I'll describe some of my current
thinking about this matter below.

DataDraw currently just models data structures, and allows me to write
code generators.  This is much like the old OM tool for UML (which
DataDraw preceeds).  It gives me the power of compile-time reflection
classes, like those in OpenC++.  However, for each new language, or
coding style, I have to write a new code generator, and these things get
really complex.  DataDraw currenly has 5.  That kind of sucks.

Instead, DataDraw should allow me to write one awesome code generator
that targets in an intermediate language.  Then, it should allow me to
write simple translators for each target language and coding style.  The
bulk of the work could then be shared.

 
 
 That's a natural idea, that doesn't seem to work. I think that Charles
 Simonyi has put 10 years into Intentional Programming to follow similiar
 ideas and they burned millions of $.

I believe it.  The hard part isn't making a nice intermediate language I 
can work with.  The hard part is making an extendable version that one 
anyone can work with.

With a built-in language translator, DataDraw would be much simpler than
it is now.  However, with a built-in language translator, DataDraw
becomes a language in itself.  What's unique about it?  Simple.  It's
extendable by me and others I work with who are familiar with the
DataDraw code base.  I can generate code of any type, and add literally
any feature I wish.  However, I do that by directly editing the code
generators, which are written in C and which link into DataDraw's
database.  That's not elegant, or usable by anyone not familiar with the
DataDraw code base, although it does cover my needs.

 
 
 This is a certain way to solve problems but it may or may not be optimal. 
 The fact that you have this tool at hand gives power but may mislead.

You're right about that.  You have to be extremely careful about adding 
features to a language using a custom pre-processor.  In particular, 
every extension has to be carefully though out, and agreed to by the 
whole group.  If anyone could add a feature any time they wished, it'd 
result in mayhem.

So, I've been looking into what it takes to get the same power, but in a
language that anyone could work with.  In particular, I've been
examining what it would take for D to cover DataDraw's functionality.

 
 
 Analytically this is not a goal. The goal is to enable programmers to write 
 great applications. What are their problems and how can they be solved? 

Oh, there are lots of problems.  Big stuff and little stuff.  How about 
array bounds checking in debug mode?  We added it to C.  Need a few 
fields added to existing classes at run-time?  We do that.  The space of 
solutions to real problems programmers are facing out there is a lot 
bigger than what most languages address.

I agree with your point, though.  A good D design is a design that 
covers most people's most common needs, but not all of anybody's needs. 
  IMO, D's basically on track.

That, it turns out, is hard (which is one reason the XL compiler isn't
done).  The more power you give the user, the more you open up the
internals of the compiler, and the more complex you make the language.

 
 
 I agree. I think this is the problem of C++ itself. Too much complexity
 for to little gain.
  
 
For example, to do that in D, a natural way would be to make Walter's
representation of D as data structures part of the language definition
(thus greatly restricting how D compilers are built).  Then, you could
offer access to reflection classes at compile time (as OpenC++ does).  A
natural way to use these classes at compile time is to interpret D code.
  Now, you have to write a D interpreter as well as a compiler.  This is
the aproach taken by VHDL for their generators, and it really
complicated implementations of compilers.  An alternative is to
re-compile the compiler instead.  This is a bit brain-bending, but I
think getting rid of the interpreter is worth it.  Besides, I already
recompile DataDraw every time I fix or add a feature, and that's never
been much of a problem.

Even if we added compile-time reflection classes, I still don't get all
the power of DataDraw, which I can extend in any way, because I directly
edit the source.  What's still missing?

For one thing, reflection classes can't be used to add syntax to the
language.  That's a serious limitation.  XL's aproach allows some syntax
extension.  Scheme also has a nice mechanism.  However, both systems are
limited, and complex, and slow.  I'm toying with another aproach that is
easy if you already allow users to compile custom versions of the
compiler (which you do to get rid of the interpreter).  Just provide a
simple mechanism for generating a syntax description for use by bison.
That nails the problem.  Any new syntax can then be added by a user, so
long as it's compatible with what's already there.  A drawback is that
bison now becomes part of the language, along with all its quirks and
strong points.  At least bison is pretty much available everywhere.

 
 
 I still don't know what problems you are trying to solve. 
 
 A language that is able to extend its own syntax? Surely an faszinating idea 
 but 99.9 percent of programmers would not be able to make good use of it.

You're right about how many programmers should use it.  It's dangerous 
stuff, and extensions need to be carefully considered by a few and then 
adopted by many.  Scheme has a nice mechanism for this kind of thing. 
Much of the syntax of Scheme can acutally be written in Scheme.

However, without an ability to add syntax, some new features can't 
cleanly be added to a language, and thus the language isn't fully 
extensible.  For example, how could we add Sather-like "include" 
constructs to allow module level inheritance?  There's no way in D, C++, 

parser a little.  After that, it's a simple thing to implement with 
compile-time reflection classes.

I'm not pushing for any syntax extension mechanism for D.  It's pretty 
worthless without some way to tie it into reflection classes or an 
equivalent mechanism.

Just adding new syntax to the language doesn't get you all the way
there.  You still are stuck with those reflection classes used to model
the language.  If you have a new construct to implement, you can add the
syntax, but what objects do you build to represent it?  The reflection
classes themselves need to be extendable.  Really.  At that point,
nothing in the language is left as non-configurable.  You're stuck with
LAR1 parsers, but that's no big deal.

However, adding reflection classes is tricky.  Being C-derived, the
language still needs to link with the C linker, including the compiler
itself, especially if users are going to compile custom compilers for
their applications.  That means that new types can't be added to the
compiler's database, since C libraries are limited that way.  I'm
currently toying with the age-old style of non-typed syntax trees rather
than fully typed reflection classes.  It looks like it will work out,
but in the end, all this has done is provide a compiler that's easy to
extend.  It's easy to extend because it's parser, and internal data
structures are simple, and extendable.  Plug-ins should be easy to
write.  However, it's not really a standard language any more.  It's
just a customizable compiler that's fairly easy to work with.

I'm left with the conclusion that D can't be enhanced be extendable the
way XL wants to be, or the way I'd like D to be.

 
 
 As I see it D was never designed to have an extensible syntax. 
  
 
I don't see how D can get there from here.

 
 
 For this reason it is unreasonable to think it could go there.
 
 Currently I don't understand why it should go there, other than it would 
 allow you to carry your DataDraw methods of problem solving on to D. 
 
 But, as I said, I'll try to read some of your threads.
 
 --
 Helmut Leitner    leitner hls.via.at   
 Graz, Austria   www.hls-software.com

I agree.  At this point, I've concluded that D should not try to solve 
the problems I solve with DataDraw.

I've started working on a new system that should replace DataDraw when 
finished.  It's already got the syntax extension mechanism I described 
that generates a bison file.  It's got a simple list based lanugage 
parse tree that is capable of representing any feature I wish to 
support.  These get used like compile-time reflection classes, allowing 
users to write code in the intermediate langauge in order to add 
features to the target language.  The output can be in any language (as 
with DataDraw), and users can write new generators to target new 
languages or coding styles.

I'm thinking of calling it Hack-C, since allowing me to hack in new 
features to C or other languages is it's primary function, and because 
the whole system seems like one of the world's largest hacks.  It's a 
translator that compiles application specific versions of itself in 
order to add features to other languages.  The opportunities for serious 
hacking in such a system are vast.

If you think there might be interest in this system in the open-source 
community, I could try to finish it's development that way.  It might be 
fun enough for me to actually support an open-source effort, and if 
anyone else were to help, I could benifit from that.

I haven't seen much interest in this kind of project out there in the 
past.  Languages are always hot, bot CASE tools never are.  Do you think 
this could be successful as an open-source effort?

Bill

Apr 03 2003

Mark Evans <Mark_member pathlink.com> writes:

Bill Cox wrote,
Not in libraries, where we could all contribute,
but built-in, where Walter has to write it.

The compiler is open-source.  Contributions are welcome.  (Wasn't it you who
said recently, 'I had a few days off and rewrote the D compiler' or words to
that effect?  Forgive me if memory fails, I think it was you.)

Whatever reasons you accept for UTF-8 as a native type hold equally well for
UTF-16 and UTF-32.  The only rationale advanced otherwise was a vague impression
of unease (coupled with slurs on my design sense).

Dividing type families is a war crime.  It's more complex having one member in
the compiler and the rest stranded in a library.  Think about slicing Unicode
strings.  Suppose the compiler includes code for slicing UTF-8 strings.  Why do
we want to duplicate that in a library for UTF-16?  We have to write identical
logic, in C for the compiler and in D for the library?  Yuk!  And what about the
conversions between Unicode formats?  They are easier with the strings all
living in the same place.  Either these strings belong in the language together,
or they belong in a library together.  I see no objective reason to divide them
up.

Just think about what you're saying in terms of numeric types and the fallacy
will jump out at you.  C has trained people too well about what strings really
are.  Suppose for example that we put all floats in the compiler and all doubles
in the library.  Silly! <g>

Maybe it will mend fences to say in public that UTF-32 could be dropped.  I have
objective reasons for saying so, not vague unease: UTF-32 is rarely used and
truly fixed-width (so it can be 'faked' as Walter suggests).  Nonetheless
intrinsic UTF-32 is just as reasonable to support as, say, the equally rarely
used, and equally fake-able 'ifloat' type.

Mark

Mar 31 2003

Bill Cox <bill viasic.com> writes:

Hi, Mark.

Mark Evans wrote:
 Bill Cox wrote,
 
 The compiler is open-source.  Contributions are welcome.  (Wasn't it you who
 said recently, 'I had a few days off and rewrote the D compiler' or words to
 that effect?  Forgive me if memory fails, I think it was you.)

I wrote a toy compiler to test out some ideas in a few days off, not a D 
compiler.  There's a huge difference between a week's effort, and what D 
has become.  In fact C++ is so complex, the compilers out there still 
aren't complete.  Keeping D simple is key to avoiding this fate.

The fact that D's front-end is open-source is an even greater reason for 
the language itself to be simple.  The author of Linux has a lot to say 
about keeping open-source code simple.  He blasted GNU's Herd effort for 
it's complexity.  I agree with him.  The fact that I'm writing this note 
using a Linux kernel instead of a GNU Herd kernel supports his assertion.

Last I checked, the D front-end was 35K lines of hand written code, 
which is impressively small given the functionality and commenting. 
However, that's still a lot to learn if you just want to contribute, but 
it's doable.  When it reaches 100K lines, the language is in real 
trouble.  Not many of us will be willing to work with a program that 
huge, unless we're getting paid.

Bill

Apr 01 2003

Mark Evans <Mark_member pathlink.com> writes:

Keeping D simple is key to avoiding this fate.

Unicode intrinsics make D a simple language.  That is the point of having them.
I assume you are still with me that D needs them.

The notion is to rid D of ugly 30-year-old C confusions about strings, and to
bring their formats up to modern standards in the bargain.  We can't help the
extra work of Unicode; that is what the world wants.

The fact that D's front-end is open-source is an even greater reason for 
the language itself to be simple.

No one said otherwise.  You keep propping up straw-men to tear down.  They are
purely your own creations.  It's amusing to watch you rip them down, but little
else beyond that.  We all want the language to be as simple and orthogonal as
possible.  That's why I worry about D's rigid adherence to C++ as a design
baseline.

Look Bill - my design sense is as good as yours, maybe better, and definitely
more informed.  You need not lecture me about simplicity.  To be frank, your
work belies complicated over-engineering and reinvented wheels. From my
viewpoint you are the one who needs simplicity lessons.

Furthermore I do not 'advocate' everything that I post.  You halfway accused me
of 'advocating' multimethods, and I don't recall once doing that.  I merely
linked to a short article showing how multimethods simplify code.

I do advocate functional approaches, for this reason:  they allow me to simplify
my code.  You see, I like simplicity.

There are software engineering concepts that C++ does not offer and it's
important for a new language effort to know about them.  That way, even if
rejected, a decision about the concepts was made on facts, not ignorance.

If you agree with me about Unicode intrinsics, to whatever degree, then bite the
bullet and be done with it.  You really are going over the top on this.

Mark

Apr 01 2003

"Matthew Wilson" <dmd synesis.com.au> writes:

Mark

Not wishing to get in the middle of you two stags, but aren't you getting a
bit over the top? I don't doubt that all your skills are as incomparable as
you assert - though I note you did not add an entry to the "Introductions"
thread, why was that? - but do we really need to be told all the time?

Frankly it's beginning to taste a little like Boost, not to mention a waste
of time in the lives of lots of busy people in reading through them to get
to the technical points (which are very interesting, I must say) that you're
making.




"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6du7v$jiv$1 digitaldaemon.com...
Keeping D simple is key to avoiding this fate.

 Unicode intrinsics make D a simple language.  That is the point of having

them.
 I assume you are still with me that D needs them.

 The notion is to rid D of ugly 30-year-old C confusions about strings, and

to
 bring their formats up to modern standards in the bargain.  We can't help

the
 extra work of Unicode; that is what the world wants.

The fact that D's front-end is open-source is an even greater reason for
the language itself to be simple.

 No one said otherwise.  You keep propping up straw-men to tear down.  They

are
 purely your own creations.  It's amusing to watch you rip them down, but

little
 else beyond that.  We all want the language to be as simple and orthogonal

as
 possible.  That's why I worry about D's rigid adherence to C++ as a design
 baseline.

 Look Bill - my design sense is as good as yours, maybe better, and

definitely
 more informed.  You need not lecture me about simplicity.  To be frank,

your
 work belies complicated over-engineering and reinvented wheels. From my
 viewpoint you are the one who needs simplicity lessons.

 Furthermore I do not 'advocate' everything that I post.  You halfway

accused me
 of 'advocating' multimethods, and I don't recall once doing that.  I

merely
 linked to a short article showing how multimethods simplify code.

 I do advocate functional approaches, for this reason:  they allow me to

simplify
 my code.  You see, I like simplicity.

 There are software engineering concepts that C++ does not offer and it's
 important for a new language effort to know about them.  That way, even if
 rejected, a decision about the concepts was made on facts, not ignorance.

 If you agree with me about Unicode intrinsics, to whatever degree, then

bite the
 bullet and be done with it.  You really are going over the top on this.

 Mark

Apr 01 2003

"Luna Kid" <lunakid neuropolis.org> writes:

Hmm... Mark, appreciating all your informedess and
very welcome sharp and clear view on this matter (and
others), how about improving your diplomatic skills
a bit?

Sorry about the noise.
The Luna Kid

Apr 03 2003

"Walter" <walter digitalmars.com> writes:

"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6beep$1qom$1 digitaldaemon.com...
 Maybe it will mend fences to say in public that UTF-32 could be dropped.

I have
 objective reasons for saying so, not vague unease: UTF-32 is rarely used

and
 truly fixed-width (so it can be 'faked' as Walter suggests).  Nonetheless
 intrinsic UTF-32 is just as reasonable to support as, say, the equally

rarely
 used, and equally fake-able 'ifloat' type.

My understanding is that the linux wchar_t type is UTF-32, which puts it in
common use. UTF-32 is also handy as an intermediate form when converting
between UTF-8 and UTF-16.

May 21 2003

"J. Daniel Smith" <J_Daniel_Smith HoTMaiL.com> writes:

If you've got a UTF-32 string, UTF-16 is really only needed when calling
things like Win32 APIs.

   Dan

"Walter" <walter digitalmars.com> wrote in message
news:bagjlo$308t$1 digitaldaemon.com...
 "Mark Evans" <Mark_member pathlink.com> wrote in message
 news:b6beep$1qom$1 digitaldaemon.com...
 Maybe it will mend fences to say in public that UTF-32 could be dropped.

 I have
 objective reasons for saying so, not vague unease: UTF-32 is rarely used

 and
 truly fixed-width (so it can be 'faked' as Walter suggests).


Nonetheless
 intrinsic UTF-32 is just as reasonable to support as, say, the equally

 rarely
 used, and equally fake-able 'ifloat' type.

 My understanding is that the linux wchar_t type is UTF-32, which puts it

in
 common use. UTF-32 is also handy as an intermediate form when converting
 between UTF-8 and UTF-16.

May 22 2003

"Matthew Wilson" <dmd synesis.com.au> writes:

One minor point:

We *must* have char/wchar and byte/ubyte/short/ushort as separate, and
overloadable, entities. This is about the most egregious and toxic aspect of
C/C++ that I can think of. Absolute nightmare when trying to write generic
serialisation components, messing around with compiler discrimination
pre-processor guff to work out whether the compiler "knows" about wchar_t,
and crying oneself to sleep with char, signed char, unsigned char, etc. etc.

Following this logic, if D does evolve to support different character
encoding schemes, it would be nice to have separate char types, although I
know this will draw the succinctness crowd down on me like a pack of
blood-thursty vultures.

Swoop away flying beasties, my gizard is exposed.



"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6abjh$12m8$1 digitaldaemon.com...
 Walter says (in response to my post)...
 D needs a Unicode string primitive.

It does already. In D, a char[] is really a utf-8 array.

 I'm dubious about this claim.  ANSI C char arrays are UTF-8 too, if the

contents
 are 7-bit ACSII (a subset of UTF-8).  That doesn't mean they support

UTF-8.
 UTF-8 is on D's very own 'to-do' list:
 http://www.digitalmars.com/d/future.html

 UTF-8 has a maximum encoding length of 6 bytes for one character.  If such

a
 character appears at index 100 in char[] myString, what is the return

value from
 myString[100]?  The answer should be "one UTF-8 char with an internal

6-byte
 representation."  I don't think D does that.

 Besides which, my idea was a native string primitive, not a quasi-array.

The
 confusion of strings with arrays was a basic, fundamental mistake of C.

While
 some string semantics do resemble those of arrays, this resemblance should

not
 mandate identical data types.  Strings are important enough to merit their

own
 intrinsic type.  Icon is not the only language to recognize that fact.  D
 documents make no mention of any string primitive:
 http://www.digitalmars.com/d/type.html
 D has two intrinsic character types, a dynamic array type, and _no_

intrinsic
 string type.

 Characters should be defined as UTF-8 or UTF-16 or UTF-32, not "short" and
 "wide."  The differing cross-platform widths of the 'wide' char is asking

for
 trouble; poof goes data portability.  D characters are not based on

Unicode, but
 archaic MS Windows API and legacy C terminology spot-welded onto Linux.

How
 about Unicode as a basis?

 The ideal type system would offer as intrinsic/primitive/native language

types:
 - UTF-8 char
 - UTF-16 char
 - UTF-32 char
 - UTF-8 string
 - UTF-16 string
 - UTF-32 string
 - built-in conversions between all of the above (e.g. UTF-8 to UTF-16)
 - built-in conversions to/from UTF strings and C-style byte arrays

 The preceding list will not seem very long when you consider how many

numeric
 types D supports.  Strings are as important as numbers.

 The old C 'char' type is merely a byte; D already has 'ubyte.'  The

distinction
 between ubyte and char in D escapes me.  Maybe the reasoning is that a

char
 might be 'wide' so D needs a separate type?  But that reason disappears

once you
 have nice UTF characters.  So even if the list is a bit long it also

eliminates
 two redundant types, char and wchar.

 I would not be against retention of char and char[] for C compatibility

purposes
 if someone could point out why 'ubyte' and 'char[]' do not suffice.

Otherwise I
 would just alias 'char' into 'ubyte' and be done with it.  The wchar could

be
 stored inside a UTF-16 or UTF-32 char, or be declared as a struct.

 To the user, strings would act like dynamic arrays.  Internally they are
 different animals.  Each 'element' of the 'array' can have varying length

per
 Unicode specifications.  String primitives would hide Unicode complexity

under
 the hood.

 That's just the beginning.  Now that you have string intrinsics, you can

give
 them special behaviors pertaining to i/o streams and such.  You can define
 'streaming' conversions from other intrinsic types to strings for i/o

purposes.
 And...permit me to dream!...you can define Icon-style string scanning
 expressions.

 Mark

Mar 31 2003

Mark Evans <Mark_member pathlink.com> writes:

Walter -

On a positive and constructive note, an implementation concept might hold some
interest.  I'm just bringing it to attention, not advocating yet <g>.

There's no hard requirement for serial bytewise storage of the proposed
intrinsic Unicode strings.  Other ways to build Unicode strings exist.  The one
offered here would do little or no damage to the current compiler.  Really it's
just a set of small additions.

Consider a Unicode string made of two data structures:  a C-style array, and a
lookup table.  The C-style array holds the first code word for each character.
The table holds all second, third, and additional code words.  (A 'code word'
meaning 8/16/32 bits for UTF 8/16/32 respectively.)  The keys to the table are

accessed via some function like table_access(100).

This setup unifies C array indices with Unicode character indices.  So D can
employ straight pointer arithmetic to find any character in the string.
Character index = array index.  String length (in chars) = implementation array
size (in elements).  These features may address your hesitation over
implementation issues that are complex in the serial case.

Having found the character, D need only check the high bit(s) which flag
additional code words.  Unicode requires such a test in any case; it's
unavoidable.  If flagged, D performs a table lookup.  This table lookup is the
only serious runtime cost.  The table could take whatever form is most
efficient.

* UTF-32 has no extended codes, so UTF-32 strings don't need tables.
* UTF-16 characters involve only a few percent with extended codes.
Ergo - the table is small, and the runtime cost is, say, 2-3%.
* UTF-8 needs the biggest and most table entries, but manageably so.

A downside might be file and network serialization - but we might skate by.  D
could supply streams on demand, without an intermediate serialized format.  If I
tell D "write(myFile, myString)" no intermediate format is required.  D can just
empty the internal array and table to disk in proper byte sequence.  The disk or
network won't care how D get the bytes from memory.

The only hard serialization requirement would be actual user conversion to byte
arrays.  (If the user is doing that, let him suffer!)

This scheme supports 7-bit ASCII.  An optimization could yield raw C speed.  Put
an extra boolean flag inside each string structure.  This flag is the logical OR
of all contained Unicode bit flags.  If the string has no extended chars, the
flag is FALSE, and D can use alternate string code on that basis.  (No bit
tests, no table lookups.)  That works for UTF-32, 7-bit ASCII, and the majority
of UTF-16 strings.

The idea can be nitpicked to death, but it's a concept.  Unicode strings and
characters will never enjoy the simplicity or speed of 7-bit ASCII.  That's a
fact of life, meaning that implementation concepts cannot be faulted on such a
basis.

What would be nice is to make Unicode maximally simple and maximally efficient
for D users.

Thanks again Walter,

Best-
Mark

Mar 31 2003

"Matthew Wilson" <dmd synesis.com.au> writes:

Qualifying this again with the stipulation that I am far from an expert on
this issue (aside from having a fair amount of experience in a negative
sense):

This sounds like a nice idea - array of 1st-byte plus lookups. I'm intrigued
as to the nature of the lookup table. Is this a constant, process-wide,
entity?

If I had time when it was introduced I'd be keen to participate in the
serialisation stuff, on which I have more firmer footing.

It's not clear now whether you've dropped the suggestion for a separate
string class, or just that arrays of "char" types would be dealt with in the
fashion that you've outlined.

Finally, I'm troubled by your comments "on a positive and constructive note"
and "maybe it will mend fences to " (other post). Have I missed some animus
that everyone else has perceived? If so, I don't know which side to be on.
Seriously, though, I don't think anyone's getting shirty, so chill, baby. :)

Keep those great comments coming. I'm learning heaps.


"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6bb6i$1ont$1 digitaldaemon.com...
 Walter -

 On a positive and constructive note, an implementation concept might hold

some
 interest.  I'm just bringing it to attention, not advocating yet <g>.

 There's no hard requirement for serial bytewise storage of the proposed
 intrinsic Unicode strings.  Other ways to build Unicode strings exist.

The one
 offered here would do little or no damage to the current compiler.  Really

it's
 just a set of small additions.

 Consider a Unicode string made of two data structures:  a C-style array,

and a
 lookup table.  The C-style array holds the first code word for each

character.
 The table holds all second, third, and additional code words.  (A 'code

word'
 meaning 8/16/32 bits for UTF 8/16/32 respectively.)  The keys to the table

are


they are
 accessed via some function like table_access(100).

 This setup unifies C array indices with Unicode character indices.  So D

can
 employ straight pointer arithmetic to find any character in the string.
 Character index = array index.  String length (in chars) = implementation

array
 size (in elements).  These features may address your hesitation over
 implementation issues that are complex in the serial case.

 Having found the character, D need only check the high bit(s) which flag
 additional code words.  Unicode requires such a test in any case; it's
 unavoidable.  If flagged, D performs a table lookup.  This table lookup is

the
 only serious runtime cost.  The table could take whatever form is most
 efficient.

 * UTF-32 has no extended codes, so UTF-32 strings don't need tables.
 * UTF-16 characters involve only a few percent with extended codes.
 Ergo - the table is small, and the runtime cost is, say, 2-3%.
 * UTF-8 needs the biggest and most table entries, but manageably so.

 A downside might be file and network serialization - but we might skate

by.  D
 could supply streams on demand, without an intermediate serialized format.

If I
 tell D "write(myFile, myString)" no intermediate format is required.  D

can just
 empty the internal array and table to disk in proper byte sequence.  The

disk or
 network won't care how D get the bytes from memory.

 The only hard serialization requirement would be actual user conversion to

byte
 arrays.  (If the user is doing that, let him suffer!)

 This scheme supports 7-bit ASCII.  An optimization could yield raw C

speed.  Put
 an extra boolean flag inside each string structure.  This flag is the

logical OR
 of all contained Unicode bit flags.  If the string has no extended chars,

the
 flag is FALSE, and D can use alternate string code on that basis.  (No bit
 tests, no table lookups.)  That works for UTF-32, 7-bit ASCII, and the

majority
 of UTF-16 strings.

 The idea can be nitpicked to death, but it's a concept.  Unicode strings

and
 characters will never enjoy the simplicity or speed of 7-bit ASCII.

That's a
 fact of life, meaning that implementation concepts cannot be faulted on

such a
 basis.

 What would be nice is to make Unicode maximally simple and maximally

efficient
 for D users.

 Thanks again Walter,

 Best-
 Mark

Mar 31 2003

"Sean L. Palmer" <palmer.sean verizon.net> writes:

"Matthew Wilson" <dmd synesis.com.au> wrote in message
news:b6bgt5$1sai$1 digitaldaemon.com...
 This sounds like a nice idea - array of 1st-byte plus lookups. I'm

intrigued
 as to the nature of the lookup table. Is this a constant, process-wide,
 entity?

No, because the map is indexed by the same index used to index into the flat
array.  Unless I'm misunderstanding something.

Perhaps these could be grouped into separate maps by the total size of the
char, which I think is determinable from the first char?  May speed lookups
a tad, or slow them down, not sure.

Sean

Apr 01 2003

"Walter" <walter digitalmars.com> writes:

"Sean L. Palmer" <palmer.sean verizon.net> wrote in message
news:b6bjg5$1ut5$1 digitaldaemon.com...
 "Matthew Wilson" <dmd synesis.com.au> wrote in message
 news:b6bgt5$1sai$1 digitaldaemon.com...
 This sounds like a nice idea - array of 1st-byte plus lookups. I'm

 intrigued
 as to the nature of the lookup table. Is this a constant, process-wide,
 entity?

 No, because the map is indexed by the same index used to index into the

flat
 array.  Unless I'm misunderstanding something.

You could use a static 256 byte lookup table to give you the 'stride' to the
next char.

May 21 2003

"Sean L. Palmer" <palmer.sean verizon.net> writes:

That lets you index sequentially pretty fast, but not randomly.

Sean

"Walter" <walter digitalmars.com> wrote in message
news:bagk8l$30ti$2 digitaldaemon.com...
 "Sean L. Palmer" <palmer.sean verizon.net> wrote in message
 news:b6bjg5$1ut5$1 digitaldaemon.com...
 "Matthew Wilson" <dmd synesis.com.au> wrote in message
 news:b6bgt5$1sai$1 digitaldaemon.com...
 This sounds like a nice idea - array of 1st-byte plus lookups. I'm

 intrigued
 as to the nature of the lookup table. Is this a constant,



process-wide,
 entity?

 No, because the map is indexed by the same index used to index into the

 flat
 array.  Unless I'm misunderstanding something.

 You could use a static 256 byte lookup table to give you the 'stride' to

the
 next char.

May 22 2003

Mark Evans <Mark_member pathlink.com> writes:

This sounds like a nice idea - array of 1st-byte plus lookups.

Thanks.  Correction, "array of first code words." Only in UTF-8
are they byte-sized.

I'm intrigued as to the nature of the lookup table. Is this a
constant, process-wide, entity?

No. There is one table per string.

I'd be keen to participate in the
serialisation stuff

No need for serialization. Even the compiler can do
serialization with no memory footprint. Only something like an
explicit conversion to ubyte[] would mandate that.

It's not clear now whether you've dropped the suggestion for a separate
string class, or just that arrays of "char" types would be dealt with in the
fashion that you've outlined.

I never suggested a string 'class,' just Unicode string and char
intrinsic types. My list of proposed intrinsics has already been supplied.
Think int, float, string8, string16, char 8, etc.

C made a huge mistake in confusing arrays with strings. Strings deserve
intrinsic status and a type all their own.  The ugly char/wchar gimmick has also
seen its day and needs replacement.

Mark

The internal implementation might read like this in C++-ish, heavy on
the "ish," this is the ideal, it's just a communication vehicle for
the concept:

// code word storage types
typedef ubyte    UTF8_CODE;
typedef ushort   UTF16_CODE;
typedef uint     UTF32_CODE;

// max code words per Unicode character
const ushort     UTF8_CODE_MAX  = 6;
const ushort     UTF16_CODE_MAX = 2;
const ushort     UTF32_CODE_MAX = 1;

template <typename UTF_CODE, ushort UTF_CODE_MAX>
class ExtensionTableEntry
{
public:
int       myStringPositionIndex;
UTF_CODE  myStorage[UTF_CODE_MAX+1]; // null terminated?
};

// a partially defined Unicode String class concept
template <typename UTF_CODE, ushort UTF_CODE_MAX>
class UnicodeString
{
public:
long                    length;
UTF_CODE*               operator[];
private:
UTF_CODE*               firstWordsArray;
std::hash_map<
int,
ExtensionTableEntry<UTF_CODE,UTF_CODE_MAX>
                       myLookup;

};

typedef UnicodeString<UTF8_CODE,UTF8_CODE_MAX>    String8;
typedef UnicodeString<UTF16_CODE,UTF16_CODE_MAX>  String16;
typedef UnicodeString<UTF32_CODE,UTF32_CODE_MAX>  String32;

/* Walter - each table entry should hold the full Unicode char not
just its extension codes. This tactic would create some redundancy,
but not much. Having the whole character in contiguous memory could be
advantageous for passing pointers around. So the C++ operator[] either
returns a pointer into the firstWordsArray, or a pointer to the table
entry's myStorage field. In all cases the firstWordsArray always holds
the first code word of the char, whether it's an extended one or not. */

Apr 01 2003

"Sean L. Palmer" <palmer.sean verizon.net> writes:

The only problem with this idea is that passing this dual structure to a
piece of code that expects a linear string of data won't work.

Typecasting to ubyte[] or ushort[] should solve that, right?

You would probably need to know the length of such a string both in bytes
and in chars.

Sean


"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6bpf9$22g9$1 digitaldaemon.com...
This sounds like a nice idea - array of 1st-byte plus lookups.

 Thanks.  Correction, "array of first code words." Only in UTF-8
 are they byte-sized.

I'm intrigued as to the nature of the lookup table. Is this a
constant, process-wide, entity?

 No. There is one table per string.

I'd be keen to participate in the
serialisation stuff

 No need for serialization. Even the compiler can do
 serialization with no memory footprint. Only something like an
 explicit conversion to ubyte[] would mandate that.

It's not clear now whether you've dropped the suggestion for a separate
string class, or just that arrays of "char" types would be dealt with in


the
fashion that you've outlined.

 I never suggested a string 'class,' just Unicode string and char
 intrinsic types. My list of proposed intrinsics has already been supplied.
 Think int, float, string8, string16, char 8, etc.

 C made a huge mistake in confusing arrays with strings. Strings deserve
 intrinsic status and a type all their own.  The ugly char/wchar gimmick

has also
 seen its day and needs replacement.

 Mark

 The internal implementation might read like this in C++-ish, heavy on
 the "ish," this is the ideal, it's just a communication vehicle for
 the concept:

 // code word storage types
 typedef ubyte    UTF8_CODE;
 typedef ushort   UTF16_CODE;
 typedef uint     UTF32_CODE;

 // max code words per Unicode character
 const ushort     UTF8_CODE_MAX  = 6;
 const ushort     UTF16_CODE_MAX = 2;
 const ushort     UTF32_CODE_MAX = 1;

 template <typename UTF_CODE, ushort UTF_CODE_MAX>
 class ExtensionTableEntry
 {
 public:
 int       myStringPositionIndex;
 UTF_CODE  myStorage[UTF_CODE_MAX+1]; // null terminated?
 };

 // a partially defined Unicode String class concept
 template <typename UTF_CODE, ushort UTF_CODE_MAX>
 class UnicodeString
 {
 public:
 long                    length;
 UTF_CODE*               operator[];
 private:
 UTF_CODE*               firstWordsArray;
 std::hash_map<
 int,
 ExtensionTableEntry<UTF_CODE,UTF_CODE_MAX>
                       myLookup;

 };

 typedef UnicodeString<UTF8_CODE,UTF8_CODE_MAX>    String8;
 typedef UnicodeString<UTF16_CODE,UTF16_CODE_MAX>  String16;
 typedef UnicodeString<UTF32_CODE,UTF32_CODE_MAX>  String32;

 /* Walter - each table entry should hold the full Unicode char not
 just its extension codes. This tactic would create some redundancy,
 but not much. Having the whole character in contiguous memory could be
 advantageous for passing pointers around. So the C++ operator[] either
 returns a pointer into the firstWordsArray, or a pointer to the table
 entry's myStorage field. In all cases the firstWordsArray always holds
 the first code word of the char, whether it's an extended one or not. */

Apr 01 2003

Mark Evans <Mark_member pathlink.com> writes:

Sean L. Palmer says...
The only problem with this idea is that passing this dual structure to a
piece of code that expects a linear string of data won't work.

Serialization at choke points has a cost of (a) zero, because the string has no
extended codes (say typ. 95%+ of UTF-16 and by definition 100% of UTF-32), or
(b) an alloc plus copy equivalent, which is acceptable for small to medium
strings (another statistically large class in software programs).

You run into problems only with large UTF-8 strings that are frequently passed
to/from Unicode APIs.  Windows uses UTF-16 so it's no problem.  Where you find
UTF-8 happening is on the web, but that has inherent delays of its own, so the
cost might go unnoticed.  Consider for example that plenty of web sites are
driven with UTF-8 by languages far slower than D.

Mark

Apr 01 2003

"Walter" <walter digitalmars.com> writes:

"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6dolr$di3$1 digitaldaemon.com...
 You run into problems only with large UTF-8 strings that are frequently

passed
 to/from Unicode APIs.  Windows uses UTF-16 so it's no problem.  Where you

find
 UTF-8 happening is on the web, but that has inherent delays of its own, so

the
 cost might go unnoticed.  Consider for example that plenty of web sites

are
 driven with UTF-8 by languages far slower than D.

I've been looking at some books for programming CGI apps in C. I see the
dreaded buffer overflow errors in the sample code even in highly regarded
books. No wonder security is such a mess! Doing CGI in D would eliminate
those problems.

May 21 2003

"Sean L. Palmer" <palmer.sean verizon.net> writes:

That's so crazy it just might work!  ;)

I think it's a fine concept.

One point I'd like to add is that when straight iterating over the string,
the library function can iterate over both the main array and the secondary
map at the same time, in sync, with no map lookups, only iteration.

This would be an interesting bit to actually implement.  But no harder than
the many other possible solutions, and easier and more efficient than most,
especially for random-access indexing, which seems to be what D is leaning
toward in general.

I'd prefer iteration to be the normal way of using D arrays, rather than
explicit loops and indexing.  Those are, for obvious reasons, difficult to
optimize.  But Walter has not decided on a good foreach construct, and
newsgroup discussion on the topic has died down.  Anyone have any good
proposals?  I haven't used any language that has good iterators, except if
you count C++ STL.

Sean

"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6bb6i$1ont$1 digitaldaemon.com...
 Walter -

 On a positive and constructive note, an implementation concept might hold

some
 interest.  I'm just bringing it to attention, not advocating yet <g>.

 There's no hard requirement for serial bytewise storage of the proposed
 intrinsic Unicode strings.  Other ways to build Unicode strings exist.

The one
 offered here would do little or no damage to the current compiler.  Really

it's
 just a set of small additions.

 Consider a Unicode string made of two data structures:  a C-style array,

and a
 lookup table.  The C-style array holds the first code word for each

character.
 The table holds all second, third, and additional code words.  (A 'code

word'
 meaning 8/16/32 bits for UTF 8/16/32 respectively.)  The keys to the table

are


they are
 accessed via some function like table_access(100).

 This setup unifies C array indices with Unicode character indices.  So D

can
 employ straight pointer arithmetic to find any character in the string.
 Character index = array index.  String length (in chars) = implementation

array
 size (in elements).  These features may address your hesitation over
 implementation issues that are complex in the serial case.

 Having found the character, D need only check the high bit(s) which flag
 additional code words.  Unicode requires such a test in any case; it's
 unavoidable.  If flagged, D performs a table lookup.  This table lookup is

the
 only serious runtime cost.  The table could take whatever form is most
 efficient.

 * UTF-32 has no extended codes, so UTF-32 strings don't need tables.
 * UTF-16 characters involve only a few percent with extended codes.
 Ergo - the table is small, and the runtime cost is, say, 2-3%.
 * UTF-8 needs the biggest and most table entries, but manageably so.

 A downside might be file and network serialization - but we might skate

by.  D
 could supply streams on demand, without an intermediate serialized format.

If I
 tell D "write(myFile, myString)" no intermediate format is required.  D

can just
 empty the internal array and table to disk in proper byte sequence.  The

disk or
 network won't care how D get the bytes from memory.

 The only hard serialization requirement would be actual user conversion to

byte
 arrays.  (If the user is doing that, let him suffer!)

 This scheme supports 7-bit ASCII.  An optimization could yield raw C

speed.  Put
 an extra boolean flag inside each string structure.  This flag is the

logical OR
 of all contained Unicode bit flags.  If the string has no extended chars,

the
 flag is FALSE, and D can use alternate string code on that basis.  (No bit
 tests, no table lookups.)  That works for UTF-32, 7-bit ASCII, and the

majority
 of UTF-16 strings.

 The idea can be nitpicked to death, but it's a concept.  Unicode strings

and
 characters will never enjoy the simplicity or speed of 7-bit ASCII.

That's a
 fact of life, meaning that implementation concepts cannot be faulted on

such a
 basis.

 What would be nice is to make Unicode maximally simple and maximally

efficient
 for D users.

 Thanks again Walter,

 Best-
 Mark

Apr 01 2003

"Walter" <walter digitalmars.com> writes:

"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6bb6i$1ont$1 digitaldaemon.com...
 What would be nice is to make Unicode maximally simple and maximally

efficient
 for D users.

I appreciate the thought, but carrying around an extra array for each string
seems difficult to make work, especially in view of slicing, etc. I don't
think there's any way to design the language so it is both efficient at
dealing with ordinary ascii, and transparently able to do multibytes.

May 21 2003

Mark Evans <Mark_member pathlink.com> writes:

Walter wrote:
I appreciate the thought, but carrying around an extra array for each string
seems difficult to make work, especially in view of slicing, etc.

I would need a specific implementation code example to understand your thinking.
(Clarification: I did not propose an extra array per string, but a lookup table
-- something considerably smaller and often empty.)  My gut says it would be
easy.

I don't
think there's any way to design the language so it is both efficient at
dealing with ordinary ascii, and transparently able to do multibytes.

The problem here is either/or thinking.  Both are possible.  People who
desperately want C byte arrays can declare them, irrespective of Unicode
strings.

If the idea is that an intrinsic string type must simultaneously support Unicode
and ASCII at equal performance levels, then I think the problem is one of
definition.  In the first place D lacks an honest string intrinsic, so a new one
could be defined just for Unicode, leaving the current whatever-it-is in place.
If people don't care for Unicode, then they can use whatever-it-is D offers
currently.

However my gut says that a Unicode string intrinsic holding just ASCII vs. an
ASCII string as currently implemented would be neck and neck in terms of
performance.  Remember that you don't necessarily need a bit test on every
character every time.  The table object can flag callers when it's totally empty
and they can proceed with manipulations on that basis.  In that sense the
Unicode concept is really just a superset of what you already have.

Considering the number of languages now being retrofitted for Unicode, I think
it would be a mistake not to build it into D when the chance to do it cleanly
exists, one that will be regretted later.

Best,
Mark

May 23 2003

D Programming

C/C++ Programming

Other

D - Unicode Character and String Intrinsics