www.digitalmars.com         C & C++   DMDScript  

D - Unicode Character and String Intrinsics

reply Mark Evans <Mark_member pathlink.com> writes:
Walter says (in response to my post)...
 D needs a Unicode string primitive.


I'm dubious about this claim. ANSI C char arrays are UTF-8 too, if the contents are 7-bit ACSII (a subset of UTF-8). That doesn't mean they support UTF-8. UTF-8 is on D's very own 'to-do' list: http://www.digitalmars.com/d/future.html UTF-8 has a maximum encoding length of 6 bytes for one character. If such a character appears at index 100 in char[] myString, what is the return value from myString[100]? The answer should be "one UTF-8 char with an internal 6-byte representation." I don't think D does that. Besides which, my idea was a native string primitive, not a quasi-array. The confusion of strings with arrays was a basic, fundamental mistake of C. While some string semantics do resemble those of arrays, this resemblance should not mandate identical data types. Strings are important enough to merit their own intrinsic type. Icon is not the only language to recognize that fact. D documents make no mention of any string primitive: http://www.digitalmars.com/d/type.html D has two intrinsic character types, a dynamic array type, and _no_ intrinsic string type. Characters should be defined as UTF-8 or UTF-16 or UTF-32, not "short" and "wide." The differing cross-platform widths of the 'wide' char is asking for trouble; poof goes data portability. D characters are not based on Unicode, but archaic MS Windows API and legacy C terminology spot-welded onto Linux. How about Unicode as a basis? The ideal type system would offer as intrinsic/primitive/native language types: - UTF-8 char - UTF-16 char - UTF-32 char - UTF-8 string - UTF-16 string - UTF-32 string - built-in conversions between all of the above (e.g. UTF-8 to UTF-16) - built-in conversions to/from UTF strings and C-style byte arrays The preceding list will not seem very long when you consider how many numeric types D supports. Strings are as important as numbers. The old C 'char' type is merely a byte; D already has 'ubyte.' The distinction between ubyte and char in D escapes me. Maybe the reasoning is that a char might be 'wide' so D needs a separate type? But that reason disappears once you have nice UTF characters. So even if the list is a bit long it also eliminates two redundant types, char and wchar. I would not be against retention of char and char[] for C compatibility purposes if someone could point out why 'ubyte' and 'char[]' do not suffice. Otherwise I would just alias 'char' into 'ubyte' and be done with it. The wchar could be stored inside a UTF-16 or UTF-32 char, or be declared as a struct. To the user, strings would act like dynamic arrays. Internally they are different animals. Each 'element' of the 'array' can have varying length per Unicode specifications. String primitives would hide Unicode complexity under the hood. That's just the beginning. Now that you have string intrinsics, you can give them special behaviors pertaining to i/o streams and such. You can define 'streaming' conversions from other intrinsic types to strings for i/o purposes. And...permit me to dream!...you can define Icon-style string scanning expressions. Mark
Mar 31 2003
next sibling parent Mark Evans <Mark_member pathlink.com> writes:
if someone could point out why 'ubyte' and 'char[]' do not suffice.

Typo: that was "why 'ubyte' and 'ubyte[]' do not suffice." - Mark
Mar 31 2003
prev sibling next sibling parent reply "Walter" <walter digitalmars.com> writes:
"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6abjh$12m8$1 digitaldaemon.com...
 Walter says (in response to my post)...
 D needs a Unicode string primitive.



 are 7-bit ACSII (a subset of UTF-8).  That doesn't mean they support

 UTF-8 is on D's very own 'to-do' list:
 http://www.digitalmars.com/d/future.html

It is incompletely implemented, sure.
 UTF-8 has a maximum encoding length of 6 bytes for one character.  If such

 character appears at index 100 in char[] myString, what is the return

 myString[100]?  The answer should be "one UTF-8 char with an internal

 representation."  I don't think D does that.

No, it doesn't do that. Sometimes you want the byte, sometimes the assembled unicode char.
 Besides which, my idea was a native string primitive, not a quasi-array.

 confusion of strings with arrays was a basic, fundamental mistake of C.

 some string semantics do resemble those of arrays, this resemblance should

 mandate identical data types.  Strings are important enough to merit their

 intrinsic type.  Icon is not the only language to recognize that fact.  D
 documents make no mention of any string primitive:
 http://www.digitalmars.com/d/type.html
 D has two intrinsic character types, a dynamic array type, and _no_

 string type.

D does have an intrinsic string literal.
 Characters should be defined as UTF-8 or UTF-16 or UTF-32, not "short" and
 "wide."  The differing cross-platform widths of the 'wide' char is asking

 trouble; poof goes data portability.  D characters are not based on

 archaic MS Windows API and legacy C terminology spot-welded onto Linux.

 about Unicode as a basis?

Actually, this has changed. Wide chars are now fixed at 16 bits, i.e. UTF-16. For UTF-32, just use uint's.
 The ideal type system would offer as intrinsic/primitive/native language

 - UTF-8 char
 - UTF-16 char
 - UTF-32 char
 - UTF-8 string
 - UTF-16 string
 - UTF-32 string
 - built-in conversions between all of the above (e.g. UTF-8 to UTF-16)
 - built-in conversions to/from UTF strings and C-style byte arrays

 The preceding list will not seem very long when you consider how many

 types D supports.  Strings are as important as numbers.

That's actually pretty close to what D supports.
 The old C 'char' type is merely a byte; D already has 'ubyte.'  The

 between ubyte and char in D escapes me.  Maybe the reasoning is that a

 might be 'wide' so D needs a separate type?  But that reason disappears

 have nice UTF characters.  So even if the list is a bit long it also

 two redundant types, char and wchar.

The distinction is char is UTF-8, and byte is well, just a byte. The distinction comes in handy when dealing with overloaded functions.
 I would not be against retention of char and char[] for C compatibility

 if someone could point out why 'ubyte' and 'char[]' do not suffice.

Function overloading.
 Otherwise I
 would just alias 'char' into 'ubyte' and be done with it.  The wchar could

 stored inside a UTF-16 or UTF-32 char, or be declared as a struct.

 To the user, strings would act like dynamic arrays.  Internally they are
 different animals.  Each 'element' of the 'array' can have varying length

 Unicode specifications.  String primitives would hide Unicode complexity

 the hood.

 That's just the beginning.  Now that you have string intrinsics, you can

 them special behaviors pertaining to i/o streams and such.  You can define
 'streaming' conversions from other intrinsic types to strings for i/o

 And...permit me to dream!...you can define Icon-style string scanning
 expressions.

 Mark

Mar 31 2003
parent reply Mark Evans <Mark_member pathlink.com> writes:
The answer should be "one UTF-8 char with an internal 6-byte
representation."

sometimes the assembled unicode char.

But the only use for raw bytes is precisely such low-level format conversions as are proposed to go under the hood. String usage involves character analysis, not bit shuffling. There is a place for getting raw bytes, but a string subscript is not it. Maybe a typecast to ubyte[], and then an array subscript. The whole point of built-in Unicode support is to let users avoid dealing with bytes and let them deal with characters instead.
D does have an intrinsic string literal.

But it's not Unicode, just char or wchar. Those are both fixed byte-width, but all Unicode chars, except UTF-32, are variable byte-width.
Wide chars are now fixed at 16 bits, i.e. UTF-16.

Ditto. Wide chars are not UTF-16 chars since they are fixed width. UTF-16 characters can be 16 or 32 bits wide. (UTF-8 characters can be anywhere from 1 byte to 6 bytes wide.)
 For UTF-32, just use uint's.

Possible, but see my final point.
That's actually pretty close to what D supports.

I don't see anything close. (a) There is no Unicode string primitive (char[] is not a string primitive, let alone Unicode; it's an array type). (b) There are no Unicode characters. There are merely types with similar 'average' sizes being touted as Unicode capable (they are not).
 if someone could point out why 'ubyte' and 'ubyte[]' do not suffice.


This comment is a logical contradiction with prior remarks. If the distinction between ubyte and char matters for this reason, then the same reason makes a difference between uint and UTF-32. But in the latter case you say to just use uint. You can't have it both ways. Thanks for taking all our thoughts into consideration. Mark
Mar 31 2003
next sibling parent reply "Walter" <walter digitalmars.com> writes:
"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6an4p$1bpj$1 digitaldaemon.com...
 The whole point of built-in Unicode support is to let users avoid
 dealing with bytes and let them deal with characters instead.

That's only partially true - the downside comes from needing high performance you'll need byte indices, not UTF character strides. There is no getting away from the variable byte encoding. In my (limited) experience with string processing and UTF-8, rarely is it necessary to decode it. Most manipulation is done with indices.
D does have an intrinsic string literal.

byte-width, but all Unicode chars, except UTF-32, are variable byte-width.

No, in D, the intrinsic string literal is not just char or wchar. It's a unicode string - its internal format is not fixed until semantic processing, when it is adjusted to be UTF-8, -16, or -32 as needed.
Wide chars are now fixed at 16 bits, i.e. UTF-16.


What I meant is they do not change size from implementation to implementation. They are 16 bits, and line up with the UTF-16 API's of Win32.
 UTF-16 characters can be 16 or 32 bits wide. (UTF-8 characters can be
 anywhere from 1 byte to 6 bytes wide.)

Yes.
 For UTF-32, just use uint's.

That's actually pretty close to what D supports.

(char[] is not a string primitive, let alone Unicode; it's an array type).

I think that's a matter of perspective.
 (b) There are no Unicode characters. There are merely types
 with similar 'average' sizes being touted as Unicode capable (they are
 not).

I believe they are unicode capable. Now, I have not written the I/O routines so they will print as unicode, and there are other gaps in the implementation, but the core concept is there.
 if someone could point out why 'ubyte' and 'ubyte[]' do not suffice.


distinction between ubyte and char matters for this reason, then the same reason makes a difference between uint and UTF-32. But in the latter case you say to just use uint. You can't have it both ways.

I think uint[] will well serve for UTF-32 because there is no need to be concerned about multiword encoding.
 Thanks for taking all our thoughts into consideration.

You're welcome.
Mar 31 2003
next sibling parent Mark Evans <Mark_member pathlink.com> writes:
 The whole point of built-in Unicode support is to let users avoid
 dealing with bytes and let them deal with characters instead.

That's only partially true - the downside comes from needing high performance you'll need byte indices, not UTF character strides. There is no getting away from the variable byte encoding.

If I understand correctly, the translation is that it's better to let end users process bytes, so they can waste hours <g> tuning inner loops, than to offer language support, with pre-tuned inner loops. I don't see that. In fact native language support is better from a performance perspective (both in time of execution and in time of development).
 In my (limited) experience with string processing and UTF-8, rarely
 is it necessary to decode it. Most manipulation is done with
 indices.

Manipulation is done with indices in C, because that is all C offers. It's one of the big problems with C vis-a-vis Unicode.
 in D, the intrinsic string literal is not just char or wchar. It's a
 unicode string - its internal format is not fixed until semantic
 processing, when it is adjusted to be UTF-8, -16, or -32 as needed.

I think your definition of "Unicode" is basically wrong. What you are calling UTF-8 and UTF-16 is really just fixed-width slots that the user must conglomerate, not true native Unicode characters. So we are talking past each other. For example when you say "internal format" I don't suppose you have in mind that 6-byte-wide UTF-8 character I mentioned. When I say Unicode character, I mean an object that the language recognizes, intrinsically, as a variable-byte-width object, but which it presents to the user as an integrated (opaque) whole. I do not mean a user-defined conglomeration of fixed-width fields. That seems to be your working definition and it does not satisfy me.
Wide chars are now fixed at 16 bits, i.e. UTF-16.


What I meant is they do not change size from implementation to implementation.

That's what I understood you to mean; and that much is good, as far as it goes, but doesn't address Unicode.
 They are 16 bits, and line up with the UTF-16 API's of Win32.

If Windows supports full UTF-16, then D does not support UTF-16 API's of Win32 with any native data type. The user still faces the same labor (more or less) as supporting Unicode in ANSI C.
 I think that's a matter of perspective.... I believe they are
 unicode capable. Now, I have not written the I/O routines so they
 will print as unicode, and there are other gaps in the
 implementation, but the core concept is there.

I've tried to explain why there is no Unicode character in D, and on that basis alone, I could say there is no Unicode string in D. The syntax and semantics of char[] are identical across all types of arrays, not limited to strings. (What syntax or semantics are unique to strings?) End users can create and manipulate almost any data structure -- any collection of bits -- in D, or for that matter C, or assembly language, or even machine language. What I'm talking about is intrinsic language support to save the labor (and mistakes). I could build Unicode strings with a Turing machine if I wanted to. That's not "language support" in my book. Saying that we already have 8-bit things, and 16-bit things, and 32-bit things, and that users can do Unicode by combining these things in various ways, is not a reasonable argument that the language supports Unicode. At best one might say, D does not prevent users from implementing Unicode, if they want to take the extra trouble.
 if someone could point out why 'ubyte' and 'ubyte[]' do not suffice.


distinction between ubyte and char matters for this reason, then the same reason makes a difference between uint and UTF-32. But in the latter case you say to just use uint. You can't have it both ways.

I think uint[] will well serve for UTF-32 because there is no need to be concerned about multiword encoding.

Then you are ignoring your own argument about function overloading! :-) Mark
Mar 31 2003
prev sibling next sibling parent "Sean L. Palmer" <palmer.sean verizon.net> writes:
"Walter" <walter digitalmars.com> wrote in message
news:b6aoge$1cnp$1 digitaldaemon.com...
 if someone could point out why 'ubyte' and 'ubyte[]' do not suffice.


distinction between ubyte and char matters for this reason, then the same reason makes a difference between uint and UTF-32. But in the latter case you say to just use uint. You can't have it both ways.

I think uint[] will well serve for UTF-32 because there is no need to be concerned about multiword encoding.

But there is still concern for there to be a separate type, for function overloading. Otherwise, how shall we print a Unicode character higher than position 0xFFFF? Perhaps the basic char type would actually be 32 bits and capable of holding any Unicode character? And when used in array form, char[] would transmogrify into UTF-8? Would we then even need wchar? Obviously this Unicode thing is a whole can of worms. Too bad we can't get everyone to forget about enough characters that they all fit in 16 bits! ;) Sean
Apr 01 2003
prev sibling parent reply Ilya Minkov <midiclub 8ung.at> writes:
Walter wrote:
 That's only partially true - the downside comes from needing high
 performance you'll need byte indices, not UTF character strides. There is no
 getting away from the variable byte encoding. In my (limited) experience
 with string processing and UTF-8, rarely is it necessary to decode it. Most
 manipulation is done with indices.

Wait... won't language-supported iterators fix a need for accessing the underlying array indices directly? I *definately* don't want to know anything about underlying format, which can be really anything - UTF-8/16/32, or even an agregate of 2 arrays like i or Mark have proposed. Walter, you also don't: look what i found in this newsgroup. :) And you claim it to be better to work with pointers into a char[], pretending it was an UTF-8 string!!! --- 8< --- At one time I had written a lexer that handled utf-8 source. It turned out to cause a lot of problems because strings could no longer be simply indexed by character position, nor could pointers be arbitrarilly incremented and decremented. It turned out to be a lot of trouble and I finally converted it to wchar's. --- >8 --- BTW, as to the possibilities that Mark wishes for himself, i've dug his message up, which was posted as i wasn't around yet. Here. --- 8< --- Short summaries here: http://www.nmt.edu/tcc/help/lang/icon/positions.html http://www.nmt.edu/tcc/help/lang/icon/substring.html http://www.cs.arizona.edu/icon/docs/ipd266.htm http://www.toolsofcomputing.com/IconHandbook/IconHandbook.pdf Sections 6.2 and following. Icon is simply unsurpassed in string processing and is for that reason famous among linguists. There is more to the string processing than just character position indices. Icon supports special clauses called "string scanning environments" which work like file i/o in a vague analogy. (See third link above, section 3.) Icon also has nice built-in structures like sets (*character sets* turn out to be insanely useful), hash tables, and lists. Somehow Icon never made it to the Big Leagues and that is a shame. It deserves to be up there with Perl. Icon is wicked fast when written correctly. The Unicon project is the next-generation Icon, and has added objects and other modern features to base Icon. It is on SourceForge. (There was only one project in which I recall desiring a new Icon built-in. I wanted a two-way hash table which could index off of either data column. The workaround was to implement two mutually mirroring one-way hash tables.) Icon has a very interesting 'success/failure' paradigm which might also be something to study, esp. in light of D's contract emphasis. The unique 'goal-directed' paradigm is quite interesting but may have no application to D. I have for a very long time desired Icon's string scanning capabilities in my C/C++ programs. Even with std::string or string classes from various class libraries (I've used them all), there is just no comparison with Icon. I would become a total D convert if it could do strings like Icon. Mark http://www.cs.arizona.edu/icon/ http://unicon.sourceforge.net/index.html --- >8 --- -i.
Apr 10 2003
parent Helmut Leitner <leitner hls.via.at> writes:
Ilya Minkov wrote:
 I have for a very long time desired Icon's string scanning capabilities
 in my C/C++ programs.  Even with std::string or string classes from
 various class libraries (I've used them all), there is just no
 comparison with Icon.  I would become a total D convert if it could do
 strings like Icon.

Being used to Perl, I think that the current D regex module has to be extended. In what way does Icon differ (or have advantages) in string processing compared to Perl? -- Helmut Leitner leitner hls.via.at Graz, Austria www.hls-software.com
Apr 11 2003
prev sibling parent "Matthew Wilson" <dmd synesis.com.au> writes:
 This comment is a logical contradiction with prior remarks. If the
 distinction between ubyte and char matters for this reason, then the
 same reason makes a difference between uint and UTF-32. But in the
 latter case you say to just use uint. You can't have it both ways.

Agree. Let's have more char types
Mar 31 2003
prev sibling next sibling parent reply Mark Evans <Mark_member pathlink.com> writes:
Hi again Bill

After your 'meta-programming' talk I shudder to think what your idea
of a maximalist is...maybe a computer that writes a compiler that
generates source code for a computer motherboard design program to
construct another computer that...

Under my scheme we gain 3 character types and drop 2: net gain 1. We
gain 3 string types and drop 1: net gain 2. Total net gain, 3 types.
What does that buy us? Complete internationalization of D, complete
freedom from ugly C string idioms, data portability across platforms,
ease of interfacing with Win32 APIs and other software languages.

The idea of "just one" Unicode type holds little water. Why don't you
make the same argument about numeric types, of which we have some
twenty-odd? Or how about if D offered just one data type, the bit, and
let you construct everything else from that? If D does Unicode then D
should do it right. It's a poor, asymmetric design to have some
Unicode built-in and the rest tacked on as library routines.

Mark


 This is a rare occasion when I agree with Mark. The fact that a
 minimalist like me, and a maximalist like Mark, and a pragmatist
 like yourself seem to agree is something Walter should consider. I
 would want to hold built-in string support to just UTF-8. D could
 offer some support for the other formats through conversion routines
 in a standard library. Having a single string format would surely be
 simpler than supporting them all. Bill

Mar 31 2003
next sibling parent reply "Matthew Wilson" <dmd synesis.com.au> writes:
I'm sold. Where can I sign up?

I presume you'll be working on the libraries ... ;)

To suck up: I've been faffing around with this issue for years, and have
been (unjustifiably, in my opinion) called on numerous times to expertly
opine on it for clients. (My expertise is limited to the C/C++
char/wchar_t/horrid-TCHAR type stuff, which I'm well aware is not the full
picture.) Your discussion here is the first time I even get a hint that I'm
listening to someone that know's what they're talking about. It's nasty,
nasty stuff, and I hope that your promise can bear fruit for D. If it can,
then it'll earn massive brownie points for D over its peer languages.
There's a big market out there of peoples whose character sets don't fall
into 7-bits ...



"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6al79$1ahd$1 digitaldaemon.com...
 Hi again Bill

 After your 'meta-programming' talk I shudder to think what your idea
 of a maximalist is...maybe a computer that writes a compiler that
 generates source code for a computer motherboard design program to
 construct another computer that...

 Under my scheme we gain 3 character types and drop 2: net gain 1. We
 gain 3 string types and drop 1: net gain 2. Total net gain, 3 types.
 What does that buy us? Complete internationalization of D, complete
 freedom from ugly C string idioms, data portability across platforms,
 ease of interfacing with Win32 APIs and other software languages.

 The idea of "just one" Unicode type holds little water. Why don't you
 make the same argument about numeric types, of which we have some
 twenty-odd? Or how about if D offered just one data type, the bit, and
 let you construct everything else from that? If D does Unicode then D
 should do it right. It's a poor, asymmetric design to have some
 Unicode built-in and the rest tacked on as library routines.

 Mark


 This is a rare occasion when I agree with Mark. The fact that a
 minimalist like me, and a maximalist like Mark, and a pragmatist
 like yourself seem to agree is something Walter should consider. I
 would want to hold built-in string support to just UTF-8. D could
 offer some support for the other formats through conversion routines
 in a standard library. Having a single string format would surely be
 simpler than supporting them all. Bill


Mar 31 2003
parent reply "Peter Hercek" <vvp no.post.spam.sk> writes:
Well, I went through character and code page problems too about a year
 ago. Very bad experience in C/C++ ... (I'm from place where 7 bits
 is not enough). I have two points about this:
1) D should support characters and not bytes (8bits) or words (16bits);
 when I'm indexing string I do so by characters and not by a byte multiply;
 if I would want to index by eg bytes I would ask for string byte length and
 cast to a byte array
2) Support for 3 character types (UTF8, UTF16, UTF32) is handy, but
 not critical (can be solved by conversion functions); actually for one
 character only, UTF32 has the shortest representation; it may be also
 interesting not to be able to specify the the exact encoding for a string
 (as oposed to an encoding for a character) - let's compiler to decide
 what is the best representation (may be some optimization can be
 achieved based on this later; eg compiler can decide to store strings
 in partially balanced trees like STLPort does for ropes, but with
 posibly different encodings for different nodes ... whatever just
 writting down my thoughts)


"Matthew Wilson" <dmd synesis.com.au> wrote in message
news:b6aq84$1dn4$1 digitaldaemon.com...
 I'm sold. Where can I sign up?

 I presume you'll be working on the libraries ... ;)

 To suck up: I've been faffing around with this issue for years, and have
 been (unjustifiably, in my opinion) called on numerous times to expertly
 opine on it for clients. (My expertise is limited to the C/C++
 char/wchar_t/horrid-TCHAR type stuff, which I'm well aware is not the full
 picture.) Your discussion here is the first time I even get a hint that I'm
 listening to someone that know's what they're talking about. It's nasty,
 nasty stuff, and I hope that your promise can bear fruit for D. If it can,
 then it'll earn massive brownie points for D over its peer languages.
 There's a big market out there of peoples whose character sets don't fall
 into 7-bits ...



 "Mark Evans" <Mark_member pathlink.com> wrote in message
 news:b6al79$1ahd$1 digitaldaemon.com...
 Hi again Bill

 After your 'meta-programming' talk I shudder to think what your idea
 of a maximalist is...maybe a computer that writes a compiler that
 generates source code for a computer motherboard design program to
 construct another computer that...

 Under my scheme we gain 3 character types and drop 2: net gain 1. We
 gain 3 string types and drop 1: net gain 2. Total net gain, 3 types.
 What does that buy us? Complete internationalization of D, complete
 freedom from ugly C string idioms, data portability across platforms,
 ease of interfacing with Win32 APIs and other software languages.

 The idea of "just one" Unicode type holds little water. Why don't you
 make the same argument about numeric types, of which we have some
 twenty-odd? Or how about if D offered just one data type, the bit, and
 let you construct everything else from that? If D does Unicode then D
 should do it right. It's a poor, asymmetric design to have some
 Unicode built-in and the rest tacked on as library routines.

 Mark


 This is a rare occasion when I agree with Mark. The fact that a
 minimalist like me, and a maximalist like Mark, and a pragmatist
 like yourself seem to agree is something Walter should consider. I
 would want to hold built-in string support to just UTF-8. D could
 offer some support for the other formats through conversion routines
 in a standard library. Having a single string format would surely be
 simpler than supporting them all. Bill



Mar 31 2003
parent Ilya Minkov <midiclub 8ung.at> writes:
Peter Hercek wrote:
 Well, I went through character and code page problems too about a year
  ago. Very bad experience in C/C++ ... (I'm from place where 7 bits
  is not enough). I have two points about this:

Me too :)
 1) D should support characters and not bytes (8bits) or words (16bits);=

  when I'm indexing string I do so by characters and not by a byte multi=

  if I would want to index by eg bytes I would ask for string byte lengt=

  cast to a byte array

Right.
 2) Support for 3 character types (UTF8, UTF16, UTF32) is handy, but
  not critical (can be solved by conversion functions); actually for one=

  character only, UTF32 has the shortest representation; it may be also
  interesting not to be able to specify the the exact encoding for a str=

  (as oposed to an encoding for a character) - let's compiler to decide
  what is the best representation (may be some optimization can be
  achieved based on this later; eg compiler can decide to store strings
  in partially balanced trees like STLPort does for ropes, but with
  posibly different encodings for different nodes ... whatever just
  writting down my thoughts)

UTF-32 doesn't have the shortest representation, since "in all 3=20 encodings [i.e. UTF-8/16/32] the maximim possible character=20 representation length is 4 bytes", as the official description says.=20 Though i agree that it's the most practical one, in part because working = with an array of longs is nowadays faster than an array of shorts. This is an implementation detail and should not matter though, because=20 whatever string implementation is, it should hide the undelying complexit= y. What matters though is that in UNICODE there are 2 kinds of characters - = normal and modifyers. So an "=E4" can be represented as well as "a" and= =20 a special accent symbol. I'm pretty much sure you want to access these=20 as a whole, not separately. -i.
Apr 10 2003
prev sibling parent reply Bill Cox <Bill_member pathlink.com> writes:
In article <b6al79$1ahd$1 digitaldaemon.com>, Mark Evans says...
Hi again Bill

After your 'meta-programming' talk I shudder to think what your idea
of a maximalist is...maybe a computer that writes a compiler that
generates source code for a computer motherboard design program to
construct another computer that...

A maximalist wants many built-in features, from functional programming support, to multimethods, to support of every character format known to man. Not in libraries, where we could all contribute, but built-in, where Walter has to write it. As a minimalist, I'd settle for features that allow me to add the features I need to the language in libraries. The meta-programming stuff I'd mentioned leads in that direction. Bill
Mar 31 2003
next sibling parent "Matthew Wilson" <dmd synesis.com.au> writes:
And a pragmatist wants as much as is possible in libraries, but what he/she
feels must be in the compiler because of the likelihood of stuff-ups if left
to the full spectrum of the developer community (such as meaningful ==,
string types and my auto-stringise thingo with char null *)

"Bill Cox" <Bill_member pathlink.com> wrote in message
news:b6b05r$1hsv$1 digitaldaemon.com...
 In article <b6al79$1ahd$1 digitaldaemon.com>, Mark Evans says...
Hi again Bill

After your 'meta-programming' talk I shudder to think what your idea
of a maximalist is...maybe a computer that writes a compiler that
generates source code for a computer motherboard design program to
construct another computer that...

A maximalist wants many built-in features, from functional programming

 to multimethods, to support of every character format known to man.  Not

 libraries, where we could all contribute, but built-in, where Walter has

 write it.

 As a minimalist, I'd settle for features that allow me to add the features

 need to the language in libraries.  The meta-programming stuff I'd

 leads in that direction.

 Bill

Mar 31 2003
prev sibling next sibling parent reply Mark Evans <Mark_member pathlink.com> writes:
Bill the point is that trying to paint me this or that color, instead of
focusing on something specific, is ad hominem.  I find it patronizing.
Especially since on this point you've already agreed with me explicitly.

We can quibble on specifics.  I want 3 char types, you want 2 (UTF8 + char) or
maybe even 3 (UTF8 + char + wchar).

I have much to say about those bizarre meta programming concepts.  I have worked
in EDA and know that domain - you can't blow smoke in my face, even if others
are impressed.  All I would say here is that by your own admission, you're
trying to write code for 'average' or 'dumb' programmers, so please focus on
doing just that.

Mark
Mar 31 2003
parent reply Bill Cox <bill viasic.com> writes:
Hi, Mark.

Mark Evans wrote:
 Bill the point is that trying to paint me this or that color, instead of
 focusing on something specific, is ad hominem.  I find it patronizing.
 Especially since on this point you've already agreed with me explicitly.
 
 We can quibble on specifics.  I want 3 char types, you want 2 (UTF8 + char) or
 maybe even 3 (UTF8 + char + wchar).
 
 I have much to say about those bizarre meta programming concepts.  I have
worked
 in EDA and know that domain - you can't blow smoke in my face, even if others
 are impressed.  All I would say here is that by your own admission, you're
 trying to write code for 'average' or 'dumb' programmers, so please focus on
 doing just that.

Ok, I'll bite... Why do you feel I'm blowing smoke in your face? As for the meta-programming stuff, we use DataDraw today to do lots of it, and I find it very productive, particularly for our EDA work. In particular, we added dynamic class extensions, recursive destructors, array bounds checking, pointer indrection checking to C. The code generators also give us much of the power of template framworks. We also use a memory mapping model that works great on 64-bit machines, where EDA is headed fast (we use theSheesh Kabob code generator). All of these have very specific benifits for EDA, which I've covered in previous posts. Before calling it bizarre, why not look into it? A fairly receint version of DataDraw is available at: http://www.viasic.com/download/datadraw.tar.gz Most GUI programmers use Class Wizzard, which is much the same kind of thing. Should that capability be in the language? Possibly. The concept has been researched by other groups, and one way to do it is to add "compile-time reflection classes" to the language. OpenC++ is one example of this aproach. XL does it, too. Also, we don't hire average or dumb programmers. We hire brilliant programmers, and train them to code as-if the target audience were stupid people. This really helps them work together, and helps the code last over time. It helps our business output a consistent product - the code looks much the same no matter who wrote it. There are good business reasons for this. Putting a restrictive coding methodology in place doesn't restrict how an algorithm works, just how the implementation looks. So far, there have been exactly 0 algorithms that had to be changed in order to fit into our methodology. We encourage our programmers to be as creative as possible in algorithm development, and to come up with brilliant solutions. We enable them to implement those algorithms quickly and efficiently with a consistent, solid, and proven coding methodology. They spend less time thinking about how to write code, and more time writing it. It's one of our competitive tools for success. Bill
Apr 01 2003
next sibling parent Mark Evans <Mark_member pathlink.com> writes:
Please don't turn this into yet another thread about DataDraw or dubious
management 'expertise.'  (Put up a wiki board somewhere, OK?  I could show you
five different ways from Sunday to replace DataDraw with better code using
standard languages/libraries/mixins/design patterns/tools of which you seem
ignorant.  Sorry you'll have to pay me though.)

Thank you for supporting the idea that D needs some kind of native Unicode
support.

Mark
Apr 01 2003
prev sibling parent reply Helmut Leitner <helmut.leitner chello.at> writes:
Bill Cox wrote:
 Before calling it bizarre, why not look into it?  A fairly receint
 version of DataDraw is available at:
 
 http://www.viasic.com/download/datadraw.tar.gz

When I read one of your postings a week ago, I googled for DataDraw and didn't find references or a download page, although you said it is open source. I found this very weird. I also didn't get the impression that you were connected to the project. Now a see in the About-Box, that you are the lead developer... There is no LICENSE. The documentation is so imcomplete that I wouldn't even start trying to use it (Although it's date says 1993). There are surely better ways to advertise you project. Why don't you set up an official OS project at sourceforge and complete the documentation. -- Helmut Leitner leitner hls.via.at Graz, Austria www.hls-software.com
Apr 01 2003
parent reply Bill Cox <bill viasic.com> writes:
Hi, Helmut.

Helmut Leitner wrote:
 
 Bill Cox wrote:
 
Before calling it bizarre, why not look into it?  A fairly receint
version of DataDraw is available at:

http://www.viasic.com/download/datadraw.tar.gz

When I read one of your postings a week ago, I googled for DataDraw and didn't find references or a download page, although you said it is open source. I found this very weird. I also didn't get the impression that you were connected to the project. Now a see in the About-Box, that you are the lead developer... There is no LICENSE. The documentation is so imcomplete that I wouldn't even start trying to use it (Although it's date says 1993). There are surely better ways to advertise you project. Why don't you set up an official OS project at sourceforge and complete the documentation. -- Helmut Leitner leitner hls.via.at Graz, Austria www.hls-software.com

I'm not trying to advertise DataDraw. In fact, I'd love to see D incorporate features that would allow me to kill it. I'd prefer that user's didn't start adopting DataDraw, as I don't have the time to do free support. It's open-source, as the copyright file describes. It's a very weak copyright, meant to be weaker than the GNU GPL. The documentation sucks, and I think it will probably stay that way. I did write the first version, and place it into the open-source domain. The guys who wrote the second one kept me listed in the about box, but I didn't write the code. So far as I know, DataDraw is only in use at ViASIC (my company), QuickLogic, and Synplicity. None of these companies has any reason to promote it. It's specific insights I've gained in working with DataDraw that I've been trying to describe in this group, rather than trying to promote DataDraw. I only posted it because someone asked me to, and the license requres that I do. Through using DataDraw for many years, however, I think I've had some fairly unique insights into language design. Adding features to a target langauge is what DataDraw is for, and I've been able to try out several features not found in C++ in a real industrial coding environment. Some of those features I've described in other posts. As I said, I was hoping D could be extended to make DataDraw obsolete. That turns out not to be the case. I'll describe some of my current thinking about this matter below. DataDraw currently just models data structures, and allows me to write code generators. This is much like the old OM tool for UML (which DataDraw preceeds). It gives me the power of compile-time reflection classes, like those in OpenC++. However, for each new language, or coding style, I have to write a new code generator, and these things get really complex. DataDraw currenly has 5. That kind of sucks. Instead, DataDraw should allow me to write one awesome code generator that targets in an intermediate language. Then, it should allow me to write simple translators for each target language and coding style. The bulk of the work could then be shared. With a built-in language translator, DataDraw would be much simpler than it is now. However, with a built-in language translator, DataDraw becomes a language in itself. What's unique about it? Simple. It's extendable by me and others I work with who are familiar with the DataDraw code base. I can generate code of any type, and add literally any feature I wish. However, I do that by directly editing the code generators, which are written in C and which link into DataDraw's database. That's not elegant, or usable by anyone not familiar with the DataDraw code base, although it does cover my needs. So, I've been looking into what it takes to get the same power, but in a language that anyone could work with. In particular, I've been examining what it would take for D to cover DataDraw's functionality. That, it turns out, is hard (which is one reason the XL compiler isn't done). The more power you give the user, the more you open up the internals of the compiler, and the more complex you make the language. For example, to do that in D, a natural way would be to make Walter's representation of D as data structures part of the language definition (thus greatly restricting how D compilers are built). Then, you could offer access to reflection classes at compile time (as OpenC++ does). A natural way to use these classes at compile time is to interpret D code. Now, you have to write a D interpreter as well as a compiler. This is the aproach taken by VHDL for their generators, and it really complicated implementations of compilers. An alternative is to re-compile the compiler instead. This is a bit brain-bending, but I think getting rid of the interpreter is worth it. Besides, I already recompile DataDraw every time I fix or add a feature, and that's never been much of a problem. Even if we added compile-time reflection classes, I still don't get all the power of DataDraw, which I can extend in any way, because I directly edit the source. What's still missing? For one thing, reflection classes can't be used to add syntax to the language. That's a serious limitation. XL's aproach allows some syntax extension. Scheme also has a nice mechanism. However, both systems are limited, and complex, and slow. I'm toying with another aproach that is easy if you already allow users to compile custom versions of the compiler (which you do to get rid of the interpreter). Just provide a simple mechanism for generating a syntax description for use by bison. That nails the problem. Any new syntax can then be added by a user, so long as it's compatible with what's already there. A drawback is that bison now becomes part of the language, along with all its quirks and strong points. At least bison is pretty much available everywhere. Just adding new syntax to the language doesn't get you all the way there. You still are stuck with those reflection classes used to model the language. If you have a new construct to implement, you can add the syntax, but what objects do you build to represent it? The reflection classes themselves need to be extendable. Really. At that point, nothing in the language is left as non-configurable. You're stuck with LAR1 parsers, but that's no big deal. However, adding reflection classes is tricky. Being C-derived, the language still needs to link with the C linker, including the compiler itself, especially if users are going to compile custom compilers for their applications. That means that new types can't be added to the compiler's database, since C libraries are limited that way. I'm currently toying with the age-old style of non-typed syntax trees rather than fully typed reflection classes. It looks like it will work out, but in the end, all this has done is provide a compiler that's easy to extend. It's easy to extend because it's parser, and internal data structures are simple, and extendable. Plug-ins should be easy to write. However, it's not really a standard language any more. It's just a customizable compiler that's fairly easy to work with. I'm left with the conclusion that D can't be enhanced be extendable the way XL wants to be, or the way I'd like D to be. I don't see how D can get there from here. Bill
Apr 02 2003
parent reply Helmut Leitner <helmut.leitner chello.at> writes:
Bill Cox wrote:
 There are surely better ways to advertise you project.

I'm not trying to advertise DataDraw. In fact, I'd love to see D incorporate features that would allow me to kill it. I'd prefer that user's didn't start adopting DataDraw, as I don't have the time to do free support.

Ok, I think it's good to have this said.
 It's open-source, as the copyright file describes.  
 The documentation sucks, and I think it will probably stay that way.

That means its dead outside of the heads of its few experts and will remain so.
 ...
 It's specific insights I've gained in working with DataDraw that I've
 been trying to describe in this group, rather than trying to promote
 DataDraw. 
 ...

I'm very interested in your experiences and insights. I'm doing software projects since 1979 and feel very strong about the way systems present themselves towards the programmer (APIs).
 Through using DataDraw for many years, however, I think I've had some
 fairly unique insights into language design.  Adding features to a
 target langauge is what DataDraw is for, and I've been able to try out
 several features not found in C++ in a real industrial coding
 environment.  Some of those features I've described in other posts.

I'll try to reread some of your postings and arguments. Can you give me some hints to find my way?
 As I said, I was hoping D could be extended to make DataDraw obsolete.
 That turns out not to be the case.  I'll describe some of my current
 thinking about this matter below.
 
 DataDraw currently just models data structures, and allows me to write
 code generators.  This is much like the old OM tool for UML (which
 DataDraw preceeds).  It gives me the power of compile-time reflection
 classes, like those in OpenC++.  However, for each new language, or
 coding style, I have to write a new code generator, and these things get
 really complex.  DataDraw currenly has 5.  That kind of sucks.
 
 Instead, DataDraw should allow me to write one awesome code generator
 that targets in an intermediate language.  Then, it should allow me to
 write simple translators for each target language and coding style.  The
 bulk of the work could then be shared.

That's a natural idea, that doesn't seem to work. I think that Charles Simonyi has put 10 years into Intentional Programming to follow similiar ideas and they burned millions of $.
 With a built-in language translator, DataDraw would be much simpler than
 it is now.  However, with a built-in language translator, DataDraw
 becomes a language in itself.  What's unique about it?  Simple.  It's
 extendable by me and others I work with who are familiar with the
 DataDraw code base.  I can generate code of any type, and add literally
 any feature I wish.  However, I do that by directly editing the code
 generators, which are written in C and which link into DataDraw's
 database.  That's not elegant, or usable by anyone not familiar with the
 DataDraw code base, although it does cover my needs.

This is a certain way to solve problems but it may or may not be optimal. The fact that you have this tool at hand gives power but may mislead.
 So, I've been looking into what it takes to get the same power, but in a
 language that anyone could work with.  In particular, I've been
 examining what it would take for D to cover DataDraw's functionality.

Analytically this is not a goal. The goal is to enable programmers to write great applications. What are their problems and how can they be solved?
 That, it turns out, is hard (which is one reason the XL compiler isn't
 done).  The more power you give the user, the more you open up the
 internals of the compiler, and the more complex you make the language.

I agree. I think this is the problem of C++ itself. Too much complexity for to little gain.
 For example, to do that in D, a natural way would be to make Walter's
 representation of D as data structures part of the language definition
 (thus greatly restricting how D compilers are built).  Then, you could
 offer access to reflection classes at compile time (as OpenC++ does).  A
 natural way to use these classes at compile time is to interpret D code.
   Now, you have to write a D interpreter as well as a compiler.  This is
 the aproach taken by VHDL for their generators, and it really
 complicated implementations of compilers.  An alternative is to
 re-compile the compiler instead.  This is a bit brain-bending, but I
 think getting rid of the interpreter is worth it.  Besides, I already
 recompile DataDraw every time I fix or add a feature, and that's never
 been much of a problem.
 
 Even if we added compile-time reflection classes, I still don't get all
 the power of DataDraw, which I can extend in any way, because I directly
 edit the source.  What's still missing?
 
 For one thing, reflection classes can't be used to add syntax to the
 language.  That's a serious limitation.  XL's aproach allows some syntax
 extension.  Scheme also has a nice mechanism.  However, both systems are
 limited, and complex, and slow.  I'm toying with another aproach that is
 easy if you already allow users to compile custom versions of the
 compiler (which you do to get rid of the interpreter).  Just provide a
 simple mechanism for generating a syntax description for use by bison.
 That nails the problem.  Any new syntax can then be added by a user, so
 long as it's compatible with what's already there.  A drawback is that
 bison now becomes part of the language, along with all its quirks and
 strong points.  At least bison is pretty much available everywhere.

I still don't know what problems you are trying to solve. A language that is able to extend its own syntax? Surely an faszinating idea but 99.9 percent of programmers would not be able to make good use of it.
 Just adding new syntax to the language doesn't get you all the way
 there.  You still are stuck with those reflection classes used to model
 the language.  If you have a new construct to implement, you can add the
 syntax, but what objects do you build to represent it?  The reflection
 classes themselves need to be extendable.  Really.  At that point,
 nothing in the language is left as non-configurable.  You're stuck with
 LAR1 parsers, but that's no big deal.
 
 However, adding reflection classes is tricky.  Being C-derived, the
 language still needs to link with the C linker, including the compiler
 itself, especially if users are going to compile custom compilers for
 their applications.  That means that new types can't be added to the
 compiler's database, since C libraries are limited that way.  I'm
 currently toying with the age-old style of non-typed syntax trees rather
 than fully typed reflection classes.  It looks like it will work out,
 but in the end, all this has done is provide a compiler that's easy to
 extend.  It's easy to extend because it's parser, and internal data
 structures are simple, and extendable.  Plug-ins should be easy to
 write.  However, it's not really a standard language any more.  It's
 just a customizable compiler that's fairly easy to work with.
 
 I'm left with the conclusion that D can't be enhanced be extendable the
 way XL wants to be, or the way I'd like D to be.

As I see it D was never designed to have an extensible syntax.
 I don't see how D can get there from here.

For this reason it is unreasonable to think it could go there. Currently I don't understand why it should go there, other than it would allow you to carry your DataDraw methods of problem solving on to D. But, as I said, I'll try to read some of your threads. -- Helmut Leitner leitner hls.via.at Graz, Austria www.hls-software.com
Apr 02 2003
parent Bill Cox <bill viasic.com> writes:
I agree with all your comments.

At this point, I'm not advocating major changes to D, so this reply is 
more just to answer your questions that to give Walter any ideas.  You'd 
asked about specific features I'd been advocating, so I'll re-summarize 
them below.

1) Compile-time reflection classes.  I threw this out there as a 
possibility to be investigated.  Now that I've done that, I'm dropping 
that request, for reasons described in the you replied to below.

2) I'd still like to see more powerful iterators that the ones discussed 
lately.  You can look up my recomendations under "Cool iterators", or 
something like that.

3) Dynamic class extensions are also a great thing, and it's sad C++, 
Java, C# and D don't have them.  Most programmers working with object 
databases have to emulate the extensions with cross-coupled void pointers.

4) A class framework inheritance mechnaism, such as Sather's "include" 
construct, virtual classes, or Dan's "Template Frameworks".  All of 
these cover a gaping hole in C++, but I'm concerned about the complexity 
of the virtual class aproach Walter was considering.

Embedded replies to a couple questions you posed are below.

Helmut Leitner wrote:
 
 Bill Cox wrote:
 
There are surely better ways to advertise you project.

... I'm not trying to advertise DataDraw. In fact, I'd love to see D incorporate features that would allow me to kill it. I'd prefer that user's didn't start adopting DataDraw, as I don't have the time to do free support.

Ok, I think it's good to have this said.
It's open-source, as the copyright file describes.  
The documentation sucks, and I think it will probably stay that way.

That means its dead outside of the heads of its few experts and will remain so.
...
It's specific insights I've gained in working with DataDraw that I've
been trying to describe in this group, rather than trying to promote
DataDraw. 
...

I'm very interested in your experiences and insights. I'm doing software projects since 1979 and feel very strong about the way systems present themselves towards the programmer (APIs).
Through using DataDraw for many years, however, I think I've had some
fairly unique insights into language design.  Adding features to a
target langauge is what DataDraw is for, and I've been able to try out
several features not found in C++ in a real industrial coding
environment.  Some of those features I've described in other posts.

I'll try to reread some of your postings and arguments. Can you give me some hints to find my way?
As I said, I was hoping D could be extended to make DataDraw obsolete.

thinking about this matter below. DataDraw currently just models data structures, and allows me to write code generators. This is much like the old OM tool for UML (which DataDraw preceeds). It gives me the power of compile-time reflection classes, like those in OpenC++. However, for each new language, or coding style, I have to write a new code generator, and these things get really complex. DataDraw currenly has 5. That kind of sucks. Instead, DataDraw should allow me to write one awesome code generator that targets in an intermediate language. Then, it should allow me to write simple translators for each target language and coding style. The bulk of the work could then be shared.

That's a natural idea, that doesn't seem to work. I think that Charles Simonyi has put 10 years into Intentional Programming to follow similiar ideas and they burned millions of $.

I believe it. The hard part isn't making a nice intermediate language I can work with. The hard part is making an extendable version that one anyone can work with.
With a built-in language translator, DataDraw would be much simpler than
it is now.  However, with a built-in language translator, DataDraw
becomes a language in itself.  What's unique about it?  Simple.  It's
extendable by me and others I work with who are familiar with the
DataDraw code base.  I can generate code of any type, and add literally
any feature I wish.  However, I do that by directly editing the code
generators, which are written in C and which link into DataDraw's
database.  That's not elegant, or usable by anyone not familiar with the
DataDraw code base, although it does cover my needs.

This is a certain way to solve problems but it may or may not be optimal. The fact that you have this tool at hand gives power but may mislead.

You're right about that. You have to be extremely careful about adding features to a language using a custom pre-processor. In particular, every extension has to be carefully though out, and agreed to by the whole group. If anyone could add a feature any time they wished, it'd result in mayhem.
So, I've been looking into what it takes to get the same power, but in a
language that anyone could work with.  In particular, I've been
examining what it would take for D to cover DataDraw's functionality.

Analytically this is not a goal. The goal is to enable programmers to write great applications. What are their problems and how can they be solved?

Oh, there are lots of problems. Big stuff and little stuff. How about array bounds checking in debug mode? We added it to C. Need a few fields added to existing classes at run-time? We do that. The space of solutions to real problems programmers are facing out there is a lot bigger than what most languages address. I agree with your point, though. A good D design is a design that covers most people's most common needs, but not all of anybody's needs. IMO, D's basically on track.
That, it turns out, is hard (which is one reason the XL compiler isn't
done).  The more power you give the user, the more you open up the
internals of the compiler, and the more complex you make the language.

I agree. I think this is the problem of C++ itself. Too much complexity for to little gain.
For example, to do that in D, a natural way would be to make Walter's
representation of D as data structures part of the language definition
(thus greatly restricting how D compilers are built).  Then, you could
offer access to reflection classes at compile time (as OpenC++ does).  A
natural way to use these classes at compile time is to interpret D code.
  Now, you have to write a D interpreter as well as a compiler.  This is
the aproach taken by VHDL for their generators, and it really
complicated implementations of compilers.  An alternative is to
re-compile the compiler instead.  This is a bit brain-bending, but I
think getting rid of the interpreter is worth it.  Besides, I already
recompile DataDraw every time I fix or add a feature, and that's never
been much of a problem.

Even if we added compile-time reflection classes, I still don't get all
the power of DataDraw, which I can extend in any way, because I directly
edit the source.  What's still missing?

For one thing, reflection classes can't be used to add syntax to the
language.  That's a serious limitation.  XL's aproach allows some syntax
extension.  Scheme also has a nice mechanism.  However, both systems are
limited, and complex, and slow.  I'm toying with another aproach that is
easy if you already allow users to compile custom versions of the
compiler (which you do to get rid of the interpreter).  Just provide a
simple mechanism for generating a syntax description for use by bison.
That nails the problem.  Any new syntax can then be added by a user, so
long as it's compatible with what's already there.  A drawback is that
bison now becomes part of the language, along with all its quirks and
strong points.  At least bison is pretty much available everywhere.

I still don't know what problems you are trying to solve. A language that is able to extend its own syntax? Surely an faszinating idea but 99.9 percent of programmers would not be able to make good use of it.

stuff, and extensions need to be carefully considered by a few and then adopted by many. Scheme has a nice mechanism for this kind of thing. Much of the syntax of Scheme can acutally be written in Scheme. However, without an ability to add syntax, some new features can't cleanly be added to a language, and thus the language isn't fully extensible. For example, how could we add Sather-like "include" constructs to allow module level inheritance? There's no way in D, C++, Java, or C# to even say that. To add this feature, you need to hack the parser a little. After that, it's a simple thing to implement with compile-time reflection classes. I'm not pushing for any syntax extension mechanism for D. It's pretty worthless without some way to tie it into reflection classes or an equivalent mechanism.
Just adding new syntax to the language doesn't get you all the way
there.  You still are stuck with those reflection classes used to model
the language.  If you have a new construct to implement, you can add the
syntax, but what objects do you build to represent it?  The reflection
classes themselves need to be extendable.  Really.  At that point,
nothing in the language is left as non-configurable.  You're stuck with
LAR1 parsers, but that's no big deal.

However, adding reflection classes is tricky.  Being C-derived, the
language still needs to link with the C linker, including the compiler
itself, especially if users are going to compile custom compilers for
their applications.  That means that new types can't be added to the
compiler's database, since C libraries are limited that way.  I'm
currently toying with the age-old style of non-typed syntax trees rather
than fully typed reflection classes.  It looks like it will work out,
but in the end, all this has done is provide a compiler that's easy to
extend.  It's easy to extend because it's parser, and internal data
structures are simple, and extendable.  Plug-ins should be easy to
write.  However, it's not really a standard language any more.  It's
just a customizable compiler that's fairly easy to work with.

I'm left with the conclusion that D can't be enhanced be extendable the
way XL wants to be, or the way I'd like D to be.

As I see it D was never designed to have an extensible syntax.
I don't see how D can get there from here.

For this reason it is unreasonable to think it could go there. Currently I don't understand why it should go there, other than it would allow you to carry your DataDraw methods of problem solving on to D. But, as I said, I'll try to read some of your threads. -- Helmut Leitner leitner hls.via.at Graz, Austria www.hls-software.com

I agree. At this point, I've concluded that D should not try to solve the problems I solve with DataDraw. I've started working on a new system that should replace DataDraw when finished. It's already got the syntax extension mechanism I described that generates a bison file. It's got a simple list based lanugage parse tree that is capable of representing any feature I wish to support. These get used like compile-time reflection classes, allowing users to write code in the intermediate langauge in order to add features to the target language. The output can be in any language (as with DataDraw), and users can write new generators to target new languages or coding styles. I'm thinking of calling it Hack-C, since allowing me to hack in new features to C or other languages is it's primary function, and because the whole system seems like one of the world's largest hacks. It's a translator that compiles application specific versions of itself in order to add features to other languages. The opportunities for serious hacking in such a system are vast. If you think there might be interest in this system in the open-source community, I could try to finish it's development that way. It might be fun enough for me to actually support an open-source effort, and if anyone else were to help, I could benifit from that. I haven't seen much interest in this kind of project out there in the past. Languages are always hot, bot CASE tools never are. Do you think this could be successful as an open-source effort? Bill
Apr 03 2003
prev sibling parent reply Mark Evans <Mark_member pathlink.com> writes:
Bill Cox wrote,
Not in libraries, where we could all contribute,
but built-in, where Walter has to write it.

The compiler is open-source. Contributions are welcome. (Wasn't it you who said recently, 'I had a few days off and rewrote the D compiler' or words to that effect? Forgive me if memory fails, I think it was you.) Whatever reasons you accept for UTF-8 as a native type hold equally well for UTF-16 and UTF-32. The only rationale advanced otherwise was a vague impression of unease (coupled with slurs on my design sense). Dividing type families is a war crime. It's more complex having one member in the compiler and the rest stranded in a library. Think about slicing Unicode strings. Suppose the compiler includes code for slicing UTF-8 strings. Why do we want to duplicate that in a library for UTF-16? We have to write identical logic, in C for the compiler and in D for the library? Yuk! And what about the conversions between Unicode formats? They are easier with the strings all living in the same place. Either these strings belong in the language together, or they belong in a library together. I see no objective reason to divide them up. Just think about what you're saying in terms of numeric types and the fallacy will jump out at you. C has trained people too well about what strings really are. Suppose for example that we put all floats in the compiler and all doubles in the library. Silly! <g> Maybe it will mend fences to say in public that UTF-32 could be dropped. I have objective reasons for saying so, not vague unease: UTF-32 is rarely used and truly fixed-width (so it can be 'faked' as Walter suggests). Nonetheless intrinsic UTF-32 is just as reasonable to support as, say, the equally rarely used, and equally fake-able 'ifloat' type. Mark
Mar 31 2003
next sibling parent reply Bill Cox <bill viasic.com> writes:
Hi, Mark.

Mark Evans wrote:
 Bill Cox wrote,
 
 The compiler is open-source.  Contributions are welcome.  (Wasn't it you who
 said recently, 'I had a few days off and rewrote the D compiler' or words to
 that effect?  Forgive me if memory fails, I think it was you.)

I wrote a toy compiler to test out some ideas in a few days off, not a D compiler. There's a huge difference between a week's effort, and what D has become. In fact C++ is so complex, the compilers out there still aren't complete. Keeping D simple is key to avoiding this fate. The fact that D's front-end is open-source is an even greater reason for the language itself to be simple. The author of Linux has a lot to say about keeping open-source code simple. He blasted GNU's Herd effort for it's complexity. I agree with him. The fact that I'm writing this note using a Linux kernel instead of a GNU Herd kernel supports his assertion. Last I checked, the D front-end was 35K lines of hand written code, which is impressively small given the functionality and commenting. However, that's still a lot to learn if you just want to contribute, but it's doable. When it reaches 100K lines, the language is in real trouble. Not many of us will be willing to work with a program that huge, unless we're getting paid. Bill
Apr 01 2003
parent reply Mark Evans <Mark_member pathlink.com> writes:
Keeping D simple is key to avoiding this fate.

Unicode intrinsics make D a simple language. That is the point of having them. I assume you are still with me that D needs them. The notion is to rid D of ugly 30-year-old C confusions about strings, and to bring their formats up to modern standards in the bargain. We can't help the extra work of Unicode; that is what the world wants.
The fact that D's front-end is open-source is an even greater reason for 
the language itself to be simple.

No one said otherwise. You keep propping up straw-men to tear down. They are purely your own creations. It's amusing to watch you rip them down, but little else beyond that. We all want the language to be as simple and orthogonal as possible. That's why I worry about D's rigid adherence to C++ as a design baseline. Look Bill - my design sense is as good as yours, maybe better, and definitely more informed. You need not lecture me about simplicity. To be frank, your work belies complicated over-engineering and reinvented wheels. From my viewpoint you are the one who needs simplicity lessons. Furthermore I do not 'advocate' everything that I post. You halfway accused me of 'advocating' multimethods, and I don't recall once doing that. I merely linked to a short article showing how multimethods simplify code. I do advocate functional approaches, for this reason: they allow me to simplify my code. You see, I like simplicity. There are software engineering concepts that C++ does not offer and it's important for a new language effort to know about them. That way, even if rejected, a decision about the concepts was made on facts, not ignorance. If you agree with me about Unicode intrinsics, to whatever degree, then bite the bullet and be done with it. You really are going over the top on this. Mark
Apr 01 2003
next sibling parent "Matthew Wilson" <dmd synesis.com.au> writes:
Mark

Not wishing to get in the middle of you two stags, but aren't you getting a
bit over the top? I don't doubt that all your skills are as incomparable as
you assert - though I note you did not add an entry to the "Introductions"
thread, why was that? - but do we really need to be told all the time?

Frankly it's beginning to taste a little like Boost, not to mention a waste
of time in the lives of lots of busy people in reading through them to get
to the technical points (which are very interesting, I must say) that you're
making.




"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6du7v$jiv$1 digitaldaemon.com...
Keeping D simple is key to avoiding this fate.

Unicode intrinsics make D a simple language. That is the point of having

 I assume you are still with me that D needs them.

 The notion is to rid D of ugly 30-year-old C confusions about strings, and

 bring their formats up to modern standards in the bargain.  We can't help

 extra work of Unicode; that is what the world wants.

The fact that D's front-end is open-source is an even greater reason for
the language itself to be simple.

No one said otherwise. You keep propping up straw-men to tear down. They

 purely your own creations.  It's amusing to watch you rip them down, but

 else beyond that.  We all want the language to be as simple and orthogonal

 possible.  That's why I worry about D's rigid adherence to C++ as a design
 baseline.

 Look Bill - my design sense is as good as yours, maybe better, and

 more informed.  You need not lecture me about simplicity.  To be frank,

 work belies complicated over-engineering and reinvented wheels. From my
 viewpoint you are the one who needs simplicity lessons.

 Furthermore I do not 'advocate' everything that I post.  You halfway

 of 'advocating' multimethods, and I don't recall once doing that.  I

 linked to a short article showing how multimethods simplify code.

 I do advocate functional approaches, for this reason:  they allow me to

 my code.  You see, I like simplicity.

 There are software engineering concepts that C++ does not offer and it's
 important for a new language effort to know about them.  That way, even if
 rejected, a decision about the concepts was made on facts, not ignorance.

 If you agree with me about Unicode intrinsics, to whatever degree, then

 bullet and be done with it.  You really are going over the top on this.

 Mark

Apr 01 2003
prev sibling parent "Luna Kid" <lunakid neuropolis.org> writes:
Hmm... Mark, appreciating all your informedess and
very welcome sharp and clear view on this matter (and
others), how about improving your diplomatic skills
a bit?

Sorry about the noise.
The Luna Kid
Apr 03 2003
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6beep$1qom$1 digitaldaemon.com...
 Maybe it will mend fences to say in public that UTF-32 could be dropped.

 objective reasons for saying so, not vague unease: UTF-32 is rarely used

 truly fixed-width (so it can be 'faked' as Walter suggests).  Nonetheless
 intrinsic UTF-32 is just as reasonable to support as, say, the equally

 used, and equally fake-able 'ifloat' type.

My understanding is that the linux wchar_t type is UTF-32, which puts it in common use. UTF-32 is also handy as an intermediate form when converting between UTF-8 and UTF-16.
May 21 2003
parent "J. Daniel Smith" <J_Daniel_Smith HoTMaiL.com> writes:
If you've got a UTF-32 string, UTF-16 is really only needed when calling
things like Win32 APIs.

   Dan

"Walter" <walter digitalmars.com> wrote in message
news:bagjlo$308t$1 digitaldaemon.com...
 "Mark Evans" <Mark_member pathlink.com> wrote in message
 news:b6beep$1qom$1 digitaldaemon.com...
 Maybe it will mend fences to say in public that UTF-32 could be dropped.

 objective reasons for saying so, not vague unease: UTF-32 is rarely used

 truly fixed-width (so it can be 'faked' as Walter suggests).


 intrinsic UTF-32 is just as reasonable to support as, say, the equally

 used, and equally fake-able 'ifloat' type.

My understanding is that the linux wchar_t type is UTF-32, which puts it

 common use. UTF-32 is also handy as an intermediate form when converting
 between UTF-8 and UTF-16.

May 22 2003
prev sibling next sibling parent "Matthew Wilson" <dmd synesis.com.au> writes:
One minor point:

We *must* have char/wchar and byte/ubyte/short/ushort as separate, and
overloadable, entities. This is about the most egregious and toxic aspect of
C/C++ that I can think of. Absolute nightmare when trying to write generic
serialisation components, messing around with compiler discrimination
pre-processor guff to work out whether the compiler "knows" about wchar_t,
and crying oneself to sleep with char, signed char, unsigned char, etc. etc.

Following this logic, if D does evolve to support different character
encoding schemes, it would be nice to have separate char types, although I
know this will draw the succinctness crowd down on me like a pack of
blood-thursty vultures.

Swoop away flying beasties, my gizard is exposed.



"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6abjh$12m8$1 digitaldaemon.com...
 Walter says (in response to my post)...
 D needs a Unicode string primitive.


I'm dubious about this claim. ANSI C char arrays are UTF-8 too, if the

 are 7-bit ACSII (a subset of UTF-8).  That doesn't mean they support

 UTF-8 is on D's very own 'to-do' list:
 http://www.digitalmars.com/d/future.html

 UTF-8 has a maximum encoding length of 6 bytes for one character.  If such

 character appears at index 100 in char[] myString, what is the return

 myString[100]?  The answer should be "one UTF-8 char with an internal

 representation."  I don't think D does that.

 Besides which, my idea was a native string primitive, not a quasi-array.

 confusion of strings with arrays was a basic, fundamental mistake of C.

 some string semantics do resemble those of arrays, this resemblance should

 mandate identical data types.  Strings are important enough to merit their

 intrinsic type.  Icon is not the only language to recognize that fact.  D
 documents make no mention of any string primitive:
 http://www.digitalmars.com/d/type.html
 D has two intrinsic character types, a dynamic array type, and _no_

 string type.

 Characters should be defined as UTF-8 or UTF-16 or UTF-32, not "short" and
 "wide."  The differing cross-platform widths of the 'wide' char is asking

 trouble; poof goes data portability.  D characters are not based on

 archaic MS Windows API and legacy C terminology spot-welded onto Linux.

 about Unicode as a basis?

 The ideal type system would offer as intrinsic/primitive/native language

 - UTF-8 char
 - UTF-16 char
 - UTF-32 char
 - UTF-8 string
 - UTF-16 string
 - UTF-32 string
 - built-in conversions between all of the above (e.g. UTF-8 to UTF-16)
 - built-in conversions to/from UTF strings and C-style byte arrays

 The preceding list will not seem very long when you consider how many

 types D supports.  Strings are as important as numbers.

 The old C 'char' type is merely a byte; D already has 'ubyte.'  The

 between ubyte and char in D escapes me.  Maybe the reasoning is that a

 might be 'wide' so D needs a separate type?  But that reason disappears

 have nice UTF characters.  So even if the list is a bit long it also

 two redundant types, char and wchar.

 I would not be against retention of char and char[] for C compatibility

 if someone could point out why 'ubyte' and 'char[]' do not suffice.

 would just alias 'char' into 'ubyte' and be done with it.  The wchar could

 stored inside a UTF-16 or UTF-32 char, or be declared as a struct.

 To the user, strings would act like dynamic arrays.  Internally they are
 different animals.  Each 'element' of the 'array' can have varying length

 Unicode specifications.  String primitives would hide Unicode complexity

 the hood.

 That's just the beginning.  Now that you have string intrinsics, you can

 them special behaviors pertaining to i/o streams and such.  You can define
 'streaming' conversions from other intrinsic types to strings for i/o

 And...permit me to dream!...you can define Icon-style string scanning
 expressions.

 Mark

Mar 31 2003
prev sibling parent reply Mark Evans <Mark_member pathlink.com> writes:
Walter -

On a positive and constructive note, an implementation concept might hold some
interest.  I'm just bringing it to attention, not advocating yet <g>.

There's no hard requirement for serial bytewise storage of the proposed
intrinsic Unicode strings.  Other ways to build Unicode strings exist.  The one
offered here would do little or no damage to the current compiler.  Really it's
just a set of small additions.

Consider a Unicode string made of two data structures:  a C-style array, and a
lookup table.  The C-style array holds the first code word for each character.
The table holds all second, third, and additional code words.  (A 'code word'
meaning 8/16/32 bits for UTF 8/16/32 respectively.)  The keys to the table are
the indices of the string.  So if character #100 has extra code words, they are
accessed via some function like table_access(100).

This setup unifies C array indices with Unicode character indices.  So D can
employ straight pointer arithmetic to find any character in the string.
Character index = array index.  String length (in chars) = implementation array
size (in elements).  These features may address your hesitation over
implementation issues that are complex in the serial case.

Having found the character, D need only check the high bit(s) which flag
additional code words.  Unicode requires such a test in any case; it's
unavoidable.  If flagged, D performs a table lookup.  This table lookup is the
only serious runtime cost.  The table could take whatever form is most
efficient.

* UTF-32 has no extended codes, so UTF-32 strings don't need tables.
* UTF-16 characters involve only a few percent with extended codes.
Ergo - the table is small, and the runtime cost is, say, 2-3%.
* UTF-8 needs the biggest and most table entries, but manageably so.

A downside might be file and network serialization - but we might skate by.  D
could supply streams on demand, without an intermediate serialized format.  If I
tell D "write(myFile, myString)" no intermediate format is required.  D can just
empty the internal array and table to disk in proper byte sequence.  The disk or
network won't care how D get the bytes from memory.

The only hard serialization requirement would be actual user conversion to byte
arrays.  (If the user is doing that, let him suffer!)

This scheme supports 7-bit ASCII.  An optimization could yield raw C speed.  Put
an extra boolean flag inside each string structure.  This flag is the logical OR
of all contained Unicode bit flags.  If the string has no extended chars, the
flag is FALSE, and D can use alternate string code on that basis.  (No bit
tests, no table lookups.)  That works for UTF-32, 7-bit ASCII, and the majority
of UTF-16 strings.

The idea can be nitpicked to death, but it's a concept.  Unicode strings and
characters will never enjoy the simplicity or speed of 7-bit ASCII.  That's a
fact of life, meaning that implementation concepts cannot be faulted on such a
basis.

What would be nice is to make Unicode maximally simple and maximally efficient
for D users.

Thanks again Walter,

Best-
Mark
Mar 31 2003
next sibling parent reply "Matthew Wilson" <dmd synesis.com.au> writes:
Qualifying this again with the stipulation that I am far from an expert on
this issue (aside from having a fair amount of experience in a negative
sense):

This sounds like a nice idea - array of 1st-byte plus lookups. I'm intrigued
as to the nature of the lookup table. Is this a constant, process-wide,
entity?

If I had time when it was introduced I'd be keen to participate in the
serialisation stuff, on which I have more firmer footing.

It's not clear now whether you've dropped the suggestion for a separate
string class, or just that arrays of "char" types would be dealt with in the
fashion that you've outlined.

Finally, I'm troubled by your comments "on a positive and constructive note"
and "maybe it will mend fences to " (other post). Have I missed some animus
that everyone else has perceived? If so, I don't know which side to be on.
Seriously, though, I don't think anyone's getting shirty, so chill, baby. :)

Keep those great comments coming. I'm learning heaps.


"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6bb6i$1ont$1 digitaldaemon.com...
 Walter -

 On a positive and constructive note, an implementation concept might hold

 interest.  I'm just bringing it to attention, not advocating yet <g>.

 There's no hard requirement for serial bytewise storage of the proposed
 intrinsic Unicode strings.  Other ways to build Unicode strings exist.

 offered here would do little or no damage to the current compiler.  Really

 just a set of small additions.

 Consider a Unicode string made of two data structures:  a C-style array,

 lookup table.  The C-style array holds the first code word for each

 The table holds all second, third, and additional code words.  (A 'code

 meaning 8/16/32 bits for UTF 8/16/32 respectively.)  The keys to the table

 the indices of the string.  So if character #100 has extra code words,

 accessed via some function like table_access(100).

 This setup unifies C array indices with Unicode character indices.  So D

 employ straight pointer arithmetic to find any character in the string.
 Character index = array index.  String length (in chars) = implementation

 size (in elements).  These features may address your hesitation over
 implementation issues that are complex in the serial case.

 Having found the character, D need only check the high bit(s) which flag
 additional code words.  Unicode requires such a test in any case; it's
 unavoidable.  If flagged, D performs a table lookup.  This table lookup is

 only serious runtime cost.  The table could take whatever form is most
 efficient.

 * UTF-32 has no extended codes, so UTF-32 strings don't need tables.
 * UTF-16 characters involve only a few percent with extended codes.
 Ergo - the table is small, and the runtime cost is, say, 2-3%.
 * UTF-8 needs the biggest and most table entries, but manageably so.

 A downside might be file and network serialization - but we might skate

 could supply streams on demand, without an intermediate serialized format.

 tell D "write(myFile, myString)" no intermediate format is required.  D

 empty the internal array and table to disk in proper byte sequence.  The

 network won't care how D get the bytes from memory.

 The only hard serialization requirement would be actual user conversion to

 arrays.  (If the user is doing that, let him suffer!)

 This scheme supports 7-bit ASCII.  An optimization could yield raw C

 an extra boolean flag inside each string structure.  This flag is the

 of all contained Unicode bit flags.  If the string has no extended chars,

 flag is FALSE, and D can use alternate string code on that basis.  (No bit
 tests, no table lookups.)  That works for UTF-32, 7-bit ASCII, and the

 of UTF-16 strings.

 The idea can be nitpicked to death, but it's a concept.  Unicode strings

 characters will never enjoy the simplicity or speed of 7-bit ASCII.

 fact of life, meaning that implementation concepts cannot be faulted on

 basis.

 What would be nice is to make Unicode maximally simple and maximally

 for D users.

 Thanks again Walter,

 Best-
 Mark

Mar 31 2003
next sibling parent reply "Sean L. Palmer" <palmer.sean verizon.net> writes:
"Matthew Wilson" <dmd synesis.com.au> wrote in message
news:b6bgt5$1sai$1 digitaldaemon.com...
 This sounds like a nice idea - array of 1st-byte plus lookups. I'm

 as to the nature of the lookup table. Is this a constant, process-wide,
 entity?

No, because the map is indexed by the same index used to index into the flat array. Unless I'm misunderstanding something. Perhaps these could be grouped into separate maps by the total size of the char, which I think is determinable from the first char? May speed lookups a tad, or slow them down, not sure. Sean
Apr 01 2003
parent reply "Walter" <walter digitalmars.com> writes:
"Sean L. Palmer" <palmer.sean verizon.net> wrote in message
news:b6bjg5$1ut5$1 digitaldaemon.com...
 "Matthew Wilson" <dmd synesis.com.au> wrote in message
 news:b6bgt5$1sai$1 digitaldaemon.com...
 This sounds like a nice idea - array of 1st-byte plus lookups. I'm

 as to the nature of the lookup table. Is this a constant, process-wide,
 entity?

No, because the map is indexed by the same index used to index into the

 array.  Unless I'm misunderstanding something.

You could use a static 256 byte lookup table to give you the 'stride' to the next char.
May 21 2003
parent "Sean L. Palmer" <palmer.sean verizon.net> writes:
That lets you index sequentially pretty fast, but not randomly.

Sean

"Walter" <walter digitalmars.com> wrote in message
news:bagk8l$30ti$2 digitaldaemon.com...
 "Sean L. Palmer" <palmer.sean verizon.net> wrote in message
 news:b6bjg5$1ut5$1 digitaldaemon.com...
 "Matthew Wilson" <dmd synesis.com.au> wrote in message
 news:b6bgt5$1sai$1 digitaldaemon.com...
 This sounds like a nice idea - array of 1st-byte plus lookups. I'm

 as to the nature of the lookup table. Is this a constant,



 entity?

No, because the map is indexed by the same index used to index into the

 array.  Unless I'm misunderstanding something.

You could use a static 256 byte lookup table to give you the 'stride' to

 next char.

May 22 2003
prev sibling parent reply Mark Evans <Mark_member pathlink.com> writes:
This sounds like a nice idea - array of 1st-byte plus lookups.

Thanks. Correction, "array of first code words." Only in UTF-8 are they byte-sized.
I'm intrigued as to the nature of the lookup table. Is this a
constant, process-wide, entity?

No. There is one table per string.
I'd be keen to participate in the
serialisation stuff

No need for serialization. Even the compiler can do serialization with no memory footprint. Only something like an explicit conversion to ubyte[] would mandate that.
It's not clear now whether you've dropped the suggestion for a separate
string class, or just that arrays of "char" types would be dealt with in the
fashion that you've outlined.

I never suggested a string 'class,' just Unicode string and char intrinsic types. My list of proposed intrinsics has already been supplied. Think int, float, string8, string16, char 8, etc. C made a huge mistake in confusing arrays with strings. Strings deserve intrinsic status and a type all their own. The ugly char/wchar gimmick has also seen its day and needs replacement. Mark The internal implementation might read like this in C++-ish, heavy on the "ish," this is the ideal, it's just a communication vehicle for the concept: // code word storage types typedef ubyte UTF8_CODE; typedef ushort UTF16_CODE; typedef uint UTF32_CODE; // max code words per Unicode character const ushort UTF8_CODE_MAX = 6; const ushort UTF16_CODE_MAX = 2; const ushort UTF32_CODE_MAX = 1; template <typename UTF_CODE, ushort UTF_CODE_MAX> class ExtensionTableEntry { public: int myStringPositionIndex; UTF_CODE myStorage[UTF_CODE_MAX+1]; // null terminated? }; // a partially defined Unicode String class concept template <typename UTF_CODE, ushort UTF_CODE_MAX> class UnicodeString { public: long length; UTF_CODE* operator[]; private: UTF_CODE* firstWordsArray; std::hash_map< int, ExtensionTableEntry<UTF_CODE,UTF_CODE_MAX>
                       myLookup;

typedef UnicodeString<UTF8_CODE,UTF8_CODE_MAX> String8; typedef UnicodeString<UTF16_CODE,UTF16_CODE_MAX> String16; typedef UnicodeString<UTF32_CODE,UTF32_CODE_MAX> String32; /* Walter - each table entry should hold the full Unicode char not just its extension codes. This tactic would create some redundancy, but not much. Having the whole character in contiguous memory could be advantageous for passing pointers around. So the C++ operator[] either returns a pointer into the firstWordsArray, or a pointer to the table entry's myStorage field. In all cases the firstWordsArray always holds the first code word of the char, whether it's an extended one or not. */
Apr 01 2003
parent reply "Sean L. Palmer" <palmer.sean verizon.net> writes:
The only problem with this idea is that passing this dual structure to a
piece of code that expects a linear string of data won't work.

Typecasting to ubyte[] or ushort[] should solve that, right?

You would probably need to know the length of such a string both in bytes
and in chars.

Sean


"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6bpf9$22g9$1 digitaldaemon.com...
This sounds like a nice idea - array of 1st-byte plus lookups.

Thanks. Correction, "array of first code words." Only in UTF-8 are they byte-sized.
I'm intrigued as to the nature of the lookup table. Is this a
constant, process-wide, entity?

No. There is one table per string.
I'd be keen to participate in the
serialisation stuff

No need for serialization. Even the compiler can do serialization with no memory footprint. Only something like an explicit conversion to ubyte[] would mandate that.
It's not clear now whether you've dropped the suggestion for a separate
string class, or just that arrays of "char" types would be dealt with in


fashion that you've outlined.

I never suggested a string 'class,' just Unicode string and char intrinsic types. My list of proposed intrinsics has already been supplied. Think int, float, string8, string16, char 8, etc. C made a huge mistake in confusing arrays with strings. Strings deserve intrinsic status and a type all their own. The ugly char/wchar gimmick

 seen its day and needs replacement.

 Mark

 The internal implementation might read like this in C++-ish, heavy on
 the "ish," this is the ideal, it's just a communication vehicle for
 the concept:

 // code word storage types
 typedef ubyte    UTF8_CODE;
 typedef ushort   UTF16_CODE;
 typedef uint     UTF32_CODE;

 // max code words per Unicode character
 const ushort     UTF8_CODE_MAX  = 6;
 const ushort     UTF16_CODE_MAX = 2;
 const ushort     UTF32_CODE_MAX = 1;

 template <typename UTF_CODE, ushort UTF_CODE_MAX>
 class ExtensionTableEntry
 {
 public:
 int       myStringPositionIndex;
 UTF_CODE  myStorage[UTF_CODE_MAX+1]; // null terminated?
 };

 // a partially defined Unicode String class concept
 template <typename UTF_CODE, ushort UTF_CODE_MAX>
 class UnicodeString
 {
 public:
 long                    length;
 UTF_CODE*               operator[];
 private:
 UTF_CODE*               firstWordsArray;
 std::hash_map<
 int,
 ExtensionTableEntry<UTF_CODE,UTF_CODE_MAX>
                       myLookup;

typedef UnicodeString<UTF8_CODE,UTF8_CODE_MAX> String8; typedef UnicodeString<UTF16_CODE,UTF16_CODE_MAX> String16; typedef UnicodeString<UTF32_CODE,UTF32_CODE_MAX> String32; /* Walter - each table entry should hold the full Unicode char not just its extension codes. This tactic would create some redundancy, but not much. Having the whole character in contiguous memory could be advantageous for passing pointers around. So the C++ operator[] either returns a pointer into the firstWordsArray, or a pointer to the table entry's myStorage field. In all cases the firstWordsArray always holds the first code word of the char, whether it's an extended one or not. */

Apr 01 2003
parent reply Mark Evans <Mark_member pathlink.com> writes:
Sean L. Palmer says...
The only problem with this idea is that passing this dual structure to a
piece of code that expects a linear string of data won't work.

Serialization at choke points has a cost of (a) zero, because the string has no extended codes (say typ. 95%+ of UTF-16 and by definition 100% of UTF-32), or (b) an alloc plus copy equivalent, which is acceptable for small to medium strings (another statistically large class in software programs). You run into problems only with large UTF-8 strings that are frequently passed to/from Unicode APIs. Windows uses UTF-16 so it's no problem. Where you find UTF-8 happening is on the web, but that has inherent delays of its own, so the cost might go unnoticed. Consider for example that plenty of web sites are driven with UTF-8 by languages far slower than D. Mark
Apr 01 2003
parent "Walter" <walter digitalmars.com> writes:
"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6dolr$di3$1 digitaldaemon.com...
 You run into problems only with large UTF-8 strings that are frequently

 to/from Unicode APIs.  Windows uses UTF-16 so it's no problem.  Where you

 UTF-8 happening is on the web, but that has inherent delays of its own, so

 cost might go unnoticed.  Consider for example that plenty of web sites

 driven with UTF-8 by languages far slower than D.

I've been looking at some books for programming CGI apps in C. I see the dreaded buffer overflow errors in the sample code even in highly regarded books. No wonder security is such a mess! Doing CGI in D would eliminate those problems.
May 21 2003
prev sibling next sibling parent "Sean L. Palmer" <palmer.sean verizon.net> writes:
That's so crazy it just might work!  ;)

I think it's a fine concept.

One point I'd like to add is that when straight iterating over the string,
the library function can iterate over both the main array and the secondary
map at the same time, in sync, with no map lookups, only iteration.

This would be an interesting bit to actually implement.  But no harder than
the many other possible solutions, and easier and more efficient than most,
especially for random-access indexing, which seems to be what D is leaning
toward in general.

I'd prefer iteration to be the normal way of using D arrays, rather than
explicit loops and indexing.  Those are, for obvious reasons, difficult to
optimize.  But Walter has not decided on a good foreach construct, and
newsgroup discussion on the topic has died down.  Anyone have any good
proposals?  I haven't used any language that has good iterators, except if
you count C++ STL.

Sean

"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6bb6i$1ont$1 digitaldaemon.com...
 Walter -

 On a positive and constructive note, an implementation concept might hold

 interest.  I'm just bringing it to attention, not advocating yet <g>.

 There's no hard requirement for serial bytewise storage of the proposed
 intrinsic Unicode strings.  Other ways to build Unicode strings exist.

 offered here would do little or no damage to the current compiler.  Really

 just a set of small additions.

 Consider a Unicode string made of two data structures:  a C-style array,

 lookup table.  The C-style array holds the first code word for each

 The table holds all second, third, and additional code words.  (A 'code

 meaning 8/16/32 bits for UTF 8/16/32 respectively.)  The keys to the table

 the indices of the string.  So if character #100 has extra code words,

 accessed via some function like table_access(100).

 This setup unifies C array indices with Unicode character indices.  So D

 employ straight pointer arithmetic to find any character in the string.
 Character index = array index.  String length (in chars) = implementation

 size (in elements).  These features may address your hesitation over
 implementation issues that are complex in the serial case.

 Having found the character, D need only check the high bit(s) which flag
 additional code words.  Unicode requires such a test in any case; it's
 unavoidable.  If flagged, D performs a table lookup.  This table lookup is

 only serious runtime cost.  The table could take whatever form is most
 efficient.

 * UTF-32 has no extended codes, so UTF-32 strings don't need tables.
 * UTF-16 characters involve only a few percent with extended codes.
 Ergo - the table is small, and the runtime cost is, say, 2-3%.
 * UTF-8 needs the biggest and most table entries, but manageably so.

 A downside might be file and network serialization - but we might skate

 could supply streams on demand, without an intermediate serialized format.

 tell D "write(myFile, myString)" no intermediate format is required.  D

 empty the internal array and table to disk in proper byte sequence.  The

 network won't care how D get the bytes from memory.

 The only hard serialization requirement would be actual user conversion to

 arrays.  (If the user is doing that, let him suffer!)

 This scheme supports 7-bit ASCII.  An optimization could yield raw C

 an extra boolean flag inside each string structure.  This flag is the

 of all contained Unicode bit flags.  If the string has no extended chars,

 flag is FALSE, and D can use alternate string code on that basis.  (No bit
 tests, no table lookups.)  That works for UTF-32, 7-bit ASCII, and the

 of UTF-16 strings.

 The idea can be nitpicked to death, but it's a concept.  Unicode strings

 characters will never enjoy the simplicity or speed of 7-bit ASCII.

 fact of life, meaning that implementation concepts cannot be faulted on

 basis.

 What would be nice is to make Unicode maximally simple and maximally

 for D users.

 Thanks again Walter,

 Best-
 Mark

Apr 01 2003
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b6bb6i$1ont$1 digitaldaemon.com...
 What would be nice is to make Unicode maximally simple and maximally

 for D users.

I appreciate the thought, but carrying around an extra array for each string seems difficult to make work, especially in view of slicing, etc. I don't think there's any way to design the language so it is both efficient at dealing with ordinary ascii, and transparently able to do multibytes.
May 21 2003
parent Mark Evans <Mark_member pathlink.com> writes:
Walter wrote:
I appreciate the thought, but carrying around an extra array for each string
seems difficult to make work, especially in view of slicing, etc.

I would need a specific implementation code example to understand your thinking. (Clarification: I did not propose an extra array per string, but a lookup table -- something considerably smaller and often empty.) My gut says it would be easy.
I don't
think there's any way to design the language so it is both efficient at
dealing with ordinary ascii, and transparently able to do multibytes.

The problem here is either/or thinking. Both are possible. People who desperately want C byte arrays can declare them, irrespective of Unicode strings. If the idea is that an intrinsic string type must simultaneously support Unicode and ASCII at equal performance levels, then I think the problem is one of definition. In the first place D lacks an honest string intrinsic, so a new one could be defined just for Unicode, leaving the current whatever-it-is in place. If people don't care for Unicode, then they can use whatever-it-is D offers currently. However my gut says that a Unicode string intrinsic holding just ASCII vs. an ASCII string as currently implemented would be neck and neck in terms of performance. Remember that you don't necessarily need a bit test on every character every time. The table object can flag callers when it's totally empty and they can proceed with manipulations on that basis. In that sense the Unicode concept is really just a superset of what you already have. Considering the number of languages now being retrofitted for Unicode, I think it would be a mistake not to build it into D when the chance to do it cleanly exists, one that will be regretted later. Best, Mark
May 23 2003