digitalmars.D - String theory in D

Glen Perkins (81/81) Oct 25 2004 I'd heard a bit about D, but this is the first time I've taken a bit

Ben Hinkle (28/115) Oct 25 2004 There is a port of IBM's ICU unicode library underway and that will help

Glen Perkins (86/131) Oct 26 2004 Thanks.

Ben Hinkle (25/177) Oct 27 2004 One can foreach over dchars from either a char[] or wchar[]:

Regan Heath (6/11) Oct 27 2004 Ben, can you give me/us an example where this would be the case.

Ben Hinkle (49/57) Oct 27 2004 I don't know about impact on typical string usage but it certainly makes...

Regan Heath (94/156) Oct 27 2004 Thanks. I was hacking round with your example, basically inventing a

Ben Hinkle (6/17) Oct 28 2004 That is odd. I got:

Glen Perkins (133/171) Oct 27 2004 I think you'll end up with many more than that. Since people will be

Regan Heath (18/28) Oct 27 2004 No, but I did :)
Walter (12/25) Oct 27 2004 I'm not so sure this is correct. For a number of common string operation...

Glen Perkins (55/65) Oct 28 2004 I agree. I think there should be a standard string class for default

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (6/14) Oct 28 2004 I don't have a problem with a standard String *class* present in D,

Regan Heath (6/20) Oct 28 2004 So we use a 'struct' instead.

Ben Hinkle (26/29) Oct 28 2004 Technically an alias introduces a new symbol. It's like a #define. It

Regan Heath (7/10) Oct 28 2004 I'd argue that it's not harmless for the very reasons you just mentioned...

Ben Hinkle (11/18) Oct 29 2004 But introducing more names doesn't always make something more readable o...

Regan Heath (6/34) Oct 31 2004 That's what *I* implied/said, wasn't it?

ac (18/26) Nov 02 2004 As an old man, I cannot avoid thinking that these (obviously both) talen...

Sean Kelly (19/45) Oct 28 2004 Out of curiosity, why would you want to use different char types interna...

Regan Heath (34/59) Oct 28 2004 Glen mentioned system API calls. AFAIK unix variants use 8-bit char

Walter (4/8) Oct 28 2004 Although some are doing this, I argue it isn't necessary. Just pick one,...

Regan Heath (7/17) Oct 28 2004 Quite frankly, yuck. As I said earlier, it's in-efficient to convert

Walter (13/32) Oct 28 2004 C++ needs a string class becuase core C++ strings are so inadequate. But

Regan Heath (10/48) Oct 28 2004 It's not whether it works or not, I agree it works very well.

Walter (7/10) Oct 28 2004 I admit it may become a problem, but I don't think it will. More experie...

Regan Heath (9/20) Oct 28 2004 Isn't it? Explain std.string then.

Walter (4/22) Oct 28 2004 The conversion doesn't work because it doesn't know about UTF. An attemp...

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (7/11) Oct 28 2004 I just found char[][] a tad confusing, but maybe it grows on you... :-)
James McComb (13/15) Oct 28 2004 But you need to use aliases for the following scenario:

Walter (9/24) Oct 28 2004 True, Win32 process strings in UTF-16 and Linux in UTF-8. But I'll argue

Glen Perkins (22/34) Oct 29 2004 Wait a minute. Aren't these pretty close to the same arguments I made

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (16/31) Oct 29 2004 Couldn't a new "tchar" alias be introduced for OS / platform strings ?

Roald Ribe (6/8) Nov 05 2004 95, 98 and ME can have the UTF-16 API's installed (redistributable DLL's...

Glen Perkins (49/64) Oct 29 2004 This may turn out to be true. If so, you are still left with multiple

Regan Heath (26/100) Oct 25 2004 Glen,

Regan Heath (6/120) Oct 25 2004 To clarify, I believe some people think one is required and will write o...

A. Coward (not related to No�l) (20/20) Oct 26 2004 I think Glen's thoughts are excellent.

Kevin Bealer (18/21) Oct 28 2004 I think there is some merit in this guideline, particularly for those ne...

Lionello Lunesu (7/11) Nov 05 2004 Just posting to let you know I also think "string" should be standardize...

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (6/9) Nov 05 2004 The 8-bit integer type in D is "byte". D's "char" is *defined* as UTF-8.

Lionello Lunesu (14/24) Nov 05 2004 Yes, I've noticed that. I was referring to the how the array is treated.

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (24/36) Nov 05 2004 That D code just gives an error, when you actually try to compile it:

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (16/31) Oct 26 2004 Since OOP is *optional* in D, it isn't given to have a *class* ?

Regan Heath (52/82) Oct 26 2004 In that case, perhaps not a 'class', but a struct as Ben suggested, or,

=?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= (22/40) Oct 26 2004 They were meant to 'compliment' the standard int aliases - in stdint.d :

Glen Perkins (23/35) Oct 26 2004 I don't think you were asking me, but my concern applies to any "let a

=?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= (12/18) Oct 27 2004 Walter has earlier ruled out a built-in "native" string type in D,

ac (29/31) Oct 27 2004 a) Built-in or library? (Standard library or 3rd party?)

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (33/46) Oct 27 2004 There is no built-in D type, and does not look like a standard class.

J C Calvarese (14/28) Oct 30 2004 (I've read some of the posts in this thread. Sorry if I'm repeating what...

"Glen Perkins" <please.dont email.com> writes:

I'd heard a bit about D, but this is the first time I've taken a bit 
of time to look it over. I'm glad I did, because I love the design.

I am wondering about something, though, and that's the apparent 
decision to have three different standard string types, each with its 
encoding exposed to the developer. I've had some experience designing 
text models--I worked with Sun upgrading Java's string model from 
UCS-2 to UTF-16 and for Macromedia upgrading the string types within 
Flash and ColdFusion, for example--but every case has its unique 
constraints.

I don't know enough about D to be sure of the issues and constraints 
in this case, but I'm wondering if it wouldn't make sense to have a 
single standard "String" class for the majority of text handling plus 
something like char/wchar/dchar/ubyte arrays reserved for special 
cases.

In both Java and Flash we kept having to throw away brainstorming 
ideas because they implied changes to internal string implementation 
details that had unnecessarily--in my opinion--been exposed to 
programmers. I've become increasingly convinced that programmers don't 
need to know, much less be forced to decide, how most of their text is 
encoded. They should be thinking in terms of text semantically most of 
the time, without concerning themselves with its byte representation.

I see text handling as analogous to memory handling in the sense that 
I think the time has come to have the platform handle the general 
cases via automated internal mechanisms that are not exposed, while 
still allowing programmer manual  intervention for occasional special 
cases.

D already seems to have this memory model (very nice!), and it seems 
to me that the corresponding text model would be a single standard 
"String" class, whose internal encoding was the implementation's 
business, not the programmer's. The String would have the ability to 
produce explicitly encoded/formatted byte arrays for special cases, 
such as I/O, where encoding mattered. I would also want the ability to 
bypass Strings entirely on some occasions and use byte arrays 
directly. (By "byte arrays" I mean something like D's existing char[], 
wchar[], etc.)

Since the internal encoding of the standard String would not be 
exposed to the programmer, it could be optimized differently on every 
platform. I would probably implement my String class in UTF-16 on 
Windows and UTF-8 on Linux to make interactions with the OS and 
neighboring processes as lightweight as possible.

Then I would probably provide standard function wrappers for common OS 
calls such as getting directory listings, opening files, etc. These 
wrapper functions would pass text in the form of Strings. Source code 
that used only these functions would be portable across platforms, and 
since String's implementation would be optimized for its platform, 
this portable source code could produce nearly optimal object code on 
all platforms.

For calling OS functions directly, where you always need to have your 
text in a specific format, you could just have your Strings create an 
explicitly formatted byte sequence for you. A call to a Windows API 
function might pass something like "my_string.toUTF16()". Since the 
internal format would probably already be UTF-16, this "conversion" 
could be optimized away by the compiler, but it would leave you the 
freedom to change the underlying String implementation in the future 
without breaking anybody's code.

And, of course, you would still have the ability to use char[], 
wchar[], dchar[], and even ubyte[] directly when needed for special 
cases.

Having a single String to use for most text handling would make 
writing, reading, porting, and maintaining code much easier. Having an 
underlying encoding that isn't exposed would make it possible for 
implementers to optimize the standard String for the platform, so that 
programmers who used it would find code that was easier to write to 
begin with was also more performant when ported. This has huge 
implications for the creation of the rich libraries that make or break 
a language these days.

And if for no other reason, it seems to me that a new language should 
have a single, standard String class from the start just to avoid 
degenerating into the tangled hairball of conflicting string types 
that C++ text handling has become. Library creators and architects 
working in languages that have had a single, standard String class 
from the start doggedly use the standard String for everything. You 
could easily create your own alternative string classes for languages 

String is good enough, it's just not worth the trouble of having to 
juggle multiple string types. All libraries and APIs in these 
languages use a single, consistent text model, which is a big 
advantage these days over C++.

Again, I realize that I may be overlooking any number of important 
issues that would make this argument inapplicable or irrelevant in 
this case, but I'm wondering if this would make sense for D.

Oct 25 2004

Ben Hinkle <bhinkle4 juno.com> writes:

Glen Perkins wrote:

 I'd heard a bit about D, but this is the first time I've taken a bit
 of time to look it over. I'm glad I did, because I love the design.
 
 I am wondering about something, though, and that's the apparent
 decision to have three different standard string types, each with its
 encoding exposed to the developer. I've had some experience designing
 text models--I worked with Sun upgrading Java's string model from
 UCS-2 to UTF-16 and for Macromedia upgrading the string types within
 Flash and ColdFusion, for example--but every case has its unique
 constraints.

welcome.

 I don't know enough about D to be sure of the issues and constraints
 in this case, but I'm wondering if it wouldn't make sense to have a
 single standard "String" class for the majority of text handling plus
 something like char/wchar/dchar/ubyte arrays reserved for special
 cases.

There is a port of IBM's ICU unicode library underway and that will help
fill in various unicode shortcomings of phobos. What else do you see a
class doing that isn't in phobos?

 In both Java and Flash we kept having to throw away brainstorming
 ideas because they implied changes to internal string implementation
 details that had unnecessarily--in my opinion--been exposed to
 programmers. I've become increasingly convinced that programmers don't
 need to know, much less be forced to decide, how most of their text is
 encoded. They should be thinking in terms of text semantically most of
 the time, without concerning themselves with its byte representation.

are you referring to indexing and slicing being character lookup and not
byte lookup?

 I see text handling as analogous to memory handling in the sense that
 I think the time has come to have the platform handle the general
 cases via automated internal mechanisms that are not exposed, while
 still allowing programmer manual  intervention for occasional special
 cases.
 
 D already seems to have this memory model (very nice!), and it seems
 to me that the corresponding text model would be a single standard
 "String" class, whose internal encoding was the implementation's
 business, not the programmer's. The String would have the ability to
 produce explicitly encoded/formatted byte arrays for special cases,
 such as I/O, where encoding mattered. I would also want the ability to
 bypass Strings entirely on some occasions and use byte arrays
 directly. (By "byte arrays" I mean something like D's existing char[],
 wchar[], etc.)
 
 Since the internal encoding of the standard String would not be
 exposed to the programmer, it could be optimized differently on every
 platform. I would probably implement my String class in UTF-16 on
 Windows and UTF-8 on Linux to make interactions with the OS and
 neighboring processes as lightweight as possible.

Aliases can introduce a symbol that can mean different things on different
platforms:

// "Operating System" character
version (Win32) {
 alias wchar oschar; 
} else {
 alias char oschar;
}
oschar[] a_string_in_the_OS_preferred_format;

 Then I would probably provide standard function wrappers for common OS
 calls such as getting directory listings, opening files, etc. These
 wrapper functions would pass text in the form of Strings. Source code
 that used only these functions would be portable across platforms, and
 since String's implementation would be optimized for its platform,
 this portable source code could produce nearly optimal object code on
 all platforms.

These should already be in phobos. If the aliases approach is used all that
is required are overloaded versions for char[] or wchar[].

 For calling OS functions directly, where you always need to have your
 text in a specific format, you could just have your Strings create an
 explicitly formatted byte sequence for you. A call to a Windows API
 function might pass something like "my_string.toUTF16()". Since the
 internal format would probably already be UTF-16, this "conversion"
 could be optimized away by the compiler, but it would leave you the
 freedom to change the underlying String implementation in the future
 without breaking anybody's code.

There exist overloaded versions of std.utf.toUTF16 for char, wchar and dchar
arrays. So calling toUTF16(my_string) would do what you propose. Changing
the type of my_string would require a recompile but no code change.

 And, of course, you would still have the ability to use char[],
 wchar[], dchar[], and even ubyte[] directly when needed for special
 cases.
 
 Having a single String to use for most text handling would make
 writing, reading, porting, and maintaining code much easier. Having an
 underlying encoding that isn't exposed would make it possible for
 implementers to optimize the standard String for the platform, so that
 programmers who used it would find code that was easier to write to
 begin with was also more performant when ported. This has huge
 implications for the creation of the rich libraries that make or break
 a language these days.
 
 And if for no other reason, it seems to me that a new language should
 have a single, standard String class from the start just to avoid
 degenerating into the tangled hairball of conflicting string types
 that C++ text handling has become. Library creators and architects
 working in languages that have had a single, standard String class
 from the start doggedly use the standard String for everything. You
 could easily create your own alternative string classes for languages

 String is good enough, it's just not worth the trouble of having to
 juggle multiple string types. All libraries and APIs in these
 languages use a single, consistent text model, which is a big
 advantage these days over C++.
 
 Again, I realize that I may be overlooking any number of important
 issues that would make this argument inapplicable or irrelevant in
 this case, but I'm wondering if this would make sense for D.

One disadvantage of a String class is that the methods of the class are
fixed. With arrays and functions anyone can add a string "method". A class
will actually reduce flexibility in the eyes of the user IMO. Another
disadvantage is that classes in D are by reference (like Java) and so
slicing will have to allocate memory - today a slice is a length and
pointer to shared data so no allocation is needed. A String struct would be
an option if a class isn't used, though.

Oct 25 2004

"Glen Perkins" <please.dont email.com> writes:

"Ben Hinkle" <bhinkle4 juno.com> wrote in message
news:clk269$haj$1 digitaldaemon.com...
 Glen Perkins wrote:

 welcome.

Thanks.

 There is a port of IBM's ICU unicode library underway and that will
 help
 fill in various unicode shortcomings of phobos. What else do you see
 a
 class doing that isn't in phobos?

I don't know enough to comment at this point. I don't even know how
modularity works for compiled executables in D, and I don't want to
propose something that would violate D's priorities by, for example,
creating a heavyweight string full of ICU features, that would end up
being statically linked into every little "hello, world" written in D,
ruining the goal of tiny executables if, for example, that is a high
priority in D.

If there's no chance of a standard string class for general string
operations in D, then there's no point in designing one. If there is a
chance, then the design would have to start with the priorities and
constraints of this particular language.


noncommittal regarding its internal encoding, would be nice for a
language like D.

 ...I've become increasingly convinced that programmers don't
 need to know, much less be forced to decide, how most of their text
 is
 encoded. They should be thinking in terms of text semantically most
 of
 the time, without concerning themselves with its byte
 representation.

 are you referring to indexing and slicing being character lookup and
 not
 byte lookup?

Yes, that's a specific example of what I'm referring to, which is the
general notion of just thinking about algorithms for working with the
text in terms of text itself without regard to how the computer might
be representing that text inside (except in the minority of cases
where you MUST work explicitly with the representation.)

And though it's probably too radical for D (so nobody freak out), we
may well evolve to the point where the most reasonable default for
walking through the "characters" in general text data is something
like 'foreach char ch in mystring do {}', where the built-in "char"
datatype in the language is a variable length entity designed to hold
a complete grapheme. Only where optimization was required would you
drop down to the level of "foreach codepoint cp in mytext do {}',
where mytext was defined as 'codepoint[] mytext', or even more
radically to 'foreach byte b in mytext do {}', where mytext was
defined as 'byte[] mytext'.

Once again, I'm not proposing that for D, I'm just promoting the
general notion of keeping the developer's mind on the text and off of
the representation details to the extent that it is *reasonable*.


 Since the internal encoding of the standard String would not be
 exposed to the programmer, it could be optimized differently on
 every
 platform. I would probably implement my String class in UTF-16 on
 Windows and UTF-8 on Linux to make interactions with the OS and
 neighboring processes as lightweight as possible.

 Aliases can introduce a symbol that can mean different things on
 different
 platforms:

 // "Operating System" character
 version (Win32) {
 alias wchar oschar;
 } else {
 alias char oschar;
 }
 oschar[] a_string_in_the_OS_preferred_format;

Thanks for pointing out this feature. I like it. It provides a
mechanism for manual optimization at the cost of greater complexity
for those special cases where optimization is called for. You could
have different string representations for different zones in your app,
labeled by zone name: oschar for internal and OS API calls, xmlchar
for an XML I/O boundary etc., so you could change the OS format from
OS to OS while leaving the XML format unchanged.

I can't help thinking, though, that it would be best reserved for
optimization cases, with a simple works-everywhere, called "string"
everywhere, string class for the general case. Otherwise, your
language tutorials would be teaching you that a string is "char[]" but
real production code would almost always be based on locally-invented
names for string types. Libraries, which are also trying hard to be
real production quality code, would use the above alias approach and
invent their own names. Not just at points you needed to manually
optimize but literally everywhere you did anything with a string
internally, you'd have to choose among the three standard names, char,
wchar, and dchar, plus your own custom oschar and xmlchar, plus your
GUI library's gchar or kchar, and your ICU library's unichar, plus a
database orachar designed to match the database encoding, etc.

You could easily end up with so many conversions going on between
types locally optimized for each zone in your app that you are
globally unoptimized.

 One disadvantage of a String class is that the methods of the class
 are
 fixed. With arrays and functions anyone can add a string "method". A
 class
 will actually reduce flexibility in the eyes of the user IMO.
 Another
 disadvantage is that classes in D are by reference (like Java) and
 so
 slicing will have to allocate memory - today a slice is a length and
 pointer to shared data so no allocation is needed. A String struct
 would be
 an option if a class isn't used, though.

It's true what you're saying about the relative lack of flexibility of
built-in methods vs. external functions. You can always apply
functions to strings, though, and the conservative approach would be
to have a few clearly important methods in the string, implement other
operations as functions that take string arguments, and over time
consider migrating those operations into the string itself.

Another possibility might be to have this "oschar" approach above
actually built-in, with everybody (starting from the first "hello,
world" tutorial) encouraged to use that one by default. That's tricky,
though, because when you asked for mystring[3] from your oschar-based
string, what would you get? People would expect the third text
character, but as you know it would depend on the platform, and would
not have any useful meaning in general, which seems pretty awkward for
a standard string. It doesn't seem very useful to present something in
an array format without the individual elements of the array being
very useful. You could make them useful by making dchar[] the default,
but everybody would probably fuss about the wasted memory, and
production code would end up using char or wchar. So that brings us
back to a string class where operator overloading could make the []
array-type access yield consistent, complete codepoints on every
platform.

I'm sympathetic to performance arguments. That would be one of the big
attractions of D. I still can't help thinking that sticking to a
single string class shared by almost all of your tutorials, your own
code, your downloaded snippets, and all of your libraries might not
only be the easiest for programmers to work with but could result in
apps that tended to be at least as performant as the existing
approach.

Oct 26 2004

Ben Hinkle <bhinkle4 juno.com> writes:

Glen Perkins wrote:

 
 "Ben Hinkle" <bhinkle4 juno.com> wrote in message
 news:clk269$haj$1 digitaldaemon.com...
 Glen Perkins wrote:

 welcome.

 
 Thanks.
 
 There is a port of IBM's ICU unicode library underway and that will
 help
 fill in various unicode shortcomings of phobos. What else do you see
 a
 class doing that isn't in phobos?

 
 I don't know enough to comment at this point. I don't even know how
 modularity works for compiled executables in D, and I don't want to
 propose something that would violate D's priorities by, for example,
 creating a heavyweight string full of ICU features, that would end up
 being statically linked into every little "hello, world" written in D,
 ruining the goal of tiny executables if, for example, that is a high
 priority in D.
 
 If there's no chance of a standard string class for general string
 operations in D, then there's no point in designing one. If there is a
 chance, then the design would have to start with the priorities and
 constraints of this particular language.
 

 noncommittal regarding its internal encoding, would be nice for a
 language like D.
 
 ...I've become increasingly convinced that programmers don't
 need to know, much less be forced to decide, how most of their text
 is
 encoded. They should be thinking in terms of text semantically most
 of
 the time, without concerning themselves with its byte
 representation.

 are you referring to indexing and slicing being character lookup and
 not
 byte lookup?

 
 Yes, that's a specific example of what I'm referring to, which is the
 general notion of just thinking about algorithms for working with the
 text in terms of text itself without regard to how the computer might
 be representing that text inside (except in the minority of cases
 where you MUST work explicitly with the representation.)
 
 And though it's probably too radical for D (so nobody freak out), we
 may well evolve to the point where the most reasonable default for
 walking through the "characters" in general text data is something
 like 'foreach char ch in mystring do {}', where the built-in "char"
 datatype in the language is a variable length entity designed to hold
 a complete grapheme. Only where optimization was required would you
 drop down to the level of "foreach codepoint cp in mytext do {}',
 where mytext was defined as 'codepoint[] mytext', or even more
 radically to 'foreach byte b in mytext do {}', where mytext was
 defined as 'byte[] mytext'.

One can foreach over dchars from either a char[] or wchar[]:

int main() {
  char[] t = "hello 中国 world";
  foreach(dchar x;t) printf("%x ",x);
  return 0;
}

prints

68 65 6c 6c 6f 20 4e2d 56fd 20 77 6f 72 6c 64

Similarly structs and classes can have overloaded opApply implementations to
customize what it means to foreach in different situations.

 Once again, I'm not proposing that for D, I'm just promoting the
 general notion of keeping the developer's mind on the text and off of
 the representation details to the extent that it is *reasonable*.
 
 
 Since the internal encoding of the standard String would not be
 exposed to the programmer, it could be optimized differently on
 every
 platform. I would probably implement my String class in UTF-16 on
 Windows and UTF-8 on Linux to make interactions with the OS and
 neighboring processes as lightweight as possible.

 Aliases can introduce a symbol that can mean different things on
 different
 platforms:

 // "Operating System" character
 version (Win32) {
 alias wchar oschar;
 } else {
 alias char oschar;
 }
 oschar[] a_string_in_the_OS_preferred_format;

 
 Thanks for pointing out this feature. I like it. It provides a
 mechanism for manual optimization at the cost of greater complexity
 for those special cases where optimization is called for. You could
 have different string representations for different zones in your app,
 labeled by zone name: oschar for internal and OS API calls, xmlchar
 for an XML I/O boundary etc., so you could change the OS format from
 OS to OS while leaving the XML format unchanged.
 
 I can't help thinking, though, that it would be best reserved for
 optimization cases, with a simple works-everywhere, called "string"
 everywhere, string class for the general case. Otherwise, your
 language tutorials would be teaching you that a string is "char[]" but
 real production code would almost always be based on locally-invented
 names for string types. Libraries, which are also trying hard to be
 real production quality code, would use the above alias approach and
 invent their own names. Not just at points you needed to manually
 optimize but literally everywhere you did anything with a string
 internally, you'd have to choose among the three standard names, char,
 wchar, and dchar, plus your own custom oschar and xmlchar, plus your
 GUI library's gchar or kchar, and your ICU library's unichar, plus a
 database orachar designed to match the database encoding, etc.
 
 You could easily end up with so many conversions going on between
 types locally optimized for each zone in your app that you are
 globally unoptimized.

That's possible, but so far it doesn't seem so bad to have three core string
types. Storing the encoding in the instance instead of the type would turn
today's compile-time decisions into run-time decisions, though. That would
most likely slow things down since it can't inline as completely.

 One disadvantage of a String class is that the methods of the class
 are
 fixed. With arrays and functions anyone can add a string "method". A
 class
 will actually reduce flexibility in the eyes of the user IMO.
 Another
 disadvantage is that classes in D are by reference (like Java) and
 so
 slicing will have to allocate memory - today a slice is a length and
 pointer to shared data so no allocation is needed. A String struct
 would be
 an option if a class isn't used, though.

 
 It's true what you're saying about the relative lack of flexibility of
 built-in methods vs. external functions. You can always apply
 functions to strings, though, and the conservative approach would be
 to have a few clearly important methods in the string, implement other
 operations as functions that take string arguments, and over time
 consider migrating those operations into the string itself.
 
 Another possibility might be to have this "oschar" approach above
 actually built-in, with everybody (starting from the first "hello,
 world" tutorial) encouraged to use that one by default. That's tricky,
 though, because when you asked for mystring[3] from your oschar-based
 string, what would you get? People would expect the third text
 character, but as you know it would depend on the platform, and would
 not have any useful meaning in general, which seems pretty awkward for
 a standard string. It doesn't seem very useful to present something in
 an array format without the individual elements of the array being
 very useful. You could make them useful by making dchar[] the default,
 but everybody would probably fuss about the wasted memory, and
 production code would end up using char or wchar. So that brings us
 back to a string class where operator overloading could make the []
 array-type access yield consistent, complete codepoints on every
 platform.
 
 I'm sympathetic to performance arguments. That would be one of the big
 attractions of D. I still can't help thinking that sticking to a
 single string class shared by almost all of your tutorials, your own
 code, your downloaded snippets, and all of your libraries might not
 only be the easiest for programmers to work with but could result in
 apps that tended to be at least as performant as the existing
 approach.

Yeah - any design will have trade-offs. dchar[] takes up too much space.
On-the-fly character lookup is too slow to make the default. char[] is too
fat for asian languages. Judgements like "too much space" and "too slow"
are subjective and Walter made his choices. I'm sure he's open to more
information that would sway those choices but the best chance of
influencing things is to add some solid data that is missing. With your
experience in string handling in different languages I'm guessing your
opinions are based on accumulated knowledge about what is fast or slow etc
so trying to articulate that accumulated knowledge would be very useful.

-Ben

Oct 27 2004

Regan Heath <regan netwin.co.nz> writes:

On Wed, 27 Oct 2004 08:26:52 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:
 That's possible, but so far it doesn't seem so bad to have three core 
 string types. Storing the encoding in the instance instead of the type 
 would turn today's compile-time decisions into run-time decisions, 
 though. That would most likely slow things down since it can't inline as 
 completely.

Ben, can you give me/us an example where this would be the case.
How much slower do you think it would make it?

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Oct 27 2004

"Ben Hinkle" <bhinkle mathworks.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsgjm7mx55a2sq9 digitalmars.com...
 On Wed, 27 Oct 2004 08:26:52 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:
 That's possible, but so far it doesn't seem so bad to have three core
 string types. Storing the encoding in the instance instead of the type
 would turn today's compile-time decisions into run-time decisions,
 though. That would most likely slow things down since it can't inline as
 completely.

 Ben, can you give me/us an example where this would be the case.
 How much slower do you think it would make it?

I don't know about impact on typical string usage but it certainly makes a
difference with a super-cheezy made-up example like:

import std.c.windows.windows;
enum Encoding { UTF8, UTF16, UTF32 };
struct my_string {
  Encoding encoding;
  int length;
  void* data;
}

char index(char[] s, int n) { return s[n]; }
wchar index(wchar[] s, int n) { return s[n]; }
dchar index(dchar[] s, int n) { return s[n]; }
dchar index(my_string s, int n) {
  switch (s.encoding) {
    case Encoding.UTF8:
      return (cast(char*)s.data)[n];
    case Encoding.UTF16:
      return (cast(wchar*)s.data)[n];
    case Encoding.UTF32:
      return (cast(dchar*)s.data)[n];
  }
}

int main() {

  char[] s = "hello";
  int t1 = GetTickCount();
  for(int k=0;k<100_000_000; k++) {
    index(s,3);
  }
  int t2 = GetTickCount();

  my_string s2;
  s2.data = s;
  s2.encoding = Encoding.UTF8;
  s2.length = s.length;
  int t3 = GetTickCount();
  for(int k=0;k<100_000_000; k++) {
    index(s2,3);
  }
  int t4 = GetTickCount();

  printf("compile time %d\n",t2-t1);
  printf("run time %d\n",t4-t3);
  return 0;
}

compiling with "dmd main.d -O -inline" and running gives
compile time 110
run time 531

Any particular example doesn't mean much, though. My statement was meant as
a general statement about compile-time vs run-time decisions.

Oct 27 2004

Regan Heath <regan netwin.co.nz> writes:

On Wed, 27 Oct 2004 16:19:14 -0400, Ben Hinkle <bhinkle mathworks.com> 
wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsgjm7mx55a2sq9 digitalmars.com...
 On Wed, 27 Oct 2004 08:26:52 -0400, Ben Hinkle <bhinkle4 juno.com> 
 wrote:
 That's possible, but so far it doesn't seem so bad to have three core
 string types. Storing the encoding in the instance instead of the type
 would turn today's compile-time decisions into run-time decisions,
 though. That would most likely slow things down since it can't inline 

 as
 completely.

 Ben, can you give me/us an example where this would be the case.
 How much slower do you think it would make it?

 I don't know about impact on typical string usage but it certainly makes 
 a
 difference with a super-cheezy made-up example like:

 import std.c.windows.windows;
 enum Encoding { UTF8, UTF16, UTF32 };
 struct my_string {
   Encoding encoding;
   int length;
   void* data;
 }

 char index(char[] s, int n) { return s[n]; }
 wchar index(wchar[] s, int n) { return s[n]; }
 dchar index(dchar[] s, int n) { return s[n]; }
 dchar index(my_string s, int n) {
   switch (s.encoding) {
     case Encoding.UTF8:
       return (cast(char*)s.data)[n];
     case Encoding.UTF16:
       return (cast(wchar*)s.data)[n];
     case Encoding.UTF32:
       return (cast(dchar*)s.data)[n];
   }
 }

 int main() {

   char[] s = "hello";
   int t1 = GetTickCount();
   for(int k=0;k<100_000_000; k++) {
     index(s,3);
   }
   int t2 = GetTickCount();

   my_string s2;
   s2.data = s;
   s2.encoding = Encoding.UTF8;
   s2.length = s.length;
   int t3 = GetTickCount();
   for(int k=0;k<100_000_000; k++) {
     index(s2,3);
   }
   int t4 = GetTickCount();

   printf("compile time %d\n",t2-t1);
   printf("run time %d\n",t4-t3);
   return 0;
 }

 compiling with "dmd main.d -O -inline" and running gives
 compile time 110
 run time 531

 Any particular example doesn't mean much, though. My statement was meant 
 as
 a general statement about compile-time vs run-time decisions.

Thanks. I was hacking round with your example, basically inventing a 
string type which did not have runtime decisions, it is giving me some 
very strange results, I wonder if you can spot where it's going awry.

D:\D\src\temp>dmd string.d -O -release -inline
d:\d\dmd\bin\..\..\dm\bin\link.exe string,,,user32+kernel32/noi;

D:\D\src\temp>string
compile time 156
run time 1000

(string.d is your example, unmodified, as a comparrison to what I get 
below)

D:\D\src\temp>dmd string2.d -O -release -inline
d:\d\dmd\bin\..\..\dm\bin\link.exe string2,,,user32+kernel32/noi;

D:\D\src\temp>string2
compile time 219
run time 1156
template 157

I ran both several times, the results above are typical for my system.

Notice:
1- the compile time string2.d is slower than string.d
2- the template one is faster than the compile time one

I don't understand how either of the above can be true.

--[string2.d]--
import std.c.windows.windows;
enum Encoding { UTF8, UTF16, UTF32 };
struct my_string {
   Encoding encoding;
   void opCall(char[] s)  { encoding = Encoding.UTF8;  cs = s.dup; }
   void opCall(wchar[] s) { encoding = Encoding.UTF16; ws = s.dup; }
   void opCall(dchar[] s) { encoding = Encoding.UTF32; ds = s.dup; }
   union {
   	char[] cs;
   	wchar[] ws;
   	dchar[] ds;
   }
}
struct my_string2(Type) {
	Type[] data;
	void opCall(char[] s)  { data = cast(Type[])s.dup; }
	void opCall(wchar[] s) { data = cast(Type[])s.dup; }
	void opCall(dchar[] s) { data = cast(Type[])s.dup; }
	Type opIndex(int i) { return data[i]; }
}

char index(char[] s, int n) { return s[n]; }
wchar index(wchar[] s, int n) { return s[n]; }
dchar index(dchar[] s, int n) { return s[n]; }
dchar index(my_string s, int n) {
   switch (s.encoding) {
     case Encoding.UTF8:
       return s.cs[n];
     case Encoding.UTF16:
       return s.ws[n];
     case Encoding.UTF32:
       return s.ds[n];
   }
}
char index(my_string2!(char) s, int n) {
	return s.data[n];
}
wchar index(my_string2!(wchar) s, int n) {
	return s.data[n];
}
dchar index(my_string2!(dchar) s, int n) {
	return s.data[n];
}

int main() {

   char[] s = "hello";
   int t1 = GetTickCount();
   for(int k=0;k<100_000_000; k++) {
     index(s,3);
   }
   int t2 = GetTickCount();

   my_string s2;
   s2(s);
   int t3 = GetTickCount();
   for(int k=0;k<100_000_000; k++) {
     index(s2,3);
   }
   int t4 = GetTickCount();

   my_string2!(char) s3;
   s3(s);
   int t5 = GetTickCount();
     for(int k=0;k<100_000_000; k++) {
       index(s3,3);
     }
   int t6 = GetTickCount();

   printf("compile time %d\n",t2-t1);
   printf("run time %d\n",t4-t3);
   printf("template %d\n",t6-t5);
   return 0;
}

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Oct 27 2004

"Ben Hinkle" <bhinkle mathworks.com> writes:

 D:\D\src\temp>dmd string2.d -O -release -inline
 d:\d\dmd\bin\..\..\dm\bin\link.exe string2,,,user32+kernel32/noi;

 D:\D\src\temp>string2
 compile time 219
 run time 1156
 template 157

 I ran both several times, the results above are typical for my system.

 Notice:
 1- the compile time string2.d is slower than string.d
 2- the template one is faster than the compile time one

 I don't understand how either of the above can be true.

That is odd. I got:
 compile time 78
 run time 593
 template 79
so I don't know what could be going on. Maybe try switching around the order
to see if that changes anything? I don't really know.

Oct 28 2004

"Glen Perkins" <please.dont email.com> writes:

"Ben Hinkle" <bhinkle4 juno.com> wrote in message 
news:clo463$21js$1 digitaldaemon.com...
 You could easily end up with so many conversions going on between
 types locally optimized for each zone in your app that you are
 globally unoptimized.

 That's possible, but so far it doesn't seem so bad to have three 
 core string
 types.

I think you'll end up with many more than that. Since people will be 
required to make what is essentially an optimization decision every 
time they do anything with text, the choice will typically be 
different on different platforms. Rather than letting the 
implementation deal with that so that source code can be ported and 
still remain close to optimal, this design requires the programmer to 
either 1) live with suboptimal performance when porting, 2) manually 
rewrite most of his code and live with separate versions that are 
harder to keep in sync, or 3) use the "alias" feature to invent a 
local name for a "standard" string.


agree with me, then when we end up reusing each other's code, we'll 
end up with the standard three string types, plus our own type, plus 
those invented by others. And there's no guarantee that our various 
alias types will all make the same decisions for when to be what.

So now I have a whole bunch of string types to deal with, some of 
which are the same on some platforms but different on others, so when 
I try to optimize my code so that I don't have lots of unnecessary 
back and forth and back and forth encoding conversions, I have to 
further de-sync my different platform versions or use more aliases to 
manage the aliases or, once again, live with the lack of optimization, 
attempting to repair it only where necessary.


what we know about optimization, where the majority of your operations 
of all types could execute instantly without a noticeable improvement 
in overall app performance, then you could probably get about the same 
performance without the design nightmare by using a single, standard 
string type (which is optimized by the implementation for the 
platform) for almost everything.

 Storing the encoding in the instance instead of the type would turn
 today's compile-time decisions into run-time decisions, though. That 
 would
 most likely slow things down since it can't inline as completely.

I'm not suggesting a string type that would have a field to hold its 
encoding, so that two instances of the same string class on the same 
platform could have two different internal encodings and functions 
would have to decide at runtime what code to run for each instance. 
I'm talking about a situation similar to the alias idea where every 
instance of a standard string on a given platform, whether in your own 
code or the libraries, would be in the same encoding, an encoding 
known at compile time.

The information to early-bind the methods would be available at 
compile time, and a smart compiler might be able to use that fact for 
compile time optimization, but I can't completely disagree with you. 
There may be other reasons why the compiler might not be able to do 
the binding at compile time, perhaps due to the general implementation 
of OO support.

Even if this is the case, you don't have to dismiss an idea because it 
doesn't optimize performance for each instance in which it is used. GC 
itself doesn't optimize performance for each instance, but it's still 
the way to go (in my opinion) because the performance of most parts is 
irrelevant to the performance of the whole, as long as those parts are 
reasonable, and you have a manual option for special cases. I think 
the same argument implies having a single default string type and 
letting the compiler optimize it.

 I'm sympathetic to performance arguments. That would be one of the 
 big
 attractions of D. I still can't help thinking that sticking to a
 single string class shared by almost all of your tutorials, your 
 own
 code, your downloaded snippets, and all of your libraries might not
 only be the easiest for programmers to work with but could result 
 in
 apps that tended to be at least as performant as the existing
 approach.

 Yeah - any design will have trade-offs. dchar[] takes up too much 
 space.
 On-the-fly character lookup is too slow to make the default.

I'm not sure I understand this. I realize that you're just quoting 
things that "people say", but if this means it's better to have byte 
fetching from UTF-8 be the default instead of character fetching, it 
sounds as though it's claiming that it's a better default to do 
something useless than useful if the useless operation is faster. For 
the majority of text work, byte fetching is useless. What you care 
about is the text, not its representation. Only in a minority of cases 
would byte fetching matter. Those special cases are definitely 
important--the general cases will be built on top of byte fetching so 
fast byte fetching is mandatory--but defaults should be based on the 
typical need, not the exceptional need. If the typical need requires 
more work, well, it's still the typical need and the default, almost 
by definition, should be designed for the typical need.

Of course, I may have misunderstood you completely. ;-) Even more 
likely, this particular point doesn't matter, but it has been a source 
of some frustration how often people with a "C mindset" (and I'm not 
talking about you but am thinking of countless design meetings over 
the years) end up optimizing the insignificant at the expense of the 
significant because the insignificant is always in their face.

I think you can have the best of both worlds with a design based 
roughly on the idea of programmer productivity for defaults plus 
fine-grained manual optimization features (that integrate easily with 
the default features) for bottlenecks (defined as anyplace where a 
local optimization will produce a global optimization). D is quite 
close to such a design, but it seems to me that the string approach 
doesn't quite match.

 char[] is too
 fat for asian languages. Judgements like "too much space" and "too 
 slow"
 are subjective and Walter made his choices. I'm sure he's open to 
 more
 information that would sway those choices but the best chance of
 influencing things is to add some solid data that is missing.

I'm not sure who "Walter" is, but it sounds like he's the guy to thank 
for such a nice language design. (If I didn't mean that, I wouldn't 
waste my time writing any of this.) For the specific issue of strings, 
the information that I think is most relevant (and, as I said before, 
I still can't be *sure* that it's relevant in D's case), is not "data" 
per se, but a reminder that C++ is about the worst case scenario among 
major languages when it comes to programmer productivity in text 
handling, in large part because you ALWAYS end up getting stuck with 
multiple string types in any significant app. The problem is NOT that 
nobody ever managed to create a useful string type for C++, it's that 
EVERYBODY did so because Stroustrup wouldn't.

The "data", I suppose, is what happened in the case of C++ and didn't 
happen to any language with a built-in standard string class, but of 
course you can argue about the relevance of the comparisons.


 With your
 experience in string handling in different languages I'm guessing 
 your
 opinions are based on accumulated knowledge about what is fast or 
 slow etc
 so trying to articulate that accumulated knowledge would be very 
 useful.

My accumulated knowledge tells me that what's fast or slow for a 
string design should NOT be your primary consideration, even when 
performance of the app as a whole IS (and it usually isn't.) I DO care 
about performance. Java's design prohibits the kind of performance I'm 
looking for, which is one reason I'm curious about D. But I care about 
global performance not local performance, and I also care about other 
significant global issues such as programmer productivity, lack of 
bugs, source portability, maintenance costs, etc. that almost always 
matter more than the microscale performance of your strings.

A factory doing manual labor can double its output by doubling the 
people at every station, or can pull people off of some stations, 
reducing the local "performance" at those stations, and reallocate 
them to double the staff at the bottleneck only. One approach improves 
global performance by improving local performance everywhere. The 
other either doesn't improve or actually loses performance everywhere 
except at the bottleneck. Both produce the same doubling of total 
factory output.

Which approach is better? Well gaining performance everywhere is 
obviously better--you never know when it might come in handy, 
right?--until you factor in the cost.

I don't want to get too tangled in the details of the analogy, but the 
cost of having three standard strings for fine-grained performance 
tuning everywhere, plus homemade and 3rd party aliases, plus multiple 
3rd party string classes that will fill the void in the standard, is 
the complexity that it will add to designs with all of the 
implications that has for debugging, code reuse, architectural 
decisions, portability, maintenance, and general programmer 
productivity. All of those factors have costs, and some of them may 
even negatively impact global performance, which was the reason for 
the extra complexity to begin with.

I just have a hard time imagining that MAKING people micromanage their 
string implementations in all cases will produce superior global 
performance to simply ALLOWING them to do so where it impacted global 
performance. Doubling the staff at every factory station results in no 
more total production than simply doubling the staff at the 
bottleneck. I have an even harder time imagining that the benefits of 
the unavoidable additional complexity (which you can never avoid if 
you ever use other people's code) will be worth the performance 
benefit that may not even exist.

I could still be wrong about any of this. Am I overlooking something?

Oct 27 2004

Regan Heath <regan netwin.co.nz> writes:

On Wed, 27 Oct 2004 17:34:14 -0700, Glen Perkins <please.dont email.com> 
wrote:
 "Ben Hinkle" <bhinkle4 juno.com> wrote in message 
 news:clo463$21js$1 digitaldaemon.com...
 Storing the encoding in the instance instead of the type would turn
 today's compile-time decisions into run-time decisions, though. That 
 would
 most likely slow things down since it can't inline as completely.

 I'm not suggesting a string type that would have a field to hold its 
 encoding, so that two instances of the same string class on the same 
 platform could have two different internal encodings and functions would 
 have to decide at runtime what code to run for each instance.

No, but I did :)

I am starting to think it's un-necessary however. Given that converting 
 from one encoding to another necessitates a copy of the data anyway.

So, instead, a single string type that could be encoded internally as any 
of the available encodings, couldn't change encoding itself, but, could be 
cast/converted to another encoding (creating a new string).

Plus, it needs all the functionality of our current arrays i.e. indexing, 
slicing, being able to write methods for it i.e.

void foo(char[] a, int b);
char[] aa;
aa.foo(5);  <-- calls 'foo' above.

I'm pretty sure the above idea is not possible without some sort of 
compiler magic.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Oct 27 2004

"Walter" <newshound digitalmars.com> writes:

"Glen Perkins" <please.dont email.com> wrote in message
news:clpeud$lql$1 digitaldaemon.com...
 I'm not sure I understand this. I realize that you're just quoting
 things that "people say", but if this means it's better to have byte
 fetching from UTF-8 be the default instead of character fetching, it
 sounds as though it's claiming that it's a better default to do
 something useless than useful if the useless operation is faster. For
 the majority of text work, byte fetching is useless. What you care
 about is the text, not its representation. Only in a minority of cases
 would byte fetching matter. Those special cases are definitely
 important--the general cases will be built on top of byte fetching so
 fast byte fetching is mandatory--but defaults should be based on the
 typical need, not the exceptional need. If the typical need requires
 more work, well, it's still the typical need and the default, almost
 by definition, should be designed for the typical need.

I'm not so sure this is correct. For a number of common string operations,
such as copying and searching, byte indexing of UTF-8 is faster than
codepoint indexing. For sequential codepoint access, the foreach() statement
does the job. For random access of codepoints, one has to always start from
the beginning and count forward anyway, and the foreach() does that.

As for a single string type, there is no answer for that. Each has
significant tradeoffs. For a speed oriented language, the choice needs to be
under the control of the application programmer, not the language. The three
types are readilly convertible into each other. I don't really see the need
for application programmers to layer on more string types.

Oct 27 2004

"Glen Perkins" <please.dont email.com> writes:

"Walter" <newshound digitalmars.com> wrote in message 
news:clptlo$161f$1 digitaldaemon.com...

 ...[UTF-8 indexing issue that I don't want to waste your time 
 with]...


 As for a single string type, there is no answer for that. Each has
 significant tradeoffs. For a speed oriented language, the choice 
 needs to be
 under the control of the application programmer, not the language.

I agree. I think there should be a standard string class for default 
use plus a selection of byte array forms (e.g. char[], wchar[], 
dchar[]) for use anywhere that the programmer determined that their 
use instead of the default improved the app. The choice would be 
completely under the control of the programmer. The encoding of the 
default string class would be up to the implementors to optimize for 
the platform so that for the great majority of text operations in an 
app, the default string would work so well that replacing it with one 
of the byte array forms would be found to have no positive impact on 
the app. However, anytime the programmer encountered a situation where 
use of a byte array type improved the app, he could use it.

With this approach, you could have code with the same performance as 
under the current system, because anytime it was slower you could just 
use the current system. However, having a good default string as well, 
used by most apps on most platforms by most people most of the time, 
would simplify designs, porting, maintenance, programmer productivity, 
etc.

 The three
 types are readilly convertible into each other.

In fact, all four types would be readily convertible, though by having 
one that was almost always the best choice, regardless of platform, 
you would be able to avoid many unecessary conversions that could 
easily de-optimize your code as you added libraries and ported your 
app to other platforms. Also by matching the implementation of that 
default to the preferred form of the local OS APIs, conversions 
between the default string class and the OS API format could probably 
be compiled down to very lightweight object code on any platform, from 
the same source code.

 I don't really see the need
 for application programmers to layer on more string types.

I do, and apparently I'm not alone. People seem to mention it a lot in 
this newsgroup (I've discovered), just as they did for so long with 
C++. If I want char[] on Linux and wchar[] on Windows and want to 
avoid the nightmare of maintaining parallel but subtly different code, 
I need to create my own type using the "alias" feature. The authors of 
a couple of libraries I'll use will do likewise, but with their own 
type names and maybe different alias resolution rules. I expect some 
people will solve it with string classes. Stroustrup took a similar 
position about the need for programmers to optimize their strings long 
enough that every C++ library and API created its own string type. He 
once stated in a meeting I attended that his greatest regret about C++ 
was waiting so long to have a standard library and that the most 
requested feature of that library had been a string class. By adding 
just one more standard string type that would be a good default on 
every platform, I think you could eliminate the need so many people 
will feel to create their own and prevent string types from 
multiplying like bunnies, as happened to C++.

Loss of performance isn't the only thing programmers want, even from a 
high performance language. They'd also like to avoid unnecessary 
complexity, avoid bugs, reuse other people's code, target multiple 
platforms with mostly the same source, and so on. I think having a 
single, good default string type could be very helpful for these 
things without having to harm performance.

Even so, I realize that my opinion may be based on incorrect 
assumptions, missing information, faulty logic, selective memory, or 
peculiar personal preferences, so I may be wrong. If so, though, I'd 
be curious to know why.

Oct 28 2004

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Glen Perkins wrote:

 As for a single string type, there is no answer for that. Each has
 significant tradeoffs. For a speed oriented language, the choice needs 
 to be under the control of the application programmer, not the language.

 
 I agree. I think there should be a standard string class for default use 
 plus a selection of byte array forms (e.g. char[], wchar[], dchar[]) for 
 use anywhere that the programmer determined that their use instead of 
 the default improved the app.

I don't have a problem with a standard String *class* present in D,
as long as I don't *have* to use it (and OOP) - like I do in Java...

The beauty about D's string types (char[] and wchar[]) is that they
work for plain old procedural C-style programs too, not just objects ?

--anders

Oct 28 2004

Regan Heath <regan netwin.co.nz> writes:

On Thu, 28 Oct 2004 10:24:22 +0200, Anders F Bj�rklund <afb algonet.se> 
wrote:
 Glen Perkins wrote:

 As for a single string type, there is no answer for that. Each has
 significant tradeoffs. For a speed oriented language, the choice needs 
 to be under the control of the application programmer, not the 
 language.

 I agree. I think there should be a standard string class for default 
 use plus a selection of byte array forms (e.g. char[], wchar[], 
 dchar[]) for use anywhere that the programmer determined that their use 
 instead of the default improved the app.

 I don't have a problem with a standard String *class* present in D,
 as long as I don't *have* to use it (and OOP) - like I do in Java...

 The beauty about D's string types (char[] and wchar[]) is that they
 work for plain old procedural C-style programs too, not just objects ?

So we use a 'struct' instead.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Oct 28 2004

Ben Hinkle <bhinkle4 juno.com> writes:

 I need to create my own type using the "alias" feature. The authors of
 a couple of libraries I'll use will do likewise, but with their own
 type names and maybe different alias resolution rules.

Technically an alias introduces a new symbol. It's like a #define. It
doesn't actually introduce a new type (see typedef). For example, the
following doesn't compile:

alias int foo;
void bar(int y) {}
void bar(foo y) {}

int main() {
  bar(0);
  return 0;
}

compiling results in: "function bar overloads void(int y) and void(int y)
both match argument list for bar"

Redefining an alias is ignored (well, it is very useful for overloading
functions but not for basic types). For example:

alias int foo;
alias long foo;
void bar(int y) {printf("int\n");}
void bar(long y) {printf("long\n");}
int main() {
  foo x;
  bar(x);
  return 0;
}

prints "int". So defining multiple aliases for strings or any other type is
a pretty harmless thing to do. It should only effect the readability and
maintainability of the code.

Oct 28 2004

Regan Heath <regan netwin.co.nz> writes:

On Thu, 28 Oct 2004 08:35:24 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:
 So defining multiple aliases for strings or any other type is
 a pretty harmless thing to do. It should only effect the readability and
 maintainability of the code.

I'd argue that it's not harmless for the very reasons you just mentioned. 
Readability and maintainability are important when working on any 
large-ish project.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Oct 28 2004

"Ben Hinkle" <bhinkle mathworks.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsglk19qr5a2sq9 digitalmars.com...
 On Thu, 28 Oct 2004 08:35:24 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:
 So defining multiple aliases for strings or any other type is
 a pretty harmless thing to do. It should only effect the readability and
 maintainability of the code.

 I'd argue that it's not harmless for the very reasons you just mentioned.
 Readability and maintainability are important when working on any
 large-ish project.

But introducing more names doesn't always make something more readable or
maintainable. One has to factor in the size of the group and time-scale of
life of the code. A wrapper or alias might seem obvious to the couple of
people who started the project but years down the road with a group orders
of magnitude larger a little helper wrapper can add up to be more overhead
than it is worth. Also notions of "this code is readable" and "maintainable"
are much more subjective than "this code doesn't compile" or "this code uses
the wrong type". My personal preference is that keeping things simple is the
best way to make something readable and maintainable.

Oct 29 2004

Regan Heath <regan netwin.co.nz> writes:

On Fri, 29 Oct 2004 10:39:08 -0400, Ben Hinkle <bhinkle mathworks.com> 
wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsglk19qr5a2sq9 digitalmars.com...
 On Thu, 28 Oct 2004 08:35:24 -0400, Ben Hinkle <bhinkle4 juno.com> 
 wrote:
 So defining multiple aliases for strings or any other type is
 a pretty harmless thing to do. It should only effect the readability 

 and
 maintainability of the code.

 I'd argue that it's not harmless for the very reasons you just 
 mentioned.
 Readability and maintainability are important when working on any
 large-ish project.

 But introducing more names doesn't always make something more readable or
 maintainable. One has to factor in the size of the group and time-scale 
 of
 life of the code. A wrapper or alias might seem obvious to the couple of
 people who started the project but years down the road with a group 
 orders
 of magnitude larger a little helper wrapper can add up to be more 
 overhead
 than it is worth. Also notions of "this code is readable" and 
 "maintainable"
 are much more subjective than "this code doesn't compile" or "this code 
 uses
 the wrong type". My personal preference is that keeping things simple is 
 the
 best way to make something readable and maintainable.

That's what *I* implied/said, wasn't it?

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Oct 31 2004

ac <ac_member pathlink.com> writes:

 I'd argue that it's not harmless for the very reasons you just 
 mentioned.
 Readability and maintainability are important when working on any
 large-ish project.



..
My personal preference is that keeping things simple is 

 the
 best way to make something readable and maintainable.

That's what *I* implied/said, wasn't it?

As an old man, I cannot avoid thinking that these (obviously both) talented
young men cannot find a place between their hormones and the writing on the
wall. Had I been in that age, I'd have participated vigorously in this.

I hope we get Walter with us in introducing a new name for the Canonical String.
Be it an alias, a type, a class, or whatever. -- The main point is that we do
need A Type that "everyone" uses. 

Sure, we can claim that it's the wchar, uchar, dchar, or whatever, but hey,
please, do remember the very purpose of a programming language:

"We may create a programming language from the point of the computer.
We may create a programming language from the point of the programmer.
We may create a ... ... sw-developing company.
We ... ... education.
W... ... maintainability.
" ... -- the story has other leaves.

Psychology, practice, history, just about everything "non-reality" related tells
us 10-to-1 that we should create a name and tell everyone to use that. 

Technically we do not need this, but this, i'm sorry, is not the issue here.

Nov 02 2004

Sean Kelly <sean f4.ca> writes:

In article <clq9a8$1jkb$1 digitaldaemon.com>, Glen Perkins says...
 I don't really see the need
 for application programmers to layer on more string types.

I do, and apparently I'm not alone. People seem to mention it a lot in 
this newsgroup (I've discovered), just as they did for so long with 
C++. If I want char[] on Linux and wchar[] on Windows and want to 
avoid the nightmare of maintaining parallel but subtly different code, 
I need to create my own type using the "alias" feature.

Out of curiosity, why would you want to use different char types internally in
the same application depending on platform?  At worst I would think that the i/o
might translate to different encodings but the internal code would use some
normalized form regardless of platform.

The authors of 
a couple of libraries I'll use will do likewise, but with their own 
type names and maybe different alias resolution rules. I expect some 
people will solve it with string classes.

They are certainly welcome to, but I'm not sure I see a need for a standard
string class.  The built-ins plus support functions should be quite sufficient.

Stroustrup took a similar 
position about the need for programmers to optimize their strings long 
enough that every C++ library and API created its own string type. He 
once stated in a meeting I attended that his greatest regret about C++ 
was waiting so long to have a standard library and that the most 
requested feature of that library had been a string class.

But D has string support while early C++ did not.  In fact the current C++
string type is basically just a vector with some helper functions tacked on, and
those functions could just as easily have been implemented separate from the
string class (as is becoming popular in these days of generic programming).

By adding 
just one more standard string type that would be a good default on 
every platform, I think you could eliminate the need so many people 
will feel to create their own and prevent string types from 
multiplying like bunnies, as happened to C++.

I think people feel the need for a string class for familiarity rather than for
need.  While dealing with multibyte encodings can be a tad odd at first, foreach
and slices make things quite painless.

Even so, I realize that my opinion may be based on incorrect 
assumptions, missing information, faulty logic, selective memory, or 
peculiar personal preferences, so I may be wrong. If so, though, I'd 
be curious to know why.

I haven't seen a good argument *for* a string class yet, but I could certainly
be swayed if one were provided.  What is the advantage over the built-ins?  Is
this purely a desire to create a standard implementation because we know that
people are going to try to roll their own?


Sean

Oct 28 2004

Regan Heath <regan netwin.co.nz> writes:

On Thu, 28 Oct 2004 14:57:33 +0000 (UTC), Sean Kelly <sean f4.ca> wrote:
 In article <clq9a8$1jkb$1 digitaldaemon.com>, Glen Perkins says...
 I don't really see the need
 for application programmers to layer on more string types.

 I do, and apparently I'm not alone. People seem to mention it a lot in
 this newsgroup (I've discovered), just as they did for so long with
 C++. If I want char[] on Linux and wchar[] on Windows and want to
 avoid the nightmare of maintaining parallel but subtly different code,
 I need to create my own type using the "alias" feature.

 Out of curiosity, why would you want to use different char types 
 internally in
 the same application depending on platform?  At worst I would think that 
 the i/o
 might translate to different encodings but the internal code would use 
 some
 normalized form regardless of platform.

Glen mentioned system API calls. AFAIK unix variants use 8-bit char 
internally, but the later windows platforms use 16-bit, so, if you're 
doing a lot of system API calls it makes sense to have the string data in 
the right format. yes/no?

 I haven't seen a good argument *for* a string class yet, but I could 
 certainly
 be swayed if one were provided.  What is the advantage over the 
 built-ins?

I am hoping to outline some below.

 Is
 this purely a desire to create a standard implementation because we know 
 that
 people are going to try to roll their own?

In part, yes, the result of which would be...

Imagine in the future when a large number of 3rd party libs exist, if each 
lib uses a different char type then interfacing between them all will 
involve conversions, lots of them.

If there were only 1 string type, this problem would not exist.

I realise that some conversions are un-avoidable (i.e. converting for IO), 
but, converting for use internally should be avoided without a very good 
reason,  I cannot think of any at the moment which I would consider good 
enough to incur the cost of conversion.

Further, say a conscientious library developer understands the above and 
wants to make his/her lib as compatible as possible, to do so, he/she has 
to either:
1- write everything 3 times. (as is already happening in the std libs)
2- do conversion internally.

Neither option is particularly good, don't you agree?

Basically I believe conversion should be done at input and ouput stages 
but nowhere in between, the way to achieve that is to have 1 string type 
used internally, the way to ensure that is only give people the choice of 
1 string type.

As suggested above that type may differ on each platform.

Perhaps it could/should also differ per application, this could be 
achieved with a compile time flag to choose the internal string type. Not 
a perfect solution I know, as now we need 3 versions of each library, one 
for each internal char type.

Thats my 2c anyways.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Oct 28 2004

"Walter" <newshound digitalmars.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsgllvqje5a2sq9 digitalmars.com...
 Perhaps it could/should also differ per application, this could be
 achieved with a compile time flag to choose the internal string type. Not
 a perfect solution I know, as now we need 3 versions of each library, one
 for each internal char type.

Although some are doing this, I argue it isn't necessary. Just pick one, and
use conversions as necessary.

Oct 28 2004

Regan Heath <regan netwin.co.nz> writes:

On Thu, 28 Oct 2004 17:53:43 -0700, Walter <newshound digitalmars.com> 
wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsgllvqje5a2sq9 digitalmars.com...
 Perhaps it could/should also differ per application, this could be
 achieved with a compile time flag to choose the internal string type. 
 Not
 a perfect solution I know, as now we need 3 versions of each library, 
 one
 for each internal char type.

 Although some are doing this, I argue it isn't necessary. Just pick one, 
 and use conversions as necessary.

Quite frankly, yuck. As I said earlier, it's in-efficient to convert 
internally, you should only convert on input and output.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Oct 28 2004

"Walter" <newshound digitalmars.com> writes:

"Glen Perkins" <please.dont email.com> wrote in message
news:clq9a8$1jkb$1 digitaldaemon.com...
 I don't really see the need
 for application programmers to layer on more string types.

 I do, and apparently I'm not alone. People seem to mention it a lot in
 this newsgroup (I've discovered), just as they did for so long with
 C++. If I want char[] on Linux and wchar[] on Windows and want to
 avoid the nightmare of maintaining parallel but subtly different code,
 I need to create my own type using the "alias" feature. The authors of
 a couple of libraries I'll use will do likewise, but with their own
 type names and maybe different alias resolution rules. I expect some
 people will solve it with string classes. Stroustrup took a similar
 position about the need for programmers to optimize their strings long
 enough that every C++ library and API created its own string type. He
 once stated in a meeting I attended that his greatest regret about C++
 was waiting so long to have a standard library and that the most
 requested feature of that library had been a string class. By adding
 just one more standard string type that would be a good default on
 every platform, I think you could eliminate the need so many people
 will feel to create their own and prevent string types from
 multiplying like bunnies, as happened to C++.

C++ needs a string class becuase core C++ strings are so inadequate. But
this is not true for D - core strings are more than up to the job. D core
strings can do everything std::string does, and a lot more. D core strings
more than cover what java.lang.string does, as well.

Using 'alias' doesn't create a new type. It just renames an existing type.
Hence, I don't see much of a collision problem between different code bases
that use aliases.

I also just don't see the need to even bother using aliases. Just use
char[]. I think the issue comes up repeatedly because people coming from a
C++ background are so used to char* being inadequate that it's hard to get
comfortable with the idea that char[] really does work <g>.

Oct 28 2004

Regan Heath <regan netwin.co.nz> writes:

On Thu, 28 Oct 2004 09:39:13 -0700, Walter <newshound digitalmars.com> 
wrote:
 "Glen Perkins" <please.dont email.com> wrote in message
 news:clq9a8$1jkb$1 digitaldaemon.com...
 I don't really see the need
 for application programmers to layer on more string types.

 I do, and apparently I'm not alone. People seem to mention it a lot in
 this newsgroup (I've discovered), just as they did for so long with
 C++. If I want char[] on Linux and wchar[] on Windows and want to
 avoid the nightmare of maintaining parallel but subtly different code,
 I need to create my own type using the "alias" feature. The authors of
 a couple of libraries I'll use will do likewise, but with their own
 type names and maybe different alias resolution rules. I expect some
 people will solve it with string classes. Stroustrup took a similar
 position about the need for programmers to optimize their strings long
 enough that every C++ library and API created its own string type. He
 once stated in a meeting I attended that his greatest regret about C++
 was waiting so long to have a standard library and that the most
 requested feature of that library had been a string class. By adding
 just one more standard string type that would be a good default on
 every platform, I think you could eliminate the need so many people
 will feel to create their own and prevent string types from
 multiplying like bunnies, as happened to C++.

 C++ needs a string class becuase core C++ strings are so inadequate. But
 this is not true for D - core strings are more than up to the job. D core
 strings can do everything std::string does, and a lot more. D core 
 strings
 more than cover what java.lang.string does, as well.

 Using 'alias' doesn't create a new type. It just renames an existing 
 type.
 Hence, I don't see much of a collision problem between different code 
 bases
 that use aliases.

 I also just don't see the need to even bother using aliases. Just use
 char[]. I think the issue comes up repeatedly because people coming from 
 a
 C++ background are so used to char* being inadequate that it's hard to 
 get
 comfortable with the idea that char[] really does work <g>.

It's not whether it works or not, I agree it works very well.

It's the fact that there are 3 of them, it's possible people will use 
different ones in their libs, then my program will have to do internal 
conversions all over the place.

Conversion should only be done at the input and/or output stages.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Oct 28 2004

"Walter" <newshound digitalmars.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsgllx2xp5a2sq9 digitalmars.com...
 It's the fact that there are 3 of them, it's possible people will use
 different ones in their libs, then my program will have to do internal
 conversions all over the place.

I admit it may become a problem, but I don't think it will. More experience
will let us know. With C++, there is a big problem because char* and
wchar_t* are simply incompatible, one cannot even do conversions. One of my
beefs with C++ was having to have multiple versions of the same function for
the various char types. This isn't necessary in D.

Oct 28 2004

Regan Heath <regan netwin.co.nz> writes:

On Thu, 28 Oct 2004 17:51:04 -0700, Walter <newshound digitalmars.com> 
wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsgllx2xp5a2sq9 digitalmars.com...
 It's the fact that there are 3 of them, it's possible people will use
 different ones in their libs, then my program will have to do internal
 conversions all over the place.

 I admit it may become a problem, but I don't think it will. More 
 experience
 will let us know. With C++, there is a big problem because char* and
 wchar_t* are simply incompatible, one cannot even do conversions. One of 
 my beefs with C++ was having to have multiple versions of the same 
 function for the various char types. This isn't necessary in D.

Isn't it? Explain std.string then.

Don't people convert between char* and wchar_t* all the time, with 
functions? How is that really different from using a cast() in D, the 
syntax and knowing the encoding are the only differences I can see.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Oct 28 2004

"Walter" <newshound digitalmars.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsglzsouo5a2sq9 digitalmars.com...
 On Thu, 28 Oct 2004 17:51:04 -0700, Walter <newshound digitalmars.com>
 wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsgllx2xp5a2sq9 digitalmars.com...
 It's the fact that there are 3 of them, it's possible people will use
 different ones in their libs, then my program will have to do internal
 conversions all over the place.

 I admit it may become a problem, but I don't think it will. More
 experience
 will let us know. With C++, there is a big problem because char* and
 wchar_t* are simply incompatible, one cannot even do conversions. One of
 my beefs with C++ was having to have multiple versions of the same
 function for the various char types. This isn't necessary in D.

 Isn't it? Explain std.string then.

 Don't people convert between char* and wchar_t* all the time, with
 functions? How is that really different from using a cast() in D, the
 syntax and knowing the encoding are the only differences I can see.

The conversion doesn't work because it doesn't know about UTF. An attempt is
being made to fix this in the latest standards.

Oct 28 2004

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Walter wrote:

 I also just don't see the need to even bother using aliases. Just use
 char[]. I think the issue comes up repeatedly because people coming from a
 C++ background are so used to char* being inadequate that it's hard to get
 comfortable with the idea that char[] really does work <g>.

I just found char[][] a tad confusing, but maybe it grows on you... :-)


Oh well; I can still use a local "string" alias for char[] if I want
to, even if it doesn't make into the standard D includes. No big deal.

And there probably should be a warning that ".length" only works for
ASCII strings, since it returns the number of code units otherwise ?

--anders

Oct 28 2004

James McComb <ned jamesmccomb.id.au> writes:

Walter wrote:

 I also just don't see the need to even bother using aliases. Just use
 char[].

But you need to use aliases for the following scenario:

Suppose that:
   1. I want to write code for both Windows and Unix.
   2. I don't want to pay any string conversion costs at all.

I assume the way to do this in D is:
   1. Use wchar[] on Windows and make UTF-16 API calls.
   2. Use char[] on Linux and make UTF-8 API calls.
   3. Use an alias to toggle between wchar[] and char[].
   4. Use a string library that defines all functions in both wchar[] 
and char[] versions.

If I just used char[], I would be forced to pay string conversion costs, 
as Windows ultimately processes all strings in UTF-16.

Oct 28 2004

"Walter" <newshound digitalmars.com> writes:

"James McComb" <ned jamesmccomb.id.au> wrote in message
news:cls8k1$16r5$1 digitaldaemon.com...
 Walter wrote:
 I also just don't see the need to even bother using aliases. Just use
 char[].

 But you need to use aliases for the following scenario:

 Suppose that:
    1. I want to write code for both Windows and Unix.
    2. I don't want to pay any string conversion costs at all.

 I assume the way to do this in D is:
    1. Use wchar[] on Windows and make UTF-16 API calls.
    2. Use char[] on Linux and make UTF-8 API calls.
    3. Use an alias to toggle between wchar[] and char[].
    4. Use a string library that defines all functions in both wchar[]
 and char[] versions.

 If I just used char[], I would be forced to pay string conversion costs,
 as Windows ultimately processes all strings in UTF-16.

True, Win32 process strings in UTF-16 and Linux in UTF-8. But I'll argue
that the string conversion costs are insignificant, because very rarely does
one write code that crosses from the app to the OS in a tight loop. In fact,
one actively tries to avoid doing that because crossing the process boundary
layer is expensive anyway.

If profiling indicates that the conversion cost is significant, then use an
alias, sure. But I'll wager that's very unlikely.

Oct 28 2004

"Glen Perkins" <please.dont email.com> writes:

"Walter" <newshound digitalmars.com> wrote in message 
news:clsfce$1dlm$1 digitaldaemon.com...

 True, Win32 process strings in UTF-16 and Linux in UTF-8. But I'll 
 argue
 that the string conversion costs are insignificant, because very 
 rarely does
 one write code that crosses from the app to the OS in a tight loop. 
 In fact,
 one actively tries to avoid doing that because crossing the process 
 boundary
 layer is expensive anyway.

 If profiling indicates that the conversion cost is significant, then 
 use an
 alias, sure. But I'll wager that's very unlikely.

Wait a minute. Aren't these pretty close to the same arguments I made 
for why the difference between the performance of a consistent default 
string class and a byte array wouldn't generally matter? "X is usually 
insignificant, and if it is ever significant use a profiler and do 
something non-default, but in general keep it simple...."

What is the performance difference between sending 1000 wchar[] 
strings into a filter library function that wants char[] strings, so 
it converts them all into char[] on the way in, finds the ones that 
qualify, and converts them back to wchar[] on the way out, versus 
sending a thousand default string objects, by reference of course, 
into a library written for default string objects for filtering and 
returns the qualifiers as default string objects? (I'm actually 
asking. It's not rhetorical.)

Having a consistent, default string that's used (almost) everywhere 
and never suffers any conversion costs inside the app may have a big 
benefit in reducing complexity, with certainly no need for aliases 
anywhere, and may not even have any performance penalty over code that 
repeatedly gets converted back and forth within the app. And if ever 
it did perform more slowly, you use your profiler as you suggest and 
tweak it with a byte array.

Oct 29 2004

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

James McComb wrote:

 But you need to use aliases for the following scenario:
 
 Suppose that:
   1. I want to write code for both Windows and Unix.
   2. I don't want to pay any string conversion costs at all.
 
 I assume the way to do this in D is:
   1. Use wchar[] on Windows and make UTF-16 API calls.
   2. Use char[] on Linux and make UTF-8 API calls.
   3. Use an alias to toggle between wchar[] and char[].
   4. Use a string library that defines all functions in both wchar[] and 
 char[] versions.
 
 If I just used char[], I would be forced to pay string conversion costs, 
 as Windows ultimately processes all strings in UTF-16.

Couldn't a new "tchar" alias be introduced for OS / platform strings ?

(mapping to either char or wchar)

Similar to how pointer aliases work with both 32- and 64-bit pointers ?

(that is: size_t and ptrdiff_t)


It would be similar to using the macro (TCHAR *) in Windows C or C++.
(with _tcs macro versions of all the functions like: strlen and wcslen)

With overloading and templates in D it is easier to maintain, though...
(compared to the preprocessor tricks one has to resort to, back in C)


Or just use the standard type "char[]" and cast(), like Walter said ?
(which seems to be a little biased towards ASCII or UNIX, but anyway)

But using the same name (tchar) as Windows / Linux does, would be good
if there indeed is such a platform-character alias eventually added...

--anders


PS. I think that it's only Windows NT (2K,XP) that uses Unicode,
     while Windows 95 (98,ME) uses ASCII... But I could be wrong?

Oct 29 2004

"Roald Ribe" <rr.no spam.teikom.no> writes:

"Anders F Bj�rklund" <afb algonet.se> wrote in message
news:cltfq3$2hqn$1 digitaldaemon.com...

 PS. I think that it's only Windows NT (2K,XP) that uses Unicode,
      while Windows 95 (98,ME) uses ASCII... But I could be wrong?

95, 98 and ME can have the UTF-16 API's installed (redistributable DLL's).
Both (the old) 8-bit char API and UTF-16 API are (currently) available
on all the currently supported WIN32 platforms.

Roald

Nov 05 2004

"Glen Perkins" <please.dont email.com> writes:

"Walter" <newshound digitalmars.com> wrote in message 
news:clr7kg$2mi2$1 digitaldaemon.com...

 [D library authors and others won't be tempted to create their own 
 string classes
 as so many did for C++ because D's core strings are so much better]

This may turn out to be true. If so, you are still left with multiple 
string types and no obvious default. My concern is that the result 
will be a lot of unnecessary complexity, with all of its associated 
real costs, in exchange for little or no real benefit in many cases, 
and it won't even be avoidable by those who are aware of it if they 
use other people's code. And if it doesn't work out that way, the 
situation would be even worse, with even more string types and still 
no default.

 Using 'alias' doesn't create a new type. It just renames an existing 
 type.

You're talking about 'type' from the compiler's perspective, while I'm 
talking about it from the perspective of people--well, programmers are 
sort of like people--as in complexity, programmer productivity, 
porting, debugging, maintenance etc. From that perspective two things 
with different names have to be managed differently. Though the 
compiler may (sometimes) not object if you mix them, anyone who works 
with multiple string type names has more to keep track of and check on 
and worry about.

Just for grins, here is the sort of thing I've overheard coming from 
developers at first-rate software companies, who ended up with 
multiple internal string types aliased by #defines:


foochar, but we're passing it a regular char. Is that okay? Or will it 



Japanese IME. Does anybody remember what a 'foochar' is on Linux? <guy 


 Hence, I don't see much of a collision problem between different 
 code bases
 that use aliases.

Whether or not such a problem exists for the compiler, I don't see how 
working with multiple string types, even if some of them differ in 
name only, would not be a complexity problem for *people*.

 I also just don't see the need to even bother using aliases.

That's pretty interesting because if there really is no need for this 
feature, you could prevent some unnecessary complexity by eliminating 
the feature.

With no default string type, though, people are essentially told to 
optimize their string type everytime they create a string, which will 
probably create a demand for a feature like "alias" to create an 
abstract string type (name) above the implementation level.

 Just use
 char[]. I think the issue comes up repeatedly because people coming 
 from a
 C++ background are so used to char* being inadequate that it's hard 
 to get
 comfortable with the idea that char[] really does work <g>.

<g> Funny, I thought you would say "just use wchar[]". Each one seems 
about equally likely. ;-)

Again, you may be right about the existing byte array types being good 
enough to prevent the proliferation of string classes. Even if you 
are, my concern about the lack of an obvious default resulting in 
non-trivial complexity costs with no concommitant benefit remains.

But I suppose it's also possible that if you DID add a nice, 
lightweight string class suitable as a default almost everywhere, the 
addiction of so many C people to premature optimization could make it 
unpopular, rendering it less of a unifying default than just a fourth 
standard string type to add to the complexity.

Oct 29 2004

Regan Heath <regan netwin.co.nz> writes:

Glen,

I think you make some very good points. In the past several people have 
argued for a single string type. Some may even have written one, I know 
it's on the cards.

In the past I have argued for implicit conversion between the existing 
string types, this would allow them to be used interchangably and 
converted 'on the fly' where required. This idea can have performance 
issues as it can cause a lot of excess conversions. My suggestion was in 
reaction to the impression that the 3 existing types were going to stay.

I think ideally having only one 'string' type would be best. The trick is 
making it efficient enough for those situations where that sort of thing 
matters, i.e. embedded software etc.

That said, a well designed class that could be told what encoding to use 
internally (if required) might be efficient enough for 99% of cases, and 
in the last 1% a ubyte[] should perhaps be used?

If that class were to come into existance, I don't see the need for 3 char 
types, instead ubyte[], ushort[] and uint[] would/could be used by the 
string class internally to represent the data stored.

It's interesting to hear your views on this, I hope your post draws some 
of the older NG members with opinions on this out of the woodwork, it's 
been quiet here the last month or so.

Regan

On Mon, 25 Oct 2004 15:07:30 -0700, Glen Perkins <please.dont email.com> 
wrote:
 I'd heard a bit about D, but this is the first time I've taken a bit of 
 time to look it over. I'm glad I did, because I love the design.

 I am wondering about something, though, and that's the apparent decision 
 to have three different standard string types, each with its encoding 
 exposed to the developer. I've had some experience designing text 
 models--I worked with Sun upgrading Java's string model from UCS-2 to 
 UTF-16 and for Macromedia upgrading the string types within Flash and 
 ColdFusion, for example--but every case has its unique constraints.

 I don't know enough about D to be sure of the issues and constraints in 
 this case, but I'm wondering if it wouldn't make sense to have a single 
 standard "String" class for the majority of text handling plus something 
 like char/wchar/dchar/ubyte arrays reserved for special cases.

 In both Java and Flash we kept having to throw away brainstorming ideas 
 because they implied changes to internal string implementation details 
 that had unnecessarily--in my opinion--been exposed to programmers. I've 
 become increasingly convinced that programmers don't need to know, much 
 less be forced to decide, how most of their text is encoded. They should 
 be thinking in terms of text semantically most of the time, without 
 concerning themselves with its byte representation.

 I see text handling as analogous to memory handling in the sense that I 
 think the time has come to have the platform handle the general cases 
 via automated internal mechanisms that are not exposed, while still 
 allowing programmer manual  intervention for occasional special cases.

 D already seems to have this memory model (very nice!), and it seems to 
 me that the corresponding text model would be a single standard "String" 
 class, whose internal encoding was the implementation's business, not 
 the programmer's. The String would have the ability to produce 
 explicitly encoded/formatted byte arrays for special cases, such as I/O, 
 where encoding mattered. I would also want the ability to bypass Strings 
 entirely on some occasions and use byte arrays directly. (By "byte 
 arrays" I mean something like D's existing char[], wchar[], etc.)

 Since the internal encoding of the standard String would not be exposed 
 to the programmer, it could be optimized differently on every platform. 
 I would probably implement my String class in UTF-16 on Windows and 
 UTF-8 on Linux to make interactions with the OS and neighboring 
 processes as lightweight as possible.

 Then I would probably provide standard function wrappers for common OS 
 calls such as getting directory listings, opening files, etc. These 
 wrapper functions would pass text in the form of Strings. Source code 
 that used only these functions would be portable across platforms, and 
 since String's implementation would be optimized for its platform, this 
 portable source code could produce nearly optimal object code on all 
 platforms.

 For calling OS functions directly, where you always need to have your 
 text in a specific format, you could just have your Strings create an 
 explicitly formatted byte sequence for you. A call to a Windows API 
 function might pass something like "my_string.toUTF16()". Since the 
 internal format would probably already be UTF-16, this "conversion" 
 could be optimized away by the compiler, but it would leave you the 
 freedom to change the underlying String implementation in the future 
 without breaking anybody's code.

 And, of course, you would still have the ability to use char[], wchar[], 
 dchar[], and even ubyte[] directly when needed for special cases.

 Having a single String to use for most text handling would make writing, 
 reading, porting, and maintaining code much easier. Having an underlying 
 encoding that isn't exposed would make it possible for implementers to 
 optimize the standard String for the platform, so that programmers who 
 used it would find code that was easier to write to begin with was also 
 more performant when ported. This has huge implications for the creation 
 of the rich libraries that make or break a language these days.

 And if for no other reason, it seems to me that a new language should 
 have a single, standard String class from the start just to avoid 
 degenerating into the tangled hairball of conflicting string types that 
 C++ text handling has become. Library creators and architects working in 
 languages that have had a single, standard String class from the start 
 doggedly use the standard String for everything. You could easily create 

 almost nobody does. As long as the standard String is good enough, it's 
 just not worth the trouble of having to juggle multiple string types. 
 All libraries and APIs in these languages use a single, consistent text 
 model, which is a big advantage these days over C++.

 Again, I realize that I may be overlooking any number of important 
 issues that would make this argument inapplicable or irrelevant in this 
 case, but I'm wondering if this would make sense for D.



-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Oct 25 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 26 Oct 2004 12:47:44 +1300, Regan Heath <regan netwin.co.nz> wrote:
 I think you make some very good points. In the past several people have 
 argued for a single string type. Some may even have written one, I know 
 it's on the cards.

To clarify, I believe some people think one is required and will write one 
to attempt to proove one is better. AFAIK Walter does not see the need for 
one and/or believe char,wchar,dchar to be better.

 In the past I have argued for implicit conversion between the existing 
 string types, this would allow them to be used interchangably and 
 converted 'on the fly' where required. This idea can have performance 
 issues as it can cause a lot of excess conversions. My suggestion was in 
 reaction to the impression that the 3 existing types were going to stay.

 I think ideally having only one 'string' type would be best. The trick 
 is making it efficient enough for those situations where that sort of 
 thing matters, i.e. embedded software etc.

 That said, a well designed class that could be told what encoding to use 
 internally (if required) might be efficient enough for 99% of cases, and 
 in the last 1% a ubyte[] should perhaps be used?

 If that class were to come into existance, I don't see the need for 3 
 char types, instead ubyte[], ushort[] and uint[] would/could be used by 
 the string class internally to represent the data stored.

 It's interesting to hear your views on this, I hope your post draws some 
 of the older NG members with opinions on this out of the woodwork, it's 
 been quiet here the last month or so.

 Regan

 On Mon, 25 Oct 2004 15:07:30 -0700, Glen Perkins <please.dont email.com> 
 wrote:
 I'd heard a bit about D, but this is the first time I've taken a bit of 
 time to look it over. I'm glad I did, because I love the design.

 I am wondering about something, though, and that's the apparent 
 decision to have three different standard string types, each with its 
 encoding exposed to the developer. I've had some experience designing 
 text models--I worked with Sun upgrading Java's string model from UCS-2 
 to UTF-16 and for Macromedia upgrading the string types within Flash 
 and ColdFusion, for example--but every case has its unique constraints.

 I don't know enough about D to be sure of the issues and constraints in 
 this case, but I'm wondering if it wouldn't make sense to have a single 
 standard "String" class for the majority of text handling plus 
 something like char/wchar/dchar/ubyte arrays reserved for special cases.

 In both Java and Flash we kept having to throw away brainstorming ideas 
 because they implied changes to internal string implementation details 
 that had unnecessarily--in my opinion--been exposed to programmers. 
 I've become increasingly convinced that programmers don't need to know, 
 much less be forced to decide, how most of their text is encoded. They 
 should be thinking in terms of text semantically most of the time, 
 without concerning themselves with its byte representation.

 I see text handling as analogous to memory handling in the sense that I 
 think the time has come to have the platform handle the general cases 
 via automated internal mechanisms that are not exposed, while still 
 allowing programmer manual  intervention for occasional special cases.

 D already seems to have this memory model (very nice!), and it seems to 
 me that the corresponding text model would be a single standard 
 "String" class, whose internal encoding was the implementation's 
 business, not the programmer's. The String would have the ability to 
 produce explicitly encoded/formatted byte arrays for special cases, 
 such as I/O, where encoding mattered. I would also want the ability to 
 bypass Strings entirely on some occasions and use byte arrays directly. 
 (By "byte arrays" I mean something like D's existing char[], wchar[], 
 etc.)

 Since the internal encoding of the standard String would not be exposed 
 to the programmer, it could be optimized differently on every platform. 
 I would probably implement my String class in UTF-16 on Windows and 
 UTF-8 on Linux to make interactions with the OS and neighboring 
 processes as lightweight as possible.

 Then I would probably provide standard function wrappers for common OS 
 calls such as getting directory listings, opening files, etc. These 
 wrapper functions would pass text in the form of Strings. Source code 
 that used only these functions would be portable across platforms, and 
 since String's implementation would be optimized for its platform, this 
 portable source code could produce nearly optimal object code on all 
 platforms.

 For calling OS functions directly, where you always need to have your 
 text in a specific format, you could just have your Strings create an 
 explicitly formatted byte sequence for you. A call to a Windows API 
 function might pass something like "my_string.toUTF16()". Since the 
 internal format would probably already be UTF-16, this "conversion" 
 could be optimized away by the compiler, but it would leave you the 
 freedom to change the underlying String implementation in the future 
 without breaking anybody's code.

 And, of course, you would still have the ability to use char[], 
 wchar[], dchar[], and even ubyte[] directly when needed for special 
 cases.

 Having a single String to use for most text handling would make 
 writing, reading, porting, and maintaining code much easier. Having an 
 underlying encoding that isn't exposed would make it possible for 
 implementers to optimize the standard String for the platform, so that 
 programmers who used it would find code that was easier to write to 
 begin with was also more performant when ported. This has huge 
 implications for the creation of the rich libraries that make or break 
 a language these days.

 And if for no other reason, it seems to me that a new language should 
 have a single, standard String class from the start just to avoid 
 degenerating into the tangled hairball of conflicting string types that 
 C++ text handling has become. Library creators and architects working 
 in languages that have had a single, standard String class from the 
 start doggedly use the standard String for everything. You could easily 
 create your own alternative string classes for languages like Java or 

 enough, it's just not worth the trouble of having to juggle multiple 
 string types. All libraries and APIs in these languages use a single, 
 consistent text model, which is a big advantage these days over C++.

 Again, I realize that I may be overlooking any number of important 
 issues that would make this argument inapplicable or irrelevant in this 
 case, but I'm wondering if this would make sense for D.




-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Oct 25 2004

A. Coward (not related to No�l) <A._member pathlink.com> writes:

I think Glen's thoughts are excellent.

As long as we use D for smallish programs, library development, and such, it may
seem obvious to continue using arrays to store sequences of characters (of the
size of our choice for the project at hand).


Java. By that time D would be used in the Programming Industry. Once we are
there it may seem equally obvious that a programmer should not have to spend
time thinking about character sets or widths. A requisite for this is that there
is a string class/type that Everyone Uses. 

We don't have to skip our current character arrays and library functions, it
just means that we really should create a default for the future. And this IMHO
should be done pretty much along the lines Glen suggested.

Newcomers to D (newbies as well as Old Pros) should be directed to use this new
string. This is what should be prominent and well described in the
documentation. And we should move the current text manipulation docs to the
hairier sections, right where OS-gurus, embedded programmers, performance pros,
and metal-benders go looking. Oh yes, and library developers, too.

The default should be that everyone uses the Default string, and that only
profiling should be used to decide whether some snippets should then be
programmed with arrays (or whatever), as a last resort.

Oct 26 2004

Kevin Bealer <Kevin_member pathlink.com> writes:

In article <cllf6q$24vg$1 digitaldaemon.com>, not related to No�l says...
.
The default should be that everyone uses the Default string, and that only
profiling should be used to decide whether some snippets should then be
programmed with arrays (or whatever), as a last resort.

I think there is some merit in this guideline, particularly for those new to
programming.  But I'm coming around to the perspective that performance problems
are like bugs.  If you don't pay attention to bugs during the design phase, you
will spend your whole career debugging programs.  Likewise, if performance is
the last thing you think about, you will spend all of your career profiling
programs with poor performance, trying to overcome slow designs with small
optimizations.

If you want a "standard" string type, use "char".  An XML parser needs to look
for "<" and ">" a lot, but how often do you -really- need to scan strings for
multibyte characters?  Virtually all traditional tokenization and parsing tasks
can be done with 8 bit types, because they require searching for delimiters that
are themselves 8 bit chars.  I've not seen "U+umlaut" delimited fields ;)

My rule of thumb is to use the smallest type that I won't need to convert inside
the function, usually char.  If the function needs to iterate over and modify
dchar elements, accept that type at the function interface.

Kevin

Oct 28 2004

"Lionello Lunesu" <lionello.lunesu crystalinter.remove.com> writes:

Just posting to let you know I also think "string" should be standardized. 
Be it char[] or whatever, but standardized.

Maybe in the future "string" could get some other members/operators that 
have no equivalent with int[]. (The fact that char[] is begin treated as 
UTF8 when converting to wchar[] proves that it's not simply an int8[] array)

 Virtually all traditional tokenization and parsing tasks
 can be done with 8 bit types, because they require searching for 
 delimiters that
 are themselves 8 bit chars.  I've not seen "U+umlaut" delimited fields ;)

Indeed :-)

Lio.

Nov 05 2004

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Lionello Lunesu wrote:

 Maybe in the future "string" could get some other members/operators that 
 have no equivalent with int[]. (The fact that char[] is begin treated as 
 UTF8 when converting to wchar[] proves that it's not simply an int8[] array)

The 8-bit integer type in D is "byte". D's "char" is *defined* as UTF-8.

This means that a "char" only holds an ASCII character.
You need a wchar to hold e.g. a Latin-1 character, and
a full (32-bit) dchar to hold all Unicode possibilities...

--anders

Nov 05 2004

"Lionello Lunesu" <lionello.lunesu crystalinter.remove.com> writes:

Yes, I've noticed that. I was referring to the how the array is treated.

char[] array;
wchar[] warray = array;

This is doing some magic that has nothing to do with simply copying members, 
extending them as necessary. OK, I guess they're both arrays of UTF 
characters and the prefix only shows the memory representation, so it's 
still a member-by-member copy...

Can I do a similar assignment from byte[] to uint[] ? (I know I could simply 
test, but I've never written a D program). If not, then there is something 
special about char[] that might perhaps be more obvious if it was a built-in 
string type (the [] is confusing.)

Lio.

"Anders F Bj�rklund" <afb algonet.se> wrote in message 
news:cmfjbd$22d5$1 digitaldaemon.com...
 Lionello Lunesu wrote:

 Maybe in the future "string" could get some other members/operators that 
 have no equivalent with int[]. (The fact that char[] is begin treated as 
 UTF8 when converting to wchar[] proves that it's not simply an int8[] 
 array)

 The 8-bit integer type in D is "byte". D's "char" is *defined* as UTF-8.

 This means that a "char" only holds an ASCII character.
 You need a wchar to hold e.g. a Latin-1 character, and
 a full (32-bit) dchar to hold all Unicode possibilities...

 --anders

Nov 05 2004

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Lionello Lunesu wrote:

 Yes, I've noticed that. I was referring to the how the array is treated.
 
 char[] array;
 wchar[] warray = array;

That D code just gives an error, when you actually try to compile it:
"cannot implicitly convert expression array of type char[] to wchar[]"

If you insert an explicit cast, result is probably NOT what you want...
(you CAN cast string *constants*)

 This is doing some magic that has nothing to do with simply copying members, 
 extending them as necessary. OK, I guess they're both arrays of UTF 
 characters and the prefix only shows the memory representation, so it's 
 still a member-by-member copy...

The compiler needs some code regarding converting different UTF arrays.
Each dchar (UTF-32), converts 1-4 chars (UTF-8) to 1-2 wchars (UTF-16)

It's not a simple memory copy, as you can see in the std/utf.d code:
wchar[] toUTF16(char[] s);

 Can I do a similar assignment from byte[] to uint[] ?

Nope: "cannot implicitly convert expression a of type byte[] to uint[]"

You would have to do something like:
     byte[] a;
     uint[] b;
     foreach (byte c; a) b ~= c;
Again, a cast() just does a "memcpy"

 If not, then there is something special about char[] that might
 perhaps be more obvious if it was a built-in string type (the [] is
 confusing.)

Type char[] has a few "stringish" properties,
and bit has some magic "boolean" properties.

This is somehow better than built-in types...
(and a frequent source of D discussions/wars)


We'll just have to live with the type aliases;
"string" and "bool", as types aren't changing ?

alias char[] string;
alias bit bool;

--anders

Nov 05 2004

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Glen Perkins wrote:

 I don't know enough about D to be sure of the issues and constraints in 
 this case, but I'm wondering if it wouldn't make sense to have a single 
 standard "String" class for the majority of text handling plus something 
 like char/wchar/dchar/ubyte arrays reserved for special cases.

Since OOP is *optional* in D, it isn't given to have a *class* ?
(a String class is still useful, but not as main implementation)

As for a "string" type alias, I think that's a very good idea...
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/11821

 And if for no other reason, it seems to me that a new language should 
 have a single, standard String class from the start just to avoid 
 degenerating into the tangled hairball of conflicting string types that 
 C++ text handling has become. Library creators and architects working in 
 languages that have had a single, standard String class from the start 
 doggedly use the standard String for everything. You could easily create 

 almost nobody does. As long as the standard String is good enough, it's 
 just not worth the trouble of having to juggle multiple string types. 
 All libraries and APIs in these languages use a single, consistent text 
 model, which is a big advantage these days over C++.

There is no "string" type, and there is no "bool" type in D.
This seems to have been done by design, as Walter's explained ?

The recommended types to use is "char[]" for the usual strings,
(even if wchar[] or even dchar[] is sometimes also useful to have)
and "bit" for booleans. (even if char and int are sometimes used)

There isn't really a conflict, since all strings are Unicode
and all booleans follow the "zero is false, non-zero is true".
But it does expose the underlying storage and implementation...

It seems the best that can be done at this point are *aliases*?
(and improving upon the D library support in Phobos and Deimos)

--anders

Oct 26 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 26 Oct 2004 13:34:23 +0200, Anders F Bj�rklund <afb algonet.se> 
wrote:
 Glen Perkins wrote:

 I don't know enough about D to be sure of the issues and constraints in 
 this case, but I'm wondering if it wouldn't make sense to have a single 
 standard "String" class for the majority of text handling plus 
 something like char/wchar/dchar/ubyte arrays reserved for special cases.

 Since OOP is *optional* in D, it isn't given to have a *class* ?
 (a String class is still useful, but not as main implementation)

In that case, perhaps not a 'class', but a struct as Ben suggested, or, 
better yet a built-in type like the current arrays, which we can extend in 
the same way as we can arrays. I think that is important.

 As for a "string" type alias, I think that's a very good idea...
 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/11821

I don't like it:

1- I personally find 'utf_8' ugly and nasty to type.

2- The style guide mentions that 'meaningless type aliases should be 
avoided' I think aliasing 'char' to 'utf_8' is meaningless because a char 
is a utf-8 type by definition.

3- I don't want 'more' character types, I want 'less'.

 And if for no other reason, it seems to me that a new language should 
 have a single, standard String class from the start just to avoid 
 degenerating into the tangled hairball of conflicting string types that 
 C++ text handling has become. Library creators and architects working 
 in languages that have had a single, standard String class from the 
 start doggedly use the standard String for everything. You could easily 
 create your own alternative string classes for languages like Java or 

 enough, it's just not worth the trouble of having to juggle multiple 
 string types. All libraries and APIs in these languages use a single, 
 consistent text model, which is a big advantage these days over C++.

 There is no "string" type, and there is no "bool" type in D.
 This seems to have been done by design, as Walter's explained ?

Yes and no. Walter has intentionally made the character types utf ones, 
IMO a good decision, however it has created a problem where they are not 
easily interchangable i.e. you have to call conversion functions all the 
time because some people use one while others use another.

I suggested implicit conversion between them to solve that. Walter sort of 
liked that idea, but has not done anything about it yet. A better soln IMO 
would be a single 'string' type which can handle 'being' in any encoding 
you need.

 The recommended types to use is "char[]" for the usual strings,
 (even if wchar[] or even dchar[] is sometimes also useful to have)
 and "bit" for booleans. (even if char and int are sometimes used)

 There isn't really a conflict, since all strings are Unicode
 and all booleans follow the "zero is false, non-zero is true".
 But it does expose the underlying storage and implementation...

All strings are _not_ Unicode, strings can be in any encoding you want.
D currently has 3 'string' types (char,wchar,dchar) which are all Unicode.

There is no difference in my mind between a char[] and a ubyte[] array, 
except for the fact that the char[] array remembers that it's contents are 
supposed to be UTF-8 and verifies that on occasion. So, a 
struct/class/whatever like:

struct string {
   StringType type;
   union {
     ubyte[]  bs;
     ushort[] ss;
     ulong[]  ls;
   }
}

could replace char, wchar, and dchar. It could do implicit conversions 
where required via 'cast' operators (do we have them yet?). It could 
handle many more encodings than the 3 handled by char, wchar, and dchar.

If such a type existed char, wchar, and dchar would become obsolete, there 
would be no need for them at all.

The only weakness a struct has is that you cannot extend it as you can the 
built-in arrays eg.

void foo(char[] a, int b) {}
char[] bob;
bob.foo(1);  <- calls the 'foo' function above passing 'bob' as 1st arg.

This is a really useful feature, it is why IMO we need a partially 
built-in solution.

 It seems the best that can be done at this point are *aliases*?
 (and improving upon the D library support in Phobos and Deimos)

We can write a string struct/class/whatever and use that, if it becomes as 
popular as I imagine it will, it will likely be adopted into Phobos. 
Basically I'm saying, if we proove it's the right way to go, we just might 
convince Walter.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Oct 26 2004

=?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Regan Heath wrote:

 I don't like it:
 
 1- I personally find 'utf_8' ugly and nasty to type.

Actually it was utf8_t, utf16_t, utf32_t - but point taken :-)

 2- The style guide mentions that 'meaningless type aliases should be 
 avoided' I think aliasing 'char' to 'utf_8' is meaningless because a 
 char is a utf-8 type by definition.
 
 3- I don't want 'more' character types, I want 'less'.

They were meant to 'compliment' the standard int aliases - in stdint.d :
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t

They were not meant as "pretty", more like: self-explanatory
(explains what type it is: utf/int, and how many bits it is)

Didn't intend to change any built-in class names, like char/wchar/dchar
or byte/short/int/long. Just offer *one* "offical" alias for each type.


What did you think about the "string" (char[]) and "ustring" (wchar[]) ?

 All strings are _not_ Unicode, strings can be in any encoding you want.
 D currently has 3 'string' types (char,wchar,dchar) which are all Unicode.

I meant the string types that interact with "quotes" and the ~ operator.

You are right in that one *could* store strings in ubyte[] or void[]...

 If such a type existed char, wchar, and dchar would become obsolete, 
 there would be no need for them at all.

Unless you like type safety ? As in: chars and ints being different ?

They are of the same bit size as ubyte, ushort and uint - that's true.

  > We can write a string struct/class/whatever and use that, if it becomes
 as popular as I imagine it will, it will likely be adopted into Phobos. 
 Basically I'm saying, if we proove it's the right way to go, we just 
 might convince Walter.

Currently Walter *has* picked the char[] type as the basic string type.
Deimos has, inspired by the ICU library, picked wchar[] as the basis...
(difference being that char[] is best for ASCII, wchar[] for Unicode)

Says http://oss.software.ibm.com/icu/userguide/icufaq.html:
 UTF-8 is 50% smaller than UTF-16 for US-ASCII, but UTF-8 is
 50% larger than UTF-16 for East and South Asian scripts.
 There is no memory difference for Latin extensions, [...]


I just thought "main(string[] args)" better than "main(char[][] args)" ?
(just as I think the "bool" alias to be better than the built-in "bit")

But I'm not sure I like a "magic" class with a hidden run-time cost...

--anders

Oct 26 2004

"Glen Perkins" <please.dont email.com> writes:

"Anders F Bj�rklund" <afb algonet.se> wrote in message 
news:clmimg$fvd$1 digitaldaemon.com...


 What did you think about the "string" (char[]) and "ustring" 
 (wchar[]) ?

I don't think you were asking me, but my concern applies to any "let a 
hundred flowers bloom" design approach for strings. If you have 
multiple string types with no dominant leader, plus an "alias" 
feature, plus strong support for OOP but no standard string class, you 
are almost begging for a crazy quilt landscape of diverse and 
incompatible string types. I'd be concerned that most large 
applications would end up dealing with more string types than they 
wanted with no significant performance gains to show for it.

 Currently Walter *has* picked the char[] type as the basic string 
 type.
 Deimos has, inspired by the ICU library, picked wchar[] as the 
 basis...
 (difference being that char[] is best for ASCII, wchar[] for 
 Unicode)

 Says http://oss.software.ibm.com/icu/userguide/icufaq.html:
 UTF-8 is 50% smaller than UTF-16 for US-ASCII, but UTF-8 is
 50% larger than UTF-16 for East and South Asian scripts.
 There is no memory difference for Latin extensions, [...]


There is so much room for "well, not necessarily" in all of these 
statements, most programmers understand the issues so little, and it 
usually matters so little, that it's a bit unfortunate to have a 
design that *requires* programmers to repeatedly make this decision. 
Different people, even smart ones, will choose differently, choices 
that may as well be random for all the difference it usually makes. 
Once again, I'm afraid that code will get more complicated than 
necessary with no compensating payoff. And I couldn't avoid the 
complexity by just choosing wisely myself, because every library 
author would be free to make his own decisions, and you need a lot of 
libraries to make a language useful. I could have unnecessary and 
performance sapping format conversions taking place at every library 
call.

Oct 26 2004

=?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Glen Perkins wrote:

 What did you think about the "string" (char[]) and "ustring" (wchar[]) ?

 
 I don't think you were asking me, but my concern applies to any "let a 
 hundred flowers bloom" design approach for strings. If you have multiple 
 string types with no dominant leader, plus an "alias" feature, plus 
 strong support for OOP but no standard string class, [...]

Walter has earlier ruled out a built-in "native" string type in D,
and a String class brings us back to the earlier "boxing" discussion.

Currently the D language treats strings as arrays of Unicode code units,
and one can still use char[] as ASCII strings, just like one could in C.


There is a lot of things discussed regarding Unicode and strings at:
http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues

A "transcoding" string type with a built-in hash code would have been
welcome, but it is *not* in the current D language specification...


I just wanted a reasonable alias while the theological debate rages on ?
(and reason for submitting it was so that we could all use the same one)

--anders

Oct 27 2004

ac <ac_member pathlink.com> writes:

 Walter has earlier ruled out a built-in "native" string type in D,
 and a String class brings us back to the earlier "boxing" discussion.


a) Built-in or library? (Standard library or 3rd party?)

b) 0, 1, 3 or 3+1 "approved" string kinds?

c) Unicode (which?), native (which?), other?

These 3 questions are orthogonal to each other. 

To (a) I have no strong opinion. Maybe just building facilities in the language
itself that are geared towards making it easy to implement an efficient string
library would be adequate?

I have no problem with 3+1 in (b). Why not let the 3 existing strings live on.
But I would really like to have an additional string which would be advertised
as what you should use. 

(c) I leave to smarter people.


If we don't have exactly _one_ type that everyone _should_ use, then programmers
in, say, the Mid West would all use an 8-bit kind. People from, say, Chinese
origin, probably would use a 32-bit type -- even if they were coding in the US.
And even if they would be working on a project that is to manipulate ASCII
strings, because they'd expect the application to sooner or later get exposed to
non-USASCII characters anyway.

Actually, rednecks would be happy with 7 bits.

What if all these guys happen to work for the same global company? 

<joke-mode>
I can hear a crowd all over the D-community shouting to their screens: "Well,
that company would have their global coding policy on strings. NO problem."

Right. But what when (not if) that company gets merged into another? Would they
have happended to choose the very same string coding policy? Maybe they began
with different operating systems, maybe the other one originally came from
another continent?

I don't even want to guess what the crowd says to this.
</joke-mode>

This ought to be a no-brainer!

Oct 27 2004

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

ac wrote:

Walter has earlier ruled out a built-in "native" string type in D,
and a String class brings us back to the earlier "boxing" discussion.


 
 a) Built-in or library? (Standard library or 3rd party?)

There is no built-in D type, and does not look like a standard class.
(as in: there will probably be no Integer, Character, String classes?)

 b) 0, 1, 3 or 3+1 "approved" string kinds?

There are *two* approved string types: char[] and wchar[]
(there is also a dchar[] type, but hardly any use for it?)

 c) Unicode (which?), native (which?), other?

"Unicode is the future", so there is no Latin-1 support...
(I assume you meant something like ISO-8859-1 by "native"?)


 These 3 questions are orthogonal to each other. 

I thought they were a bit strange, but I tried anyway ?


 If we don't have exactly _one_ type that everyone _should_ use, then
programmers
 in, say, the Mid West would all use an 8-bit kind. People from, say, Chinese
 origin, probably would use a 32-bit type -- even if they were coding in the US.
 And even if they would be working on a project that is to manipulate ASCII
 strings, because they'd expect the application to sooner or later get exposed
to
 non-USASCII characters anyway.

Western people that earlier had Latin-1 tend to use "char[]",
the only trick is to dimension as [length * 2] since some
characters occupy two bytes when encoded. To be i18n-savvy,
they should use [length * 4] which allows for all of Unicode.

"char" is only useful for ASCII characters, as one has to
use at least wchar to fit a Latin-1 character for instance.


Other people tend to use "wchar[]", which is also the string
(and character) encoding that Java choose. Nowadays one has
to be prepared to handle "surrogates", since Unicode does
not fit in 16 bits anymore - but spilled over to 21 bits...

"wchar" *usually* works for Unicode characters, but to
be able to handle all characters then dchar must be used.


Nobody in their right mind uses "dchar[]" to store strings,
but the "dchar" type is useful for storing one code point.

A big disadvantage of UTF-16 (over UTF-8) is that it is
platform-dependant, and that it is not ASCII-compatible.
At least not with C and UNIX, since it will have a "BOM"
and since every other byte in a ASCII string will be NUL.

(more details at http://www.unicode.org/faq/utf_bom.html)



And imagine that all I wanted was two simpler aliases. :-)
(thought "string" and "ustring" were easier to "pronounce"
than "char[]" and "wchar[]", and that was about it really.
Just a simple: alias char[] string; alias wchar[] ustring;)

Not any new types or classes or other magic incantations...
--anders

Oct 27 2004

J C Calvarese <jcc7 cox.net> writes:

Glen Perkins wrote:
 I'd heard a bit about D, but this is the first time I've taken a bit of 
 time to look it over. I'm glad I did, because I love the design.
 
 I am wondering about something, though, and that's the apparent decision 
 to have three different standard string types, each with its encoding 
 exposed to the developer. I've had some experience designing text 
 models--I worked with Sun upgrading Java's string model from UCS-2 to 
 UTF-16 and for Macromedia upgrading the string types within Flash and 
 ColdFusion, for example--but every case has its unique constraints.
 
 I don't know enough about D to be sure of the issues and constraints in 
 this case, but I'm wondering if it wouldn't make sense to have a single 
 standard "String" class for the majority of text handling plus something 
 like char/wchar/dchar/ubyte arrays reserved for special cases.

(I've read some of the posts in this thread. Sorry if I'm repeating what 
  someone else has already written.)

It seems to me that D would support a string class such as the one you 
seem to be proposing. Since Walter is busy getting the bugs out of the 
compiler, so he's not likely to write an official string class anytime 
soon. But someone else could write it. And if that string was good and 
lots of people liked it, I'd be surprised if Walter didn't add it to the 
standard library, Phobos.

If you're not up to writing it yourself, maybe you could persuade 
someone else to do the work by proposing a design.

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/

Oct 30 2004

D Programming

C/C++ Programming

Other

digitalmars.D - String theory in D