www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - String theory in D

reply "Glen Perkins" <please.dont email.com> writes:
I'd heard a bit about D, but this is the first time I've taken a bit 
of time to look it over. I'm glad I did, because I love the design.

I am wondering about something, though, and that's the apparent 
decision to have three different standard string types, each with its 
encoding exposed to the developer. I've had some experience designing 
text models--I worked with Sun upgrading Java's string model from 
UCS-2 to UTF-16 and for Macromedia upgrading the string types within 
Flash and ColdFusion, for example--but every case has its unique 
constraints.

I don't know enough about D to be sure of the issues and constraints 
in this case, but I'm wondering if it wouldn't make sense to have a 
single standard "String" class for the majority of text handling plus 
something like char/wchar/dchar/ubyte arrays reserved for special 
cases.

In both Java and Flash we kept having to throw away brainstorming 
ideas because they implied changes to internal string implementation 
details that had unnecessarily--in my opinion--been exposed to 
programmers. I've become increasingly convinced that programmers don't 
need to know, much less be forced to decide, how most of their text is 
encoded. They should be thinking in terms of text semantically most of 
the time, without concerning themselves with its byte representation.

I see text handling as analogous to memory handling in the sense that 
I think the time has come to have the platform handle the general 
cases via automated internal mechanisms that are not exposed, while 
still allowing programmer manual  intervention for occasional special 
cases.

D already seems to have this memory model (very nice!), and it seems 
to me that the corresponding text model would be a single standard 
"String" class, whose internal encoding was the implementation's 
business, not the programmer's. The String would have the ability to 
produce explicitly encoded/formatted byte arrays for special cases, 
such as I/O, where encoding mattered. I would also want the ability to 
bypass Strings entirely on some occasions and use byte arrays 
directly. (By "byte arrays" I mean something like D's existing char[], 
wchar[], etc.)

Since the internal encoding of the standard String would not be 
exposed to the programmer, it could be optimized differently on every 
platform. I would probably implement my String class in UTF-16 on 
Windows and UTF-8 on Linux to make interactions with the OS and 
neighboring processes as lightweight as possible.

Then I would probably provide standard function wrappers for common OS 
calls such as getting directory listings, opening files, etc. These 
wrapper functions would pass text in the form of Strings. Source code 
that used only these functions would be portable across platforms, and 
since String's implementation would be optimized for its platform, 
this portable source code could produce nearly optimal object code on 
all platforms.

For calling OS functions directly, where you always need to have your 
text in a specific format, you could just have your Strings create an 
explicitly formatted byte sequence for you. A call to a Windows API 
function might pass something like "my_string.toUTF16()". Since the 
internal format would probably already be UTF-16, this "conversion" 
could be optimized away by the compiler, but it would leave you the 
freedom to change the underlying String implementation in the future 
without breaking anybody's code.

And, of course, you would still have the ability to use char[], 
wchar[], dchar[], and even ubyte[] directly when needed for special 
cases.

Having a single String to use for most text handling would make 
writing, reading, porting, and maintaining code much easier. Having an 
underlying encoding that isn't exposed would make it possible for 
implementers to optimize the standard String for the platform, so that 
programmers who used it would find code that was easier to write to 
begin with was also more performant when ported. This has huge 
implications for the creation of the rich libraries that make or break 
a language these days.

And if for no other reason, it seems to me that a new language should 
have a single, standard String class from the start just to avoid 
degenerating into the tangled hairball of conflicting string types 
that C++ text handling has become. Library creators and architects 
working in languages that have had a single, standard String class 
from the start doggedly use the standard String for everything. You 
could easily create your own alternative string classes for languages 
like Java or C#, but almost nobody does. As long as the standard 
String is good enough, it's just not worth the trouble of having to 
juggle multiple string types. All libraries and APIs in these 
languages use a single, consistent text model, which is a big 
advantage these days over C++.

Again, I realize that I may be overlooking any number of important 
issues that would make this argument inapplicable or irrelevant in 
this case, but I'm wondering if this would make sense for D.
Oct 25 2004
next sibling parent reply Ben Hinkle <bhinkle4 juno.com> writes:
Glen Perkins wrote:

 I'd heard a bit about D, but this is the first time I've taken a bit
 of time to look it over. I'm glad I did, because I love the design.
 
 I am wondering about something, though, and that's the apparent
 decision to have three different standard string types, each with its
 encoding exposed to the developer. I've had some experience designing
 text models--I worked with Sun upgrading Java's string model from
 UCS-2 to UTF-16 and for Macromedia upgrading the string types within
 Flash and ColdFusion, for example--but every case has its unique
 constraints.

welcome.
 I don't know enough about D to be sure of the issues and constraints
 in this case, but I'm wondering if it wouldn't make sense to have a
 single standard "String" class for the majority of text handling plus
 something like char/wchar/dchar/ubyte arrays reserved for special
 cases.

There is a port of IBM's ICU unicode library underway and that will help fill in various unicode shortcomings of phobos. What else do you see a class doing that isn't in phobos?
 In both Java and Flash we kept having to throw away brainstorming
 ideas because they implied changes to internal string implementation
 details that had unnecessarily--in my opinion--been exposed to
 programmers. I've become increasingly convinced that programmers don't
 need to know, much less be forced to decide, how most of their text is
 encoded. They should be thinking in terms of text semantically most of
 the time, without concerning themselves with its byte representation.

are you referring to indexing and slicing being character lookup and not byte lookup?
 I see text handling as analogous to memory handling in the sense that
 I think the time has come to have the platform handle the general
 cases via automated internal mechanisms that are not exposed, while
 still allowing programmer manual  intervention for occasional special
 cases.
 
 D already seems to have this memory model (very nice!), and it seems
 to me that the corresponding text model would be a single standard
 "String" class, whose internal encoding was the implementation's
 business, not the programmer's. The String would have the ability to
 produce explicitly encoded/formatted byte arrays for special cases,
 such as I/O, where encoding mattered. I would also want the ability to
 bypass Strings entirely on some occasions and use byte arrays
 directly. (By "byte arrays" I mean something like D's existing char[],
 wchar[], etc.)
 
 Since the internal encoding of the standard String would not be
 exposed to the programmer, it could be optimized differently on every
 platform. I would probably implement my String class in UTF-16 on
 Windows and UTF-8 on Linux to make interactions with the OS and
 neighboring processes as lightweight as possible.

Aliases can introduce a symbol that can mean different things on different platforms: // "Operating System" character version (Win32) { alias wchar oschar; } else { alias char oschar; } oschar[] a_string_in_the_OS_preferred_format;
 Then I would probably provide standard function wrappers for common OS
 calls such as getting directory listings, opening files, etc. These
 wrapper functions would pass text in the form of Strings. Source code
 that used only these functions would be portable across platforms, and
 since String's implementation would be optimized for its platform,
 this portable source code could produce nearly optimal object code on
 all platforms.

These should already be in phobos. If the aliases approach is used all that is required are overloaded versions for char[] or wchar[].
 For calling OS functions directly, where you always need to have your
 text in a specific format, you could just have your Strings create an
 explicitly formatted byte sequence for you. A call to a Windows API
 function might pass something like "my_string.toUTF16()". Since the
 internal format would probably already be UTF-16, this "conversion"
 could be optimized away by the compiler, but it would leave you the
 freedom to change the underlying String implementation in the future
 without breaking anybody's code.

There exist overloaded versions of std.utf.toUTF16 for char, wchar and dchar arrays. So calling toUTF16(my_string) would do what you propose. Changing the type of my_string would require a recompile but no code change.
 And, of course, you would still have the ability to use char[],
 wchar[], dchar[], and even ubyte[] directly when needed for special
 cases.
 
 Having a single String to use for most text handling would make
 writing, reading, porting, and maintaining code much easier. Having an
 underlying encoding that isn't exposed would make it possible for
 implementers to optimize the standard String for the platform, so that
 programmers who used it would find code that was easier to write to
 begin with was also more performant when ported. This has huge
 implications for the creation of the rich libraries that make or break
 a language these days.
 
 And if for no other reason, it seems to me that a new language should
 have a single, standard String class from the start just to avoid
 degenerating into the tangled hairball of conflicting string types
 that C++ text handling has become. Library creators and architects
 working in languages that have had a single, standard String class
 from the start doggedly use the standard String for everything. You
 could easily create your own alternative string classes for languages
 like Java or C#, but almost nobody does. As long as the standard
 String is good enough, it's just not worth the trouble of having to
 juggle multiple string types. All libraries and APIs in these
 languages use a single, consistent text model, which is a big
 advantage these days over C++.
 
 Again, I realize that I may be overlooking any number of important
 issues that would make this argument inapplicable or irrelevant in
 this case, but I'm wondering if this would make sense for D.

One disadvantage of a String class is that the methods of the class are fixed. With arrays and functions anyone can add a string "method". A class will actually reduce flexibility in the eyes of the user IMO. Another disadvantage is that classes in D are by reference (like Java) and so slicing will have to allocate memory - today a slice is a length and pointer to shared data so no allocation is needed. A String struct would be an option if a class isn't used, though.
Oct 25 2004
parent reply "Glen Perkins" <please.dont email.com> writes:
"Ben Hinkle" <bhinkle4 juno.com> wrote in message
news:clk269$haj$1 digitaldaemon.com...
 Glen Perkins wrote:

 welcome.

Thanks.
 There is a port of IBM's ICU unicode library underway and that will
 help
 fill in various unicode shortcomings of phobos. What else do you see
 a
 class doing that isn't in phobos?

I don't know enough to comment at this point. I don't even know how modularity works for compiled executables in D, and I don't want to propose something that would violate D's priorities by, for example, creating a heavyweight string full of ICU features, that would end up being statically linked into every little "hello, world" written in D, ruining the goal of tiny executables if, for example, that is a high priority in D. If there's no chance of a standard string class for general string operations in D, then there's no point in designing one. If there is a chance, then the design would have to start with the priorities and constraints of this particular language. My sense is that a string class similar to that in C#, but noncommittal regarding its internal encoding, would be nice for a language like D.
 ...I've become increasingly convinced that programmers don't
 need to know, much less be forced to decide, how most of their text
 is
 encoded. They should be thinking in terms of text semantically most
 of
 the time, without concerning themselves with its byte
 representation.

are you referring to indexing and slicing being character lookup and not byte lookup?

Yes, that's a specific example of what I'm referring to, which is the general notion of just thinking about algorithms for working with the text in terms of text itself without regard to how the computer might be representing that text inside (except in the minority of cases where you MUST work explicitly with the representation.) And though it's probably too radical for D (so nobody freak out), we may well evolve to the point where the most reasonable default for walking through the "characters" in general text data is something like 'foreach char ch in mystring do {}', where the built-in "char" datatype in the language is a variable length entity designed to hold a complete grapheme. Only where optimization was required would you drop down to the level of "foreach codepoint cp in mytext do {}', where mytext was defined as 'codepoint[] mytext', or even more radically to 'foreach byte b in mytext do {}', where mytext was defined as 'byte[] mytext'. Once again, I'm not proposing that for D, I'm just promoting the general notion of keeping the developer's mind on the text and off of the representation details to the extent that it is *reasonable*.
 Since the internal encoding of the standard String would not be
 exposed to the programmer, it could be optimized differently on
 every
 platform. I would probably implement my String class in UTF-16 on
 Windows and UTF-8 on Linux to make interactions with the OS and
 neighboring processes as lightweight as possible.

Aliases can introduce a symbol that can mean different things on different platforms: // "Operating System" character version (Win32) { alias wchar oschar; } else { alias char oschar; } oschar[] a_string_in_the_OS_preferred_format;

Thanks for pointing out this feature. I like it. It provides a mechanism for manual optimization at the cost of greater complexity for those special cases where optimization is called for. You could have different string representations for different zones in your app, labeled by zone name: oschar for internal and OS API calls, xmlchar for an XML I/O boundary etc., so you could change the OS format from OS to OS while leaving the XML format unchanged. I can't help thinking, though, that it would be best reserved for optimization cases, with a simple works-everywhere, called "string" everywhere, string class for the general case. Otherwise, your language tutorials would be teaching you that a string is "char[]" but real production code would almost always be based on locally-invented names for string types. Libraries, which are also trying hard to be real production quality code, would use the above alias approach and invent their own names. Not just at points you needed to manually optimize but literally everywhere you did anything with a string internally, you'd have to choose among the three standard names, char, wchar, and dchar, plus your own custom oschar and xmlchar, plus your GUI library's gchar or kchar, and your ICU library's unichar, plus a database orachar designed to match the database encoding, etc. You could easily end up with so many conversions going on between types locally optimized for each zone in your app that you are globally unoptimized.
 One disadvantage of a String class is that the methods of the class
 are
 fixed. With arrays and functions anyone can add a string "method". A
 class
 will actually reduce flexibility in the eyes of the user IMO.
 Another
 disadvantage is that classes in D are by reference (like Java) and
 so
 slicing will have to allocate memory - today a slice is a length and
 pointer to shared data so no allocation is needed. A String struct
 would be
 an option if a class isn't used, though.

It's true what you're saying about the relative lack of flexibility of built-in methods vs. external functions. You can always apply functions to strings, though, and the conservative approach would be to have a few clearly important methods in the string, implement other operations as functions that take string arguments, and over time consider migrating those operations into the string itself. Another possibility might be to have this "oschar" approach above actually built-in, with everybody (starting from the first "hello, world" tutorial) encouraged to use that one by default. That's tricky, though, because when you asked for mystring[3] from your oschar-based string, what would you get? People would expect the third text character, but as you know it would depend on the platform, and would not have any useful meaning in general, which seems pretty awkward for a standard string. It doesn't seem very useful to present something in an array format without the individual elements of the array being very useful. You could make them useful by making dchar[] the default, but everybody would probably fuss about the wasted memory, and production code would end up using char or wchar. So that brings us back to a string class where operator overloading could make the [] array-type access yield consistent, complete codepoints on every platform. I'm sympathetic to performance arguments. That would be one of the big attractions of D. I still can't help thinking that sticking to a single string class shared by almost all of your tutorials, your own code, your downloaded snippets, and all of your libraries might not only be the easiest for programmers to work with but could result in apps that tended to be at least as performant as the existing approach.
Oct 26 2004
parent reply Ben Hinkle <bhinkle4 juno.com> writes:
Glen Perkins wrote:

 
 "Ben Hinkle" <bhinkle4 juno.com> wrote in message
 news:clk269$haj$1 digitaldaemon.com...
 Glen Perkins wrote:

 welcome.

Thanks.
 There is a port of IBM's ICU unicode library underway and that will
 help
 fill in various unicode shortcomings of phobos. What else do you see
 a
 class doing that isn't in phobos?

I don't know enough to comment at this point. I don't even know how modularity works for compiled executables in D, and I don't want to propose something that would violate D's priorities by, for example, creating a heavyweight string full of ICU features, that would end up being statically linked into every little "hello, world" written in D, ruining the goal of tiny executables if, for example, that is a high priority in D. If there's no chance of a standard string class for general string operations in D, then there's no point in designing one. If there is a chance, then the design would have to start with the priorities and constraints of this particular language. My sense is that a string class similar to that in C#, but noncommittal regarding its internal encoding, would be nice for a language like D.
 ...I've become increasingly convinced that programmers don't
 need to know, much less be forced to decide, how most of their text
 is
 encoded. They should be thinking in terms of text semantically most
 of
 the time, without concerning themselves with its byte
 representation.

are you referring to indexing and slicing being character lookup and not byte lookup?

Yes, that's a specific example of what I'm referring to, which is the general notion of just thinking about algorithms for working with the text in terms of text itself without regard to how the computer might be representing that text inside (except in the minority of cases where you MUST work explicitly with the representation.) And though it's probably too radical for D (so nobody freak out), we may well evolve to the point where the most reasonable default for walking through the "characters" in general text data is something like 'foreach char ch in mystring do {}', where the built-in "char" datatype in the language is a variable length entity designed to hold a complete grapheme. Only where optimization was required would you drop down to the level of "foreach codepoint cp in mytext do {}', where mytext was defined as 'codepoint[] mytext', or even more radically to 'foreach byte b in mytext do {}', where mytext was defined as 'byte[] mytext'.

One can foreach over dchars from either a char[] or wchar[]: int main() { char[] t = "hello 中国 world"; foreach(dchar x;t) printf("%x ",x); return 0; } prints 68 65 6c 6c 6f 20 4e2d 56fd 20 77 6f 72 6c 64 Similarly structs and classes can have overloaded opApply implementations to customize what it means to foreach in different situations.
 Once again, I'm not proposing that for D, I'm just promoting the
 general notion of keeping the developer's mind on the text and off of
 the representation details to the extent that it is *reasonable*.
 
 
 Since the internal encoding of the standard String would not be
 exposed to the programmer, it could be optimized differently on
 every
 platform. I would probably implement my String class in UTF-16 on
 Windows and UTF-8 on Linux to make interactions with the OS and
 neighboring processes as lightweight as possible.

Aliases can introduce a symbol that can mean different things on different platforms: // "Operating System" character version (Win32) { alias wchar oschar; } else { alias char oschar; } oschar[] a_string_in_the_OS_preferred_format;

Thanks for pointing out this feature. I like it. It provides a mechanism for manual optimization at the cost of greater complexity for those special cases where optimization is called for. You could have different string representations for different zones in your app, labeled by zone name: oschar for internal and OS API calls, xmlchar for an XML I/O boundary etc., so you could change the OS format from OS to OS while leaving the XML format unchanged. I can't help thinking, though, that it would be best reserved for optimization cases, with a simple works-everywhere, called "string" everywhere, string class for the general case. Otherwise, your language tutorials would be teaching you that a string is "char[]" but real production code would almost always be based on locally-invented names for string types. Libraries, which are also trying hard to be real production quality code, would use the above alias approach and invent their own names. Not just at points you needed to manually optimize but literally everywhere you did anything with a string internally, you'd have to choose among the three standard names, char, wchar, and dchar, plus your own custom oschar and xmlchar, plus your GUI library's gchar or kchar, and your ICU library's unichar, plus a database orachar designed to match the database encoding, etc. You could easily end up with so many conversions going on between types locally optimized for each zone in your app that you are globally unoptimized.

That's possible, but so far it doesn't seem so bad to have three core string types. Storing the encoding in the instance instead of the type would turn today's compile-time decisions into run-time decisions, though. That would most likely slow things down since it can't inline as completely.
 One disadvantage of a String class is that the methods of the class
 are
 fixed. With arrays and functions anyone can add a string "method". A
 class
 will actually reduce flexibility in the eyes of the user IMO.
 Another
 disadvantage is that classes in D are by reference (like Java) and
 so
 slicing will have to allocate memory - today a slice is a length and
 pointer to shared data so no allocation is needed. A String struct
 would be
 an option if a class isn't used, though.

It's true what you're saying about the relative lack of flexibility of built-in methods vs. external functions. You can always apply functions to strings, though, and the conservative approach would be to have a few clearly important methods in the string, implement other operations as functions that take string arguments, and over time consider migrating those operations into the string itself. Another possibility might be to have this "oschar" approach above actually built-in, with everybody (starting from the first "hello, world" tutorial) encouraged to use that one by default. That's tricky, though, because when you asked for mystring[3] from your oschar-based string, what would you get? People would expect the third text character, but as you know it would depend on the platform, and would not have any useful meaning in general, which seems pretty awkward for a standard string. It doesn't seem very useful to present something in an array format without the individual elements of the array being very useful. You could make them useful by making dchar[] the default, but everybody would probably fuss about the wasted memory, and production code would end up using char or wchar. So that brings us back to a string class where operator overloading could make the [] array-type access yield consistent, complete codepoints on every platform. I'm sympathetic to performance arguments. That would be one of the big attractions of D. I still can't help thinking that sticking to a single string class shared by almost all of your tutorials, your own code, your downloaded snippets, and all of your libraries might not only be the easiest for programmers to work with but could result in apps that tended to be at least as performant as the existing approach.

Yeah - any design will have trade-offs. dchar[] takes up too much space. On-the-fly character lookup is too slow to make the default. char[] is too fat for asian languages. Judgements like "too much space" and "too slow" are subjective and Walter made his choices. I'm sure he's open to more information that would sway those choices but the best chance of influencing things is to add some solid data that is missing. With your experience in string handling in different languages I'm guessing your opinions are based on accumulated knowledge about what is fast or slow etc so trying to articulate that accumulated knowledge would be very useful. -Ben
Oct 27 2004
next sibling parent reply Regan Heath <regan netwin.co.nz> writes:
On Wed, 27 Oct 2004 08:26:52 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:
 That's possible, but so far it doesn't seem so bad to have three core 
 string types. Storing the encoding in the instance instead of the type 
 would turn today's compile-time decisions into run-time decisions, 
 though. That would most likely slow things down since it can't inline as 
 completely.

Ben, can you give me/us an example where this would be the case. How much slower do you think it would make it? Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Oct 27 2004
parent reply "Ben Hinkle" <bhinkle mathworks.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsgjm7mx55a2sq9 digitalmars.com...
 On Wed, 27 Oct 2004 08:26:52 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:
 That's possible, but so far it doesn't seem so bad to have three core
 string types. Storing the encoding in the instance instead of the type
 would turn today's compile-time decisions into run-time decisions,
 though. That would most likely slow things down since it can't inline as
 completely.

Ben, can you give me/us an example where this would be the case. How much slower do you think it would make it?

I don't know about impact on typical string usage but it certainly makes a difference with a super-cheezy made-up example like: import std.c.windows.windows; enum Encoding { UTF8, UTF16, UTF32 }; struct my_string { Encoding encoding; int length; void* data; } char index(char[] s, int n) { return s[n]; } wchar index(wchar[] s, int n) { return s[n]; } dchar index(dchar[] s, int n) { return s[n]; } dchar index(my_string s, int n) { switch (s.encoding) { case Encoding.UTF8: return (cast(char*)s.data)[n]; case Encoding.UTF16: return (cast(wchar*)s.data)[n]; case Encoding.UTF32: return (cast(dchar*)s.data)[n]; } } int main() { char[] s = "hello"; int t1 = GetTickCount(); for(int k=0;k<100_000_000; k++) { index(s,3); } int t2 = GetTickCount(); my_string s2; s2.data = s; s2.encoding = Encoding.UTF8; s2.length = s.length; int t3 = GetTickCount(); for(int k=0;k<100_000_000; k++) { index(s2,3); } int t4 = GetTickCount(); printf("compile time %d\n",t2-t1); printf("run time %d\n",t4-t3); return 0; } compiling with "dmd main.d -O -inline" and running gives compile time 110 run time 531 Any particular example doesn't mean much, though. My statement was meant as a general statement about compile-time vs run-time decisions.
Oct 27 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Wed, 27 Oct 2004 16:19:14 -0400, Ben Hinkle <bhinkle mathworks.com> 
wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsgjm7mx55a2sq9 digitalmars.com...
 On Wed, 27 Oct 2004 08:26:52 -0400, Ben Hinkle <bhinkle4 juno.com> 
 wrote:
 That's possible, but so far it doesn't seem so bad to have three core
 string types. Storing the encoding in the instance instead of the type
 would turn today's compile-time decisions into run-time decisions,
 though. That would most likely slow things down since it can't inline 

 completely.

Ben, can you give me/us an example where this would be the case. How much slower do you think it would make it?

I don't know about impact on typical string usage but it certainly makes a difference with a super-cheezy made-up example like: import std.c.windows.windows; enum Encoding { UTF8, UTF16, UTF32 }; struct my_string { Encoding encoding; int length; void* data; } char index(char[] s, int n) { return s[n]; } wchar index(wchar[] s, int n) { return s[n]; } dchar index(dchar[] s, int n) { return s[n]; } dchar index(my_string s, int n) { switch (s.encoding) { case Encoding.UTF8: return (cast(char*)s.data)[n]; case Encoding.UTF16: return (cast(wchar*)s.data)[n]; case Encoding.UTF32: return (cast(dchar*)s.data)[n]; } } int main() { char[] s = "hello"; int t1 = GetTickCount(); for(int k=0;k<100_000_000; k++) { index(s,3); } int t2 = GetTickCount(); my_string s2; s2.data = s; s2.encoding = Encoding.UTF8; s2.length = s.length; int t3 = GetTickCount(); for(int k=0;k<100_000_000; k++) { index(s2,3); } int t4 = GetTickCount(); printf("compile time %d\n",t2-t1); printf("run time %d\n",t4-t3); return 0; } compiling with "dmd main.d -O -inline" and running gives compile time 110 run time 531 Any particular example doesn't mean much, though. My statement was meant as a general statement about compile-time vs run-time decisions.

Thanks. I was hacking round with your example, basically inventing a string type which did not have runtime decisions, it is giving me some very strange results, I wonder if you can spot where it's going awry. D:\D\src\temp>dmd string.d -O -release -inline d:\d\dmd\bin\..\..\dm\bin\link.exe string,,,user32+kernel32/noi; D:\D\src\temp>string compile time 156 run time 1000 (string.d is your example, unmodified, as a comparrison to what I get below) D:\D\src\temp>dmd string2.d -O -release -inline d:\d\dmd\bin\..\..\dm\bin\link.exe string2,,,user32+kernel32/noi; D:\D\src\temp>string2 compile time 219 run time 1156 template 157 I ran both several times, the results above are typical for my system. Notice: 1- the compile time string2.d is slower than string.d 2- the template one is faster than the compile time one I don't understand how either of the above can be true. --[string2.d]-- import std.c.windows.windows; enum Encoding { UTF8, UTF16, UTF32 }; struct my_string { Encoding encoding; void opCall(char[] s) { encoding = Encoding.UTF8; cs = s.dup; } void opCall(wchar[] s) { encoding = Encoding.UTF16; ws = s.dup; } void opCall(dchar[] s) { encoding = Encoding.UTF32; ds = s.dup; } union { char[] cs; wchar[] ws; dchar[] ds; } } struct my_string2(Type) { Type[] data; void opCall(char[] s) { data = cast(Type[])s.dup; } void opCall(wchar[] s) { data = cast(Type[])s.dup; } void opCall(dchar[] s) { data = cast(Type[])s.dup; } Type opIndex(int i) { return data[i]; } } char index(char[] s, int n) { return s[n]; } wchar index(wchar[] s, int n) { return s[n]; } dchar index(dchar[] s, int n) { return s[n]; } dchar index(my_string s, int n) { switch (s.encoding) { case Encoding.UTF8: return s.cs[n]; case Encoding.UTF16: return s.ws[n]; case Encoding.UTF32: return s.ds[n]; } } char index(my_string2!(char) s, int n) { return s.data[n]; } wchar index(my_string2!(wchar) s, int n) { return s.data[n]; } dchar index(my_string2!(dchar) s, int n) { return s.data[n]; } int main() { char[] s = "hello"; int t1 = GetTickCount(); for(int k=0;k<100_000_000; k++) { index(s,3); } int t2 = GetTickCount(); my_string s2; s2(s); int t3 = GetTickCount(); for(int k=0;k<100_000_000; k++) { index(s2,3); } int t4 = GetTickCount(); my_string2!(char) s3; s3(s); int t5 = GetTickCount(); for(int k=0;k<100_000_000; k++) { index(s3,3); } int t6 = GetTickCount(); printf("compile time %d\n",t2-t1); printf("run time %d\n",t4-t3); printf("template %d\n",t6-t5); return 0; } -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Oct 27 2004
parent "Ben Hinkle" <bhinkle mathworks.com> writes:
 D:\D\src\temp>dmd string2.d -O -release -inline
 d:\d\dmd\bin\..\..\dm\bin\link.exe string2,,,user32+kernel32/noi;

 D:\D\src\temp>string2
 compile time 219
 run time 1156
 template 157

 I ran both several times, the results above are typical for my system.

 Notice:
 1- the compile time string2.d is slower than string.d
 2- the template one is faster than the compile time one

 I don't understand how either of the above can be true.

That is odd. I got: compile time 78 run time 593 template 79 so I don't know what could be going on. Maybe try switching around the order to see if that changes anything? I don't really know.
Oct 28 2004
prev sibling parent reply "Glen Perkins" <please.dont email.com> writes:
"Ben Hinkle" <bhinkle4 juno.com> wrote in message 
news:clo463$21js$1 digitaldaemon.com...
 You could easily end up with so many conversions going on between
 types locally optimized for each zone in your app that you are
 globally unoptimized.

That's possible, but so far it doesn't seem so bad to have three core string types.

I think you'll end up with many more than that. Since people will be required to make what is essentially an optimization decision every time they do anything with text, the choice will typically be different on different platforms. Rather than letting the implementation deal with that so that source code can be ported and still remain close to optimal, this design requires the programmer to either 1) live with suboptimal performance when porting, 2) manually rewrite most of his code and live with separate versions that are harder to keep in sync, or 3) use the "alias" feature to invent a local name for a "standard" string. Of the three, I think #3 is the most attractive. If lots of people agree with me, then when we end up reusing each other's code, we'll end up with the standard three string types, plus our own type, plus those invented by others. And there's no guarantee that our various alias types will all make the same decisions for when to be what. So now I have a whole bunch of string types to deal with, some of which are the same on some platforms but different on others, so when I try to optimize my code so that I don't have lots of unnecessary back and forth and back and forth encoding conversions, I have to further de-sync my different platform versions or use more aliases to manage the aliases or, once again, live with the lack of optimization, attempting to repair it only where necessary. If it's going to end up being #3, and it probably should because of what we know about optimization, where the majority of your operations of all types could execute instantly without a noticeable improvement in overall app performance, then you could probably get about the same performance without the design nightmare by using a single, standard string type (which is optimized by the implementation for the platform) for almost everything.
 Storing the encoding in the instance instead of the type would turn
 today's compile-time decisions into run-time decisions, though. That 
 would
 most likely slow things down since it can't inline as completely.

I'm not suggesting a string type that would have a field to hold its encoding, so that two instances of the same string class on the same platform could have two different internal encodings and functions would have to decide at runtime what code to run for each instance. I'm talking about a situation similar to the alias idea where every instance of a standard string on a given platform, whether in your own code or the libraries, would be in the same encoding, an encoding known at compile time. The information to early-bind the methods would be available at compile time, and a smart compiler might be able to use that fact for compile time optimization, but I can't completely disagree with you. There may be other reasons why the compiler might not be able to do the binding at compile time, perhaps due to the general implementation of OO support. Even if this is the case, you don't have to dismiss an idea because it doesn't optimize performance for each instance in which it is used. GC itself doesn't optimize performance for each instance, but it's still the way to go (in my opinion) because the performance of most parts is irrelevant to the performance of the whole, as long as those parts are reasonable, and you have a manual option for special cases. I think the same argument implies having a single default string type and letting the compiler optimize it.
 I'm sympathetic to performance arguments. That would be one of the 
 big
 attractions of D. I still can't help thinking that sticking to a
 single string class shared by almost all of your tutorials, your 
 own
 code, your downloaded snippets, and all of your libraries might not
 only be the easiest for programmers to work with but could result 
 in
 apps that tended to be at least as performant as the existing
 approach.

Yeah - any design will have trade-offs. dchar[] takes up too much space. On-the-fly character lookup is too slow to make the default.

I'm not sure I understand this. I realize that you're just quoting things that "people say", but if this means it's better to have byte fetching from UTF-8 be the default instead of character fetching, it sounds as though it's claiming that it's a better default to do something useless than useful if the useless operation is faster. For the majority of text work, byte fetching is useless. What you care about is the text, not its representation. Only in a minority of cases would byte fetching matter. Those special cases are definitely important--the general cases will be built on top of byte fetching so fast byte fetching is mandatory--but defaults should be based on the typical need, not the exceptional need. If the typical need requires more work, well, it's still the typical need and the default, almost by definition, should be designed for the typical need. Of course, I may have misunderstood you completely. ;-) Even more likely, this particular point doesn't matter, but it has been a source of some frustration how often people with a "C mindset" (and I'm not talking about you but am thinking of countless design meetings over the years) end up optimizing the insignificant at the expense of the significant because the insignificant is always in their face. I think you can have the best of both worlds with a design based roughly on the idea of programmer productivity for defaults plus fine-grained manual optimization features (that integrate easily with the default features) for bottlenecks (defined as anyplace where a local optimization will produce a global optimization). D is quite close to such a design, but it seems to me that the string approach doesn't quite match.
 char[] is too
 fat for asian languages. Judgements like "too much space" and "too 
 slow"
 are subjective and Walter made his choices. I'm sure he's open to 
 more
 information that would sway those choices but the best chance of
 influencing things is to add some solid data that is missing.

I'm not sure who "Walter" is, but it sounds like he's the guy to thank for such a nice language design. (If I didn't mean that, I wouldn't waste my time writing any of this.) For the specific issue of strings, the information that I think is most relevant (and, as I said before, I still can't be *sure* that it's relevant in D's case), is not "data" per se, but a reminder that C++ is about the worst case scenario among major languages when it comes to programmer productivity in text handling, in large part because you ALWAYS end up getting stuck with multiple string types in any significant app. The problem is NOT that nobody ever managed to create a useful string type for C++, it's that EVERYBODY did so because Stroustrup wouldn't. The "data", I suppose, is what happened in the case of C++ and didn't happen to any language with a built-in standard string class, but of course you can argue about the relevance of the comparisons.
 With your
 experience in string handling in different languages I'm guessing 
 your
 opinions are based on accumulated knowledge about what is fast or 
 slow etc
 so trying to articulate that accumulated knowledge would be very 
 useful.

My accumulated knowledge tells me that what's fast or slow for a string design should NOT be your primary consideration, even when performance of the app as a whole IS (and it usually isn't.) I DO care about performance. Java's design prohibits the kind of performance I'm looking for, which is one reason I'm curious about D. But I care about global performance not local performance, and I also care about other significant global issues such as programmer productivity, lack of bugs, source portability, maintenance costs, etc. that almost always matter more than the microscale performance of your strings. A factory doing manual labor can double its output by doubling the people at every station, or can pull people off of some stations, reducing the local "performance" at those stations, and reallocate them to double the staff at the bottleneck only. One approach improves global performance by improving local performance everywhere. The other either doesn't improve or actually loses performance everywhere except at the bottleneck. Both produce the same doubling of total factory output. Which approach is better? Well gaining performance everywhere is obviously better--you never know when it might come in handy, right?--until you factor in the cost. I don't want to get too tangled in the details of the analogy, but the cost of having three standard strings for fine-grained performance tuning everywhere, plus homemade and 3rd party aliases, plus multiple 3rd party string classes that will fill the void in the standard, is the complexity that it will add to designs with all of the implications that has for debugging, code reuse, architectural decisions, portability, maintenance, and general programmer productivity. All of those factors have costs, and some of them may even negatively impact global performance, which was the reason for the extra complexity to begin with. I just have a hard time imagining that MAKING people micromanage their string implementations in all cases will produce superior global performance to simply ALLOWING them to do so where it impacted global performance. Doubling the staff at every factory station results in no more total production than simply doubling the staff at the bottleneck. I have an even harder time imagining that the benefits of the unavoidable additional complexity (which you can never avoid if you ever use other people's code) will be worth the performance benefit that may not even exist. I could still be wrong about any of this. Am I overlooking something?
Oct 27 2004
next sibling parent Regan Heath <regan netwin.co.nz> writes:
On Wed, 27 Oct 2004 17:34:14 -0700, Glen Perkins <please.dont email.com> 
wrote:
 "Ben Hinkle" <bhinkle4 juno.com> wrote in message 
 news:clo463$21js$1 digitaldaemon.com...
 Storing the encoding in the instance instead of the type would turn
 today's compile-time decisions into run-time decisions, though. That 
 would
 most likely slow things down since it can't inline as completely.

I'm not suggesting a string type that would have a field to hold its encoding, so that two instances of the same string class on the same platform could have two different internal encodings and functions would have to decide at runtime what code to run for each instance.

No, but I did :) I am starting to think it's un-necessary however. Given that converting from one encoding to another necessitates a copy of the data anyway. So, instead, a single string type that could be encoded internally as any of the available encodings, couldn't change encoding itself, but, could be cast/converted to another encoding (creating a new string). Plus, it needs all the functionality of our current arrays i.e. indexing, slicing, being able to write methods for it i.e. void foo(char[] a, int b); char[] aa; aa.foo(5); <-- calls 'foo' above. I'm pretty sure the above idea is not possible without some sort of compiler magic. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Oct 27 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Glen Perkins" <please.dont email.com> wrote in message
news:clpeud$lql$1 digitaldaemon.com...
 I'm not sure I understand this. I realize that you're just quoting
 things that "people say", but if this means it's better to have byte
 fetching from UTF-8 be the default instead of character fetching, it
 sounds as though it's claiming that it's a better default to do
 something useless than useful if the useless operation is faster. For
 the majority of text work, byte fetching is useless. What you care
 about is the text, not its representation. Only in a minority of cases
 would byte fetching matter. Those special cases are definitely
 important--the general cases will be built on top of byte fetching so
 fast byte fetching is mandatory--but defaults should be based on the
 typical need, not the exceptional need. If the typical need requires
 more work, well, it's still the typical need and the default, almost
 by definition, should be designed for the typical need.

I'm not so sure this is correct. For a number of common string operations, such as copying and searching, byte indexing of UTF-8 is faster than codepoint indexing. For sequential codepoint access, the foreach() statement does the job. For random access of codepoints, one has to always start from the beginning and count forward anyway, and the foreach() does that. As for a single string type, there is no answer for that. Each has significant tradeoffs. For a speed oriented language, the choice needs to be under the control of the application programmer, not the language. The three types are readilly convertible into each other. I don't really see the need for application programmers to layer on more string types.
Oct 27 2004
parent reply "Glen Perkins" <please.dont email.com> writes:
"Walter" <newshound digitalmars.com> wrote in message 
news:clptlo$161f$1 digitaldaemon.com...

 ...[UTF-8 indexing issue that I don't want to waste your time 
 with]...

 As for a single string type, there is no answer for that. Each has
 significant tradeoffs. For a speed oriented language, the choice 
 needs to be
 under the control of the application programmer, not the language.

I agree. I think there should be a standard string class for default use plus a selection of byte array forms (e.g. char[], wchar[], dchar[]) for use anywhere that the programmer determined that their use instead of the default improved the app. The choice would be completely under the control of the programmer. The encoding of the default string class would be up to the implementors to optimize for the platform so that for the great majority of text operations in an app, the default string would work so well that replacing it with one of the byte array forms would be found to have no positive impact on the app. However, anytime the programmer encountered a situation where use of a byte array type improved the app, he could use it. With this approach, you could have code with the same performance as under the current system, because anytime it was slower you could just use the current system. However, having a good default string as well, used by most apps on most platforms by most people most of the time, would simplify designs, porting, maintenance, programmer productivity, etc.
 The three
 types are readilly convertible into each other.

In fact, all four types would be readily convertible, though by having one that was almost always the best choice, regardless of platform, you would be able to avoid many unecessary conversions that could easily de-optimize your code as you added libraries and ported your app to other platforms. Also by matching the implementation of that default to the preferred form of the local OS APIs, conversions between the default string class and the OS API format could probably be compiled down to very lightweight object code on any platform, from the same source code.
 I don't really see the need
 for application programmers to layer on more string types.

I do, and apparently I'm not alone. People seem to mention it a lot in this newsgroup (I've discovered), just as they did for so long with C++. If I want char[] on Linux and wchar[] on Windows and want to avoid the nightmare of maintaining parallel but subtly different code, I need to create my own type using the "alias" feature. The authors of a couple of libraries I'll use will do likewise, but with their own type names and maybe different alias resolution rules. I expect some people will solve it with string classes. Stroustrup took a similar position about the need for programmers to optimize their strings long enough that every C++ library and API created its own string type. He once stated in a meeting I attended that his greatest regret about C++ was waiting so long to have a standard library and that the most requested feature of that library had been a string class. By adding just one more standard string type that would be a good default on every platform, I think you could eliminate the need so many people will feel to create their own and prevent string types from multiplying like bunnies, as happened to C++. Loss of performance isn't the only thing programmers want, even from a high performance language. They'd also like to avoid unnecessary complexity, avoid bugs, reuse other people's code, target multiple platforms with mostly the same source, and so on. I think having a single, good default string type could be very helpful for these things without having to harm performance. Even so, I realize that my opinion may be based on incorrect assumptions, missing information, faulty logic, selective memory, or peculiar personal preferences, so I may be wrong. If so, though, I'd be curious to know why.
Oct 28 2004
next sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Glen Perkins wrote:

 As for a single string type, there is no answer for that. Each has
 significant tradeoffs. For a speed oriented language, the choice needs 
 to be under the control of the application programmer, not the language.

I agree. I think there should be a standard string class for default use plus a selection of byte array forms (e.g. char[], wchar[], dchar[]) for use anywhere that the programmer determined that their use instead of the default improved the app.

I don't have a problem with a standard String *class* present in D, as long as I don't *have* to use it (and OOP) - like I do in Java... The beauty about D's string types (char[] and wchar[]) is that they work for plain old procedural C-style programs too, not just objects ? --anders
Oct 28 2004
parent Regan Heath <regan netwin.co.nz> writes:
On Thu, 28 Oct 2004 10:24:22 +0200, Anders F Björklund <afb algonet.se> 
wrote:
 Glen Perkins wrote:

 As for a single string type, there is no answer for that. Each has
 significant tradeoffs. For a speed oriented language, the choice needs 
 to be under the control of the application programmer, not the 
 language.

I agree. I think there should be a standard string class for default use plus a selection of byte array forms (e.g. char[], wchar[], dchar[]) for use anywhere that the programmer determined that their use instead of the default improved the app.

I don't have a problem with a standard String *class* present in D, as long as I don't *have* to use it (and OOP) - like I do in Java... The beauty about D's string types (char[] and wchar[]) is that they work for plain old procedural C-style programs too, not just objects ?

So we use a 'struct' instead. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Oct 28 2004
prev sibling next sibling parent reply Ben Hinkle <bhinkle4 juno.com> writes:
 I need to create my own type using the "alias" feature. The authors of
 a couple of libraries I'll use will do likewise, but with their own
 type names and maybe different alias resolution rules.

Technically an alias introduces a new symbol. It's like a #define. It doesn't actually introduce a new type (see typedef). For example, the following doesn't compile: alias int foo; void bar(int y) {} void bar(foo y) {} int main() { bar(0); return 0; } compiling results in: "function bar overloads void(int y) and void(int y) both match argument list for bar" Redefining an alias is ignored (well, it is very useful for overloading functions but not for basic types). For example: alias int foo; alias long foo; void bar(int y) {printf("int\n");} void bar(long y) {printf("long\n");} int main() { foo x; bar(x); return 0; } prints "int". So defining multiple aliases for strings or any other type is a pretty harmless thing to do. It should only effect the readability and maintainability of the code.
Oct 28 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Thu, 28 Oct 2004 08:35:24 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:
 So defining multiple aliases for strings or any other type is
 a pretty harmless thing to do. It should only effect the readability and
 maintainability of the code.

I'd argue that it's not harmless for the very reasons you just mentioned. Readability and maintainability are important when working on any large-ish project. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Oct 28 2004
parent reply "Ben Hinkle" <bhinkle mathworks.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsglk19qr5a2sq9 digitalmars.com...
 On Thu, 28 Oct 2004 08:35:24 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:
 So defining multiple aliases for strings or any other type is
 a pretty harmless thing to do. It should only effect the readability and
 maintainability of the code.

I'd argue that it's not harmless for the very reasons you just mentioned. Readability and maintainability are important when working on any large-ish project.

But introducing more names doesn't always make something more readable or maintainable. One has to factor in the size of the group and time-scale of life of the code. A wrapper or alias might seem obvious to the couple of people who started the project but years down the road with a group orders of magnitude larger a little helper wrapper can add up to be more overhead than it is worth. Also notions of "this code is readable" and "maintainable" are much more subjective than "this code doesn't compile" or "this code uses the wrong type". My personal preference is that keeping things simple is the best way to make something readable and maintainable.
Oct 29 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Fri, 29 Oct 2004 10:39:08 -0400, Ben Hinkle <bhinkle mathworks.com> 
wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsglk19qr5a2sq9 digitalmars.com...
 On Thu, 28 Oct 2004 08:35:24 -0400, Ben Hinkle <bhinkle4 juno.com> 
 wrote:
 So defining multiple aliases for strings or any other type is
 a pretty harmless thing to do. It should only effect the readability 

 maintainability of the code.

I'd argue that it's not harmless for the very reasons you just mentioned. Readability and maintainability are important when working on any large-ish project.

But introducing more names doesn't always make something more readable or maintainable. One has to factor in the size of the group and time-scale of life of the code. A wrapper or alias might seem obvious to the couple of people who started the project but years down the road with a group orders of magnitude larger a little helper wrapper can add up to be more overhead than it is worth. Also notions of "this code is readable" and "maintainable" are much more subjective than "this code doesn't compile" or "this code uses the wrong type". My personal preference is that keeping things simple is the best way to make something readable and maintainable.

That's what *I* implied/said, wasn't it? Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Oct 31 2004
parent ac <ac_member pathlink.com> writes:
 I'd argue that it's not harmless for the very reasons you just 
 mentioned.
 Readability and maintainability are important when working on any
 large-ish project.



My personal preference is that keeping things simple is 

best way to make something readable and maintainable.

That's what *I* implied/said, wasn't it?

As an old man, I cannot avoid thinking that these (obviously both) talented young men cannot find a place between their hormones and the writing on the wall. Had I been in that age, I'd have participated vigorously in this. I hope we get Walter with us in introducing a new name for the Canonical String. Be it an alias, a type, a class, or whatever. -- The main point is that we do need A Type that "everyone" uses. Sure, we can claim that it's the wchar, uchar, dchar, or whatever, but hey, please, do remember the very purpose of a programming language: "We may create a programming language from the point of the computer. We may create a programming language from the point of the programmer. We may create a ... ... sw-developing company. We ... ... education. W... ... maintainability. " ... -- the story has other leaves. Psychology, practice, history, just about everything "non-reality" related tells us 10-to-1 that we should create a name and tell everyone to use that. Technically we do not need this, but this, i'm sorry, is not the issue here.
Nov 02 2004
prev sibling next sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <clq9a8$1jkb$1 digitaldaemon.com>, Glen Perkins says...
 I don't really see the need
 for application programmers to layer on more string types.

I do, and apparently I'm not alone. People seem to mention it a lot in this newsgroup (I've discovered), just as they did for so long with C++. If I want char[] on Linux and wchar[] on Windows and want to avoid the nightmare of maintaining parallel but subtly different code, I need to create my own type using the "alias" feature.

Out of curiosity, why would you want to use different char types internally in the same application depending on platform? At worst I would think that the i/o might translate to different encodings but the internal code would use some normalized form regardless of platform.
The authors of 
a couple of libraries I'll use will do likewise, but with their own 
type names and maybe different alias resolution rules. I expect some 
people will solve it with string classes.

They are certainly welcome to, but I'm not sure I see a need for a standard string class. The built-ins plus support functions should be quite sufficient.
Stroustrup took a similar 
position about the need for programmers to optimize their strings long 
enough that every C++ library and API created its own string type. He 
once stated in a meeting I attended that his greatest regret about C++ 
was waiting so long to have a standard library and that the most 
requested feature of that library had been a string class.

But D has string support while early C++ did not. In fact the current C++ string type is basically just a vector with some helper functions tacked on, and those functions could just as easily have been implemented separate from the string class (as is becoming popular in these days of generic programming).
By adding 
just one more standard string type that would be a good default on 
every platform, I think you could eliminate the need so many people 
will feel to create their own and prevent string types from 
multiplying like bunnies, as happened to C++.

I think people feel the need for a string class for familiarity rather than for need. While dealing with multibyte encodings can be a tad odd at first, foreach and slices make things quite painless.
Even so, I realize that my opinion may be based on incorrect 
assumptions, missing information, faulty logic, selective memory, or 
peculiar personal preferences, so I may be wrong. If so, though, I'd 
be curious to know why.

I haven't seen a good argument *for* a string class yet, but I could certainly be swayed if one were provided. What is the advantage over the built-ins? Is this purely a desire to create a standard implementation because we know that people are going to try to roll their own? Sean
Oct 28 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Thu, 28 Oct 2004 14:57:33 +0000 (UTC), Sean Kelly <sean f4.ca> wrote:
 In article <clq9a8$1jkb$1 digitaldaemon.com>, Glen Perkins says...
 I don't really see the need
 for application programmers to layer on more string types.

I do, and apparently I'm not alone. People seem to mention it a lot in this newsgroup (I've discovered), just as they did for so long with C++. If I want char[] on Linux and wchar[] on Windows and want to avoid the nightmare of maintaining parallel but subtly different code, I need to create my own type using the "alias" feature.

Out of curiosity, why would you want to use different char types internally in the same application depending on platform? At worst I would think that the i/o might translate to different encodings but the internal code would use some normalized form regardless of platform.

Glen mentioned system API calls. AFAIK unix variants use 8-bit char internally, but the later windows platforms use 16-bit, so, if you're doing a lot of system API calls it makes sense to have the string data in the right format. yes/no?
 I haven't seen a good argument *for* a string class yet, but I could 
 certainly
 be swayed if one were provided.  What is the advantage over the 
 built-ins?

I am hoping to outline some below.
 Is
 this purely a desire to create a standard implementation because we know 
 that
 people are going to try to roll their own?

In part, yes, the result of which would be... Imagine in the future when a large number of 3rd party libs exist, if each lib uses a different char type then interfacing between them all will involve conversions, lots of them. If there were only 1 string type, this problem would not exist. I realise that some conversions are un-avoidable (i.e. converting for IO), but, converting for use internally should be avoided without a very good reason, I cannot think of any at the moment which I would consider good enough to incur the cost of conversion. Further, say a conscientious library developer understands the above and wants to make his/her lib as compatible as possible, to do so, he/she has to either: 1- write everything 3 times. (as is already happening in the std libs) 2- do conversion internally. Neither option is particularly good, don't you agree? Basically I believe conversion should be done at input and ouput stages but nowhere in between, the way to achieve that is to have 1 string type used internally, the way to ensure that is only give people the choice of 1 string type. As suggested above that type may differ on each platform. Perhaps it could/should also differ per application, this could be achieved with a compile time flag to choose the internal string type. Not a perfect solution I know, as now we need 3 versions of each library, one for each internal char type. Thats my 2c anyways. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Oct 28 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsgllvqje5a2sq9 digitalmars.com...
 Perhaps it could/should also differ per application, this could be
 achieved with a compile time flag to choose the internal string type. Not
 a perfect solution I know, as now we need 3 versions of each library, one
 for each internal char type.

Although some are doing this, I argue it isn't necessary. Just pick one, and use conversions as necessary.
Oct 28 2004
parent Regan Heath <regan netwin.co.nz> writes:
On Thu, 28 Oct 2004 17:53:43 -0700, Walter <newshound digitalmars.com> 
wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsgllvqje5a2sq9 digitalmars.com...
 Perhaps it could/should also differ per application, this could be
 achieved with a compile time flag to choose the internal string type. 
 Not
 a perfect solution I know, as now we need 3 versions of each library, 
 one
 for each internal char type.

Although some are doing this, I argue it isn't necessary. Just pick one, and use conversions as necessary.

Quite frankly, yuck. As I said earlier, it's in-efficient to convert internally, you should only convert on input and output. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Oct 28 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Glen Perkins" <please.dont email.com> wrote in message
news:clq9a8$1jkb$1 digitaldaemon.com...
 I don't really see the need
 for application programmers to layer on more string types.

I do, and apparently I'm not alone. People seem to mention it a lot in this newsgroup (I've discovered), just as they did for so long with C++. If I want char[] on Linux and wchar[] on Windows and want to avoid the nightmare of maintaining parallel but subtly different code, I need to create my own type using the "alias" feature. The authors of a couple of libraries I'll use will do likewise, but with their own type names and maybe different alias resolution rules. I expect some people will solve it with string classes. Stroustrup took a similar position about the need for programmers to optimize their strings long enough that every C++ library and API created its own string type. He once stated in a meeting I attended that his greatest regret about C++ was waiting so long to have a standard library and that the most requested feature of that library had been a string class. By adding just one more standard string type that would be a good default on every platform, I think you could eliminate the need so many people will feel to create their own and prevent string types from multiplying like bunnies, as happened to C++.

C++ needs a string class becuase core C++ strings are so inadequate. But this is not true for D - core strings are more than up to the job. D core strings can do everything std::string does, and a lot more. D core strings more than cover what java.lang.string does, as well. Using 'alias' doesn't create a new type. It just renames an existing type. Hence, I don't see much of a collision problem between different code bases that use aliases. I also just don't see the need to even bother using aliases. Just use char[]. I think the issue comes up repeatedly because people coming from a C++ background are so used to char* being inadequate that it's hard to get comfortable with the idea that char[] really does work <g>.
Oct 28 2004
next sibling parent reply Regan Heath <regan netwin.co.nz> writes:
On Thu, 28 Oct 2004 09:39:13 -0700, Walter <newshound digitalmars.com> 
wrote:
 "Glen Perkins" <please.dont email.com> wrote in message
 news:clq9a8$1jkb$1 digitaldaemon.com...
 I don't really see the need
 for application programmers to layer on more string types.

I do, and apparently I'm not alone. People seem to mention it a lot in this newsgroup (I've discovered), just as they did for so long with C++. If I want char[] on Linux and wchar[] on Windows and want to avoid the nightmare of maintaining parallel but subtly different code, I need to create my own type using the "alias" feature. The authors of a couple of libraries I'll use will do likewise, but with their own type names and maybe different alias resolution rules. I expect some people will solve it with string classes. Stroustrup took a similar position about the need for programmers to optimize their strings long enough that every C++ library and API created its own string type. He once stated in a meeting I attended that his greatest regret about C++ was waiting so long to have a standard library and that the most requested feature of that library had been a string class. By adding just one more standard string type that would be a good default on every platform, I think you could eliminate the need so many people will feel to create their own and prevent string types from multiplying like bunnies, as happened to C++.

C++ needs a string class becuase core C++ strings are so inadequate. But this is not true for D - core strings are more than up to the job. D core strings can do everything std::string does, and a lot more. D core strings more than cover what java.lang.string does, as well. Using 'alias' doesn't create a new type. It just renames an existing type. Hence, I don't see much of a collision problem between different code bases that use aliases. I also just don't see the need to even bother using aliases. Just use char[]. I think the issue comes up repeatedly because people coming from a C++ background are so used to char* being inadequate that it's hard to get comfortable with the idea that char[] really does work <g>.

It's not whether it works or not, I agree it works very well. It's the fact that there are 3 of them, it's possible people will use different ones in their libs, then my program will have to do internal conversions all over the place. Conversion should only be done at the input and/or output stages. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Oct 28 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsgllx2xp5a2sq9 digitalmars.com...
 It's the fact that there are 3 of them, it's possible people will use
 different ones in their libs, then my program will have to do internal
 conversions all over the place.

I admit it may become a problem, but I don't think it will. More experience will let us know. With C++, there is a big problem because char* and wchar_t* are simply incompatible, one cannot even do conversions. One of my beefs with C++ was having to have multiple versions of the same function for the various char types. This isn't necessary in D.
Oct 28 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Thu, 28 Oct 2004 17:51:04 -0700, Walter <newshound digitalmars.com> 
wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsgllx2xp5a2sq9 digitalmars.com...
 It's the fact that there are 3 of them, it's possible people will use
 different ones in their libs, then my program will have to do internal
 conversions all over the place.

I admit it may become a problem, but I don't think it will. More experience will let us know. With C++, there is a big problem because char* and wchar_t* are simply incompatible, one cannot even do conversions. One of my beefs with C++ was having to have multiple versions of the same function for the various char types. This isn't necessary in D.

Isn't it? Explain std.string then. Don't people convert between char* and wchar_t* all the time, with functions? How is that really different from using a cast() in D, the syntax and knowing the encoding are the only differences I can see. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Oct 28 2004
parent "Walter" <newshound digitalmars.com> writes:
"Regan Heath" <regan netwin.co.nz> wrote in message
news:opsglzsouo5a2sq9 digitalmars.com...
 On Thu, 28 Oct 2004 17:51:04 -0700, Walter <newshound digitalmars.com>
 wrote:
 "Regan Heath" <regan netwin.co.nz> wrote in message
 news:opsgllx2xp5a2sq9 digitalmars.com...
 It's the fact that there are 3 of them, it's possible people will use
 different ones in their libs, then my program will have to do internal
 conversions all over the place.

I admit it may become a problem, but I don't think it will. More experience will let us know. With C++, there is a big problem because char* and wchar_t* are simply incompatible, one cannot even do conversions. One of my beefs with C++ was having to have multiple versions of the same function for the various char types. This isn't necessary in D.

Isn't it? Explain std.string then. Don't people convert between char* and wchar_t* all the time, with functions? How is that really different from using a cast() in D, the syntax and knowing the encoding are the only differences I can see.

The conversion doesn't work because it doesn't know about UTF. An attempt is being made to fix this in the latest standards.
Oct 28 2004
prev sibling next sibling parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Walter wrote:

 I also just don't see the need to even bother using aliases. Just use
 char[]. I think the issue comes up repeatedly because people coming from a
 C++ background are so used to char* being inadequate that it's hard to get
 comfortable with the idea that char[] really does work <g>.

I just found char[][] a tad confusing, but maybe it grows on you... :-) Oh well; I can still use a local "string" alias for char[] if I want to, even if it doesn't make into the standard D includes. No big deal. And there probably should be a warning that ".length" only works for ASCII strings, since it returns the number of code units otherwise ? --anders
Oct 28 2004
prev sibling next sibling parent reply James McComb <ned jamesmccomb.id.au> writes:
Walter wrote:

 I also just don't see the need to even bother using aliases. Just use
 char[].

But you need to use aliases for the following scenario: Suppose that: 1. I want to write code for both Windows and Unix. 2. I don't want to pay any string conversion costs at all. I assume the way to do this in D is: 1. Use wchar[] on Windows and make UTF-16 API calls. 2. Use char[] on Linux and make UTF-8 API calls. 3. Use an alias to toggle between wchar[] and char[]. 4. Use a string library that defines all functions in both wchar[] and char[] versions. If I just used char[], I would be forced to pay string conversion costs, as Windows ultimately processes all strings in UTF-16.
Oct 28 2004
next sibling parent reply "Walter" <newshound digitalmars.com> writes:
"James McComb" <ned jamesmccomb.id.au> wrote in message
news:cls8k1$16r5$1 digitaldaemon.com...
 Walter wrote:
 I also just don't see the need to even bother using aliases. Just use
 char[].

But you need to use aliases for the following scenario: Suppose that: 1. I want to write code for both Windows and Unix. 2. I don't want to pay any string conversion costs at all. I assume the way to do this in D is: 1. Use wchar[] on Windows and make UTF-16 API calls. 2. Use char[] on Linux and make UTF-8 API calls. 3. Use an alias to toggle between wchar[] and char[]. 4. Use a string library that defines all functions in both wchar[] and char[] versions. If I just used char[], I would be forced to pay string conversion costs, as Windows ultimately processes all strings in UTF-16.

True, Win32 process strings in UTF-16 and Linux in UTF-8. But I'll argue that the string conversion costs are insignificant, because very rarely does one write code that crosses from the app to the OS in a tight loop. In fact, one actively tries to avoid doing that because crossing the process boundary layer is expensive anyway. If profiling indicates that the conversion cost is significant, then use an alias, sure. But I'll wager that's very unlikely.
Oct 28 2004
parent "Glen Perkins" <please.dont email.com> writes:
"Walter" <newshound digitalmars.com> wrote in message 
news:clsfce$1dlm$1 digitaldaemon.com...

 True, Win32 process strings in UTF-16 and Linux in UTF-8. But I'll 
 argue
 that the string conversion costs are insignificant, because very 
 rarely does
 one write code that crosses from the app to the OS in a tight loop. 
 In fact,
 one actively tries to avoid doing that because crossing the process 
 boundary
 layer is expensive anyway.

 If profiling indicates that the conversion cost is significant, then 
 use an
 alias, sure. But I'll wager that's very unlikely.

Wait a minute. Aren't these pretty close to the same arguments I made for why the difference between the performance of a consistent default string class and a byte array wouldn't generally matter? "X is usually insignificant, and if it is ever significant use a profiler and do something non-default, but in general keep it simple...." What is the performance difference between sending 1000 wchar[] strings into a filter library function that wants char[] strings, so it converts them all into char[] on the way in, finds the ones that qualify, and converts them back to wchar[] on the way out, versus sending a thousand default string objects, by reference of course, into a library written for default string objects for filtering and returns the qualifiers as default string objects? (I'm actually asking. It's not rhetorical.) Having a consistent, default string that's used (almost) everywhere and never suffers any conversion costs inside the app may have a big benefit in reducing complexity, with certainly no need for aliases anywhere, and may not even have any performance penalty over code that repeatedly gets converted back and forth within the app. And if ever it did perform more slowly, you use your profiler as you suggest and tweak it with a byte array.
Oct 29 2004
prev sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
James McComb wrote:

 But you need to use aliases for the following scenario:
 
 Suppose that:
   1. I want to write code for both Windows and Unix.
   2. I don't want to pay any string conversion costs at all.
 
 I assume the way to do this in D is:
   1. Use wchar[] on Windows and make UTF-16 API calls.
   2. Use char[] on Linux and make UTF-8 API calls.
   3. Use an alias to toggle between wchar[] and char[].
   4. Use a string library that defines all functions in both wchar[] and 
 char[] versions.
 
 If I just used char[], I would be forced to pay string conversion costs, 
 as Windows ultimately processes all strings in UTF-16.

Couldn't a new "tchar" alias be introduced for OS / platform strings ? (mapping to either char or wchar) Similar to how pointer aliases work with both 32- and 64-bit pointers ? (that is: size_t and ptrdiff_t) It would be similar to using the macro (TCHAR *) in Windows C or C++. (with _tcs macro versions of all the functions like: strlen and wcslen) With overloading and templates in D it is easier to maintain, though... (compared to the preprocessor tricks one has to resort to, back in C) Or just use the standard type "char[]" and cast(), like Walter said ? (which seems to be a little biased towards ASCII or UNIX, but anyway) But using the same name (tchar) as Windows / Linux does, would be good if there indeed is such a platform-character alias eventually added... --anders PS. I think that it's only Windows NT (2K,XP) that uses Unicode, while Windows 95 (98,ME) uses ASCII... But I could be wrong?
Oct 29 2004
parent "Roald Ribe" <rr.no spam.teikom.no> writes:
"Anders F Björklund" <afb algonet.se> wrote in message
news:cltfq3$2hqn$1 digitaldaemon.com...

 PS. I think that it's only Windows NT (2K,XP) that uses Unicode,
      while Windows 95 (98,ME) uses ASCII... But I could be wrong?

95, 98 and ME can have the UTF-16 API's installed (redistributable DLL's). Both (the old) 8-bit char API and UTF-16 API are (currently) available on all the currently supported WIN32 platforms. Roald
Nov 05 2004
prev sibling parent "Glen Perkins" <please.dont email.com> writes:
"Walter" <newshound digitalmars.com> wrote in message 
news:clr7kg$2mi2$1 digitaldaemon.com...

 [D library authors and others won't be tempted to create their own 
 string classes
 as so many did for C++ because D's core strings are so much better]

This may turn out to be true. If so, you are still left with multiple string types and no obvious default. My concern is that the result will be a lot of unnecessary complexity, with all of its associated real costs, in exchange for little or no real benefit in many cases, and it won't even be avoidable by those who are aware of it if they use other people's code. And if it doesn't work out that way, the situation would be even worse, with even more string types and still no default.
 Using 'alias' doesn't create a new type. It just renames an existing 
 type.

You're talking about 'type' from the compiler's perspective, while I'm talking about it from the perspective of people--well, programmers are sort of like people--as in complexity, programmer productivity, porting, debugging, maintenance etc. From that perspective two things with different names have to be managed differently. Though the compiler may (sometimes) not object if you mix them, anyone who works with multiple string type names has more to keep track of and check on and worry about. Just for grins, here is the sort of thing I've overheard coming from developers at first-rate software companies, who ended up with multiple internal string types aliased by #defines: "<cubicle #1 wonders out loud>Hmm, this FooLib wants to be passed a foochar, but we're passing it a regular char. Is that okay? Or will it fail with a non-ASCII character? <cubicle #2 offers>Maybe you should test it with an accented e. <cubicle #3>No, wait, isn't upper ASCII still one byte sometimes....? <guy #1 again>Well I don't have a Japanese IME. Does anybody remember what a 'foochar' is on Linux? <guy #3> It's UTF-16 on Windows. Sorry, don't know about Linux...."
 Hence, I don't see much of a collision problem between different 
 code bases
 that use aliases.

Whether or not such a problem exists for the compiler, I don't see how working with multiple string types, even if some of them differ in name only, would not be a complexity problem for *people*.
 I also just don't see the need to even bother using aliases.

That's pretty interesting because if there really is no need for this feature, you could prevent some unnecessary complexity by eliminating the feature. With no default string type, though, people are essentially told to optimize their string type everytime they create a string, which will probably create a demand for a feature like "alias" to create an abstract string type (name) above the implementation level.
 Just use
 char[]. I think the issue comes up repeatedly because people coming 
 from a
 C++ background are so used to char* being inadequate that it's hard 
 to get
 comfortable with the idea that char[] really does work <g>.

<g> Funny, I thought you would say "just use wchar[]". Each one seems about equally likely. ;-) Again, you may be right about the existing byte array types being good enough to prevent the proliferation of string classes. Even if you are, my concern about the lack of an obvious default resulting in non-trivial complexity costs with no concommitant benefit remains. But I suppose it's also possible that if you DID add a nice, lightweight string class suitable as a default almost everywhere, the addiction of so many C people to premature optimization could make it unpopular, rendering it less of a unifying default than just a fourth standard string type to add to the complexity.
Oct 29 2004
prev sibling next sibling parent reply Regan Heath <regan netwin.co.nz> writes:
Glen,

I think you make some very good points. In the past several people have 
argued for a single string type. Some may even have written one, I know 
it's on the cards.

In the past I have argued for implicit conversion between the existing 
string types, this would allow them to be used interchangably and 
converted 'on the fly' where required. This idea can have performance 
issues as it can cause a lot of excess conversions. My suggestion was in 
reaction to the impression that the 3 existing types were going to stay.

I think ideally having only one 'string' type would be best. The trick is 
making it efficient enough for those situations where that sort of thing 
matters, i.e. embedded software etc.

That said, a well designed class that could be told what encoding to use 
internally (if required) might be efficient enough for 99% of cases, and 
in the last 1% a ubyte[] should perhaps be used?

If that class were to come into existance, I don't see the need for 3 char 
types, instead ubyte[], ushort[] and uint[] would/could be used by the 
string class internally to represent the data stored.

It's interesting to hear your views on this, I hope your post draws some 
of the older NG members with opinions on this out of the woodwork, it's 
been quiet here the last month or so.

Regan

On Mon, 25 Oct 2004 15:07:30 -0700, Glen Perkins <please.dont email.com> 
wrote:
 I'd heard a bit about D, but this is the first time I've taken a bit of 
 time to look it over. I'm glad I did, because I love the design.

 I am wondering about something, though, and that's the apparent decision 
 to have three different standard string types, each with its encoding 
 exposed to the developer. I've had some experience designing text 
 models--I worked with Sun upgrading Java's string model from UCS-2 to 
 UTF-16 and for Macromedia upgrading the string types within Flash and 
 ColdFusion, for example--but every case has its unique constraints.

 I don't know enough about D to be sure of the issues and constraints in 
 this case, but I'm wondering if it wouldn't make sense to have a single 
 standard "String" class for the majority of text handling plus something 
 like char/wchar/dchar/ubyte arrays reserved for special cases.

 In both Java and Flash we kept having to throw away brainstorming ideas 
 because they implied changes to internal string implementation details 
 that had unnecessarily--in my opinion--been exposed to programmers. I've 
 become increasingly convinced that programmers don't need to know, much 
 less be forced to decide, how most of their text is encoded. They should 
 be thinking in terms of text semantically most of the time, without 
 concerning themselves with its byte representation.

 I see text handling as analogous to memory handling in the sense that I 
 think the time has come to have the platform handle the general cases 
 via automated internal mechanisms that are not exposed, while still 
 allowing programmer manual  intervention for occasional special cases.

 D already seems to have this memory model (very nice!), and it seems to 
 me that the corresponding text model would be a single standard "String" 
 class, whose internal encoding was the implementation's business, not 
 the programmer's. The String would have the ability to produce 
 explicitly encoded/formatted byte arrays for special cases, such as I/O, 
 where encoding mattered. I would also want the ability to bypass Strings 
 entirely on some occasions and use byte arrays directly. (By "byte 
 arrays" I mean something like D's existing char[], wchar[], etc.)

 Since the internal encoding of the standard String would not be exposed 
 to the programmer, it could be optimized differently on every platform. 
 I would probably implement my String class in UTF-16 on Windows and 
 UTF-8 on Linux to make interactions with the OS and neighboring 
 processes as lightweight as possible.

 Then I would probably provide standard function wrappers for common OS 
 calls such as getting directory listings, opening files, etc. These 
 wrapper functions would pass text in the form of Strings. Source code 
 that used only these functions would be portable across platforms, and 
 since String's implementation would be optimized for its platform, this 
 portable source code could produce nearly optimal object code on all 
 platforms.

 For calling OS functions directly, where you always need to have your 
 text in a specific format, you could just have your Strings create an 
 explicitly formatted byte sequence for you. A call to a Windows API 
 function might pass something like "my_string.toUTF16()". Since the 
 internal format would probably already be UTF-16, this "conversion" 
 could be optimized away by the compiler, but it would leave you the 
 freedom to change the underlying String implementation in the future 
 without breaking anybody's code.

 And, of course, you would still have the ability to use char[], wchar[], 
 dchar[], and even ubyte[] directly when needed for special cases.

 Having a single String to use for most text handling would make writing, 
 reading, porting, and maintaining code much easier. Having an underlying 
 encoding that isn't exposed would make it possible for implementers to 
 optimize the standard String for the platform, so that programmers who 
 used it would find code that was easier to write to begin with was also 
 more performant when ported. This has huge implications for the creation 
 of the rich libraries that make or break a language these days.

 And if for no other reason, it seems to me that a new language should 
 have a single, standard String class from the start just to avoid 
 degenerating into the tangled hairball of conflicting string types that 
 C++ text handling has become. Library creators and architects working in 
 languages that have had a single, standard String class from the start 
 doggedly use the standard String for everything. You could easily create 
 your own alternative string classes for languages like Java or C#, but 
 almost nobody does. As long as the standard String is good enough, it's 
 just not worth the trouble of having to juggle multiple string types. 
 All libraries and APIs in these languages use a single, consistent text 
 model, which is a big advantage these days over C++.

 Again, I realize that I may be overlooking any number of important 
 issues that would make this argument inapplicable or irrelevant in this 
 case, but I'm wondering if this would make sense for D.

-- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Oct 25 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Tue, 26 Oct 2004 12:47:44 +1300, Regan Heath <regan netwin.co.nz> wrote:
 I think you make some very good points. In the past several people have 
 argued for a single string type. Some may even have written one, I know 
 it's on the cards.

To clarify, I believe some people think one is required and will write one to attempt to proove one is better. AFAIK Walter does not see the need for one and/or believe char,wchar,dchar to be better.
 In the past I have argued for implicit conversion between the existing 
 string types, this would allow them to be used interchangably and 
 converted 'on the fly' where required. This idea can have performance 
 issues as it can cause a lot of excess conversions. My suggestion was in 
 reaction to the impression that the 3 existing types were going to stay.

 I think ideally having only one 'string' type would be best. The trick 
 is making it efficient enough for those situations where that sort of 
 thing matters, i.e. embedded software etc.

 That said, a well designed class that could be told what encoding to use 
 internally (if required) might be efficient enough for 99% of cases, and 
 in the last 1% a ubyte[] should perhaps be used?

 If that class were to come into existance, I don't see the need for 3 
 char types, instead ubyte[], ushort[] and uint[] would/could be used by 
 the string class internally to represent the data stored.

 It's interesting to hear your views on this, I hope your post draws some 
 of the older NG members with opinions on this out of the woodwork, it's 
 been quiet here the last month or so.

 Regan

 On Mon, 25 Oct 2004 15:07:30 -0700, Glen Perkins <please.dont email.com> 
 wrote:
 I'd heard a bit about D, but this is the first time I've taken a bit of 
 time to look it over. I'm glad I did, because I love the design.

 I am wondering about something, though, and that's the apparent 
 decision to have three different standard string types, each with its 
 encoding exposed to the developer. I've had some experience designing 
 text models--I worked with Sun upgrading Java's string model from UCS-2 
 to UTF-16 and for Macromedia upgrading the string types within Flash 
 and ColdFusion, for example--but every case has its unique constraints.

 I don't know enough about D to be sure of the issues and constraints in 
 this case, but I'm wondering if it wouldn't make sense to have a single 
 standard "String" class for the majority of text handling plus 
 something like char/wchar/dchar/ubyte arrays reserved for special cases.

 In both Java and Flash we kept having to throw away brainstorming ideas 
 because they implied changes to internal string implementation details 
 that had unnecessarily--in my opinion--been exposed to programmers. 
 I've become increasingly convinced that programmers don't need to know, 
 much less be forced to decide, how most of their text is encoded. They 
 should be thinking in terms of text semantically most of the time, 
 without concerning themselves with its byte representation.

 I see text handling as analogous to memory handling in the sense that I 
 think the time has come to have the platform handle the general cases 
 via automated internal mechanisms that are not exposed, while still 
 allowing programmer manual  intervention for occasional special cases.

 D already seems to have this memory model (very nice!), and it seems to 
 me that the corresponding text model would be a single standard 
 "String" class, whose internal encoding was the implementation's 
 business, not the programmer's. The String would have the ability to 
 produce explicitly encoded/formatted byte arrays for special cases, 
 such as I/O, where encoding mattered. I would also want the ability to 
 bypass Strings entirely on some occasions and use byte arrays directly. 
 (By "byte arrays" I mean something like D's existing char[], wchar[], 
 etc.)

 Since the internal encoding of the standard String would not be exposed 
 to the programmer, it could be optimized differently on every platform. 
 I would probably implement my String class in UTF-16 on Windows and 
 UTF-8 on Linux to make interactions with the OS and neighboring 
 processes as lightweight as possible.

 Then I would probably provide standard function wrappers for common OS 
 calls such as getting directory listings, opening files, etc. These 
 wrapper functions would pass text in the form of Strings. Source code 
 that used only these functions would be portable across platforms, and 
 since String's implementation would be optimized for its platform, this 
 portable source code could produce nearly optimal object code on all 
 platforms.

 For calling OS functions directly, where you always need to have your 
 text in a specific format, you could just have your Strings create an 
 explicitly formatted byte sequence for you. A call to a Windows API 
 function might pass something like "my_string.toUTF16()". Since the 
 internal format would probably already be UTF-16, this "conversion" 
 could be optimized away by the compiler, but it would leave you the 
 freedom to change the underlying String implementation in the future 
 without breaking anybody's code.

 And, of course, you would still have the ability to use char[], 
 wchar[], dchar[], and even ubyte[] directly when needed for special 
 cases.

 Having a single String to use for most text handling would make 
 writing, reading, porting, and maintaining code much easier. Having an 
 underlying encoding that isn't exposed would make it possible for 
 implementers to optimize the standard String for the platform, so that 
 programmers who used it would find code that was easier to write to 
 begin with was also more performant when ported. This has huge 
 implications for the creation of the rich libraries that make or break 
 a language these days.

 And if for no other reason, it seems to me that a new language should 
 have a single, standard String class from the start just to avoid 
 degenerating into the tangled hairball of conflicting string types that 
 C++ text handling has become. Library creators and architects working 
 in languages that have had a single, standard String class from the 
 start doggedly use the standard String for everything. You could easily 
 create your own alternative string classes for languages like Java or 
 C#, but almost nobody does. As long as the standard String is good 
 enough, it's just not worth the trouble of having to juggle multiple 
 string types. All libraries and APIs in these languages use a single, 
 consistent text model, which is a big advantage these days over C++.

 Again, I realize that I may be overlooking any number of important 
 issues that would make this argument inapplicable or irrelevant in this 
 case, but I'm wondering if this would make sense for D.


-- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Oct 25 2004
parent reply A. Coward (not related to Noël) <A._member pathlink.com> writes:
I think Glen's thoughts are excellent.

As long as we use D for smallish programs, library development, and such, it may
seem obvious to continue using arrays to store sequences of characters (of the
size of our choice for the project at hand).

Our aim (at least I think) is to have D usurp C, C++, and to some extent C# and
Java. By that time D would be used in the Programming Industry. Once we are
there it may seem equally obvious that a programmer should not have to spend
time thinking about character sets or widths. A requisite for this is that there
is a string class/type that Everyone Uses. 

We don't have to skip our current character arrays and library functions, it
just means that we really should create a default for the future. And this IMHO
should be done pretty much along the lines Glen suggested.

Newcomers to D (newbies as well as Old Pros) should be directed to use this new
string. This is what should be prominent and well described in the
documentation. And we should move the current text manipulation docs to the
hairier sections, right where OS-gurus, embedded programmers, performance pros,
and metal-benders go looking. Oh yes, and library developers, too.

The default should be that everyone uses the Default string, and that only
profiling should be used to decide whether some snippets should then be
programmed with arrays (or whatever), as a last resort.
Oct 26 2004
parent reply Kevin Bealer <Kevin_member pathlink.com> writes:
In article <cllf6q$24vg$1 digitaldaemon.com>, not related to Noël says...
.
The default should be that everyone uses the Default string, and that only
profiling should be used to decide whether some snippets should then be
programmed with arrays (or whatever), as a last resort.

I think there is some merit in this guideline, particularly for those new to programming. But I'm coming around to the perspective that performance problems are like bugs. If you don't pay attention to bugs during the design phase, you will spend your whole career debugging programs. Likewise, if performance is the last thing you think about, you will spend all of your career profiling programs with poor performance, trying to overcome slow designs with small optimizations. If you want a "standard" string type, use "char". An XML parser needs to look for "<" and ">" a lot, but how often do you -really- need to scan strings for multibyte characters? Virtually all traditional tokenization and parsing tasks can be done with 8 bit types, because they require searching for delimiters that are themselves 8 bit chars. I've not seen "U+umlaut" delimited fields ;) My rule of thumb is to use the smallest type that I won't need to convert inside the function, usually char. If the function needs to iterate over and modify dchar elements, accept that type at the function interface. Kevin
Oct 28 2004
parent reply "Lionello Lunesu" <lionello.lunesu crystalinter.remove.com> writes:
Just posting to let you know I also think "string" should be standardized. 
Be it char[] or whatever, but standardized.

Maybe in the future "string" could get some other members/operators that 
have no equivalent with int[]. (The fact that char[] is begin treated as 
UTF8 when converting to wchar[] proves that it's not simply an int8[] array)

 Virtually all traditional tokenization and parsing tasks
 can be done with 8 bit types, because they require searching for 
 delimiters that
 are themselves 8 bit chars.  I've not seen "U+umlaut" delimited fields ;)

Indeed :-) Lio.
Nov 05 2004
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Lionello Lunesu wrote:

 Maybe in the future "string" could get some other members/operators that 
 have no equivalent with int[]. (The fact that char[] is begin treated as 
 UTF8 when converting to wchar[] proves that it's not simply an int8[] array)

The 8-bit integer type in D is "byte". D's "char" is *defined* as UTF-8. This means that a "char" only holds an ASCII character. You need a wchar to hold e.g. a Latin-1 character, and a full (32-bit) dchar to hold all Unicode possibilities... --anders
Nov 05 2004
parent reply "Lionello Lunesu" <lionello.lunesu crystalinter.remove.com> writes:
Yes, I've noticed that. I was referring to the how the array is treated.

char[] array;
wchar[] warray = array;

This is doing some magic that has nothing to do with simply copying members, 
extending them as necessary. OK, I guess they're both arrays of UTF 
characters and the prefix only shows the memory representation, so it's 
still a member-by-member copy...

Can I do a similar assignment from byte[] to uint[] ? (I know I could simply 
test, but I've never written a D program). If not, then there is something 
special about char[] that might perhaps be more obvious if it was a built-in 
string type (the [] is confusing.)

Lio.

"Anders F Björklund" <afb algonet.se> wrote in message 
news:cmfjbd$22d5$1 digitaldaemon.com...
 Lionello Lunesu wrote:

 Maybe in the future "string" could get some other members/operators that 
 have no equivalent with int[]. (The fact that char[] is begin treated as 
 UTF8 when converting to wchar[] proves that it's not simply an int8[] 
 array)

The 8-bit integer type in D is "byte". D's "char" is *defined* as UTF-8. This means that a "char" only holds an ASCII character. You need a wchar to hold e.g. a Latin-1 character, and a full (32-bit) dchar to hold all Unicode possibilities... --anders

Nov 05 2004
parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Lionello Lunesu wrote:

 Yes, I've noticed that. I was referring to the how the array is treated.
 
 char[] array;
 wchar[] warray = array;

That D code just gives an error, when you actually try to compile it: "cannot implicitly convert expression array of type char[] to wchar[]" If you insert an explicit cast, result is probably NOT what you want... (you CAN cast string *constants*)
 This is doing some magic that has nothing to do with simply copying members, 
 extending them as necessary. OK, I guess they're both arrays of UTF 
 characters and the prefix only shows the memory representation, so it's 
 still a member-by-member copy...

The compiler needs some code regarding converting different UTF arrays. Each dchar (UTF-32), converts 1-4 chars (UTF-8) to 1-2 wchars (UTF-16) It's not a simple memory copy, as you can see in the std/utf.d code: wchar[] toUTF16(char[] s);
 Can I do a similar assignment from byte[] to uint[] ?

Nope: "cannot implicitly convert expression a of type byte[] to uint[]" You would have to do something like: byte[] a; uint[] b; foreach (byte c; a) b ~= c; Again, a cast() just does a "memcpy"
 If not, then there is something special about char[] that might
 perhaps be more obvious if it was a built-in string type (the [] is
 confusing.)

Type char[] has a few "stringish" properties, and bit has some magic "boolean" properties. This is somehow better than built-in types... (and a frequent source of D discussions/wars) We'll just have to live with the type aliases; "string" and "bool", as types aren't changing ? alias char[] string; alias bit bool; --anders
Nov 05 2004
prev sibling next sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Glen Perkins wrote:

 I don't know enough about D to be sure of the issues and constraints in 
 this case, but I'm wondering if it wouldn't make sense to have a single 
 standard "String" class for the majority of text handling plus something 
 like char/wchar/dchar/ubyte arrays reserved for special cases.

Since OOP is *optional* in D, it isn't given to have a *class* ? (a String class is still useful, but not as main implementation) As for a "string" type alias, I think that's a very good idea... digitalmars.D/11821
 And if for no other reason, it seems to me that a new language should 
 have a single, standard String class from the start just to avoid 
 degenerating into the tangled hairball of conflicting string types that 
 C++ text handling has become. Library creators and architects working in 
 languages that have had a single, standard String class from the start 
 doggedly use the standard String for everything. You could easily create 
 your own alternative string classes for languages like Java or C#, but 
 almost nobody does. As long as the standard String is good enough, it's 
 just not worth the trouble of having to juggle multiple string types. 
 All libraries and APIs in these languages use a single, consistent text 
 model, which is a big advantage these days over C++.

There is no "string" type, and there is no "bool" type in D. This seems to have been done by design, as Walter's explained ? The recommended types to use is "char[]" for the usual strings, (even if wchar[] or even dchar[] is sometimes also useful to have) and "bit" for booleans. (even if char and int are sometimes used) There isn't really a conflict, since all strings are Unicode and all booleans follow the "zero is false, non-zero is true". But it does expose the underlying storage and implementation... It seems the best that can be done at this point are *aliases*? (and improving upon the D library support in Phobos and Deimos) --anders
Oct 26 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Tue, 26 Oct 2004 13:34:23 +0200, Anders F Björklund <afb algonet.se> 
wrote:
 Glen Perkins wrote:

 I don't know enough about D to be sure of the issues and constraints in 
 this case, but I'm wondering if it wouldn't make sense to have a single 
 standard "String" class for the majority of text handling plus 
 something like char/wchar/dchar/ubyte arrays reserved for special cases.

Since OOP is *optional* in D, it isn't given to have a *class* ? (a String class is still useful, but not as main implementation)

In that case, perhaps not a 'class', but a struct as Ben suggested, or, better yet a built-in type like the current arrays, which we can extend in the same way as we can arrays. I think that is important.
 As for a "string" type alias, I think that's a very good idea...
 digitalmars.D/11821

I don't like it: 1- I personally find 'utf_8' ugly and nasty to type. 2- The style guide mentions that 'meaningless type aliases should be avoided' I think aliasing 'char' to 'utf_8' is meaningless because a char is a utf-8 type by definition. 3- I don't want 'more' character types, I want 'less'.
 And if for no other reason, it seems to me that a new language should 
 have a single, standard String class from the start just to avoid 
 degenerating into the tangled hairball of conflicting string types that 
 C++ text handling has become. Library creators and architects working 
 in languages that have had a single, standard String class from the 
 start doggedly use the standard String for everything. You could easily 
 create your own alternative string classes for languages like Java or 
 C#, but almost nobody does. As long as the standard String is good 
 enough, it's just not worth the trouble of having to juggle multiple 
 string types. All libraries and APIs in these languages use a single, 
 consistent text model, which is a big advantage these days over C++.

There is no "string" type, and there is no "bool" type in D. This seems to have been done by design, as Walter's explained ?

Yes and no. Walter has intentionally made the character types utf ones, IMO a good decision, however it has created a problem where they are not easily interchangable i.e. you have to call conversion functions all the time because some people use one while others use another. I suggested implicit conversion between them to solve that. Walter sort of liked that idea, but has not done anything about it yet. A better soln IMO would be a single 'string' type which can handle 'being' in any encoding you need.
 The recommended types to use is "char[]" for the usual strings,
 (even if wchar[] or even dchar[] is sometimes also useful to have)
 and "bit" for booleans. (even if char and int are sometimes used)

 There isn't really a conflict, since all strings are Unicode
 and all booleans follow the "zero is false, non-zero is true".
 But it does expose the underlying storage and implementation...

All strings are _not_ Unicode, strings can be in any encoding you want. D currently has 3 'string' types (char,wchar,dchar) which are all Unicode. There is no difference in my mind between a char[] and a ubyte[] array, except for the fact that the char[] array remembers that it's contents are supposed to be UTF-8 and verifies that on occasion. So, a struct/class/whatever like: struct string { StringType type; union { ubyte[] bs; ushort[] ss; ulong[] ls; } } could replace char, wchar, and dchar. It could do implicit conversions where required via 'cast' operators (do we have them yet?). It could handle many more encodings than the 3 handled by char, wchar, and dchar. If such a type existed char, wchar, and dchar would become obsolete, there would be no need for them at all. The only weakness a struct has is that you cannot extend it as you can the built-in arrays eg. void foo(char[] a, int b) {} char[] bob; bob.foo(1); <- calls the 'foo' function above passing 'bob' as 1st arg. This is a really useful feature, it is why IMO we need a partially built-in solution.
 It seems the best that can be done at this point are *aliases*?
 (and improving upon the D library support in Phobos and Deimos)

We can write a string struct/class/whatever and use that, if it becomes as popular as I imagine it will, it will likely be adopted into Phobos. Basically I'm saying, if we proove it's the right way to go, we just might convince Walter. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Oct 26 2004
parent reply =?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Regan Heath wrote:

 I don't like it:
 
 1- I personally find 'utf_8' ugly and nasty to type.

Actually it was utf8_t, utf16_t, utf32_t - but point taken :-)
 2- The style guide mentions that 'meaningless type aliases should be 
 avoided' I think aliasing 'char' to 'utf_8' is meaningless because a 
 char is a utf-8 type by definition.
 
 3- I don't want 'more' character types, I want 'less'.

They were meant to 'compliment' the standard int aliases - in stdint.d : int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t They were not meant as "pretty", more like: self-explanatory (explains what type it is: utf/int, and how many bits it is) Didn't intend to change any built-in class names, like char/wchar/dchar or byte/short/int/long. Just offer *one* "offical" alias for each type. What did you think about the "string" (char[]) and "ustring" (wchar[]) ?
 All strings are _not_ Unicode, strings can be in any encoding you want.
 D currently has 3 'string' types (char,wchar,dchar) which are all Unicode.

I meant the string types that interact with "quotes" and the ~ operator. You are right in that one *could* store strings in ubyte[] or void[]...
 If such a type existed char, wchar, and dchar would become obsolete, 
 there would be no need for them at all.

Unless you like type safety ? As in: chars and ints being different ? They are of the same bit size as ubyte, ushort and uint - that's true. > We can write a string struct/class/whatever and use that, if it becomes
 as popular as I imagine it will, it will likely be adopted into Phobos. 
 Basically I'm saying, if we proove it's the right way to go, we just 
 might convince Walter.

Currently Walter *has* picked the char[] type as the basic string type. Deimos has, inspired by the ICU library, picked wchar[] as the basis... (difference being that char[] is best for ASCII, wchar[] for Unicode) Says http://oss.software.ibm.com/icu/userguide/icufaq.html:
 UTF-8 is 50% smaller than UTF-16 for US-ASCII, but UTF-8 is
 50% larger than UTF-16 for East and South Asian scripts.
 There is no memory difference for Latin extensions, [...]

I just thought "main(string[] args)" better than "main(char[][] args)" ? (just as I think the "bool" alias to be better than the built-in "bit") But I'm not sure I like a "magic" class with a hidden run-time cost... --anders
Oct 26 2004
parent reply "Glen Perkins" <please.dont email.com> writes:
"Anders F Björklund" <afb algonet.se> wrote in message 
news:clmimg$fvd$1 digitaldaemon.com...


 What did you think about the "string" (char[]) and "ustring" 
 (wchar[]) ?

I don't think you were asking me, but my concern applies to any "let a hundred flowers bloom" design approach for strings. If you have multiple string types with no dominant leader, plus an "alias" feature, plus strong support for OOP but no standard string class, you are almost begging for a crazy quilt landscape of diverse and incompatible string types. I'd be concerned that most large applications would end up dealing with more string types than they wanted with no significant performance gains to show for it.
 Currently Walter *has* picked the char[] type as the basic string 
 type.
 Deimos has, inspired by the ICU library, picked wchar[] as the 
 basis...
 (difference being that char[] is best for ASCII, wchar[] for 
 Unicode)

 Says http://oss.software.ibm.com/icu/userguide/icufaq.html:
 UTF-8 is 50% smaller than UTF-16 for US-ASCII, but UTF-8 is
 50% larger than UTF-16 for East and South Asian scripts.
 There is no memory difference for Latin extensions, [...]


There is so much room for "well, not necessarily" in all of these statements, most programmers understand the issues so little, and it usually matters so little, that it's a bit unfortunate to have a design that *requires* programmers to repeatedly make this decision. Different people, even smart ones, will choose differently, choices that may as well be random for all the difference it usually makes. Once again, I'm afraid that code will get more complicated than necessary with no compensating payoff. And I couldn't avoid the complexity by just choosing wisely myself, because every library author would be free to make his own decisions, and you need a lot of libraries to make a language useful. I could have unnecessary and performance sapping format conversions taking place at every library call.
Oct 26 2004
parent reply =?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Glen Perkins wrote:

 What did you think about the "string" (char[]) and "ustring" (wchar[]) ?

I don't think you were asking me, but my concern applies to any "let a hundred flowers bloom" design approach for strings. If you have multiple string types with no dominant leader, plus an "alias" feature, plus strong support for OOP but no standard string class, [...]

Walter has earlier ruled out a built-in "native" string type in D, and a String class brings us back to the earlier "boxing" discussion. Currently the D language treats strings as arrays of Unicode code units, and one can still use char[] as ASCII strings, just like one could in C. There is a lot of things discussed regarding Unicode and strings at: http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues A "transcoding" string type with a built-in hash code would have been welcome, but it is *not* in the current D language specification... I just wanted a reasonable alias while the theological debate rages on ? (and reason for submitting it was so that we could all use the same one) --anders
Oct 27 2004
parent reply ac <ac_member pathlink.com> writes:
 Walter has earlier ruled out a built-in "native" string type in D,
 and a String class brings us back to the earlier "boxing" discussion.


a) Built-in or library? (Standard library or 3rd party?) b) 0, 1, 3 or 3+1 "approved" string kinds? c) Unicode (which?), native (which?), other? These 3 questions are orthogonal to each other. To (a) I have no strong opinion. Maybe just building facilities in the language itself that are geared towards making it easy to implement an efficient string library would be adequate? I have no problem with 3+1 in (b). Why not let the 3 existing strings live on. But I would really like to have an additional string which would be advertised as what you should use. (c) I leave to smarter people. If we don't have exactly _one_ type that everyone _should_ use, then programmers in, say, the Mid West would all use an 8-bit kind. People from, say, Chinese origin, probably would use a 32-bit type -- even if they were coding in the US. And even if they would be working on a project that is to manipulate ASCII strings, because they'd expect the application to sooner or later get exposed to non-USASCII characters anyway. Actually, rednecks would be happy with 7 bits. What if all these guys happen to work for the same global company? <joke-mode> I can hear a crowd all over the D-community shouting to their screens: "Well, that company would have their global coding policy on strings. NO problem." Right. But what when (not if) that company gets merged into another? Would they have happended to choose the very same string coding policy? Maybe they began with different operating systems, maybe the other one originally came from another continent? I don't even want to guess what the crowd says to this. </joke-mode> This ought to be a no-brainer!
Oct 27 2004
parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
ac wrote:

Walter has earlier ruled out a built-in "native" string type in D,
and a String class brings us back to the earlier "boxing" discussion.


a) Built-in or library? (Standard library or 3rd party?)

There is no built-in D type, and does not look like a standard class. (as in: there will probably be no Integer, Character, String classes?)
 b) 0, 1, 3 or 3+1 "approved" string kinds?

There are *two* approved string types: char[] and wchar[] (there is also a dchar[] type, but hardly any use for it?)
 c) Unicode (which?), native (which?), other?

"Unicode is the future", so there is no Latin-1 support... (I assume you meant something like ISO-8859-1 by "native"?)
 These 3 questions are orthogonal to each other. 

I thought they were a bit strange, but I tried anyway ?
 If we don't have exactly _one_ type that everyone _should_ use, then
programmers
 in, say, the Mid West would all use an 8-bit kind. People from, say, Chinese
 origin, probably would use a 32-bit type -- even if they were coding in the US.
 And even if they would be working on a project that is to manipulate ASCII
 strings, because they'd expect the application to sooner or later get exposed
to
 non-USASCII characters anyway.

Western people that earlier had Latin-1 tend to use "char[]", the only trick is to dimension as [length * 2] since some characters occupy two bytes when encoded. To be i18n-savvy, they should use [length * 4] which allows for all of Unicode. "char" is only useful for ASCII characters, as one has to use at least wchar to fit a Latin-1 character for instance. Other people tend to use "wchar[]", which is also the string (and character) encoding that Java choose. Nowadays one has to be prepared to handle "surrogates", since Unicode does not fit in 16 bits anymore - but spilled over to 21 bits... "wchar" *usually* works for Unicode characters, but to be able to handle all characters then dchar must be used. Nobody in their right mind uses "dchar[]" to store strings, but the "dchar" type is useful for storing one code point. A big disadvantage of UTF-16 (over UTF-8) is that it is platform-dependant, and that it is not ASCII-compatible. At least not with C and UNIX, since it will have a "BOM" and since every other byte in a ASCII string will be NUL. (more details at http://www.unicode.org/faq/utf_bom.html) And imagine that all I wanted was two simpler aliases. :-) (thought "string" and "ustring" were easier to "pronounce" than "char[]" and "wchar[]", and that was about it really. Just a simple: alias char[] string; alias wchar[] ustring;) Not any new types or classes or other magic incantations... --anders
Oct 27 2004
prev sibling parent J C Calvarese <jcc7 cox.net> writes:
Glen Perkins wrote:
 I'd heard a bit about D, but this is the first time I've taken a bit of 
 time to look it over. I'm glad I did, because I love the design.
 
 I am wondering about something, though, and that's the apparent decision 
 to have three different standard string types, each with its encoding 
 exposed to the developer. I've had some experience designing text 
 models--I worked with Sun upgrading Java's string model from UCS-2 to 
 UTF-16 and for Macromedia upgrading the string types within Flash and 
 ColdFusion, for example--but every case has its unique constraints.
 
 I don't know enough about D to be sure of the issues and constraints in 
 this case, but I'm wondering if it wouldn't make sense to have a single 
 standard "String" class for the majority of text handling plus something 
 like char/wchar/dchar/ubyte arrays reserved for special cases.

(I've read some of the posts in this thread. Sorry if I'm repeating what someone else has already written.) It seems to me that D would support a string class such as the one you seem to be proposing. Since Walter is busy getting the bugs out of the compiler, so he's not likely to write an official string class anytime soon. But someone else could write it. And if that string was good and lots of people liked it, I'd be surprised if Walter didn't add it to the standard library, Phobos. If you're not up to writing it yourself, maybe you could persuade someone else to do the work by proposing a design. -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Oct 30 2004