www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - String implementations

reply bearophile <bearophileHUGS lycos.com> writes:
Defining how an ASCII string is best managed by a language is already complex
(ropes or not? Mutable or not? With shared parts or not? Etc), but today ASCII
isn't enough and when you add Unicode matters then string management becomes an
hairy topic, this may be interesting for D developers:

http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html

Something curious: sometimes I need mutable strings, but I cope with the
immutable ones when necessary. This author says that even stringAt isn't much
useful! :-)

Bye,
bearophile
Jan 15 2008
next sibling parent Robert Fraser <fraserofthenight gmail.com> writes:
bearophile wrote:
 Defining how an ASCII string is best managed by a language is already complex
(ropes or not? Mutable or not? With shared parts or not? Etc), but today ASCII
isn't enough and when you add Unicode matters then string management becomes an
hairy topic, this may be interesting for D developers:
 
 http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html
 
 Something curious: sometimes I need mutable strings, but I cope with the
immutable ones when necessary. This author says that even stringAt isn't much
useful! :-)
 
 Bye,
 bearophile
I agree with pretty much everything in that article, especially the part about charAt not being very useful (occasionally I iterate, but rarely do I need random access, though slicing is a good option for strings I know the format of that don't need regexes). D's "string" (that is, invariant(char)[] ) is a good compromise, although I'd prefer to also have a String class that I can use for *some* strings that can be implicitly used as (but not converted to) a char[] but has interning & hash code caching. But this is impossible within D's type system as it stands today.
Jan 15 2008
prev sibling next sibling parent reply Jarrod <qwerty ytre.wq> writes:
On Tue, 15 Jan 2008 21:23:31 -0500, bearophile wrote:

 Defining how an ASCII string is best managed by a language is already
 complex (ropes or not? Mutable or not? With shared parts or not? Etc),
 but today ASCII isn't enough and when you add Unicode matters then
 string management becomes an hairy topic, this may be interesting for D
 developers:
 
 http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html
 
 Something curious: sometimes I need mutable strings, but I cope with the
 immutable ones when necessary. This author says that even stringAt isn't
 much useful! :-)
 
 Bye,
 bearophile
This article is pretty correct. While the topic is at hand, I guess I could rant a little; Why does D practically *require* the coder to use different forms of UTF encoding. D can tell if a code unit spans multiple bytes, as evidenced in converting a utf-8 string to utf-32 (D knows where to split the blocks apart), yet we can't index char[] arrays by code units. Instead, D will index char arrays by fixed length bytes, which is almost nonsensical since the D spec asserts that char[] arrays are designed specifically for unicode characters, and that other single byte arrays should instead be made as a byte[]. So if this is the case, then why can't the language itself manage multi- byte characters for us? It would make things a hell of a lot easier and more efficient than having to convert /potentially/ foreign strings to utf-32 for a simple manipulation operation, then converting them back. The only reason I can think of for char arrays being treated as fixed length is for faster indexing, which is hardly useful in most cases since a lot of the time we don't even know if we're dealing with multi-byte characters when handling strings, so we have to convert and traverse the strings anyway. Arg. I know this may probably be a pain to implement, but it would really give D a huge leg-up if it could properly and automatically handle strings for us. Without requiring a bloated string class.
Jan 16 2008
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Jarrod" wrote
 On Tue, 15 Jan 2008 21:23:31 -0500, bearophile wrote:

 Defining how an ASCII string is best managed by a language is already
 complex (ropes or not? Mutable or not? With shared parts or not? Etc),
 but today ASCII isn't enough and when you add Unicode matters then
 string management becomes an hairy topic, this may be interesting for D
 developers:

 http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html

 Something curious: sometimes I need mutable strings, but I cope with the
 immutable ones when necessary. This author says that even stringAt isn't
 much useful! :-)

 Bye,
 bearophile
This article is pretty correct. While the topic is at hand, I guess I could rant a little; Why does D practically *require* the coder to use different forms of UTF encoding. D can tell if a code unit spans multiple bytes, as evidenced in converting a utf-8 string to utf-32 (D knows where to split the blocks apart), yet we can't index char[] arrays by code units. Instead, D will index char arrays by fixed length bytes, which is almost nonsensical since the D spec asserts that char[] arrays are designed specifically for unicode characters, and that other single byte arrays should instead be made as a byte[]. So if this is the case, then why can't the language itself manage multi- byte characters for us? It would make things a hell of a lot easier and more efficient than having to convert /potentially/ foreign strings to utf-32 for a simple manipulation operation, then converting them back. The only reason I can think of for char arrays being treated as fixed length is for faster indexing, which is hardly useful in most cases since a lot of the time we don't even know if we're dealing with multi-byte characters when handling strings, so we have to convert and traverse the strings anyway.
The algorithmic penalties would be O(n) for a indexed lookup then instead of O(1). I think the way it is now is the best of all worlds. I think the correct method in this case is to convert to utf32 first, then index. Then at least you only take the O(n) penalty once. Or why not just use dchar[] instead of char[] to begin with? -Steve
Jan 16 2008
parent reply Jarrod <qwerty ytre.wq> writes:
On Wed, 16 Jan 2008 10:27:53 -0500, Steven Schveighoffer wrote:
 
 The algorithmic penalties would be O(n) for a indexed lookup then
 instead of O(1).
I understand this, but the compiler could probably optimize this for most situations. Most string access would be sequential and thus positions could be cached on access when need be, and string literals that aren't modified and have all single-byte chars could be optimized into normal indexing. Furthermore, modern processors are incredibly good at sequential iteration and I know from personal experience that they can parse over massive chunks of memory in mere milliseconds (hashing entire executables in memory for potential changes is a common example of this). It shouldn't be noticeable at all to scan over a string. I do believe the author of the article that bearophile linked agrees with me on this regard, in his mention of charAt implementation.
 I think the correct method in this case is to convert to utf32 first,
 then index.  Then at least you only take the O(n) penalty once.  
Well, converting to dchar[] means a full iteration over the entire string to split up the units. Then the program has to allocate space, copy chars over, and add padding. Is it really all that much more efficient? And why should the programmer have to worry about the conversion anyway? Good languages avoid cognitive load on the programmers.
 Or why not just use dchar[] instead of char[] to begin with?
Yes, you could just use dchar[] all the time, but how many people do that? It's very space-inefficient which is the whole reason utf-8 exists. If dchar[]s were meant to be used more often Walter probably would have made them the default string type. Eh, I guess this is just one of those annoying little 'loose ends' I see when I look at D.
Jan 16 2008
parent Dan <murpsoft hotmail.com> writes:
Jarrod Wrote:

 On Wed, 16 Jan 2008 10:27:53 -0500, Steven Schveighoffer wrote:
 
 The algorithmic penalties would be O(n) for a indexed lookup then
 instead of O(1).
I understand this, but the compiler could probably optimize this for most situations. Most string access would be sequential and thus positions could be cached on access when need be, and string literals that aren't modified and have all single-byte chars could be optimized into normal indexing. Furthermore, modern processors are incredibly good at sequential iteration and I know from personal experience that they can parse over massive chunks of memory in mere milliseconds (hashing entire executables in memory for potential changes is a common example of this). It shouldn't be noticeable at all to scan over a string. I do believe the author of the article that bearophile linked agrees with me on this regard, in his mention of charAt implementation.
 I think the correct method in this case is to convert to utf32 first,
 then index.  Then at least you only take the O(n) penalty once.  
Well, converting to dchar[] means a full iteration over the entire string to split up the units. Then the program has to allocate space, copy chars over, and add padding. Is it really all that much more efficient? And why should the programmer have to worry about the conversion anyway? Good languages avoid cognitive load on the programmers.
 Or why not just use dchar[] instead of char[] to begin with?
Yes, you could just use dchar[] all the time, but how many people do that? It's very space-inefficient which is the whole reason utf-8 exists. If dchar[]s were meant to be used more often Walter probably would have made them the default string type. Eh, I guess this is just one of those annoying little 'loose ends' I see when I look at D.
Certainly is a whole lot better than the loose ends in other languages; at least we're in UTF and not ASCII (or undefined language). I personally prefer UTF-8. I can write any UTF character in UTF8 if I accept that odd case of a UTF-32 character will be stored as \uXXXX. To be honest, that's acceptable; and gives me the memory savings and O(1) as long as I've got the foresight to predict where the \u's are. I love D's handling of strings, in fact it is my *favorite* feature in D.
Jan 17 2008
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Jarrod wrote:
 While the topic is at hand, I guess I could rant a little;
 Why does D practically *require* the coder to use different forms of UTF 
 encoding.
Because I've worked with internationalized code in C/C++ where the encoding isn't specified, and it's very bad.
 D can tell if a code unit spans multiple bytes, as evidenced in 
 converting a utf-8 string to utf-32 (D knows where to split the blocks 
 apart), yet we can't index char[] arrays by code units. Instead, D will 
 index char arrays by fixed length bytes, which is almost nonsensical
It is impractical (i.e. very inefficient) to index arrays otherwise, especially in getting array lengths, doing slicing, etc. In fact, it is rather rare to index by code units. The times you might want to do it are easily handled by foreach(dchar c, string).
 since the D spec asserts that char[] arrays are designed specifically for 
 unicode characters, and that other single byte arrays should instead be 
 made as a byte[].
 So if this is the case, then why can't the language itself manage multi-
 byte characters for us?
It does, see foreach. In general, I don't think it's a good idea for the language to try to completely hide the multibyte nature of UTF. For example, when you're allocating and copying strings around, you need the byte length, not the number of code points.
 It would make things a hell of a lot easier and 
 more efficient than having to convert /potentially/ foreign strings to 
 utf-32 for a simple manipulation operation, then converting them back.
 The only reason I can think of for char arrays being treated as fixed 
 length is for faster indexing, which is hardly useful in most cases since 
 a lot of the time we don't even know if we're dealing with multi-byte 
 characters when handling strings, so we have to convert and traverse the 
 strings anyway.
 Arg.
I was surprised to discover that most indexing work in strings, such as searching, work more efficiently by *not* trying to index by code points. There are standard library functions in std.utf to index by code points, if you do need it.
 I know this may probably be a pain to implement, but it would really give 
 D a huge leg-up if it could properly and automatically handle strings for 
 us. Without requiring a bloated string class.
I believe D already has found the right approach to handling UTF strings. All I can say is try it out for a while.
Jan 17 2008
parent reply Jarrod <qwerty ytre.wq> writes:
On Thu, 17 Jan 2008 13:40:12 -0800, Walter Bright wrote:

 Because I've worked with internationalized code in C/C++ where the
 encoding isn't specified, and it's very bad.
I was more referring to the required switching to and from different utf types just to change a few characters around, I wasn't really referring to letting programmer decide what kind of string encoding to use.
 It is impractical (i.e. very inefficient) to index arrays otherwise,
 especially in getting array lengths, doing slicing, etc. In fact, it is
 rather rare to index by code units. The times you might want to do it
 are easily handled by foreach(dchar c, string).
Well yes I'm sure there's a performance hit for changing how it is indexed, but at the same time who would honestly prefer to index by the code point without first getting the unit points? You're practically stabbing in the dark if you try to slice a char[] array without first iterating over it with foreach to find its points.
 It does, see foreach. In general, I don't think it's a good idea for the
 language to try to completely hide the multibyte nature of UTF. For
 example, when you're allocating and copying strings around, you need the
 byte length, not the number of code points.
string str = "etc"; int strlen = str.length; int arrsize = str.sizeof; Seems pretty simple to me. And you don't have to completely hide the multibyte nature. Casting to a byte[] would allow full access to each point, which might sound hackish but at the same time manipulating individual code points in a string sounds like you're more than likely doing something just as hackish.
 I was surprised to discover that most indexing work in strings, such as
 searching, work more efficiently by *not* trying to index by code
 points. There are standard library functions in std.utf to index by code
 points, if you do need it.
Efficiency at the cost of the programmer. :( Perhaps you could design methods to access a string by either unit or index if you see the need to keep index-by-byte behaviour. Something like a toggle method would suit me just fine. str.indexByByte(true);
 I believe D already has found the right approach to handling UTF
 strings. All I can say is try it out for a while.
I am, and it's making working with user-editable config files an annoyance that perl avoids very easily.
Jan 19 2008
parent reply "Janice Caron" <caron800 googlemail.com> writes:
On 1/20/08, Jarrod <qwerty ytre.wq> wrote:
 I am, and it's making working with user-editable config files an
 annoyance that perl avoids very easily.
Could you possibly explain that, for the benefit of those of us who don't speak perl? My limited understanding is the perl was invented before Unicode, and probably even before the wheel, so either it deals with Unicode by not dealing with it at all, or else it's a recent edition to the language (or else I've got it completely wrong - like I said, I don't speak perl). Also, isn't perl an interpreted language? You can get away with a lot more in an interpreted language, but you pay the price in speed. Moreover, working with user-editable config files - I would have thought that a job for a text editor, not a programming language. I'm confused.
Jan 19 2008
parent reply Jarrod <qwerty ytre.wq> writes:
On Sun, 20 Jan 2008 06:41:29 +0000, Janice Caron wrote:

 On 1/20/08, Jarrod <qwerty ytre.wq> wrote:
 I am, and it's making working with user-editable config files an
 annoyance that perl avoids very easily.
Could you possibly explain that, for the benefit of those of us who don't speak perl? My limited understanding is the perl was invented before Unicode, and probably even before the wheel, so either it deals with Unicode by not dealing with it at all, or else it's a recent edition to the language (or else I've got it completely wrong - like I said, I don't speak perl).
Perl is still being constantly updated although it is indeed quite old. And it works quite well with unicode as you would expect from a language that prides itself on text manipulation.
 Also, isn't perl an interpreted language? You can get away with a lot
 more in an interpreted language, but you pay the price in speed.
Yes, It's interpreted and that does cost it a fair amount of speed, but I see it as a worthwhile trade off for what it can do with strings.
 Moreover, working with user-editable config files - I would have thought
 that a job for a text editor, not a programming language. I'm confused.
Indeed, you are a tad confused. I'm allowing the user to edit config files so that my GUI application can read it in on startup and use it to populate a dialog display as well as fill out numerous options involving how it deals with a web interface. Because I don't know what the user is going to input I have to do a fair amount of converting. Yes, this in indeed the main motivation behind this entire rant.
Jan 19 2008
parent reply "Janice Caron" <caron800 googlemail.com> writes:
On 1/20/08, Jarrod <qwerty ytre.wq> wrote:
 Moreover, working with user-editable config files - I would have thought
 that a job for a text editor, not a programming language. I'm confused.
Indeed, you are a tad confused.
Yep. I said so! :-)
 I'm allowing the user to edit config
 files
How? With a GUI interface? With a program written in D? With their favorite text editor of choice? If the latter, then you cannot be sure of the encoding, and that's hardly D's fault!
 so that my GUI application can read it in on startup and use it to
 populate a dialog display as well as fill out numerous options involving
 how it deals with a web interface. Because I don't know what the user is
 going to input I have to do a fair amount of converting.
Right, but converting from one encoding to another is the job of specialised classes. Detecting whether a text file is in ISO-8859-1, or Windows-1252, or MAC-ROMAN, or whatever, is not a trivial task. If your application were going to do that, you'd have to provide the implementation. (Or possibly Tango or some other third party library already provides such converters - I don't know). In any case, it's not a common enough task to warrant built-in language support. But I still don't see what this has got to do with whether or not a[n] should identify the (n+1)th character rather than the (n+1)th code unit.
 Yes, this in indeed the main motivation behind this entire rant.
Cool. So what is the real world use case that necessitates that sequences of UTF-8 code units must be addressable by character index as the default?
Jan 20 2008
parent reply Jarrod <qwerty ytre.wq> writes:
On Sun, 20 Jan 2008 08:04:01 +0000, Janice Caron wrote:

 I'm allowing the user to edit config
 files
How? With a GUI interface? With a program written in D? With their favorite text editor of choice? If the latter, then you cannot be sure of the encoding, and that's hardly D's fault!
It is the latter.
 Right, but converting from one encoding to another is the job of
 specialised classes. Detecting whether a text file is in ISO-8859-1, or
 Windows-1252, or MAC-ROMAN, or whatever, is not a trivial task. If your
 application were going to do that, you'd have to provide the
 implementation. (Or possibly Tango or some other third party library
 already provides such converters - I don't know). In any case, it's not
 a common enough task to warrant built-in language support.
 
 But I still don't see what this has got to do with whether or not a[n]
 should identify the (n+1)th character rather than the (n+1)th code unit.
Because this issue isn't really to do with the input file itself, it's to do with the potential input characters given in the file. As far as I can tell (I'm using a C library to parse the input) it should be ascii or UTF-8 encoding. Anything else would probably cause the C lexer to screw up.
 Cool. So what is the real world use case that necessitates that
 sequences of UTF-8 code units must be addressable by character index as
 the default?
The most important one right now is splicing. I'm allowing both user- defined and program-defined macros in the input data. They can be anywhere within a string, so I need to splice them out and replace them with their correct counterparts. I hear the std lib provided with D is unreliable so I'm unwilling to use it. Plus even if it is fixed up I'd hate to limit string manipulation to regular expressions. I also wish to cut off input at a certain letter count for spacing issues in both the GUI and dealing with the webscript. I'll have to be converting certain characters to their URI equivalent too, that will probably take more splicing as well. The other thing I'm using is single-letter replacement. Simple stuff like capitalising letters and replacing spaces with underscores. I can think of a lot of other situations that would benefit from proper multibyte support too, for instance practically any application that takes frequent user input could benefit. A text editor is a very good example. Any coders who don't natively deal with Latin text would probably benefit greatly too ( think of the poor Japanese coders :< ). I've seen a lot of programs that print a specified number of characters before wrapping around or trailing off, too. The humble gnome console is a good example of that. Very handy to have character indexing in this case. String tokenizing and plain old character counting are two operations I can think of that could probably be done easier too. In the end I think I'm just tired of having to jump through hoops when it comes to string manipulation. I want to be able to say 'this is a character, I don't care what it is. Store it, change it, splice it, print it.' But instead it seems if I don't care what the character type it, it might not fit. Then I have to allocate then store it, find and change it, locate then splice it, convert then print it. Small annoyances build up over time and I'm pretty sure I'm not insured for blood vessels bursting in my eye. I live in the hope that one day in the future I'll see something magical happen, and I'll be able to type char chr = 'Δ'; and chr will be a proper utf-8 character that I can print, insert into an array, and change. What a beautiful day that will be. Welp, I think I'm done ranting for now. Back to screwing around with strings. Or more accurately, procrastinating about screwing around with strings.
Jan 20 2008
next sibling parent reply "Janice Caron" <caron800 googlemail.com> writes:
On 1/20/08, Jarrod <qwerty ytre.wq> wrote:
 But I still don't see what this has got to do with whether or not a[n]
 should identify the (n+1)th character rather than the (n+1)th code unit.
Because this issue isn't really to do with the input file itself, it's to do with the potential input characters given in the file.
You mean the plain text config file of unknown encoding?
 As far as I can
 tell (I'm using a C library to parse the input) it should be ascii or
 UTF-8 encoding.
 Anything else would probably cause the C lexer to screw up.
If it's an unknown encoding, you store it in a ubyte array. Then you identify the encoding, convert it to UTF-8 and store the result in a char array.
 Cool. So what is the real world use case that necessitates that
 sequences of UTF-8 code units must be addressable by character index as
 the default?
The most important one right now is splicing. I'm allowing both user- defined and program-defined macros in the input data. They can be anywhere within a string, so I need to splice them out and replace them with their correct counterparts.
That works right now with ordinary char arrays. Just use find(), rfind(), etc., and slicing.
 I hear the std lib provided with D is
 unreliable
Huh? Please elucidate.
 so I'm unwilling to use it.
That's your loss, but you can hardly expect Walter to consider adding new language features just because you are unwilling to use Phobos.
 I also wish to cut off input at a certain letter count for spacing issues
 in both the GUI and dealing with the webscript.
Well, I hate to spoil things, but even /characters/ are not sufficient to help you figure out spacing issues. For that, you need to be working on the level of /glyphs/. For example, consider the word "caf¨¦". (Just in case that didn't render properly, that's c, a, f, followed be e-with-an-acute-accent). You can write this as either caf\u00E9 which consists of five UTF-8 code units, or four characters, or four glyphs; or you can write it as cafe\u0301 which consist of seven UTF-8 code units, or five characters, or four glyphs. In the first case, the e-acute glyph is represented as a single character; in the second case, it is represented as an e character followed by a combining-acute character. In other words, even indexing by character is not sufficient to achieve your goals. You need to index by glyph. At some point, you have to say to yourself: wait a minute - this would be better implemented in a library than in the primitive types of the language.
 The other thing I'm using is single-letter replacement. Simple stuff like
 capitalising letters and replacing spaces with underscores.
I guess what you're getting at here is that uppercasing a character might result in a UTF-8 string longer than that of the original character. And so it might. On the other hand, if you use a foreach loop to do this sort of thing, your problems are solved.
 I can think of a lot of other situations that would benefit from proper
 multibyte support too,
UTF-8 support /is/ proper multibyte support. That's why D has it built in.
 ( think of the poor Japanese coders :< ).
Which is why D uses Unicode. Again, I say, D got it right.
 I've seen a lot of programs that print a specified number of characters
 before wrapping around or trailing off, too. The humble gnome console is
 a good example of that. Very handy to have character indexing in this
 case.
I don't agree. This is a problem in font rendering. If you happen to be using a proportional font, then even character counting won't work. You need to be counting rendered width in pixels - an operation which should be generic enough to work for both fixed-width and proportional fonts.
 In the end I think I'm just tired of having to jump through hoops when it
 comes to string manipulation. I want to be able to say 'this is a
 character, I don't care what it is. Store it, change it, splice it, print
 it.'
dchar.
 happen, and I'll be able to type char chr = '¦¤'; and chr will be a proper
 utf-8 character that I can print, insert into an array, and change.
 What a beautiful day that will be.
dchar. Put another way, you want to be insulated from the internal representation. UTF-8 is an implementation detail, wheras what you want is an array of Unicode characters (whose implementation is not necessarily dchar[] but you want to be shielded from it anyway). Again I say, this is a problem for a library class, not a builtin type. And you're probably going to want even higher level abstractions dealing with glyphs too (and then font-rendering tools after that). D allows you to write such libraries. But the builtin types do exactly what is says on the tin. Their behaviour is well-defined, and it's up to the programmer to understand that behaviour.
Jan 20 2008
next sibling parent Jarrod <qwerty ytre.wq> writes:
On Sun, 20 Jan 2008 11:45:40 +0000, Janice Caron wrote:

 I hear the std lib provided with D is unreliable
Huh? Please elucidate.
Ah, mistyped my thoughts there. I meant to say std.regexp* At the moment, to me char[] is just byte[], except I guess when it comes to foreach. Pretty watered down when you look at it like that. And yes, I guess a library implementation of string would be fine too. I just figure since strings are one of the most important data types used in any program, I figure D should probably natively support their multi- byte nature more transparently. Perhaps I have been spoiled by scripting languages after leaving C/++ alone for so long, but it would be very nice to see it happen one way or another. I guess I like my language how I like my coffee; filled with sugar.
Jan 20 2008
prev sibling parent James Dennett <jdennett acm.org> writes:
Janice Caron wrote:
 On 1/20/08, Jarrod <qwerty ytre.wq> wrote:
 But I still don't see what this has got to do with whether or not a[n]
 should identify the (n+1)th character rather than the (n+1)th code unit.
Because this issue isn't really to do with the input file itself, it's to do with the potential input characters given in the file.
You mean the plain text config file of unknown encoding?
Let's stop here. If you don't know the encoding, you can't safely process the file. That's nothing to do with language or library designs. You can't process data whose format you do not know. (Yes, you can employ heuristics to try to guess, but they can be wrong, and in the case of text files there are many files which are valid in numerous encodings but have different meanings.) -- James
Jan 20 2008
prev sibling parent "Kris" <foo bar.com> writes:
Jarrod: you might find something useful in the way the tango Text class 
operates? It attempts to make the common operation independent of indexing, 
in order to avoid some of these unit/point problems.


"Jarrod" <qwerty ytre.wq> wrote in message 
news:fmv7s7$76h$2 digitalmars.com...
 On Sun, 20 Jan 2008 08:04:01 +0000, Janice Caron wrote:

 I'm allowing the user to edit config
 files
How? With a GUI interface? With a program written in D? With their favorite text editor of choice? If the latter, then you cannot be sure of the encoding, and that's hardly D's fault!
It is the latter.
 Right, but converting from one encoding to another is the job of
 specialised classes. Detecting whether a text file is in ISO-8859-1, or
 Windows-1252, or MAC-ROMAN, or whatever, is not a trivial task. If your
 application were going to do that, you'd have to provide the
 implementation. (Or possibly Tango or some other third party library
 already provides such converters - I don't know). In any case, it's not
 a common enough task to warrant built-in language support.

 But I still don't see what this has got to do with whether or not a[n]
 should identify the (n+1)th character rather than the (n+1)th code unit.
Because this issue isn't really to do with the input file itself, it's to do with the potential input characters given in the file. As far as I can tell (I'm using a C library to parse the input) it should be ascii or UTF-8 encoding. Anything else would probably cause the C lexer to screw up.
 Cool. So what is the real world use case that necessitates that
 sequences of UTF-8 code units must be addressable by character index as
 the default?
The most important one right now is splicing. I'm allowing both user- defined and program-defined macros in the input data. They can be anywhere within a string, so I need to splice them out and replace them with their correct counterparts. I hear the std lib provided with D is unreliable so I'm unwilling to use it. Plus even if it is fixed up I'd hate to limit string manipulation to regular expressions. I also wish to cut off input at a certain letter count for spacing issues in both the GUI and dealing with the webscript. I'll have to be converting certain characters to their URI equivalent too, that will probably take more splicing as well. The other thing I'm using is single-letter replacement. Simple stuff like capitalising letters and replacing spaces with underscores. I can think of a lot of other situations that would benefit from proper multibyte support too, for instance practically any application that takes frequent user input could benefit. A text editor is a very good example. Any coders who don't natively deal with Latin text would probably benefit greatly too ( think of the poor Japanese coders :< ). I've seen a lot of programs that print a specified number of characters before wrapping around or trailing off, too. The humble gnome console is a good example of that. Very handy to have character indexing in this case. String tokenizing and plain old character counting are two operations I can think of that could probably be done easier too. In the end I think I'm just tired of having to jump through hoops when it comes to string manipulation. I want to be able to say 'this is a character, I don't care what it is. Store it, change it, splice it, print it.' But instead it seems if I don't care what the character type it, it might not fit. Then I have to allocate then store it, find and change it, locate then splice it, convert then print it. Small annoyances build up over time and I'm pretty sure I'm not insured for blood vessels bursting in my eye. I live in the hope that one day in the future I'll see something magical happen, and I'll be able to type char chr = '?'; and chr will be a proper utf-8 character that I can print, insert into an array, and change. What a beautiful day that will be. Welp, I think I'm done ranting for now. Back to screwing around with strings. Or more accurately, procrastinating about screwing around with strings.
Jan 20 2008
prev sibling parent reply "Janice Caron" <caron800 googlemail.com> writes:
On 1/16/08, Jarrod <qwerty ytre.wq> wrote:
 On Tue, 15 Jan 2008 21:23:31 -0500, bearophile wrote:

 So if this is the case, then why can't the language itself manage multi-
 byte characters for us? It would make things a hell of a lot easier and
 more efficient than having to convert /potentially/ foreign strings to
 utf-32 for a simple manipulation operation, then converting them back.
 The only reason I can think of for char arrays being treated as fixed
 length is for faster indexing, which is hardly useful in most cases since
 a lot of the time we don't even know if we're dealing with multi-byte
 characters when handling strings, so we have to convert and traverse the
 strings anyway.
Because, think about this: char[] a = new char[8]; If a char array were indexed by character instead of codeunit, as you suggest, how many bytes would the compiler need to allocate? It can't know in advance. Also: char[] a = "abcd"; char[] b = "\u20AC"; a[0] = b[0]; would cause big problems. (Would a[1] get overwritten? Would a have to be resized and everything shifted up one byte?) I think D has got it right. Use wchar or dchar when you need character based indexing.
Jan 18 2008
parent reply James Dennett <jdennett acm.org> writes:
Janice Caron wrote:
 On 1/16/08, Jarrod <qwerty ytre.wq> wrote:
 On Tue, 15 Jan 2008 21:23:31 -0500, bearophile wrote:

 So if this is the case, then why can't the language itself manage multi-
 byte characters for us? It would make things a hell of a lot easier and
 more efficient than having to convert /potentially/ foreign strings to
 utf-32 for a simple manipulation operation, then converting them back.
 The only reason I can think of for char arrays being treated as fixed
 length is for faster indexing, which is hardly useful in most cases since
 a lot of the time we don't even know if we're dealing with multi-byte
 characters when handling strings, so we have to convert and traverse the
 strings anyway.
Because, think about this: char[] a = new char[8]; If a char array were indexed by character instead of codeunit, as you suggest, how many bytes would the compiler need to allocate?
8. That's independent of how they are indexed.
 It can't
 know in advance. 
Yup, it can. 8.
 Also:
 
     char[] a = "abcd";
     char[] b = "\u20AC";
     a[0] = b[0];
 
 would cause big problems. (Would a[1] get overwritten? Would a have to
 be resized and everything shifted up one byte?)
It causes big problems with byte-wise addressing, because you can end up with a char[] (which is specified to hold a UTF8 string) which does not contain a UTF8 string.
 I think D has got it right. Use wchar or dchar when you need character
 based indexing.
If you have UTF8, you should not be allowed to access the bytes that make it up, only its characters. If you want a bunch of bytes, use an array of bytes. (On the other hand, D is weak here because it identifies UTF8 strings with arrays of char, but char doesn't hold a UTF8 character. I can't imagine persuading Walter that this is a horrible error is going to work though.) -- James
Jan 18 2008
next sibling parent reply "Janice Caron" <caron800 googlemail.com> writes:
On 1/19/08, James Dennett <jdennett acm.org> wrote:
 Because, think about this:

     char[] a = new char[8];

 If a char array were indexed by character instead of codeunit, as you
 suggest, how many bytes would the compiler need to allocate?
8. That's independent of how they are indexed.
My *rhetorical* question was in response to a poster who suggested that char arrays should be indexed by CHARACTER. I know perfectly well that in reality they are indexed by UTF-8 code unit, and therefore that the correct answer is eight. I was giving additional reasons /why/ indexing by character was not sensible.
 It can't
 know in advance.
Yup, it can. 8.
Of course. However, /if/ they were indexed by character, /then/ it would be impossible to know in advance how many UTF-8 code units it would take to construct eight characters. Which is a good reason why they should /not/ be indexed by character. As far as I am concerned, D has got it right.
 Also:

     char[] a = "abcd";
     char[] b = "\u20AC";
     a[0] = b[0];

 would cause big problems. (Would a[1] get overwritten? Would a have to
 be resized and everything shifted up one byte?)
It causes big problems with byte-wise addressing, because you can end up with a char[] (which is specified to hold a UTF8 string) which does not contain a UTF8 string.
That's what I said. Thank you for reiterating it.
 I think D has got it right. Use wchar or dchar when you need character
 based indexing.
If you have UTF8, you should not be allowed to access the bytes that make it up, only its characters.
I absolutely /should/ be (and am) allowed to access the UTF-8 code units stored within an array of UTF-8 code units. This is absolutely as it should be, thank you very much.
 If you want a bunch of
 bytes, use an array of bytes.
Of course. And if you want an array of UTF-8 code units, use a char[] or a string.
 (On the other hand, D is weak here
 because it identifies UTF8 strings with arrays of char, but char
 doesn't hold a UTF8 character.
I believe you are wrong. I would say that D is strong here, because it identifies UTF-8 strings with arrays of char, because char /does/ hold a UTF-8 code unit. The mistake is assuming that code unit == character. It does not. (However, there is overlap in the ASCII range, 0x00 to 0x7F, so the assumption is still valid if you are certain that your strings are ASCII). Once you've grokked that char[] == array of UTF-8 code unit, everything else falls into place and makes sense.
Jan 19 2008
parent reply James Dennett <jdennett acm.org> writes:
Janice Caron wrote:
 On 1/19/08, James Dennett <jdennett acm.org> wrote:
 Because, think about this:

     char[] a = new char[8];

 If a char array were indexed by character instead of codeunit, as you
 suggest, how many bytes would the compiler need to allocate?
8. That's independent of how they are indexed.
My *rhetorical* question was in response to a poster who suggested that char arrays should be indexed by CHARACTER.
Yes, I know. If you don't see that my post answered yours, feel free to ask me to explain more clearly. Indexing by character doesn't automatically tell us anything about how the size specified for new[] works. It would be a little quirky for the size to be in bytes while the indices are in characters, but it's quirky for char[] to pretend to be a UTF8 type without enforcing that.
 I know perfectly well
 that in reality they are indexed by UTF-8 code unit, and therefore
 that the correct answer is eight. I was giving additional reasons
 /why/ indexing by character was not sensible.
I was refuting your claimed reason.
 
 It can't
 know in advance.
Yup, it can. 8.
Of course. However, /if/ they were indexed by character, /then/ it would be impossible to know in advance how many UTF-8 code units it would take to construct eight characters.
Sure. So it wouldn't be a request for 8 characters.
 Which is a good reason why
 they should /not/ be indexed by character.
I *still* disagree with that.
 As far as I am concerned, D has got it right.
Enjoy ;)
 Also:

     char[] a = "abcd";
     char[] b = "\u20AC";
     a[0] = b[0];

 would cause big problems. (Would a[1] get overwritten? Would a have to
 be resized and everything shifted up one byte?)
It causes big problems with byte-wise addressing, because you can end up with a char[] (which is specified to hold a UTF8 string) which does not contain a UTF8 string.
That's what I said. Thank you for reiterating it.
I think we are *disagreeing* here. I claim that this causes problems _with the current design of D_, which would be resolved if char[] (or however we denote mutable UTF8 strings) string were really a UTF8 type.
 I think D has got it right. Use wchar or dchar when you need character
 based indexing.
If you have UTF8, you should not be allowed to access the bytes that make it up, only its characters.
I absolutely /should/ be (and am) allowed to access the UTF-8 code units stored within an array of UTF-8 code units. This is absolutely as it should be, thank you very much.
But you already illustrated why it should not: you can break the invariant that there is valid UTF8 there, so char[] is lying when it says that it is a UTF8 type.
 
 If you want a bunch of
 bytes, use an array of bytes.
Of course. And if you want an array of UTF-8 code units, use a char[] or a string.
If you're at the level of code units, you can't be sure that they're part of UTF8 characters unless there is a higher-level invariant enforced on them.
 (On the other hand, D is weak here
 because it identifies UTF8 strings with arrays of char, but char
 doesn't hold a UTF8 character.
I believe you are wrong. I would say that D is strong here, because it identifies UTF-8 strings with arrays of char, because char /does/ hold a UTF-8 code unit.
That's the problem. char[] can hold non-UTF8 strings.
 The mistake is assuming that code unit == character.
If you're telling me that I made that mistake, you're sorely mistaken.
 It does not.
 (However, there is overlap in the ASCII range, 0x00 to 0x7F, so the
 assumption is still valid if you are certain that your strings are
 ASCII).
 
 Once you've grokked that char[] == array of UTF-8 code unit,
 everything else falls into place and makes sense.
No, it does not. It's precisely the difference that is why D's char[] is a poor man's UTF8 string. -- James
Jan 19 2008
parent reply "Janice Caron" <caron800 googlemail.com> writes:
On 1/19/08, James Dennett <jdennett acm.org> wrote:
     char[] a = "abcd";
     char[] b = "\u20AC";
     a[0] = b[0];
I think we are *disagreeing* here. I claim that this causes problems _with the current design of D_, which would be resolved if char[] (or however we denote mutable UTF8 strings) string were really a UTF8 type.
So you're saying that in your new design, after that assignment, a would equal "\u20ACbcd". The problem is that the compiler would have to allocate extra bytes and then memcpy all the bytes up a bit to make room. That strikes me as kinda slow, which is not something I'd want in a char array.
 That's the problem.  char[] can hold non-UTF8 strings.
Yes, that is possible. But only in buggy code, of course. That really raises the question: is it the compiler's job, or the programmer's, to ensure that the contract is maintained? I don't really have any problem taking responsibility for maintaining UTF-8 correctness. (It's not hard). But if you want to be completely protected from those kinds of errors, I still don't see the problem with using dchar.
 No, it does not.  It's precisely the difference that is why
 D's char[] is a poor man's UTF8 string.
I suppose a library class could be written whose interface behaved like a dchar array, but whose implementation was UTF-8. But when you ever use it?
Jan 19 2008
parent reply Jarrod <qwerty ytre.wq> writes:
On Sat, 19 Jan 2008 20:21:49 +0000, Janice Caron wrote:

 So you're saying that in your new design, after that assignment, a would
 equal "\u20ACbcd". The problem is that the compiler would have to
 allocate extra bytes and then memcpy all the bytes up a bit to make
 room. That strikes me as kinda slow, which is not something I'd want in
 a char array.
Well how else would you like it to be done? If you were writing something that took a text input much like this very window I'm typing in right now, and the user hit back a few times and input a multi-byte character, how would you deal with it? Allow it to overlap? No. dchars? That's a lot of wasted memory, and it basically makes me wonder why utf-8 even exists if it needs to be dropped for simple text manipulation. May as well stick with utf-32 and ascii. No sir, I don't like it.
Jan 19 2008
parent "Janice Caron" <caron800 googlemail.com> writes:
On 1/20/08, Jarrod <qwerty ytre.wq> wrote:
 If you were writing something that took a text input much like this very
 window I'm typing in right now, and the user hit back a few times and
 input a multi-byte character, how would you deal with it?
I'd write a class, of course. It is simple (though not trivial) to step through the bytes of UTF-8. Bytes in the range 00 to 7F are ASCII; bytes in the range 80 to BF are tail bytes; bytes in the range C0 to F7 are head bytes; and bytes in the range F8 to FF are illegal. Identifying multi-byte sequences is therefore easy. You can make an argument that functions and/or classes to do this sort of thing should perhaps pre-exist in Phobos, but to say it should be built into /the language itself/ ... that's going a bit too far, I feel. Allow it to
 overlap? No. dchars? That's a lot of wasted memory, and it basically
 makes me wonder why utf-8 even exists if it needs to be dropped for
 simple text manipulation. May as well stick with utf-32 and ascii.

 No sir, I don't like it.
Jan 19 2008
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
James Dennett wrote:
 (On the other hand, D is weak here
 because it identifies UTF8 strings with arrays of char, but char
 doesn't hold a UTF8 character.  I can't imagine persuading Walter
 that this is a horrible error is going to work though.)
I've actually done considerable work with UTF-8, both in C++ and D. D's method of dealing with it works out very well (and very naturally). This is why you'll have a hard time persuading me otherwise <g>. Note that C++0x is doing things similarly: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html
Jan 20 2008
parent reply James Dennett <jdennett acm.org> writes:
Walter Bright wrote:
 James Dennett wrote:
 (On the other hand, D is weak here
 because it identifies UTF8 strings with arrays of char, but char
 doesn't hold a UTF8 character.  I can't imagine persuading Walter
 that this is a horrible error is going to work though.)
I've actually done considerable work with UTF-8, both in C++ and D.
Yes, by this stage most serious programmers have had to learn in some detail how to work with UTF-8.
 D's 
 method of dealing with it works out very well (and very naturally).
I've given specific problems with it. I've heard no refutation of them. D uses essentially a model of UTF8 which is really just a bunch-of-bytes with smart iteration. C-based projects on which I worked in the 90's did similarly, but with coding conventions that banned direct access to the bytes.
 This is why you'll have a hard time persuading me otherwise <g>.
Because you assert that there's not a problem? ;)
 Note that C++0x is doing things similarly:
 
 http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html
Looks very different to me. There's no conflation of char with a code unit of UTF8 (and indeed C++ deliberately supports use of varied encodings for multi-byte characters). Yes, C++ is adding 16- and 32-bit character types which are more akin to D's, but that has little bearing on how differently it handles multi-byte (as opposed to wide-character) strings. -- James
Jan 20 2008
next sibling parent reply "Janice Caron" <caron800 googlemail.com> writes:
On 1/20/08, James Dennett <jdennett acm.org> wrote:
 Looks very different to me.
I thought it looked very similar indeed to D, but there you go. Funny how two different people can read the same document and interpret it in two different ways.
 There's no conflation of char with a
 code unit of UTF8
C has no ubyte type. Since time immemorial, C programmers have been using the char type to store every 8-bit wide data type under the sun simply because there's been no alternative (until recently, when int8_t showed up as a typedef for char). That's not a big deal.
 (and indeed C++ deliberately supports use of
 varied encodings for multi-byte characters).
I must have misread the heading that says "Require UTF", and whose text reads "The C TR makes the encoding of char16_t and char32_t implementation-defined. It also provides macros to indicate whether or not the encoding is UTF. In contrast, this proposal requires UTF encoding." Oh, I see what you're saying - C++ would require UTF for wchar and dchar, but not for char. Well, that's historical legacy for you.
 Yes, C++ is adding
 16- and 32-bit character types which are more akin to D's, but that
 has little bearing on how differently it handles multi-byte (as
 opposed to wide-character) strings.
So it has a bunch of procedural functions instead of foreach. Apart from that, the approach seems the same as D. Where's the difference?
Jan 20 2008
parent James Dennett <jdennett acm.org> writes:
Janice Caron wrote:
 On 1/20/08, James Dennett <jdennett acm.org> wrote:
 Looks very different to me.
I thought it looked very similar indeed to D, but there you go. Funny how two different people can read the same document and interpret it in two different ways.
The core issue here, to me, is D's half-hearted attempt to paint char[] as a Unicode string type. C++ has nothing analagous.
 There's no conflation of char with a
 code unit of UTF8
C has no ubyte type. Since time immemorial, C programmers have been using the char type to store every 8-bit wide data type under the sun simply because there's been no alternative (until recently, when int8_t showed up as a typedef for char).
int8_t is necessarily signed, a la "signed char", not a typedef for "char", whose signedness varies (but, unfortunately, is often signed in C and C++).
 That's not a big deal.
 
 
 (and indeed C++ deliberately supports use of
 varied encodings for multi-byte characters).
I must have misread the heading that says "Require UTF", and whose text reads "The C TR makes the encoding of char16_t and char32_t implementation-defined. It also provides macros to indicate whether or not the encoding is UTF. In contrast, this proposal requires UTF encoding." Oh, I see what you're saying - C++ would require UTF for wchar and dchar, but not for char. Well, that's historical legacy for you.
And it's the real world; computer systems need to interface with existing systems which us diverse encodings.
 Yes, C++ is adding
 16- and 32-bit character types which are more akin to D's, but that
 has little bearing on how differently it handles multi-byte (as
 opposed to wide-character) strings.
So it has a bunch of procedural functions instead of foreach. Apart from that, the approach seems the same as D. Where's the difference?
Philosophy: D pushes char[] as if it were a proper UTF8 facility, and goes a small step towards adding language support for that. C++ recognizes diversity in multi-byte character encodings, and doesn't make the language promote one over any other. It admits up-front that you're dealing with code units if you want to work with multi-byte characters. C++ is a long, long way from perfect when it comes to Unicode support. Even C++0x will be. But I'm hoping for more from D, and what I see so far can stand some improvement. -- James
Jan 20 2008
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
James Dennett wrote:
 I've given specific problems with it.  I've heard no refutation
 of them.
It's hard to describe, but after working with UTF-8 for a while, they are just non-problems. Code isn't written that way. If you want, you can create a String class which wraps a char[] and treats it at the level you wish.
 D uses essentially a model of UTF8 which is really just
 a bunch-of-bytes with smart iteration.
That's what UTF-8 is.
 C-based projects on which
 I worked in the 90's did similarly, but with coding conventions
 that banned direct access to the bytes.
Coding conventions are one thing, but banning things in a systems language are quite another. Copying a UTF-8 string by decoding and encoding the characters one-by-one is unacceptably inefficient, for example, compared with just memcpy. Searching a UTF-8 string for a substring is another operation for which treating it like a bag of bytes works best.
 This is why you'll have a hard time persuading me otherwise <g>.
Because you assert that there's not a problem? ;)
Because I know it works based on experience.
 Note that C++0x is doing things similarly:

 http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html
Looks very different to me. There's no conflation of char with a code unit of UTF8 (and indeed C++ deliberately supports use of varied encodings for multi-byte characters). Yes, C++ is adding 16- and 32-bit character types which are more akin to D's, but that has little bearing on how differently it handles multi-byte (as opposed to wide-character) strings.
Since, in the C++ proposal, indexing and length is done by byte/word/dword, not by code point, it's semantically equivalent. I don't see any banning of getting at the underlying representation, nor any attempt to hide it.
Jan 20 2008
parent reply James Dennett <jdennett acm.org> writes:
Walter Bright wrote:
 James Dennett wrote:
 I've given specific problems with it.  I've heard no refutation
 of them.
It's hard to describe, but after working with UTF-8 for a while, they are just non-problems. Code isn't written that way. If you want, you can create a String class which wraps a char[] and treats it at the level you wish.
Indeed, but such a thing should be standard, not reinvented over and over.
 D uses essentially a model of UTF8 which is really just
 a bunch-of-bytes with smart iteration.
That's what UTF-8 is.
That view has lead to many security issues, where different software reacts differently to byte strings which are not valid UTF-8 in places where UTF-8 is expected.
 C-based projects on which
 I worked in the 90's did similarly, but with coding conventions
 that banned direct access to the bytes.
Coding conventions are one thing, but banning things in a systems language are quite another. Copying a UTF-8 string by decoding and encoding the characters one-by-one is unacceptably inefficient, for example, compared with just memcpy. Searching a UTF-8 string for a substring is another operation for which treating it like a bag of bytes works best.
There are alternatives; explicit notation to access the bytes, which *doesn't* look like it's accessing characters, would be better. (char doesn't represent a character in D. Not great naming? But then D almost follows C in this, where char did double duty as a limited character type and a small integral type.)
 This is why you'll have a hard time persuading me otherwise <g>.
Because you assert that there's not a problem? ;)
Because I know it works based on experience.
And I know, based on experience, of problems with it. So how do we get past this to discuss things more objectively? (Of course, we don't have to. You're the BDFL, and you get to make the call, and try to keep D coherent in the face of a hundred people pushing inconsistent views for how it should evolve. I get the easy job of being just one of those voices.)
 Note that C++0x is doing things similarly:

 http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html
Looks very different to me. There's no conflation of char with a code unit of UTF8 (and indeed C++ deliberately supports use of varied encodings for multi-byte characters). Yes, C++ is adding 16- and 32-bit character types which are more akin to D's, but that has little bearing on how differently it handles multi-byte (as opposed to wide-character) strings.
Since, in the C++ proposal, indexing and length is done by byte/word/dword, not by code point, it's semantically equivalent. I don't see any banning of getting at the underlying representation, nor any attempt to hide it.
Whereas D partly attempts to hide it; the mathematician in me hates this kind of fence-sitting. But let's get more concrete: suppose D code finds that an alleged char[] passed to it is, in fact, broken (i.e., violates the UTF8 invariants). What should it do -- abort, throw an exception, offer a policy for handling such bugs, other? -- James
Jan 20 2008
next sibling parent reply "Janice Caron" <caron800 googlemail.com> writes:
On Jan 20, 2008 11:01 PM, James Dennett <jdennett acm.org> wrote:
 That view has lead to many security issues, where different
 software reacts differently to byte strings which are not
 valid UTF-8 in places where UTF-8 is expected.
Such input should always be rejected. D will throw an exception, which is the right thing to do. If a programmer wants to be more flexible, they can always catch the exception and delete invalid sequences. Security issues have arisen as a result of what are called non-shortest-sequences. For example, the slash character is represented in UTF-8 as 2F. Some hackers have attempted to get past certain filters by representing the slash character as C0 AF. This is not valid UTF-8 (because UTF-8 forbids non-shortest sequences), but a buggy implementation might get that wrong and interpret that as '\u002F'. The important point that I want to make here is that *D GETS IT RIGHT*. D's implementation will throw an exception on all invalid UTF-8 sequences, and this will block all such security issues. The only way they can resurface is if you hand-code your own UTF handling. So long as you stick to the built-in UTF-handling stuff which D provides, you will not encounter these security issues. Other security issues arise as a result of Unicode itself, not UTF-8. This is because Unicode is such a large character set - which makes it really good for phishing attacks. After all, if you spell "amazon" with the greek letter lowercase omicron instead of latin lowercase o, who's going to notice. However, this is not D's problem - it's a problem for browser writers, and one they will encounter regardless of what programming language they use. (Similar issues arise if browsers fail to convert URLs to Normalisation Form C, but again, that would be a browser problem, not a D problem. It would also be a bug).
 (char doesn't represent a character in D.  Not great
 naming?
It's reasonable naming, given that UTF-8 code units in the range 00 to 7F do, in fact, correspond to (ASCII) characters. OK, so it's inappropriately named for holding values 80 to FF, but alternatives such as codeunit, or utf8, would probably not catch on so easily.
 But let's get more concrete:
 suppose D code finds that an alleged char[] passed to it is, in
 fact, broken (i.e., violates the UTF8 invariants).  What should
 it do -- abort, throw an exception, offer a policy for handling
 such bugs, other?
It should, and does, throw an exception. Your program may catch the exception, but it should reject the input.
Jan 21 2008
parent reply Matti Niemenmaa <see_signature for.real.address> writes:
Janice Caron wrote:
 The important point that I want to make here is that *D GETS
 IT RIGHT*. D's implementation will throw an exception on all invalid
 UTF-8 sequences, and this will block all such security issues.
Well, mostly right: http://d.puremagic.com/issues/show_bug.cgi?id=978 -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Jan 21 2008
parent reply "Janice Caron" <caron800 googlemail.com> writes:
On Jan 21, 2008 8:47 AM, Matti Niemenmaa <see_signature for.real.address> wrote:
 Janice Caron wrote:
 The important point that I want to make here is that *D GETS
 IT RIGHT*. D's implementation will throw an exception on all invalid
 UTF-8 sequences, and this will block all such security issues.
Well, mostly right: http://d.puremagic.com/issues/show_bug.cgi?id=978
Oooh - well spotted! In that case, I amend my statement. D /will/ get it right, once this bug is fixed. One would hope that said bug will be fixed in the next release.
Jan 21 2008
parent Matti Niemenmaa <see_signature for.real.address> writes:
Janice Caron wrote:
 On Jan 21, 2008 8:47 AM, Matti Niemenmaa <see_signature for.real.address>
wrote:
 Janice Caron wrote:
 The important point that I want to make here is that *D GETS
 IT RIGHT*. D's implementation will throw an exception on all invalid
 UTF-8 sequences, and this will block all such security issues.
Well, mostly right: http://d.puremagic.com/issues/show_bug.cgi?id=978
Oooh - well spotted! In that case, I amend my statement. D /will/ get it right, once this bug is fixed. One would hope that said bug will be fixed in the next release.
You'll note the bug is a year old. Although that doesn't change the fact that one would, indeed, hope for that. :-) -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Jan 21 2008
prev sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On Jan 21, 2008 8:11 AM, Janice Caron <caron800 googlemail.com> wrote:
 But let's get more concrete:
 suppose D code finds that an alleged char[] passed to it is, in
 fact, broken (i.e., violates the UTF8 invariants).  What should
 it do -- abort, throw an exception, offer a policy for handling
 such bugs, other?
It should, and does, throw an exception. Your program may catch the exception, but it should reject the input.
In fact, this goes to the heart of almost all modern securty problems (SQL injection, buffer overruns, etc.). The golden rule is that *ALL* untrusted input must be sanitised. Every time you don't do that, you provide an opportunity for hackers. But at least in the case of UT, it's easy - just let D validate it. If it doesn't validate, throw it out.
Jan 21 2008
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Janice Caron:
 Also, isn't perl an interpreted language? You can get away with a lot
 more in an interpreted language, but you pay the price in speed.
I'm not a Perl expert, and I don't know how well Perl manages Unicode (maybe Python manages Unicode better than Perl), but Perl was designed to process text, so if you process strings you will find that Perl is pretty *fast*, it's easy to write Perl programs that process text faster (and in a more flexible way) than C++ ones... (Note that Python 3.0 will manage unicode strings as default). For example if you use Python dicts (AAs) with strings they seem faster than current DMD AAs, and probably that's true for Perl ones too. This was a tiny example: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=57986 Perl and Python have GC that is well refined, so it may be faster than the current DMD GC if you manage lot of strings, this was an example where D was slower than Py too: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=62369 With Python you can also use Psyco, that's a JIT, to speed it up, etc. Psyco uses tricks to avoid actually copying strings and string slices in most cases, because Python strings are immutables (Python copies them when you perform a slice), like D too does. REs in current DMD are *way* slower than Perl/Python/Tcl ones, etc. Some time ago I have found a situation where the RE sub() of D looks O(n^2): http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lang=dlang&id=4 String methods of Python are written in a really refined C, like this one: http://effbot.org/zone/stringlib.htm And they are usually faster than the not-refined versions you can find in the current Phobos. I have implemented and I use a fastJoin an xsplit, etc, faster then the Phobos ones. The built-in sort of Python is the Timsort, that's way faster than the D built-in (I have written a rather simple sort that is up to 3 times faster than the built in in D, and it's always faster no matter what data I use). Now and then the text I/O on disk of the current DMD is slower than Python, this comes from some of my benchmarks. I know all those parts of DMD can be improved later. When you create a new language you can't (and you don't want to) optimize every little bit (because it may be premature optimization), optimizazion must come later, so I understand Walter in this regard. But all this is just to show you that if today you have to process lot of text in a very flexible way it's not easy to beat the languages like Perl (but Python/Ruby/Tcl too. Ruby is less good than Python for Unicode texts, I think) designed for it. If you take a look near the bottom of this thread: http://groups.google.com/group/comp.lang.python/browse_thread/thread/0b3ded6d0f494d06/0068cb1406ab9e4c you can see that I'd like to use D to speed up some text-processing-related bioinformatics scripts of mine, but often I find that the Python programs are faster for that purpose ;-) Bye, bearophile
Jan 19 2008
next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 The built-in sort of Python is the Timsort, that's way faster than
 the D built-in (I have written a rather simple sort that is up to 3
 times faster than the built in in D, and it's always faster no matter
 what data I use).
D's sort is in phobos/internal/qsort.d and qsort.d. If you have a faster qsort, and want to contribute it, please do so! Same goes for other faster routines you've written.
 Now and then the text I/O on disk of the current DMD is slower than
 Python, this comes from some of my benchmarks.
The D 2.0 I/O is much faster than the 1.0 I/O. But it still suffers a bit from the requirement (I imposed) of being compatible with C stdio. I don't know if Python does this or not.
 
 I know all those parts of DMD can be improved later. When you create
 a new language you can't (and you don't want to) optimize every
 little bit (because it may be premature optimization), optimizazion
 must come later, so I understand Walter in this regard. But all this
 is just to show you that if today you have to process lot of text in
 a very flexible way it's not easy to beat the languages like Perl
 (but Python/Ruby/Tcl too. Ruby is less good than Python for Unicode
 texts, I think) designed for it.
I don't believe there are any fundamental reasons why D string processing should be slower, it's just spending the effort on it.
Jan 20 2008
prev sibling parent Sean Kelly <sean f4.ca> writes:
bearophile wrote:
 The built-in sort of Python is the Timsort, that's way faster than the D
built-in (I have written a rather simple sort that is up to 3 times faster than
the built in in D, and it's always faster no matter what data I use).
I'd be interested in seeing that. I've been able to beat the D sort for some data sets and match it in others, but not beat it across the board. Sean
Jan 20 2008