digitalmars.D - String implementations

bearophile (5/5) Jan 15 2008 Defining how an ASCII string is best managed by a language is already co...

Robert Fraser (10/18) Jan 15 2008 I agree with pretty much everything in that article, especially the part...
Jarrod (25/39) Jan 16 2008 This article is pretty correct.

Steven Schveighoffer (7/42) Jan 16 2008 The algorithmic penalties would be O(n) for a indexed lookup then instea...

Jarrod (23/29) Jan 16 2008 I understand this, but the compiler could probably optimize this for mos...

Dan (4/39) Jan 17 2008 Certainly is a whole lot better than the loose ends in other languages; ...

Walter Bright (17/41) Jan 17 2008 Because I've worked with internationalized code in C/C++ where the

Jarrod (25/41) Jan 19 2008 I was more referring to the required switching to and from different utf...

Janice Caron (13/15) Jan 19 2008 Could you possibly explain that, for the benefit of those of us who

Jarrod (13/29) Jan 19 2008 Perl is still being constantly updated although it is indeed quite old.

Janice Caron (19/29) Jan 20 2008 How? With a GUI interface? With a program written in D? With their

Jarrod (45/66) Jan 20 2008 Because this issue isn't really to do with the input file itself, it's t...

Janice Caron (53/89) Jan 20 2008 If it's an unknown encoding, you store it in a ubyte array. Then you

Jarrod (11/14) Jan 20 2008 Ah, mistyped my thoughts there. I meant to say std.regexp*
James Dennett (10/17) Jan 20 2008 Let's stop here. If you don't know the encoding, you can't

Kris (5/71) Jan 20 2008 Jarrod: you might find something useful in the way the tango Text class

Janice Caron (13/23) Jan 18 2008 Because, think about this:

James Dennett (13/44) Jan 18 2008 Yup, it can. 8.

Janice Caron (26/56) Jan 19 2008 My *rhetorical* question was in response to a poster who suggested

James Dennett (28/96) Jan 19 2008 Yes, I know. If you don't see that my post answered yours,

Janice Caron (16/26) Jan 19 2008 So you're saying that in your new design, after that assignment, a

Jarrod (9/14) Jan 19 2008 Well how else would you like it to be done?

Janice Caron (12/19) Jan 19 2008 I'd write a class, of course.

Walter Bright (6/10) Jan 20 2008 I've actually done considerable work with UTF-8, both in C++ and D. D's

James Dennett (16/29) Jan 20 2008 Yes, by this stage most serious programmers have had to learn

Janice Caron (17/26) Jan 20 2008 I thought it looked very similar indeed to D, but there you go. Funny

James Dennett (18/52) Jan 20 2008 The core issue here, to me, is D's half-hearted attempt to paint

Walter Bright (17/36) Jan 20 2008 It's hard to describe, but after working with UTF-8 for a while, they

James Dennett (25/67) Jan 20 2008 Indeed, but such a thing should be standard, not reinvented

Janice Caron (31/41) Jan 21 2008 Such input should always be rejected. D will throw an exception, which

Matti Niemenmaa (4/7) Jan 21 2008 Well, mostly right: http://d.puremagic.com/issues/show_bug.cgi?id=978

Janice Caron (4/9) Jan 21 2008 Oooh - well spotted! In that case, I amend my statement. D /will/ get

Matti Niemenmaa (5/15) Jan 21 2008 You'll note the bug is a year old.

Janice Caron (7/14) Jan 21 2008 In fact, this goes to the heart of almost all modern securty problems

bearophile (20/22) Jan 19 2008 I'm not a Perl expert, and I don't know how well Perl manages Unicode (m...

Walter Bright (9/24) Jan 20 2008 D's sort is in phobos/internal/qsort.d and qsort.d. If you have a faster...
Sean Kelly (4/5) Jan 20 2008 I'd be interested in seeing that. I've been able to beat the D sort for

bearophile <bearophileHUGS lycos.com> writes:

Defining how an ASCII string is best managed by a language is already complex
(ropes or not? Mutable or not? With shared parts or not? Etc), but today ASCII
isn't enough and when you add Unicode matters then string management becomes an
hairy topic, this may be interesting for D developers:

http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html

Something curious: sometimes I need mutable strings, but I cope with the
immutable ones when necessary. This author says that even stringAt isn't much
useful! :-)

Bye,
bearophile

Jan 15 2008

Robert Fraser <fraserofthenight gmail.com> writes:

bearophile wrote:
 Defining how an ASCII string is best managed by a language is already complex
(ropes or not? Mutable or not? With shared parts or not? Etc), but today ASCII
isn't enough and when you add Unicode matters then string management becomes an
hairy topic, this may be interesting for D developers:
 
 http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html
 
 Something curious: sometimes I need mutable strings, but I cope with the
immutable ones when necessary. This author says that even stringAt isn't much
useful! :-)
 
 Bye,
 bearophile

I agree with pretty much everything in that article, especially the part 
about charAt not being very useful (occasionally I iterate, but rarely 
do I need random access, though slicing is a good option for strings I 
know the format of that don't need regexes).

D's "string" (that is, invariant(char)[] ) is a good compromise, 
although I'd prefer to also have a String class that I can use for 
*some* strings that can be implicitly used as (but not converted to) a 
char[] but has interning & hash code caching. But this is impossible 
within D's type system as it stands today.

Jan 15 2008

Jarrod <qwerty ytre.wq> writes:

On Tue, 15 Jan 2008 21:23:31 -0500, bearophile wrote:

 Defining how an ASCII string is best managed by a language is already
 complex (ropes or not? Mutable or not? With shared parts or not? Etc),
 but today ASCII isn't enough and when you add Unicode matters then
 string management becomes an hairy topic, this may be interesting for D
 developers:
 
 http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html
 
 Something curious: sometimes I need mutable strings, but I cope with the
 immutable ones when necessary. This author says that even stringAt isn't
 much useful! :-)
 
 Bye,
 bearophile

This article is pretty correct.

While the topic is at hand, I guess I could rant a little;
Why does D practically *require* the coder to use different forms of UTF 
encoding.

D can tell if a code unit spans multiple bytes, as evidenced in 
converting a utf-8 string to utf-32 (D knows where to split the blocks 
apart), yet we can't index char[] arrays by code units. Instead, D will 
index char arrays by fixed length bytes, which is almost nonsensical 
since the D spec asserts that char[] arrays are designed specifically for 
unicode characters, and that other single byte arrays should instead be 
made as a byte[].
So if this is the case, then why can't the language itself manage multi-
byte characters for us? It would make things a hell of a lot easier and 
more efficient than having to convert /potentially/ foreign strings to 
utf-32 for a simple manipulation operation, then converting them back.
The only reason I can think of for char arrays being treated as fixed 
length is for faster indexing, which is hardly useful in most cases since 
a lot of the time we don't even know if we're dealing with multi-byte 
characters when handling strings, so we have to convert and traverse the 
strings anyway.
Arg.

I know this may probably be a pain to implement, but it would really give 
D a huge leg-up if it could properly and automatically handle strings for 
us. Without requiring a bloated string class.

Jan 16 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Jarrod" wrote
 On Tue, 15 Jan 2008 21:23:31 -0500, bearophile wrote:

 Defining how an ASCII string is best managed by a language is already
 complex (ropes or not? Mutable or not? With shared parts or not? Etc),
 but today ASCII isn't enough and when you add Unicode matters then
 string management becomes an hairy topic, this may be interesting for D
 developers:

 http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html

 Something curious: sometimes I need mutable strings, but I cope with the
 immutable ones when necessary. This author says that even stringAt isn't
 much useful! :-)

 Bye,
 bearophile

 This article is pretty correct.

 While the topic is at hand, I guess I could rant a little;
 Why does D practically *require* the coder to use different forms of UTF
 encoding.

 D can tell if a code unit spans multiple bytes, as evidenced in
 converting a utf-8 string to utf-32 (D knows where to split the blocks
 apart), yet we can't index char[] arrays by code units. Instead, D will
 index char arrays by fixed length bytes, which is almost nonsensical
 since the D spec asserts that char[] arrays are designed specifically for
 unicode characters, and that other single byte arrays should instead be
 made as a byte[].
 So if this is the case, then why can't the language itself manage multi-
 byte characters for us? It would make things a hell of a lot easier and
 more efficient than having to convert /potentially/ foreign strings to
 utf-32 for a simple manipulation operation, then converting them back.
 The only reason I can think of for char arrays being treated as fixed
 length is for faster indexing, which is hardly useful in most cases since
 a lot of the time we don't even know if we're dealing with multi-byte
 characters when handling strings, so we have to convert and traverse the
 strings anyway.

The algorithmic penalties would be O(n) for a indexed lookup then instead of 
O(1).  I think the way it is now is the best of all worlds.

I think the correct method in this case is to convert to utf32 first, then 
index.  Then at least you only take the O(n) penalty once.  Or why not just 
use dchar[] instead of char[] to begin with?

-Steve

Jan 16 2008

Jarrod <qwerty ytre.wq> writes:

On Wed, 16 Jan 2008 10:27:53 -0500, Steven Schveighoffer wrote:
 
 The algorithmic penalties would be O(n) for a indexed lookup then
 instead of O(1).

I understand this, but the compiler could probably optimize this for most 
situations. Most string access would be sequential and thus positions 
could be cached on access when need be, and string literals that aren't 
modified and have all single-byte chars could be optimized into normal 
indexing. Furthermore, modern processors are incredibly good at 
sequential iteration and I know from personal experience that they can 
parse over massive chunks of memory in mere milliseconds (hashing entire 
executables in memory for potential changes is a common example of this). 
It shouldn't be noticeable at all to scan over a string. I do believe the 
author of the article that bearophile linked agrees with me on this 
regard, in his mention of charAt implementation.

 I think the correct method in this case is to convert to utf32 first,
 then index.  Then at least you only take the O(n) penalty once.  

Well, converting to dchar[] means a full iteration over the entire string 
to split up the units. Then the program has to allocate space, copy chars 
over, and add padding. Is it really all that much more efficient? And why 
should the programmer have to worry about the conversion anyway? Good 
languages avoid cognitive load on the programmers.

 Or why not just use dchar[] instead of char[] to begin with?

Yes, you could just use dchar[] all the time, but how many people do 
that? It's very space-inefficient which is the whole reason utf-8 exists. 
If dchar[]s were meant to be used more often Walter probably would have 
made them the default string type.

Eh, I guess this is just one of those annoying little 'loose ends' I see 
when I look at D.

Jan 16 2008

Dan <murpsoft hotmail.com> writes:

Jarrod Wrote:

 On Wed, 16 Jan 2008 10:27:53 -0500, Steven Schveighoffer wrote:
 
 The algorithmic penalties would be O(n) for a indexed lookup then
 instead of O(1).

 
 I understand this, but the compiler could probably optimize this for most 
 situations. Most string access would be sequential and thus positions 
 could be cached on access when need be, and string literals that aren't 
 modified and have all single-byte chars could be optimized into normal 
 indexing. Furthermore, modern processors are incredibly good at 
 sequential iteration and I know from personal experience that they can 
 parse over massive chunks of memory in mere milliseconds (hashing entire 
 executables in memory for potential changes is a common example of this). 
 It shouldn't be noticeable at all to scan over a string. I do believe the 
 author of the article that bearophile linked agrees with me on this 
 regard, in his mention of charAt implementation.
 
 I think the correct method in this case is to convert to utf32 first,
 then index.  Then at least you only take the O(n) penalty once.  

 
 Well, converting to dchar[] means a full iteration over the entire string 
 to split up the units. Then the program has to allocate space, copy chars 
 over, and add padding. Is it really all that much more efficient? And why 
 should the programmer have to worry about the conversion anyway? Good 
 languages avoid cognitive load on the programmers.
 
 Or why not just use dchar[] instead of char[] to begin with?

 
 Yes, you could just use dchar[] all the time, but how many people do 
 that? It's very space-inefficient which is the whole reason utf-8 exists. 
 If dchar[]s were meant to be used more often Walter probably would have 
 made them the default string type.
 
 Eh, I guess this is just one of those annoying little 'loose ends' I see 
 when I look at D.

Certainly is a whole lot better than the loose ends in other languages; at
least we're in UTF and not ASCII (or undefined language).

I personally prefer UTF-8.  I can write any UTF character in UTF8 if I accept
that odd case of a UTF-32 character will be stored as \uXXXX.  To be honest,
that's acceptable; and gives me the memory savings and O(1) as long as I've got
the foresight to predict where the \u's are.

I love D's handling of strings, in fact it is my *favorite* feature in D.

Jan 17 2008

Walter Bright <newshound1 digitalmars.com> writes:

Jarrod wrote:
 While the topic is at hand, I guess I could rant a little;
 Why does D practically *require* the coder to use different forms of UTF 
 encoding.

Because I've worked with internationalized code in C/C++ where the 
encoding isn't specified, and it's very bad.


 D can tell if a code unit spans multiple bytes, as evidenced in 
 converting a utf-8 string to utf-32 (D knows where to split the blocks 
 apart), yet we can't index char[] arrays by code units. Instead, D will 
 index char arrays by fixed length bytes, which is almost nonsensical

It is impractical (i.e. very inefficient) to index arrays otherwise, 
especially in getting array lengths, doing slicing, etc. In fact, it is 
rather rare to index by code units. The times you might want to do it 
are easily handled by foreach(dchar c, string).

 since the D spec asserts that char[] arrays are designed specifically for 
 unicode characters, and that other single byte arrays should instead be 
 made as a byte[].
 So if this is the case, then why can't the language itself manage multi-
 byte characters for us?

It does, see foreach. In general, I don't think it's a good idea for the 
language to try to completely hide the multibyte nature of UTF. For 
example, when you're allocating and copying strings around, you need the 
byte length, not the number of code points.

 It would make things a hell of a lot easier and 
 more efficient than having to convert /potentially/ foreign strings to 
 utf-32 for a simple manipulation operation, then converting them back.
 The only reason I can think of for char arrays being treated as fixed 
 length is for faster indexing, which is hardly useful in most cases since 
 a lot of the time we don't even know if we're dealing with multi-byte 
 characters when handling strings, so we have to convert and traverse the 
 strings anyway.
 Arg.

I was surprised to discover that most indexing work in strings, such as 
searching, work more efficiently by *not* trying to index by code 
points. There are standard library functions in std.utf to index by code 
points, if you do need it.


 I know this may probably be a pain to implement, but it would really give 
 D a huge leg-up if it could properly and automatically handle strings for 
 us. Without requiring a bloated string class.

I believe D already has found the right approach to handling UTF 
strings. All I can say is try it out for a while.

Jan 17 2008

Jarrod <qwerty ytre.wq> writes:

On Thu, 17 Jan 2008 13:40:12 -0800, Walter Bright wrote:

 Because I've worked with internationalized code in C/C++ where the
 encoding isn't specified, and it's very bad.

I was more referring to the required switching to and from different utf 
types just to change a few characters around, I wasn't really referring 
to letting programmer decide what kind of string encoding to use.
 

 It is impractical (i.e. very inefficient) to index arrays otherwise,
 especially in getting array lengths, doing slicing, etc. In fact, it is
 rather rare to index by code units. The times you might want to do it
 are easily handled by foreach(dchar c, string).

Well yes I'm sure there's a performance hit for changing how it is 
indexed, but at the same time who would honestly prefer to index by the 
code point without first getting the unit points? You're practically 
stabbing in the dark if you try to slice a char[] array without first 
iterating over it with foreach to find its points.


 It does, see foreach. In general, I don't think it's a good idea for the
 language to try to completely hide the multibyte nature of UTF. For
 example, when you're allocating and copying strings around, you need the
 byte length, not the number of code points.

string str = "etc";
int strlen = str.length;
int arrsize = str.sizeof;

Seems pretty simple to me. And you don't have to completely hide the 
multibyte nature. Casting to a byte[] would allow full access to each 
point, which might sound hackish but at the same time manipulating 
individual code points in a string sounds like you're more than likely 
doing something just as hackish.


 I was surprised to discover that most indexing work in strings, such as
 searching, work more efficiently by *not* trying to index by code
 points. There are standard library functions in std.utf to index by code
 points, if you do need it.

Efficiency at the cost of the programmer. :(
Perhaps you could design methods to access a string by either unit or 
index if you see the need to keep index-by-byte behaviour. Something like 
a toggle method would suit me just fine.
str.indexByByte(true);


 I believe D already has found the right approach to handling UTF
 strings. All I can say is try it out for a while.

I am, and it's making working with user-editable config files an 
annoyance that perl avoids very easily.

Jan 19 2008

"Janice Caron" <caron800 googlemail.com> writes:

On 1/20/08, Jarrod <qwerty ytre.wq> wrote:
 I am, and it's making working with user-editable config files an
 annoyance that perl avoids very easily.

Could you possibly explain that, for the benefit of those of us who
don't speak perl?

My limited understanding is the perl was invented before Unicode, and
probably even before the wheel, so either it deals with Unicode by not
dealing with it at all, or else it's a recent edition to the language
(or else I've got it completely wrong - like I said, I don't speak
perl).

Also, isn't perl an interpreted language? You can get away with a lot
more in an interpreted language, but you pay the price in speed.

Moreover, working with user-editable config files - I would have
thought that a job for a text editor, not a programming language. I'm
confused.

Jan 19 2008

Jarrod <qwerty ytre.wq> writes:

On Sun, 20 Jan 2008 06:41:29 +0000, Janice Caron wrote:

 On 1/20/08, Jarrod <qwerty ytre.wq> wrote:
 I am, and it's making working with user-editable config files an
 annoyance that perl avoids very easily.

 
 Could you possibly explain that, for the benefit of those of us who
 don't speak perl?
 
 My limited understanding is the perl was invented before Unicode, and
 probably even before the wheel, so either it deals with Unicode by not
 dealing with it at all, or else it's a recent edition to the language
 (or else I've got it completely wrong - like I said, I don't speak
 perl).

Perl is still being constantly updated although it is indeed quite old. 
And it works quite well with unicode as you would expect from a language 
that prides itself on text manipulation.


 Also, isn't perl an interpreted language? You can get away with a lot
 more in an interpreted language, but you pay the price in speed.

Yes, It's interpreted and that does cost it a fair amount of speed, but I 
see it as a worthwhile trade off for what it can do with strings.
 

 Moreover, working with user-editable config files - I would have thought
 that a job for a text editor, not a programming language. I'm confused.

Indeed, you are a tad confused. I'm allowing the user to edit config 
files so that my GUI application can read it in on startup and use it to 
populate a dialog display as well as fill out numerous options involving 
how it deals with a web interface. Because I don't know what the user is 
going to input I have to do a fair amount of converting.

Yes, this in indeed the main motivation behind this entire rant.

Jan 19 2008

"Janice Caron" <caron800 googlemail.com> writes:

On 1/20/08, Jarrod <qwerty ytre.wq> wrote:
 Moreover, working with user-editable config files - I would have thought
 that a job for a text editor, not a programming language. I'm confused.

 Indeed, you are a tad confused.

Yep. I said so! :-)

 I'm allowing the user to edit config
 files

How? With a GUI interface? With a program written in D? With their
favorite text editor of choice?

If the latter, then you cannot be sure of the encoding, and that's
hardly D's fault!


 so that my GUI application can read it in on startup and use it to
 populate a dialog display as well as fill out numerous options involving
 how it deals with a web interface. Because I don't know what the user is
 going to input I have to do a fair amount of converting.

Right, but converting from one encoding to another is the job of
specialised classes. Detecting whether a text file is in ISO-8859-1,
or Windows-1252, or MAC-ROMAN, or whatever, is not a trivial task. If
your application were going to do that, you'd have to provide the
implementation. (Or possibly Tango or some other third party library
already provides such converters - I don't know). In any case, it's
not a common enough task to warrant built-in language support.

But I still don't see what this has got to do with whether or not a[n]
should identify the (n+1)th character rather than the (n+1)th code
unit.


 Yes, this in indeed the main motivation behind this entire rant.

Cool. So what is the real world use case that necessitates that
sequences of UTF-8 code units must be addressable by character index
as the default?

Jan 20 2008

Jarrod <qwerty ytre.wq> writes:

On Sun, 20 Jan 2008 08:04:01 +0000, Janice Caron wrote:

 I'm allowing the user to edit config
 files

 
 How? With a GUI interface? With a program written in D? With their
 favorite text editor of choice?
 
 If the latter, then you cannot be sure of the encoding, and that's
 hardly D's fault!

It is the latter.


 Right, but converting from one encoding to another is the job of
 specialised classes. Detecting whether a text file is in ISO-8859-1, or
 Windows-1252, or MAC-ROMAN, or whatever, is not a trivial task. If your
 application were going to do that, you'd have to provide the
 implementation. (Or possibly Tango or some other third party library
 already provides such converters - I don't know). In any case, it's not
 a common enough task to warrant built-in language support.
 
 But I still don't see what this has got to do with whether or not a[n]
 should identify the (n+1)th character rather than the (n+1)th code unit.

Because this issue isn't really to do with the input file itself, it's to 
do with the potential input characters given in the file. As far as I can 
tell (I'm using a C library to parse the input) it should be ascii or 
UTF-8 encoding.
Anything else would probably cause the C lexer to screw up.


 Cool. So what is the real world use case that necessitates that
 sequences of UTF-8 code units must be addressable by character index as
 the default?

The most important one right now is splicing. I'm allowing both user-
defined and program-defined macros in the input data. They can be 
anywhere within a string, so I need to splice them out and replace them 
with their correct counterparts. I hear the std lib provided with D is 
unreliable so I'm unwilling to use it. Plus even if it is fixed up I'd 
hate to limit string manipulation to regular expressions.
I also wish to cut off input at a certain letter count for spacing issues 
in both the GUI and dealing with the webscript.
I'll have to be converting certain characters to their URI equivalent 
too, that will probably take more splicing as well.

The other thing I'm using is single-letter replacement. Simple stuff like 
capitalising letters and replacing spaces with underscores.

I can think of a lot of other situations that would benefit from proper 
multibyte support too, for instance practically any application that 
takes frequent user input could benefit. A text editor is a very good 
example. Any coders who don't natively deal with Latin text would 
probably benefit greatly too ( think of the poor Japanese coders :< ). 
I've seen a lot of programs that print a specified number of characters 
before wrapping around or trailing off, too. The humble gnome console is 
a good example of that. Very handy to have character indexing in this 
case. 
String tokenizing and plain old character counting are two operations I 
can think of that could probably be done easier too.


In the end I think I'm just tired of having to jump through hoops when it 
comes to string manipulation. I want to be able to say 'this is a 
character, I don't care what it is. Store it, change it, splice it, print 
it.' But instead it seems if I don't care what the character type it, it 
might not fit. Then I have to allocate then store it, find and change it, 
locate then splice it, convert then print it.
Small annoyances build up over time and I'm pretty sure I'm not insured 
for blood vessels bursting in my eye.

I live in the hope that one day in the future I'll see something magical 
happen, and I'll be able to type char chr = 'Δ'; and chr will be a proper 
utf-8 character that I can print, insert into an array, and change.
What a beautiful day that will be.

Welp, I think I'm done ranting for now. Back to screwing around with 
strings. Or more accurately, procrastinating about screwing around with 
strings.

Jan 20 2008

"Janice Caron" <caron800 googlemail.com> writes:

On 1/20/08, Jarrod <qwerty ytre.wq> wrote:
 But I still don't see what this has got to do with whether or not a[n]
 should identify the (n+1)th character rather than the (n+1)th code unit.

 Because this issue isn't really to do with the input file itself, it's to
 do with the potential input characters given in the file.

You mean the plain text config file of unknown encoding?

 As far as I can
 tell (I'm using a C library to parse the input) it should be ascii or
 UTF-8 encoding.
 Anything else would probably cause the C lexer to screw up.

If it's an unknown encoding, you store it in a ubyte array. Then you
identify the encoding, convert it to UTF-8 and store the result in a
char array.


 Cool. So what is the real world use case that necessitates that
 sequences of UTF-8 code units must be addressable by character index as
 the default?

 The most important one right now is splicing. I'm allowing both user-
 defined and program-defined macros in the input data. They can be
 anywhere within a string, so I need to splice them out and replace them
 with their correct counterparts.

That works right now with ordinary char arrays. Just use find(),
rfind(), etc., and slicing.


 I hear the std lib provided with D is
 unreliable

Huh? Please elucidate.


 so I'm unwilling to use it.

That's your loss, but you can hardly expect Walter to consider adding
new language features just because you are unwilling to use Phobos.



 I also wish to cut off input at a certain letter count for spacing issues
 in both the GUI and dealing with the webscript.

Well, I hate to spoil things, but even /characters/ are not sufficient
to help you figure out spacing issues. For that, you need to be
working on the level of /glyphs/.

For example, consider the word "caf��". (Just in case that didn't
render properly, that's c, a, f, followed be e-with-an-acute-accent).
You can write this as either

    caf\u00E9

which consists of five UTF-8 code units, or four characters, or four
glyphs; or you can write it as

    cafe\u0301

which consist of seven UTF-8 code units, or five characters, or four
glyphs. In the first case, the e-acute glyph is represented as a
single character; in the second case, it is represented as an e
character followed by a combining-acute character. In other words,
even indexing by character is not sufficient to achieve your goals.
You need to index by glyph.

At some point, you have to say to yourself: wait a minute - this would
be better implemented in a library than in the primitive types of the
language.


 The other thing I'm using is single-letter replacement. Simple stuff like
 capitalising letters and replacing spaces with underscores.

I guess what you're getting at here is that uppercasing a character
might result in a UTF-8 string longer than that of the original
character. And so it might. On the other hand, if you use a foreach
loop to do this sort of thing, your problems are solved.


 I can think of a lot of other situations that would benefit from proper
 multibyte support too,

UTF-8 support /is/ proper multibyte support. That's why D has it built in.


 ( think of the poor Japanese coders :< ).

Which is why D uses Unicode. Again, I say, D got it right.


 I've seen a lot of programs that print a specified number of characters
 before wrapping around or trailing off, too. The humble gnome console is
 a good example of that. Very handy to have character indexing in this
 case.

I don't agree. This is a problem in font rendering. If you happen to
be using a proportional font, then even character counting won't work.
You need to be counting rendered width in pixels - an operation which
should be generic enough to work for both fixed-width and proportional
fonts.


 In the end I think I'm just tired of having to jump through hoops when it
 comes to string manipulation. I want to be able to say 'this is a
 character, I don't care what it is. Store it, change it, splice it, print
 it.'

dchar.


 happen, and I'll be able to type char chr = '��'; and chr will be a proper
 utf-8 character that I can print, insert into an array, and change.
 What a beautiful day that will be.

dchar.

Put another way, you want to be insulated from the internal
representation. UTF-8 is an implementation detail, wheras what you
want is an array of Unicode characters (whose implementation is not
necessarily dchar[] but you want to be shielded from it anyway). Again
I say, this is a problem for a library class, not a builtin type. And
you're probably going to want even higher level abstractions dealing
with glyphs too (and then font-rendering tools after that).

D allows you to write such libraries.

But the builtin types do exactly what is says on the tin. Their
behaviour is well-defined, and it's up to the programmer to understand
that behaviour.

Jan 20 2008

Jarrod <qwerty ytre.wq> writes:

On Sun, 20 Jan 2008 11:45:40 +0000, Janice Caron wrote:

 I hear the std lib provided with D is unreliable

 
 Huh? Please elucidate.

Ah, mistyped my thoughts there. I meant to say std.regexp*

At the moment, to me char[] is just byte[], except I guess when it comes 
to foreach. Pretty watered down when you look at it like that.

And yes, I guess a library implementation of string would be fine too. I 
just figure since strings are one of the most important data types used 
in any program, I figure D should probably natively support their multi-
byte nature more transparently. Perhaps I have been spoiled by scripting 
languages after leaving C/++ alone for so long, but it would be very nice 
to see it happen one way or another.

I guess I like my language how I like my coffee; filled with sugar.

Jan 20 2008

James Dennett <jdennett acm.org> writes:

Janice Caron wrote:
 On 1/20/08, Jarrod <qwerty ytre.wq> wrote:
 But I still don't see what this has got to do with whether or not a[n]
 should identify the (n+1)th character rather than the (n+1)th code unit.

 Because this issue isn't really to do with the input file itself, it's to
 do with the potential input characters given in the file.

 
 You mean the plain text config file of unknown encoding?

Let's stop here.  If you don't know the encoding, you can't
safely process the file.  That's nothing to do with language
or library designs.  You can't process data whose format
you do not know.

(Yes, you can employ heuristics to try to guess, but they can
be wrong, and in the case of text files there are many files
which are valid in numerous encodings but have different
meanings.)

-- James

Jan 20 2008

"Kris" <foo bar.com> writes:

Jarrod: you might find something useful in the way the tango Text class 
operates? It attempts to make the common operation independent of indexing, 
in order to avoid some of these unit/point problems.


"Jarrod" <qwerty ytre.wq> wrote in message 
news:fmv7s7$76h$2 digitalmars.com...
 On Sun, 20 Jan 2008 08:04:01 +0000, Janice Caron wrote:

 I'm allowing the user to edit config
 files

 How? With a GUI interface? With a program written in D? With their
 favorite text editor of choice?

 If the latter, then you cannot be sure of the encoding, and that's
 hardly D's fault!

 It is the latter.


 Right, but converting from one encoding to another is the job of
 specialised classes. Detecting whether a text file is in ISO-8859-1, or
 Windows-1252, or MAC-ROMAN, or whatever, is not a trivial task. If your
 application were going to do that, you'd have to provide the
 implementation. (Or possibly Tango or some other third party library
 already provides such converters - I don't know). In any case, it's not
 a common enough task to warrant built-in language support.

 But I still don't see what this has got to do with whether or not a[n]
 should identify the (n+1)th character rather than the (n+1)th code unit.

 Because this issue isn't really to do with the input file itself, it's to
 do with the potential input characters given in the file. As far as I can
 tell (I'm using a C library to parse the input) it should be ascii or
 UTF-8 encoding.
 Anything else would probably cause the C lexer to screw up.


 Cool. So what is the real world use case that necessitates that
 sequences of UTF-8 code units must be addressable by character index as
 the default?

 The most important one right now is splicing. I'm allowing both user-
 defined and program-defined macros in the input data. They can be
 anywhere within a string, so I need to splice them out and replace them
 with their correct counterparts. I hear the std lib provided with D is
 unreliable so I'm unwilling to use it. Plus even if it is fixed up I'd
 hate to limit string manipulation to regular expressions.
 I also wish to cut off input at a certain letter count for spacing issues
 in both the GUI and dealing with the webscript.
 I'll have to be converting certain characters to their URI equivalent
 too, that will probably take more splicing as well.

 The other thing I'm using is single-letter replacement. Simple stuff like
 capitalising letters and replacing spaces with underscores.

 I can think of a lot of other situations that would benefit from proper
 multibyte support too, for instance practically any application that
 takes frequent user input could benefit. A text editor is a very good
 example. Any coders who don't natively deal with Latin text would
 probably benefit greatly too ( think of the poor Japanese coders :< ).
 I've seen a lot of programs that print a specified number of characters
 before wrapping around or trailing off, too. The humble gnome console is
 a good example of that. Very handy to have character indexing in this
 case.
 String tokenizing and plain old character counting are two operations I
 can think of that could probably be done easier too.


 In the end I think I'm just tired of having to jump through hoops when it
 comes to string manipulation. I want to be able to say 'this is a
 character, I don't care what it is. Store it, change it, splice it, print
 it.' But instead it seems if I don't care what the character type it, it
 might not fit. Then I have to allocate then store it, find and change it,
 locate then splice it, convert then print it.
 Small annoyances build up over time and I'm pretty sure I'm not insured
 for blood vessels bursting in my eye.

 I live in the hope that one day in the future I'll see something magical
 happen, and I'll be able to type char chr = '?'; and chr will be a proper
 utf-8 character that I can print, insert into an array, and change.
 What a beautiful day that will be.

 Welp, I think I'm done ranting for now. Back to screwing around with
 strings. Or more accurately, procrastinating about screwing around with
 strings.

Jan 20 2008

"Janice Caron" <caron800 googlemail.com> writes:

On 1/16/08, Jarrod <qwerty ytre.wq> wrote:
 On Tue, 15 Jan 2008 21:23:31 -0500, bearophile wrote:

 So if this is the case, then why can't the language itself manage multi-
 byte characters for us? It would make things a hell of a lot easier and
 more efficient than having to convert /potentially/ foreign strings to
 utf-32 for a simple manipulation operation, then converting them back.
 The only reason I can think of for char arrays being treated as fixed
 length is for faster indexing, which is hardly useful in most cases since
 a lot of the time we don't even know if we're dealing with multi-byte
 characters when handling strings, so we have to convert and traverse the
 strings anyway.

Because, think about this:

    char[] a = new char[8];

If a char array were indexed by character instead of codeunit, as you
suggest, how many bytes would the compiler need to allocate? It can't
know in advance. Also:

    char[] a = "abcd";
    char[] b = "\u20AC";
    a[0] = b[0];

would cause big problems. (Would a[1] get overwritten? Would a have to
be resized and everything shifted up one byte?)

I think D has got it right. Use wchar or dchar when you need character
based indexing.

Jan 18 2008

James Dennett <jdennett acm.org> writes:

Janice Caron wrote:
 On 1/16/08, Jarrod <qwerty ytre.wq> wrote:
 On Tue, 15 Jan 2008 21:23:31 -0500, bearophile wrote:

 So if this is the case, then why can't the language itself manage multi-
 byte characters for us? It would make things a hell of a lot easier and
 more efficient than having to convert /potentially/ foreign strings to
 utf-32 for a simple manipulation operation, then converting them back.
 The only reason I can think of for char arrays being treated as fixed
 length is for faster indexing, which is hardly useful in most cases since
 a lot of the time we don't even know if we're dealing with multi-byte
 characters when handling strings, so we have to convert and traverse the
 strings anyway.

 
 Because, think about this:
 
     char[] a = new char[8];
 
 If a char array were indexed by character instead of codeunit, as you
 suggest, how many bytes would the compiler need to allocate?

8.  That's independent of how they are indexed.

 It can't
 know in advance. 

Yup, it can.  8.

 Also:
 
     char[] a = "abcd";
     char[] b = "\u20AC";
     a[0] = b[0];
 
 would cause big problems. (Would a[1] get overwritten? Would a have to
 be resized and everything shifted up one byte?)

It causes big problems with byte-wise addressing, because you
can end up with a char[] (which is specified to hold a UTF8
string) which does not contain a UTF8 string.

 I think D has got it right. Use wchar or dchar when you need character
 based indexing.

If you have UTF8, you should not be allowed to access the bytes
that make it up, only its characters.  If you want a bunch of
bytes, use an array of bytes.  (On the other hand, D is weak here
because it identifies UTF8 strings with arrays of char, but char
doesn't hold a UTF8 character.  I can't imagine persuading Walter
that this is a horrible error is going to work though.)

-- James

Jan 18 2008

"Janice Caron" <caron800 googlemail.com> writes:

On 1/19/08, James Dennett <jdennett acm.org> wrote:
 Because, think about this:

     char[] a = new char[8];

 If a char array were indexed by character instead of codeunit, as you
 suggest, how many bytes would the compiler need to allocate?

 8.  That's independent of how they are indexed.

My *rhetorical* question was in response to a poster who suggested
that char arrays should be indexed by CHARACTER. I know perfectly well
that in reality they are indexed by UTF-8 code unit, and therefore
that the correct answer is eight. I was giving additional reasons
/why/ indexing by character was not sensible.



 It can't
 know in advance.

 Yup, it can.  8.

Of course. However, /if/ they were indexed by character, /then/ it
would be impossible to know in advance how many UTF-8 code units it
would take to construct eight characters. Which is a good reason why
they should /not/ be indexed by character.

As far as I am concerned, D has got it right.



 Also:

     char[] a = "abcd";
     char[] b = "\u20AC";
     a[0] = b[0];

 would cause big problems. (Would a[1] get overwritten? Would a have to
 be resized and everything shifted up one byte?)

 It causes big problems with byte-wise addressing, because you
 can end up with a char[] (which is specified to hold a UTF8
 string) which does not contain a UTF8 string.

That's what I said. Thank you for reiterating it.



 I think D has got it right. Use wchar or dchar when you need character
 based indexing.

 If you have UTF8, you should not be allowed to access the bytes
 that make it up, only its characters.

I absolutely /should/ be (and am) allowed to access the UTF-8 code
units stored within an array of UTF-8 code units. This is absolutely
as it should be, thank you very much.


 If you want a bunch of
 bytes, use an array of bytes.

Of course. And if you want an array of UTF-8 code units, use a char[]
or a string.


 (On the other hand, D is weak here
 because it identifies UTF8 strings with arrays of char, but char
 doesn't hold a UTF8 character.

I believe you are wrong. I would say that D is strong here, because it
identifies UTF-8 strings with arrays of char, because char /does/ hold
a UTF-8 code unit.

The mistake is assuming that code unit == character. It does not.
(However, there is overlap in the ASCII range, 0x00 to 0x7F, so the
assumption is still valid if you are certain that your strings are
ASCII).

Once you've grokked that char[] == array of UTF-8 code unit,
everything else falls into place and makes sense.

Jan 19 2008

James Dennett <jdennett acm.org> writes:

Janice Caron wrote:
 On 1/19/08, James Dennett <jdennett acm.org> wrote:
 Because, think about this:

     char[] a = new char[8];

 If a char array were indexed by character instead of codeunit, as you
 suggest, how many bytes would the compiler need to allocate?

 8.  That's independent of how they are indexed.

 
 My *rhetorical* question was in response to a poster who suggested
 that char arrays should be indexed by CHARACTER. 

Yes, I know.  If you don't see that my post answered yours,
feel free to ask me to explain more clearly.

Indexing by character doesn't automatically tell us anything
about how the size specified for new[] works.  It would be a
little quirky for the size to be in bytes while the indices
are in characters, but it's quirky for char[] to pretend to
be a UTF8 type without enforcing that.

 I know perfectly well
 that in reality they are indexed by UTF-8 code unit, and therefore
 that the correct answer is eight. I was giving additional reasons
 /why/ indexing by character was not sensible.

I was refuting your claimed reason.

 
 It can't
 know in advance.

 Yup, it can.  8.

 
 Of course. However, /if/ they were indexed by character, /then/ it
 would be impossible to know in advance how many UTF-8 code units it
 would take to construct eight characters.

Sure.  So it wouldn't be a request for 8 characters.

 Which is a good reason why
 they should /not/ be indexed by character.

I *still* disagree with that.

 As far as I am concerned, D has got it right.

Enjoy ;)

 Also:

     char[] a = "abcd";
     char[] b = "\u20AC";
     a[0] = b[0];

 would cause big problems. (Would a[1] get overwritten? Would a have to
 be resized and everything shifted up one byte?)

 It causes big problems with byte-wise addressing, because you
 can end up with a char[] (which is specified to hold a UTF8
 string) which does not contain a UTF8 string.

 
 That's what I said. Thank you for reiterating it.

I think we are *disagreeing* here.  I claim that this causes
problems _with the current design of D_, which would be
resolved if char[] (or however we denote mutable UTF8 strings)
string were really a UTF8 type.

 I think D has got it right. Use wchar or dchar when you need character
 based indexing.

 If you have UTF8, you should not be allowed to access the bytes
 that make it up, only its characters.

 
 I absolutely /should/ be (and am) allowed to access the UTF-8 code
 units stored within an array of UTF-8 code units. This is absolutely
 as it should be, thank you very much.

But you already illustrated why it should not: you can break
the invariant that there is valid UTF8 there, so char[] is
lying when it says that it is a UTF8 type.

 
 If you want a bunch of
 bytes, use an array of bytes.

 
 Of course. And if you want an array of UTF-8 code units, use a char[]
 or a string.

If you're at the level of code units, you can't be sure that
they're part of UTF8 characters unless there is a higher-level
invariant enforced on them.

 (On the other hand, D is weak here
 because it identifies UTF8 strings with arrays of char, but char
 doesn't hold a UTF8 character.

 
 I believe you are wrong. I would say that D is strong here, because it
 identifies UTF-8 strings with arrays of char, because char /does/ hold
 a UTF-8 code unit.

That's the problem.  char[] can hold non-UTF8 strings.

 The mistake is assuming that code unit == character.

If you're telling me that I made that mistake, you're
sorely mistaken.

 It does not.
 (However, there is overlap in the ASCII range, 0x00 to 0x7F, so the
 assumption is still valid if you are certain that your strings are
 ASCII).
 
 Once you've grokked that char[] == array of UTF-8 code unit,
 everything else falls into place and makes sense.

No, it does not.  It's precisely the difference that is why
D's char[] is a poor man's UTF8 string.

-- James

Jan 19 2008

"Janice Caron" <caron800 googlemail.com> writes:

On 1/19/08, James Dennett <jdennett acm.org> wrote:
     char[] a = "abcd";
     char[] b = "\u20AC";
     a[0] = b[0];



 I think we are *disagreeing* here.  I claim that this causes
 problems _with the current design of D_, which would be
 resolved if char[] (or however we denote mutable UTF8 strings)
 string were really a UTF8 type.

So you're saying that in your new design, after that assignment, a
would equal "\u20ACbcd". The problem is that the compiler would have
to allocate extra bytes and then memcpy all the bytes up a bit to make
room. That strikes me as kinda slow, which is not something I'd want
in a char array.


 That's the problem.  char[] can hold non-UTF8 strings.

Yes, that is possible. But only in buggy code, of course. That really
raises the question: is it the compiler's job, or the programmer's, to
ensure that the contract is maintained? I don't really have any
problem taking responsibility for maintaining UTF-8 correctness. (It's
not hard).

But if you want to be completely protected from those kinds of errors,
I still don't see the problem with using dchar.


 No, it does not.  It's precisely the difference that is why
 D's char[] is a poor man's UTF8 string.

I suppose a library class could be written whose interface behaved
like a dchar array, but whose implementation was UTF-8. But when you
ever use it?

Jan 19 2008

Jarrod <qwerty ytre.wq> writes:

On Sat, 19 Jan 2008 20:21:49 +0000, Janice Caron wrote:

 So you're saying that in your new design, after that assignment, a would
 equal "\u20ACbcd". The problem is that the compiler would have to
 allocate extra bytes and then memcpy all the bytes up a bit to make
 room. That strikes me as kinda slow, which is not something I'd want in
 a char array.

Well how else would you like it to be done? 
If you were writing something that took a text input much like this very 
window I'm typing in right now, and the user hit back a few times and 
input a multi-byte character, how would you deal with it? Allow it to 
overlap? No. dchars? That's a lot of wasted memory, and it basically 
makes me wonder why utf-8 even exists if it needs to be dropped for 
simple text manipulation. May as well stick with utf-32 and ascii.

No sir, I don't like it.

Jan 19 2008

"Janice Caron" <caron800 googlemail.com> writes:

On 1/20/08, Jarrod <qwerty ytre.wq> wrote:
 If you were writing something that took a text input much like this very
 window I'm typing in right now, and the user hit back a few times and
 input a multi-byte character, how would you deal with it?

I'd write a class, of course.

It is simple (though not trivial) to step through the bytes of UTF-8.
Bytes in the range 00 to 7F are ASCII; bytes in the range 80 to BF are
tail bytes; bytes in the range C0 to F7 are head bytes; and bytes in
the range F8 to FF are illegal. Identifying multi-byte sequences is
therefore easy.

You can make an argument that functions and/or classes to do this sort
of thing should perhaps pre-exist in Phobos, but to say it should be
built into /the language itself/ ... that's going a bit too far, I
feel.


 Allow it to
 overlap? No. dchars? That's a lot of wasted memory, and it basically
 makes me wonder why utf-8 even exists if it needs to be dropped for
 simple text manipulation. May as well stick with utf-32 and ascii.

 No sir, I don't like it.

Jan 19 2008

Walter Bright <newshound1 digitalmars.com> writes:

James Dennett wrote:
 (On the other hand, D is weak here
 because it identifies UTF8 strings with arrays of char, but char
 doesn't hold a UTF8 character.  I can't imagine persuading Walter
 that this is a horrible error is going to work though.)

I've actually done considerable work with UTF-8, both in C++ and D. D's 
method of dealing with it works out very well (and very naturally). This 
is why you'll have a hard time persuading me otherwise <g>.

Note that C++0x is doing things similarly:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html

Jan 20 2008

James Dennett <jdennett acm.org> writes:

Walter Bright wrote:
 James Dennett wrote:
 (On the other hand, D is weak here
 because it identifies UTF8 strings with arrays of char, but char
 doesn't hold a UTF8 character.  I can't imagine persuading Walter
 that this is a horrible error is going to work though.)

 
 I've actually done considerable work with UTF-8, both in C++ and D.

Yes, by this stage most serious programmers have had to learn
in some detail how to work with UTF-8.

 D's 
 method of dealing with it works out very well (and very naturally).

I've given specific problems with it.  I've heard no refutation
of them.  D uses essentially a model of UTF8 which is really just
a bunch-of-bytes with smart iteration.  C-based projects on which
I worked in the 90's did similarly, but with coding conventions
that banned direct access to the bytes.

 This is why you'll have a hard time persuading me otherwise <g>.

Because you assert that there's not a problem? ;)

 Note that C++0x is doing things similarly:
 
 http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html

Looks very different to me.  There's no conflation of char with a
code unit of UTF8 (and indeed C++ deliberately supports use of
varied encodings for multi-byte characters).  Yes, C++ is adding
16- and 32-bit character types which are more akin to D's, but that
has little bearing on how differently it handles multi-byte (as
opposed to wide-character) strings.

-- James

Jan 20 2008

"Janice Caron" <caron800 googlemail.com> writes:

On 1/20/08, James Dennett <jdennett acm.org> wrote:
 Looks very different to me.

I thought it looked very similar indeed to D, but there you go. Funny
how two different people can read the same document and interpret it
in two different ways.


 There's no conflation of char with a
 code unit of UTF8

C has no ubyte type. Since time immemorial, C programmers have been
using the char type to store every 8-bit wide data type under the sun
simply because there's been no alternative (until recently, when
int8_t showed up as a typedef for char). That's not a big deal.


 (and indeed C++ deliberately supports use of
 varied encodings for multi-byte characters).

I must have misread the heading that says "Require UTF", and whose
text reads "The C TR makes the encoding of char16_t and char32_t
implementation-defined. It also provides macros to indicate whether or
not the encoding is UTF. In contrast, this proposal requires UTF
encoding."

Oh, I see what you're saying - C++ would require UTF for wchar and
dchar, but not for char. Well, that's historical legacy for you.


 Yes, C++ is adding
 16- and 32-bit character types which are more akin to D's, but that
 has little bearing on how differently it handles multi-byte (as
 opposed to wide-character) strings.

So it has a bunch of procedural functions instead of foreach. Apart
from that, the approach seems the same as D. Where's the difference?

Jan 20 2008

James Dennett <jdennett acm.org> writes:

Janice Caron wrote:
 On 1/20/08, James Dennett <jdennett acm.org> wrote:
 Looks very different to me.

 
 I thought it looked very similar indeed to D, but there you go. Funny
 how two different people can read the same document and interpret it
 in two different ways.

The core issue here, to me, is D's half-hearted attempt to paint
char[] as a Unicode string type.  C++ has nothing analagous.

 There's no conflation of char with a
 code unit of UTF8

 
 C has no ubyte type. Since time immemorial, C programmers have been
 using the char type to store every 8-bit wide data type under the sun
 simply because there's been no alternative (until recently, when
 int8_t showed up as a typedef for char).

int8_t is necessarily signed, a la "signed char", not a typedef
for "char", whose signedness varies (but, unfortunately, is often
signed in C and C++).

 That's not a big deal.
 
 
 (and indeed C++ deliberately supports use of
 varied encodings for multi-byte characters).

 
 I must have misread the heading that says "Require UTF", and whose
 text reads "The C TR makes the encoding of char16_t and char32_t
 implementation-defined. It also provides macros to indicate whether or
 not the encoding is UTF. In contrast, this proposal requires UTF
 encoding."
 
 Oh, I see what you're saying - C++ would require UTF for wchar and
 dchar, but not for char. Well, that's historical legacy for you.

And it's the real world; computer systems need to interface
with existing systems which us diverse encodings.

 Yes, C++ is adding
 16- and 32-bit character types which are more akin to D's, but that
 has little bearing on how differently it handles multi-byte (as
 opposed to wide-character) strings.

 
 So it has a bunch of procedural functions instead of foreach. Apart
 from that, the approach seems the same as D. Where's the difference?

Philosophy: D pushes char[] as if it were a proper UTF8 facility,
and goes a small step towards adding language support for that.

C++ recognizes diversity in multi-byte character encodings, and
doesn't make the language promote one over any other.  It admits
up-front that you're dealing with code units if you want to work
with multi-byte characters.

C++ is a long, long way from perfect when it comes to Unicode
support.  Even C++0x will be.  But I'm hoping for more from D,
and what I see so far can stand some improvement.

-- James

Jan 20 2008

Walter Bright <newshound1 digitalmars.com> writes:

James Dennett wrote:
 I've given specific problems with it.  I've heard no refutation
 of them.

It's hard to describe, but after working with UTF-8 for a while, they 
are just non-problems. Code isn't written that way.

If you want, you can create a String class which wraps a char[] and 
treats it at the level you wish.

 D uses essentially a model of UTF8 which is really just
 a bunch-of-bytes with smart iteration.

That's what UTF-8 is.

 C-based projects on which
 I worked in the 90's did similarly, but with coding conventions
 that banned direct access to the bytes.

Coding conventions are one thing, but banning things in a systems 
language are quite another. Copying a UTF-8 string by decoding and 
encoding the characters one-by-one is unacceptably inefficient, for 
example, compared with just memcpy. Searching a UTF-8 string for a 
substring is another operation for which treating it like a bag of bytes 
works best.

 This is why you'll have a hard time persuading me otherwise <g>.

 Because you assert that there's not a problem? ;)

Because I know it works based on experience.

 Note that C++0x is doing things similarly:

 http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html

 
 Looks very different to me.  There's no conflation of char with a
 code unit of UTF8 (and indeed C++ deliberately supports use of
 varied encodings for multi-byte characters).  Yes, C++ is adding
 16- and 32-bit character types which are more akin to D's, but that
 has little bearing on how differently it handles multi-byte (as
 opposed to wide-character) strings.

Since, in the C++ proposal, indexing and length is done by 
byte/word/dword, not by code point, it's semantically equivalent. I 
don't see any banning of getting at the underlying representation, nor 
any attempt to hide it.

Jan 20 2008

James Dennett <jdennett acm.org> writes:

Walter Bright wrote:
 James Dennett wrote:
 I've given specific problems with it.  I've heard no refutation
 of them.

 
 It's hard to describe, but after working with UTF-8 for a while, they 
 are just non-problems. Code isn't written that way.
 
 If you want, you can create a String class which wraps a char[] and 
 treats it at the level you wish.

Indeed, but such a thing should be standard, not reinvented
over and over.

 D uses essentially a model of UTF8 which is really just
 a bunch-of-bytes with smart iteration.

 
 That's what UTF-8 is.

That view has lead to many security issues, where different
software reacts differently to byte strings which are not
valid UTF-8 in places where UTF-8 is expected.

 C-based projects on which
 I worked in the 90's did similarly, but with coding conventions
 that banned direct access to the bytes.

 
 Coding conventions are one thing, but banning things in a systems 
 language are quite another. Copying a UTF-8 string by decoding and 
 encoding the characters one-by-one is unacceptably inefficient, for 
 example, compared with just memcpy. Searching a UTF-8 string for a 
 substring is another operation for which treating it like a bag of bytes 
 works best.

There are alternatives; explicit notation to access the bytes,
which *doesn't* look like it's accessing characters, would be
better.  (char doesn't represent a character in D.  Not great
naming?  But then D almost follows C in this, where char did
double duty as a limited character type and a small integral
type.)

 This is why you'll have a hard time persuading me otherwise <g>.

 Because you assert that there's not a problem? ;)

 
 Because I know it works based on experience.

And I know, based on experience, of problems with it.  So how
do we get past this to discuss things more objectively?

(Of course, we don't have to.  You're the BDFL, and you get to
make the call, and try to keep D coherent in the face of a hundred
people pushing inconsistent views for how it should evolve.  I get
the easy job of being just one of those voices.)

 Note that C++0x is doing things similarly:

 http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html

 Looks very different to me.  There's no conflation of char with a
 code unit of UTF8 (and indeed C++ deliberately supports use of
 varied encodings for multi-byte characters).  Yes, C++ is adding
 16- and 32-bit character types which are more akin to D's, but that
 has little bearing on how differently it handles multi-byte (as
 opposed to wide-character) strings.

 
 Since, in the C++ proposal, indexing and length is done by 
 byte/word/dword, not by code point, it's semantically equivalent. I 
 don't see any banning of getting at the underlying representation, nor 
 any attempt to hide it.

Whereas D partly attempts to hide it; the mathematician in me
hates this kind of fence-sitting.  But let's get more concrete:
suppose D code finds that an alleged char[] passed to it is, in
fact, broken (i.e., violates the UTF8 invariants).  What should
it do -- abort, throw an exception, offer a policy for handling
such bugs, other?

-- James

Jan 20 2008

"Janice Caron" <caron800 googlemail.com> writes:

On Jan 20, 2008 11:01 PM, James Dennett <jdennett acm.org> wrote:
 That view has lead to many security issues, where different
 software reacts differently to byte strings which are not
 valid UTF-8 in places where UTF-8 is expected.

Such input should always be rejected. D will throw an exception, which
is the right thing to do. If a programmer wants to be more flexible,
they can always catch the exception and delete invalid sequences.

Security issues have arisen as a result of what are called
non-shortest-sequences. For example, the slash character is
represented in UTF-8 as 2F. Some hackers have attempted to get past
certain filters by representing the slash character as C0 AF. This is
not valid UTF-8 (because UTF-8 forbids non-shortest sequences), but a
buggy implementation might get that wrong and interpret that as
'\u002F'. The important point that I want to make here is that *D GETS
IT RIGHT*. D's implementation will throw an exception on all invalid
UTF-8 sequences, and this will block all such security issues. The
only way they can resurface is if you hand-code your own UTF handling.
So long as you stick to the built-in UTF-handling stuff which D
provides, you will not encounter these security issues.

Other security issues arise as a result of Unicode itself, not UTF-8.
This is because Unicode is such a large character set - which makes it
really good for phishing attacks. After all, if you spell "amazon"
with the greek letter lowercase omicron instead of latin lowercase o,
who's going to notice. However, this is not D's problem - it's a
problem for browser writers, and one they will encounter regardless of
what programming language they use. (Similar issues arise if browsers
fail to convert URLs to Normalisation Form C, but again, that would be
a browser problem, not a D problem. It would also be a bug).


 (char doesn't represent a character in D.  Not great
 naming?

It's reasonable naming, given that UTF-8 code units in the range 00 to
7F do, in fact, correspond to (ASCII) characters. OK, so it's
inappropriately named for holding values 80 to FF, but alternatives
such as codeunit, or utf8, would probably not catch on so easily.



 But let's get more concrete:
 suppose D code finds that an alleged char[] passed to it is, in
 fact, broken (i.e., violates the UTF8 invariants).  What should
 it do -- abort, throw an exception, offer a policy for handling
 such bugs, other?

It should, and does, throw an exception. Your program may catch the
exception, but it should reject the input.

Jan 21 2008

Matti Niemenmaa <see_signature for.real.address> writes:

Janice Caron wrote:
 The important point that I want to make here is that *D GETS
 IT RIGHT*. D's implementation will throw an exception on all invalid
 UTF-8 sequences, and this will block all such security issues.

Well, mostly right: http://d.puremagic.com/issues/show_bug.cgi?id=978

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi

Jan 21 2008

"Janice Caron" <caron800 googlemail.com> writes:

On Jan 21, 2008 8:47 AM, Matti Niemenmaa <see_signature for.real.address> wrote:
 Janice Caron wrote:
 The important point that I want to make here is that *D GETS
 IT RIGHT*. D's implementation will throw an exception on all invalid
 UTF-8 sequences, and this will block all such security issues.

 Well, mostly right: http://d.puremagic.com/issues/show_bug.cgi?id=978

Oooh - well spotted! In that case, I amend my statement. D /will/ get
it right, once this bug is fixed. One would hope that said bug will be
fixed in the next release.

Jan 21 2008

Matti Niemenmaa <see_signature for.real.address> writes:

Janice Caron wrote:
 On Jan 21, 2008 8:47 AM, Matti Niemenmaa <see_signature for.real.address>
wrote:
 Janice Caron wrote:
 The important point that I want to make here is that *D GETS
 IT RIGHT*. D's implementation will throw an exception on all invalid
 UTF-8 sequences, and this will block all such security issues.

 Well, mostly right: http://d.puremagic.com/issues/show_bug.cgi?id=978

 
 Oooh - well spotted! In that case, I amend my statement. D /will/ get
 it right, once this bug is fixed. One would hope that said bug will be
 fixed in the next release.

You'll note the bug is a year old.

Although that doesn't change the fact that one would, indeed, hope for that. :-)

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi

Jan 21 2008

"Janice Caron" <caron800 googlemail.com> writes:

On Jan 21, 2008 8:11 AM, Janice Caron <caron800 googlemail.com> wrote:
 But let's get more concrete:
 suppose D code finds that an alleged char[] passed to it is, in
 fact, broken (i.e., violates the UTF8 invariants).  What should
 it do -- abort, throw an exception, offer a policy for handling
 such bugs, other?

 It should, and does, throw an exception. Your program may catch the
 exception, but it should reject the input.

In fact, this goes to the heart of almost all modern securty problems
(SQL injection, buffer overruns, etc.). The golden rule is that *ALL*
untrusted input must be sanitised. Every time you don't do that, you
provide an opportunity for hackers.

But at least in the case of UT, it's easy - just let D validate it. If
it doesn't validate, throw it out.

Jan 21 2008

bearophile <bearophileHUGS lycos.com> writes:

Janice Caron:
Also, isn't perl an interpreted language? You can get away with a lot
more in an interpreted language, but you pay the price in speed.

I'm not a Perl expert, and I don't know how well Perl manages Unicode (maybe
Python manages Unicode better than Perl), but Perl was designed to process
text, so if you process strings you will find that Perl is pretty *fast*, it's
easy to write Perl programs that process text faster (and in a more flexible
way) than C++ ones... (Note that Python 3.0 will manage unicode strings as
default).

For example if you use Python dicts (AAs) with strings they seem faster than
current DMD AAs, and probably that's true for Perl ones too. This was a tiny
example:
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=57986

Perl and Python have GC that is well refined, so it may be faster than the
current DMD GC if you manage lot of strings, this was an example where D was
slower than Py too:
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=62369

With Python you can also use Psyco, that's a JIT, to speed it up, etc. Psyco
uses tricks to avoid actually copying strings and string slices in most cases,
because Python strings are immutables (Python copies them when you perform a
slice), like D too does.

REs in current DMD are *way* slower than Perl/Python/Tcl ones, etc. Some time
ago I have found a situation where the RE sub() of D looks O(n^2):
http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lang=dlang&id=4

String methods of Python are written in a really refined C, like this one:
http://effbot.org/zone/stringlib.htm
And they are usually faster than the not-refined versions you can find in the
current Phobos. I have implemented and I use a fastJoin an xsplit, etc, faster
then the Phobos ones.

The built-in sort of Python is the Timsort, that's way faster than the D
built-in (I have written a rather simple sort that is up to 3 times faster than
the built in in D, and it's always faster no matter what data I use).

Now and then the text I/O on disk of the current DMD is slower than Python,
this comes from some of my benchmarks.

I know all those parts of DMD can be improved later. When you create a new
language you can't (and you don't want to) optimize every little bit (because
it may be premature optimization), optimizazion must come later, so I
understand Walter in this regard. But all this is just to show you that if
today you have to process lot of text in a very flexible way it's not easy to
beat the languages like Perl (but Python/Ruby/Tcl too. Ruby is less good than
Python for Unicode texts, I think) designed for it.

If you take a look near the bottom of this thread:
http://groups.google.com/group/comp.lang.python/browse_thread/thread/0b3ded6d0f494d06/0068cb1406ab9e4c
you can see that I'd like to use D to speed up some text-processing-related
bioinformatics scripts of mine, but often I find that the Python programs are
faster for that purpose ;-)

Bye,
bearophile

Jan 19 2008

Walter Bright <newshound1 digitalmars.com> writes:

bearophile wrote:
 The built-in sort of Python is the Timsort, that's way faster than
 the D built-in (I have written a rather simple sort that is up to 3
 times faster than the built in in D, and it's always faster no matter
 what data I use).

D's sort is in phobos/internal/qsort.d and qsort.d. If you have a faster 
qsort, and want to contribute it, please do so! Same goes for other 
faster routines you've written.

 Now and then the text I/O on disk of the current DMD is slower than
 Python, this comes from some of my benchmarks.

The D 2.0 I/O is much faster than the 1.0 I/O. But it still suffers a 
bit from the requirement (I imposed) of being compatible with C stdio. I 
don't know if Python does this or not.

 
 I know all those parts of DMD can be improved later. When you create
 a new language you can't (and you don't want to) optimize every
 little bit (because it may be premature optimization), optimizazion
 must come later, so I understand Walter in this regard. But all this
 is just to show you that if today you have to process lot of text in
 a very flexible way it's not easy to beat the languages like Perl
 (but Python/Ruby/Tcl too. Ruby is less good than Python for Unicode
 texts, I think) designed for it.

I don't believe there are any fundamental reasons why D string 
processing should be slower, it's just spending the effort on it.

Jan 20 2008

Sean Kelly <sean f4.ca> writes:

bearophile wrote:
 The built-in sort of Python is the Timsort, that's way faster than the D
built-in (I have written a rather simple sort that is up to 3 times faster than
the built in in D, and it's always faster no matter what data I use).

I'd be interested in seeing that.  I've been able to beat the D sort for
some data sets and match it in others, but not beat it across the board.


Sean

Jan 20 2008

D Programming

C/C++ Programming

Other

digitalmars.D - String implementations