www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Why string alias is invariant ?

reply Sergey Gromov <snake.scaly gmail.com> writes:
Really, why ?  You can't assign a dynamically created data to a variable of
type string.  You can't pass a dynamically created data into a function
receiving string !  If you wanted efficiency, if you wanted to guarantee that
anybody receiving string can receive a char literal without duplication,---make
it const then.  Invariant casts to const(char)[] implicitly, so as char[].  And
even then, I'd prefer plain old char[] for string and foo(in string s) for a
function that guarantees not to change it.

Right now I'm having trouble calling std.file.listdir() just because it
receives string.

P.S.  Many thought mus be put into choosing a return type for a library
function.  Because if it returns a unique copy of data it must be char[] so
that i'm free to modify it.  The const(char)[] or const(char[]) must only be
used if the function is not sure if the returned result is unique.

SnakE
Jan 30 2008
next sibling parent reply Sean Kelly <sean f4.ca> writes:
Sergey Gromov wrote:
 Really, why ?  You can't assign a dynamically created data to a variable of
type string.  You can't pass a dynamically created data into a function
receiving string !  If you wanted efficiency, if you wanted to guarantee that
anybody receiving string can receive a char literal without duplication,---make
it const then.  Invariant casts to const(char)[] implicitly, so as char[].  And
even then, I'd prefer plain old char[] for string and foo(in string s) for a
function that guarantees not to change it.

I think the hope is that 'string' will place us in a position to get functional programming-type optimizations "for free" once the language and compiler are in a position to provide them. Such optimizations typically aren't possible with "const char[]" because the compiler must assume that the string data could change unexpectedly. In essence, I believe that 'invariant' is intended for optimization while 'const' is intended for documenting/enforcing API contracts. Sean
Jan 30 2008
parent reply Sergey Gromov <snake.scaly gmail.com> writes:
Sean Kelly Wrote:
 Sergey Gromov wrote:
 Really, why ?  You can't assign a dynamically created data to a variable of
type string.  You can't pass a dynamically created data into a function
receiving string !  If you wanted efficiency, if you wanted to guarantee that
anybody receiving string can receive a char literal without duplication,---make
it const then.  Invariant casts to const(char)[] implicitly, so as char[].  And
even then, I'd prefer plain old char[] for string and foo(in string s) for a
function that guarantees not to change it.

I think the hope is that 'string' will place us in a position to get functional programming-type optimizations "for free" once the language and compiler are in a position to provide them. Such optimizations typically aren't possible with "const char[]" because the compiler must assume that the string data could change unexpectedly. In essence, I believe that 'invariant' is intended for optimization while 'const' is intended for documenting/enforcing API contracts.

I don't quite get it. If 'const' is "for documenting/enforcing API contracts", then "string[] listdir(string)" is an API bug. If it were "char[][] listdir(in char[])", then the compiler wouldn't be able to "get functional programming-type optimizations" within the 'listdir()'. Overall, the use of invariant 'string' is limited to unmodified string literals and their substrings, because for other data the behaviour is undefined, by specification. SnakE
Jan 30 2008
parent Christian Kamm <kamm.incasoftware shift-at-left-and-remove-this.de> writes:
 Overall, the use of invariant 'string' is limited
 to unmodified string literals and their substrings, because for
 other data the behaviour is undefined, by specification.

You need to duplicate your mutable strings: char[] mutablestring = "hello".dup(); string invariantstring = mutablestring.idup(); Where I think the second line is a shorthand for string invariantstring = cast(string) mutablestring.dup(); and the cast to invariant is valid since you guarantee to give no one else access to the newly dupped data. Christian Kamm
Jan 30 2008
prev sibling next sibling parent reply "Janice Caron" <caron800 googlemail.com> writes:
On Jan 30, 2008 11:03 PM, Sergey Gromov <snake.scaly gmail.com> wrote:
 You can't assign a dynamically created data to a variable of type string.

Correct, and this is a good default. There are, of course, times when the default is not what you want. Others have suggested "idup" as the solution, but that means an extra copying pass. Forturnately, however, there is a solution which does /exactly/ what you want, with no superflous copying. It is this: import std.contracts; char[] buffer = whatever; string s = assumeUnique(buffer);
 You can't pass a dynamically created data into a function receiving string!

Again, correct. This is also a good thing. It allows copy-on-write to work correctly. For example, consider a function which lowercases a string. Since string is invariant, that means that when lowercasing something that is /already/ lowercase, the function is free to return the original string. If string were merely const (as opposed to invariant), it would have to make a copy every time. Again, assumeUnique is your friend. (But don't lie to the compiler or things will go badly wrong!)
 P.S.  Many thought mus be put into choosing a return type for a library
function.  Because if it returns a unique copy of data it must be char[] so
that i'm free to modify it.

Well, consider again the example of lowercasing a string to see why that is not so. If I return the original string (not a copy of it), then you are /not/ free to modify it, because there might be other pointers to that data. So you must first copy it (using dup) and then you can modify the copy. This means that the copy need be done /only when it is required/, instead of every single time you call the function - so yes, it is an improvement in efficiency.
Jan 31 2008
next sibling parent reply Sean Kelly <sean f4.ca> writes:
Janice Caron wrote:
 On Jan 30, 2008 11:03 PM, Sergey Gromov <snake.scaly gmail.com> wrote:
 
 P.S.  Many thought mus be put into choosing a return type for a library
function.  Because if it returns a unique copy of data it must be char[] so
that i'm free to modify it.

Well, consider again the example of lowercasing a string to see why that is not so. If I return the original string (not a copy of it), then you are /not/ free to modify it, because there might be other pointers to that data. So you must first copy it (using dup) and then you can modify the copy. This means that the copy need be done /only when it is required/, instead of every single time you call the function - so yes, it is an improvement in efficiency.

I'd say that it is an improvement in safety rather than efficiency because the model assumes the string may be shared and thus enforces copy on write. But consider something like this: char[] data = cast(char[]) read( "myfile.txt" ); char[][] lines = splitlines( data ); foreach( line; lines ) { writefln( tolower( line.idup ) ); } In this routine, the programmer knows he is the sole owner of the data and simply wants to print the contents of a file in lower case line-by-line. And to do so the contents of data must be duplicated, which causes GC churn and may slow the app considerably. I suppose one possible fix for this sample app would be to convert the entire file to lowercase in one shot, thus incurring only one copy, but in a real app it may not be apparent or even possible to modify the algorithm in such a way. Sean
Jan 31 2008
next sibling parent Don Clugston <dac nospam.com.au> writes:
Janice Caron wrote:
 On 1/31/08, Sean Kelly <sean f4.ca> wrote:
 And to do so the contents of data must be duplicated,

The problem there is the idup. Replace it with foreach( line; lines ) { writefln( tolower( assumeUnique(line)) ); } and the duplication goes away.

Seems that assumeUnique is going to be so common, it deserves a shorter name.
Jan 31 2008
prev sibling next sibling parent reply Sergey Gromov <snake.scaly gmail.com> writes:
Sean Kelly Wrote:
 Janice Caron wrote:
 On Jan 30, 2008 11:03 PM, Sergey Gromov <snake.scaly gmail.com> wrote:
 
 P.S.  Many thought mus be put into choosing a return type for a library
 function.  Because if it returns a unique copy of data it must be char[]
 so that i'm free to modify it.

Well, consider again the example of lowercasing a string to see why that is not so. If I return the original string (not a copy of it), then you are /not/ free to modify it, because there might be other pointers to that data. So you must first copy it (using dup) and then This means that the copy need be done /only when it is required/, instead of every single time you call the function - so yes, it is an improvement in efficiency.

I'd say that it is an improvement in safety rather than efficiency because the model assumes the string may be shared and thus enforces copy on write. But consider something like this: char[] data = cast(char[]) read( "myfile.txt" ); char[][] lines = splitlines( data ); foreach( line; lines ) { writefln( tolower( line.idup ) ); } In this routine, the programmer knows he is the sole owner of the data and simply wants to print the contents of a file in lower case line-by-line. And to do so the contents of data must be duplicated, which causes GC churn and may slow the app considerably.

If I were doing this in C by loading an entire file in memory, I would lowercase it in-place which is both faster and takes less memory. D should support such efficient solutions. SnakE
Jan 31 2008
parent Sean Kelly <sean f4.ca> writes:
Sergey Gromov wrote:
 Sean Kelly Wrote:
 Janice Caron wrote:
 On Jan 30, 2008 11:03 PM, Sergey Gromov <snake.scaly gmail.com> wrote:

 P.S.  Many thought mus be put into choosing a return type for a library
 function.  Because if it returns a unique copy of data it must be char[]
 so that i'm free to modify it.

that is not so. If I return the original string (not a copy of it), then you are /not/ free to modify it, because there might be other pointers to that data. So you must first copy it (using dup) and then This means that the copy need be done /only when it is required/, instead of every single time you call the function - so yes, it is an improvement in efficiency.

because the model assumes the string may be shared and thus enforces copy on write. But consider something like this: char[] data = cast(char[]) read( "myfile.txt" ); char[][] lines = splitlines( data ); foreach( line; lines ) { writefln( tolower( line.idup ) ); } In this routine, the programmer knows he is the sole owner of the data and simply wants to print the contents of a file in lower case line-by-line. And to do so the contents of data must be duplicated, which causes GC churn and may slow the app considerably.

If I were doing this in C by loading an entire file in memory, I would lowercase it in-place which is both faster and takes less memory. D should support such efficient solutions.

The D language does. It simply isn't a feature provided by Phobos. Sean
Jan 31 2008
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
Janice Caron wrote:
 On 1/31/08, Sean Kelly <sean f4.ca> wrote:
 And to do so the contents of data must be duplicated,

The problem there is the idup. Replace it with foreach( line; lines ) { writefln( tolower( assumeUnique(line)) ); } and the duplication goes away.

It does? I didn't think tolower would modify a string in place. Sean
Jan 31 2008
next sibling parent Sean Kelly <sean f4.ca> writes:
Janice Caron wrote:
 On 1/31/08, Sean Kelly <sean f4.ca> wrote:
 
 and the duplication goes away.


OK - /one/ of the duplications goes away. It's not possible, even in principle, to lowercase a char[] in place, because a char[] by definition is an array of UTF-8 code units, /not/ an array of characters. Lowercasing a character may result in the length of its UTF-8 sequence changing. If the length increases, you're screwed. You can lowercase a dchar[] in place, but not a char[].

Ah, good point. Sean
Jan 31 2008
prev sibling parent Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Janice Caron wrote:
 It's not possible, even in principle, to lowercase a char[] in place,
 because a char[] by definition is an array of UTF-8 code units, /not/
 an array of characters. Lowercasing a character may result in the
 length of its UTF-8 sequence changing. If the length increases, you're
 screwed.
 
 You can lowercase a dchar[] in place, but not a char[].

I'm not sure if that's true[2]. However, I *am* sure it's *not* true for uppercasing. Some code points expand to 2 or 3 codepoints when uppercased. One common case is U+00DF "ß", LATIN SMALL LETTER SHARP S, which expands to "SS" (two characters) when uppercased[1]. Another example from the Unicode standard, U+0390, GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS apparently expands to three codepoints. [1] Interestingly though, the UTF-8 (aka char[]) representation is the same length :P. [2] The relevant section[3] of the Unicode standard says "Case mappings may produce strings of different lengths than the original." but proceeds to only give examples for uppercasing. [3] Section 5.18, see http://www.unicode.org/versions/Unicode5.0.0/ch05.pdf#G21180
Jan 31 2008
prev sibling parent reply Sergey Gromov <snake.scaly gmail.com> writes:
Well thanks for explanation, all this really makes sense.  Though there are
many benefits and drawbacks for different approaches which I can't keep track
of.

Janice Caron Wrote:
 On Jan 30, 2008 11:03 PM, Sergey Gromov <snake.scaly gmail.com> wrote:
 You can't pass a dynamically created data into a function receiving string!

Again, correct. This is also a good thing. It allows copy-on-write to work correctly. For example, consider a function which lowercases a string. Since string is invariant, that means that when lowercasing something that is /already/ lowercase, the function is free to return the original string.

Yes, this works for string tokenizers for instance. But doesn't quite for lowercasing. The already lowercase case is generally rare and requires an additional pass to check, so you only benefit from this if you optimize for memory. And if you optimize for speed, you'd want lowercasing in-place because you're obviously in the middle of processing data, and the chances that this data is immutable are little. Even if so, there's always .dup for you. And of course there's no reason in having 'string' parameters in functions like read(), listdir() etc. because their result never contains parts of the arguments.
 P.S.  Many thought mus be put into choosing a return type for a library
function.  Because if it returns a unique copy of data it must be char[] so
that i'm free to modify it.

Well, consider again the example of lowercasing a string to see why that is not so.

But I'm talking about always returning a unique copy of data. What about that listdir() ? The returned names are obviously unique because they're received from file system. If you make them invariant, and I want to uppercase them, I can't do that in-place because I'm not sure if they're actually unique. If you make them mutable but I need them as is, I still cannot count on them not to change because who knows where this data is also used. Maybe a contract should be added for standard library that if a function returns a mutable array, it guarantees that this array can be modified without side effects and therefore can be safely assumed unique. SnakE
Jan 31 2008
parent =?ISO-8859-1?Q?Jari-Matti_M=e4kel=e4?= <jmjmak utu.fi.invalid> writes:
Janice Caron Wrote:

 On 1/31/08, Sergey Gromov <snake.scaly gmail.com> wrote:
 Maybe a contract should be added for standard library that if a function
 returns a mutable array, it guarantees that this array can be modified
 without side effects and therefore can be safely assumed unique.

Yes, D has a problem here, in that it has no way to express uniqueness. (That's true of other languages too, of course).

That's not true. Linear types and their derivatives solve this problem elegantly and have been known at least since 1990. They have interesting properties. The data can be simultaneously "constant" (referentially transparent) and unique, and still destructive updates are possible in situ. The Clean language is an example of this, it has uniqueness types[1]. But I'm afraid the current implementations of D aren't really well suited for these kinds of "advanced" type systems in the near future... [1] http://www.cs.ru.nl/~clean/download/papers/1996/bare96-uniclosed.pdf
 If a
 string is immutable, then it doesn't /have/ to be unique (but if it
 is, it's safe to cast it to invariant), but if a string is mutable
 then it /must/ be. That's a problem, because one simply cannot declare
 
     unique(char)[] array;
 
 (...although it would be interesting to try...) So D just makes it your
problem.
 
 One problem area where this sort of thing arises is concatenation. If
 you concatenate char arrays, then /regardless/ of the constancy of the
 inputs, the result is guaranteed to be unique (because, by
 specification, concatenation always makes a copy). If we had a
 "unique" type-constructor, one could make the result type unique(T)[].
 But we don't, so D makes the (arbitrary) choice of invariant(T)[].
 Also, it is currently a syntax error to concatenate a char[] with an
 invariant(char)[], even though both will implicitly cast to
 const(char)[].
 
 So it's not a perfect system, but it is better than what we had
 before, and (I would argue) better than C++. But remember that D2.x is
 experimental. We're here to play with it. If we don't like it, things
 could change. Our experiences are important here in shaping the
 future.

Jan 31 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 1/31/08, Sean Kelly <sean f4.ca> wrote:
 And to do so the contents of data must be duplicated,

The problem there is the idup. Replace it with foreach( line; lines ) { writefln( tolower( assumeUnique(line)) ); } and the duplication goes away.
Jan 31 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 1/31/08, Don Clugston <dac nospam.com.au> wrote:
 Seems that assumeUnique is going to be so common, it deserves a shorter name.

Oddly enough, cast(string)s is shorter than assumeUnique(s) and achieves the same effect. :) I think someone somewhere must have decided that explicit casts are to be avoided. That said, assumeUnique works for things that aren't strings, too. It's basically equivalent to cast(invariant). assumeUnique does have a side effect though, which is that its parameter must be an lvalue, which assumeUnique nulls. So you couldn't do string data = assumeUnique( read( "myfile.txt" )); even if you wanted to. The "officially correct" way would be char[] temp = cast(char[]) read( "myfile.txt" ); string data = assumeUnique(temp); /* temp is now null */ But that's way too much typing for me, so I'd just write: string data = cast(string) read("myfile.txt");
Jan 31 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 1/31/08, Sergey Gromov <snake.scaly gmail.com> wrote:
 If I were doing this in C by loading an entire file in memory, I would
 lowercase it in-place which is both faster and takes less memory.
 D should support such efficient solutions.

It does. char[] data = cast(char[]) read( "myfile.txt" ); inPlaceLower(data); Now all you have to do is write the function inPlaceLower(). :) Presumably, what you really mean is that functions like inPlaceLower() should be in Phobos. I won't argue with that, because I agree with you. But the D language does not prevent you from modifying in place - it only prevents you from modifying invariant data in place.
Jan 31 2008
prev sibling next sibling parent reply "Janice Caron" <caron800 googlemail.com> writes:
On 1/31/08, Sergey Gromov <snake.scaly gmail.com> wrote:
 Maybe a contract should be added for standard library that if a function
 returns a mutable array, it guarantees that this array can be modified
 without side effects and therefore can be safely assumed unique.

Yes, D has a problem here, in that it has no way to express uniqueness. (That's true of other languages too, of course). If a string is immutable, then it doesn't /have/ to be unique (but if it is, it's safe to cast it to invariant), but if a string is mutable then it /must/ be. That's a problem, because one simply cannot declare unique(char)[] array; (...although it would be interesting to try...) So D just makes it your problem. One problem area where this sort of thing arises is concatenation. If you concatenate char arrays, then /regardless/ of the constancy of the inputs, the result is guaranteed to be unique (because, by specification, concatenation always makes a copy). If we had a "unique" type-constructor, one could make the result type unique(T)[]. But we don't, so D makes the (arbitrary) choice of invariant(T)[]. Also, it is currently a syntax error to concatenate a char[] with an invariant(char)[], even though both will implicitly cast to const(char)[]. So it's not a perfect system, but it is better than what we had before, and (I would argue) better than C++. But remember that D2.x is experimental. We're here to play with it. If we don't like it, things could change. Our experiences are important here in shaping the future.
Jan 31 2008
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Janice Caron" wrote
 On 1/31/08, Sergey Gromov <snake.scaly gmail.com> wrote:
 Maybe a contract should be added for standard library that if a function
 returns a mutable array, it guarantees that this array can be modified
 without side effects and therefore can be safely assumed unique.

One problem area where this sort of thing arises is concatenation. If you concatenate char arrays, then /regardless/ of the constancy of the inputs, the result is guaranteed to be unique (because, by specification, concatenation always makes a copy). If we had a "unique" type-constructor, one could make the result type unique(T)[]. But we don't, so D makes the (arbitrary) choice of invariant(T)[]. Also, it is currently a syntax error to concatenate a char[] with an invariant(char)[], even though both will implicitly cast to const(char)[].

I actually filed an enhancement request for that. http://d.puremagic.com/issues/show_bug.cgi?id=1654 -Steve
Jan 31 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 1/31/08, Sean Kelly <sean f4.ca> wrote:

 and the duplication goes away.

It does? I didn't think tolower would modify a string in place.

OK - /one/ of the duplications goes away. It's not possible, even in principle, to lowercase a char[] in place, because a char[] by definition is an array of UTF-8 code units, /not/ an array of characters. Lowercasing a character may result in the length of its UTF-8 sequence changing. If the length increases, you're screwed. You can lowercase a dchar[] in place, but not a char[].
Jan 31 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On Feb 1, 2008 12:03 AM, Frits van Bommel <fvbommel remwovexcapss.nl> wrote:
 Some code points
 expand to 2 or 3 codepoints when uppercased. One common case is U+00DF
 "", LATIN SMALL LETTER SHARP S, which expands to "SS" (two characters)
 when uppercased[1]. Another example from the Unicode standard, U+0390,
 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS apparently expands to
 three codepoints.

I know. I would have mentioned that, but I didn't want to needlessly complicate the issue. But Unicode makes a distinction between "simple casing" and "full casing". What you're talking about is full casing. In simple casing, one character maps to one character. So you could uppercase U+00DF (to itself) using simple-casing. When using full casing, as you quite rightly point out, one can only case-change strings, not characters.
Jan 31 2008
prev sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On Feb 1, 2008 7:50 AM, Janice Caron <caron800 googlemail.com> wrote:
 So you could uppercase U+00DF (to
 itself) using simple-casing.

I'm obviously telling you stuff you already know - I apologise. I would imagine that normalisation forms probably also complicate full casing. One interesting thing is that simple casing also works just fine for wchar (that is, UTF-16). That's because every letter of every living language will be found in the Basic Multilingual Plane (the range U+0000 to U+FFFF). Codepoints outside this range are either symbols, or letters of dead languages. (Or combining characters, etc.), so it's probably safe to leave all non-BMP codepoints unchanged when case-changing. Codepoints in the BMP occupy a single UTF-16 code unit. In many languages (e.g. Chinese), a UTF-16 string is likely to be shorter than the corresponding UTF-8 string. This makes me suspect that UTF-16 may well be the ideal choice for string representation in the real world. (It's what Java went with). Maybe UTF-16, not UTF-8, should be the default kind of string?
Feb 01 2008