www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - .NET on a string

reply Walter Bright <newshound1 digitalmars.com> writes:
Cristi talks about adapting D strings to .NET.

http://www.reddit.com/r/programming/comments/84urf/net_on_a_string/?already_submitted=true
Mar 16 2009
next sibling parent reply Georg Wrede <georg.wrede iki.fi> writes:
Walter Bright wrote:
 Cristi talks about adapting D strings to .NET.
 
 http://www.reddit.com/r/programming/comments/84urf/net_on_a_string/?al
eady_submitted=true 
 

Actually, meant to ask earlier, but "[...in .NET,] System.Array and System.ArraySegment are distinct, unrelated types. In D arrays slices are indistinguishable from arrays and this creates the problems that I mentioned in the interview." Is this discussed somewhere? The idea of slices and arrays being distinct types does seem to have advantages. I've seen a couple of mentions of this lately, but has there been a *rigorous* discussion?
Mar 17 2009
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 17 Mar 2009 10:59:21 -0400, Georg Wrede <georg.wrede iki.fi> wrote:

 Walter Bright wrote:
 Cristi talks about adapting D strings to .NET.
   
 http://www.reddit.com/r/programming/comments/84urf/net_on_a_string/?already_submitted=true

Actually, meant to ask earlier, but "[...in .NET,] System.Array and System.ArraySegment are distinct, unrelated types. In D arrays slices are indistinguishable from arrays and this creates the problems that I mentioned in the interview." Is this discussed somewhere? The idea of slices and arrays being distinct types does seem to have advantages. I've seen a couple of mentions of this lately, but has there been a *rigorous* discussion?

There has been. But there are very good reasons to keep arrays and slices the same type. Even in C# and Java, a substring is the same type as a string. It allows iterative patterns such as: str = str[1..$]; The main problem comes form having a substring overwrite data in another string via appending. From what I can tell, that is the *only* issue with arrays and slices being the same type. This problem can be solved in other ways. I've put forth 2 solutions that as far as I know were not proven to be infeasible, but which never received any attention in the NG from Walter or Andrei. There was mention of a separate T[new] type, before my time with D, but I think you still have the same aliasing problems there unless you zero out the source instance on assignment, or simply disallow assigning a T[new] from another T[new], which can affect the design of code, and make certain iterative patterns difficult if not impossible. The two solutions I put forth are: 1. Storing the requested length of an allocated block in the GC. This not only allows for more intuitive appending, but could also help the GC when scanning for pointers (you might not have to zero out the block first). This proposal received zero responses. 2. Storing the requested length of an allocated array at the front of the array. I had a hackish scheme to use the most significant bit in the length field to identify if a slice is just beyond the allocated length. Therefore, an append operation would first check if it is the first element in a block, and if so check the allocated length before deciding whether to append in place or allocate a new block. There were some questions on the proposal, but I don't believe anyone found any killer problems with it. It has advantages and disadvantages over the first proposal and current implementation. The main advantage is that the append code does not have to query the GC to get the requested length. Aside from the append problem, I see no other drawbacks to having strings the way they are. References: Proposal 1: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=63146 Proposal 2: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=77437 -Steve
Mar 17 2009
parent reply "Cristian Vlasceanu" <cristian zerobugs.org> writes:
 The idea of slices and arrays being distinct types does seem to have 
 advantages. I've seen a couple of mentions of this lately, but has there 
 been a *rigorous* discussion?

There has been. But there are very good reasons to keep arrays and slices the same type. Even in C# and Java, a substring is the same type as a string. It allows iterative patterns such as: str = str[1..$];

I am afraid that there's a fallacy in your argument: substrings ARE strings: 1) they are full copies of the characters in the given range, and 2) once the substring is created it goes its own merry way (i.e. it does not keep track of any relationship to the "original" string). Slices ARE NOT arrays. Slices are more like "views" into the original array. It is like the difference between the icon and the saint / deity that it represents. Another point that I have a hard time getting accross (even to the language heavy-weights) is that just because it is easy to represent arrays and slices seemlessly IN THE PARTICULAR CASE OF THE DIGITAL MARS BACKEND it does not mean it is going to work as smooth and seamless in other systems. The .NET backend that I am working on is the case in point. If instead of using .NET built-in arrays I craft my own representation (to stay compatible with the DMD's way of doing array and slices) then I give up interoperability with other languages -- and that would defeat the point of doing D on .NET to begin with. Your proposed solution is interesting but implementation-specific. I am afraid that I cannot not use it with .NET (I just generate IL code, which is more high-level than "ordinary" assembly code). I passed a proposal of my own to Walter and Andrei, and that is to have D coders explicitly state the intent of using a slice with the "ref" keyword; "ref" is already a legal token in D (at least in 2.0) albeit it is only valid in the context of a parameter list, or foreach argument list. It is not legal to say "ref int j = i;" in a declaration, for example. But it is a trivial change in the parser (I have implemented this change as a proof of concept / language extension research) to allow ref (just for slices): "ref int[] s = a[1..2];" other than in parameter and foreach arg lists. I think that "ref" makes sense, because slices, like I said, are conceptually views (or references) into the "true" arrays. This simple change would a) make D code more self-documenting, and it would give a very powerful hint to the compiler. Also, the "ref" semantics is backwards compatbile with the exisiting cases where "ref" is allowed. But there may be some yet unforseen side-effects (when dealing with generic code, for instance) that need to be further investigated before I push for this solution at full speed. If I find it to work suitably well I have no problem in making my own D 2.0 / .NET dialect... Cheers, Cristian
Mar 21 2009
next sibling parent reply Kagamin <spam here.lot> writes:
Cristian Vlasceanu Wrote:

 I passed a proposal of my own to Walter and Andrei, and that is to have D 
 coders explicitly state the intent of using a slice with the "ref" keyword; 
 "ref" is already a legal token in D (at least in 2.0) albeit it is only 
 valid in the context of a parameter list, or foreach argument list. It is 
 not legal to say "ref int j = i;" in a declaration, for example. But it is a 
 trivial change in the parser (I have implemented this change as a proof of 
 concept / language extension research) to allow ref (just for slices): "ref 
 int[] s = a[1..2];" other than in parameter and foreach arg lists.

There was another suggestion - to mark arrays: int[] is a slice, int[new] is an array. I think, you'll have problems with const system, .net doesn't have one.
Mar 22 2009
next sibling parent Christopher Wright <dhasenan gmail.com> writes:
Kagamin wrote:
 Cristian Vlasceanu Wrote:
 
 I passed a proposal of my own to Walter and Andrei, and that is to have D 
 coders explicitly state the intent of using a slice with the "ref" keyword; 
 "ref" is already a legal token in D (at least in 2.0) albeit it is only 
 valid in the context of a parameter list, or foreach argument list. It is 
 not legal to say "ref int j = i;" in a declaration, for example. But it is a 
 trivial change in the parser (I have implemented this change as a proof of 
 concept / language extension research) to allow ref (just for slices): "ref 
 int[] s = a[1..2];" other than in parameter and foreach arg lists.

There was another suggestion - to mark arrays: int[] is a slice, int[new] is an array. I think, you'll have problems with const system, .net doesn't have one.

The const system is largely a frontend issue. I don't think that will cause many issues.
Mar 22 2009
prev sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2009-03-22 05:08:55 -0400, Kagamin <spam here.lot> said:

 Cristian Vlasceanu Wrote:
 
 I passed a proposal of my own to Walter and Andrei, and that is to have D
 coders explicitly state the intent of using a slice with the "ref" keyword;
 "ref" is already a legal token in D (at least in 2.0) albeit it is only
 valid in the context of a parameter list, or foreach argument list. It is
 not legal to say "ref int j = i;" in a declaration, for example. But it is a
 trivial change in the parser (I have implemented this change as a proof of
 concept / language extension research) to allow ref (just for slices): "ref
 int[] s = a[1..2];" other than in parameter and foreach arg lists.

There was another suggestion - to mark arrays: int[] is a slice, int[new] is an array.

I like that better, because the less verbose type is a slice and that's what you need to pass around most of the time. But I think it'd be even better (perhaps not for .NET compatibility, but for D in general) to just remove the need for having a separate array type. For instance, we could have the GC keep track of the array bounds in addition to the allocated size, allowing it to determine if it can grow a slice in place by checking if the slice points at the end. I'm sure someone has already proposed this at some point. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 22 2009
prev sibling next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Cristian Vlasceanu:
 arrays and 
 slices seemlessly IN THE PARTICULAR CASE OF THE DIGITAL MARS BACKEND it does 
 not mean it is going to work as smooth and seamless in other systems.

It seems that LLVM (and GCC) too are able to support them well enough. I think D was never designed to be a VM-based language, it is supposed to be a system language after all (while dotnet isn't) so it's normal to find some troubles when you try to implement it on a VM. It's wrong to worsen the syntax/semantics of D just to simplify a dotnet port. Bye, bearophile
Mar 22 2009
prev sibling next sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2009-03-22 01:31:28 -0400, "Cristian Vlasceanu" 
<cristian zerobugs.org> said:

 Another point that I have a hard time getting accross (even to the language
 heavy-weights) is that just because it is easy to represent arrays and
 slices seemlessly IN THE PARTICULAR CASE OF THE DIGITAL MARS BACKEND it does
 not mean it is going to work as smooth and seamless in other systems. The
 .NET backend that I am working on is the case in point. If instead of using
 .NET built-in arrays I craft my own representation (to stay compatible with
 the DMD's way of doing array and slices) then I give up interoperability
 with other languages -- and that would defeat the point of doing D on .NET
 to begin with.

As I read it have no problem integrating with the writing CLR code to make D semantics work, but you struggle with seamless compatibility with the framework (using .NET predefined classes). But really, no compatibility? In the D/Objective-C bridge, I have a bunch of conversions functions to and from NSArray. You can surely do the same in .NET: have the D array type, and the .NET array type, and convert from one another when necessary. Sure, the D/Objective-C bridge isn't a backend based on the Objective-C runtime, but to the user it almost is (D and Objective-C objects look the same and are interchangeable). If I was pushing that a little further and writing a compiler using Objective-C runtime to implement D classes, I still wouldn't be able to use NSArray to represents D arrays because they have different caracteristics, limitations, and behaviours. But I don't see that as a flaw. That D arrays make different tradoffs than NSArray or .NET arrays isn't a problem that should be corrected: it's something that makes D what it is as it enables some patterns that wouldn't work otherwise. Sure, those different tradoffs makes things harder to bridge with arrays defined in some existing frameworks (both Cocoa and .NET), but if they make things easier and nicer inside the D language I'm all for keeping them. So when using the D/Objective-C bridge, there are two array types and you use the one most appropriate for the job. And you can convert from one to the other when necessary. Perhaps you should look at means of easying the conversion process when passing D arrays to functions expecting .NET arrays instead of fighting to unify two different array concepts into one. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 22 2009
prev sibling next sibling parent Robin KAY <komadori gekkou.co.uk> writes:
Steven Schveighoffer wrote:
[snip]
 I thought one of the benefits of having immutable strings is that 
 substrings were just pointers to slices of the original data in Java and 
 .NET.  So every time I do a substring in Java and .NET, it creates a 
 copy of the data?  That seems very wasteful, especially when the data is 
 immutable...

Sun's Java implements the String class as below, so that sub-strings can share their parent's character array. This is, AFAIK, an implementation detail although I don't know of any implementation that differs. The downside is that holding a reference to a sub-string will prevent the parent from being collected and this is not an obvious side-effect of substring(). Storing longer strings as ropes might be better. class String { char[] value; int offset; int count; } -- Robin KAY
Mar 23 2009
prev sibling next sibling parent Christopher Wright <dhasenan gmail.com> writes:
Steven Schveighoffer wrote:
 I thought one of the benefits of having immutable strings is that 
 substrings were just pointers to slices of the original data in Java and 
 .NET.  So every time I do a substring in Java and .NET, it creates a 
 copy of the data?  That seems very wasteful, especially when the data is 
 immutable...

I checked the source code for ClassPath's string implementation about a year ago. It copies to create a substring. I thought it was stupid. On the other hand, ClassPath is an open source project and doesn't have hundreds of millions of dollars in backing, so I imagine Sun's String implementation is slightly smarter.
Mar 23 2009
prev sibling parent reply "Cristian Vlasceanu" <cristian zerobugs.org> writes:
 I thought one of the benefits of having immutable strings is that 
 substrings were just pointers to slices of the original data in Java and 
 .NET.  So every time I do a substring in Java and .NET, it creates a copy 
 of the data?  That seems very wasteful, especially when the data is 
 immutable...

Fair enough, I meant substrings in more of a C++ way. Perhaps discussing slices / arrays by comparison with substrings / strings is a bad idea, because as you point out strings are immutable whilst arrays are not. My main problem with slices is that when you a) append to them, or b) resize them (by assigning to the length property), and the new size goes past the bounds of the "original" array, then the slice gets "divorced" from the array, and from a light-weight "view" it gets promoted to a "first class" array (at least this is the behavior in 2.025). This means that for arrays of value types, you can modify the original array indirectly via the outstanding slices, but only up until when either a) or b) occur, at runtime. I would prefer to be able to see explicitly in the source code what is a "full array", and what is just a "view". On the other hand, this is not a deal breaker, because I can always roll my own slice like this: struct (T) ArraySlice { private: T[] a; // reference to array int begin; // index where slice begins int end; // one past index where slice ends public: T opIndex(size_t i) { return a[i]; } void opIndexAssign(size_t i, T val) { a[i] = val; } int length() { return end - begin; } // comment this function out to prevent resizing void length(int newLength) { end = begin + newlength; if (end > a.length) { end = a.length; } } // support foreach int opApply(int delegate(ref int) dg) { foreach (i;begin..end) { if (dg(a[i])) break; } return 0; } }
 That is an interesting idea.  But I have a couple problems with it:

 First, when I see ref int[] s, I think reference to an array, not this 
 array references data from another array.
 Second, your proposed default (not using ref) is to copy data everywhere, 
 which is not good for performance.  Most of the time, arrays are passed 
 without needing to "own" the data, so making copies everywhere you forgot 
 to put ref would be hard to deal with.  It's also completely incompatible 
 with existing code, which expects reference semantics without using ref.

That's exactly right, but for D / .NET I do not expect existing code to compile as-is. For example, I do not intent to port any of the phobos / tango code. I envision all I / O and system stuff to go through [mscorlib]System. Cristian
Mar 23 2009
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Cristian Vlasceanu:
 That's exactly right, but for D / .NET I do not expect existing code to 
 compile as-is. For example, I do not intent to port any of the phobos / 
 tango code. I envision all I / O and system stuff to go through 
 [mscorlib]System.

Can you explain me what are the purposes of D.net? Bye, bearophile
Mar 24 2009
parent Cristian Vlasceanu <cristian zerobugs.org> writes:
bearophile Wrote:
 
 Can you explain me what are the purposes of D.net?
 
 Bye,
 bearophile

You mean other than the tremendous fun that I am having writing a compiler back-end? ;) I guess the idea is to help the language catch on, by spreading it (or a dialect of it) to as many platforms as possible. Other than the above-mentioned levels of personal fun, I am envisioning to lower the entry barrier for experimentation with the language. The back-end for .net is much simpler than the native code generator (it just outputs textual IL, then invokes ILASM) so it should be easier to understand and tweak. The front-end is open source (is dually licensed as GPL / artistic) and my .net back-end will be released under the Ms-PL (OSI-certified), so the entire project will be 100% Open Source; people will be able to make changes and redistribute them, and hopefully this will entice even more folks to adopt (or at least give a try to) the language. Cheers, Cristian
Mar 24 2009
prev sibling parent reply Cristian Vlasceanu <cristian zerobugs.org> writes:
Steven Schveighoffer Wrote:

 Oh, and BTW, if I couldn't use Tango, I'd most certainly not use D.NET ;)   
 I sort of loathe the .NET runtime libs, except for certain parts.
 

What I meant is that porting the libraries is not in the scope of my project. Anyway, that is a side discussion. Back to the slices topic: I agree that my proposed "ref" solution would require code changes, but isn't that true for T[new] as well? Cristian
Mar 24 2009
parent reply Cristian Vlasceanu <cristian zerobugs.org> writes:
Steven Schveighoffer Wrote:

 On Tue, 24 Mar 2009 18:26:16 -0400, Cristian Vlasceanu  
 <cristian zerobugs.org> wrote:
 
 Back to the slices topic: I agree that my proposed "ref" solution would  
 require code changes, but isn't that true for T[new] as well?

 Cristian

There is not already a meaning for T[new], it is a syntax error. There is already a meaning for ref T[].

Yes, but the current, existing meaning will be preserved: void f(ref T[] a) { a[13] = 42; // still works as before if "a" is a slice under the hood a = null; // very easy for the compiler to make this work: a.array = null } I thought that your main objection was that, with my proposed change, in code like this: int [] a = ... int [] s = a[0..$]; "s" becomes a full blown copy of a (rather than a slice), and to reinstate the today's semantics the code should be re-written to: ref int [] s = a[0..$]; It is true, but such cases can be worked out in the compiler and we'll have warning messages issued. Thanks, Cristian
Mar 24 2009
parent "Cristian Vlasceanu" <cristian zerobugs.org> writes:
At a first look yes I think the assertion will fail, but not if you declare 
x:
ref char[] x = "   trim this!   ".dup;

It could be possible to tweak the compiler so that it forces you to declare 
x like that

"Steven Schveighoffer" <schveiguy yahoo.com> wrote in umessage 
news:op.urex03gyeav7ka steves.networkengines.com...
 On Tue, 24 Mar 2009 20:02:16 -0400, Cristian Vlasceanu 
 <cristian zerobugs.org> wrote:

 Steven Schveighoffer Wrote:

 On Tue, 24 Mar 2009 18:26:16 -0400, Cristian Vlasceanu
 <cristian zerobugs.org> wrote:

 Back to the slices topic: I agree that my proposed "ref" solution

 require code changes, but isn't that true for T[new] as well?

 Cristian

There is not already a meaning for T[new], it is a syntax error. There is already a meaning for ref T[].

Yes, but the current, existing meaning will be preserved: void f(ref T[] a) { a[13] = 42; // still works as before if "a" is a slice under the hood a = null; // very easy for the compiler to make this work: a.array = null }

OK, I'm not sure I understood your original proposal, before I respond more, let me make it clear what my understanding was. In your proposal, a ref T[] a is the same as a slice today. That is, assignment to a ref T[] simply copies the pointer and length from another T[] or ref T[]. However, it does not reference another slice struct, but is a local struct in itself. In the current situation, a ref T[] is a reference to a slice struct. That is, assignement to a ref T[] overwrites the pointer and length on the reference that was passed. So here is my objection: void trim(ref char[] c) { // // remove leading and trailing spaces // while(c.length > 0 && c[0] == ' ') c = c[1..$]; while(c.length > 0 && c[$-1] == ' ') c = c[0..$-1]; } void foo() { char[] x = " trim this! ".dup; trim(x); assert(x == "trim this!"); } Now, in your scheme, the ref simply means that c's data is referencing something else, not that c is a reference, so the assert will fail, no? If this isn't the case, let me know how you signify: 1. a normal slice (struct is local, but ptr and length are aliased from data). 2. a reference of a slice (references an external struct). -Steve

Mar 29 2009
prev sibling next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Georg Wrede (georg.wrede iki.fi)'s article
 The idea of slices and arrays being distinct types does seem to have
 advantages. I've seen a couple of mentions of this lately, but has there
 been a *rigorous* discussion?

No rigorous discussion. This seems to be one of those thorny issues that noone wants to delve into seriously. I had proposed once that T[new], which has been thrown around here before, be a value type but be implicitly convertible to T[] by reference. Here is the proposal reproduced. Maybe this time it will get some comments. Resizable arrays are such an important, heavily used feature that they should get special treatment from the language, to ensure that they are syntactically elegant, efficient both at runtime and in terms of compilation time, easy to use, free of odd corner cases, and in general are true first class citizens. Nonetheless, D's arrays do have some rough edges. My proposal would be as follows: There should be two array types: T[new] and T[]. This has been proposed before, though not in as much detail as what I'm suggesting. The semantics should work as follows: T[new] is considered a subtype of T[]. This works mostly like you would expect. T[new] can be implicitly converted to T[]. The twist is that T[] can also be assigned to T[new] without any explicit conversion, but a copy of the underlying data is implicitly made for safety. Assigning either a T[] or a T[new] to a T[] will result in aliasing: T[] foo; T[new] bar = new T[N]; bar[0] = 1; foo = bar; foo[0] = 2; writeln(bar[0]); // Prints 2. T[new]s are always assigned by value to make concatenation and resizing safe. The underlying data is copied implicitly when assigning from a T[] to a T[new] or from a T[new] to another T[new]. T[new] foo; // new returns T[new], implicitly converted to T[] T[] bar = new T[N]; bar[0] = 1; foo = bar; foo[0] = 2; writeln(bar[0]); // Prints 1. T[]s can be sliced, but cannot be enlarged or appended to. They should be laid out the same way they are now, as a struct of a pointer and a length. T[new]s can be appended to and enlarged in addition to being sliced. Their layout should be a struct of pointer, length, capacity to make appending fast. This will also make implicit conversion to T[] essentially free, since all that is necessary is slicing off the capacity field. I also wonder if this proposal could be extended to fix some of the covariance issues with arrays (see Bugzilla 2095). This might work by only allowing covariant types to be assigned to T[new], not T[]. For example: class A{} class B:A{} B[new] ba = [new B]; A[] aa = ba; // Error: Cannot assign covariant types to T[]. A[new] aa = ba; // Works, but copy is implicitly made.
Mar 17 2009
parent reply bearophile <bearophileHUGS lycos.com> writes:
dsimcha:
Nonetheless, D's arrays do have some rough edges.<

For the second time I have put a bug like this: def foo(int[] a) { ... a.length = a.length - 1; } The length change can't be seen from the outside, because a is a struct passed by value. You have to use foo(ref int[] a). I think I'd like this to be the default behavior of dynamic array pass (disabled for example if you put an foo(in int[] a). Do you add this bug to your programs too? Steven Schveighoffer:
I had a hackish scheme to use the most significant bit in the length field to
identify if a slice is just beyond the allocated length.<

Such bit may also be stored in the pointer, if they are aligned to 8 or 16 bytes. Bye, bearophile
Mar 17 2009
parent bearophile <bearophileHUGS lycos.com> writes:
bearophile:
 def foo(int[] a) {

Sorry, I meant: void foo(int[] a) {
Mar 17 2009
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 17 Mar 2009 13:29:59 -0400, bearophile <bearophileHUGS lycos.com>  
wrote:

 Steven Schveighoffer:
 I had a hackish scheme to use the most significant bit in the length  
 field to identify if a slice is just beyond the allocated length.<

Such bit may also be stored in the pointer, if they are aligned to 8 or 16 bytes.

That will affect the GC when it is scanning for pointers. -Steve
Mar 17 2009
prev sibling next sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Tue, 17 Mar 2009 13:03:56 -0400, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Tue, 17 Mar 2009 10:59:21 -0400, Georg Wrede <georg.wrede iki.fi>  
 wrote:

 Walter Bright wrote:
 Cristi talks about adapting D strings to .NET.
   
 http://www.reddit.com/r/programming/comments/84urf/net_on_a_string/?already_submitted=true

Actually, meant to ask earlier, but "[...in .NET,] System.Array and System.ArraySegment are distinct, unrelated types. In D arrays slices are indistinguishable from arrays and this creates the problems that I mentioned in the interview." Is this discussed somewhere? The idea of slices and arrays being distinct types does seem to have advantages. I've seen a couple of mentions of this lately, but has there been a *rigorous* discussion?

There has been. But there are very good reasons to keep arrays and slices the same type. Even in C# and Java, a substring is the same type as a string. It allows iterative patterns such as: str = str[1..$]; The main problem comes form having a substring overwrite data in another string via appending. From what I can tell, that is the *only* issue with arrays and slices being the same type. This problem can be solved in other ways. I've put forth 2 solutions that as far as I know were not proven to be infeasible, but which never received any attention in the NG from Walter or Andrei. There was mention of a separate T[new] type, before my time with D, but I think you still have the same aliasing problems there unless you zero out the source instance on assignment, or simply disallow assigning a T[new] from another T[new], which can affect the design of code, and make certain iterative patterns difficult if not impossible. The two solutions I put forth are: 1. Storing the requested length of an allocated block in the GC. This not only allows for more intuitive appending, but could also help the GC when scanning for pointers (you might not have to zero out the block first). This proposal received zero responses.

Well, if it helps, my roommate and I just independently came up with this idea. (So is this great minds think a like or that simple ones seldom differ?)
 2. Storing the requested length of an allocated array at the front of  
 the array.  I had a hackish scheme to use the most significant bit in  
 the length field to identify if a slice is just beyond the allocated  
 length.  Therefore, an append operation would first check if it is the  
 first element in a block, and if so check the allocated length before  
 deciding whether to append in place or allocate a new block.  There were  
 some questions on the proposal, but I don't believe anyone found any  
 killer problems with it.  It has advantages and disadvantages over the  
 first proposal and current implementation.  The main advantage is that  
 the append code does not have to query the GC to get the requested  
 length.

You mentioned being un-sure of the threading and alignment issues of your proposal in [2]. I see both a lock and lock-free solution: Lock solution: As to append to an array, one has to know the capacity, which requires taking the GC lock. If, as in [1], the used length and capacity are stored in the GC together, one could move the extension operation to the GC. i.e. given a block, the used length in that block and an array (consisting of a pointer into that block and a length) it is straight forward to calculate if that array could be lengthen and if not, allocate a new one. The caller can then do a full or partial copy as appropriate. This avoids the alignment issues (as no new information is stored in the block) and maintains the maximum array size (although this is a fairly minor corner case) Lock-free solution This solution would make use of the 'empty' 16-byte block that proceeds each allocated block. I'm assuming this is usable, but if it's not, an extra 16-byte block would have to be added to each heap array to maintain proper alignment. This solution would cache the array capacity and its used length just prior to the start of the array (i.e. in the 'empty' block). This is similar to [2], except that in [2] you still have to query the GC for capacity. Again, a bit flag in the array length determines slice / non-slice status. Regarding multi-threading issues: reading capacity suffers from publication safety issues (I think) but it might get cleared up by the releasing of the GC allocation lock. Otherwise, it would need to be fenced. As for the length, one can use a single compare and swap (comparing the GC's length with the local array's length and setting the new desired length). This also has the advantage of avoiding the GC when doing array appending, which decreases the need for separate ArrayBuldier classes. (Of course, once shared/local is introduced, the expensive fences and CAS instructions can be limited appropriately.)
 Aside from the append problem, I see no other drawbacks to having  
 strings the way they are.

 References:

 Proposal 1:  
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=63146

 Proposal 2:  
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=77437

 -Steve

Mar 17 2009
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sun, 22 Mar 2009 01:31:28 -0400, Cristian Vlasceanu wrote:

 The idea of slices and arrays being distinct types does seem to have
 advantages. I've seen a couple of mentions of this lately, but has  
 there
 been a *rigorous* discussion?

There has been. But there are very good reasons to keep arrays and slices the same type. Even in C# and Java, a substring is the same type as a string. It allows iterative patterns such as: str = str[1..$];

I am afraid that there's a fallacy in your argument: substrings ARE strings: 1) they are full copies of the characters in the given range, and 2) once the substring is created it goes its own merry way (i.e. it does not keep track of any relationship to the "original" string). Slices ARE NOT arrays. Slices are more like "views" into the original array. It is like the difference between the icon and the saint / deity that it represents.

I thought one of the benefits of having immutable strings is that substrings were just pointers to slices of the original data in Java and .NET. So every time I do a substring in Java and .NET, it creates a copy of the data? That seems very wasteful, especially when the data is immutable... Note I have no intimate knowledge of the inner workings of Java and .Net, I just went on what logically makes sense. Your statement that slices are not arrays depends on your definition of slice and array. To me, slices and arrays are identical. Slices are simply a smaller set of the array. It's like saying subsets are a different type than sets. I suppose it depends on what language you learned about slices (for me it was D). The only issue I see is the builtin append operation. Everything else has a clean solution already.
 Another point that I have a hard time getting accross (even to the  
 language
 heavy-weights) is that just because it is easy to represent arrays and
 slices seemlessly IN THE PARTICULAR CASE OF THE DIGITAL MARS BACKEND it  
 does
 not mean it is going to work as smooth and seamless in other systems. The
 .NET backend that I am working on is the case in point. If instead of  
 using
 .NET built-in arrays I craft my own representation (to stay compatible  
 with
 the DMD's way of doing array and slices) then I give up interoperability
 with other languages -- and that would defeat the point of doing D on  
 .NET
 to begin with.

Fair enough, but I think restricting the design of D to cater to other back ends is probably not a huge driver for Walter. It's not without precedent that people adapting languages to .NET have to introduce syntax changes to the language, no? I'm thinking of C++.NET.
 Your proposed solution is interesting but implementation-specific. I am
 afraid that I cannot not use it with .NET (I just generate IL code,  
 which is
 more high-level than "ordinary" assembly code).

 I passed a proposal of my own to Walter and Andrei, and that is to have D
 coders explicitly state the intent of using a slice with the "ref"  
 keyword;
 "ref" is already a legal token in D (at least in 2.0) albeit it is only
 valid in the context of a parameter list, or foreach argument list. It is
 not legal to say "ref int j = i;" in a declaration, for example. But it  
 is a
 trivial change in the parser (I have implemented this change as a proof  
 of
 concept / language extension research) to allow ref (just for slices):  
 "ref
 int[] s = a[1..2];" other than in parameter and foreach arg lists.

 I think that "ref" makes sense, because slices, like I said, are
 conceptually views (or references) into the "true" arrays. This simple
 change would a) make D code more self-documenting, and it would give a  
 very
 powerful hint to the compiler. Also, the "ref" semantics is backwards
 compatbile with the exisiting cases where "ref" is allowed.

That is an interesting idea. But I have a couple problems with it: First, when I see ref int[] s, I think reference to an array, not this array references data from another array. Second, your proposed default (not using ref) is to copy data everywhere, which is not good for performance. Most of the time, arrays are passed without needing to "own" the data, so making copies everywhere you forgot to put ref would be hard to deal with. It's also completely incompatible with existing code, which expects reference semantics without using ref. -Steve
Mar 23 2009
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Mon, 23 Mar 2009 20:41:45 +0300, Steven Schveighoffer <schveiguy yahoo.com>
wrote:

 On Sun, 22 Mar 2009 01:31:28 -0400, Cristian Vlasceanu wrote:

 The idea of slices and arrays being distinct types does seem to have
 advantages. I've seen a couple of mentions of this lately, but has  
 there
 been a *rigorous* discussion?

There has been. But there are very good reasons to keep arrays and slices the same type. Even in C# and Java, a substring is the same type as a string. It allows iterative patterns such as: str = str[1..$];

I am afraid that there's a fallacy in your argument: substrings ARE strings: 1) they are full copies of the characters in the given range, and 2) once the substring is created it goes its own merry way (i.e. it does not keep track of any relationship to the "original" string). Slices ARE NOT arrays. Slices are more like "views" into the original array. It is like the difference between the icon and the saint / deity that it represents.

I thought one of the benefits of having immutable strings is that substrings were just pointers to slices of the original data in Java and .NET. So every time I do a substring in Java and .NET, it creates a copy of the data? That seems very wasteful, especially when the data is immutable... Note I have no intimate knowledge of the inner workings of Java and .Net, I just went on what logically makes sense. Your statement that slices are not arrays depends on your definition of slice and array. To me, slices and arrays are identical. Slices are simply a smaller set of the array. It's like saying subsets are a different type than sets. I suppose it depends on what language you learned about slices (for me it was D). The only issue I see is the builtin append operation. Everything else has a clean solution already.
 Another point that I have a hard time getting accross (even to the  
 language
 heavy-weights) is that just because it is easy to represent arrays and
 slices seemlessly IN THE PARTICULAR CASE OF THE DIGITAL MARS BACKEND it  
 does
 not mean it is going to work as smooth and seamless in other systems.  
 The
 .NET backend that I am working on is the case in point. If instead of  
 using
 .NET built-in arrays I craft my own representation (to stay compatible  
 with
 the DMD's way of doing array and slices) then I give up interoperability
 with other languages -- and that would defeat the point of doing D on  
 .NET
 to begin with.

Fair enough, but I think restricting the design of D to cater to other back ends is probably not a huge driver for Walter. It's not without precedent that people adapting languages to .NET have to introduce syntax changes to the language, no? I'm thinking of C++.NET.
 Your proposed solution is interesting but implementation-specific. I am
 afraid that I cannot not use it with .NET (I just generate IL code,  
 which is
 more high-level than "ordinary" assembly code).

 I passed a proposal of my own to Walter and Andrei, and that is to have  
 D
 coders explicitly state the intent of using a slice with the "ref"  
 keyword;
 "ref" is already a legal token in D (at least in 2.0) albeit it is only
 valid in the context of a parameter list, or foreach argument list. It  
 is
 not legal to say "ref int j = i;" in a declaration, for example. But it  
 is a
 trivial change in the parser (I have implemented this change as a proof  
 of
 concept / language extension research) to allow ref (just for slices):  
 "ref
 int[] s = a[1..2];" other than in parameter and foreach arg lists.

 I think that "ref" makes sense, because slices, like I said, are
 conceptually views (or references) into the "true" arrays. This simple
 change would a) make D code more self-documenting, and it would give a  
 very
 powerful hint to the compiler. Also, the "ref" semantics is backwards
 compatbile with the exisiting cases where "ref" is allowed.

That is an interesting idea. But I have a couple problems with it: First, when I see ref int[] s, I think reference to an array, not this array references data from another array. Second, your proposed default (not using ref) is to copy data everywhere, which is not good for performance. Most of the time, arrays are passed without needing to "own" the data, so making copies everywhere you forgot to put ref would be hard to deal with. It's also completely incompatible with existing code, which expects reference semantics without using ref. -Steve

I recall a discussion on Java forum where a user was getting a huge "memory leak" because he was processing large xml files and stored some (rather small) String values that were substrings inside those xml files. Since then I got an impression that a substring of a String is just a slice into an original string. But I agree that this is an implementation detail.
Mar 23 2009
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 23 Mar 2009 22:24:30 -0400, Cristian Vlasceanu  
<cristian zerobugs.org> wrote:

 I thought one of the benefits of having immutable strings is that
 substrings were just pointers to slices of the original data in Java and
 .NET.  So every time I do a substring in Java and .NET, it creates a  
 copy
 of the data?  That seems very wasteful, especially when the data is
 immutable...

Fair enough, I meant substrings in more of a C++ way. Perhaps discussing slices / arrays by comparison with substrings / strings is a bad idea, because as you point out strings are immutable whilst arrays are not.

OK, I didn't think of that implementation.
 My main problem with slices is that when you a) append to them, or b)  
 resize
 them (by assigning to the length property), and the new size goes past  
 the
 bounds of the "original" array, then the slice gets "divorced" from the
 array, and from a light-weight "view" it gets promoted to a "first class"
 array (at least this is the behavior in 2.025).

It's even more bizarre than this :) If you append to a slice that happens to point to the first bytes of the original array, it appends in place (no divorce!), possibly overwriting the (possibly immutable!) data still in the original array. This is the bug I'm trying to fix with my proposals. But I see your point, which is one aspect that I was aware of, but didn't really feel like it was a huge problem. After all, if the behavior is deterministic, then you can know whether your slice still points to the original array. But looking at it from your point of view, the scheme definitely has valid issues. Most other parts of the D language allow you to simply look at the type of something and know what it means. Arrays, you have to examine the code that created/used the array to figure out whether it's an alias or unique data. That is a problem. So maybe it is a worthwhile exercise to figure out if there is a way to embed the attributes of the array into the type. I'll have to think about how this could be implemented in a way that makes sense, is realistic, and does not hinder performance or syntax. There might still be a way to make this work.
 That is an interesting idea.  But I have a couple problems with it:

 First, when I see ref int[] s, I think reference to an array, not this
 array references data from another array.
 Second, your proposed default (not using ref) is to copy data  
 everywhere,
 which is not good for performance.  Most of the time, arrays are passed
 without needing to "own" the data, so making copies everywhere you  
 forgot
 to put ref would be hard to deal with.  It's also completely  
 incompatible
 with existing code, which expects reference semantics without using ref.

That's exactly right, but for D / .NET I do not expect existing code to compile as-is. For example, I do not intent to port any of the phobos / tango code. I envision all I / O and system stuff to go through [mscorlib]System.

That is what I would expect also, but part of the benefit of having .NET implemented in another language is to port existing code from that language to a .NET runtime. Any code that was to be ported might suffer. There are already instances of ref char[] or ref string in many applications/libs that would cause strange bugs when ported to .NET. Oh, and BTW, if I couldn't use Tango, I'd most certainly not use D.NET ;) I sort of loathe the .NET runtime libs, except for certain parts. -Steve
Mar 24 2009
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 24 Mar 2009 18:26:16 -0400, Cristian Vlasceanu  
<cristian zerobugs.org> wrote:

 Back to the slices topic: I agree that my proposed "ref" solution would  
 require code changes, but isn't that true for T[new] as well?

 Cristian

There is not already a meaning for T[new], it is a syntax error. There is already a meaning for ref T[]. -Steve
Mar 24 2009
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 24 Mar 2009 20:02:16 -0400, Cristian Vlasceanu  
<cristian zerobugs.org> wrote:

 Steven Schveighoffer Wrote:

 On Tue, 24 Mar 2009 18:26:16 -0400, Cristian Vlasceanu
 <cristian zerobugs.org> wrote:

 Back to the slices topic: I agree that my proposed "ref" solution  

 require code changes, but isn't that true for T[new] as well?

 Cristian

There is not already a meaning for T[new], it is a syntax error. There is already a meaning for ref T[].

Yes, but the current, existing meaning will be preserved: void f(ref T[] a) { a[13] = 42; // still works as before if "a" is a slice under the hood a = null; // very easy for the compiler to make this work: a.array = null }

OK, I'm not sure I understood your original proposal, before I respond more, let me make it clear what my understanding was. In your proposal, a ref T[] a is the same as a slice today. That is, assignment to a ref T[] simply copies the pointer and length from another T[] or ref T[]. However, it does not reference another slice struct, but is a local struct in itself. In the current situation, a ref T[] is a reference to a slice struct. That is, assignement to a ref T[] overwrites the pointer and length on the reference that was passed. So here is my objection: void trim(ref char[] c) { // // remove leading and trailing spaces // while(c.length > 0 && c[0] == ' ') c = c[1..$]; while(c.length > 0 && c[$-1] == ' ') c = c[0..$-1]; } void foo() { char[] x = " trim this! ".dup; trim(x); assert(x == "trim this!"); } Now, in your scheme, the ref simply means that c's data is referencing something else, not that c is a reference, so the assert will fail, no? If this isn't the case, let me know how you signify: 1. a normal slice (struct is local, but ptr and length are aliased from data). 2. a reference of a slice (references an external struct). -Steve
Mar 26 2009
prev sibling parent Georg Wrede <georg.wrede iki.fi> writes:
Walter Bright wrote:
 Cristi talks about adapting D strings to .NET.
 
 http://www.reddit.com/r/programming/comments/84urf/net_on_a_string/?al
eady_submitted=true 

At times I find myself wondering if the D string (i.e. invariant char[]) is such a good idea when it (sort-of implicitly) means string of utf8. After all, it could be construed as being just a "quick and good-enough" hack. But then I come across things like http://www.yoda.arachsys.com/csharp/strings.html#culture and remember that this is more or less hopeless to get right (with any decent effort anyway).
Mar 17 2009