www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Why Strings as Classes?

reply Benji Smith <dlanguage benjismith.net> writes:
In another thread (about array append performance) I mentioned that 
Strings ought to be implemented as classes rather than as simple builtin
arrays. Superdan asked why. Here's my response...

I'll start with a few of the softball, easy reasons.

For starters, with strings implemented as character arrays, writing 
library code that accepts and operates on strings is a bit of a pain in 
the neck, since you always have to write templates and template code is 
slightly less readable than non-template code. You can't distribute your 
code as a DLL or a shared object, because the template instantiations 
won't be included (unless you create wrapper functions with explicit 
template instantiations, bloating your code size, but more importantly 
tripling the number of functions in your API).

Another good low-hanging argument is that strings are frequently used as 
keys in associative arrays. Every insertion and retrieval in an 
associative array requires a hashcode computation. And since D strings 
are just dumb arrays, they have no way of memoizing their hashcodes. 
We've already observed that D assoc arrays are less performant than even 
Python maps, so the extra cost of lookup operations is unwelcome.

But much more important than either of those reasons is the lack of 
polymorphism on character arrays. Arrays can't have subclasses, and they 
can't implement interfaces.

A good example of what I'm talking about can be seen in the Phobos and 
Tango regular expression engines. At least the Tango implementation 
matches against all string types (the Phobos one only works with char[] 
strings).

But what if I want to consume a 100 MB logfile, counting all lines that 
match a pattern?

Right now, to use the either regex engine, I have to read the entire 
logfile into an enormous array before invoking the regex search function.

Instead, what if there was a CharacterStream interface? And what if all 
the text-handling code in Phobos & Tango was written to consume and 
return instances of that interface?

A regex engine accepting a CharacterStream interface could process text 
from string literals, file input streams, socket input streams, database 
records, etc, etc, etc... without having to pollute the API with a bunch 
of casts, copies, and conversions. And my logfile processing application 
would consume only a tiny fraction of the memory needed by the character 
array implementation.

Most importantly, the contract between the regex engine and its 
consumers would provide a well-defined interface for processing text, 
regardless of the source or representation of that text.

Along a similar vein, I've worked on a lot of parsers over the past few 
years, for domain specific languages and templating engines, and stuff 
like that. Sometimes it'd be very handy to define a "Token" class that 
behaves exactly like a String, but with some additional behavior. 
Ideally, I'd like to implement that Token class as an implementor of the 
CharacterStream interface, so that it can be passed directly into other 
text-handling functions.

But, in D, with no polymorphic text handling, I can't do that.

As one final thought... I suspect that mutable/const/invariant string 
handling would be much more conveniently implemented with a 
MutableCharacterStream interface (as an extended interface of 
CharacterStream).

Any function written to accept a CharacterStream would automatically 
accept a MutableCharacterStream, thanks to interface polymorphism, 
without any casts, conversions, or copies. And various implementors of 
the interface could provide buffered implementations operating on 
in-memory strings, file data, or network data.

Coding against the CharacterStream interface, library authors wouldn't 
need to worry about const-correctness, since the interface wouldn't 
provide any mutator methods.

But then again, I haven't used any of the const functionality in D2, so 
I can't actually comment on relative usability of compiler-enforced 
immutability versus interface-enforced immutability.

Anyhow, those are some of my thoughts... I think there are a lot of 
compelling reasons for de-coupling the specification of string handling 
functionality from the implementation of that functionality, primarily 
for enabling polymorphic text-processing.

But memoized hashcodes would be cool too :-)

--benji
Aug 25 2008
next sibling parent reply Benji Smith <dlanguage benjismith.net> writes:
Oh, man, I forgot one other thing... And it's a biggie...

The D _Arrays_ page says that "A string is an array of characters. 
String literals are just an easy way to write character arrays."

http://digitalmars.com/d/1.0/arrays.html

In my previous post, I also use the "character array" terminology.

Unfortunately, though, it's just not true.

A char[] is actually an array of UTF-8 encoded octets, where each 
character may consume one or more consecutive elements of the array. 
Retrieving the str.length property may or may not tell you how many 
characters are in the string. And pretty much any code that tries to 
iterate character-by-character through the array elements is 
fundamentally broken.

Take a look at this code, for example:

------------------------------------------------------------------
import tango.io.Stdout;

void main() {

    // Create a string with UTF-8 content
    char[] str = "mötley crüe";
    Stdout.formatln("full string value: {}", str);

    Stdout.formatln("len: {}", str.length);
    // --> "len: 13" ... but there are only 11 characters!

    Stdout.formatln("2nd char: '{}'", str[1]);
    // --> "2nd char: ''" ... where'd my character go?

    Stdout.formatln("first 3 chars: '{}'", str[0..3]);
    // --> "first 3 chars: 'mö'" ... why only 2?

    char o_umlat = 'ö';
    Stdout.formatln("char value: '{}'", o_umlat);
    // --> "char value: ''" ... where's my char?

}
------------------------------------------------------------------

So you can't actually iterate the the char elements of a char[] without 
risking that you'll turn your string data into garbage. And you can't 
trust that the length property tells you how many characters there are. 
And you can't trust that an index or a slice will return valid data.

Also: take a look at the Phobos string "find" functions:

   int find(char[] s, dchar c);
   int ifind(char[] s, dchar c);
   int rfind(char[] s, dchar c);
   int irfind(char[] s, dchar c);

Huh?

To find a character in a char[] array, you have to use a dchar?

To me, that's like looking for a long within an int[] array.

So.. If a char[] actually consists of dchar elements, does that mean I 
can append a dchar to a char[] array?

   dchar u_umlat = 'ü';
   char[] newString = "mötley crüe" ~ u_umlat;

No. Of course not. The compiler complains that you can't concatenate a 
dchar to a char[] array. Even though the "find" functions indicate that 
the array is truly a collection of dchar elements.

Now, don't get me wrong. I understand why the string is encoded as 
UTF-8. And I understand that the encoding prevents accurate element 
iteration, indexing, slicing, and all the other nice array goodies.

The existing D string implementation is exactly what I'd expect to see 
inside the guts of a string class, because encodings are important and 
efficiency is important. But those implementation details shouldn't be 
exposed through a public API.

To claim that D strings are actually usable as character arrays is more 
than a little spurious, since direct access of the array elements can 
return fragmented garbage bytes.

If accurate string manipulation is impossible without a set of 
special-purpose functions, then I'll argue that the implementation is 
already equivalent to that of a class, but without any of the niceties 
of encapsulation and polymorphism.

--benji
Aug 25 2008
parent reply superdan <super dan.org> writes:
Benji Smith Wrote:

 Oh, man, I forgot one other thing... And it's a biggie...
 
 The D _Arrays_ page says that "A string is an array of characters. 
 String literals are just an easy way to write character arrays."
 
 http://digitalmars.com/d/1.0/arrays.html
 
 In my previous post, I also use the "character array" terminology.
 
 Unfortunately, though, it's just not true.
 
 A char[] is actually an array of UTF-8 encoded octets, where each 
 character may consume one or more consecutive elements of the array. 
 Retrieving the str.length property may or may not tell you how many 
 characters are in the string. And pretty much any code that tries to 
 iterate character-by-character through the array elements is 
 fundamentally broken.

try this: foreach (dchar c; str) { process c }
 Take a look at this code, for example:
 
 ------------------------------------------------------------------
 import tango.io.Stdout;
 
 void main() {
 
     // Create a string with UTF-8 content
     char[] str = "mötley crüe";
     Stdout.formatln("full string value: {}", str);
 
     Stdout.formatln("len: {}", str.length);
     // --> "len: 13" ... but there are only 11 characters!
 
     Stdout.formatln("2nd char: '{}'", str[1]);
     // --> "2nd char: ''" ... where'd my character go?
 
     Stdout.formatln("first 3 chars: '{}'", str[0..3]);
     // --> "first 3 chars: 'mö'" ... why only 2?
 
     char o_umlat = 'ö';
     Stdout.formatln("char value: '{}'", o_umlat);
     // --> "char value: ''" ... where's my char?
 
 }
 ------------------------------------------------------------------
 
 So you can't actually iterate the the char elements of a char[] without 
 risking that you'll turn your string data into garbage. And you can't 
 trust that the length property tells you how many characters there are. 
 And you can't trust that an index or a slice will return valid data.

you can iterate with foreach or lib functions. an index or slice won't return valid data indeed, but it couldn't anyway. there's no o(1) indexing into a string unless it's utf32.
 Also: take a look at the Phobos string "find" functions:
 
    int find(char[] s, dchar c);
    int ifind(char[] s, dchar c);
    int rfind(char[] s, dchar c);
    int irfind(char[] s, dchar c);
 
 Huh?
 
 To find a character in a char[] array, you have to use a dchar?
 
 To me, that's like looking for a long within an int[] array.

because you're wrong. you look for a dchar which can represent all characters in an array of a given encoding. the comparison is off.
 So.. If a char[] actually consists of dchar elements, does that mean I 
 can append a dchar to a char[] array?
 
    dchar u_umlat = 'ü';
    char[] newString = "mötley crüe" ~ u_umlat;
 
 No. Of course not. The compiler complains that you can't concatenate a 
 dchar to a char[] array. Even though the "find" functions indicate that 
 the array is truly a collection of dchar elements.

that's a bug in the compiler. report it.
 Now, don't get me wrong. I understand why the string is encoded as 
 UTF-8. And I understand that the encoding prevents accurate element 
 iteration, indexing, slicing, and all the other nice array goodies.

i know you understand. you should also understand
 The existing D string implementation is exactly what I'd expect to see 
 inside the guts of a string class, because encodings are important and 
 efficiency is important. But those implementation details shouldn't be 
 exposed through a public API.

exactly at this point your argument kinda explodes. yes, you should see that stuff inside the guts of a string. which means builtin strings should be just arrays that you build larger stuff from. but wait. that's exactly what happens right now.
 To claim that D strings are actually usable as character arrays is more 
 than a little spurious, since direct access of the array elements can 
 return fragmented garbage bytes.

agreed.
 If accurate string manipulation is impossible without a set of 
 special-purpose functions, then I'll argue that the implementation is 
 already equivalent to that of a class, but without any of the niceties 
 of encapsulation and polymorphism.

and without the disadvantages.
Aug 25 2008
next sibling parent reply BCS <ao pathlink.com> writes:
Reply to superdan,

 The existing D string implementation is exactly what I'd expect to see
 inside the guts of a string class, because encodings are important and
 efficiency is important. But those implementation details shouldn't be
 exposed through a public API.

exactly at this point your argument kinda explodes. yes, you should see that stuff inside the guts of a string. which means builtin strings should be just arrays that you build larger stuff from. but wait. that's exactly what happens right now.

Ditto, D is a *systems language* It's *supposed* to have access to the lowest level representation and build stuff on top of that
Aug 25 2008
parent reply Benji Smith <dlanguage benjismith.net> writes:
BCS wrote:
 Ditto, D is a *systems language* It's *supposed* to have access to the 
 lowest level representation and build stuff on top of that

But in this "systems language", it's a O(n) operation to get the nth character from a string, to slice a string based on character offsets, or to determine the number of characters in the string. I'd gladly pay the price of a single interface vtable lookup to turn all of those into O(1) operations. --benji
Aug 25 2008
next sibling parent reply BCS <ao pathlink.com> writes:
Reply to Benji,

 BCS wrote:
 
 Ditto, D is a *systems language* It's *supposed* to have access to
 the lowest level representation and build stuff on top of that
 

character from a string, to slice a string based on character offsets, or to determine the number of characters in the string. I'd gladly pay the price of a single interface vtable lookup to turn all of those into O(1) operations. --benji

Then borrow, buy, steal or build a class that does that /on top of the D arrays/ No one has said that this should not be available, just that it should not /replace/ what is available
Aug 25 2008
parent reply Benji Smith <dlanguage benjismith.net> writes:
BCS wrote:
 Reply to Benji,
 
 BCS wrote:

 Ditto, D is a *systems language* It's *supposed* to have access to
 the lowest level representation and build stuff on top of that

character from a string, to slice a string based on character offsets, or to determine the number of characters in the string. I'd gladly pay the price of a single interface vtable lookup to turn all of those into O(1) operations. --benji

Then borrow, buy, steal or build a class that does that /on top of the D arrays/ No one has said that this should not be available, just that it should not /replace/ what is available

The point is that the new string class would be incompatible with the *hundreds* of existing functions that process character arrays. Why don't strings qualify for polymorphism? Am I the only one who thinks the existing tradeoff is a fool's bargain? --benji
Aug 25 2008
next sibling parent BCS <ao pathlink.com> writes:
Reply to Benji,

 BCS wrote:
 
 Reply to Benji,
 
 BCS wrote:
 
 Ditto, D is a *systems language* It's *supposed* to have access to
 the lowest level representation and build stuff on top of that
 

character from a string, to slice a string based on character offsets, or to determine the number of characters in the string. I'd gladly pay the price of a single interface vtable lookup to turn all of those into O(1) operations. --benji

the D arrays/ No one has said that this should not be available, just that it should not /replace/ what is available

*hundreds* of existing functions that process character arrays.

That is an issue with (and *only* with) "the *hundreds* of existing functions that process character arrays".
Aug 25 2008
prev sibling parent "Chris R. Miller" <lordSaurontheGreat gmail.com> writes:
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Benji Smith wrote:
 BCS wrote:
 Reply to Benji,

 BCS wrote:

 Ditto, D is a *systems language* It's *supposed* to have access to
 the lowest level representation and build stuff on top of that

character from a string, to slice a string based on character offsets=



 or to determine the number of characters in the string.

 I'd gladly pay the price of a single interface vtable lookup to turn
 all of those into O(1) operations.

 --benji

Then borrow, buy, steal or build a class that does that /on top of the=


 D arrays/

 No one has said that this should not be available, just that it should=


 not /replace/ what is available

The point is that the new string class would be incompatible with the *hundreds* of existing functions that process character arrays. =20 Why don't strings qualify for polymorphism?

------------------------------------------- wchar[] foo=3D"text"w; int indexOf(char[] str,char ch){ foreach(int idx,char c;str) if(c=3D=3Dch) return idx; return -1; } void main() { assert(indexOf(foo, 'x')=3D=3D2); } ------------------------------------------- If that does compile, it shouldn't. The best way to get that to work is to use a template. Templates can be annoying. A String class could simplify the different kinds of String inherent in D. The String class would (should) internally know what kind of String it is (wchar, char, dchar) and to know how to mitigate those differences when operations are called on it. Benji If you want a String class, why don't you write one? It's a fairly simple task, even high-school CS students do it quite routinely in C++ (which is a lot more unwieldy for OOP than D is). A very successful instance of Strings-as-objects is present in Java. I'd suggest trying to duplicate that functionality. Then you could easily write wrappers on existing libraries to use the new String object.=
Aug 25 2008
prev sibling next sibling parent reply superdan <super dan.org> writes:
Benji Smith Wrote:

 BCS wrote:
 Ditto, D is a *systems language* It's *supposed* to have access to the 
 lowest level representation and build stuff on top of that

But in this "systems language", it's a O(n) operation to get the nth character from a string, to slice a string based on character offsets, or to determine the number of characters in the string. I'd gladly pay the price of a single interface vtable lookup to turn all of those into O(1) operations.

dood. i dunno where to start. allow me to answer from multiple angles. 1. when was the last time looking up one char in a string or computing length was your bottleneck. 2. you talk as if o(1) happens by magic that d currently disallows. 3. maybe i don't want to blow the size of my string by a factor of 4 if i'm just interested in some occasional character search. 4. implement all that nice stuff you wanna. nobody put a gun to yer head not to. understand you can't put a gun to my head to pay the price.
Aug 25 2008
parent reply Benji Smith <dlanguage benjismith.net> writes:
superdan wrote:
 Benji Smith Wrote:
 
 BCS wrote:
 Ditto, D is a *systems language* It's *supposed* to have access to the 
 lowest level representation and build stuff on top of that

character from a string, to slice a string based on character offsets, or to determine the number of characters in the string. I'd gladly pay the price of a single interface vtable lookup to turn all of those into O(1) operations.

dood. i dunno where to start. allow me to answer from multiple angles. 1. when was the last time looking up one char in a string or computing length was your bottleneck. 2. you talk as if o(1) happens by magic that d currently disallows. 3. maybe i don't want to blow the size of my string by a factor of 4 if i'm just interested in some occasional character search. 4. implement all that nice stuff you wanna. nobody put a gun to yer head not to. understand you can't put a gun to my head to pay the price.

Geez, man, you just keep missing the point, over and over again. Let me make one point, blisteringly clear: I don't give a shit about the data format. You want the fastest strings in the universe, implemented with zero-byte magic beans and burned into the local ROM. Fantastic! I'm completely in favor of it. Presumably. people will be so into those strings that they'll write a shitload of functionality for them. Parsing, searching, sorting, indexing... the motherload. One day, I come along, and I'd like to perform some text processing. But all of my string data comes from non-magic-beans data sources. I'd like to implement a new kind of string class that supports my data. I'm not going to push my super-slow string class on anybody else, because I know how concerned with performance you are. But check this out... you can have your fast class, and I can have my slow class, and they can both implement the same interface. Like this: interface CharSequence { int find(CharSequence needle); int rfind(CharSequence needle); // ... } class ZeroByteFastMagicString : CharSequence { // ... } class SuperSlowStoneTabletString : CharSequence { // ... } Now we can both use the same string functions. Just by implementing an interface, I can use the same text-processing as your hyper-compiler-optimized builtin arrays. But only if the interface exists. And only if library authors write their text-processing code against that interface. That's the point. A good API allows multiple implementations to make use of the same algorithms. Application authors can choose their own tradeoffs between speed, memory consumption, and functionality. A rigid builtin implementation, with no interface definition, locks everybody into the same choices. --benji
Aug 25 2008
next sibling parent BCS <ao pathlink.com> writes:
Reply to Benji,

 But check this out... you can have your fast class, and I can have my
 slow class, and they can both implement the same interface. Like this:
 

No, you can't. The overhead needed to implement that is EXACTLY what we are unwilling to use. I want an indexed array load x = arr[i]; to be: -- load arr.ptr into a reg -- add i to that reg -- indirect load that into i 3 ASM ops. If you can get that and what you want, go get a PhD. You will have earned it.
Aug 25 2008
prev sibling next sibling parent reply Robert Fraser <fraserofthenight gmail.com> writes:
Benji Smith wrote:
 superdan wrote:
 Benji Smith Wrote:

 BCS wrote:
 Ditto, D is a *systems language* It's *supposed* to have access to 
 the lowest level representation and build stuff on top of that

character from a string, to slice a string based on character offsets, or to determine the number of characters in the string. I'd gladly pay the price of a single interface vtable lookup to turn all of those into O(1) operations.

dood. i dunno where to start. allow me to answer from multiple angles. 1. when was the last time looking up one char in a string or computing length was your bottleneck. 2. you talk as if o(1) happens by magic that d currently disallows. 3. maybe i don't want to blow the size of my string by a factor of 4 if i'm just interested in some occasional character search. 4. implement all that nice stuff you wanna. nobody put a gun to yer head not to. understand you can't put a gun to my head to pay the price.

Geez, man, you just keep missing the point, over and over again. Let me make one point, blisteringly clear: I don't give a shit about the data format. You want the fastest strings in the universe, implemented with zero-byte magic beans and burned into the local ROM. Fantastic! I'm completely in favor of it. Presumably. people will be so into those strings that they'll write a shitload of functionality for them. Parsing, searching, sorting, indexing... the motherload. One day, I come along, and I'd like to perform some text processing. But all of my string data comes from non-magic-beans data sources. I'd like to implement a new kind of string class that supports my data. I'm not going to push my super-slow string class on anybody else, because I know how concerned with performance you are. But check this out... you can have your fast class, and I can have my slow class, and they can both implement the same interface. Like this: interface CharSequence { int find(CharSequence needle); int rfind(CharSequence needle); // ... } class ZeroByteFastMagicString : CharSequence { // ... } class SuperSlowStoneTabletString : CharSequence { // ... } Now we can both use the same string functions. Just by implementing an interface, I can use the same text-processing as your hyper-compiler-optimized builtin arrays. But only if the interface exists. And only if library authors write their text-processing code against that interface. That's the point. A good API allows multiple implementations to make use of the same algorithms. Application authors can choose their own tradeoffs between speed, memory consumption, and functionality. A rigid builtin implementation, with no interface definition, locks everybody into the same choices. --benji

Superdan is confusing the issues here. The main argument against your proposal (besides backwards compatibility, of course) is that every access would require a virtual call, which can be fairly slow.
Aug 25 2008
parent superdan <super dan.org> writes:
Robert Fraser Wrote:

 Benji Smith wrote:
 superdan wrote:
 Benji Smith Wrote:

 BCS wrote:
 Ditto, D is a *systems language* It's *supposed* to have access to 
 the lowest level representation and build stuff on top of that

character from a string, to slice a string based on character offsets, or to determine the number of characters in the string. I'd gladly pay the price of a single interface vtable lookup to turn all of those into O(1) operations.

dood. i dunno where to start. allow me to answer from multiple angles. 1. when was the last time looking up one char in a string or computing length was your bottleneck. 2. you talk as if o(1) happens by magic that d currently disallows. 3. maybe i don't want to blow the size of my string by a factor of 4 if i'm just interested in some occasional character search. 4. implement all that nice stuff you wanna. nobody put a gun to yer head not to. understand you can't put a gun to my head to pay the price.

Geez, man, you just keep missing the point, over and over again. Let me make one point, blisteringly clear: I don't give a shit about the data format. You want the fastest strings in the universe, implemented with zero-byte magic beans and burned into the local ROM. Fantastic! I'm completely in favor of it. Presumably. people will be so into those strings that they'll write a shitload of functionality for them. Parsing, searching, sorting, indexing... the motherload. One day, I come along, and I'd like to perform some text processing. But all of my string data comes from non-magic-beans data sources. I'd like to implement a new kind of string class that supports my data. I'm not going to push my super-slow string class on anybody else, because I know how concerned with performance you are. But check this out... you can have your fast class, and I can have my slow class, and they can both implement the same interface. Like this: interface CharSequence { int find(CharSequence needle); int rfind(CharSequence needle); // ... } class ZeroByteFastMagicString : CharSequence { // ... } class SuperSlowStoneTabletString : CharSequence { // ... } Now we can both use the same string functions. Just by implementing an interface, I can use the same text-processing as your hyper-compiler-optimized builtin arrays. But only if the interface exists. And only if library authors write their text-processing code against that interface. That's the point. A good API allows multiple implementations to make use of the same algorithms. Application authors can choose their own tradeoffs between speed, memory consumption, and functionality. A rigid builtin implementation, with no interface definition, locks everybody into the same choices. --benji

Superdan is confusing the issues here. The main argument against your proposal (besides backwards compatibility, of course) is that every access would require a virtual call, which can be fairly slow.

i'm not confusin'. mentioned the efficiency thing a number of times, didn't seem to phase him a bit. so i tried some more viewpoints.
Aug 25 2008
prev sibling parent reply superdan <super dan.org> writes:
Benji Smith Wrote:

 superdan wrote:
 Benji Smith Wrote:
 
 BCS wrote:
 Ditto, D is a *systems language* It's *supposed* to have access to the 
 lowest level representation and build stuff on top of that

character from a string, to slice a string based on character offsets, or to determine the number of characters in the string. I'd gladly pay the price of a single interface vtable lookup to turn all of those into O(1) operations.

dood. i dunno where to start. allow me to answer from multiple angles. 1. when was the last time looking up one char in a string or computing length was your bottleneck. 2. you talk as if o(1) happens by magic that d currently disallows. 3. maybe i don't want to blow the size of my string by a factor of 4 if i'm just interested in some occasional character search. 4. implement all that nice stuff you wanna. nobody put a gun to yer head not to. understand you can't put a gun to my head to pay the price.

Geez, man, you just keep missing the point, over and over again.

relax. believe me i'm tryin', maybe you could put it a better way and meet me in the middle.
 Let me make one point, blisteringly clear: I don't give a shit about the 
    data format. You want the fastest strings in the universe, 
 implemented with zero-byte magic beans and burned into the local ROM. 
 Fantastic! I'm completely in favor of it.

so far so good.
 Presumably. people will be so into those strings that they'll write a 
 shitload of functionality for them. Parsing, searching, sorting, 
 indexing... the motherload.

cool.
 One day, I come along, and I'd like to perform some text processing. But 
 all of my string data comes from non-magic-beans data sources. I'd like 
 to implement a new kind of string class that supports my data. I'm not 
 going to push my super-slow string class on anybody else, because I know 
 how concerned with performance you are.

i'm in nirvana.
 But check this out... you can have your fast class, and I can have my 
 slow class, and they can both implement the same interface. Like this:
 
 interface CharSequence {
    int find(CharSequence needle);
    int rfind(CharSequence needle);
    // ...
 }
 
 class ZeroByteFastMagicString : CharSequence {
    // ...
 }
 
 class SuperSlowStoneTabletString : CharSequence {
    // ...
 }
 
 Now we can both use the same string functions. Just by implementing an 
 interface, I can use the same text-processing as your 
 hyper-compiler-optimized builtin arrays.

but maestro. the interface call is already what's costing.
 But only if the interface exists.
 
 And only if library authors write their text-processing code against 
 that interface.
 
 That's the point.

then there was none. sorry.
 A good API allows multiple implementations to make use of the same 
 algorithms. Application authors can choose their own tradeoffs between 
 speed, memory consumption, and functionality.
 
 A rigid builtin implementation, with no interface definition, locks 
 everybody into the same choices.

no. this is just wrong. perfectly backwards in fact. a low-level builtin allows unbounded architectures with control over efficiency.
Aug 25 2008
parent reply Benji Smith <dlanguage benjismith.net> writes:
superdan wrote:
 relax. believe me i'm tryin', maybe you could put it a better way and meet me
in the middle.

Okay. I'll try :) Think about a collection API. The container classes are all written to satisfy a few basic primitive operations: you can get an item at a particular index, you can iterate in sequence (either forward or in reverse). You can insert items into a hashtable or retrieve them by key. And so on. Someone else comes along and writes a library of algorithms. The algorithms can operate on any container that implements the necessary operations. When someone clever comes along and writes a new sorting algorithm, I can plug my new container class right into it, and get the algorithm for free. Likewise for the guy with the clever new collection class. We don't bat an eye at the idea of containers & algorithms connecting to one another using a reciprocal set of interfaces. In most cases, you get a performance **benefit** because you can mix and match the container and algorithm implementations that most suit your needs. You can design your own performance solution, rather than being stuck a single "low level" implementation that might be good for the general case but isn't ideal for your problem. Over in another message BCS said he wants an array index to compile to 3 ASM ops. Cool I'm all for it. I don't know a whole lot about the STL, but my understanding is that most C++ compilers are smart enough that they can produce the same ASM from an iterator moving over a vector as incrementing a pointer over an array. So the default implementation is damn fast. But if someone else, with special design constraints, needs to implement a custom container template, it's no problem. As long as the container provides a function for getting iterators to the container elements, it can consume any of the STL algorithms too, even if the performance isn't as good as the performance for a vector. There's no good reason the same technique couldn't provide both speed and API flexibility for text processing. --benji
Aug 25 2008
next sibling parent reply superdan <super dan.org> writes:
Benji Smith Wrote:

 superdan wrote:
 relax. believe me i'm tryin', maybe you could put it a better way and meet me
in the middle.

Okay. I'll try :)

'preciate that.
 Think about a collection API.

okay.
 The container classes are all written to satisfy a few basic primitive 
 operations: you can get an item at a particular index, you can iterate 
 in sequence (either forward or in reverse). You can insert items into a 
 hashtable or retrieve them by key. And so on.

how do you implement getting an item at a particular index for a linked list? how do you make a hashtable, an array, and a linked list obey the same interface? guess hashtable has stuff that others don't support? these are serious questions. not in jest not rhetorical and not trick.
 Someone else comes along and writes a library of algorithms. The 
 algorithms can operate on any container that implements the necessary 
 operations.

hm. things are starting to screech a bit. but let's see your answers to your questions above.
 When someone clever comes along and writes a new sorting algorithm, I 
 can plug my new container class right into it, and get the algorithm for 
 free. Likewise for the guy with the clever new collection class.

things ain't that simple. saw this flick "the devil wears prada", an ok movie but one funny remark stayed with me. "you are in desperate need of chanel." i'll paraphrase. "you are in desperate need of stl." you need to learn stl and then you'll figure why you can't plug a new sorting algorithm into a container. you need more guarantees. and you need iterators.
 We don't bat an eye at the idea of containers & algorithms connecting to 
 one another using a reciprocal set of interfaces.

i do. it's completely wrong. you need iterators that broker between containers and algos. and iterators must give complexity guarantees.
 In most cases, you get 
 a performance **benefit** because you can mix and match the container 
 and algorithm implementations that most suit your needs. You can design 
 your own performance solution, rather than being stuck a single "low 
 level" implementation that might be good for the general case but isn't 
 ideal for your problem.

assuming there are iterators in the picture, sure. there is a performance benefit. even more so when said mixing and matching is done during compilation.
 Over in another message BCS said he wants an array index to compile to 3 
 ASM ops. Cool I'm all for it.

great. but then you must be all for the consequences of it.
 I don't know a whole lot about the STL, but my understanding is that 
 most C++ compilers are smart enough that they can produce the same ASM 
 from an iterator moving over a vector as incrementing a pointer over an 
 array.

they are because stl is designed in a specific way. that specific way is lightyears away from the design you outline above.
 So the default implementation is damn fast.

not sure what you mean by default here, but playing along.
 But if someone else, with special design constraints, needs to implement 
 a custom container template, it's no problem. As long as the container 
 provides a function for getting iterators to the container elements, it 
 can consume any of the STL algorithms too, even if the performance isn't 
 as good as the performance for a vector.
 
 There's no good reason the same technique couldn't provide both speed 
 and API flexibility for text processing.

you see here's the problem. you systematically forget to factor in the cost of reaching through a binary interface. and if that's not there, congrats. you just discovered perpetual motion. stl is fast for two main reasons. one. it uses compile-time interfaces and not run-time interfaces as you want. two. it defines and strictly uses a compile-time hierarchy of iterators with stringent complexity guarantees. your container design can't be fast because it uses runtime interfaces. let alone that you don't mention complexity guarantees. but let's say those can be provided. but the fundamental problem is that you want runtime interfaces for a very low level data structure. fast that can't be. please understand.
Aug 25 2008
parent reply Christopher Wright <dhasenan gmail.com> writes:
superdan wrote:
 Benji Smith Wrote:
 
 superdan wrote:
 relax. believe me i'm tryin', maybe you could put it a better way and meet me
in the middle.


'preciate that.
 Think about a collection API.

okay.
 The container classes are all written to satisfy a few basic primitive 
 operations: you can get an item at a particular index, you can iterate 
 in sequence (either forward or in reverse). You can insert items into a 
 hashtable or retrieve them by key. And so on.

how do you implement getting an item at a particular index for a linked list?

class Node(T) { Node!(T) next; T value; } class LinkedList(T) { Node!(T) head; /** Gets the ith item of the list. Throws: SIGSEGV if i >= length of the list. Time complexity: O(N) for a list of length N. This operation is provided for completeness and not recommended for frequent use in large lists. */ T opIndex(int i) { auto current = head; while (i) { current = current.next; } return current.value; } }
 how do you make a hashtable, an array, and a linked list obey the same
interface? guess hashtable has stuff that others don't support?

You have an interface for collections (you can add, remove, and get the length, maybe a couple other things). You have an interface for lists (they're collections, and you can index them). Then you can use all the collection-oriented stuff with lists, and you can do special list-type things with them if you want.
 these are serious questions. not in jest not rhetorical and not trick.

Yes, but they're solved problems.
 Someone else comes along and writes a library of algorithms. The 
 algorithms can operate on any container that implements the necessary 
 operations.

hm. things are starting to screech a bit. but let's see your answers to your questions above.
 When someone clever comes along and writes a new sorting algorithm, I 
 can plug my new container class right into it, and get the algorithm for 
 free. Likewise for the guy with the clever new collection class.

things ain't that simple.

Collection-oriented library code will care sufficiently about performance that this mix-and-match stuff is not feasible. Almost anything else doesn't care enough to take only an AssociativeArrayHashSet and not a TreeSet or a LinkedList or a primitive array.
 saw this flick "the devil wears prada", an ok movie but one funny remark
stayed with me. "you are in desperate need of chanel." 
 
 i'll paraphrase. "you are in desperate need of stl." you need to learn stl and
then you'll figure why you can't plug a new sorting algorithm into a container.
you need more guarantees. and you need iterators.
 
 We don't bat an eye at the idea of containers & algorithms connecting to 
 one another using a reciprocal set of interfaces.

i do. it's completely wrong. you need iterators that broker between containers and algos. and iterators must give complexity guarantees.

I don't. If I'm just going to iterate through the items of a collection, I only care about the opApply. If I need to index stuff, I don't care if I get a primitive array or Bob Dole's brand-new ArrayList class.
Aug 25 2008
parent reply superdan <super dan.org> writes:
Christopher Wright Wrote:

 superdan wrote:
 Benji Smith Wrote:
 
 superdan wrote:
 relax. believe me i'm tryin', maybe you could put it a better way and meet me
in the middle.


'preciate that.
 Think about a collection API.

okay.
 The container classes are all written to satisfy a few basic primitive 
 operations: you can get an item at a particular index, you can iterate 
 in sequence (either forward or in reverse). You can insert items into a 
 hashtable or retrieve them by key. And so on.

how do you implement getting an item at a particular index for a linked list?

class Node(T) { Node!(T) next; T value; }

so far so good.
 class LinkedList(T)
 {
 	Node!(T) head;
 
 	/** Gets the ith item of the list. Throws: SIGSEGV if i >= length of 
 the list. Time complexity: O(N) for a list of length N. This operation 
 is provided for completeness and not recommended for frequent use in 
 large lists. */
 	T opIndex(int i)
 	{
 		auto current = head;
 		while (i)
 		{
 			current = current.next;
 		}
 		return current.value;
 	}
 }

oopsies. houston we got a problem here. problem is all that pluggable sort business works only if it can count on a constant time opIndex. why? because sort has right in its spec that it takes o(n log n) time. if u pass LinkedList to it you obtain a nonsensical design that compiles but runs afoul of the spec. because with that opIndex sort will run in quadratic time and no amount of commentin' is gonna save it from that. this design is a stillborn. what needs done is to allow different kinds of containers implement different interfaces. in fact a better way to factor things is via iterators, as stl has shown.
 how do you make a hashtable, an array, and a linked list obey the same
interface? guess hashtable has stuff that others don't support?

You have an interface for collections (you can add, remove, and get the length, maybe a couple other things).

this is an incomplete response. what do you add for a vector!(T)? A T. what do you add for a hash!(T, U)? you tell me. and you tell me how you make that signature consistent across vector and hash.
 You have an interface for lists (they're collections, and you can index 
 them).

wrong. don't ever mention a linear-time indexing operator in an interview. you will fail it right then. you can always define linear-time indexing as a named function. but never masquerade it as an index operator.
 Then you can use all the collection-oriented stuff with lists, and you 
 can do special list-type things with them if you want.
 
 these are serious questions. not in jest not rhetorical and not trick.

Yes, but they're solved problems.

apparently not since you failed at'em.
 Someone else comes along and writes a library of algorithms. The 
 algorithms can operate on any container that implements the necessary 
 operations.

hm. things are starting to screech a bit. but let's see your answers to your questions above.
 When someone clever comes along and writes a new sorting algorithm, I 
 can plug my new container class right into it, and get the algorithm for 
 free. Likewise for the guy with the clever new collection class.

things ain't that simple.

Collection-oriented library code will care sufficiently about performance that this mix-and-match stuff is not feasible.

what's that supposed to mean? you sayin' stl don't exist?
 Almost 
 anything else doesn't care enough to take only an 
 AssociativeArrayHashSet and not a TreeSet or a LinkedList or a primitive 
 array.

wrong for the reasons above.
 saw this flick "the devil wears prada", an ok movie but one funny remark
stayed with me. "you are in desperate need of chanel." 
 
 i'll paraphrase. "you are in desperate need of stl." you need to learn stl and
then you'll figure why you can't plug a new sorting algorithm into a container.
you need more guarantees. and you need iterators.
 
 We don't bat an eye at the idea of containers & algorithms connecting to 
 one another using a reciprocal set of interfaces.

i do. it's completely wrong. you need iterators that broker between containers and algos. and iterators must give complexity guarantees.

I don't. If I'm just going to iterate through the items of a collection, I only care about the opApply. If I need to index stuff, I don't care if I get a primitive array or Bob Dole's brand-new ArrayList class.

you too are in desperate need for stl.
Aug 25 2008
parent reply Christopher Wright <dhasenan gmail.com> writes:
superdan wrote:
 class LinkedList(T)
 {
 	Node!(T) head;

 	/** Gets the ith item of the list. Throws: SIGSEGV if i >= length of 
 the list. Time complexity: O(N) for a list of length N. This operation 
 is provided for completeness and not recommended for frequent use in 
 large lists. */
 	T opIndex(int i)
 	{
 		auto current = head;
 		while (i)
 		{
 			current = current.next;
 		}
 		return current.value;
 	}
 }

oopsies. houston we got a problem here. problem is all that pluggable sort business works only if it can count on a constant time opIndex. why? because sort has right in its spec that it takes o(n log n) time. if u pass LinkedList to it you obtain a nonsensical design that compiles but runs afoul of the spec. because with that opIndex sort will run in quadratic time and no amount of commentin' is gonna save it from that.

WRONG! Those sorting algorithms are correct. Their runtime is now O(n^2 log n) for this linked list.
 this design is a stillborn.
 
 what needs done is to allow different kinds of containers implement different
interfaces. in fact a better way to factor things is via iterators, as stl has
shown.

You didn't ask for an O(1) opIndex on a linked list. You asked for a correct opIndex on a linked list. Any sorting algorithm you name that would work on an array would also work on this linked list. Admittedly, insertion sort would be much faster than qsort, but if your library provides that, you, knowing that you are using a linked list, would choose the insertion sort algorithm.
 how do you make a hashtable, an array, and a linked list obey the same
interface? guess hashtable has stuff that others don't support?

length, maybe a couple other things).

this is an incomplete response. what do you add for a vector!(T)? A T. what do you add for a hash!(T, U)? you tell me. and you tell me how you make that signature consistent across vector and hash.

interface List(T) : Collection!(T) {} class Vector(T) : List!(T) {} class HashMap(T, U) : Collection!(KeyValuePair!(T, U)) {} Have a look at C#'s collection classes. They solved this problem.
 You have an interface for lists (they're collections, and you can index 
 them).

wrong. don't ever mention a linear-time indexing operator in an interview. you will fail it right then. you can always define linear-time indexing as a named function. but never masquerade it as an index operator.

If you create a linked list with O(1) indexing, that might suffice to get you a PhD. If you claim that you can do so in an interview, you should be required to show proof; and should you fail to do so, you will probably be shown the door. Even if you did prove it in the interview, they would probably consider you overqualified, unless the company's focus was data structures.
 Then you can use all the collection-oriented stuff with lists, and you 
 can do special list-type things with them if you want.

 these are serious questions. not in jest not rhetorical and not trick.


apparently not since you failed at'em.

You claimed that a problem with an inefficient solution has no solution, and then you execrate me for providing an inefficient solution. Why?
 Someone else comes along and writes a library of algorithms. The 
 algorithms can operate on any container that implements the necessary 
 operations.

 When someone clever comes along and writes a new sorting algorithm, I 
 can plug my new container class right into it, and get the algorithm for 
 free. Likewise for the guy with the clever new collection class.


performance that this mix-and-match stuff is not feasible.

what's that supposed to mean? you sayin' stl don't exist?

No, I'm saying that for efficiency, you need to know about the internals of a data structure to implement a number of collection-oriented algorithms. Just like the linked list example.
 Almost 
 anything else doesn't care enough to take only an 
 AssociativeArrayHashSet and not a TreeSet or a LinkedList or a primitive 
 array.

wrong for the reasons above.

They expose very similar interfaces. You might care about which you choose because of the efficiency of various operations, but most of your code won't care which type it gets; it would still be correct. Well, sets have a property that no element appears in them twice, so that is an algorithmic consideration sometimes.
 saw this flick "the devil wears prada", an ok movie but one funny remark
stayed with me. "you are in desperate need of chanel." 

 i'll paraphrase. "you are in desperate need of stl." you need to learn stl and
then you'll figure why you can't plug a new sorting algorithm into a container.
you need more guarantees. and you need iterators.

 We don't bat an eye at the idea of containers & algorithms connecting to 
 one another using a reciprocal set of interfaces.


I only care about the opApply. If I need to index stuff, I don't care if I get a primitive array or Bob Dole's brand-new ArrayList class.

you too are in desperate need for stl.

You are in desperate need of System.Collections.Generic. Or tango.util.container.
Aug 26 2008
next sibling parent reply superdan <super dan.org> writes:
Christopher Wright Wrote:

 superdan wrote:
 class LinkedList(T)
 {
 	Node!(T) head;

 	/** Gets the ith item of the list. Throws: SIGSEGV if i >= length of 
 the list. Time complexity: O(N) for a list of length N. This operation 
 is provided for completeness and not recommended for frequent use in 
 large lists. */
 	T opIndex(int i)
 	{
 		auto current = head;
 		while (i)
 		{
 			current = current.next;
 		}
 		return current.value;
 	}
 }

oopsies. houston we got a problem here. problem is all that pluggable sort business works only if it can count on a constant time opIndex. why? because sort has right in its spec that it takes o(n log n) time. if u pass LinkedList to it you obtain a nonsensical design that compiles but runs afoul of the spec. because with that opIndex sort will run in quadratic time and no amount of commentin' is gonna save it from that.

WRONG! Those sorting algorithms are correct. Their runtime is now O(n^2 log n) for this linked list.

you got the wrong definition of correct. sort became a non-scalable algo from a scalable algo. whacha gonna say next. bubble sort is viable?!?
 this design is a stillborn.
 
 what needs done is to allow different kinds of containers implement different
interfaces. in fact a better way to factor things is via iterators, as stl has
shown.

You didn't ask for an O(1) opIndex on a linked list. You asked for a correct opIndex on a linked list.

the correct opIndex runs in o(1).
 Any sorting algorithm you name that 
 would work on an array would also work on this linked list. Admittedly, 
 insertion sort would be much faster than qsort, but if your library 
 provides that, you, knowing that you are using a linked list, would 
 choose the insertion sort algorithm.

no. the initial idea was for a design that allows that cool mixing and matching gig thru interfaces without knowing what is where. but ur design leads to a lot of unworkable mixing and matching. it is a stillborn design.
 how do you make a hashtable, an array, and a linked list obey the same
interface? guess hashtable has stuff that others don't support?

length, maybe a couple other things).

this is an incomplete response. what do you add for a vector!(T)? A T. what do you add for a hash!(T, U)? you tell me. and you tell me how you make that signature consistent across vector and hash.

interface List(T) : Collection!(T) {} class Vector(T) : List!(T) {} class HashMap(T, U) : Collection!(KeyValuePair!(T, U)) {} Have a look at C#'s collection classes. They solved this problem.

dood. i know how the problem is solved. stl solved that before c#. my point was that making vector!(string) and hash!(int, string) offer the same interface is a tenuous proposition.
 You have an interface for lists (they're collections, and you can index 
 them).

wrong. don't ever mention a linear-time indexing operator in an interview. you will fail it right then. you can always define linear-time indexing as a named function. but never masquerade it as an index operator.

If you create a linked list with O(1) indexing, that might suffice to get you a PhD. If you claim that you can do so in an interview, you should be required to show proof; and should you fail to do so, you will probably be shown the door. Even if you did prove it in the interview, they would probably consider you overqualified, unless the company's focus was data structures.

my point was opIndex should not be written for a list to begin with.
 Then you can use all the collection-oriented stuff with lists, and you 
 can do special list-type things with them if you want.

 these are serious questions. not in jest not rhetorical and not trick.


apparently not since you failed at'em.

You claimed that a problem with an inefficient solution has no solution, and then you execrate me for providing an inefficient solution. Why?

because the correct answer was: a list cannot implement opIndex. it must be in a different hierarchy branch than a vector. which reveals one of the wrongs in the post i answered.
 Someone else comes along and writes a library of algorithms. The 
 algorithms can operate on any container that implements the necessary 
 operations.

 When someone clever comes along and writes a new sorting algorithm, I 
 can plug my new container class right into it, and get the algorithm for 
 free. Likewise for the guy with the clever new collection class.


performance that this mix-and-match stuff is not feasible.

what's that supposed to mean? you sayin' stl don't exist?

No, I'm saying that for efficiency, you need to know about the internals of a data structure to implement a number of collection-oriented algorithms. Just like the linked list example.

wrong. you only need to define your abstract types appropriately. e.g. stl defines forward and random iterators. a forward iterators has ++ but no []. random iterator has both. so a random iterator can be substituted for a forward iterator. but not the other way. bottom line, sort won't compile on a forward iterator. your design allowed it to compile. which makes the design wrong.
 Almost 
 anything else doesn't care enough to take only an 
 AssociativeArrayHashSet and not a TreeSet or a LinkedList or a primitive 
 array.

wrong for the reasons above.

They expose very similar interfaces. You might care about which you choose because of the efficiency of various operations, but most of your code won't care which type it gets; it would still be correct.

no. you got the wrong notion of correctness.
 Well, sets have a property that no element appears in them twice, so 
 that is an algorithmic consideration sometimes.

finally a good point. true. thing is, that can't be told with types. indexing can.
 saw this flick "the devil wears prada", an ok movie but one funny remark
stayed with me. "you are in desperate need of chanel." 

 i'll paraphrase. "you are in desperate need of stl." you need to learn stl and
then you'll figure why you can't plug a new sorting algorithm into a container.
you need more guarantees. and you need iterators.

 We don't bat an eye at the idea of containers & algorithms connecting to 
 one another using a reciprocal set of interfaces.


I only care about the opApply. If I need to index stuff, I don't care if I get a primitive array or Bob Dole's brand-new ArrayList class.

you too are in desperate need for stl.

You are in desperate need of System.Collections.Generic. Or tango.util.container.

guess my advice fell on deaf ears eh. btw my respect for tango improved when i found no opIndex in their list container.
Aug 26 2008
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"superdan" wrote
 Christopher Wright Wrote:

 superdan wrote:
 class LinkedList(T)
 {
 Node!(T) head;

 /** Gets the ith item of the list. Throws: SIGSEGV if i >= length of
 the list. Time complexity: O(N) for a list of length N. This operation
 is provided for completeness and not recommended for frequent use in
 large lists. */
 T opIndex(int i)
 {
 auto current = head;
 while (i)
 {
 current = current.next;
 }
 return current.value;
 }
 }

oopsies. houston we got a problem here. problem is all that pluggable sort business works only if it can count on a constant time opIndex. why? because sort has right in its spec that it takes o(n log n) time. if u pass LinkedList to it you obtain a nonsensical design that compiles but runs afoul of the spec. because with that opIndex sort will run in quadratic time and no amount of commentin' is gonna save it from that.

WRONG! Those sorting algorithms are correct. Their runtime is now O(n^2 log n) for this linked list.

you got the wrong definition of correct. sort became a non-scalable algo from a scalable algo. whacha gonna say next. bubble sort is viable?!?

O(n^2 log n) is still considered solved. Anything that is not exponential is usually considered P-complete, meaning it can be solved in polynomial time. The unsolvable problems are NP-complete, i.e. non-polynomial. Non polynomial usually means n is in one of the exponents, e.g.: O(2^n). That being said, it doesn't take a genius to figure out that a standard sorting algorithm on a linked list while trying to use random access is going to run longer than the same sorting algorithm on a random-access list. But there are ways around this. For instance, you can sort a linked list in O(n log n) time with (in pseudocode): vector v = list; // copy all elements to v, O(n) v.sort; // O(n lgn) list.replaceAll(v); // O(n) So the total is O(2n + n lgn), and we all know you always take the most significant part of the polynomial, so it then becomes: O(n lgn) Can I have my PhD now? :P In all seriousness though, with the way you can call functions with arrays as the first argument like member functions, it almost seems like they are already classes. One thing I have against having a string class be the default, is that you can use substring on a string in D without any heap allocation, and it is super-fast. And I think substring (slicing) is one of the best features that D has. FWIW, you can have both a string class and an array representing a string, and you can define the string class to use an array as it's backing storage. I do this in dcollections (ArrayList). If you want the interface, wrap the array, if you want the speed of an array, it is accessible as a member. This allows you to decide whichever one you want to use. You can even use algorithms on the array like sort by using the member because you are accessing the actual storage of the ArrayList. -Steve
Aug 26 2008
parent reply superdan <super dan.org> writes:
Steven Schveighoffer Wrote:

 "superdan" wrote
 Christopher Wright Wrote:

 superdan wrote:
 class LinkedList(T)
 {
 Node!(T) head;

 /** Gets the ith item of the list. Throws: SIGSEGV if i >= length of
 the list. Time complexity: O(N) for a list of length N. This operation
 is provided for completeness and not recommended for frequent use in
 large lists. */
 T opIndex(int i)
 {
 auto current = head;
 while (i)
 {
 current = current.next;
 }
 return current.value;
 }
 }

oopsies. houston we got a problem here. problem is all that pluggable sort business works only if it can count on a constant time opIndex. why? because sort has right in its spec that it takes o(n log n) time. if u pass LinkedList to it you obtain a nonsensical design that compiles but runs afoul of the spec. because with that opIndex sort will run in quadratic time and no amount of commentin' is gonna save it from that.

WRONG! Those sorting algorithms are correct. Their runtime is now O(n^2 log n) for this linked list.

you got the wrong definition of correct. sort became a non-scalable algo from a scalable algo. whacha gonna say next. bubble sort is viable?!?

O(n^2 log n) is still considered solved.

of course. for traveling salesman that is. just not sorting. o(n^2 log n) sorting algo, that's a no-go. notice also i didn't say solved/unsolved. i said scalable. stuff beyond n log n don't scale.
  Anything that is not exponential 
 is usually considered P-complete, meaning it can be solved in polynomial 
 time.  The unsolvable problems are NP-complete, i.e. non-polynomial.  Non 
 polynomial usually means n is in one of the exponents, e.g.:
 
 O(2^n).

sure thing.
 That being said, it doesn't take a genius to figure out that a standard 
 sorting algorithm on a linked list while trying to use random access is 
 going to run longer than the same sorting algorithm on a random-access list. 
 But there are ways around this.  For instance, you can sort a linked list in 
 O(n log n) time with (in pseudocode):
 
 vector v = list; // copy all elements to v, O(n)
 v.sort; // O(n lgn)
 list.replaceAll(v); // O(n)

sure thing. problem is, you must know it's a list. otherwise you wouldn't make a copy. don't forget how this all started when answering tiny bits of my post. it started with a dood claiming vector and list both have opIndex and then sort works with both without knowing the details. it don't work with both.
 So the total is O(2n + n lgn), and we all know you always take the most 
 significant part of the polynomial, so it then becomes:
 
 O(n lgn)
 
 Can I have my PhD now? :P

sure. i must have seen an email with an offer somewhere ;)
 In all seriousness though, with the way you can call functions with arrays 
 as the first argument like member functions, it almost seems like they are 
 already classes.  One thing I have against having a string class be the 
 default, is that you can use substring on a string in D without any heap 
 allocation, and it is super-fast.  And I think substring (slicing) is one of 
 the best features that D has.
 
 FWIW, you can have both a string class and an array representing a string, 
 and you can define the string class to use an array as it's backing storage. 
 I do this in dcollections (ArrayList).  If you want the interface, wrap the 
 array, if you want the speed of an array, it is accessible as a member. 
 This allows you to decide whichever one you want to use.  You can even use 
 algorithms on the array like sort by using the member because you are 
 accessing the actual storage of the ArrayList.

that sounds better.
Aug 26 2008
parent reply "Chris R. Miller" <lordSaurontheGreat gmail.com> writes:
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

superdan wrote:
 Steven Schveighoffer Wrote:
=20
 "superdan" wrote
 Christopher Wright Wrote:

 superdan wrote:
 class LinkedList(T)
 {
 Node!(T) head;

 /** Gets the ith item of the list. Throws: SIGSEGV if i >=3D lengt=






 the list. Time complexity: O(N) for a list of length N. This opera=






 is provided for completeness and not recommended for frequent use =






 large lists. */
 T opIndex(int i)
 {
 auto current =3D head;
 while (i)
 {
 current =3D current.next;
 }
 return current.value;
 }
 }






 sort business works only if it can count on a constant time opIndex=





 why? because sort has right in its spec that it takes o(n log n) ti=





 if u pass LinkedList to it you obtain a nonsensical design that=20
 compiles but runs afoul of the spec. because with that opIndex sort=





 will run in quadratic time and no amount of commentin' is gonna sav=





 from that.

Those sorting algorithms are correct. Their runtime is now O(n^2 log=




 for this linked list.




 from a scalable algo. whacha gonna say next. bubble sort is viable?!?=



 O(n^2 log n) is still considered solved.

of course. for traveling salesman that is. just not sorting. o(n^2 log =

d. i said scalable. stuff beyond n log n don't scale. Why are people unable to have the brains to understand that they're using data structures in an suboptimal nature? Furthermore, you're faulting linked lists as having a bad opIndex. Why not implement a cursor (Java LinkedList-like iterator) in the opIndex function? Thus you could retain the reference to the last indexed location, and simply use that instead of the root node when calling opIndex. Granted that whenever the contents of the list are modified that reference would have to be considered invalid (start from the root node again), but it'd work with an O(1) efficiency for sequential accesses from 0 to length. True, it'll add another pointer to a node in memory, as well as a integer representing the position of that node reference.
 sure thing. problem is, you must know it's a list. otherwise you wouldn=

=20
 don't forget how this all started when answering tiny bits of my post. =

n sort works with both without knowing the details. it don't work with bo= th. Wrong. It works. That it's not precisely what the spec for sort dictates (which is probably in error, since no spec can guarantee a precise efficiency if it doesn't know the precise container type). You are also misinterpreting the spec. It is saying that it uses a specific efficiency of algorithm, not that you can arbitrarily expect a certain efficiency out of it regardless of how dumb you might be with the choice of container you use.
 So the total is O(2n + n lgn), and we all know you always take the mos=


 significant part of the polynomial, so it then becomes:

 O(n lgn)

 Can I have my PhD now? :P

sure. i must have seen an email with an offer somewhere ;)

A Ph.D from superdan... gee, I'd value that just above my MSDN membership. Remember: I value nothing less than my MSDN membership.
Aug 26 2008
next sibling parent reply superdan <super dan.org> writes:
Chris R. Miller Wrote:

 superdan wrote:
 Steven Schveighoffer Wrote:
 
 "superdan" wrote
 Christopher Wright Wrote:

 superdan wrote:
 class LinkedList(T)
 {
 Node!(T) head;

 /** Gets the ith item of the list. Throws: SIGSEGV if i >= length of
 the list. Time complexity: O(N) for a list of length N. This operation
 is provided for completeness and not recommended for frequent use in
 large lists. */
 T opIndex(int i)
 {
 auto current = head;
 while (i)
 {
 current = current.next;
 }
 return current.value;
 }
 }

sort business works only if it can count on a constant time opIndex. why? because sort has right in its spec that it takes o(n log n) time. if u pass LinkedList to it you obtain a nonsensical design that compiles but runs afoul of the spec. because with that opIndex sort will run in quadratic time and no amount of commentin' is gonna save it from that.

Those sorting algorithms are correct. Their runtime is now O(n^2 log n) for this linked list.

from a scalable algo. whacha gonna say next. bubble sort is viable?!?


of course. for traveling salesman that is. just not sorting. o(n^2 log n) sorting algo, that's a no-go. notice also i didn't say solved/unsolved. i said scalable. stuff beyond n log n don't scale.

Why are people unable to have the brains to understand that they're using data structures in an suboptimal nature?

coz they wants to do generic programming. they can't know what structures are using. so mos def structures must define expressive interfaces that describe their capabilities.
 Furthermore, you're faulting linked lists as having a bad opIndex.  Why
 not implement a cursor (Java LinkedList-like iterator) in the opIndex
 function?  Thus you could retain the reference to the last indexed
 location, and simply use that instead of the root node when calling
 opIndex.  Granted that whenever the contents of the list are modified
 that reference would have to be considered invalid (start from the root
 node again), but it'd work with an O(1) efficiency for sequential
 accesses from 0 to length.  True, it'll add another pointer to a node in
 memory, as well as a integer representing the position of that node
 reference.

you: "this scent will make skunk farts stink less." me: "let's kick the gorram skunk outta here!"
 sure thing. problem is, you must know it's a list. otherwise you wouldn't make
a copy. 
 
 don't forget how this all started when answering tiny bits of my post. it
started with a dood claiming vector and list both have opIndex and then sort
works with both without knowing the details. it don't work with both.

Wrong. It works. That it's not precisely what the spec for sort dictates (which is probably in error, since no spec can guarantee a precise efficiency if it doesn't know the precise container type).

sure it can. in big oh.
  You
 are also misinterpreting the spec.  It is saying that it uses a specific
 efficiency of algorithm, not that you can arbitrarily expect a certain
 efficiency out of it regardless of how dumb you might be with the choice
 of container you use.

in stl the spec says as i say. in d the spec is not precise. it should.
 So the total is O(2n + n lgn), and we all know you always take the most 
 significant part of the polynomial, so it then becomes:

 O(n lgn)

 Can I have my PhD now? :P

sure. i must have seen an email with an offer somewhere ;)

A Ph.D from superdan... gee, I'd value that just above my MSDN membership. Remember: I value nothing less than my MSDN membership.

humor is a sign of intelligence. but let me explain it. i was referring to the spam emails advertising phds from non-accredited universities.
Aug 27 2008
parent reply "Chris R. Miller" <lordSaurontheGreat gmail.com> writes:
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

superdan wrote:
 Chris R. Miller Wrote:
 Furthermore, you're faulting linked lists as having a bad opIndex.  Wh=


 not implement a cursor (Java LinkedList-like iterator) in the opIndex
 function?  Thus you could retain the reference to the last indexed
 location, and simply use that instead of the root node when calling
 opIndex.  Granted that whenever the contents of the list are modified
 that reference would have to be considered invalid (start from the roo=


 node again), but it'd work with an O(1) efficiency for sequential
 accesses from 0 to length.  True, it'll add another pointer to a node =


 memory, as well as a integer representing the position of that node
 reference.

you: "this scent will make skunk farts stink less." me: "let's kick the gorram skunk outta here!"

I would imagine that you'll have a hard time convincing others that linked-lists are evil when you apparently have two broken shift keys.
 Wrong.  It works.  That it's not precisely what the spec for sort
 dictates (which is probably in error, since no spec can guarantee a
 precise efficiency if it doesn't know the precise container type).

sure it can. in big oh.

Which is simply identifying the algorithm used by its efficiency. If you're not familiar with the types of algorithms, it tells you the proximate efficiency of the algorithm used. If you are familiar with algorithms, then you can identify the type of algorithm used so you can better leverage it to do what you want.
  You
 are also misinterpreting the spec.  It is saying that it uses a specif=


 efficiency of algorithm, not that you can arbitrarily expect a certain=


 efficiency out of it regardless of how dumb you might be with the choi=


 of container you use.

in stl the spec says as i say. in d the spec is not precise. it should.=

Yes, it probably should explicitly say that "sort uses the xxxxx algorithm, which gives a proximate efficiency of O(n log n) when used with optimal data structures." You honestly cannot write a spec for generic programming and expect uniform performance. Trying to move back on topic, yes, I believe it is important that such a degree of ambiguity be avoided with something so simple as string handling. So no strings-as-objects. But writing a String class and using that wherever possible is advantageous, especially because it does not remove the ability of the language to support the simpler string implementation.
 A Ph.D from superdan... gee, I'd value that just above my MSDN
 membership.  Remember: I value nothing less than my MSDN membership.

humor is a sign of intelligence. but let me explain it. i was referring=

You get different spam than I do then. I just get junk about cheap Canadian pharmaceuticals and dead South African oil moguls who have left large amounts of money in my name.
Aug 27 2008
parent reply Dee Girl <deegirl noreply.com> writes:
Chris R. Miller Wrote:

 superdan wrote:
 Chris R. Miller Wrote:
 Furthermore, you're faulting linked lists as having a bad opIndex.  Why
 not implement a cursor (Java LinkedList-like iterator) in the opIndex
 function?  Thus you could retain the reference to the last indexed
 location, and simply use that instead of the root node when calling
 opIndex.  Granted that whenever the contents of the list are modified
 that reference would have to be considered invalid (start from the root
 node again), but it'd work with an O(1) efficiency for sequential
 accesses from 0 to length.  True, it'll add another pointer to a node in
 memory, as well as a integer representing the position of that node
 reference.

you: "this scent will make skunk farts stink less." me: "let's kick the gorram skunk outta here!"

I would imagine that you'll have a hard time convincing others that linked-lists are evil when you apparently have two broken shift keys.
 Wrong.  It works.  That it's not precisely what the spec for sort
 dictates (which is probably in error, since no spec can guarantee a
 precise efficiency if it doesn't know the precise container type).

sure it can. in big oh.

Which is simply identifying the algorithm used by its efficiency. If you're not familiar with the types of algorithms, it tells you the proximate efficiency of the algorithm used. If you are familiar with algorithms, then you can identify the type of algorithm used so you can better leverage it to do what you want.
  You
 are also misinterpreting the spec.  It is saying that it uses a specific
 efficiency of algorithm, not that you can arbitrarily expect a certain
 efficiency out of it regardless of how dumb you might be with the choice
 of container you use.

in stl the spec says as i say. in d the spec is not precise. it should.

Yes, it probably should explicitly say that "sort uses the xxxxx algorithm, which gives a proximate efficiency of O(n log n) when used with optimal data structures." You honestly cannot write a spec for generic programming and expect uniform performance.

But this is what STL did. Sorry, Dee Girl
Aug 27 2008
parent reply "Chris R. Miller" <lordSaurontheGreat gmail.com> writes:
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Dee Girl wrote:
 Chris R. Miller Wrote:
 You honestly cannot write a spec for generic programming and expect
 uniform performance.

But this is what STL did. Sorry, Dee Girl

Reading back through the STL intro, it seems that all this STL power comes from the iterator. Supposing I wrote a horrid iterator (sort of like the annoying O(n) opIndex previously discussed) I don't see why STL is "immune" to the same weakness of a slower data structure. I can see how STL is more powerful in that you can pick and choose the algorithm to use, but at this point I think we're discussing changing the nature of the sort property in D at a fundamental level. I still just don't see the (apparently obvious) advantage of STL. Disclaimer: I do not /know/ STL that well at all. I came from Java with a ___brief___ dabbling in C/C++. So I'm not trying to be annoying, stupid, or blind - I'm just ignorant of what you see that I don't.
Aug 27 2008
parent Don <nospam nospam.com.au> writes:
Chris R. Miller wrote:
 Dee Girl wrote:
 Chris R. Miller Wrote:
 You honestly cannot write a spec for generic programming and expect
 uniform performance.


Reading back through the STL intro, it seems that all this STL power comes from the iterator. Supposing I wrote a horrid iterator (sort of like the annoying O(n) opIndex previously discussed) I don't see why STL is "immune" to the same weakness of a slower data structure. I can see how STL is more powerful in that you can pick and choose the algorithm to use, but at this point I think we're discussing changing the nature of the sort property in D at a fundamental level. I still just don't see the (apparently obvious) advantage of STL.

Read Alexander Stepanov's notes. They are fantastic. http://www.stepanovpapers.com/notes.pdf
Aug 28 2008
prev sibling parent Dee Girl <deegirl noreply.com> writes:
Chris R. Miller Wrote:

 superdan wrote:
 Steven Schveighoffer Wrote:
 
 "superdan" wrote
 Christopher Wright Wrote:

 superdan wrote:
 class LinkedList(T)
 {
 Node!(T) head;

 /** Gets the ith item of the list. Throws: SIGSEGV if i >= length of
 the list. Time complexity: O(N) for a list of length N. This operation
 is provided for completeness and not recommended for frequent use in
 large lists. */
 T opIndex(int i)
 {
 auto current = head;
 while (i)
 {
 current = current.next;
 }
 return current.value;
 }
 }

sort business works only if it can count on a constant time opIndex. why? because sort has right in its spec that it takes o(n log n) time. if u pass LinkedList to it you obtain a nonsensical design that compiles but runs afoul of the spec. because with that opIndex sort will run in quadratic time and no amount of commentin' is gonna save it from that.

Those sorting algorithms are correct. Their runtime is now O(n^2 log n) for this linked list.

from a scalable algo. whacha gonna say next. bubble sort is viable?!?


of course. for traveling salesman that is. just not sorting. o(n^2 log n) sorting algo, that's a no-go. notice also i didn't say solved/unsolved. i said scalable. stuff beyond n log n don't scale.

Why are people unable to have the brains to understand that they're using data structures in an suboptimal nature? Furthermore, you're faulting linked lists as having a bad opIndex. Why not implement a cursor (Java LinkedList-like iterator) in the opIndex function? Thus you could retain the reference to the last indexed location, and simply use that instead of the root node when calling opIndex. Granted that whenever the contents of the list are modified that reference would have to be considered invalid (start from the root node again), but it'd work with an O(1) efficiency for sequential accesses from 0 to length. True, it'll add another pointer to a node in memory, as well as a integer representing the position of that node reference.

I am sorry to enter discussion. But I have some thing to say. Please do not scare me ^_^. I think Super Dan choose wrong example sort. Because sort is O(n log n) even for list. But good example is find. Taken collection that gives length and opIndex abstraction. Then I write find easy with index. It is slow O(n*n) for list. But with optimization from Chris it is fast again. But if I want to write findLast. Find last element equal to some thing. Then I go back. But going back the optimization never works. I am again to O(n*n)! This is important because abstraction. You want to write find abstract. Also you want write container abstract. And you want both to work together well. If you choose algorithm manually "you cheat". You break abstraction. Because you want abstract algorithm work on abstract container. Not concrete algorithm on concrete container. Also is not only detail. When call findLast on container I expect better or worse depending on optimization of library. But I expect proportional with number of elements. If I know is O(n*n) maybe I want redesign. O(n*n) is really bad. 1000 elements is not many. But 1000000 operations is many. I took two data structures classes. What each structure gives fast is essential. Not detail. I am not sure is clear what I say. Structures are special for certain operations. For example there is suffix tree. It is for fast common substring. Suffix tree must not have same interface as O(n*n) search. Because algorithm should not accept both. If you say list has random access it is naive I think (sorry!). Everybody in class could laugh. To find index in list is linear search. A similar example an array string[] can define overload a["abc"] to do linear search for "abc". But search is not indexing. It must be name search find or linearSearch.
Aug 27 2008
prev sibling next sibling parent reply Robert Fraser <fraserofthenight gmail.com> writes:
superdan wrote:
 Christopher Wright Wrote:
 
 superdan wrote:
 class LinkedList(T)
 {
 	Node!(T) head;

 	/** Gets the ith item of the list. Throws: SIGSEGV if i >= length of 
 the list. Time complexity: O(N) for a list of length N. This operation 
 is provided for completeness and not recommended for frequent use in 
 large lists. */
 	T opIndex(int i)
 	{
 		auto current = head;
 		while (i)
 		{
 			current = current.next;
 		}
 		return current.value;
 	}
 }


Those sorting algorithms are correct. Their runtime is now O(n^2 log n) for this linked list.

you got the wrong definition of correct. sort became a non-scalable algo from a scalable algo. whacha gonna say next. bubble sort is viable?!?

No, YOU got the wrong definition of correct. "Correct" and "scalabale" are different words. As are "correct" and "viable". In Java, I've been known to index into linked lists... usually ones with ~5 elements, but I've done it.
 this design is a stillborn.

 what needs done is to allow different kinds of containers implement different
interfaces. in fact a better way to factor things is via iterators, as stl has
shown.

correct opIndex on a linked list.

the correct opIndex runs in o(1).

No, the SCALABLE opIndex runs in O(1). The CORRECT opIndex can run in O(n^n) and still be correct.
 Any sorting algorithm you name that 
 would work on an array would also work on this linked list. Admittedly, 
 insertion sort would be much faster than qsort, but if your library 
 provides that, you, knowing that you are using a linked list, would 
 choose the insertion sort algorithm.

no. the initial idea was for a design that allows that cool mixing and matching gig thru interfaces without knowing what is where. but ur design leads to a lot of unworkable mixing and matching.

Again, WORKABLE, just not SCALABLE. You should wrap your head around this concept, since it's been around for about 30 years now.
 it is a stillborn design.

Tell that to Java, the world's most used programming language for new projects. Or C#, the world's fastest growing programming language. Or Tango, one of D's standard libraries.
 my point was opIndex should not be written for a list to begin with.

Yes it should be. Here's a fairly good example: Say you have a GUI control that displays a list and allows the user to insert or remove items from the list. It also allows the user to double-click on an item at a given position. Looking up what position maps to what item is an opIndex. Would this problem be better solved using an array (Vector)? Maybe. Luckily, if you used a List interface throughout your code, you can change one line, and it'll work wither way.
 wrong. you only need to define your abstract types appropriately. e.g. stl
defines forward and random iterators. a forward iterators has ++ but no [].
random iterator has both. so a random iterator can be substituted for a forward
iterator. but not the other way. bottom line, sort won't compile on a forward
iterator. your design allowed it to compile. which makes the design wrong.

STL happens to be one design and one world-view. It's a good one, but it's not the only one. My main problem with the STL is that it takes longer to learn than the Java/.NET standard libraries -- and thus the cost of a programmer who knows it is higher. But there are language considerations in there too, and this is a topic for another day.
 btw my respect for tango improved when i found no opIndex in their list
container.

http://www.dsource.org/projects/tango/browser/trunk/tango/util/collection/LinkSeq.d#L176
Aug 26 2008
next sibling parent reply Lars Ivar Igesund <larsivar igesund.net> writes:
Robert Fraser wrote:

 superdan wrote:

 btw my respect for tango improved when i found no opIndex in their list
 container.


This particular collection package is deprecated. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Aug 26 2008
parent reply Robert Fraser <fraserofthenight gmail.com> writes:
Lars Ivar Igesund wrote:
 Robert Fraser wrote:
 
 superdan wrote:

 btw my respect for tango improved when i found no opIndex in their list
 container.


This particular collection package is deprecated.

The new package has this feature too: http://www.dsource.org/projects/tango/browser/trunk/tango/util/container/Slink.d#L248 It's a good feature to have (I wouldn't consider a list class complete without it), it just shouldn't be abused.
Aug 26 2008
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Robert Fraser" <fraserofthenight gmail.com> wrote in message 
news:g91fva$1if4$1 digitalmars.com...
 Lars Ivar Igesund wrote:
 Robert Fraser wrote:

 superdan wrote:

 btw my respect for tango improved when i found no opIndex in their list
 container.


This particular collection package is deprecated.

The new package has this feature too: http://www.dsource.org/projects/tango/browser/trunk/tango/util/container/Slink.d#L248 It's a good feature to have (I wouldn't consider a list class complete without it), it just shouldn't be abused.

First, Slink is not really the public interface, it is the unit that LinkedList (and other containers) use to build linked lists. Second, LinkedList implements a 'lookup by index', through the get function, but note that it is not implemented as an opIndex function. an opIndex implies fast lookup (at least < O(n)). I don't think these functions were intended to be used in sorting routines. -Steve
Aug 26 2008
prev sibling parent reply superdan <super dan.org> writes:
Robert Fraser Wrote:

 superdan wrote:
 Christopher Wright Wrote:
 
 superdan wrote:
 class LinkedList(T)
 {
 	Node!(T) head;

 	/** Gets the ith item of the list. Throws: SIGSEGV if i >= length of 
 the list. Time complexity: O(N) for a list of length N. This operation 
 is provided for completeness and not recommended for frequent use in 
 large lists. */
 	T opIndex(int i)
 	{
 		auto current = head;
 		while (i)
 		{
 			current = current.next;
 		}
 		return current.value;
 	}
 }


Those sorting algorithms are correct. Their runtime is now O(n^2 log n) for this linked list.

you got the wrong definition of correct. sort became a non-scalable algo from a scalable algo. whacha gonna say next. bubble sort is viable?!?

No, YOU got the wrong definition of correct. "Correct" and "scalabale" are different words. As are "correct" and "viable". In Java, I've been known to index into linked lists... usually ones with ~5 elements, but I've done it.

yeah. for starters my dictionary fails to list "scalabale".
 this design is a stillborn.

 what needs done is to allow different kinds of containers implement different
interfaces. in fact a better way to factor things is via iterators, as stl has
shown.

correct opIndex on a linked list.

the correct opIndex runs in o(1).

No, the SCALABLE opIndex runs in O(1). The CORRECT opIndex can run in O(n^n) and still be correct.

stepanov has shown that for composable operations the complexity must be part of the specification. otherwise composition easily leads to high-order polynomial that fail to terminate in reasonable time. opIndex is an indexing operator expected to run in constant time, and algorithms rely on that. so no. opIndex running in o(n^n) is incorrect because it fails it spec.
 Any sorting algorithm you name that 
 would work on an array would also work on this linked list. Admittedly, 
 insertion sort would be much faster than qsort, but if your library 
 provides that, you, knowing that you are using a linked list, would 
 choose the insertion sort algorithm.

no. the initial idea was for a design that allows that cool mixing and matching gig thru interfaces without knowing what is where. but ur design leads to a lot of unworkable mixing and matching.

Again, WORKABLE, just not SCALABLE. You should wrap your head around this concept, since it's been around for about 30 years now.

guess would do you good to entertain just for a second the idea that i know what i'm talking about and you don't get me.
 it is a stillborn design.

Tell that to Java, the world's most used programming language for new projects. Or C#, the world's fastest growing programming language. Or Tango, one of D's standard libraries.

here you hint you don't understand what i'm talking about indeed. neither of java, c#, or tango define a[n] to run in o(n). they define named functions, which i'm perfectly fine with.
 my point was opIndex should not be written for a list to begin with.

Yes it should be. Here's a fairly good example: Say you have a GUI control that displays a list and allows the user to insert or remove items from the list. It also allows the user to double-click on an item at a given position. Looking up what position maps to what item is an opIndex. Would this problem be better solved using an array (Vector)? Maybe. Luckily, if you used a List interface throughout your code, you can change one line, and it'll work wither way.

funny you should mention that. window manager in windows 3.1 worked exactly like that. users noticed that the more windows the opened, the longer it took to open a new window. with new systems and more memory people would have many windows. before long this became a big issue. windows 95 fixed that. never misunderestimate scalability.
 wrong. you only need to define your abstract types appropriately. e.g. stl
defines forward and random iterators. a forward iterators has ++ but no [].
random iterator has both. so a random iterator can be substituted for a forward
iterator. but not the other way. bottom line, sort won't compile on a forward
iterator. your design allowed it to compile. which makes the design wrong.

STL happens to be one design and one world-view. It's a good one, but it's not the only one. My main problem with the STL is that it takes longer to learn than the Java/.NET standard libraries -- and thus the cost of a programmer who knows it is higher. But there are language considerations in there too, and this is a topic for another day.

cool. don't see how this all relates to the problem at hand.
 btw my respect for tango improved when i found no opIndex in their list
container.

http://www.dsource.org/projects/tango/browser/trunk/tango/util/collection/LinkSeq.d#L176

here you gently provide irrefutable proof you don't get what i'm sayin'. schveiguy did. the page fails to list opIndex. there is a function called 'get'. better yet. the new package lists a function 'nth' suggesting a linear walk to the nth element. good job tango fellas.
Aug 26 2008
parent reply Robert Fraser <fraserofthenight gmail.com> writes:
superdan wrote:
 No, the SCALABLE opIndex runs in O(1). The CORRECT opIndex can run in 
 O(n^n) and still be correct.

stepanov has shown that for composable operations the complexity must be part of the specification. otherwise composition easily leads to high-order polynomial that fail to terminate in reasonable time. opIndex is an indexing operator expected to run in constant time, and algorithms rely on that. so no. opIndex running in o(n^n) is incorrect because it fails it spec.

Um.. . how would one "show" that. I'm not talking theoretical bullshit here, I'm talking real-world requirements. Some specs of operations (composable or not) list their time/memory complexity. Most do not. They're still usable. I agree that a standard library sort routine is one that *should* list its time complexity. My internal function for enumerating registry keys doesn't need to.
 here you hint you don't understand what i'm talking about indeed. neither of
java, c#, or tango define a[n] to run in o(n). they define named functions,
which i'm perfectly fine with.

I guess I didn't understand what you were saying because you _never mentioned_ you were talking only about opIndex and not other functions. I don't see the difference between a[n] and a.get(n); the former is just a shorter syntax. The D spec certainly doesn't make any guarantees about the time/memory complexity of opIndex; it's up to the implementing class to do so. In fact, the D spec makes no time/memory complexity guarantees about sort for arbitrary user-defined types, either, so maybe you shouldn't use that.
 funny you should mention that. window manager in windows 3.1 worked exactly
like that. users noticed that the more windows the opened, the longer it took
to open a new window. with new systems and more memory people would have many
windows. before long this became a big issue. windows 95 fixed that.
 
 never misunderestimate scalability.

I don't know enough about GUI programming to say for sure, but that suggests a window manager shouldn't be written using linked lists. It doesn't suggest that getting a value from an arbitrary index in a linked list is useless (in fact, it shows the opposite -- that it works fine -- it just shows its not scalable).
Aug 26 2008
parent reply superdan <super dan.org> writes:
Robert Fraser Wrote:

 superdan wrote:
 No, the SCALABLE opIndex runs in O(1). The CORRECT opIndex can run in 
 O(n^n) and still be correct.

stepanov has shown that for composable operations the complexity must be part of the specification. otherwise composition easily leads to high-order polynomial that fail to terminate in reasonable time. opIndex is an indexing operator expected to run in constant time, and algorithms rely on that. so no. opIndex running in o(n^n) is incorrect because it fails it spec.

Um.. . how would one "show" that. I'm not talking theoretical bullshit here, I'm talking real-world requirements.

hey. hey. watch'em manners :) he's shown it by putting stl together.
 Some specs of operations 
 (composable or not) list their time/memory complexity. Most do not. 
 They're still usable. I agree that a standard library sort routine is 
 one that *should* list its time complexity. My internal function for 
 enumerating registry keys doesn't need to.

sure thing.
 here you hint you don't understand what i'm talking about indeed. neither of
java, c#, or tango define a[n] to run in o(n). they define named functions,
which i'm perfectly fine with.

I guess I didn't understand what you were saying because you _never mentioned_ you were talking only about opIndex and not other functions.

well then allow me to quote myself: "oopsies. houston we got a problem here. problem is all that pluggable sort business works only if it can count on a constant time opIndex. why? because sort has right in its spec that it takes o(n log n) time. if u pass LinkedList to it you obtain a nonsensical design that compiles but runs afoul of the spec. because with that opIndex sort will run in quadratic time and no amount of commentin' is gonna save it from that."
 I don't see the difference between a[n] and a.get(n); the former is just 
 a shorter syntax.

wrong. the former is used by sort. the latter ain't.
 The D spec certainly doesn't make any guarantees about 
 the time/memory complexity of opIndex; it's up to the implementing class 
 to do so.

it don't indeed. it should. that's a problem with the spec.
 In fact, the D spec makes no time/memory complexity guarantees 
 about sort for arbitrary user-defined types, either, so maybe you 
 shouldn't use that.

makes guarantees in terms of the primitive operations used.
 funny you should mention that. window manager in windows 3.1 worked exactly
like that. users noticed that the more windows the opened, the longer it took
to open a new window. with new systems and more memory people would have many
windows. before long this became a big issue. windows 95 fixed that.
 
 never misunderestimate scalability.

I don't know enough about GUI programming to say for sure, but that suggests a window manager shouldn't be written using linked lists. It doesn't suggest that getting a value from an arbitrary index in a linked list is useless (in fact, it shows the opposite -- that it works fine -- it just shows its not scalable).

problem's too many people talk without knowin' enuff 'bout stuff. there's only a handful of subjects i know any about. and i try to not stray. when it come about any stuff i know i'm amazed readin' here at how many just fudge their way around.
Aug 26 2008
next sibling parent Benji Smith <dlanguage benjismith.net> writes:
Denis Koroskin wrote:
 I agree. You can't rely on function invokation, i.e. the following 
 might be slow as death:

 auto n = collection.at(i);
 auto len = collection.length();

 but index operations and properties getters should be real-time and 
 have O(1) complexity by design.

 auto n = collection[i];
 auto len = collection.length;

The same goes to assignment, casts, comparisons, shifts, i.e. everything that doesn't have a function invokation syntax.

This is the main reason I dislike D's optional parentheses for function invocations: something.dup; // looks cheap something.dup(); // looks expensive Since any zero-arg function can have its parens omitted, it's harder to read code and see where the expensive operations are. --benji
Aug 26 2008
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Denis Koroskin" wrote
 On Tue, 26 Aug 2008 23:29:33 +0400, superdan <super dan.org> wrote:
 [snip]
 The D spec certainly doesn't make any guarantees about
 the time/memory complexity of opIndex; it's up to the implementing class
 to do so.

it don't indeed. it should. that's a problem with the spec.

I agree. You can't rely on function invokation, i.e. the following might be slow as death: auto n = collection.at(i); auto len = collection.length(); but index operations and properties getters should be real-time and have O(1) complexity by design. auto n = collection[i]; auto len = collection.length;

less than O(n) complexity please :) Think of tree map complexity which is usually O(lg n) for lookups. And the opIndex syntax is sooo nice for maps :) In general, opIndex just shouldn't imply 'linear search', as its roots come from array lookup, which is always O(1). The perception is that x[n] should be fast. Otherwise you have coders using x[n] all over the place thinking they are doing quick lookups, and wondering why their code is so damned slow. -Steve
Aug 26 2008
prev sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Denis Koroskin" <2korden gmail.com> wrote in message 
news:op.ugie28dxo7cclz proton.creatstudio.intranet...
 On Tue, 26 Aug 2008 23:29:33 +0400, superdan <super dan.org> wrote:
 [snip]
 The D spec certainly doesn't make any guarantees about
 the time/memory complexity of opIndex; it's up to the implementing class
 to do so.

it don't indeed. it should. that's a problem with the spec.

I agree. You can't rely on function invokation, i.e. the following might be slow as death: auto n = collection.at(i); auto len = collection.length(); but index operations and properties getters should be real-time and have O(1) complexity by design. auto n = collection[i]; auto len = collection.length;

I disagree. That strategy strikes me as a very clear example of breaking encapsulation by having implementation details dictate certain aspects of the API. At the very least, that will make the API overly rigid, hindering future changes that could otherwise have been non-breaking, behind-the-scenes changes. For realtime code, I can see the benefit to what you're saying. Although in many cases only part of a program needs to be realtime, and for the rest of the program's code I'd hate to have to sacrifice the encapsulation benefits.
Aug 26 2008
parent reply superdan <super dan.org> writes:
Nick Sabalausky Wrote:

 "Denis Koroskin" <2korden gmail.com> wrote in message 
 news:op.ugie28dxo7cclz proton.creatstudio.intranet...
 On Tue, 26 Aug 2008 23:29:33 +0400, superdan <super dan.org> wrote:
 [snip]
 The D spec certainly doesn't make any guarantees about
 the time/memory complexity of opIndex; it's up to the implementing class
 to do so.

it don't indeed. it should. that's a problem with the spec.

I agree. You can't rely on function invokation, i.e. the following might be slow as death: auto n = collection.at(i); auto len = collection.length(); but index operations and properties getters should be real-time and have O(1) complexity by design. auto n = collection[i]; auto len = collection.length;

I disagree. That strategy strikes me as a very clear example of breaking encapsulation by having implementation details dictate certain aspects of the API. At the very least, that will make the API overly rigid, hindering future changes that could otherwise have been non-breaking, behind-the-scenes changes.

take this: auto c = new Customer; c.loadCustomerInfo("123-12-1234"); that's all cool. there's no performance guarantee other than some best effort kinda thing `we won't sabotage things'. if you come with faster or slower methods to load customers, no problem. coz noone assumes any. now take sort. sort says. my input is a range that supports indexing and swapping independent of the range size. if you don't have that just let me know and i'll use a totally different method. just don't pretend. with that indexing and swapping complexity ain't implementation detail. they're part of the spec. guess stepanov's main contribution was to clarify that.
 For realtime code, I can see the benefit to what you're saying. Although in 
 many cases only part of a program needs to be realtime, and for the rest of 
 the program's code I'd hate to have to sacrifice the encapsulation benefits.

realtime has nothin' to do with it. encapsulation ain't broken by making complexity part of the reqs. any more than any req ain't breakin' encapsulation. if it looks like a problem then encapsulation was misdesigned and needs change. case in point. all containers should provide 'nth' say is it's o(n) or better. then there's a subclass of container that is indexed_container. that provides opIndex and says it's o(log n) or better. it also provides 'nth' by just forwarding to opIndex. faster than o(n) ain't a problem. but forcing a list to blurt something for opIndex - that's just bad design.
Aug 26 2008
next sibling parent reply Benji Smith <dlanguage benjismith.net> writes:
superdan wrote:
 now take sort. sort says. my input is a range that supports indexing and
swapping independent of the range size. if you don't have that just let me know
and i'll use a totally different method. just don't pretend.
 
 with that indexing and swapping complexity ain't implementation detail.
they're part of the spec. guess stepanov's main contribution was to clarify
that.

The other variable cost operation of a sort is the element comparison. Even if indexing and swapping are O(1), the cost of a comparison between two elements might be O(m), where m is proportional to the size of the elements themselves. And since a typical sort algorithm will perform n log n comparisons, the cost of the comparison has to be factored into the total cost. The performance of sorting...say, an array of strings based on a locale-specific collation...could be an expensive operation, if the strings themselves are really long. But that wouldn't make the implementation incorrect, and I'm always glad when a sorting implementation provides a way of passing a custom comparison delegate into the sort routine. Not a counterargument to what you're saying about performance guarantees for indexing and swapping. Just something else to think about. --benji
Aug 26 2008
parent superdan <super dan.org> writes:
Benji Smith Wrote:

 superdan wrote:
 now take sort. sort says. my input is a range that supports indexing and
swapping independent of the range size. if you don't have that just let me know
and i'll use a totally different method. just don't pretend.
 
 with that indexing and swapping complexity ain't implementation detail.
they're part of the spec. guess stepanov's main contribution was to clarify
that.

The other variable cost operation of a sort is the element comparison. Even if indexing and swapping are O(1), the cost of a comparison between two elements might be O(m), where m is proportional to the size of the elements themselves. And since a typical sort algorithm will perform n log n comparisons, the cost of the comparison has to be factored into the total cost. The performance of sorting...say, an array of strings based on a locale-specific collation...could be an expensive operation, if the strings themselves are really long. But that wouldn't make the implementation incorrect, and I'm always glad when a sorting implementation provides a way of passing a custom comparison delegate into the sort routine.

good points. i only know of one trick to save on comparisons. it's that -1/0/1 comparison. you compare once and get info on less/equal/greater. that cuts comparisons in half. too bad std.algorithm don't use it. then i moseyed 'round std.algorithm and saw all that schwartz xform business. not sure i grokked it. does it have to do with saving on comparisons?
Aug 26 2008
prev sibling parent reply "Nick Sabalausky" <a a.a> writes:
"superdan" <super dan.org> wrote in message 
news:g921nb$2qqq$1 digitalmars.com...
 Nick Sabalausky Wrote:

 "Denis Koroskin" <2korden gmail.com> wrote in message
 news:op.ugie28dxo7cclz proton.creatstudio.intranet...
 On Tue, 26 Aug 2008 23:29:33 +0400, superdan <super dan.org> wrote:
 [snip]
 The D spec certainly doesn't make any guarantees about
 the time/memory complexity of opIndex; it's up to the implementing 
 class
 to do so.

it don't indeed. it should. that's a problem with the spec.

I agree. You can't rely on function invokation, i.e. the following might be slow as death: auto n = collection.at(i); auto len = collection.length(); but index operations and properties getters should be real-time and have O(1) complexity by design. auto n = collection[i]; auto len = collection.length;

I disagree. That strategy strikes me as a very clear example of breaking encapsulation by having implementation details dictate certain aspects of the API. At the very least, that will make the API overly rigid, hindering future changes that could otherwise have been non-breaking, behind-the-scenes changes.

take this: auto c = new Customer; c.loadCustomerInfo("123-12-1234"); that's all cool. there's no performance guarantee other than some best effort kinda thing `we won't sabotage things'. if you come with faster or slower methods to load customers, no problem. coz noone assumes any. now take sort. sort says. my input is a range that supports indexing and swapping independent of the range size. if you don't have that just let me know and i'll use a totally different method. just don't pretend.

Choosing a sort method is a separate task from the actual sorting. Any sort that expects to be able to index will still work correctly if given a collection that has O(n) or worse indexing, just as it will still work correctly if given an O(n) or worse comparison delegate. It's up to the caller of the sort function to know that an O(n log n) sort (for instance) is only O(n log n) if the indexing and comparison are both O(1). And then it's up to them to decide if they still want to send it a linked list or a complex comparison. The sort shouldn't crap out at compile-time (or runtime) just because some novice might not know that doing a generalized bubble sort on a linked list scales poorly. If you want automatic choosing of an appropriate sort algorithm (which is certainly a good thing to have), then that can be done at a separate, optional, level of abstraction using function overloading, template specialization, RTTI, etc. That way you're not imposing arbitrary restrictions on anyone.
 with that indexing and swapping complexity ain't implementation detail. 
 they're part of the spec. guess stepanov's main contribution was to 
 clarify that.

When I called indexing an implementation detail, I was referring to the collection itself. The method of indexing *is* an implementation detail of the collection. It should not be considered an implementation detail of the sort algorithm since it's encapsulated in the collection and thus hidden away from the sort algorithm. If you make a rule that collections with cheap indexing are indexed via opIndex and collections with expensive indexing are indexed via a function, then you've just defined the API in terms of the collection's implementation (and introduced an unnecessary inconsistency into the API). If a sort function is desired that only accepts collections with O(1) indexing, then that can be accomplished at a higher level of abstraction (using function overloading, RTTI, etc.) without getting in the way when such a guarantee is not needed.
 For realtime code, I can see the benefit to what you're saying. Although 
 in
 many cases only part of a program needs to be realtime, and for the rest 
 of
 the program's code I'd hate to have to sacrifice the encapsulation 
 benefits.

realtime has nothin' to do with it.

For code that needs to run in realtime, I agree with Denis Koroskin that it could be helpful to be able to look at a piece of code and have some sort of guarantee that there is no behind-the-scenes overloading going on that is any more complex than the operators' default behaviors. But for code that doesn't need to finish within a maximum amount of time, that becomes less important and the encapsulation/syntactic-consistency gained from the use of such things becomes a more worthy pursuit. That's what I was saying about realtime.
 encapsulation ain't broken by making complexity part of the reqs. any more 
 than any req ain't breakin' encapsulation. if it looks like a problem then 
 encapsulation was misdesigned and needs change.

 case in point. all containers should provide 'nth' say is it's o(n) or 
 better. then there's a subclass of container that is indexed_container. 
 that provides opIndex and says it's o(log n) or better. it also provides 
 'nth' by just forwarding to opIndex. faster than o(n) ain't a problem.

 but forcing a list to blurt something for opIndex - that's just bad 
 design.

I agree that not all collections should implement an opIndex. Anything without a natural sequence or mapping should lack opIndex (such as a tree or graph). But forcing the user of a collection that *does* have a natural sequence (like a linked list) to use function-call-syntax instead of standard indexing-syntax just because the collection is implemented in a way that causes indexing to be less scalable than other collections - that's bad design. The way I see it, "group[n]" means "Get the nth element of group". Not "Get the element at location group.ptr + (n * sizeof(group_base_type)) or something else that's just as scalable." In plain C, those are one and the same. But when you start talking about generic collections, encapsulation and interface versus implementation, they are very different: the former is interface, the latter is implementation.
Aug 26 2008
parent reply superdan <super dan.org> writes:
Nick Sabalausky Wrote:

 "superdan" <super dan.org> wrote in message 
 news:g921nb$2qqq$1 digitalmars.com...
 Nick Sabalausky Wrote:

 "Denis Koroskin" <2korden gmail.com> wrote in message
 news:op.ugie28dxo7cclz proton.creatstudio.intranet...
 On Tue, 26 Aug 2008 23:29:33 +0400, superdan <super dan.org> wrote:
 [snip]
 The D spec certainly doesn't make any guarantees about
 the time/memory complexity of opIndex; it's up to the implementing 
 class
 to do so.

it don't indeed. it should. that's a problem with the spec.

I agree. You can't rely on function invokation, i.e. the following might be slow as death: auto n = collection.at(i); auto len = collection.length(); but index operations and properties getters should be real-time and have O(1) complexity by design. auto n = collection[i]; auto len = collection.length;

I disagree. That strategy strikes me as a very clear example of breaking encapsulation by having implementation details dictate certain aspects of the API. At the very least, that will make the API overly rigid, hindering future changes that could otherwise have been non-breaking, behind-the-scenes changes.

take this: auto c = new Customer; c.loadCustomerInfo("123-12-1234"); that's all cool. there's no performance guarantee other than some best effort kinda thing `we won't sabotage things'. if you come with faster or slower methods to load customers, no problem. coz noone assumes any. now take sort. sort says. my input is a range that supports indexing and swapping independent of the range size. if you don't have that just let me know and i'll use a totally different method. just don't pretend.

Choosing a sort method is a separate task from the actual sorting.

thot you were the one big on abstraction and encapsulation and all those good things. as a user i want to sort stuff. i let the library choose what's best for the collection at hand. sort(stuff) { 1) figure out best algo for stuff 2) have at it } i don't want to make that decision outside. for example i like one sort routine. not quicksort, heapsort, quicksort_with_median_of_5, or god forbid bubblesort.
 Any sort 
 that expects to be able to index will still work correctly if given a 
 collection that has O(n) or worse indexing, just as it will still work 
 correctly if given an O(n) or worse comparison delegate.

i disagree but now that christopher gave me a black eye guess i have to shut up.
 It's up to the 
 caller of the sort function to know that an O(n log n) sort (for instance) 
 is only O(n log n) if the indexing and comparison are both O(1). And then 
 it's up to them to decide if they still want to send it a linked list or a 
 complex comparison. The sort shouldn't crap out at compile-time (or runtime) 
 just because some novice might not know that doing a generalized bubble sort 
 on a linked list scales poorly.

it should coz there's an obvious good choice. there's no good tradeoff in that. there never will be a case to call the bad sort on the bad range.
 If you want automatic choosing of an appropriate sort algorithm (which is 
 certainly a good thing to have), then that can be done at a separate, 
 optional, level of abstraction using function overloading, template 
 specialization, RTTI, etc. That way you're not imposing arbitrary 
 restrictions on anyone.

i think u missed a lil point i was making. All collections: implement nth() Indexable collections: implement opIndex is all. there is no restriction. just use nth.
 with that indexing and swapping complexity ain't implementation detail. 
 they're part of the spec. guess stepanov's main contribution was to 
 clarify that.

When I called indexing an implementation detail, I was referring to the collection itself. The method of indexing *is* an implementation detail of the collection.

not when it gets composed in higher level ops.
 It should not be considered an implementation detail of the 
 sort algorithm since it's encapsulated in the collection and thus hidden 
 away from the sort algorithm. If you make a rule that collections with cheap 
 indexing are indexed via opIndex and collections with expensive indexing are 
 indexed via a function, then you've just defined the API in terms of the 
 collection's implementation (and introduced an unnecessary inconsistency 
 into the API).

no. there is consistency. nth() is consistent across - o(n) or better indexing. i think u have it wrong when u think of "cheap" as if it were "fewer machine instructions". no. it's about asymptotic complexity and that does matter.
 If a sort function is desired that only accepts collections with O(1) 
 indexing, then that can be accomplished at a higher level of abstraction 
 (using function overloading, RTTI, etc.) without getting in the way when 
 such a guarantee is not needed.

exactly. nth() and opIndex() fit the bill. what's there not to love?
 For realtime code, I can see the benefit to what you're saying. Although 
 in
 many cases only part of a program needs to be realtime, and for the rest 
 of
 the program's code I'd hate to have to sacrifice the encapsulation 
 benefits.

realtime has nothin' to do with it.

For code that needs to run in realtime, I agree with Denis Koroskin that it could be helpful to be able to look at a piece of code and have some sort of guarantee that there is no behind-the-scenes overloading going on that is any more complex than the operators' default behaviors. But for code that doesn't need to finish within a maximum amount of time, that becomes less important and the encapsulation/syntactic-consistency gained from the use of such things becomes a more worthy pursuit. That's what I was saying about realtime.

i disagree but am in a rush now. guess i can't convince u.
 encapsulation ain't broken by making complexity part of the reqs. any more 
 than any req ain't breakin' encapsulation. if it looks like a problem then 
 encapsulation was misdesigned and needs change.

 case in point. all containers should provide 'nth' say is it's o(n) or 
 better. then there's a subclass of container that is indexed_container. 
 that provides opIndex and says it's o(log n) or better. it also provides 
 'nth' by just forwarding to opIndex. faster than o(n) ain't a problem.

 but forcing a list to blurt something for opIndex - that's just bad 
 design.

I agree that not all collections should implement an opIndex. Anything without a natural sequence or mapping should lack opIndex (such as a tree or graph). But forcing the user of a collection that *does* have a natural sequence (like a linked list) to use function-call-syntax instead of standard indexing-syntax just because the collection is implemented in a way that causes indexing to be less scalable than other collections - that's bad design.

no. it's great design. because it's not lyin'. you want o(1) indexing you say a[n]. you are ok with o(n) indexing you say a.nth(n). this is how generic code works, with consistent notation. not with lyin'.
 The way I see it, "group[n]" means "Get the nth element of group". Not "Get 
 the element at location group.ptr + (n * sizeof(group_base_type)) or 
 something else that's just as scalable."

no need to get that low. just say o(1) and understand o(1) has nothing to do with the count of assembler ops.
 In plain C, those are one and the 
 same. But when you start talking about generic collections, encapsulation 
 and interface versus implementation, they are very different: the former is 
 interface, the latter is implementation. 

so now would u say stl has a poor design? because it's all about stuff that you consider badly designed.
Aug 26 2008
parent "Nick Sabalausky" <a a.a> writes:
"superdan" <super dan.org> wrote in message 
news:g92drj$h0u$1 digitalmars.com...
 Nick Sabalausky Wrote:

 "superdan" <super dan.org> wrote in message
 news:g921nb$2qqq$1 digitalmars.com...
 Nick Sabalausky Wrote:

 "Denis Koroskin" <2korden gmail.com> wrote in message
 news:op.ugie28dxo7cclz proton.creatstudio.intranet...
 On Tue, 26 Aug 2008 23:29:33 +0400, superdan <super dan.org> wrote:
 [snip]
 The D spec certainly doesn't make any guarantees about
 the time/memory complexity of opIndex; it's up to the implementing
 class
 to do so.

it don't indeed. it should. that's a problem with the spec.

I agree. You can't rely on function invokation, i.e. the following might be slow as death: auto n = collection.at(i); auto len = collection.length(); but index operations and properties getters should be real-time and have O(1) complexity by design. auto n = collection[i]; auto len = collection.length;

I disagree. That strategy strikes me as a very clear example of breaking encapsulation by having implementation details dictate certain aspects of the API. At the very least, that will make the API overly rigid, hindering future changes that could otherwise have been non-breaking, behind-the-scenes changes.

take this: auto c = new Customer; c.loadCustomerInfo("123-12-1234"); that's all cool. there's no performance guarantee other than some best effort kinda thing `we won't sabotage things'. if you come with faster or slower methods to load customers, no problem. coz noone assumes any. now take sort. sort says. my input is a range that supports indexing and swapping independent of the range size. if you don't have that just let me know and i'll use a totally different method. just don't pretend.

Choosing a sort method is a separate task from the actual sorting.

thot you were the one big on abstraction and encapsulation and all those good things. as a user i want to sort stuff. i let the library choose what's best for the collection at hand. sort(stuff) { 1) figure out best algo for stuff 2) have at it } i don't want to make that decision outside. for example i like one sort routine. not quicksort, heapsort, quicksort_with_median_of_5, or god forbid bubblesort.

I never said that shouldn't be available. In fact, I did say it should be there. But just not forced.
 Any sort
 that expects to be able to index will still work correctly if given a
 collection that has O(n) or worse indexing, just as it will still work
 correctly if given an O(n) or worse comparison delegate.

i disagree but now that christopher gave me a black eye guess i have to shut up.

That's not a matter of agreeing or disagreeing, it's a verifiable fact. Grab a working sort function that operates on collection classes that implement indexing-syntax and a length property, feed it an unsorted linked list that has opIndex overloaded to return the nth node and a proper length property, and when it returns, the list will be sorted. Or are you maybe talking about a sort function that's parameterized specifically to take an "array" instead of "a collection that implements opIndex and a length property"? Because that might make a difference depending on the language (not sure about D offhand).
 It's up to the
 caller of the sort function to know that an O(n log n) sort (for 
 instance)
 is only O(n log n) if the indexing and comparison are both O(1). And then
 it's up to them to decide if they still want to send it a linked list or 
 a
 complex comparison. The sort shouldn't crap out at compile-time (or 
 runtime)
 just because some novice might not know that doing a generalized bubble 
 sort
 on a linked list scales poorly.

it should coz there's an obvious good choice. there's no good tradeoff in that. there never will be a case to call the bad sort on the bad range.

No matter what type of collection you're using, the "best sort" is still going to vary depending on factors like the number of elements to be sorted, whether duplicates might exist, how close the collection is to either perfectly sorted, perfectly backwards or totally random, how likely it is to be random/sorted/backwards at any given time, etc. And then there can be different variations of the same basic algorithm that can be better or worse for certain scenarios. And then there's the issue of how does the algorithm-choosing sort handle user-created collections, if at all.
 If you want automatic choosing of an appropriate sort algorithm (which is
 certainly a good thing to have), then that can be done at a separate,
 optional, level of abstraction using function overloading, template
 specialization, RTTI, etc. That way you're not imposing arbitrary
 restrictions on anyone.

i think u missed a lil point i was making. All collections: implement nth() Indexable collections: implement opIndex is all. there is no restriction. just use nth.

If you want opIndex to be reserved for highly scalable indexing, then I can see how that would lead to what you describe here. But I'm in the camp that feels opIndex means "indexing", not "cheap/scalable indexing", in which case it becomes unnecessary to also expose the separate "nth()" function.
 with that indexing and swapping complexity ain't implementation detail.
 they're part of the spec. guess stepanov's main contribution was to
 clarify that.

When I called indexing an implementation detail, I was referring to the collection itself. The method of indexing *is* an implementation detail of the collection.

not when it gets composed in higher level ops.

It's not? If a collection's indexing isn't implemented by the collection's own class (or the equivalent functions in non-OO), then where is it implemented? Don't tell me it's the sort function, because I know that I'm not calling a sort function every time I say "collection[i]". The method of indexing is implemented by the collection class, therefore, it's an implementation detail of that method/class, not the functions that call it. Claiming otherwise is like saying that all of the inner working of printf() are implementation details of main().
 It should not be considered an implementation detail of the
 sort algorithm since it's encapsulated in the collection and thus hidden
 away from the sort algorithm. If you make a rule that collections with 
 cheap
 indexing are indexed via opIndex and collections with expensive indexing 
 are
 indexed via a function, then you've just defined the API in terms of the
 collection's implementation (and introduced an unnecessary inconsistency
 into the API).

no. there is consistency. nth() is consistent across - o(n) or better indexing. i think u have it wrong when u think of "cheap" as if it were "fewer machine instructions". no. it's about asymptotic complexity and that does matter.

Since we're talking about algorithmic complexity, I figured "cheap" and "expensive" would be understood as being intended in the same sense. So yes, I'm well aware of that.
 If a sort function is desired that only accepts collections with O(1)
 indexing, then that can be accomplished at a higher level of abstraction
 (using function overloading, RTTI, etc.) without getting in the way when
 such a guarantee is not needed.

exactly. nth() and opIndex() fit the bill. what's there not to love?

The "nth()" deviates from standard indexing syntax. I consider "collection[i]" to mean "indexing", not "low-complexity indexing".
 For realtime code, I can see the benefit to what you're saying. 
 Although
 in
 many cases only part of a program needs to be realtime, and for the 
 rest
 of
 the program's code I'd hate to have to sacrifice the encapsulation
 benefits.

realtime has nothin' to do with it.

For code that needs to run in realtime, I agree with Denis Koroskin that it could be helpful to be able to look at a piece of code and have some sort of guarantee that there is no behind-the-scenes overloading going on that is any more complex than the operators' default behaviors. But for code that doesn't need to finish within a maximum amount of time, that becomes less important and the encapsulation/syntactic-consistency gained from the use of such things becomes a more worthy pursuit. That's what I was saying about realtime.

i disagree but am in a rush now. guess i can't convince u.
 encapsulation ain't broken by making complexity part of the reqs. any 
 more
 than any req ain't breakin' encapsulation. if it looks like a problem 
 then
 encapsulation was misdesigned and needs change.

 case in point. all containers should provide 'nth' say is it's o(n) or
 better. then there's a subclass of container that is indexed_container.
 that provides opIndex and says it's o(log n) or better. it also 
 provides
 'nth' by just forwarding to opIndex. faster than o(n) ain't a problem.

 but forcing a list to blurt something for opIndex - that's just bad
 design.

I agree that not all collections should implement an opIndex. Anything without a natural sequence or mapping should lack opIndex (such as a tree or graph). But forcing the user of a collection that *does* have a natural sequence (like a linked list) to use function-call-syntax instead of standard indexing-syntax just because the collection is implemented in a way that causes indexing to be less scalable than other collections - that's bad design.

no. it's great design. because it's not lyin'. you want o(1) indexing you say a[n]. you are ok with o(n) indexing you say a.nth(n). this is how generic code works, with consistent notation. not with lyin'.

Number of different ways to index a collection: One way: Consistent Two ways: Not consistent
 The way I see it, "group[n]" means "Get the nth element of group". Not 
 "Get
 the element at location group.ptr + (n * sizeof(group_base_type)) or
 something else that's just as scalable."

no need to get that low. just say o(1) and understand o(1) has nothing to do with the count of assembler ops.

Here: "group.ptr + (n * sizeof(group_base_type))..." One multiplication, one addition, one memory read, no loops, no recursion: O(1) "...or something else that's just as scalable" Something that's just as scalable as O(1) must be O(1). So yes, that's what I said.
 In plain C, those are one and the
 same. But when you start talking about generic collections, encapsulation
 and interface versus implementation, they are very different: the former 
 is
 interface, the latter is implementation.

so now would u say stl has a poor design? because it's all about stuff that you consider badly designed.

I abandoned C++ back when STL was still fairly new, so I can't honestly say. I seem to remember C++ having some trouble with certain newer language concepts, it might be that STL is the best than can be reasonably done given the drawbacks of C++. Or there might be room for improvement.
Aug 26 2008
prev sibling parent reply Michiel Helvensteijn <nomail please.com> writes:
superdan wrote:

 my point was opIndex should not be written for a list to begin with.

Ok. So you are not opposed to the random access operation on a list, as long as it doesn't use opIndex but a named function, correct? You are saying that there is a rule somewhere (either written or unwritten) that guarantees a time-complexity of O(1) for opIndex, wherever it appears. This of course means that a linked list cannot define opIndex, since a random access operation on it will take O(n) (there are tricks that can make it faster in most practical cases, but I digress). That, in turn, means that a linked list and a dynamic array can not share a common interface that includes opIndex. Aren't you making things difficult for yourself with this rule? A list and an array are very similar data-structures and it is natural for them to share a common interface. The main differences are: * A list takes more memory. * A list has slower random access. * A list has faster insertions and growth. But the interface shouldn't necessarily make any complexity guarantees. The implementations should. And any programmer worth his salt will be able to use this wisely and choose the right sorting algorithm for the right data-structure. There are other algorithms, I'm sure, that work equally well on either. Of course, any algorithm should give its time-complexity in terms of the complexity of the operations it uses. I do understand your point, however. And I believe my argument would be stronger if there were some sort of automatic complexity analysis tool. This could either warn a programmer in case he makes the wrong choice, or even take the choice out of the programmers hands and automatically choose the right sorting algorithm for the job. That's a bit ambitious. I guess a profiler is the next best thing. -- Michiel
Aug 27 2008
parent reply superdan <super dan.org> writes:
Michiel Helvensteijn Wrote:

 superdan wrote:
 
 my point was opIndex should not be written for a list to begin with.

Ok. So you are not opposed to the random access operation on a list, as long as it doesn't use opIndex but a named function, correct?

correctamundo.
 You are saying that there is a rule somewhere (either written or unwritten)
 that guarantees a time-complexity of O(1) for opIndex, wherever it appears.

yeppers. amend that to o(log n). in d, that rule is a social contract derived from the built-in vector and hash indexing syntax.
 This of course means that a linked list cannot define opIndex, since a
 random access operation on it will take O(n) (there are tricks that can
 make it faster in most practical cases, but I digress).

it oughtn't. & you digress in the wrong direction. you can't prove a majority of "practical cases" will not suffer a performance hit. the right direction is to define the right abstraction for forward iteration. i mean opIndex optimization is making a shitty design run better. y not make a good design to start with?
 That, in turn, means that a linked list and a dynamic array can not share a
 common interface that includes opIndex.

so what. they can share a common interface that includes nth(). what exactly is yer problem with that.
 Aren't you making things difficult for yourself with this rule?

not all. i want o(n) index access, i use nth() and i know it's gonna take me o(n) and i'll design my higher-level algorithm accordingly. if random access helps my particular algo then is(typeof(a[n])) tells me that a supports random access. if i can't live without a[n] my algo won't compile. every1's happy.
 A list and an array are very similar data-structures and it is natural for
 them to share a common interface.

sure. both r sequence containers. opIndex ain't part of a sequence container interface.
 The main differences are:
 * A list takes more memory.
 * A list has slower random access.

nooonononono. o(n) vs. o(1) to be precise. that's not "slower". that's sayin' "list don't have random access, you can as well get up your lazy ass & do a linear search by calling nth()". turn that on its head. if i give u a container and say "it has random access" u'd rightly expect better than a linear search. a deque has slower random access than a vector the same complexity.
 * A list has faster insertions and growth.

o(1) insertion if u have the insertion point. both list and vector have o(1) growth. list also has o(1) splicing. that's important. we get to the point where we realize the two r fundamentally different structures built around fundamentally different tradeoffs. they do satisfy the same interface. just ain't the vector interface. it's a sequential-access interface. not a random-access interface.
 But the interface shouldn't necessarily make any complexity guarantees.

problem is many said so til stl came and said enuff is enuff. for fundamental data structures & algos complexity /must/ be part of the spec & design. otherwise all u get is a mishmash of crap & u simply can't do generic stuff w/ a mishmash of crap. as other cont/algo libs've copiously shown. that approach's impressive coz it challenged some stupid taboos & proved them worthless. it was contrarian & to great effect. for that alone stl puts to shame previous container/algo libs. i know i'd used half a dozen and wrote a couple. thot the whole container/algo design is old hat. when stl came along i was like, holy effin' guacamole. that's why i say. even if u don't use c++ for its many faults. understand stl coz it's d shiznit.
 The
 implementations should. And any programmer worth his salt will be able to
 use this wisely and choose the right sorting algorithm for the right
 data-structure.

here's where the thing blows apart. i agree with choosing manually if i didn't want to do generic programming. if u wanna do generic programming u want help from the compiler in mixing n matching stuff. it's not about the saltworthiness. it's about writing generic code.
 There are other algorithms, I'm sure, that work equally
 well on either. Of course, any algorithm should give its time-complexity in
 terms of the complexity of the operations it uses.
 
 I do understand your point, however. And I believe my argument would be
 stronger if there were some sort of automatic complexity analysis tool.

stl makes-do without an automatic tool.
 This could either warn a programmer in case he makes the wrong choice, or
 even take the choice out of the programmers hands and automatically choose
 the right sorting algorithm for the job. That's a bit ambitious. I guess a
 profiler is the next best thing.

i have no doubt stl has had big ambitions. for what i can tell it fulfilled them tho c++ makes higher-level algorithms look arcane. so i'm happy with the lambdas in std.algorithm. & can't figure why containers don't come along. walt?
Aug 27 2008
next sibling parent reply Michiel Helvensteijn <nomail please.com> writes:
superdan wrote:

 This of course means that a linked list cannot define opIndex, since a
 random access operation on it will take O(n) (there are tricks that can
 make it faster in most practical cases, but I digress).

it oughtn't. & you digress in the wrong direction. you can't prove a majority of "practical cases" will not suffer a performance hit.

Perhaps. It's been a while since I've worked with data-structures on this level, but I seem to remember there are ways. What if you maintain a linked list of small arrays? Say each node in the list contains around log(n) of the elements in the entire list. Wouldn't that bring random access down to O(log n)? Of course, this would also bring insertions up to O(log n). And what if you save an index/pointer pair after each access. Then with each new access request, you can choose from three locations to start walking: * The start of the list. * The end of the list. * The last access-point of the list. In a lot of practical cases a new access is close to the last access. Of course, the general case would still be O(n).
 That, in turn, means that a linked list and a dynamic array can not share
 a common interface that includes opIndex.

so what. they can share a common interface that includes nth(). what exactly is yer problem with that.

That's simple. a[i] looks much nicer than a.nth(i). By the way, I suspect that if opIndex is available only on arrays and nth() is available on all sequence types, algorithm writers will forget about opIndex and use nth(), to make their algorithm more widely compatible. And I wouldn't blame them, though I guess you would.
 * A list has faster insertions and growth.

o(1) insertion if u have the insertion point. both list and vector have o(1) growth.

Yeah, but dynamic arrays have to re-allocate once in a while. Lists don't.
 we get to the point where we realize the two r fundamentally different
 structures built around fundamentally different tradeoffs. they do satisfy
 the same interface. just ain't the vector interface. it's a
 sequential-access interface. not a random-access interface.

I believe we agree in principle, but are just confused about each others definitions. If the "random-access interface" guarantees O(1) for nth/opIndex/whatever, of course you are right. But if time-complexity is not taken into consideration, the sequential-access interface and the random-access interface are equivalent, no? I'm not opposed to complexity guarantees in public contracts. Far from it, in fact. Just introduce both interfaces and let the algorithm writers choose which one to accept. But give both interfaces opIndex, since it's just good syntax. I do think it's a good idea for algorithms to support the interface with the weakest constraints (sequential-access). As long as they specify their time-complexity in terms of the complexities of the interface operations, not in absolute terms. Then, when a programmer writes 'somealgorithm(someinterfaceobject)', a hypothetical analysis tool could tell him the time-complexity of the the resulting operation. The programmer might even assert a worst-case complexity at that point and the compiler could bail out if it doesn't match.
 even if u don't use c++ for its many faults. understand stl coz it's d
 shiznit.

I use C++. I use STL. I love both. But that doesn't mean there is no room for improvement. The STL is quite complex, and maybe it doesn't have to be. -- Michiel
Aug 27 2008
parent reply Dee Girl <deegirl noreply.com> writes:
Michiel Helvensteijn Wrote:

 superdan wrote:
 
 This of course means that a linked list cannot define opIndex, since a
 random access operation on it will take O(n) (there are tricks that can
 make it faster in most practical cases, but I digress).

it oughtn't. & you digress in the wrong direction. you can't prove a majority of "practical cases" will not suffer a performance hit.

Perhaps. It's been a while since I've worked with data-structures on this level, but I seem to remember there are ways.

I write example with findLast a little time a go. But there are many examples. For example move to front algorithm. It is linear but with "trick" opIndex it is O(n*n) even with optimization. Bad!
 What if you maintain a linked list of small arrays? Say each node in the
 list contains around log(n) of the elements in the entire list. Wouldn't
 that bring random access down to O(log n)? Of course, this would also bring
 insertions up to O(log n).
 
 And what if you save an index/pointer pair after each access. Then with each
 new access request, you can choose from three locations to start walking:
 * The start of the list.
 * The end of the list.
 * The last access-point of the list.
 
 In a lot of practical cases a new access is close to the last access. Of
 course, the general case would still be O(n).

Michiel-san, this is new data structure very different from list! If I want list I never use this structure. It is like joking. Because when you write this you agree that list is not vector.
 That, in turn, means that a linked list and a dynamic array can not share
 a common interface that includes opIndex.

so what. they can share a common interface that includes nth(). what exactly is yer problem with that.

That's simple. a[i] looks much nicer than a.nth(i).

It is not nicer. It is more deceiving (correct spell?). If you look at code it looks like array code. foreach (i; 0 .. a.length) { a[i] += 1; } For array works nice. But for list it is terrible! Many operations for incrementing only small list.
 By the way, I suspect that if opIndex is available only on arrays and nth()
 is available on all sequence types, algorithm writers will forget about
 opIndex and use nth(), to make their algorithm more widely compatible. And
 I wouldn't blame them, though I guess you would.

I do not agree with this. I am sorry! I think nobody should write find() that uses nth().
 * A list has faster insertions and growth.

o(1) insertion if u have the insertion point. both list and vector have o(1) growth.

Yeah, but dynamic arrays have to re-allocate once in a while. Lists don't.

Lists allocate memory each insert. Array allocate memory some time. With doubling cost of allocation+copy converge to zero.
 we get to the point where we realize the two r fundamentally different
 structures built around fundamentally different tradeoffs. they do satisfy
 the same interface. just ain't the vector interface. it's a
 sequential-access interface. not a random-access interface.

I believe we agree in principle, but are just confused about each others definitions. If the "random-access interface" guarantees O(1) for nth/opIndex/whatever, of course you are right. But if time-complexity is not taken into consideration, the sequential-access interface and the random-access interface are equivalent, no?

I think it is mistake to not taken into consideration time complexity. For basic data structures specially. I do not think you can call them equivalent.
 I'm not opposed to complexity guarantees in public contracts. Far from it,
 in fact. Just introduce both interfaces and let the algorithm writers
 choose which one to accept. But give both interfaces opIndex, since it's
 just good syntax.

I think is convenient syntax. Maybe too convenient ^_^.
 I do think it's a good idea for algorithms to support the interface with the
 weakest constraints (sequential-access). As long as they specify their
 time-complexity in terms of the complexities of the interface operations,
 not in absolute terms.
 
 Then, when a programmer writes 'somealgorithm(someinterfaceobject)', a
 hypothetical analysis tool could tell him the time-complexity of the the
 resulting operation. The programmer might even assert a worst-case
 complexity at that point and the compiler could bail out if it doesn't
 match.

The specification I think is with types. If that works tool is the compiler.
 even if u don't use c++ for its many faults. understand stl coz it's d
 shiznit.

I use C++. I use STL. I love both. But that doesn't mean there is no room for improvement. The STL is quite complex, and maybe it doesn't have to be.

Many things in STL can be better with D. But iterators and complexity is beautiful in STL.
Aug 27 2008
next sibling parent reply Michiel Helvensteijn <nomail please.com> writes:
Dee Girl wrote:

 What if you maintain a linked list of small arrays? Say each node in the
 list contains around log(n) of the elements in the entire list. Wouldn't
 that bring random access down to O(log n)? Of course, this would also
 bring insertions up to O(log n).
 
 And what if you save an index/pointer pair after each access. Then with
 each new access request, you can choose from three locations to start
 walking: * The start of the list.
 * The end of the list.
 * The last access-point of the list.
 
 In a lot of practical cases a new access is close to the last access. Of
 course, the general case would still be O(n).

Michiel-san, this is new data structure very different from list! If I want list I never use this structure. It is like joking. Because when you write this you agree that list is not vector.

Yes, the first 'trick' makes it a different datastructure. The second does not. Would you still be opposed to using opIndex if its time-complexity is O(log n)?
 That's simple. a[i] looks much nicer than a.nth(i).

It is not nicer. It is more deceiving (correct spell?). If you look at code it looks like array code. foreach (i; 0 .. a.length) { a[i] += 1; } For array works nice. But for list it is terrible! Many operations for incrementing only small list.

With that second trick the loop would have the same complexity for lists. But putting that aside for the moment, are you saying you would allow yourself to be deceived by a syntax detail? No, mentally attaching O(1) to the *subscripting operator* is simply a legacy from C, where it is syntactic sugar for pointer arithmetic.
 By the way, I suspect that if opIndex is available only on arrays and
 nth() is available on all sequence types, algorithm writers will forget
 about opIndex and use nth(), to make their algorithm more widely
 compatible. And I wouldn't blame them, though I guess you would.

I do not agree with this. I am sorry! I think nobody should write find() that uses nth().

Of course not. Find should be written with an iterator, which has optimal complexity for both data-structures. My point is that an algorithm should be generic first and foremost. Then you use the operations that have the lowest complexity over all targeted data-structures if possible.
 * A list has faster insertions and growth.

o(1) insertion if u have the insertion point. both list and vector have o(1) growth.

Yeah, but dynamic arrays have to re-allocate once in a while. Lists don't.

Lists allocate memory each insert. Array allocate memory some time. With doubling cost of allocation+copy converge to zero.

Lists allocate memory for bare nodes, but never have to copy their elements. Arrays have to move their whole content to a larger memory location each time they are outgrown. For more complex data-types that means potentially very expensive copies.
 But if time-complexity is not taken into consideration, the
 sequential-access interface and the random-access interface are
 equivalent, no?

I think it is mistake to not taken into consideration time complexity. For basic data structures specially. I do not think you can call them equivalent.

I did say 'if'. You have to agree that if you disregard complexity issues (for the sake of argument), the two ARE equivalent.
 I do think it's a good idea for algorithms to support the interface with
 the weakest constraints (sequential-access). As long as they specify
 their time-complexity in terms of the complexities of the interface
 operations, not in absolute terms.
 
 Then, when a programmer writes 'somealgorithm(someinterfaceobject)', a
 hypothetical analysis tool could tell him the time-complexity of the the
 resulting operation. The programmer might even assert a worst-case
 complexity at that point and the compiler could bail out if it doesn't
 match.

The specification I think is with types. If that works tool is the compiler.

But don't you understand that if this tool did exist, and the language had a standard notation for time/space-complexity, I could simply write: sequence<T> s; /* fill sequence */ sort(s); And the compiler (in cooperation with this 'tool') could automatically find the most effective combination of data-structure and algorithm. The code would be more readable and efficient. -- Michiel
Aug 27 2008
parent reply Dee Girl <deegirl noreply.com> writes:
Michiel Helvensteijn Wrote:

 Dee Girl wrote:
 
 What if you maintain a linked list of small arrays? Say each node in the
 list contains around log(n) of the elements in the entire list. Wouldn't
 that bring random access down to O(log n)? Of course, this would also
 bring insertions up to O(log n).
 
 And what if you save an index/pointer pair after each access. Then with
 each new access request, you can choose from three locations to start
 walking: * The start of the list.
 * The end of the list.
 * The last access-point of the list.
 
 In a lot of practical cases a new access is close to the last access. Of
 course, the general case would still be O(n).

Michiel-san, this is new data structure very different from list! If I want list I never use this structure. It is like joking. Because when you write this you agree that list is not vector.

Yes, the first 'trick' makes it a different datastructure. The second does not. Would you still be opposed to using opIndex if its time-complexity is O(log n)?

This is different question. And tricks are not answer for problem. Problem is list has other access method than array.
 That's simple. a[i] looks much nicer than a.nth(i).

It is not nicer. It is more deceiving (correct spell?). If you look at code it looks like array code. foreach (i; 0 .. a.length) { a[i] += 1; } For array works nice. But for list it is terrible! Many operations for incrementing only small list.

With that second trick the loop would have the same complexity for lists.

Not for singly linked lists. I think name "trick" is very good. It is trick like prank to a friend. It does not do real thing. It only fools for few cases.
 But putting that aside for the moment, are you saying you would allow
 yourself to be deceived by a syntax detail? No, mentally attaching O(1) to
 the *subscripting operator* is simply a legacy from C, where it is
 syntactic sugar for pointer arithmetic.

I do not think so. I am sorry. If a[n] is not allowed then other array access primitive is allowed. Give index(a, n) as example. If language say index(a, n) is array access then it is big mistake for list to also define index(a, n). List maybe should define findAt(a, n). Then array also can define findAt(a, n). It is not mistake.
 By the way, I suspect that if opIndex is available only on arrays and
 nth() is available on all sequence types, algorithm writers will forget
 about opIndex and use nth(), to make their algorithm more widely
 compatible. And I wouldn't blame them, though I guess you would.

I do not agree with this. I am sorry! I think nobody should write find() that uses nth().

Of course not. Find should be written with an iterator, which has optimal complexity for both data-structures. My point is that an algorithm should be generic first and foremost. Then you use the operations that have the lowest complexity over all targeted data-structures if possible.

Maybe I think "generic" word different than you. For me generic is that algorithm asks minimum from structure to do its work. For example find ask only one forward pass. Input iterator does one forward pass. It is mistake if find ask for index. It is also mistake if structure makes an algorithm think it has index as primitive operation.
 * A list has faster insertions and growth.

o(1) insertion if u have the insertion point. both list and vector have o(1) growth.

Yeah, but dynamic arrays have to re-allocate once in a while. Lists don't.

Lists allocate memory each insert. Array allocate memory some time. With doubling cost of allocation+copy converge to zero.

Lists allocate memory for bare nodes, but never have to copy their elements. Arrays have to move their whole content to a larger memory location each time they are outgrown. For more complex data-types that means potentially very expensive copies.

I think this is mistake. I think you should google "amortized complexity". Maybe that can help much.
 But if time-complexity is not taken into consideration, the
 sequential-access interface and the random-access interface are
 equivalent, no?

I think it is mistake to not taken into consideration time complexity. For basic data structures specially. I do not think you can call them equivalent.

I did say 'if'. You have to agree that if you disregard complexity issues (for the sake of argument), the two ARE equivalent.

But it is useless comparison. Comparison can not forget important aspect. If we ignore fractionary floating point is integer. If organism is not alive it is mostly water.
 I do think it's a good idea for algorithms to support the interface with
 the weakest constraints (sequential-access). As long as they specify
 their time-complexity in terms of the complexities of the interface
 operations, not in absolute terms.
 
 Then, when a programmer writes 'somealgorithm(someinterfaceobject)', a
 hypothetical analysis tool could tell him the time-complexity of the the
 resulting operation. The programmer might even assert a worst-case
 complexity at that point and the compiler could bail out if it doesn't
 match.

The specification I think is with types. If that works tool is the compiler.

But don't you understand that if this tool did exist, and the language had a standard notation for time/space-complexity, I could simply write: sequence<T> s; /* fill sequence */ sort(s); And the compiler (in cooperation with this 'tool') could automatically find the most effective combination of data-structure and algorithm. The code would be more readable and efficient.

Michiel-san, STL does that. Or I misunderstand you?
Aug 27 2008
next sibling parent reply Michiel Helvensteijn <nomail please.com> writes:
Dee Girl wrote:

 Yes, the first 'trick' makes it a different datastructure. The second
 does not. Would you still be opposed to using opIndex if its
 time-complexity is O(log n)?

This is different question. And tricks are not answer for problem. Problem is list has other access method than array.

And what's the answer?
 With that second trick the loop would have the same complexity for lists.

Not for singly linked lists.

Yeah, also for singly linked lists.
 But putting that aside for the moment, are you saying you would allow
 yourself to be deceived by a syntax detail? No, mentally attaching O(1)
 to the *subscripting operator* is simply a legacy from C, where it is
 syntactic sugar for pointer arithmetic.

I do not think so. I am sorry. If a[n] is not allowed then other array access primitive is allowed. Give index(a, n) as example. If language say index(a, n) is array access then it is big mistake for list to also define index(a, n). List maybe should define findAt(a, n). Then array also can define findAt(a, n). It is not mistake.

Yes, I agree for "array access". That term implies O(1), since it uses the word "array". But I was argueing against the subscripting operator to be forced into O(1).
 Lists allocate memory for bare nodes, but never have to copy their
 elements. Arrays have to move their whole content to a larger memory
 location each time they are outgrown. For more complex data-types that
 means potentially very expensive copies.

I think this is mistake. I think you should google "amortized complexity". Maybe that can help much.

Amortized complexity has nothing to do with it. Dynamic arrays have to copy their elements and lists do not. It's as simple as that.
 But don't you understand that if this tool did exist, and the language
 had a standard notation for time/space-complexity, I could simply write:
 
 sequence<T> s;
 /* fill sequence */
 sort(s);
 
 And the compiler (in cooperation with this 'tool') could automatically
 find the most effective combination of data-structure and algorithm. The
 code would be more readable and efficient.

Michiel-san, STL does that. Or I misunderstand you?

STL will choose the right sorting algorithm, given a specific data-structure. But I am saying it may be possible also for the data-structure to be automatically chosen, based on what the programmer does with it. -- Michiel
Aug 27 2008
parent reply Dee Girl <deegirl noreply.com> writes:
Michiel Helvensteijn Wrote:

 Dee Girl wrote:
 
 Yes, the first 'trick' makes it a different datastructure. The second
 does not. Would you still be opposed to using opIndex if its
 time-complexity is O(log n)?

This is different question. And tricks are not answer for problem. Problem is list has other access method than array.

And what's the answer?

I accept logarithm complexity with []. Logarithm grows slow.
 With that second trick the loop would have the same complexity for lists.

Not for singly linked lists.

Yeah, also for singly linked lists.

May be it is not interesting discuss trick more. I am sure many tricks can be done. And many serious things. Can be done and have been done. They make list not a list any more.
 But putting that aside for the moment, are you saying you would allow
 yourself to be deceived by a syntax detail? No, mentally attaching O(1)
 to the *subscripting operator* is simply a legacy from C, where it is
 syntactic sugar for pointer arithmetic.

I do not think so. I am sorry. If a[n] is not allowed then other array access primitive is allowed. Give index(a, n) as example. If language say index(a, n) is array access then it is big mistake for list to also define index(a, n). List maybe should define findAt(a, n). Then array also can define findAt(a, n). It is not mistake.

Yes, I agree for "array access". That term implies O(1), since it uses the word "array". But I was argueing against the subscripting operator to be forced into O(1).

I am sorry. I do not understand your logic. My logic was this. Language has a[n] for array index. My opinion was then a[n] should not be linear search. I said also you can replace a[n] with index(a, n) and my same reason is the same. How are you argueing? I did not want to get in this discussion. I see how it is confusing fast ^_^.
 Lists allocate memory for bare nodes, but never have to copy their
 elements. Arrays have to move their whole content to a larger memory
 location each time they are outgrown. For more complex data-types that
 means potentially very expensive copies.

I think this is mistake. I think you should google "amortized complexity". Maybe that can help much.

Amortized complexity has nothing to do with it. Dynamic arrays have to copy their elements and lists do not. It's as simple as that.

No, it is not. I am sorry! In STL there is copy. In D there is std.move. I think it only copies data by bits and clears source. And amortized complexity shows that there is o(1) bit copy on many append.
 But don't you understand that if this tool did exist, and the language
 had a standard notation for time/space-complexity, I could simply write:
 
 sequence<T> s;
 /* fill sequence */
 sort(s);
 
 And the compiler (in cooperation with this 'tool') could automatically
 find the most effective combination of data-structure and algorithm. The
 code would be more readable and efficient.

Michiel-san, STL does that. Or I misunderstand you?

STL will choose the right sorting algorithm, given a specific data-structure. But I am saying it may be possible also for the data-structure to be automatically chosen, based on what the programmer does with it.

I think this is interesting. Then why argueing for bad container design? I do not understand. Thank you, Dee Girl.
Aug 27 2008
parent reply Michiel Helvensteijn <nomail please.com> writes:
Dee Girl wrote:

 But putting that aside for the moment, are you saying you would allow
 yourself to be deceived by a syntax detail? No, mentally attaching
 O(1) to the *subscripting operator* is simply a legacy from C, where
 it is syntactic sugar for pointer arithmetic.

I do not think so. I am sorry. If a[n] is not allowed then other array access primitive is allowed. Give index(a, n) as example. If language say index(a, n) is array access then it is big mistake for list to also define index(a, n). List maybe should define findAt(a, n). Then array also can define findAt(a, n). It is not mistake.

Yes, I agree for "array access". That term implies O(1), since it uses the word "array". But I was argueing against the subscripting operator to be forced into O(1).

I am sorry. I do not understand your logic. My logic was this. Language has a[n] for array index. My opinion was then a[n] should not be linear search. I said also you can replace a[n] with index(a, n) and my same reason is the same. How are you argueing?

Let me try again. I agree that you may impose complexity-restrictions in function contracts. If you write a function called index(a, n), you may impose the O(log n) syntax, for all I care. But the a[n] syntax is so convenient that I would hate for it to be likewise restricted. I would like to use it for lists and associative containers and the complexity may be O(n). The programmer should just be careful.
 I did not want to get in this discussion. I see how it is confusing fast
 ^_^.

I find much of this subthread confusing. (Starting with the discussion of Benji and superdan). It looks to me like 80% of the discussion is based on misunderstandings.
 Amortized complexity has nothing to do with it. Dynamic arrays have to
 copy their elements and lists do not. It's as simple as that.

No, it is not. I am sorry! In STL there is copy. In D there is std.move. I think it only copies data by bits and clears source. And amortized complexity shows that there is o(1) bit copy on many append.

Yes, a bit-copy would be ok. I was thinking of executing the potentially more expensive copy constructor. It's nice that D doesn't have to do this.
 STL will choose the right sorting algorithm, given a specific
 data-structure. But I am saying it may be possible also for the
 data-structure to be automatically chosen, based on what the programmer
 does with it.

I think this is interesting. Then why argueing for bad container design? I do not understand. Thank you, Dee Girl.

Where am I argueing for bad design? All I've been arguing for is looser restrictions for the subscripting operator. You should be able to use it on a list, even though the complexity is O(n). But if it is used often enough (in a deeply nested loop), the compiler will probably automatically use an array instead. -- Michiel
Aug 27 2008
parent reply Dee Girl <deegirl noreply.com> writes:
Michiel Helvensteijn Wrote:

 Dee Girl wrote:
 
 But putting that aside for the moment, are you saying you would allow
 yourself to be deceived by a syntax detail? No, mentally attaching
 O(1) to the *subscripting operator* is simply a legacy from C, where
 it is syntactic sugar for pointer arithmetic.

I do not think so. I am sorry. If a[n] is not allowed then other array access primitive is allowed. Give index(a, n) as example. If language say index(a, n) is array access then it is big mistake for list to also define index(a, n). List maybe should define findAt(a, n). Then array also can define findAt(a, n). It is not mistake.

Yes, I agree for "array access". That term implies O(1), since it uses the word "array". But I was argueing against the subscripting operator to be forced into O(1).

I am sorry. I do not understand your logic. My logic was this. Language has a[n] for array index. My opinion was then a[n] should not be linear search. I said also you can replace a[n] with index(a, n) and my same reason is the same. How are you argueing?

Let me try again. I agree that you may impose complexity-restrictions in function contracts. If you write a function called index(a, n), you may impose the O(log n) syntax, for all I care. But the a[n] syntax is so convenient that I would hate for it to be likewise restricted. I would like to use it for lists and associative containers and the complexity may be O(n). The programmer should just be careful.

Thank you for trying again. Thank you! I understand. Yes, a[n] is very convenient! And I would have agree 100% with you if a[n] was not build in language for array access. But because of that I 100% disagree ^_^. I also think there is objective mistake. In concrete code programmer can be careful. But in generic code programmer can not be careful. I think this must to be explained better. But I am not sure I can.
 I did not want to get in this discussion. I see how it is confusing fast
 ^_^.

I find much of this subthread confusing. (Starting with the discussion of Benji and superdan). It looks to me like 80% of the discussion is based on misunderstandings.
 Amortized complexity has nothing to do with it. Dynamic arrays have to
 copy their elements and lists do not. It's as simple as that.

No, it is not. I am sorry! In STL there is copy. In D there is std.move. I think it only copies data by bits and clears source. And amortized complexity shows that there is o(1) bit copy on many append.

Yes, a bit-copy would be ok. I was thinking of executing the potentially more expensive copy constructor. It's nice that D doesn't have to do this.
 STL will choose the right sorting algorithm, given a specific
 data-structure. But I am saying it may be possible also for the
 data-structure to be automatically chosen, based on what the programmer
 does with it.

I think this is interesting. Then why argueing for bad container design? I do not understand. Thank you, Dee Girl.

Where am I argueing for bad design? All I've been arguing for is looser restrictions for the subscripting operator. You should be able to use it on a list, even though the complexity is O(n). But if it is used often enough (in a deeply nested loop), the compiler will probably automatically use an array instead.

The idea is nice. But I think it can not be done. Tool is not mind reader. If I make some insert and some index. They want different structure. How does the tool know what I want fast? I say you want bad design because tool does not exist. So we do not have the tool. But we can make good library with what we have. I am sure you can write library with a[n] in O(n). And it works. But I say is more inferior design than STL. Because your library allows things that should not work and does not warn programmer.
Aug 27 2008
parent reply Michiel Helvensteijn <nomail please.com> writes:
Dee Girl wrote:

 Where am I argueing for bad design? All I've been arguing for is looser
 restrictions for the subscripting operator. You should be able to use it
 on a list, even though the complexity is O(n). But if it is used often
 enough (in a deeply nested loop), the compiler will probably
 automatically use an array instead.

The idea is nice. But I think it can not be done. Tool is not mind reader. If I make some insert and some index. They want different structure. How does the tool know what I want fast?

In the future it may be possible to do such analysis. If the indexing is in a deeper loop, it may weigh more than the insertions you are doing. But failing that, the programmer might give the compiler 'hints' on which functions he/she wants faster. -- Michiel
Aug 27 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Michiel Helvensteijn:
 In the future it may be possible to do such analysis. If the indexing is in
 a deeper loop, it may weigh more than the insertions you are doing. But
 failing that, the programmer might give the compiler 'hints' on which
 functions he/she wants faster.

For example you can write Deque data structure made with a double linked list of small arrays. During run time it is able to collect few simple statistics of its usage, and it can grow or shrink the length of the arrays according to the cache line length and the patterns of its usage at runtime. There's a boolean constant that at compile time can switch off such collection of statistics, to make the data structure a bit faster but not adaptive. You may want the data structure not adaptive if you know very well what its future usage will be in the program, or in programs that run for few minutes/seconds. In programs that run for hours or days you may prefer a more adaptive data structure. You can create similar data structures with languages that compile statically, but those operations look fitter when there's a virtual machine (HotSpot for example compiles and uncompiles code dynamically). LLMV looks like being able to be used in both situations :-) Bye, bearophile
Aug 27 2008
prev sibling parent superdan <super dan.org> writes:
Dee Girl Wrote:

 Michiel Helvensteijn Wrote:
 
 Dee Girl wrote:
 
 What if you maintain a linked list of small arrays? Say each node in the
 list contains around log(n) of the elements in the entire list. Wouldn't
 that bring random access down to O(log n)? Of course, this would also
 bring insertions up to O(log n).
 
 And what if you save an index/pointer pair after each access. Then with
 each new access request, you can choose from three locations to start
 walking: * The start of the list.
 * The end of the list.
 * The last access-point of the list.
 
 In a lot of practical cases a new access is close to the last access. Of
 course, the general case would still be O(n).

Michiel-san, this is new data structure very different from list! If I want list I never use this structure. It is like joking. Because when you write this you agree that list is not vector.

Yes, the first 'trick' makes it a different datastructure. The second does not. Would you still be opposed to using opIndex if its time-complexity is O(log n)?

This is different question. And tricks are not answer for problem. Problem is list has other access method than array.
 That's simple. a[i] looks much nicer than a.nth(i).

It is not nicer. It is more deceiving (correct spell?). If you look at code it looks like array code. foreach (i; 0 .. a.length) { a[i] += 1; } For array works nice. But for list it is terrible! Many operations for incrementing only small list.

With that second trick the loop would have the same complexity for lists.

Not for singly linked lists. I think name "trick" is very good. It is trick like prank to a friend. It does not do real thing. It only fools for few cases.

guess i'll risk telling which. forward iteration. backward iteration. accessing first k. accessing last k. that's pretty much it. and first/last k are already available in standard list. all else is linear time. so forget about using that as an index table. a naive design at best.
 But putting that aside for the moment, are you saying you would allow
 yourself to be deceived by a syntax detail? No, mentally attaching O(1) to
 the *subscripting operator* is simply a legacy from C, where it is
 syntactic sugar for pointer arithmetic.

I do not think so. I am sorry. If a[n] is not allowed then other array access primitive is allowed. Give index(a, n) as example. If language say index(a, n) is array access then it is big mistake for list to also define index(a, n). List maybe should define findAt(a, n). Then array also can define findAt(a, n). It is not mistake.

boils down to what's primitive access vs. what's actual algorithm. indexing in array is primitive. indexing in list is same algorithm as finding nth element anywhere - singly, doubly, file, you name it. so can't claim indexing is primitive for list.
 By the way, I suspect that if opIndex is available only on arrays and
 nth() is available on all sequence types, algorithm writers will forget
 about opIndex and use nth(), to make their algorithm more widely
 compatible. And I wouldn't blame them, though I guess you would.

I do not agree with this. I am sorry! I think nobody should write find() that uses nth().

Of course not. Find should be written with an iterator, which has optimal complexity for both data-structures. My point is that an algorithm should be generic first and foremost. Then you use the operations that have the lowest complexity over all targeted data-structures if possible.

Maybe I think "generic" word different than you. For me generic is that algorithm asks minimum from structure to do its work. For example find ask only one forward pass. Input iterator does one forward pass. It is mistake if find ask for index. It is also mistake if structure makes an algorithm think it has index as primitive operation.
 * A list has faster insertions and growth.

o(1) insertion if u have the insertion point. both list and vector have o(1) growth.

Yeah, but dynamic arrays have to re-allocate once in a while. Lists don't.

Lists allocate memory each insert. Array allocate memory some time. With doubling cost of allocation+copy converge to zero.

Lists allocate memory for bare nodes, but never have to copy their elements. Arrays have to move their whole content to a larger memory location each time they are outgrown. For more complex data-types that means potentially very expensive copies.

I think this is mistake. I think you should google "amortized complexity". Maybe that can help much.

to expand: array append is o(1) averaged over many appends if you double the capacity each time you need. interesting if you only add k complexity jumps to quadratic.
 But if time-complexity is not taken into consideration, the
 sequential-access interface and the random-access interface are
 equivalent, no?

I think it is mistake to not taken into consideration time complexity. For basic data structures specially. I do not think you can call them equivalent.

I did say 'if'. You have to agree that if you disregard complexity issues (for the sake of argument), the two ARE equivalent.

But it is useless comparison. Comparison can not forget important aspect. If we ignore fractionary floating point is integer. If organism is not alive it is mostly water.

pwned if u ask me :D
Aug 27 2008
prev sibling parent reply Benji Smith <dlanguage benjismith.net> writes:
Dee Girl wrote:
 Michiel Helvensteijn Wrote:
 That's simple. a[i] looks much nicer than a.nth(i).

It is not nicer. It is more deceiving (correct spell?). If you look at code it looks like array code. foreach (i; 0 .. a.length) { a[i] += 1; } For array works nice. But for list it is terrible! Many operations for incrementing only small list.

Well, that's what you get with operator overloading. The same thing could be said for "+" or "-". They're inherently deceiving, because they look like builtin operations on primitive data types. For expensive operations (like performing division on an unlimited-precision decimal object), should the author of the code use "opDiv" or should he implement a separate "divide" function? Forget opIndex for a moment, and ask the more general question about all overloaded operators. Should they imply any sort of asymptotic complexity guarantee? Personally, I don't think so. I don't like "nth". I'd rather use the opIndex. And if I'm using a linked list, I'll be aware of the fact that it'll exhibit linear-time indexing, and I'll be cautious about which algorithms to use. --benji
Aug 27 2008
parent reply Dee Girl <deegirl noreply.com> writes:
Benji Smith Wrote:

 Dee Girl wrote:
 Michiel Helvensteijn Wrote:
 That's simple. a[i] looks much nicer than a.nth(i).

It is not nicer. It is more deceiving (correct spell?). If you look at code it looks like array code. foreach (i; 0 .. a.length) { a[i] += 1; } For array works nice. But for list it is terrible! Many operations for incrementing only small list.

Well, that's what you get with operator overloading.

I am sorry. I disagree. I think that is what you get with bad design.
 The same thing could be said for "+" or "-". They're inherently 
 deceiving, because they look like builtin operations on primitive data 
 types.
 
 For expensive operations (like performing division on an 
 unlimited-precision decimal object), should the author of the code use 
 "opDiv" or should he implement a separate "divide" function?

The cost of + and - is proportional to digits in number. For small number of digits computer does fast in hardware. For many digits the cost grows. The number of digits is log n. I think + and - are fine for big integer. I am not surprise.
 Forget opIndex for a moment, and ask the more general question about all 
 overloaded operators. Should they imply any sort of asymptotic 
 complexity guarantee?

I think depends on good design. For example I think ++ or -- for iterator. If it is O(n) it is bad design. Bad design make people say like you "This is what you get with operator overloading".
 Personally, I don't think so.
 
 I don't like "nth".
 
 I'd rather use the opIndex. And if I'm using a linked list, I'll be 
 aware of the fact that it'll exhibit linear-time indexing, and I'll be 
 cautious about which algorithms to use.

But inside algorithm you do not know if you use a linked list or a vector. You lost that information in bad abstraction. Also abstraction is bad because if you change data structure you have concept errors that still compile. And run until tomorrow ^_^. I also like or do not like things. But good reason can convince me? Thank you, Dee Girl.
Aug 27 2008
next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Dee Girl" wrote
 I think depends on good design. For example I think ++ or -- for iterator. 
 If it is O(n) it is bad design. Bad design make people say like you "This 
 is what you get with operator overloading".

Slightly off topic, when I was developing dcollections, I was a bit annoyed that there was no opInc or opDec, instead you have to use opAddAssign and opSubAssign. What this means is that for a list iterator, if you want to allow the syntax: iterator it = list.find(x); (++it).value = 5; or such, you have to define the operator opAddAssign. This makes it possible to do: it += 10; Which I don't like for the same reason we are arguing about this, it suggests this is a simple operation, when in fact, it is O(n). But there's no way around it, as you can't define ++it without defining +=. Of course, I could throw an exception, but I decided against that. Instead, I just warn the user in the docs to only ever use the ++x version. Annoying... -Steve
Aug 27 2008
prev sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Dee Girl" <deegirl noreply.com> wrote in message 
news:g943oi$11f4$1 digitalmars.com...
 Benji Smith Wrote:

 Dee Girl wrote:
 Michiel Helvensteijn Wrote:
 That's simple. a[i] looks much nicer than a.nth(i).

It is not nicer. It is more deceiving (correct spell?). If you look at code it looks like array code. foreach (i; 0 .. a.length) { a[i] += 1; } For array works nice. But for list it is terrible! Many operations for incrementing only small list.

Well, that's what you get with operator overloading.

I am sorry. I disagree. I think that is what you get with bad design.
 The same thing could be said for "+" or "-". They're inherently
 deceiving, because they look like builtin operations on primitive data
 types.

 For expensive operations (like performing division on an
 unlimited-precision decimal object), should the author of the code use
 "opDiv" or should he implement a separate "divide" function?

The cost of + and - is proportional to digits in number. For small number of digits computer does fast in hardware. For many digits the cost grows. The number of digits is log n. I think + and - are fine for big integer. I am not surprise.
 Forget opIndex for a moment, and ask the more general question about all
 overloaded operators. Should they imply any sort of asymptotic
 complexity guarantee?

I think depends on good design. For example I think ++ or -- for iterator. If it is O(n) it is bad design. Bad design make people say like you "This is what you get with operator overloading".
 Personally, I don't think so.

 I don't like "nth".

 I'd rather use the opIndex. And if I'm using a linked list, I'll be
 aware of the fact that it'll exhibit linear-time indexing, and I'll be
 cautious about which algorithms to use.

But inside algorithm you do not know if you use a linked list or a vector. You lost that information in bad abstraction. Also abstraction is bad because if you change data structure you have concept errors that still compile. And run until tomorrow ^_^.

A generic algoritm has absolutely no business caring about the complexity of the collection it's operating on. If it does, then you've created a concrete algoritm, not a generic one. If an algoritm uses [] and doesn't know the complexity of the []...good! It shouldn't know, and it shouldn't care. It's the code that sends the collection to the algoritm that knows and cares. Why? Because "what algoritm is best?" depends on far more than just what type of collection is used. It depends on "Will the collection ever be larger than X elements?". It depends on "Is it a standard textbook list, or does it use trick 1 and/or trick 2?". It depends on "Is it usually mostly sorted or mostly random?". It depends on "What do I do with it most often? Sort, append, search, insert or delete?". And it depends on other things, too. Using "[]" versus "nth()" can't tell the algoritm *any* of those things. But those things *must* be known in order to make an accurate decision of "Is this the right algoritm or not?" Therefore, a generic algoritm *cannot* ever know for certain if it's the right algoritm *even* if you say "[]" means "O(log n) or better". Therefore, the algorithm should not be designed to only work with certain types of collections. The code that sends the collection to the algoritm is the *only* code that knows the answers to all of the questions above, therefore it is the only code that should ever decide "I should use this algorithm, I shouldn't use that algorithm."
 I also like or do not like things. But good reason can convince me? Thank 
 you, Dee Girl.

Aug 27 2008
next sibling parent reply superdan <super dan.org> writes:
Nick Sabalausky Wrote:

 "Dee Girl" <deegirl noreply.com> wrote in message 
 news:g943oi$11f4$1 digitalmars.com...
 Benji Smith Wrote:

 Dee Girl wrote:
 Michiel Helvensteijn Wrote:
 That's simple. a[i] looks much nicer than a.nth(i).

It is not nicer. It is more deceiving (correct spell?). If you look at code it looks like array code. foreach (i; 0 .. a.length) { a[i] += 1; } For array works nice. But for list it is terrible! Many operations for incrementing only small list.

Well, that's what you get with operator overloading.

I am sorry. I disagree. I think that is what you get with bad design.
 The same thing could be said for "+" or "-". They're inherently
 deceiving, because they look like builtin operations on primitive data
 types.

 For expensive operations (like performing division on an
 unlimited-precision decimal object), should the author of the code use
 "opDiv" or should he implement a separate "divide" function?

The cost of + and - is proportional to digits in number. For small number of digits computer does fast in hardware. For many digits the cost grows. The number of digits is log n. I think + and - are fine for big integer. I am not surprise.
 Forget opIndex for a moment, and ask the more general question about all
 overloaded operators. Should they imply any sort of asymptotic
 complexity guarantee?

I think depends on good design. For example I think ++ or -- for iterator. If it is O(n) it is bad design. Bad design make people say like you "This is what you get with operator overloading".
 Personally, I don't think so.

 I don't like "nth".

 I'd rather use the opIndex. And if I'm using a linked list, I'll be
 aware of the fact that it'll exhibit linear-time indexing, and I'll be
 cautious about which algorithms to use.

But inside algorithm you do not know if you use a linked list or a vector. You lost that information in bad abstraction. Also abstraction is bad because if you change data structure you have concept errors that still compile. And run until tomorrow ^_^.

A generic algoritm has absolutely no business caring about the complexity of the collection it's operating on.

absolutely. desperate. need. of. chanel.
 If it does, then you've created a concrete 
 algoritm, not a generic one.

sure you don't know what you're talking about. it is generic insofar as it abstracts away the primitives it needs from its iterator. run don't walk and get a dose of stl.
 If an algoritm uses [] and doesn't know the 
 complexity of the []...good! It shouldn't know, and it shouldn't care.

nonsense. so wrong i won't even address it.
 It's 
 the code that sends the collection to the algoritm that knows and cares. 
 Why? Because "what algoritm is best?" depends on far more than just what 
 type of collection is used. It depends on "Will the collection ever be 
 larger than X elements?". It depends on "Is it a standard textbook list, or 
 does it use trick 1 and/or trick 2?". It depends on "Is it usually mostly 
 sorted or mostly random?". It depends on "What do I do with it most often? 
 Sort, append, search, insert or delete?". And it depends on other things, 
 too.

sure it does. problem is you have it backwards. types and algos tell you the theoretical properties. then you in knowledge of what's goin' on use the algorithm that does it for you in knowledge that the complexity would work for your situation. or you encode your own specialized algorithm. thing is stl encoded the most general linear search. you can use it for linear searching everything. moreover it exactly said what's needed for a linear search: a one-pass forward iterator aka input iterator. now to tie it with what u said: you know in your situation whether linear find cuts the mustard. that don't change the nature of that fundamental algorithm. so you use it or use another. your choice. but find remains universal so long as it has access to the basics of one-pass iteration.
 Using "[]" versus "nth()" can't tell the algoritm *any* of those things.

doesn't have to.
 But 
 those things *must* be known in order to make an accurate decision of "Is 
 this the right algoritm or not?"

sure. you make them decision on the call site.
 Therefore, a generic algoritm *cannot* ever 
 know for certain if it's the right algoritm *even* if you say "[]" means 
 "O(log n) or better".

utterly wrong. poppycock. gobbledygook. nonsense. this is so far off i don't have time to even address. if you want to learn stuff go learn stl. then we talk. if you want to teach me i guess we're done.
 Therefore, the algorithm should not be designed to 
 only work with certain types of collections.

it should be designed to work with certain iterator categories.
 The code that sends the 
 collection to the algoritm is the *only* code that knows the answers to all 
 of the questions above, therefore it is the only code that should ever 
 decide "I should use this algorithm, I shouldn't use that algorithm."

correct. you just have all of your other hypotheses jumbled. sorry dood don't be hatin' but there's so much you don't know i ain't gonna continue this. last word is yours. call me a pompous prick if you want. go ahead.
Aug 27 2008
parent reply "Nick Sabalausky" <a a.a> writes:
"superdan" <super dan.org> wrote in message 
news:g94g3e$20e9$1 digitalmars.com...
 Nick Sabalausky Wrote:

 "Dee Girl" <deegirl noreply.com> wrote in message
 news:g943oi$11f4$1 digitalmars.com...
 Benji Smith Wrote:

 Dee Girl wrote:
 Michiel Helvensteijn Wrote:
 That's simple. a[i] looks much nicer than a.nth(i).

It is not nicer. It is more deceiving (correct spell?). If you look at code it looks like array code. foreach (i; 0 .. a.length) { a[i] += 1; } For array works nice. But for list it is terrible! Many operations for incrementing only small list.

Well, that's what you get with operator overloading.

I am sorry. I disagree. I think that is what you get with bad design.
 The same thing could be said for "+" or "-". They're inherently
 deceiving, because they look like builtin operations on primitive data
 types.

 For expensive operations (like performing division on an
 unlimited-precision decimal object), should the author of the code use
 "opDiv" or should he implement a separate "divide" function?

The cost of + and - is proportional to digits in number. For small number of digits computer does fast in hardware. For many digits the cost grows. The number of digits is log n. I think + and - are fine for big integer. I am not surprise.
 Forget opIndex for a moment, and ask the more general question about 
 all
 overloaded operators. Should they imply any sort of asymptotic
 complexity guarantee?

I think depends on good design. For example I think ++ or -- for iterator. If it is O(n) it is bad design. Bad design make people say like you "This is what you get with operator overloading".
 Personally, I don't think so.

 I don't like "nth".

 I'd rather use the opIndex. And if I'm using a linked list, I'll be
 aware of the fact that it'll exhibit linear-time indexing, and I'll be
 cautious about which algorithms to use.

But inside algorithm you do not know if you use a linked list or a vector. You lost that information in bad abstraction. Also abstraction is bad because if you change data structure you have concept errors that still compile. And run until tomorrow ^_^.

A generic algoritm has absolutely no business caring about the complexity of the collection it's operating on.

absolutely. desperate. need. of. chanel.
 If it does, then you've created a concrete
 algoritm, not a generic one.

sure you don't know what you're talking about. it is generic insofar as it abstracts away the primitives it needs from its iterator. run don't walk and get a dose of stl.
 If an algoritm uses [] and doesn't know the
 complexity of the []...good! It shouldn't know, and it shouldn't care.

nonsense. so wrong i won't even address it.
 It's
 the code that sends the collection to the algoritm that knows and cares.
 Why? Because "what algoritm is best?" depends on far more than just what
 type of collection is used. It depends on "Will the collection ever be
 larger than X elements?". It depends on "Is it a standard textbook list, 
 or
 does it use trick 1 and/or trick 2?". It depends on "Is it usually mostly
 sorted or mostly random?". It depends on "What do I do with it most 
 often?
 Sort, append, search, insert or delete?". And it depends on other things,
 too.

sure it does. problem is you have it backwards. types and algos tell you the theoretical properties. then you in knowledge of what's goin' on use the algorithm that does it for you in knowledge that the complexity would work for your situation. or you encode your own specialized algorithm. thing is stl encoded the most general linear search. you can use it for linear searching everything. moreover it exactly said what's needed for a linear search: a one-pass forward iterator aka input iterator. now to tie it with what u said: you know in your situation whether linear find cuts the mustard. that don't change the nature of that fundamental algorithm. so you use it or use another. your choice. but find remains universal so long as it has access to the basics of one-pass iteration.
 Using "[]" versus "nth()" can't tell the algoritm *any* of those things.

doesn't have to.
 But
 those things *must* be known in order to make an accurate decision of "Is
 this the right algoritm or not?"

sure. you make them decision on the call site.
 Therefore, a generic algoritm *cannot* ever
 know for certain if it's the right algoritm *even* if you say "[]" means
 "O(log n) or better".

utterly wrong. poppycock. gobbledygook. nonsense. this is so far off i don't have time to even address. if you want to learn stuff go learn stl. then we talk. if you want to teach me i guess we're done.
 Therefore, the algorithm should not be designed to
 only work with certain types of collections.

it should be designed to work with certain iterator categories.
 The code that sends the
 collection to the algoritm is the *only* code that knows the answers to 
 all
 of the questions above, therefore it is the only code that should ever
 decide "I should use this algorithm, I shouldn't use that algorithm."

correct. you just have all of your other hypotheses jumbled. sorry dood don't be hatin' but there's so much you don't know i ain't gonna continue this. last word is yours. call me a pompous prick if you want. go ahead.

I'll agree to drop this issue. There's little point in debating with someone whose arguments frequently consist of things like "You are wrong", "I'm not going to explain my point", and "dood don't be hatin'".
Aug 27 2008
parent reply Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-08-27 23:27:52 +0200, "Nick Sabalausky" <a a.a> said:

 "superdan" <super dan.org> wrote in message
 news:g94g3e$20e9$1 digitalmars.com...
 Nick Sabalausky Wrote:
 
 "Dee Girl" <deegirl noreply.com> wrote in message
 news:g943oi$11f4$1 digitalmars.com...
 Benji Smith Wrote:
 
 Dee Girl wrote:
 Michiel Helvensteijn Wrote:
 That's simple. a[i] looks much nicer than a.nth(i).

It is not nicer. It is more deceiving (correct spell?). If you look at code it looks like array code. foreach (i; 0 .. a.length) { a[i] += 1; } For array works nice. But for list it is terrible! Many operations for incrementing only small list.

Well, that's what you get with operator overloading.

I am sorry. I disagree. I think that is what you get with bad design.
 The same thing could be said for "+" or "-". They're inherently
 deceiving, because they look like builtin operations on primitive data
 types.
 
 For expensive operations (like performing division on an
 unlimited-precision decimal object), should the author of the code use
 "opDiv" or should he implement a separate "divide" function?

The cost of + and - is proportional to digits in number. For small number of digits computer does fast in hardware. For many digits the cost grows. The number of digits is log n. I think + and - are fine for big integer. I am not surprise.
 Forget opIndex for a moment, and ask the more general question about
 all
 overloaded operators. Should they imply any sort of asymptotic
 complexity guarantee?

I think depends on good design. For example I think ++ or -- for iterator. If it is O(n) it is bad design. Bad design make people say like you "This is what you get with operator overloading".
 Personally, I don't think so.
 
 I don't like "nth".
 
 I'd rather use the opIndex. And if I'm using a linked list, I'll be
 aware of the fact that it'll exhibit linear-time indexing, and I'll be
 cautious about which algorithms to use.

But inside algorithm you do not know if you use a linked list or a vector. You lost that information in bad abstraction. Also abstraction is bad because if you change data structure you have concept errors that still compile. And run until tomorrow ^_^.

A generic algoritm has absolutely no business caring about the complexity of the collection it's operating on.

absolutely. desperate. need. of. chanel.
 If it does, then you've created a concrete
 algoritm, not a generic one.

sure you don't know what you're talking about. it is generic insofar as it abstracts away the primitives it needs from its iterator. run don't walk and get a dose of stl.
 If an algoritm uses [] and doesn't know the
 complexity of the []...good! It shouldn't know, and it shouldn't care.

nonsense. so wrong i won't even address it.
 It's
 the code that sends the collection to the algoritm that knows and cares.
 Why? Because "what algoritm is best?" depends on far more than just what
 type of collection is used. It depends on "Will the collection ever be
 larger than X elements?". It depends on "Is it a standard textbook list,
 or
 does it use trick 1 and/or trick 2?". It depends on "Is it usually mostly
 sorted or mostly random?". It depends on "What do I do with it most
 often?
 Sort, append, search, insert or delete?". And it depends on other things,
 too.

sure it does. problem is you have it backwards. types and algos tell you the theoretical properties. then you in knowledge of what's goin' on use the algorithm that does it for you in knowledge that the complexity would work for your situation. or you encode your own specialized algorithm. thing is stl encoded the most general linear search. you can use it for linear searching everything. moreover it exactly said what's needed for a linear search: a one-pass forward iterator aka input iterator. now to tie it with what u said: you know in your situation whether linear find cuts the mustard. that don't change the nature of that fundamental algorithm. so you use it or use another. your choice. but find remains universal so long as it has access to the basics of one-pass iteration.
 Using "[]" versus "nth()" can't tell the algoritm *any* of those things.

doesn't have to.
 But
 those things *must* be known in order to make an accurate decision of "Is
 this the right algoritm or not?"

sure. you make them decision on the call site.
 Therefore, a generic algoritm *cannot* ever
 know for certain if it's the right algoritm *even* if you say "[]" means
 "O(log n) or better".

utterly wrong. poppycock. gobbledygook. nonsense. this is so far off i don't have time to even address. if you want to learn stuff go learn stl. then we talk. if you want to teach me i guess we're done.
 Therefore, the algorithm should not be designed to
 only work with certain types of collections.

it should be designed to work with certain iterator categories.
 The code that sends the
 collection to the algoritm is the *only* code that knows the answers to
 all
 of the questions above, therefore it is the only code that should ever
 decide "I should use this algorithm, I shouldn't use that algorithm."

correct. you just have all of your other hypotheses jumbled. sorry dood don't be hatin' but there's so much you don't know i ain't gonna continue this. last word is yours. call me a pompous prick if you want. go ahead.

I'll agree to drop this issue. There's little point in debating with someone whose arguments frequently consist of things like "You are wrong", "I'm not going to explain my point", and "dood don't be hatin'".

I am with dan dee_girl & co on this issue, the problem is that a generic algorithm "knows" the types he is working on and can easily check the operations they have, and based on this decide the strategy to use. This choice works well if the presence of a given operation is also connected with some performance guarantee. Concepts (or better categories (aldor concept not C++), that are interfaces for types, but interfaces that have to be explicitly assigned to a type) might relax this situation a little, but the need for some guarantees will remain. Fawzi
Aug 27 2008
parent reply "Nick Sabalausky" <a a.a> writes:
"Fawzi Mohamed" <fmohamed mac.com> wrote in message 
news:g94k2b$2a1e$1 digitalmars.com...
 On 2008-08-27 23:27:52 +0200, "Nick Sabalausky" <a a.a> said:

 "superdan" <super dan.org> wrote in message
 news:g94g3e$20e9$1 digitalmars.com...
 Nick Sabalausky Wrote:

 "Dee Girl" <deegirl noreply.com> wrote in message
 news:g943oi$11f4$1 digitalmars.com...
 Benji Smith Wrote:

 Dee Girl wrote:
 Michiel Helvensteijn Wrote:
 That's simple. a[i] looks much nicer than a.nth(i).

It is not nicer. It is more deceiving (correct spell?). If you look at code it looks like array code. foreach (i; 0 .. a.length) { a[i] += 1; } For array works nice. But for list it is terrible! Many operations for incrementing only small list.

Well, that's what you get with operator overloading.

I am sorry. I disagree. I think that is what you get with bad design.
 The same thing could be said for "+" or "-". They're inherently
 deceiving, because they look like builtin operations on primitive 
 data
 types.

 For expensive operations (like performing division on an
 unlimited-precision decimal object), should the author of the code 
 use
 "opDiv" or should he implement a separate "divide" function?

The cost of + and - is proportional to digits in number. For small number of digits computer does fast in hardware. For many digits the cost grows. The number of digits is log n. I think + and - are fine for big integer. I am not surprise.
 Forget opIndex for a moment, and ask the more general question about
 all
 overloaded operators. Should they imply any sort of asymptotic
 complexity guarantee?

I think depends on good design. For example I think ++ or -- for iterator. If it is O(n) it is bad design. Bad design make people say like you "This is what you get with operator overloading".
 Personally, I don't think so.

 I don't like "nth".

 I'd rather use the opIndex. And if I'm using a linked list, I'll be
 aware of the fact that it'll exhibit linear-time indexing, and I'll 
 be
 cautious about which algorithms to use.

But inside algorithm you do not know if you use a linked list or a vector. You lost that information in bad abstraction. Also abstraction is bad because if you change data structure you have concept errors that still compile. And run until tomorrow ^_^.

A generic algoritm has absolutely no business caring about the complexity of the collection it's operating on.

absolutely. desperate. need. of. chanel.
 If it does, then you've created a concrete
 algoritm, not a generic one.

sure you don't know what you're talking about. it is generic insofar as it abstracts away the primitives it needs from its iterator. run don't walk and get a dose of stl.
 If an algoritm uses [] and doesn't know the
 complexity of the []...good! It shouldn't know, and it shouldn't care.

nonsense. so wrong i won't even address it.
 It's
 the code that sends the collection to the algoritm that knows and 
 cares.
 Why? Because "what algoritm is best?" depends on far more than just 
 what
 type of collection is used. It depends on "Will the collection ever be
 larger than X elements?". It depends on "Is it a standard textbook 
 list,
 or
 does it use trick 1 and/or trick 2?". It depends on "Is it usually 
 mostly
 sorted or mostly random?". It depends on "What do I do with it most
 often?
 Sort, append, search, insert or delete?". And it depends on other 
 things,
 too.

sure it does. problem is you have it backwards. types and algos tell you the theoretical properties. then you in knowledge of what's goin' on use the algorithm that does it for you in knowledge that the complexity would work for your situation. or you encode your own specialized algorithm. thing is stl encoded the most general linear search. you can use it for linear searching everything. moreover it exactly said what's needed for a linear search: a one-pass forward iterator aka input iterator. now to tie it with what u said: you know in your situation whether linear find cuts the mustard. that don't change the nature of that fundamental algorithm. so you use it or use another. your choice. but find remains universal so long as it has access to the basics of one-pass iteration.
 Using "[]" versus "nth()" can't tell the algoritm *any* of those 
 things.

doesn't have to.
 But
 those things *must* be known in order to make an accurate decision of 
 "Is
 this the right algoritm or not?"

sure. you make them decision on the call site.
 Therefore, a generic algoritm *cannot* ever
 know for certain if it's the right algoritm *even* if you say "[]" 
 means
 "O(log n) or better".

utterly wrong. poppycock. gobbledygook. nonsense. this is so far off i don't have time to even address. if you want to learn stuff go learn stl. then we talk. if you want to teach me i guess we're done.
 Therefore, the algorithm should not be designed to
 only work with certain types of collections.

it should be designed to work with certain iterator categories.
 The code that sends the
 collection to the algoritm is the *only* code that knows the answers to
 all
 of the questions above, therefore it is the only code that should ever
 decide "I should use this algorithm, I shouldn't use that algorithm."

correct. you just have all of your other hypotheses jumbled. sorry dood don't be hatin' but there's so much you don't know i ain't gonna continue this. last word is yours. call me a pompous prick if you want. go ahead.

I'll agree to drop this issue. There's little point in debating with someone whose arguments frequently consist of things like "You are wrong", "I'm not going to explain my point", and "dood don't be hatin'".

I am with dan dee_girl & co on this issue, the problem is that a generic algorithm "knows" the types he is working on and can easily check the operations they have, and based on this decide the strategy to use. This choice works well if the presence of a given operation is also connected with some performance guarantee.

IMO, a better way to do that would be via C#-style attributes or equivilent named interfaces. I'm not sure if this is what you're referring to below or not.
 Concepts (or better categories (aldor concept not C++), that are 
 interfaces for types, but interfaces that have to be explicitly assigned 
 to a type) might relax this situation a little, but the need for some 
 guarantees will remain.

If this "guarantee" (or mechanism for checking the types of operations that a collection supports) takes the form of a style guideline that says "don't implement opIndex for a collection if it would be O(n) or worse", then that, frankly, is absolutely no guarantee at all. If you *really* need that sort of guarantee (and I can imagine it may be useful in some cases), then the implementation of the guarantee does *not* belong in the realm of "implements vs doesn't-implement a particular operator overload". Doing so is an abuse of operator overloading, since operator overloading is there for defining syntactic sugar, not for acting as a makeshift contract. The correct mechanism for such guarantees is with named interfaces or C#-style attribtes, as I mentioned above. True, that can still be abused if the collection author wants to, but they have to actually try (ie, they have to lie and say "implements IndexingInConstantTime" in addition to implementing opIndex). If you instead try to implement that guarantee with the "don't implement opIndex for a collection if it would be O(n) or worse" style-guideline, then it's far too easy for a collection to come along that is ignorant of that "psuedo-contract" and accidentially breaks it. Proper use of interfaces/attributes instead of relying on the existence or absense of an overloaded operator fixes that problem.
Aug 27 2008
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Nick Sabalausky" wrote
 Concepts (or better categories (aldor concept not C++), that are 
 interfaces for types, but interfaces that have to be explicitly assigned 
 to a type) might relax this situation a little, but the need for some 
 guarantees will remain.

If this "guarantee" (or mechanism for checking the types of operations that a collection supports) takes the form of a style guideline that says "don't implement opIndex for a collection if it would be O(n) or worse", then that, frankly, is absolutely no guarantee at all.

The guarantee is not enforced, but the expectation and convention is implict. When someone sees an index operator the first thought is that it is a quick lookup. You can force yourself to think differently, but the reality is that most people think that because of the universal usage of square brackets (except for VB, and I feel pity for anyone who needs to use VB) to mean 'lookup by key', and usually this is only useful on objects where the lookup is quick ( < O(n) ). Although there is no requirement, nor enforcement, the 'quick' contract is expected by the user, no matter how much docs you throw at them. Look, for instance, at Tango's now-deprecated LinkMap, which uses a linked-list of key/value pairs (copied from Doug Lea's implementation). Nobody in their right mind would use link map because lookups are O(n), and it's just as easy to use a TreeMap or HashMap. Would you ever use it?
 If you *really* need that sort of guarantee (and I can imagine it may be 
 useful in some cases), then the implementation of the guarantee does *not* 
 belong in the realm of "implements vs doesn't-implement a particular 
 operator overload". Doing so is an abuse of operator overloading, since 
 operator overloading is there for defining syntactic sugar, not for acting 
 as a makeshift contract.

I don't think anybody is asking for a guarantee from the compiler or any specific tool. I think what we are saying is that violating the 'opIndex is fast' notion is bad design because you end up with users thinking they are doing something that's quick. You end up with people posting benchmarks on your containers saying 'why does python beat the pants off your list implementation?'. You can say 'hey, it's not meant to be used that way', but then why can the user use it that way? A better design is to nudge the user into using a correct container for the job by only supporting operations that make sense on the collections. And as far as operator semantic meaning, D's operators are purposely named after what they are supposed to do. Notice that the operator for + is opAdd, not opPlus. This is because opAdd is supposed to mean you are performing an addition operation. Assigning a different semantic meaning is not disallowed by the compiler, but is considered bad design. opIndex is supposed to be an index function, not a linear search. It's not called opSearch for a reason. Sure you can redefine it however you want semantically, but it's considered bad design. That's all we're saying. -Steve
Aug 27 2008
parent reply "Nick Sabalausky" <a a.a> writes:
"Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
news:g95b89$1jii$1 digitalmars.com...
 When someone sees an index operator the first thought is that it is a 
 quick lookup.

This seems to be a big part of the disagreement. Personally, I think it's insane to look at [] and just assume it's a cheap lookup. That's was true in pure C where, as someone else said, it was nothing more than a shorthand for a specific pointer arithmentic operation, but carrying that idea over to more modern languages (especially one that also uses [] for associative arrays) is a big mistake. The '+' operator means "add". Addition is typically O(1). But vectors can be added, and that's an O(n) operation. Should opAdd never be used for vectors?
 You can force yourself to think differently, but the reality is that most 
 people think that because of the universal usage of square brackets 
 (except for VB, and I feel pity for anyone who needs to use VB) to mean 
 'lookup by key', and usually this is only useful on objects where the 
 lookup is quick ( < O(n) ).  Although there is no requirement, nor 
 enforcement, the 'quick' contract is expected by the user, no matter how 
 much docs you throw at them.

 Look, for instance, at Tango's now-deprecated LinkMap, which uses a 
 linked-list of key/value pairs (copied from Doug Lea's implementation). 
 Nobody in their right mind would use link map because lookups are O(n), 
 and it's just as easy to use a TreeMap or HashMap.  Would you ever use it?

 If you *really* need that sort of guarantee (and I can imagine it may be 
 useful in some cases), then the implementation of the guarantee does 
 *not* belong in the realm of "implements vs doesn't-implement a 
 particular operator overload". Doing so is an abuse of operator 
 overloading, since operator overloading is there for defining syntactic 
 sugar, not for acting as a makeshift contract.

I don't think anybody is asking for a guarantee from the compiler or any specific tool. I think what we are saying is that violating the 'opIndex is fast' notion is bad design because you end up with users thinking they are doing something that's quick. You end up with people posting benchmarks on your containers saying 'why does python beat the pants off your list implementation?'. You can say 'hey, it's not meant to be used that way', but then why can the user use it that way? A better design is to nudge the user into using a correct container for the job by only supporting operations that make sense on the collections. And as far as operator semantic meaning, D's operators are purposely named after what they are supposed to do. Notice that the operator for + is opAdd, not opPlus. This is because opAdd is supposed to mean you are performing an addition operation. Assigning a different semantic meaning is not disallowed by the compiler, but is considered bad design. opIndex is supposed to be an index function, not a linear search. It's not called opSearch for a reason. Sure you can redefine it however you want semantically, but it's considered bad design. That's all we're saying.

Nobody is suggesting using [] to invoke a search (Although we have talked about using [] *in the search function's implementation*). Search means you want to get the position of a given element, or in other words, "element" -> search -> "key/index". What we're talking about is the reverse: getting the element at a given position, ie, "key/index" -> [] -> "element". It doesn't matter if it's an array, a linked list, or a super-duper-collection-from-2092: that's still indexing, not searching.
Aug 27 2008
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Nick Sabalausky" wrote
 "Steven Schveighoffer" wrote in message 
 news:g95b89$1jii$1 digitalmars.com...
 When someone sees an index operator the first thought is that it is a 
 quick lookup.

This seems to be a big part of the disagreement. Personally, I think it's insane to look at [] and just assume it's a cheap lookup. That's was true in pure C where, as someone else said, it was nothing more than a shorthand for a specific pointer arithmentic operation, but carrying that idea over to more modern languages (especially one that also uses [] for associative arrays) is a big mistake.

Perhaps it is a mistake to assume it, but it is a common mistake. And the expectation is intuitive. You don't see people making light switches that look like outlets, even though it's possible. You might perhaps make a library where opIndex is a linear search in your list, but I would expect that people would not use that indexing feature correctly. Just as if I plug my lamp into the light switch that looks like an outlet, I'd expect it to get power, and be confused when it doesn't. Except the opIndex mistake is more subtle because I *do* get what I actually want, but I just am not realizing the cost of it.
 The '+' operator means "add". Addition is typically O(1). But vectors can 
 be added, and that's an O(n) operation. Should opAdd never be used for 
 vectors?

As long as it's addition, I have no problem with O(n) operation (and it depends on your view of n). Indexing implies speed, look at all other cases where indexing is used. For addition to be proportional to the size of the element is natural and expected.
 You can force yourself to think differently, but the reality is that most 
 people think that because of the universal usage of square brackets 
 (except for VB, and I feel pity for anyone who needs to use VB) to mean 
 'lookup by key', and usually this is only useful on objects where the 
 lookup is quick ( < O(n) ).  Although there is no requirement, nor 
 enforcement, the 'quick' contract is expected by the user, no matter how 
 much docs you throw at them.

 Look, for instance, at Tango's now-deprecated LinkMap, which uses a 
 linked-list of key/value pairs (copied from Doug Lea's implementation). 
 Nobody in their right mind would use link map because lookups are O(n), 
 and it's just as easy to use a TreeMap or HashMap.  Would you ever use 
 it?

 If you *really* need that sort of guarantee (and I can imagine it may be 
 useful in some cases), then the implementation of the guarantee does 
 *not* belong in the realm of "implements vs doesn't-implement a 
 particular operator overload". Doing so is an abuse of operator 
 overloading, since operator overloading is there for defining syntactic 
 sugar, not for acting as a makeshift contract.

I don't think anybody is asking for a guarantee from the compiler or any specific tool. I think what we are saying is that violating the 'opIndex is fast' notion is bad design because you end up with users thinking they are doing something that's quick. You end up with people posting benchmarks on your containers saying 'why does python beat the pants off your list implementation?'. You can say 'hey, it's not meant to be used that way', but then why can the user use it that way? A better design is to nudge the user into using a correct container for the job by only supporting operations that make sense on the collections. And as far as operator semantic meaning, D's operators are purposely named after what they are supposed to do. Notice that the operator for + is opAdd, not opPlus. This is because opAdd is supposed to mean you are performing an addition operation. Assigning a different semantic meaning is not disallowed by the compiler, but is considered bad design. opIndex is supposed to be an index function, not a linear search. It's not called opSearch for a reason. Sure you can redefine it however you want semantically, but it's considered bad design. That's all we're saying.

Nobody is suggesting using [] to invoke a search (Although we have talked about using [] *in the search function's implementation*). Search means you want to get the position of a given element, or in other words, "element" -> search -> "key/index". What we're talking about is the reverse: getting the element at a given position, ie, "key/index" -> [] -> "element". It doesn't matter if it's an array, a linked list, or a super-duper-collection-from-2092: that's still indexing, not searching.

It is a search, you are searching for the element whose key matches. When the key can be used to lookup the element efficiently, we call that indexing. Indexing is a more constrained form of searching. -Steve
Aug 28 2008
parent reply "Nick Sabalausky" <a a.a> writes:
"Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
news:g9650h$cp9$1 digitalmars.com...
 "Nick Sabalausky" wrote
 "Steven Schveighoffer" wrote in message 
 news:g95b89$1jii$1 digitalmars.com...
 When someone sees an index operator the first thought is that it is a 
 quick lookup.

This seems to be a big part of the disagreement. Personally, I think it's insane to look at [] and just assume it's a cheap lookup. That's was true in pure C where, as someone else said, it was nothing more than a shorthand for a specific pointer arithmentic operation, but carrying that idea over to more modern languages (especially one that also uses [] for associative arrays) is a big mistake.

Perhaps it is a mistake to assume it, but it is a common mistake. And the expectation is intuitive.

Why in the world would any halfway competent programmer ever look at a *linked list* and assume that the linked lists's [] (if implemented) is O(1)? As for the risk that could create of accidentially sending a linked list to a "search" (ie, a "search for an element which contains data X") that uses [] internally instead of iterators (but then, why wouldn't it just use iterators anyway?): I'll agree that in a case like this there should be some mechanism for automatic choosing of an algorithm, but that mechanism should be at a separate level of abstraction. There would be a function "search" that, through either RTTI or template constraints or something else, says "does collection 'c' implement ConstantTimeForewardDirectionIndexing?" or better yet IMO "does the collection have attribute ForewardDirectionIndexingComplexity that is set equal to Complexity.Constant?", and based on that passes control to either IndexingSearch or IteratorSearch.
  You don't see people making light switches that look like outlets, even 
 though it's possible.  You might perhaps make a library where opIndex is a 
 linear search in your list, but I would expect that people would not use 
 that indexing feature correctly.  Just as if I plug my lamp into the light 
 switch that looks like an outlet, I'd expect it to get power, and be 
 confused when it doesn't.  Except the opIndex mistake is more subtle 
 because I *do* get what I actually want, but I just am not realizing the 
 cost of it.

 The '+' operator means "add". Addition is typically O(1). But vectors can 
 be added, and that's an O(n) operation. Should opAdd never be used for 
 vectors?

As long as it's addition, I have no problem with O(n) operation (and it depends on your view of n). Indexing implies speed, look at all other cases where indexing is used. For addition to be proportional to the size of the element is natural and expected.

When people look at '+' they typically think "integer/float addition". Why would, for example, the risk of mistaking an O(n) "big int" addition for an O(1) integer/float addition be any worse than the risk of mistaking an O(n) linked list "get element at index" for an O(1) array "get element at index"?
 You can force yourself to think differently, but the reality is that 
 most people think that because of the universal usage of square brackets 
 (except for VB, and I feel pity for anyone who needs to use VB) to mean 
 'lookup by key', and usually this is only useful on objects where the 
 lookup is quick ( < O(n) ).  Although there is no requirement, nor 
 enforcement, the 'quick' contract is expected by the user, no matter how 
 much docs you throw at them.

 Look, for instance, at Tango's now-deprecated LinkMap, which uses a 
 linked-list of key/value pairs (copied from Doug Lea's implementation). 
 Nobody in their right mind would use link map because lookups are O(n), 
 and it's just as easy to use a TreeMap or HashMap.  Would you ever use 
 it?

 If you *really* need that sort of guarantee (and I can imagine it may 
 be useful in some cases), then the implementation of the guarantee does 
 *not* belong in the realm of "implements vs doesn't-implement a 
 particular operator overload". Doing so is an abuse of operator 
 overloading, since operator overloading is there for defining syntactic 
 sugar, not for acting as a makeshift contract.

I don't think anybody is asking for a guarantee from the compiler or any specific tool. I think what we are saying is that violating the 'opIndex is fast' notion is bad design because you end up with users thinking they are doing something that's quick. You end up with people posting benchmarks on your containers saying 'why does python beat the pants off your list implementation?'. You can say 'hey, it's not meant to be used that way', but then why can the user use it that way? A better design is to nudge the user into using a correct container for the job by only supporting operations that make sense on the collections. And as far as operator semantic meaning, D's operators are purposely named after what they are supposed to do. Notice that the operator for + is opAdd, not opPlus. This is because opAdd is supposed to mean you are performing an addition operation. Assigning a different semantic meaning is not disallowed by the compiler, but is considered bad design. opIndex is supposed to be an index function, not a linear search. It's not called opSearch for a reason. Sure you can redefine it however you want semantically, but it's considered bad design. That's all we're saying.

Nobody is suggesting using [] to invoke a search (Although we have talked about using [] *in the search function's implementation*). Search means you want to get the position of a given element, or in other words, "element" -> search -> "key/index". What we're talking about is the reverse: getting the element at a given position, ie, "key/index" -> [] -> "element". It doesn't matter if it's an array, a linked list, or a super-duper-collection-from-2092: that's still indexing, not searching.

It is a search, you are searching for the element whose key matches. When the key can be used to lookup the element efficiently, we call that indexing. Indexing is a more constrained form of searching.

If you've got a linked list, and you want to get element N, are you *really* going to go reaching for a function named "search"? How often do you really see a generic function named "search" or "find" that takes a numeric index as a the "to be found" parameter instead of something to be matched against the element's value? I would argue that that would be confusing for most people. Like I said in a different post farther down, the implementation of a "getAtIndex()" is obviously going to work like a search, but from "outside the box", what you're asking for is not the same.
Aug 28 2008
next sibling parent Dee Girl <deegirl noreply.com> writes:
Nick Sabalausky Wrote:

 "Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
 news:g9650h$cp9$1 digitalmars.com...
 "Nick Sabalausky" wrote
 "Steven Schveighoffer" wrote in message 
 news:g95b89$1jii$1 digitalmars.com...
 When someone sees an index operator the first thought is that it is a 
 quick lookup.

This seems to be a big part of the disagreement. Personally, I think it's insane to look at [] and just assume it's a cheap lookup. That's was true in pure C where, as someone else said, it was nothing more than a shorthand for a specific pointer arithmentic operation, but carrying that idea over to more modern languages (especially one that also uses [] for associative arrays) is a big mistake.

Perhaps it is a mistake to assume it, but it is a common mistake. And the expectation is intuitive.

Why in the world would any halfway competent programmer ever look at a *linked list* and assume that the linked lists's [] (if implemented) is O(1)?

A programmer can look at generic code. Generic code does not know it is a linked list or something else. I think this really helps thinking in generic code. Because generic code that makes problem interesting.
 As for the risk that could create of accidentially sending a linked list to 
 a "search" (ie, a "search for an element which contains data X") that uses 
 [] internally instead of iterators (but then, why wouldn't it just use 
 iterators anyway?): I'll agree that in a case like this there should be some 
 mechanism for automatic choosing of an algorithm, but that mechanism should 
 be at a separate level of abstraction. There would be a function "search" 
 that, through either RTTI or template constraints or something else, says 
 "does collection 'c' implement ConstantTimeForewardDirectionIndexing?" or 
 better yet IMO "does the collection have attribute 
 ForewardDirectionIndexingComplexity that is set equal to 
 Complexity.Constant?", and based on that passes control to either 
 IndexingSearch or IteratorSearch.

I think this is extreme complicate design. What is advantage of this design from STL?
  You don't see people making light switches that look like outlets, even 
 though it's possible.  You might perhaps make a library where opIndex is a 
 linear search in your list, but I would expect that people would not use 
 that indexing feature correctly.  Just as if I plug my lamp into the light 
 switch that looks like an outlet, I'd expect it to get power, and be 
 confused when it doesn't.  Except the opIndex mistake is more subtle 
 because I *do* get what I actually want, but I just am not realizing the 
 cost of it.

 The '+' operator means "add". Addition is typically O(1). But vectors can 
 be added, and that's an O(n) operation. Should opAdd never be used for 
 vectors?

As long as it's addition, I have no problem with O(n) operation (and it depends on your view of n). Indexing implies speed, look at all other cases where indexing is used. For addition to be proportional to the size of the element is natural and expected.

When people look at '+' they typically think "integer/float addition". Why would, for example, the risk of mistaking an O(n) "big int" addition for an O(1) integer/float addition be any worse than the risk of mistaking an O(n) linked list "get element at index" for an O(1) array "get element at index"?

This is again wrong for two reasons. I am sorry! One small thing is I think big int "+" is O(log n) not O(n). But real problem is people look at a[] = b[] + c[] and see operands. It is evident from operands that cost is proportional to input size. If it is any shorter it would be miracle! Because it means some elements are not even seen. You compare wrong situations. I mean different situation. And operator or function is not important. I said if array access is index(a, n) and everybody always thinks so then index(a, n) should not do linear search.
 You can force yourself to think differently, but the reality is that 
 most people think that because of the universal usage of square brackets 
 (except for VB, and I feel pity for anyone who needs to use VB) to mean 
 'lookup by key', and usually this is only useful on objects where the 
 lookup is quick ( < O(n) ).  Although there is no requirement, nor 
 enforcement, the 'quick' contract is expected by the user, no matter how 
 much docs you throw at them.

 Look, for instance, at Tango's now-deprecated LinkMap, which uses a 
 linked-list of key/value pairs (copied from Doug Lea's implementation). 
 Nobody in their right mind would use link map because lookups are O(n), 
 and it's just as easy to use a TreeMap or HashMap.  Would you ever use 
 it?

 If you *really* need that sort of guarantee (and I can imagine it may 
 be useful in some cases), then the implementation of the guarantee does 
 *not* belong in the realm of "implements vs doesn't-implement a 
 particular operator overload". Doing so is an abuse of operator 
 overloading, since operator overloading is there for defining syntactic 
 sugar, not for acting as a makeshift contract.

I don't think anybody is asking for a guarantee from the compiler or any specific tool. I think what we are saying is that violating the 'opIndex is fast' notion is bad design because you end up with users thinking they are doing something that's quick. You end up with people posting benchmarks on your containers saying 'why does python beat the pants off your list implementation?'. You can say 'hey, it's not meant to be used that way', but then why can the user use it that way? A better design is to nudge the user into using a correct container for the job by only supporting operations that make sense on the collections. And as far as operator semantic meaning, D's operators are purposely named after what they are supposed to do. Notice that the operator for + is opAdd, not opPlus. This is because opAdd is supposed to mean you are performing an addition operation. Assigning a different semantic meaning is not disallowed by the compiler, but is considered bad design. opIndex is supposed to be an index function, not a linear search. It's not called opSearch for a reason. Sure you can redefine it however you want semantically, but it's considered bad design. That's all we're saying.

Nobody is suggesting using [] to invoke a search (Although we have talked about using [] *in the search function's implementation*). Search means you want to get the position of a given element, or in other words, "element" -> search -> "key/index". What we're talking about is the reverse: getting the element at a given position, ie, "key/index" -> [] -> "element". It doesn't matter if it's an array, a linked list, or a super-duper-collection-from-2092: that's still indexing, not searching.

It is a search, you are searching for the element whose key matches. When the key can be used to lookup the element efficiently, we call that indexing. Indexing is a more constrained form of searching.

If you've got a linked list, and you want to get element N, are you *really* going to go reaching for a function named "search"?

Yes. This is exactly STL is doing. And there is no wrong with it. Again I think STL book by Josuttis very helpful. Also Stepanov notes online very interesting! Thank you Don.
 How often do you really 
 see a generic function named "search" or "find" that takes a numeric index 
 as a the "to be found" parameter instead of something to be matched against 
 the element's value? I would argue that that would be confusing for most 
 people.

I think you lose argueing. There is experience in STL that it is not confusing. STL is most success library for C++ even if C++ is now old and has problems.
Aug 28 2008
prev sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Nick Sabalausky" wrote
 "Steven Schveighoffer" wrote in message 
 news:g9650h$cp9$1 digitalmars.com...
 "Nick Sabalausky" wrote
 "Steven Schveighoffer" wrote in message 
 news:g95b89$1jii$1 digitalmars.com...
 When someone sees an index operator the first thought is that it is a 
 quick lookup.

This seems to be a big part of the disagreement. Personally, I think it's insane to look at [] and just assume it's a cheap lookup. That's was true in pure C where, as someone else said, it was nothing more than a shorthand for a specific pointer arithmentic operation, but carrying that idea over to more modern languages (especially one that also uses [] for associative arrays) is a big mistake.

Perhaps it is a mistake to assume it, but it is a common mistake. And the expectation is intuitive.

Why in the world would any halfway competent programmer ever look at a *linked list* and assume that the linked lists's [] (if implemented) is O(1)?

You are writing this function: void foo(IOrderedContainer cont) { .... } IOrderedContainer implements opIndex(uint). The problem is that you can't tell whether the object itself is a list or not, so you are powerless to make the decision as to whether the container has fast indexing. In that case, your only choice (if speed is an issue) is to not use opIndex.
 As for the risk that could create of accidentially sending a linked list 
 to a "search" (ie, a "search for an element which contains data X") that 
 uses [] internally instead of iterators (but then, why wouldn't it just 
 use iterators anyway?): I'll agree that in a case like this there should 
 be some mechanism for automatic choosing of an algorithm, but that 
 mechanism should be at a separate level of abstraction. There would be a 
 function "search" that, through either RTTI or template constraints or 
 something else, says "does collection 'c' implement 
 ConstantTimeForewardDirectionIndexing?" or better yet IMO "does the 
 collection have attribute ForewardDirectionIndexingComplexity that is set 
 equal to Complexity.Constant?", and based on that passes control to either 
 IndexingSearch or IteratorSearch.

To me, this is a bad design. It's my opinion, but one that is shared among many people. You can do stuff this way, but it is not intuitive. I'd much rather reserve opIndex to only quick lookups, and avoid the possibility of accidentally using it incorrectly. In general, I'd say if you are using lists and frequently looking up the nth value in the list, you have chosen the wrong container for the job.
  You don't see people making light switches that look like outlets, even 
 though it's possible.  You might perhaps make a library where opIndex is 
 a linear search in your list, but I would expect that people would not 
 use that indexing feature correctly.  Just as if I plug my lamp into the 
 light switch that looks like an outlet, I'd expect it to get power, and 
 be confused when it doesn't.  Except the opIndex mistake is more subtle 
 because I *do* get what I actually want, but I just am not realizing the 
 cost of it.

 The '+' operator means "add". Addition is typically O(1). But vectors 
 can be added, and that's an O(n) operation. Should opAdd never be used 
 for vectors?

As long as it's addition, I have no problem with O(n) operation (and it depends on your view of n). Indexing implies speed, look at all other cases where indexing is used. For addition to be proportional to the size of the element is natural and expected.

When people look at '+' they typically think "integer/float addition". Why would, for example, the risk of mistaking an O(n) "big int" addition for an O(1) integer/float addition be any worse than the risk of mistaking an O(n) linked list "get element at index" for an O(1) array "get element at index"?

What good are integers that can't be added? In this case, it is not possible to have quick addition, no matter how you implement your arbitrary-precision integer. I think the time penalty is understood and accepted. With opIndex, the time penalty is not expected. Like it or not, this is how many users look at it.
 You can force yourself to think differently, but the reality is that 
 most people think that because of the universal usage of square 
 brackets (except for VB, and I feel pity for anyone who needs to use 
 VB) to mean 'lookup by key', and usually this is only useful on objects 
 where the lookup is quick ( < O(n) ).  Although there is no 
 requirement, nor enforcement, the 'quick' contract is expected by the 
 user, no matter how much docs you throw at them.

 Look, for instance, at Tango's now-deprecated LinkMap, which uses a 
 linked-list of key/value pairs (copied from Doug Lea's implementation). 
 Nobody in their right mind would use link map because lookups are O(n), 
 and it's just as easy to use a TreeMap or HashMap.  Would you ever use 
 it?

 If you *really* need that sort of guarantee (and I can imagine it may 
 be useful in some cases), then the implementation of the guarantee 
 does *not* belong in the realm of "implements vs doesn't-implement a 
 particular operator overload". Doing so is an abuse of operator 
 overloading, since operator overloading is there for defining 
 syntactic sugar, not for acting as a makeshift contract.

I don't think anybody is asking for a guarantee from the compiler or any specific tool. I think what we are saying is that violating the 'opIndex is fast' notion is bad design because you end up with users thinking they are doing something that's quick. You end up with people posting benchmarks on your containers saying 'why does python beat the pants off your list implementation?'. You can say 'hey, it's not meant to be used that way', but then why can the user use it that way? A better design is to nudge the user into using a correct container for the job by only supporting operations that make sense on the collections. And as far as operator semantic meaning, D's operators are purposely named after what they are supposed to do. Notice that the operator for + is opAdd, not opPlus. This is because opAdd is supposed to mean you are performing an addition operation. Assigning a different semantic meaning is not disallowed by the compiler, but is considered bad design. opIndex is supposed to be an index function, not a linear search. It's not called opSearch for a reason. Sure you can redefine it however you want semantically, but it's considered bad design. That's all we're saying.

Nobody is suggesting using [] to invoke a search (Although we have talked about using [] *in the search function's implementation*). Search means you want to get the position of a given element, or in other words, "element" -> search -> "key/index". What we're talking about is the reverse: getting the element at a given position, ie, "key/index" -> [] -> "element". It doesn't matter if it's an array, a linked list, or a super-duper-collection-from-2092: that's still indexing, not searching.

It is a search, you are searching for the element whose key matches. When the key can be used to lookup the element efficiently, we call that indexing. Indexing is a more constrained form of searching.

If you've got a linked list, and you want to get element N, are you *really* going to go reaching for a function named "search"? How often do you really see a generic function named "search" or "find" that takes a numeric index as a the "to be found" parameter instead of something to be matched against the element's value? I would argue that that would be confusing for most people. Like I said in a different post farther down, the implementation of a "getAtIndex()" is obviously going to work like a search, but from "outside the box", what you're asking for is not the same.

If you are indexing into a tree, it is considered a binary search, if you are indexing into a hash, it is a search at some point to deal with collisions. People don't think about indexing as being a search, but in reality it is. A really fast search. And I don't think search would be the name of the member function, it should be something like 'getNth', which returns a cursor that points to the element. -Steve
Aug 28 2008
parent reply "Nick Sabalausky" <a a.a> writes:
"Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
news:g96vmq$3cc$1 digitalmars.com...
 "Nick Sabalausky" wrote
 "Steven Schveighoffer" wrote in message 
 news:g9650h$cp9$1 digitalmars.com...
 Perhaps it is a mistake to assume it, but it is a common mistake. And 
 the expectation is intuitive.

Why in the world would any halfway competent programmer ever look at a *linked list* and assume that the linked lists's [] (if implemented) is O(1)?

You are writing this function: void foo(IOrderedContainer cont) { .... } IOrderedContainer implements opIndex(uint). The problem is that you can't tell whether the object itself is a list or not, so you are powerless to make the decision as to whether the container has fast indexing. In that case, your only choice (if speed is an issue) is to not use opIndex.

Ok, so you want foo() to be able to tell if the collection has fast or slow indexing. What are you suggesting that foo() does when the collection does have slow indexing? 1. Should it fail to compile because foo's implementation uses [] and the slow-indexing collection doesn't implement []? Well then how does foo know that it's the most important, most frequent thing being done on the collection? Suppose foo is something that needs to access elements in a somewhat random order, ie the kind of thing that lists are poorly suited for. Further suppose that collection C is some set of data that *usually* just gets insertions and deletions at nodes that the code already has direct references to. Further suppose that foo does *need* to be run on the collection, *but* very infrequently. So, should I *really* be forced to make C a collection that trades good insertion/deletion complexity for good indexing complexity, just because the occasionally-run foo() doesn't like it? And what if I want to run benchmarks to test what collection works best, in real-world use, for C? Should foo's intolerance of slow-indexing collections really be able force me to exclude testing of such collections? 2. Should foo revert to an alternate branch of code that doesn't use []? This behavior can be implemented via interfaces like I described. The benefit of that is that [] can still serve as the shorthand it's intended for (see below) and you never need to introduce the inconsistency of "Gee, how do I get the Nth element of a collection?" "Well, on some collections it's getNth(), and on other collections it's []." 3. Something else?
 As for the risk that could create of accidentially sending a linked list 
 to a "search" (ie, a "search for an element which contains data X") that 
 uses [] internally instead of iterators (but then, why wouldn't it just 
 use iterators anyway?): I'll agree that in a case like this there should 
 be some mechanism for automatic choosing of an algorithm, but that 
 mechanism should be at a separate level of abstraction. There would be a 
 function "search" that, through either RTTI or template constraints or 
 something else, says "does collection 'c' implement 
 ConstantTimeForewardDirectionIndexing?" or better yet IMO "does the 
 collection have attribute ForewardDirectionIndexingComplexity that is set 
 equal to Complexity.Constant?", and based on that passes control to 
 either IndexingSearch or IteratorSearch.

To me, this is a bad design. It's my opinion, but one that is shared among many people. You can do stuff this way, but it is not intuitive. I'd much rather reserve opIndex to only quick lookups, and avoid the possibility of accidentally using it incorrectly.

Preventing a collection from ever being used in a function that would typically perform poorly on that collection just smacks of premature optimization. How do you, as the collection author, know that the collection will never be used in a way such that *occasional* use in certain specific sub-optimal a manner might actually be necessary and/or acceptable? If you omit [] then you've burnt the bridge (so to speak) and your only recourse is to add a standardized "getNth()" to every single collection which clutters the interface, hinders integration with third-party collections and algorithms, and is likely to still suffer from idiots who think that "get Nth element" is always better than O(n) (see below).
 In general, I'd say if you are using lists and frequently looking up the 
 nth value in the list, you have chosen the wrong container for the job.

If you're frequently looking up random elements in a list, then yes, you're probably using the wrong container. But that's beside the point. Even if you only do it once: If you have a collection with a natural order, and you want to get the nth element, you should be able to use the standard "get element at index X" notation, []. I don't care how many people go around using [] and thinking they're guaranteed to get a cheap computation from it. In a language that supports overloading of [], the [] means "get the element at key/index X". Especially in a language like D where using [] on an associative array can trigger an unbounded allocation and GC run. Using [] in D (and various other languages) can be expensive, period, even in the standard lib (assoc array). So looking at a [] and thinking "guaranteed cheap", is incorrect, period. If most people think 2+2=5, you're not going to redesign arithmetic to work around that mistaken assumption.
 When people look at '+' they typically think "integer/float addition". 
 Why would, for example, the risk of mistaking an O(n) "big int" addition 
 for an O(1) integer/float addition be any worse than the risk of 
 mistaking an O(n) linked list "get element at index" for an O(1) array 
 "get element at index"?

What good are integers that can't be added? In this case, it is not possible to have quick addition, no matter how you implement your arbitrary-precision integer. I think the time penalty is understood and accepted. With opIndex, the time penalty is not expected. Like it or not, this is how many users look at it.
 If you've got a linked list, and you want to get element N, are you 
 *really* going to go reaching for a function named "search"? How often do 
 you really see a generic function named "search" or "find" that takes a 
 numeric index as a the "to be found" parameter instead of something to be 
 matched against the element's value? I would argue that that would be 
 confusing for most people. Like I said in a different post farther down, 
 the implementation of a "getAtIndex()" is obviously going to work like a 
 search, but from "outside the box", what you're asking for is not the 
 same.

If you are indexing into a tree, it is considered a binary search, if you are indexing into a hash, it is a search at some point to deal with collisions. People don't think about indexing as being a search, but in reality it is. A really fast search.

It's implemented as a search, but I'd argue that the input/output specifications are different. And yes, I suppose that does put it into a bit of a grey area. But I wouldn't go so far as to say that, to the caller, it's the same thing, because there are differences. If you want get an element based on it's position in the collection, you call one function. If you want to get an element based on it's content instead of it's position, that's another function. If you want to get the position of an element based on it's content or it's identity, that's one or two more functions (depending, of course, if the element is a value type or reference type, respectively).
 And I don't think search would be the name of the member function, it 
 should be something like 'getNth', which returns a cursor that points to 
 the element.

Right, and outside of pure C, [] is the shorthand for and the standardized name for "getNth". If someone automatically assumes [] to be a simple lookup, chances are they're going to make the same assumption about anything named along the lines of "getNth". After all, that's what [] does, it gets the Nth.
Aug 28 2008
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Nick Sabalausky" wrote
 Ok, so you want foo() to be able to tell if the collection has fast or 
 slow indexing. What are you suggesting that foo() does when the collection 
 does have slow indexing?

No, I don't want to be able to tell. I don't want to HAVE to be able to tell. In my ideal world, the collection does not implement opIndex unless it is fast, so there is no issue. i.e. you cannot call foo with a linked list. I'm really tired of this argument, you understand my point of view, I understand yours. To you, the syntax sugar is more important than the complexity guarantees. To me, what the syntax intuitively means should be what it does. So I'll develop my collections library and you develop yours, fair enough? I don't think either of us is right or wrong in the strict sense of the terms. To be fair, I'll answer your other points as you took the time to write them. And then I'm done. I can't really be any clearer as to what I believe is the best design.
 1. Should it fail to compile because foo's implementation uses [] and the 
 slow-indexing collection doesn't implement []?

No, foo will always compile because opIndex should always be fast, and then I can specify the complexity of foo without worry. Using an O(n) lookup operation should be more painful because it requires more time. It makes users use it less.
 2. Should foo revert to an alternate branch of code that doesn't use []?

 This behavior can be implemented via interfaces like I described. The 
 benefit of that is that [] can still serve as the shorthand it's intended 
 for (see below) and you never need to introduce the inconsistency of "Gee, 
 how do I get the Nth element of a collection?" "Well, on some collections 
 it's getNth(), and on other collections it's []."

I believe that you shouldn't really ever be calling getNth on a link-list, and if you are, it should be a red flag, like a cast. Furthermore [] isn't always equivalent to getNth, see below.
 As for the risk that could create of accidentially sending a linked list 
 to a "search" (ie, a "search for an element which contains data X") that 
 uses [] internally instead of iterators (but then, why wouldn't it just 
 use iterators anyway?): I'll agree that in a case like this there should 
 be some mechanism for automatic choosing of an algorithm, but that 
 mechanism should be at a separate level of abstraction. There would be a 
 function "search" that, through either RTTI or template constraints or 
 something else, says "does collection 'c' implement 
 ConstantTimeForewardDirectionIndexing?" or better yet IMO "does the 
 collection have attribute ForewardDirectionIndexingComplexity that is 
 set equal to Complexity.Constant?", and based on that passes control to 
 either IndexingSearch or IteratorSearch.

To me, this is a bad design. It's my opinion, but one that is shared among many people. You can do stuff this way, but it is not intuitive. I'd much rather reserve opIndex to only quick lookups, and avoid the possibility of accidentally using it incorrectly.

Preventing a collection from ever being used in a function that would typically perform poorly on that collection just smacks of premature optimization. How do you, as the collection author, know that the collection will never be used in a way such that *occasional* use in certain specific sub-optimal a manner might actually be necessary and/or acceptable?

It's not premature optimization, it's not offering a feature that has little or no use. It's like any contract for any object, you only want to define the interface for which your object is designed. A linked list should not have an opIndex because it's not designed to be indexed. If I designed a new car with which you could steer each front wheel independently, would that make you buy it? It's another feature that the car has that other cars don't. Who cares if it's useful, its another *feature*! Sometimes a good design is not that a feature is included but that a feature is *not* included.
 If you omit [] then you've burnt the bridge (so to speak) and your only 
 recourse is to add a standardized "getNth()" to every single collection 
 which clutters the interface, hinders integration with third-party 
 collections and algorithms, and is likely to still suffer from idiots who 
 think that "get Nth element" is always better than O(n) (see below).

I'd reserve getNth for linked lists only, if I implemented it at all. It is a useless feature. The only common feature for all containers should be iteration, because 'iterate next element' is always an O(1) operation (amortized in the case of trees).
 In general, I'd say if you are using lists and frequently looking up the 
 nth value in the list, you have chosen the wrong container for the job.

If you're frequently looking up random elements in a list, then yes, you're probably using the wrong container. But that's beside the point. Even if you only do it once: If you have a collection with a natural order, and you want to get the nth element, you should be able to use the standard "get element at index X" notation, [].

I respectfully disagree. For the reasons I've stated above.
 I don't care how many people go around using [] and thinking they're 
 guaranteed to get a cheap computation from it. In a language that supports 
 overloading of [], the [] means "get the element at key/index X". 
 Especially in a language like D where using [] on an associative array can 
 trigger an unbounded allocation and GC run. Using [] in D (and various 
 other languages) can be expensive, period, even in the standard lib (assoc 
 array). So looking at a [] and thinking "guaranteed cheap", is incorrect, 
 period. If most people think 2+2=5, you're not going to redesign 
 arithmetic to work around that mistaken assumption.

Your assumption is that 'get the Nth element' is the only expectation for opIndex interface. My assumption is that opIndex implies 'get an element efficiently' is an important part of the interface. We obviously disagree, and as I said above, neither of us is right or wrong, strictly speaking. It's a matter of what is intuitive to you. Part of the problems I see with many bad designs is the author thinks they see a fit for an interface, but it's not quite there. They are so excited about fitting into an interface that they forget the importance of leaving out elements of the interface that don't make sense. To me this is one of them. An interface is a fit IMO if it fits exactly. If you have to do things like implement functions that throw exceptions because they don't belong, or break the contract that the interface specifies, then either the interface is too specific, or you are not implementing the correct interface.
 If you've got a linked list, and you want to get element N, are you 
 *really* going to go reaching for a function named "search"? How often 
 do you really see a generic function named "search" or "find" that takes 
 a numeric index as a the "to be found" parameter instead of something to 
 be matched against the element's value? I would argue that that would be 
 confusing for most people. Like I said in a different post farther down, 
 the implementation of a "getAtIndex()" is obviously going to work like a 
 search, but from "outside the box", what you're asking for is not the 
 same.

If you are indexing into a tree, it is considered a binary search, if you are indexing into a hash, it is a search at some point to deal with collisions. People don't think about indexing as being a search, but in reality it is. A really fast search.

It's implemented as a search, but I'd argue that the input/output specifications are different. And yes, I suppose that does put it into a bit of a grey area. But I wouldn't go so far as to say that, to the caller, it's the same thing, because there are differences. If you want get an element based on it's position in the collection, you call one function. If you want to get an element based on it's content instead of it's position, that's another function. If you want to get the position of an element based on it's content or it's identity, that's one or two more functions (depending, of course, if the element is a value type or reference type, respectively).

I disagree. I view the numeric index of an ordered container as a 'key' into the container. A keyed container has the ability to look up elements quickly with the key. Take a quick look at dcollections' ArrayList. It implements the Keyed interface, with uint as the key. I have no key for LinkList, because I don't see a useful key.
 And I don't think search would be the name of the member function, it 
 should be something like 'getNth', which returns a cursor that points to 
 the element.

Right, and outside of pure C, [] is the shorthand for and the standardized name for "getNth". If someone automatically assumes [] to be a simple lookup, chances are they're going to make the same assumption about anything named along the lines of "getNth". After all, that's what [] does, it gets the Nth.

I view [] as "getByIndex", index being a value that offers quick access to elements. There is no implied 'get the nth element'. Look at an associative array. If I had a string[string] array, what would you expect to get if you passed an integer as the index? So good luck with your linked-list-should-be-indexed battle. I shall not be posting on this again. -Steve
Aug 28 2008
parent "Nick Sabalausky" <a a.a> writes:
 "Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
news:g983f5$2ns2$1 digitalmars.com...
 "Nick Sabalausky" wrote
 Ok, so you want foo() to be able to tell if the collection has fast or 
 slow indexing. What are you suggesting that foo() does when the 
 collection does have slow indexing?

No, I don't want to be able to tell. I don't want to HAVE to be able to tell.

You're missing the point. Since, as you say below, you want foo to not be callable with the collection since it doesn't implement opIndex, your answer is clearly "#1, The program should fail to compile because foo's implementation uses [] and the slow-indexing collection doesn't implement []".
 In my ideal world, the collection does not implement opIndex unless it is 
 fast, so there is no issue.  i.e. you cannot call foo with a linked list.

 I'm really tired of this argument, you understand my point of view, I 
 understand yours.

..(line split for clarity)..
 To you, the syntax sugar is more important than the complexity guarantees.

Not at all. And to that effect, I've already presented a way that we can have both syntactic sugar and, when desired, complexity guarantees. In fact, the method I presented actually provides more protection against poor complexity than your method (Since the guarantee doesn't break when faced with code from people with my viewpoint on [], which as you admit below is neither more right nor more wrong than your viewpoint on []). Just because I don't agree with your method of implementing complexity guarantees, doesn't mean I don't think they can be valuable.
 To me, what the syntax intuitively means should be what it does.

I absolutely agree that "What the syntax intuitively means should be what it does". Where we disagree is on "what the [] syntax intuitively means".
 So I'll develop my collections library and you develop yours, fair enough? 
 I don't think either of us is right or wrong in the strict sense of the 
 terms.

 To be fair, I'll answer your other points as you took the time to write 
 them.  And then I'm done.  I can't really be any clearer as to what I 
 believe is the best design.

 1. Should it fail to compile because foo's implementation uses [] and the 
 slow-indexing collection doesn't implement []?

No, foo will always compile because opIndex should always be fast, and then I can specify the complexity of foo without worry. Using an O(n) lookup operation should be more painful because it requires more time. It makes users use it less.
 2. Should foo revert to an alternate branch of code that doesn't use []?

 This behavior can be implemented via interfaces like I described. The 
 benefit of that is that [] can still serve as the shorthand it's intended 
 for (see below) and you never need to introduce the inconsistency of 
 "Gee, how do I get the Nth element of a collection?" "Well, on some 
 collections it's getNth(), and on other collections it's []."

I believe that you shouldn't really ever be calling getNth on a link-list, and if you are, it should be a red flag, like a cast. Furthermore [] isn't always equivalent to getNth, see below.

Addressed below...
 As for the risk that could create of accidentially sending a linked 
 list to a "search" (ie, a "search for an element which contains data 
 X") that uses [] internally instead of iterators (but then, why 
 wouldn't it just use iterators anyway?): I'll agree that in a case like 
 this there should be some mechanism for automatic choosing of an 
 algorithm, but that mechanism should be at a separate level of 
 abstraction. There would be a function "search" that, through either 
 RTTI or template constraints or something else, says "does collection 
 'c' implement ConstantTimeForewardDirectionIndexing?" or better yet IMO 
 "does the collection have attribute ForewardDirectionIndexingComplexity 
 that is set equal to Complexity.Constant?", and based on that passes 
 control to either IndexingSearch or IteratorSearch.

To me, this is a bad design. It's my opinion, but one that is shared among many people. You can do stuff this way, but it is not intuitive. I'd much rather reserve opIndex to only quick lookups, and avoid the possibility of accidentally using it incorrectly.

Preventing a collection from ever being used in a function that would typically perform poorly on that collection just smacks of premature optimization. How do you, as the collection author, know that the collection will never be used in a way such that *occasional* use in certain specific sub-optimal a manner might actually be necessary and/or acceptable?

It's not premature optimization, it's not offering a feature that has little or no use. It's like any contract for any object, you only want to define the interface for which your object is designed. A linked list should not have an opIndex because it's not designed to be indexed.

Addressed below...
 If I designed a new car with which you could steer each front wheel 
 independently, would that make you buy it?  It's another feature that the 
 car has that other cars don't.  Who cares if it's useful, its another 
 *feature*!  Sometimes a good design is not that a feature is included but 
 that a feature is *not* included.

So, in other words, it sounds like you're saying that in my scenario above, you think that a linked list should not be usable, even if it is faster in the greater context (Without actually saying so directly). Or do you claim that the scenario can never happen?
 If you omit [] then you've burnt the bridge (so to speak) and your only 
 recourse is to add a standardized "getNth()" to every single collection 
 which clutters the interface, hinders integration with third-party 
 collections and algorithms, and is likely to still suffer from idiots who 
 think that "get Nth element" is always better than O(n) (see below).

I'd reserve getNth for linked lists only, if I implemented it at all. It is a useless feature. The only common feature for all containers should be iteration, because 'iterate next element' is always an O(1) operation (amortized in the case of trees).
 In general, I'd say if you are using lists and frequently looking up the 
 nth value in the list, you have chosen the wrong container for the job.

If you're frequently looking up random elements in a list, then yes, you're probably using the wrong container. But that's beside the point. Even if you only do it once: If you have a collection with a natural order, and you want to get the nth element, you should be able to use the standard "get element at index X" notation, [].

I respectfully disagree. For the reasons I've stated above.
 I don't care how many people go around using [] and thinking they're 
 guaranteed to get a cheap computation from it. In a language that 
 supports overloading of [], the [] means "get the element at key/index 
 X". Especially in a language like D where using [] on an associative 
 array can trigger an unbounded allocation and GC run. Using [] in D (and 
 various other languages) can be expensive, period, even in the standard 
 lib (assoc array). So looking at a [] and thinking "guaranteed cheap", is 
 incorrect, period. If most people think 2+2=5, you're not going to 
 redesign arithmetic to work around that mistaken assumption.

Your assumption is that 'get the Nth element' is the only expectation for opIndex interface. My assumption is that opIndex implies 'get an element efficiently' is an important part of the interface. We obviously disagree, and as I said above, neither of us is right or wrong, strictly speaking. It's a matter of what is intuitive to you. Part of the problems I see with many bad designs is the author thinks they see a fit for an interface, but it's not quite there. They are so excited about fitting into an interface that they forget the importance of leaving out elements of the interface that don't make sense. To me this is one of them. An interface is a fit IMO if it fits exactly. If you have to do things like implement functions that throw exceptions because they don't belong, or break the contract that the interface specifies, then either the interface is too specific, or you are not implementing the correct interface.

(From the above "Addressed below..."'s) I fully agree that leaving the wrong things out of an interface is just as important as putting the right things in. But I don't think that's applicable here. An array can do anything a linked list can do (even insert). A linked list can do anything an array can do (even sort). They are both capable of the same exact set of basic operations: insert, delete, get at position, get position of, append, iterate, etc). The only thing that ever differs is how well each type of collection scales on each of those basic operations. The *whole point* of having both arrays and linked lists is that they provide different performance tradeoffs, not that they "implement different interfaces", because obviously they're all capable of doing the same things. It's the performance tradeoffs that are the whole point of "array vs linked list". But it's rarely as simple as just looking at the basic operations individually... Its rare that a collection would ever be used for just one basic operation. What's the point sorting a collection if you're never going to insert anything into it? What's the point of inserting data if you're never going to retrieve any? In most cases, you're going to be doing multiple types of operations on the collection, therefore the choice of collection becomes "Which set of tradeoffs are the most worthwhile for my overall usage patters?" You can speculate and analyze all you want about the usage patterns and the appropriate tradeoffs, and that's good, you certainly should. But it ultimately comes down to the real word tests: profiling. And if you're profiling, you're going to want to compare the performance of different types of collections. And if you're going to do that, why should you prevent yourself from making it a one-line change ("Vector myBunchOfStuff" <-> "List myBunchOfStuff"), just because the fear of someone using an array for an insert-intensive purpose, or a list for a random-access-intensive purpose, drove you to design your code in a way that forces a single change of type to (in many cases) be an all-out refactoring - and it'll be the type of refactoring that no automatic refactoring tool is going to do for you. And suppose you do successfully find that optimal container, through your method or mine. Then a program feature/requirement is changed/added/removed, and all of a sudden, the usage patterns have changed! Now you get to do it all again! Major refactor then profile or change a line then profile? You're looking at guaranteeing the performance of very narrow slices of a program. I'll agree that can be useful in some cases (hence, my proposal for how to implement performance guarantees). But in many cases, that's effectively a "taken out of context" fallacy and can lead to trouble.
 If you've got a linked list, and you want to get element N, are you 
 *really* going to go reaching for a function named "search"? How often 
 do you really see a generic function named "search" or "find" that 
 takes a numeric index as a the "to be found" parameter instead of 
 something to be matched against the element's value? I would argue that 
 that would be confusing for most people. Like I said in a different 
 post farther down, the implementation of a "getAtIndex()" is obviously 
 going to work like a search, but from "outside the box", what you're 
 asking for is not the same.

If you are indexing into a tree, it is considered a binary search, if you are indexing into a hash, it is a search at some point to deal with collisions. People don't think about indexing as being a search, but in reality it is. A really fast search.

It's implemented as a search, but I'd argue that the input/output specifications are different. And yes, I suppose that does put it into a bit of a grey area. But I wouldn't go so far as to say that, to the caller, it's the same thing, because there are differences. If you want get an element based on it's position in the collection, you call one function. If you want to get an element based on it's content instead of it's position, that's another function. If you want to get the position of an element based on it's content or it's identity, that's one or two more functions (depending, of course, if the element is a value type or reference type, respectively).

I disagree. I view the numeric index of an ordered container as a 'key' into the container. A keyed container has the ability to look up elements quickly with the key. Take a quick look at dcollections' ArrayList. It implements the Keyed interface, with uint as the key. I have no key for LinkList, because I don't see a useful key.
 And I don't think search would be the name of the member function, it 
 should be something like 'getNth', which returns a cursor that points to 
 the element.

Right, and outside of pure C, [] is the shorthand for and the standardized name for "getNth". If someone automatically assumes [] to be a simple lookup, chances are they're going to make the same assumption about anything named along the lines of "getNth". After all, that's what [] does, it gets the Nth.

I view [] as "getByIndex", index being a value that offers quick access to elements. There is no implied 'get the nth element'. Look at an associative array. If I had a string[string] array, what would you expect to get if you passed an integer as the index?

You misunderstand. I'm well aware of the sequentially-indexed array vs associative array issues. I was just using "sequentially-indexed array" terminology to avoid cluttering the explanations with more general terms that would have distracted from bigger points. By "getNth", what I was getting at was "getByPosition". Maybe I should have been saying "getByPosition" from the start, my mistake. As you can see, I still consider the key of an associative array to be it's position. I'll explain why: An associative array is the dynamic/runtime equivalent of a static/compiletime named variable (After all, in many dynamic languages, like PHP (not that I like PHP), named variables literally are keys into an implicit associative array). In a typical static or dynamic language, all variables are essentially made up of two parts: The raw data and a label. The label, obviously, is what's used to refer to the data. The label can be one of two things, an identifier or (in a non-sandboxed language) a dereferenced memory address. So, borrowing the usual pointer metaphor of "memory as a series of labeled boxes", we can have the data "7" in the 0xA04D6'th "box" which is also labeled with the identifier "myInt". The memory address, obviously, is the position of the data. The identifier is another way to to refer the same position. "CPU: Where should I put this 7?" "High-level Code: In the location labeled with the identifier myInt". The data of a variable corresponds to an element of any collection (array, assoc array, list). The memory addresses not only correspond to, but literally are sequential indicies into the array of addressable memory (ie, the key/position in a sequentially-indexed array). The identifier corresponds to the key of an associative array or other such collection. "CPU: Where, within the assoc array, should I put this 7?" "High-level Code: In the assoc array's box/element labeled myInt" (With a linked list, of course, there's nothing that corresponds to the key of an assoc array, but it does have a natural sequential order.) Maybe I can explain the "sorting" distinction I see a little bit better with our terminology hopefully now in closer sync: For any collection, each element has a concept of position (index/key/nth/whatever) and a concept of data. A collection is a series of "boxes". On the outside of each box is a label (position/index/key/nth/whatever). On the inside of each box is data. If the collection's base type is a reference type, then this "inside data" is, of course, a pointer/reference to more data somewhere else. There are two basic conceptual operations: "outside label -> inside data", and "inside data -> outside label". The "inside data -> outside label" is always a search (although if the inside data contains a cached copy of it's outside label, then that's somewhat of a grey area. Personally, I would count it as a "cached search": usable just like a search, but faster). The "outside label -> inside data" is, of course, our disputed "getAtPosition". In a linked list, it's a grey area similar to hat I called a "cached search" above. It's usable like an ordinary "getAtPosition", but slower. Sure, the implementation is done via a search algoritm, but if you call it a search that means that for a linked list, "getAtPosition" and search are the same thing (for whatever that implies, I don't have time to go any further on that ATM, so take it as you will). I do understand though, that you're defining "index" and "search" essentially as "fast" and "slow" versions (respectively) of "X" -> "Y" regardless of which of X or Y is "outside label" and which is "inside data". Personally, I find that awkward and somewhat less useful since that means "index" and "search" each have multiple "input vs. output" behaviors (Ie, there's still the question of "Am I giving the outside position and getting the inside data, or vice versa?").
Aug 29 2008
prev sibling parent Dee Girl <deegirl noreply.com> writes:
Nick Sabalausky Wrote:

 "Steven Schveighoffer" <schveiguy yahoo.com> wrote in message 
 news:g95b89$1jii$1 digitalmars.com...
 When someone sees an index operator the first thought is that it is a 
 quick lookup.

This seems to be a big part of the disagreement. Personally, I think it's insane to look at [] and just assume it's a cheap lookup. That's was true in pure C where, as someone else said, it was nothing more than a shorthand for a specific pointer arithmentic operation, but carrying that idea over to more modern languages (especially one that also uses [] for associative arrays) is a big mistake. The '+' operator means "add". Addition is typically O(1). But vectors can be added, and that's an O(n) operation. Should opAdd never be used for vectors?

This is big, big mistake! I think I do not know how to explain it. I could not so far because we talk from one mistake to other. If you add vector a[] = b[] + c[] then that is one operation that is repeat for each element of vector. Complexity is linear in cost of one operation. It would be mistake if each + take O(a.length). In other case it is simply as expected proportional to length of input.
 You can force yourself to think differently, but the reality is that most 
 people think that because of the universal usage of square brackets 
 (except for VB, and I feel pity for anyone who needs to use VB) to mean 
 'lookup by key', and usually this is only useful on objects where the 
 lookup is quick ( < O(n) ).  Although there is no requirement, nor 
 enforcement, the 'quick' contract is expected by the user, no matter how 
 much docs you throw at them.

 Look, for instance, at Tango's now-deprecated LinkMap, which uses a 
 linked-list of key/value pairs (copied from Doug Lea's implementation). 
 Nobody in their right mind would use link map because lookups are O(n), 
 and it's just as easy to use a TreeMap or HashMap.  Would you ever use it?

 If you *really* need that sort of guarantee (and I can imagine it may be 
 useful in some cases), then the implementation of the guarantee does 
 *not* belong in the realm of "implements vs doesn't-implement a 
 particular operator overload". Doing so is an abuse of operator 
 overloading, since operator overloading is there for defining syntactic 
 sugar, not for acting as a makeshift contract.

I don't think anybody is asking for a guarantee from the compiler or any specific tool. I think what we are saying is that violating the 'opIndex is fast' notion is bad design because you end up with users thinking they are doing something that's quick. You end up with people posting benchmarks on your containers saying 'why does python beat the pants off your list implementation?'. You can say 'hey, it's not meant to be used that way', but then why can the user use it that way? A better design is to nudge the user into using a correct container for the job by only supporting operations that make sense on the collections. And as far as operator semantic meaning, D's operators are purposely named after what they are supposed to do. Notice that the operator for + is opAdd, not opPlus. This is because opAdd is supposed to mean you are performing an addition operation. Assigning a different semantic meaning is not disallowed by the compiler, but is considered bad design. opIndex is supposed to be an index function, not a linear search. It's not called opSearch for a reason. Sure you can redefine it however you want semantically, but it's considered bad design. That's all we're saying.

Nobody is suggesting using [] to invoke a search (Although we have talked about using [] *in the search function's implementation*). Search means you want to get the position of a given element, or in other words, "element" -> search -> "key/index". What we're talking about is the reverse: getting the element at a given position, ie, "key/index" -> [] -> "element". It doesn't matter if it's an array, a linked list, or a super-duper-collection-from-2092: that's still indexing, not searching.

I think this is also big mistake. I am so sorry! I can not explain it. But it looks you think if you call it different it behaves different. Sorry, Dee Girl
Aug 28 2008
prev sibling parent Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-08-28 06:06:11 +0200, "Nick Sabalausky" <a a.a> said:

 "Fawzi Mohamed" <fmohamed mac.com> wrote in message
 news:g94k2b$2a1e$1 digitalmars.com...
 
 I am with dan dee_girl & co on this issue, the problem is that a generic
 algorithm "knows" the types he is working on and can easily check the
 operations they have, and based on this decide the strategy to use. This
 choice works well if the presence of a given operation is also connected
 with some performance guarantee.
 

IMO, a better way to do that would be via C#-style attributes or equivilent named interfaces. I'm not sure if this is what you're referring to below or not.

yes categories are basically named interfaces for types, unlike the constraint (that check if something is implemented) one has to explicitly say T implements Interface (and obviously T has to have all the requested functions/methods). This should be available for all types, and you should be able to request also the existence of free functions, not only of methods. Attributes are a simplified version of this (basically no checks for a given interface). The important thing is that presence or absence of attributes in a given type is not automatically inferred from the presence of given functions.
 Concepts (or better categories (aldor concept not C++), that are
 interfaces for types, but interfaces that have to be explicitly assigned
 to a type) might relax this situation a little, but the need for some
 guarantees will remain.
 

If this "guarantee" (or mechanism for checking the types of operations that a collection supports) takes the form of a style guideline that says "don't implement opIndex for a collection if it would be O(n) or worse", then that, frankly, is absolutely no guarantee at all.

well if it is in the spec and everybody knows it, then breaking it and getting bad behavior is you own fault.
 If you *really* need that sort of guarantee (and I can imagine it may be
 useful in some cases), then the implementation of the guarantee does *not*
 belong in the realm of "implements vs doesn't-implement a particular
 operator overload". Doing so is an abuse of operator overloading, since
 operator overloading is there for defining syntactic sugar, not for acting
 as a makeshift contract.
 
 The correct mechanism for such guarantees is with named interfaces or
 C#-style attribtes, as I mentioned above. True, that can still be abused if
 the collection author wants to, but they have to actually try (ie, they have
 to lie and say "implements IndexingInConstantTime" in addition to
 implementing opIndex). If you instead try to implement that guarantee with
 the "don't implement opIndex for a collection if it would be O(n) or worse"
 style-guideline, then it's far too easy for a collection to come along that
 is ignorant of that "psuedo-contract" and accidentially breaks it. Proper
 use of interfaces/attributes instead of relying on the existence or absense
 of an overloaded operator fixes that problem.

I fully agree that with interfaces (or categories or attributes) the correct thing is to use them to enforce extra constraints, so that overloading (or naming the functions) is really just syntactic sugar, but also in this case it can make reading (and writing from the beginning reasonably fast code) and understanding complexity (speed) of code reading it easier if some social contract about speed of operations is respected. Pleas note that these "guarantees" are not such that one cannot break them, their purpose is to make the life of who knows them and enforces them easier, and also of the whole community if it chooses to adopt it. As Steve just argued much more fully than me. STL, does it, and I think that also in D should do it, should D get categories or attributes this things could be relaxed a little, but I think there will still be cases where expecting a given function to have a given complexity will be a good thing to have. It just makes thinking about the code easier, and simpler to stay at high level without having surprises that you have to find out by looking in detail at the code, and so it makes a programmer more productive. Fawzi
Aug 28 2008
prev sibling next sibling parent reply Derek Parnell <derek psych.ward> writes:
On Wed, 27 Aug 2008 16:33:24 -0400, Nick Sabalausky wrote:

 A generic algoritm has absolutely no business caring about the complexity of 
 the collection it's operating on.

I also believe this to be true. -- Derek Parnell Melbourne, Australia skype: derek.j.parnell
Aug 27 2008
parent Dee Girl <deegirl noreply.com> writes:
Derek Parnell Wrote:

 On Wed, 27 Aug 2008 16:33:24 -0400, Nick Sabalausky wrote:
 
 A generic algoritm has absolutely no business caring about the complexity of 
 the collection it's operating on.

I also believe this to be true.

It is true. But only if you take out of context. An algorithm does not need to know the complexity of collection. But it must have a minim guarantee of what iterator the collection has. Is it forward only, bidirectional, or random access? This is small interface. And very easy to implement. Inside the iterator can do the thing it needs to access collection. Algorithm must not know it! Only ++, *, [] and comparison. That is why in STL algorithms are so general. Because they work with very small (narrow) interface. Thank you, Dee Girl
Aug 27 2008
prev sibling parent reply Dee Girl <deegirl noreply.com> writes:
Nick Sabalausky Wrote:

 "Dee Girl" <deegirl noreply.com> wrote in message 
 news:g943oi$11f4$1 digitalmars.com...
 Benji Smith Wrote:

 Dee Girl wrote:
 Michiel Helvensteijn Wrote:
 That's simple. a[i] looks much nicer than a.nth(i).

It is not nicer. It is more deceiving (correct spell?). If you look at code it looks like array code. foreach (i; 0 .. a.length) { a[i] += 1; } For array works nice. But for list it is terrible! Many operations for incrementing only small list.

Well, that's what you get with operator overloading.

I am sorry. I disagree. I think that is what you get with bad design.
 The same thing could be said for "+" or "-". They're inherently
 deceiving, because they look like builtin operations on primitive data
 types.

 For expensive operations (like performing division on an
 unlimited-precision decimal object), should the author of the code use
 "opDiv" or should he implement a separate "divide" function?

The cost of + and - is proportional to digits in number. For small number of digits computer does fast in hardware. For many digits the cost grows. The number of digits is log n. I think + and - are fine for big integer. I am not surprise.
 Forget opIndex for a moment, and ask the more general question about all
 overloaded operators. Should they imply any sort of asymptotic
 complexity guarantee?

I think depends on good design. For example I think ++ or -- for iterator. If it is O(n) it is bad design. Bad design make people say like you "This is what you get with operator overloading".
 Personally, I don't think so.

 I don't like "nth".

 I'd rather use the opIndex. And if I'm using a linked list, I'll be
 aware of the fact that it'll exhibit linear-time indexing, and I'll be
 cautious about which algorithms to use.

But inside algorithm you do not know if you use a linked list or a vector. You lost that information in bad abstraction. Also abstraction is bad because if you change data structure you have concept errors that still compile. And run until tomorrow ^_^.

A generic algoritm has absolutely no business caring about the complexity of the collection it's operating on. If it does, then you've created a concrete algoritm, not a generic one.

I appreciate your view point. Please allow me explain. The view point is in opposition with STL. In STL each algorithm defines what kind of iterator it operates with. And it requires what iterator complexity. I agree that other design can be made. But STL has that design. In my opinion is much part of what make STL so successful. I disagree that algorithm that knows complexity of iterator is concrete. I think exactly contrary. Maybe it is good that you read book about STL by Josuttis. STL algorithms are the most generic I ever find in any language. I hope std.algorithm in D will be better. But right now std.algorithm works only with array.
 If an algoritm uses [] and doesn't know the 
 complexity of the []...good! It shouldn't know, and it shouldn't care. It's 
 the code that sends the collection to the algoritm that knows and cares.

I think this is mistake. Algorithm should know. Otherwise "linear find" is not "linear find"! It is "cuadratic find" (spell?). If you want to define something called linear find then you must know iterator complexity.
 Why? Because "what algoritm is best?" depends on far more than just what 
 type of collection is used. It depends on "Will the collection ever be 
 larger than X elements?". It depends on "Is it a standard textbook list, or 
 does it use trick 1 and/or trick 2?". It depends on "Is it usually mostly 
 sorted or mostly random?". It depends on "What do I do with it most often? 
 Sort, append, search, insert or delete?". And it depends on other things, 
 too.

I agree it depends on many things. But such practical matters do not change the nature of generic algorithm. Linear find is same on 5, 50, or 5 million objects. I have to say I also think you have inversed some ideas. Algorithm is same. You use it the way you want.
 Using "[]" versus "nth()" can't tell the algoritm *any* of those things.

This is interface convention. Like any other interface convention! Nobody says that IStack.Push() puts something on stack. It is described in documentation. If a concrete stack is wrong it can do anything. Only special about [] is that built in array has []. So I do not think a list should want to look like array.
 But 
 those things *must* be known in order to make an accurate decision of "Is 
 this the right algoritm or not?" Therefore, a generic algoritm *cannot* ever 
 know for certain if it's the right algoritm *even* if you say "[]" means 
 "O(log n) or better". Therefore, the algorithm should not be designed to 
 only work with certain types of collections. The code that sends the 
 collection to the algoritm is the *only* code that knows the answers to all 
 of the questions above, therefore it is the only code that should ever 
 decide "I should use this algorithm, I shouldn't use that algorithm."

I respectfully disagree. For example binary_search in STL should never compile on a list. Because it would be simply wrong to use it with a list. It has no sense. So I am happy that STL does not allow that. I think you can easy build structure and algorithm library that allows wrong combinations. In programming you can do anything ^_^. But I think then I say: Your library is inferior and STL is superior. I am sorry, Dee Girl
Aug 27 2008
parent reply "Nick Sabalausky" <a a.a> writes:
"Dee Girl" <deegirl noreply.com> wrote in message 
news:g94j7a$2875$1 digitalmars.com...
 Nick Sabalausky Wrote:

 "Dee Girl" <deegirl noreply.com> wrote in message
 news:g943oi$11f4$1 digitalmars.com...
 Benji Smith Wrote:

 Dee Girl wrote:
 Michiel Helvensteijn Wrote:
 That's simple. a[i] looks much nicer than a.nth(i).

It is not nicer. It is more deceiving (correct spell?). If you look at code it looks like array code. foreach (i; 0 .. a.length) { a[i] += 1; } For array works nice. But for list it is terrible! Many operations for incrementing only small list.

Well, that's what you get with operator overloading.

I am sorry. I disagree. I think that is what you get with bad design.
 The same thing could be said for "+" or "-". They're inherently
 deceiving, because they look like builtin operations on primitive data
 types.

 For expensive operations (like performing division on an
 unlimited-precision decimal object), should the author of the code use
 "opDiv" or should he implement a separate "divide" function?

The cost of + and - is proportional to digits in number. For small number of digits computer does fast in hardware. For many digits the cost grows. The number of digits is log n. I think + and - are fine for big integer. I am not surprise.
 Forget opIndex for a moment, and ask the more general question about 
 all
 overloaded operators. Should they imply any sort of asymptotic
 complexity guarantee?

I think depends on good design. For example I think ++ or -- for iterator. If it is O(n) it is bad design. Bad design make people say like you "This is what you get with operator overloading".
 Personally, I don't think so.

 I don't like "nth".

 I'd rather use the opIndex. And if I'm using a linked list, I'll be
 aware of the fact that it'll exhibit linear-time indexing, and I'll be
 cautious about which algorithms to use.

But inside algorithm you do not know if you use a linked list or a vector. You lost that information in bad abstraction. Also abstraction is bad because if you change data structure you have concept errors that still compile. And run until tomorrow ^_^.

A generic algoritm has absolutely no business caring about the complexity of the collection it's operating on. If it does, then you've created a concrete algoritm, not a generic one.

I appreciate your view point. Please allow me explain. The view point is in opposition with STL. In STL each algorithm defines what kind of iterator it operates with. And it requires what iterator complexity. I agree that other design can be made. But STL has that design. In my opinion is much part of what make STL so successful. I disagree that algorithm that knows complexity of iterator is concrete. I think exactly contrary. Maybe it is good that you read book about STL by Josuttis. STL algorithms are the most generic I ever find in any language. I hope std.algorithm in D will be better. But right now std.algorithm works only with array.
 If an algoritm uses [] and doesn't know the
 complexity of the []...good! It shouldn't know, and it shouldn't care. 
 It's
 the code that sends the collection to the algoritm that knows and cares.

I think this is mistake. Algorithm should know. Otherwise "linear find" is not "linear find"! It is "cuadratic find" (spell?). If you want to define something called linear find then you must know iterator complexity.

If a generic algorithm describes itself as "linear find" then I know damn well that it's referring to the behavior of *just* the function itself, and is not a statement that the function *combined* with the behavior of the collection and/or a custom comparison is always going to be O(n). A question about STL: If I create a collection that, internally, is like a linked list, but starts each indexing operation from the position of the last indexing operation (so that a "find first" would run in O(n) instead of O(n*n)), is it possible to send that collection to STL's generic "linear find first"? I would argue that it should somehow be possible *even* if the STL's generic "linear find first" guarantees a *total* performance of O(n) (Since, in this case, it would still be O(n) anyway). Because otherwise, the STL wouldn't be very extendable, which would be a bad thing for a library of "generic" algorithms. Another STL question: It is possible to use STL to do a "linear find" using a custom comparison? If so, it is possible to make STL's "linear find" function use a comparison that just happens to be O(n)? If so, doesn't that violate the linear-time guarantee, too? If not, how does it know that the custom comparison is O(n) instead of O(1) or O(log n)?
Aug 27 2008
next sibling parent reply Don <nospam nospam.com.au> writes:
Nick Sabalausky wrote:
 "Dee Girl" <deegirl noreply.com> wrote in message 
 news:g94j7a$2875$1 digitalmars.com...
 Nick Sabalausky Wrote:

 "Dee Girl" <deegirl noreply.com> wrote in message
 news:g943oi$11f4$1 digitalmars.com...
 Benji Smith Wrote:

 Dee Girl wrote:
 Michiel Helvensteijn Wrote:
 That's simple. a[i] looks much nicer than a.nth(i).

at code it looks like array code. foreach (i; 0 .. a.length) { a[i] += 1; } For array works nice. But for list it is terrible! Many operations for incrementing only small list.


 The same thing could be said for "+" or "-". They're inherently
 deceiving, because they look like builtin operations on primitive data
 types.

 For expensive operations (like performing division on an
 unlimited-precision decimal object), should the author of the code use
 "opDiv" or should he implement a separate "divide" function?

number of digits computer does fast in hardware. For many digits the cost grows. The number of digits is log n. I think + and - are fine for big integer. I am not surprise.
 Forget opIndex for a moment, and ask the more general question about 
 all
 overloaded operators. Should they imply any sort of asymptotic
 complexity guarantee?

iterator. If it is O(n) it is bad design. Bad design make people say like you "This is what you get with operator overloading".
 Personally, I don't think so.

 I don't like "nth".

 I'd rather use the opIndex. And if I'm using a linked list, I'll be
 aware of the fact that it'll exhibit linear-time indexing, and I'll be
 cautious about which algorithms to use.

vector. You lost that information in bad abstraction. Also abstraction is bad because if you change data structure you have concept errors that still compile. And run until tomorrow ^_^.

of the collection it's operating on. If it does, then you've created a concrete algoritm, not a generic one.

in opposition with STL. In STL each algorithm defines what kind of iterator it operates with. And it requires what iterator complexity. I agree that other design can be made. But STL has that design. In my opinion is much part of what make STL so successful. I disagree that algorithm that knows complexity of iterator is concrete. I think exactly contrary. Maybe it is good that you read book about STL by Josuttis. STL algorithms are the most generic I ever find in any language. I hope std.algorithm in D will be better. But right now std.algorithm works only with array.
 If an algoritm uses [] and doesn't know the
 complexity of the []...good! It shouldn't know, and it shouldn't care. 
 It's
 the code that sends the collection to the algoritm that knows and cares.

not "linear find"! It is "cuadratic find" (spell?). If you want to define something called linear find then you must know iterator complexity.

If a generic algorithm describes itself as "linear find" then I know damn well that it's referring to the behavior of *just* the function itself, and is not a statement that the function *combined* with the behavior of the collection and/or a custom comparison is always going to be O(n). A question about STL: If I create a collection that, internally, is like a linked list, but starts each indexing operation from the position of the last indexing operation (so that a "find first" would run in O(n) instead of O(n*n)), is it possible to send that collection to STL's generic "linear find first"? I would argue that it should somehow be possible *even* if the STL's generic "linear find first" guarantees a *total* performance of O(n) (Since, in this case, it would still be O(n) anyway). Because otherwise, the STL wouldn't be very extendable, which would be a bad thing for a library of "generic" algorithms.

Yes, it will work.
 Another STL question: It is possible to use STL to do a "linear find" using 
 a custom comparison? If so, it is possible to make STL's "linear find" 
 function use a comparison that just happens to be O(n)? If so, doesn't that 
 violate the linear-time guarantee, too? If not, how does it know that the 
 custom comparison is O(n) instead of O(1) or O(log n)?

This will work too. IF you follow the conventions THEN the STL gives you the guarantees.
Aug 28 2008
parent reply "Nick Sabalausky" <a a.a> writes:
"Don" <nospam nospam.com.au> wrote in message 
news:g95ks5$2aon$1 digitalmars.com...
 Nick Sabalausky wrote:
 "Dee Girl" <deegirl noreply.com> wrote in message 
 news:g94j7a$2875$1 digitalmars.com...
 I appreciate your view point. Please allow me explain. The view point is 
 in opposition with STL. In STL each algorithm defines what kind of 
 iterator it operates with. And it requires what iterator complexity.

 I agree that other design can be made. But STL has that design. In my 
 opinion is much part of what make STL so successful.

 I disagree that algorithm that knows complexity of iterator is concrete. 
 I think exactly contrary. Maybe it is good that you read book about STL 
 by Josuttis. STL algorithms are the most generic I ever find in any 
 language. I hope std.algorithm in D will be better. But right now 
 std.algorithm works only with array.

 If an algoritm uses [] and doesn't know the
 complexity of the []...good! It shouldn't know, and it shouldn't care. 
 It's
 the code that sends the collection to the algoritm that knows and 
 cares.

is not "linear find"! It is "cuadratic find" (spell?). If you want to define something called linear find then you must know iterator complexity.

If a generic algorithm describes itself as "linear find" then I know damn well that it's referring to the behavior of *just* the function itself, and is not a statement that the function *combined* with the behavior of the collection and/or a custom comparison is always going to be O(n). A question about STL: If I create a collection that, internally, is like a linked list, but starts each indexing operation from the position of the last indexing operation (so that a "find first" would run in O(n) instead of O(n*n)), is it possible to send that collection to STL's generic "linear find first"? I would argue that it should somehow be possible *even* if the STL's generic "linear find first" guarantees a *total* performance of O(n) (Since, in this case, it would still be O(n) anyway). Because otherwise, the STL wouldn't be very extendable, which would be a bad thing for a library of "generic" algorithms.

Yes, it will work.
 Another STL question: It is possible to use STL to do a "linear find" 
 using a custom comparison? If so, it is possible to make STL's "linear 
 find" function use a comparison that just happens to be O(n)? If so, 
 doesn't that violate the linear-time guarantee, too? If not, how does it 
 know that the custom comparison is O(n) instead of O(1) or O(log n)?

This will work too. IF you follow the conventions THEN the STL gives you the guarantees.

I'm not sure that's really a "guarantee" per se, but that's splitting hairs. In any case, it sounds like we're all arguing more or less the same point: Setting aside the issue of "should opIndex be used and when?", suppose I have the following collection interface and find function (not guaranteed to compile): interface ICollection(T) { T getElement(index); int getSize(); } int find(T)(ICollection(T) c, T elem) { for(int i=0; i<c.size(); i++) { if(c.getElement(i) == elem) return i; } } It sounds like STL's approach is to do something roughly like that and say: "find()'s parameter 'c' should be an ICollection for which getElement() is O(1), in which case find() is guaranteed to be O(n)" What I've been advocating is, again, doing something like the code above and saying: "find()'s complexity is dependant on the complexity of the ICollection's getElement(). If getElement()'s complexity is O(m), then find()'s complexity is guaranteed to be O(m * n). Of course, this means that the only way to get ideal complexity from find() is to use an ICollection for which getElement() is O(1)". But, you see, those two statements are effectively equivilent.
Aug 28 2008
parent reply Don <nospam nospam.com.au> writes:
Nick Sabalausky wrote:
 "Don" <nospam nospam.com.au> wrote in message 
 news:g95ks5$2aon$1 digitalmars.com...
 Nick Sabalausky wrote:
 "Dee Girl" <deegirl noreply.com> wrote in message 
 news:g94j7a$2875$1 digitalmars.com...
 I appreciate your view point. Please allow me explain. The view point is 
 in opposition with STL. In STL each algorithm defines what kind of 
 iterator it operates with. And it requires what iterator complexity.

 I agree that other design can be made. But STL has that design. In my 
 opinion is much part of what make STL so successful.

 I disagree that algorithm that knows complexity of iterator is concrete. 
 I think exactly contrary. Maybe it is good that you read book about STL 
 by Josuttis. STL algorithms are the most generic I ever find in any 
 language. I hope std.algorithm in D will be better. But right now 
 std.algorithm works only with array.

 If an algoritm uses [] and doesn't know the
 complexity of the []...good! It shouldn't know, and it shouldn't care. 
 It's
 the code that sends the collection to the algoritm that knows and 
 cares.

is not "linear find"! It is "cuadratic find" (spell?). If you want to define something called linear find then you must know iterator complexity.

well that it's referring to the behavior of *just* the function itself, and is not a statement that the function *combined* with the behavior of the collection and/or a custom comparison is always going to be O(n). A question about STL: If I create a collection that, internally, is like a linked list, but starts each indexing operation from the position of the last indexing operation (so that a "find first" would run in O(n) instead of O(n*n)), is it possible to send that collection to STL's generic "linear find first"? I would argue that it should somehow be possible *even* if the STL's generic "linear find first" guarantees a *total* performance of O(n) (Since, in this case, it would still be O(n) anyway). Because otherwise, the STL wouldn't be very extendable, which would be a bad thing for a library of "generic" algorithms.

 Another STL question: It is possible to use STL to do a "linear find" 
 using a custom comparison? If so, it is possible to make STL's "linear 
 find" function use a comparison that just happens to be O(n)? If so, 
 doesn't that violate the linear-time guarantee, too? If not, how does it 
 know that the custom comparison is O(n) instead of O(1) or O(log n)?

IF you follow the conventions THEN the STL gives you the guarantees.

I'm not sure that's really a "guarantee" per se, but that's splitting hairs. In any case, it sounds like we're all arguing more or less the same point: Setting aside the issue of "should opIndex be used and when?", suppose I have the following collection interface and find function (not guaranteed to compile): interface ICollection(T) { T getElement(index); int getSize(); } int find(T)(ICollection(T) c, T elem) { for(int i=0; i<c.size(); i++) { if(c.getElement(i) == elem) return i; } } It sounds like STL's approach is to do something roughly like that and say: "find()'s parameter 'c' should be an ICollection for which getElement() is O(1), in which case find() is guaranteed to be O(n)" What I've been advocating is, again, doing something like the code above and saying: "find()'s complexity is dependant on the complexity of the ICollection's getElement(). If getElement()'s complexity is O(m), then find()'s complexity is guaranteed to be O(m * n). Of course, this means that the only way to get ideal complexity from find() is to use an ICollection for which getElement() is O(1)". But, you see, those two statements are effectively equivilent.

They are. But... if you don't adhere to the conventions, your code gets really hard to reason about. "This class has an opIndex which is in O(n). Is that OK?" Well, that depends on what it's being used for. So you have to look at all of the places where it is used. It's much simpler to use the convention that opIndex _must_ be fast; this way the performance requirements for containers and algorithms are completely decoupled from each other. It's about good design.
Aug 28 2008
parent reply "Nick Sabalausky" <a a.a> writes:
"Don" <nospam nospam.com.au> wrote in message 
news:g95td3$2tu0$1 digitalmars.com...
 Nick Sabalausky wrote:
 "Don" <nospam nospam.com.au> wrote in message 
 news:g95ks5$2aon$1 digitalmars.com...
 Nick Sabalausky wrote:
 "Dee Girl" <deegirl noreply.com> wrote in message 
 news:g94j7a$2875$1 digitalmars.com...
 I appreciate your view point. Please allow me explain. The view point 
 is in opposition with STL. In STL each algorithm defines what kind of 
 iterator it operates with. And it requires what iterator complexity.

 I agree that other design can be made. But STL has that design. In my 
 opinion is much part of what make STL so successful.

 I disagree that algorithm that knows complexity of iterator is 
 concrete. I think exactly contrary. Maybe it is good that you read 
 book about STL by Josuttis. STL algorithms are the most generic I ever 
 find in any language. I hope std.algorithm in D will be better. But 
 right now std.algorithm works only with array.

 If an algoritm uses [] and doesn't know the
 complexity of the []...good! It shouldn't know, and it shouldn't 
 care. It's
 the code that sends the collection to the algoritm that knows and 
 cares.

find" is not "linear find"! It is "cuadratic find" (spell?). If you want to define something called linear find then you must know iterator complexity.

damn well that it's referring to the behavior of *just* the function itself, and is not a statement that the function *combined* with the behavior of the collection and/or a custom comparison is always going to be O(n). A question about STL: If I create a collection that, internally, is like a linked list, but starts each indexing operation from the position of the last indexing operation (so that a "find first" would run in O(n) instead of O(n*n)), is it possible to send that collection to STL's generic "linear find first"? I would argue that it should somehow be possible *even* if the STL's generic "linear find first" guarantees a *total* performance of O(n) (Since, in this case, it would still be O(n) anyway). Because otherwise, the STL wouldn't be very extendable, which would be a bad thing for a library of "generic" algorithms.

 Another STL question: It is possible to use STL to do a "linear find" 
 using a custom comparison? If so, it is possible to make STL's "linear 
 find" function use a comparison that just happens to be O(n)? If so, 
 doesn't that violate the linear-time guarantee, too? If not, how does 
 it know that the custom comparison is O(n) instead of O(1) or O(log n)?

IF you follow the conventions THEN the STL gives you the guarantees.

I'm not sure that's really a "guarantee" per se, but that's splitting hairs. In any case, it sounds like we're all arguing more or less the same point: Setting aside the issue of "should opIndex be used and when?", suppose I have the following collection interface and find function (not guaranteed to compile): interface ICollection(T) { T getElement(index); int getSize(); } int find(T)(ICollection(T) c, T elem) { for(int i=0; i<c.size(); i++) { if(c.getElement(i) == elem) return i; } } It sounds like STL's approach is to do something roughly like that and say: "find()'s parameter 'c' should be an ICollection for which getElement() is O(1), in which case find() is guaranteed to be O(n)" What I've been advocating is, again, doing something like the code above and saying: "find()'s complexity is dependant on the complexity of the ICollection's getElement(). If getElement()'s complexity is O(m), then find()'s complexity is guaranteed to be O(m * n). Of course, this means that the only way to get ideal complexity from find() is to use an ICollection for which getElement() is O(1)". But, you see, those two statements are effectively equivilent.

They are. But... if you don't adhere to the conventions, your code gets really hard to reason about. "This class has an opIndex which is in O(n). Is that OK?" Well, that depends on what it's being used for. So you have to look at all of the places where it is used. It's much simpler to use the convention that opIndex _must_ be fast; this way the performance requirements for containers and algorithms are completely decoupled from each other. It's about good design.

Taking a slight detour, let me ask you this... Which of the following strategies do you consider to be better: //-- A -- value = 0; for(int i=1; i<=10; i++) { value += i*2; } //-- B -- value = sum(map(1..10, {n * 2})); Both strategies compute the sum of the first 10 multiples of 2. Strategy A makes the low-level implementation details very clear, but IMO, it comes at the expense of high-level clarity. This is because the code intermixes the high-level "what I want to accomplish?" with the low-level details. Strategy B much more closely resembles the high-level desired result, and thus makes the high-level intent more clear. But this comes at the cost of hiding the low-level details behind a layer of abstraction. I may very well be wrong on this, but from what you've said it sounds like you (as well as the other people who prefer [] to never be O(n)) are the type of coder who would prefer "Strategy A". In that case, I can completely understand your viewpoint on opIndex, even though I don't agree with it (I'm a "Strategy B" kind of person). Of course, if I'm wrong on that assumption, then we're back to square one ;)
Aug 28 2008
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Nick Sabalausky" wrote
 "Don" <nospam nospam.com.au> wrote in message 
 news:g95td3$2tu0$1 digitalmars.com...
 Nick Sabalausky wrote:
 "Don" <nospam nospam.com.au> wrote in message 
 news:g95ks5$2aon$1 digitalmars.com...
 Nick Sabalausky wrote:
 "Dee Girl" <deegirl noreply.com> wrote in message 
 news:g94j7a$2875$1 digitalmars.com...
 I appreciate your view point. Please allow me explain. The view point 
 is in opposition with STL. In STL each algorithm defines what kind of 
 iterator it operates with. And it requires what iterator complexity.

 I agree that other design can be made. But STL has that design. In my 
 opinion is much part of what make STL so successful.

 I disagree that algorithm that knows complexity of iterator is 
 concrete. I think exactly contrary. Maybe it is good that you read 
 book about STL by Josuttis. STL algorithms are the most generic I 
 ever find in any language. I hope std.algorithm in D will be better. 
 But right now std.algorithm works only with array.

 If an algoritm uses [] and doesn't know the
 complexity of the []...good! It shouldn't know, and it shouldn't 
 care. It's
 the code that sends the collection to the algoritm that knows and 
 cares.

find" is not "linear find"! It is "cuadratic find" (spell?). If you want to define something called linear find then you must know iterator complexity.

damn well that it's referring to the behavior of *just* the function itself, and is not a statement that the function *combined* with the behavior of the collection and/or a custom comparison is always going to be O(n). A question about STL: If I create a collection that, internally, is like a linked list, but starts each indexing operation from the position of the last indexing operation (so that a "find first" would run in O(n) instead of O(n*n)), is it possible to send that collection to STL's generic "linear find first"? I would argue that it should somehow be possible *even* if the STL's generic "linear find first" guarantees a *total* performance of O(n) (Since, in this case, it would still be O(n) anyway). Because otherwise, the STL wouldn't be very extendable, which would be a bad thing for a library of "generic" algorithms.

 Another STL question: It is possible to use STL to do a "linear find" 
 using a custom comparison? If so, it is possible to make STL's "linear 
 find" function use a comparison that just happens to be O(n)? If so, 
 doesn't that violate the linear-time guarantee, too? If not, how does 
 it know that the custom comparison is O(n) instead of O(1) or O(log 
 n)?

IF you follow the conventions THEN the STL gives you the guarantees.

I'm not sure that's really a "guarantee" per se, but that's splitting hairs. In any case, it sounds like we're all arguing more or less the same point: Setting aside the issue of "should opIndex be used and when?", suppose I have the following collection interface and find function (not guaranteed to compile): interface ICollection(T) { T getElement(index); int getSize(); } int find(T)(ICollection(T) c, T elem) { for(int i=0; i<c.size(); i++) { if(c.getElement(i) == elem) return i; } } It sounds like STL's approach is to do something roughly like that and say: "find()'s parameter 'c' should be an ICollection for which getElement() is O(1), in which case find() is guaranteed to be O(n)" What I've been advocating is, again, doing something like the code above and saying: "find()'s complexity is dependant on the complexity of the ICollection's getElement(). If getElement()'s complexity is O(m), then find()'s complexity is guaranteed to be O(m * n). Of course, this means that the only way to get ideal complexity from find() is to use an ICollection for which getElement() is O(1)". But, you see, those two statements are effectively equivilent.

They are. But... if you don't adhere to the conventions, your code gets really hard to reason about. "This class has an opIndex which is in O(n). Is that OK?" Well, that depends on what it's being used for. So you have to look at all of the places where it is used. It's much simpler to use the convention that opIndex _must_ be fast; this way the performance requirements for containers and algorithms are completely decoupled from each other. It's about good design.

Taking a slight detour, let me ask you this... Which of the following strategies do you consider to be better: //-- A -- value = 0; for(int i=1; i<=10; i++) { value += i*2; } //-- B -- value = sum(map(1..10, {n * 2})); Both strategies compute the sum of the first 10 multiples of 2. Strategy A makes the low-level implementation details very clear, but IMO, it comes at the expense of high-level clarity. This is because the code intermixes the high-level "what I want to accomplish?" with the low-level details. Strategy B much more closely resembles the high-level desired result, and thus makes the high-level intent more clear. But this comes at the cost of hiding the low-level details behind a layer of abstraction. I may very well be wrong on this, but from what you've said it sounds like you (as well as the other people who prefer [] to never be O(n)) are the type of coder who would prefer "Strategy A". In that case, I can completely understand your viewpoint on opIndex, even though I don't agree with it (I'm a "Strategy B" kind of person). Of course, if I'm wrong on that assumption, then we're back to square one ;)

For me at least, you are wrong :) In fact, I view it the other way, you shouldn't have to care about the underlying implementation, as long as the runtime is well defined. If you tell me strategy B may or may not take up to O(n^2) to compute, then you bet your ass I'm not going to even touch option B, 'cause I can always get O(n) time with option A :) Your solution FORCES me to care about the details, it's not so much that I want to care about them. -Steve
Aug 28 2008
parent reply Don <nospam nospam.com.au> writes:
Steven Schveighoffer wrote:
 "Nick Sabalausky" wrote
 "Don" <nospam nospam.com.au> wrote in message 
 news:g95td3$2tu0$1 digitalmars.com...
 Nick Sabalausky wrote:
 "Don" <nospam nospam.com.au> wrote in message 
 news:g95ks5$2aon$1 digitalmars.com...
 Nick Sabalausky wrote:
 "Dee Girl" <deegirl noreply.com> wrote in message 
 news:g94j7a$2875$1 digitalmars.com...
 I appreciate your view point. Please allow me explain. The view point 
 is in opposition with STL. In STL each algorithm defines what kind of 
 iterator it operates with. And it requires what iterator complexity.

 I agree that other design can be made. But STL has that design. In my 
 opinion is much part of what make STL so successful.

 I disagree that algorithm that knows complexity of iterator is 
 concrete. I think exactly contrary. Maybe it is good that you read 
 book about STL by Josuttis. STL algorithms are the most generic I 
 ever find in any language. I hope std.algorithm in D will be better. 
 But right now std.algorithm works only with array.

 If an algoritm uses [] and doesn't know the
 complexity of the []...good! It shouldn't know, and it shouldn't 
 care. It's
 the code that sends the collection to the algoritm that knows and 
 cares.

find" is not "linear find"! It is "cuadratic find" (spell?). If you want to define something called linear find then you must know iterator complexity.

damn well that it's referring to the behavior of *just* the function itself, and is not a statement that the function *combined* with the behavior of the collection and/or a custom comparison is always going to be O(n). A question about STL: If I create a collection that, internally, is like a linked list, but starts each indexing operation from the position of the last indexing operation (so that a "find first" would run in O(n) instead of O(n*n)), is it possible to send that collection to STL's generic "linear find first"? I would argue that it should somehow be possible *even* if the STL's generic "linear find first" guarantees a *total* performance of O(n) (Since, in this case, it would still be O(n) anyway). Because otherwise, the STL wouldn't be very extendable, which would be a bad thing for a library of "generic" algorithms.

 Another STL question: It is possible to use STL to do a "linear find" 
 using a custom comparison? If so, it is possible to make STL's "linear 
 find" function use a comparison that just happens to be O(n)? If so, 
 doesn't that violate the linear-time guarantee, too? If not, how does 
 it know that the custom comparison is O(n) instead of O(1) or O(log 
 n)?

IF you follow the conventions THEN the STL gives you the guarantees.

hairs. In any case, it sounds like we're all arguing more or less the same point: Setting aside the issue of "should opIndex be used and when?", suppose I have the following collection interface and find function (not guaranteed to compile): interface ICollection(T) { T getElement(index); int getSize(); } int find(T)(ICollection(T) c, T elem) { for(int i=0; i<c.size(); i++) { if(c.getElement(i) == elem) return i; } } It sounds like STL's approach is to do something roughly like that and say: "find()'s parameter 'c' should be an ICollection for which getElement() is O(1), in which case find() is guaranteed to be O(n)" What I've been advocating is, again, doing something like the code above and saying: "find()'s complexity is dependant on the complexity of the ICollection's getElement(). If getElement()'s complexity is O(m), then find()'s complexity is guaranteed to be O(m * n). Of course, this means that the only way to get ideal complexity from find() is to use an ICollection for which getElement() is O(1)". But, you see, those two statements are effectively equivilent.

if you don't adhere to the conventions, your code gets really hard to reason about. "This class has an opIndex which is in O(n). Is that OK?" Well, that depends on what it's being used for. So you have to look at all of the places where it is used. It's much simpler to use the convention that opIndex _must_ be fast; this way the performance requirements for containers and algorithms are completely decoupled from each other. It's about good design.

strategies do you consider to be better: //-- A -- value = 0; for(int i=1; i<=10; i++) { value += i*2; } //-- B -- value = sum(map(1..10, {n * 2})); Both strategies compute the sum of the first 10 multiples of 2. Strategy A makes the low-level implementation details very clear, but IMO, it comes at the expense of high-level clarity. This is because the code intermixes the high-level "what I want to accomplish?" with the low-level details. Strategy B much more closely resembles the high-level desired result, and thus makes the high-level intent more clear. But this comes at the cost of hiding the low-level details behind a layer of abstraction. I may very well be wrong on this, but from what you've said it sounds like you (as well as the other people who prefer [] to never be O(n)) are the type of coder who would prefer "Strategy A". In that case, I can completely understand your viewpoint on opIndex, even though I don't agree with it (I'm a "Strategy B" kind of person). Of course, if I'm wrong on that assumption, then we're back to square one ;)

For me at least, you are wrong :) In fact, I view it the other way, you shouldn't have to care about the underlying implementation, as long as the runtime is well defined. If you tell me strategy B may or may not take up to O(n^2) to compute, then you bet your ass I'm not going to even touch option B, 'cause I can always get O(n) time with option A :) Your solution FORCES me to care about the details, it's not so much that I want to care about them.

I agree. It's about _which_ details do you want to abstract away. I don't care about the internals. But I _do_ care about the complexity of them.
Aug 29 2008
parent Christopher Wright <dhasenan gmail.com> writes:
Don wrote:
 Steven Schveighoffer wrote:
 "Nick Sabalausky" wrote
 "Don" <nospam nospam.com.au> wrote in message 
 news:g95td3$2tu0$1 digitalmars.com...
 Nick Sabalausky wrote:
 "Don" <nospam nospam.com.au> wrote in message 
 news:g95ks5$2aon$1 digitalmars.com...
 Nick Sabalausky wrote:
 "Dee Girl" <deegirl noreply.com> wrote in message 
 news:g94j7a$2875$1 digitalmars.com...
 I appreciate your view point. Please allow me explain. The view 
 point is in opposition with STL. In STL each algorithm defines 
 what kind of iterator it operates with. And it requires what 
 iterator complexity.

 I agree that other design can be made. But STL has that design. 
 In my opinion is much part of what make STL so successful.

 I disagree that algorithm that knows complexity of iterator is 
 concrete. I think exactly contrary. Maybe it is good that you 
 read book about STL by Josuttis. STL algorithms are the most 
 generic I ever find in any language. I hope std.algorithm in D 
 will be better. But right now std.algorithm works only with array.

 If an algoritm uses [] and doesn't know the
 complexity of the []...good! It shouldn't know, and it 
 shouldn't care. It's
 the code that sends the collection to the algoritm that knows 
 and cares.

"linear find" is not "linear find"! It is "cuadratic find" (spell?). If you want to define something called linear find then you must know iterator complexity.

know damn well that it's referring to the behavior of *just* the function itself, and is not a statement that the function *combined* with the behavior of the collection and/or a custom comparison is always going to be O(n). A question about STL: If I create a collection that, internally, is like a linked list, but starts each indexing operation from the position of the last indexing operation (so that a "find first" would run in O(n) instead of O(n*n)), is it possible to send that collection to STL's generic "linear find first"? I would argue that it should somehow be possible *even* if the STL's generic "linear find first" guarantees a *total* performance of O(n) (Since, in this case, it would still be O(n) anyway). Because otherwise, the STL wouldn't be very extendable, which would be a bad thing for a library of "generic" algorithms.

 Another STL question: It is possible to use STL to do a "linear 
 find" using a custom comparison? If so, it is possible to make 
 STL's "linear find" function use a comparison that just happens 
 to be O(n)? If so, doesn't that violate the linear-time 
 guarantee, too? If not, how does it know that the custom 
 comparison is O(n) instead of O(1) or O(log n)?

IF you follow the conventions THEN the STL gives you the guarantees.

splitting hairs. In any case, it sounds like we're all arguing more or less the same point: Setting aside the issue of "should opIndex be used and when?", suppose I have the following collection interface and find function (not guaranteed to compile): interface ICollection(T) { T getElement(index); int getSize(); } int find(T)(ICollection(T) c, T elem) { for(int i=0; i<c.size(); i++) { if(c.getElement(i) == elem) return i; } } It sounds like STL's approach is to do something roughly like that and say: "find()'s parameter 'c' should be an ICollection for which getElement() is O(1), in which case find() is guaranteed to be O(n)" What I've been advocating is, again, doing something like the code above and saying: "find()'s complexity is dependant on the complexity of the ICollection's getElement(). If getElement()'s complexity is O(m), then find()'s complexity is guaranteed to be O(m * n). Of course, this means that the only way to get ideal complexity from find() is to use an ICollection for which getElement() is O(1)". But, you see, those two statements are effectively equivilent.

if you don't adhere to the conventions, your code gets really hard to reason about. "This class has an opIndex which is in O(n). Is that OK?" Well, that depends on what it's being used for. So you have to look at all of the places where it is used. It's much simpler to use the convention that opIndex _must_ be fast; this way the performance requirements for containers and algorithms are completely decoupled from each other. It's about good design.

strategies do you consider to be better: //-- A -- value = 0; for(int i=1; i<=10; i++) { value += i*2; } //-- B -- value = sum(map(1..10, {n * 2})); Both strategies compute the sum of the first 10 multiples of 2. Strategy A makes the low-level implementation details very clear, but IMO, it comes at the expense of high-level clarity. This is because the code intermixes the high-level "what I want to accomplish?" with the low-level details. Strategy B much more closely resembles the high-level desired result, and thus makes the high-level intent more clear. But this comes at the cost of hiding the low-level details behind a layer of abstraction. I may very well be wrong on this, but from what you've said it sounds like you (as well as the other people who prefer [] to never be O(n)) are the type of coder who would prefer "Strategy A". In that case, I can completely understand your viewpoint on opIndex, even though I don't agree with it (I'm a "Strategy B" kind of person). Of course, if I'm wrong on that assumption, then we're back to square one ;)

For me at least, you are wrong :) In fact, I view it the other way, you shouldn't have to care about the underlying implementation, as long as the runtime is well defined. If you tell me strategy B may or may not take up to O(n^2) to compute, then you bet your ass I'm not going to even touch option B, 'cause I can always get O(n) time with option A :) Your solution FORCES me to care about the details, it's not so much that I want to care about them.

I agree. It's about _which_ details do you want to abstract away. I don't care about the internals. But I _do_ care about the complexity of them.

We all agree about this. What we disagree about is how to find out about the complexity of an operation -- by whether it overloads an operator or by some metadata. In terms of code, the difference is: /* Operator overloading */ void foo(T)(T collection) { static if (is (typeof (T[0]))) { ... } } /* Metadata */ void foo(T)(ICollection!(T) collection) { if ((cast(FastIndexedCollection)collection) !is null) { ... } } You do need a metadata solution, whichever you choose. Otherwise you can't differentiate at runtime.
Aug 29 2008
prev sibling parent reply Robert Fraser <fraserofthenight gmail.com> writes:
Nick Sabalausky wrote:
 Taking a slight detour, let me ask you this... Which of the following 
 strategies do you consider to be better:
 
 //-- A --
 value = 0;
 for(int i=1; i<=10; i++)
 {
     value += i*2;
 }
 
 //-- B --
 value = sum(map(1..10, {n * 2}));
 
 Both strategies compute the sum of the first 10 multiples of 2.
 
 Strategy A makes the low-level implementation details very clear, but IMO, 
 it comes at the expense of high-level clarity. This is because the code 
 intermixes the high-level "what I want to accomplish?" with the low-level 
 details.
 
 Strategy B much more closely resembles the high-level desired result, and 
 thus makes the high-level intent more clear. But this comes at the cost of 
 hiding the low-level details behind a layer of abstraction.

Didn't read the rest of the discussion, but I disagree here... Most programmers learn iterative languages first, and anyone whose taken Computer Science 101 can figure out what's going on in A. B takes a second to think about. I'm not into the zen of FP for sure, and that probably makes me a worse programmer... but I bet you if you take a random candidate for a development position, she'll be more likely to figure out (and write) A than B. [That may be projection; I haven't seen/done any studies] The big problem IMO is the number of primitive things you need to understand. In A, you need to understand variables, looping and arithmetic operations. In B, you need to understand and think about closures/scoping, lists, the "map" function, aggregate functions, function compositions, and arithmetic operations. What hit me when first looking at it "where the **** did n come from?" I'm not saying the functional style isn't perfect for a lot of things, I'm just saying that this not one of them.
Aug 28 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Robert Fraser wrote:
 The big problem IMO is the number of primitive things you need to 
 understand. In A, you need to understand variables, looping and 
 arithmetic operations. In B, you need to understand and think about 
 closures/scoping, lists, the "map" function, aggregate functions, 
 function compositions, and arithmetic operations. What hit me when first 
 looking at it "where the **** did n come from?"

I think B should be clearer and more intuitive, it's just that I'm not used to B at all whereas A style has worn a very deep groove in my brain.
Aug 29 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:

I think B should be clearer and more intuitive, it's just that I'm not used to
B at all whereas A style has worn a very deep groove in my brain.<

Well, if you use D 2 you write it this way: value = 0; foreach (i; 1 .. 11) value += i * 2; Using my libs you can write: auto value = sum(map((int i){return i * 2;}, range(1, 11))); But that creates two intermediate lists, so you may want to go all lazy instead: auto value = sum(xmap((int i){return i * 2;}, xrange(1, 11))); That's short and fast and uses very little (a constant amount of) memory, but you have to count the open and closed brackets to be sure the expression is correct... So for me the most clear solution is the Python (lazy) one: value = sum(i * 2 for i in xrange(1, 11)) That's why I suggested a similar syntax for D too ;-) Bye, bearophile
Aug 30 2008
prev sibling parent Dee Girl <deegirl noreply.com> writes:
Nick Sabalausky Wrote:

 "Dee Girl" <deegirl noreply.com> wrote in message 
 news:g94j7a$2875$1 digitalmars.com...
 Nick Sabalausky Wrote:

 "Dee Girl" <deegirl noreply.com> wrote in message
 news:g943oi$11f4$1 digitalmars.com...
 Benji Smith Wrote:

 Dee Girl wrote:
 Michiel Helvensteijn Wrote:
 That's simple. a[i] looks much nicer than a.nth(i).

It is not nicer. It is more deceiving (correct spell?). If you look at code it looks like array code. foreach (i; 0 .. a.length) { a[i] += 1; } For array works nice. But for list it is terrible! Many operations for incrementing only small list.

Well, that's what you get with operator overloading.

I am sorry. I disagree. I think that is what you get with bad design.
 The same thing could be said for "+" or "-". They're inherently
 deceiving, because they look like builtin operations on primitive data
 types.

 For expensive operations (like performing division on an
 unlimited-precision decimal object), should the author of the code use
 "opDiv" or should he implement a separate "divide" function?

The cost of + and - is proportional to digits in number. For small number of digits computer does fast in hardware. For many digits the cost grows. The number of digits is log n. I think + and - are fine for big integer. I am not surprise.
 Forget opIndex for a moment, and ask the more general question about 
 all
 overloaded operators. Should they imply any sort of asymptotic
 complexity guarantee?

I think depends on good design. For example I think ++ or -- for iterator. If it is O(n) it is bad design. Bad design make people say like you "This is what you get with operator overloading".
 Personally, I don't think so.

 I don't like "nth".

 I'd rather use the opIndex. And if I'm using a linked list, I'll be
 aware of the fact that it'll exhibit linear-time indexing, and I'll be
 cautious about which algorithms to use.

But inside algorithm you do not know if you use a linked list or a vector. You lost that information in bad abstraction. Also abstraction is bad because if you change data structure you have concept errors that still compile. And run until tomorrow ^_^.

A generic algoritm has absolutely no business caring about the complexity of the collection it's operating on. If it does, then you've created a concrete algoritm, not a generic one.

I appreciate your view point. Please allow me explain. The view point is in opposition with STL. In STL each algorithm defines what kind of iterator it operates with. And it requires what iterator complexity. I agree that other design can be made. But STL has that design. In my opinion is much part of what make STL so successful. I disagree that algorithm that knows complexity of iterator is concrete. I think exactly contrary. Maybe it is good that you read book about STL by Josuttis. STL algorithms are the most generic I ever find in any language. I hope std.algorithm in D will be better. But right now std.algorithm works only with array.
 If an algoritm uses [] and doesn't know the
 complexity of the []...good! It shouldn't know, and it shouldn't care. 
 It's
 the code that sends the collection to the algoritm that knows and cares.

I think this is mistake. Algorithm should know. Otherwise "linear find" is not "linear find"! It is "cuadratic find" (spell?). If you want to define something called linear find then you must know iterator complexity.

If a generic algorithm describes itself as "linear find" then I know damn well that it's referring to the behavior of *just* the function itself, and is not a statement that the function *combined* with the behavior of the collection and/or a custom comparison is always going to be O(n).

I think this is wrong. (Maybe I wake up moody! ^_^) Linear find that use another linear find each iteration is not linear find.
 A question about STL: If I create a collection that, internally, is like a 
 linked list, but starts each indexing operation from the position of the 
 last indexing operation (so that a "find first" would run in O(n) instead of 
 O(n*n)), is it possible to send that collection to STL's generic "linear 
 find first"? I would argue that it should somehow be possible *even* if the 
 STL's generic "linear find first" guarantees a *total* performance of O(n) 
 (Since, in this case, it would still be O(n) anyway). Because otherwise, the 
 STL wouldn't be very extendable, which would be a bad thing for a library of 
 "generic" algorithms.

Of course you can design bad collection and bad iterator. Let me ask this. interface IUnknown { void AddRef(); void Release(); int QueryInterface(IID*, void**); } Now I come and ask you. If I implement functions bad to do wrong things, can I use my class with COM? Maybe but I have leaks and other bad things. Compiler or STL can not enforce meaning of words. It only can give you a framework to express meanings correctly. Framework can be better or bad. You hide that nth element costs O(n) as detail. Then I can not write find or binary_search with your framework. Then I say STL better than your framework.
 Another STL question: It is possible to use STL to do a "linear find" using 
 a custom comparison? If so, it is possible to make STL's "linear find" 
 function use a comparison that just happens to be O(n)? If so, doesn't that 
 violate the linear-time guarantee, too? If not, how does it know that the 
 custom comparison is O(n) instead of O(1) or O(log n)?

An element of array does not have easy access to all array. But if you really want it can store it as a member or use a global array. So you can make find do O(n*n) or even more bad. But I think it is same mistake. If you can do something bad it does not mean framework is bad. The test is if you can do good thing easy. STL allows you to do good thing easy. Your framework makes doing good thing impossible. Thank you, Dee Girl
Aug 28 2008
prev sibling parent reply Robert Fraser <fraserofthenight gmail.com> writes:
superdan wrote:
 yeppers. amend that to o(log n). in d, that rule is a social contract derived
from the built-in vector and hash indexing syntax.

I see what you did thar -- you made up a rule you like and called it a "social contract". Whether it _should be_ a rule or not is debatable, but it is neither a written nor unwritten rule in use right now, so what you said there is a lie. First, a hash access is already time unbounded. hash["hello"] where "hello" is not already in the hash will create a hash entry for hello. This requires heap allocation, which can take arbitrarily long. So having unbounded opIndex is in the language already! Second, opIndex can be used for things other than data structures. For example, if I had a handle to a folder that had a file "foo.txt" in it, folder["foo.txt"] seems a natural syntax to create a handle to that file (which allocates memory = time unbounded). I can see the opIndex syntax being used for things like properties that may require searching through a parse tree. Maybe this is sort of stretching it, but I wouldn't mind having the opIndex syntax as a shorthand for executing database queries, i.e. `auto result = db["SELECT * FROM posts WHERE from = 'superdan']";`. It's a shorthand syntax that makes no guarantees as far as complexity nor should it.
Aug 27 2008
next sibling parent superdan <super dan.org> writes:
Robert Fraser Wrote:

 superdan wrote:
 yeppers. amend that to o(log n). in d, that rule is a social contract derived
from the built-in vector and hash indexing syntax.

I see what you did thar -- you made up a rule you like and called it a "social contract". Whether it _should be_ a rule or not is debatable, but it is neither a written nor unwritten rule in use right now, so what you said there is a lie.

well i'm exposed. good goin' johnny drama. in c++ it's written. in d it's not yet. lookin' at std.algorithm i have no doubt it will. so my lie is really a prophecy :D
 First, a hash access is already time unbounded. hash["hello"] where 
 "hello" is not already in the hash will create a hash entry for hello. 
 This requires heap allocation, which can take arbitrarily long. So 
 having unbounded opIndex is in the language already!

hash was always an oddball. it is acceptable because it offers constant time [] on average.
 Second, opIndex can be used for things other than data structures. For 
 example, if I had a handle to a folder that had a file "foo.txt" in it, 
 folder["foo.txt"] seems a natural syntax to create a handle to that file 
 (which allocates memory = time unbounded).

guess i wouldn't be crazy about it. but yeah it works no problem. s'pose there's a misunderstanding s'mewhere. i'm not against opIndex usage in various data structs. no problem! i am only against opIndex masquerading as random access in a collection. that would allow algos thinkin' they do some effin' good iteration. when in fact they do linear search each time they make a pass. completely throws the shit towards the fan.
 I can see the opIndex syntax 
 being used for things like properties that may require searching through 
 a parse tree. Maybe this is sort of stretching it, but I wouldn't mind 
 having the opIndex syntax as a shorthand for executing database queries, 
 i.e. `auto result = db["SELECT * FROM posts WHERE from = 'superdan']";`.
 
 It's a shorthand syntax that makes no guarantees as far as complexity 
 nor should it.

kinda cute, but 100% agree.
Aug 27 2008
prev sibling parent Sergey Gromov <snake.scaly gmail.com> writes:
Robert Fraser <fraserofthenight gmail.com> wrote:
 First, a hash access is already time unbounded. hash["hello"] where 
 "hello" is not already in the hash will create a hash entry for hello. 
 This requires heap allocation, which can take arbitrarily long. So 
 having unbounded opIndex is in the language already!

Hash's opIndex() throws an ArrayBoundsError if given an unknown key. It's opIndexAssign() which allocates. -- SnakE
Aug 28 2008
prev sibling next sibling parent reply Christopher Wright <dhasenan gmail.com> writes:
== Quote from Christopher Wright (dhasenan gmail.com)'s article
 WRONG!
 Those sorting algorithms are correct. Their runtime is now O(n^2 log n)
 for this linked list.

My mistake. Merge sort, qsort, and heap sort are all O(n log n) for any list type that allows for efficient iteration (O(n) to go through a list of n elements, or for heap sort, O(n log n)) and O(1) appending (or, for heap sort, O(log n)). So even for a linked list, those three algorithms, which are probably the most common sorting algorithms used, will still be efficient. Unless the person who wrote them was braindead and used indexing to iterate rather than the class's defined opApply.
Aug 26 2008
parent reply superdan <super dan.org> writes:
Christopher Wright Wrote:

 == Quote from Christopher Wright (dhasenan gmail.com)'s article
 WRONG!
 Those sorting algorithms are correct. Their runtime is now O(n^2 log n)
 for this linked list.

My mistake. Merge sort, qsort, and heap sort are all O(n log n) for any list type that allows for efficient iteration (O(n) to go through a list of n elements, or for heap sort, O(n log n)) and O(1) appending (or, for heap sort, O(log n)). So even for a linked list, those three algorithms, which are probably the most common sorting algorithms used, will still be efficient. Unless the person who wrote them was braindead and used indexing to iterate rather than the class's defined opApply.

sigh. your mistake indeed. just not where you thot. quicksort needs random access fer the pivot. not fer iterating. quicksort can't guarantee good runtime if pivot is first element. actually any of first k elements. on a forward iterator quicksort does quadratic time if already sorted or almost sorted.
Aug 26 2008
parent reply Christopher Wright <dhasenan gmail.com> writes:
superdan wrote:
 Christopher Wright Wrote:
 
 == Quote from Christopher Wright (dhasenan gmail.com)'s article
 WRONG!
 Those sorting algorithms are correct. Their runtime is now O(n^2 log n)
 for this linked list.

that allows for efficient iteration (O(n) to go through a list of n elements, or for heap sort, O(n log n)) and O(1) appending (or, for heap sort, O(log n)). So even for a linked list, those three algorithms, which are probably the most common sorting algorithms used, will still be efficient. Unless the person who wrote them was braindead and used indexing to iterate rather than the class's defined opApply.

sigh. your mistake indeed. just not where you thot. quicksort needs random access fer the pivot. not fer iterating. quicksort can't guarantee good runtime if pivot is first element. actually any of first k elements. on a forward iterator quicksort does quadratic time if already sorted or almost sorted.

You need to pick a random pivot in order to guarantee that runtime, in fact. And you can do that in linear time, and you're doing a linear scan through the elements anyway, so you get the same asymptotic time. It's going to double your runtime at worst, if you chose a poor datastructure for the task.
Aug 26 2008
parent superdan <super dan.org> writes:
Christopher Wright Wrote:

 superdan wrote:
 Christopher Wright Wrote:
 
 == Quote from Christopher Wright (dhasenan gmail.com)'s article
 WRONG!
 Those sorting algorithms are correct. Their runtime is now O(n^2 log n)
 for this linked list.

that allows for efficient iteration (O(n) to go through a list of n elements, or for heap sort, O(n log n)) and O(1) appending (or, for heap sort, O(log n)). So even for a linked list, those three algorithms, which are probably the most common sorting algorithms used, will still be efficient. Unless the person who wrote them was braindead and used indexing to iterate rather than the class's defined opApply.

sigh. your mistake indeed. just not where you thot. quicksort needs random access fer the pivot. not fer iterating. quicksort can't guarantee good runtime if pivot is first element. actually any of first k elements. on a forward iterator quicksort does quadratic time if already sorted or almost sorted.

You need to pick a random pivot in order to guarantee that runtime, in fact. And you can do that in linear time, and you're doing a linear scan through the elements anyway, so you get the same asymptotic time. It's going to double your runtime at worst, if you chose a poor datastructure for the task.

damn man you're right. yeah it's still o(n log n). i was wrong. 'pologies.
Aug 26 2008
prev sibling parent "Bill Baxter" <wbaxter gmail.com> writes:
On Wed, Aug 27, 2008 at 9:49 AM, Michiel Helvensteijn <nomail please.com> wrote:
 Dee Girl wrote:

 Yes, the first 'trick' makes it a different datastructure. The second
 does not. Would you still be opposed to using opIndex if its
 time-complexity is O(log n)?

This is different question. And tricks are not answer for problem. Problem is list has other access method than array.

And what's the answer?

The complexity of STL's std::map indexing operator is O(lg N). So it is not the case even in the STL that [] *always* means O(1). Plus, if the element is not found in the std::map when using [], it triggers an insertion which can mean an allocation, which means the upper bound for time required for an index operation is whatever the upper bound for 'new' is on your system. But std::map is kind of an oddball case. I think a lot of people are surprised to find that merely accessing an element can trigger allocation. Not a great design in my opinion, precisely because it fails to have the behavior one would expect out of an [] operator. --bb
Aug 27 2008
prev sibling next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2008-08-25 21:56:18 -0400, Benji Smith <dlanguage benjismith.net> said:

 But if someone else, with special design constraints, needs to 
 implement a custom container template, it's no problem. As long as the 
 container provides a function for getting iterators to the container 
 elements, it can consume any of the STL algorithms too, even if the 
 performance isn't as good as the performance for a vector.

Indeed. But notice that the Standard Template Library containers doesn't use inheritance, but templates. You can create your own version of std::string by creating a different class and implementing the same functions, but then every function accepting a std::string would have to be a template capable of accepting either one as input, or changed to use your new string class. That's why std::find and std::foreach, akin many others, are template functions: those would work with your custom string class. The situation is no different in D: you can create your own string class or struct, but only functions taking your string class or struct, or template functions where the string type is a template argument, will be able to use it. If your argument is that string functions in Phobos should be template functions accepting any kind of string as input, then that sounds reasonable to me. But that's not exacly what you said you wanted.
 There's no good reason the same technique couldn't provide both speed 
 and API flexibility for text processing.

This is absolutely right... but unfortunately, using virtual inheritance (as interfaces in D imply) isn't the same technique as in the STL at all. Template algorithms parametrized on the container and iterator type is what the STL is all about, and from there come its speed. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Aug 25 2008
parent reply superdan <super dan.org> writes:
Michel Fortin Wrote:

 On 2008-08-25 21:56:18 -0400, Benji Smith <dlanguage benjismith.net> said:
 
 But if someone else, with special design constraints, needs to 
 implement a custom container template, it's no problem. As long as the 
 container provides a function for getting iterators to the container 
 elements, it can consume any of the STL algorithms too, even if the 
 performance isn't as good as the performance for a vector.

Indeed. But notice that the Standard Template Library containers doesn't use inheritance, but templates. You can create your own version of std::string by creating a different class and implementing the same functions, but then every function accepting a std::string would have to be a template capable of accepting either one as input, or changed to use your new string class. That's why std::find and std::foreach, akin many others, are template functions: those would work with your custom string class. The situation is no different in D: you can create your own string class or struct, but only functions taking your string class or struct, or template functions where the string type is a template argument, will be able to use it. If your argument is that string functions in Phobos should be template functions accepting any kind of string as input, then that sounds reasonable to me. But that's not exacly what you said you wanted.

perfect answer. u da man. for example look at this fn from std.string. int cmp(C1, C2)(in C1[] s1, in C2[] s2); so it looks like cmp accepts arrays of any character type. that is cool but the [] limits the thing to builtin arrays. the correct sig is int cmp(S1, S2)(in S1 s1, in S2 s2) if (isSortaString!(S1) && isSortaString!(S2)); correct?
Aug 25 2008
next sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2008-08-25 22:52:52 -0400, superdan <super dan.org> said:

 int cmp(S1, S2)(in S1 s1, in S2 s2)
     if (isSortaString!(S1) && isSortaString!(S2));
 
 correct?

That's sorta what I had in mind. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Aug 25 2008
prev sibling parent Walter Bright <newshound1 digitalmars.com> writes:
superdan wrote:
 for example look at this fn from std.string.
 
 int cmp(C1, C2)(in C1[] s1, in C2[] s2);
 
 so it looks like cmp accepts arrays of any character type. that is
 cool but the [] limits the thing to builtin arrays. the correct sig
 is
 
 int cmp(S1, S2)(in S1 s1, in S2 s2) if (isSortaString!(S1) &&
 isSortaString!(S2));
 
 correct?

Yes. It's just that template constraints came along later than std.string.cmp :-)
Aug 26 2008
prev sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Benji Smith wrote:
 I don't know a whole lot about the STL,

STL is a piece of brilliance in C++ (and one can reasonably argue that STL saved C++). The design of STL solves the problems you are talking about. Andrei has been hard at work getting equivalent functionality into the D library (see http://www.digitalmars.com/d/2.0/phobos/std_algorithm.html).
Aug 26 2008
prev sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Benji Smith wrote:
 But in this "systems language", it's a O(n) operation to get the nth 
 character from a string, to slice a string based on character offsets, 
 or to determine the number of characters in the string.
 
 I'd gladly pay the price of a single interface vtable lookup to turn all 
 of those into O(1) operations.

I've written internationalized applications that dealt with multibyte utf strings. It looks like one would regularly need all those operations, but interestingly it just doesn't come up. It turns out that one needs to slice with the byte offset, or get the byte length, or get the nth byte. In the very rare case where one wants to do it with characters, one seems to already have the right offsets at hand. If you choose to use dchar's instead, there is a 1:1 mapping between characters and indices, and it doesn't cost you any class overhead. It's also a simple conversion from UTF-8 <==> UCS-2. I can't think of a scenario where using classes would produce any performance advantage.
Aug 26 2008
prev sibling next sibling parent "Lionello Lunesu" <lionello lunesu.remove.com> writes:
"superdan" <super dan.org> wrote in message 
news:g8vh9b$fko$1 digitalmars.com...
 Benji Smith Wrote:
 No. Of course not. The compiler complains that you can't concatenate a
 dchar to a char[] array. Even though the "find" functions indicate that
 the array is truly a collection of dchar elements.

that's a bug in the compiler. report it.

I did, a long time ago. #111 if I'm not mistaken. L.
Aug 25 2008
prev sibling next sibling parent Benji Smith <dlanguage benjismith.net> writes:
superdan wrote:
 Benji Smith Wrote:
 A char[] is actually an array of UTF-8 encoded octets, where each 
 character may consume one or more consecutive elements of the array. 
 Retrieving the str.length property may or may not tell you how many 
 characters are in the string. And pretty much any code that tries to 
 iterate character-by-character through the array elements is 
 fundamentally broken.

try this: foreach (dchar c; str) { process c }

Cool. I had no idea that was possible. I was doing this: myFunction(T)(T[] array) { foreach(T c; array) { doStuff(c); } } --benji
Aug 26 2008
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Tue, 26 Aug 2008 23:29:33 +0400, superdan <super dan.org> wrote:
[snip]
 The D spec certainly doesn't make any guarantees about
 the time/memory complexity of opIndex; it's up to the implementing class
 to do so.

it don't indeed. it should. that's a problem with the spec.

I agree. You can't rely on function invokation, i.e. the following might be slow as death: auto n = collection.at(i); auto len = collection.length(); but index operations and properties getters should be real-time and have O(1) complexity by design. auto n = collection[i]; auto len = collection.length;
Aug 26 2008
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Tue, 26 Aug 2008 23:58:10 +0400, Denis Koroskin <2korden gmail.com>  
wrote:

 On Tue, 26 Aug 2008 23:29:33 +0400, superdan <super dan.org> wrote:
 [snip]
 The D spec certainly doesn't make any guarantees about
 the time/memory complexity of opIndex; it's up to the implementing  
 class
 to do so.

it don't indeed. it should. that's a problem with the spec.

I agree. You can't rely on function invokation, i.e. the following might be slow as death: auto n = collection.at(i); auto len = collection.length(); but index operations and properties getters should be real-time and have O(1) complexity by design. auto n = collection[i]; auto len = collection.length;

The same goes to assignment, casts, comparisons, shifts, i.e. everything that doesn't have a function invokation syntax. BTW, that's one of the main C++ criticisms: you can't say how much time a given line may take. It is predictable in C because it lacks operator overloading.
Aug 26 2008
prev sibling next sibling parent reply "Denis Koroskin" <2korden gmail.com> writes:
On Wed, 27 Aug 2008 00:30:07 +0400, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 "Denis Koroskin" wrote
 On Tue, 26 Aug 2008 23:29:33 +0400, superdan <super dan.org> wrote:
 [snip]
 The D spec certainly doesn't make any guarantees about
 the time/memory complexity of opIndex; it's up to the implementing  
 class
 to do so.

it don't indeed. it should. that's a problem with the spec.

I agree. You can't rely on function invokation, i.e. the following might be slow as death: auto n = collection.at(i); auto len = collection.length(); but index operations and properties getters should be real-time and have O(1) complexity by design. auto n = collection[i]; auto len = collection.length;

less than O(n) complexity please :) Think of tree map complexity which is usually O(lg n) for lookups. And the opIndex syntax is sooo nice for maps :) In general, opIndex just shouldn't imply 'linear search', as its roots come from array lookup, which is always O(1). The perception is that x[n] should be fast. Otherwise you have coders using x[n] all over the place thinking they are doing quick lookups, and wondering why their code is so damned slow. -Steve

Yes, that was a rash statement.
Aug 26 2008
parent superdan <super dan.org> writes:
Denis Koroskin Wrote:

 On Wed, 27 Aug 2008 00:30:07 +0400, Steven Schveighoffer  
 <schveiguy yahoo.com> wrote:
 
 "Denis Koroskin" wrote
 On Tue, 26 Aug 2008 23:29:33 +0400, superdan <super dan.org> wrote:
 [snip]
 The D spec certainly doesn't make any guarantees about
 the time/memory complexity of opIndex; it's up to the implementing  
 class
 to do so.

it don't indeed. it should. that's a problem with the spec.

I agree. You can't rely on function invokation, i.e. the following might be slow as death: auto n = collection.at(i); auto len = collection.length(); but index operations and properties getters should be real-time and have O(1) complexity by design. auto n = collection[i]; auto len = collection.length;

less than O(n) complexity please :) Think of tree map complexity which is usually O(lg n) for lookups. And the opIndex syntax is sooo nice for maps :) In general, opIndex just shouldn't imply 'linear search', as its roots come from array lookup, which is always O(1). The perception is that x[n] should be fast. Otherwise you have coders using x[n] all over the place thinking they are doing quick lookups, and wondering why their code is so damned slow. -Steve

Yes, that was a rash statement.

i'm kool & the gang with log n too. that's like proportional 2 the count of digits in n. undecided about sublinear. like o(n^.5). guess that would be pushin' it. but they come by rarely so why bother makin' a decision :)
Aug 26 2008
prev sibling parent reply "Simen Kjaeraas" <simen.kjaras gmail.com> writes:
Michiel Helvensteijn <nomail please.com> wrote:

 a[i] looks much nicer than a.nth(i).

To me, this is one of the most important points here. I want a language that seems to make sense, more than I want a language that is by default very fast. When I write a short example program, I want to write a[i] not a.getElementAtPosition(i). D is known as a language that does the safe thing by default, and you have to jump through some hoops to do the fast, unsafe thing. I will claim that a[i] is the default, as it is what we're used to, and looks better. a.nth(i), a.getElementAtPosition(i) and whatever other ways one might come up with, is jumping through hoops. Just my 0.02 kr. -- Simen
Aug 27 2008
parent superdan <super dan.org> writes:
Simen Kjaeraas Wrote:

 Michiel Helvensteijn <nomail please.com> wrote:
 
 a[i] looks much nicer than a.nth(i).

To me, this is one of the most important points here. I want a language that seems to make sense, more than I want a language that is by default very fast. When I write a short example program, I want to write a[i] not a.getElementAtPosition(i).

sure. you do so with arrays. i think you confuse "optimized-fast" with "complexity-fast".
 D is known as a language that does the safe thing by default,
 and you have to jump through some hoops to do the fast, unsafe
 thing. I will claim that a[i] is the default, as it is what
 we're used to, and looks better. a.nth(i),
 a.getElementAtPosition(i) and whatever other ways one might
 come up with, is jumping through hoops.

guess i missed the lesson teachin' array indexing was unsafe. you are looking at them wrong tradeoffs. it's not about slow and safe vs. fast and unsafe. safety's nothin' to do with all this. you're lookin' at bad design vs. good design of algos and data structs.
Aug 27 2008
prev sibling next sibling parent reply superdan <super dan.org> writes:
Benji Smith Wrote:

 In another thread (about array append performance) I mentioned that 
 Strings ought to be implemented as classes rather than as simple builtin
 arrays. Superdan asked why. Here's my response...

well then allow me to retort.
 I'll start with a few of the softball, easy reasons.
 
 For starters, with strings implemented as character arrays, writing 
 library code that accepts and operates on strings is a bit of a pain in 
 the neck, since you always have to write templates and template code is 
 slightly less readable than non-template code. You can't distribute your 
 code as a DLL or a shared object, because the template instantiations 
 won't be included (unless you create wrapper functions with explicit 
 template instantiations, bloating your code size, but more importantly 
 tripling the number of functions in your API).

so u mean with a class the encoding char/wchar/dchar won't be an issue anymore. that would be hidden behind the wraps. cool. problem is that means there's an indirection cost for every character access. oops. so then apps that decided to use some particular encoding consistently must pay a price for stuff they don't use. but if u have strings like today it's a no-brainer to define a class that does all that stuff. u can then use that class whenever you feel. it would be madness to put that class in the language definition. at best it's a candidate for the stdlib. so that low-hangin' argument of yers ain't that low-hangin' after all. unless u call a hanged deadman low-hangin'.
 Another good low-hanging argument is that strings are frequently used as 
 keys in associative arrays. Every insertion and retrieval in an 
 associative array requires a hashcode computation. And since D strings 
 are just dumb arrays, they have no way of memoizing their hashcodes. 
 We've already observed that D assoc arrays are less performant than even 
 Python maps, so the extra cost of lookup operations is unwelcome.

again you want to build larger components from smaller components. you can build a string with memoized hashcode from a string without memoized hashcode. but you can't build a string without memoized hashcode from a string with memoized hashcode. but wait there's more. the extra field is paid for regardless. so what numbers do you have to back up your assertion that it's worth paying that cost for everything except hashtables.
 But much more important than either of those reasons is the lack of 
 polymorphism on character arrays. Arrays can't have subclasses, and they 
 can't implement interfaces.

that's why you can always define a class that does all those good things. by the same arg why isn't int a class. the point is you can always create class Int that does what an int does, slower but more flexible. if all you had was class Int you'd be in slowland.
 A good example of what I'm talking about can be seen in the Phobos and 
 Tango regular expression engines. At least the Tango implementation 
 matches against all string types (the Phobos one only works with char[] 
 strings).
 
 But what if I want to consume a 100 MB logfile, counting all lines that 
 match a pattern?

 Right now, to use the either regex engine, I have to read the entire 
 logfile into an enormous array before invoking the regex search function.
 
 Instead, what if there was a CharacterStream interface? And what if all 
 the text-handling code in Phobos & Tango was written to consume and 
 return instances of that interface?

what exactly is the problem there aside from a library issue.
 A regex engine accepting a CharacterStream interface could process text 
 from string literals, file input streams, socket input streams, database 
 records, etc, etc, etc... without having to pollute the API with a bunch 
 of casts, copies, and conversions. And my logfile processing application 
 would consume only a tiny fraction of the memory needed by the character 
 array implementation.

library problem. or maybe you want to build character stream into the language too.
 Most importantly, the contract between the regex engine and its 
 consumers would provide a well-defined interface for processing text, 
 regardless of the source or representation of that text.
 
 Along a similar vein, I've worked on a lot of parsers over the past few 
 years, for domain specific languages and templating engines, and stuff 
 like that. Sometimes it'd be very handy to define a "Token" class that 
 behaves exactly like a String, but with some additional behavior. 
 Ideally, I'd like to implement that Token class as an implementor of the 
 CharacterStream interface, so that it can be passed directly into other 
 text-handling functions.
 
 But, in D, with no polymorphic text handling, I can't do that.

of course you can. you just don't want to for the sake of building a fragile argument.
 As one final thought... I suspect that mutable/const/invariant string 
 handling would be much more conveniently implemented with a 
 MutableCharacterStream interface (as an extended interface of 
 CharacterStream).
 
 Any function written to accept a CharacterStream would automatically 
 accept a MutableCharacterStream, thanks to interface polymorphism, 
 without any casts, conversions, or copies. And various implementors of 
 the interface could provide buffered implementations operating on 
 in-memory strings, file data, or network data.
 
 Coding against the CharacterStream interface, library authors wouldn't 
 need to worry about const-correctness, since the interface wouldn't 
 provide any mutator methods.

sounds great. so then go ahead and make the characterstream thingie. the language gives u everything u need to make it clean and fast.
 But then again, I haven't used any of the const functionality in D2, so 
 I can't actually comment on relative usability of compiler-enforced 
 immutability versus interface-enforced immutability.
 
 Anyhow, those are some of my thoughts... I think there are a lot of 
 compelling reasons for de-coupling the specification of string handling 
 functionality from the implementation of that functionality, primarily 
 for enabling polymorphic text-processing.
 
 But memoized hashcodes would be cool too :-)

sorry dood each and every argument talks straight against your case. if i had any doubts, you just convinced me that a builtin string class would be a mistake.
Aug 25 2008
next sibling parent BCS <ao pathlink.com> writes:
Reply to superdan,

 sorry dood each and every argument talks straight against your case.
 if i had any doubts, you just convinced me that a builtin string class
 would be a mistake.
 

OTOH as standard lib string class...
Aug 25 2008
prev sibling next sibling parent reply Benji Smith <dlanguage benjismith.net> writes:
Sorry, dan. You're wrong.

superdan wrote:
 again you want to build larger components from smaller components.

Good APIs define public interfaces and hide implementation details, usually providing a default general-purpose implementation while allowing third-parties to define special-purpose implementations to suit their needs. In D, the text handling is defined by the data format (unicode byte sequences) rather than by the interface, while providing no polymorphic mechanism for alternate implementations. It's the opposite of good API design. The regular expression engine accepts character arrays. And that's it. You want to match on a regex pattern, character-by-character, from an input stream? Tough nuts. It's not possible. The new JSON parser in the Tango library operates on templated string arrays. If I want to read from a file or a socket, I have to first slurp the whole thing into a character array, even though the character-streaming would be more practical. Parsers, formatters, console IO, socket IO... Anything that provides an iterable sequence of characters ought to comply with an interface facilitating polymorphic text processing. In some cases, there might be a slight memory/speed tradeoff. But in many more cases, the benefit of iterating over the transient characters in a stream would be much faster and require a tiny fraction of the memory of the character array. There are performance benefits to be found on both sides of the coin. Anyhow, I totally agree with you when you say that "larger components" should be built from "smaller components". But the "small components" are the *interfaces*, not the implementation details. --benji
Aug 25 2008
next sibling parent reply superdan <super dan.org> writes:
Benji Smith Wrote:

 Sorry, dan. You're wrong.

well just sayin' it ain't makin' it. hope there's some in the way of a proof henceforth.
 superdan wrote:
 again you want to build larger components from smaller components.

Good APIs define public interfaces and hide implementation details, usually providing a default general-purpose implementation while allowing third-parties to define special-purpose implementations to suit their needs.

sure thing. all this gave me a warm fuzzy feelin' right there.
 In D, the text handling is defined by the data format (unicode byte 
 sequences) rather than by the interface, while providing no polymorphic 
 mechanism for alternate implementations.
 
 It's the opposite of good API design.

wait a minute. you got terms all mixed up. and that's a recurring problem that mashes your argument badly. first you say `in d'. so then i assume there's a problem in the language. but then you go on with what looks like a library thing. sure enuff you end with `api' which is 100% a library thingy. so if you have any beef say `in phobos' or `in tango'. `in d' must refer to the language proper. the core lang is supposed to give you the necessary stuff to do that nice api design that gave me half an erection in ur first paragraph. if the language gives me the classes and apis and stuff then things will be slow for everyone and no viagra and no herbal supplement ain't gonna make stuff hard around here. i think d strings are the cleverest thing yet. not too low level like pointers. not too high level like classes and stuff. just the right size. pun not intended.
 The regular expression engine accepts character arrays. And that's it. 
 You want to match on a regex pattern, character-by-character, from an 
 input stream? Tough nuts. It's not possible.

there are multiple problems with this argument of yours. first it's that you swap horses in the midstream. that, according to the cowboy proverb, is unrecommended. you switch from strings to streams. don't. about streams. same in perl. it's a classic problem. you must first read as much text as you know could match. then you match. and nobody complains. what you say is possible. it hasn't been written that way 'cause it's damn hard. motherintercoursin' hard. regexes must backtrack and when they do they need pronto access to the stuff behind. if that's in a file, yeah `tough nuts' eh.
 The new JSON parser in the Tango library operates on templated string 
 arrays. If I want to read from a file or a socket, I have to first slurp 
 the whole thing into a character array, even though the 
 character-streaming would be more practical.

non seqitur. see stuff above.
 Parsers, formatters, console IO, socket IO... Anything that provides an 
 iterable sequence of characters ought to comply with an interface 
 facilitating polymorphic text processing. In some cases, there might be 
 a slight memory/speed tradeoff. But in many more cases, the benefit of 
 iterating over the transient characters in a stream would be much faster 
 and require a tiny fraction of the memory of the character array.

so what u want is a streaming interface and an implementation of it that spans a string. for the mother of god i can't figure what this has to do with the language, and what stops you from writing it.
 There are performance benefits to be found on both sides of the coin.

no. all benefits only of my side of the coin. my side of the coin includes your side of the coin.
 Anyhow, I totally agree with you when you say that "larger components" 
 should be built from "smaller components".

cool.
 But the "small components" are the *interfaces*, not the implementation 
 details.

quite when i thought i drove a point home... dood we need to talk. you have all core language, primitives, libraries, and app code confused.
Aug 25 2008
parent reply Benji Smith <dlanguage benjismith.net> writes:
superdan wrote:
 But the "small components" are the *interfaces*, not the implementation 
 details.

quite when i thought i drove a point home... dood we need to talk. you have all core language, primitives, libraries, and app code confused.

The standard libraries are in a grey area between language the language spec and application code. There are all sorts implicit "interfaces" in exposed by the builtin types (and there's also plenty of core language functionality implemented in the standard lib... take the GC, for example). You act there's no such thing as an interface for a builtin language feature. With strings implemented as raw arrays, they take on the array API... slicing: broken indexing: busted iterating: fucked length: you guessed it I don't think the internals of the string representation should be any different. UTF-8 arrays? Fine by me. Just don't make me look at the malformed, mis-sliced bytes. Provide an API (yes, implemented in the standard lib, but specified by the language spec) that actually makes sense for text data. (Incidentally, this is the same reason I think the builtin dynamic arrays should be classes implementing a standard List interface, and the associative arrays should be classes implementing a Map interface. The language implementations are nice, but they're not polymorphic, and that makes it a pain in the ass to extend them.) --benji
Aug 25 2008
parent reply BCS <ao pathlink.com> writes:
Reply to Benji,

 slicing: broken

works as defined
 indexing: busted

works as defined
 iterating: fucked

works as defined and with foreach(dchar) as you want
 length: you guessed it

works as defined
 Provide an API (yes, implemented in the
 standard lib, but specified by the language spec) that actually makes
 sense for text data.

BTW everything in phobos under std.* is /not part of the D language spec/.
 
 (Incidentally, this is the same reason I think the builtin dynamic
 arrays should be classes implementing a standard List interface, and
 the associative arrays should be classes implementing a Map interface.
 The language implementations are nice, but they're not polymorphic,
 and that makes it a pain in the ass to extend them.)
 

A system language MUST have arrays that are not classes or anything near that thick. If you must have that sort of interface, pick a different language, because D isn't intended to work that way.
 --benji
 

Aug 25 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
BCS:
 If you must have that sort of interface, pick a different language, 
 because D isn't intended to work that way.

I suggest Benji to try C# 3+, despite all the problems it has and the borg-like nature of such software, etc, it will be used way more than D, and it has all the nice things Benji asks for. Bye, bearophile
Aug 25 2008
next sibling parent reply Benji Smith <dlanguage benjismith.net> writes:
bearophile wrote:
 BCS:
 If you must have that sort of interface, pick a different language, 
 because D isn't intended to work that way.

I suggest Benji to try C# 3+, despite all the problems it has and the borg-like nature of such software, etc, it will be used way more than D, and it has all the nice things Benji asks for. Bye, bearophile

Yep, I like C# a lot. I think it's very well-designed, with the language and libraries dovetailing nicely together. I'm using D on my current project because I need to distribute libraries on both windows and linux, with C-linkage. And D is a helluva lot more pleasant than C/C++, even if there is a lot about D that I find lacking. --benji
Aug 25 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Benji Smith:
 Yep, I like C# a lot. I think it's very well-designed, with the language 
 and libraries dovetailing nicely together.

In the past I have said that C# 3.5/4 has some small ideas that D may enjoy copying. But probably having a complex coherent OOP structure from the bottom up isn't one of them. You must understand that D is lower level than C#, it means it's designed for people that like to suffer more :-) D is designed mostly for people coming from C and C++, and it must be fit to be used procedurally/functionally without any OOP too. So D isn't C# and this means what you ask isn't much fit for it. Note that the situation isn't set in stone: time ago for example there was a person willing to program like in Python on the dot net platform, unhappy with C#. He has created the Boo language. It's not widespread, and it has few small design mistakes, but overall it's not a bad language, it's quite usable for its purposes. So you can create your language fit for your purposes... Do you know the Vala language? It looks like C#, but compiles to C... it's probably in beta stage still, but it may be closer to your dream language. Another approach you may follow is to reinvent just the standard library/runtime of D to make it look more like the C# you like :-) Seeing it from outside, Tango too seems already closer to the Java std lib more than Phobos (but I may be wrong). I like Python, so I am writing a large lib that no one else uses that has partially the purpose of making D look like Python :-) Bye, bearophile
Aug 25 2008
prev sibling parent reply "Nick Sabalausky" <a a.a> writes:
"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:g8vmda$sd4$1 digitalmars.com...
 BCS:
 If you must have that sort of interface, pick a different language,
 because D isn't intended to work that way.

I suggest Benji to try C# 3+, despite all the problems it has and the borg-like nature of such software, etc, it will be used way more than D, and it has all the nice things Benji asks for.

(pet peeve) As much as there is that I like about C#, the lack of an IArithmetic or operator constrains tends to gimp its template system in a number of cases.
Aug 26 2008
parent reply BCS <ao pathlink.com> writes:
Reply to Nick,

 "bearophile" <bearophileHUGS lycos.com> wrote in message
 news:g8vmda$sd4$1 digitalmars.com...
 
 BCS:
 
 If you must have that sort of interface, pick a different language,
 because D isn't intended to work that way.
 

borg-like nature of such software, etc, it will be used way more than D, and it has all the nice things Benji asks for.

IArithmetic or operator constrains tends to gimp its template system in a number of cases.

C# generics are *Crippled*. They more or less do nothing but map types around.
Aug 26 2008
parent Christopher Wright <dhasenan gmail.com> writes:
BCS wrote:
 Reply to Nick,
 
 "bearophile" <bearophileHUGS lycos.com> wrote in message
 news:g8vmda$sd4$1 digitalmars.com...

 BCS:

 If you must have that sort of interface, pick a different language,
 because D isn't intended to work that way.

borg-like nature of such software, etc, it will be used way more than D, and it has all the nice things Benji asks for.

IArithmetic or operator constrains tends to gimp its template system in a number of cases.

C# generics are *Crippled*. They more or less do nothing but map types around.

Yes, but oh, the syntax!
Aug 26 2008
prev sibling parent reply BCS <ao pathlink.com> writes:
Reply to Benji,


 The new JSON parser in the Tango library operates on templated string
 arrays. If I want to read from a file or a socket, I have to first
 slurp the whole thing into a character array, even though the
 character-streaming would be more practical.
 

Unless you are only going to parse the start of the file or are going to be throwing away most of it *while you parse it, not after* The best way to parse a file is to load it all in one OS system call and then run a slicing parser (like the Tango XML parser) on that. One memory allocation and one load or a mmap, and then only the meta structures get allocated later.
Aug 25 2008
parent reply Robert Fraser <fraserofthenight gmail.com> writes:
BCS wrote:
 Reply to Benji,
 
 
 The new JSON parser in the Tango library operates on templated string
 arrays. If I want to read from a file or a socket, I have to first
 slurp the whole thing into a character array, even though the
 character-streaming would be more practical.

Unless you are only going to parse the start of the file or are going to be throwing away most of it *while you parse it, not after* The best way to parse a file is to load it all in one OS system call and then run a slicing parser (like the Tango XML parser) on that. One memory allocation and one load or a mmap, and then only the meta structures get allocated later.

There are cases where you might want to parse an XML file that won't fit easily in main memory. I think a stream processing SAX parser would be a good addition (perhaps not replacement for) the exiting one.
Aug 25 2008
parent reply BCS <ao pathlink.com> writes:
Reply to Robert,

 BCS wrote:
 
 Reply to Benji,
 
 The new JSON parser in the Tango library operates on templated
 string arrays. If I want to read from a file or a socket, I have to
 first slurp the whole thing into a character array, even though the
 character-streaming would be more practical.
 

to be throwing away most of it *while you parse it, not after* The best way to parse a file is to load it all in one OS system call and then run a slicing parser (like the Tango XML parser) on that. One memory allocation and one load or a mmap, and then only the meta structures get allocated later.

fit easily in main memory. I think a stream processing SAX parser would be a good addition (perhaps not replacement for) the exiting one.

If you can't fit the data file in memory the I find it hard to believe you will be able to hold the parsed file in memory. If you can program the parser to dump unneeded data on the fly or process and discard the data, that might make a difference.
Aug 26 2008
next sibling parent reply Benji Smith <dlanguage benjismith.net> writes:
BCS wrote:
 Reply to Robert,
 
 BCS wrote:

 Reply to Benji,

 The new JSON parser in the Tango library operates on templated
 string arrays. If I want to read from a file or a socket, I have to
 first slurp the whole thing into a character array, even though the
 character-streaming would be more practical.

to be throwing away most of it *while you parse it, not after* The best way to parse a file is to load it all in one OS system call and then run a slicing parser (like the Tango XML parser) on that. One memory allocation and one load or a mmap, and then only the meta structures get allocated later.

fit easily in main memory. I think a stream processing SAX parser would be a good addition (perhaps not replacement for) the exiting one.

If you can't fit the data file in memory the I find it hard to believe you will be able to hold the parsed file in memory. If you can program the parser to dump unneeded data on the fly or process and discard the data, that might make a difference.

Well, for something like a DOM parser, it's pretty much impossible to parse a file that won't fit into memory. But a SAX parser doesn't actually create any objects. It just calls events, while processing XML data from a stream. A good SAX parser can operate without ever allocating anything on the heap, leaving the consumer to create any necessary objects from the parse process. --benji
Aug 26 2008
parent reply BCS <ao pathlink.com> writes:
Reply to Benji,

 BCS wrote:
 
 Reply to Robert,
 
 BCS wrote:
 
 Reply to Benji,
 
 The new JSON parser in the Tango library operates on templated
 string arrays. If I want to read from a file or a socket, I have
 to first slurp the whole thing into a character array, even though
 the character-streaming would be more practical.
 

going to be throwing away most of it *while you parse it, not after* The best way to parse a file is to load it all in one OS system call and then run a slicing parser (like the Tango XML parser) on that. One memory allocation and one load or a mmap, and then only the meta structures get allocated later.

fit easily in main memory. I think a stream processing SAX parser would be a good addition (perhaps not replacement for) the exiting one.

believe you will be able to hold the parsed file in memory. If you can program the parser to dump unneeded data on the fly or process and discard the data, that might make a difference.

parse a file that won't fit into memory. But a SAX parser doesn't actually create any objects. It just calls events, while processing XML data from a stream. A good SAX parser can operate without ever allocating anything on the heap, leaving the consumer to create any necessary objects from the parse process. --benji

Interesting, I've worked with parsers* that function something like that but never thought of them in that way. OTOH I can think of only very limited domain where this would be useful. If I needed to process that much data I'd load it into a database and go from there. *In fact my parser generator could be used that way.
Aug 26 2008
parent reply Benji Smith <dlanguage benjismith.net> writes:
BCS wrote:
 Reply to Benji,
 Well, for something like a DOM parser, it's pretty much impossible to
 parse a file that won't fit into memory. But a SAX parser doesn't
 actually create any objects. It just calls events, while processing
 XML data from a stream. A good SAX parser can operate without ever
 allocating anything on the heap, leaving the consumer to create any
 necessary objects from the parse process.

 --benji

Interesting, I've worked with parsers* that function something like that but never thought of them in that way. OTOH I can think of only very limited domain where this would be useful. If I needed to process that much data I'd load it into a database and go from there. *In fact my parser generator could be used that way.

In fact, that's one of the places where I've used this kind of parsing technique before. I wrote a streaming CSV parser (which takes discipline to do correctly, since a double-quote enclosed field can legally contain arbitrary newline characters, and quotes are escaped by doubling). It provides a field callback and a record callback, so it's very handy for performing ETL tasks. If I had to load the whole CSV files into memory before parsing, it wouldn't work, because sometimes they can be hundreds of megabytes. But the streaming parser takes up almost no memory at all. --benji
Aug 26 2008
parent reply superdan <super dan.org> writes:
Benji Smith Wrote:

 BCS wrote:
 Reply to Benji,
 Well, for something like a DOM parser, it's pretty much impossible to
 parse a file that won't fit into memory. But a SAX parser doesn't
 actually create any objects. It just calls events, while processing
 XML data from a stream. A good SAX parser can operate without ever
 allocating anything on the heap, leaving the consumer to create any
 necessary objects from the parse process.

 --benji

Interesting, I've worked with parsers* that function something like that but never thought of them in that way. OTOH I can think of only very limited domain where this would be useful. If I needed to process that much data I'd load it into a database and go from there. *In fact my parser generator could be used that way.

In fact, that's one of the places where I've used this kind of parsing technique before. I wrote a streaming CSV parser (which takes discipline to do correctly, since a double-quote enclosed field can legally contain arbitrary newline characters, and quotes are escaped by doubling). It provides a field callback and a record callback, so it's very handy for performing ETL tasks. If I had to load the whole CSV files into memory before parsing, it wouldn't work, because sometimes they can be hundreds of megabytes. But the streaming parser takes up almost no memory at all. --benji

sure it takes very little memory. i'll tell u how much memory u need in fact. it's the finite state needed by the fsa. u could do that because csv only needs finite state for parsing. soon as you need to backtrack stream parsing becomes very difficult.
Aug 26 2008
parent reply Benji Smith <dlanguage benjismith.net> writes:
superdan wrote:
 Benji Smith Wrote:
 I wrote a streaming CSV parser (which takes discipline to do correctly, 
 since a double-quote enclosed field can legally contain arbitrary 
 newline characters, and quotes are escaped by doubling). It provides a 
 field callback and a record callback, so it's very handy for performing 
 ETL tasks.

 If I had to load the whole CSV files into memory before parsing, it 
 wouldn't work, because sometimes they can be hundreds of megabytes. But 
 the streaming parser takes up almost no memory at all.

 --benji

sure it takes very little memory. i'll tell u how much memory u need in fact. it's the finite state needed by the fsa. u could do that because csv only needs finite state for parsing. soon as you need to backtrack stream parsing becomes very difficult.

Noooooooobody uses backtracking to parse. Most of the time LL(k) token lookahead solves the problem. Sometimes you need a syntactic predicate or (rarely) a semantic predicate. I've never even heard of a parser generator framework that supported backtracking. --benji
Aug 26 2008
next sibling parent reply superdan <super dan.org> writes:
Benji Smith Wrote:

 superdan wrote:
 Benji Smith Wrote:
 I wrote a streaming CSV parser (which takes discipline to do correctly, 
 since a double-quote enclosed field can legally contain arbitrary 
 newline characters, and quotes are escaped by doubling). It provides a 
 field callback and a record callback, so it's very handy for performing 
 ETL tasks.

 If I had to load the whole CSV files into memory before parsing, it 
 wouldn't work, because sometimes they can be hundreds of megabytes. But 
 the streaming parser takes up almost no memory at all.

 --benji

sure it takes very little memory. i'll tell u how much memory u need in fact. it's the finite state needed by the fsa. u could do that because csv only needs finite state for parsing. soon as you need to backtrack stream parsing becomes very difficult.

Noooooooobody uses backtracking to parse.

guess that makes perl regexes et al noooooooobody.
 Most of the time LL(k) token lookahead solves the problem. Sometimes you 
 need a syntactic predicate or (rarely) a semantic predicate.

 I've never even heard of a parser generator framework that supported 
 backtracking.

live & learn. keep lookin'. hint: try antlr.
Aug 26 2008
parent reply Benji Smith <dlanguage benjismith.net> writes:
superdan wrote:
 Noooooooobody uses backtracking to parse.

guess that makes perl regexes et al noooooooobody.

I suppose it depends on your definition of "parse".
 Most of the time LL(k) token lookahead solves the problem. Sometimes you 
 need a syntactic predicate or (rarely) a semantic predicate.

 I've never even heard of a parser generator framework that supported 
 backtracking.

live & learn. keep lookin'. hint: try antlr.

I've used ANTLR a few times. It's nice. I didn't realize it supported backtracking, though. (In my experience writing parsers, backtracking is one of those things you work overtime to eliminate, because it usually destroys performance.) It's funny you should mention ANTLR, actually, in this discussion. A year or so ago, I was considering porting the ANTLR runtime to D. The original runtime is written in Java, and makes full use of the robust string handling capabilities of the Java standard library. Based on the available text processing functionality in D at that time, I quickly gave up on the project as being not worth the effort. --benji
Aug 26 2008
next sibling parent reply BCS <ao pathlink.com> writes:
Reply to Benji,

 I've used ANTLR a few times. It's nice.
 

I've used it. If you gave me the choice of sitting in a cardboard small box all day or using it again, I'll sit in the cardboard box because I fit in that box better.
Aug 26 2008
next sibling parent reply superdan <super dan.org> writes:
BCS Wrote:

 Reply to Benji,
 
 I've used ANTLR a few times. It's nice.
 

I've used it. If you gave me the choice of sitting in a cardboard small box all day or using it again, I'll sit in the cardboard box because I fit in that box better.

i've used it too eh. u gotta be talking about a pretty charmin' cozy box there. the effin' mcmansion of boxes in fact. coz antlr is one of the best if not the best period.
Aug 26 2008
parent BCS <ao pathlink.com> writes:
Reply to superdan,

 BCS Wrote:
 
 Reply to Benji,
 
 I've used ANTLR a few times. It's nice.
 

small box all day or using it again, I'll sit in the cardboard box because I fit in that box better.

box there. the effin' mcmansion of boxes in fact. coz antlr is one of the best if not the best period.

The above is intended as a pun, "I don't fit in the ANTLR box". It's like MS Word, as long as you do it the way things are intended to be done, clear sailing, as soon as you try something else rocks and shoals. My other main issue I have had with ANTLR is that the documentation is ABSOLUTELY HORRIBLE! It took me three weeks working with it to even figure out that it was intended to be used differently than I expected. I was hard pressed to find critical information. Stuff that is, IMHO, only a quarter step less important than the fact ANTLR is a parser generator, stuff I'd expect to be looking straight at after hitting Google's "I'm feeling lucky" button for ANTLR, no scrolling needed.
Aug 26 2008
prev sibling parent reply Benji Smith <dlanguage benjismith.net> writes:
BCS wrote:
 Reply to Benji,
 
 I've used ANTLR a few times. It's nice.

I've used it. If you gave me the choice of sitting in a cardboard small box all day or using it again, I'll sit in the cardboard box because I fit in that box better.

I've always been impressed by the capabilities of ANTLR. The ANTLRWorks IDE is a very cool way to develop and debug grammars, and Terrence Parr is one of those people that pushes the research into interesting new areas (he wrote something a few months ago about simplifying the deeply-recursive Expression grammar common in most languages that I found very insightful). The architecture is pretty cool too. Text input in consumed and AST's are constructed using token grammars, which are then transformed using tree-grammars, and code-generation is performed by output grammars. It's a very elegant system, and I've seen some example projects that used a sequence of those grammars to translate code between different programming languages. It's cool stuff. So I appreciate ANTLR from that perspective. I think the theory behind the project is top-notch. But the syntax sucks. Badly. The learning curve is waaaay too steep for me, so I've always had to keep the documentation close by. And once the grammars are written, they're hard to read and maintain. Also, there's a strong bias in the ANLTR community toward ASTs. I prefer to construct a somewhat higher-level parse tree. For example: given the expression "1 + 2", I'd like the parser to construct a BinaryOperator node, with two Expression node children and an enum "operator" field of "PLUS". I'd like it to use a set of pre-defined "parse model" classes that I've written to represent the language elements. It's hard to do that kind of thing in ANTLR, which usually just creates a "+" node with children of "1" and "2". The majority of my parser-generator experience has been with JavaCC, which leaves model-generation to the user, which works better for me. --benji
Aug 26 2008
parent BCS <ao pathlink.com> writes:
Reply to Benji,

 BCS wrote:
 
 Reply to Benji,
 
 I've used ANTLR a few times. It's nice.
 

small box all day or using it again, I'll sit in the cardboard box because I fit in that box better.

ANTLRWorks IDE is a very cool way to develop and debug grammars, and Terrence Parr is one of those people that pushes the research into interesting new areas (he wrote something a few months ago about simplifying the deeply-recursive Expression grammar common in most languages that I found very insightful). The architecture is pretty cool too. Text input in consumed and AST's are constructed using token grammars, which are then transformed using tree-grammars, and code-generation is performed by output grammars. It's a very elegant system, and I've seen some example projects that used a sequence of those grammars to translate code between different programming languages. It's cool stuff. So I appreciate ANTLR from that perspective. I think the theory behind the project is top-notch. But the syntax sucks. Badly. The learning curve is waaaay too steep for me, so I've always had to keep the documentation close by. And once the grammars are written, they're hard to read and maintain. Also, there's a strong bias in the ANLTR community toward ASTs. I prefer to construct a somewhat higher-level parse tree. For example: given the expression "1 + 2", I'd like the parser to construct a BinaryOperator node, with two Expression node children and an enum "operator" field of "PLUS". I'd like it to use a set of pre-defined "parse model" classes that I've written to represent the language elements. It's hard to do that kind of thing in ANTLR, which usually just creates a "+" node with children of "1" and "2". The majority of my parser-generator experience has been with JavaCC, which leaves model-generation to the user, which works better for me. --benji

My feeling's exactly (or near enough)
Aug 26 2008
prev sibling parent reply superdan <super dan.org> writes:
Benji Smith Wrote:

 superdan wrote:
 Noooooooobody uses backtracking to parse.

guess that makes perl regexes et al noooooooobody.

I suppose it depends on your definition of "parse".

well since you was gloating about handling a csv file as "parsing" i thot i'd lower my definition accordingly :) p.s. sorry benji. you are cool n all (tho to be brutally honest listenin' more an' talkin' less always helps) but you keep on raisin' those easy balls fer me. what can i do? i keep on dunkin'em ;)
Aug 26 2008
next sibling parent reply Benji Smith <dlanguage benjismith.net> writes:
superdan wrote:
 tho to be brutally honest listenin' more an' talkin' less always helps

LOL Coming from you, dan, that's my favorite ironic quote of the day :) --benji
Aug 26 2008
parent reply superdan <super dan.org> writes:
Benji Smith Wrote:

 superdan wrote:
 tho to be brutally honest listenin' more an' talkin' less always helps

LOL Coming from you, dan, that's my favorite ironic quote of the day :)

meh. whacha sayin'? i ain't talking much.
Aug 26 2008
parent reply "Nick Sabalausky" <a a.a> writes:
"superdan" <super dan.org> wrote in message 
news:g91uku$2l93$1 digitalmars.com...
 Benji Smith Wrote:

 superdan wrote:
 tho to be brutally honest listenin' more an' talkin' less always helps

LOL Coming from you, dan, that's my favorite ironic quote of the day :)

meh. whacha sayin'? i ain't talking much.

missing capitalization that do nothing but hide any kernels of relevance that may or may not exist, yes.
Aug 26 2008
parent superdan <super dan.org> writes:
Nick Sabalausky Wrote:

 "superdan" <super dan.org> wrote in message 
 news:g91uku$2l93$1 digitalmars.com...
 Benji Smith Wrote:

 superdan wrote:
 tho to be brutally honest listenin' more an' talkin' less always helps

LOL Coming from you, dan, that's my favorite ironic quote of the day :)

meh. whacha sayin'? i ain't talking much.

missing capitalization that do nothing but hide any kernels of relevance that may or may not exist, yes.

don't be hat'n' :)
Aug 26 2008
prev sibling parent reply BCS <ao pathlink.com> writes:
Reply to superdan,

 Benji Smith Wrote:
 
 superdan wrote:
 
 Noooooooobody uses backtracking to parse.
 



thot i'd lower my definition accordingly :) p.s. sorry benji. you are cool n all (tho to be brutally honest listenin' more an' talkin' less always helps) but you keep on raisin' those easy balls fer me. what can i do? i keep on dunkin'em ;)

A CVS parser can be interesting if you have high enough performance demands (e.g. total memory footprint smaller than a single field might be)
Aug 26 2008
parent reply Benji Smith <dlanguage benjismith.net> writes:
 superdan wrote:

thot i'd lower my definition accordingly :)


I don't know about "gloating". I mentioned it, because it was relevant to the conversation about places where streaming parsers are useful. But I can't see how it was gloating. Geez. Why is everything a challenge to you? Why can't you just have a conversation, without getting all argumentative? BCS wrote:
 A CVS parser can be interesting if you have high enough performance 
 demands (e.g. total memory footprint smaller than a single field might be)

It's also interesting from the perspective that you can write a basic parser, using a dirt-simple grammar, that performs no backtracking. In the word of parsers, it's about as simple and braindead as you get, but it's damn handy nevertheless. It's possible to do the same thing with a regular expression, but it's very tricky to correctly handle all the weird newline issues, and it's even harder to avoid backtracking. I've done it both ways, and the regex solution sucks, compared to using a real parser generator. --benji
Aug 26 2008
parent superdan <super dan.org> writes:
Benji Smith Wrote:

 superdan wrote:

thot i'd lower my definition accordingly :)


I don't know about "gloating".

was jesting. 'twas too good a comeback after u switched the definition of parsing on me. twice :)
 I mentioned it, because it was relevant 
 to the conversation about places where streaming parsers are useful. But 
 I can't see how it was gloating. Geez.

 Why is everything a challenge to you? Why can't you just have a 
 conversation, without getting all argumentative?

conversatin's cool. but if you says something wrong and i happen to knows how it is i'll say how it is.
Aug 26 2008
prev sibling parent BCS <ao pathlink.com> writes:
Reply to Benji,

 superdan wrote:
 
 Benji Smith Wrote:
 
 I wrote a streaming CSV parser (which takes discipline to do
 correctly, since a double-quote enclosed field can legally contain
 arbitrary newline characters, and quotes are escaped by doubling).
 It provides a field callback and a record callback, so it's very
 handy for performing ETL tasks.
 
 If I had to load the whole CSV files into memory before parsing, it
 wouldn't work, because sometimes they can be hundreds of megabytes.
 But the streaming parser takes up almost no memory at all.
 
 --benji
 

in fact. it's the finite state needed by the fsa. u could do that because csv only needs finite state for parsing. soon as you need to backtrack stream parsing becomes very difficult.

Most of the time LL(k) token lookahead solves the problem. Sometimes you need a syntactic predicate or (rarely) a semantic predicate. I've never even heard of a parser generator framework that supported backtracking. --benji

Antlr, dparse and (IIRC) eniki all do
Aug 26 2008
prev sibling parent Robert Fraser <fraserofthenight gmail.com> writes:
BCS wrote:
 Reply to Robert,
 
 BCS wrote:

 Reply to Benji,

 The new JSON parser in the Tango library operates on templated
 string arrays. If I want to read from a file or a socket, I have to
 first slurp the whole thing into a character array, even though the
 character-streaming would be more practical.

to be throwing away most of it *while you parse it, not after* The best way to parse a file is to load it all in one OS system call and then run a slicing parser (like the Tango XML parser) on that. One memory allocation and one load or a mmap, and then only the meta structures get allocated later.

fit easily in main memory. I think a stream processing SAX parser would be a good addition (perhaps not replacement for) the exiting one.

If you can't fit the data file in memory the I find it hard to believe you will be able to hold the parsed file in memory. If you can program the parser to dump unneeded data on the fly or process and discard the data, that might make a difference.

I think that's one of the reasons to use a streaming parser -- so you can dump data on the fly.
Aug 26 2008
prev sibling next sibling parent Christopher Wright <dhasenan gmail.com> writes:
superdan wrote:
 but if u have strings like today it's a no-brainer to define a class that does
all that stuff. u can then use that class whenever you feel. it would be
madness to put that class in the language definition. at best it's a candidate
for the stdlib.

Instead, the runtime has to know how to convert between utf8, utf16, and utf32. Encodings are not a trivial matter, either.
Aug 25 2008
prev sibling next sibling parent Jesse Phillips <jessekphillips gmail.com> writes:
On Mon, 25 Aug 2008 20:52:04 -0400, Benji Smith wrote:

 superdan wrote:
 But the "small components" are the *interfaces*, not the
 implementation details.

quite when i thought i drove a point home... dood we need to talk. you have all core language, primitives, libraries, and app code confused.

The standard libraries are in a grey area between language the language spec and application code. There are all sorts implicit "interfaces" in exposed by the builtin types (and there's also plenty of core language functionality implemented in the standard lib... take the GC, for example). You act there's no such thing as an interface for a builtin language feature. With strings implemented as raw arrays, they take on the array API... slicing: broken indexing: busted iterating: fucked length: you guessed it I don't think the internals of the string representation should be any different. UTF-8 arrays? Fine by me. Just don't make me look at the malformed, mis-sliced bytes. Provide an API (yes, implemented in the standard lib, but specified by the language spec) that actually makes sense for text data. (Incidentally, this is the same reason I think the builtin dynamic arrays should be classes implementing a standard List interface, and the associative arrays should be classes implementing a Map interface. The language implementations are nice, but they're not polymorphic, and that makes it a pain in the ass to extend them.) --benji

On the language spec vs standard library. While the GC is implemented in the standard library, I do not believe the spec says it has to be (though I don't think it is possible otherwise). So the spec could state that strings should be implemented your way, but it shouldn't. On another note. I must say this as been quite a turn around. There have been many posts in the past with people arguing over having a String class, I think they have been staying out. But none the less it is nothing new.
Aug 25 2008
prev sibling parent reply Benji Smith <dlanguage benjismith.net> writes:
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

superdan wrote:
 For starters, with strings implemented as character arrays, writing
 library code that accepts and operates on strings is a bit of a pain in
 the neck, since you always have to write templates and template code is
 slightly less readable than non-template code. You can't distribute 


 code as a DLL or a shared object, because the template instantiations
 won't be included (unless you create wrapper functions with explicit
 template instantiations, bloating your code size, but more importantly
 tripling the number of functions in your API).

so u mean with a class the encoding char/wchar/dchar won't be an

 problem is that means there's an indirection cost for every character 

consistently must pay a price for stuff they don't use. So, I was thinking about the actual costs involved with the String class and CharSequence interface design that I'd like to see (and that exists in languages like Java and C#). There's the cost of the class wrapper itself, the cost of internally representing and converting between encodings, the cost of routing all method calls through an interface vtable. Characters, if always represented using two bytes, would consume twice the memory. And returning characters from method-calls has got to be slower than accessing them directly from arrays. Right? So I wrote some tests, in Java and in D/Tango. The source code files are attached. Both of the tests perform a common set of string operations (searching, splitting, concatenating, and character-iterating). I tried to make the functionality as identical as possible, though I wasn't sure which technique to use for splitting text in Tango, so I used both the "Util.split" and "Util.delimit" functions. I ran both tests using a 5MB text file, "The Complete Works of William Shakespeare", from the Project Gutenberg website: http://www.gutenberg.org/dirs/etext94/shaks12.txt You can grab it for yourself, or you can just run the code against your favorite large text file. I compiled and ran the Java code in the 1.6.0_06 JDK, with the "-server" flag. The d code was compiled with DMD 1.034 and Tango 0.99.7, using the "-O -release -inline" flags. My test machine is an AMD Turion 64 X2 dual-core laptop, with 2GB of RAM and running WinXP SP3. I ran the tests eight times each, using fine-resolution timers. These are the median results: LOADING THE FILE INTO A STRING: D/Tango wins, by 428% D/Tango: 0.02960 seconds Java: 0.12675 seconds ITERATING OVER CHARS IN A STRING: Java wins, by 280% D/Tango: 0.10093 seconds Java: 0.03599 seconds SEARCHING FOR A SUBSTRING: D/Tango wins, by 218% D/Tango: 0.02251 seconds Java: 0.04915 seconds SEARCH & REPLACE INTO A NEW STRING: D/Tango wins, by 226% D/Tango: 0.17685 seconds Java: 0.39996 seconds SPLIT A STRING ON WHITESPACE: Java wins, by 681% (against tango.text.Util.delimit()) Java wins, by 313% (against tango.text.Util.split()) D/Tango (delimit): 8.28195 seconds D/Tango (split): 3.80465 seconds Java (split): 1.21477 seconds CONCATENATING STRINGS: Java wins, by 884% D/Tango (array concat, no pre-alloc): 4.07929 seconds Java (StringBuilder, no pre-alloc): 0.46150 seconds SORT STRINGS (CASE-INSENSITIVE): D/Tango wins, by 226% D/Tango: 1.62227 seconds Java: 3.66389 seconds It looks like D mostly falls down when it has to allocate a lot of memory, even if it's just allocating slices. The D performance for string splitting really surprised me. I was interested to see, though, that Java was so much faster at iterating through the characters in a string, since I used the charAt(i) method of the CharSequence interface, rather than directly iterating through a char[] array, or even calling the charAt method on the String instance. And yet, character iteration is almost 3 times as fast as in D. Down with premature optimization! Design the best interfaces possible, to enable the most pleasant and flexible programing idioms. The performance problems can be solved. :-P --benji
Aug 26 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Benji Smith:
 It looks like D mostly falls down when it has to allocate a lot of 
 memory, even if it's just allocating slices. The D performance for 
 string splitting really surprised me.

String splitting requires lot of work for the GC. The GC of HotSpot is light years ahead of the current D GC. You can see that measuring just the time the D GC takes to deallocate a large array of the splitted substrings. I have posted several benchmarks here about this topic.
 I was interested to see, though, that Java was so much faster at 
 iterating through the characters in a string, since I used the charAt(i) 
 method of the CharSequence interface, rather than directly iterating 
 through a char[] array, or even calling the charAt method on the String 
 instance.

HotSpot is able to inline lot of virtual methods too, D can't do those things.
 Down with premature optimization! Design the best interfaces possible, 
 to enable the most pleasant and flexible programing idioms. The 
 performance problems can be solved. :-P

They currently can't be solved by the backends of DMD and GDC, only HotSpot (and maybe the compiler of the dot net on windows) are able to do that. I don't know if LLVM will be able to perform some of those things. Bye, bearophile
Aug 26 2008
prev sibling next sibling parent JAnderson <ask me.com> writes:
Benji Smith wrote:
 In another thread (about array append performance) I mentioned that 
 Strings ought to be implemented as classes rather than as simple builtin
 arrays. Superdan asked why. Here's my response...
 
 I'll start with a few of the softball, easy reasons.
 

 polymorphism on character arrays. Arrays can't have subclasses, and they 
 can't implement interfaces.

I don't think polymorphic strings is right for D strings. This is the sort of thing a lib could implement but D should (and does) provide the basic components of which to build more complex components. You can already extend D strings by using strings as a component, if it its necessary. I don't want all this extra overhead in the primitive array type. It seems to me a classic case of feature creep, pretty soon we have something that has been designed for all but its original purpose. I'm ok with having features like hash caching as long as they can be implemented without changing the core mechanics of the primitive. To me its not even correct design to inherit from a concrete class (there are quite a few books on that, Effective C++ talks about it a bit and so does Herb/Alexandrescu 101 Coding standard book). I think, there are *much better* ways to handle this sort of thing. Personally I don't want to encourage that sort of design. -Joel
Aug 26 2008
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Benji Smith wrote:
 For starters, with strings implemented as character arrays, writing 
 library code that accepts and operates on strings is a bit of a pain in 
 the neck, since you always have to write templates and template code is 
 slightly less readable than non-template code.
 You can't distribute your 
 code as a DLL or a shared object, because the template instantiations 
 won't be included (unless you create wrapper functions with explicit 
 template instantiations, bloating your code size, but more importantly 
 tripling the number of functions in your API).

Is the problem you're referring to the fact that there are 3 character types?
 Another good low-hanging argument is that strings are frequently used as 
 keys in associative arrays. Every insertion and retrieval in an 
 associative array requires a hashcode computation. And since D strings 
 are just dumb arrays, they have no way of memoizing their hashcodes.

True, but I've written a lot of string processing programs (compilers are just one example of such). This has never been an issue, because the AA itself memoizes the hash, and from then on the dictionary handle is used.
 We've already observed that D assoc arrays are less performant than even 
 Python maps, so the extra cost of lookup operations is unwelcome.

Every one of those benchmarks that purported to show that D AA's were relatively slow turned out to be, on closer examination, D running the garbage collector more often than Python does. It had NOTHING to do with the AA's.
 But much more important than either of those reasons is the lack of 
 polymorphism on character arrays. Arrays can't have subclasses, and they 
 can't implement interfaces.
 
 A good example of what I'm talking about can be seen in the Phobos and 
 Tango regular expression engines. At least the Tango implementation 
 matches against all string types (the Phobos one only works with char[] 
 strings).
 
 But what if I want to consume a 100 MB logfile, counting all lines that 
 match a pattern?
 
 Right now, to use the either regex engine, I have to read the entire 
 logfile into an enormous array before invoking the regex search function.
 
 Instead, what if there was a CharacterStream interface? And what if all 
 the text-handling code in Phobos & Tango was written to consume and 
 return instances of that interface?
 
 A regex engine accepting a CharacterStream interface could process text 
 from string literals, file input streams, socket input streams, database 
 records, etc, etc, etc... without having to pollute the API with a bunch 
 of casts, copies, and conversions. And my logfile processing application 
 would consume only a tiny fraction of the memory needed by the character 
 array implementation.
 
 Most importantly, the contract between the regex engine and its 
 consumers would provide a well-defined interface for processing text, 
 regardless of the source or representation of that text.

I think a better solution is for regexp to accept an Iterator as its source. That doesn't require polymorphic behavior via inheritance, it can do polymorphism by value (which is what templates do).
 
 Along a similar vein, I've worked on a lot of parsers over the past few 
 years, for domain specific languages and templating engines, and stuff 
 like that. Sometimes it'd be very handy to define a "Token" class that 
 behaves exactly like a String, but with some additional behavior. 
 Ideally, I'd like to implement that Token class as an implementor of the 
 CharacterStream interface, so that it can be passed directly into other 
 text-handling functions.
 
 But, in D, with no polymorphic text handling, I can't do that.

Templates are the ideal solution to that, and the more specific idiom is to use iterators.
 But then again, I haven't used any of the const functionality in D2, so 
 I can't actually comment on relative usability of compiler-enforced 
 immutability versus interface-enforced immutability.

From my own experience, I didn't 'get' invariant strings until I'd used them for a while.
Aug 26 2008
next sibling parent Benji Smith <dlanguage benjismith.net> writes:
Walter Bright wrote:
 Benji Smith wrote:
 You can't distribute your code as a DLL or a shared object, because 
 the template instantiations won't be included (unless you create 
 wrapper functions with explicit template instantiations, bloating your 
 code size, but more importantly tripling the number of functions in 
 your API).

Is the problem you're referring to the fact that there are 3 character types?

Basically, yeah. With three different character types, and two different array types (static & dynamic). And in D2, with const, invariant, and mutable types (and soon with shared and unshared), the number of ways of representing a "string" in the type-system is overwhelming. This afternoon, I was writing some string-processing code that I intend to distribute in a library, and I couldn't but help thinking to myself "This code is probably broken, for anything but the most squeaky-clean ASCII text". I don't mind that there are different character types, or that there are different character encodings. But I want to deal with those issues in exactly *one* place: in my string constructor (and, very rarely, during IO). But 99% of the time, I want to just think of the object as a String, with all the ugly details abstracted away.
 Another good low-hanging argument is that strings are frequently used 
 as keys in associative arrays. Every insertion and retrieval in an 
 associative array requires a hashcode computation. And since D strings 
 are just dumb arrays, they have no way of memoizing their hashcodes.

True, but I've written a lot of string processing programs (compilers are just one example of such). This has never been an issue, because the AA itself memoizes the hash, and from then on the dictionary handle is used.

Cool. The hashcode-memoization thing was really just a catalyst to get me thinking. It's really at the periphery of my concerns with Strings.
 We've already observed that D assoc arrays are less performant than 
 even Python maps, so the extra cost of lookup operations is unwelcome.

Every one of those benchmarks that purported to show that D AA's were relatively slow turned out to be, on closer examination, D running the garbage collector more often than Python does. It had NOTHING to do with the AA's.

Ah. Good point. Thanks for clarifying. I didn't remember all the follow-up details.
 Most importantly, the contract between the regex engine and its 
 consumers would provide a well-defined interface for processing text, 
 regardless of the source or representation of that text.

I think a better solution is for regexp to accept an Iterator as its source. That doesn't require polymorphic behavior via inheritance, it can do polymorphism by value (which is what templates do).

That's a great idea. I should clarify that my referring to an "interface" was in the informal sense. (Though I think actual interfaces would be a reasonable solution.) But any sort of contract between text-data-structures and text-processing-routines would fit the bill nicely.
 But then again, I haven't used any of the const functionality in D2, 
 so I can't actually comment on relative usability of compiler-enforced 
 immutability versus interface-enforced immutability.

From my own experience, I didn't 'get' invariant strings until I'd used them for a while.

I actually kind of think I'm on the other side of the issue. I've been primarily a Java programmer (8 years) and secondarily a C# programmer (3 years), so immutable Strings are the only thing I've ever used. Lots of the other JDK classes are like that, too. So, from my perspective, it seems like the ideal, low-impact way of enforcing immutability is to have the classes enforce it on themselves. I've never felt the need for compiler-enforced const semantics in any of the work I've done. Thanks for your replies! I always appreciate hearing from you. --benji
Aug 26 2008
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 We've already observed that D assoc arrays are less performant than even 
 Python maps, so the extra cost of lookup operations is unwelcome.

Every one of those benchmarks that purported to show that D AA's were relatively slow turned out to be, on closer examination, D running the garbage collector more often than Python does. It had NOTHING to do with the AA's.

Really? I must have missed those conclusions then, despite reading all the posts on the subject. What solutions do you propose for the problem then? I recall that disabling the GC didn't improve the situation much. So the problem now becomes how to improve the D GC? In my site I am keeping a gallery of tiny benchmarks where D code (with DMD) is 10 or more times slower than very equivalent Python, C, Java code (I have about 12 programs so far, very different from each other. There's a benchmark regarding the associative arrays too). Hopefully it will become useful once people will start tuning D implementations. Bye, bearophile
Aug 26 2008
next sibling parent reply Tomas Lindquist Olsen <tomas famolsen.dk> writes:
bearophile wrote:
 Walter Bright:
 We've already observed that D assoc arrays are less performant than even 
 Python maps, so the extra cost of lookup operations is unwelcome.

relatively slow turned out to be, on closer examination, D running the garbage collector more often than Python does. It had NOTHING to do with the AA's.

Really? I must have missed those conclusions then, despite reading all the posts on the subject. What solutions do you propose for the problem then? I recall that disabling the GC didn't improve the situation much. So the problem now becomes how to improve the D GC? In my site I am keeping a gallery of tiny benchmarks where D code (with DMD) is 10 or more times slower than very equivalent Python, C, Java code (I have about 12 programs so far, very different from each other. There's a benchmark regarding the associative arrays too). Hopefully it will become useful once people will start tuning D implementations. Bye, bearophile

Might I ask where that site is? I'd like to compare them against LLVMDC if possible Tomas
Aug 26 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Tomas Lindquist Olsen:
 Might I ask where that site is?

I have sent you an email with information and more things, etc. Bye, bearophile
Aug 26 2008
parent Tomas Lindquist Olsen <tomas famolsen.dk> writes:
bearophile wrote:
 Tomas Lindquist Olsen:
 Might I ask where that site is?

I have sent you an email with information and more things, etc. Bye, bearophile

Got it. Thanx :) I'll give it a go over the weekend :) Tomas
Aug 28 2008
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 Walter Bright:
 We've already observed that D assoc arrays are less performant
 than even Python maps, so the extra cost of lookup operations is
 unwelcome.

were relatively slow turned out to be, on closer examination, D running the garbage collector more often than Python does. It had NOTHING to do with the AA's.

Really? I must have missed those conclusions then, despite reading all the posts on the subject. What solutions do you propose for the problem then? I recall that disabling the GC didn't improve the situation much. So the problem now becomes how to improve the D GC?

In my experience with such programs, disabling the collection cycles brought the speed up to par.
Aug 26 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:

In my experience with such programs, disabling the collection cycles brought
the speed up to par.<

In my experience there's some difference still. The usual disclaimer: benchmarks are tricky things, so anyone is invited to spot problems in my code. A very simple benchmark: // D without GC import std.gc: disable; void main() { int[int] d; disable(); for (int i; i < 10_000_000; ++i) d[i] = 0; } # Python+Psyco without GC from gc import disable def main(): d = {} disable() for i in xrange(10000000): d[i] = 0 import psyco; psyco.full() main() hash without GC, n = 10_000_000: D: 9.12 s Psyco: 1.45 s hash2 with GC, n = 10_000_000: D: 9.80 s Psyco: 1.46 s If Psyco isn't used the Python version without GC requires 2.02 seconds. This means the 2.02 - 1.45 = 0.57 s are needed by the Python virtual machine just to run those 10_000_000 loops :-) Warms tests, best of 3, tests performed with Python 2.5.2, Psyco 1.6, on Win XP, and the last DMD with -O -release -inline. Python integers are objects, rather bigger than 4 bytes, and they can grow "naturally" to become multi-precision integers:
 a = 2147483647
 a



 a + 1



 type(a)



 type(a + 1)



 type(7 ** 5)



 type(7 ** 55)



Bye, bearophile
Aug 26 2008
next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 Walter Bright:
 
 In my experience with such programs, disabling the collection
 cycles brought the speed up to par.<

In my experience there's some difference still. The usual disclaimer: benchmarks are tricky things, so anyone is invited to spot problems in my code.

I invite you to look at the code in internal/aaA.d and do some testing!
Aug 26 2008
parent reply Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-08-28 13:33:47 +0200, "Manfred_Nowak" <svv1999 hotmail.com> said:

 Walter Bright wrote:
 
 I invite you to look at the code in internal/aaA.d and do some
 testing!

This invitation is a red herring without the offer to change the language, because there exists no implementation for AA covering all possible use cases. The bare minimum to get anything out of fiddling with the implemenations for AA is the posibility to use the results of adaptions without considerable overhead especially for the declarations. However, currenty I do not see any elegant solution, because types of implementation of maps are given implicitely. At least something like prototyping seems to be necessary: | int[ char[]] map; | Prototype= typeof( map); | Prototype.implementation= MyAA; where MyAA is some class type implementing the interface required for AA. Are you willing to do something in this direction?

I think that the invitation should be read as the possibility to experiment with some changes for AA, see their effect, and if worthwhile provide them back, so that they can be applied to the "official" version. making the standard version changeable seems just horrible form the portability and maintainability and clarity of the code: if the standard version is not ok for your use you should explicitly use another one, otherwise mixing codes that use two different standard versions becomes a nightmare. On the other hand if you think that you can improve the standard version for everybody, changing internal/aaA.d is what you should do... Fawzi
 
 -manfred

Aug 28 2008
next sibling parent reply "Manfred_Nowak" <svv1999 hotmail.com> writes:
Fawzi Mohamed wrote:

 I think

Sorry. Although I cancelled my posting within seconds, you grabbed it even faster. -manfred -- If life is going to exist in this Universe, then the one thing it cannot afford to have is a sense of proportion. (Douglas Adams)
Aug 28 2008
parent reply Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-08-28 19:33:52 +0200, "Manfred_Nowak" <svv1999 hotmail.com> said:

 Fawzi Mohamed wrote:
 
 I think

Sorry. Although I cancelled my posting within seconds, you grabbed it even faster. -manfred

well out of curiosity how do you cancel a post? (that way I could have removed also mine... Fawzi
Aug 29 2008
parent reply "Manfred_Nowak" <svv1999 hotmail.com> writes:
Fawzi Mohamed wrote:

 how do you cancel a post?

By using a news-client, that has this feature. -manfred -- If life is going to exist in this Universe, then the one thing it cannot afford to have is a sense of proportion. (Douglas Adams)
Aug 29 2008
parent Michiel Helvensteijn <nomail please.com> writes:
Manfred_Nowak wrote:

 how do you cancel a post?

By using a news-client, that has this feature.

But the news-server also needs to have this feature, and not all do. (Does this one?) -- Michiel
Aug 29 2008
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Fawzi Mohamed wrote:
 I think that the invitation should be read as the possibility to 
 experiment with some changes for AA, see their effect, and if worthwhile 
 provide them back, so that they can be applied to the "official" version.
 
 making the standard version changeable seems just horrible form the 
 portability and maintainability and clarity of the code: if the standard 
 version is not ok for your use you should explicitly use another one, 
 otherwise mixing codes that use two different standard versions becomes 
 a nightmare.

I agree.
 On the other hand if you think that you can improve the standard version 
 for everybody, changing internal/aaA.d is what you should do...

Right.
Aug 29 2008
parent "Manfred_Nowak" <svv1999 hotmail.com> writes:
Walter Bright wrote:

  use two different standard versions becomes a nightmare.
 I agree.

I retracted my posting immediately because it wasn't well thought out. However, the least I wanted was to have "several" "standard" versions. So we all agree on this. But even when I read my rectracted posting again, I can not imagine how one can come to the conclusion, that I wanted to have several.
 
 On the other hand if you think that you can improve the standard
 version for everybody, changing internal/aaA.d is what you should
 do... 


I wrote about that some years ago and got no answer: what is an improvement for everybody---or what is the general usage? Whithout an agreed definition on that, every change will make someone else cry. -manfred -- If life is going to exist in this Universe, then the one thing it cannot afford to have is a sense of proportion. (Douglas Adams)
Aug 29 2008
prev sibling parent reply "Manfred_Nowak" <svv1999 hotmail.com> writes:
bearophile wrote:

 anyone is invited to spot problems

Biggest of all: using the wrong tool. I.e. using a hash map for a maximal populated key range. -manfred -- Maybe some knowledge of some types of disagreeing and their relation can turn out to be useful: http://blog.createdebate.com/2008/04/07/writing-strong-arguments/
Aug 27 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Manfred_Nowak:
 Biggest of all: using the wrong tool. I.e. using a hash map for a 
 maximal populated key range.

But if the hash machinery is good it must work well in this common situation too. Anyway, let's see how I can write a benchmark that you may like. I can use a very fast random generator to create random integer keys. This will probably make the Python version in disadvantage, because such language isn't fit for doing integer operations as fast as a compiled language (a sum among two integers may be 100 times slower in Python). This problem can be solved pre-computing the numbers first and putting them in the associative array later. Is this enough to satisfy you? Note that in the meantime I have created another associative array benchmark, this is string-based, and this time Python-Psyco comes out only about 2-2.5 times faster. I'll show it when I can... Bye, bearophile
Aug 27 2008
parent reply "Manfred_Nowak" <svv1999 hotmail.com> writes:
bearophile wrote:

 Biggest of all: using the wrong tool. I.e. using a hash map for a
 maximal populated key range.

But if the hash machinery is good it must work well in this common situation too.

This statement seems to be as true as the statement: But if a house is good it must perform well in the common situation of an off shore speed boat race. The problem with your approach is, that you have close to no idea, whether your candidates are speed boats or houses. The only thing you know seems to be, that in both candidates some humans can live for some time. Your first design seems to have placed both candidates off shore. Now you have introduced some randomness in the location. However, you might only be designing some overly complicated tool for computing the percentage of landmass not covered by water on some random planet. -manfred -- If life is going to exist in this Universe, then the one thing it cannot afford to have is a sense of proportion. (Douglas Adams)
Aug 28 2008
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Manfred_Nowak:
 This statement seems to be as true as the statement:<

No, it's a different statement. I'll try to post another more real-looking benchmark later, anyway. In the meantime I'll keep using Python for many of my purposes where I need hash maps and sets (and regular expressions, etc), instead of D, because in tons of real-world tests I have seen Python dicts are quite faster. Bye, bearophile
Aug 28 2008
parent "Manfred_Nowak" <svv1999 hotmail.com> writes:
bearophile wrote:

 This statement seems to be as true as the statement:


Yes. It is a different statement. So what? If you do not recognize how well the accompanying elaborations tied it to your statement, then there is no value in going any further. Just convince Walter and the case is settled.
 I'll try to post another more real-looking benchmark later

Different people might perceive reality different, don't they? -manfred -- If life is going to exist in this Universe, then the one thing it cannot afford to have is a sense of proportion. (Douglas Adams)
Aug 28 2008
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
This is another small associative array benchmark, with strings. Please spot
any problem/bug in it.

You can generate the data set with this Python script:


from array import array
from random import randrange, seed
def generate(filename, N):
    fout = file(filename, "w")
    seed(1)
    a = array("B", " ") * 10
    for i in xrange(N):
        n = randrange(3, 11)
        for j in xrange(n):
            a[j] = randrange(97, 123)
        print >>fout, a.tostring()[:n]
import psyco; psyco.full()
generate("words.txt", 1600000)


It generates a text file of about 13.3 MB, each line contains a random word.
Such dataset isn't exactly like a real dataset, because generally words aren't
random, they contain lot of redundancy that may worsen a little the performance
of an hash function. So this is probably a better situation for an associative
array.


The D code:

import std.stream, std.stdio, std.gc;
void main() {
    //disable();
    int[string] d;
    foreach (string line; new BufferedFile("words.txt"))
        d[line.dup] = 0;
}

Note that this program tests the I/O performance too. If you want to avoid that
you can read all the file lines up-front and time just the AA creation. (SEE
BELOW).

This is a first Python+Psyco version, it's not equal, because the good Python
developers have chosen the newlines at the end of the words:

def main():
    d = {}
    for line in file("words.txt"):
        d[line] = 0
import psyco; psyco.full()
main()


This second slower Python version strips the lines from their newline, rstrip()
is similar to std.string.stripr():

def main():
    d = {}
    for line in file("words.txt"):
        d[line.rstrip()] = 0

import psyco; psyco.full()
main()


Few timings:

N = 800_000:
	Psyco:          0.69 s
	Psyco stripped: 0.77 s
	D:              1.26 s
	D no GC:        0.96 s

N = 1_600_000:
	Psyco:          1.19 s
	Psyco stripped: 1.35 s
	D:              2.80 s
	D no GC:        2.08 s

Note that disabling the GC in those two Python programs has no effects on their
running time.

-------------------------------------

To be sure to not compare apples with oranges I have written two "empty"
benchmarks, that measure the time needed to just read the lines:

D code:

import std.stream;
void main() {
    foreach (string line; new BufferedFile("words.txt"))
        line.dup;
}


Python code:

def main():
    for line in file("words.txt"):
        pass
import psyco; psyco.full()
main()


The D version contains a dup to make the comparison more accurate, because the
"line" variable contains an actual copy and not just a slice.

Line reading timings, N = 1_600_000:
  D:     0.58 s
  Psyco: 0.30 s

So the I/O of Phobos is slower and may enjoy some tuning, so my timings of the
hash benchmarks are off.

Removing 0.58 - 0.30 = 0.28 seconds from the timings of the D associative array
benchmarks you have:

N = 1_600_000:
	Psyco:          1.19 s
	Psyco stripped: 1.35 s
	D:              2.80 s
	D no GC:        2.08 s
	D, I/O c:       2.80 - 0.28 = 2.52 s
	D no GC, I/O c: 2.08 - 0.28 = 1.80 s

------------------------------------

To hopefully create a more meaningful benchmark I have then written code that
loads all the lines before creating the hash.

The Python+Psyco code:

from timeit import default_timer as clock
def main():
    words = []
    for line in open("words.txt"):
        words.append(line.rstrip())
    t = clock()
    d = {}
    for line in words:
        d[line] = 0
    print round(clock() - t, 2), "s"
import psyco; psyco.bind(main)
main()


The D code:

import std.stream, std.stdio, std.c.time, std.gc;
void main() {
    string[] words;
    foreach (string line; new BufferedFile("words.txt"))
        words ~= line.dup;
    //disable();
    auto t = clock();
    int[string] d;
    foreach (word; words)
        d[word] = 0;
    writefln((cast(double)(clock()-t))/CLOCKS_PER_SEC, " s");
}


Timings, N = 1_600_000:
	Psyco:          0.61 s  (total running time: 1.36 s)
	D:              1.42 s  (total running time: 2.46 s)
	D no GC:        1.22 s  (total running time: 2.28 s)

Bye,
bearophile
Aug 28 2008
prev sibling parent reply Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-08-26 14:15:28 +0200, bearophile <bearophileHUGS lycos.com> said:

 [...]
 In my site I am keeping a gallery of tiny benchmarks where D code (with 
 DMD) is 10 or more times slower than very equivalent Python, C, Java 
 code (I have about 12 programs so far, very different from each other. 
 There's a benchmark regarding the associative arrays too). Hopefully it 
 will become useful once people will start tuning D implementations.
 
 Bye,
 bearophile

You know I have got the impression that you have a naive view of datastructures, and each time you find a performance problem you ask for the data structure to be improved. One cannot expect to have a single data structure accomodate all uses, simply because something is a container, and support a given operation it does not mean that it does support it efficiently. *if* something is slow for a given purpose what I do is to sit down a think a little which datastructure is optimal for my problem, and then switch to it (maybe taking it from tango containers). Don't get me wrong, it is useful to know which usage patterns give performance problems with the default data structures, and if associative arrays would use a tree or some sorted structure for small sizes (avoiding the cost of hashing) I would not complain, but I do not think (for example) that arrays should necessarily be very optimized for appending... Fawzi
Aug 26 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Fawzi Mohamed:
You know I have got the impression that you have a naive view of
datastructures,<

I think you are wrong, see below.
One cannot expect to have a single data structure accomodate all uses,<

My view on the topic are: - Each data structure (DS) is a compromise, it allows you to do some operations with a certain performance, while giving you a different performance on onther operations, so it gives you a performance profile. Generally you can't have a DS with the best performance for every operation. - Sometimes you may want to choose a DS with a worse performance just because its implementation is simpler, to reduce development time, bug count, etc. - The standard library of a modern language has to contain most of the most commmon DSs, to avoid lot of problems and speed up programming, etc. - If the standard library doesn't contain a certain DS or if the DS required is very uncommon, or the performance profile you need for a time-critical part of your code is very sharp, then your language is supposed able to allow you to write your own DS (some scripting languages may need you to drop do a lower level language to do this). - A modern language is supposed to have some built-in DSs. What ones to choose? This is a tricky question, but the answer I give is that a built-in data structure has to be very flexible, so it has to be efficient enough in a large variety of situations, without being optimal for any situations. This allows the programmers to use it in most situations, where max performance isn't required, so the programmer has to use DSs of the standard library (or even his/her own ones) only once in a while. Creating a very flexible DS is not easy, it requires lot of tuning, tons of benchmarks done on real code, your DS often needs to use some extra memory to be flexible (you can even add subsytems to such DS that collect its usage statistics during the runtime to adapt itself to the specific usage in the code).
Don't get me wrong, it is useful to know which usage patterns give performance
problems with the default data structures,<

If you want to write efficient programs such knowledge is very important, even in scripting languages.
and if associative arrays would use a tree or some sorted structure for small
sizes (avoiding the cost of hashing) I would not complain,<

Python hashes are optimized for being little too, and they don't use a tree.
but I do not think (for example) that arrays should necessarily be very
optimized for appending...<

From the long thread it seems that allowing both fast slices, mutability and fast append isn't easy. I think all three features are important, so some compromise has to be found, because appending is a common enough operation and at the moment is slow or relly slow, too much slow. Now you say that the built-in arrays don't need to be very optimized for appending. In the Python world they solve this problem mining real programs to collect real usage statistics. So they try to know if the append is actually used in real programs, and how often, where, etc. Bye, bearophile
Aug 27 2008
parent reply Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-08-27 13:21:10 +0200, bearophile <bearophileHUGS lycos.com> said:

 Fawzi Mohamed:
 You know I have got the impression that you have a naive view of 
 datastructures,<

I think you are wrong, see below.

good :)
 One cannot expect to have a single data structure accomodate all uses,<

My view on the topic are: - Each data structure (DS) is a compromise, it allows you to do some operations with a certain performance, while giving you a different performance on onther operations, so it gives you a performance profile. Generally you can't have a DS with the best performance for every operation. - Sometimes you may want to choose a DS with a worse performance just because its implementation is simpler, to reduce development time, bug count, etc. - The standard library of a modern language has to contain most of the most commmon DSs, to avoid lot of problems and speed up programming, etc. - If the standard library doesn't contain a certain DS or if the DS required is very uncommon, or the performance profile you need for a time-critical part of your code is very sharp, then your language is supposed able to allow you to write your own DS (some scripting languages may need you to drop do a lower level language to do this). - A modern language is supposed to have some built-in DSs. What ones to choose? This is a tricky question, but the answer I give is that a built-in data structure has to be very flexible, so it has to be efficient enough in a large variety of situations, without being optimal for any situations. This allows the programmers to use it in most situations, where max performance isn't required, so the programmer has to use DSs of the standard library (or even his/her own ones) only once in a while. Creating a very flexible DS is not easy, it requires lot of tuning, tons of benchmarks done on real code, your DS often needs to use some extra memory to be flexible (you can even add subsytems to such DS that collect its usage statistics during the runtime to adapt itself to the specific usage in the code).
 Don't get me wrong, it is useful to know which usage patterns give 
 performance problems with the default data structures,<

If you want to write efficient programs such knowledge is very important, even in scripting languages.

on this we agree
 and if associative arrays would use a tree or some sorted structure for 
 small sizes (avoiding the cost of hashing) I would not complain,<

Python hashes are optimized for being little too, and they don't use a tree.
 but I do not think (for example) that arrays should necessarily be very 
 optimized for appending...<

From the long thread it seems that allowing both fast slices, mutability and fast append isn't easy. I think all three features are important, so some compromise has to be found, because appending is a common enough operation and at the moment is slow or relly slow, too much slow. Now you say that the built-in arrays don't need to be very optimized for appending. In the Python world they solve this problem mining real programs to collect real usage statistics. So they try to know if the append is actually used in real programs, and how often, where, etc.

you know as you say it depends on the programs, but it also depend on the language, if in a language it is clear that using the default structure you are not supposed to append often, and if you really have to do it you use a special method if you do, then the programs written in that language will use that. On the other hand when you translate code from a language to another you might encounter problems. The standard array as I see it embodies the philosophy of the C array: - minimal memory overhead (it is ok to have lots of them, D does have some overhead vs C) - normal memory layout (usable with low level routines that pass memory around and with C) - you can check bounds (C can't) - you can slice it - appending to it is difficult D tries to do different things to mitigate the fact that appending is difficult, maybe it could do more, but it will *never* be as efficient as a structure that gives up that fact that an array has to be just a chunk of contiguous memory. Now I find the choice of having the basic array being just a chunk of contiguous memory, and that the overhead of the structure should be minimal very reasonable for a system programming language that has to interact with C, so I also find ok that appending is not as fast as with other structures. Clearly someone coming from lisp, any functional language or even python, might disagree about what is requested from the "default" array container. It isn't that one is right and the other wrong, it is just a question of priorities of the language, the feel of it, its style... This does not mean that if improvements can be done without compromising too much it shouldn't be done, just that failing some benchmarks might be ok :) Fawzi
 
 Bye,
 bearophile

Aug 27 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Fawzi Mohamed:
 it also depend on 
 the language, if in a language it is clear that using the default 
 structure you are not supposed to append often, and if you really have 
 to do it you use a special method if you do, then the programs written 
 in that language will use that.

I agree. But then the D specs have to be updated to say that the append to built-in D arrays is a slow or very slow operation (not amortized O(1)), so people coming from the C++ STL, Python, Ruby, TCl, Lua, Lisp, Clean, Oz, ecc will not receive a bite from it.
 D tries to do different things to mitigate the fact that appending is 
 difficult, maybe it could do more, but it will *never* be as efficient 
 as a structure that gives up that fact that an array has to be just a 
 chunk of contiguous memory.

I agree, a deque will probably be always faster in appending than a dynamic array. But I think D may do more here :-)
 Now I find the choice of having the basic array being just a chunk of 
 contiguous memory, and that the overhead of the structure should be 
 minimal very reasonable for a system programming language that has to 
 interact with C

Note that both vector of the C++ STL and "list" of Python are generally (or always) implemented with a chunk of contiguous memory.
 Clearly someone coming from lisp, any functional language or even 
 python, might disagree about what is requested from the "default" array 
 container.
 It isn't that one is right and the other wrong, it is just a question 
 of priorities of the language, the feel of it, its style...

I agree, that's why I have suggested to collect statistics from real D code, and not from Lisp programs, to see what's the a good performance profile compromise for D built-in dynamic arrays :-) This means that I'll stop caring for a fast array append in D if most D programmers don't need fast appends much.
 This does not mean that if improvements can be done without 
 compromising too much it shouldn't be done, just that failing some 
 benchmarks might be ok :)

Well, in my opinion that's okay, but there's a limit. So I think a built-in data structure has to be optimized for being flexible, so while not being very good in anything, it has to be not terrible in any commonly done operation. On the other hand, I can see what you say, that in a low level language a too much flexible data structure may be not fit as built-in, while a simpler and less flexible one may be fitter. You may be right and I may be wrong on this point :-) Bye, bearophile
Aug 27 2008
parent Fawzi Mohamed <fmohamed mac.com> writes:
On 2008-08-27 16:16:49 +0200, bearophile <bearophileHUGS lycos.com> said:

 Fawzi Mohamed:
 it also depend on
 the language, if in a language it is clear that using the default
 structure you are not supposed to append often, and if you really have
 to do it you use a special method if you do, then the programs written
 in that language will use that.

I agree. But then the D specs have to be updated to say that the append to built-in D arrays is a slow or very slow operation (not amortized O(1)), so people coming from the C++ STL, Python, Ruby, TCl, Lua, Lisp, Clean, Oz, ecc will not receive a bite from it.
 D tries to do different things to mitigate the fact that appending is
 difficult, maybe it could do more, but it will *never* be as efficient
 as a structure that gives up that fact that an array has to be just a
 chunk of contiguous memory.

I agree, a deque will probably be always faster in appending than a dynamic array. But I think D may do more here :-)
 Now I find the choice of having the basic array being just a chunk of
 contiguous memory, and that the overhead of the structure should be
 minimal very reasonable for a system programming language that has to
 interact with C

Note that both vector of the C++ STL and "list" of Python are generally (or always) implemented with a chunk of contiguous memory.
 Clearly someone coming from lisp, any functional language or even
 python, might disagree about what is requested from the "default" array
 container.
 It isn't that one is right and the other wrong, it is just a question
 of priorities of the language, the feel of it, its style...

I agree, that's why I have suggested to collect statistics from real D code, and not from Lisp programs, to see what's the a good performance profile compromise for D built-in dynamic arrays :-) This means that I'll stop caring for a fast array append in D if most D programmers don't need fast appends much.
 This does not mean that if improvements can be done without
 compromising too much it shouldn't be done, just that failing some
 benchmarks might be ok :)

Well, in my opinion that's okay, but there's a limit. So I think a built-in data structure has to be optimized for being flexible, so while not being very good in anything, it has to be not terrible in any commonly done operation. On the other hand, I can see what you say, that in a low level language a too much flexible data structure may be not fit as built-in, while a simpler and less flexible one may be fitter. You may be right and I may be wrong on this point :-)

well it is funny because now I am not too sure anymore that adding an extra field (pointing to the end of the reserved data, if the actual array is the "owner" of it and to before the start if it isn't and one should reallocate (i.e. for slices) is such a bad idea. I started out with the idea "slightly improved C characteristics", and so for me it was clear that appending would be bad, and I was actually surprised (using it) in seeing that it was less bad that I supposed. The way it is has the advantage of allowing bound checks, and a very little overhead, but it has no concept of "extra grow space". So for me it was clear that if I wanted to insert something I had to make place first, and then insert it (keeping in mind the old size). The vector approach (do know about the capacity) can be useful in high level code where one wants that a.length always is the length of the array (not maybe longer because of some reserved memory). So maybe it is worthwhile, even if it will for sure add some bloat, I do not know... For most of my code it will just add bloat, but probably a tolerable one Fawzi
Aug 27 2008
prev sibling parent reply Yigal Chripun <yigal100 gmail.com> writes:
Benji suggested run-time inheritance and at least from a design
perspective I like some of his thoughts.
I've got a few questions though:

a) people here said that a virtual call will make it slow. How much
slow? how much of an overhead is it on modern hardware considering also
that this is a place where hardware manufacturers spend time on
optimizations?

b) can't a string class use implicit casts and maybe some sugar to
pretend to be a regular array in such a way to avoid that virtual call
and still be useful?
you already can do array.func(...) instead of func(array, ...) so this
can be used with a string class that implicitly converts to a char array..

C) compile-time interfaces? (aka concepts)
Aug 26 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Yigal Chripun wrote:
 a) people here said that a virtual call will make it slow. How much
 slow? how much of an overhead is it on modern hardware considering also
 that this is a place where hardware manufacturers spend time on
 optimizations?

Virtual function calls have been a problem for hardware optimization. Direct function calls can be speculatively executed, but not virtual ones, because the hardware cannot predict where it will go. This means virtual calls can be much slower than direct function calls.
Aug 26 2008
next sibling parent reply "Jb" <jb nowhere.com> writes:
"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:g90iia$2jc4$3 digitalmars.com...
 Yigal Chripun wrote:
 a) people here said that a virtual call will make it slow. How much
 slow? how much of an overhead is it on modern hardware considering also
 that this is a place where hardware manufacturers spend time on
 optimizations?

Virtual function calls have been a problem for hardware optimization. Direct function calls can be speculatively executed, but not virtual ones, because the hardware cannot predict where it will go. This means virtual calls can be much slower than direct function calls.

Modern x86 branch prediction treats indirect calls the same as conditional branches. They get a slot in the branch target buffer, so they do get speculatively executed. And if correctly predicted it's only a couple of cycles more costly direct calls. See the thread "Feature Request: nontrivial functions and vtable optimizations" about 2 weeks ago. I cited the technical docs and a few doubters ran benchmarks, which proved that virtual methods are not as evil as many people think. In fact they are no more evil than a conditional branch.
Aug 26 2008
next sibling parent superdan <super dan.org> writes:
Jb Wrote:

 
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g90iia$2jc4$3 digitalmars.com...
 Yigal Chripun wrote:
 a) people here said that a virtual call will make it slow. How much
 slow? how much of an overhead is it on modern hardware considering also
 that this is a place where hardware manufacturers spend time on
 optimizations?

Virtual function calls have been a problem for hardware optimization. Direct function calls can be speculatively executed, but not virtual ones, because the hardware cannot predict where it will go. This means virtual calls can be much slower than direct function calls.

Modern x86 branch prediction treats indirect calls the same as conditional branches. They get a slot in the branch target buffer, so they do get speculatively executed. And if correctly predicted it's only a couple of cycles more costly direct calls. See the thread "Feature Request: nontrivial functions and vtable optimizations" about 2 weeks ago. I cited the technical docs and a few doubters ran benchmarks, which proved that virtual methods are not as evil as many people think. In fact they are no more evil than a conditional branch.

you're right. but direct calls don't speculate. they don't need speculation because they're direct jump. so they are loaded straight into the pipeline. so walter was right but used the wrong term.
Aug 26 2008
prev sibling next sibling parent reply superdan <super dan.org> writes:
Jb Wrote:

 
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g90iia$2jc4$3 digitalmars.com...
 Yigal Chripun wrote:
 a) people here said that a virtual call will make it slow. How much
 slow? how much of an overhead is it on modern hardware considering also
 that this is a place where hardware manufacturers spend time on
 optimizations?

Virtual function calls have been a problem for hardware optimization. Direct function calls can be speculatively executed, but not virtual ones, because the hardware cannot predict where it will go. This means virtual calls can be much slower than direct function calls.

Modern x86 branch prediction treats indirect calls the same as conditional branches. They get a slot in the branch target buffer, so they do get speculatively executed. And if correctly predicted it's only a couple of cycles more costly direct calls. See the thread "Feature Request: nontrivial functions and vtable optimizations" about 2 weeks ago. I cited the technical docs and a few doubters ran benchmarks, which proved that virtual methods are not as evil as many people think. In fact they are no more evil than a conditional branch.

you're right. but direct calls don't speculate. they don't need speculation because they're direct jump. so they are loaded straight into the pipeline. walt was right but used the wrong term.
Aug 26 2008
parent reply "Jb" <jb nowhere.com> writes:
"superdan" <super dan.org> wrote in message 
news:g912vh$mbe$1 digitalmars.com...
 Jb Wrote:

 "Walter Bright" <newshound1 digitalmars.com> wrote in message
 news:g90iia$2jc4$3 digitalmars.com...
 Yigal Chripun wrote:
 a) people here said that a virtual call will make it slow. How much
 slow? how much of an overhead is it on modern hardware considering 
 also
 that this is a place where hardware manufacturers spend time on
 optimizations?

Virtual function calls have been a problem for hardware optimization. Direct function calls can be speculatively executed, but not virtual ones, because the hardware cannot predict where it will go. This means virtual calls can be much slower than direct function calls.

Modern x86 branch prediction treats indirect calls the same as conditional branches. They get a slot in the branch target buffer, so they do get speculatively executed. And if correctly predicted it's only a couple of cycles more costly direct calls. See the thread "Feature Request: nontrivial functions and vtable optimizations" about 2 weeks ago. I cited the technical docs and a few doubters ran benchmarks, which proved that virtual methods are not as evil as many people think. In fact they are no more evil than a conditional branch.

you're right. but direct calls don't speculate. they don't need speculation because they're direct jump. so they are loaded straight into the pipeline. walt was right but used the wrong term.

Walter said "the hardware cannot predict where a virtual call will go". It does in fact predict them, and speculatively execute them, and as pretty much any bechmark will show it gets it right the vast majority of the time. (On x86 anyway.) That's what I was saying.
Aug 26 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Jb wrote:
 Walter said "the hardware cannot predict where a virtual call will go".
 
 It does in fact predict them, and speculatively execute them, and as pretty 
 much any bechmark will show it gets it right the vast majority of the time. 
 (On x86 anyway.)
 
 That's what I was saying. 

Looks like I keep falling behind on what modern CPUs are doing :-( In any case, throughout all the revolutions in how CPUs work, there have been a few invariants that hold true well enough as an optimization guide: 1. fewer instructions ==> faster execution 2. fewer memory accesses ==> faster execution 3. fewer conditional branches ==> faster execution
Aug 26 2008
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 Looks like I keep falling behind on what modern CPUs are doing :-(

The 5 good PDF files in this page are probably enough to put you back in shape: http://www.agner.org/optimize/ (Especially ones regarding CPUs and micro architecture. The first document is the simpler one). Bye, bearophile
Aug 26 2008
parent "Jb" <jb nowhere.com> writes:
"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:g91vf8$2mmk$1 digitalmars.com...
 Walter Bright:
 Looks like I keep falling behind on what modern CPUs are doing :-(

The 5 good PDF files in this page are probably enough to put you back in shape: http://www.agner.org/optimize/ (Especially ones regarding CPUs and micro architecture. The first document is the simpler one).

Anger Fog's guides are the best optimization info you can get. They're actualy a lot better than Intels and Amds own optimization guides imo.
Aug 26 2008
prev sibling next sibling parent "Jb" <jb nowhere.com> writes:
"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:g91kah$1rvb$1 digitalmars.com...
 Jb wrote:
 Walter said "the hardware cannot predict where a virtual call will go".

 It does in fact predict them, and speculatively execute them, and as 
 pretty much any bechmark will show it gets it right the vast majority of 
 the time. (On x86 anyway.)

 That's what I was saying.

Looks like I keep falling behind on what modern CPUs are doing :-( In any case, throughout all the revolutions in how CPUs work, there have been a few invariants that hold true well enough as an optimization guide: 1. fewer instructions ==> faster execution 2. fewer memory accesses ==> faster execution 3. fewer conditional branches ==> faster execution

True. I'd add this to the list aswell.. 4. Shorter dependance chains => faster execution. Although it's more relevant for floating point where most ops have at least a few cycles latency.
Aug 26 2008
prev sibling parent JAnderson <ask me.com> writes:
Walter Bright wrote:
 Jb wrote:
 Walter said "the hardware cannot predict where a virtual call will go".

 It does in fact predict them, and speculatively execute them, and as 
 pretty much any bechmark will show it gets it right the vast majority 
 of the time. (On x86 anyway.)

 That's what I was saying. 

Looks like I keep falling behind on what modern CPUs are doing :-( In any case, throughout all the revolutions in how CPUs work, there have been a few invariants that hold true well enough as an optimization guide: 1. fewer instructions ==> faster execution 2. fewer memory accesses ==> faster execution 3. fewer conditional branches ==> faster execution

Also you can't inline virtual calls (well a smart compiler could but that's another discussion). That means the compiler can't optimize so well but removing unnecessary operations. -Joel
Aug 26 2008
prev sibling next sibling parent JAnderson <ask me.com> writes:
Jb wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g90iia$2jc4$3 digitalmars.com...
 Yigal Chripun wrote:
 a) people here said that a virtual call will make it slow. How much
 slow? how much of an overhead is it on modern hardware considering also
 that this is a place where hardware manufacturers spend time on
 optimizations?

Direct function calls can be speculatively executed, but not virtual ones, because the hardware cannot predict where it will go. This means virtual calls can be much slower than direct function calls.

Modern x86 branch prediction treats indirect calls the same as conditional branches. They get a slot in the branch target buffer, so they do get speculatively executed. And if correctly predicted it's only a couple of cycles more costly direct calls. See the thread "Feature Request: nontrivial functions and vtable optimizations" about 2 weeks ago. I cited the technical docs and a few doubters ran benchmarks, which proved that virtual methods are not as evil as many people think. In fact they are no more evil than a conditional branch.

That's x86 hardware. Try something like the ps3. That systems has little or no cache. They have to jump to the vtable which is in a totally different location from the class. Note I'm not in the camp that things they should never be used on these ystems however I think you should use them smartly and profile profile profile. One technique I've used in C++ to help improve things a little is to switch the vtable with one that's in the same location or close to the class. The wrapper function looked something like this: class A {...} A a = new LocalVirualTable<A>(); //ie LocalVirualTable is a bolt-in template However, performance only improved in cases where I could flush the cache. It many cases it was slightly worse on a x86 so you had to try it, profile and see it it had a positive or negative effect in each case. I imagine when you've got hundreds of these classes its simply more memory to process, so on a high cache system it can be inversely beneficial. I never tried it on a ps3 so it might be more effective there. -Joel
Aug 26 2008
prev sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Jb" <jb nowhere.com> wrote in message news:g90mm6$2tk9$1 digitalmars.com...
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g90iia$2jc4$3 digitalmars.com...
 Yigal Chripun wrote:
 a) people here said that a virtual call will make it slow. How much
 slow? how much of an overhead is it on modern hardware considering also
 that this is a place where hardware manufacturers spend time on
 optimizations?

Virtual function calls have been a problem for hardware optimization. Direct function calls can be speculatively executed, but not virtual ones, because the hardware cannot predict where it will go. This means virtual calls can be much slower than direct function calls.

Modern x86 branch prediction treats indirect calls the same as conditional branches. They get a slot in the branch target buffer, so they do get speculatively executed. And if correctly predicted it's only a couple of cycles more costly direct calls.

Just curious: How "modern", do you mean by "modern" here?
Aug 26 2008
parent "Jb" <jb nowhere.com> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:g91m3t$222a$1 digitalmars.com...
 "Jb" <jb nowhere.com> wrote in message 
 news:g90mm6$2tk9$1 digitalmars.com...
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g90iia$2jc4$3 digitalmars.com...
 Yigal Chripun wrote:
 a) people here said that a virtual call will make it slow. How much
 slow? how much of an overhead is it on modern hardware considering also
 that this is a place where hardware manufacturers spend time on
 optimizations?

Virtual function calls have been a problem for hardware optimization. Direct function calls can be speculatively executed, but not virtual ones, because the hardware cannot predict where it will go. This means virtual calls can be much slower than direct function calls.

Modern x86 branch prediction treats indirect calls the same as conditional branches. They get a slot in the branch target buffer, so they do get speculatively executed. And if correctly predicted it's only a couple of cycles more costly direct calls.

Just curious: How "modern", do you mean by "modern" here?

Well I thought it was the Pentium II, but acording to AgnerFog, it's since the PMMX. So pretty much all Pentiums. Although that's "predict it goes the same place it did last time". More recent ones do remember multiple targets and recognize some patterns.
Aug 26 2008
prev sibling parent reply Benji Smith <dlanguage benjismith.net> writes:
Walter Bright wrote:
 Yigal Chripun wrote:
 a) people here said that a virtual call will make it slow. How much
 slow? how much of an overhead is it on modern hardware considering also
 that this is a place where hardware manufacturers spend time on
 optimizations?

Virtual function calls have been a problem for hardware optimization. Direct function calls can be speculatively executed, but not virtual ones, because the hardware cannot predict where it will go. This means virtual calls can be much slower than direct function calls.

What about for software optimization? I seem to remember reading something about the Objective-C compiler maybe six or eight months ago talking about some of its optimization techniques. Obj-C uses a message-passing idiom, and all messages use dynamic dispatch, since the list of messages an object can receive is not fixed at compile-time. If I remember correctly, this article said that the dynamic dispatch expense only had to be incurred once, upon the first invocation of each message type. After that, the address of the appropriate function was re-written in memory, so that it pointed directly to the correct code. No more dynamic dispatch. Although the message handlers aren't resolved until runtime, once invoked, they'll always use the same target. Or some thing like that. It was an interesting read. I'll see if I can find it. --benji
Aug 26 2008
parent Michel Fortin <michel.fortin michelf.com> writes:
On 2008-08-26 13:25:40 -0400, Benji Smith <dlanguage benjismith.net> said:

 I seem to remember reading something about the Objective-C compiler 
 maybe six or eight months ago talking about some of its optimization 
 techniques.
 
 Obj-C uses a message-passing idiom, and all messages use dynamic 
 dispatch, since the list of messages an object can receive is not fixed 
 at compile-time.
 
 If I remember correctly, this article said that the dynamic dispatch 
 expense only had to be incurred once, upon the first invocation of each 
 message type. After that, the address of the appropriate function was 
 re-written in memory, so that it pointed directly to the correct code. 
 No more dynamic dispatch. Although the message handlers aren't resolved 
 until runtime, once invoked, they'll always use the same target.
 
 Or some thing like that.
 
 It was an interesting read. I'll see if I can find it.

Hum, I believe you're talking about the cache for method calls. What Objective-C does is that it caches methods by selector in a lookup table. There is one such table for each class, and it gets populated as methods are called on that class. Once a method is in the cache, it's very efficient to find where to branch: you take the selector's pointer and apply the mask to get value n, then branch on the method pointer from the nth bucket in the table. All messages are passed by calling the objc_msgSend function. Here's how you can implement some of that in D: id objc_msgSend(id, SEL, ...) { auto n = cast(uint)SEL & id.isa.cache.mask; auto func = cast(id function(id, SEL, ...))id.isa.cache.buckets[n]; if (func != null) { <set instruction pointer to func> // never returns, the function pointed by func returns instead } <find func pointer by other means, fill cache, etc.> } I've read somewhere that it's almost as fast as virtual functions. While I haven't verified that, it's much more flexible: you can add functions at runtime to any class. That's how Objective-C allows you to add methods to classes you do not control and you can still override them in derived classes. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Aug 26 2008