digitalmars.D.learn - How to use sets in D?

mark (10/10) Feb 07 2020 I am porting code from other languages to D as part of learning

Jan =?UTF-8?B?SMO2bmln?= (6/16) Feb 07 2020 I have not that much experience, but i have asked the exact same
H. S. Teoh (17/28) Feb 07 2020 bool[E] works just fine.

mark (9/16) Feb 07 2020 [snip]

mark (97/97) Mar 08 2020 I use sets a lot and am now working on a program that will need

Johan (10/12) Mar 08 2020 Please remember that performance testing is not trivial.

Jesse Phillips (9/19) Mar 08 2020 I think I've usually used the associative arrays, but I also

mark (8/17) Mar 09 2020 I can see the sense in that for use with union & intersection.

mark <mark qtrac.eu> writes:

I am porting code from other languages to D as part of learning 
D, and I find I've used sets quite a lot. AFAIK D doesn't have a 
built-in set type or one in the std. lib.

However, I've been perfectly successfully using int[E] where E is 
my ElementType, and adding with set[element] = 0. I mostly only 
need add, remove, iteration, and in, with uniqueness what I care 
most about.

I know I could use bool[E] and set[element] = false, or I suppose 
container.rbtree. Would either of these--or something else 
built-in or in the std. lib.--be better?

Feb 07 2020

Jan =?UTF-8?B?SMO2bmln?= <hrominium gmail.com> writes:

On Friday, 7 February 2020 at 19:37:08 UTC, mark wrote:
 I am porting code from other languages to D as part of learning 
 D, and I find I've used sets quite a lot. AFAIK D doesn't have 
 a built-in set type or one in the std. lib.

 However, I've been perfectly successfully using int[E] where E 
 is my ElementType, and adding with set[element] = 0. I mostly 
 only need add, remove, iteration, and in, with uniqueness what 
 I care most about.

 I know I could use bool[E] and set[element] = false, or I 
 suppose container.rbtree. Would either of these--or something 
 else built-in or in the std. lib.--be better?

I have not that much experience, but i have asked the exact same 
thing in the IRC.
rbtree was recommended. However, if it is not performance 
critical (in terms of speed or size), your solution works fine as 
well.

Feb 07 2020

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Feb 07, 2020 at 07:37:08PM +0000, mark via Digitalmars-d-learn wrote:
 I am porting code from other languages to D as part of learning D, and
 I find I've used sets quite a lot. AFAIK D doesn't have a built-in set
 type or one in the std. lib.
 
 However, I've been perfectly successfully using int[E] where E is my
 ElementType, and adding with set[element] = 0. I mostly only need add,
 remove, iteration, and in, with uniqueness what I care most about.
 
 I know I could use bool[E] and set[element] = false, or I suppose
 container.rbtree. Would either of these--or something else built-in or
 in the std. lib.--be better?

bool[E] works just fine.

The bool does take up 1 byte, though, so if you're *really* want to
optimize that away, you could do this:

	alias Unit = void[0];
	enum unit = Unit.init;

	// Look, ma! A bona fide set!
	Unit[E] mySet;

	mySet[...] = unit;
	mySet.remove(...);
	... // etc.

Or you can wrap void[0][E] in a nice user-defined type that gives nice
set-like syntax.  But IMO, this is all overkill, and adds needless
complexity. Just use bool[E] or std.container.rbtree. :-D


T

-- 
If you look at a thing nine hundred and ninety-nine times, you are perfectly
safe; if you look at it the thousandth time, you are in frightful danger of
seeing it for the first time. -- G. K. Chesterton

Feb 07 2020

mark <mark qtrac.eu> writes:

On Friday, 7 February 2020 at 22:03:00 UTC, H. S. Teoh wrote:
 On Fri, Feb 07, 2020 at 07:37:08PM +0000, mark via 
 Digitalmars-d-learn wrote:

[snip]
 bool[E] works just fine.

[snip]
 Or you can wrap void[0][E] in a nice user-defined type that 
 gives nice set-like syntax.  But IMO, this is all overkill, and 
 adds needless complexity. Just use bool[E] or 
 std.container.rbtree. :-D

I didn't think bool[E] would be a win because although it is only 
one byte per item, it won't align so wouldn't it end up taking 4 
bytes of space anyway. The void[0][E] you showed is good, but, 
I'll try using the rbtree since I'd rather use an out-of-the-box 
collection.

Thanks for the replies.

Feb 07 2020

mark <mark qtrac.eu> writes:

I use sets a lot and am now working on a program that will need 
to hold sets of 65,000+ items, so I thought I do some timings for 
the different approaches.

Here are some timings (uset uses the AA Unit approach, tset uses 
an rbtree, and aset uses an AA with bool values):

$ ./sets.d
size 50,000
uset 1 ms, 340 μs, and 8 hnsecs
tset 4 ms, 637 μs, and 1 hnsec
aset 1 ms, 402 μs, and 6 hnsecs
$ ./sets.d
size 100,000
uset 2 ms, 338 μs, and 4 hnsecs
tset 12 ms, 262 μs, and 6 hnsecs
aset 2 ms and 991 μs
$ ./sets.d
size 200,000
uset 5 ms, 971 μs, and 5 hnsecs
tset 30 ms, 675 μs, and 5 hnsecs
aset 6 ms, 74 μs, and 6 hnsecs
$ ./sets.d
size 400,000
uset 11 ms, 823 μs, and 4 hnsecs
tset 74 ms, 146 μs, and 2 hnsecs
aset 12 ms, 560 μs, and 5 hnsecs

What seems pretty clear is that for my purposes (checking 
presence or absence of membership in a set), AAs are much faster 
than rbtrees. (This is to be expected since if an AA uses a hash 
is should be O(1) vs O(lg n) for an rbtree).

Here is the test code I used:



enum SIZE = 400_000;
enum SUB_SIZE = SIZE / 10;
enum MIN_LEN = 10;
enum MAX_LEN = 50;
enum AZ = "abcdefghijklmnopqrstuvwxyz";

void main() {
     import std.stdio: writefln;

     auto data = Data.populate();
     auto sets = Sets(data.words);

     writefln("size %,d", SIZE);
     check(sets.uset, data.present, data.absent, "uset");
     check(sets.tset, data.present, data.absent, "tset");
     check(sets.aset, data.present, data.absent, "aset");
}

struct Sets {
     import std.container.rbtree: RedBlackTree;

     alias Unit = void[0];
     enum unit = Unit.init;

     Unit[string] uset;
     RedBlackTree!string tset;
     bool[string] aset;

     this(string[] words) {
         tset = new RedBlackTree!string;
         foreach (word; words) {
             uset[word] = unit;
             tset.insert(word);
             aset[word] = false;
         }
     }
}

struct Data {
     string[] words;
     string[] present;
     string[] absent;

     static Data populate() {
         Data data;
         for (int i = 0; i < SIZE; i++) {
             auto word = makeWord;
             data.words ~= word;
             if (data.present.length < SUB_SIZE)
                 data.present ~= word;
             if (data.absent.length < SUB_SIZE)
                 data.absent ~= word ~ "9";
         }
         return data;
     }
}

string makeWord() {
     import std.random: randomCover, uniform;
     import std.range: take;
     import std.string: join, split;

     enum AZS = (AZ ~ AZ ~ AZ ~ AZ ~ AZ ~ AZ).split("");
     return randomCover(AZS).take(uniform(MIN_LEN, 
MAX_LEN)).join("");
}

void check(T)(T set, string[] present, string[] absent, string 
name) {
     import std.datetime.stopwatch: AutoStart, StopWatch;
     import std.stdio: writeln;

     auto timer = StopWatch(AutoStart.yes);
     foreach (string p; present)
         assert(p in set);
     foreach (string a; absent)
         assert(a !in set);
     writeln(name, " ", timer.peek);
}

Mar 08 2020

Johan <j j.nl> writes:

On Sunday, 8 March 2020 at 08:43:10 UTC, mark wrote:
 Here are some timings ...

[...]


Please remember that performance testing is not trivial.
At the very least, you should be testing optimized code (-O) and 
preferably with LDC or GDC because they have a much stronger 
optimizer than DMD.
Also, `assert` is not a good way to check something in 
performance testing code. See 
https://forum.dlang.org/thread/bpbyhmrsfzirfqggnlyw forum.dlang.org

-Johan

Mar 08 2020

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Friday, 7 February 2020 at 19:37:08 UTC, mark wrote:
 I am porting code from other languages to D as part of learning 
 D, and I find I've used sets quite a lot. AFAIK D doesn't have 
 a built-in set type or one in the std. lib.

 However, I've been perfectly successfully using int[E] where E 
 is my ElementType, and adding with set[element] = 0. I mostly 
 only need add, remove, iteration, and in, with uniqueness what 
 I care most about.

 I know I could use bool[E] and set[element] = false, or I 
 suppose container.rbtree. Would either of these--or something 
 else built-in or in the std. lib.--be better?

I think I've usually used the associative arrays, but I also 
think I tend to avoid using this approach but couldn't quite 
remember what I do instead.

I believe I have started just using an array.

arr ~= addMyData;
arr.sort.uniq

Then I make use of the algorithms here.

https://dlang.org/phobos/std_algorithm_setops.html

Mar 08 2020

mark <mark qtrac.eu> writes:

On Sunday, 8 March 2020 at 17:58:16 UTC, Jesse Phillips wrote:
 On Friday, 7 February 2020 at 19:37:08 UTC, mark wrote:

[snip]
 I think I've usually used the associative arrays, but I also 
 think I tend to avoid using this approach but couldn't quite 
 remember what I do instead.

 I believe I have started just using an array.

 arr ~= addMyData;
 arr.sort.uniq

 Then I make use of the algorithms here.

 https://dlang.org/phobos/std_algorithm_setops.html

I can see the sense in that for use with union & intersection.
However, AA's (assuming they are hashes under the hood) give O(1) 
for `in` whereas those algorithms like the rbtree are O(lg n) for 
`in`, and most of what I'm doing at the moment is `in`. I'll keep 
the idea in mind though, because I have use cases for 
intersection.

Mar 09 2020

D Programming

C/C++ Programming

Other

digitalmars.D.learn - How to use sets in D?