digitalmars.D - Making generalized Trie type in D

Dmitry Olshansky (122/122) Jun 04 2012 Cross-posting from my GSOC list as I think we had a lack of D rox posts

Denis Shelomovskij (13/20) Jun 04 2012 A nitpick: constant arrays should be defined as `immutable` instead of

Dmitry Olshansky (5/22) Jun 04 2012 Thanks, the mutable part is then a consequence of being another rvalue.
Roman D. Boiko (3/20) Jun 04 2012 Well, in my case they were only used for mixin code generation

Andrei Alexandrescu (4/6) Jun 04 2012 [snip]

Dmitry Olshansky (5/11) Jun 04 2012 Sounds good, will do once I fix few issues that were mentioned

Roman D. Boiko (38/52) Jun 04 2012 Would be interesting to see some more examples, along with

Dmitry Olshansky (46/97) Jun 04 2012 Aye, the good thing about them - the amount of data affected by any

Roman D. Boiko (33/123) Jun 04 2012 Actually, Microsoft went this way (immutable hash tables) in

Dmitry Olshansky (53/124) Jun 04 2012 Sorry my Trie implementation focused on "constructe once - read

Roman D. Boiko (10/44) Jun 04 2012 My cases will likely have quite low amortized number of reads per

Roman D. Boiko (5/15) Jun 04 2012 Another counter-example is searching for strings with specified

Dmitry Olshansky (15/30) Jun 04 2012 I believe that once encoidng is established your compiler tool should

Roman D. Boiko (9/13) Jun 04 2012 I didn't plan to convert input into some other encoding. But
Roman D. Boiko (8/15) Jun 04 2012 Will it be difficult to adapt your API for immutable tries? E.g.,

Dmitry Olshansky (8/21) Jun 04 2012 Linked list? I'm horrified. Though I'd need some info on where and why

Dmitry Olshansky (5/19) Jun 04 2012 And you can fake immutability, buy always picking up an unused slot btw....

Roman D. Boiko (6/8) Jun 04 2012 That's applicable in some cases, not in general. But I agree that

Dmitry Olshansky (9/17) Jun 04 2012 Keep in mind Ranges are temporary object most of the time. They are

Roman D. Boiko (5/18) Jun 04 2012 I'm currently trying this approach (instead of working with

Roman D. Boiko (7/24) Jun 04 2012 That was for illustration only, as the most trivial example of

Roman D. Boiko (10/23) Jun 04 2012 Basically, the most important aspect of immutability is returning

Roman D. Boiko (5/9) Jun 04 2012 Simplest and possibly the best approach is to provide an

Dmitry Olshansky (26/34) Jun 05 2012 I suspect I would have to add another policy like Persistent that will

Roman D. Boiko (4/45) Jun 05 2012 Yes,something like that should work. I finished support request

Roman D. Boiko (11/13) Jun 11 2012 Would it be difficult to introduce two optimizations:

Dmitry Olshansky (9/21) Jun 11 2012 The sample I listed did just that: for length >= 64 it goes to bucket of...

Roman D. Boiko (18/41) Jun 11 2012 That's wonderful :)

Dmitry Olshansky (8/14) Jun 11 2012 I meant an operation pseudo-XOR x P^ y where x is part of snapshot and y...

Roman D. Boiko (3/22) Jun 11 2012 I understood, just mentioned some alternatives (maybe not

Roman D. Boiko (3/11) Jun 11 2012 Or probably not understood :)

Roman D. Boiko (4/16) Jun 11 2012 Is this article somehow related, or just uses the same term?

Dmitry Olshansky (10/25) Jun 11 2012 0 ^ 0 = 1 while T.init P^ T.init == T.init :)

Roman D. Boiko (9/13) Jun 24 2012 Just a remark that DCT no longer needs tries to be immutable,

Dmitry Olshansky (7/19) Jun 24 2012 Storing new values directly is no easy trick as you need crawl page

Marco Leise (10/14) Jul 19 2012 Looks, like you are back in control.

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

Cross-posting from my GSOC list as I think we had a lack of D rox posts 
lately :)

I rarely am surprised by generic code flexibility and power these days.
But once I fleshed out last bits of generalized Trie I was amazed.

In short: it allows something like "make a multistage integer lookup 
table using these x upper bits, these y bits in the middle and those 
last z bits as final offset" as one-liner. Better yet strings and other 
keys are just as easy, and even better it's up to user to allow custom 
ways of getting X bits for any key.
And while I writing this bit-packing is on it's way to get integrated 
thus obtaining ultra-small tables (we can pack even index entries since 
we know how much bits they take exactly).

And last but no least - it's easy to strap Trie on top of some other 
container (say arrays, sets, hashtable etc.).  (currently it doesn't 
play nice with GC-ed objects, I'll fix it later)
Say book catalog that goes by first latter (or two) as Trie stage then 
as final slot uses sorted array. The advantages  of such schemes is that 
re-sorting huge arrays  can get costly, moreover insertion can 
reallocate and that might even run out of memory while duping it. On the 
other hand smaller array keep sorting bounded at acceptable level.

(In fact hashtable is a degenerate case of 1-level Trie on top of e.g. 
List, just not space-optimized and with a weird  index transform).

To give you a taste of this power, consider a simple example - keyword 
recognizer. I'm showing two version first just checks if it's a keyword. 
Second one does catalogs it to specific kind.
(sample material taken from DCT project, thanks goes to Roman Boiko)

enum keywords = [
             "abstract",
             "alias",
             "align",
             //...  all of them, except   ones
             "__traits"
     ];

//here is bare minumum set type usable for Trie
  struct SmallMap(size_t N, V, K)
     {
         void insert(Tuple!(V, K) t){ _set.insert(t); }

         V opBinaryRight(string op, T)(T key)
             if(op == "in")
         {
             auto idx = map!"a[1]"(_set.items[]).countUntil(key);
             return idx < 0 ? V.init : _set.items[idx][0];
         }
         private:
             SmallSet!(N, Tuple!(V, K)) _set;
     }
//define how we get bits for our index
    size_t useLength(T)(T[] arr)
     {
         return arr.length > 63 ? 0 : arr.length; //need max length, 64 
- 6bits
     }

     template useLastItem(T)
     {
         size_t entity(in T[] arr){ return arr[$-1]; }
         enum bitSize = 8*T.sizeof;
     }
     //... useItemAt is similar


//now the usage:
auto keyTrie = Trie!(SetAsSlot!(SmallSet!(2,string))
                          , string
                          , assumeSize!(6, useLength)
                          , useItemAt!(0, char)
                          , useLastItem!(char)
                         )(keywords);

foreach(key; keywords)
//opIndex gives us a slot, here slot is set so test it with 'in'
         assert( key in keyTrie[key]);


And that's it. Importantly there is not a single bit of hardcoding here:
     - we choose to make Trie of sets by passing SetAsSlot for value 
type (simple Trie may just use say bool)
     - key type is string oviously
     - assumeSize takes a user define functor and "trusts" user that it 
is result indeed fits into X bit wide integer
     - the the table as constructed goes by levels: by length -> by 
first char -> by last char -> <small set>

As you can observe "alias" and "align" do collide after 3 levels passed, 
thus we go for fixed 2-slot sets at the end to resolve it and a few others.
Of course there are better ways to handle level that would prevent this 
"collision" and reduce size, it's just to show how Trie on top of Set 
works. (not to mention all keywords are ASCII alphas, hinting on better 
compression in say 6bits)

And here is how it can work with Maps as slot (the above was mapping to 
booleans in fact):

auto keywordsMap = [
             "abstract" : TokenKind.Abstract,
             "alias" : TokenKind.Alias,
             "align" : TokenKind.Align,
             //... all others, cataloged per their kind
             "__thread" : TokenKind._Thread,
             "__traits" : TokenKind._Traits,
     ];

//now the usage:
auto keyTrie = Trie!(SetAsMap!(SmallMap!(2, TokenKind, string), 
TokenKind, string)
//duplication is because not every Map type Key-Value can be derived (we 
have no standard protocol for AAs)
                          , string
                          , assumeSize!(6, useLength)
                          , useItemAt!(0, char)
                          , useLastItem!(char)
                         )(keywordsMap);

  foreach(k,v; keywordsMap)
         assert((k in keyTrie2[k]) == v);

And the only big change is the data type :

struct SmallMap(size_t N, V, K)
     {
         void insert(Tuple!(V, K) t){ _set.insert(t); }

         V opBinaryRight(string op, T)(T key)
             if(op == "in")
         {
             auto idx = map!"a[1]"(_set.items[]).countUntil(key);
             return idx < 0 ? V.init : _set.items[idx][0];
         }
         private:
             SmallSet!(N, Tuple!(V, K)) _set;
     }


P.S. Sorry that the code for Trie itself isn't presented, but you can 
find the the whole thing in my Phobos fork:
https://github.com/blackwhale/phobos/blob/gsoc-uni/std/uni.d

-- 
Dmitry Olshansky

Jun 04 2012

Denis Shelomovskij <verylonglogin.reg gmail.com> writes:

04.06.2012 13:46, Dmitry Olshansky написал:
 enum keywords = [
              "abstract",
              "alias",
              "align",
              //...  all of them, except   ones
              "__traits"
      ];

A nitpick: constant arrays should be defined as `immutable` instead of 
`enum`. `enum` means every time `keywords` used a new mutable array is 
dynamically allocated:
---
auto x = keywords;
auto y = keywords;
assert(x !is y); // passes
x[0] = ""; // legal, its mutable
---

-- 
Денис В. Шеломовский
Denis V. Shelomovskij

Jun 04 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 04.06.2012 16:15, Denis Shelomovskij wrote:
 04.06.2012 13:46, Dmitry Olshansky написал:
 enum keywords = [
 "abstract",
 "alias",
 "align",
 //... all of them, except   ones
 "__traits"
 ];

 A nitpick: constant arrays should be defined as `immutable` instead of
 `enum`. `enum` means every time `keywords` used a new mutable array is
 dynamically allocated:
 ---
 auto x = keywords;
 auto y = keywords;
 assert(x !is y); // passes
 x[0] = ""; // legal, its mutable
 ---

Thanks, the mutable part is then a consequence of being another rvalue.
Obviously table should have been immutable.


-- 
Dmitry Olshansky

Jun 04 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 4 June 2012 at 12:15:33 UTC, Denis Shelomovskij wrote:
 04.06.2012 13:46, Dmitry Olshansky написал:
 enum keywords = [
             "abstract",
             "alias",
             "align",
             //...  all of them, except   ones
             "__traits"
     ];

 A nitpick: constant arrays should be defined as `immutable` 
 instead of `enum`. `enum` means every time `keywords` used a 
 new mutable array is dynamically allocated:
 ---
 auto x = keywords;
 auto y = keywords;
 assert(x !is y); // passes
 x[0] = ""; // legal, its mutable
 ---

Well, in my case they were only used for mixin code generation 
(IIRC). Dmitry obviously has a different situation.

Jun 04 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/4/12 4:46 AM, Dmitry Olshansky wrote:
 Cross-posting from my GSOC list as I think we had a lack of D rox posts
 lately :)

[snip]

I think it would be great if you converted this post into an article.

Andrei

Jun 04 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 04.06.2012 18:19, Andrei Alexandrescu wrote:
 On 6/4/12 4:46 AM, Dmitry Olshansky wrote:
 Cross-posting from my GSOC list as I think we had a lack of D rox posts
 lately :)

 [snip]

 I think it would be great if you converted this post into an article.

 Andrei

Sounds good, will do once I fix few issues that were mentioned 
(bit-packing, GC types etc.)

-- 
Dmitry Olshansky

Jun 04 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 4 June 2012 at 15:35:31 UTC, Dmitry Olshansky wrote:
 On 04.06.2012 18:19, Andrei Alexandrescu wrote:
 On 6/4/12 4:46 AM, Dmitry Olshansky wrote:
 Cross-posting from my GSOC list as I think we had a lack of D 
 rox posts
 lately :)

 [snip]

 I think it would be great if you converted this post into an 
 article.

 Andrei

 Sounds good, will do once I fix few issues that were mentioned 
 (bit-packing, GC types etc.)

Would be interesting to see some more examples, along with 
rationale/motivation for various aspects of your API, and 
possible usage scenarios.

Tries are fundamental (both data structures and respective 
algorithms) for lookup problems, in the same way as arrays are 
fundamental for indexed access.

For example, they have several advantages over hash tables. Hash 
calculation requires const * key.length operations, which is 
equivalent to the number of comparisons needed for trie lookup. 
But hash tables may be less space efficient, or lead to hash 
collisions which increase lookup time. Also, it is possible to 
create persistent (immutable) tries with efficient (log N) 
inserting / deleting (this scenario is very important for my DCT 
project). Immutable hash tables would require 0(N) copying for 
each insert / delete.

It is difficult to create a good API for fundamental data 
structures, because various use cases would motivate different 
trade-offs. The same is true for implementation. This is why I 
like your decision to introduce policies for configuration. 
Rationale and use cases should help to analyze design of your API 
and implementation, thus you will get better community feedback :)

Below are some notes related to my DCT use cases.

Your examples deal with lookup by the whole word (first/last 
characters and length are needed). Are your API and 
implementation adaptable for character-by-character trie lookup?

Will compile-time generation of lookup code based on tries be 
supported? Example which is currently in DCT (first implemented 
by Brian Schott in his Dscanner project) uses switch statements 
(which means lookup linear in number of possible characters at 
each position). A trivial improvement might be using if 
statements and binary lookup. (E.g., if there are 26 possible 
characters used at some position, use only 5 comparisons, not 26).

I wanted to analyse your regex implementation, but that's not an 
easy task and requires a lot of effort... It looks like the most 
promising alternative to binary trie lookup which I described in 
previous paragraph. Similarities and differences with your regex 
design might also help us understand tries better.

Jun 04 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 04.06.2012 22:42, Roman D. Boiko wrote:
 On Monday, 4 June 2012 at 15:35:31 UTC, Dmitry Olshansky wrote:
 On 04.06.2012 18:19, Andrei Alexandrescu wrote:
 On 6/4/12 4:46 AM, Dmitry Olshansky wrote:
 Cross-posting from my GSOC list as I think we had a lack of D rox posts
 lately :)

 [snip]

 I think it would be great if you converted this post into an article.

 Andrei

 Sounds good, will do once I fix few issues that were mentioned
 (bit-packing, GC types etc.)

 Would be interesting to see some more examples, along with
 rationale/motivation for various aspects of your API, and possible usage
 scenarios.

 Tries are fundamental (both data structures and respective algorithms)
 for lookup problems, in the same way as arrays are fundamental for
 indexed access.

 For example, they have several advantages over hash tables. Hash
 calculation requires const * key.length operations, which is equivalent
 to the number of comparisons needed for trie lookup. But hash tables may
 be less space efficient, or lead to hash collisions which increase
 lookup time. Also, it is possible to create persistent (immutable) tries
 with efficient (log N) inserting / deleting (this scenario is very
 important for my DCT project). Immutable hash tables would require 0(N)
 copying for each insert / delete.

Aye, the good thing about them - the amount of data affected by any 
change is localized to one "page". So you just copy it over and remap 
indices/pointers.

 It is difficult to create a good API for fundamental data structures,
 because various use cases would motivate different trade-offs. The same
 is true for implementation. This is why I like your decision to
 introduce policies for configuration. Rationale and use cases should
 help to analyze design of your API and implementation, thus you will get
 better community feedback :)

Well I guess I'll talk in depth about them in the article, as the 
material exceed sane limits of a single NG post.

In brief:
	- multiple levels are stored in one memory chunk one after another thus 
helping a bit with cache-locality (first level goes first)
	- constructors do minimize number of "pages" on each level by 
constructing it outwards from the last level and checking duplicates 
(costs ~ O(N^2) though, IRC)
	- I learned the hard way not to introduce extra conditionals anywhere, 
so there is no "out of range, max index, not existent" crap, in all 
cases it's clean-cut memory access. Extra bits lost on having at least 
one "default" page per level can be saved by going extra level
	- extra levels ain't that bad as they look, since memory is close anyway.
	
 Below are some notes related to my DCT use cases.

 Your examples deal with lookup by the whole word (first/last characters
 and length are needed). Are your API and implementation adaptable for
 character-by-character trie lookup?

I would say that one by one won't help you much since the speed is 
almost the same if not worse.
The problematic thing with one by one - say you want to stop early, right?
Now you have to check the *NOT FOUND* case, and that implies extra 
branching (if(...)) on _each level_ and maybe reusing certain valid 
values as "not found" marker (extra configuration?).

Branching and testing are things that kill speed advantage of Tries, the 
ones I overlooked in my previous attempt, see std/internal/uni.d.
The other being separate locations for data and index, pointer-happy 
disjoint node(page) locality is another way of the same fault.

 Will compile-time generation of lookup code based on tries be supported?
 Example which is currently in DCT (first implemented by Brian Schott in
 his Dscanner project) uses switch statements (which means lookup linear
 in number of possible characters at each position).

Nope, the days of linear lookup for switch are over (were there even 
such days?) compiler always do binary search nowadays if linear isn't 
more efficient e.g. for a small number of values.(it even weight out 
which is better and uses a combination of them)

However you'd better check asm code afterwards. Compiler is like a nasty 
stepchild it will give up on generating good old jump tables given any 
reason it finds justifiable. (but it may use few small jump tables + 
binary search, could be fine... if not in a tight loop!)

A trivial
 improvement might be using if statements and binary lookup. (E.g., if
 there are 26 possible characters used at some position, use only 5
 comparisons, not 26).

Moreover you'd be surprised but such leap-frog binary search looses by a 
big margin to _multilayer_ lookup table. (I for one was quite shocked 
back then)

 I wanted to analyse your regex implementation, but that's not an easy
 task and requires a lot of effort...

Yeah, sorry for some encrypted Klingon here and there ;)

It looks like the most promising
 alternative to binary trie lookup which I described in previous
 paragraph. Similarities and differences with your regex design might
 also help us understand tries better.

Well if you are in no rush you might as well just take my latest 
development in the ways of Trie from my gsoc fork :)
Skipping some gory days of try and trial (fail) in the process ;)

-- 
Dmitry Olshansky

Jun 04 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 4 June 2012 at 19:18:32 UTC, Dmitry Olshansky wrote:
 On 04.06.2012 22:42, Roman D. Boiko wrote:
 ... it is possible to create persistent (immutable) tries
 with efficient (log N) inserting / deleting (this scenario is 
 very important for my DCT project). Immutable hash tables 
 would require 0(N) copying for each insert / delete.

 Aye, the good thing about them - the amount of data affected by 
 any change is localized to one "page". So you just copy it over 
 and remap indices/pointers.

Actually, Microsoft went this way (immutable hash tables) in 
their Roslyn preview implementation. However, I still believe 
that tries will work better here. Will check...

Would bulk pre-allocation of memory to be used by trie improve 
locality? With some heuristics for copying when it goes out of 
control.

 It is difficult to create a good API for fundamental data 
 structures,
 because various use cases would motivate different trade-offs. 
 The same
 is true for implementation. This is why I like your decision to
 introduce policies for configuration. Rationale and use cases 
 should
 help to analyze design of your API and implementation, thus 
 you will get
 better community feedback :)

 Well I guess I'll talk in depth about them in the article, as 
 the material exceed sane limits of a single NG post.

 In brief:
 	- multiple levels are stored in one memory chunk one after 
 another thus helping a bit with cache-locality (first level 
 goes first)
 	- constructors do minimize number of "pages" on each level by 
 constructing it outwards from the last level and checking 
 duplicates (costs ~ O(N^2) though, IRC)

So this price is paid only on construction, right? Are there 
alternatives to chose from (when needed)? If yes, which?

 	- I learned the hard way not to introduce extra conditionals 
 anywhere, so there is no "out of range, max index, not 
 existent" crap, in all cases it's clean-cut memory access. 
 Extra bits lost on having at least one "default" page per level 
 can be saved by going extra level

Could you please elaborate? How do you handle situations when not 
existent, etc., is needed?

 Your examples deal with lookup by the whole word (first/last 
 characters and length are needed). Are your API and 
 implementation adaptable for character-by-character trie 
 lookup?

 I would say that one by one won't help you much since the speed 
 is almost the same if not worse.

I guess, in general your statement is true, especially because 
known length could improve speed significantly. Not sure (but can 
easily believe) that in my particular situation it is true. For 
example, one by one would allow ignoring key encoding (and thus 
using multiple encodings simultaneously just as easily as single).

 The problematic thing with one by one - say you want to stop 
 early, right?

Why? I'd like to lex inout as TokenKind.InOut, not TokenKind.In 
followed by TokenKind.Out. Did I misunderstand your question?

 Now you have to check the *NOT FOUND* case, and that implies 
 extra branching (if(...)) on _each level_ and maybe reusing 
 certain valid values as "not found" marker (extra 
 configuration?).

This should be easy, if something is not a keyword, it is likely 
an identifier. But I agree in general, and probably even in my 
case.

 Branching and testing are things that kill speed advantage of 
 Tries, the ones I overlooked in my previous attempt, see 
 std/internal/uni.d.
 The other being separate locations for data and index, 
 pointer-happy disjoint node(page) locality is another way of 
 the same fault.

This concern disturbs me for some time already, and slightly 
demotivates, because implementing something this way will likely 
lead to a failure. I don't have enough experience with 
alternatives to know their advantages and trade-offs. I'll check 
your code. I did plan to try table lookup instead of branching. I 
guess making my own mistakes is necessary anyway.

 Will compile-time generation of lookup code based on tries be 
 supported?
 Example which is currently in DCT (first implemented by Brian 
 Schott in
 his Dscanner project) uses switch statements (which means 
 lookup linear
 in number of possible characters at each position).

 Nope, the days of linear lookup for switch are over (were there 
 even such days?) compiler always do binary search nowadays if 
 linear isn't more efficient e.g. for a small number of 
 values.(it even weight out which is better and uses a 
 combination of them)

I thought this *might* be the case, but didn't know nor checked 
anywhere. I also wanted to do linear search for some empirically 
chosen small number of items.

 However you'd better check asm code afterwards. Compiler is 
 like a nasty stepchild it will give up on generating good old 
 jump tables given any reason it finds justifiable. (but it may 
 use few small jump tables + binary search, could be fine... if 
 not in a tight loop!)

Thanks.

 A trivial
 improvement might be using if statements and binary lookup. 
 (E.g., if
 there are 26 possible characters used at some position, use 
 only 5
 comparisons, not 26).

 Moreover you'd be surprised but such leap-frog binary search 
 looses by a big margin to _multilayer_ lookup table. (I for one 
 was quite shocked back then)

Thanks again :) Are any numbers or perf. tests available?

 I wanted to analyse your regex implementation, but that's not 
 an easy
 task and requires a lot of effort...

 Yeah, sorry for some encrypted Klingon here and there ;)

It looks like the most promising
 alternative to binary trie lookup which I described in previous
 paragraph. Similarities and differences with your regex design 
 might
 also help us understand tries better.

 Well if you are in no rush you might as well just take my 
 latest development in the ways of Trie from my gsoc fork :)
 Skipping some gory days of try and trial (fail) in the process 
 ;)

OK

Jun 04 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

[snip]
 Aye, the good thing about them - the amount of data affected by any
 change is localized to one "page". So you just copy it over and remap
 indices/pointers.

 Actually, Microsoft went this way (immutable hash tables) in their
 Roslyn preview implementation. However, I still believe that tries will
 work better here. Will check...

 Would bulk pre-allocation of memory to be used by trie improve locality?
 With some heuristics for copying when it goes out of control.

Sorry my Trie implementation focused on "constructe once - read 
everywhere" case. Like I noted mutation is not hard but it's easy to 
blow up size if used unknowingly.
I think along the lines of:

auto trie = ...;//create it
for(...)
{
	trie[x] = ...;//modify repeatedly
}
trie.compact();//redo same algorithm as during construction, O(N^2)

 It is difficult to create a good API for fundamental data structures,
 because various use cases would motivate different trade-offs. The same
 is true for implementation. This is why I like your decision to
 introduce policies for configuration. Rationale and use cases should
 help to analyze design of your API and implementation, thus you will get
 better community feedback :)

 Well I guess I'll talk in depth about them in the article, as the
 material exceed sane limits of a single NG post.

 In brief:
 - multiple levels are stored in one memory chunk one after another
 thus helping a bit with cache-locality (first level goes first)
 - constructors do minimize number of "pages" on each level by
 constructing it outwards from the last level and checking duplicates
 (costs ~ O(N^2) though, IRC)

 So this price is paid only on construction, right? Are there
 alternatives to chose from (when needed)? If yes, which?

See above. Price is paid every time you want squeeze it in as little 
memory as possible by removing duplicate pages.
 - I learned the hard way not to introduce extra conditionals anywhere,
 so there is no "out of range, max index, not existent" crap, in all
 cases it's clean-cut memory access. Extra bits lost on having at least
 one "default" page per level can be saved by going extra level

 Could you please elaborate? How do you handle situations when not
 existent, etc., is needed?

Easy say you have 2-level 4 entry each level (for sake of simplicity), 
say the final value is int.

Then in highly redundant or just when there is little amount of values 
in initial data set (both cases are equivalent, thus tries are easily 
invertible btw):

LVL 0: [0, 1, 0, 0]
This first one always occupies full size (it's asserted that there is no 
index >= 4)
LVL 1: [0, 0, 0, 0] [1, 0, 2, 3]
Note again - only full pages, no cheating and half-pages, but we can 
save on the amount of them (note almost obligatory zero page)

So it contains only 1, 2 and 3 at indexes of 4, 6, 7 respectively, 
T.init is our way of not EXISTS ... yes, I think that should be user 
definable.
This way there is no checks anywhere, only shift, add dereference, 
shift, add, dereference...

 I would say that one by one won't help you much since the speed is
 almost the same if not worse.

 I guess, in general your statement is true, especially because known
 length could improve speed significantly. Not sure (but can easily
 believe) that in my particular situation it is true.

That and some simple minded hashing of say first char and the length ;)
In any case you need to munch the whole symbol, I think.

 For example, one by
 one would allow ignoring key encoding (and thus using multiple encodings
 simultaneously just as easily as single).

It's just as easy with the whole thing. Treat it as bytes ;)

 The problematic thing with one by one - say you want to stop early,
 right?

 Why? I'd like to lex inout as TokenKind.InOut, not TokenKind.In followed
 by TokenKind.Out. Did I misunderstand your question?

No I meant stopping on say 2 level of 5 level Trie because the element 
was not found. Stupid idea generally.

 Branching and testing are things that kill speed advantage of Tries,
 the ones I overlooked in my previous attempt, see std/internal/uni.d.
 The other being separate locations for data and index, pointer-happy
 disjoint node(page) locality is another way of the same fault.

 This concern disturbs me for some time already, and slightly
 demotivates, because implementing something this way will likely lead to
 a failure. I don't have enough experience with alternatives to know
 their advantages and trade-offs. I'll check your code. I did plan to try
 table lookup instead of branching. I guess making my own mistakes is
 necessary anyway.

It could enlightening just don't give up too quickly and don't jump to 
conclusions. In fact try to be sympathetic with "loosing party", like in 
... "hm this way is much slower, so bad - I have to optimize it 
somehow". In other words make sure you squeezed all you can from "slow" 
method.

 Nope, the days of linear lookup for switch are over (were there even
 such days?) compiler always do binary search nowadays if linear isn't
 more efficient e.g. for a small number of values.(it even weight out
 which is better and uses a combination of them)

 I thought this *might* be the case, but didn't know nor checked
 anywhere. I also wanted to do linear search for some empirically chosen
 small number of items.

Compare with C's memchr and such just in case, it was ~4x faster then 
Phobos find last time I checked.

 A trivial
 improvement might be using if statements and binary lookup. (E.g., if
 there are 26 possible characters used at some position, use only 5
 comparisons, not 26).

 Moreover you'd be surprised but such leap-frog binary search looses by
 a big margin to _multilayer_ lookup table. (I for one was quite
 shocked back then)

 Thanks again :) Are any numbers or perf. tests available?

Well I might recompile regex to use binary search vs Trie again, off the 
record it was ~2 times slower minimum. That was sorted array vs oldish 
Trie, note however that it's responsible for only ~40% of actual time of 
matching... so it's around 2.5 * 2 = 5

Compiler's switch I *believe* could be at least ~2x better as it's fully 
unrolled binary search,
but hard to guess at how much actually and it depends on the CPU model 
by the end of day.
The older CPU the better lookup tables generally are, except these 
Celeron mules with half the cache ;)

-- 
Dmitry Olshansky

Jun 04 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 4 June 2012 at 20:40:03 UTC, Dmitry Olshansky wrote:
 [snip]
 Sorry my Trie implementation focused on "constructe once - read 
 everywhere" case.

My cases will likely have quite low amortized number of reads per 
insert / delete.

 How do you handle situations when not existent, etc., is 
 needed?

 Easy say you have 2-level 4 entry each level (for sake of 
 simplicity), say the final value is int.

 Then in highly redundant or just when there is little amount of 
 values in initial data set (both cases are equivalent, thus 
 tries are easily invertible btw):

 LVL 0: [0, 1, 0, 0]
 This first one always occupies full size (it's asserted that 
 there is no index >= 4)
 LVL 1: [0, 0, 0, 0] [1, 0, 2, 3]
 Note again - only full pages, no cheating and half-pages, but 
 we can save on the amount of them (note almost obligatory zero 
 page)

 So it contains only 1, 2 and 3 at indexes of 4, 6, 7 
 respectively, T.init is our way of not EXISTS ... yes, I think 
 that should be user definable.
 This way there is no checks anywhere, only shift, add 
 dereference, shift, add, dereference...

Smart

 For example, one by
 one would allow ignoring key encoding (and thus using multiple 
 encodings
 simultaneously just as easily as single).

 It's just as easy with the whole thing. Treat it as bytes ;)

Except when equivalent keys in different encodings should be 
treated as equal. But now I can see that my counter-example is 
only partially valid - walklength could be used instead of length 
(more expensive, though), and dchars everywhere.

 I guess making my own mistakes is necessary anyway.

 It could enlightening just don't give up too quickly and don't 
 jump to conclusions. In fact try to be sympathetic with 
 "loosing party", like in ... "hm this way is much slower, so 
 bad - I have to optimize it somehow". In other words make sure 
 you squeezed all you can from "slow" method.

This deserves quoting somewhere :) Thanks a lot and have a good 
night! (It's late in Russia, isn't it?)

Jun 04 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 4 June 2012 at 21:07:02 UTC, Roman D. Boiko wrote:
 For example, one by
 one would allow ignoring key encoding (and thus using 
 multiple encodings
 simultaneously just as easily as single).

 It's just as easy with the whole thing. Treat it as bytes ;)

 Except when equivalent keys in different encodings should be 
 treated as equal. But now I can see that my counter-example is 
 only partially valid - walklength could be used instead of 
 length (more expensive, though), and dchars everywhere.

Another counter-example is searching for strings with specified 
prefix. One-by-one fits better here. I didn't understand whether 
such use cases are supported at both API and implementation 
levels.

Jun 04 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 05.06.2012 1:16, Roman D. Boiko wrote:
 On Monday, 4 June 2012 at 21:07:02 UTC, Roman D. Boiko wrote:
 For example, one by
 one would allow ignoring key encoding (and thus using multiple
 encodings
 simultaneously just as easily as single).

 It's just as easy with the whole thing. Treat it as bytes ;)

 Except when equivalent keys in different encodings should be treated
 as equal.


I believe that once encoidng is established your compiler tool should 
use the best code for that encoding. And that means templates,  tailored 
per encoding in my book.
If anything I plan to use Tries on strings without decoding codepoint, 
just using length of it (as first stage, might need some tweak).

 But now I can see that my counter-example is only partially
 valid - walklength could be used instead of length (more expensive,
 though), and dchars everywhere.

 Another counter-example is searching for strings with specified prefix.
 One-by-one fits better here. I didn't understand whether such use cases
 are supported at both API and implementation levels.

They are not... *yawn*. Okay, I'll make it support InputRange of 
typeof(Key.init[0]) along with specific key type iff key type is 
RandomAccessRange :) It will not work however with SetAsSlot and MapAsSlot.

And before you run away with that horrible idea of ever decoding UTF in 
lexer... Just don't do that. Trust me, it's not as small a price as it 
seems at first. At least keep it only at prototype stage as it 
simplifies things.

-- 
Dmitry Olshansky

Jun 04 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 4 June 2012 at 21:39:50 UTC, Dmitry Olshansky wrote:
 And before you run away with that horrible idea of ever 
 decoding UTF in lexer... Just don't do that. Trust me, it's not 
 as small a price as it seems at first. At least keep it only at 
 prototype stage as it simplifies things.

I didn't plan to convert input into some other encoding. But 
missed the idea that it is possible to create finite automata as 
a template and avoid decoding altogether. IIRC, I rejected this 
approach when decided to convert everything into UTF-8 long ago, 
and didn't reconsider after discarding that idea after your 
previous suggestion to avoid converting. Thus your idea was used 
only partially, and now I wonder how did I not discover this 
myself! :)

Jun 04 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 4 June 2012 at 21:39:50 UTC, Dmitry Olshansky wrote:
 On 05.06.2012 1:16, Roman D. Boiko wrote:
 I believe that once encoidng is established your compiler tool 
 should use the best code for that encoding. And that means 
 templates,  tailored per encoding in my book.
 If anything I plan to use Tries on strings without decoding 
 codepoint, just using length of it (as first stage, might need 
 some tweak).

Will it be difficult to adapt your API for immutable tries? E.g., 
it is not possible to implement immutable sequence (linked list) 
as a range, so range API doesn't fit that (but could with a tweak 
- returning tail instead of mutating in popFront). If trie API 
will have similar problems, then I need to invent my own. I 
understand that immutability is not your priority for GSoC, 
though.

Jun 04 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 05.06.2012 1:56, Roman D. Boiko wrote:
 On Monday, 4 June 2012 at 21:39:50 UTC, Dmitry Olshansky wrote:
 On 05.06.2012 1:16, Roman D. Boiko wrote:
 I believe that once encoidng is established your compiler tool should
 use the best code for that encoding. And that means templates,
 tailored per encoding in my book.
 If anything I plan to use Tries on strings without decoding codepoint,
 just using length of it (as first stage, might need some tweak).

 Will it be difficult to adapt your API for immutable tries? E.g., it is
 not possible to implement immutable sequence (linked list) as a range,

Linked list? I'm horrified. Though I'd need some info on where and why 
you'd need that ;)
Despite some claims to the contrary small arrays (like e-hm 10_000 
elements) are faster in nearly all operations possible.

 so range API doesn't fit that (but could with a tweak - returning tail
 instead of mutating in popFront). If trie API will have similar
 problems, then I need to invent my own. I understand that immutability
 is not your priority for GSoC, though.

Well I might remove obstacles, if you outline your design more clearly.

-- 
Dmitry Olshansky

Jun 04 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 05.06.2012 2:06, Dmitry Olshansky wrote:
 On 05.06.2012 1:56, Roman D. Boiko wrote:
 On Monday, 4 June 2012 at 21:39:50 UTC, Dmitry Olshansky wrote:
 On 05.06.2012 1:16, Roman D. Boiko wrote:
 I believe that once encoidng is established your compiler tool should
 use the best code for that encoding. And that means templates,
 tailored per encoding in my book.
 If anything I plan to use Tries on strings without decoding codepoint,
 just using length of it (as first stage, might need some tweak).

 Will it be difficult to adapt your API for immutable tries? E.g., it is
 not possible to implement immutable sequence (linked list) as a range,

 Linked list? I'm horrified. Though I'd need some info on where and why
 you'd need that ;)
 Despite some claims to the contrary small arrays (like e-hm 10_000
 elements) are faster in nearly all operations possible.

And you can fake immutability, buy always picking up an unused slot btw. 
No need to go beyond logical immutability.


-- 
Dmitry Olshansky

Jun 04 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 4 June 2012 at 22:21:42 UTC, Dmitry Olshansky wrote:
 And you can fake immutability, buy always picking up an unused 
 slot btw. No need to go beyond logical immutability.

That's applicable in some cases, not in general. But I agree that 
often it is possible to optimize if use cases are known.

My example concern was about fundamental problem of range APIs 
for immutable data structures, which is not possible to emulate: 
popFront is mutating by design.

Jun 04 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 05.06.2012 2:28, Roman D. Boiko wrote:
 On Monday, 4 June 2012 at 22:21:42 UTC, Dmitry Olshansky wrote:
 And you can fake immutability, buy always picking up an unused slot
 btw. No need to go beyond logical immutability.

 That's applicable in some cases, not in general. But I agree that often
 it is possible to optimize if use cases are known.

 My example concern was about fundamental problem of range APIs for
 immutable data structures, which is not possible to emulate: popFront is
 mutating by design.

Keep in mind Ranges are temporary object most of the time. They are 
grease for wheels of algorithms. Given data structure S, it's  range is 
R(element of S). Thus for immutable data structure range will be mutable 
entity of immutable element type.

Interesting example is immutable strings, that still have ranges over 
them, that even return dchar not an immutable(char).

-- 
Dmitry Olshansky

Jun 04 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Tuesday, 5 June 2012 at 04:42:07 UTC, Dmitry Olshansky wrote:
 On 05.06.2012 2:28, Roman D. Boiko wrote:
 My example concern was about fundamental problem of range APIs 
 for
 immutable data structures, which is not possible to emulate: 
 popFront is
 mutating by design.

 Keep in mind Ranges are temporary object most of the time. They 
 are grease for wheels of algorithms. Given data structure S, 
 it's  range is R(element of S). Thus for immutable data 
 structure range will be mutable entity of immutable element 
 type.

 Interesting example is immutable strings, that still have 
 ranges over them, that even return dchar not an immutable(char).

I'm currently trying this approach (instead of working with 
immutable data structures directly, which would require recursion 
everywhere) in my experimental functional data structures, and it 
looks promising. :)

Jun 04 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 4 June 2012 at 22:06:49 UTC, Dmitry Olshansky wrote:
 On 05.06.2012 1:56, Roman D. Boiko wrote:
 Will it be difficult to adapt your API for immutable tries? 
 E.g., it is
 not possible to implement immutable sequence (linked list) as 
 a range,

 Linked list? I'm horrified. Though I'd need some info on where 
 and why you'd need that ;)
 Despite some claims to the contrary small arrays (like e-hm 
 10_000 elements) are faster in nearly all operations possible.

That was for illustration only, as the most trivial example of 
immutable data structure (and probably the most widely used 
structure in functional programming).

 so range API doesn't fit that (but could with a tweak - 
 returning tail
 instead of mutating in popFront). If trie API will have similar
 problems, then I need to invent my own. I understand that 
 immutability
 is not your priority for GSoC, though.

 Well I might remove obstacles, if you outline your design more 
 clearly.

OK, thanks! I'll go through your code first to understand it 
better. But even before that I need to finish an important 
support request from my past customer...

Jun 04 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 4 June 2012 at 22:22:34 UTC, Roman D. Boiko wrote:
 On Monday, 4 June 2012 at 22:06:49 UTC, Dmitry Olshansky wrote:
 On 05.06.2012 1:56, Roman D. Boiko wrote:
 so range API doesn't fit that (but could with a tweak - 
 returning tail instead of mutating in popFront). If trie API 
 will have similar problems, then I need to invent my own. I 
 understand that immutability is not your priority for GSoC, 
 though.

 Well I might remove obstacles, if you outline your design more 
 clearly.

 OK, thanks! I'll go through your code first to understand it 
 better. But even before that I need to finish an important 
 support request from my past customer...

Basically, the most important aspect of immutability is returning 
a new instance of data structure on each insert / delete / 
update, and keeping the old one unchanged, instead of performing 
update in-place. This may not fit your most common use cases, 
though. Would be great to enable such semantics via policies, but 
without deep analysis I can't come up with a good API / design 
for that (without overcomplicating it). Probably keeping mutable 
and immutable APIs separate is the best choice. Will return to 
this problem once I get a bit of free time.

Jun 04 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Tuesday, 5 June 2012 at 05:28:48 UTC, Roman D. Boiko wrote:
 ... without deep analysis I can't come up with a good API / 
 design for that (without overcomplicating it). Probably keeping 
 mutable and immutable APIs separate is the best choice. Will 
 return to this problem once I get a bit of free time.

Simplest and possibly the best approach is to provide an 
immutable wrapper over mutable implementation, but that may be 
difficult to make efficient given the need to support insert / 
delete as common operations.

Jun 04 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 05.06.2012 9:33, Roman D. Boiko wrote:
 On Tuesday, 5 June 2012 at 05:28:48 UTC, Roman D. Boiko wrote:
 ... without deep analysis I can't come up with a good API / design for
 that (without overcomplicating it). Probably keeping mutable and
 immutable APIs separate is the best choice. Will return to this
 problem once I get a bit of free time.

 Simplest and possibly the best approach is to provide an immutable
 wrapper over mutable implementation, but that may be difficult to make
 efficient given the need to support insert / delete as common operations.

I suspect I would have to add another policy like Persistent that will 
preallocate some slack space implicitly so that some pages can be 
shallowly copied on each assign(a-la COW) and immutable parts still reused.
Another simpler way would be to use an separate field "pointer-to-actual 
base" for each level, thus allowing it to redirect it away to new 
(modified) copy. Still looks like policy as it _maybe_  slightly slower.

Anyway as a start this should work:

auto modifyDupGlobalImmutableTrie(Trie(immutable(T), ...) t
	, scope delegate(Trie(immutable(T), ...) )pure dg) __pure__
{
	auto copy = t.dup;//this would be one day a shallow copy
	with(copy)
	{	
		dg(copy);
	}
	return copy;
}

//later on
{
...
immutable newTrie = modifyDupGlobalImmutableTrie(yourTrie);
...
}

-- 
Dmitry Olshansky

Jun 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Tuesday, 5 June 2012 at 12:57:10 UTC, Dmitry Olshansky wrote:
 On 05.06.2012 9:33, Roman D. Boiko wrote:
 On Tuesday, 5 June 2012 at 05:28:48 UTC, Roman D. Boiko wrote:
 ... without deep analysis I can't come up with a good API / 
 design for
 that (without overcomplicating it). Probably keeping mutable 
 and
 immutable APIs separate is the best choice. Will return to 
 this
 problem once I get a bit of free time.

 Simplest and possibly the best approach is to provide an 
 immutable
 wrapper over mutable implementation, but that may be difficult 
 to make
 efficient given the need to support insert / delete as common 
 operations.

 I suspect I would have to add another policy like Persistent 
 that will preallocate some slack space implicitly so that some 
 pages can be shallowly copied on each assign(a-la COW) and 
 immutable parts still reused.
 Another simpler way would be to use an separate field 
 "pointer-to-actual base" for each level, thus allowing it to 
 redirect it away to new (modified) copy. Still looks like 
 policy as it _maybe_  slightly slower.

 Anyway as a start this should work:

 auto modifyDupGlobalImmutableTrie(Trie(immutable(T), ...) t
 	, scope delegate(Trie(immutable(T), ...) )pure dg) __pure__
 {
 	auto copy = t.dup;//this would be one day a shallow copy
 	with(copy)
 	{	
 		dg(copy);
 	}
 	return copy;
 }

 //later on
 {
 ...
 immutable newTrie = modifyDupGlobalImmutableTrie(yourTrie);
 ...
 }

Yes,something like that should work. I finished support request 
and will investigate this and your std.uni. maybe it is better to 
avoid immutability... or do bulk ins /del before  copy.

Jun 05 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Tuesday, 5 June 2012 at 13:27:12 UTC, Roman D. Boiko wrote:
 maybe it is better to avoid immutability... or do bulk ins /del 
 before  copy.

Would it be difficult to introduce two optimizations:
* have strings of all lengths above configurable threshold go to 
the same bucket (is it reasonable?)
* have ability to store a checkpoint, then do a series of 
modifications, and store another checkpoint but reuse data which 
have not been affected.

For the second optimization a possible design would be to store 
internal state as snapshot + diffs, and apply diffs when creating 
another snapshot. Diff format should allow efficiently performing 
trie interface operations.

Jun 11 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 11.06.2012 16:11, Roman D. Boiko wrote:
 On Tuesday, 5 June 2012 at 13:27:12 UTC, Roman D. Boiko wrote:
 maybe it is better to avoid immutability... or do bulk ins /del before
 copy.

 Would it be difficult to introduce two optimizations:
 * have strings of all lengths above configurable threshold go to the
 same bucket (is it reasonable?)

The sample I listed did just that: for length >= 64 it goes to bucket of 
0-length strings. After all it's your prefix function, with any 
distribution you like ;)

 * have ability to store a checkpoint, then do a series of modifications,
 and store another checkpoint but reuse data which have not been affected.

yea, that's a rough sketch of what I want to support.

 For the second optimization a possible design would be to store internal
 state as snapshot + diffs, and apply diffs when creating another
 snapshot. Diff format should allow efficiently performing trie interface
 operations.

It may be. Diff could be a bunch of pages that are XOR-ed on top of 
snapshot. Dunno if it's worthwhile trick yet.


-- 
Dmitry Olshansky

Jun 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 11 June 2012 at 14:48:27 UTC, Dmitry Olshansky wrote:
 On 11.06.2012 16:11, Roman D. Boiko wrote:
 On Tuesday, 5 June 2012 at 13:27:12 UTC, Roman D. Boiko wrote:
 maybe it is better to avoid immutability... or do bulk ins 
 /del before
 copy.

 Would it be difficult to introduce two optimizations:
 * have strings of all lengths above configurable threshold go 
 to the same bucket (is it reasonable?)

 The sample I listed did just that: for length >= 64 it goes to 
 bucket of 0-length strings. After all it's your prefix 
 function, with any distribution you like ;)

Great

 * have ability to store a checkpoint, then do a series of 
 modifications,
 and store another checkpoint but reuse data which have not 
 been affected.

 yea, that's a rough sketch of what I want to support.

That's wonderful :)

 For the second optimization a possible design would be to 
 store internal state as snapshot + diffs, and apply diffs when 
 creating another snapshot. Diff format should allow 
 efficiently performing trie interface operations.

 It may be. Diff could be a bunch of pages that are XOR-ed on 
 top of snapshot. Dunno if it's worthwhile trick yet.

It should be, provided that a single insert / delete doesn't 
affect many pages and size of page is reasonably small. Only need 
to create pages corresponding to those which are affected; and 
there is no need to track separate diffs between snapshots, so 
they can be combined.

Another option might be creating a separate trie for insertions 
and tracking deletions somehow, provided that tries can be merged 
efficiently.

Actually, in my case, deletions could be deferred and performed 
in bulk. OTOH, I will need to count how many times a string is 
inserted minus number of times it has been deleted. 
Alternatively, I could just check from time to time which strings 
are not needed any more (micro-GC :) ).

There are many possible data structures, but the one you 
mentioned seems to be the most reasonable.

Jun 11 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 11.06.2012 20:43, Roman D. Boiko wrote:
[snip]
 Actually, in my case, deletions could be deferred and performed in bulk.
 OTOH, I will need to count how many times a string is inserted minus
 number of times it has been deleted. Alternatively, I could just check
 from time to time which strings are not needed any more (micro-GC :) ).

 There are many possible data structures, but the one you mentioned seems
 to be the most reasonable.

I meant an operation pseudo-XOR x P^ y where x is part of snapshot and y 
is part of diff page.

x P^ y == x      when y == T.init
x P^ y == y      when y != T.init

-- 
Dmitry Olshansky

Jun 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 11 June 2012 at 18:16:45 UTC, Dmitry Olshansky wrote:
 On 11.06.2012 20:43, Roman D. Boiko wrote:
 [snip]
 Actually, in my case, deletions could be deferred and 
 performed in bulk.
 OTOH, I will need to count how many times a string is inserted 
 minus
 number of times it has been deleted. Alternatively, I could 
 just check
 from time to time which strings are not needed any more 
 (micro-GC :) ).

 There are many possible data structures, but the one you 
 mentioned seems
 to be the most reasonable.

 I meant an operation pseudo-XOR x P^ y where x is part of 
 snapshot and y is part of diff page.

 x P^ y == x      when y == T.init
 x P^ y == y      when y != T.init

I understood, just mentioned some alternatives (maybe not 
reasonable given your trie implementation).

Jun 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 11 June 2012 at 18:23:39 UTC, Roman D. Boiko wrote:
 On Monday, 11 June 2012 at 18:16:45 UTC, Dmitry Olshansky wrote:
 I meant an operation pseudo-XOR x P^ y where x is part of 
 snapshot and y is part of diff page.

 x P^ y == x      when y == T.init
 x P^ y == y      when y != T.init

 I understood, just mentioned some alternatives (maybe not 
 reasonable given your trie implementation).

Or probably not understood :)

Why pseudo?

Jun 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 11 June 2012 at 18:27:58 UTC, Roman D. Boiko wrote:
 On Monday, 11 June 2012 at 18:23:39 UTC, Roman D. Boiko wrote:
 On Monday, 11 June 2012 at 18:16:45 UTC, Dmitry Olshansky 
 wrote:
 I meant an operation pseudo-XOR x P^ y where x is part of 
 snapshot and y is part of diff page.

 x P^ y == x      when y == T.init
 x P^ y == y      when y != T.init

 I understood, just mentioned some alternatives (maybe not 
 reasonable given your trie implementation).

 Or probably not understood :)

 Why pseudo?

Is this article somehow related, or just uses the same term?
http://iwct.sjtu.edu.cn/Personal/mxtao/paper/liujc_icc11_PID1075548.pdf

I couldn't google anything more useful so far.

Jun 11 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 11.06.2012 22:30, Roman D. Boiko wrote:
 On Monday, 11 June 2012 at 18:27:58 UTC, Roman D. Boiko wrote:
 On Monday, 11 June 2012 at 18:23:39 UTC, Roman D. Boiko wrote:
 On Monday, 11 June 2012 at 18:16:45 UTC, Dmitry Olshansky wrote:
 I meant an operation pseudo-XOR x P^ y where x is part of snapshot
 and y is part of diff page.

 x P^ y == x when y == T.init
 x P^ y == y when y != T.init

 I understood, just mentioned some alternatives (maybe not reasonable
 given your trie implementation).

 Or probably not understood :)

 Why pseudo?


0 ^ 0 = 1 while T.init P^ T.init == T.init :)
other are almot the same
but
1 ^ 1 = 0 while x P^ y == y
So it's not symmetric when x,y != T.init for starters ;)

 Is this article somehow related, or just uses the same term?
 http://iwct.sjtu.edu.cn/Personal/mxtao/paper/liujc_icc11_PID1075548.pdf

looks like it might be but I can't penetrate this scientish somehow ( I 
usually can) in any case I believe it's more of coincidence.

-- 
Dmitry Olshansky

Jun 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 11 June 2012 at 18:42:14 UTC, Dmitry Olshansky wrote:
 0 ^ 0 = 1 while T.init P^ T.init == T.init :)
 other are almot the same but
 1 ^ 1 = 0 while x P^ y == y
 So it's not symmetric when x,y != T.init for starters ;)

Just a remark that DCT no longer needs tries to be immutable, 
since I decided not to backtrack which strings affect which 
syntax elements. OTOH, it is important to be able to do updates 
and reads at any time (intermixed).

Another note is that XOR makes sense only if you plan to do 
reverse transforms, or expect to pack diffs (if they will be 
sparse). Otherwise it would be more efficient to store new values 
directly.

Jun 24 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 24-Jun-12 14:50, Roman D. Boiko wrote:
 On Monday, 11 June 2012 at 18:42:14 UTC, Dmitry Olshansky wrote:
 0 ^ 0 = 1 while T.init P^ T.init == T.init :)
 other are almot the same but
 1 ^ 1 = 0 while x P^ y == y
 So it's not symmetric when x,y != T.init for starters ;)

 Just a remark that DCT no longer needs tries to be immutable, since I
 decided not to backtrack which strings affect which syntax elements.
 OTOH, it is important to be able to do updates and reads at any time
 (intermixed).

 Another note is that XOR makes sense only if you plan to do reverse
 transforms, or expect to pack diffs (if they will be sparse). Otherwise
 it would be more efficient to store new values directly.

Storing new values directly is no easy trick as you need crawl page 
tables all the way up to see if this value corresponds to multiple 
indexes. Well something similar happens with diff trie anyway.

But diffs are indeed very sparse, and the trie itself usually also sparse.
-- 
Dmitry Olshansky

Jun 24 2012

Marco Leise <Marco.Leise gmx.de> writes:

Am Mon, 04 Jun 2012 23:18:31 +0400
schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 Compiler is like a nasty=20
 stepchild it will give up on generating good old jump tables given any=20
 reason it finds justifiable. (but it may use few small jump tables +=20
 binary search, could be fine... if not in a tight loop!)

Looks, like you are back in control.

=46rom http://gcc.gnu.org/gcc-4.7/changes.html :

** General Optimizer Improvements **

Support for a new parameter --param case-values-threshold=3Dn was added to =
allow users to control the cutoff between doing switch statements as a seri=
es of if statements and using a jump table.

--=20
Marco

Jul 19 2012

D Programming

C/C++ Programming

Other

digitalmars.D - Making generalized Trie type in D