www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Efficiently streaming data to associative array

reply Guillaume Chatelet <chatelet.guillaume gmail.com> writes:
Let's say I'm processing MB of data, I'm lazily iterating over 
the incoming lines storing data in an associative array. I don't 
want to copy unless I have to.

Contrived example follows:

input file
----------
a,b,15
c,d,12
...

Efficient ingestion
-------------------
void main() {

   size_t[string][string] indexed_map;

   foreach(char[] line ; stdin.byLine) {
     char[] a;
     char[] b;
     size_t value;
     line.formattedRead!"%s,%s,%d"(a,b,value);

     auto pA = a in indexed_map;
     if(pA is null) {
       pA = &(indexed_map[a.idup] = (size_t[string]).init);
     }

     auto pB = b in (*pA);
     if(pB is null) {
       pB = &((*pA)[b.idup] = size_t.init);
     }

     // Technically unneeded but let's say we have more than 2 
dimensions.
     (*pB) = value;
   }

   indexed_map.writeln;
}


I qualify this code as ugly but fast. Any idea on how to make 
this less ugly? Is there something in Phobos to help?
Aug 08
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 8/8/17 11:28 AM, Guillaume Chatelet wrote:
 Let's say I'm processing MB of data, I'm lazily iterating over the 
 incoming lines storing data in an associative array. I don't want to 
 copy unless I have to.
 
 Contrived example follows:
 
 input file
 ----------
 a,b,15
 c,d,12
 ....
 
 Efficient ingestion
 -------------------
 void main() {
 
    size_t[string][string] indexed_map;
 
    foreach(char[] line ; stdin.byLine) {
      char[] a;
      char[] b;
      size_t value;
      line.formattedRead!"%s,%s,%d"(a,b,value);
 
      auto pA = a in indexed_map;
      if(pA is null) {
        pA = &(indexed_map[a.idup] = (size_t[string]).init);
      }
 
      auto pB = b in (*pA);
      if(pB is null) {
        pB = &((*pA)[b.idup] = size_t.init
      }
 
      // Technically unneeded but let's say we have more than 2 dimensions.
      (*pB) = value;
    }
 
    indexed_map.writeln;
 }
 
 
 I qualify this code as ugly but fast. Any idea on how to make this less 
 ugly? Is there something in Phobos to help?
I wouldn't use formattedRead, as I think this is going to allocate temporaries for a and b. Note, this is very close to Jon Degenhardt's blog post in May: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ -Steve
Aug 08
next sibling parent reply Guillaume Chatelet <chatelet.guillaume gmail.com> writes:
On Tuesday, 8 August 2017 at 16:00:17 UTC, Steven Schveighoffer 
wrote:
 On 8/8/17 11:28 AM, Guillaume Chatelet wrote:
 Let's say I'm processing MB of data, I'm lazily iterating over 
 the incoming lines storing data in an associative array. I 
 don't want to copy unless I have to.
 
 Contrived example follows:
 
 input file
 ----------
 a,b,15
 c,d,12
 ....
 
 Efficient ingestion
 -------------------
 void main() {
 
    size_t[string][string] indexed_map;
 
    foreach(char[] line ; stdin.byLine) {
      char[] a;
      char[] b;
      size_t value;
      line.formattedRead!"%s,%s,%d"(a,b,value);
 
      auto pA = a in indexed_map;
      if(pA is null) {
        pA = &(indexed_map[a.idup] = (size_t[string]).init);
      }
 
      auto pB = b in (*pA);
      if(pB is null) {
        pB = &((*pA)[b.idup] = size_t.init
      }
 
      // Technically unneeded but let's say we have more than 2 
 dimensions.
      (*pB) = value;
    }
 
    indexed_map.writeln;
 }
 
 
 I qualify this code as ugly but fast. Any idea on how to make 
 this less ugly? Is there something in Phobos to help?
I wouldn't use formattedRead, as I think this is going to allocate temporaries for a and b. Note, this is very close to Jon Degenhardt's blog post in May: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ -Steve
I haven't yet dug into formattedRead but thx for letting me know : ) I was mostly speaking about the pattern with the AA. I guess the best I can do is a templated function to hide the ugliness. ref Value GetWithDefault(Value)(ref Value[string] map, const (char[]) key) { auto pValue = key in map; if(pValue) return *pValue; return map[key.idup] = Value.init; } void main() { size_t[string][string] indexed_map; foreach(char[] line ; stdin.byLine) { char[] a; char[] b; size_t value; line.formattedRead!"%s,%s,%d"(a,b,value); indexed_map.GetWithDefault(a).GetWithDefault(b) = value; } indexed_map.writeln; } Not too bad actually !
Aug 08
parent reply kerdemdemir <kerdemdemir hotmail.com> writes:
 I haven't yet dug into formattedRead but thx for letting me 
 know : )
 I was mostly speaking about the pattern with the AA. I guess 
 the best I can do is a templated function to hide the ugliness.


 ref Value GetWithDefault(Value)(ref Value[string] map, const 
 (char[]) key) {
   auto pValue = key in map;
   if(pValue) return *pValue;
   return map[key.idup] = Value.init;
 }

 void main() {

   size_t[string][string] indexed_map;

   foreach(char[] line ; stdin.byLine) {
     char[] a;
     char[] b;
     size_t value;
     line.formattedRead!"%s,%s,%d"(a,b,value);

     indexed_map.GetWithDefault(a).GetWithDefault(b) = value;
   }

   indexed_map.writeln;
 }


 Not too bad actually !
As a total beginner I am feeling a bit not comfortable with basic operations in AA. First even I am very happy we have pointers but using pointers in a common operation like this IMHO makes the language a bit not safe. Second "in" keyword always seemed so specific to me. I think I will use your solution "ref Value GetWithDefault(Value)" very often since it hides the two things above.
Aug 09
parent Guillaume Chatelet <chatelet.guillaume gmail.com> writes:
On Wednesday, 9 August 2017 at 10:00:14 UTC, kerdemdemir wrote:
 As a total beginner I am feeling a bit not comfortable with 
 basic operations in AA.

 First even I am very happy we have pointers but using pointers 
 in a common operation like this IMHO makes the language a bit 
 not safe.

 Second "in" keyword always seemed so specific to me.

 I think I will use your solution "ref Value 
 GetWithDefault(Value)" very often since it hides the two things 
 above.
You don't need this most of the time, if you already have the correct type it's easy: size_t[string][string] indexed_map; string a, b; // a and b are strings not char[] indexed_map[a][b] = value; // this will create the AA slots if needed In my specific case the data is streamed from stdin and is not kept in memory. byLine returns a view of the stdin buffer which may be replaced at the next for-loop iteration so I can't use the index operator directly, I need a string that does not change over time. I could have used this code: void main() { size_t[string][string] indexed_map; foreach(char[] line ; stdin.byLine) { char[] a; char[] b; size_t value; line.formattedRead!"%s,%s,%d"(a,b,value); indexed_map[a.idup][b.idup] = value; } indexed_map.writeln; } It's perfectly ok if data is small. In my case data is huge and creating a copy of the strings at each iteration is costly.
Aug 09
prev sibling parent reply Anonymouse <asdf asdf.net> writes:
On Tuesday, 8 August 2017 at 16:00:17 UTC, Steven Schveighoffer 
wrote:
 I wouldn't use formattedRead, as I think this is going to 
 allocate temporaries for a and b.
What would you suggest to use in its stead? My use-case is similar to the OP's in that I have a string of tokens that I want split into variables. import std.stdio; import std.format; void main() { string abc, def; int ghi, jkl; string s = "abc,123,def,456"; s.formattedRead!"%s,%d,%s,%d"(abc, ghi, def, jkl); writeln(abc); writeln(def); writeln(ghi); writeln(jkl); }
Aug 08
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 8/8/17 3:43 PM, Anonymouse wrote:
 On Tuesday, 8 August 2017 at 16:00:17 UTC, Steven Schveighoffer wrote:
 I wouldn't use formattedRead, as I think this is going to allocate 
 temporaries for a and b.
What would you suggest to use in its stead? My use-case is similar to the OP's in that I have a string of tokens that I want split into variables.
using splitter(","), and then parsing each field using appropriate function (e.g. to!) For example, the OP's code, I would do: auto r = line.splitter(","); a = r.front; r.popFront; b = r.front; r.popFront; c = r.front.to!int; It would be nice if formattedRead didn't use appender, and instead sliced, but I'm not sure it can be fixed. Note, one could make a template that does this automatically in one line. -Steve
Aug 09
parent Jon Degenhardt <jond noreply.com> writes:
On Wednesday, 9 August 2017 at 13:36:46 UTC, Steven Schveighoffer 
wrote:
 On 8/8/17 3:43 PM, Anonymouse wrote:
 On Tuesday, 8 August 2017 at 16:00:17 UTC, Steven 
 Schveighoffer wrote:
 I wouldn't use formattedRead, as I think this is going to 
 allocate temporaries for a and b.
What would you suggest to use in its stead? My use-case is similar to the OP's in that I have a string of tokens that I want split into variables.
using splitter(","), and then parsing each field using appropriate function (e.g. to!) For example, the OP's code, I would do: auto r = line.splitter(","); a = r.front; r.popFront; b = r.front; r.popFront; c = r.front.to!int; It would be nice if formattedRead didn't use appender, and instead sliced, but I'm not sure it can be fixed. Note, one could make a template that does this automatically in one line. -Steve
The blog post Steve referred to has examples of this type processing while iterating over lines in a file. A couple different ways to access the elements are shown. AA access is addressed also: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ --Jon
Aug 10