www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Saving and loading large data sets easily and efficiently

reply Brett <Brett gmail.com> writes:
I have done some large computations where the data set is around 
10GB and takes several minutes to run. Rather than running it 
every time regenerating the same data, can I simply save it to 
disk easily?

The data is ordered in arrays and structs. It's just numbers/POD 
except some arrays use pointers to elements in other arrays(so 
the structs are not duplicated).

Hence any saving routine would have to take this in to account 
and properly reference the references structs rather than 
duplicate them...

Hence why it is not so simple as writing out the array data. 
Ideally it would write to binary to save space.

Essentially pointers will get to file offsets and so, in some 
sense, it is as if one had a memory map where ptr = 0 is the 
start of the data structure and ptr=34 would reference the 34 
byte. All pointers in the data struct refer to all pointers 
relative to the main structure(no heal allocations except for the 
arrays of struct pointers).

So it much more difficult than POD but would still be a little 
more work to right... hoping that there is something already out 
there than can do this. It should be


The way the data is structured is that I have a master array of 
non-ptr structs.

E.g.,

S[] Data;
S*[] OtherStuff;

then every pointer points to an element in to Data. I did not use 
int's as "pointers" for a specific non-relevant reason but I 
should be able to convert every pointer to an index by simply 
removing the offset. [Technically I do not know if this is 
occurring but it should]

OtherStuff's elements just reference Data's elements.


I imagine it wouldn't be that difficult to write out the Data. 
Save Data to a file then append the rest of the info and all 
pointers from memory can be converted to pointers to disk by 
simple a ptr - Data.ptr. It still requires managing some issues 
stuff though. As I do have some associative arrays:

S*[int] MoreStuff;

So I'm looking for a more robust solution that will handle any 
future expansions.
Sep 30 2019
next sibling parent JN <666total wp.pl> writes:
On Monday, 30 September 2019 at 20:10:21 UTC, Brett wrote:
 So it much more difficult than POD but would still be a little 
 more work to right... hoping that there is something already 
 out there than can do this. It should be
I'm afraid there's nothing like this available. Out of serialization libraries that I know, msgpack-d and cerealed don't store references and instead duplicate the pointed-to content. Orange does it, but it doesn't support binary output format, only XML, so it isn't a good fit for your data.
Oct 01 2019
prev sibling parent reply Bastiaan Veelo <Bastiaan Veelo.net> writes:
On Monday, 30 September 2019 at 20:10:21 UTC, Brett wrote:
[...]
 The way the data is structured is that I have a master array of 
 non-ptr structs.

 E.g.,

 S[] Data;
 S*[] OtherStuff;

 then every pointer points to an element in to Data. I did not 
 use int's as "pointers" for a specific non-relevant reason [...]
I would seriously consider turning that around and work with indices primarily, then take the address of an indexed element whenever you do need a pointer for that specific non-relevant reason. It makes I/O trivial, and it is safer too. size_t[] OtherStuff; size_t[int] MoreStuff; Bastiaan.
Oct 03 2019
parent Brett <Brett gmail.com> writes:
On Thursday, 3 October 2019 at 14:38:35 UTC, Bastiaan Veelo wrote:
 On Monday, 30 September 2019 at 20:10:21 UTC, Brett wrote:
 [...]
 The way the data is structured is that I have a master array 
 of non-ptr structs.

 E.g.,

 S[] Data;
 S*[] OtherStuff;

 then every pointer points to an element in to Data. I did not 
 use int's as "pointers" for a specific non-relevant reason 
 [...]
I would seriously consider turning that around and work with indices primarily, then take the address of an indexed element whenever you do need a pointer for that specific non-relevant reason. It makes I/O trivial, and it is safer too. size_t[] OtherStuff; size_t[int] MoreStuff; Bastiaan.
No it doesn't.
Oct 03 2019