www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - randomIO, std.file, core.stdc.stdio

reply Charles Hixson via Digitalmars-d-learn writes:
Are there reasons why one would use rawRead and rawWrite rather than 
fread and fwrite when doiing binary random io?  What are the advantages?

In particular, if one is reading and writing structs rather than arrays 
or ranges, are there any advantages?
Jul 25 2016
parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Monday, 25 July 2016 at 18:54:27 UTC, Charles Hixson wrote:
 Are there reasons why one would use rawRead and rawWrite rather 
 than fread and fwrite when doiing binary random io?  What are 
 the advantages?

 In particular, if one is reading and writing structs rather 
 than arrays or ranges, are there any advantages?
yes: keeping API consistent. ;-) for example, my stream i/o modules works with anything that has `rawRead`/`rawWrite` methods, but don't bother to check for any other. besides, `rawRead` is just looks cleaner, even with all `(&a)[0..1])` noise. so, a question of style.
Jul 25 2016
parent reply Charles Hixson via Digitalmars-d-learn writes:
On 07/25/2016 05:18 PM, ketmar via Digitalmars-d-learn wrote:
 On Monday, 25 July 2016 at 18:54:27 UTC, Charles Hixson wrote:
 Are there reasons why one would use rawRead and rawWrite rather than 
 fread and fwrite when doiing binary random io?  What are the advantages?

 In particular, if one is reading and writing structs rather than 
 arrays or ranges, are there any advantages?
yes: keeping API consistent. ;-) for example, my stream i/o modules works with anything that has `rawRead`/`rawWrite` methods, but don't bother to check for any other. besides, `rawRead` is just looks cleaner, even with all `(&a)[0..1])` noise. so, a question of style.
OK. If it's just a question of "looking cleaner" and "style", then I will prefer the core.stdc.stdio approach. I find it's appearance extremely much cleaner...except that that's understating things. I'll probably wrap those routines in a struct to ensure things like files being properly closed, and not have explicit pointers persisting over large areas of code. (I said a lot more, but it was just a rant about how ugly I find rawRead/rawWrite syntax, so I deleted it.)
Jul 25 2016
next sibling parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Tuesday, 26 July 2016 at 01:19:49 UTC, Charles Hixson wrote:
 then I will prefer the core.stdc.stdio approach.  I find it's 
 appearance extremely much cleaner...
only if you are really used to write C code. when you see pointer, or explicit type size argument in D, it is a sign of C disease.
 I'll probably wrap those routines in a struct to ensure things 
 like files being properly closed, and not have explicit 
 pointers persisting over large areas of code.
exactly what std.stdio.File did! ;-)
Jul 25 2016
parent reply Charles Hixson via Digitalmars-d-learn writes:
On 07/25/2016 07:11 PM, ketmar via Digitalmars-d-learn wrote:
 On Tuesday, 26 July 2016 at 01:19:49 UTC, Charles Hixson wrote:
 then I will prefer the core.stdc.stdio approach.  I find it's 
 appearance extremely much cleaner...
only if you are really used to write C code. when you see pointer, or explicit type size argument in D, it is a sign of C disease.
 I'll probably wrap those routines in a struct to ensure things like 
 files being properly closed, and not have explicit pointers 
 persisting over large areas of code.
exactly what std.stdio.File did! ;-)
Yes, but I really despise the syntax they came up with. It's probably good if most of your I/O is ranges, but mine hasn't yet ever been. (Combining ranges with random I/O?)
Jul 25 2016
parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Tuesday, 26 July 2016 at 04:05:22 UTC, Charles Hixson wrote:
 Yes, but I really despise the syntax they came up with.  It's 
 probably good if most of your I/O is ranges, but mine hasn't 
 yet ever been.  (Combining ranges with random I/O?)
that's why i wrote iv.stream, and then iv.vfs, with convenient things like `readNum!T`, for example. you absolutely don't need to reimplement the whole std.stdio.File if all you need it better API. thanks to UFCS, you can write your new API as free functions accepting std.stdio.File as first arg. or even generic stream, like i did in iv.stream: enum isReadableStream(T) = is(typeof((inout int=0) { auto t = T.init; ubyte[1] b; auto v = cast(void[])b; t.rawRead(v); })); enum isWriteableStream(T) = is(typeof((inout int=0) { auto t = T.init; ubyte[1] b; t.rawWrite(cast(void[])b); })); T readInt(T : ulong, ST) (auto ref ST st) if (isReadableStream!ST) { T res; ubyte* b = cast(ubyte*)&res; foreach (immutable idx; 0..T.sizeof) { if (st.rawRead(b[idx..idx+1]).length != 1) throw new Exception("read error"); } return res; } and then: auto fl = File("myfile"); auto i = fl.readInt!uint; something like that.
Jul 25 2016
parent reply Charles Hixson via Digitalmars-d-learn writes:
On 07/25/2016 09:22 PM, ketmar via Digitalmars-d-learn wrote:
 On Tuesday, 26 July 2016 at 04:05:22 UTC, Charles Hixson wrote:
 Yes, but I really despise the syntax they came up with.  It's 
 probably good if most of your I/O is ranges, but mine hasn't yet ever 
 been.  (Combining ranges with random I/O?)
that's why i wrote iv.stream, and then iv.vfs, with convenient things like `readNum!T`, for example. you absolutely don't need to reimplement the whole std.stdio.File if all you need it better API. thanks to UFCS, you can write your new API as free functions accepting std.stdio.File as first arg. or even generic stream, like i did in iv.stream: enum isReadableStream(T) = is(typeof((inout int=0) { auto t = T.init; ubyte[1] b; auto v = cast(void[])b; t.rawRead(v); })); enum isWriteableStream(T) = is(typeof((inout int=0) { auto t = T.init; ubyte[1] b; t.rawWrite(cast(void[])b); })); T readInt(T : ulong, ST) (auto ref ST st) if (isReadableStream!ST) { T res; ubyte* b = cast(ubyte*)&res; foreach (immutable idx; 0..T.sizeof) { if (st.rawRead(b[idx..idx+1]).length != 1) throw new Exception("read error"); } return res; } and then: auto fl = File("myfile"); auto i = fl.readInt!uint; something like that.
That's sort of what I have in mind, but I want to do what in Fortran would be (would have been?) called record I/O, except that I want a file header that specifies a few things like magic number, records allocated, head of free list, etc. In practice I don't see any need for record size not known at compile time...except that if there are different versions of the program, they might include different things, so, e.g., the size of the file header might need to be variable. This is a design problem I'm still trying to wrap my head around. Efficiency seems to say "you need to know the size at compile time", but flexibility says "you can't depend on the size at compile time". The only compromise position seems to compromise safety (by depending on void * and record size parameters that aren't guaranteed safe). I'll probably eventually decide in favor of "size fixed at compile time", but I'm still dithering. But clearly efficiency dictates that the read size not be a basic type. I'm currently thinking of a struct that's about 1 KB in size. As far as the I/O routines are concerned this will probably all be uninterpreted bytes, unless I throw in some sequencing for error recovery...but that's probably making things too complex, and should be left for a higher level. Clearly this is a bit of a specialized case, so I wouldn't be considering implementing all of stdio, only the relevant bits, and those wrapped with an interpretation based around record number. The thing is, I'd probably be writing this wrapper anyway, what I was wondering originally is whether there was any reason to use std.file as the underlying library rather than going directly to core.stdc.stdio.
Jul 26 2016
parent ketmar <ketmar ketmar.no-ip.org> writes:
On Tuesday, 26 July 2016 at 16:35:26 UTC, Charles Hixson wrote:
 That's sort of what I have in mind, but I want to do what in 
 Fortran would be (would have been?) called record I/O, except 
 that I want a file header that specifies a few things like 
 magic number, records allocated, head of free list, etc.  In 
 practice I don't see any need for record size not known at 
 compile time...except that if there are different
 versions of the program, they might include different things, 
 so, e.g., the size of the file header might need to be variable.
it looks like you want a serialization library. there are some: http://wiki.dlang.org/Serialization_Libraries
Jul 26 2016
prev sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 7/25/16 9:19 PM, Charles Hixson via Digitalmars-d-learn wrote:
 On 07/25/2016 05:18 PM, ketmar via Digitalmars-d-learn wrote:
 On Monday, 25 July 2016 at 18:54:27 UTC, Charles Hixson wrote:
 Are there reasons why one would use rawRead and rawWrite rather than
 fread and fwrite when doiing binary random io?  What are the advantages?

 In particular, if one is reading and writing structs rather than
 arrays or ranges, are there any advantages?
yes: keeping API consistent. ;-) for example, my stream i/o modules works with anything that has `rawRead`/`rawWrite` methods, but don't bother to check for any other. besides, `rawRead` is just looks cleaner, even with all `(&a)[0..1])` noise. so, a question of style.
OK. If it's just a question of "looking cleaner" and "style", then I will prefer the core.stdc.stdio approach. I find it's appearance extremely much cleaner...except that that's understating things. I'll probably wrap those routines in a struct to ensure things like files being properly closed, and not have explicit pointers persisting over large areas of code.
It's more than just that. Having a bounded array is safer than a pointer/length separated parameters. Literally, rawRead and rawWrite are inferred safe, whereas fread and fwrite are not. But D is so nice with UFCS, you don't have to live with APIs you don't like. Allow me to suggest adding a helper function to your code: rawReadItem(T)(File f, ref T item) trusted { f.rawRead(&item[0 .. 1]); } -Steve
Jul 26 2016
parent reply Charles Hixson via Digitalmars-d-learn writes:
On 07/26/2016 05:31 AM, Steven Schveighoffer via Digitalmars-d-learn wrote:
 On 7/25/16 9:19 PM, Charles Hixson via Digitalmars-d-learn wrote:
 On 07/25/2016 05:18 PM, ketmar via Digitalmars-d-learn wrote:
 On Monday, 25 July 2016 at 18:54:27 UTC, Charles Hixson wrote:
 Are there reasons why one would use rawRead and rawWrite rather than
 fread and fwrite when doiing binary random io?  What are the 
 advantages?

 In particular, if one is reading and writing structs rather than
 arrays or ranges, are there any advantages?
yes: keeping API consistent. ;-) for example, my stream i/o modules works with anything that has `rawRead`/`rawWrite` methods, but don't bother to check for any other. besides, `rawRead` is just looks cleaner, even with all `(&a)[0..1])` noise. so, a question of style.
OK. If it's just a question of "looking cleaner" and "style", then I will prefer the core.stdc.stdio approach. I find it's appearance extremely much cleaner...except that that's understating things. I'll probably wrap those routines in a struct to ensure things like files being properly closed, and not have explicit pointers persisting over large areas of code.
It's more than just that. Having a bounded array is safer than a pointer/length separated parameters. Literally, rawRead and rawWrite are inferred safe, whereas fread and fwrite are not. But D is so nice with UFCS, you don't have to live with APIs you don't like. Allow me to suggest adding a helper function to your code: rawReadItem(T)(File f, ref T item) trusted { f.rawRead(&item[0 .. 1]); } -Steve
That *does* make the syntax a lot nicer, and I understand the safety advantage of not using pointer/length separated parameters. But I'm going to be wrapping the I/O anyway, and the external interface is going to be more like: struct RF (T, long magic) { .... void read (size_t recNo, ref T val){...} size_t read (ref T val){...} ... } where a sequential read returns the record number, or you specify the record number and get an indexedIO read. So the length with be T.sizeof, and will be specified at the time the file is opened. To me this seems to eliminate the advantage of stdfile, and stdfile seems to add a level of indirection. Ranges aren't free, are they? If so then I should probably use stdfile, because that is probably less likely to change than core.stdc.stdio. When I see "f.rawRead(&item[0 .. 1])" it looks to me as if unneeded code is being generated explictly to be thrown away. (I don't like using pointer/length either, but it's actually easier to understand than this kind of thing, and this LOOKS like it's generating extra code.) That said, perhaps I should use stdio anyway. When doing I/O it's the disk speed that's the really slow part, and that so dominates things that worrying about trivialities is foolish. And since it's going to be wrapped anyway, the ugly will be confined to a very small routine.
Jul 26 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 7/26/16 12:58 PM, Charles Hixson via Digitalmars-d-learn wrote:

 Ranges aren't free, are they? If so then I should probably use stdfile,
 because that is probably less likely to change than core.stdc.stdio.
Do you mean slices?
 When I see "f.rawRead(&item[0 .. 1])" it looks to me as if unneeded code
 is being generated explictly to be thrown away.  (I don't like using
 pointer/length either, but it's actually easier to understand than this
 kind of thing, and this LOOKS like it's generating extra code.)
This is probably a misunderstanding on your part. &item is accessing the item as a pointer. Since the compiler already has it as a reference, this is a noop -- just an expression to change the type. [0 .. 1] is constructing a slice out of a pointer. It's done all inline by the compiler (there is no special _d_constructSlice function), so that is very very quick. There is no bounds checking, because pointers do not have bounds checks. So there is pretty much zero overhead for this. Just push the pointer and length onto the stack (or registers, not sure of ABI), and call rawRead.
 That said, perhaps I should use stdio anyway.  When doing I/O it's the
 disk speed that's the really slow part, and that so dominates things
 that worrying about trivialities is foolish.  And since it's going to be
 wrapped anyway, the ugly will be confined to a very small routine.
Having written a very templated io library (https://github.com/schveiguy/iopipe), I can tell you that in my experience, the slowdown comes from 2 things: 1) spending time calling the kernel, and 2) not being able to inline. This of course assumes that proper buffering is done. Buffering should mitigate most of the slowdown from the disk. It is expensive, but you amortize the expense by buffering. C's i/o is pretty much as good as it gets for an opaque non-inlinable system, as long as your requirements are simple enough. The std.stdio code should basically inline into the calls you should be making, and it handles a bunch of stuff that optimizes the calls (such as locking the file handle for one complex operation). -Steve
Jul 26 2016
parent reply Charles Hixson via Digitalmars-d-learn writes:
On 07/26/2016 10:18 AM, Steven Schveighoffer via Digitalmars-d-learn wrote:
 On 7/26/16 12:58 PM, Charles Hixson via Digitalmars-d-learn wrote:

 Ranges aren't free, are they? If so then I should probably use stdfile,
 because that is probably less likely to change than core.stdc.stdio.
Do you mean slices?
 When I see "f.rawRead(&item[0 .. 1])" it looks to me as if unneeded code
 is being generated explictly to be thrown away.  (I don't like using
 pointer/length either, but it's actually easier to understand than this
 kind of thing, and this LOOKS like it's generating extra code.)
This is probably a misunderstanding on your part. &item is accessing the item as a pointer. Since the compiler already has it as a reference, this is a noop -- just an expression to change the type. [0 .. 1] is constructing a slice out of a pointer. It's done all inline by the compiler (there is no special _d_constructSlice function), so that is very very quick. There is no bounds checking, because pointers do not have bounds checks. So there is pretty much zero overhead for this. Just push the pointer and length onto the stack (or registers, not sure of ABI), and call rawRead.
 That said, perhaps I should use stdio anyway.  When doing I/O it's the
 disk speed that's the really slow part, and that so dominates things
 that worrying about trivialities is foolish.  And since it's going to be
 wrapped anyway, the ugly will be confined to a very small routine.
Having written a very templated io library (https://github.com/schveiguy/iopipe), I can tell you that in my experience, the slowdown comes from 2 things: 1) spending time calling the kernel, and 2) not being able to inline. This of course assumes that proper buffering is done. Buffering should mitigate most of the slowdown from the disk. It is expensive, but you amortize the expense by buffering. C's i/o is pretty much as good as it gets for an opaque non-inlinable system, as long as your requirements are simple enough. The std.stdio code should basically inline into the calls you should be making, and it handles a bunch of stuff that optimizes the calls (such as locking the file handle for one complex operation). -Steve
Thanks. Since there isn't any excess overhead I guess I'll use stdio. Buffering, however, isn't going to help at all since I'm doing randomIO. I know that most of the data the system reads from disk is going to end up getting thrown away, since my records will generally be smaller than 8K, but there's no help for that.
Jul 26 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 7/26/16 1:57 PM, Charles Hixson via Digitalmars-d-learn wrote:

 Thanks.  Since there isn't any excess overhead I guess I'll use stdio.
 Buffering, however, isn't going to help at all since I'm doing
 randomIO.  I know that most of the data the system reads from disk is
 going to end up getting thrown away, since my records will generally be
 smaller than 8K, but there's no help for that.
Even for doing random I/O buffering is helpful. It depends on the size of your items. Essentially, to read 10 bytes from a file probably costs the same as reading 100,000 bytes from a file. So may as well buffer that in case you need it. Now, C i/o's buffering may not suit your exact needs. So I don't know how it will perform. You may want to consider mmap which tells the kernel to link pages of memory directly to disk access. Then the kernel is doing all the buffering for you. Phobos has support for it, but it's pretty minimal from what I can see: http://dlang.org/phobos/std_mmfile.html -Steve
Jul 26 2016
parent reply Charles Hixson via Digitalmars-d-learn writes:
On 07/26/2016 11:31 AM, Steven Schveighoffer via Digitalmars-d-learn wrote:
 On 7/26/16 1:57 PM, Charles Hixson via Digitalmars-d-learn wrote:

 Thanks.  Since there isn't any excess overhead I guess I'll use stdio.
 Buffering, however, isn't going to help at all since I'm doing
 randomIO.  I know that most of the data the system reads from disk is
 going to end up getting thrown away, since my records will generally be
 smaller than 8K, but there's no help for that.
Even for doing random I/O buffering is helpful. It depends on the size of your items. Essentially, to read 10 bytes from a file probably costs the same as reading 100,000 bytes from a file. So may as well buffer that in case you need it. Now, C i/o's buffering may not suit your exact needs. So I don't know how it will perform. You may want to consider mmap which tells the kernel to link pages of memory directly to disk access. Then the kernel is doing all the buffering for you. Phobos has support for it, but it's pretty minimal from what I can see: http://dlang.org/phobos/std_mmfile.html -Steve
I've considered mmapfile often, but when I read the documentation I end up realizing that I don't understand it. So I look up memory mapped files in other places, and I still don't understand it. It looks as if the entire file is stored in memory, which is not at all what I want, but I also can't really believe that's what's going on. I know that there was an early form of this in a version of BASIC (the version that RISS was written in, but I don't remember which version that was) and in *that* version array elements were read in as needed. (It wasn't spectacularly efficient.) But memory mapped files don't seem to work that way, because people keep talking about how efficient they are. Do you know a good introductory tutorial? I'm guessing that "window size" might refer to the number of bytes available, but what if you need to append to the file? Etc. A part of the problem is that I don't want this to be a process with an arbitrarily high memory use. Buffering would be fine, if I could use it, but for my purposes sequential access is likely to be rare, and the working layout of the data in RAM doesn't (can't reasonably) match the layout on disk. IIUC (this is a few decades old) the system buffer size is about 8K. I expect to never need to read that large a chunk, but I'm going to try to keep the chunks in multiples of 1024 bytes, and if it's reasonable to exactly 1024 bytes. So I should never need two reads or writes for a chunk. I guess to be sure of this I'd better make sure the file header is also 1024 bytes. (I'm guessing that the seek to position results in the appropriate buffer being read into the system buffer, so if my header were 512 bytes I might occasionally need to do double reads or writes.) I'm guessing that memory mapped files trade off memory use against speed of access, and for my purposes that's probably a bad trade, even though databases are doing that more and more. I'm likely to need all the memory I can lay my hands on, and even then thrashing wouldn't surprise me. So a fixed buffer size seems a huge advantage.
Jul 26 2016
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Tuesday, 26 July 2016 at 19:30:35 UTC, Charles Hixson wrote:
 It looks as if the entire file is stored in memory, which is 
 not at all what I want, but I also can't really believe that's 
 what's going on.
It is just mapped to virtual memory without actually being loaded into physical memory, so when you access the array it returns, the kernel loads a page of the file into memory, but it doesn't do that until it actually has to. Think of it as being like this: struct MagicFile { ubyte[] opIndex(size_t idx) { auto buffer = new ubyte[](some_block_length); fseek(fp, idx, SEEK_SET); fread(buffer.ptr, buffer.length, 1); return buffer; } } And something analogous for writing, but instead of being done with overloaded operators in D, it is done with the MMU hardware by the kernel (and the kernel also does smarter buffering than this little example).
 A part of the problem is that I don't want this to be a process 
 with an arbitrarily high memory use.
The kernel will automatically handle physical memory usage too, similarly to a page file. If you haven't read a portion of the file recently, it will discard that page, since it can always read it again off disk if needed, but if you do have memory to spare, it will keep the data in memory for faster access later. So basically the operating system handles a lot of the details which makes it efficient. Growing a memory mapped file is a bit tricky though, you need to unmap and remap. Since it is an OS concept, you can always look for C or C++ examples too, like herE: http://stackoverflow.com/questions/4460507/appending-to-a-memory-mapped-file/4461462#4461462
Jul 26 2016
parent reply Charles Hixson via Digitalmars-d-learn writes:
On 07/26/2016 12:53 PM, Adam D. Ruppe via Digitalmars-d-learn wrote:
 On Tuesday, 26 July 2016 at 19:30:35 UTC, Charles Hixson wrote:
 It looks as if the entire file is stored in memory, which is not at 
 all what I want, but I also can't really believe that's what's going on.
It is just mapped to virtual memory without actually being loaded into physical memory, so when you access the array it returns, the kernel loads a page of the file into memory, but it doesn't do that until it actually has to. Think of it as being like this: struct MagicFile { ubyte[] opIndex(size_t idx) { auto buffer = new ubyte[](some_block_length); fseek(fp, idx, SEEK_SET); fread(buffer.ptr, buffer.length, 1); return buffer; } } And something analogous for writing, but instead of being done with overloaded operators in D, it is done with the MMU hardware by the kernel (and the kernel also does smarter buffering than this little example).
 A part of the problem is that I don't want this to be a process with 
 an arbitrarily high memory use.
The kernel will automatically handle physical memory usage too, similarly to a page file. If you haven't read a portion of the file recently, it will discard that page, since it can always read it again off disk if needed, but if you do have memory to spare, it will keep the data in memory for faster access later. So basically the operating system handles a lot of the details which makes it efficient. Growing a memory mapped file is a bit tricky though, you need to unmap and remap. Since it is an OS concept, you can always look for C or C++ examples too, like herE: http://stackoverflow.com/questions/4460507/appending-to-a-memory-mapped-file/4461462#4461462
O, dear. It was sounding like such an excellent approach until this last paragraph, but growing the file is going to be one of the common operations. (Certainly at first.) It sounds as if that means the file needs to be closed and re-opened for extensions. And I quote from https://www.gnu.org/software/libc/manual/html_node/Memory_002d apped-I_002fO.html: <END Function: /void */ *mremap* /(void *address, size_t length, size_t new_length, int flag)/ Preliminary: | MT-Safe | AS-Safe | AC-Safe | See POSIX Safety Concepts <https://www.gnu.org/software/libc/manual/html_node/POSIX-Safety-Concepts.html#POSIX-Safety-Concepts>. This function can be used to change the size of an existing memory area. address and length must cover a region entirely mapped in the same |mmap| statement. A new mapping with the same characteristics will be returned with the length new_length. ... This function is only available on a few systems. Except for performing optional optimizations one should not rely on this function. END So I'm probably better off sticking to using a seek based i/o system.
Jul 26 2016
parent reply Rene Zwanenburg <renezwanenburg gmail.com> writes:
On Wednesday, 27 July 2016 at 02:20:57 UTC, Charles Hixson wrote:
 O, dear.  It was sounding like such an excellent approach until 
 this
 last paragraph, but growing the file is going to be one of the 
 common
 operations.  (Certainly at first.) (...)
 So I'm probably better off sticking to using a seek based i/o 
 system.
Not necessarily. The usual approach is to over-allocate your file so you don't need to grow it that often. This is the exact same strategy used by D's dynamic arrays and grow-able array-backed lists in other languages - the difference between list length and capacity. There is no built-in support for this in std.mmfile afaik. But it's not hard to do yourself.
Jul 27 2016
parent Charles Hixson via Digitalmars-d-learn writes:
On 07/27/2016 06:46 AM, Rene Zwanenburg via Digitalmars-d-learn wrote:
 On Wednesday, 27 July 2016 at 02:20:57 UTC, Charles Hixson wrote:
 O, dear.  It was sounding like such an excellent approach until this
 last paragraph, but growing the file is going to be one of the common
 operations.  (Certainly at first.) (...)
 So I'm probably better off sticking to using a seek based i/o system.
Not necessarily. The usual approach is to over-allocate your file so you don't need to grow it that often. This is the exact same strategy used by D's dynamic arrays and grow-able array-backed lists in other languages - the difference between list length and capacity. There is no built-in support for this in std.mmfile afaik. But it's not hard to do yourself.
Well, that would mean I didn't need to reopen the file so often, but that sure wouldn't mean I wouldn't need to re-open the file. And it would add considerable complexity. Possibly that would be an optimal approach once the data was mainly collected, but I won't want to re-write this bit at that point.
Jul 27 2016
prev sibling parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 7/26/16 3:30 PM, Charles Hixson via Digitalmars-d-learn wrote:
 On 07/26/2016 11:31 AM, Steven Schveighoffer via Digitalmars-d-learn wrote:
 Now, C i/o's buffering may not suit your exact needs. So I don't know
 how it will perform. You may want to consider mmap which tells the
 kernel to link pages of memory directly to disk access. Then the
 kernel is doing all the buffering for you. Phobos has support for it,
 but it's pretty minimal from what I can see:
 http://dlang.org/phobos/std_mmfile.html
I've considered mmapfile often, but when I read the documentation I end up realizing that I don't understand it. So I look up memory mapped files in other places, and I still don't understand it. It looks as if the entire file is stored in memory, which is not at all what I want, but I also can't really believe that's what's going on.
Of course that isn't what is happening :) What happens is that the kernel says memory page 0x12345 (or whatever) is mapped to the file. Then when you access a mapped page, the system memory management unit gets a page fault (because that memory isn't loaded), which triggers the kernel to load that page of memory. Kernel sees that the memory is really mapped to that file, and loads the page from the file instead. As you write to the memory location, the page is marked dirty, and at some point, the kernel flushes that page back to disk. Everything is done behind the scenes and is in tune with the filesystem itself, so you get a little extra benefit from that.
 I know that
 there was an early form of this in a version of BASIC (the version that
 RISS was written in, but I don't remember which version that was) and in
 *that* version array elements were read in as needed.  (It wasn't
 spectacularly efficient.)  But memory mapped files don't seem to work
 that way, because people keep talking about how efficient they are.  Do
 you know a good introductory tutorial?  I'm guessing that "window size"
 might refer to the number of bytes available, but what if you need to
 append to the file?  Etc.
To be honest, I'm not super familiar with actually using them, I just have a rough idea of how they work. The actual usage you will have to look up.
 A part of the problem is that I don't want this to be a process with an
 arbitrarily high memory use.
You should know that you can allocate as much memory as you want, as long as you have address space for it, and you won't actually map that to physical memory until you use it. So the management of the memory is done lazily, all supported by the MMU hardware. This is true for actual memory too! Note that the only "memory" you are using for the mmaped file are page buffers in the kernel which are likely already being used to buffer the disk reads. It's not like it's loading the entire file into memory, and probably doesn't even load all sequential pages into memory. It only loads the ones you use. I'm pretty much at my limit for knowledge of this subject (and maybe I have a few things incorrect), I'm sure others here know much more. I suggest you play a bit with it to see what the performance is like. I have also heard that it's very fast. -Steve
Jul 26 2016