www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Reading a structured binary file?

reply "Gary Willoughby" <dev nomad.so> writes:
What library commands do i use to read from a structured binary 
file? I want to read the byte stream 1, 2 maybe 4 bytes at a time 
and cast these to bytes, shorts and ints respectively. I can't 
seem to find anything like readByte().
Aug 02 2013
next sibling parent "Dicebot" <public dicebot.lv> writes:
On Friday, 2 August 2013 at 17:49:55 UTC, Gary Willoughby wrote:
 What library commands do i use to read from a structured binary 
 file? I want to read the byte stream 1, 2 maybe 4 bytes at a 
 time and cast these to bytes, shorts and ints respectively. I 
 can't seem to find anything like readByte().
?
Aug 02 2013
prev sibling next sibling parent Justin Whear <justin economicmodeling.com> writes:
On Fri, 02 Aug 2013 19:49:54 +0200, Gary Willoughby wrote:

 What library commands do i use to read from a structured binary file? I
 want to read the byte stream 1, 2 maybe 4 bytes at a time and cast these
 to bytes, shorts and ints respectively. I can't seem to find anything
 like readByte().
You can use File.rawRead: ushort[1] myShort; file.rawRead(myShort); Or if you have structures in the file: struct Foo { align(1): int bar; short k; char[7] str; } Foo[1] foo; file.rawRead(foo);
Aug 02 2013
prev sibling next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Friday, 2 August 2013 at 17:49:55 UTC, Gary Willoughby wrote:
 What library commands do i use to read from a structured binary 
 file? I want to read the byte stream 1, 2 maybe 4 bytes at a 
 time and cast these to bytes, shorts and ints respectively. I 
 can't seem to find anything like readByte().
How big is the file? If it's not too huge i'd just read it in with std.file.read and then sort out splitting it up from there.
Aug 02 2013
parent "Gary Willoughby" <dev nomad.so> writes:
 How big is the file?

 If it's not too huge i'd just read it in with std.file.read and 
 then sort out splitting it up from there.
Quite large so i'll probably stream it. Thanks guys.
Aug 02 2013
prev sibling next sibling parent "Jesse Phillips" <Jesse.K.Phillips+D gmail.com> writes:
On Friday, 2 August 2013 at 17:49:55 UTC, Gary Willoughby wrote:
 What library commands do i use to read from a structured binary 
 file? I want to read the byte stream 1, 2 maybe 4 bytes at a 
 time and cast these to bytes, shorts and ints respectively. I 
 can't seem to find anything like readByte().
You've gotten some help already around functions D provides. But I thought I would mention I'd recently tried to do some large file parsing for binary data, and decided to try and blog about it. http://he-the-great.livejournal.com/47550.html I can't say this is the best solution, but it worked. I was parsing a 20 gig OpenStreetMap planet file.
Aug 02 2013
prev sibling next sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Friday, August 02, 2013 19:49:54 Gary Willoughby wrote:
 What library commands do i use to read from a structured binary
 file? I want to read the byte stream 1, 2 maybe 4 bytes at a time
 and cast these to bytes, shorts and ints respectively. I can't
 seem to find anything like readByte().
I'd probably use std.mmfile and std.bitmanip to do it. MmFile will allow you to efficiently operate on the file as a ubyte[] in memory thanks to mmap, and std.bitmanip's peek and read functions make it easy to convert multiple bytes into integral values. - Jonathan M Davis
Aug 02 2013
next sibling parent reply captaindet <2krnk gmx.net> writes:
On 2013-08-02 17:13, Jonathan M Davis wrote:
 On Friday, August 02, 2013 19:49:54 Gary Willoughby wrote:
 What library commands do i use to read from a structured binary
 file? I want to read the byte stream 1, 2 maybe 4 bytes at a time
 and cast these to bytes, shorts and ints respectively. I can't
 seem to find anything like readByte().
I'd probably use std.mmfile and std.bitmanip to do it. MmFile will allow you to efficiently operate on the file as a ubyte[] in memory thanks to mmap, and std.bitmanip's peek and read functions make it easy to convert multiple bytes into integral values. - Jonathan M Davis
FWIW i have to deal with big data files that can be a few GB. for some data analysis software i wrote in C a while back i did some testing with caching and such. turns out that for Win7-64 the automatic caching done by the OS is really good and any attempt to speed things up actually slowed it down. no kidding, i have seen more than 2GB of data being automatically cached. of course the system RAM must be larger than the file size (if i remember my tests correctly by a factor of ~2, but this is maybe not a linear relationship, i did not actually change the RAM just the size of the data file) and it will hold it in the cache only as long as there are no concurrent applications requiring RAM or caching. i guess my point is, if your target is Win7 and your files are >5x smaller than the installed RAM i would not bother at all trying to optimize file access. i suppose -nix machine will do a similar good job these days. /det
Aug 02 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Aug 02, 2013 at 06:38:20PM -0500, captaindet wrote:
[...]
 FWIW
 i have to deal with big data files that can be a few GB. for some data
 analysis software i wrote in C a while back i did some testing with
 caching and such. turns out that for Win7-64 the automatic caching
 done by the OS is really good and any attempt to speed things up
 actually slowed it down. no kidding, i have seen more than 2GB of data
 being automatically cached. of course the system RAM must be larger
 than the file size (if i remember my tests correctly by a factor of
 ~2, but this is maybe not a linear relationship, i did not actually
 change the RAM just the size of the data file) and it will hold it in
 the cache only as long as there are no concurrent applications
 requiring RAM or caching. i guess my point is, if your target is Win7
 and your files are >5x smaller than the installed RAM i would not
 bother at all trying to optimize file access. i suppose -nix machine
 will do a similar good job these days.
[...] IIRC, Linux has been caching files (or disk blocks, rather) in memory since the days of Win95. Of course, memory in those days was much scarcer, but file sizes were smaller too. :) There's still a cost to copy the kernel buffers into userspace, though, which should not be disregarded. But if you use mmap, then you're essentially accessing that memory cache directly, which is as good as it gets. I don't know how well mmap works on windows, though, IIRC it doesn't have the same semantics as Posix, so you could accidentally run into performance issues by using it the wrong way on windows. T -- There is no gravity. The earth sucks.
Aug 02 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 2 August 2013 at 23:51:27 UTC, H. S. Teoh wrote:
 On Fri, Aug 02, 2013 at 06:38:20PM -0500, captaindet wrote:
 [...]
 FWIW
 i have to deal with big data files that can be a few GB. for 
 some data
 analysis software i wrote in C a while back i did some testing 
 with
 caching and such. turns out that for Win7-64 the automatic 
 caching
 done by the OS is really good and any attempt to speed things 
 up
 actually slowed it down. no kidding, i have seen more than 2GB 
 of data
 being automatically cached. of course the system RAM must be 
 larger
 than the file size (if i remember my tests correctly by a 
 factor of
 ~2, but this is maybe not a linear relationship, i did not 
 actually
 change the RAM just the size of the data file) and it will 
 hold it in
 the cache only as long as there are no concurrent applications
 requiring RAM or caching. i guess my point is, if your target 
 is Win7
 and your files are >5x smaller than the installed RAM i would 
 not
 bother at all trying to optimize file access. i suppose -nix 
 machine
 will do a similar good job these days.
[...] IIRC, Linux has been caching files (or disk blocks, rather) in memory since the days of Win95. Of course, memory in those days was much scarcer, but file sizes were smaller too. :) There's still a cost to copy the kernel buffers into userspace, though, which should not be disregarded. But if you use mmap, then you're essentially accessing that memory cache directly, which is as good as it gets. I don't know how well mmap works on windows, though, IIRC it doesn't have the same semantics as Posix, so you could accidentally run into performance issues by using it the wrong way on windows. T
I did some benching a while back with user bioinfornatics. He had to do some pretty large file reads, preferably in very little time. Observations showed my algo was *much* faster under windows then linux. What we observed is that under windows, as soon as you open a file for reading, windows starts buffering the file in a parallel thread. What we did was create two threads. The first did nothing but read the file, store it into chunks of memory, and then pass it to a worker thread. The worker thread did the parsing proper. Doing this *halved* the linux runtime, tying it with the "monothreaded" windows run time. Windows saw no change. FYI, the full thread is here: forum.dlang.org/thread/gmfqwzgtjfnqiajghmsx forum.dlang.org
Aug 03 2013
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Aug 03, 2013 at 11:29:01PM +0200, monarch_dodra wrote:
 On Friday, 2 August 2013 at 23:51:27 UTC, H. S. Teoh wrote:
On Fri, Aug 02, 2013 at 06:38:20PM -0500, captaindet wrote:
[...]
FWIW
i have to deal with big data files that can be a few GB. for some
data analysis software i wrote in C a while back i did some testing
with caching and such. turns out that for Win7-64 the automatic
caching done by the OS is really good and any attempt to speed
things up actually slowed it down. no kidding, i have seen more than
2GB of data being automatically cached. of course the system RAM
must be larger than the file size (if i remember my tests correctly
by a factor of ~2, but this is maybe not a linear relationship, i
did not actually change the RAM just the size of the data file) and
it will hold it in the cache only as long as there are no concurrent
applications requiring RAM or caching. i guess my point is, if your
target is Win7 and your files are >5x smaller than the installed RAM
i would not bother at all trying to optimize file access. i suppose
-nix machine will do a similar good job these days.
[...] IIRC, Linux has been caching files (or disk blocks, rather) in memory since the days of Win95. Of course, memory in those days was much scarcer, but file sizes were smaller too. :) There's still a cost to copy the kernel buffers into userspace, though, which should not be disregarded. But if you use mmap, then you're essentially accessing that memory cache directly, which is as good as it gets. I don't know how well mmap works on windows, though, IIRC it doesn't have the same semantics as Posix, so you could accidentally run into performance issues by using it the wrong way on windows.
[...]
 I did some benching a while back with user bioinfornatics. He had to
 do some pretty large file reads, preferably in very little time.
 Observations showed my algo was *much* faster under windows then
 linux.
Sorry, I lost the context of this discussion, what algo are you referring to?
 What we observed is that under windows, as soon as you open a file
 for reading, windows starts buffering the file in a parallel thread.
 
 What we did was create two threads. The first did nothing but read
 the file, store it into chunks of memory, and then pass it to a
 worker thread. The worker thread did the parsing proper.
 
 Doing this *halved* the linux runtime, tying it with the
 "monothreaded" windows run time. Windows saw no change.
Interesting. I wonder if you could, under Linux, mmap a file then have one thread access the first byte of each file block while another thread does the real work with the data.
 FYI, the full thread is here:
 forum.dlang.org/thread/gmfqwzgtjfnqiajghmsx forum.dlang.org
I'll take a look, thanks. T -- The diminished 7th chord is the most flexible and fear-instilling chord. Use it often, use it unsparingly, to subdue your listeners into submission!
Aug 03 2013
prev sibling parent reply "Gary Willoughby" <dev nomad.so> writes:
On Friday, 2 August 2013 at 22:13:28 UTC, Jonathan M Davis wrote:
 I'd probably use std.mmfile and std.bitmanip to do it. MmFile 
 will allow you to
 efficiently operate on the file as a ubyte[] in memory thanks 
 to mmap, and
 std.bitmanip's peek and read functions make it easy to convert 
 multiple bytes
 into integral values.

 - Jonathan M Davis
This sounds a great idea but once the file has been opened as a MmFile how to i convert this to a ubyte[] so the std.bitmanip functions work with it?
Aug 03 2013
parent reply "Gary Willoughby" <dev nomad.so> writes:
On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby wrote:
 This sounds a great idea but once the file has been opened as a 
 MmFile how to i convert this to a ubyte[] so the std.bitmanip 
 functions work with it?
I'm currently doing this: auto file = new MmFile("file.dat"); ubyte[] buffer = cast(ubyte[])file[]; buffer.read!uint(); //etc. Is this how you would recommend?
Aug 03 2013
next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:
 On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby 
 wrote:
 This sounds a great idea but once the file has been opened as 
 a MmFile how to i convert this to a ubyte[] so the 
 std.bitmanip functions work with it?
I'm currently doing this: auto file = new MmFile("file.dat"); ubyte[] buffer = cast(ubyte[])file[]; buffer.read!uint(); //etc. Is this how you would recommend?
That defeats the object of memory mapping, as the [] at the end of cast(ubyte[])file[] implies copying the whole file in to memory. 3 options I can think of: 1) copy read from std.bitmanip and modify it to work nicely with MmFile 2) write a wrapper for MmFile to let it work nicely with read 3) rewrite/modify MmFile I would love to do 3) at some point, but I'm too busy at the moment.
Aug 03 2013
next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, August 03, 2013 23:10:12 John Colvin wrote:
 On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:
 On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby
 
 wrote:
 This sounds a great idea but once the file has been opened as
 a MmFile how to i convert this to a ubyte[] so the
 std.bitmanip functions work with it?
I'm currently doing this: auto file = new MmFile("file.dat"); ubyte[] buffer = cast(ubyte[])file[]; buffer.read!uint(); //etc. Is this how you would recommend?
That defeats the object of memory mapping, as the [] at the end of cast(ubyte[])file[] implies copying the whole file in to memory.
Are you sure about that? Maybe I'm just not familiar enough with mmap, but I don't see anything in MmFile which would result in it copying the whole file into memory. I guess that I'll have to do some more reading up on mmap. Certainly, if slicing it like that copies it all into memory, that's a big problem. - Jonathan M Davis
Aug 03 2013
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Aug 03, 2013 at 02:25:23PM -0700, Jonathan M Davis wrote:
 On Saturday, August 03, 2013 23:10:12 John Colvin wrote:
 On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:
 On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby
 
 wrote:
 This sounds a great idea but once the file has been opened as
 a MmFile how to i convert this to a ubyte[] so the
 std.bitmanip functions work with it?
I'm currently doing this: auto file = new MmFile("file.dat"); ubyte[] buffer = cast(ubyte[])file[]; buffer.read!uint(); //etc. Is this how you would recommend?
That defeats the object of memory mapping, as the [] at the end of cast(ubyte[])file[] implies copying the whole file in to memory.
Are you sure about that? Maybe I'm just not familiar enough with mmap, but I don't see anything in MmFile which would result in it copying the whole file into memory. I guess that I'll have to do some more reading up on mmap. Certainly, if slicing it like that copies it all into memory, that's a big problem.
[...] I think he meant that the OS will have to load the entire file into memory if you sliced the mmap'ed file, not that you'll copy all the data. I'm not certain this is true, though, because slicing as I understand it only returns the address of the start of the mmap'ed addresses coupled with its length. I don't think the OS will actually load anything into memory until you reference an address within that mmap'ed range. And even then, only those disk blocks that correspond with the referenced addresses will actually be loaded -- this is the point of virtual memory, after all. T -- The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5
Aug 03 2013
prev sibling next sibling parent reply "Jesse Phillips" <Jesse.K.Phillips+D gmail.com> writes:
On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:
 On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby 
 wrote:
 This sounds a great idea but once the file has been opened as 
 a MmFile how to i convert this to a ubyte[] so the 
 std.bitmanip functions work with it?
I'm currently doing this: auto file = new MmFile("file.dat"); ubyte[] buffer = cast(ubyte[])file[]; buffer.read!uint(); //etc. Is this how you would recommend?
You will need to slice the size of the data you want, otherwise you're effectively doing std.file.read(). It doesn't need to be for a single value (as in the example), it could be a block of data which is then individual parsed for the pieces. auto file = new MmFile("file.dat"); ubyte[] buffer = cast(ubyte[])file[indexInFile..uint.sizeof]; indexInFile += uint.sizeof; buffer.read!uint(); //etc. The only way I'm seeing to advance through the file is to keep an index on where you're currently reading from. This actually works perfect for the FileRange I mentioned in the previous post. Though I'm not familiar with how mmfile manages its memory, but hopefully there isn't buffer reuse or storing the slice could be overridden (not an issue for value data, but string data).
Aug 05 2013
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Aug 06, 2013 at 06:48:12AM +0200, Jesse Phillips wrote:
[...]
 The only way I'm seeing to advance through the file is to keep an
 index on where you're currently reading from. This actually works
 perfect for the FileRange I mentioned in the previous post. Though
 I'm not familiar with how mmfile manages its memory, but hopefully
 there isn't buffer reuse or storing the slice could be overridden
 (not an issue for value data, but string data).
I don't know about D's Mmfile, but AFAIK, it maps directly to the OS mmap(), which basically maps a portion of your program's address space to the data on the disk. Meaning that the memory is managed by the OS, and addresses will not change from under you. In the underlying physical memory, pages may get swapped out and reused, but this is invisible to your program, since referencing them will cause the OS to swap the pages back in, so you'll never end up with invalid pointers. The worst that could happen is the I/O performance hit associated with swapping. Such is the utility of virtual memory. T -- Error: Keyboard not attached. Press F1 to continue. -- Yoon Ha Lee, CONLANG
Aug 05 2013
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Monday, August 05, 2013 23:04:58 H. S. Teoh wrote:
 On Tue, Aug 06, 2013 at 06:48:12AM +0200, Jesse Phillips wrote:
 [...]
 
 The only way I'm seeing to advance through the file is to keep an
 index on where you're currently reading from. This actually works
 perfect for the FileRange I mentioned in the previous post. Though
 I'm not familiar with how mmfile manages its memory, but hopefully
 there isn't buffer reuse or storing the slice could be overridden
 (not an issue for value data, but string data).
I don't know about D's Mmfile, but AFAIK, it maps directly to the OS mmap(), which basically maps a portion of your program's address space to the data on the disk. Meaning that the memory is managed by the OS, and addresses will not change from under you. In the underlying physical memory, pages may get swapped out and reused, but this is invisible to your program, since referencing them will cause the OS to swap the pages back in, so you'll never end up with invalid pointers. The worst that could happen is the I/O performance hit associated with swapping. Such is the utility of virtual memory.
mmap is awesome. It makes handling large files _way_ easier, especially when you have to worry about performance. It was a huge performance boost for one of our video recorder programs where I work when we switched to using mmap on it (this device is recording multiple video streams from cameras 24/7, and performance is critical). Trying to do what mmap does on your own is incredibly bug-prone and bound to be worse for performance (since you're doing it instead of the kernel). One of our older products tries to do it on its own (probably because the developers didn't know about mmap), and it's a royal mess. - Jonathan M Davis
Aug 05 2013
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, August 03, 2013 20:23:55 Gary Willoughby wrote:
 On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby wrote:
 This sounds a great idea but once the file has been opened as a
 MmFile how to i convert this to a ubyte[] so the std.bitmanip
 functions work with it?
I'm currently doing this: auto file = new MmFile("file.dat"); ubyte[] buffer = cast(ubyte[])file[]; buffer.read!uint(); //etc. Is this how you would recommend?
Yeah. That's how you do it. - Jonathan M Davis
Aug 03 2013
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, August 03, 2013 14:31:16 H. S. Teoh wrote:
 On Sat, Aug 03, 2013 at 02:25:23PM -0700, Jonathan M Davis wrote:
 On Saturday, August 03, 2013 23:10:12 John Colvin wrote:
 On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:
 On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby
 
 wrote:
 This sounds a great idea but once the file has been opened as
 a MmFile how to i convert this to a ubyte[] so the
 std.bitmanip functions work with it?
I'm currently doing this: auto file = new MmFile("file.dat"); ubyte[] buffer = cast(ubyte[])file[]; buffer.read!uint(); //etc. Is this how you would recommend?
That defeats the object of memory mapping, as the [] at the end of cast(ubyte[])file[] implies copying the whole file in to memory.
Are you sure about that? Maybe I'm just not familiar enough with mmap, but I don't see anything in MmFile which would result in it copying the whole file into memory. I guess that I'll have to do some more reading up on mmap. Certainly, if slicing it like that copies it all into memory, that's a big problem.
[...] I think he meant that the OS will have to load the entire file into memory if you sliced the mmap'ed file, not that you'll copy all the data. I'm not certain this is true, though, because slicing as I understand it only returns the address of the start of the mmap'ed addresses coupled with its length. I don't think the OS will actually load anything into memory until you reference an address within that mmap'ed range. And even then, only those disk blocks that correspond with the referenced addresses will actually be loaded -- this is the point of virtual memory, after all.
That's what I thought that mmap did, but it's not something that I've studied in detail. Aside from that though, my main complaint about MmFile is the fact that it's a class when it really should be a reference-counted struct. At some point, we should probably create MMFile or somesuch which _is_ a reference counted struct and then deprecate MmFile. But if we do that, then we should be sure of whatever other changes the implementation needs and do those with it. - Jonathan M Davis
Aug 03 2013