www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - GZip File Reading

reply dsimcha <dsimcha yahoo.com> writes:
I noticed last night that Phobos actually has all the machinations 
required for reading gzipped files, buried in etc.c.zlib.  I've wanted a 
high-level D interface for reading and writing compressed files with an 
API similar to "normal" file I/O for a while.  I'm thinking about what 
the easiest/best design would be.  At a high level there are two designs:

1.  Hack std.stdio.file to support gzipped formats.  This would allow an 
identical interface for "normal" and compressed I/O.  It would also 
allow reuse of things like ByLine.  However, it would require major 
refactoring of File to decouple it from the C file I/O routines so that 
it could call either the C or GZip ones depending on how it's 
configured.  Probably, it would make sense to make an interface that 
wraps I/O functions and make an instance for C and one for gzip, with 
bzip2 and other goodies possibly being added later.

2.  Write something completely separate.  This would keep std.stdio.File 
doing one thing well (wrapping C file I/O) but would be more of a PITA 
for the user and possibly result in code duplication.

I'd like to get some comments on what an appropriate API design and 
implementation for writing gzipped files would be.  Two key requirements 
are that it must be as easy to use as std.stdio.File and it must be easy 
to extend to support other single-file compression formats like bz2.
Mar 09 2011
next sibling parent reply Daniel Gibson <metalcaedes gmail.com> writes:
Am 10.03.2011 05:53, schrieb dsimcha:
 I noticed last night that Phobos actually has all the machinations
 required for reading gzipped files, buried in etc.c.zlib. I've wanted a
 high-level D interface for reading and writing compressed files with an
 API similar to "normal" file I/O for a while. I'm thinking about what
 the easiest/best design would be. At a high level there are two designs:

 1. Hack std.stdio.file to support gzipped formats. This would allow an
 identical interface for "normal" and compressed I/O. It would also allow
 reuse of things like ByLine. However, it would require major refactoring
 of File to decouple it from the C file I/O routines so that it could
 call either the C or GZip ones depending on how it's configured.
 Probably, it would make sense to make an interface that wraps I/O
 functions and make an instance for C and one for gzip, with bzip2 and
 other goodies possibly being added later.

 2. Write something completely separate. This would keep std.stdio.File
 doing one thing well (wrapping C file I/O) but would be more of a PITA
 for the user and possibly result in code duplication.

 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key requirements
 are that it must be as easy to use as std.stdio.File and it must be easy
 to extend to support other single-file compression formats like bz2.

Maybe a proper stream API would help. It could provide ByLine etc, could be used for any kind of compression format (as long as an appropriate input-stream is provided), ... (analogous for writing) Cheers, - Daniel
Mar 09 2011
next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday 09 March 2011 21:10:59 Daniel Gibson wrote:
 Am 10.03.2011 05:53, schrieb dsimcha:
 I noticed last night that Phobos actually has all the machinations
 required for reading gzipped files, buried in etc.c.zlib. I've wanted a
 high-level D interface for reading and writing compressed files with an
 API similar to "normal" file I/O for a while. I'm thinking about what
 the easiest/best design would be. At a high level there are two designs:
 
 1. Hack std.stdio.file to support gzipped formats. This would allow an
 identical interface for "normal" and compressed I/O. It would also allow
 reuse of things like ByLine. However, it would require major refactoring
 of File to decouple it from the C file I/O routines so that it could
 call either the C or GZip ones depending on how it's configured.
 Probably, it would make sense to make an interface that wraps I/O
 functions and make an instance for C and one for gzip, with bzip2 and
 other goodies possibly being added later.
 
 2. Write something completely separate. This would keep std.stdio.File
 doing one thing well (wrapping C file I/O) but would be more of a PITA
 for the user and possibly result in code duplication.
 
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key requirements
 are that it must be as easy to use as std.stdio.File and it must be easy
 to extend to support other single-file compression formats like bz2.

Maybe a proper stream API would help. It could provide ByLine etc, could be used for any kind of compression format (as long as an appropriate input-stream is provided), ... (analogous for writing)

That was my thought. We really need proper streams... The other potential issue with compressed files is that they can contain directories and such. A gzipped/bzipped file is not necessarily a file that you can read, even once it's been uncompressed. That may or not matter for this particular application of them, but it is something to be aware of. - Jonathan M Davis
Mar 09 2011
prev sibling parent "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Wed, 09 Mar 2011 21:34:29 -0800, Jonathan M Davis wrote:

 On Wednesday 09 March 2011 21:10:59 Daniel Gibson wrote:
 Am 10.03.2011 05:53, schrieb dsimcha:
 I noticed last night that Phobos actually has all the machinations
 required for reading gzipped files, buried in etc.c.zlib. I've wanted
 a high-level D interface for reading and writing compressed files
 with an API similar to "normal" file I/O for a while. I'm thinking
 about what the easiest/best design would be. At a high level there
 are two designs:
 
 1. Hack std.stdio.file to support gzipped formats. This would allow
 an identical interface for "normal" and compressed I/O. It would also
 allow reuse of things like ByLine. However, it would require major
 refactoring of File to decouple it from the C file I/O routines so
 that it could call either the C or GZip ones depending on how it's
 configured. Probably, it would make sense to make an interface that
 wraps I/O functions and make an instance for C and one for gzip, with
 bzip2 and other goodies possibly being added later.
 
 2. Write something completely separate. This would keep
 std.stdio.File doing one thing well (wrapping C file I/O) but would
 be more of a PITA for the user and possibly result in code
 duplication.
 
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key
 requirements are that it must be as easy to use as std.stdio.File and
 it must be easy to extend to support other single-file compression
 formats like bz2.

Maybe a proper stream API would help. It could provide ByLine etc, could be used for any kind of compression format (as long as an appropriate input-stream is provided), ... (analogous for writing)

That was my thought. We really need proper streams... The other potential issue with compressed files is that they can contain directories and such.

Not gzip and bzip2 compressed files. They only contain a single file. -Lars
Mar 10 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday 10 March 2011 00:15:34 Lars T. Kyllingstad wrote:
 On Wed, 09 Mar 2011 21:34:29 -0800, Jonathan M Davis wrote:
 On Wednesday 09 March 2011 21:10:59 Daniel Gibson wrote:
 Am 10.03.2011 05:53, schrieb dsimcha:
 I noticed last night that Phobos actually has all the machinations
 required for reading gzipped files, buried in etc.c.zlib. I've wanted
 a high-level D interface for reading and writing compressed files
 with an API similar to "normal" file I/O for a while. I'm thinking
 about what the easiest/best design would be. At a high level there
 are two designs:
 
 1. Hack std.stdio.file to support gzipped formats. This would allow
 an identical interface for "normal" and compressed I/O. It would also
 allow reuse of things like ByLine. However, it would require major
 refactoring of File to decouple it from the C file I/O routines so
 that it could call either the C or GZip ones depending on how it's
 configured. Probably, it would make sense to make an interface that
 wraps I/O functions and make an instance for C and one for gzip, with
 bzip2 and other goodies possibly being added later.
 
 2. Write something completely separate. This would keep
 std.stdio.File doing one thing well (wrapping C file I/O) but would
 be more of a PITA for the user and possibly result in code
 duplication.
 
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key
 requirements are that it must be as easy to use as std.stdio.File and
 it must be easy to extend to support other single-file compression
 formats like bz2.

Maybe a proper stream API would help. It could provide ByLine etc, could be used for any kind of compression format (as long as an appropriate input-stream is provided), ... (analogous for writing)

That was my thought. We really need proper streams... The other potential issue with compressed files is that they can contain directories and such.

Not gzip and bzip2 compressed files. They only contain a single file.

Ah. True. I'm too used to always using tar with them. ;) Actually, the fact that they're that way makes them _way_ more pleasant to deal with programmatically than zip... - Jonathan M Davis
Mar 10 2011
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/9/2011 8:53 PM, dsimcha wrote:
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key requirements are
that
 it must be as easy to use as std.stdio.File and it must be easy to extend to
 support other single-file compression formats like bz2.

Use ranges.
Mar 10 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/10/2011 4:59 AM, Walter Bright wrote:
 On 3/9/2011 8:53 PM, dsimcha wrote:
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key
 requirements are that
 it must be as easy to use as std.stdio.File and it must be easy to
 extend to
 support other single-file compression formats like bz2.

Use ranges.

Ok, obviously. The point was trying to figure out how to maximize the reuse of the infrastructure from std.stdio.File.
Mar 10 2011
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/10/2011 6:24 AM, dsimcha wrote:
 On 3/10/2011 4:59 AM, Walter Bright wrote:
 On 3/9/2011 8:53 PM, dsimcha wrote:
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key
 requirements are that
 it must be as easy to use as std.stdio.File and it must be easy to
 extend to
 support other single-file compression formats like bz2.

Use ranges.

Ok, obviously. The point was trying to figure out how to maximize the reuse of the infrastructure from std.stdio.File.

It's not so obvious based on my reading of the other comments. For example, we should not be inventing a streaming interface.
Mar 10 2011
next sibling parent dsimcha <dsimcha yahoo.com> writes:
On 3/10/2011 8:29 PM, Walter Bright wrote:
 On 3/10/2011 6:24 AM, dsimcha wrote:
 On 3/10/2011 4:59 AM, Walter Bright wrote:
 On 3/9/2011 8:53 PM, dsimcha wrote:
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key
 requirements are that
 it must be as easy to use as std.stdio.File and it must be easy to
 extend to
 support other single-file compression formats like bz2.

Use ranges.

Ok, obviously. The point was trying to figure out how to maximize the reuse of the infrastructure from std.stdio.File.

It's not so obvious based on my reading of the other comments. For example, we should not be inventing a streaming interface.

Ok, I see what you're saying. I was making the assumption that the streaming interface would be based on ranges, and it was more a matter of working out other details, like what decorators to provide.
Mar 10 2011
prev sibling parent dsimcha <dsimcha yahoo.com> writes:
On 3/11/2011 8:04 AM, Steven Schveighoffer wrote:
 On Thu, 10 Mar 2011 20:29:55 -0500, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 3/10/2011 6:24 AM, dsimcha wrote:
 On 3/10/2011 4:59 AM, Walter Bright wrote:
 On 3/9/2011 8:53 PM, dsimcha wrote:
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key
 requirements are that
 it must be as easy to use as std.stdio.File and it must be easy to
 extend to
 support other single-file compression formats like bz2.

Use ranges.

Ok, obviously. The point was trying to figure out how to maximize the reuse of the infrastructure from std.stdio.File.

It's not so obvious based on my reading of the other comments. For example, we should not be inventing a streaming interface.

C's FILE * interface is too limiting/low performing. I'm working to create a streaming interface to replace it, and then we can compare the differences. I think it's pretty obvious from Tango's I/O performance that a D-based stream interface is a better approach. Ranges should be built on top of that interface. I won't continue the debate, since it's difficult to argue from a position of theory. However, I don't think it will be long before I can show some real numbers. I'm not expecting Phobos to adopt, based on my experience with dcollections, but it should be seamlessly usable with Phobos, especially since range-based functions are templated. -Steve

Well, I certainly appreciate your efforts. IMHO the current state of file I/O for anything but uncompressed plain text in D is pretty sad. Even uncompressed plain text is pretty bad on Windows due to various bugs. IMHO one huge improvement that could be made to Phobos would be to create modules for reading the most common file formats (my personal list would be gzip, bzip2, png, bmp, jpeg and csv) with a nice high-level D interface.
Mar 11 2011
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 10 Mar 2011 20:29:55 -0500, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 3/10/2011 6:24 AM, dsimcha wrote:
 On 3/10/2011 4:59 AM, Walter Bright wrote:
 On 3/9/2011 8:53 PM, dsimcha wrote:
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key
 requirements are that
 it must be as easy to use as std.stdio.File and it must be easy to
 extend to
 support other single-file compression formats like bz2.

Use ranges.

Ok, obviously. The point was trying to figure out how to maximize the reuse of the infrastructure from std.stdio.File.

It's not so obvious based on my reading of the other comments. For example, we should not be inventing a streaming interface.

C's FILE * interface is too limiting/low performing. I'm working to create a streaming interface to replace it, and then we can compare the differences. I think it's pretty obvious from Tango's I/O performance that a D-based stream interface is a better approach. Ranges should be built on top of that interface. I won't continue the debate, since it's difficult to argue from a position of theory. However, I don't think it will be long before I can show some real numbers. I'm not expecting Phobos to adopt, based on my experience with dcollections, but it should be seamlessly usable with Phobos, especially since range-based functions are templated. -Steve
Mar 11 2011
prev sibling next sibling parent reply Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Wed, 2011-03-09 at 23:53 -0500, dsimcha wrote:
 I noticed last night that Phobos actually has all the machinations=20
 required for reading gzipped files, buried in etc.c.zlib.  I've wanted a=

 high-level D interface for reading and writing compressed files with an=

 API similar to "normal" file I/O for a while.  I'm thinking about what=

 the easiest/best design would be.  At a high level there are two designs:

But isn't a gzip (or zip, 7z, bzip2, etc., etc.) file actually a container: a tree of files. So isn't it more a persistent data structure that has a rendering as a single flat file on the filestore, than being a partitioned flat file which is what you will end up with if you head directly down the file/stream route?=20
 1.  Hack std.stdio.file to support gzipped formats.  This would allow an=

 identical interface for "normal" and compressed I/O.  It would also=20
 allow reuse of things like ByLine.  However, it would require major=20
 refactoring of File to decouple it from the C file I/O routines so that=

 it could call either the C or GZip ones depending on how it's=20
 configured.  Probably, it would make sense to make an interface that=20
 wraps I/O functions and make an instance for C and one for gzip, with=20
 bzip2 and other goodies possibly being added later.
=20
 2.  Write something completely separate.  This would keep std.stdio.File=

 doing one thing well (wrapping C file I/O) but would be more of a PITA=

 for the user and possibly result in code duplication.
=20
 I'd like to get some comments on what an appropriate API design and=20
 implementation for writing gzipped files would be.  Two key requirements=

 are that it must be as easy to use as std.stdio.File and it must be easy=

 to extend to support other single-file compression formats like bz2.

--=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 10 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/10/2011 5:57 AM, Lars T. Kyllingstad wrote:
 Nope, a gzip or bzip2 file only contains a single file.  To zip several
 files, you first make a tar archive, and then you run gzip or bzip2 on
 it.  That's why most compressed archives targeted at the Linux platform
 have extensions like .tar.gz, .tar.bz2, and so on.

 -Lars

This is **exactly** my point. These single-file gzip and bzip2 files should be usable with exactly the same API as uncompressed file I/O. My personal use case for this is files that contain large amounts of DNA sequence. This compresses very well, since besides a little meta-info it's just a bunch of A's, C's, G's and T's. I want to be able to read in these huge files and decompress them transparently on the fly. Another example (and the one that brought the subject of these non-tarred gzips to my attention) is the svgz format. This is an image format, and is literally just a gzipped SVG. Uncompressed SVG is a ridiculously bloated format but compresses very well, so the SVG standard requires that gzipped SVG files "just work" transparently with any SVG-compliant program. I recently added svgz support to plot2kill, and it was somewhat of a PITA because I had to find the C API buried in etc.c.zlib and then I got stuck using it instead of a nice D API. The bigger point, though, is that use cases for non-tarred single-file gzips do exist and they should be handled transparently via an interface identical to normal file I/O.
Mar 10 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/10/2011 9:45 AM, Lars T. Kyllingstad wrote:
 Although I agree this would be nice, I don't think std.stdio.File is the
 right place to put it.  I think a general streaming framework should be
 in place first, and File be made to work with it.  Then, working with a
 gzipped/bzipped file should be as simple as wrapping the raw File stream
 in a compression/decompression stream.

 -Lars

Ok, this seems to be the general consensus based on the replies I've gotten. Unfortunately, I have neither the time, the desire, nor the knowledge to design and implement a full-fledged stream API, whereas I have enough of all three of these to bolt gzip support onto std.stdio. I guess I'll solve my specific use case at a higher level by wrapping the C gzip stuff for my DNA sequence reader class, and let someone who knowns something about good stream design solve the more general problem.
Mar 10 2011
parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Jonathan M Davis (jmdavisProg gmx.com)'s article
 There _have_ been some threads on designing a new stream API, and there are
some
 preliminary designs, but as far as I know, they have yet to really go anywhere.
 I'm unaware of the stream API really having a champion per se. Andrei has done
 some preliminary design work, but I don't know if he intends to actually
 implement anything, and as far as I know, no one else has volunteered. So, a
new
 std.stream is one of those things that we all agree on that we want but which
 hasn't happened yet, because no one has stepped up to do it.
 And I do agree that the ability to deal with a compressed file should be part
of
 the stream API (probably as some sort of adapter/wrapper which uncompresses the
 stream as you iterate through it). But the stream API needs to be designed and
 _implemented_ before we'll have that.
 - Jonathan M Davis

So I guess in such a design we would still have things like a decorator to iterate through a stream by chunk, by line, etc., with a range interface?
Mar 10 2011
prev sibling next sibling parent "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Thu, 10 Mar 2011 10:17:17 +0000, Russel Winder wrote:

 On Wed, 2011-03-09 at 23:53 -0500, dsimcha wrote:
 I noticed last night that Phobos actually has all the machinations
 required for reading gzipped files, buried in etc.c.zlib.  I've wanted
 a high-level D interface for reading and writing compressed files with
 an API similar to "normal" file I/O for a while.  I'm thinking about
 what the easiest/best design would be.  At a high level there are two
 designs:

But isn't a gzip (or zip, 7z, bzip2, etc., etc.) file actually a container: a tree of files. So isn't it more a persistent data structure that has a rendering as a single flat file on the filestore, than being a partitioned flat file which is what you will end up with if you head directly down the file/stream route?

Nope, a gzip or bzip2 file only contains a single file. To zip several files, you first make a tar archive, and then you run gzip or bzip2 on it. That's why most compressed archives targeted at the Linux platform have extensions like .tar.gz, .tar.bz2, and so on. -Lars
Mar 10 2011
prev sibling next sibling parent Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Thu, 2011-03-10 at 10:57 +0000, Lars T. Kyllingstad wrote:
[ . . . ]
 Nope, a gzip or bzip2 file only contains a single file.  To zip several=

 files, you first make a tar archive, and then you run gzip or bzip2 on=

 it.  That's why most compressed archives targeted at the Linux platform=

 have extensions like .tar.gz, .tar.bz2, and so on.

Obviously ;-) I confused myself thinking of files with extension tgz. Zip, Gzip so similar, so different. Sorry for the noise. Everyone should go back to thinking of a transforming stream architecture for this problem. =20 --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 10 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 03/10/2011 09:15 AM, Lars T. Kyllingstad wrote:
 On Wed, 09 Mar 2011 21:34:29 -0800, Jonathan M Davis wrote:

 On Wednesday 09 March 2011 21:10:59 Daniel Gibson wrote:
 Am 10.03.2011 05:53, schrieb dsimcha:
 I noticed last night that Phobos actually has all the machinations
 required for reading gzipped files, buried in etc.c.zlib. I've wanted
 a high-level D interface for reading and writing compressed files
 with an API similar to "normal" file I/O for a while. I'm thinking
 about what the easiest/best design would be. At a high level there
 are two designs:

 1. Hack std.stdio.file to support gzipped formats. This would allow
 an identical interface for "normal" and compressed I/O. It would also
 allow reuse of things like ByLine. However, it would require major
 refactoring of File to decouple it from the C file I/O routines so
 that it could call either the C or GZip ones depending on how it's
 configured. Probably, it would make sense to make an interface that
 wraps I/O functions and make an instance for C and one for gzip, with
 bzip2 and other goodies possibly being added later.

 2. Write something completely separate. This would keep
 std.stdio.File doing one thing well (wrapping C file I/O) but would
 be more of a PITA for the user and possibly result in code
 duplication.

 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key
 requirements are that it must be as easy to use as std.stdio.File and
 it must be easy to extend to support other single-file compression
 formats like bz2.

Maybe a proper stream API would help. It could provide ByLine etc, could be used for any kind of compression format (as long as an appropriate input-stream is provided), ... (analogous for writing)

That was my thought. We really need proper streams... The other potential issue with compressed files is that they can contain directories and such.

Not gzip and bzip2 compressed files. They only contain a single file.

Yop, but the underlying 'file' is a tar-ed pack of files... Denis -- _________________ vita es estrany spir.wikidot.com
Mar 10 2011
prev sibling next sibling parent reply "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Thu, 10 Mar 2011 09:20:34 -0500, dsimcha wrote:

 On 3/10/2011 5:57 AM, Lars T. Kyllingstad wrote:
 Nope, a gzip or bzip2 file only contains a single file.  To zip several
 files, you first make a tar archive, and then you run gzip or bzip2 on
 it.  That's why most compressed archives targeted at the Linux platform
 have extensions like .tar.gz, .tar.bz2, and so on.

 -Lars

This is **exactly** my point. These single-file gzip and bzip2 files should be usable with exactly the same API as uncompressed file I/O. My personal use case for this is files that contain large amounts of DNA sequence. This compresses very well, since besides a little meta-info it's just a bunch of A's, C's, G's and T's. I want to be able to read in these huge files and decompress them transparently on the fly. Another example (and the one that brought the subject of these non-tarred gzips to my attention) is the svgz format. This is an image format, and is literally just a gzipped SVG. Uncompressed SVG is a ridiculously bloated format but compresses very well, so the SVG standard requires that gzipped SVG files "just work" transparently with any SVG-compliant program. I recently added svgz support to plot2kill, and it was somewhat of a PITA because I had to find the C API buried in etc.c.zlib and then I got stuck using it instead of a nice D API. The bigger point, though, is that use cases for non-tarred single-file gzips do exist and they should be handled transparently via an interface identical to normal file I/O.

Although I agree this would be nice, I don't think std.stdio.File is the right place to put it. I think a general streaming framework should be in place first, and File be made to work with it. Then, working with a gzipped/bzipped file should be as simple as wrapping the raw File stream in a compression/decompression stream. -Lars
Mar 10 2011
next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday 10 March 2011 07:14:32 dsimcha wrote:
 On 3/10/2011 9:45 AM, Lars T. Kyllingstad wrote:
 Although I agree this would be nice, I don't think std.stdio.File is the
 right place to put it.  I think a general streaming framework should be
 in place first, and File be made to work with it.  Then, working with a
 gzipped/bzipped file should be as simple as wrapping the raw File stream
 in a compression/decompression stream.
 
 -Lars

Ok, this seems to be the general consensus based on the replies I've gotten. Unfortunately, I have neither the time, the desire, nor the knowledge to design and implement a full-fledged stream API, whereas I have enough of all three of these to bolt gzip support onto std.stdio. I guess I'll solve my specific use case at a higher level by wrapping the C gzip stuff for my DNA sequence reader class, and let someone who knowns something about good stream design solve the more general problem.

There _have_ been some threads on designing a new stream API, and there are some preliminary designs, but as far as I know, they have yet to really go anywhere. I'm unaware of the stream API really having a champion per se. Andrei has done some preliminary design work, but I don't know if he intends to actually implement anything, and as far as I know, no one else has volunteered. So, a new std.stream is one of those things that we all agree on that we want but which hasn't happened yet, because no one has stepped up to do it. And I do agree that the ability to deal with a compressed file should be part of the stream API (probably as some sort of adapter/wrapper which uncompresses the stream as you iterate through it). But the stream API needs to be designed and _implemented_ before we'll have that. - Jonathan M Davis
Mar 10 2011
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday 10 March 2011 07:14:32 dsimcha wrote:
 On 3/10/2011 9:45 AM, Lars T. Kyllingstad wrote:
 Although I agree this would be nice, I don't think std.stdio.File is the
 right place to put it.  I think a general streaming framework should be
 in place first, and File be made to work with it.  Then, working with a
 gzipped/bzipped file should be as simple as wrapping the raw File stream
 in a compression/decompression stream.
 
 -Lars

Ok, this seems to be the general consensus based on the replies I've gotten. Unfortunately, I have neither the time, the desire, nor the knowledge to design and implement a full-fledged stream API, whereas I have enough of all three of these to bolt gzip support onto std.stdio. I guess I'll solve my specific use case at a higher level by wrapping the C gzip stuff for my DNA sequence reader class, and let someone who knowns something about good stream design solve the more general problem.

There _have_ been some threads on designing a new stream API, and there are some preliminary designs, but as far as I know, they have yet to really go anywhere. I'm unaware of the stream API really having a champion per se. Andrei has done some preliminary design work, but I don't know if he intends to actually implement anything, and as far as I know, no one else has volunteered. So, a new std.stream is one of those things that we all agree on that we want but which hasn't happened yet, because no one has stepped up to do it. And I do agree that the ability to deal with a compressed file should be part of the stream API (probably as some sort of adapter/wrapper which uncompresses the stream as you iterate through it). But the stream API needs to be designed and _implemented_ before we'll have that. - Jonathan M Davis
Mar 10 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, March 10, 2011 09:16:35 dsimcha wrote:
 == Quote from Jonathan M Davis (jmdavisProg gmx.com)'s article
 
 There _have_ been some threads on designing a new stream API, and there
 are some preliminary designs, but as far as I know, they have yet to
 really go anywhere. I'm unaware of the stream API really having a
 champion per se. Andrei has done some preliminary design work, but I
 don't know if he intends to actually implement anything, and as far as I
 know, no one else has volunteered. So, a new std.stream is one of those
 things that we all agree on that we want but which hasn't happened yet,
 because no one has stepped up to do it.
 And I do agree that the ability to deal with a compressed file should be
 part of the stream API (probably as some sort of adapter/wrapper which
 uncompresses the stream as you iterate through it). But the stream API
 needs to be designed and _implemented_ before we'll have that.
 - Jonathan M Davis

So I guess in such a design we would still have things like a decorator to iterate through a stream by chunk, by line, etc., with a range interface?

I don't remember exactly what it's going to look like. IIRC, streams are the reason that Andrei was looking at at a range API that gave you a T[] instead of T. Whatever you were doing would be built on top of that. So, grabbing it by line or whatever. But I would fully expect that you would be able to put a wrapper range in there which took the stream of bytes, treated them as if they were gzipped or bzipped or whatever, and gave you bytes (or chars or whatever was appropriate) as if were reading from an uncompressed version of the file. So, the reading would be identical whether the file was compressed or not. It's just that in the case of a compressed stream/file, you'd have decorator/wrapper which uncompressed it for you. - Jonathan M Davis
Mar 10 2011
prev sibling next sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
On 10/03/2011 04:53, dsimcha wrote:
<snip>
 I'd like to get some comments on what an appropriate API design and
implementation for
 writing gzipped files would be. Two key requirements are that it must be as
easy to use as
 std.stdio.File and it must be easy to extend to support other single-file
compression
 formats like bz2.

You don't seem to get how std.stream works. The API is defined in the InputStream and OutputStream interfaces. Various classes implement this interface, generally through the Stream abstract class, to provide the functionality for a specific kind of stream. File is just one of these classes. Another is MemoryStream, to read to and write from a buffer in memory. A stream class used to work with gzipped files would be just another. Indeed, we have FilterStream, which is a base class for stream classes that wrap a stream, such as a file or memory stream, to modify the data in some way as it goes in and out. Compressing or decompressing is an example of this - so I guess that GzipStream would be a subclass of FilterStream. Stewart.
Mar 11 2011
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/11/2011 7:12 PM, Stewart Gordon wrote:
 On 10/03/2011 04:53, dsimcha wrote:
 <snip>
 I'd like to get some comments on what an appropriate API design and
 implementation for
 writing gzipped files would be. Two key requirements are that it must
 be as easy to use as
 std.stdio.File and it must be easy to extend to support other
 single-file compression
 formats like bz2.

You don't seem to get how std.stream works. The API is defined in the InputStream and OutputStream interfaces. Various classes implement this interface, generally through the Stream abstract class, to provide the functionality for a specific kind of stream. File is just one of these classes. Another is MemoryStream, to read to and write from a buffer in memory. A stream class used to work with gzipped files would be just another. Indeed, we have FilterStream, which is a base class for stream classes that wrap a stream, such as a file or memory stream, to modify the data in some way as it goes in and out. Compressing or decompressing is an example of this - so I guess that GzipStream would be a subclass of FilterStream. Stewart.

But: 1. std.stream is scheduled for deprecation IIRC. 2. std.stdio.File is what's now idiomatic to use. 3. Streams in D should be based on input ranges, not whatever crufty old stuff std.stream is based on.
Mar 11 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 3/14/2011 8:22 AM, Steven Schveighoffer wrote:
 BTW, that crufty old stuff
 probably way outperforms anything you could ever do with ranges as the
 base.

I don't get the concern about performance, for file I/O at least. Isn't the main bottleneck reading it off the disk platter?
Mar 14 2011
next sibling parent Daniel Gibson <metalcaedes gmail.com> writes:
Am 14.03.2011 14:17, schrieb dsimcha:
 On 3/14/2011 8:22 AM, Steven Schveighoffer wrote:
 BTW, that crufty old stuff
 probably way outperforms anything you could ever do with ranges as the
 base.

I don't get the concern about performance, for file I/O at least. Isn't the main bottleneck reading it off the disk platter?

SSDs, RAID, RAM-disks, 10GBit (and faster) networking, ...
Mar 14 2011
prev sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Steven Schveighoffer (schveiguy yahoo.com)'s article
 On Mon, 14 Mar 2011 09:17:04 -0400, dsimcha <dsimcha yahoo.com> wrote:
 On 3/14/2011 8:22 AM, Steven Schveighoffer wrote:
 BTW, that crufty old stuff
 probably way outperforms anything you could ever do with ranges as the
 base.

I don't get the concern about performance, for file I/O at least. Isn't the main bottleneck reading it off the disk platter?

With ranges, you likely have to copy things more than if you use a proper stream interface, or else make the interface very awkward. I don't know about you, but having to set the amount I want to read before I call front seems awkward. The library I'm writing optimizes the copying so there is very little copying from the buffer. If you look at Tango's I/O performance, it outperforms even C I/O, and uses a class/interface hierarchy w/ delegates for reading data. I think the range concept is good to paste on top of a buffered I/O stream, but not to use as the base. For example, byLine is a good example of an I/O range that would use a buffered I/O stream to do its work. See this message I posted a few months back:

 A couple replies later I outline how to do byLine funtion (and easily a
 range) on top of such a
 stream.
 This is the basis for my current I/O library I'm writing.
 -Steve

Ok, makes sense. I sincerely hope your I/O library is good enough to get adopted, then, because Phobos is in **serious** need of better I/O functionality.
Mar 14 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, March 11, 2011 16:27:21 dsimcha wrote:
 On 3/11/2011 7:12 PM, Stewart Gordon wrote:
 On 10/03/2011 04:53, dsimcha wrote:
 <snip>
 
 I'd like to get some comments on what an appropriate API design and
 implementation for
 writing gzipped files would be. Two key requirements are that it must
 be as easy to use as
 std.stdio.File and it must be easy to extend to support other
 single-file compression
 formats like bz2.

You don't seem to get how std.stream works. The API is defined in the InputStream and OutputStream interfaces. Various classes implement this interface, generally through the Stream abstract class, to provide the functionality for a specific kind of stream. File is just one of these classes. Another is MemoryStream, to read to and write from a buffer in memory. A stream class used to work with gzipped files would be just another. Indeed, we have FilterStream, which is a base class for stream classes that wrap a stream, such as a file or memory stream, to modify the data in some way as it goes in and out. Compressing or decompressing is an example of this - so I guess that GzipStream would be a subclass of FilterStream. Stewart.

But: 1. std.stream is scheduled for deprecation IIRC.

Technically speaking, I think that it's intended to be scheduled for deprecation as opposed to actually being scheduled for deprecation, but whatever. It's going to be phased out as soon as we have a replacement.d
 2.  std.stdio.File is what's now idiomatic to use.

Well, more like it's the only solution we have which will be sticking around. Once we have a new std.stream, it may be the preferred solution.
 3.  Streams in D should be based on input ranges, not whatever crufty
 old stuff std.stream is based on.

Indeed. But the new API still needs to be fleshed out and implemented before we actually have even a _proposed_ new std.stream, let alone actually have it. - Jonathan M Davis
Mar 11 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 11 Mar 2011 19:27:21 -0500, dsimcha <dsimcha yahoo.com> wrote:

 3.  Streams in D should be based on input ranges, not whatever crufty  
 old stuff std.stream is based on.

No. I/O Ranges should be based on streams. A stream is a low level construct that can read and/or write data. In essence it is an abstraction of the capabilities of the OS. BTW, that crufty old stuff probably way outperforms anything you could ever do with ranges as the base. The range interface simply isn't built to deal with I/O properly. For example, std.stdio.File is based on FILE *, which is an opaque stream interface. There should probably be RangeStream which would wrap a range in a stream interface, if you want to go that route. -Steve
Mar 14 2011
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 14 Mar 2011 09:17:04 -0400, dsimcha <dsimcha yahoo.com> wrote:

 On 3/14/2011 8:22 AM, Steven Schveighoffer wrote:
 BTW, that crufty old stuff
 probably way outperforms anything you could ever do with ranges as the
 base.

I don't get the concern about performance, for file I/O at least. Isn't the main bottleneck reading it off the disk platter?

That is solved by buffering, which would be done in either case. With ranges, you likely have to copy things more than if you use a proper stream interface, or else make the interface very awkward. I don't know about you, but having to set the amount I want to read before I call front seems awkward. The library I'm writing optimizes the copying so there is very little copying from the buffer. If you look at Tango's I/O performance, it outperforms even C I/O, and uses a class/interface hierarchy w/ delegates for reading data. I think the range concept is good to paste on top of a buffered I/O stream, but not to use as the base. For example, byLine is a good example of an I/O range that would use a buffered I/O stream to do its work. See this message I posted a few months back: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=119400 A couple replies later I outline how to do byLine funtion (and easily a range) on top of such a stream. This is the basis for my current I/O library I'm writing. -Steve
Mar 14 2011
prev sibling parent dsimcha <dsimcha yahoo.com> writes:
Since it seems like the consensus is that streaming gzip support belongs 
in a stream API, I guess we have yet another reason to get busy with the 
stream API.  However, I'm wondering if std.file should support gzip and, 
if license issues can be overcome, bzip2.

I'd love to be able to write code like this:

// Read and transparently decompress foo.txt, which is UTF-8 encoded.
auto foo = cast(string) gzippedRead("foo.txt.gz");

// Write a buffer to a gzipped file.
gzippedWrite("foo.txt.gz", buf);

This stuff would be trivial to implement in std.file and, IMHO, belongs 
there.  What's the consensus on whether it belongs?
Mar 12 2011