digitalmars.D - GZip File Reading

dsimcha (20/20) Mar 09 2011 I noticed last night that Phobos actually has all the machinations

Daniel Gibson (7/27) Mar 09 2011 Maybe a proper stream API would help. It could provide ByLine etc, could...

Jonathan M Davis (7/36) Mar 09 2011 That was my thought. We really need proper streams...

Lars T. Kyllingstad (3/41) Mar 10 2011 Not gzip and bzip2 compressed files. They only contain a single file.

Jonathan M Davis (5/46) Mar 10 2011 Ah. True. I'm too used to always using tar with them. ;)
spir (7/47) Mar 10 2011 Yop, but the underlying 'file' is a tar-ed pack of files...

Walter Bright (2/6) Mar 10 2011 Use ranges.

dsimcha (3/11) Mar 10 2011 Ok, obviously. The point was trying to figure out how to maximize the

Walter Bright (3/15) Mar 10 2011 It's not so obvious based on my reading of the other comments. For examp...

dsimcha (4/21) Mar 10 2011 Ok, I see what you're saying. I was making the assumption that the
Steven Schveighoffer (13/30) Mar 11 2011 C's FILE * interface is too limiting/low performing. I'm working to

dsimcha (8/39) Mar 11 2011 Well, I certainly appreciate your efforts. IMHO the current state of

Russel Winder (25/47) Mar 10 2011 =20

Lars T. Kyllingstad (6/19) Mar 10 2011 Nope, a gzip or bzip2 file only contains a single file. To zip several

Russel Winder (21/25) Mar 10 2011 =20
dsimcha (18/23) Mar 10 2011 This is **exactly** my point. These single-file gzip and bzip2 files

Lars T. Kyllingstad (7/33) Mar 10 2011 Although I agree this would be nice, I don't think std.stdio.File is the...

dsimcha (8/14) Mar 10 2011 Ok, this seems to be the general consensus based on the replies I've

Jonathan M Davis (13/29) Mar 10 2011 There _have_ been some threads on designing a new stream API, and there ...

dsimcha (3/15) Mar 10 2011 So I guess in such a design we would still have things like a decorator ...

Jonathan M Davis (12/30) Mar 10 2011 I don't remember exactly what it's going to look like. IIRC, streams are...

Jonathan M Davis (13/29) Mar 10 2011 There _have_ been some threads on designing a new stream API, and there ...

Stewart Gordon (13/17) Mar 11 2011 You don't seem to get how std.stream works.

dsimcha (6/28) Mar 11 2011 But:

Jonathan M Davis (9/44) Mar 11 2011 Technically speaking, I think that it's intended to be scheduled for dep...
Steven Schveighoffer (11/13) Mar 14 2011 No. I/O Ranges should be based on streams. A stream is a low level

dsimcha (3/6) Mar 14 2011 I don't get the concern about performance, for file I/O at least. Isn't...

Daniel Gibson (2/8) Mar 14 2011 SSDs, RAID, RAM-disks, 10GBit (and faster) networking, ...
Steven Schveighoffer (19/25) Mar 14 2011 That is solved by buffering, which would be done in either case.

dsimcha (4/29) Mar 14 2011 Ok, makes sense. I sincerely hope your I/O library is good enough to ge...

dsimcha (11/11) Mar 12 2011 Since it seems like the consensus is that streaming gzip support belongs...

dsimcha <dsimcha yahoo.com> writes:

I noticed last night that Phobos actually has all the machinations 
required for reading gzipped files, buried in etc.c.zlib.  I've wanted a 
high-level D interface for reading and writing compressed files with an 
API similar to "normal" file I/O for a while.  I'm thinking about what 
the easiest/best design would be.  At a high level there are two designs:

1.  Hack std.stdio.file to support gzipped formats.  This would allow an 
identical interface for "normal" and compressed I/O.  It would also 
allow reuse of things like ByLine.  However, it would require major 
refactoring of File to decouple it from the C file I/O routines so that 
it could call either the C or GZip ones depending on how it's 
configured.  Probably, it would make sense to make an interface that 
wraps I/O functions and make an instance for C and one for gzip, with 
bzip2 and other goodies possibly being added later.

2.  Write something completely separate.  This would keep std.stdio.File 
doing one thing well (wrapping C file I/O) but would be more of a PITA 
for the user and possibly result in code duplication.

I'd like to get some comments on what an appropriate API design and 
implementation for writing gzipped files would be.  Two key requirements 
are that it must be as easy to use as std.stdio.File and it must be easy 
to extend to support other single-file compression formats like bz2.

Mar 09 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 10.03.2011 05:53, schrieb dsimcha:
 I noticed last night that Phobos actually has all the machinations
 required for reading gzipped files, buried in etc.c.zlib. I've wanted a
 high-level D interface for reading and writing compressed files with an
 API similar to "normal" file I/O for a while. I'm thinking about what
 the easiest/best design would be. At a high level there are two designs:

 1. Hack std.stdio.file to support gzipped formats. This would allow an
 identical interface for "normal" and compressed I/O. It would also allow
 reuse of things like ByLine. However, it would require major refactoring
 of File to decouple it from the C file I/O routines so that it could
 call either the C or GZip ones depending on how it's configured.
 Probably, it would make sense to make an interface that wraps I/O
 functions and make an instance for C and one for gzip, with bzip2 and
 other goodies possibly being added later.

 2. Write something completely separate. This would keep std.stdio.File
 doing one thing well (wrapping C file I/O) but would be more of a PITA
 for the user and possibly result in code duplication.

 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key requirements
 are that it must be as easy to use as std.stdio.File and it must be easy
 to extend to support other single-file compression formats like bz2.

Maybe a proper stream API would help. It could provide ByLine etc, could 
be used for any kind of compression format (as long as an appropriate 
input-stream is provided), ...
(analogous for writing)

Cheers,
- Daniel

Mar 09 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday 09 March 2011 21:10:59 Daniel Gibson wrote:
 Am 10.03.2011 05:53, schrieb dsimcha:
 I noticed last night that Phobos actually has all the machinations
 required for reading gzipped files, buried in etc.c.zlib. I've wanted a
 high-level D interface for reading and writing compressed files with an
 API similar to "normal" file I/O for a while. I'm thinking about what
 the easiest/best design would be. At a high level there are two designs:
 
 1. Hack std.stdio.file to support gzipped formats. This would allow an
 identical interface for "normal" and compressed I/O. It would also allow
 reuse of things like ByLine. However, it would require major refactoring
 of File to decouple it from the C file I/O routines so that it could
 call either the C or GZip ones depending on how it's configured.
 Probably, it would make sense to make an interface that wraps I/O
 functions and make an instance for C and one for gzip, with bzip2 and
 other goodies possibly being added later.
 
 2. Write something completely separate. This would keep std.stdio.File
 doing one thing well (wrapping C file I/O) but would be more of a PITA
 for the user and possibly result in code duplication.
 
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key requirements
 are that it must be as easy to use as std.stdio.File and it must be easy
 to extend to support other single-file compression formats like bz2.

 
 Maybe a proper stream API would help. It could provide ByLine etc, could
 be used for any kind of compression format (as long as an appropriate
 input-stream is provided), ...
 (analogous for writing)

That was my thought. We really need proper streams...

The other potential issue with compressed files is that they can contain 
directories and such. A gzipped/bzipped file is not necessarily a file that you 
can read, even once it's been uncompressed. That may or not matter for this 
particular application of them, but it is something to be aware of.

- Jonathan M Davis

Mar 09 2011

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

On Wed, 09 Mar 2011 21:34:29 -0800, Jonathan M Davis wrote:

 On Wednesday 09 March 2011 21:10:59 Daniel Gibson wrote:
 Am 10.03.2011 05:53, schrieb dsimcha:
 I noticed last night that Phobos actually has all the machinations
 required for reading gzipped files, buried in etc.c.zlib. I've wanted
 a high-level D interface for reading and writing compressed files
 with an API similar to "normal" file I/O for a while. I'm thinking
 about what the easiest/best design would be. At a high level there
 are two designs:
 
 1. Hack std.stdio.file to support gzipped formats. This would allow
 an identical interface for "normal" and compressed I/O. It would also
 allow reuse of things like ByLine. However, it would require major
 refactoring of File to decouple it from the C file I/O routines so
 that it could call either the C or GZip ones depending on how it's
 configured. Probably, it would make sense to make an interface that
 wraps I/O functions and make an instance for C and one for gzip, with
 bzip2 and other goodies possibly being added later.
 
 2. Write something completely separate. This would keep
 std.stdio.File doing one thing well (wrapping C file I/O) but would
 be more of a PITA for the user and possibly result in code
 duplication.
 
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key
 requirements are that it must be as easy to use as std.stdio.File and
 it must be easy to extend to support other single-file compression
 formats like bz2.

 
 Maybe a proper stream API would help. It could provide ByLine etc,
 could be used for any kind of compression format (as long as an
 appropriate input-stream is provided), ...
 (analogous for writing)

 
 That was my thought. We really need proper streams...
 
 The other potential issue with compressed files is that they can contain
 directories and such.

Not gzip and bzip2 compressed files.  They only contain a single file.

-Lars

Mar 10 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday 10 March 2011 00:15:34 Lars T. Kyllingstad wrote:
 On Wed, 09 Mar 2011 21:34:29 -0800, Jonathan M Davis wrote:
 On Wednesday 09 March 2011 21:10:59 Daniel Gibson wrote:
 Am 10.03.2011 05:53, schrieb dsimcha:
 I noticed last night that Phobos actually has all the machinations
 required for reading gzipped files, buried in etc.c.zlib. I've wanted
 a high-level D interface for reading and writing compressed files
 with an API similar to "normal" file I/O for a while. I'm thinking
 about what the easiest/best design would be. At a high level there
 are two designs:
 
 1. Hack std.stdio.file to support gzipped formats. This would allow
 an identical interface for "normal" and compressed I/O. It would also
 allow reuse of things like ByLine. However, it would require major
 refactoring of File to decouple it from the C file I/O routines so
 that it could call either the C or GZip ones depending on how it's
 configured. Probably, it would make sense to make an interface that
 wraps I/O functions and make an instance for C and one for gzip, with
 bzip2 and other goodies possibly being added later.
 
 2. Write something completely separate. This would keep
 std.stdio.File doing one thing well (wrapping C file I/O) but would
 be more of a PITA for the user and possibly result in code
 duplication.
 
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key
 requirements are that it must be as easy to use as std.stdio.File and
 it must be easy to extend to support other single-file compression
 formats like bz2.

 
 Maybe a proper stream API would help. It could provide ByLine etc,
 could be used for any kind of compression format (as long as an
 appropriate input-stream is provided), ...
 (analogous for writing)

 
 That was my thought. We really need proper streams...
 
 The other potential issue with compressed files is that they can contain
 directories and such.

 
 Not gzip and bzip2 compressed files.  They only contain a single file.

Ah. True. I'm too used to always using tar with them. ;)

Actually, the fact that they're that way makes them _way_ more pleasant to deal 
with programmatically than zip...

- Jonathan M Davis

Mar 10 2011

spir <denis.spir gmail.com> writes:

On 03/10/2011 09:15 AM, Lars T. Kyllingstad wrote:
 On Wed, 09 Mar 2011 21:34:29 -0800, Jonathan M Davis wrote:

 On Wednesday 09 March 2011 21:10:59 Daniel Gibson wrote:
 Am 10.03.2011 05:53, schrieb dsimcha:
 I noticed last night that Phobos actually has all the machinations
 required for reading gzipped files, buried in etc.c.zlib. I've wanted
 a high-level D interface for reading and writing compressed files
 with an API similar to "normal" file I/O for a while. I'm thinking
 about what the easiest/best design would be. At a high level there
 are two designs:

 1. Hack std.stdio.file to support gzipped formats. This would allow
 an identical interface for "normal" and compressed I/O. It would also
 allow reuse of things like ByLine. However, it would require major
 refactoring of File to decouple it from the C file I/O routines so
 that it could call either the C or GZip ones depending on how it's
 configured. Probably, it would make sense to make an interface that
 wraps I/O functions and make an instance for C and one for gzip, with
 bzip2 and other goodies possibly being added later.

 2. Write something completely separate. This would keep
 std.stdio.File doing one thing well (wrapping C file I/O) but would
 be more of a PITA for the user and possibly result in code
 duplication.

 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key
 requirements are that it must be as easy to use as std.stdio.File and
 it must be easy to extend to support other single-file compression
 formats like bz2.

 Maybe a proper stream API would help. It could provide ByLine etc,
 could be used for any kind of compression format (as long as an
 appropriate input-stream is provided), ...
 (analogous for writing)

 That was my thought. We really need proper streams...

 The other potential issue with compressed files is that they can contain
 directories and such.

 Not gzip and bzip2 compressed files.  They only contain a single file.

Yop, but the underlying 'file' is a tar-ed pack of files...

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Mar 10 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 3/9/2011 8:53 PM, dsimcha wrote:
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key requirements are
that
 it must be as easy to use as std.stdio.File and it must be easy to extend to
 support other single-file compression formats like bz2.

Use ranges.

Mar 10 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/10/2011 4:59 AM, Walter Bright wrote:
 On 3/9/2011 8:53 PM, dsimcha wrote:
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key
 requirements are that
 it must be as easy to use as std.stdio.File and it must be easy to
 extend to
 support other single-file compression formats like bz2.

 Use ranges.

Ok, obviously.  The point was trying to figure out how to maximize the 
reuse of the infrastructure from std.stdio.File.

Mar 10 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 3/10/2011 6:24 AM, dsimcha wrote:
 On 3/10/2011 4:59 AM, Walter Bright wrote:
 On 3/9/2011 8:53 PM, dsimcha wrote:
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key
 requirements are that
 it must be as easy to use as std.stdio.File and it must be easy to
 extend to
 support other single-file compression formats like bz2.

 Use ranges.

 Ok, obviously. The point was trying to figure out how to maximize the reuse of
 the infrastructure from std.stdio.File.

It's not so obvious based on my reading of the other comments. For example, we 
should not be inventing a streaming interface.

Mar 10 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/10/2011 8:29 PM, Walter Bright wrote:
 On 3/10/2011 6:24 AM, dsimcha wrote:
 On 3/10/2011 4:59 AM, Walter Bright wrote:
 On 3/9/2011 8:53 PM, dsimcha wrote:
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key
 requirements are that
 it must be as easy to use as std.stdio.File and it must be easy to
 extend to
 support other single-file compression formats like bz2.

 Use ranges.

 Ok, obviously. The point was trying to figure out how to maximize the
 reuse of
 the infrastructure from std.stdio.File.

 It's not so obvious based on my reading of the other comments. For
 example, we should not be inventing a streaming interface.

Ok, I see what you're saying.  I was making the assumption that the 
streaming interface would be based on ranges, and it was more a matter 
of working out other details, like what decorators to provide.

Mar 10 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 10 Mar 2011 20:29:55 -0500, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 3/10/2011 6:24 AM, dsimcha wrote:
 On 3/10/2011 4:59 AM, Walter Bright wrote:
 On 3/9/2011 8:53 PM, dsimcha wrote:
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key
 requirements are that
 it must be as easy to use as std.stdio.File and it must be easy to
 extend to
 support other single-file compression formats like bz2.

 Use ranges.

 Ok, obviously. The point was trying to figure out how to maximize the  
 reuse of
 the infrastructure from std.stdio.File.

 It's not so obvious based on my reading of the other comments. For  
 example, we should not be inventing a streaming interface.

C's FILE * interface is too limiting/low performing.  I'm working to  
create a streaming interface to replace it, and then we can compare the  
differences.  I think it's pretty obvious from Tango's I/O performance  
that a D-based stream interface is a better approach.

Ranges should be built on top of that interface.

I won't continue the debate, since it's difficult to argue from a position  
of theory.  However, I don't think it will be long before I can show some  
real numbers.  I'm not expecting Phobos to adopt, based on my experience  
with dcollections, but it should be seamlessly usable with Phobos,  
especially since range-based functions are templated.

-Steve

Mar 11 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/11/2011 8:04 AM, Steven Schveighoffer wrote:
 On Thu, 10 Mar 2011 20:29:55 -0500, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 3/10/2011 6:24 AM, dsimcha wrote:
 On 3/10/2011 4:59 AM, Walter Bright wrote:
 On 3/9/2011 8:53 PM, dsimcha wrote:
 I'd like to get some comments on what an appropriate API design and
 implementation for writing gzipped files would be. Two key
 requirements are that
 it must be as easy to use as std.stdio.File and it must be easy to
 extend to
 support other single-file compression formats like bz2.

 Use ranges.

 Ok, obviously. The point was trying to figure out how to maximize the
 reuse of
 the infrastructure from std.stdio.File.

 It's not so obvious based on my reading of the other comments. For
 example, we should not be inventing a streaming interface.

 C's FILE * interface is too limiting/low performing. I'm working to
 create a streaming interface to replace it, and then we can compare the
 differences. I think it's pretty obvious from Tango's I/O performance
 that a D-based stream interface is a better approach.

 Ranges should be built on top of that interface.

 I won't continue the debate, since it's difficult to argue from a
 position of theory. However, I don't think it will be long before I can
 show some real numbers. I'm not expecting Phobos to adopt, based on my
 experience with dcollections, but it should be seamlessly usable with
 Phobos, especially since range-based functions are templated.

 -Steve

Well, I certainly appreciate your efforts.  IMHO the current state of 
file I/O for anything but uncompressed plain text in D is pretty sad. 
Even uncompressed plain text is pretty bad on Windows due to various 
bugs.  IMHO one huge improvement that could be made to Phobos would be 
to create modules for reading the most common file formats (my personal 
list would be gzip, bzip2, png, bmp, jpeg and csv) with a nice 
high-level D interface.

Mar 11 2011

Russel Winder <russel russel.org.uk> writes:

On Wed, 2011-03-09 at 23:53 -0500, dsimcha wrote:
 I noticed last night that Phobos actually has all the machinations=20
 required for reading gzipped files, buried in etc.c.zlib.  I've wanted a=

=20
 high-level D interface for reading and writing compressed files with an=

=20
 API similar to "normal" file I/O for a while.  I'm thinking about what=

=20
 the easiest/best design would be.  At a high level there are two designs:

But isn't a gzip (or zip, 7z, bzip2, etc., etc.) file actually a
container:  a tree of files.  So isn't it more a persistent data
structure that has a rendering as a single flat file on the filestore,
than being a partitioned flat file which is what you will end up with if
you head directly down the file/stream route?=20

 1.  Hack std.stdio.file to support gzipped formats.  This would allow an=

=20
 identical interface for "normal" and compressed I/O.  It would also=20
 allow reuse of things like ByLine.  However, it would require major=20
 refactoring of File to decouple it from the C file I/O routines so that=

=20
 it could call either the C or GZip ones depending on how it's=20
 configured.  Probably, it would make sense to make an interface that=20
 wraps I/O functions and make an instance for C and one for gzip, with=20
 bzip2 and other goodies possibly being added later.
=20
 2.  Write something completely separate.  This would keep std.stdio.File=

=20
 doing one thing well (wrapping C file I/O) but would be more of a PITA=

=20
 for the user and possibly result in code duplication.
=20
 I'd like to get some comments on what an appropriate API design and=20
 implementation for writing gzipped files would be.  Two key requirements=

=20
 are that it must be as easy to use as std.stdio.File and it must be easy=

=20
 to extend to support other single-file compression formats like bz2.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Mar 10 2011

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

On Thu, 10 Mar 2011 10:17:17 +0000, Russel Winder wrote:

 On Wed, 2011-03-09 at 23:53 -0500, dsimcha wrote:
 I noticed last night that Phobos actually has all the machinations
 required for reading gzipped files, buried in etc.c.zlib.  I've wanted
 a high-level D interface for reading and writing compressed files with
 an API similar to "normal" file I/O for a while.  I'm thinking about
 what the easiest/best design would be.  At a high level there are two
 designs:

 
 But isn't a gzip (or zip, 7z, bzip2, etc., etc.) file actually a
 container:  a tree of files.  So isn't it more a persistent data
 structure that has a rendering as a single flat file on the filestore,
 than being a partitioned flat file which is what you will end up with if
 you head directly down the file/stream route?

Nope, a gzip or bzip2 file only contains a single file.  To zip several 
files, you first make a tar archive, and then you run gzip or bzip2 on 
it.  That's why most compressed archives targeted at the Linux platform 
have extensions like .tar.gz, .tar.bz2, and so on.

-Lars

Mar 10 2011

Russel Winder <russel russel.org.uk> writes:

On Thu, 2011-03-10 at 10:57 +0000, Lars T. Kyllingstad wrote:
[ . . . ]
 Nope, a gzip or bzip2 file only contains a single file.  To zip several=

=20
 files, you first make a tar archive, and then you run gzip or bzip2 on=

=20
 it.  That's why most compressed archives targeted at the Linux platform=

=20
 have extensions like .tar.gz, .tar.bz2, and so on.

Obviously ;-)

I confused myself thinking of files with extension tgz.  Zip, Gzip so
similar, so different.

Sorry for the noise.  Everyone should go back to thinking of a
transforming stream architecture for this problem.
=20
--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Mar 10 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/10/2011 5:57 AM, Lars T. Kyllingstad wrote:
 Nope, a gzip or bzip2 file only contains a single file.  To zip several
 files, you first make a tar archive, and then you run gzip or bzip2 on
 it.  That's why most compressed archives targeted at the Linux platform
 have extensions like .tar.gz, .tar.bz2, and so on.

 -Lars

This is **exactly** my point.  These single-file gzip and bzip2 files 
should be usable with exactly the same API as uncompressed file I/O.  My 
personal use case for this is files that contain large amounts of DNA 
sequence.  This compresses very well, since besides a little meta-info 
it's just a bunch of A's, C's, G's and T's.  I want to be able to read 
in these huge files and decompress them transparently on the fly.

Another example (and the one that brought the subject of these 
non-tarred gzips to my attention) is the svgz format.  This is an image 
format, and is literally just a gzipped SVG.  Uncompressed SVG is a 
ridiculously bloated format but compresses very well, so the SVG 
standard requires that gzipped SVG files "just work" transparently with 
any SVG-compliant program.  I recently added svgz support to plot2kill, 
and it was somewhat of a PITA because I had to find the C API buried in 
etc.c.zlib and then I got stuck using it instead of a nice D API.

The bigger point, though, is that use cases for non-tarred single-file 
gzips do exist and they should be handled transparently via an interface 
identical to normal file I/O.

Mar 10 2011

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

On Thu, 10 Mar 2011 09:20:34 -0500, dsimcha wrote:

 On 3/10/2011 5:57 AM, Lars T. Kyllingstad wrote:
 Nope, a gzip or bzip2 file only contains a single file.  To zip several
 files, you first make a tar archive, and then you run gzip or bzip2 on
 it.  That's why most compressed archives targeted at the Linux platform
 have extensions like .tar.gz, .tar.bz2, and so on.

 -Lars

 
 This is **exactly** my point.  These single-file gzip and bzip2 files
 should be usable with exactly the same API as uncompressed file I/O.  My
 personal use case for this is files that contain large amounts of DNA
 sequence.  This compresses very well, since besides a little meta-info
 it's just a bunch of A's, C's, G's and T's.  I want to be able to read
 in these huge files and decompress them transparently on the fly.
 
 Another example (and the one that brought the subject of these
 non-tarred gzips to my attention) is the svgz format.  This is an image
 format, and is literally just a gzipped SVG.  Uncompressed SVG is a
 ridiculously bloated format but compresses very well, so the SVG
 standard requires that gzipped SVG files "just work" transparently with
 any SVG-compliant program.  I recently added svgz support to plot2kill,
 and it was somewhat of a PITA because I had to find the C API buried in
 etc.c.zlib and then I got stuck using it instead of a nice D API.

 The bigger point, though, is that use cases for non-tarred single-file
 gzips do exist and they should be handled transparently via an interface
 identical to normal file I/O.

Although I agree this would be nice, I don't think std.stdio.File is the 
right place to put it.  I think a general streaming framework should be 
in place first, and File be made to work with it.  Then, working with a 
gzipped/bzipped file should be as simple as wrapping the raw File stream 
in a compression/decompression stream.

-Lars

Mar 10 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/10/2011 9:45 AM, Lars T. Kyllingstad wrote:
 Although I agree this would be nice, I don't think std.stdio.File is the
 right place to put it.  I think a general streaming framework should be
 in place first, and File be made to work with it.  Then, working with a
 gzipped/bzipped file should be as simple as wrapping the raw File stream
 in a compression/decompression stream.

 -Lars

Ok, this seems to be the general consensus based on the replies I've 
gotten.  Unfortunately, I have neither the time, the desire, nor the 
knowledge to design and implement a full-fledged stream API, whereas I 
have enough of all three of these to bolt gzip support onto std.stdio. 
I guess I'll solve my specific use case at a higher level by wrapping 
the C gzip stuff for my DNA sequence reader class, and let someone who 
knowns something about good stream design solve the more general problem.

Mar 10 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday 10 March 2011 07:14:32 dsimcha wrote:
 On 3/10/2011 9:45 AM, Lars T. Kyllingstad wrote:
 Although I agree this would be nice, I don't think std.stdio.File is the
 right place to put it.  I think a general streaming framework should be
 in place first, and File be made to work with it.  Then, working with a
 gzipped/bzipped file should be as simple as wrapping the raw File stream
 in a compression/decompression stream.
 
 -Lars

 
 Ok, this seems to be the general consensus based on the replies I've
 gotten.  Unfortunately, I have neither the time, the desire, nor the
 knowledge to design and implement a full-fledged stream API, whereas I
 have enough of all three of these to bolt gzip support onto std.stdio.
 I guess I'll solve my specific use case at a higher level by wrapping
 the C gzip stuff for my DNA sequence reader class, and let someone who
 knowns something about good stream design solve the more general problem.

There _have_ been some threads on designing a new stream API, and there are
some 
preliminary designs, but as far as I know, they have yet to really go anywhere. 
I'm unaware of the stream API really having a champion per se. Andrei has done 
some preliminary design work, but I don't know if he intends to actually 
implement anything, and as far as I know, no one else has volunteered. So, a
new 
std.stream is one of those things that we all agree on that we want but which 
hasn't happened yet, because no one has stepped up to do it.

And I do agree that the ability to deal with a compressed file should be part
of 
the stream API (probably as some sort of adapter/wrapper which uncompresses the 
stream as you iterate through it). But the stream API needs to be designed and 
_implemented_ before we'll have that.

- Jonathan M Davis

Mar 10 2011

dsimcha <dsimcha yahoo.com> writes:

== Quote from Jonathan M Davis (jmdavisProg gmx.com)'s article
 There _have_ been some threads on designing a new stream API, and there are
some
 preliminary designs, but as far as I know, they have yet to really go anywhere.
 I'm unaware of the stream API really having a champion per se. Andrei has done
 some preliminary design work, but I don't know if he intends to actually
 implement anything, and as far as I know, no one else has volunteered. So, a
new
 std.stream is one of those things that we all agree on that we want but which
 hasn't happened yet, because no one has stepped up to do it.
 And I do agree that the ability to deal with a compressed file should be part
of
 the stream API (probably as some sort of adapter/wrapper which uncompresses the
 stream as you iterate through it). But the stream API needs to be designed and
 _implemented_ before we'll have that.
 - Jonathan M Davis

So I guess in such a design we would still have things like a decorator to
iterate
through a stream by chunk, by line, etc., with a range interface?

Mar 10 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, March 10, 2011 09:16:35 dsimcha wrote:
 == Quote from Jonathan M Davis (jmdavisProg gmx.com)'s article
 
 There _have_ been some threads on designing a new stream API, and there
 are some preliminary designs, but as far as I know, they have yet to
 really go anywhere. I'm unaware of the stream API really having a
 champion per se. Andrei has done some preliminary design work, but I
 don't know if he intends to actually implement anything, and as far as I
 know, no one else has volunteered. So, a new std.stream is one of those
 things that we all agree on that we want but which hasn't happened yet,
 because no one has stepped up to do it.
 And I do agree that the ability to deal with a compressed file should be
 part of the stream API (probably as some sort of adapter/wrapper which
 uncompresses the stream as you iterate through it). But the stream API
 needs to be designed and _implemented_ before we'll have that.
 - Jonathan M Davis

 
 So I guess in such a design we would still have things like a decorator to
 iterate through a stream by chunk, by line, etc., with a range interface?

I don't remember exactly what it's going to look like. IIRC, streams are the 
reason that Andrei was looking at at a range API that gave you a T[] instead of 
T. Whatever you were doing would be built on top of that. So, grabbing it by 
line or whatever. But I would fully expect that you would be able to put a 
wrapper range in there which took the stream of bytes, treated them as if they 
were gzipped or bzipped or whatever, and gave you bytes (or chars or whatever 
was appropriate) as if were reading from an uncompressed version of the file.
So, 
the reading would be identical whether the file was compressed or not. It's
just 
that in the case of a compressed stream/file, you'd have decorator/wrapper
which 
uncompressed it for you.

- Jonathan M Davis

Mar 10 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday 10 March 2011 07:14:32 dsimcha wrote:
 On 3/10/2011 9:45 AM, Lars T. Kyllingstad wrote:
 Although I agree this would be nice, I don't think std.stdio.File is the
 right place to put it.  I think a general streaming framework should be
 in place first, and File be made to work with it.  Then, working with a
 gzipped/bzipped file should be as simple as wrapping the raw File stream
 in a compression/decompression stream.
 
 -Lars

 
 Ok, this seems to be the general consensus based on the replies I've
 gotten.  Unfortunately, I have neither the time, the desire, nor the
 knowledge to design and implement a full-fledged stream API, whereas I
 have enough of all three of these to bolt gzip support onto std.stdio.
 I guess I'll solve my specific use case at a higher level by wrapping
 the C gzip stuff for my DNA sequence reader class, and let someone who
 knowns something about good stream design solve the more general problem.

There _have_ been some threads on designing a new stream API, and there are
some 
preliminary designs, but as far as I know, they have yet to really go anywhere. 
I'm unaware of the stream API really having a champion per se. Andrei has done 
some preliminary design work, but I don't know if he intends to actually 
implement anything, and as far as I know, no one else has volunteered. So, a
new 
std.stream is one of those things that we all agree on that we want but which 
hasn't happened yet, because no one has stepped up to do it.

And I do agree that the ability to deal with a compressed file should be part
of 
the stream API (probably as some sort of adapter/wrapper which uncompresses the 
stream as you iterate through it). But the stream API needs to be designed and 
_implemented_ before we'll have that.

- Jonathan M Davis

Mar 10 2011

Stewart Gordon <smjg_1998 yahoo.com> writes:

On 10/03/2011 04:53, dsimcha wrote:
<snip>
 I'd like to get some comments on what an appropriate API design and
implementation for
 writing gzipped files would be. Two key requirements are that it must be as
easy to use as
 std.stdio.File and it must be easy to extend to support other single-file
compression
 formats like bz2.

You don't seem to get how std.stream works.

The API is defined in the InputStream and OutputStream interfaces.  Various
classes 
implement this interface, generally through the Stream abstract class, to
provide the 
functionality for a specific kind of stream.  File is just one of these
classes.  Another 
is MemoryStream, to read to and write from a buffer in memory.  A stream class
used to 
work with gzipped files would be just another.

Indeed, we have FilterStream, which is a base class for stream classes that
wrap a stream, 
such as a file or memory stream, to modify the data in some way as it goes in
and out. 
Compressing or decompressing is an example of this - so I guess that GzipStream
would be a 
subclass of FilterStream.

Stewart.

Mar 11 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/11/2011 7:12 PM, Stewart Gordon wrote:
 On 10/03/2011 04:53, dsimcha wrote:
 <snip>
 I'd like to get some comments on what an appropriate API design and
 implementation for
 writing gzipped files would be. Two key requirements are that it must
 be as easy to use as
 std.stdio.File and it must be easy to extend to support other
 single-file compression
 formats like bz2.

 You don't seem to get how std.stream works.

 The API is defined in the InputStream and OutputStream interfaces.
 Various classes implement this interface, generally through the Stream
 abstract class, to provide the functionality for a specific kind of
 stream. File is just one of these classes. Another is MemoryStream, to
 read to and write from a buffer in memory. A stream class used to work
 with gzipped files would be just another.

 Indeed, we have FilterStream, which is a base class for stream classes
 that wrap a stream, such as a file or memory stream, to modify the data
 in some way as it goes in and out. Compressing or decompressing is an
 example of this - so I guess that GzipStream would be a subclass of
 FilterStream.

 Stewart.

But:

1.  std.stream is scheduled for deprecation IIRC.

2.  std.stdio.File is what's now idiomatic to use.

3.  Streams in D should be based on input ranges, not whatever crufty 
old stuff std.stream is based on.

Mar 11 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, March 11, 2011 16:27:21 dsimcha wrote:
 On 3/11/2011 7:12 PM, Stewart Gordon wrote:
 On 10/03/2011 04:53, dsimcha wrote:
 <snip>
 
 I'd like to get some comments on what an appropriate API design and
 implementation for
 writing gzipped files would be. Two key requirements are that it must
 be as easy to use as
 std.stdio.File and it must be easy to extend to support other
 single-file compression
 formats like bz2.

 
 You don't seem to get how std.stream works.
 
 The API is defined in the InputStream and OutputStream interfaces.
 Various classes implement this interface, generally through the Stream
 abstract class, to provide the functionality for a specific kind of
 stream. File is just one of these classes. Another is MemoryStream, to
 read to and write from a buffer in memory. A stream class used to work
 with gzipped files would be just another.
 
 Indeed, we have FilterStream, which is a base class for stream classes
 that wrap a stream, such as a file or memory stream, to modify the data
 in some way as it goes in and out. Compressing or decompressing is an
 example of this - so I guess that GzipStream would be a subclass of
 FilterStream.
 
 Stewart.

 
 But:
 
 1.  std.stream is scheduled for deprecation IIRC.

Technically speaking, I think that it's intended to be scheduled for
deprecation 
as opposed to actually being scheduled for deprecation, but whatever. It's
going 
to be phased out as soon as we have a replacement.d

 2.  std.stdio.File is what's now idiomatic to use.

Well, more like it's the only solution we have which will be sticking around. 
Once we have a new std.stream, it may be the preferred solution.

 3.  Streams in D should be based on input ranges, not whatever crufty
 old stuff std.stream is based on.

Indeed. But the new API still needs to be fleshed out and implemented before we 
actually have even a _proposed_ new std.stream, let alone actually have it.

- Jonathan M Davis

Mar 11 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 11 Mar 2011 19:27:21 -0500, dsimcha <dsimcha yahoo.com> wrote:

 3.  Streams in D should be based on input ranges, not whatever crufty  
 old stuff std.stream is based on.

No.  I/O Ranges should be based on streams.  A stream is a low level  
construct that can read and/or write data.  In essence it is an  
abstraction of the capabilities of the OS.  BTW, that crufty old stuff  
probably way outperforms anything you could ever do with ranges as the  
base.  The range interface simply isn't built to deal with I/O properly.

For example, std.stdio.File is based on FILE *, which is an opaque stream  
interface.

There should probably be RangeStream which would wrap a range in a stream  
interface, if you want to go that route.

-Steve

Mar 14 2011

dsimcha <dsimcha yahoo.com> writes:

On 3/14/2011 8:22 AM, Steven Schveighoffer wrote:
 BTW, that crufty old stuff
 probably way outperforms anything you could ever do with ranges as the
 base.

I don't get the concern about performance, for file I/O at least.  Isn't 
the main bottleneck reading it off the disk platter?

Mar 14 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 14.03.2011 14:17, schrieb dsimcha:
 On 3/14/2011 8:22 AM, Steven Schveighoffer wrote:
 BTW, that crufty old stuff
 probably way outperforms anything you could ever do with ranges as the
 base.

 I don't get the concern about performance, for file I/O at least. Isn't
 the main bottleneck reading it off the disk platter?

SSDs, RAID, RAM-disks, 10GBit (and faster) networking, ...

Mar 14 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 14 Mar 2011 09:17:04 -0400, dsimcha <dsimcha yahoo.com> wrote:

 On 3/14/2011 8:22 AM, Steven Schveighoffer wrote:
 BTW, that crufty old stuff
 probably way outperforms anything you could ever do with ranges as the
 base.

 I don't get the concern about performance, for file I/O at least.  Isn't  
 the main bottleneck reading it off the disk platter?

That is solved by buffering, which would be done in either case.

With ranges, you likely have to copy things more than if you use a proper
stream interface, or else make the interface very awkward.  I don't know
about you, but having to set the amount I want to read before I call front
seems awkward.  The library I'm writing optimizes the copying so there is
very little copying from the buffer.

If you look at Tango's I/O performance, it outperforms even C I/O, and
uses a class/interface hierarchy w/ delegates for reading data.

I think the range concept is good to paste on top of a buffered I/O
stream, but not to use as the base.  For example, byLine is a good example
of an I/O range that would use a buffered I/O stream to do its work.

See this message I posted a few months back:

http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=119400

A couple replies later I outline how to do byLine funtion (and easily a  
range) on top of such a
stream.

This is the basis for my current I/O library I'm writing.

-Steve

Mar 14 2011

dsimcha <dsimcha yahoo.com> writes:

== Quote from Steven Schveighoffer (schveiguy yahoo.com)'s article
On Mon, 14 Mar 2011 09:17:04 -0400, dsimcha <dsimcha yahoo.com> wrote:
On 3/14/2011 8:22 AM, Steven Schveighoffer wrote:
BTW, that crufty old stuff
probably way outperforms anything you could ever do with ranges as the
base.

I don't get the concern about performance, for file I/O at least. Isn't
the main bottleneck reading it off the disk platter?

That is solved by buffering, which would be done in either case.
With ranges, you likely have to copy things more than if you use a proper
stream interface, or else make the interface very awkward. I don't know
about you, but having to set the amount I want to read before I call front
seems awkward. The library I'm writing optimizes the copying so there is
very little copying from the buffer.
If you look at Tango's I/O performance, it outperforms even C I/O, and
uses a class/interface hierarchy w/ delegates for reading data.
I think the range concept is good to paste on top of a buffered I/O
stream, but not to use as the base. For example, byLine is a good example
of an I/O range that would use a buffered I/O stream to do its work.
See this message I posted a few months back:

http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=119400
A couple replies later I outline how to do byLine funtion (and easily a
range) on top of such a
stream.
This is the basis for my current I/O library I'm writing.
-Steve

Ok, makes sense. I sincerely hope your I/O library is good enough to get
adopted,
then, because Phobos is in **serious** need of better I/O functionality.

Mar 14 2011

dsimcha <dsimcha yahoo.com> writes:

Since it seems like the consensus is that streaming gzip support belongs 
in a stream API, I guess we have yet another reason to get busy with the 
stream API.  However, I'm wondering if std.file should support gzip and, 
if license issues can be overcome, bzip2.

I'd love to be able to write code like this:

// Read and transparently decompress foo.txt, which is UTF-8 encoded.
auto foo = cast(string) gzippedRead("foo.txt.gz");

// Write a buffer to a gzipped file.
gzippedWrite("foo.txt.gz", buf);

This stuff would be trivial to implement in std.file and, IMHO, belongs 
there.  What's the consensus on whether it belongs?

Mar 12 2011

D Programming

C/C++ Programming

Other

digitalmars.D - GZip File Reading