digitalmars.D - How to continue work on std.zip

berni44 (189/189) Oct 20 2019 Recently I fixed some bugs in std.zip, mainly concerning malware

Jonathan M Davis (73/146) Oct 21 2019 What would be the point of having something so low-level? The ZipMembers...

Patrick Schluter (6/13) Oct 21 2019 The examples would be much more useful if they illustrated how to
berni44 (85/132) Oct 21 2019 On Monday, 21 October 2019 at 08:44:30 UTC, Jonathan M Davis

Jonathan M Davis (92/197) Oct 22 2019 The constants don't look even vaguely magical. They're clearly simply ba...

Walter Bright (4/5) Oct 22 2019 Within the last year, I recall a post on reddit or hackernews that had a...

berni44 (5/11) Oct 23 2019 That's allready done. std.zip meanwhile checks for overlapping

Walter Bright (7/11) Oct 23 2019 How about nailing all that stuff? Then you can put in the documentation ...

berni44 (11/17) Oct 23 2019 That's the plan (I have to wait for stable to be remerged back

Petar Kirov [ZombineDev] (3/8) Oct 23 2019 Here you go: https://github.com/dlang/phobos/pull/7246

berni44 (3/12) Oct 23 2019 Thanks. ;)

berni44 <dlang d-ecke.de> writes:

Recently I fixed some bugs in std.zip, mainly concerning malware 
attacks. I would like to continue there, but I think its a good 
idea to clearify the roadmap, before I continue with more PRs. 
Here's a list of items, I've got in my mind now:

Extracting zip files
--------------------

* Add structs that directly map the data in the zip records. As 
Atila pointed
out, this should be done with serialisation instead of unions 
(what I planed
and what he convinced me not to do, because this results in 
endianess problems).
See below (*) for some suggestions, on how this might be done.

* use enforce instead of if ... throw like Vladimir suggested. Do 
you think I
should use enforce!zipException("message") or is it better to get 
rid of
zipException?

* fix bugs 20289 - 20294 and bug 18950 (this is part of bug 
20290).

* Add more unittests, especially for zip64 and to reach (almost) 
100%
coverage.


Documentation
-------------

* Reversing the order of the four items in this file would bring 
the more
important ones to the beginning. This would make for better 
documentation.

* Using unittests for examples (maybe replacing the two examples 
at the top of
the module).

* Rewriting some of the documentation for better ddoc 
conformation and clearer
descriptions.


Extensions
----------

* Support for ranges. For unzipping, a RandomAccessRange could be 
used, for
zipping an OutputRange could be provided. Also, the 
ArchiveMembers could be
accessed via an other RandomAccessRange. These ranges would (in 
some cases)
allow for lazy evaluation, which might result in a speed gain, 
when not all
elements of an archive need to be read (or kept in memory).

* Multidisk? I think, it's *not* worth it. Just causes trouble 
and no one
needs this nowadays.

* Support for versions < 2.0 (called version 20 in the docs)? 
Again I think
that is *not* worth it.

* Support for other compression methods? Maybe bzip2 and LZMA 
would be nice.
All other methods look history to me.

* Support for encryption? I'm not sure about legal issues here. 
Looks like
PKWARE has some copyrights here. Therefore I'd skip encryption. 
In general I
personally think, that it's better to use a separate encryption 
program,
instead of using encryption mechanisms inside of a fixed format. 
Sooner or
later a fixed format will produce security issues.

* Files larger than 4GB? I see no reason, why we should block 
them. All it
would probably need is to move from uint to ulong.


Concerning the order of all these points, I'd roughly go from top 
to bottom.

What do you think? Is all of this worth it? Or should part of it 
be removed?
Do you have other stuff, that you would like to see here?

=============================================================================

(*) Currently the mapping is done like this:

{
     // Read end record data
     _diskNumber = getUshort(i + 4);
     _diskStartDir = getUshort(i + 6);

     _numEntries = getUshort(i + 8);
     _totalEntries = getUshort(i + 10);

     if (numEntries != totalEntries)
         throw new ZipException("multiple disk zips not 
supported");

     directorySize = getUint(i + 12);
     directoryOffset = getUint(i + 16);

     if (directoryOffset + directorySize > i)
         throw new ZipException("corrupted directory");
}

using functions like

 safe  nogc pure nothrow ushort getUshort(uint i)
{
     ubyte[2] result = data[i .. i + 2];
     return littleEndianToNative!ushort(result);
}

I dislike this approach, because it's using magic constants I'd 
like to get rid of.

As D has no build in serialisation mechanism and neither Phobos 
has, there is the need to add a specialised one.

One idea:

struct EndOfCentralDirRecord
{
     ushort numberOfThisDisk;
     ushort numberOfCDDisk;
     ushort entriesOnThisDisk;
     ushort entries;
     uint size;
     uint offset;
     ushort commentLength;
}

{
     auto eocdr = conv!Test(data[offset .. offset + 
EndOfCentralDirLength);

     // Read end record data
     _diskNumber = eocdr.numberOfThisDisk;
     _diskStartDir = eocdr.numberOfCDDisk;

     _numEntries = eocdr.entriesOnThisDisk;
     _totalEntries = eocdr.entries;

     if (numEntries != totalEntries)
         throw new ZipException("multiple disk zips not 
supported");

     directorySize = eocdr.size;
     directoryOffset = eocdr.offset;

     if (directoryOffset + directorySize > i)
         throw new ZipException("corrupted directory");
}

using

T conv(T)(ubyte[] a)
{
     auto tmp = T();

     foreach (member; __traits(allMembers, T))
     {
         alias type = typeof(mixin("T."~member));
         static if (is(type == ushort))
             mixin("tmp."~member~" = cast(ushort) (a[0] | 
(a[1]<<8));");
         else static if (is(type == uint))
             mixin("tmp."~member~" = cast(uint) (a[0] | (a[1]<<8) 
| (a[2]<<16) | (a[3]<<24));");
         else static if (is(type == ulong))
             mixin("tmp."~member~" = a[0] | (a[1]<<8) | (a[2]<<16) 
| (a[3]<<24)"~
                   "| (cast(ulong)(a[4] | (a[5]<<8) | (a[6]<<16) | 
(a[7]<<24))<<32);");
         else
             assert(false, "type "~member~" not supported");
         a = a[type.sizeof .. $];
     }

     return tmp;
}

One other idea motivated by some comment from Vladimir:

alias fromLE = littleEndianToNative;

align (1) // just to make sure there are no gaps
struct EndOfCentralDirRecord
{
     ubyte[2] numberOfThisDisk;
     ubyte[2] numberOfCDDisk;
     ubyte[2] entriesOnThisDisk;
     ubyte[2] entries;
     ubyte[4] size;
     ubyte[4] offset;
     ubyte[2] commentLength;
}

{
     auto eocdr = getRecord!EndOfCentralDirRecord(offset);

     // Read end record data
     _diskNumber = fromLE!ushort(eocdr.numberOfThisDisk);
     _diskStartDir = fromLE!ushort(eocdr.numberOfCDDisk);

     _numEntries = fromLE!ushort(eocdr.entriesOnThisDisk);
     _totalEntries = fromLE!ushort(eocdr.entries);

     if (numEntries != totalEntries)
         throw new ZipException("multiple disk zips not 
supported");

     directorySize = fromLE!uint(eocdr.size);
     directoryOffset = fromLE!uint(eocdr.offset);

     if (directoryOffset + directorySize > i)
         throw new ZipException("corrupted directory");
}

using

private T getRecord(T)(ulong offset)
{
     union U
     {
         ubyte[T.sizeof] data;
         T result;
     }

     U tmp;
     tmp.data = data[offset, offset + T.sizeof];
     return tmp.result;
}

Oct 20 2019

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Sunday, October 20, 2019 8:03:06 AM MDT berni44 via Digitalmars-d wrote:
 Recently I fixed some bugs in std.zip, mainly concerning malware
 attacks. I would like to continue there, but I think its a good
 idea to clearify the roadmap, before I continue with more PRs.
 Here's a list of items, I've got in my mind now:

 Extracting zip files
 --------------------

 * Add structs that directly map the data in the zip records. As
 Atila pointed
 out, this should be done with serialisation instead of unions
 (what I planed
 and what he convinced me not to do, because this results in
 endianess problems).
 See below (*) for some suggestions, on how this might be done.

What would be the point of having something so low-level? The ZipMembers are
already put in an AA in the ZipArchive. So, the basic design of std.zip is
pretty high-level, whereas something like accessing the low-level zip
records would fit better in a solution that was lower level and potentially
lazy.

 * use enforce instead of if ... throw like Vladimir suggested. Do
 you think I
 should use enforce!zipException("message") or is it better to get
 rid of
 zipException?

Getting rid of ZipException would break code that currently uses std.zip and
wouldn't necessarily be a desirable change anyway. enforce!ZipException is
perfectly fine.

 * fix bugs 20289 - 20294 and bug 18950 (this is part of bug
 20290).

 * Add more unittests, especially for zip64 and to reach (almost)
 100%
 coverage.


 Documentation
 -------------

 * Reversing the order of the four items in this file would bring
 the more
 important ones to the beginning. This would make for better
 documentation.

That would be terrible for diffs and really wouldn't help much with the
documentation. There are already links to the top-level symbols at the top
of the documentation, and the example at the top shows how the various
types are related. It might make sense to add a table at the top with a
short description of each top-level symbol, but rearranging the code just to
change the documentation is needless churn - especially in such a small
file.

 * Using unittests for examples (maybe replacing the two examples
 at the top of
 the module).

The example at the top of the module is useful and really shouldn't be
removed. However, adding more documented unittest blocks as examples for
specific symbols would be an improvement.

 * Rewriting some of the documentation for better ddoc
 conformation and clearer
 descriptions.

I have no clue what you mean by better ddoc conformation. As far as ddoc
goes, what's there looks perfectly fine to me. If you can come up with
better descriptions for stuff, that's fine, but I don't see anything
obviously wrong with what's there.

 Extensions
 ----------

 * Support for ranges. For unzipping, a RandomAccessRange could be
 used, for
 zipping an OutputRange could be provided. Also, the
 ArchiveMembers could be
 accessed via an other RandomAccessRange. These ranges would (in
 some cases)
 allow for lazy evaluation, which might result in a speed gain,
 when not all
 elements of an archive need to be read (or kept in memory).

Given that ZipArchive already sticks everything in an AA, I'm not sure what
you'd gain by trying to add a random access range as well. And random-access
would be incompatible with lazy evaluation. In order to have random access,
you'd basically have to stick everything in a dynamic array instead of an
AA, which doesn't fit with the current design and wouldn't be lazy either.

In order to actually have lazy evaluation, you'd have to have a completely
different design, because ZipArchive creates the AA up front - and changing
that would cause problems for existing code. There's certainly something to
be said for a design that lazily iterates through a zip archive, but from
what I can tell, that would require a completely different type from
ZipArchive, and I'd argue that if that's the sort of thing that you're
looking to do, it would make a lot more sense to just create a fresh
solution and put it on code.dlang.org instead of changing std.zip, since it
really doesn't fit into the current design, and moving away from the AA
design would break existing code.

 * Multidisk? I think, it's *not* worth it. Just causes trouble
 and no one
 needs this nowadays.

Maybe it a more complex solution, but std.zip is very basic. Also, looking
at the wikipedia page for zip, it looks like the current design of std.zip
is probably pretty similar to what ISO/IEC 21320-1:2015 requires, and
spanning multiple volumes is not supported by that standard. I don't know
that we need to stick to that standard, but it might make sense to just
target what the ISO standard does for std.zip and let anything more complex
be in a more complex solution code.dlang.org if someone feels the need to
write such a solution.

 * Support for versions < 2.0 (called version 20 in the docs)?
 Again I think
 that is *not* worth it.

Wikipedia doesn't even inclued versions older than that in the versions that
it lists, and 2.0 was released in 1993. Really, the question should be
whether we include support for stuff newer than that, not whether we support
anything older.

 * Support for other compression methods? Maybe bzip2 and LZMA
 would be nice.
 All other methods look history to me.

That could be useful, but we'd like to avoid having Phobos depend on any
more external libraries (having curl has caused us enough problems already),
so we'd need implementations for them inside of Phobos. It also wouldn't be
ISO compatible, so if we decided that that's what we were targeting, then it
would not be appropriate.

 * Support for encryption? I'm not sure about legal issues here.
 Looks like
 PKWARE has some copyrights here. Therefore I'd skip encryption.
 In general I
 personally think, that it's better to use a separate encryption
 program,
 instead of using encryption mechanisms inside of a fixed format.
 Sooner or
 later a fixed format will produce security issues.

I'd suggest skipping it at this point. If nothing else, doing it correctly
will likely require external libraries. It would be better for such a
solution to exist on code.dlang.org rather than be part of Phobos. It's also
explicitly banned by the ISO standard.

 * Files larger than 4GB? I see no reason, why we should block
 them. All it
 would probably need is to move from uint to ulong.

From what little I understand, that would involve being zip64 rather than

normal zip, which would involve more than simply going from uint to ulong,
and non-zip64 files do not support larger than 4 GiB. So, while adding
support for larger files wouldn't necessarily be a problem, it needs to be
done in a way that's standards compatible and without requiring that zip
files be zip64 if they don't need to be.

 Concerning the order of all these points, I'd roughly go from top
 to bottom.

 What do you think? Is all of this worth it? Or should part of it
 be removed?
 Do you have other stuff, that you would like to see here?

Improvements to std.zip which expand its capabilitiues without breaking
existing code are fine, but at some point, if you want a full, flexible
implementation, it's probably better to come up with something new and put
it on code.dlang.org. Also including stuff like bzip2 or encryption would
likely be better in a 3rd party library that can use existing C libraries.
Without digging into the whole issue in detail, I'd be tempted to argue that
we should just target the ISO standard with std.zip and leave anything
fancier to code.dlang.org.

- Jonathan M Davis

Oct 21 2019

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Monday, 21 October 2019 at 08:44:30 UTC, Jonathan M Davis 
wrote:
 * Using unittests for examples (maybe replacing the two 
 examples
 at the top of
 the module).

 The example at the top of the module is useful and really 
 shouldn't be removed. However, adding more documented unittest 
 blocks as examples for specific symbols would be an improvement.

The examples would be much more useful if they illustrated how to 
add or read more than 1 entry in the zipfile. I remember that I 
struggled a bit at the beginning to understand how it worked to 
go from 1 file to several files.

Oct 21 2019

berni44 <dlang d-ecke.de> writes:

On Monday, 21 October 2019 at 08:44:30 UTC, Jonathan M Davis 
wrote:

First of all, thanks for pointing me to the ISO standard. I was 
not aware of this standard (the german wikipedia does not mention 
it and I must have overlooked it, when I skimmed through the 
english one). I agree, that we should stick to that standard 
which answers several of my questions.

Second, my goal is to improve Phobos. I'm pretty much aware, that 
this means to avoid breaking code at (almost) all costs. In case 
of std.zip the only "breaking" change will be to get rid of some 
constants (digiSignLength, diskNumber) that should have been 
private from the very beginning or are only usefull with 
multidisk support. And this is done with a full deprecation cycle.

 * Add structs that directly map the data in the zip records. 
 As Atila pointed
 out, this should be done with serialisation instead of unions 
 (what I planed
 and what he convinced me not to do, because this results in 
 endianess problems).
 See below (*) for some suggestions, on how this might be done.

 What would be the point of having something so low-level? The 
 ZipMembers are already put in an AA in the ZipArchive.

Sorry, I think this was a missunderstanding. I don't want to 
change anything of the AA nor the members in that AA. For me the 
question is, how to get the data from this void[]-array into the 
AA. The current approach uses lot's of lines like "_diskNumber = 
getUshort(i + 4);" with magic constants (here 4), which is error 
prone. I'd prefere to get rid of these in favour of some structs 
to avoid adding bugs when fixing the existing issues.

 * Reversing the order of the four items in this file would 
 bring the more
 important ones to the beginning. This would make for better 
 documentation.

 That would be terrible for diffs and really wouldn't help much 
 with the documentation. There are already links to the 
 top-level symbols at the top of the documentation, and the 
 example at the top shows how the various types are related. 
 [...]

I disagree here. First of all, documentation is made for humans 
and this should be more important than diffs. (But maybe there is 
some aspect of diffs, that I'm not aware of. For me, this would 
be a larger change once, and after that change, the diffs would 
again be pretty small.)

 From my experience, the links at the top are not usefull for 
people looking the first time into the docs of some module. At 
that time, one often does not know, which item is the one they 
are looking for.

For example. When I first looked at the documentation of std.zip, 
I was looking for a way to extract an existing archive. I 
expected that to be done by a function called unzip, but found no 
such function. Skimming the example at the top I thought, this 
was about creating zip archives (The comment in the first line 
tells that this is about reading zip files; I don't know exactly, 
why I overlooked this. Maybe, because it's small letters in light 
gray; maybe, because it's english, maybe I did not expect it 
there; I can't tell). Therefore I scrolled down. I found 
ZipException which is obviously of no use. After that I found 
CompressionMethod which is of little use either. Next there is 
ArchiveMember, which is more interesting, but yet does not show a 
way on how to get to the content of a zip file.

Finally, if you do not overlook it (which is quite possible 
between all those members of ArchiveMember and ZipArchive), 
you'll find ZipArchive. But again you don't know how to use it. 
The constructor I was looking for is actually the second to last 
item, very well hidden.

I went into detail here, because I think this is important. I had 
this experience over and over again with the documentation of 
Phobos. Comparing this to the documentation of PHP [1], where I 
never had any problems with, I conclude that there is space for 
improvement. ;-)

But it's getting a little bit off topic. Maybe Phobos 
documentation should be discussed in a separate threat.

[1] https://www.php.net/manual/de/book.zip.php

 * Using unittests for examples (maybe replacing the two 
 examples at the top of the module).

 The example at the top of the module is useful and really 
 shouldn't be removed.

No, I don't want to remove them. Just moving them inside of a 
unittest to make sure, that documentation and implementation does 
not diverge too much.

 * Rewriting some of the documentation for better ddoc 
 conformation and clearer descriptions.

 I have no clue what you mean by better ddoc conformation.

Probably also a missunderstanding. What I mean is: As far as I 
know, all parameters of all functions should be listed; the 
return value should be mentioned and a "see also" section is 
recommended and stuff like this. Most of the functions do not 
have that information. For example: The documentation of 
"extractVersion" is "Read Only" which is not really useful, isn't 
it?

 * Support for ranges. For unzipping, a RandomAccessRange could 
 be used, for
 zipping an OutputRange could be provided. Also, the 
 ArchiveMembers could be
 accessed via an other RandomAccessRange. These ranges would 
 (in some cases)
 allow for lazy evaluation, which might result in a speed gain, 
 when not all
 elements of an archive need to be read (or kept in memory).

 Given that ZipArchive already sticks everything in an AA, I'm 
 not sure what you'd gain by trying to add a random access range 
 as well. And random-access would be incompatible with lazy 
 evaluation.

This is probably only about the last range I mentioned above, 
isn't it? That's the one I myself are most unsure if it's going 
to be usefull. One could use AA.byValue etc instead and that's 
probably sufficient.

I added support for ranges in the list, because it was explicitly 
mentioned in the vision 2016H2 [2]. I think, there are use cases, 
where laziness can be applied. Think of an archive with lots of 
files, where only some of them are going to be extracted. The 
underlying file-access-range can use fseek and the like to only 
load the part of the archive, that's gona be needed. zip's 
structure allowes to fill in the AA (almost) completely by 
scanning the directory at the end of the file. I write "almost", 
because the real data content is still missing. But that can be 
retrieved, when it's going to be used.

[2] https://wiki.dlang.org/Vision/2016H2

 In order to actually have lazy evaluation, you'd have to have a 
 completely different design, because ZipArchive creates the AA 
 up front - and changing that would cause problems for existing 
 code.

I think, it can be done with the current (public) design. I would 
add a new constructor that takes a range instead of void[]. All 
else would happen underneath.

 * Files larger than 4GB? I see no reason, why we should block
 them. All it
 would probably need is to move from uint to ulong.

From what little I understand, that would involve being zip64 
rather than normal zip,


Indeed, but zip64 is allready implemented. There is just still 
the limit on the filesize.

Oct 21 2019

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Monday, October 21, 2019 7:35:37 AM MDT berni44 via Digitalmars-d wrote:
 On Monday, 21 October 2019 at 08:44:30 UTC, Jonathan M Davis
 wrote:
 * Add structs that directly map the data in the zip records.
 As Atila pointed
 out, this should be done with serialisation instead of unions
 (what I planed
 and what he convinced me not to do, because this results in
 endianess problems).
 See below (*) for some suggestions, on how this might be done.

 What would be the point of having something so low-level? The
 ZipMembers are already put in an AA in the ZipArchive.

 Sorry, I think this was a missunderstanding. I don't want to
 change anything of the AA nor the members in that AA. For me the
 question is, how to get the data from this void[]-array into the
 AA. The current approach uses lot's of lines like "_diskNumber =
 getUshort(i + 4);" with magic constants (here 4), which is error
 prone. I'd prefere to get rid of these in favour of some structs
 to avoid adding bugs when fixing the existing issues.

The constants don't look even vaguely magical. They're clearly simply based
on the sizes of the values. That being said, if I were to fix it, I'd
probably use append and read from std.bitmanip to replace all of the put*
and get* calls, since they would take care of the sizes for you without
having to deal with them directly at all. So, instead of

    putUshort(i + 4,  de._madeVersion);
    putUshort(i + 6,  de.extractVersion);
    putUshort(i + 8,  de.flags);
    putUshort(i + 10, de._compressionMethod);
    putUint  (i + 12, cast(uint) de.time);
    putUint  (i + 16, de.crc32);
    putUint  (i + 20, de.compressedSize);
    putUint  (i + 24, de.expandedSize);
    putUshort(i + 28, cast(ushort) de.name.length);
    putUshort(i + 30, cast(ushort) de.extra.length);
    putUshort(i + 32, cast(ushort) de.comment.length);
    putUshort(i + 34, de.diskNumber);
    putUshort(i + 36, de.internalAttributes);
    putUint  (i + 38, de._externalAttributes);
    putUint  (i + 42, de.offset);

you'd get something like

    buffer.append!ushort(de._madeVersion);
    buffer.append!ushort(de.extractVersion);
    buffer.append!ushort(de.flags);
    buffer.append!ushort(de._compressionMethod);
    buffer.append!uint  (cast(uint) de.time);
    buffer.append!uint  (de.crc32);
    buffer.append!uint  (de.compressedSize);
    buffer.append!uint  (de.expandedSize);
    buffer.append!ushort(cast(ushort) de.name.length);
    buffer.append!ushort(cast(ushort) de.extra.length);
    buffer.append!ushort(cast(ushort) de.comment.length);
    buffer.append!ushort(de.diskNumber);
    buffer.append!ushort(de.internalAttributes);
    buffer.append!uint  (de._externalAttributes);
    buffer.append!uint  (de.offset);

Then all of the constants disappear. Also, then std.zip isn't trying to
replace functionality that's already in std.bitmanip (like it's doing right
now). The sort of thing that std.zip is doing here is pretty much exactly
what read, peek, write, and append in std.bitmanip were written to handle.

 * Reversing the order of the four items in this file would
 bring the more
 important ones to the beginning. This would make for better
 documentation.

 That would be terrible for diffs and really wouldn't help much
 with the documentation. There are already links to the
 top-level symbols at the top of the documentation, and the
 example at the top shows how the various types are related.
 [...]

 I disagree here. First of all, documentation is made for humans
 and this should be more important than diffs. (But maybe there is
 some aspect of diffs, that I'm not aware of. For me, this would
 be a larger change once, and after that change, the diffs would
 again be pretty small.)

  From my experience, the links at the top are not usefull for
 people looking the first time into the docs of some module. At
 that time, one often does not know, which item is the one they
 are looking for.

 For example. When I first looked at the documentation of std.zip,
 I was looking for a way to extract an existing archive. I
 expected that to be done by a function called unzip, but found no
 such function. Skimming the example at the top I thought, this
 was about creating zip archives (The comment in the first line
 tells that this is about reading zip files; I don't know exactly,
 why I overlooked this. Maybe, because it's small letters in light
 gray; maybe, because it's english, maybe I did not expect it
 there; I can't tell). Therefore I scrolled down. I found
 ZipException which is obviously of no use. After that I found
 CompressionMethod which is of little use either. Next there is
 ArchiveMember, which is more interesting, but yet does not show a
 way on how to get to the content of a zip file.

 Finally, if you do not overlook it (which is quite possible
 between all those members of ArchiveMember and ZipArchive),
 you'll find ZipArchive. But again you don't know how to use it.
 The constructor I was looking for is actually the second to last
 item, very well hidden.

 I went into detail here, because I think this is important. I had
 this experience over and over again with the documentation of
 Phobos. Comparing this to the documentation of PHP [1], where I
 never had any problems with, I conclude that there is space for
 improvement. ;-)

Reordering stuff makes it harder to track changes over time, and with a file
as small as std.zip, I have a very hard time believing that changing the
order is going to help much - especially since you need to be aware of all
of the types in std.zip when using it anyway. The main way to make it easier
for people to figure out where they need to know to start is by improving
the module-level documentation. For instance, std.zip could certainly use a
table similar to what modules like std.array and std.range have:

https://dlang.org/phobos/std_array.html
https://dlang.org/phobos/std_range.html

Having an overview of that explains what the pieces are and provides links
to them is going to work far better than simply reorganizing the file will
do. If the module-level documentation doesn't give enough information for
people to know what they should be looking for in the module for the
functionality they want, then the module-level documentation needs to be
improved, regardless of what order the symbols are in in the file. And if
the module-level documentation does its job properly, the order of the
symbols in the documentation should be irrelevant.

 * Support for ranges. For unzipping, a RandomAccessRange could
 be used, for
 zipping an OutputRange could be provided. Also, the
 ArchiveMembers could be
 accessed via an other RandomAccessRange. These ranges would
 (in some cases)
 allow for lazy evaluation, which might result in a speed gain,
 when not all
 elements of an archive need to be read (or kept in memory).

 Given that ZipArchive already sticks everything in an AA, I'm
 not sure what you'd gain by trying to add a random access range
 as well. And random-access would be incompatible with lazy
 evaluation.

 This is probably only about the last range I mentioned above,
 isn't it? That's the one I myself are most unsure if it's going
 to be usefull. One could use AA.byValue etc instead and that's
 probably sufficient.

 I added support for ranges in the list, because it was explicitly
 mentioned in the vision 2016H2 [2]. I think, there are use cases,
 where laziness can be applied. Think of an archive with lots of
 files, where only some of them are going to be extracted. The
 underlying file-access-range can use fseek and the like to only
 load the part of the archive, that's gona be needed. zip's
 structure allowes to fill in the AA (almost) completely by
 scanning the directory at the end of the file. I write "almost",
 because the real data content is still missing. But that can be
 retrieved, when it's going to be used.

 [2] https://wiki.dlang.org/Vision/2016H2

You'd have to be careful with this, because right now, if you call
ZipArchive's directory, you get the same AA that it has internally, and
accessing any values in it does not cause any calculations to be done.
Having them be done lazily might be nice, but you also wouldn't want to end
up having stuff calculated every time you access one of the ArchiveMembers
in the AA. And making it so that ArchiveMember is somehow lazy but doesn't
redo any work means that it won't work when it's const. Fortunately,
functions like expandedData and compressedData are not const, so maybe some
of what's done there could be made lazy without breaking code, but it would
have to be done carefully.

The fact that std.zip is designed around filling in an AA, and it provides
direct access to that AA pretty much destroys most of the possible
range-based solutions - especially the lazy ones. If you can make some
aspect of how the internals work lazier without causing problems, then
that's fine, but I don't see much point in trying to provide a different
range API on top of what's there when it's basically just an AA.

I could easily see having a lazy, range-based solution which did something
like iterate through the files in the zip file in whatever order they're in
in the zip file, but that would require a very different design from what we
currently have with std.zip. Also, whether it would be better or worse would
largely depend on what the user code was trying to do.

Certainly, if you can come up with something clever that improves things
while not breaking existing code, then that's great, but the fact that the
code provides direct access to the AA (and as the primary way to access the
data no less) makes it pretty hard to change. Maybe it would make sense to
provide some sort of range-based solution beside what we currently have in
std.zip, but that's the sort of thing that should probably be put on
code.dlang first and battle-tested a bit before being put into Phobos, since
it's essentially designing a new set of functionality for accessing and
processing zip files rather than simply improving what's already there.

 In order to actually have lazy evaluation, you'd have to have a
 completely different design, because ZipArchive creates the AA
 up front - and changing that would cause problems for existing
 code.

 I think, it can be done with the current (public) design. I would
 add a new constructor that takes a range instead of void[]. All
 else would happen underneath.

 * Files larger than 4GB? I see no reason, why we should block
 them. All it
 would probably need is to move from uint to ulong.

From what little I understand, that would involve being zip64
rather than normal zip,


 Indeed, but zip64 is allready implemented. There is just still
 the limit on the filesize.

Just make sure that whatever you're doing doesn't result in a file being
treated as a zip file rather than a zip64 file when it can't legally be a
zip file.

- Jonathan M Davis

Oct 22 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 10/20/2019 7:03 AM, berni44 wrote:
 Recently I fixed some bugs in std.zip, mainly concerning malware attacks.

Within the last year, I recall a post on reddit or hackernews that had a long 
list of ways to make zip bombs. It'd be a great initiative to defeat all of 
these with std.zip.

Oct 22 2019

berni44 <dlang d-ecke.de> writes:

On Tuesday, 22 October 2019 at 20:18:44 UTC, Walter Bright wrote:
 On 10/20/2019 7:03 AM, berni44 wrote:
 Recently I fixed some bugs in std.zip, mainly concerning 
 malware attacks.

 Within the last year, I recall a post on reddit or hackernews 
 that had a long list of ways to make zip bombs. It'd be a great 
 initiative to defeat all of these with std.zip.

That's allready done. std.zip meanwhile checks for overlapping 
data and rejects any such file. Actually this (namely an issue 
you filed) was the reason, why I started work on std.zip. Some 
minor stuff, like path traversal attacks, remains though.

Oct 23 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 10/23/2019 12:24 AM, berni44 wrote:
 That's allready done. std.zip meanwhile checks for overlapping data and
rejects 
 any such file. Actually this (namely an issue you filed) was the reason, why I 
 started work on std.zip.

That's good!

 Some minor stuff, like path traversal attacks, remains though.

How about nailing all that stuff? Then you can put in the documentation for 
std.zip, with links to the zip bomb problems, that std.zip doesn't have those 
problems.

There's no reason not to do an std.zip that's better than anyone else's, 
especially when the problems are known.

Oct 23 2019

berni44 <dlang d-ecke.de> writes:

On Wednesday, 23 October 2019 at 09:31:04 UTC, Walter Bright 
wrote:
 Some minor stuff, like path traversal attacks, remains though.

 How about nailing all that stuff? Then you can put in the 
 documentation for std.zip, with links to the zip bomb problems, 
 that std.zip doesn't have those problems.

 There's no reason not to do an std.zip that's better than 
 anyone else's, especially when the problems are known.

That's the plan (I have to wait for stable to be remerged back 
into master though, because there's been a regression fix). I 
thought of first implementing everything and going for 
documentation later, but meanwhile I think, I could do the 
documentation stuff in parallel with implementing, starting 
(which can be done immediately) with the idea of an overview 
table at the top, like Jonathan M. Davis suggests. Currently, 
this would only mention zip-bombs and chameleon-files. But with 
every further fix it can be extended.

Oct 23 2019

Petar Kirov [ZombineDev] <petar.p.kirov gmail.com> writes:

On Wednesday, 23 October 2019 at 14:54:20 UTC, berni44 wrote:
 [..]
 
 That's the plan (I have to wait for stable to be remerged back 
 into master though, because there's been a regression fix).
 [..]

Here you go: https://github.com/dlang/phobos/pull/7246
;)

Oct 23 2019

berni44 <dlang d-ecke.de> writes:

On Wednesday, 23 October 2019 at 15:42:08 UTC, Petar Kirov 
[ZombineDev] wrote:
 On Wednesday, 23 October 2019 at 14:54:20 UTC, berni44 wrote:
 [..]
 
 That's the plan (I have to wait for stable to be remerged back 
 into master though, because there's been a regression fix).
 [..]

 Here you go: https://github.com/dlang/phobos/pull/7246
 ;)

Thanks. ;)

Oct 23 2019

D Programming

C/C++ Programming

Other

digitalmars.D - How to continue work on std.zip