www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - How to continue work on std.zip

reply berni44 <dlang d-ecke.de> writes:
Recently I fixed some bugs in std.zip, mainly concerning malware 
attacks. I would like to continue there, but I think its a good 
idea to clearify the roadmap, before I continue with more PRs. 
Here's a list of items, I've got in my mind now:

Extracting zip files
--------------------

* Add structs that directly map the data in the zip records. As 
Atila pointed
out, this should be done with serialisation instead of unions 
(what I planed
and what he convinced me not to do, because this results in 
endianess problems).
See below (*) for some suggestions, on how this might be done.

* use enforce instead of if ... throw like Vladimir suggested. Do 
you think I
should use enforce!zipException("message") or is it better to get 
rid of
zipException?

* fix bugs 20289 - 20294 and bug 18950 (this is part of bug 
20290).

* Add more unittests, especially for zip64 and to reach (almost) 
100%
coverage.


Documentation
-------------

* Reversing the order of the four items in this file would bring 
the more
important ones to the beginning. This would make for better 
documentation.

* Using unittests for examples (maybe replacing the two examples 
at the top of
the module).

* Rewriting some of the documentation for better ddoc 
conformation and clearer
descriptions.


Extensions
----------

* Support for ranges. For unzipping, a RandomAccessRange could be 
used, for
zipping an OutputRange could be provided. Also, the 
ArchiveMembers could be
accessed via an other RandomAccessRange. These ranges would (in 
some cases)
allow for lazy evaluation, which might result in a speed gain, 
when not all
elements of an archive need to be read (or kept in memory).

* Multidisk? I think, it's *not* worth it. Just causes trouble 
and no one
needs this nowadays.

* Support for versions < 2.0 (called version 20 in the docs)? 
Again I think
that is *not* worth it.

* Support for other compression methods? Maybe bzip2 and LZMA 
would be nice.
All other methods look history to me.

* Support for encryption? I'm not sure about legal issues here. 
Looks like
PKWARE has some copyrights here. Therefore I'd skip encryption. 
In general I
personally think, that it's better to use a separate encryption 
program,
instead of using encryption mechanisms inside of a fixed format. 
Sooner or
later a fixed format will produce security issues.

* Files larger than 4GB? I see no reason, why we should block 
them. All it
would probably need is to move from uint to ulong.


Concerning the order of all these points, I'd roughly go from top 
to bottom.

What do you think? Is all of this worth it? Or should part of it 
be removed?
Do you have other stuff, that you would like to see here?

=============================================================================

(*) Currently the mapping is done like this:

{
     // Read end record data
     _diskNumber = getUshort(i + 4);
     _diskStartDir = getUshort(i + 6);

     _numEntries = getUshort(i + 8);
     _totalEntries = getUshort(i + 10);

     if (numEntries != totalEntries)
         throw new ZipException("multiple disk zips not 
supported");

     directorySize = getUint(i + 12);
     directoryOffset = getUint(i + 16);

     if (directoryOffset + directorySize > i)
         throw new ZipException("corrupted directory");
}

using functions like

 safe  nogc pure nothrow ushort getUshort(uint i)
{
     ubyte[2] result = data[i .. i + 2];
     return littleEndianToNative!ushort(result);
}

I dislike this approach, because it's using magic constants I'd 
like to get rid of.

As D has no build in serialisation mechanism and neither Phobos 
has, there is the need to add a specialised one.

One idea:

struct EndOfCentralDirRecord
{
     ushort numberOfThisDisk;
     ushort numberOfCDDisk;
     ushort entriesOnThisDisk;
     ushort entries;
     uint size;
     uint offset;
     ushort commentLength;
}

{
     auto eocdr = conv!Test(data[offset .. offset + 
EndOfCentralDirLength);

     // Read end record data
     _diskNumber = eocdr.numberOfThisDisk;
     _diskStartDir = eocdr.numberOfCDDisk;

     _numEntries = eocdr.entriesOnThisDisk;
     _totalEntries = eocdr.entries;

     if (numEntries != totalEntries)
         throw new ZipException("multiple disk zips not 
supported");

     directorySize = eocdr.size;
     directoryOffset = eocdr.offset;

     if (directoryOffset + directorySize > i)
         throw new ZipException("corrupted directory");
}

using

T conv(T)(ubyte[] a)
{
     auto tmp = T();

     foreach (member; __traits(allMembers, T))
     {
         alias type = typeof(mixin("T."~member));
         static if (is(type == ushort))
             mixin("tmp."~member~" = cast(ushort) (a[0] | 
(a[1]<<8));");
         else static if (is(type == uint))
             mixin("tmp."~member~" = cast(uint) (a[0] | (a[1]<<8) 
| (a[2]<<16) | (a[3]<<24));");
         else static if (is(type == ulong))
             mixin("tmp."~member~" = a[0] | (a[1]<<8) | (a[2]<<16) 
| (a[3]<<24)"~
                   "| (cast(ulong)(a[4] | (a[5]<<8) | (a[6]<<16) | 
(a[7]<<24))<<32);");
         else
             assert(false, "type "~member~" not supported");
         a = a[type.sizeof .. $];
     }

     return tmp;
}

One other idea motivated by some comment from Vladimir:

alias fromLE = littleEndianToNative;

align (1) // just to make sure there are no gaps
struct EndOfCentralDirRecord
{
     ubyte[2] numberOfThisDisk;
     ubyte[2] numberOfCDDisk;
     ubyte[2] entriesOnThisDisk;
     ubyte[2] entries;
     ubyte[4] size;
     ubyte[4] offset;
     ubyte[2] commentLength;
}

{
     auto eocdr = getRecord!EndOfCentralDirRecord(offset);

     // Read end record data
     _diskNumber = fromLE!ushort(eocdr.numberOfThisDisk);
     _diskStartDir = fromLE!ushort(eocdr.numberOfCDDisk);

     _numEntries = fromLE!ushort(eocdr.entriesOnThisDisk);
     _totalEntries = fromLE!ushort(eocdr.entries);

     if (numEntries != totalEntries)
         throw new ZipException("multiple disk zips not 
supported");

     directorySize = fromLE!uint(eocdr.size);
     directoryOffset = fromLE!uint(eocdr.offset);

     if (directoryOffset + directorySize > i)
         throw new ZipException("corrupted directory");
}

using

private T getRecord(T)(ulong offset)
{
     union U
     {
         ubyte[T.sizeof] data;
         T result;
     }

     U tmp;
     tmp.data = data[offset, offset + T.sizeof];
     return tmp.result;
}
Oct 20 2019
next sibling parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Sunday, October 20, 2019 8:03:06 AM MDT berni44 via Digitalmars-d wrote:
 Recently I fixed some bugs in std.zip, mainly concerning malware
 attacks. I would like to continue there, but I think its a good
 idea to clearify the roadmap, before I continue with more PRs.
 Here's a list of items, I've got in my mind now:

 Extracting zip files
 --------------------

 * Add structs that directly map the data in the zip records. As
 Atila pointed
 out, this should be done with serialisation instead of unions
 (what I planed
 and what he convinced me not to do, because this results in
 endianess problems).
 See below (*) for some suggestions, on how this might be done.
What would be the point of having something so low-level? The ZipMembers are already put in an AA in the ZipArchive. So, the basic design of std.zip is pretty high-level, whereas something like accessing the low-level zip records would fit better in a solution that was lower level and potentially lazy.
 * use enforce instead of if ... throw like Vladimir suggested. Do
 you think I
 should use enforce!zipException("message") or is it better to get
 rid of
 zipException?
Getting rid of ZipException would break code that currently uses std.zip and wouldn't necessarily be a desirable change anyway. enforce!ZipException is perfectly fine.
 * fix bugs 20289 - 20294 and bug 18950 (this is part of bug
 20290).

 * Add more unittests, especially for zip64 and to reach (almost)
 100%
 coverage.


 Documentation
 -------------

 * Reversing the order of the four items in this file would bring
 the more
 important ones to the beginning. This would make for better
 documentation.
That would be terrible for diffs and really wouldn't help much with the documentation. There are already links to the top-level symbols at the top of the documentation, and the example at the top shows how the various types are related. It might make sense to add a table at the top with a short description of each top-level symbol, but rearranging the code just to change the documentation is needless churn - especially in such a small file.
 * Using unittests for examples (maybe replacing the two examples
 at the top of
 the module).
The example at the top of the module is useful and really shouldn't be removed. However, adding more documented unittest blocks as examples for specific symbols would be an improvement.
 * Rewriting some of the documentation for better ddoc
 conformation and clearer
 descriptions.
I have no clue what you mean by better ddoc conformation. As far as ddoc goes, what's there looks perfectly fine to me. If you can come up with better descriptions for stuff, that's fine, but I don't see anything obviously wrong with what's there.
 Extensions
 ----------

 * Support for ranges. For unzipping, a RandomAccessRange could be
 used, for
 zipping an OutputRange could be provided. Also, the
 ArchiveMembers could be
 accessed via an other RandomAccessRange. These ranges would (in
 some cases)
 allow for lazy evaluation, which might result in a speed gain,
 when not all
 elements of an archive need to be read (or kept in memory).
Given that ZipArchive already sticks everything in an AA, I'm not sure what you'd gain by trying to add a random access range as well. And random-access would be incompatible with lazy evaluation. In order to have random access, you'd basically have to stick everything in a dynamic array instead of an AA, which doesn't fit with the current design and wouldn't be lazy either. In order to actually have lazy evaluation, you'd have to have a completely different design, because ZipArchive creates the AA up front - and changing that would cause problems for existing code. There's certainly something to be said for a design that lazily iterates through a zip archive, but from what I can tell, that would require a completely different type from ZipArchive, and I'd argue that if that's the sort of thing that you're looking to do, it would make a lot more sense to just create a fresh solution and put it on code.dlang.org instead of changing std.zip, since it really doesn't fit into the current design, and moving away from the AA design would break existing code.
 * Multidisk? I think, it's *not* worth it. Just causes trouble
 and no one
 needs this nowadays.
Maybe it a more complex solution, but std.zip is very basic. Also, looking at the wikipedia page for zip, it looks like the current design of std.zip is probably pretty similar to what ISO/IEC 21320-1:2015 requires, and spanning multiple volumes is not supported by that standard. I don't know that we need to stick to that standard, but it might make sense to just target what the ISO standard does for std.zip and let anything more complex be in a more complex solution code.dlang.org if someone feels the need to write such a solution.
 * Support for versions < 2.0 (called version 20 in the docs)?
 Again I think
 that is *not* worth it.
Wikipedia doesn't even inclued versions older than that in the versions that it lists, and 2.0 was released in 1993. Really, the question should be whether we include support for stuff newer than that, not whether we support anything older.
 * Support for other compression methods? Maybe bzip2 and LZMA
 would be nice.
 All other methods look history to me.
That could be useful, but we'd like to avoid having Phobos depend on any more external libraries (having curl has caused us enough problems already), so we'd need implementations for them inside of Phobos. It also wouldn't be ISO compatible, so if we decided that that's what we were targeting, then it would not be appropriate.
 * Support for encryption? I'm not sure about legal issues here.
 Looks like
 PKWARE has some copyrights here. Therefore I'd skip encryption.
 In general I
 personally think, that it's better to use a separate encryption
 program,
 instead of using encryption mechanisms inside of a fixed format.
 Sooner or
 later a fixed format will produce security issues.
I'd suggest skipping it at this point. If nothing else, doing it correctly will likely require external libraries. It would be better for such a solution to exist on code.dlang.org rather than be part of Phobos. It's also explicitly banned by the ISO standard.
 * Files larger than 4GB? I see no reason, why we should block
 them. All it
 would probably need is to move from uint to ulong.
From what little I understand, that would involve being zip64 rather than
normal zip, which would involve more than simply going from uint to ulong, and non-zip64 files do not support larger than 4 GiB. So, while adding support for larger files wouldn't necessarily be a problem, it needs to be done in a way that's standards compatible and without requiring that zip files be zip64 if they don't need to be.
 Concerning the order of all these points, I'd roughly go from top
 to bottom.

 What do you think? Is all of this worth it? Or should part of it
 be removed?
 Do you have other stuff, that you would like to see here?
Improvements to std.zip which expand its capabilitiues without breaking existing code are fine, but at some point, if you want a full, flexible implementation, it's probably better to come up with something new and put it on code.dlang.org. Also including stuff like bzip2 or encryption would likely be better in a 3rd party library that can use existing C libraries. Without digging into the whole issue in detail, I'd be tempted to argue that we should just target the ISO standard with std.zip and leave anything fancier to code.dlang.org. - Jonathan M Davis
Oct 21 2019
next sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Monday, 21 October 2019 at 08:44:30 UTC, Jonathan M Davis 
wrote:
 * Using unittests for examples (maybe replacing the two 
 examples
 at the top of
 the module).
The example at the top of the module is useful and really shouldn't be removed. However, adding more documented unittest blocks as examples for specific symbols would be an improvement.
The examples would be much more useful if they illustrated how to add or read more than 1 entry in the zipfile. I remember that I struggled a bit at the beginning to understand how it worked to go from 1 file to several files.
Oct 21 2019
prev sibling parent reply berni44 <dlang d-ecke.de> writes:
On Monday, 21 October 2019 at 08:44:30 UTC, Jonathan M Davis 
wrote:

First of all, thanks for pointing me to the ISO standard. I was 
not aware of this standard (the german wikipedia does not mention 
it and I must have overlooked it, when I skimmed through the 
english one). I agree, that we should stick to that standard 
which answers several of my questions.

Second, my goal is to improve Phobos. I'm pretty much aware, that 
this means to avoid breaking code at (almost) all costs. In case 
of std.zip the only "breaking" change will be to get rid of some 
constants (digiSignLength, diskNumber) that should have been 
private from the very beginning or are only usefull with 
multidisk support. And this is done with a full deprecation cycle.

 * Add structs that directly map the data in the zip records. 
 As Atila pointed
 out, this should be done with serialisation instead of unions 
 (what I planed
 and what he convinced me not to do, because this results in 
 endianess problems).
 See below (*) for some suggestions, on how this might be done.
What would be the point of having something so low-level? The ZipMembers are already put in an AA in the ZipArchive.
Sorry, I think this was a missunderstanding. I don't want to change anything of the AA nor the members in that AA. For me the question is, how to get the data from this void[]-array into the AA. The current approach uses lot's of lines like "_diskNumber = getUshort(i + 4);" with magic constants (here 4), which is error prone. I'd prefere to get rid of these in favour of some structs to avoid adding bugs when fixing the existing issues.
 * Reversing the order of the four items in this file would 
 bring the more
 important ones to the beginning. This would make for better 
 documentation.
That would be terrible for diffs and really wouldn't help much with the documentation. There are already links to the top-level symbols at the top of the documentation, and the example at the top shows how the various types are related. [...]
I disagree here. First of all, documentation is made for humans and this should be more important than diffs. (But maybe there is some aspect of diffs, that I'm not aware of. For me, this would be a larger change once, and after that change, the diffs would again be pretty small.) From my experience, the links at the top are not usefull for people looking the first time into the docs of some module. At that time, one often does not know, which item is the one they are looking for. For example. When I first looked at the documentation of std.zip, I was looking for a way to extract an existing archive. I expected that to be done by a function called unzip, but found no such function. Skimming the example at the top I thought, this was about creating zip archives (The comment in the first line tells that this is about reading zip files; I don't know exactly, why I overlooked this. Maybe, because it's small letters in light gray; maybe, because it's english, maybe I did not expect it there; I can't tell). Therefore I scrolled down. I found ZipException which is obviously of no use. After that I found CompressionMethod which is of little use either. Next there is ArchiveMember, which is more interesting, but yet does not show a way on how to get to the content of a zip file. Finally, if you do not overlook it (which is quite possible between all those members of ArchiveMember and ZipArchive), you'll find ZipArchive. But again you don't know how to use it. The constructor I was looking for is actually the second to last item, very well hidden. I went into detail here, because I think this is important. I had this experience over and over again with the documentation of Phobos. Comparing this to the documentation of PHP [1], where I never had any problems with, I conclude that there is space for improvement. ;-) But it's getting a little bit off topic. Maybe Phobos documentation should be discussed in a separate threat. [1] https://www.php.net/manual/de/book.zip.php
 * Using unittests for examples (maybe replacing the two 
 examples at the top of the module).
The example at the top of the module is useful and really shouldn't be removed.
No, I don't want to remove them. Just moving them inside of a unittest to make sure, that documentation and implementation does not diverge too much.
 * Rewriting some of the documentation for better ddoc 
 conformation and clearer descriptions.
I have no clue what you mean by better ddoc conformation.
Probably also a missunderstanding. What I mean is: As far as I know, all parameters of all functions should be listed; the return value should be mentioned and a "see also" section is recommended and stuff like this. Most of the functions do not have that information. For example: The documentation of "extractVersion" is "Read Only" which is not really useful, isn't it?
 * Support for ranges. For unzipping, a RandomAccessRange could 
 be used, for
 zipping an OutputRange could be provided. Also, the 
 ArchiveMembers could be
 accessed via an other RandomAccessRange. These ranges would 
 (in some cases)
 allow for lazy evaluation, which might result in a speed gain, 
 when not all
 elements of an archive need to be read (or kept in memory).
Given that ZipArchive already sticks everything in an AA, I'm not sure what you'd gain by trying to add a random access range as well. And random-access would be incompatible with lazy evaluation.
This is probably only about the last range I mentioned above, isn't it? That's the one I myself are most unsure if it's going to be usefull. One could use AA.byValue etc instead and that's probably sufficient. I added support for ranges in the list, because it was explicitly mentioned in the vision 2016H2 [2]. I think, there are use cases, where laziness can be applied. Think of an archive with lots of files, where only some of them are going to be extracted. The underlying file-access-range can use fseek and the like to only load the part of the archive, that's gona be needed. zip's structure allowes to fill in the AA (almost) completely by scanning the directory at the end of the file. I write "almost", because the real data content is still missing. But that can be retrieved, when it's going to be used. [2] https://wiki.dlang.org/Vision/2016H2
 In order to actually have lazy evaluation, you'd have to have a 
 completely different design, because ZipArchive creates the AA 
 up front - and changing that would cause problems for existing 
 code.
I think, it can be done with the current (public) design. I would add a new constructor that takes a range instead of void[]. All else would happen underneath.
 * Files larger than 4GB? I see no reason, why we should block
 them. All it
 would probably need is to move from uint to ulong.
From what little I understand, that would involve being zip64 
rather than normal zip,
Indeed, but zip64 is allready implemented. There is just still the limit on the filesize.
Oct 21 2019
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, October 21, 2019 7:35:37 AM MDT berni44 via Digitalmars-d wrote:
 On Monday, 21 October 2019 at 08:44:30 UTC, Jonathan M Davis
 wrote:
 * Add structs that directly map the data in the zip records.
 As Atila pointed
 out, this should be done with serialisation instead of unions
 (what I planed
 and what he convinced me not to do, because this results in
 endianess problems).
 See below (*) for some suggestions, on how this might be done.
What would be the point of having something so low-level? The ZipMembers are already put in an AA in the ZipArchive.
Sorry, I think this was a missunderstanding. I don't want to change anything of the AA nor the members in that AA. For me the question is, how to get the data from this void[]-array into the AA. The current approach uses lot's of lines like "_diskNumber = getUshort(i + 4);" with magic constants (here 4), which is error prone. I'd prefere to get rid of these in favour of some structs to avoid adding bugs when fixing the existing issues.
The constants don't look even vaguely magical. They're clearly simply based on the sizes of the values. That being said, if I were to fix it, I'd probably use append and read from std.bitmanip to replace all of the put* and get* calls, since they would take care of the sizes for you without having to deal with them directly at all. So, instead of putUshort(i + 4, de._madeVersion); putUshort(i + 6, de.extractVersion); putUshort(i + 8, de.flags); putUshort(i + 10, de._compressionMethod); putUint (i + 12, cast(uint) de.time); putUint (i + 16, de.crc32); putUint (i + 20, de.compressedSize); putUint (i + 24, de.expandedSize); putUshort(i + 28, cast(ushort) de.name.length); putUshort(i + 30, cast(ushort) de.extra.length); putUshort(i + 32, cast(ushort) de.comment.length); putUshort(i + 34, de.diskNumber); putUshort(i + 36, de.internalAttributes); putUint (i + 38, de._externalAttributes); putUint (i + 42, de.offset); you'd get something like buffer.append!ushort(de._madeVersion); buffer.append!ushort(de.extractVersion); buffer.append!ushort(de.flags); buffer.append!ushort(de._compressionMethod); buffer.append!uint (cast(uint) de.time); buffer.append!uint (de.crc32); buffer.append!uint (de.compressedSize); buffer.append!uint (de.expandedSize); buffer.append!ushort(cast(ushort) de.name.length); buffer.append!ushort(cast(ushort) de.extra.length); buffer.append!ushort(cast(ushort) de.comment.length); buffer.append!ushort(de.diskNumber); buffer.append!ushort(de.internalAttributes); buffer.append!uint (de._externalAttributes); buffer.append!uint (de.offset); Then all of the constants disappear. Also, then std.zip isn't trying to replace functionality that's already in std.bitmanip (like it's doing right now). The sort of thing that std.zip is doing here is pretty much exactly what read, peek, write, and append in std.bitmanip were written to handle.
 * Reversing the order of the four items in this file would
 bring the more
 important ones to the beginning. This would make for better
 documentation.
That would be terrible for diffs and really wouldn't help much with the documentation. There are already links to the top-level symbols at the top of the documentation, and the example at the top shows how the various types are related. [...]
I disagree here. First of all, documentation is made for humans and this should be more important than diffs. (But maybe there is some aspect of diffs, that I'm not aware of. For me, this would be a larger change once, and after that change, the diffs would again be pretty small.) From my experience, the links at the top are not usefull for people looking the first time into the docs of some module. At that time, one often does not know, which item is the one they are looking for. For example. When I first looked at the documentation of std.zip, I was looking for a way to extract an existing archive. I expected that to be done by a function called unzip, but found no such function. Skimming the example at the top I thought, this was about creating zip archives (The comment in the first line tells that this is about reading zip files; I don't know exactly, why I overlooked this. Maybe, because it's small letters in light gray; maybe, because it's english, maybe I did not expect it there; I can't tell). Therefore I scrolled down. I found ZipException which is obviously of no use. After that I found CompressionMethod which is of little use either. Next there is ArchiveMember, which is more interesting, but yet does not show a way on how to get to the content of a zip file. Finally, if you do not overlook it (which is quite possible between all those members of ArchiveMember and ZipArchive), you'll find ZipArchive. But again you don't know how to use it. The constructor I was looking for is actually the second to last item, very well hidden. I went into detail here, because I think this is important. I had this experience over and over again with the documentation of Phobos. Comparing this to the documentation of PHP [1], where I never had any problems with, I conclude that there is space for improvement. ;-)
Reordering stuff makes it harder to track changes over time, and with a file as small as std.zip, I have a very hard time believing that changing the order is going to help much - especially since you need to be aware of all of the types in std.zip when using it anyway. The main way to make it easier for people to figure out where they need to know to start is by improving the module-level documentation. For instance, std.zip could certainly use a table similar to what modules like std.array and std.range have: https://dlang.org/phobos/std_array.html https://dlang.org/phobos/std_range.html Having an overview of that explains what the pieces are and provides links to them is going to work far better than simply reorganizing the file will do. If the module-level documentation doesn't give enough information for people to know what they should be looking for in the module for the functionality they want, then the module-level documentation needs to be improved, regardless of what order the symbols are in in the file. And if the module-level documentation does its job properly, the order of the symbols in the documentation should be irrelevant.
 * Support for ranges. For unzipping, a RandomAccessRange could
 be used, for
 zipping an OutputRange could be provided. Also, the
 ArchiveMembers could be
 accessed via an other RandomAccessRange. These ranges would
 (in some cases)
 allow for lazy evaluation, which might result in a speed gain,
 when not all
 elements of an archive need to be read (or kept in memory).
Given that ZipArchive already sticks everything in an AA, I'm not sure what you'd gain by trying to add a random access range as well. And random-access would be incompatible with lazy evaluation.
This is probably only about the last range I mentioned above, isn't it? That's the one I myself are most unsure if it's going to be usefull. One could use AA.byValue etc instead and that's probably sufficient. I added support for ranges in the list, because it was explicitly mentioned in the vision 2016H2 [2]. I think, there are use cases, where laziness can be applied. Think of an archive with lots of files, where only some of them are going to be extracted. The underlying file-access-range can use fseek and the like to only load the part of the archive, that's gona be needed. zip's structure allowes to fill in the AA (almost) completely by scanning the directory at the end of the file. I write "almost", because the real data content is still missing. But that can be retrieved, when it's going to be used. [2] https://wiki.dlang.org/Vision/2016H2
You'd have to be careful with this, because right now, if you call ZipArchive's directory, you get the same AA that it has internally, and accessing any values in it does not cause any calculations to be done. Having them be done lazily might be nice, but you also wouldn't want to end up having stuff calculated every time you access one of the ArchiveMembers in the AA. And making it so that ArchiveMember is somehow lazy but doesn't redo any work means that it won't work when it's const. Fortunately, functions like expandedData and compressedData are not const, so maybe some of what's done there could be made lazy without breaking code, but it would have to be done carefully. The fact that std.zip is designed around filling in an AA, and it provides direct access to that AA pretty much destroys most of the possible range-based solutions - especially the lazy ones. If you can make some aspect of how the internals work lazier without causing problems, then that's fine, but I don't see much point in trying to provide a different range API on top of what's there when it's basically just an AA. I could easily see having a lazy, range-based solution which did something like iterate through the files in the zip file in whatever order they're in in the zip file, but that would require a very different design from what we currently have with std.zip. Also, whether it would be better or worse would largely depend on what the user code was trying to do. Certainly, if you can come up with something clever that improves things while not breaking existing code, then that's great, but the fact that the code provides direct access to the AA (and as the primary way to access the data no less) makes it pretty hard to change. Maybe it would make sense to provide some sort of range-based solution beside what we currently have in std.zip, but that's the sort of thing that should probably be put on code.dlang first and battle-tested a bit before being put into Phobos, since it's essentially designing a new set of functionality for accessing and processing zip files rather than simply improving what's already there.
 In order to actually have lazy evaluation, you'd have to have a
 completely different design, because ZipArchive creates the AA
 up front - and changing that would cause problems for existing
 code.
I think, it can be done with the current (public) design. I would add a new constructor that takes a range instead of void[]. All else would happen underneath.
 * Files larger than 4GB? I see no reason, why we should block
 them. All it
 would probably need is to move from uint to ulong.

From what little I understand, that would involve being zip64
rather than normal zip,
Indeed, but zip64 is allready implemented. There is just still the limit on the filesize.
Just make sure that whatever you're doing doesn't result in a file being treated as a zip file rather than a zip64 file when it can't legally be a zip file. - Jonathan M Davis
Oct 22 2019
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/20/2019 7:03 AM, berni44 wrote:
 Recently I fixed some bugs in std.zip, mainly concerning malware attacks.
Within the last year, I recall a post on reddit or hackernews that had a long list of ways to make zip bombs. It'd be a great initiative to defeat all of these with std.zip.
Oct 22 2019
parent reply berni44 <dlang d-ecke.de> writes:
On Tuesday, 22 October 2019 at 20:18:44 UTC, Walter Bright wrote:
 On 10/20/2019 7:03 AM, berni44 wrote:
 Recently I fixed some bugs in std.zip, mainly concerning 
 malware attacks.
Within the last year, I recall a post on reddit or hackernews that had a long list of ways to make zip bombs. It'd be a great initiative to defeat all of these with std.zip.
That's allready done. std.zip meanwhile checks for overlapping data and rejects any such file. Actually this (namely an issue you filed) was the reason, why I started work on std.zip. Some minor stuff, like path traversal attacks, remains though.
Oct 23 2019
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/23/2019 12:24 AM, berni44 wrote:
 That's allready done. std.zip meanwhile checks for overlapping data and
rejects 
 any such file. Actually this (namely an issue you filed) was the reason, why I 
 started work on std.zip.
That's good!
 Some minor stuff, like path traversal attacks, remains though.
How about nailing all that stuff? Then you can put in the documentation for std.zip, with links to the zip bomb problems, that std.zip doesn't have those problems. There's no reason not to do an std.zip that's better than anyone else's, especially when the problems are known.
Oct 23 2019
parent reply berni44 <dlang d-ecke.de> writes:
On Wednesday, 23 October 2019 at 09:31:04 UTC, Walter Bright 
wrote:
 Some minor stuff, like path traversal attacks, remains though.
How about nailing all that stuff? Then you can put in the documentation for std.zip, with links to the zip bomb problems, that std.zip doesn't have those problems. There's no reason not to do an std.zip that's better than anyone else's, especially when the problems are known.
That's the plan (I have to wait for stable to be remerged back into master though, because there's been a regression fix). I thought of first implementing everything and going for documentation later, but meanwhile I think, I could do the documentation stuff in parallel with implementing, starting (which can be done immediately) with the idea of an overview table at the top, like Jonathan M. Davis suggests. Currently, this would only mention zip-bombs and chameleon-files. But with every further fix it can be extended.
Oct 23 2019
parent reply Petar Kirov [ZombineDev] <petar.p.kirov gmail.com> writes:
On Wednesday, 23 October 2019 at 14:54:20 UTC, berni44 wrote:
 [..]
 
 That's the plan (I have to wait for stable to be remerged back 
 into master though, because there's been a regression fix).
 [..]
Here you go: https://github.com/dlang/phobos/pull/7246 ;)
Oct 23 2019
parent berni44 <dlang d-ecke.de> writes:
On Wednesday, 23 October 2019 at 15:42:08 UTC, Petar Kirov 
[ZombineDev] wrote:
 On Wednesday, 23 October 2019 at 14:54:20 UTC, berni44 wrote:
 [..]
 
 That's the plan (I have to wait for stable to be remerged back 
 into master though, because there's been a regression fix).
 [..]
Here you go: https://github.com/dlang/phobos/pull/7246 ;)
Thanks. ;)
Oct 23 2019