www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - unzip parallel, 3x faster than 7zip

reply "Jay Norwood" <jayn prismnet.com> writes:
I uploaded a parallel unzip here, and the main in the examples 
folder.  Testing on my ssd drive, unzips a 2GB directory 
structure in 17.5 secs.  7zip took 55 secs on the same file.  
This restores timestamps on the regular files.  There is also a 
loop which will restore timestams on folders.  It can be 
uncommented if the fix is added to std.file.setTimes that allows 
timestamp updates on folders.  I documented a fix that I tested 
in issue 7819.

https://github.com/jnorwood/file_parallel

http://d.puremagic.com/issues/show_bug.cgi?id=7819


This has similar limitations to std.zip,  Only does inflate or 
store, doesn't do decryption.  There is a 4GB limit based on the 
32 bit offsets limit of the zip format used.  It processes 40MB 
blocks of files, and uses std.parallelism foreach loop.  If the 
archived entry is larger than 40MB it will attempt to load it 
into memory, but there currently is no expansion technique in 
there to split a large single entry into blocks.

I used the streams io to avoid the 2GB file limits still in stdio.
Apr 05 2012
next sibling parent "Jay Norwood" <jayn prismnet.com> writes:
On Thursday, 5 April 2012 at 14:04:57 UTC, Jay Norwood wrote:
 I uploaded a parallel unzip here, and the main in the examples 
 folder.

So, below is a demo of how to use the example app in windows, where I unzipped a 2GB directory structure from a 1GB zip file, tzip.zip. 02/18/2012 03:23 PM <DIR> test 03/30/2012 11:28 AM 968,727,390 tzip.zip 04/05/2012 08:07 AM 462,364 uzp.exe 03/21/2012 10:26 AM 1,603,584 wc.exe 03/06/2012 12:20 AM <DIR> xx8 13 File(s) 1,071,302,938 bytes 14 Dir(s) 49,315,860,480 bytes fre H:\>uzp tzip.zip tz unzipping: .\tzip.zip finished! time: 17183 ms 02/18/2012 03:23 PM <DIR> test 04/05/2012 08:12 AM <DIR> tz 03/30/2012 11:28 AM 968,727,390 tzip.zip 04/05/2012 08:07 AM 462,364 uzp.exe 03/21/2012 10:26 AM 1,603,584 wc.exe 03/06/2012 12:20 AM <DIR> xx8 13 File(s) 1,071,302,938 bytes 15 Dir(s) 47,078,543,360 bytes free The example supports several forms of commandline: uzp zipFilename to unzip in current folder, or uzp zipFilename destFoldername to unzip into the destination folder, or uzp zipf1 zipf2 zipf3 destFoldername to unzip multiple zip files to dest folder, or uzp zipf* destFoldername to unzip multiple zip files (wildarg expansion)to dest folder It overwrites existing directory entries without asking in the current form.
Apr 05 2012
prev sibling next sibling parent "Jay Norwood" <jayn prismnet.com> writes:
On Thursday, 5 April 2012 at 15:07:47 UTC, Jay Norwood wrote:
........

so, a few comments about std.zip...

I attempted to use it and found that its way of unzipping is a 
memory hog, keeping the full original and all the unzipped data 
in memory.  It quickly ran out of memory on my test case.

The classes didn't lend themselves to parallel execution, so I 
broke them into a few pieces ... one creates the directory 
structure, one reads in compressed archive entries, one expands 
archive entries.

The app creates the directory structure non-parallel using the 
mkdir recursive.

I found that creating the directory structure only took about 0.4 
secs of the total time in that 2GB test.

I found that creating the directory structure, reading the zip 
entries, and expanding the data, without writing to disk, took 
less than 4 secs, with the expansion done in parallel.

The other 13 to 14 secs were all taken up by writing out the 
files, with less than a half sec of that required to update the 
timestamps.  This is on about 39k directory entries.

The 17 sec result is on the intel 510 series ssd drive.  on a 
hard drive 7zip took 128 secs and uzp took about 70 sec.

G:\>uzp tzip.zip tz
unzipping: .\tzip.zip
finished! time: 69440 ms


It is interesting that  win7 takes longer to delete these 
directories than it does to create them.
Apr 05 2012
prev sibling next sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 05.04.2012 16:04, schrieb Jay Norwood:
 I uploaded a parallel unzip here, and the main in the examples
 folder.  Testing on my ssd drive, unzips a 2GB directory
 structure in 17.5 secs.  7zip took 55 secs on the same file.

it makes no sense to benchmark different algorithm zip<->7zip compare only unzip and parallel unzip - nothing else makes sense
Apr 05 2012
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 04/05/2012 06:37 PM, dennis luehring wrote:
 Am 05.04.2012 16:04, schrieb Jay Norwood:
 I uploaded a parallel unzip here, and the main in the examples
 folder. Testing on my ssd drive, unzips a 2GB directory
 structure in 17.5 secs. 7zip took 55 secs on the same file.

it makes no sense to benchmark different algorithm zip<->7zip compare only unzip and parallel unzip - nothing else makes sense

I think he is talking about 7zip the standalone software, not 7zip the compression algorithm.
 7zip took 55 secs _on the same file_.

Apr 05 2012
next sibling parent reply Sean Cavanaugh <WorksOnMyMachine gmail.com> writes:
On 4/5/2012 6:53 PM, Jay Norwood wrote:

I'm curious why win7 is such a dog when removing directories. I see a lot of disk read activity going on which seems to dominate the delete time. This doesn't make any sense to me unless there is some file caching being triggered on files being deleted. I don't see any virus checker app being triggered ... it all seems to be system read activity. Maybe I'll try non cached flags, write truncate to 0 length before deleting and see if that results in faster execution when the files are deleted...

If you delete a directory containing several hundred thousand directories (each with 4-5 files inside, don't ask), you can see windows freeze for long periods (10+seconds) of time until it is finished, which affects everything up to and including the audio mixing (it starts looping etc).
Apr 06 2012
parent Rainer Schuetze <r.sagitario gmx.de> writes:
On 4/7/2012 12:32 AM, Jay Norwood wrote:
 I got procmon to see what is going on. Win7 has doing indexing and
 thumbnails, and there was some virus checker going on, but you can get
 rid of those. Still, most of the problem just boils down to the duration
 of the delete on close being proportional to the size of the file, and
 apparently related to the access times of the disk. I sometimes see .25
 sec duration for a single file during the close of the delete operations
 on the hard drive.

Maybe it is the trim command being executed on the sectors previously occupied by the file.
 I've been using an intel 510 series 120GB drive for recording concerts.
 It is hooked up with an ineo usb3 adaptor to the front panel port of an
 rme ufx recorder. The laptop is just used as a controller ... the ufx
 does all the mixing and recording to the hard drive.

Apr 07 2012
prev sibling next sibling parent dennis luehring <dl.soluz gmx.net> writes:
Am 05.04.2012 19:04, schrieb Timon Gehr:
 On 04/05/2012 06:37 PM, dennis luehring wrote:
  Am 05.04.2012 16:04, schrieb Jay Norwood:
  I uploaded a parallel unzip here, and the main in the examples
  folder. Testing on my ssd drive, unzips a 2GB directory
  structure in 17.5 secs. 7zip took 55 secs on the same file.

it makes no sense to benchmark different algorithm zip<->7zip compare only unzip and parallel unzip - nothing else makes sense

I think he is talking about 7zip the standalone software, not 7zip the compression algorithm.
  7zip took 55 secs _on the same file_.


that is ok but he still compares different implementations
Apr 06 2012
prev sibling parent dennis luehring <dl.soluz gmx.net> writes:
Am 06.04.2012 01:53, schrieb Jay Norwood:
 I'm curious why win7 is such a dog when removing directories.  I
 see a lot of disk read activity going on which seems to dominate
 the delete time.

try windows safe-mode (without network :} - your virus scanner is disabled), press F8 before windows start - thats seems to remove many strange pauses,blockings etc. - still no idea why, but a good testenvironment
Apr 06 2012
prev sibling next sibling parent "Jay Norwood" <jayn prismnet.com> writes:
 I think he is talking about 7zip the standalone software, not 
 7zip the compression algorithm.

 7zip took 55 secs _on the same file_.


Yes, that's right, both 7zip and this uzp program are using the same deflate standard format of zip for this test. It is the only expand format that is supported in std.zip. 7zip was used to create the zip file used in the test. 7zip already has multi-core compression capability, but no multi-core uncompress. I haven't seen any multi-core uncompress for deflate format, but I did see one for bzip2 named pbzip2. In general, though, inflate/deflate are the fastest algorithms I've seen, when comparing the ones that are available in 7zip. I'm happy with the 7zip performance on compress with the inflate format, but not on the uncompress, so I will be using this uzp app. I'm curious why win7 is such a dog when removing directories. I see a lot of disk read activity going on which seems to dominate the delete time. This doesn't make any sense to me unless there is some file caching being triggered on files being deleted. I don't see any virus checker app being triggered ... it all seems to be system read activity. Maybe I'll try non cached flags, write truncate to 0 length before deleting and see if that results in faster execution when the files are deleted...
Apr 05 2012
prev sibling next sibling parent "Jay Norwood" <jayn prismnet.com> writes:
On Friday, 6 April 2012 at 14:55:14 UTC, Sean Cavanaugh wrote:
 If you delete a directory containing several hundred thousand 
 directories (each with 4-5 files inside, don't ask), you can 
 see windows freeze for long periods (10+seconds) of time until 
 it is finished, which affects everything up to and including 
 the audio mixing (it starts looping etc).

Yeah, I saw posts by people doing video complaining about such things. One good suggestion was to create may small volumes for separate projects and just do a fast format on them rather than trying to delete folders. I got procmon to see what is going on. Win7 has doing indexing and thumbnails, and there was some virus checker going on, but you can get rid of those. Still, most of the problem just boils down to the duration of the delete on close being proportional to the size of the file, and apparently related to the access times of the disk. I sometimes see .25 sec duration for a single file during the close of the delete operations on the hard drive. I've been using an intel 510 series 120GB drive for recording concerts. It is hooked up with an ineo usb3 adaptor to the front panel port of an rme ufx recorder. The laptop is just used as a controller ... the ufx does all the mixing and recording to the hard drive.
Apr 06 2012
prev sibling next sibling parent "Jay Norwood" <jayn prismnet.com> writes:
On Saturday, 7 April 2012 at 05:02:04 UTC, dennis luehring wrote:
 7zip took 55 secs _on the same file_.


that is ok but he still compares different implementations

7zip is the program. It unzips many formats, with the standard zip format being one of them. The parallel d program is three times faster at decoding the zip format than 7zip decodes the same file on the same ssd drive. That is an appropriate comparison since 7zip has been my utility of choice for unzipping zip format files on windows for many years. I provided the source code in the examples folder for the complete command line utility that I used, so you may build it and compare it to whatever you like and report the results.
Apr 06 2012
prev sibling next sibling parent "Jay Norwood" <jayn prismnet.com> writes:
On Saturday, 7 April 2012 at 11:41:41 UTC, Rainer Schuetze wrote:
  >
 Maybe it is the trim command being executed on the sectors 
 previously occupied by the file.

No, perhaps I didn't make it clear that the rmdir slowness is only an issue on hard drives. I can unzip the 2GB archive in about 17.5 sec on the ssd drive, and delete it using the rmd multi-thread delete example program in less than 17 secs on the ssd drive. The same operations on a hard drive take around 60 seconds to extract, but 1.5 to 3 minutes to delete. H:\>uzp tzip.zip tz unzipping: .\tzip.zip finished! time: 17405 ms H:\>rmd tz removing: .\tz finished! time:16671 ms I've been doing some reading on the web and studying the procmon logs. I am convinced the slow hard drive delete is an issue with seek times, since it is not an issue on the ssd. It may be caused by fragmentation of the stored data or the mft itself, or else it could be that ntfs is doing some book-keeping journaling. You are right that it could be doing delete notifications to any application watching the disk activity. I've already turned off the virus checker and the indexing, but I'm going to try the tweaks in the second link and also try the mydefrag program in the third link and see if anything improves the hd delete times. http://ixbtlabs.com/articles/ntfs/index3.html http://www.gilsmethod.com/speed-up-vista-with-these-simple-ntfs-tweaks http://www.mydefrag.com/index.html That mydefrag has some interesting ideas about sorting folders by full pathname on the disk as one of the defrag algorithms. Perhaps using it, and also using unzip and zip algorithms that match the defrag algorithm, would be a nice combination. In other words, if the zip algorithm processes the files in a sorted-by-pathname order, and if the defrag algorithm has created folders that are sorted on disk by the same order, then you would expect optimally short seeks while processing the files in the order they are stored. The mydefrag program uses the ntfs defrag api. There is an article at the following link showing how to access it to get the Logical Cluster Numbers on disk for a file. I suppose you could sort your file operations by start LCN, of the file, for example during compression, and that might reduce the seek related delays. http://blogs.msdn.com/b/jeffrey_wall/archive/2004/09/13/229137.aspx
Apr 07 2012
prev sibling next sibling parent "Jay Norwood" <jayn prismnet.com> writes:
On Saturday, 7 April 2012 at 17:08:33 UTC, Jay Norwood wrote:
 The mydefrag program uses the ntfs defrag api.  There is an 
 article at the following link showing how to access it to get 
 the Logical Cluster Numbers on disk for a file.  I suppose you 
 could sort your file operations  by start LCN, of the file, for 
 example during compression, and that might reduce the seek 
 related delays.

 http://blogs.msdn.com/b/jeffrey_wall/archive/2004/09/13/229137.aspx

I did a complete defrag of the g hard drive, then did parallel unzip of the tz folder with rmd, then unzipped it again with the parallel uzp. Then analyzed the disk again with mydefrag. The analysis shows the unzip resulted in over 300 fragmented files created, even though I wrote each expanded file in a single operation. So, I did a complete defrag again, then removed the folder again, and get about the same 109 secs for the delete operation on the hd (vs about 17 sec on the ssd for the same operation). The uzp parallel unzip is bout 85 secs vs about 17.5 sec on the ssd. G:\>rmd tz removing: .\tz finished! time:109817 ms G:\>uzp tzip.zip tz unzipping: .\tzip.zip finished! time: 85405 ms G:\>rmd tz removing: .\tz finished! time:108387 ms So ... it looks like the defrag helps, as the 109 sec values are at the low end of the range I've seen previously. Still it is totally surprising to me that deleting files should take longer than creating the same files. btw, here are the windows rmdir on the defragged hd and on the ssd drive, and the third measurement is the D parallel rmd on the ssd ... much faster on D. G:\>cmd /v:on /c "echo !TIME! & rmdir /q /s tz & echo !TIME!" 14:34:09.06 14:36:23.36 H:\>cmd /v:on /c "echo !TIME! & rmdir /q /s tz & echo !TIME!" 14:38:44.69 14:40:02.16 H:\>rmd tz removing: .\tz finished! time:17536 ms
Apr 07 2012
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sat, 07 Apr 2012 21:45:04 +0200
schrieb "Jay Norwood" <jayn prismnet.com>:

 So ... it looks like the defrag helps, as the 109 sec values are 
 at the low end of the range I've seen previously.  Still it is 
 totally surprising to me that deleting files should take longer 
 than creating the same files.

Maybe the kernel caches writes, but synchronizes deletes? (So the seek times become apparent there, and not in the writes) Also check the file creation flags, maybe you can hint Windows to the final file size and they wont be fragmented?
Apr 08 2012
prev sibling parent "Jay Norwood" <jayn prismnet.com> writes:
On Sunday, 8 April 2012 at 13:55:21 UTC, Marco Leise wrote:
 Maybe the kernel caches writes, but synchronizes deletes? (So 
 the seek times become apparent there, and not in the writes)
 Also check the file creation flags, maybe you can hint Windows 
 to the final file size and they wont be fragmented?

My understanding is that a delete operation occurs after all the file handles associated with a file are closed, assuming there other handles were opened with file_share_delete. I believe otherwise you get an error from the attempt to delete. I'm doing some experiments with myFrag sortByName() and it indicates to me that there will be huge improvments in delete efficiency available on a hard drive if you can figure out some way to get the os to arrange the files and directories in LCNs in that byName order. Below are the delete time from win7 rmdir on the same 2GB folder with and without defrag using myFrag sortByName(). This is win7 rmdir following myFrag sortByName() defrag ... less than 7 seconds G:\>cmd /v:on /c "echo !TIME! & rmdir /q /s tz & echo !TIME!" 9:06:33.79 9:06:40.47 This is the same rmdir without defrag of the folder. 2 minutes 14 secs. G:\>cmd /v:on /c "echo !TIME! & rmdir /q /s tz & echo !TIME!" 14:34:09.06 14:36:23.36 This is all on win7 ntfs, and I have no idea if similar gains are available for linux. So, yes, whatever tricks you can play with the win api in order to get it to organize the unzipped archive into this particular order is going to make huge improvements in the speed of delete.
Apr 08 2012