How to store a file once in a zip file instead of duplicating it in 50 folders - compression

I have a directory structure that I need to write into a zip file that contains a single file that is duplicated in 50 sub directories. When users download the zip file, the duplicated file needs to appear in every directory. Is there a way to store the file once in a zip file, yet have it downloaded into the subdirectories when it is extracted? I cannot use shortcuts.
It would seem like Zip would be smart enough to recognize that I have 50 duplicate files and automatically store the file once... It would be silly to make this file 50 times larger than necessary!

It is possible within the ZIP specification to have multiple entries in the central directory point to the same local header offset. The ZIP application would have to precalculate the CRC of the file it was going to add and find a matching entry in the central directory of the existing ZIP file. A query for the CRC lookup against a ZIP file that contains a huge number of entries would be an expensive operation. It would also be costly to precalculate the CRC on huge files (CRC calculations are usually done during the compression routine).
I have not heard of a specific ZIP application that makes this optimization. However, it does look like StuffIt X format supports duplicate file optimization:
The StuffIt X format supports "Duplicate Detection". When adding files to an archive, StuffIt detects if there are duplicate items (even if they have the different file names), and only compresses the duplicates once, no matter how many copies there are. When expanded, StuffIt recreates all the duplicates from that one instance. Depending on the data being compressed, it can offer significant reductions in size and compression time.

I just wanted to clarify that the Suffit solution only removes duplicate files when compressing to their own proprietary format and not ZIP.

Related

Compression container-format with arbitrary file operations

Is there mature compression format that allows arbitrary file operations for items inside like Delete/Insert/Update but not requiring full archive recreation for this.
I'm aware of Sqlar based on Sqlite file format that naturally supports this since the mentioned operations is just deleting/inserting/updating records containing blobs. But it is more like experimental project created with other goals in mind and not widely adopted
UPDATE: to be more precise with what I have in mind, this is more like file system inside the archive when the files inserted might occupy a different "sectors" inside this container, depending on the scenario of previous delete and update operations. But the "chain" of the file is compressed while being added so occupies effectively less space than the original file.
The .zip format. You may need to copy the zip file contents to do a delete, but you don't need to recreate the archive.
Update:
The .zip format can, in principle, support the deletion and addition of entries without copying the entire zip file, as well as the re-use of the space from deleted entries. The central directory at the end can be updated and cheaply rewritten. I have heard of it being done. You would have to deal with fragmentation, as with any file system. I am not aware of an open-source library that supports using a zip file as a file system. The .zip format does not support breaking an entry into sectors that could be scattered across the zip file, as file systems do. A single entry has to be contiguous in a zip file.

Caching for faster recompression of folder given edit/add/delete

Summary
Let's say I have a large number of files in a folder that I want to compress/zip before I send to a server. After I've zipped them together, I realize I want to add/remove/modify a file. Can going through the entire compression process from scratch be avoided?
Details
I imagine there might be some way to cache part of the compression process (whether it is .zip, .gz or .bzip2), to make the compression incremental, even if it results in sub-optimal compression. For example, consider the naive dictionary encoding compression algorithm. I imagine it should be possible to use the encoding dictionary on a single file without re-processing all the files. I also imagine that the loss in compression provided by this caching mechanism would grow as more files are added/removed/edited.
Similar Questions
There are two questions related to this problem:
A C implementation, which implies it's possible
A C# related question, which implies it's possible by zipping individual files first?
A PHP implementation, which implies it isn't possible without a special file-system
A Java-specific adjacent question, which implies it's semi-possible?
Consulting the man page of zip, there are several relevant commands:
Update
-u
--update
Replace (update) an existing entry in the zip archive only if it has
been modified more recently than the version already in the zip
archive. For example:
zip -u stuff *
will add any new files in the current directory, and update any files
which have been modified since the zip archive stuff.zip was last
created/modified (note that zip will not try to pack stuff.zip into
itself when you do this).
Note that the -u option with no input file arguments acts like the -f
(freshen) option.
Delete
-d
--delete
Remove (delete) entries from a zip archive. For example:
zip -d foo foo/tom/junk foo/harry/\* \*.o
will remove the entry foo/tom/junk, all of the files that start with
foo/harry/, and all of the files that end with .o (in any path).
Note that shell pathname expansion has been inhibited with
backslashes, so that zip can see the asterisks, enabling zip to match
on the contents of the zip archive instead of the contents of the
current directory.
Yes. The entries in a zip file are all compressed individually. You can select and copy just the compressed entries you want from any zip file to make a new zip file, and you can add new entries to a zip file.
There is no need for any caching.
As an example, the zip command does this.

Compressing large, near-identical files

I have a bunch of large HDF5 files (all around 1.7G), which share a lot of their content – I guess that more than 95% of the data of each file is found repeated in every other.
I would like to compress them in an archive.
My first attempt using GNU tar with the -z option (gzip) failed: the process was terminated when the archive reached 50G (probably a file size limitation imposed by the sysadmin). Apparently, gzip wasn't able to take advantage of the fact that the files are near-identical in this setting.
Compressing these particular files obviously doesn't require a very fancy compression algorithm, but a veeery patient one.
Is there a way to make gzip (or another tool) detect these large repeated blobs and avoid repeating them in the archive?
Sounds like what you need is a binary diff program. You can google for that, and then try using binary diff between two of them, and then compressing one of them and the resulting diff. You could get fancy and try diffing all combinations, picking the smallest ones to compress, and send only one original.

Can zip files be sparse/non-contiguous?

The zip file format ends with a central directory section that then points to the individual zip entries within the file. This appears to allow zip entries to occur anywhere within the zip file itself. Indeed, self-extracting zip files are a good example: they start with an executable and all the zip entries occur after the executable bytes.
The question is: does the zip file format really allow sparse or non-contiguous zip entries? e.g. if there are empty or otherwise unaccounted bytes between zip entries? Both the definitive PK note and wikipedia article seem to allow this. Will all/most typical zip utilities work with such sparse zip files?
The use case is this: I want to be able to delete or replace zip entries in a zip file. To do this, the typical minizip etc. libraries want you to copy out the entire zip file while not copying out the deleted or replaced zip entry, which seems wasteful and slow.
Wouldn't it be better to over-allocate, say 1.5x the storage for an entry, then when deleting or replacing an entry you could figure out where the unallocated bytes were and use those directly? Using 1.5x the storage means that if the zip entry grew linearly, the reallocations should also happen amortized linearly. It would be similar to file system block allocation though probably not as sophisticated.
This also helps with a lot of the zip-based file formats out there. Instead of having to have some temp directory somewhere (or even in memory) with the temporarily unzipped files for editing/changing and then having to rezip the lot back into the file format, this would lessen the need for rezipping and rewriting portions of the zip file.
Are there any C/C++ libraries out there that do this?
No. Reading the central directory is optional. zip decoders can, and some do, simply read the zip file sequentially from the beginning, expecting to see the local headers and entry data contiguously. They can complete the job of decoding, never having even looked at the central directory.
In order to do what you want, you would need to put in dummy zip entries between the useful entries in order to hold that space. At least if you want to be compatible with the rest of the zip world.

Quick file access in a directory with 500,000 files

I have a directory with 500,000 files in it. I would like to access them as quickly as possible. The algorithm requires me to repeatedly open and close them (can't have 500,000 file open simultaneously).
How can I do that efficiently? I had originally thought that I could cache the inodes and open the files that way, but *nix doesn't provide a way to open files by inode (security or some such).
The other option is to just not worry about it and hope the FS does good job on file look up in a directory. If that is the best option, which FS's would work best. Do certain filename patterns look up faster than others? eg 01234.txt vs foo.txt
BTW this is all on Linux.
Assuming your file system is ext3, your directory is indexed with a hashed B-Tree if dir_index is enabled. That's going to give you as much a boost as anything you could code into your app.
If the directory is indexed, your file naming scheme shouldn't matter.
http://lonesysadmin.net/2007/08/17/use-dir_index-for-your-new-ext3-filesystems/
A couple of ideas:
a) If you can control the directory layout then put the files into subdirectories.
b) If you can't move the files around, then you might try different filesystems, I think xfs might be good for directories with lots of entries?
If you've got enough memory, you can use ulimit to increase the maximum number of files that your process can have open at one time, I have successfully done with with 100,000 files, 500,000 should work as well.
If that isn't a option for you, try to make sure that your dentry cache has enough room to store all the entries. The dentry cache is the filename -> inode mapping that the kernel uses to speed up file access based on filename, accessing huge numbers of different files can effectively eliminate the benefit of the dentry cache as well as introduce an additional performance hit. Stock 2.6 kernel has a hash with up to 256 * MB RAM entries in it at a time, if you have 2GB of memory you should be okay for up to a little over 500,000 files.
Of course, make sure you perform the appropriate profiling to determine if this really causes a bottlneck.
The traditional way to do this is with hashed subdirectories. Assume your file names are all uniformly-distributed hashes, encoded in hexadecimal. You can then create 256 directories based on the first two characters of the file name (so, for instance, the file 012345678 would be named 01/2345678). You can use two or even more levels if one is not enough.
As long as the file names are uniformly distributed, this will keep the directory sizes manageable, and thus make any operations on them faster.
Another question is how much data is in the files? Is an SQL back end an option?