I wrote code to delete file in zip with minizip.
referenced http://www.winimage.com/zLibDll/del.cpp
I have to delete and modify the file and in zip frequently.
the zip file that used is 1.6 GB.
Deleting the file and in zip means
copy whole zip file except the file to delete to new zip file
delete old zip file.
rename new zip file to old zip file.
so too slow delete and modify(delete and add).
how can I make faster deleting and modifying the file in zip?
is there any idea?
You can write your own code to delete a zip entry in place. That is a little more risky, since if there is a problem or the system goes down in the middle of the operation, you will have lost the file. Your current approach, copying the zip file, assures that you always have one good zip file available if something goes south.
The .ZIP File Format Specification provides all the information you need to write your own deleter. The structure of a zip file is relatively straightforward, but it will take some attention to detail to work through all the possibilities.
The deletion operation will still require copying all of the zip file content that follows the deleted entry down.
Having done that, adding a file in place will be relatively fast, since it just goes at the end, and the central directory is rewritten. If the deletion and additions are the same file or files, then they will naturally end up at the end, and the in-place operations should be faster than copying the whole zip file.
Related
Is there mature compression format that allows arbitrary file operations for items inside like Delete/Insert/Update but not requiring full archive recreation for this.
I'm aware of Sqlar based on Sqlite file format that naturally supports this since the mentioned operations is just deleting/inserting/updating records containing blobs. But it is more like experimental project created with other goals in mind and not widely adopted
UPDATE: to be more precise with what I have in mind, this is more like file system inside the archive when the files inserted might occupy a different "sectors" inside this container, depending on the scenario of previous delete and update operations. But the "chain" of the file is compressed while being added so occupies effectively less space than the original file.
The .zip format. You may need to copy the zip file contents to do a delete, but you don't need to recreate the archive.
Update:
The .zip format can, in principle, support the deletion and addition of entries without copying the entire zip file, as well as the re-use of the space from deleted entries. The central directory at the end can be updated and cheaply rewritten. I have heard of it being done. You would have to deal with fragmentation, as with any file system. I am not aware of an open-source library that supports using a zip file as a file system. The .zip format does not support breaking an entry into sectors that could be scattered across the zip file, as file systems do. A single entry has to be contiguous in a zip file.
Summary
Let's say I have a large number of files in a folder that I want to compress/zip before I send to a server. After I've zipped them together, I realize I want to add/remove/modify a file. Can going through the entire compression process from scratch be avoided?
Details
I imagine there might be some way to cache part of the compression process (whether it is .zip, .gz or .bzip2), to make the compression incremental, even if it results in sub-optimal compression. For example, consider the naive dictionary encoding compression algorithm. I imagine it should be possible to use the encoding dictionary on a single file without re-processing all the files. I also imagine that the loss in compression provided by this caching mechanism would grow as more files are added/removed/edited.
Similar Questions
There are two questions related to this problem:
A C implementation, which implies it's possible
A C# related question, which implies it's possible by zipping individual files first?
A PHP implementation, which implies it isn't possible without a special file-system
A Java-specific adjacent question, which implies it's semi-possible?
Consulting the man page of zip, there are several relevant commands:
Update
-u
--update
Replace (update) an existing entry in the zip archive only if it has
been modified more recently than the version already in the zip
archive. For example:
zip -u stuff *
will add any new files in the current directory, and update any files
which have been modified since the zip archive stuff.zip was last
created/modified (note that zip will not try to pack stuff.zip into
itself when you do this).
Note that the -u option with no input file arguments acts like the -f
(freshen) option.
Delete
-d
--delete
Remove (delete) entries from a zip archive. For example:
zip -d foo foo/tom/junk foo/harry/\* \*.o
will remove the entry foo/tom/junk, all of the files that start with
foo/harry/, and all of the files that end with .o (in any path).
Note that shell pathname expansion has been inhibited with
backslashes, so that zip can see the asterisks, enabling zip to match
on the contents of the zip archive instead of the contents of the
current directory.
Yes. The entries in a zip file are all compressed individually. You can select and copy just the compressed entries you want from any zip file to make a new zip file, and you can add new entries to a zip file.
There is no need for any caching.
As an example, the zip command does this.
I have written a zip class that uses functions and code from miniz to: Open an archive, Close an archive, Open a file in the archive, Close a file in the archive, and write to the currently open file in the archive.
Currently opening a file in an archive overwrites it if it already exists. I would like to know if it is possible to APPEND to a file within a zip archive that has already been closed?
I want to say that it is possible but I would have to edit all offsets in each of the other file's internal states and within the central directory. If it is possible - is this the right path to look in to?
Note:
I deal with large files so decompressing and compressing again is not ideal and neither is doing any copying of files. I would just like to "open" a file in the zip archive to continue writing compressed data to it.
I would just like to "open" a file in the zip archive to continue writing compressed data to it.
Compressed files aren't working like a file system or folder, where you could change individual files. They keep e.g. check sums, that need to apply for the whole archive.
So no, you can't do such inplace, but have to unpack the compressed file, apply your changes and compress everything again.
The zip file format ends with a central directory section that then points to the individual zip entries within the file. This appears to allow zip entries to occur anywhere within the zip file itself. Indeed, self-extracting zip files are a good example: they start with an executable and all the zip entries occur after the executable bytes.
The question is: does the zip file format really allow sparse or non-contiguous zip entries? e.g. if there are empty or otherwise unaccounted bytes between zip entries? Both the definitive PK note and wikipedia article seem to allow this. Will all/most typical zip utilities work with such sparse zip files?
The use case is this: I want to be able to delete or replace zip entries in a zip file. To do this, the typical minizip etc. libraries want you to copy out the entire zip file while not copying out the deleted or replaced zip entry, which seems wasteful and slow.
Wouldn't it be better to over-allocate, say 1.5x the storage for an entry, then when deleting or replacing an entry you could figure out where the unallocated bytes were and use those directly? Using 1.5x the storage means that if the zip entry grew linearly, the reallocations should also happen amortized linearly. It would be similar to file system block allocation though probably not as sophisticated.
This also helps with a lot of the zip-based file formats out there. Instead of having to have some temp directory somewhere (or even in memory) with the temporarily unzipped files for editing/changing and then having to rezip the lot back into the file format, this would lessen the need for rezipping and rewriting portions of the zip file.
Are there any C/C++ libraries out there that do this?
No. Reading the central directory is optional. zip decoders can, and some do, simply read the zip file sequentially from the beginning, expecting to see the local headers and entry data contiguously. They can complete the job of decoding, never having even looked at the central directory.
In order to do what you want, you would need to put in dummy zip entries between the useful entries in order to hold that space. At least if you want to be compatible with the rest of the zip world.
I wonder which is the fastest way to erase part of a file in c++.
I know the way of write a second file and skip the part you want. But i think is slow when you work with big files.
What about database system, how they remove records so fast?
A database keeps an index, with metadata listing which parts of the file are valid and which aren't. To delete data, just the index is updated to mark that section invalid, and the main file content doesn't have to be changed at all.
Database systems typically just mark deleted records as deleted, without physically recovering the unused space. They may later reuse the space occupied by deleted records. That's why they can delete parts of a database quickly.
The ability to quickly delete a portion of a file depends on the portion of the file you wish to delete. If the portion of the file that you are deleting is at the end of the file, you can simply truncate the file, using OS calls.
Deleting a portion of a file from the middle is potentially time consuming. Your choice is to either move the remainder of the file forward, or to copy the entire file to a new location, skipping the deleted portion. Either way could be time consuming for a large file.
The fastest way I know is to open data file as a Persisted memory-mapped file and simple move over the part you don't need. Would be faster than moving to second file but still not too fast with big files.