Compression container-format with arbitrary file operations - compression

Is there mature compression format that allows arbitrary file operations for items inside like Delete/Insert/Update but not requiring full archive recreation for this.
I'm aware of Sqlar based on Sqlite file format that naturally supports this since the mentioned operations is just deleting/inserting/updating records containing blobs. But it is more like experimental project created with other goals in mind and not widely adopted
UPDATE: to be more precise with what I have in mind, this is more like file system inside the archive when the files inserted might occupy a different "sectors" inside this container, depending on the scenario of previous delete and update operations. But the "chain" of the file is compressed while being added so occupies effectively less space than the original file.

The .zip format. You may need to copy the zip file contents to do a delete, but you don't need to recreate the archive.
Update:
The .zip format can, in principle, support the deletion and addition of entries without copying the entire zip file, as well as the re-use of the space from deleted entries. The central directory at the end can be updated and cheaply rewritten. I have heard of it being done. You would have to deal with fragmentation, as with any file system. I am not aware of an open-source library that supports using a zip file as a file system. The .zip format does not support breaking an entry into sectors that could be scattered across the zip file, as file systems do. A single entry has to be contiguous in a zip file.

Related

Is it possbile to store processed files into where it was stored initially, using Google-provided utility templates?

One of the Google Dataflow utility templates allows us to do compression for files in GCS (Bulk Compress Cloud Storage files).
While it is possible to have multiple inputs for the parameter that consist of different folders (e.g: inputFilePattern=gs://YOUR_BUCKET_NAME/uncompressed/**.csv,), is it actually possible to store the 'compressed'/processed files into the same folder where it was stored initially?
If you have a look at the documentation:
The extensions appended will be one of: .bzip2, .deflate, .gz.
Therefore, the new compressed files won't match the provided pattern (*.csv). And thus, you can store them in the same folder without conflict.
In addition, this process is a batch process. When you look deeper in the dataflow IO component, especially to read with a pattern into GCS, the file list (of file to compress) is read at the beginning of the job and thus don't evolve during the job.
Therefore, if you have new files that come in and which match the pattern during a job, they won't take into account by the current job. You will have to run another job to take these new files.
Eventually, a last thing: the existing uncompressed files aren't replaced by the compressed ones. That means you will have the file in double: compressed and uncompressed version. To save space (and money) I recommend you to delete one of the two version.

concatenate/append/merge files in c++ (windows) without coping

How can i concatenate few large files(total size~ 3 Tb) in 1 file using c/c++ on windows?
I cant copy data, because it takes too much time, so i cant use:
cmd copy
Appending One File to Another File(https://msdn.microsoft.com/en-us/library/windows/desktop/aa363778%28v=vs.85%29.aspx)
and so on(stream::readbuf(),...)
I just need represent few files as one.
if this is inside your own program only, then you can create a class that would virtually glue the files together so you can read over it and make it apear as a single file.
if you want to physically have a single file. then no, not possible.
that requires opening file 1 and appending the others.
or creating a new file and appending all the files.
neither the C/C++ library nor the windows API have a means to concatenate files
even if such an API would be available, it would be restrictive in that the first file would have to be of a size that is a multiple of the disk allocation size.
Going really really low level, and assuming the multiple of allocation size is fulfilled... yes, if you unmount the drive, and physically override the file system and mess around with the file system structures, you could "stitch" the files together but that would be a challenge to do for FAT, and near impossible for NTFS.

Should I use .tar.gz?

In the Unix world, there is a famous format called "tar.gz".
But now, I want to develop a game and random accessing a file will be more efficient. If it is archived first, it will cause sequential access.
I know that there is an alternative format called zip or 7z, but what about other formats?
Not only gz.tar, I'd like to a minor compressing library and also get archiving features.
Should I use *.tar or other solutions are available?
PS: I'm using C++.
"Random" access is not good on a .tar.gz, since that is a .tar file that has been wrapped in a .gz compression, so to get to things in the .tar file, you'd first have to decompress the .tar file.
It would be possible to use a .tar file that contains individual files compressed with .gz. You can read the table of content of the .tar file and find/store where all the files are in the archive, and then extract as you need. However, you may find that using your own format is "better" (for example, if I remember correctly, the "header" for a tar-archive is a file at a time, you may want to build your header in one lump, before you store the files [which does mean at least enumerating all the relevant files first, then forming the compressed variant and "patching up" the header with the offsets in compressed form]
For a game, one critical factor would probably be the decompression speed, so you may want to look at different libraries and which one has the best decompression speed. I found this when searching for a comparison:
http://catchchallenger.first-world.info//wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
You may also care about memory usage, which also varies a bit depending on algorithm.
And I'm guessing your individual files will be much smaller than the entire tar-ball of Linux, so you may want to do your own benchmark, with your own data - after all, the speed of different compression formats does, to some degree, depend on the format of the data.
Normally, for computer games, what you need is a format where each file is compressed individually before being assembled into one file. This is the crucial difference between .tar.gz and .zip / .7z formats, that is, tar-gz is a "compressed archive" while zip / 7z are "archives of compressed files". In fact, both file formats use the same compression algorithm (by default), and the only reason that .tar.gz files are typically smaller is because they compress the entire archive instead of file-by-file, which increases the overall compression ratio.
AFAIK, most computer games use a zip format or a custom format that closely matches it, because it does per-file compression. For instance, Quake engines have always (.pak, .pk3, .pk4) relied on an off-the-shelf zip format with a few minor additions (like a built-in checksum, I think).
The .tar.gz format is created by first making an archive that puts all the (uncompressed) files into one .tar file. Then, that big archive file is compressed with the gzip method to create the final .tar.gz file. The point is that to get any one of the files from the archive you have the decompress the entire thing. This is very appropriate for backups or large transfers, but not appropriate at all for a game engine media archive.
That said, you could technically do the reverse of tar-gz, which is to compress each file individually with gzip, and then put them together in a .tar archive. But this is probably not worth the extra trouble, as it is pretty much exactly what zip files are (in "one easy step"). So, it will be a lot easier to use an off-the-shelf all-in-one format like zip that will allow you to extract individual files at a time. There are many off-the-shelf libraries for extracting and manipulating files in zip archives, just start with libzip (not to be confused with zlib (for gzip or .gz)).
In the Unix world, there is a famous format called "tar.gz".
Probably the biggest reason why "tar-ballz" are so popular and famously used in Unix-like systems is that they preserve file permissions (and other meta-data, I guess). I think that some implementations of zip and 7z might provide that feature as an extension to the format, but most don't have it. The convenient thing with tar archives is that whatever you put in there comes out exactly the same at the other end, with all permissions and whatever else preserved. And the "gzip" compression (from zlib) has just been historically an industry-standard compression algorithm, although, now, there are better ones, also supported by tar, such as .tar.lzma (or .tlz) or .tar.xz.
but what about other formats?
There aren't really that many other formats. Mostly, compressed archive formats often reuse the same few algorithms (DEFLATE, LZ77 / LZMA / LZMA2, BZIP, etc.), and often, formats like zip / 7z / rar are only really container formats that can employ any of those compression algorithms (and even mix and match depending on the individual file types). The point is that you won't really find much that is better than zip or 7z. And their competitors are more or less gone today (like rar?).
Should I use *.tar or other solutions are available?
No, use zip or 7z. Tar-balls are for backups. They are optimized for that purpose (e.g., dump a large folder full of files into a tar-ball, and recover it later, with everything preserved and with best full-archive compression). For your application, zip or 7z is more appropriate.

How to store a file once in a zip file instead of duplicating it in 50 folders

I have a directory structure that I need to write into a zip file that contains a single file that is duplicated in 50 sub directories. When users download the zip file, the duplicated file needs to appear in every directory. Is there a way to store the file once in a zip file, yet have it downloaded into the subdirectories when it is extracted? I cannot use shortcuts.
It would seem like Zip would be smart enough to recognize that I have 50 duplicate files and automatically store the file once... It would be silly to make this file 50 times larger than necessary!
It is possible within the ZIP specification to have multiple entries in the central directory point to the same local header offset. The ZIP application would have to precalculate the CRC of the file it was going to add and find a matching entry in the central directory of the existing ZIP file. A query for the CRC lookup against a ZIP file that contains a huge number of entries would be an expensive operation. It would also be costly to precalculate the CRC on huge files (CRC calculations are usually done during the compression routine).
I have not heard of a specific ZIP application that makes this optimization. However, it does look like StuffIt X format supports duplicate file optimization:
The StuffIt X format supports "Duplicate Detection". When adding files to an archive, StuffIt detects if there are duplicate items (even if they have the different file names), and only compresses the duplicates once, no matter how many copies there are. When expanded, StuffIt recreates all the duplicates from that one instance. Depending on the data being compressed, it can offer significant reductions in size and compression time.
I just wanted to clarify that the Suffit solution only removes duplicate files when compressing to their own proprietary format and not ZIP.

Can zip files be sparse/non-contiguous?

The zip file format ends with a central directory section that then points to the individual zip entries within the file. This appears to allow zip entries to occur anywhere within the zip file itself. Indeed, self-extracting zip files are a good example: they start with an executable and all the zip entries occur after the executable bytes.
The question is: does the zip file format really allow sparse or non-contiguous zip entries? e.g. if there are empty or otherwise unaccounted bytes between zip entries? Both the definitive PK note and wikipedia article seem to allow this. Will all/most typical zip utilities work with such sparse zip files?
The use case is this: I want to be able to delete or replace zip entries in a zip file. To do this, the typical minizip etc. libraries want you to copy out the entire zip file while not copying out the deleted or replaced zip entry, which seems wasteful and slow.
Wouldn't it be better to over-allocate, say 1.5x the storage for an entry, then when deleting or replacing an entry you could figure out where the unallocated bytes were and use those directly? Using 1.5x the storage means that if the zip entry grew linearly, the reallocations should also happen amortized linearly. It would be similar to file system block allocation though probably not as sophisticated.
This also helps with a lot of the zip-based file formats out there. Instead of having to have some temp directory somewhere (or even in memory) with the temporarily unzipped files for editing/changing and then having to rezip the lot back into the file format, this would lessen the need for rezipping and rewriting portions of the zip file.
Are there any C/C++ libraries out there that do this?
No. Reading the central directory is optional. zip decoders can, and some do, simply read the zip file sequentially from the beginning, expecting to see the local headers and entry data contiguously. They can complete the job of decoding, never having even looked at the central directory.
In order to do what you want, you would need to put in dummy zip entries between the useful entries in order to hold that space. At least if you want to be compatible with the rest of the zip world.