Make zstd compressed files 'rsyncable' like gzip does with --rsyncable option - compression

Is there a way to make zstd compressed files 'rsyncable' like gzip does with --rsyncable option?
I've tried splitting input files into fixed length chunks and compressing them separately with no luck.
About the --rsyncable option:
When you synchronize a compressed file between two computers, this option allows rsync to transfer only files that were changed in the archive instead of the entire archive. Normally, after a change is made to any file in the archive, the compression algorithm can generate a new version of the archive that does not match the previous version of the archive. In this case, rsync transfers the entire new version of the archive to the remote computer. With this option, rsync can transfer only the changed files as well as a small amount of metadata that is required to update the archive structure in the area that was changed.

With version 1.3.8 zstd introduced --rsyncable mode.

I've tried splitting input files into fixed length chunks and compressing them separately with no luck.
This should work NP provided that you only change the bytes without moving them.
That is, if you split "The hog crawled under the high fence" into fixed-size chunks ["The hog ", "crawled ", "under th", "e high f", "ence"] and then independently compress them, then changing "hog" to "dog" will be rsync-friendly, because the compressed version of the remaining chunks, ["crawled ", "under th", "e high f", "ence"], will still be the same.
If, on the other hand, you move the bytes, like when you replace the "hog" with the "caterpillar", then splitting will no longer help, because the chunks ["The cat", "erpillar", " crawled", " under t", "he high ", "fence"] are now different and so also is different the compressed version of them.
Rsync will help with the former but not with the latter.
If you want arbitrary modifications, you'd need a smart chunk splitting alrogithm that gravitates towards certain points of the file. For example, if you split "The hog crawled under the high fence" on space, into "The ", "hog ", "crawled ", "under ", "the ", "high ", "fence", then replacing "hog" with "caterpillar" will only change one compressed chunk, allowing rsync not to transfer the rest of them.
P.S. Looks like LBFS uses such a chunk splitting scheme: "by sliding a 48 byte window over the file and computing the Rabin fingerprint of each window. When the low 13 bits of the fingerprint are zero LBFS calls those 48 bytes a breakpoint and ends the current block and begins a new one"

Related

Is double compression less effective?

Let's say we have multiple packages stored as .tar.gz files and we want to combine them into one bundle. Everything I know about lossless file compression is that it attempts to find patterns in the data. From that, my intuition is that it would be able to find more patterns and therefore produce smaller bundle if I first decompress the packages into .tar files and then combine them into one bundle.tar.gz. Is my intuition correct? Or is it not worth the hassle and creating the bundle from the .tar.gz files directly would produce similar results?
I tested it with a random collection of txts (RFC 1-500 from https://www.rfc-editor.org/retrieve/bulk/) and compressing each of them individually and then creating the final .tar.gz from the compressed files yields a 15% bigger result, which supports my intuition but maybe not to an extent I expected.
total size of txts: 5.6M
total size of individually compressed txts: 2.7M
size of .tar.gz from txts: 1.4M
size of .tar.gz from compressed txts: 1.6M
I would like to understand more how it behaves in general.
Compressing something with gzip that is already compressed will generally expand the data, but only by a very small amount, multiplying the size by about 1.0003.
The fact that you are getting a 15% benefit from decompressing the pieces and recompressing the bundle means that your pieces must be relatively small in order for gzip's 32K byte matching distance to find more matches and increase the compression by that much. (You did not say how many of these individually compressed texts there were.)
By the way, it is easy to combine several .tar files into a single .tar file. Each .tar file is terminated with 1024 zero bytes. Strip that from every .tar file except the last one, and concatenate them. Then you have one .tar file to compress.

What's the smallest possible file size on disk?

I'm trying to find a solution to store a binary file in it's smallest size on disk. I'm reading vehicles VIN and plate number from a database that is 30 Bytes and when I put it in a txt file and save it, its size is 30B, but its size on disk is 4KB, which means if I save 100000 files or more, it would kill storage space.
So my question is that how can I write this 30B to an individual binary file to its smallest size on disk, and what is the smallest possible size of 30B on disk including other info such as file name and permissions?
Note: I do not want to save those text in database, just I want to make separate binary files.
the smallest size of a file is always the cluster size of your disk, which is typically 4k. for data like this, having many records in a single file is really the only reasonable solution.
although another possibility would be to store those files in an archive, a zip file for example. under windows you can even access the zip contents pretty similar to ordinary files in explorer.
another creative possibility: store all the data in the filename only. a zero byte file takes only 1024 bytes in the MFT. (assuming NTFS)
edit: reading up on resident files, i found that on the newer 4k sector drives, the MFT entry is actually 4k, too. so it doesn't get smaller than this, whether the data size is 0 or not.
another edit: huge directories, with tens or hundreds of thousands of entries, will become quite unwieldy. don't try to open one in explorer, or be prepared to go drink a coffee while it loads.
Most file systems allocate disk space to files in chunks. It is not possible to take less than one chunk, except for possibly a zero-length file.
Google 'Cluster size'
You should consider using some indexed file library like gdbm: it is associating to arbitrary key some arbitrary data. You won't spend a file for each association (only a single file for all of them).
You should reconsider your opposition to "databases". Sqlite is a library giving you SQL and database abilities. And there are noSQL databases like mongodb
Of course, all this is horribly operating system and file system specific (but gdbm and sqlite should work on many systems).
AFAIU, you can configure and use both gdbm and sqlite to be able to store millions of entries of a few dozen bytes each quite efficienty.
on filesystems you have the same problem. the smallest allocate size is one data-node and also a i-node. For example in IBM JFS2 is the smallest blocksize 4k and you have a inode to allocate. The second problem is you will write many file in short time. It makes a performance problems, to write in short time many inodes.
Every write operation must jornaled and commit. Or you us a old not jornaled filesystem.
A Idear is, grep many of your data recorders put a separator between them and write 200-1000 in one file.
for example:
0102030400506070809101112131415;;0102030400506070809101112131415;;...
you can index dem with the file name. Sequence numbers or so ....

Remove beginning of file without rewriting the whole file

I have an embedded Linux system, that stores data in a very large file, appending new data to the end. As the file size grows near filling available storage space, I need to remove oldest data.
Problem is, I can't really accept the disruption it would take to move the massive bulk of data "up" the file, like normal - lock the file for an extended period of time just to rewrite it (plus this being a flash medium, it would cause unnecessary wear to the flash).
Probably the easiest way would be to split the file into multiple smaller ones, but this has several downsides related to how the data is handled and processed - all the 'client end' software expects single file. OTOH it can handle 'corruption' of having the first record cut in half, so the file doesn't need to be trimmed at record offsets, just 'somewhere up there', e.g. first few iNodes freed. Oldest data is obsolete anyway so even more severe corruption of the beginning of the file is completely acceptable, as long as the 'tail' remains clean, and liberties can be taken with how much exactly is removed - 'roughly several first megabytes' is okay, no need for 'first 4096KB exactly' precision.
Is there some method, API, trick, hack to truncate beginning of file like that?
You can achieve the goal with Linux kernel v3.15 above for ext4/xfs file system.
int ret = fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, 0, 4096);
See here
Truncating the first 100MB of a file in linux
The easiest solution for your old applications would be a FUSE filesystem which gives them access to the underlying file, but with the offset cyclically shifted. This would allow you to implement a ringbuffer at the physical level. The FUSE layer would be fairly trivial as it only needs to adjust all filepositions by a constant, modulo filesize.
What about setting up a separate process that renames the output file when it reaches a predefined size (for instance by adding the linux time at the end of the file name).
This would allow you to keep the old data and the main process will recreate the output file the next time it writes to it.
Another cron job may remove the old file every now and then.

tar.Z file format, structure, header

I am trying to figure out the file layout of
tar.Z file. (so called .taz file. compressed tar file).
this file can be produced with tar -Z option or
using unix compress utility(result are same)
I tried to google some document about this file structure
but there is no documentation about this file structure.
I know that this is LZW compressed file and starts with
its magic number "1F 9D" but thats all I can figure out.
someone please tell me more details about the file header or
anything.
I am not interested about how to uncompress this file, or
what linux command can process this file.
I want to know is internal file structure/header/format/layout.
thank you in advance
A .Z file is compressed using compress and can be uncompressed with uncompress (or on some machines this is called uncompress.real). This .Z file can hold any data. .tar.Z or .taz is just a .tar file that is compressed with compress.
The first 2 bytes (MAGIC_1 and MAGIC_2) are used to check if the .Z file really is a .Z file, and not something else with accidentally the same extension. These bytes are hardcoded in the sources.
The third byte is a settings byte and holds 2 values:
The most significant bit is the block mode.
The last 5 bits indicate the maximum size of the code table (the code table is used for lzw compression).
From the original code: BLOCK_MODE=0x80; byte3=(BIT|BLOCK_MODE); and BIT is in an if/else block where it is 12..16.
If block mode is turned on, in the code table a entity will be added at place 256 (remember 0..255 are filled with the values 0..255) and this will contain the CLEAR sign. So whenever the CLEAR sign is gotten from the data stream from the file, the code table has to be reverted to it's initial state (so it has only 0..256 in it).
The maximum code size indicates the amount of bits the code table can be. When the maximum is hit, there are no entities added to the code table anymore. So if the maximum code size is 0b00001100, it means that the code table can only hold 12 bits, so a maximum of 2^12=4096 entities.
The highest amount possible that is used by compress is 16 bit. That means that there are 2 bits in this settings field that are unused.
After these 3 bytes the raw LZW data starts. Because the LZW table starts at 9 bits, the 4th byte will be the same as the first byte of the input (in case of a .tar.Z file, or taz file, this byte will be the first byte of the uncompressed .tar file).
A tar.Z file is just a compressed tar file, so you will only find the 1F 9D magic number telling you to uncompress it.
When uncompressed you can read the tar file header:
http://www.fileformat.info/format/tar/corion.htm
Q: this file can be produced with tar -Z option or using unix compress utility(result are same)
A: Yes. "tar -cvf myfile.tar myfiles; compress myfile.tar" is equivalent to using "-Z". An even better choice is often "j" (using BZip, instead of Zip)
Q: What is the layout of a tar file?
A: There are many references, and much freely available source. For example:
http://en.wikipedia.org/wiki/Tar_%28file_format%29
Q: What is the format of a Unix compressed file?
A: Again: many references; easy to find sample source code:
http://en.wikipedia.org/wiki/Compress
Fot a .tgz (compressed tar file) you'll need both formats: you must first uncompress it, then untar it. The "tar" utility will do both for you, automagically :)

Parallelization of PNG file creation with C++, libpng and OpenMP

I am currently trying to implement a PNG encoder in C++ based on libpng that uses OpenMP to speed up the compression process.
The tool is already able to generate PNG files from various image formats.
I uploaded the complete source code to pastebin.com so you can see what I have done so far: http://pastebin.com/8wiFzcgV
So far, so good! Now, my problem is to find a way how to parallelize the generation of the IDAT chunks containing the compressed image data. Usually, the libpng function png_write_row gets called in a for-loop with a pointer to the struct that contains all the information about the PNG file and a row pointer with the pixel data of a single image row.
(Line 114-117 in the Pastebin file)
//Loop through image
for (i = 0, rp = info_ptr->row_pointers; i < png_ptr->height; i++, rp++) {
png_write_row(png_ptr, *rp);
}
Libpng then compresses one row after another and fills an internal buffer with the compressed data. As soon as the buffer is full, the compressed data gets flushed in a IDAT chunk to the image file.
My approach was to split the image into multiple parts and let one thread compress row 1 to 10 and another thread 11 to 20 and so on. But as libpng is using an internal buffer it is not as easy as I thought first :) I somehow have to make libpng write the compressed data to a separate buffer for every thread. Afterwards I need a way to concatenate the buffers in the right order so I can write them all together to the output image file.
So, does someone have an idea how I can do this with OpenMP and some tweaking to libpng? Thank you very much!
This is too long for a comment but is not really an answer either--
I'm not sure you can do this without modifying libpng (or writing your own encoder). In any case, it will help if you understand how PNG compression is implemented:
At the high level, the image is a set of rows of pixels (generally 32-bit values representing RGBA tuples).
Each row can independently have a filter applied to it -- the filter's sole purpose is to make the row more "compressible". For example, the "sub" filter makes each pixel's value the difference between it and the one to its left. This delta encoding might seem silly at first glance, but if the colours between adjacent pixels are similar (which tends to be the case) then the resulting values are very small regardless of the actual colours they represent. It's easier to compress such data because it's much more repetitive.
Going down a level, the image data can be seen as a stream of bytes (rows are no longer distinguished from each other). These bytes are compressed, yielding another stream of bytes. The compressed data is arbitrarily broken up into segments (anywhere you want!) written to one IDAT chunk each (along with a little bookkeeping overhead per chunk, including a CRC checksum).
The lowest level brings us to the interesting part, which is the compression step itself. The PNG format uses the zlib compressed data format. zlib itself is just a wrapper (with more bookkeeping, including an Adler-32 checksum) around the real compressed data format, deflate (zip files use this too). deflate supports two compression techniques: Huffman coding (which reduces the number of bits required to represent some byte-string to the optimal number given the frequency that each different byte occurs in the string), and LZ77 encoding (which lets duplicate strings that have already occurred be referenced instead of written to the output twice).
The tricky part about parallelizing deflate compression is that in general, compressing one part of the input stream requires that the previous part also be available in case it needs to be referenced. But, just like PNGs can have multiple IDAT chunks, deflate is broken up into multiple "blocks". Data in one block can reference previously encoded data in another block, but it doesn't have to (of course, it may affect the compression ratio if it doesn't).
So, a general strategy for parallelizing deflate would be to break the input into multiple large sections (so that the compression ratio stays high), compress each section into a series of blocks, then glue the blocks together (this is actually tricky since blocks don't always end on a byte boundary -- but you can put an empty non-compressed block (type 00), which will align to a byte boundary, in-between sections). This isn't trivial, however, and requires control over the very lowest level of compression (creating deflate blocks manually), creating the proper zlib wrapper spanning all the blocks, and stuffing all this into IDAT chunks.
If you want to go with your own implementation, I'd suggest reading my own zlib/deflate implementation (and how I use it) which I expressly created for compressing PNGs (it's written in Haxe for Flash but should be comparatively easy to port to C++). Since Flash is single-threaded, I don't do any parallelization, but I do split the encoding up into virtually independent sections ("virtually" because there's the fractional-byte state preserved between sections) over multiple frames, which amounts to largely the same thing.
Good luck!
I finally got it to parallelize the compression process.
As mentioned by Cameron in the comment to his answer I had to strip the zlib header from the zstreams to combine them. Stripping the footer was not required as zlib offers an option called Z_SYNC_FLUSH which can be used for all chunks (except the last one which has to be written with Z_FINISH) to write to a byte boundary. So you can simply concatenate the stream outputs afterwards. Eventually, the adler32 checksum has to be calculated over all threads and copied to the end of the combined zstreams.
If you are interested in the result you can find the complete proof of concept at https://github.com/anvio/png-parallel