Is it possible to increase the size of the sliding window beyond 32KB for zlib? - compression

I would like to increase the size of the sliding window for zlib beyond the maximum 32KB (I would like to match the window size to the length of the string that I am trying to compress). This is because I want to make sure that if a match exist it'll be found. Can this be done easily? Or are there subtleties in the implementation that I should consider?

It would require a redesign of the deflate format, which inherently only allows distances of 32768 or less, and a rewrite of the deflate code in zlib.
The redesign of the deflate format was already done once, resulting in deflate64 which permits distances up to 65536 (maybe not enough for you?), which the zlib code could in principle be rewritten to accommodate.
Alternatively, you can use other LZ compressors already written and tested with larger windows (often much larger windows), such as lzma or brotli.

Related

Which decompression algorithms are safe to use on attacker-supplied buffers?

I want to save network bandwidth by using compression, such as bzip2 or gzip.
Attackers, as well as normal users, may send compressed messages.
Are there sequences of bytes which will cause some decompression functions to become stuck in an infinite loop, or to use vast amounts of memory?
Is so, is this a fundamental property of those algorithms, or just an implementation bug?
I can only speak for zlib's inflate. There is no input that would result in an infinite loop or uncontrolled memory consumption.
Since the maximum compression of deflate is less than 1032:1, then inflate when working normally can expand up to almost 1032:1. You just need to be able to handle that possibility.

Does GZIP Compression Level Have Any Impact On Decompression

I understand that GZIP is a combination of LZ77 and Huffman coding and can be configured with a level between 1-9 where 1 indicates the fastest compression (less compression) and 9 indicates the slowest compression method (best compression).
My question is, does the choice of level only impact the compression process or is there an additional cost also incurred in decompression depending on the level used to compress?
I ask because typically many web servers will GZIP responses on the fly if the client supports it, e.g. Accept-Encoding: gzip. I appreciate that when doing this on the fly a level such as 6 might be the good choice for the average case, since it gives a good balance between speed and compression.
However, if I have a bunch of static assets that I can GZIP just once ahead of time - and never need to do this again - would there be any downside to using the highest but slowest compression level? I.e. is there now an additional overhead for the client that would not have been incurred had a lower compression level been used.
Great question, and an underexposed issue. Your intuition is solid – for some compression algorithms, choosing the max level of compression can require more work from the decompressor when it's unpacked.
Luckily, that's not true for gzip – there's no extra overhead for the client/browser to decompress more heavily compressed gzip files (e.g. choosing 9 for compression instead of 6, assuming the standard zlib codebase that most servers use). The best measure for this is decompression rate, which for present purposes is in units of MB/sec, while also monitoring overhead like memory and CPU. Simply going by decompression time is no good because the file is smaller at higher compression settings, and we're not controlling for that factor if we're only using a stopwatch.
gzip decompression quickly gets asymptotic in terms of both time-to-decompress and memory usage once you get past level 6 compressed content. The time-to-decompress flatlines for levels 7, 8, and 9 in the test results linked by Marcus Müller, though that's coarse-grained data given in whole seconds.
You'll also notice in those results that the memory requirements for decompression are flat for all levels of compression at 0.1 MiB. That's almost unbelievable, just a degree of excellence in software that we rarely see. Mark Adler and colleagues deserve massive props for what they achieved. gzip is a very nice format.
The memory use gets at your question about overhead. There really is none. You don't gain much with level 9 in terms of browser decompression speed, but you don't lose anything.
Now, check out these test results for a bit more texture. You'll see how the gzip decompression rate is slightly faster with level 9 compressed content than with lower levels (at level 9, decomp rate is about 0.9% faster than at level 6, for example). That is interesting and surprising. I wouldn't expect the rate to increase. That was just one set of test results – it may not hold for other scenarios (and the difference is quite small in any case).
Parting note: Precompressing static files is a good idea, but I don't recommend gzip at level 9. You'll get smaller files than gzip-9 by instead using zopfli or libdeflate. Zopfli is a well-established gzip compressor from Google. libdeflate is new but quite excellent. In my testing it consistently beats gzip-9, but still trails zopfli. You can also use 7-Zip to create gzip files, and it will consistently beat gzip-9. (In the foregoing, gzip-9 refers to using the canonical gzip or zlib application that Apache and nginx use).
No, there is no downside on the decompression side when using the maximum compression level. In fact, there is a slight upside, in that better-compressed data decompresses faster. The reason is simply fewer compressed bits that the decompressor has to process.
Actually, in real world measurements a higher compression level yields lower decompression times (which might be primarily caused by the fact that you need to handle less permanent storage and less RAM access).
Since, actually, most things that happen at a client with the data are rather expensive compared to gunzipping, you shouldn't really care about that, at all.
Also be advised that for static assets that are images, usually huffman/zlib coding (PNG simply uses zlib!) is already applied, and you won't gain much by gzipping these. Actually, often small images (for example, icons) fit into a single TCP packet (ignoring the HTTP header, which sometimes is bigger than the image itself) and therefore you don't get any speed gain (but save money on transfer volume -- if you deliver terabytes of small images. Now, may I presume you're not Google itself...
Also, I'd like to point you to higher level optimization, like tools that can transform your javascript code into a compacter shape (eg. removing whitespace, renaming private variables from my_mother_really_likes_this_number_of_unicorns to m1); also, things like JQuery come in a "precompressed" form. The same exists for HTML. Doesn't make things easier to debug, but since you seem to be interested in ultimate space saving...

Can reading zipped files be faster than uncompressed?

Is there any chance that packing a large file with some simple algorithm enables me to read the data faster than from an uncompressed file (due to the hard drive being slower than uncompressing)?
What kind of compression rate would I need to have? Can any fast compression algorithm do that?
Yes. That is often the case with deflate compression, used by zip, gzip, and zlib, when reading from hard drives with a typical compression factor of, say, four.
From SSDs, you may need to go to something with faster decompression. One you could try is lz4.
Your mileage may vary.
You could also try Density, its command line client "sharc" is benchmarked here.

Compression library for C / C++ able to deal with more than 32 bit elements in the array

I have a problem in the fact that I need to compress around a 6 GB std::vector() (1.5 billion floats in it), and up to now I have used lz4, but it only handles int count of chars. Since I have 6 billion chars in my vector, that would need 33bit to represent, and the compression with LZ4 does not work as I need it to.
From what I saw at the zlib libraries, it takes int as well as input for the length of the to compressed data.
Do I need to segment my data, or is there a framework around able to deal with more than 32bit of chars, or am I missing something?
Use zlib, and pass the array in as several chunks. The DEFLATE algorithm used by zlib has a window size of about 32 KB, and it already buffers the compressed data, so passing the data in as multiple chunks will not affect the compression efficiency.
Take a look at XZ, it seems to handle really big sizes. The CLI executables themselves are thin wrappers over a library, so this should fit your bill.
OTOH, a stream of binary floats shouldn't compress that well...

What is the best compression algorithm that allows random reads/writes in a file? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 11 months ago.
Improve this question
What is the best compression algorithm that allows random reads/writes in a file?
I know that any adaptive compression algorithms would be out of the question.
And I know huffman encoding would be out of the question.
Does anyone have a better compression algorithm that would allow random reads/writes?
I think you could use any compression algorithm if you write it in blocks, but ideally I would not like to have to decompress a whole block at a time. But if you have suggestions on an easy way to do this and how to know the block boundaries, please let me know. If this is part of your solution, please also let me know what you do when the data you want to read is across a block boundary?
In the context of your answers please assume the file in question is 100GB, and sometimes I'll want to read the first 10 bytes, and sometimes I'll want to read the last 19 bytes, and sometimes I'll want to read 17 bytes in the middle. .
I am stunned at the number of responses that imply that such a thing is impossible.
Have these people never heard of "compressed file systems",
which have been around since before Microsoft was sued in 1993 by Stac Electronics over compressed file system technology?
I hear that LZS and LZJB are popular algorithms for people implementing compressed file systems, which necessarily require both random-access reads and random-access writes.
Perhaps the simplest and best thing to do is to turn on file system compression for that file, and let the OS deal with the details.
But if you insist on handling it manually, perhaps you can pick up some tips by reading about NTFS transparent file compression.
Also check out:
"StackOverflow: Compression formats with good support for random access within archives?"
A dictionary-based compression scheme, with each dictionary entry's code being encoded with the same size, will result in being able to begin reading at any multiple of the code size, and writes and updates are easy if the codes make no use of their context/neighbors.
If the encoding includes a way of distinguishing the start or end of codes then you do not need the codes to be the same length, and you can start reading anywhere in the middle of the file. This technique is more useful if you're reading from an unknown position in a stream.
I think Stephen Denne might be onto something here. Imagine:
zip-like compression of sequences to codes
a dictionary mapping code -> sequence
file will be like a filesystem
each write generates a new "file" (a sequence of bytes, compressed according to dictionary)
"filesystem" keeps track of which "file" belongs to which bytes (start, end)
each "file" is compressed according to dictionary
reads work filewise, uncompressing and retrieving bytes according to "filesystem"
writes make "files" invalid, new "files" are appended to replace the invalidated ones
this system will need:
defragmentation mechanism of filesystem
compacting dictionary from time to time (removing unused codes)
done properly, housekeeping could be done when nobody is looking (idle time) or by creating a new file and "switching" eventually
One positive effect would be that the dictionary would apply to the whole file. If you can spare the CPU cycles, you could periodically check for sequences overlapping "file" boundaries and then regrouping them.
This idea is for truly random reads. If you are only ever going to read fixed sized records, some parts of this idea could get easier.
I don't know of any compression algorithm that allows random reads, never mind random writes. If you need that sort of ability, your best bet would be to compress the file in chunks rather than as a whole.
e.g.We'll look at the read-only case first. Let's say you break up your file into 8K chunks. You compress each chunk and store each compressed chunk sequentially. You will need to record where each compressed chunk is stored and how big it is. Then, say you need to read N bytes starting at offset O. You will need to figure out which chunk it's in (O / 8K), decompress that chunk and grab those bytes. The data you need may span multiple chunks, so you have to deal with that scenario.
Things get complicated when you want to be able to write to the compressed file. You have to deal with compressed chunks getting bigger and smaller. You may need to add some extra padding to each chunk in case it expands (it's still the same size uncompressed, but different data will compress to different sizes). You may even need to move chunks if the compressed data is too big to fit back in the original space it was given.
This is basically how compressed file systems work. You might be better off turning on file system compression for your files and just read/write to them normally.
Compression is all about removing redundancy from the data. Unfortunately, it's unlikely that the redundancy is going to be distributed with monotonous evenness throughout the file, and that's about the only scenario in which you could expect compression and fine-grained random access.
However, you could get close to random access by maintaining an external list, built during the compression, which shows the correspondence between chosen points in the uncompressed datastream and their locations in the compressed datastream. You'd obviously have to choose a method where the translation scheme between the source stream and its compressed version does not vary with the location in the stream (i.e. no LZ77 or LZ78; instead you'd probably want to go for Huffman or byte-pair encoding.) Obviously this would incur a lot of overhead, and you'd have to decide on just how you wanted to trade off between the storage space needed for "bookmark points" and the processor time needed to decompress the stream starting at a bookmark point to get the data you're actually looking for on that read.
As for random-access writing... that's all but impossible. As already noted, compression is about removing redundancy from the data. If you try to replace data that could be and was compressed because it was redundant with data that does not have the same redundancy, it's simply not going to fit.
However, depending on how much random-access writing you're going to do -- you may be able to simulate it by maintaining a sparse matrix representing all data written to the file after the compression. On all reads, you'd check the matrix to see if you were reading an area that you had written to after the compression. If not, then you'd go to the compressed file for the data.