I know that gzip supports 9 compression levels, from fast to strong.
The decompression algorithm does not care about the compression level at all.
Is it possible to reach a "higher" level than 9 by another tool than the common gzip application?
I mean, someone could have created a modified gzip compressor which is more effective than gzip level 9.
The background is that I have a webserver which hosts compressed gz files. It would be nice to reduce the sizes of those files and I do not care how long my server has to work in order to reduce those files even by 1 byte at the end. It is a one-time task, so it does not matter.
Is there something like a hacked version of gzip supporting higher levels or offering higher compression?
Yes. It's called zopfli. It is painfully slow, but will compress about 5% better than zlib level 9. zopfli is built in to pigz, which is a gzip equivalent that makes use of multiple processors and cores. Compression level 11 in pigz invokes the zopfli compressor. (pigz goes up to 11. Get it?) Using multiple cores on large inputs helps mitigate the slowness of zopfli.
Related
Many/most compression algorithms have a parallel-decompression implementation (like pigz for gzip, etc).
However, rarely does one see a reduction in time proportional to the number of processors thrown at the task, with most not benefiting at all from more than 6 processors.
I'm curious to know if there are any compression formats with parallel decompression built into the design - i.e. would be theoretically 100x faster with 100 cpus than with 1.
Thank you and all the best :)
You're probably I/O bound. At some point more processors won't help if they're waiting for input or output. You just get more processors waiting.
Or maybe your input files aren't big enough.
pigz will in fact be 100x faster with 100 cpus, for a sufficiently large input, if it is not I/O bound. By default, pigz sends 128K blocks to each processor to work on, so you would need the input to be at least 13 MB in order to provide work for all 100 processors. Ideally a good bit more than that to get all the processors running at full steam at the same time.
I understand that GZIP is a combination of LZ77 and Huffman coding and can be configured with a level between 1-9 where 1 indicates the fastest compression (less compression) and 9 indicates the slowest compression method (best compression).
My question is, does the choice of level only impact the compression process or is there an additional cost also incurred in decompression depending on the level used to compress?
I ask because typically many web servers will GZIP responses on the fly if the client supports it, e.g. Accept-Encoding: gzip. I appreciate that when doing this on the fly a level such as 6 might be the good choice for the average case, since it gives a good balance between speed and compression.
However, if I have a bunch of static assets that I can GZIP just once ahead of time - and never need to do this again - would there be any downside to using the highest but slowest compression level? I.e. is there now an additional overhead for the client that would not have been incurred had a lower compression level been used.
Great question, and an underexposed issue. Your intuition is solid – for some compression algorithms, choosing the max level of compression can require more work from the decompressor when it's unpacked.
Luckily, that's not true for gzip – there's no extra overhead for the client/browser to decompress more heavily compressed gzip files (e.g. choosing 9 for compression instead of 6, assuming the standard zlib codebase that most servers use). The best measure for this is decompression rate, which for present purposes is in units of MB/sec, while also monitoring overhead like memory and CPU. Simply going by decompression time is no good because the file is smaller at higher compression settings, and we're not controlling for that factor if we're only using a stopwatch.
gzip decompression quickly gets asymptotic in terms of both time-to-decompress and memory usage once you get past level 6 compressed content. The time-to-decompress flatlines for levels 7, 8, and 9 in the test results linked by Marcus Müller, though that's coarse-grained data given in whole seconds.
You'll also notice in those results that the memory requirements for decompression are flat for all levels of compression at 0.1 MiB. That's almost unbelievable, just a degree of excellence in software that we rarely see. Mark Adler and colleagues deserve massive props for what they achieved. gzip is a very nice format.
The memory use gets at your question about overhead. There really is none. You don't gain much with level 9 in terms of browser decompression speed, but you don't lose anything.
Now, check out these test results for a bit more texture. You'll see how the gzip decompression rate is slightly faster with level 9 compressed content than with lower levels (at level 9, decomp rate is about 0.9% faster than at level 6, for example). That is interesting and surprising. I wouldn't expect the rate to increase. That was just one set of test results – it may not hold for other scenarios (and the difference is quite small in any case).
Parting note: Precompressing static files is a good idea, but I don't recommend gzip at level 9. You'll get smaller files than gzip-9 by instead using zopfli or libdeflate. Zopfli is a well-established gzip compressor from Google. libdeflate is new but quite excellent. In my testing it consistently beats gzip-9, but still trails zopfli. You can also use 7-Zip to create gzip files, and it will consistently beat gzip-9. (In the foregoing, gzip-9 refers to using the canonical gzip or zlib application that Apache and nginx use).
No, there is no downside on the decompression side when using the maximum compression level. In fact, there is a slight upside, in that better-compressed data decompresses faster. The reason is simply fewer compressed bits that the decompressor has to process.
Actually, in real world measurements a higher compression level yields lower decompression times (which might be primarily caused by the fact that you need to handle less permanent storage and less RAM access).
Since, actually, most things that happen at a client with the data are rather expensive compared to gunzipping, you shouldn't really care about that, at all.
Also be advised that for static assets that are images, usually huffman/zlib coding (PNG simply uses zlib!) is already applied, and you won't gain much by gzipping these. Actually, often small images (for example, icons) fit into a single TCP packet (ignoring the HTTP header, which sometimes is bigger than the image itself) and therefore you don't get any speed gain (but save money on transfer volume -- if you deliver terabytes of small images. Now, may I presume you're not Google itself...
Also, I'd like to point you to higher level optimization, like tools that can transform your javascript code into a compacter shape (eg. removing whitespace, renaming private variables from my_mother_really_likes_this_number_of_unicorns to m1); also, things like JQuery come in a "precompressed" form. The same exists for HTML. Doesn't make things easier to debug, but since you seem to be interested in ultimate space saving...
How would one be able to predict execution time and/or resulting compression ratio when compressing a file using a certain lossless compression algorithm? I am especially more concerned with local compression, since if you know time and compression ratio for local compression, you can easily calculate time for network compression based on currently available network throughput.
Let's say you have some information about file such as size, redundancy, type (we can say text to keep it simple). Maybe we have some statistical data from actual prior measurements. What else would be needed to perform prediction for execution time and/or compression ratio (even if a very rough one).
For just local compression, the size of the file would have effect since actual reading and writing data to/from storage media (sdcard, hard drive) would take more dominant portion of total execution.
The actual compression portion, will probably depend on redundancy/type, since most compression algorithms work by compressing small blocks of data (100kb or so). For example, larger HTML/Javascripts files compress better since they have higher redundancy.
I guess there is also a problem of scheduling, but this could probably be ignored for rough estimation.
This is a question that been in my head for quiet sometimes. I been wondering if some low overhead code (say on the server) can predict how long it would take to compress a file before performing actual compression?
Sample the file by taking 10-100 small pieces from random locations. Compress them individually. This should give you a lower bound on compression ratio.
This only returns meaningful results if the chunks are not too small. The compression algorithm must be able to make use of a certain size of history to predict the next bytes.
It depends on the data but with images you can take small small samples. Downsampling would change the result. Here is an example:PHP - Compress Image to Meet File Size Limit.
The compression ratio can be calculated with these formulas:
And the performance benchmarking can be done using V8 or Sunspider.
You can also use algorithms like DEFLATE or LZMA to compute the mechanism. PPM (Partial by Predicting Matching) can be used for predicting.
Is there any chance that packing a large file with some simple algorithm enables me to read the data faster than from an uncompressed file (due to the hard drive being slower than uncompressing)?
What kind of compression rate would I need to have? Can any fast compression algorithm do that?
Yes. That is often the case with deflate compression, used by zip, gzip, and zlib, when reading from hard drives with a typical compression factor of, say, four.
From SSDs, you may need to go to something with faster decompression. One you could try is lz4.
Your mileage may vary.
You could also try Density, its command line client "sharc" is benchmarked here.
Sometimes MPI is used to send low-entropy data in messages. So it can be useful to try to compress messages before sending it. I know that MPI can work on very fast networks (10 Gbit/s and more), but many MPI programs are used with cheap network like 0,1G or 1Gbit/s Ethernet and with cheap (slow, low bisection) network switch. There is a very fast Snappy (wikipedia) compression algorithm, which has
Compression speed is 250 MB/s and decompression speed is 500 MB/s
so on compressible data and slow network it will give some speedup.
Is there any MPI library which can compress MPI messages (at layer of MPI; not the compression of ip packets like in PPP).
MPI messages are also structured, so there can be some special method, like compression of exponent part in array of double.
PS: There is also LZ4 compression method with comparable speed
I won't swear that there's none out there, but there's none in common use.
There's a couple of reason's why it's not common:
MPI is often used for sending lots of floating point data which is hard (but not impossible) to compress well, and often has relatively high entropy after a while.
In addition, MPI users are often as concerned with latency as bandwidth, and adding a compression/decompression step into the message-passing critical path wouldn't be attractive to those users.
Finally some operations (like reduction collectives, or scatter gather) would be very hard to implement efficiently with compression.
However, you sound like your use case could benefit from this for point-to-point communications, so there's no reason why you couldn't do it yourself. If you were going to send a message of size N and the receiver expected it then:
sender calls compression routine, receives buffer and new length M;
if M >= N, send the original data, with an initial byte of say 0, as N+1 bytes to the
receiver
otherwise sends an initial byte of 1 + compressed data
receiver receives data into length N+1 buffer
if first byte is 1, calls MPI_Get_count to determine amount of data received, calls
decompression routine
otherwises uses uncompressed data
I can't give you much guidance as to the compresion routines, but it does look like people have tried this before, eg http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.7936 .
I'll be happy to be told otherwise but I don't think many of us users of MPI are concerned with having a transport layer that compresses data.
Why the heck not ?
1) We already design our programs to do as little communication as possible, so we (like to think we) are sending the bare minimum across the interconnect.
2) The bulk of our larger messages comprise arrays of floating-point numbers which are relatively difficult (and therefore relatively expensive in time) to compress to any degree.
There's an ongoing project at the University of Edinburgh: http://link.springer.com/chapter/10.1007%2F978-3-642-32820-6_72?LI=true