Any key-value storages with emphasis on compression?

Any key-value storages with emphasis on compression? - compression

Are there any key-value storages which fit the following criteria?
are open-source
persistent file storage
have replication and oplog
have configurable compression usable for storing 10-100 megabytes of raw text per second
work on windows and linux
Desired interface should contain at least:
store a record by a text or numeric ID
retrieve a record by ID

wiredtiger does support different kind of compression:
Compression considerations
WiredTiger compresses data at several stages to preserve memory and
disk space. Applications can configure these different compression
algorithms to tailor their requirements between memory, disk and CPU
consumption. Compression algorithms other than block compression work
by modifying how the keys and values are represented, and hence reduce
data size in-memory and on-disk. Block compression on the other hand
compress the data in its binary representation while saving it on the
disk.
Configuring compression may change application throughput. For
example, in applications using solid-state drives (where I/O is less
expensive), turning off compression may increase application
performance by reducing CPU costs; in applications where I/O costs are
more expensive, turning on compression may increase application
performance by reducing the overall number of I/O operations.
WiredTiger uses some internal algorithms to compress the amount of
data stored that are not configurable, but always on. For example,
run-length reduces the size requirement by storing sequential,
duplicate values in the store only a single time (with an associated
count).
wiredtiger support different kind of compression:
key prefix
dictionary
huffman
and block compression which support among other things lz4, snappy, zlib and zstd.
Have look at the documentation for full cover of the subject.

Related

Relation between DISTSTYLE and Compression encoding in Redshift

Is there any relation between DISTSTYLE and Compression encoding in Redshift. As whenever we use Compression encoding the operating system on compute node do extra work of encoding and decoding the data; with DISTSTYLE set as ALL don't you thing every node had to do the decoding and encoding work ?
Any conceptual help here is highly appreciated.

The Distribution Style determines which node/slice will store the data. This has no relationship or impact on compression type. It is simply saying where to store the data.
Compression, however, is closely related to the Sort Key, which determines the order in which data is stored. Some compression methods use 'offsets' from previous values, or even storing the number of repeated values, which can significantly compress data (eg "repeat this value 1000 times" rather than storing 1000 values).
Compression within Amazon Redshift has two benefits:
Less storage space (thus, less cost)
More data can be retrieved for each disk access
The slowest operation of any database is disk access. Therefore, any reduction in disk access will speed operations. The time taken to decompress data is minor compared to the time required for an additional disk read operation.
The second most 'expensive' operation is sending data between nodes. While network traffic is faster than disk access, it is best avoided.
When using DISTSTYLE ALL, it simply means that the data is available on every node, which avoids the need to transfer data across the network.

Does GZIP Compression Level Have Any Impact On Decompression

I understand that GZIP is a combination of LZ77 and Huffman coding and can be configured with a level between 1-9 where 1 indicates the fastest compression (less compression) and 9 indicates the slowest compression method (best compression).
My question is, does the choice of level only impact the compression process or is there an additional cost also incurred in decompression depending on the level used to compress?
I ask because typically many web servers will GZIP responses on the fly if the client supports it, e.g. Accept-Encoding: gzip. I appreciate that when doing this on the fly a level such as 6 might be the good choice for the average case, since it gives a good balance between speed and compression.
However, if I have a bunch of static assets that I can GZIP just once ahead of time - and never need to do this again - would there be any downside to using the highest but slowest compression level? I.e. is there now an additional overhead for the client that would not have been incurred had a lower compression level been used.

Great question, and an underexposed issue. Your intuition is solid – for some compression algorithms, choosing the max level of compression can require more work from the decompressor when it's unpacked.
Luckily, that's not true for gzip – there's no extra overhead for the client/browser to decompress more heavily compressed gzip files (e.g. choosing 9 for compression instead of 6, assuming the standard zlib codebase that most servers use). The best measure for this is decompression rate, which for present purposes is in units of MB/sec, while also monitoring overhead like memory and CPU. Simply going by decompression time is no good because the file is smaller at higher compression settings, and we're not controlling for that factor if we're only using a stopwatch.
gzip decompression quickly gets asymptotic in terms of both time-to-decompress and memory usage once you get past level 6 compressed content. The time-to-decompress flatlines for levels 7, 8, and 9 in the test results linked by Marcus Müller, though that's coarse-grained data given in whole seconds.
You'll also notice in those results that the memory requirements for decompression are flat for all levels of compression at 0.1 MiB. That's almost unbelievable, just a degree of excellence in software that we rarely see. Mark Adler and colleagues deserve massive props for what they achieved. gzip is a very nice format.
The memory use gets at your question about overhead. There really is none. You don't gain much with level 9 in terms of browser decompression speed, but you don't lose anything.
Now, check out these test results for a bit more texture. You'll see how the gzip decompression rate is slightly faster with level 9 compressed content than with lower levels (at level 9, decomp rate is about 0.9% faster than at level 6, for example). That is interesting and surprising. I wouldn't expect the rate to increase. That was just one set of test results – it may not hold for other scenarios (and the difference is quite small in any case).
Parting note: Precompressing static files is a good idea, but I don't recommend gzip at level 9. You'll get smaller files than gzip-9 by instead using zopfli or libdeflate. Zopfli is a well-established gzip compressor from Google. libdeflate is new but quite excellent. In my testing it consistently beats gzip-9, but still trails zopfli. You can also use 7-Zip to create gzip files, and it will consistently beat gzip-9. (In the foregoing, gzip-9 refers to using the canonical gzip or zlib application that Apache and nginx use).

No, there is no downside on the decompression side when using the maximum compression level. In fact, there is a slight upside, in that better-compressed data decompresses faster. The reason is simply fewer compressed bits that the decompressor has to process.

Actually, in real world measurements a higher compression level yields lower decompression times (which might be primarily caused by the fact that you need to handle less permanent storage and less RAM access).
Since, actually, most things that happen at a client with the data are rather expensive compared to gunzipping, you shouldn't really care about that, at all.
Also be advised that for static assets that are images, usually huffman/zlib coding (PNG simply uses zlib!) is already applied, and you won't gain much by gzipping these. Actually, often small images (for example, icons) fit into a single TCP packet (ignoring the HTTP header, which sometimes is bigger than the image itself) and therefore you don't get any speed gain (but save money on transfer volume -- if you deliver terabytes of small images. Now, may I presume you're not Google itself...
Also, I'd like to point you to higher level optimization, like tools that can transform your javascript code into a compacter shape (eg. removing whitespace, renaming private variables from my_mother_really_likes_this_number_of_unicorns to m1); also, things like JQuery come in a "precompressed" form. The same exists for HTML. Doesn't make things easier to debug, but since you seem to be interested in ultimate space saving...

Prediciting time or compression ratio for lossless compress of a file?

How would one be able to predict execution time and/or resulting compression ratio when compressing a file using a certain lossless compression algorithm? I am especially more concerned with local compression, since if you know time and compression ratio for local compression, you can easily calculate time for network compression based on currently available network throughput.
Let's say you have some information about file such as size, redundancy, type (we can say text to keep it simple). Maybe we have some statistical data from actual prior measurements. What else would be needed to perform prediction for execution time and/or compression ratio (even if a very rough one).
For just local compression, the size of the file would have effect since actual reading and writing data to/from storage media (sdcard, hard drive) would take more dominant portion of total execution.
The actual compression portion, will probably depend on redundancy/type, since most compression algorithms work by compressing small blocks of data (100kb or so). For example, larger HTML/Javascripts files compress better since they have higher redundancy.
I guess there is also a problem of scheduling, but this could probably be ignored for rough estimation.
This is a question that been in my head for quiet sometimes. I been wondering if some low overhead code (say on the server) can predict how long it would take to compress a file before performing actual compression?

Sample the file by taking 10-100 small pieces from random locations. Compress them individually. This should give you a lower bound on compression ratio.
This only returns meaningful results if the chunks are not too small. The compression algorithm must be able to make use of a certain size of history to predict the next bytes.

It depends on the data but with images you can take small small samples. Downsampling would change the result. Here is an example:PHP - Compress Image to Meet File Size Limit.

The compression ratio can be calculated with these formulas:
And the performance benchmarking can be done using V8 or Sunspider.
You can also use algorithms like DEFLATE or LZMA to compute the mechanism. PPM (Partial by Predicting Matching) can be used for predicting.

Packing and compressing resource data

I try to pack and compress game client resource data using zlib. If I compress the data, it will reduce Disk I/O as reduced file size but it increases CPU usage when uncompress.
Question1
if a resource used for rendering is compressed, processing (rendering and uncompressing) uses CPU, so i think it seems to be rather slow, is it right?
If no compression, Disk I/O has not changed and an additional CPU usage does not occur. And if you read only a portion of the file, DISK I/O can be reduced by using the CreateFileMapping(), MapViewOfFile() functions.
Question2
In the case of the resource, such as uncompressed image(for example tga, not png) when we have to read whole file (ex. image file), we can't get adventage of CreateFileMapping(), MapViewOfFile(), so i think compressing resource is better, how do you think?
Question3
What do you think about compressing resource data when packing?

Resources for games are not only packed to reduce size, but also to reduce the number of seeks by collapsing many small files into one, which matters a lot more than the size on disk. A single unnecessary seek on a conventional hard disk costs as much time as reading a gigabyte of data. Even if your "compression" consists of only concatenating small files together, you already gain performance.
As a small bonus, having resources packed in an archive somewhat obscures them from computer unsavy people, deterring them from modifying game assets (though admittedly, this is not a very big hurdle!).
Q1: Depending on what compression algorithm you use, you can easily get upwards of 1 GB/s decompression (close to 2 GB/s with a fast CPU). Sequential disk I/O is still around 300-400 MB/s maximum even on solid state (and usually less). Random access disk I/O is 5-20 times slower, depending on the disk and the access pattern.
On the other hand, you can get as little as a few dozen kilobytes per second in decompression speed if you choose a slow algorithm, which is much worse than just loading more data from disk. The secret is to choose an algorithm that compresses reasonably well (not perfectly, just reasonably) and runs at good decompression speed. Compression speed usually does not matter, since this is done offline once. Candidate algorithms are for example LZF, Snappy, or LZ4.
File mapping can generally be used regardless of whether the contents are compressed. Also, filemapping is not only an advantage for very small portions, on the contrary. The larger your reads, the more advantageous it becomes (very small views may actually be faster using conventional reads).
Q2: Uncompressed images do not normally occur in a game. Most of the time you will want to use DXT compression, not so much to reduce disk I/O but to reduce memory and PCIe bandwidth requirements and GPU memory consumption. DXT is a very poor compression, but it works in hardware and has an exactly predictable compression ratio. You can compress DXT-compressed textures again with a conventional general-purpose compressor (with varying rates, depending on what compressor you used, there are some that are especially optimized for that purpose).
Q3: Packing resources is definitively advisable for any non-trivial game.

Difference between stateless and stateful compression?

In the chapter Filters (scroll down ~50%) in an article about the Remote Call Framework are mentioned 2 ways of compression:
ZLib stateless compression
ZLib stateful compression
What is the difference between those? Is it ZLib-related or are these common compression methods?
While searching I could only find stateful and stateless webservices. Aren't the attributes stateless/ful meant to describe the compression-method?

From Transport Layer Security Protocol Compression Methods:
Compression methods used with TLS can
be either stateful (the compressor
maintains it's state through all
compressed records) or stateless
(the compressor compresses each record
independently), but there seems to
be little known benefit in using a
stateless compression method within
TLS.
Some compression methods have the
ability to maintain history
information when compressing and
decompressing packet payloads. The
compression history allows a higher
compression ratio to be achieved on
a stream as compared to per-packet
compression, but maintaining a
history across packets implies that a
packet might contain data needed to
completely decompress data contained
in a different packet. History
maintenance thus requires both a
reliable link and sequenced packet
delivery. Since TLS and lower-layer
protocols provide reliable,
sequenced packet delivery, compression
history information MAY be
maintained and exploited if supported
by the compression method.

In general, stateless describes any process that does not have a memory of past events, and stateful describes any process that does have such a memory (and uses it to make decisions.)
In compression, then, stateless means whatever chunk of data it sees, it compresses, without depending on previous inputs. It's faster but usually compresses less; stateful compression looks at previous data to decide how to compress current data, it's slower but compresses much better.

Zlib is a compression algorithm that's adaptive. All compression algorithms work because the data they work on isn't entirely random. Instead, their input data has a non-uniform distribution that can be exploited. Take English text as a simple example. The letter e is far more common than the letter q. Zlib will detect this, and use less bits for the letter e.
Now, when you send a lot lot of short text messages, and you know they're all in English, you should use Zlib statefull compression. It would keep that low-bit representation of the letter e across all messages. But if there are messages in Chinese, Japanese, French, etc intermixed, stateful compression is no longer that smart. There will be few letters e in a Japanese text. Stateless compression would check for each message which letters are common. A wellknown example of ZLib stateless compression is the PNG file format, which keeps no state between 2 distinct images.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js