Difference between stateless and stateful compression? - c++

In the chapter Filters (scroll down ~50%) in an article about the Remote Call Framework are mentioned 2 ways of compression:
ZLib stateless compression
ZLib stateful compression
What is the difference between those? Is it ZLib-related or are these common compression methods?
While searching I could only find stateful and stateless webservices. Aren't the attributes stateless/ful meant to describe the compression-method?

From Transport Layer Security Protocol Compression Methods:
Compression methods used with TLS can
be either stateful (the compressor
maintains it's state through all
compressed records) or stateless
(the compressor compresses each record
independently), but there seems to
be little known benefit in using a
stateless compression method within
TLS.
Some compression methods have the
ability to maintain history
information when compressing and
decompressing packet payloads. The
compression history allows a higher
compression ratio to be achieved on
a stream as compared to per-packet
compression, but maintaining a
history across packets implies that a
packet might contain data needed to
completely decompress data contained
in a different packet. History
maintenance thus requires both a
reliable link and sequenced packet
delivery. Since TLS and lower-layer
protocols provide reliable,
sequenced packet delivery, compression
history information MAY be
maintained and exploited if supported
by the compression method.

In general, stateless describes any process that does not have a memory of past events, and stateful describes any process that does have such a memory (and uses it to make decisions.)
In compression, then, stateless means whatever chunk of data it sees, it compresses, without depending on previous inputs. It's faster but usually compresses less; stateful compression looks at previous data to decide how to compress current data, it's slower but compresses much better.

Zlib is a compression algorithm that's adaptive. All compression algorithms work because the data they work on isn't entirely random. Instead, their input data has a non-uniform distribution that can be exploited. Take English text as a simple example. The letter e is far more common than the letter q. Zlib will detect this, and use less bits for the letter e.
Now, when you send a lot lot of short text messages, and you know they're all in English, you should use Zlib statefull compression. It would keep that low-bit representation of the letter e across all messages. But if there are messages in Chinese, Japanese, French, etc intermixed, stateful compression is no longer that smart. There will be few letters e in a Japanese text. Stateless compression would check for each message which letters are common. A wellknown example of ZLib stateless compression is the PNG file format, which keeps no state between 2 distinct images.

Related

Any key-value storages with emphasis on compression?

Are there any key-value storages which fit the following criteria?
are open-source
persistent file storage
have replication and oplog
have configurable compression usable for storing 10-100 megabytes of raw text per second
work on windows and linux
Desired interface should contain at least:
store a record by a text or numeric ID
retrieve a record by ID
wiredtiger does support different kind of compression:
Compression considerations
WiredTiger compresses data at several stages to preserve memory and
disk space. Applications can configure these different compression
algorithms to tailor their requirements between memory, disk and CPU
consumption. Compression algorithms other than block compression work
by modifying how the keys and values are represented, and hence reduce
data size in-memory and on-disk. Block compression on the other hand
compress the data in its binary representation while saving it on the
disk.
Configuring compression may change application throughput. For
example, in applications using solid-state drives (where I/O is less
expensive), turning off compression may increase application
performance by reducing CPU costs; in applications where I/O costs are
more expensive, turning on compression may increase application
performance by reducing the overall number of I/O operations.
WiredTiger uses some internal algorithms to compress the amount of
data stored that are not configurable, but always on. For example,
run-length reduces the size requirement by storing sequential,
duplicate values in the store only a single time (with an associated
count).
wiredtiger support different kind of compression:
key prefix
dictionary
huffman
and block compression which support among other things lz4, snappy, zlib and zstd.
Have look at the documentation for full cover of the subject.

advantage of serialization over sockets c++

currently we have integrated networking into our game, using UDP protocol. It works fine. But we are sending strings over there network to the server. "10,10,23 - 23,9,10 - 9,23,23"
I came across that I need to serialize the data as this is the right way to do it? what are the benefits of it? does it reduces the performance? Or is sending string fine?
You're already serialising it.
I think what you're asking is whether it is beneficial to serialise to a compact, binary format rather than human-readable strings. The answer is yes, since you can reduce bandwidth requirements and parsing time.
Sometimes you can simply copy the bytes that make up your objects straight into the communications media, though watch out for endianness, padding, width, alignment and other implementation-defined quantities; generally you want to define a single, universal format for your data and some translation may be required on one or more endpoints in order to express the data interchange. That said, in most cases, that's still going to be cheaper than string parsing and stringisation.
The downside is you cannot snoop on the communications channel and immediately see with your eyes what's going on, when debugging your networking.

MPI: is there mpi libraries capable of message compression?

Sometimes MPI is used to send low-entropy data in messages. So it can be useful to try to compress messages before sending it. I know that MPI can work on very fast networks (10 Gbit/s and more), but many MPI programs are used with cheap network like 0,1G or 1Gbit/s Ethernet and with cheap (slow, low bisection) network switch. There is a very fast Snappy (wikipedia) compression algorithm, which has
Compression speed is 250 MB/s and decompression speed is 500 MB/s
so on compressible data and slow network it will give some speedup.
Is there any MPI library which can compress MPI messages (at layer of MPI; not the compression of ip packets like in PPP).
MPI messages are also structured, so there can be some special method, like compression of exponent part in array of double.
PS: There is also LZ4 compression method with comparable speed
I won't swear that there's none out there, but there's none in common use.
There's a couple of reason's why it's not common:
MPI is often used for sending lots of floating point data which is hard (but not impossible) to compress well, and often has relatively high entropy after a while.
In addition, MPI users are often as concerned with latency as bandwidth, and adding a compression/decompression step into the message-passing critical path wouldn't be attractive to those users.
Finally some operations (like reduction collectives, or scatter gather) would be very hard to implement efficiently with compression.
However, you sound like your use case could benefit from this for point-to-point communications, so there's no reason why you couldn't do it yourself. If you were going to send a message of size N and the receiver expected it then:
sender calls compression routine, receives buffer and new length M;
if M >= N, send the original data, with an initial byte of say 0, as N+1 bytes to the
receiver
otherwise sends an initial byte of 1 + compressed data
receiver receives data into length N+1 buffer
if first byte is 1, calls MPI_Get_count to determine amount of data received, calls
decompression routine
otherwises uses uncompressed data
I can't give you much guidance as to the compresion routines, but it does look like people have tried this before, eg http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.7936 .
I'll be happy to be told otherwise but I don't think many of us users of MPI are concerned with having a transport layer that compresses data.
Why the heck not ?
1) We already design our programs to do as little communication as possible, so we (like to think we) are sending the bare minimum across the interconnect.
2) The bulk of our larger messages comprise arrays of floating-point numbers which are relatively difficult (and therefore relatively expensive in time) to compress to any degree.
There's an ongoing project at the University of Edinburgh: http://link.springer.com/chapter/10.1007%2F978-3-642-32820-6_72?LI=true

Efficient Packet types/transfer protocol

Using boost::asio in C++, I'm trying to determine the best way to encrypt packets in my program. I thought of defining packets all by myself by type number, each with different fixed packet sizes. The system reads the header (type, and quantity of entries for lists of data) and creates the appropriate structure to receive the data, then it reacts according to the data received.
However, when I look at this method, I wonder if there would be a simpler way to accomplish this without sacrificing efficiency.
These packets are to be sent between different applications trough TCP. Ideally, I'm aiming at both applications using as least bandwidth and CPU as possible while also being as simple to modify as possible. Any suggestions?
TCP uses streams of data, not packets. I highly suggest thinking of your data transmission as a stream of data instead of sequence of packets. This will make it easier to abstract into your code. Take a look at Boost.Serialization or Google Protocol Buffers.
Boost.Asio has SSL encryption capabilities, so it's trivial to encrypt the stream of data. It also has an example using serialization.
Have you considered google protobufs? While it doesn't actually do the encryption (you'll have to do that yourself), it does provide a way of encoding the structured data allowing you to send it over the wire efficiently. Additionally, there are many language bindings for it (C++, Java, and Python off the top of my head).

compressing socket send data

I'm trying to send a lot of data(basically data records converted to a string) over a socket and its slowing down the performance of the rest of my program. Is it possible to compress the data using gzip etc. and uncompress it at the other end?
Yes. The easiest way to implement this is to use the venerable zlib library.
The compress() and uncompress() utility functions may be what you're after.
Yes, but compression and decompression have their costs as well.
You might want to consider using another process or thread to handle the data transfer; this is probably harder than merely compressing, but will scale better when your data load increases n-fold.
Yes, it's possible. zlib is one library for doing this sort of compression and decompression. However, you may be better served by serializing your data records in a binary format rather than as a string; that should improve performance, possibly even more so than using compression.
Of course you can do that. When sending binary data, you have to take care of endiannes of the platform.
However, are you sure your performance problems will be solved through compression of sent data? You'll still have additional steps (compression/decompression, possibly solving endiannes issues).
Think about how the communication through sockets is done. Are you using synchronous or asynchronous communication. If you do the reads and writes synchronous, then you can feel performance penalities...
You may use AdOC a library to transparently overload socket system calls
http://www.labri.fr/perso/ejeannot/adoc/adoc.html
It does compression on the fly if it finds that it would be profitable.