I'd like a compression API that allows me to store its state information separately from it's compressed data.
I recognize that to work well I would probably need an API that requires 2 passes: pass 1 over the data to build up its symbol replacement tables and pass 2 to actually compress the data.
Does such an API exist?
Any usable compression library has a state, so that you can compress arbitrarily long streams. Without a state, you would have to give it all of the data in one call.
In addition, any usable compression library is streaming, which means there is only one pass.
zlib is one example, where the deflate() as well as inflate() streams maintain an opaque state from call to call referenced in the z_stream structure.
Related
I like to be able to generate a gzip (.gz) file using concurrent CPU threads. I.e., I would be deflating separate chunks from the input file with separately initialized z_stream records.
The resulting file should be readable by zlib's inflate() function in a classic single threaded operation.
Is that possible? Even if it requires customized zlib code? The only requirement would be that the currently existing zlib's inflate code could handle it.
Update
The pigz source code demonstrates how it works. It uses some sophisticated optimizations to share the dictionary between chunks, keeping the compression rate optimal. It further handles bit packing if a more recent zlib version is used.
Howevever, I like to understand how to roll my own, keeping things simple, without the optimizations pigz uses.
And while many consider source code to be the ultimate documentation (Ed Post, anyone?) I rather have it explained in plain words to avoid misunderstandings. (While the docs actually describe what happens pretty well, they do not explain too well what needs to be done to roll one's own.)
From browsing the code, I figured out this much so far:
It appears that one simply creates each compressed chunk using deflate(..., Z_SYNC_FLUSH) instead of using Z_FINISH. However, deflateEnd() gives an error then, not sure if that can be ignored. And one needs to calculate the final checksum over all chunks manually, though I wonder how to add the checksum at the end. There is also a rather complex put_trailer() function for writing a gzip header - I wonder if that could also be handled by zlib's own code for simple cases?
Any clarification on this is appreciated.
Also, I realize that I should have included asking about writing a zlib stream the same way, in order to write multithreaded-compressed files to a zip archive. There, I suspect, more simplifications are possible due to the lack of the more complex gzip header.
The answer is in your question. Each thread has its own deflate instance to produce raw deflate data (see deflateInit2()), which compresses the chunk of the data fed to it, ending with Z_SYNC_FLUSH instead of Z_FINISH. Except for the last chunk of data, which you end with a Z_FINISH. Either way, this ends each resulting stream of compressed data on a byte boundary. Make sure that you get all of the generated data out of deflate(). Then you can concatenate all the compressed data streams. (In the correct order!) Precede that with a gzip header that you generate yourself. It is trivial to do that (see RFC 1952). It can just be a constant 10-byte sequence if you don't need any additional information included in the header (e.g. file name, modification date). The gzip header is not complex.
You can also compute the CRC-32 of each uncompressed chunk in the same thread or a different thread, and combine those CRC-32's using crc32_combine(). You need that for the gzip trailer.
After all of the compressed streams are written, ending with the compressed stream that was ended with a Z_FINISH, you append the gzip trailer. All that is is the four-byte CRC-32 and the low four bytes of the total uncompressed length, both in little-endian order. Eight bytes total.
In each thread you can either use deflateEnd() when done with each chunk, or if you are reusing threads for more chunks, use deflateReset(). I found in pigz that it is much more efficient to leave threads open and deflate instances open in them when processing multiple chunks. Just make sure to use deflateEnd() for the last chunk that thread processes, before closing the thread. Yes, the error from deflateEnd() can be ignored. Just make sure that you've run deflate() until avail_out is not zero to get all of the compressed data.
Doing this, each thread compresses its chunk with no reference to any other uncompressed data, where such references would normally improve the compression when doing it serially. If you want to get more advanced, you can feed each thread the chunk of uncompressed data to compress, and the last 32K of the previous chunk to provide history for the compressor. You do this with deflateSetDictionary().
Still more advanced, you can reduce the number of bytes inserted between compressed streams by sometimes using Z_PARTIAL_FLUSH's until getting to a byte boundary. See pigz for the details of that.
Even more advanced, but slower, you can append compressed streams at the bit level instead of the byte level. That would require shifting every byte of the compressed stream twice to build a new shifted stream. At least for seven out of every eight preceding compressed streams. This eliminates all of the extra bits inserted between compressed streams.
A zlib stream can be generated in exactly the same way, using adler32_combine() for the checksum.
Your question about zlib implies a confusion. The zip format does not use the zlib header and trailer. zip has its own structure, within which is imbedded raw deflate streams. You can use the above approach for those raw deflate streams as well.
Sure..
http://zlib.net/pigz/
A parallel implementation of gzip for modern
multi-processor, multi-core machines
I have a block of data I want to compress, say, C structures of variable sizes. I want to compress the data, but access specific fields of structures on the fly in application code without having to decompress the entire data.
Is there an algorithm which can take the offset (for the original data), decompress and return the data?
Compression methods generally achieve compression by making use of the preceding data. At any point in the compressed data, you need to know at least some amount of the preceding uncompressed data in order to decompress what follows.
You can deliberately forget the history at select points in the compressed data in order to have random access at those points. This reduces the compression by some amount, but that can be small with sufficiently distant random access points. A simple approach would be to compress pieces with gzip and concatenate the gzip streams, keeping a record of the offsets of each stream. For less overhead, you can use Z_FULL_FLUSH in zlib to do the same thing.
Alternatively, you can save the history at each random access point in a separate file. An example of building such a random access index to a zlib or gzip stream can be found in zran.c.
You can construct compression methods that do not depend on previous history for decompression, such as simple Huffman coding. However the compression ratio will be poor compared to methods that do depend on previous history.
Example compressed file system: There we have a filesystem API which doesn't need to know about the compression that happens before it's written to disk. There are a few algorithms out there.
Check here for more details.
However, maybe there is more gain in trying to optimize the used data structures so there is no need to compress them?
For efficient access an index is needed. So between arrays and MultiMaps and Sparse Arrays there should be a way to model the data that there is no need for further compression as the data is represented efficiently.
Of course this depends largely on the use case which is quite ambiguous.
A use case where a compression layer is needed to access the data is possible to imagine, but it's likely that there are better ways to solve the problem.
I am in dilema to make decision on the below scenarios. Kindly need experts help.
Scenario : There is TCP/IP communication between two process running in two boxes.
Communication Method 1 : Stream based communication on the socket. Where on the receiver side , he will receive the entire byte buffer and interpret first few fixed bytes as header and desrialize it and get to know the message length and start take message of that length and deserialize it and proceed to next message header like that goes on....
Communication Method2 : Put all the messages in a vector and vector will be residing in a class object. serialize the class object in one go and send to receiver. Receiver deserialize the class object and read the vector array one by one.
Please let me know which approach is efficient and if any other approach , please guide me.
Also pros and cons of class based data transmission and structure based data transmission and which is suitable for which scenario ?
Your question lacks some key details, and mixes different concerns, frustrating any attempt to provide a good answer.
Specifically, Method 2 mysteriously "serialises" and "deserialises" the object and contained vector without specifying any details of how that's being done. In practice, the details are of the kind alluded to in Method 1. So, 1 and 2 aren't alternatives unless you're choosing between using a serialisation library and doing it from scratch (in which case I'd say use the library as you're new to this and the library's more likely to get it right).
What I can say:
at a TCP level, it's most efficient to read into a decent sized block (given I tend to work on PC/server hardware, I'd just use 64k though smaller may be enough to get the same kind of throughput) and have each read() or recv() read as much data from the socket as possible
after reading enough bytes (in however many read/recvs) to attempt some interpretation of the data, it's necessary to recognise the end of particular parts of the serialised input: sometimes that's implicit in the data type involved, other times it's communicated using some sentinel (e.g. a linefeed or NUL), and other times there can be a prefixed fixed-size "expect N bytes" header. This aspect/consideration often applies hierarchically to the stream of objects and nested sub objects etc..
the TCP read/recvs may deliver more data than were sent in any single request, so you may have 1 or more bytes that are logically part of the subsequent but incomplete logical message at the end of the block assembled above
the process of reading larger blocks then accessing various fixed and variable sized elements inside the buffers is already supported by C++ iostreams, but you can roll your own if you want
So, let me emphasise this: do NOT assume you will receive any more than 1 byte from any given read of the socket: if you have say a 20 byte header you should loop reading until you hit either an error or have assembled all 20 bytes. Sending 20 bytes in a single write() or send() does not mean the 20 bytes will be presented to a single read() / recv(). TCP is a byte stream protocol, and you have to take arbitrary numbers of bytes as and when they're provided, waiting until you have enough data to interpret it. Similarly, be prepared to get more data than the client could write in a single write()/`send().
Also pros and cons of class based data transmission and structure based data transmission and which is suitable for which scenario ?
These terms are completely bogus. classes and structures are almost identical things in C++ - mechanisms for grouping data and related functions (they differ only in how they - by default - expose the base classes and data members to client code). Either can have or not have member functions or support code that helps serialise and deserialise the data. For example, the simplest and most typical support are operator<< and/or operator>> streaming functions.
If you want to contrast these kind of streaming functions with an ad-hoc "write a binary block, read a binary block" approach (perhaps from thinking of structs as being POD without support code), then I'd say prefer streaming functions where possible, starting with streaming to human-readable representations as they'll make your system easier and quicker to develop, debug and support. Once you're really comfortable with that, if the runtime performance requires it then optimise with a binary representation. If you write the serialisation code well, you won't notice much difference in performance between a cruder void*/#bytes model of data and proper per-member serialisation, but the latter can more easily support unusual cases - portability across systems with different size ints/longs etc., different byte ordering, intentional choices re shallow vs. deep copying of pointed to data etc....
I'd also recommend looking at the boost serialisation library. Even if you don't use it, it should give you a better understanding of how this kind of thing is reasonably implemented in C++.
Both methods are equivalent. In both you must send a header with message size and identifier in order to deserialize. If you assume that first option is composed by a serialized 'class' like a normal message, you must implement the same 'code'.
Another thing you must have in mind is message's size in order to full TCP buffers to optimize communications. If your 1st method messages are so little, try to improve the communication ratio with bigger messages like in 2nd option you describe.
Keep in mind that it's not safe simply streaming out a struct or class directly by interpreting it as a sequence of bytes, even if it's a simple POD - there's issues like endianness (which is unlikely to be a real-world problem for most of us), and structure alignment/padding (which is a potential problem).
C++ doesn't have any built-in serialization/deserialization, you'll either have to roll your own or take a look at things like boost Serialization, or Google's protobuf.
If it is not a homework or study project, there may be little point in fiddling with IPC at TCP stream level especially if that's not something one is familiar with.
Use a messaging library, like ØMQ to send and receive complete messages rather than streams of bytes.
In the chapter Filters (scroll down ~50%) in an article about the Remote Call Framework are mentioned 2 ways of compression:
ZLib stateless compression
ZLib stateful compression
What is the difference between those? Is it ZLib-related or are these common compression methods?
While searching I could only find stateful and stateless webservices. Aren't the attributes stateless/ful meant to describe the compression-method?
From Transport Layer Security Protocol Compression Methods:
Compression methods used with TLS can
be either stateful (the compressor
maintains it's state through all
compressed records) or stateless
(the compressor compresses each record
independently), but there seems to
be little known benefit in using a
stateless compression method within
TLS.
Some compression methods have the
ability to maintain history
information when compressing and
decompressing packet payloads. The
compression history allows a higher
compression ratio to be achieved on
a stream as compared to per-packet
compression, but maintaining a
history across packets implies that a
packet might contain data needed to
completely decompress data contained
in a different packet. History
maintenance thus requires both a
reliable link and sequenced packet
delivery. Since TLS and lower-layer
protocols provide reliable,
sequenced packet delivery, compression
history information MAY be
maintained and exploited if supported
by the compression method.
In general, stateless describes any process that does not have a memory of past events, and stateful describes any process that does have such a memory (and uses it to make decisions.)
In compression, then, stateless means whatever chunk of data it sees, it compresses, without depending on previous inputs. It's faster but usually compresses less; stateful compression looks at previous data to decide how to compress current data, it's slower but compresses much better.
Zlib is a compression algorithm that's adaptive. All compression algorithms work because the data they work on isn't entirely random. Instead, their input data has a non-uniform distribution that can be exploited. Take English text as a simple example. The letter e is far more common than the letter q. Zlib will detect this, and use less bits for the letter e.
Now, when you send a lot lot of short text messages, and you know they're all in English, you should use Zlib statefull compression. It would keep that low-bit representation of the letter e across all messages. But if there are messages in Chinese, Japanese, French, etc intermixed, stateful compression is no longer that smart. There will be few letters e in a Japanese text. Stateless compression would check for each message which letters are common. A wellknown example of ZLib stateless compression is the PNG file format, which keeps no state between 2 distinct images.
I'm trying to send a lot of data(basically data records converted to a string) over a socket and its slowing down the performance of the rest of my program. Is it possible to compress the data using gzip etc. and uncompress it at the other end?
Yes. The easiest way to implement this is to use the venerable zlib library.
The compress() and uncompress() utility functions may be what you're after.
Yes, but compression and decompression have their costs as well.
You might want to consider using another process or thread to handle the data transfer; this is probably harder than merely compressing, but will scale better when your data load increases n-fold.
Yes, it's possible. zlib is one library for doing this sort of compression and decompression. However, you may be better served by serializing your data records in a binary format rather than as a string; that should improve performance, possibly even more so than using compression.
Of course you can do that. When sending binary data, you have to take care of endiannes of the platform.
However, are you sure your performance problems will be solved through compression of sent data? You'll still have additional steps (compression/decompression, possibly solving endiannes issues).
Think about how the communication through sockets is done. Are you using synchronous or asynchronous communication. If you do the reads and writes synchronous, then you can feel performance penalities...
You may use AdOC a library to transparently overload socket system calls
http://www.labri.fr/perso/ejeannot/adoc/adoc.html
It does compression on the fly if it finds that it would be profitable.