Is there a compression implementation with external metadata? Like Gzip or Brotli with external Huffman tree and LZW table - compression

I have a lot of messages in Kafka and want to use a compression. When message is compressed it consists of some metadata and compressed message itself.
My goal is to save space not for a single message but for all messages.
Maybe there is an implementation of Gzip that can take a bunch of messages, compress them, save the meta(Huffman tree and LZW table) and then use it for all messages?
This is what I have after compression, pseudocode describes data:
Huffman tree for message 1, LZW table for message 1, compressed message 1 in queue;
Huffman tree for message 2, LZW table for message 2, compressed message 2 in queue;
Huffman tree for message 3, LZW table for message 3, compressed message 3 in queue;
This is what I want to have after a compression, pseudocode describes data:
Common huffman tree and LZW table for all messages, stored separately.
compressed message 1 in queue;
compressed message 2 in queue;
compressed message 3 in queue;

What you're calling "metadata" is not metadata. It is the current description of Huffman codes or LZW codes to enable decompression. Metadata means associated information about the data that is not contained in the data, such as what is stored in a file system directory. E.g the name of the data, modification date, permission, author, source, etc. The compression coding information you are referring to is derived entirely from the data.
For LZW, that information is deeply intertwined with the compressed data and updated every symbol, and so cannot be saved and used on different data.
While you could in principle save the Huffman codes for an LZ77 compressor, that wouldn't do you much good. Most of the compression comes not from Huffman coding, but rather from matching strings. What you can do instead is save a "dictionary" of strings that occur commonly in your messages, and have that dictionary known on both the compression and decompression ends. Then you can use zlib to feed that dictionary to the compressor and decompressor. The dictionary is 32K bytes, and can contain simply a concatention of however many example messages will fit. Or you can try to be more sophisticated and extract common strings, putting the most common at the end of the dictionary.
zstd also provides dictionary facilities.

Related

Most efficient way to use AWS SQS (with Golang)

When using the AWS SQS (Simple Queue Service) you pay for each request you make to the service (push, pull, ...). There is a maximum of 256kb for each message you can send to a queue.
To save money I'd like to buffer messages sent to my Go application before I send them out to SQS until I have enough data to efficiently use the 256kb limit.
Since my Go application is a webserver, my current idea is to use a string mutex and append messages as long as I would exceed the 256kb limit and then issue the SQS push event. To save even more space I could gzip every single message before appending it to the string mutex.
I wonder if there is some kind of gzip stream that I could use for this. My assumption is that gzipping all concatenated messages together will result in smaller size then gzipping every message before appending it to the string mutex. One way would be to gzip the string mutex after every append to validate its size. But that might be very slow.
Is there a better way? Or is there a total better approach involving channels? I'm still new to Go I have to admit.
I'd take the following approach
Use a channel to accept incoming "internal" messages to a go routine
In that go routine keep the messages in a "raw" format, so 10 messages is 10 raw uncompressed items
Each time a new raw item arrives, compress all the raw messages into one. If the size with the new message > 256k then compress messages EXCEPT the last one and push to SQS
This is computationally expensive. Each individual message causes a full compression of all pending messages. However it is efficient for SQS use
You could guesstimate the size of the gzipped messages and calculate whether you've reached the max size threshold. Keep track of a message size counter and for every new message increment the counter by it's expected compressed size. Do the actual compression and send to SQS only if your counter will exceed 256kb. So you could avoid compressing every time a new message comes in.
For a use-case like this, running a few tests on a sample set of messages should give the rough percentage of compression expected.
Before you get focused on compression, eliminate redundant data that is known on both sides. This is what encodings like msgpack, protobuf, AVRO, and so on do.
Let's say all of your messages are a struct like this:
type Foo struct {
bar string
qux int
}
and you were thinking of encoding it as JSON. Then the most efficient you could do is:
{"bar":"whatever","qux",123}
If you wanted to just append all of those together in memory, you might get something like this:
{"bar":"whatever","qux",123}{"bar":"hello world","qux",0}{"bar":"answer to life, the universe, and everything","qux",42}{"bar":"nice","qux",69}
A really good compression algorithm might look at hundreds of those messages and identify the repetitiveness of {"bar":" and ","qux",.
But compression has to do work to figure that out from your data each time.
If the receiving code already knows what "schema" (the {"bar": some_string, "qux": some_int} "shape" of your data) each message has, then you can just serialize the messages like this:
"whatever"123"hello world"0"answer to life, the universe, and everything"42"nice"69
Note that in this example encoding, you can't just start in the middle of the data and unambiguously find your place. If you have a bunch of messages such as {"bar":"1","qux":2}, {"bar":"2","qux":3}, {"bar":"3","qux":4}, then the encoding will produce: "1"2"2"3"3"4, and you can't just start in the middle and know for sure if you're looking at a number or a string - you have to count from the ends. Whether or not this matters will depend on your use case.
You can come up with other simple schemes that are more unambiguous or make the code for writing or reading messages easier or simpler, like using a field separator or message separator character which is escaped in your encoding of the other data (just like how \ and " would be escaped in quoted JSON strings).
If you can't have the receiver just know/hardcode the expected message schema - if you need the full flexibility of something like JSON and you always unmarshal into something like a map[string]interface{} or whatever - then you should consider using something like BSON.
Of course, you can't use msgpack, protobuf, AVRO, or BSON directly - they need a medium that allows arbitrary bytes like 0x0. And according to the AWS SQS FAQ:
Q: What kind of data can I include in a message?
Amazon SQS messages can contain up to 256 KB of text data, including XML, JSON and unformatted text. The following Unicode characters are accepted:
#x9 | #xA | #xD | [#x20 to #xD7FF] | [#xE000 to #xFFFD] | [#x10000 to #x10FFFF]
So if you want to aim for maximum space efficiency for your exact usecase, you'd have to write your own code which use the techniques from those encoding schemes, but only used bytes which bytes are allowed in SQS messages.
Relatedly, if you have a lot of integers, and you know most of them are small (or clump around a certain spot of the number line, so that by adding a constant offset to all of them you can make most of them small), you can use one of the variable length quantity techniques to encode all of those integers. In fact several of those common encoding schemes mentioned above use variable length quantities in their encoding of integers. If you use a "piece size" of six (6) bits (instead of the standard implicitly assumed piece size of eight (8) bits to match a byte) then you can use base64. Not full base64 encoding, because the padding will completely defeat the purpose - just map from the 64 possible values that fit in six bits to the 64 distinct ASCII characters that base64 uses.
Anyway, unless you know your data has a lot repetition (but not the kind that you can just not send, like the same field names in every message) I would start with all of that, and only then would I look at compression.
Even so, if you want minimal size, I would aim for LZMA, and if you want minimal computing overhead, I would use LZ4. Gzip is not bad per se - if it's much easier to use gzip then just use it - but if you're optimizing for either size or for speed, there are better options. I don't know if gzip is even a good "middle ground" of speed and output size and working memory size - it's pretty old and maybe there's compression algorithms which are just strictly superior in speed and output and memory size by now. I think gzip, depending on implementation, also includes headers and framing information (like version metadata, size, checksums, and so on), which if you really need to minimize for size you probably don't want, and in the context of SQS messages you probably don't need.

How to efficiently decompress huffman coded file

I've found a lot of questions asking this but some of the explanations were very difficult to understand and I couldn't quite grasp the concept of how to efficiently decompress the file.
I have found these related questions:
Huffman code with lookup table
How to decode huffman code quickly?
But I fail to understand the explanation. I know how to encode and decode a huffman tree regularly. Right now in my compression program I can write any of the following information to file
symbol
huffman code (unsigned long)
huffman code length
What I plan to do is get a text file, separate it into small text files and compress each individually and then decompress that file by sending all the small compressed files with their respective lookup table (don't know how to do this part) to a Nvidia GPU to try to decompress the file in parallel using some sort of look up table.
I have 3 questions:
What information should I write to file in the header to construct the look up table?
How do I recreate this table from file?
How do I use it to decode the huffman encoded file quickly?
Don't bother writing it yourself, unless this is a didactic exercise. Use zlib, lz4, or any of several other free compression/decompression libraries out there that are far better tested than anything you'll be able to do.
You are only talking about Huffman coding, indicating that you would only get a small portion of the available compression. Most of the compression in the libraries mentioned come from matching strings. Look up "LZ77".
As for efficient Huffman decoding, you can look at how zlib's inflate does it. It creates a lookup table for the most-significant nine bits of the code. Each entry in the table has either a symbol and numbers of bits for that code (less than or equal to nine), or if the provided nine bits is a prefix of a longer code, that entry has a pointer to another table to resolve the rest of the code and the number of bits needed for that secondary table. (There are several of these secondary tables.) There are multiple entries for the same symbol if the code length is less than nine. In fact, 29-n multiple entries for an n-bit code.
So to decode you get nine bits from the input and get the entry from the table. If it is a symbol, then you remove the number of bits indicated for the code from your stream and emit the symbol. If it is a pointer to a secondary table, then you remove nine bits from the stream, get the number of bits indicated by the table, and look it up there. Now you will definitely get a symbol to emit, and the number of remaining bits to remove from the stream.

Storing table of codes in a compressed file after Huffman compression and building tree for decompression from this table

I was writing a program of Huffman compression using C++ but I faced with a problem of compressed file's structure. It needs to store some structure in my new file that can help me to decode this file. I decided to write a table of codes in the beginning of this file and then build a tree from this table to decode the next content, but I do not know in which way it is better to store the table (I mean I do not know structure of the table, I know how to write things in binary mode) and how to build the tree from this table. Sorry for my English. Thank you in advance.
You do not need to transmit the probabilities or the tree. All the decoder needs is the number of bits assigned to each symbol, and a canonical way to assign the bit values to each symbol that is agreed to by both the encoder and decoder. See Canonical Huffman Code.
You could try writing a header in the compressed file with the sequence of characters according to their probability of appearing in the text. Or writing the letters followed by their probabilities. With that, you use the same process for building the tree for compressing and decompressing. As for how to build the tree itself, I suppose you'll have to do a little research and come back you if have problems.

writing to gz using fstream

How can I write the output to a compressed file (gz, bz2, ...) using fstream? It seems that Boost library can do that, but I am looking for a non Boost solution. I saw example only for reading from a compressed file.
To write compressed data to a file, you would run your uncompressed data through a compression library such as zlib (for DEFLATE, the compression algorithm used with .zip and .gz files) or xz utils (for LZMA, the compression algorithm used with 7zip and .xz files), then write the result as usual using ofstream or fwrite.
The two major pieces to implement are the encoding/compression and framing/encapsulation/file format.
From wikipedia, the DEFLATE algorithm:
Stream format
A Deflate stream consists of a series of blocks. Each block is
preceded by a 3-bit header: 1 bit: Last-block-in-stream marker: 1:
this is the last block in the stream. 0: there are more blocks to
process after this one. 2 bits: Encoding method used for this block
type: 00: a stored/raw/literal section, between 0 and 65,535 bytes in
length. 01: a static Huffman compressed block, using a pre-agreed
Huffman tree. 10: a compressed block complete with the Huffman table
supplied. 11: reserved, don't use. Most blocks will end up being
encoded using method 10, the dynamic Huffman encoding, which produces
an optimised Huffman tree customised for each block of data
individually. Instructions to generate the necessary Huffman tree
immediately follow the block header. Compression is achieved through
two steps The matching and replacement of duplicate strings with
pointers. Replacing symbols with new, weighted symbols based on
frequency of use.
From wikipedia, the gzip file format:
"gzip" is often also used to refer to the gzip file format, which is:
a 10-byte header, containing a magic number, a version number and a
timestamp optional extra headers, such as the original file name, a
body, containing a DEFLATE-compressed payload an 8-byte footer,
containing a CRC-32 checksum and the length of the original
uncompressed data

Best compression library/format for compressing on the fly and binary search?

I'm looking for a compression library/format with the following abilities:
Can compress my data as I write it.
Will let me efficiently binary search through the file.
Will let me efficiently traverse the file in reverse.
Context: I'm writing a C++ app that listens for incoming data, normalizes it, and then needs to persist the normalized output to disk. The data already compresses pretty well when I run gzip on the files by hand. However, the amount of incoming data is potentially massive, and I'd like to do the compression on the fly. Each entry in the file has a timestamp associated with it and I may be only interested in the chunk of data between time X and time Y, so to quickly find that chunk I'd like to be able to binary search. And even iterate in reverse if possible. Do any particular compression libraries/formats stick out as being particularly good for my project? I've found libraries that satisfy #1, but often whether #2 or #3 will work is undocumented.
You can just compress a few chunks at a time so that you can decompress them separately, then keep an (uncompressed but small) index to the beginning of each block of chunks in the compressed data. That will allow almost random access to the chunks and still keep them in order by timestamp. The limit case to this is to compress each chunk individually, although that might hurt your compression ratio.