Parallelization of PNG file creation with C++, libpng and OpenMP - c++

I am currently trying to implement a PNG encoder in C++ based on libpng that uses OpenMP to speed up the compression process.
The tool is already able to generate PNG files from various image formats.
I uploaded the complete source code to pastebin.com so you can see what I have done so far: http://pastebin.com/8wiFzcgV
So far, so good! Now, my problem is to find a way how to parallelize the generation of the IDAT chunks containing the compressed image data. Usually, the libpng function png_write_row gets called in a for-loop with a pointer to the struct that contains all the information about the PNG file and a row pointer with the pixel data of a single image row.
(Line 114-117 in the Pastebin file)
//Loop through image
for (i = 0, rp = info_ptr->row_pointers; i < png_ptr->height; i++, rp++) {
png_write_row(png_ptr, *rp);
}
Libpng then compresses one row after another and fills an internal buffer with the compressed data. As soon as the buffer is full, the compressed data gets flushed in a IDAT chunk to the image file.
My approach was to split the image into multiple parts and let one thread compress row 1 to 10 and another thread 11 to 20 and so on. But as libpng is using an internal buffer it is not as easy as I thought first :) I somehow have to make libpng write the compressed data to a separate buffer for every thread. Afterwards I need a way to concatenate the buffers in the right order so I can write them all together to the output image file.
So, does someone have an idea how I can do this with OpenMP and some tweaking to libpng? Thank you very much!

This is too long for a comment but is not really an answer either--
I'm not sure you can do this without modifying libpng (or writing your own encoder). In any case, it will help if you understand how PNG compression is implemented:
At the high level, the image is a set of rows of pixels (generally 32-bit values representing RGBA tuples).
Each row can independently have a filter applied to it -- the filter's sole purpose is to make the row more "compressible". For example, the "sub" filter makes each pixel's value the difference between it and the one to its left. This delta encoding might seem silly at first glance, but if the colours between adjacent pixels are similar (which tends to be the case) then the resulting values are very small regardless of the actual colours they represent. It's easier to compress such data because it's much more repetitive.
Going down a level, the image data can be seen as a stream of bytes (rows are no longer distinguished from each other). These bytes are compressed, yielding another stream of bytes. The compressed data is arbitrarily broken up into segments (anywhere you want!) written to one IDAT chunk each (along with a little bookkeeping overhead per chunk, including a CRC checksum).
The lowest level brings us to the interesting part, which is the compression step itself. The PNG format uses the zlib compressed data format. zlib itself is just a wrapper (with more bookkeeping, including an Adler-32 checksum) around the real compressed data format, deflate (zip files use this too). deflate supports two compression techniques: Huffman coding (which reduces the number of bits required to represent some byte-string to the optimal number given the frequency that each different byte occurs in the string), and LZ77 encoding (which lets duplicate strings that have already occurred be referenced instead of written to the output twice).
The tricky part about parallelizing deflate compression is that in general, compressing one part of the input stream requires that the previous part also be available in case it needs to be referenced. But, just like PNGs can have multiple IDAT chunks, deflate is broken up into multiple "blocks". Data in one block can reference previously encoded data in another block, but it doesn't have to (of course, it may affect the compression ratio if it doesn't).
So, a general strategy for parallelizing deflate would be to break the input into multiple large sections (so that the compression ratio stays high), compress each section into a series of blocks, then glue the blocks together (this is actually tricky since blocks don't always end on a byte boundary -- but you can put an empty non-compressed block (type 00), which will align to a byte boundary, in-between sections). This isn't trivial, however, and requires control over the very lowest level of compression (creating deflate blocks manually), creating the proper zlib wrapper spanning all the blocks, and stuffing all this into IDAT chunks.
If you want to go with your own implementation, I'd suggest reading my own zlib/deflate implementation (and how I use it) which I expressly created for compressing PNGs (it's written in Haxe for Flash but should be comparatively easy to port to C++). Since Flash is single-threaded, I don't do any parallelization, but I do split the encoding up into virtually independent sections ("virtually" because there's the fractional-byte state preserved between sections) over multiple frames, which amounts to largely the same thing.
Good luck!

I finally got it to parallelize the compression process.
As mentioned by Cameron in the comment to his answer I had to strip the zlib header from the zstreams to combine them. Stripping the footer was not required as zlib offers an option called Z_SYNC_FLUSH which can be used for all chunks (except the last one which has to be written with Z_FINISH) to write to a byte boundary. So you can simply concatenate the stream outputs afterwards. Eventually, the adler32 checksum has to be calculated over all threads and copied to the end of the combined zstreams.
If you are interested in the result you can find the complete proof of concept at https://github.com/anvio/png-parallel

Related

Is it possible to compress a compressed file by mixin and/or 'XOR'?

I want to compress a compressed file. But when I mixin the compressed file and compress it again, the file is larger than before everytime.
I have 2 methods for mixing. First, compressed file XOR another file(Generated).
Second, Swap some bits in compressed file.
Do you have any ideas about new methods or is it impossible ?
As mentioned in the comments, the first compression has already taken the extra "air" out so it cannot be "squeezed" smaller anymore. It is now close to random data, and compressing random data will always make it at least slightly larger due to compression headers.
The situation is different with lossy compression though, you can discard as much data as you wish.

How to efficiently decompress huffman coded file

I've found a lot of questions asking this but some of the explanations were very difficult to understand and I couldn't quite grasp the concept of how to efficiently decompress the file.
I have found these related questions:
Huffman code with lookup table
How to decode huffman code quickly?
But I fail to understand the explanation. I know how to encode and decode a huffman tree regularly. Right now in my compression program I can write any of the following information to file
symbol
huffman code (unsigned long)
huffman code length
What I plan to do is get a text file, separate it into small text files and compress each individually and then decompress that file by sending all the small compressed files with their respective lookup table (don't know how to do this part) to a Nvidia GPU to try to decompress the file in parallel using some sort of look up table.
I have 3 questions:
What information should I write to file in the header to construct the look up table?
How do I recreate this table from file?
How do I use it to decode the huffman encoded file quickly?
Don't bother writing it yourself, unless this is a didactic exercise. Use zlib, lz4, or any of several other free compression/decompression libraries out there that are far better tested than anything you'll be able to do.
You are only talking about Huffman coding, indicating that you would only get a small portion of the available compression. Most of the compression in the libraries mentioned come from matching strings. Look up "LZ77".
As for efficient Huffman decoding, you can look at how zlib's inflate does it. It creates a lookup table for the most-significant nine bits of the code. Each entry in the table has either a symbol and numbers of bits for that code (less than or equal to nine), or if the provided nine bits is a prefix of a longer code, that entry has a pointer to another table to resolve the rest of the code and the number of bits needed for that secondary table. (There are several of these secondary tables.) There are multiple entries for the same symbol if the code length is less than nine. In fact, 29-n multiple entries for an n-bit code.
So to decode you get nine bits from the input and get the entry from the table. If it is a symbol, then you remove the number of bits indicated for the code from your stream and emit the symbol. If it is a pointer to a secondary table, then you remove nine bits from the stream, get the number of bits indicated by the table, and look it up there. Now you will definitely get a symbol to emit, and the number of remaining bits to remove from the stream.

fast encode bitmap buffer to png using libpng

My objective is to convert a 32-bit bitmap(BGRA) buffer into png image in real-time using C/C++. To achieve it, i used libpng library to convert bitmap buffer and then write into a png file. However it seemed to take huge time (~5 secs) to execute on target arm board (quad core processor) in single thread. On profiling, i found that libpng compression process (deflate algorithm) is taking more than 90% of time. So i tried to reduce it by using parallelization in some way. The end goal here is to get it done in less than 0.5 secs at least.
Now since a png can have multiple IDAT chunks, i thought of writing png with multiple IDATs in parallel. To write custom png file with multiple IDATs following methodology is adopted
1. Write PNG IHDR chunk
2. Write IDAT chunks in parallel
i. Split input buffer in 4 parts.
ii. compress each part in parallel using zlib "compress" function.
iii. compute CRC of chunk { "IDAT"+zlib compressed data }.
iv. create IDAT chunk i.e. { "IDAT"+zlib compressed data+ CRC}.
v. Write length of IDAT chunk created.
vi. Write complete chunk in sequence.
3. write IEND chunk
Now the problem is the png file created by this method is not valid or corrupted. Can somebody point out
What am I doing wrong?
Is there any fast implementation of zlib compress or multi-threaded png creation, preferably in C/C++?
Any other alternate way to achieve target goal?
Note: The PNG specification is followed in creating chunks
Update:
This method works for creating IDAT in parallel
1. add one filter byte before each row of input image.
2. split image in four equal parts. <-- may not be required passing pointer to buffer and their offsets
3. Compress Image Parts in parallel
(A)for first image part
--deflateinit(zstrm,Z_BEST_SPEED)
--deflate(zstrm, Z_FULL_FLUSH)
--deflateend(zstrm)
--store compressed buffer and its length
--store adler32 for current chunk, {a1=zstrm->adler} <--adler is of uncompressed data
(B)for second and third image part
--deflateinit(zstrm,Z_BEST_SPEED)
--deflate(zstrm, Z_FULL_FLUSH)
--deflateend(zstrm)
--store compressed buffer and its length
--strip first 2-bytes, reduce length by 2
--store adler32 for current chunk zstrm->adler,{a2,a3 similar to A} <--adler is of uncompressed data
(C) for last image part
--deflateinit(zstrm,Z_BEST_SPEED)
--deflate(zstrm, Z_FINISH)
--deflateend(zstrm)
--store compressed buffer and its length
--strip first 2-bytes and last 4-bytes of buffer, reduce length by 6
--here last 4 bytes should be equal to ztrm->adler,{a4=zstrm->adler} <--adler is of uncompressed data
4. adler32_combine() all four parts i.e. a1,a2,a3 & a4 <--last arg is length of uncompressed data used to calculate adler32 of 2nd arg
5. store total length of compressed buffers <--to be used in calculating CRC of complete IDAT & to be written before IDaT in file
6. Append "IDAT" to Final chunk
7. Append all four compressed parts in sequence to Final chunk
8. Append adler32 checksum computed in step 4 to Final chunk
9. Append CRC of Final chunk i.e.{"IDAT"+data+adler}
To be written in png file in this manner: [PNG_HEADER][PNG_DATA][PNG_END]
where [PNG_DATA] ->Length(4-bytes)+{"IDAT"(4-bytes)+data+adler(4-bytes)}+CRC(4-bytes)
Even when there are multiple IDAT chunks in a PNG datastream, they still contain a single zlib compressed datastream. The first two bytes of the first IDAT are the zlib header, and the final four bytes of the final IDAT are the zlib "adler32" checksum of the entire datastream (except for the 2-byte header), computed before compressing it.
There is a parallel gzip (pigz) under development at zlib.net/pigz. It will generate zlib datastreams instead of gzip datastreams when invoked as "pigz -z".
For that you won't need to split up your input file because the parallel compression happens internally to pigz.
In your step ii, you need to use deflate(), not compress(). Use Z_FULL_FLUSH on the first three parts, and Z_FINISH on the last part. Then you can concatenate them to a single stream, after pulling off the two-byte header from the last three (keep the header on the first one), and pulling the four-byte check values off of the last one. For all of them, you can get the check value from strm->adler. Save those.
Use adler32_combine() to combine the four check values you saved into a single check value for the complete input. You can then tack that on to the end of the stream.
And there you have it.

Best compression library/format for compressing on the fly and binary search?

I'm looking for a compression library/format with the following abilities:
Can compress my data as I write it.
Will let me efficiently binary search through the file.
Will let me efficiently traverse the file in reverse.
Context: I'm writing a C++ app that listens for incoming data, normalizes it, and then needs to persist the normalized output to disk. The data already compresses pretty well when I run gzip on the files by hand. However, the amount of incoming data is potentially massive, and I'd like to do the compression on the fly. Each entry in the file has a timestamp associated with it and I may be only interested in the chunk of data between time X and time Y, so to quickly find that chunk I'd like to be able to binary search. And even iterate in reverse if possible. Do any particular compression libraries/formats stick out as being particularly good for my project? I've found libraries that satisfy #1, but often whether #2 or #3 will work is undocumented.
You can just compress a few chunks at a time so that you can decompress them separately, then keep an (uncompressed but small) index to the beginning of each block of chunks in the compressed data. That will allow almost random access to the chunks and still keep them in order by timestamp. The limit case to this is to compress each chunk individually, although that might hurt your compression ratio.

What is the best compression algorithm for small 4 KB files?

I am trying to compress TCP packets each one of about 4 KB in size. The packets can contain any byte (from 0 to 255). All of the benchmarks on compression algorithms that I found were based on larger files. I did not find anything that compares the compression ratio of different algorithms on small files, which is what I need. I need it to be open source so it can be implemented on C++, so no RAR for example. What algorithm can be recommended for small files of about 4 kilobytes in size? LZMA? HACC? ZIP? gzip? bzip2?
Choose the algorithm that is the quickest, since you probably care about doing this in real time. Generally for smaller blocks of data, the algorithms compress about the same (give or take a few bytes) mostly because the algorithms need to transmit the dictionary or Huffman trees in addition to the payload.
I highly recommend Deflate (used by zlib and Zip) for a number of reasons. The algorithm is quite fast, well tested, BSD licensed, and is the only compression required to be supported by Zip (as per the infozip Appnote). Aside from the basics, when it determines that the compression is larger than the decompressed size, there's a STORE mode which only adds 5 bytes for every block of data (max block is 64k bytes). Aside from the STORE mode, Deflate supports two different types of Huffman tables (or dictionaries): dynamic and fixed. A dynamic table means the Huffman tree is transmitted as part of the compressed data and is the most flexible (for varying types of nonrandom data). The advantage of a fixed table is that the table is known by all decoders and thus doesn't need to be contained in the compressed stream. The decompression (or Inflate) code is relatively easy. I've written both Java and Javascript versions based directly off of zlib and they perform rather well.
The other compression algorithms mentioned have their merits. I prefer Deflate because of its runtime performance on both the compression step and particularly in decompression step.
A point of clarification: Zip is not a compression type, it is a container. For doing packet compression, I would bypass Zip and just use the deflate/inflate APIs provided by zlib.
This is a follow-up to Rick's excellent answer which I've upvoted. Unfortunately, I couldn't include an image in a comment.
I ran across this question and decided to try deflate on a sample of 500 ASCII messages that ranged in size from 6 to 340 bytes. Each message is a bit of data generated by an environmental monitoring system that gets transported via an expensive (pay-per-byte) satellite link.
The most fun observation is that the crossover point at which messages are smaller after compression is the same as the Ultimate Question of Life, the Universe, and Everything: 42 bytes.
To try this out on your own data, here's a little bit of node.js to help:
const zlib = require('zlib')
const sprintf = require('sprintf-js').sprintf
const inflate_len = data_packet.length
const deflate_len = zlib.deflateRawSync(data_packet).length
const delta = +((inflate_len - deflate_len)/-inflate_len * 100).toFixed(0)
console.log(`inflated,deflated,delta(%)`)
console.log(sprintf(`%03i,%03i,%3i`, inflate_len, deflate_len, delta))
If you want to "compress TCP packets", you might consider using a RFC standard technique.
RFC1978 PPP Predictor Compression Protocol
RFC2394 IP Payload Compression Using DEFLATE
RFC2395 IP Payload Compression Using LZS
RFC3173 IP Payload Compression Protocol (IPComp)
RFC3051 IP Payload Compression Using ITU-T V.44 Packet Method
RFC5172 Negotiation for IPv6 Datagram Compression Using IPv6 Control Protocol
RFC5112 The Presence-Specific Static Dictionary for Signaling Compression (Sigcomp)
RFC3284 The VCDIFF Generic Differencing and Compression Data Format
RFC2118 Microsoft Point-To-Point Compression (MPPC) Protocol
There are probably other relevant RFCs I've overlooked.
All of those algorithms are reasonable to try. As you say, they aren't optimized for tiny files, but your next step is to simply try them. It will likely take only 10 minutes to test-compress some typical packets and see what sizes result. (Try different compress flags too). From the resulting files you can likely pick out which tool works best.
The candidates you listed are all good first tries. You might also try bzip2.
Sometimes simple "try them all" is a good solution when the tests are easy to do.. thinking too much sometimes slow you down.
I don't think the file size matters - if I remember correctly, the LZW in GIF resets its dictionary every 4K.
ZLIB should be fine. It is used in MCCP.
However, if you really need good compression, I would do an analysis of common patterns and include a dictionary of them in the client, which can yield even higher levels of compression.
I've had luck using zlib compression libraries directly and not using any file containers. ZIP, RAR, have overhead to store things like filenames. I've seen compression this way yield positive results (compression less than original size) for packets down to 200 bytes.
You may test bicom.
This algorithm is forbidden for commercial use.
If you want it for professional or commercial usage look at "range coding algorithm".
You can try delta compression. Compression will depend on your data. If you have any encapsulation on the payload, then you can compress the headers.
I did what Arno Setagaya suggested in his answer: made some sample tests and compared the results.
The compression tests were done using 5 files, each of them 4096 bytes in size. Each byte inside of these 5 files was generated randomly.
IMPORTANT: In real life, the data would not likely be all random, but would tend to have quiet a bit of repeating bytes. Thus in real life application the compression would tend to be a bit better then the following results.
NOTE: Each of the 5 files was compressed by itself (i.e. not together with the other 4 files, which would result in better compression). In the following results I just use the sum of the size of the 5 files together for simplicity.
I included RAR just for comparison reasons, even though it is not open source.
Results: (from best to worst)
LZOP: 20775 / 20480 * 100 = 101.44% of original size
RAR : 20825 / 20480 * 100 = 101.68% of original size
LZMA: 20827 / 20480 * 100 = 101.69% of original size
ZIP : 21020 / 20480 * 100 = 102.64% of original size
BZIP: 22899 / 20480 * 100 = 111.81% of original size
Conclusion: To my surprise ALL of the tested algorithms produced a larger size then the originals!!! I guess they are only good for compressing larger files, or files that have a lot of repeating bytes (not random data like the above). Thus I will not be using any type of compression on my TCP packets. Maybe this information will be useful to others who consider compressing small pieces of data.
EDIT:
I forgot to mention that I used default options (flags) for each of the algorithms.