How to efficiently decompress huffman coded file - c++

I've found a lot of questions asking this but some of the explanations were very difficult to understand and I couldn't quite grasp the concept of how to efficiently decompress the file.
I have found these related questions:
Huffman code with lookup table
How to decode huffman code quickly?
But I fail to understand the explanation. I know how to encode and decode a huffman tree regularly. Right now in my compression program I can write any of the following information to file
symbol
huffman code (unsigned long)
huffman code length
What I plan to do is get a text file, separate it into small text files and compress each individually and then decompress that file by sending all the small compressed files with their respective lookup table (don't know how to do this part) to a Nvidia GPU to try to decompress the file in parallel using some sort of look up table.
I have 3 questions:
What information should I write to file in the header to construct the look up table?
How do I recreate this table from file?
How do I use it to decode the huffman encoded file quickly?

Don't bother writing it yourself, unless this is a didactic exercise. Use zlib, lz4, or any of several other free compression/decompression libraries out there that are far better tested than anything you'll be able to do.
You are only talking about Huffman coding, indicating that you would only get a small portion of the available compression. Most of the compression in the libraries mentioned come from matching strings. Look up "LZ77".
As for efficient Huffman decoding, you can look at how zlib's inflate does it. It creates a lookup table for the most-significant nine bits of the code. Each entry in the table has either a symbol and numbers of bits for that code (less than or equal to nine), or if the provided nine bits is a prefix of a longer code, that entry has a pointer to another table to resolve the rest of the code and the number of bits needed for that secondary table. (There are several of these secondary tables.) There are multiple entries for the same symbol if the code length is less than nine. In fact, 29-n multiple entries for an n-bit code.
So to decode you get nine bits from the input and get the entry from the table. If it is a symbol, then you remove the number of bits indicated for the code from your stream and emit the symbol. If it is a pointer to a secondary table, then you remove nine bits from the stream, get the number of bits indicated by the table, and look it up there. Now you will definitely get a symbol to emit, and the number of remaining bits to remove from the stream.

Related

Storing table of codes in a compressed file after Huffman compression and building tree for decompression from this table

I was writing a program of Huffman compression using C++ but I faced with a problem of compressed file's structure. It needs to store some structure in my new file that can help me to decode this file. I decided to write a table of codes in the beginning of this file and then build a tree from this table to decode the next content, but I do not know in which way it is better to store the table (I mean I do not know structure of the table, I know how to write things in binary mode) and how to build the tree from this table. Sorry for my English. Thank you in advance.
You do not need to transmit the probabilities or the tree. All the decoder needs is the number of bits assigned to each symbol, and a canonical way to assign the bit values to each symbol that is agreed to by both the encoder and decoder. See Canonical Huffman Code.
You could try writing a header in the compressed file with the sequence of characters according to their probability of appearing in the text. Or writing the letters followed by their probabilities. With that, you use the same process for building the tree for compressing and decompressing. As for how to build the tree itself, I suppose you'll have to do a little research and come back you if have problems.

Openssl Message Digest One-Way Brute-force attack

I am learning Cryptography and using OPENSSL to implement whatever I am learning. Recently, I found one of the assignment questions and am trying to solve it. I don't have problem understanding most of the questions but this one.
4 Task 2: One-Way Property versus Collision-Free Property
In this task, we will investigate the difference between two properties of common hash functions: one-way
property versus collision-free property. We will use the brute-force method to see how long it takes to break
each of these properties. Instead of using openssl’s command-line tools, you are required to write your
own C program to invoke the message digest functions in openssl’s crypto library. Docs can be found at
http://www.openssl.org/docs/crypto/EVP_DigestInit.html.
Laboratory for Computer Security Education, CMSC 414, Spring 2013
2
Since most of the hash functions are quite strong against the brute-force attack on those two properties,
it will take us years to break them using the brute-force method. To make the task feasible, in all of this
project we reduce the length of the hash value to 24 bits. We can use any one-way hash function, but we
only use the first 24 bits of the hash value.
Write a program that, given a 24-bit hash value, finds a matching text (only lower-case ASCII charac-
ters). Your program will have to repeatedly 1) generate a random text, 2) hash it, 3) compare lower 24 bits
to the input.
Your program (source must be called task2.c) will be called as follows:
./task2 <digest name> <hash value>
e.g, ./task2 sha256 2612c7. . . and your program must write the winning text to task2.out.
Please ensure the output is readable and writable, i.e.:
open("task2.out", O`enter code here` WRONLY | O CREAT, 0644);
We will verify with command line tools, e.g., openssl dgst -sha256 task2.out.
Question: How many texts did you have to hash to find a specific hash? (give average of three trials)
I am not able to understand how to start writing my program. Any inputs are greatly appreciated. As I am not solving it for a home work. I am looking for some pointers and not the code.
Well, reading the text to me its clear what is the task, and unclear which part you do not get. Where to start?
create a skeleton program like hello word
create a function that generates a random text
create a function that takes text and a hash-id, and uses openssl to hash it, returning the hash
create a function that extract the lower 24 bits of the hash
create function that takes the command line params and convert them to a 24-bit number that is the looked-for hash and the hash-id to drop at openssl (or exits with error indication)
run a loop that keeps feeding new random strings until the resulting hash matches the req and counts
write the winning text to file and the number to output
do all the remaining tasks from assignment...
The algorithm is well laid out by Balog Pal. Just to add a few things:
In one-way property, you are given a hash and you search for another text with the similar hash.
In collision-free property, you just need to find two texts with similar hashes. So you start by generating two texts and compare their corresponding hashes. If they are the same, you have found a collision. If not, you store the already generated hashes and then generate a new text, find its hash and Compare it with the stored hashes. if any stored hash matches with it, you have found a collision, else store it in the list of stored hashes. Repeat the cycle until you find a collision.
The python implementation of the same can be found at the below link. It includes minimum comments, so you have to figure out everything from the code. Once that is done, then try implementing it in C or java.
https://github.com/arafat1/One-Way-Property-versus-Collision-Free-Property/blob/master/HashProperty.py

File Binary vs Text

Are there some situation where I have to prefer binary file to text file? I'm using C++ as programming language?
For example if I have to store some large text file is it better use text file or binary file?
Edit
The file for the moment has no requirment to be readable from human. Are some performance difference, security difference and so on?
Edit
Sorry for the omit other the requirment (thanks to Carey Gregory)
The record to save are in ascii encoding
The file must be crypted ( AES )
The machine can power off any time. So I've to try to prevents errors.
I've to know if the file change outside the program, I think I'll use a sha1 digest of the file.
As a general rule, define a text format, and use it. It's much
easier to develop and debug, and it's much easier to see what is
going wrong if it doesn't work.
If you find that the files are becoming too big, or taking to
much time to transfer over the wire, consider compressing them.
A compressed text file is often smaller than you can do with
binary. Or consider a less verbose text format; it's possible
to reliably transmit a text representation of your data with
a lot less characters than XML uses.
And finally, if you do end up having to use binary, try to chose
an existing format (e.g. Google's protocol blocks), or base your
format on an existing format. Just remember that:
Binary is a lot more work than text, since you practically
have to write all of the << operators again, including those
in the standard library.
Binary is a lot more difficult to debug, because you can't
easily see what you've actually done.
Concerning your last edit:
Once you've encrypted, the results will be binary. You can
use a text representation of the binary (base64 or some such),
but the results won't be any more readable than the binary, so
it's not worth the bother. If you're encrypting in process,
before writing to disk, you automatically lose all of the
advantages of text.
The issues concerning powering off mean that you cannot use
ofstream directly. You must open or create the file with the
necessary options for full transactional integrity (O_SYNC as
a flag to open under Unix). You must write each record as
a single write request to the system.
It's always a good idea to have a checksum, just in case. If
you're worried about security, SHA1 is a good choice. But keep
in mind that if someone has access to the file, and wants to
intentionally change it, they can recalculate the SHA1 and
insert the new value as well.
All files are binary; the data within them is a binary representation of some information. If you have to store a large amount of text then the file will contain the binary representation of that text. The difference between a "binary file" and a "text file" is that creating the latter involves converting data to a text form before saving it. This is typically done so humans can read it.
The distinction between binary and text is usually made when storing data that is for computer consumption. Typically this data would not be text - it might be a list of numerical configuration values, for example: 1, 2, 3.
If you stored this in text format, your file could contain a list of human-readable numbers, and if you opened the file in Notepad you might see one number per line. But what you're actually saving here is not the binary values 1, 2, 3 - you're saving a string "1\n2\n3\n". Note that this string is 6 characters long, and the binary values (assuming ASCI) would actually be 49, 10, 50, 10, 51, 10!
If the same data were stored in binary format, you would store the numbers in the smallest useful space, and write the file as individual bytes that can often only be read by the code that created them. Opening this file in Notepad would likely display junk characters, because the data makes no sense as text. In this case you would be saving a byte array with actual values { 1, 2, 3 } - or even a single byte with the three values embedded. This could be much smaller than the human-readable equivalent.
Binary files store a sequence of bytes like all other files. You can store numeric values like integers per 4 bytes, characters per single byte, or even serialized class objects and anything you want.
When you know how to read a binary file (ie. you know what is stored in it) you can extract all the information from it. However, text files use text encodings like UTF8, ANSI etc. and they are intended to encode text characters to be processed by text editors.
Binary files are for machines only to interpret, whereas a text file, a human can also open and interpret its content.
So it depends whether you want your file to be readable by a human or not.
It depends on a lot of factors. I can think of two right now:
Do you require the file to be readable by humans?
Is compression a factor? A 10-digits number will take at least 10 bytes as text, but might take as little as four or two as binary.
All data is binary. You always need a machine to interpret it for you. Even if the data is compressed like protocol buffers, Avro, Thrift etc, it is binary, and if it is uncompressed, it is still binary. If you want to read protocol buffers by notepad, there is a two step process. Uncompress, and read. In case of text, this step of uncompressing is not needed. Same is case with encrypted. First unencrypted, and then read. Humans cannot read binary (as some commenters are mentioning). We still need notepad to interpret and display binary (so called text).
All data stored in a text file are human-readable graphic characters. Each line of data ends with a new line character.
In case of a binary file - data is stored in the same format as they are stored in the memory. There are no lines or new line characters. There is an end of file marker.
Moreover binary files show more efficiency for memory as they are stored in zeros and one's.

Parallelization of PNG file creation with C++, libpng and OpenMP

I am currently trying to implement a PNG encoder in C++ based on libpng that uses OpenMP to speed up the compression process.
The tool is already able to generate PNG files from various image formats.
I uploaded the complete source code to pastebin.com so you can see what I have done so far: http://pastebin.com/8wiFzcgV
So far, so good! Now, my problem is to find a way how to parallelize the generation of the IDAT chunks containing the compressed image data. Usually, the libpng function png_write_row gets called in a for-loop with a pointer to the struct that contains all the information about the PNG file and a row pointer with the pixel data of a single image row.
(Line 114-117 in the Pastebin file)
//Loop through image
for (i = 0, rp = info_ptr->row_pointers; i < png_ptr->height; i++, rp++) {
png_write_row(png_ptr, *rp);
}
Libpng then compresses one row after another and fills an internal buffer with the compressed data. As soon as the buffer is full, the compressed data gets flushed in a IDAT chunk to the image file.
My approach was to split the image into multiple parts and let one thread compress row 1 to 10 and another thread 11 to 20 and so on. But as libpng is using an internal buffer it is not as easy as I thought first :) I somehow have to make libpng write the compressed data to a separate buffer for every thread. Afterwards I need a way to concatenate the buffers in the right order so I can write them all together to the output image file.
So, does someone have an idea how I can do this with OpenMP and some tweaking to libpng? Thank you very much!
This is too long for a comment but is not really an answer either--
I'm not sure you can do this without modifying libpng (or writing your own encoder). In any case, it will help if you understand how PNG compression is implemented:
At the high level, the image is a set of rows of pixels (generally 32-bit values representing RGBA tuples).
Each row can independently have a filter applied to it -- the filter's sole purpose is to make the row more "compressible". For example, the "sub" filter makes each pixel's value the difference between it and the one to its left. This delta encoding might seem silly at first glance, but if the colours between adjacent pixels are similar (which tends to be the case) then the resulting values are very small regardless of the actual colours they represent. It's easier to compress such data because it's much more repetitive.
Going down a level, the image data can be seen as a stream of bytes (rows are no longer distinguished from each other). These bytes are compressed, yielding another stream of bytes. The compressed data is arbitrarily broken up into segments (anywhere you want!) written to one IDAT chunk each (along with a little bookkeeping overhead per chunk, including a CRC checksum).
The lowest level brings us to the interesting part, which is the compression step itself. The PNG format uses the zlib compressed data format. zlib itself is just a wrapper (with more bookkeeping, including an Adler-32 checksum) around the real compressed data format, deflate (zip files use this too). deflate supports two compression techniques: Huffman coding (which reduces the number of bits required to represent some byte-string to the optimal number given the frequency that each different byte occurs in the string), and LZ77 encoding (which lets duplicate strings that have already occurred be referenced instead of written to the output twice).
The tricky part about parallelizing deflate compression is that in general, compressing one part of the input stream requires that the previous part also be available in case it needs to be referenced. But, just like PNGs can have multiple IDAT chunks, deflate is broken up into multiple "blocks". Data in one block can reference previously encoded data in another block, but it doesn't have to (of course, it may affect the compression ratio if it doesn't).
So, a general strategy for parallelizing deflate would be to break the input into multiple large sections (so that the compression ratio stays high), compress each section into a series of blocks, then glue the blocks together (this is actually tricky since blocks don't always end on a byte boundary -- but you can put an empty non-compressed block (type 00), which will align to a byte boundary, in-between sections). This isn't trivial, however, and requires control over the very lowest level of compression (creating deflate blocks manually), creating the proper zlib wrapper spanning all the blocks, and stuffing all this into IDAT chunks.
If you want to go with your own implementation, I'd suggest reading my own zlib/deflate implementation (and how I use it) which I expressly created for compressing PNGs (it's written in Haxe for Flash but should be comparatively easy to port to C++). Since Flash is single-threaded, I don't do any parallelization, but I do split the encoding up into virtually independent sections ("virtually" because there's the fractional-byte state preserved between sections) over multiple frames, which amounts to largely the same thing.
Good luck!
I finally got it to parallelize the compression process.
As mentioned by Cameron in the comment to his answer I had to strip the zlib header from the zstreams to combine them. Stripping the footer was not required as zlib offers an option called Z_SYNC_FLUSH which can be used for all chunks (except the last one which has to be written with Z_FINISH) to write to a byte boundary. So you can simply concatenate the stream outputs afterwards. Eventually, the adler32 checksum has to be calculated over all threads and copied to the end of the combined zstreams.
If you are interested in the result you can find the complete proof of concept at https://github.com/anvio/png-parallel

Best compression library/format for compressing on the fly and binary search?

I'm looking for a compression library/format with the following abilities:
Can compress my data as I write it.
Will let me efficiently binary search through the file.
Will let me efficiently traverse the file in reverse.
Context: I'm writing a C++ app that listens for incoming data, normalizes it, and then needs to persist the normalized output to disk. The data already compresses pretty well when I run gzip on the files by hand. However, the amount of incoming data is potentially massive, and I'd like to do the compression on the fly. Each entry in the file has a timestamp associated with it and I may be only interested in the chunk of data between time X and time Y, so to quickly find that chunk I'd like to be able to binary search. And even iterate in reverse if possible. Do any particular compression libraries/formats stick out as being particularly good for my project? I've found libraries that satisfy #1, but often whether #2 or #3 will work is undocumented.
You can just compress a few chunks at a time so that you can decompress them separately, then keep an (uncompressed but small) index to the beginning of each block of chunks in the compressed data. That will allow almost random access to the chunks and still keep them in order by timestamp. The limit case to this is to compress each chunk individually, although that might hurt your compression ratio.