Which type of files can be compressed with Huffman coding?

Which type of files can be compressed with Huffman coding? - compression

I know that we use Huffman Coding to compress .txt files what i want to know that which other extensions can be compressed using Huffman Coding for example can we compress (.pdf, .Xls, .Jpg, .Gif, .Mp4) files using Huffman Coding?

In principle, you can compress any type of file with Huffman coding. Huffman coding works on the assumption that the input is a stream of symbols of some sort, and all files are represented as individual bytes, so any file is a valid input to a Huffman coder.
In practice, though, Huffman coding likely wouldn't work well for many other formats for a number of reasons. For example, many file formats (PDF, MP4, JPG, etc.) already employ some compression method to reduce their space usage, so hitting them with a secondary compressor isn't likely to do anything. Second, Huffman coding is based on the assumption that each symbol seen is sampled from some fixed probability distribution independently of any other symbol, and therefore doesn't do well when there are correlations between which symbols appear where. For example, a raw bitmap image is likely to have correlations between pixel colors and their neighboring pixels, but Huffman encoding can't take advantage of this.
That being said, Huffman coding is often used as one of many steps in various encoding algorithms. For example, if memory serves me correctly, bzip2 works by breaking the input into blocks, using the Burrows-Wheeler transform on each block, then using move-to-front coding, then run-length encoding, and then finally using Huffman encoding at the very end.
Hope this helps!

Related

Why to combine Huffman and lz77?

I'm doing a reverse engineering in a Gameboy Advance's game, and I noticed that the originals developers wrote a code that has two system calls to uncompress a level using Huffman and lz77 (in this order).
But why to use Huffman + lzZ7? Whats the advantage to this approach?

Using available libraries
It's possible that the developers are using DEFLATE (or some similar algorithm), simply to be able to re-use tested and debugged software rather than writing something new from scratch (and taking who-knows-how-long to test and fix all the quirky edge cases).
Why both Huffman and LZ77?
But why does DEFLATE, Zstandard, LZHAM, LZHUF, LZH, etc. use both Huffman and LZ77?
Because these 2 algorithms detect and remove 2 different kinds of redundancy in common to many data files (video game levels, English and other natural-language text, etc.), and they can be combined to get better net compression than either one alone.
(Unfortunately, most data compression algorithms cannot be combined like this).
details
In English, the 2 most common letters are (usually) 'e' and then 't'.
So what is the most common pair? You might guess "ee", "et", or "te" -- nope, it's "th".
LZ77 is good at detecting and compressing these kinds of common words and syllables that occur far more often than you might guess from the letter frequencies alone.
Letter-oriented Huffman is good at detecting and compressing files using the letter frequencies alone, but it cannot detect correlations between consecutive letters (common words and syllables).
LZ77 compresses an original file into an intermediate sequence of literal letters and "copy items".
Then Huffman further compresses that intermediate sequence.
Often those "copy items" are already much shorter than the original substring would have been if we had skipped the LZ77 step and simply Huffman compressed the original file.
And Huffman does just as well compressing the literal letters in the intermediate sequence as it would have done compressing those same letters in the original file.
So already this 2-step process creates smaller files than either algorithm alone.
As a bonus, typically the copy items are also Huffman compressed for more savings in storage space.
In general, most data compression software is made up of these 2 parts.
First they run the original data through a "transformation" or multiple transformations, also called "decorrelators", typically highly tuned to the particular kind of redundancy in the particular kind of data being compressed (JPEG's DCT transform, MPEG's motion-compensation, etc.) or tuned to the limitations of human perception (MP3's auditory masking, etc.).
Next they run the intermediate data through a single "entropy coder" (arithmetic coding, or Huffman coding, or asymmetric numeral system coding) that's pretty much the same for every kind of data.

What to do first to implement a JPG decoder

I need to understand JPG decompression so that I don't need other libraries that just do it.
After being able to identify the different parts of a JPG file in terms of file format, what do I need to do, understand or learn first in mathematical or algorithmic terms so I can start implementing decoding primitives?

Look at this answer to find all the specifications you need to read, and then read them. Several times. Front to back. Then start to implement, testing along the way often with many example jpeg files.
It wouldn't hurt to know a little bit about fourier transforms and then the discrete cosine transform, and also how Huffman codes work. Though you could pick up much of what you need from the specifications.

Choosing Symbols in an efficient way, in Arithmetic Code Algorithm, C++

I'm trying to implement Arithmetic Code Algorithm to compress binary images (JPG images transformed to binary base using opencv). The problem is that I've to save in the compressed file, the encoded string and the symbols which I used to generate this encoded string and also their frequencies, so I can be able to decode it. The symbols take a lot of space even if I'm transforming them to ascii and if I tried to take less number of characters for each symbol the size of the encoded string becomes bigger. So I wonder if there's an efficient way to save symbols in the compressed file with minimum possible size. And I want to know the most efficient way to choose the symbols from the original file.
Thanks in advance :)

325,592,005 bytes is 310 megabytes. You managed to compress this image into 2.8+6.1=8.9 megabytes so you decreased the size by 97%. It's a good result and I wouldn't worry here. Besides 6.1 megabytes of 64 bits long symbols means that you have around 800K of them. It is much less than the maximum possible number of possible symbols i.e. 2^64 - 1. It is again a good result.
As to your question about using multiple compression algorithms. Firstly, you have to know that in the case of the loosless compression the optimal number of bits per symbol is equal to the entropy. And the arithmetic encoding is close to be optimal (see this, this or this). It means that there is no much sense in using more than 1 algorithm one after another, if one of them is the arithmetic encoding.
As to the arithmetic coding vs Huffman codes. The latter is actually the special case of the former. And as far as I know the arithmetic encoding is always at least as good as Huffman codes.
It is also worth adding one thing. If you can consider lossy compression there is actually no limit in the compression rate. Or in other words, you can compress data as much as you want as long as quality loss is still acceptable for you. However, even in this case using multiple compression algorithms is not required. The one is enough.

Data Compression Algorithms

I was wondering if anyone has a list of data compression algorithms. I know basically nothing about data compression and I was hoping to learn more about different algorithms and see which ones are the newest and have yet to be developed on a lot of ASICs.
I'm hoping to implement a data compression ASIC which is independent of the type of data coming in (audio,video,images,etc.)
If my question is too open ended, please let me know and I'll revise. Thank you

There are a ton of compression algorithms out there. What you need here is a lossless compression algorithm. A lossless compression algorithm compresses data such that it can be decompressed to achieve exactly what was given before compression. The opposite would be a lossy compression algorithm. Lossy compression can remove data from a file. PNG images use lossless compression while JPEG images can and often do use lossy compression.
Some of the most widely known compression algorithms include:
RLE
Huffman
LZ77
ZIP archives use a combination of Huffman coding and LZ77 to give fast compression and decompression times and reasonably good compression ratios.
LZ77 is pretty much a generalized form of RLE and it will often yield much better results.
Huffman allows the most repeating bytes to represent the least number of bits.
Imagine a text file that looked like this:
aaaaaaaabbbbbcccdd
A typical implementation of Huffman would result in the following map:
Bits Character
0 a
10 b
110 c
1110 d
So the file would be compressed to this:
00000000 10101010 10110110 11011101 11000000
^^^^^
Padding bits required
18 bytes go down to 5. Of course, the table must be included in the file. This algorithm works better with more data :P
Alex Allain has a nice article on the Huffman Compression Algorithm in case the Wiki doesn't suffice.
Feel free to ask for more information. This topic is pretty darn wide.

My paper A Survey Of Architectural Approaches for Data Compression in Cache and Main Memory Systems (permalink here) reviews many compression algorithms and also techniques for using them in modern processors. It reviews both research-grade and commercial-grade compression algorithms/techniques, so you may find one which has not yet been implemented in ASIC.

Here are some lossless algorithms (can perfectly recover the original data using these):
Huffman code
LZ78 (and LZW variation)
LZ77
Arithmetic coding
Sequitur
prediction with partial match (ppm)
Many of the well known formats like png or gif use variants or combinations of these.
On the other hand there are lossy algorithms too (compromise accuracy to compress your data, but often works pretty well). State of the art lossy techniques combine ideas from differential coding, quantization, and DCT, among others.
To learn more about data compression, I recommend https://www.elsevier.com/books/introduction-to-data-compression/sayood/978-0-12-809474-7. It is a very accessible introduction text. The 3rd edition out there in pdf online.

There are an awful lot of data compression algorithms around. If you're looking for something encyclopedic, I recommend the Handbook of Data Compression by Salomon et al, which is about as comprehensive as you're likely to get (and has good sections on the principles and practice of data compression, as well).
My best guess is that ASIC-based compression is usually implemented for a particular application, or as a specialized element of a SoC, rather than as a stand-alone compression chip. I also doubt that looking for a "latest and greatest" compression format is the way to go here -- I would expect standardization, maturity, and fitness for a particular purpose to be more important.

LZW or Lempel Ziv algorithm is a great lossless one. Pseudocode here: http://oldwww.rasip.fer.hr/research/compress/algorithms/fund/lz/lzw.html

Huffman decompresion

Well i'm trying to implement function which will decide if given tree match to compressed file, well 'trying' is little "misrepresentation" i just dont know how to implement such functionality.
I cant just figure out,because sometimes bytes of compressed file can match to tree from other file, any ideas are welcome.

I don't understand what you're trying to ask.
I suggest reading up on Huffman compression -- perhaps Wikipedia: Huffman coding and a few of the pages it links to.
Then edit your question to describe what you're trying to understand.
Could you use an actual question mark?
Huffman compression algorithms typically produce compressed files that contain a "header" with all the information necessary to reconstruct the tree, and a "body" with the compressed bitstream.
If you splice the "header" from one compressed file with the "body" of some other compressed file, the decompressor can't tell that anything is wrong -- the decompressor will happily produce "decompressed" gibberish.
Every possible sequence of bits can be "decoded" by every possible Huffman tree.
But the correct original file for some compressed bitstream can only be produced by the one correct Huffman tree.
It is usually not possible to decide, given only a the "body" compressed bitstream and a "header" (or the full Huffman tree reconstructed from that header), whether they are the real body and real header from a single real compressed file, or whether one came from one compressed file and the other came from some other compressed file.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js