I was wondering if anyone has a list of data compression algorithms. I know basically nothing about data compression and I was hoping to learn more about different algorithms and see which ones are the newest and have yet to be developed on a lot of ASICs.
I'm hoping to implement a data compression ASIC which is independent of the type of data coming in (audio,video,images,etc.)
If my question is too open ended, please let me know and I'll revise. Thank you
There are a ton of compression algorithms out there. What you need here is a lossless compression algorithm. A lossless compression algorithm compresses data such that it can be decompressed to achieve exactly what was given before compression. The opposite would be a lossy compression algorithm. Lossy compression can remove data from a file. PNG images use lossless compression while JPEG images can and often do use lossy compression.
Some of the most widely known compression algorithms include:
RLE
Huffman
LZ77
ZIP archives use a combination of Huffman coding and LZ77 to give fast compression and decompression times and reasonably good compression ratios.
LZ77 is pretty much a generalized form of RLE and it will often yield much better results.
Huffman allows the most repeating bytes to represent the least number of bits.
Imagine a text file that looked like this:
aaaaaaaabbbbbcccdd
A typical implementation of Huffman would result in the following map:
Bits Character
0 a
10 b
110 c
1110 d
So the file would be compressed to this:
00000000 10101010 10110110 11011101 11000000
^^^^^
Padding bits required
18 bytes go down to 5. Of course, the table must be included in the file. This algorithm works better with more data :P
Alex Allain has a nice article on the Huffman Compression Algorithm in case the Wiki doesn't suffice.
Feel free to ask for more information. This topic is pretty darn wide.
My paper A Survey Of Architectural Approaches for Data Compression in Cache and Main Memory Systems (permalink here) reviews many compression algorithms and also techniques for using them in modern processors. It reviews both research-grade and commercial-grade compression algorithms/techniques, so you may find one which has not yet been implemented in ASIC.
Here are some lossless algorithms (can perfectly recover the original data using these):
Huffman code
LZ78 (and LZW variation)
LZ77
Arithmetic coding
Sequitur
prediction with partial match (ppm)
Many of the well known formats like png or gif use variants or combinations of these.
On the other hand there are lossy algorithms too (compromise accuracy to compress your data, but often works pretty well). State of the art lossy techniques combine ideas from differential coding, quantization, and DCT, among others.
To learn more about data compression, I recommend https://www.elsevier.com/books/introduction-to-data-compression/sayood/978-0-12-809474-7. It is a very accessible introduction text. The 3rd edition out there in pdf online.
There are an awful lot of data compression algorithms around. If you're looking for something encyclopedic, I recommend the Handbook of Data Compression by Salomon et al, which is about as comprehensive as you're likely to get (and has good sections on the principles and practice of data compression, as well).
My best guess is that ASIC-based compression is usually implemented for a particular application, or as a specialized element of a SoC, rather than as a stand-alone compression chip. I also doubt that looking for a "latest and greatest" compression format is the way to go here -- I would expect standardization, maturity, and fitness for a particular purpose to be more important.
LZW or Lempel Ziv algorithm is a great lossless one. Pseudocode here: http://oldwww.rasip.fer.hr/research/compress/algorithms/fund/lz/lzw.html
Related
I want to compress .txt files that contains dates in yyyy-mm-dd hh:mm:ss format and english words that sometimes tend to be repeated in different lines.
I read some articles about compression algorithm and find out that in my case dictionary based encoding is better than entropy based encoding. Since I want to implement algorithm myself I need something that isn't very complicated. So I paid attention to LZW and LZ77, but can't choose between them, because conclusions of articles I found are contradictory. According to some articles LZW has better compression ratio and according to others leader is LZ77. So the question is which one is most likely will be better in my case? Is there more easy-to-implement algorithms that can be good for my purpose?
LZW is obsolete. Modern, and even pretty old, LZ77 compressors outperform LZW.
In any case, you are the only one who can answer your question, since only you have examples of the data you want to compress. Simply experiment with various compression methods (zstd, xz, lz4, etc.) on your data and see what combination of compression ratio and speed meets your needs.
Currently, I am looking for an lossless compression algorithm that is suitable for a large amount of text, that will be further encrypt by AES and use as the payload in steganography.
EDIT:
Based on A Comparative Study Of Text Compression Algorithms, it seems that
Arithmetic coding is preferable in Statistical compression techniques, while LZB is recommended for Dictionary compression techniques.
So now I am wondering whether Statistical compression or Dictionary compression is more suitable for large English text compression in terms of compression ratio and ease-to-implement.
I have search through but still barely have an idea of the suitable algorithm. Thank you very much for your time in answering. Have a nice day. :)
A lot of the algorithms that you are describing in this question are called entropy coders (Shannon-Fano, Huffman, arithmetic, etc.). Entropy coders are used to compress sequences of symbols (often bytes), where some symbols are much more frequent than others. Simple entropy coding of symbols (letters) for compressing natural language will only yield about a 2:1 compression.
Instead, popular modern lossless compression techniques for text include methods like LZ77, LZW, and BWT. Loosely speaking, the LZ family involves building up a dictionary of recurring short symbol sequences (we'll call them "words") and then uses pointers to reference those words. Some of the implementations of LZ like LZ77 and LZW can be fairly simple to code up but probably do not yield the highest compression ratios. See for example this video: https://www.youtube.com/watch?v=j2HSd3HCpDs. On the other end of the spectrum, LZMA2, is a relatively more complicated variant with a higher compression ratio.
The Burrows-Wheeler transform (BWT) provides a clever alternative to the dictionary methods. I'll refer you to the Wikipedia article, https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
In a nutshell, though, it produces a (invertible) permutation of the original sequence of bytes that can often be compressed very effectively by run-length encoding followed by an entropy coder.
If I had to code a compression technique from scratch, for simplicity, I'd probably go with LZW or LZ77.
Shannon-Fano coding, Huffman coding, Arithmetic coding, Range coding, and Asymmetric Numeral System coding are all zero-order entropy coders applied after you have first modeled your data, taking advantage of the inherent redundancy.
For text, that redundancy is repeated strings and higher-order correlations in the data. There are several ways to model text. The most common are Lempel-Ziv 77, which looks for matching strings, the Burrows-Wheeler Transform (look it up for a description), and prediction by partial matching.
Look to the Large Text Compression Benchmark to see comparisons in compression, compression speed, memory used, and decompression speed.
I am looking for a compression algorithm (for a programming competition) and I need a full description of how to implement it (all technical details), any loseless and patent-free algorithm will do, but the ease of implementation is a bonus :)
(Although possibly irrelevant) I plan to implement the algorithm in C++...
Thanks in advance.
EDIT:
I will be compressing text files only, no other file types...
Well, I can't go so far as to complete the competition for you, but please check out this article on wiki: Run Length Encoding. It is by far one of the simplest ways to compress data, albeit not always an efficient one. Compression is also domain specific, even amongst lossless algorithms you will find that what you are compressing determines how best to encode it.
RFC 1951 describes inflate/deflate, including a brief description of the compressor's algorithm. Antaeus Feldspar's An Explanation of the Deflate Algorithm provides a bit more background.
Also, the zlib source distribution contains a simplified reference inflater in contrib/puff/puff.c that can be helpful reading to understand exactly how the bits are arranged (but it doesn't contain a deflate, only inflate).
I'd start here on Wikipedia.
There's a whole lot to choose from, but without knowing more about what you want it's difficult to help more. Are you compressing text, images, video or just random files? Each one has it's own set of techniques and challenges for optimal results.
If ease of implementation is the sole criterion I'd use "filecopy" compression. Guaranteed compression ratio of exactly 1:1, and trivial implementation...
Huffman is good if you're compressing plain text. And all the commenters below assure me it's a joy to implement ;D
Ease of implementation: Huffman, as stated before. I believe LZW is no longer under patent, but I don't know for sure. It' a relatively simple algorithm. LZ77 should be available, though. Lastly, the Burrows-Wheeler transform allows for compression, but it's significantly more difficult to implement.
I like this introduction to the Burrows-Wheeler Transform.
If you go under "View" in your internet browser, there should be an option to either "Zoom Out" or make the text smaller.
Select one of those and...
BAM!
You just got more text on the same screen! Yay compression!
The Security Now! podcast recently put out an episode highlighting data compression algorithms. Steve Gibson gives a pretty good explanation of the basics of Huffman and Lempel-Ziv compression techniques. You can listen to the audio podcast or read the transcript for Episode 205.
In honor of the Hutter Prize,
what are the top algorithms (and a quick description of each) for text compression?
Note: The intent of this question is to get a description of compression algorithms, not of compression programs.
The boundary-pushing compressors combine algorithms for insane results. Common algorithms include:
The Burrows-Wheeler Transform and here - shuffle characters (or other bit blocks) with a predictable algorithm to increase repeated blocks which makes the source easier to compress. Decompression occurs as normal and the result is un-shuffled with the reverse transform. Note: BWT alone doesn't actually compress anything. It just makes the source easier to compress.
Prediction by Partial Matching (PPM) - an evolution of arithmetic coding where the prediction model(context) is created by crunching statistics about the source versus using static probabilities. Even though its roots are in arithmetic coding, the result can be represented with Huffman encoding or a dictionary as well as arithmetic coding.
Context Mixing - Arithmetic coding uses a static context for prediction, PPM dynamically chooses a single context, Context Mixing uses many contexts and weighs their results. PAQ uses context mixing. Here's a high-level overview.
Dynamic Markov Compression - related to PPM but uses bit-level contexts versus byte or longer.
In addition, the Hutter prize contestants may replace common text with small-byte entries from external dictionaries and differentiate upper and lower case text with a special symbol versus using two distinct entries. That's why they're so good at compressing text (especially ASCII text) and not as valuable for general compression.
Maximum Compression is a pretty cool text and general compression benchmark site. Matt Mahoney publishes another benchmark. Mahoney's may be of particular interest because it lists the primary algorithm used per entry.
There's always lzip.
All kidding aside:
Where compatibility is a concern, PKZIP (DEFLATE algorithm) still wins.
bzip2 is the best compromise between being enjoying a relatively broad install base and a rather good compression ratio, but requires a separate archiver.
7-Zip (LZMA algorithm) compresses very well and is available for under the LGPL. Few operating systems ship with built-in support, however.
rzip is a variant of bzip2 that in my opinion deserves more attention. It could be particularly interesting for huge log files that need long-term archiving. It also requires a separate archiver.
If you want to use PAQ as a program, you can install the zpaq package on debian-based systems. Usage is (see also man zpaq)
zpaq c archivename.zpaq file1 file2 file3
Compression was to about 1/10th of a zip file's size. (1.9M vs 15M)
What is the best compression algorithm with the following features:
should take less time to decompress (can take reasonably more time compress)
should be able to compress sorted data (approx list of 3,000,000 strings/integers ...)
Please suggest along with metrics: compression ratio, algorithmic complexity for compression and decompression (if possible)?
Entire site devoted to compression benchmarking here
Well if you just want speed, then standard ZIP compression is just fine and it's most likely integrated into your language/framework already (ex: .NET has it, Java has it). Sometimes the most universal solution is the best, ZIP is a very mature format, any ZIP library and application will work with any other.
But if you want better compression, I'd suggest 7-Zip as the author is very smart, easy to get ahold of and encourages people to use the format.
Providing you with compression times is impossible, as it's directly related to your hardware. If you want a benchmark, you have to do it yourself.
You don't have to worry about decompression time. The time spent the higher compression level is mostly finding the longest matching pattern.
Decompression either
1) Writes the literal
2) for (backward position, length)=(m,n) pair,
goes back, in the output buffer, m bytes,
reads n bytes and
writes n bytes at the end of the buffer.
So the decompression time is independent of the compression level. And, with my experience with Universal Decompression Virtual Machine (RFC3320), I guess the same is true for any decompression algorithm.
This is an interessing question.
On such sorted data of strings and integers, I would expect that difference coding compression approaches would outperform any out-of-the-box text compression approach as LZ77 or LZ78 in terms of compression ratio. General purpose encoder do not use the special properties of the data.