What is the current state of text-only compression algorithms?

What is the current state of text-only compression algorithms? - compression

In honor of the Hutter Prize,
what are the top algorithms (and a quick description of each) for text compression?
Note: The intent of this question is to get a description of compression algorithms, not of compression programs.

The boundary-pushing compressors combine algorithms for insane results. Common algorithms include:
The Burrows-Wheeler Transform and here - shuffle characters (or other bit blocks) with a predictable algorithm to increase repeated blocks which makes the source easier to compress. Decompression occurs as normal and the result is un-shuffled with the reverse transform. Note: BWT alone doesn't actually compress anything. It just makes the source easier to compress.
Prediction by Partial Matching (PPM) - an evolution of arithmetic coding where the prediction model(context) is created by crunching statistics about the source versus using static probabilities. Even though its roots are in arithmetic coding, the result can be represented with Huffman encoding or a dictionary as well as arithmetic coding.
Context Mixing - Arithmetic coding uses a static context for prediction, PPM dynamically chooses a single context, Context Mixing uses many contexts and weighs their results. PAQ uses context mixing. Here's a high-level overview.
Dynamic Markov Compression - related to PPM but uses bit-level contexts versus byte or longer.
In addition, the Hutter prize contestants may replace common text with small-byte entries from external dictionaries and differentiate upper and lower case text with a special symbol versus using two distinct entries. That's why they're so good at compressing text (especially ASCII text) and not as valuable for general compression.
Maximum Compression is a pretty cool text and general compression benchmark site. Matt Mahoney publishes another benchmark. Mahoney's may be of particular interest because it lists the primary algorithm used per entry.

There's always lzip.
All kidding aside:
Where compatibility is a concern, PKZIP (DEFLATE algorithm) still wins.
bzip2 is the best compromise between being enjoying a relatively broad install base and a rather good compression ratio, but requires a separate archiver.
7-Zip (LZMA algorithm) compresses very well and is available for under the LGPL. Few operating systems ship with built-in support, however.
rzip is a variant of bzip2 that in my opinion deserves more attention. It could be particularly interesting for huge log files that need long-term archiving. It also requires a separate archiver.

If you want to use PAQ as a program, you can install the zpaq package on debian-based systems. Usage is (see also man zpaq)
zpaq c archivename.zpaq file1 file2 file3
Compression was to about 1/10th of a zip file's size. (1.9M vs 15M)

Related

Why to combine Huffman and lz77?

I'm doing a reverse engineering in a Gameboy Advance's game, and I noticed that the originals developers wrote a code that has two system calls to uncompress a level using Huffman and lz77 (in this order).
But why to use Huffman + lzZ7? Whats the advantage to this approach?

Using available libraries
It's possible that the developers are using DEFLATE (or some similar algorithm), simply to be able to re-use tested and debugged software rather than writing something new from scratch (and taking who-knows-how-long to test and fix all the quirky edge cases).
Why both Huffman and LZ77?
But why does DEFLATE, Zstandard, LZHAM, LZHUF, LZH, etc. use both Huffman and LZ77?
Because these 2 algorithms detect and remove 2 different kinds of redundancy in common to many data files (video game levels, English and other natural-language text, etc.), and they can be combined to get better net compression than either one alone.
(Unfortunately, most data compression algorithms cannot be combined like this).
details
In English, the 2 most common letters are (usually) 'e' and then 't'.
So what is the most common pair? You might guess "ee", "et", or "te" -- nope, it's "th".
LZ77 is good at detecting and compressing these kinds of common words and syllables that occur far more often than you might guess from the letter frequencies alone.
Letter-oriented Huffman is good at detecting and compressing files using the letter frequencies alone, but it cannot detect correlations between consecutive letters (common words and syllables).
LZ77 compresses an original file into an intermediate sequence of literal letters and "copy items".
Then Huffman further compresses that intermediate sequence.
Often those "copy items" are already much shorter than the original substring would have been if we had skipped the LZ77 step and simply Huffman compressed the original file.
And Huffman does just as well compressing the literal letters in the intermediate sequence as it would have done compressing those same letters in the original file.
So already this 2-step process creates smaller files than either algorithm alone.
As a bonus, typically the copy items are also Huffman compressed for more savings in storage space.
In general, most data compression software is made up of these 2 parts.
First they run the original data through a "transformation" or multiple transformations, also called "decorrelators", typically highly tuned to the particular kind of redundancy in the particular kind of data being compressed (JPEG's DCT transform, MPEG's motion-compensation, etc.) or tuned to the limitations of human perception (MP3's auditory masking, etc.).
Next they run the intermediate data through a single "entropy coder" (arithmetic coding, or Huffman coding, or asymmetric numeral system coding) that's pretty much the same for every kind of data.

Compression ratio of LZW, LZ77 and other easy-to-implement algorithms

I want to compress .txt files that contains dates in yyyy-mm-dd hh:mm:ss format and english words that sometimes tend to be repeated in different lines.
I read some articles about compression algorithm and find out that in my case dictionary based encoding is better than entropy based encoding. Since I want to implement algorithm myself I need something that isn't very complicated. So I paid attention to LZW and LZ77, but can't choose between them, because conclusions of articles I found are contradictory. According to some articles LZW has better compression ratio and according to others leader is LZ77. So the question is which one is most likely will be better in my case? Is there more easy-to-implement algorithms that can be good for my purpose?

LZW is obsolete. Modern, and even pretty old, LZ77 compressors outperform LZW.
In any case, you are the only one who can answer your question, since only you have examples of the data you want to compress. Simply experiment with various compression methods (zstd, xz, lz4, etc.) on your data and see what combination of compression ratio and speed meets your needs.

Which algorithm is most suitable for large text compression?

Currently, I am looking for an lossless compression algorithm that is suitable for a large amount of text, that will be further encrypt by AES and use as the payload in steganography.
EDIT:
Based on A Comparative Study Of Text Compression Algorithms, it seems that
Arithmetic coding is preferable in Statistical compression techniques, while LZB is recommended for Dictionary compression techniques.
So now I am wondering whether Statistical compression or Dictionary compression is more suitable for large English text compression in terms of compression ratio and ease-to-implement.
I have search through but still barely have an idea of the suitable algorithm. Thank you very much for your time in answering. Have a nice day. :)

A lot of the algorithms that you are describing in this question are called entropy coders (Shannon-Fano, Huffman, arithmetic, etc.). Entropy coders are used to compress sequences of symbols (often bytes), where some symbols are much more frequent than others. Simple entropy coding of symbols (letters) for compressing natural language will only yield about a 2:1 compression.
Instead, popular modern lossless compression techniques for text include methods like LZ77, LZW, and BWT. Loosely speaking, the LZ family involves building up a dictionary of recurring short symbol sequences (we'll call them "words") and then uses pointers to reference those words. Some of the implementations of LZ like LZ77 and LZW can be fairly simple to code up but probably do not yield the highest compression ratios. See for example this video: https://www.youtube.com/watch?v=j2HSd3HCpDs. On the other end of the spectrum, LZMA2, is a relatively more complicated variant with a higher compression ratio.
The Burrows-Wheeler transform (BWT) provides a clever alternative to the dictionary methods. I'll refer you to the Wikipedia article, https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
In a nutshell, though, it produces a (invertible) permutation of the original sequence of bytes that can often be compressed very effectively by run-length encoding followed by an entropy coder.
If I had to code a compression technique from scratch, for simplicity, I'd probably go with LZW or LZ77.

Shannon-Fano coding, Huffman coding, Arithmetic coding, Range coding, and Asymmetric Numeral System coding are all zero-order entropy coders applied after you have first modeled your data, taking advantage of the inherent redundancy.
For text, that redundancy is repeated strings and higher-order correlations in the data. There are several ways to model text. The most common are Lempel-Ziv 77, which looks for matching strings, the Burrows-Wheeler Transform (look it up for a description), and prediction by partial matching.
Look to the Large Text Compression Benchmark to see comparisons in compression, compression speed, memory used, and decompression speed.

Data Compression Algorithms

I was wondering if anyone has a list of data compression algorithms. I know basically nothing about data compression and I was hoping to learn more about different algorithms and see which ones are the newest and have yet to be developed on a lot of ASICs.
I'm hoping to implement a data compression ASIC which is independent of the type of data coming in (audio,video,images,etc.)
If my question is too open ended, please let me know and I'll revise. Thank you

There are a ton of compression algorithms out there. What you need here is a lossless compression algorithm. A lossless compression algorithm compresses data such that it can be decompressed to achieve exactly what was given before compression. The opposite would be a lossy compression algorithm. Lossy compression can remove data from a file. PNG images use lossless compression while JPEG images can and often do use lossy compression.
Some of the most widely known compression algorithms include:
RLE
Huffman
LZ77
ZIP archives use a combination of Huffman coding and LZ77 to give fast compression and decompression times and reasonably good compression ratios.
LZ77 is pretty much a generalized form of RLE and it will often yield much better results.
Huffman allows the most repeating bytes to represent the least number of bits.
Imagine a text file that looked like this:
aaaaaaaabbbbbcccdd
A typical implementation of Huffman would result in the following map:
Bits Character
0 a
10 b
110 c
1110 d
So the file would be compressed to this:
00000000 10101010 10110110 11011101 11000000
^^^^^
Padding bits required
18 bytes go down to 5. Of course, the table must be included in the file. This algorithm works better with more data :P
Alex Allain has a nice article on the Huffman Compression Algorithm in case the Wiki doesn't suffice.
Feel free to ask for more information. This topic is pretty darn wide.

My paper A Survey Of Architectural Approaches for Data Compression in Cache and Main Memory Systems (permalink here) reviews many compression algorithms and also techniques for using them in modern processors. It reviews both research-grade and commercial-grade compression algorithms/techniques, so you may find one which has not yet been implemented in ASIC.

Here are some lossless algorithms (can perfectly recover the original data using these):
Huffman code
LZ78 (and LZW variation)
LZ77
Arithmetic coding
Sequitur
prediction with partial match (ppm)
Many of the well known formats like png or gif use variants or combinations of these.
On the other hand there are lossy algorithms too (compromise accuracy to compress your data, but often works pretty well). State of the art lossy techniques combine ideas from differential coding, quantization, and DCT, among others.
To learn more about data compression, I recommend https://www.elsevier.com/books/introduction-to-data-compression/sayood/978-0-12-809474-7. It is a very accessible introduction text. The 3rd edition out there in pdf online.

There are an awful lot of data compression algorithms around. If you're looking for something encyclopedic, I recommend the Handbook of Data Compression by Salomon et al, which is about as comprehensive as you're likely to get (and has good sections on the principles and practice of data compression, as well).
My best guess is that ASIC-based compression is usually implemented for a particular application, or as a specialized element of a SoC, rather than as a stand-alone compression chip. I also doubt that looking for a "latest and greatest" compression format is the way to go here -- I would expect standardization, maturity, and fitness for a particular purpose to be more important.

LZW or Lempel Ziv algorithm is a great lossless one. Pseudocode here: http://oldwww.rasip.fer.hr/research/compress/algorithms/fund/lz/lzw.html

best compression algorithm with the following features

What is the best compression algorithm with the following features:
should take less time to decompress (can take reasonably more time compress)
should be able to compress sorted data (approx list of 3,000,000 strings/integers ...)
Please suggest along with metrics: compression ratio, algorithmic complexity for compression and decompression (if possible)?

Entire site devoted to compression benchmarking here

Well if you just want speed, then standard ZIP compression is just fine and it's most likely integrated into your language/framework already (ex: .NET has it, Java has it). Sometimes the most universal solution is the best, ZIP is a very mature format, any ZIP library and application will work with any other.
But if you want better compression, I'd suggest 7-Zip as the author is very smart, easy to get ahold of and encourages people to use the format.
Providing you with compression times is impossible, as it's directly related to your hardware. If you want a benchmark, you have to do it yourself.

You don't have to worry about decompression time. The time spent the higher compression level is mostly finding the longest matching pattern.
Decompression either
1) Writes the literal
2) for (backward position, length)=(m,n) pair,
goes back, in the output buffer, m bytes,
reads n bytes and
writes n bytes at the end of the buffer.
So the decompression time is independent of the compression level. And, with my experience with Universal Decompression Virtual Machine (RFC3320), I guess the same is true for any decompression algorithm.

This is an interessing question.
On such sorted data of strings and integers, I would expect that difference coding compression approaches would outperform any out-of-the-box text compression approach as LZ77 or LZ78 in terms of compression ratio. General purpose encoder do not use the special properties of the data.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js