Is it possible to compress text using natural language processing?

Is it possible to compress text using natural language processing? - compression

I was thinking about compressing large blocks of text using most frequent english words, but now I doubt it would be efficient, since lzw seems to be achieving just this in a better way.
Still, I can't shake the feeling compressing character one by one is a little "brutal", since one could just analyze the structure of sentences to better organize it into smaller chunks of data, and the structure is not exactly the same when decompressed, it could use classic compression methods.
Does "basic" NLP allows that ?

NLP?
Standard compression techniques can be applied to words instead of characters. These techniques would assign probabilities to what the next word is, based on the preceding words. I have not seen this in practice though, since there are so many more words than characters, resulting in prohibitive memory usage and excessive execution time for even low-order models.

Related

Why to combine Huffman and lz77?

I'm doing a reverse engineering in a Gameboy Advance's game, and I noticed that the originals developers wrote a code that has two system calls to uncompress a level using Huffman and lz77 (in this order).
But why to use Huffman + lzZ7? Whats the advantage to this approach?

Using available libraries
It's possible that the developers are using DEFLATE (or some similar algorithm), simply to be able to re-use tested and debugged software rather than writing something new from scratch (and taking who-knows-how-long to test and fix all the quirky edge cases).
Why both Huffman and LZ77?
But why does DEFLATE, Zstandard, LZHAM, LZHUF, LZH, etc. use both Huffman and LZ77?
Because these 2 algorithms detect and remove 2 different kinds of redundancy in common to many data files (video game levels, English and other natural-language text, etc.), and they can be combined to get better net compression than either one alone.
(Unfortunately, most data compression algorithms cannot be combined like this).
details
In English, the 2 most common letters are (usually) 'e' and then 't'.
So what is the most common pair? You might guess "ee", "et", or "te" -- nope, it's "th".
LZ77 is good at detecting and compressing these kinds of common words and syllables that occur far more often than you might guess from the letter frequencies alone.
Letter-oriented Huffman is good at detecting and compressing files using the letter frequencies alone, but it cannot detect correlations between consecutive letters (common words and syllables).
LZ77 compresses an original file into an intermediate sequence of literal letters and "copy items".
Then Huffman further compresses that intermediate sequence.
Often those "copy items" are already much shorter than the original substring would have been if we had skipped the LZ77 step and simply Huffman compressed the original file.
And Huffman does just as well compressing the literal letters in the intermediate sequence as it would have done compressing those same letters in the original file.
So already this 2-step process creates smaller files than either algorithm alone.
As a bonus, typically the copy items are also Huffman compressed for more savings in storage space.
In general, most data compression software is made up of these 2 parts.
First they run the original data through a "transformation" or multiple transformations, also called "decorrelators", typically highly tuned to the particular kind of redundancy in the particular kind of data being compressed (JPEG's DCT transform, MPEG's motion-compensation, etc.) or tuned to the limitations of human perception (MP3's auditory masking, etc.).
Next they run the intermediate data through a single "entropy coder" (arithmetic coding, or Huffman coding, or asymmetric numeral system coding) that's pretty much the same for every kind of data.

compressing individual lines of text separately using common phrases in a global dictionary

Is there any open source library or algorithm available to look at what phrases or words are most common among individual lines of text in a file and create a global dictionary that would then be used to compress the lines of text separately? Preferably the code if available would be in C or C++.
I found this question that I think was similar but did not have an answer that meets what I am looking for:
compressing a huge set of similar strings

There are three important things to recognize here.
The value of replacing a word by a code depends on its frequency and its length. Replacing "a" isn't worth a lot, even if it appears very often.
Once you've identified the most common words, phrases can be found by looking for occurrences of two common words appearing side by side. (In most grammars, word repetition is fairly rare.)
However, one of the biggest sources of redundancy in text is actually the amount of bits needed to predict the next letter. That's typically about 2, given the preceding text. Do you really need word-based compression when letter-based compression is so much easier?

I did some more searching, and I think I have found my answer.
I came across this page discussing improving compression by using boosters
http://mainroach.blogspot.com/2013/08/boosting-text-compression-with-dense.html
That page provided a link to the research paper
http://www.dcc.uchile.cl/~gnavarro/ps/tcj11.pdf
and also to the source code used to do the compression
http://vios.dc.fi.udc.es/codes/download.html

Yes. zlib, an open source compression library in C, provides the deflateSetDictonary() and inflateSetDictionary() routines for this purpose. You provide up to 32K of seed data in which the compressor will look for matching strings. The same dictionary needs to reside on both ends. If you are compressing lots of small pieces of data with a lot of commonality, this can greatly improve compression. Your "lines of text" certainly qualify as small pieces of data.

How to compute good preset dictionary for deflate compression

I have an opportunity to preset dictionary for deflate compression. It makes sense in my case, because data to be compressed is relatively small 1kb-3kb and I have a large sample of representative examples. Data to be compressed consists of arbitrary sequence of bytes, so tokenization etc. is not a good way to go. Also, data shows a lot of repetition (between data examples), so good dictionary could potentially give very good results.
The question is how calculate good dictionary? Is there an algorithm which calculates optimal dictionary (given sample data)?
I started looking at prefix trees, but it is not clear how to use them in this context.
Best regards,
Jarek

I am not aware of an algorithm to generate an optimal or even a good dictionary. This is generally done by hand. I think that a suffix tree would be a good approach to finding common strings for a dictionary, but I have never tried it.
The first thing to try is to simply concatenate 32K worth of your 1-3K examples and see how much gain that provides over no dictionary. Then you mess with it from there, changing the ordering of examples or pulling out repeated pieces in the examples to the end of the dictionary.
Note that the most common strings should be put at the end, since shorter distances take fewer bits.

I don't know how good this is, but it's a dictionary creator: https://github.com/vkrasnov/dictator

Why don't we use word ranks for string compression?

I have 3 main questions:
Let's say I have a large text file. (1)Is replacing the words with their rank an effective way to compress the file? (Got answer to this question. This is a bad idea.)
Also, I have come up with a new compression algorithm. I read some existing compression models that are used widely and I found out they use some pretty advanced concepts like statistical redundancy and probabilistic prediction. My algorithm does not use all these concepts and is a rather simple set of rules that need to be followed while compressing and decompressing. (2)My question is am I wasting my time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes?
(3)Furthermore, if I manage to successfully compress a string can I extend my algorithm to other content like videos, images etc.?
(I understand that the third question is difficult to answer without knowledge about the compression algorithm. But I am afraid the algorithm is so rudimentary and nascent I feel ashamed about sharing it. Please feel free to ignore the third question if you have to)

Your question doesn't make sense as it stands (see answer #2), but I'll try to rephrase and you can let me know if I capture your question. Would modeling text using the probability of individual words make for a good text compression algorithm? Answer: No. That would be a zeroth order model, and would not be able to take advantage of higher order correlations, such as the conditional probability of a given word following the previous word. Simple existing text compressors that look for matching strings and varied character probabilities would perform better.
Yes, you are wasting your time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes. You should first learn about the techniques that have been applied over time to model data, textual and others, and the approaches to use the modeled information to compress the data. You need to study what has already been researched for decades before developing a new approach.
The compression part may extend, but the modeling part won't.

Do you mean like having a ranking table of words sorted by frequency and assign smaller "symbols" to those words that are repeated the most, therefore reducing the amount of information that needs to be transmitted?
That's basically how Huffman Coding works, the problem with compression is that you always hit a limit somewhere along the road, of course, if the set of things that you try to compress follows a particular pattern/distribution then it's possible to be really efficient about it, but for general purposes (audio/video/text/encrypted data that appears to be random) there is no (and I believe that there can't be) "best" compression technique.

Huffman Coding uses frequency on letters. You can do the same with words or with letter frequency in more dimensions, i.e. combinations of letters and their frequency.

Binary parser or serialization?

I want to store a graph of different objects for a game, their classes may or may not be related, they may or may not contain vectors of simple structures.
I want parsing operation to be fast, data can be pretty big.
Adding new things should not be hard, and it should not break backward compatibility.
Smaller file size is kind of important
Readability counts
By serialization I mean, making objects serialize themselves, which is effective, but I will need to write different serialization methods for different objects for that.
By binary parsing/composing I mean, creating a new tree of parsers/composers that holds and reads data for these objects, and passing this around to have my objects push/pull their data.
I can also use json, but it can be pretty slow for reading, and it is not very size effective when it comes to pretty big sets of matrices, and numbers.

Point by point:
Fast Parsing: binary (since you don't necessarily have to "parse", you can just deserialize)
Adding New Things: text
Smaller: text (even if gzipped text is larger than binary, it won't be much larger).
Readability: text
So that's three votes for text, one point for binary. Personally, I'd go with text for everything except images (and other data which is "naturally" binary). Then, store everything in a big zip file (I can think of several games do this or something close to it).
Good reads: The Importance of Being Textual and Power Of Plain Text.

Check out protocol buffers from Google or thrift from Apache. Although billed as a way to write wire protocols easily, it's basically an object serialization mechanism that can create bindings in a dozen languages, has efficient binary representation, easy versioning, fast performance, and is well-supported.

We're using Boost.Serialization. Don't know how it performs next to those offered by samkass.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js