compressing individual lines of text separately using common phrases in a global dictionary - c++

Is there any open source library or algorithm available to look at what phrases or words are most common among individual lines of text in a file and create a global dictionary that would then be used to compress the lines of text separately? Preferably the code if available would be in C or C++.
I found this question that I think was similar but did not have an answer that meets what I am looking for:
compressing a huge set of similar strings

There are three important things to recognize here.
The value of replacing a word by a code depends on its frequency and its length. Replacing "a" isn't worth a lot, even if it appears very often.
Once you've identified the most common words, phrases can be found by looking for occurrences of two common words appearing side by side. (In most grammars, word repetition is fairly rare.)
However, one of the biggest sources of redundancy in text is actually the amount of bits needed to predict the next letter. That's typically about 2, given the preceding text. Do you really need word-based compression when letter-based compression is so much easier?

I did some more searching, and I think I have found my answer.
I came across this page discussing improving compression by using boosters
http://mainroach.blogspot.com/2013/08/boosting-text-compression-with-dense.html
That page provided a link to the research paper
http://www.dcc.uchile.cl/~gnavarro/ps/tcj11.pdf
and also to the source code used to do the compression
http://vios.dc.fi.udc.es/codes/download.html

Yes. zlib, an open source compression library in C, provides the deflateSetDictonary() and inflateSetDictionary() routines for this purpose. You provide up to 32K of seed data in which the compressor will look for matching strings. The same dictionary needs to reside on both ends. If you are compressing lots of small pieces of data with a lot of commonality, this can greatly improve compression. Your "lines of text" certainly qualify as small pieces of data.

Related

Compress many versions of a text with fast access to each

Let's say I store many versions of a source code file in a source code repository - maybe 500 historic versions of a 50k source file. So storing the versions directly would take about 12.5 MB (assuming the file grew linearly over time). Naturally though, there is ample room for compression as there will only be slight differences between most successive versions.
What I want is compact storage as well as reasonably quick extraction of any of the versions at any time.
So we would probably store a list of oft-occuring text chunks, and each version would just contain pointers to the chunks it is made of. To make this really compact, text chunks would be able to defined as concatenations of other chunks.
Is there a well-established compression algorithm that produces this kind of structure? I was not sure what term to search for.
(Bonus points if adding a new version is faster than recompressing the whole set of versions.)
What you want is called "git". In fact, that is exactly what you want. Including bonus points.
Seeing as there were no usable answers, I came up with my own format today to demonstrate what I mean. I am storing 850 versions of a source file about 20k in size. Usually from one version to the next just one line was added (but there were other changes as well).
If I store these 850 versions in a .zip, it is 4.2 MB big. I want less than that, way less.
My format is line-based. Basically each file version is stored as a list of pointers into a table. Each table entry is either:
a literal line,
or a pair of pointers into the table.
In the second case, in decompression, the two pointers have to be followed successively.
Not sure if this description makes sense to you right away, but the thing works.
The compressor generates a single text file from which each of the 850 versions can be extracted instantly. This text file has a size of 45k.
Finally we can simply gzip this file which gets us down to 18.5k. Quite an improvement from 4.2 MB!
The compressor uses a very simple but effective way to find repeating combinations of lines.
So the answer to the initial question is that there is an algorithm that combines inter-file compression (like .tar.gz) with instant extraction if any contained file (like .zip).
I still don't know how you would call this class of compression algorithms.

Why to combine Huffman and lz77?

I'm doing a reverse engineering in a Gameboy Advance's game, and I noticed that the originals developers wrote a code that has two system calls to uncompress a level using Huffman and lz77 (in this order).
But why to use Huffman + lzZ7? Whats the advantage to this approach?
Using available libraries
It's possible that the developers are using DEFLATE (or some similar algorithm), simply to be able to re-use tested and debugged software rather than writing something new from scratch (and taking who-knows-how-long to test and fix all the quirky edge cases).
Why both Huffman and LZ77?
But why does DEFLATE, Zstandard, LZHAM, LZHUF, LZH, etc. use both Huffman and LZ77?
Because these 2 algorithms detect and remove 2 different kinds of redundancy in common to many data files (video game levels, English and other natural-language text, etc.), and they can be combined to get better net compression than either one alone.
(Unfortunately, most data compression algorithms cannot be combined like this).
details
In English, the 2 most common letters are (usually) 'e' and then 't'.
So what is the most common pair? You might guess "ee", "et", or "te" -- nope, it's "th".
LZ77 is good at detecting and compressing these kinds of common words and syllables that occur far more often than you might guess from the letter frequencies alone.
Letter-oriented Huffman is good at detecting and compressing files using the letter frequencies alone, but it cannot detect correlations between consecutive letters (common words and syllables).
LZ77 compresses an original file into an intermediate sequence of literal letters and "copy items".
Then Huffman further compresses that intermediate sequence.
Often those "copy items" are already much shorter than the original substring would have been if we had skipped the LZ77 step and simply Huffman compressed the original file.
And Huffman does just as well compressing the literal letters in the intermediate sequence as it would have done compressing those same letters in the original file.
So already this 2-step process creates smaller files than either algorithm alone.
As a bonus, typically the copy items are also Huffman compressed for more savings in storage space.
In general, most data compression software is made up of these 2 parts.
First they run the original data through a "transformation" or multiple transformations, also called "decorrelators", typically highly tuned to the particular kind of redundancy in the particular kind of data being compressed (JPEG's DCT transform, MPEG's motion-compensation, etc.) or tuned to the limitations of human perception (MP3's auditory masking, etc.).
Next they run the intermediate data through a single "entropy coder" (arithmetic coding, or Huffman coding, or asymmetric numeral system coding) that's pretty much the same for every kind of data.

Why don't we use word ranks for string compression?

I have 3 main questions:
Let's say I have a large text file. (1)Is replacing the words with their rank an effective way to compress the file? (Got answer to this question. This is a bad idea.)
Also, I have come up with a new compression algorithm. I read some existing compression models that are used widely and I found out they use some pretty advanced concepts like statistical redundancy and probabilistic prediction. My algorithm does not use all these concepts and is a rather simple set of rules that need to be followed while compressing and decompressing. (2)My question is am I wasting my time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes?
(3)Furthermore, if I manage to successfully compress a string can I extend my algorithm to other content like videos, images etc.?
(I understand that the third question is difficult to answer without knowledge about the compression algorithm. But I am afraid the algorithm is so rudimentary and nascent I feel ashamed about sharing it. Please feel free to ignore the third question if you have to)
Your question doesn't make sense as it stands (see answer #2), but I'll try to rephrase and you can let me know if I capture your question. Would modeling text using the probability of individual words make for a good text compression algorithm? Answer: No. That would be a zeroth order model, and would not be able to take advantage of higher order correlations, such as the conditional probability of a given word following the previous word. Simple existing text compressors that look for matching strings and varied character probabilities would perform better.
Yes, you are wasting your time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes. You should first learn about the techniques that have been applied over time to model data, textual and others, and the approaches to use the modeled information to compress the data. You need to study what has already been researched for decades before developing a new approach.
The compression part may extend, but the modeling part won't.
Do you mean like having a ranking table of words sorted by frequency and assign smaller "symbols" to those words that are repeated the most, therefore reducing the amount of information that needs to be transmitted?
That's basically how Huffman Coding works, the problem with compression is that you always hit a limit somewhere along the road, of course, if the set of things that you try to compress follows a particular pattern/distribution then it's possible to be really efficient about it, but for general purposes (audio/video/text/encrypted data that appears to be random) there is no (and I believe that there can't be) "best" compression technique.
Huffman Coding uses frequency on letters. You can do the same with words or with letter frequency in more dimensions, i.e. combinations of letters and their frequency.

Is it possible to compress text using natural language processing?

I was thinking about compressing large blocks of text using most frequent english words, but now I doubt it would be efficient, since lzw seems to be achieving just this in a better way.
Still, I can't shake the feeling compressing character one by one is a little "brutal", since one could just analyze the structure of sentences to better organize it into smaller chunks of data, and the structure is not exactly the same when decompressed, it could use classic compression methods.
Does "basic" NLP allows that ?
NLP?
Standard compression techniques can be applied to words instead of characters. These techniques would assign probabilities to what the next word is, based on the preceding words. I have not seen this in practice though, since there are so many more words than characters, resulting in prohibitive memory usage and excessive execution time for even low-order models.

Identifying keywords of a (programming) language

this is a follow up to my recent question ( Code for identifying programming language in a text file ). I'm really thankful for all the answers I got, it helped me very much. My code for this task is complete and it works fairly well - quick and reasonably accurate.
The method i used is the following: i have a "learning" perl script that identifies most frequently used words in a language by doing a word histogram over a set of sample files. These data are then loaded by the c++ program which then checks the given text and accumulates score for each language based on found words and then simply checks which language accumulated the highest score.
Now i would like to make it even better and work a bit on the quality of identification. The problem is I often get "unknown" as result (many languages accumulate a small score, but none anything bigger than my threshold). After some debugging, research etc i found out that this is probably due to the fact, that all words are considered equal. This means that seeing a "#include" for example has the same effect as seeing a "while" - both of which indicate that it might be c/c++ (i'm now ignoring the fact that "while" is used in many other languages), but of course in larger .cpp files there might be a ton of "while" but most of the time only a few "#include".
So the fact that a "#include" is more important is ignored, because i could not come up with a good way how to identify if a word is more important than another. Now bear in mind that the script which creates the data is fairly stupid, its only a word histogram and for every chosen word it assigns a score of 1. It does not even look at the words (so if there is a "#&|?/" in a file very often it might get chosen as a good word).
Also i would like to have the data creation part fully automated, so nobody should have to look at the data and alter them, change scores, change words etc. All the "brainz" should be in the script and the cpp program.
Does somebody have a suggestion how to identify keywords, or more generally, important words? Some things that might help: i have the number of occurences of each word and the number of total words (so a ratio may be calculated). I have also thought about wiping out characters like ;, etc. since the histogram script often puts for example "continue;" in the result, but the important word is "continue". Last note: all checks for equality are done for exact match - no substrings, case sensitive. This is mainly because of speed, but substrings might help (or hurt, i dont know)...
NOTE: thanks all who bothered to answer, you helped me a lot.
My work with this is almost finished so i will describe what i did to get good results.
1) Get a decent training set, about 30-50 files per language from various sources to avoid coding style bias
2) Write a perl script that does a word histogram. Implement blacklist and whitelist (more about it below)
3) add bogus words to blacklist, like "license", "the" etc. These are often found at the start of file in license information.
4) add about five most important words per language to the whitelist. These are words that are found in most source code of a given language, but are not frequent enough to get into the histogram. For example for C/C++ i had: #include, #define, #ifdef, #ifndef and #endif in the whitelist.
5) Emphasize the start of a file, so give more points to words found in the first 50-100 lines
6) when doing the word histogram, tokenize the file using #words = split(/[\s\(\){}\[\];.,=]+/, $_); This should be ok for most languages i think (gives me the best results). For each language, have about 10-20 most frequent words in the final results.
7) When the histogram is complete, remove all words that are found in the blacklist and add all those that are found in the whitelist
8) Write a program which processes a text file in the same way as the script - tokenize using the same rules. If a word is found in the histogram data, add points to the right language. Words in the histogram which correspond to only one language should add more points, those which belong to multiple languages should add less.
Comments are welcome. Currently on about 1000 text files i get 80 unknowns (mostly on extremely short files - mainly javascript with just one or two lines). About 20 files are recognized wrong. Size of the files is about 11kB ranging from 100 bytes to 100 kBytes (almost 11MB total). It takes one second to process them all, which is good enough for me.
I think you're approaching this from the wrong viewpoint. From your description, it sounds like you are building a classifier. A good classifier needs to discriminate between different classes; it doesn't need to precisely estimate the correspondence between the input and the most likely class.
Practically: your classifier doesn't need to assess precisely how close to C++ a certain input is; it merely needs to determine if the input is more like C than C++. This makes your work a lot easier - most of your current "unknown" cases will be close to one or two languages, even though they don't exceed your basic threshold.
Now, once you realize this, you will also see what training your classifier needs: not some random aspect of the sample files, but what sets two languages apart. Hence, when you have parsed your C samples, and your C++ samples, you will see that #include does not set them apart. However, class and template will be far more common in C++. On the other hand, #include does distinguish between C++ and Java.
There are of course other aspects besides keywords that you can use. For instance, the most obvious would be the frequency of {, and ; is similarly distinguishing. Another very useful feature for your classifier would be the comment tokens for the different languages. The basic problem of course would be automatically identifying them. Again, hardcoding //, /*, ', --, # and ! as pseudo-keywords would help.
This also identifies another classification rule: SQL will often have -- at the beginning of a line, whereas in C it will often appear somewhere else. Thus it may be useful for your classifier to take the context into account as well.
Use Google Code Search to learn weights for the set of keywords: #include in C++ gets 672.000 hits, in Python only ~5000.
You can normalize the results by looking at the number of results for the language in total:
C++ gives about 770.000 files whereas Python returns 120.000.
Thus "#include" is extremely rare in Python files, but exists in almost every C++ file. (Now you still have to learn to distinguish C++ and C of course.) All that is left is to do the correct reasoning about probabilities.
You need to get some exclusiveness into your lookup data.
When teaching the programming languages you expect, you should search for words typical for one or few language(s). If a word appears in several code files of the same language but appears in few or none of the other language files, it's a strong suggestion to that language.
So the score of a word could be calculated at the lookup side by selecting the words that are exclusive to a language or a group of languages. Find several of these words and get the intersection of these by adding the scores, and found your language you will have.
In an answer to your other question, someone recommended a naïve Bayes classifier. You should implement this suggestion because the technique is good at separating according to distinguishing features. You mentioned the while keyword, but that's not likely to be useful because so many languages use it—and a Bayes classifier won't treat it as useful.
An interesting part of your problem is how to tokenize an unknown program. Whitespace-separated chunks is a decent rough start, but going meaningfully beyond that will be tricky.