Why don't we use word ranks for string compression? - compression

I have 3 main questions:
Let's say I have a large text file. (1)Is replacing the words with their rank an effective way to compress the file? (Got answer to this question. This is a bad idea.)
Also, I have come up with a new compression algorithm. I read some existing compression models that are used widely and I found out they use some pretty advanced concepts like statistical redundancy and probabilistic prediction. My algorithm does not use all these concepts and is a rather simple set of rules that need to be followed while compressing and decompressing. (2)My question is am I wasting my time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes?
(3)Furthermore, if I manage to successfully compress a string can I extend my algorithm to other content like videos, images etc.?
(I understand that the third question is difficult to answer without knowledge about the compression algorithm. But I am afraid the algorithm is so rudimentary and nascent I feel ashamed about sharing it. Please feel free to ignore the third question if you have to)

Your question doesn't make sense as it stands (see answer #2), but I'll try to rephrase and you can let me know if I capture your question. Would modeling text using the probability of individual words make for a good text compression algorithm? Answer: No. That would be a zeroth order model, and would not be able to take advantage of higher order correlations, such as the conditional probability of a given word following the previous word. Simple existing text compressors that look for matching strings and varied character probabilities would perform better.
Yes, you are wasting your time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes. You should first learn about the techniques that have been applied over time to model data, textual and others, and the approaches to use the modeled information to compress the data. You need to study what has already been researched for decades before developing a new approach.
The compression part may extend, but the modeling part won't.

Do you mean like having a ranking table of words sorted by frequency and assign smaller "symbols" to those words that are repeated the most, therefore reducing the amount of information that needs to be transmitted?
That's basically how Huffman Coding works, the problem with compression is that you always hit a limit somewhere along the road, of course, if the set of things that you try to compress follows a particular pattern/distribution then it's possible to be really efficient about it, but for general purposes (audio/video/text/encrypted data that appears to be random) there is no (and I believe that there can't be) "best" compression technique.

Huffman Coding uses frequency on letters. You can do the same with words or with letter frequency in more dimensions, i.e. combinations of letters and their frequency.

Related

Sentiment analysis feature extraction

I am new to NLP and feature extraction, i wish to create a machine learning model that can determine the sentiment of stock related social media posts. For feature extraction of my dataset I have opted to use Word2Vec. My question is:
Is it important to train my word2vec model on a corpus of stock related social media posts - the datasets that are available for this are not very large. Should I just use a much larger pretrained word vector ?
The only way to to tell what will work better for your goals, within your constraints of data/resources/time, is to try alternate approaches & compare the results on a repeatable quantititave evaluation.
Having training texts that are properly representative of your domain-of-interest can be quite important. You may need your representation of the word 'interest', for example, to represent that of stock/financial world, rather than the more general sense of the word.
But quantity of data is also quite important. With smaller datasets, none of your words may get great vectors, and words important to evaluating new posts may be missing or of very-poor quality. In some cases taking some pretrained set-of-vectors, with its larger vocabulary & sharper (but slightly-mismatched to domain) word-senses may be a net help.
Because these pull in different directions, there's no general answer. It will depend on your data, goals, limits, & skills. Only trying a range of alternative approaches, and comparing them, will tell you what should be done for your situation.
As this iterative, comparative experimental pattern repeats endlessly as your projects & knowledge grow – it's what the experts do! – it's also important to learn, & practice. There's no authority you can ask for any certain answer to many of these tradeoff questions.
Other observations on what you've said:
If you don't have a large dataset of posts, and well-labeled 'ground truth' for sentiment, your results may not be good. All these techniques benefit from larger training sets.
Sentiment analysis is often approached as a classification problem (assigning texts to bins of 'positive' or 'negative' sentiment, operhaps of multiple intensities) or a regression problem (assigning texts a value on numerical scale). There are many more-simple ways to create features for such processes that do not involve word2vec vectors – a somewhat more-advanced technique, which adds complexity. (In particular, word-vectors only give you features for individual words, not texts of many words, unless you add some other choices/steps.) If new to the sentiment-analysis domain, I would recommend against starting with word-vector features. Only consider adding them later, after you've achieved some initial baseline results without their extra complexity/choices. At that point, you'll also be able to tell if they're helping or not.

compressing individual lines of text separately using common phrases in a global dictionary

Is there any open source library or algorithm available to look at what phrases or words are most common among individual lines of text in a file and create a global dictionary that would then be used to compress the lines of text separately? Preferably the code if available would be in C or C++.
I found this question that I think was similar but did not have an answer that meets what I am looking for:
compressing a huge set of similar strings
There are three important things to recognize here.
The value of replacing a word by a code depends on its frequency and its length. Replacing "a" isn't worth a lot, even if it appears very often.
Once you've identified the most common words, phrases can be found by looking for occurrences of two common words appearing side by side. (In most grammars, word repetition is fairly rare.)
However, one of the biggest sources of redundancy in text is actually the amount of bits needed to predict the next letter. That's typically about 2, given the preceding text. Do you really need word-based compression when letter-based compression is so much easier?
I did some more searching, and I think I have found my answer.
I came across this page discussing improving compression by using boosters
http://mainroach.blogspot.com/2013/08/boosting-text-compression-with-dense.html
That page provided a link to the research paper
http://www.dcc.uchile.cl/~gnavarro/ps/tcj11.pdf
and also to the source code used to do the compression
http://vios.dc.fi.udc.es/codes/download.html
Yes. zlib, an open source compression library in C, provides the deflateSetDictonary() and inflateSetDictionary() routines for this purpose. You provide up to 32K of seed data in which the compressor will look for matching strings. The same dictionary needs to reside on both ends. If you are compressing lots of small pieces of data with a lot of commonality, this can greatly improve compression. Your "lines of text" certainly qualify as small pieces of data.

How to compute good preset dictionary for deflate compression

I have an opportunity to preset dictionary for deflate compression. It makes sense in my case, because data to be compressed is relatively small 1kb-3kb and I have a large sample of representative examples. Data to be compressed consists of arbitrary sequence of bytes, so tokenization etc. is not a good way to go. Also, data shows a lot of repetition (between data examples), so good dictionary could potentially give very good results.
The question is how calculate good dictionary? Is there an algorithm which calculates optimal dictionary (given sample data)?
I started looking at prefix trees, but it is not clear how to use them in this context.
Best regards,
Jarek
I am not aware of an algorithm to generate an optimal or even a good dictionary. This is generally done by hand. I think that a suffix tree would be a good approach to finding common strings for a dictionary, but I have never tried it.
The first thing to try is to simply concatenate 32K worth of your 1-3K examples and see how much gain that provides over no dictionary. Then you mess with it from there, changing the ordering of examples or pulling out repeated pieces in the examples to the end of the dictionary.
Note that the most common strings should be put at the end, since shorter distances take fewer bits.
I don't know how good this is, but it's a dictionary creator: https://github.com/vkrasnov/dictator

Is it possible to compress text using natural language processing?

I was thinking about compressing large blocks of text using most frequent english words, but now I doubt it would be efficient, since lzw seems to be achieving just this in a better way.
Still, I can't shake the feeling compressing character one by one is a little "brutal", since one could just analyze the structure of sentences to better organize it into smaller chunks of data, and the structure is not exactly the same when decompressed, it could use classic compression methods.
Does "basic" NLP allows that ?
NLP?
Standard compression techniques can be applied to words instead of characters. These techniques would assign probabilities to what the next word is, based on the preceding words. I have not seen this in practice though, since there are so many more words than characters, resulting in prohibitive memory usage and excessive execution time for even low-order models.

File compression with C++

I want to make my own text file compression program. I don't know much about C++ programming, but I have learned all the basics and writing/reading a file.
I have searched on google a lot about compression, and saw many different kind of methods to compress a file like LZW and Huffman. The problem is that most of them don't have a source code, or they have a very complicated one.
I want to ask if you know any good webpages where I can learn and make a compression program myself?
EDIT:
I will let this topic be open for a little longer, since I plan to study this the next few days, and if I have any questions, I'll ask them here.
Most of the algorithms are pretty complex. But they all have in common that they are taking data that is repeating and only store it once and have a system of knowing how to uncompress them (putting the repeated segments back in place)
Here is a simple example you can try to implement.
We have this data file
XXXXFGGGJJ
DDDDDDDDAA
XXXXFGGGJJ
Here we have chars that repeat and two lines that repeat. So you could start with finding a way to reduce the filesize.
Here's a simple compression algorithm.
4XF3G2J
8D2A
4XF3G2J
So we have 4 of X, one of F, 3 of G etc.
You can try this page it contains a clear walk through of the basics of compression and the first principles.
Compression is not the most easy task. I took a college class to learn about compression algorithms like LZW and Huffman, and I can tell you that they're not that easy. If C++ is your first language and you're just starting into this sort of thing, I wouldn't recommend trying to write your own sorting algorithm just yet. If you are more experienced, then I would try writing source without any code being provided to you - this shows that you truly understand the compression algorithm.
This is how I was taught - the professor explained the algorithm in very broad terms, and then either we would implement it (in Java, mind you) or we would answer questions about how the algorithm would behave under certain circumstances. If we could do either of those, then we really knew the algorithm - without him showing us any source at all - it's a good skill to develop ;)
Huffman encoding tree's are not too complicated, I'd start with them. Here's a link: Example: Huffman Encoding Trees