I have an opportunity to preset dictionary for deflate compression. It makes sense in my case, because data to be compressed is relatively small 1kb-3kb and I have a large sample of representative examples. Data to be compressed consists of arbitrary sequence of bytes, so tokenization etc. is not a good way to go. Also, data shows a lot of repetition (between data examples), so good dictionary could potentially give very good results.
The question is how calculate good dictionary? Is there an algorithm which calculates optimal dictionary (given sample data)?
I started looking at prefix trees, but it is not clear how to use them in this context.
Best regards,
Jarek
I am not aware of an algorithm to generate an optimal or even a good dictionary. This is generally done by hand. I think that a suffix tree would be a good approach to finding common strings for a dictionary, but I have never tried it.
The first thing to try is to simply concatenate 32K worth of your 1-3K examples and see how much gain that provides over no dictionary. Then you mess with it from there, changing the ordering of examples or pulling out repeated pieces in the examples to the end of the dictionary.
Note that the most common strings should be put at the end, since shorter distances take fewer bits.
I don't know how good this is, but it's a dictionary creator: https://github.com/vkrasnov/dictator
Related
I have 3 main questions:
Let's say I have a large text file. (1)Is replacing the words with their rank an effective way to compress the file? (Got answer to this question. This is a bad idea.)
Also, I have come up with a new compression algorithm. I read some existing compression models that are used widely and I found out they use some pretty advanced concepts like statistical redundancy and probabilistic prediction. My algorithm does not use all these concepts and is a rather simple set of rules that need to be followed while compressing and decompressing. (2)My question is am I wasting my time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes?
(3)Furthermore, if I manage to successfully compress a string can I extend my algorithm to other content like videos, images etc.?
(I understand that the third question is difficult to answer without knowledge about the compression algorithm. But I am afraid the algorithm is so rudimentary and nascent I feel ashamed about sharing it. Please feel free to ignore the third question if you have to)
Your question doesn't make sense as it stands (see answer #2), but I'll try to rephrase and you can let me know if I capture your question. Would modeling text using the probability of individual words make for a good text compression algorithm? Answer: No. That would be a zeroth order model, and would not be able to take advantage of higher order correlations, such as the conditional probability of a given word following the previous word. Simple existing text compressors that look for matching strings and varied character probabilities would perform better.
Yes, you are wasting your time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes. You should first learn about the techniques that have been applied over time to model data, textual and others, and the approaches to use the modeled information to compress the data. You need to study what has already been researched for decades before developing a new approach.
The compression part may extend, but the modeling part won't.
Do you mean like having a ranking table of words sorted by frequency and assign smaller "symbols" to those words that are repeated the most, therefore reducing the amount of information that needs to be transmitted?
That's basically how Huffman Coding works, the problem with compression is that you always hit a limit somewhere along the road, of course, if the set of things that you try to compress follows a particular pattern/distribution then it's possible to be really efficient about it, but for general purposes (audio/video/text/encrypted data that appears to be random) there is no (and I believe that there can't be) "best" compression technique.
Huffman Coding uses frequency on letters. You can do the same with words or with letter frequency in more dimensions, i.e. combinations of letters and their frequency.
I was thinking about compressing large blocks of text using most frequent english words, but now I doubt it would be efficient, since lzw seems to be achieving just this in a better way.
Still, I can't shake the feeling compressing character one by one is a little "brutal", since one could just analyze the structure of sentences to better organize it into smaller chunks of data, and the structure is not exactly the same when decompressed, it could use classic compression methods.
Does "basic" NLP allows that ?
NLP?
Standard compression techniques can be applied to words instead of characters. These techniques would assign probabilities to what the next word is, based on the preceding words. I have not seen this in practice though, since there are so many more words than characters, resulting in prohibitive memory usage and excessive execution time for even low-order models.
I'm looking for a good algorithm / method to check the data quality in a data warehouse.
Therefore I want to have some algorithm that "knows" the possible structure of the values and then checks if the values are a member of this structure and then decide if they are correct / not correct.
I thought about defining a regexp and the check each value whether it fits or not.
Is this a good way? Are there some good alternatives? (Any research papers?)
I have seen some authors suggest adding a special dimension called a data quality dimension to describe each facttable-record further.
Typical values in a data quality dimension could then be “Normal value,” “Out-of-bounds value,” “Unlikely value,” “Verified value,” “Unverified value,” and “Uncertain value.”
I would recommend using a dedicated data quality tool, like DataCleaner (http://datacleaner.eobjects.org), which I have been doing quite a lot of work on.
You need a tool that not only check strict rules like constraints, but also one that will give you a profile of your data and make it easy for you to explore and identify inconsistencies on your own. Try for example the "Pattern finder" which will tell you the patterns of your string values - something that will often reveal the outliers and errornous values. You can also use the tool for actual cleansing the data, by transforming values, extracting information from them or enriching using third party services. Good luck improving your data quality!
I want to store a graph of different objects for a game, their classes may or may not be related, they may or may not contain vectors of simple structures.
I want parsing operation to be fast, data can be pretty big.
Adding new things should not be hard, and it should not break backward compatibility.
Smaller file size is kind of important
Readability counts
By serialization I mean, making objects serialize themselves, which is effective, but I will need to write different serialization methods for different objects for that.
By binary parsing/composing I mean, creating a new tree of parsers/composers that holds and reads data for these objects, and passing this around to have my objects push/pull their data.
I can also use json, but it can be pretty slow for reading, and it is not very size effective when it comes to pretty big sets of matrices, and numbers.
Point by point:
Fast Parsing: binary (since you don't necessarily have to "parse", you can just deserialize)
Adding New Things: text
Smaller: text (even if gzipped text is larger than binary, it won't be much larger).
Readability: text
So that's three votes for text, one point for binary. Personally, I'd go with text for everything except images (and other data which is "naturally" binary). Then, store everything in a big zip file (I can think of several games do this or something close to it).
Good reads: The Importance of Being Textual and Power Of Plain Text.
Check out protocol buffers from Google or thrift from Apache. Although billed as a way to write wire protocols easily, it's basically an object serialization mechanism that can create bindings in a dozen languages, has efficient binary representation, easy versioning, fast performance, and is well-supported.
We're using Boost.Serialization. Don't know how it performs next to those offered by samkass.
This data is stored in an array (using C++) and is a repetition of 125 bits each one varying from the other. It also has 8 messages of 12 ASCII characters each at the end. Please suggest if I should use differential compression within the array and if so how?
Or should I apply some other compression scheme onto the whole array?
Generally you can compress data that has some sort of predictability or redundancy. Dictionary based compression (e.g. ZIP style algorithms) traditionally don't work well on small chunks of data because of the need to share the selected dictionary.
In the past, when I have compressed very small chunks of data with somewhat predictable patterns, I have used SharpZipLib with a custom dictionary. Rather than embed the dictionary in the actual data, I hard-coded the dictionary in every program that needs to (de)compress the data. SharpZipLib gives you both options: custom dictionary, and keep dictionary separate from the data.
Again this will only work well if you can predict some patterns to your data ahead of time so that you can create an appropriate compression dictionary, and it's feasible for the dictionary itself to be separate from the compressed data.
You haven't given us enough information to help you. However, I can highly recommend the book Text Compression by Bell, Cleary, and Witten. Don't be fooled by the title; "Text" here just means "lossless"—all the techniques apply to binary data. Because the book is expensive you might try to get it on interlibrary loan.
Also, don't overlook the obvious Burrows-Wheeler (bzip2) or Lempel-Ziv (gzip, zlib) techniques. It's quite possible that one of these techniques will work well for your application, so before investigating alternatives, try compressing your data with standard tools.