Are algorithms that can compress data dramatically (like 70% - 80%) a big deal? are there any? (Question) - compression

Note: I apologize if it isn't right to post this here
Hello,
As the title says, are algorithms that can dramatically compress data by a significant amount a big deal? Are there any data compressors out there that can compress any type of data by 70% - 80% or even 99% in some cases? I know of JPEG but it is only for images
I think mine would be able to do that however it is still a prototype currently very slow (90 kb -> 11.284 kb takes 3 mins) both in compression and decompression, you are welcome to express whether this is a fraud, or a fake because I was told this was impossible as well. However I would not speak about how I was able to build mine as I am afraid I am going to lose my leverage.
If I can make this algorithm much much faster, would this be worth anything? I would like to make some money with it, are there ways to monetize it? I am currently in need financially so I could dropout as I am still in college and start a small startup I have in mind
Also if this is worth anything and if I manage to fix its flaws and I decide to monetize it or sell it even, I would also like it to be open source as I think this would be of great help for the public is that possible?
Any insight about this would be appreciated! :)

Yes, most lossless compression algorithms can compress by a factor of a thousand or more. If presented with, for example, a long sequence of zero bytes.
No compressor can compress "any type of data" by even one bit, and then decompress it losslessly. If some inputs are compressed, then necessarily some other inputs are expanded, by at least one bit.

Related

Have the monetdb's developers tested any other compression algorithm on it?

Have the MonetDb's developers tested any other compression algorithm on it before?
Perhaps they have tested other compression algorithms ,but it's really had a negative performance impact.
So why haven't they improved this database's compression performance?
I am a student from China. MonetDb is really interesting me and I want to try to improve its compression performance.
So, I should make sure that any body have done this before.
It would be my grateful if you could answer my question.
That is because i really need this.
Thank you So much.
MonetDB only compresses String (Varchar and char) types using dictionary compression and only if the number of unique strings in a column is small.
Integrating any other kind of compression (even simple ones like Prefix-Coding, Run-length Encoding, Delta-compression, ...) need a complete rewrite of the system because the operators have to be made compression-aware (which pretty much means creating a new operator).
The only thing that may be feasible without a complete rewrite is having dedicated compression operators the compress/decompress data instead of spilling to disk. However, this would be very close to the memory compression apple implemented in Mavericks
MonetDB compresses columns using PFor compression. See http://paperhub.s3.amazonaws.com/7558905a56f370848a04fa349dd8bb9d.pdf for details. This also answers the your question about checking other compression methods.
The choice for PFOR is because of the way modern CPU's work, but really any algorithm that doesn't work with branches but with (only) arithmetics will do just fine. I've hit similar speeds with arithmetic coding in the past.

Best approach to storing scientific data sets on disk C++

I'm currently working on a project that requires working with gigabytes of scientific data sets. The data sets are in the form of very large arrays (30,000 elements) of integers and floating point numbers. The problem here is that they are too large too fit into memory, so I need an on disk solution for storing and working with them. To make this problem even more fun, I am restricted to using a 32-bit architecture (as this is for work) and I need to try to maximize performance for this solution.
So far, I've worked with HDF5, which worked okay, but I found it a little too complicated to work with. So, I thought the next best thing would be to try a NoSQL database, but I couldn't find a good way to store the arrays in the database short of casting them to character arrays and storing them like that, which caused a lot of bad pointer headaches.
So, I'd like to know what you guys recommend. Maybe you have a less painful way of working with HDF5 while at the same time maximizing performance. Or maybe you know of a NoSQL database that works well for storing this type of data. Or maybe I'm going in the totally wrong direction with this and you'd like to smack some sense into me.
Anyway, I'd appreciate any words of wisdom you guys can offer me :)
Smack some sense into yourself and use a production-grade library such as HDF5. So you found it too complicated, but did you find its high-level APIs ?
If you don't like that answer, try one of the emerging array databases such as SciDB, rasdaman or MonetDB. I suspect though, that if you have baulked at HDF5 you'll baulk at any of these.
In my view, and experience, it is worth the effort to learn how to properly use a tool such as HDF5 if you are going to be working with large scientific data sets for any length of time. If you pick up a tool such as a NoSQL database, which was not designed for the task at hand, then, while it may initially be easier to use, eventually (before very long would be my guess) it will lack features you need or want and you will find yourself having to program around its deficiencies.
Pick one of the right tools for the job and learn how to use it properly.
Assuming your data sets really are large enough to merit (e.g., instead of 30,000 elements, a 30,000x30,000 array of doubles), you might want to consider STXXL. It provides interfaces that are intended to (and largely succeed at) imitate those of the collections in the C++ standard library, but are intended to work with data too large to fit in memory.
I have been working on scientific computing for years, and I think HDF5 or NetCDF is a good data format for you to work with. It can provide efficient parallel read/wirte, which is important for dealing with big data.
An alternate solution is to use array database, like SciDB, MonetDB, or RasDaMan. However, it will be kinda painful if you try to load HDF5 data into an array database. I once tried to load HDF5 data into SciDB, but it requires a series of data transformations. You need to know if you will query the data often or not. If not often, then the time-consuming loading may be unworthy.
You may be interested in this paper.
It can allow you to query the HDF5 data directly by using SQL.

Why don't we use word ranks for string compression?

I have 3 main questions:
Let's say I have a large text file. (1)Is replacing the words with their rank an effective way to compress the file? (Got answer to this question. This is a bad idea.)
Also, I have come up with a new compression algorithm. I read some existing compression models that are used widely and I found out they use some pretty advanced concepts like statistical redundancy and probabilistic prediction. My algorithm does not use all these concepts and is a rather simple set of rules that need to be followed while compressing and decompressing. (2)My question is am I wasting my time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes?
(3)Furthermore, if I manage to successfully compress a string can I extend my algorithm to other content like videos, images etc.?
(I understand that the third question is difficult to answer without knowledge about the compression algorithm. But I am afraid the algorithm is so rudimentary and nascent I feel ashamed about sharing it. Please feel free to ignore the third question if you have to)
Your question doesn't make sense as it stands (see answer #2), but I'll try to rephrase and you can let me know if I capture your question. Would modeling text using the probability of individual words make for a good text compression algorithm? Answer: No. That would be a zeroth order model, and would not be able to take advantage of higher order correlations, such as the conditional probability of a given word following the previous word. Simple existing text compressors that look for matching strings and varied character probabilities would perform better.
Yes, you are wasting your time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes. You should first learn about the techniques that have been applied over time to model data, textual and others, and the approaches to use the modeled information to compress the data. You need to study what has already been researched for decades before developing a new approach.
The compression part may extend, but the modeling part won't.
Do you mean like having a ranking table of words sorted by frequency and assign smaller "symbols" to those words that are repeated the most, therefore reducing the amount of information that needs to be transmitted?
That's basically how Huffman Coding works, the problem with compression is that you always hit a limit somewhere along the road, of course, if the set of things that you try to compress follows a particular pattern/distribution then it's possible to be really efficient about it, but for general purposes (audio/video/text/encrypted data that appears to be random) there is no (and I believe that there can't be) "best" compression technique.
Huffman Coding uses frequency on letters. You can do the same with words or with letter frequency in more dimensions, i.e. combinations of letters and their frequency.

File compression/checking for data corruption in c/c++

For large files or other files that are not necessarily text, how can i compress them and what are the most efficient methods to check for data corruption? any tutorials on these kinds of algorithms would be greatly appreciated.
For compression, LZO should be helpful. Easy to use and library easily available.
For data corruption check, CRC ca
http://cppgm.blogspot.in/2008/10/calculation-of-crc.html
For general compression, I would recommend Huffman coding. It's very easy to learn, a full-featured (2-pass) coder/decoder can be written in <4 hours if you understand it. It is part of DEFLATE which is part of the .zip format. Once you have that down, learn LZ77, then put them together and make your own DEFLATE implementation.
Alternatively, use zlib, the library everyone uses for zip files.
For large files, I wouldn't recommend CRC32 like everyone is telling you. Larger files suffer from birthday corruption pretty easily. What I mean is that as a file gets larger, a 32-bit checksum can only find an increasingly limited number of errors. A fast implementation of a hash - say, MD5 - would do you well. Yes MD5 is cryptographically broken but I'm assuming, considering your question, that you're not working on a security-conscious problem.
Hamming codes are a possibility. The idea is to insert a few sum-bits at each N bits of data , and initialize each of them with 0 or 1, such that the sum of some of the bits of data and sum-bits is 1 all the time. In case in which a sum is not 1, looking at the values of these sum-bits, you can see what bits of data were lost.
There are lots of other possibilities, as the previous post says.
http://en.wikipedia.org/wiki/Hamming_code#General_algorithm

best compression algorithm with the following features

What is the best compression algorithm with the following features:
should take less time to decompress (can take reasonably more time compress)
should be able to compress sorted data (approx list of 3,000,000 strings/integers ...)
Please suggest along with metrics: compression ratio, algorithmic complexity for compression and decompression (if possible)?
Entire site devoted to compression benchmarking here
Well if you just want speed, then standard ZIP compression is just fine and it's most likely integrated into your language/framework already (ex: .NET has it, Java has it). Sometimes the most universal solution is the best, ZIP is a very mature format, any ZIP library and application will work with any other.
But if you want better compression, I'd suggest 7-Zip as the author is very smart, easy to get ahold of and encourages people to use the format.
Providing you with compression times is impossible, as it's directly related to your hardware. If you want a benchmark, you have to do it yourself.
You don't have to worry about decompression time. The time spent the higher compression level is mostly finding the longest matching pattern.
Decompression either
1) Writes the literal
2) for (backward position, length)=(m,n) pair,
goes back, in the output buffer, m bytes,
reads n bytes and
writes n bytes at the end of the buffer.
So the decompression time is independent of the compression level. And, with my experience with Universal Decompression Virtual Machine (RFC3320), I guess the same is true for any decompression algorithm.
This is an interessing question.
On such sorted data of strings and integers, I would expect that difference coding compression approaches would outperform any out-of-the-box text compression approach as LZ77 or LZ78 in terms of compression ratio. General purpose encoder do not use the special properties of the data.