Have the monetdb's developers tested any other compression algorithm on it? - compression

Have the MonetDb's developers tested any other compression algorithm on it before?
Perhaps they have tested other compression algorithms ,but it's really had a negative performance impact.
So why haven't they improved this database's compression performance?
I am a student from China. MonetDb is really interesting me and I want to try to improve its compression performance.
So, I should make sure that any body have done this before.
It would be my grateful if you could answer my question.
That is because i really need this.
Thank you So much.

MonetDB only compresses String (Varchar and char) types using dictionary compression and only if the number of unique strings in a column is small.
Integrating any other kind of compression (even simple ones like Prefix-Coding, Run-length Encoding, Delta-compression, ...) need a complete rewrite of the system because the operators have to be made compression-aware (which pretty much means creating a new operator).
The only thing that may be feasible without a complete rewrite is having dedicated compression operators the compress/decompress data instead of spilling to disk. However, this would be very close to the memory compression apple implemented in Mavericks

MonetDB compresses columns using PFor compression. See http://paperhub.s3.amazonaws.com/7558905a56f370848a04fa349dd8bb9d.pdf for details. This also answers the your question about checking other compression methods.
The choice for PFOR is because of the way modern CPU's work, but really any algorithm that doesn't work with branches but with (only) arithmetics will do just fine. I've hit similar speeds with arithmetic coding in the past.

Related

Best approach to storing scientific data sets on disk C++

I'm currently working on a project that requires working with gigabytes of scientific data sets. The data sets are in the form of very large arrays (30,000 elements) of integers and floating point numbers. The problem here is that they are too large too fit into memory, so I need an on disk solution for storing and working with them. To make this problem even more fun, I am restricted to using a 32-bit architecture (as this is for work) and I need to try to maximize performance for this solution.
So far, I've worked with HDF5, which worked okay, but I found it a little too complicated to work with. So, I thought the next best thing would be to try a NoSQL database, but I couldn't find a good way to store the arrays in the database short of casting them to character arrays and storing them like that, which caused a lot of bad pointer headaches.
So, I'd like to know what you guys recommend. Maybe you have a less painful way of working with HDF5 while at the same time maximizing performance. Or maybe you know of a NoSQL database that works well for storing this type of data. Or maybe I'm going in the totally wrong direction with this and you'd like to smack some sense into me.
Anyway, I'd appreciate any words of wisdom you guys can offer me :)
Smack some sense into yourself and use a production-grade library such as HDF5. So you found it too complicated, but did you find its high-level APIs ?
If you don't like that answer, try one of the emerging array databases such as SciDB, rasdaman or MonetDB. I suspect though, that if you have baulked at HDF5 you'll baulk at any of these.
In my view, and experience, it is worth the effort to learn how to properly use a tool such as HDF5 if you are going to be working with large scientific data sets for any length of time. If you pick up a tool such as a NoSQL database, which was not designed for the task at hand, then, while it may initially be easier to use, eventually (before very long would be my guess) it will lack features you need or want and you will find yourself having to program around its deficiencies.
Pick one of the right tools for the job and learn how to use it properly.
Assuming your data sets really are large enough to merit (e.g., instead of 30,000 elements, a 30,000x30,000 array of doubles), you might want to consider STXXL. It provides interfaces that are intended to (and largely succeed at) imitate those of the collections in the C++ standard library, but are intended to work with data too large to fit in memory.
I have been working on scientific computing for years, and I think HDF5 or NetCDF is a good data format for you to work with. It can provide efficient parallel read/wirte, which is important for dealing with big data.
An alternate solution is to use array database, like SciDB, MonetDB, or RasDaMan. However, it will be kinda painful if you try to load HDF5 data into an array database. I once tried to load HDF5 data into SciDB, but it requires a series of data transformations. You need to know if you will query the data often or not. If not often, then the time-consuming loading may be unworthy.
You may be interested in this paper.
It can allow you to query the HDF5 data directly by using SQL.

Why don't we use word ranks for string compression?

I have 3 main questions:
Let's say I have a large text file. (1)Is replacing the words with their rank an effective way to compress the file? (Got answer to this question. This is a bad idea.)
Also, I have come up with a new compression algorithm. I read some existing compression models that are used widely and I found out they use some pretty advanced concepts like statistical redundancy and probabilistic prediction. My algorithm does not use all these concepts and is a rather simple set of rules that need to be followed while compressing and decompressing. (2)My question is am I wasting my time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes?
(3)Furthermore, if I manage to successfully compress a string can I extend my algorithm to other content like videos, images etc.?
(I understand that the third question is difficult to answer without knowledge about the compression algorithm. But I am afraid the algorithm is so rudimentary and nascent I feel ashamed about sharing it. Please feel free to ignore the third question if you have to)
Your question doesn't make sense as it stands (see answer #2), but I'll try to rephrase and you can let me know if I capture your question. Would modeling text using the probability of individual words make for a good text compression algorithm? Answer: No. That would be a zeroth order model, and would not be able to take advantage of higher order correlations, such as the conditional probability of a given word following the previous word. Simple existing text compressors that look for matching strings and varied character probabilities would perform better.
Yes, you are wasting your time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes. You should first learn about the techniques that have been applied over time to model data, textual and others, and the approaches to use the modeled information to compress the data. You need to study what has already been researched for decades before developing a new approach.
The compression part may extend, but the modeling part won't.
Do you mean like having a ranking table of words sorted by frequency and assign smaller "symbols" to those words that are repeated the most, therefore reducing the amount of information that needs to be transmitted?
That's basically how Huffman Coding works, the problem with compression is that you always hit a limit somewhere along the road, of course, if the set of things that you try to compress follows a particular pattern/distribution then it's possible to be really efficient about it, but for general purposes (audio/video/text/encrypted data that appears to be random) there is no (and I believe that there can't be) "best" compression technique.
Huffman Coding uses frequency on letters. You can do the same with words or with letter frequency in more dimensions, i.e. combinations of letters and their frequency.

Fast implementation of MD5 in C++

First of all, to be clear, I'm aware that a huge number of MD5 implementations exist in C++. The problem here is I'm wondering if there is a comparison of which implementation is faster than the others. Since I'm using this MD5 hash function on files with size larger than 10GB, speed indeed is a major concern here.
I think the point avakar is trying to make is: with modern processing power the IO speed of your hard drive is the bottleneck not the calculation of the hash. Getting a more efficient algorithm will not help you as that is not (likely) the slowest point.
If you are doing anything special (1000's of rounds for example) then it may be different, but if you are just calculating a hash of a file. You need to speed up your IO, not your math.
I don't think it matters much (on the same hardware; but indeed GPGPU-s are different, and perhaps faster, hardware for that kind of problem). The main part of md5 is a quite complex loop of complex arithmetic operations. What does matter is the quality of compiler optimizations.
And what does also matter is how you read the file. On Linux, mmap and madvise and readahead could be relevant. Disk speed is probably the bottleneck (use an SSD if you can).
And are you sure you want md5 specifically? There are simpler and faster hash coding algorithms (md4, etc.). Still your problem is more I/O bound than CPU bound.
I'm sure there are plenty of CUDA/OpenCL adaptations of the algorithm out there which should give you a definite speedup. You could also take the basic algorithm and think a bit -> get a CUDA/OpenCL implementation going.
Block-ciphers are perfect candidates for this type of implementation.
You could also get a C implementation of it and grab a copy of the Intel C compiler and see how good that is. The vectorization extensions in Intel CPUs are amazing for speed boosts.
table available here:
http://www.golubev.com/gpuest.htm
looks like probably your bottleneck will be your harddrive IO

Packet oriented lossless compression library

Does anyone know of a free (non-GPL), decently performing compression library that supports packet oriented compression in C/C++?
With packet oriented, I mean the kind of feature QuickLZ (GPL) has, where multiple packets of a stream can be compressed and decompressed individually while a history is being maintained across packets to achieve sensible compression.
I'd favor compression ratio over CPU usage as long as the CPU usage isn't ridiculous, but I've had a hard time finding this feature at all, so anything is of interest.
zlib's main deflate() function takes a flush parameter, which allows various different flushing modes. If you pass Z_SYNC_FLUSH at the end of each packet, that should produce the desired effect.
The details are explained in the zLib manual.
bzip2 has flushing functionality as well, which might let you do this kind of thing. See http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html#bzCompress
Google's Snappy may be a good option, if you need speed more than compression and are just looking to save a moderate amount of space.
Alternatively, Ilia Muraviev put a small piece of compression code called BALZ in public domain some time ago. It is quite decent for many kinds of data.
Both of these support stream flushing and independent state variables to do multiple, concurrent streams across packets.
Google's new SPDY protocol uses zlib to compress individual messages, and maintains the zlib state for the life of the connection to achieve better compression. I don't think there's a standalone library that handles this behavior exactly, but there are several open-source implementations of SPDY that could show you how it's done.
The public domain Crush algorithm by Ilia Muraviev has similar performance and compression ratio as QuickLZ has, Crush being a bit more powerful. The algorithms are conceptually similar too, Crush containing a bit more tricks. The BALZ algorithm that was already mentioned earlier is also by Ilia Muraviev. See http://compressme.net/
may be you could use lzma compression SDK, it's written and placed in the public domain by Igor Pavlov.
And since it can compress stream files, and has memory to memory compression I think it's possible to compress packet stream (may be with some changes) but not sure.

best compression algorithm with the following features

What is the best compression algorithm with the following features:
should take less time to decompress (can take reasonably more time compress)
should be able to compress sorted data (approx list of 3,000,000 strings/integers ...)
Please suggest along with metrics: compression ratio, algorithmic complexity for compression and decompression (if possible)?
Entire site devoted to compression benchmarking here
Well if you just want speed, then standard ZIP compression is just fine and it's most likely integrated into your language/framework already (ex: .NET has it, Java has it). Sometimes the most universal solution is the best, ZIP is a very mature format, any ZIP library and application will work with any other.
But if you want better compression, I'd suggest 7-Zip as the author is very smart, easy to get ahold of and encourages people to use the format.
Providing you with compression times is impossible, as it's directly related to your hardware. If you want a benchmark, you have to do it yourself.
You don't have to worry about decompression time. The time spent the higher compression level is mostly finding the longest matching pattern.
Decompression either
1) Writes the literal
2) for (backward position, length)=(m,n) pair,
goes back, in the output buffer, m bytes,
reads n bytes and
writes n bytes at the end of the buffer.
So the decompression time is independent of the compression level. And, with my experience with Universal Decompression Virtual Machine (RFC3320), I guess the same is true for any decompression algorithm.
This is an interessing question.
On such sorted data of strings and integers, I would expect that difference coding compression approaches would outperform any out-of-the-box text compression approach as LZ77 or LZ78 in terms of compression ratio. General purpose encoder do not use the special properties of the data.