MonetDB is very efficient column oriented database. I came to know that it follows light weight compression algorithms to speed it up. Can someone tell me more about the implementation of these compression/decompression algorithms in monetDB?
There is currently no compression on primitive values such as integers and floating point numbers. Thus, choosing the appropriate type for your data will make a difference once your tables get large.
The string storage uses pointers to a string heap. Hence, for categorical string values that only contain few distinct values, storage will generally be efficient. More advanced compression methods are in the works, but I do not expect them to be available in the next six months.
Finally, we had great experiences running MonetDB on a force-compressed file system (e.g. BTRFS). This greatly reduces the storage footprint of databases and also reduces the IO time, especially on spinning hard disks.
Related
I'm familiar with using a column- vs a row-store for how a databases internally persists data to disk. My question is whether, for a dataset is entirely in memory, and there's no storage to disk, if the row- vs column- orientation makes much of a difference?
The things I can think of that may make a difference would be:
For fields under 8 bytes, it would involve less memory accesses for columns than for rows.
Compression would also be easier on a column-store regardless of whether in memory or not (seems like a non-issue if not saving back to storage I suppose? does compression ever matter on in-memory operations?)
Possible to vectorize operations.
Much, much easier to work with a struct on a row-by-row basis of course.
Are both of those accurate, and are there any more? Given this, would there be substantial performance improvements on using an in-memory colstore vs rowstore on a read-only dataset, or just a marginal improvement?
I'm familiar with using a column- vs a row-store for how a databases internally persists data to disk. My question is whether, for a dataset is entirely in memory, and there's no storage to disk, if the row- vs column- orientation makes much of a difference?
A lot depends on the size of the dataset, what the contents of each row are, how you need to search in it, whether you want to add items to or remove items from the dataset, and so on.
There is also the CPU and memory architecture to consider; how big are your caches, what is the size of a cache line, and how intelligent is your CPU's prefetcher.
For fields under 8 bytes, it would involve less memory accesses for columns than for rows.
Memory is not accessed a register at a time, but rather a cache line at a time. On most contemporary machines, cache lines are 64 bytes.
Compression would also be easier on a column-store regardless of whether in memory or not
Not really. You can compress/decompress a column even if it is not stored in memory consecutively. It might be faster though.
does compression ever matter on in-memory operations?
That depends. If it's in-memory, then it's likely that compression will reduce performance, but on the other hand, the amount of data that you need to store is smaller, so you will be able to fit more into memory.
Possible to vectorize operations.
It's only loading/storing to memory that might be slower if data is grouped by rows.
Much, much easier to work with a struct on a row-by-row basis of course.
It's easy to use a pointer to a struct with a row-by-row store, but with C++ you can make classes that hide the fact that data is stored column-by-column. That's a bit more work up front, but might make it as easy as row-by-row once you have set that up.
Also, column-by-column store is often used in the entity-component-system pattern, and there are libraries such as EnTT that make it quite easy to work with.
Are both of those accurate, and are there any more? Given this, would there be substantial performance improvements on using an in-memory colstore vs rowstore on a read-only dataset, or just a marginal improvement?
Again, it heavily depends on the size of the dataset and how you want to access it. If you frequently use all columns in a row, then row-by-row store is preferred. If you frequently just use one column, and need to access that column of many consecutive rows, then a column-by-column store is best.
Also, there are hybrid solutions possible. You could have one column on its own, and then all the other columns stored in row-by-row fashion.
How you will search in a read-only dataset matters a lot. Is it going to be sorted, or is it more like a hash map? In the former case, you want the index to be as compact as possible, and possibly ordered like a B-tree as Alex Guteniev already mentioned. If it's going to be like a hash map, then you probably want row-by-row.
For in-memory arrays, this is called AoS vs SoA (array of structs vs struct of arrays).
I think the main advantage in SoA for a read-only database is that searches would need to access smaller memory range. This is more cache friendly, less prone to page faults.
The amount of improvement depends on how you use the database. There may be some more significant improvement by using more targetted structure (sorted array, B-tree)
I'm developing a tool for wavelet image analysis and machine learning on Linux machines in C++.
It is limited by the size of the images, the number of scales and their corresponding filters (up to 2048x2048 doubles) for each of N orientations as well as additional memory and processing overhead by a machine learning algorithm.
Unfortunately my skills of Linux system programming are shallow at best,
so I'm currently using no swap but figure it should be possible somehow?
I'm required to keep the imaginary and real part of the
filtered images of each scale and orientation, as well as the corresponding wavelets for reconstruction purposes. I keep them in memory for additional speed for small images.
Regarding the memory use: I already
store everything no more than once,
only what is needed,
cut out any double entries or redundancy,
pass by reference only,
use pointers over temporary objects,
free memory as soon as it is not required any more and
limit the number of calculations to the absolute minimum.
As with most data processing tools, speed is at the essence. As long as there
is enough memory the tool is about 3x as fast compared to the same implementation in Matlab code.
But as soon as I'm out of memory nothing goes any more. Unfortunately most of the images I'm training the algorithm on are huge (raw data 4096x4096 double entries, after symmetric padding even larger), therefore I hit the ceiling quite often.
Would it be bad practise to temporarily write data that is not needed for the current calculation / processing step from memory to the disk?
What approach / data format would be most suitable to do that?
I was thinking of using rapidXML to read and write an XML to a binary file and then read out only the required data. Would this work?
Is a memory-mapped file what I need? https://en.wikipedia.org/wiki/Memory-mapped_file
I'm aware that this will result in performance loss, but it is more important that the software runs smoothly and does not freeze.
I know that there are libraries out there that can do wavelet image analysis, so please spare the "Why reinvent the wheel, just use XYZ instead". I'm using very specific wavelets, I'm required to do it myself and I'm not supposed to use external libraries.
Yes, writing data to the disk to save memory is bad practice.
There is usually no need to manually write your data to the disk to save memory, unless you are reaching the limits of what you can address (4GB on 32bit machines, much more in 64bit machines).
The reason for this is that the OS is already doing exactly the same thing. It is very possible that your own solution would be slower than what the OS is doing. Read this Wikipedia article if you are not familiar with the concept of paging and virtual memory.
Did you look into using mmap and munmap to bring the images (and temporary results) into your address space and discard them when you no longer need them. mmap allows you to map the content of a file directly in memory. no more fread/fwrite. Direct memory access. Writes to the memory region are written back to the file too and bringing back that intermediate state later on is no harder than redoing an mmap.
The big advantages are:
no encoding in a bloated format like XML
perfectly suitable for transient results such as matrices that are represented in contiguous memory regions.
Dead simple to implement.
Completely delegate to the OS the decision of when to swap in and out.
This doesn't solve your fundamental problem, but: Are you sure you need to be doing everything in double precision? You may not be able to use integer coefficient wavelets, but storing the image data itself in doubles is usually pretty wasteful. Also, 4k images aren't very big ... I'm assuming you are actually using frames of some sort so have redundant entries, otherwise your numbers don't seem to add up (and are you storing them sparsely?) ... or maybe you are just using a large number at once.
As for "should I write to disk"? This can help, particularly if you are getting a 4x increase (or more) by taking image data to double precision. You can answer it for yourself though, just measure the time to load and compare to your compute time to see if this is worth pursuing. The wavelet itself should be very cheap, so I'm guess you're mostly dominated by your learning algorithm. In that case, go ahead and throw out original data or whatever until you need it again.
I'd like to ask fellow SO'ers for their opinions regarding best of breed data structures to be used for indexing time-series (aka column-wise data, aka flat linear).
Two basic types of time-series exist based on the sampling/discretisation characteristic:
Regular discretisation (Every sample is taken with a common frequency)
Irregular discretisation(Samples are taken at arbitary time-points)
Queries that will be required:
All values in the time range [t0,t1]
All values in the time range [t0,t1] that are greater/less than v0
All values in the time range [t0,t1] that are in the value range[v0,v1]
The data sets consist of summarized time-series (which sort of gets over the Irregular discretisation), and multivariate time-series. The data set(s) in question are about 15-20TB in size, hence processing is performed in a distributed manner - because some of the queries described above will result in datasets larger than the physical amount of memory available on any one system.
Distributed processing in this context also means dispatching the required data specific computation along with the time-series query, so that the computation can occur as close to the data as is possible - so as to reduce node to node communications (somewhat similar to map/reduce paradigm) - in short proximity of computation and data is very critical.
Another issue that the index should be able to cope with, is that the overwhelming majority of data is static/historic (99.999...%), however on a daily basis new data is added, think of "in the field senors" or "market data". The idea/requirement is to be able to update any running calculations (averages, garch's etc) with as low a latency as possible, some of these running calculations require historical data, some of which will be more than what can be reasonably cached.
I've already considered HDF5, it works well/efficiently for smaller datasets but starts to drag as the datasets become larger, also there isn't native parallel processing capabilities from the front-end.
Looking for suggestions, links, further reading etc. (C or C++ solutions, libraries)
You would probably want to use some type of large, balanced tree. Like Tobias mentioned, B-trees would be the standard choice for solving the first problem. If you also care about getting fast insertions and updates, there is a lot of new work being done at places like MIT and CMU into these new "cache oblivious B-trees". For some discussion of the implementation of these things, look up Tokutek DB, they've got a number of good presentations like the following:
http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf
Questions 2 and 3 are in general a lot harder, since they involve higher dimensional range searching. The standard data structure for doing this would be the range tree (which gives O(log^{d-1}(n)) query time, at the cost of O(n log^d(n)) storage). You generally would not want to use a k-d tree for something like this. While it is true that kd trees have optimal, O(n), storage costs, it is a fact that you can't evaluate range queries any faster than O(n^{(d-1)/d}) if you only use O(n) storage. For d=2, this would be O(sqrt(n)) time complexity; and frankly that isn't going to cut it if you have 10^10 data points (who wants to wait for O(10^5) disk reads to complete on a simple range query?)
Fortunately, it sounds like your situation you really don't need to worry too much about the general case. Because all of your data comes from a time series, you only ever have at most one value per each time coordinate. Hypothetically, what you could do is just use a range query to pull some interval of points, then as a post process go through and apply the v constraints pointwise. This would be the first thing I would try (after getting a good database implementation), and if it works then you are done! It really only makes sense to try optimizing the latter two queries if you keep running into situations where the number of points in [t0, t1] x [-infty,+infty] is orders of magnitude larger than the number of points in [t0,t1] x [v0, v1].
General ideas:
Problem 1 is fairly common: Create an index that fits into your RAM and has links to the data on the secondary storage (datastructure: B-Tree family).
Problem 2 / 3 are quite complicated since your data is so large. You could partition your data into time ranges and calculate the min / max for that time range. Using that information, you can filter out time ranges (e.g. max value for a range is 50 and you search for v0>60 then the interval is out). The rest needs to be searched by going through the data. The effectiveness greatly depends on how fast the data is changing.
You can also do multiple indices by combining the time ranges of lower levels to do the filtering faster.
It is going to be really time consuming and complicated to implement this by your self. I recommend you use Cassandra.
Cassandra can give you horizontal scalability, redundancy and allow you to run complicated map reduce functions in future.
To learn how to store time series in cassandra please take a look at:
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
and http://www.youtube.com/watch?v=OzBJrQZjge0.
A bit more detail: we're already trying to take the most advantage of zipmaps, ziplists, etc, and I'm wondering whether these representations are already compressed, or are just serialized hashes and lists; does compression significantly reduce memory usage?
Also, does compression overhead at the app server layer get offset by lower network usage? StackOverflow's experience suggests it does, any other opinions?
In brief, does it make sense - for both short and longer strings?
Redis does not compress your values, and if you should compress them yourself depends a lot on the size of the strings you are going to store. For big strings, hundreds of K's and more it's probably worth the extra CPU cycles on the client side, just like it is when you serve web pages, but for shorter strings it's likely a waste of time. Short strings generally don't compress much, so the gain would be too small.
There's a practical way to get good compression, even for very small strings (50 bytes!) -
If your values are somewhat similar to each other - for example, they're JSON representations of a few related classes of objects - you can precompute a compressor/decompressor dictionary based on some example text.
It sounds complicated, but it's simple in practice - and simpler still with the right wrapper code to handle it.
Here's a Python implementation:
https://github.com/internetarchive/openlibrary/blob/master/openlibrary/utils/compress.py
and here's a wrapper for compressing a specific class of strings: (short JSON records)
https://github.com/internetarchive/openlibrary/blob/master/openlibrary/utils/olcompress.py
One catch: to do this efficiently, your compression library must support 'cloning' the internal state. (The Python library does) You can implement something similar by prepending the example text when compressing, but this means paying an extra computation cost.
Thanks to solrize for this awesome trick.
Redis and clients are typically IO bound and the IO costs are typically at least 2 orders of magnitude in respect to the rest of the request/reply sequence. Smaller payloads will give you higher throughput and lower latencies.
I do not believe there are any hard and fast rules beyond: cost of compression << IO gains. You should bench it and find the sweat spot in setting the lower bound, but the MTU of your network is not a bad starting point for the lower bound.
I need to make a decision about whether to use STM in a Clojure system I am involved with for a system which needs several GB to be stored in a single STM ref.
I would like to hear from anyone who has any advice in using Clojure STM with large indexed datasets to hear their experiences.
I've been using Clojure for some fairly large-scale data processing tasks (definitely gigabytes of data, typically lots of largish Java arrays stored inside various Clojure constructs/STM refs).
As long as everything fits in available memory, you shouldn't have a problem with extremely large amounts of data in a single ref. The ref itself applies only a small fixed amount of STM overhead that is independent of the size of whatever is contained within it.
A nice extra bonus comes from the structural sharing that is built into Clojure's standard data structures (maps, vectors etc.) - you can take a complete copy of a 10GB data structure, change one element anywhere in the structure, and be guaranteed that both data structures will together only require a fraction more than 10GB. This is very helpful, particularly if you consider that due to STM/concurrency you will potentially have several different versions of the data being created at any one time.
The performance isn't going to be any worse or any better than STM involving a single ref with a small dataset. Performance is more hindered by the number of updates to a dataset than the actual size of the dataset.
If you have one writer to the dataset and many readers, then performance will still be quite good. However if you have one reader and many writers, performance will suffer.
Perhaps more information would help us help you more.