Efficient 'rolling/moving hash' computation (like moving average) - c++

I'm trying to optimize a program which needs to compute a hash for a constant size window in a data stream at every position (byte) of stream. It is needed for a lookup of repetitions in disk files much larger than available RAM. Currently I compute separate md5 hash for every window, but it costs a lot of time (window size is a few kilobytes, so each byte of data is processed few thousand times). I wonder if there exist a way to compute every subsequent hash in constant (window-size-independent) time (like addition and subtraction of 1 element in moving average)? The hash function may be anything as long as it gives not to long hashes (50-100 bits is ok) and its computation is reasonably fast. It also must give virtually no colisions on up to trillions of not-so-random windows (TB of data) - every collision means a disk access in my case (crc32 is much to weak, md5 is OK in this aspect).
I'll be thankful if you point me to an existing library function available on linux if there is one.
This is my first question here, so please be tolerant if I did something wrong.
regards,
bartosz

The Wikipedia article on rolling hashes has a link to ngramhashing which implements a few different techniques in C++, including:
Randomized Karp-Rabin (sometimes called Rabin-Karp)
Hashing by Cyclic Polynomials (also known as Buzhash)
Hashing by Irreducible Polynomials
(Also available on GitHub)

What you describe, is pretty near to the basic approach used in data deduplication storage.
The data deduplication systems, we usually use Rabin's fingerprinting method as fast, rolling hash function.
However, while Rabin fingerprints are good and well understood collision properties, it is not cryptographically secure, i.e., there will be collisions. Check e.g. how Bentley et al. used such a method in their compression method. The question is if and how much collisions you can tolerate. If you can tolerate an occasional collision, a good Rabin fingerprint implementation might be a good idea. Good implementations can process more then 200 MB per second per core.
I am not aware of any approach with virtually no collisions (aka cryptographically secure) and rolling at the same time. As PlasmaHH, I have serious doubts that this is actually possible.
Think if you can relax your restrictions. Maybe you can allow to miss some duplicates. In these cases, faster ways are possible.

Related

Why does CouchDB use an append-only B+ tree and not a HAMT

I'm reading up on datastructures, especially immutable ones like the append-only B+ tree used in CouchDB and the Hash array mapped trie used in Clojure and some other functional programming languages.
The main reason datastructures that work well in memory might not work well on disk appears to be time spent on disk seeks due to fragmentation, as with a normal binary tree.
However, HAMT is also very shallow, so doesn't require any more seeks than a B tree.
Another suggested reason is that deletions from a array mapped trie are more expensive tha from a B tree. This is based on the assumption that we're talking about a dense vector, and doesn't apply when using either as a hash map.
What's more, it seems that a B tree does more rebalancing, so using it in an append-only manner produces more garbage.
So why do CouchDB and practically every other database and filesystem use B trees?
[edit] fractal trees? log-structured merge tree? mind = blown
[edit] Real-life B trees use a degree in the thousands, while a HAMT has a degree of 32. A HAMT of degree 1024 would be possible, but slower due to popcnt handling 32 or 64 bits at a time.
B-trees are used because they are a well-understood algorithm that achieves "ideal" sorted-order read-cost. Because keys are sorted, moving to the next or previous key is very cheap.
HAMTs or other hash storage, stores keys in random order. Keys are retrieved by their exact value, and there is no efficient way to find to the next or previous key.
Regarding degree, it is normally selected indirectly, by selecting page size. HAMTs are most often used in memory, with pages sized for cache lines, while B-trees are most often used with secondary storage, where page sizes are related to IO and VM parameters.
Log Structured Merge (LSM) is a different approach to sorted order storage which achieves more optimal write-efficiency, by trading off some read efficiency. That hit to read efficiency can be a problem for read-modify-write workloads, but the fewer uncached reads there are, the more LSM provides better overall throughput vs B-tree - at the cost of higher worst case read latency.
LSM also offers the promise of a wider-performance envelope. Putting new data into its proper place is "deferred", offering the possibility to tune read-to-write efficiency by controlling the proportion of deferred cleanup work to live work. In theory, an ideal-LSM with zero-deferral is a B-tree and with 100%-deferral is a log.
However, LSM is more of a "family" of algorithms than a specific algorithm like a B-tree. Their usage is growing in popularity, but it is hindered by the lack of a de-facto optimal LSM design. LevelDB/RocksDB is one of the more practical LSM implementations, but it is far from optimal.
Another approach to achieving write-throughput efficiency is to write-optimize B-trees through write-deferral, while attempting to maintain their optimal read-throughput.
Fractal-trees, shuttle-trees, stratified-trees are this type of design, and represent a hybrid gray area between B-tree and LSM. Rather than deferring writes to an offline process, they amortize write-deferral in a fixed way. For example, such a design might represent a fixed 60%-write-deferral fraction. This means they can't achieve the 100% write-deferral performance of an LSM, but they also have a more predictable read-performance, making them more practical drop-in replacements for B-trees. (As in the commercial Tokutek MySQL and MongoDB fractal-tree backends)
Btrees are ordered by their key while in a hash map similar keys have very different hash values so are stored far each other. Now think of a query that do a range scan "give me yesterday's sales": with a hash map you have to scan all the map to find them, with a btree on the sales_dtm columns you'll find them nicely clustered and you exactly know where to start and stop reading.

Best of breed indexing data structures for Extremely Large time-series

I'd like to ask fellow SO'ers for their opinions regarding best of breed data structures to be used for indexing time-series (aka column-wise data, aka flat linear).
Two basic types of time-series exist based on the sampling/discretisation characteristic:
Regular discretisation (Every sample is taken with a common frequency)
Irregular discretisation(Samples are taken at arbitary time-points)
Queries that will be required:
All values in the time range [t0,t1]
All values in the time range [t0,t1] that are greater/less than v0
All values in the time range [t0,t1] that are in the value range[v0,v1]
The data sets consist of summarized time-series (which sort of gets over the Irregular discretisation), and multivariate time-series. The data set(s) in question are about 15-20TB in size, hence processing is performed in a distributed manner - because some of the queries described above will result in datasets larger than the physical amount of memory available on any one system.
Distributed processing in this context also means dispatching the required data specific computation along with the time-series query, so that the computation can occur as close to the data as is possible - so as to reduce node to node communications (somewhat similar to map/reduce paradigm) - in short proximity of computation and data is very critical.
Another issue that the index should be able to cope with, is that the overwhelming majority of data is static/historic (99.999...%), however on a daily basis new data is added, think of "in the field senors" or "market data". The idea/requirement is to be able to update any running calculations (averages, garch's etc) with as low a latency as possible, some of these running calculations require historical data, some of which will be more than what can be reasonably cached.
I've already considered HDF5, it works well/efficiently for smaller datasets but starts to drag as the datasets become larger, also there isn't native parallel processing capabilities from the front-end.
Looking for suggestions, links, further reading etc. (C or C++ solutions, libraries)
You would probably want to use some type of large, balanced tree. Like Tobias mentioned, B-trees would be the standard choice for solving the first problem. If you also care about getting fast insertions and updates, there is a lot of new work being done at places like MIT and CMU into these new "cache oblivious B-trees". For some discussion of the implementation of these things, look up Tokutek DB, they've got a number of good presentations like the following:
http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf
Questions 2 and 3 are in general a lot harder, since they involve higher dimensional range searching. The standard data structure for doing this would be the range tree (which gives O(log^{d-1}(n)) query time, at the cost of O(n log^d(n)) storage). You generally would not want to use a k-d tree for something like this. While it is true that kd trees have optimal, O(n), storage costs, it is a fact that you can't evaluate range queries any faster than O(n^{(d-1)/d}) if you only use O(n) storage. For d=2, this would be O(sqrt(n)) time complexity; and frankly that isn't going to cut it if you have 10^10 data points (who wants to wait for O(10^5) disk reads to complete on a simple range query?)
Fortunately, it sounds like your situation you really don't need to worry too much about the general case. Because all of your data comes from a time series, you only ever have at most one value per each time coordinate. Hypothetically, what you could do is just use a range query to pull some interval of points, then as a post process go through and apply the v constraints pointwise. This would be the first thing I would try (after getting a good database implementation), and if it works then you are done! It really only makes sense to try optimizing the latter two queries if you keep running into situations where the number of points in [t0, t1] x [-infty,+infty] is orders of magnitude larger than the number of points in [t0,t1] x [v0, v1].
General ideas:
Problem 1 is fairly common: Create an index that fits into your RAM and has links to the data on the secondary storage (datastructure: B-Tree family).
Problem 2 / 3 are quite complicated since your data is so large. You could partition your data into time ranges and calculate the min / max for that time range. Using that information, you can filter out time ranges (e.g. max value for a range is 50 and you search for v0>60 then the interval is out). The rest needs to be searched by going through the data. The effectiveness greatly depends on how fast the data is changing.
You can also do multiple indices by combining the time ranges of lower levels to do the filtering faster.
It is going to be really time consuming and complicated to implement this by your self. I recommend you use Cassandra.
Cassandra can give you horizontal scalability, redundancy and allow you to run complicated map reduce functions in future.
To learn how to store time series in cassandra please take a look at:
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
and http://www.youtube.com/watch?v=OzBJrQZjge0.

how to improve multi-dimentional bit array comparison performance in c or c++

I have the following three-dimensional bit array(for a bloom filter):
unsigned char P_bit_table_[P_ROWS][ROWS][COLUMNS];
the P_ROWS's dimension represents independent two-dimensional bit arrays(i.e, P_ROWS[0], P_ROWS1,P_ROWS[2] are independent bit arrays) and could be as large as 100MBs and contains data which are populated independently. The data that I am looking for could be in any of these P_ROWS and right now I am searching through it independently, which is P_ROWS[0] then P_ROWS1 and so on until i get a positive or until the end of it(P_ROWS[n-1]). This implies that if n is 100 I have to do this search(bit comparison) 100 times(and this search is done very often). Some body suggested that I can improve the search performance if I could do bit grouping (use a column-major order on the row-major order array-- I DON'T KNOW HOW).
I really need to improve the performance of the search because the program does a lot of it.
I will be happy to give more details of my bit table implementation if required.
Sorry for the poor language.
Thanks for your help.
EDIT:
The bit grouping could be done in the following format:
Assume the array to be :
unsigned char P_bit_table_[P_ROWS][ROWS][COLUMNS]={{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))},
{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))},
{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))}};
As you can see all the rows --on the third dimension-- have similar data. What I want after the grouping is like; all the a1's are in one group(as just one entity so that i can compare them with another bit for checking if they are on or off ) and all the b1's are in another group and so on.
Re-use Other People's Algorithms
There are a ton of bit-calculation optimizations out there including many that are non-obvious, like Hamming Weights and specialized algorithms for finding the next true or false bit, that are rather independent of how you structure your data.
Reusing algorithms that other people have written can really speed up computation and lookups, not to mention development time. Some algorithms are so specialized and use computational magic that will have you scratching your head: in that case, you can take the author's word for it (after you confirm their correctness with unit tests).
Take Advantage of CPU Caching and Multithreading
I personally reduce my multidimensional bit arrays to one dimension, optimized for expected traversal.
This way, there is a greater chance of hitting the CPU cache.
In your case, I would also think deeply about the mutability of the data and whether you want to put locks on blocks of bits. With 100MBs of data, you have the potential of running your algorithms in parallel using many threads, if you can structure your data and algorithms to avoid contention.
You may even have a lockless model if you divide up ownership of the blocks of data by thread so no two threads can read or write to the same block. It all depends on your requirements.
Now is a good time to think about these issues. But since no one knows your data and usage better than you do, you must consider design options in the context of your data and usage patterns.

Fastest C++ map for worst-case time execution in a real time app?

I'm trying to figure out how I want to store timed events in a real time audio app that may hop around in time a lot, and needs to run with the lowest latency possible. Basically the engine knows what time 'now' is, but 'now' may be non-linear, and there be multiple 'nows' in the future. I'm wondering if:
a) a C++ map of some time keyed by time values is even feasible, when there could be thousands of entries
b) which map or hash table implementation will give me the best performance where best means lowest worst case execution, not lowest average. An implementation that even once in a while takes a really long time will be unusable, something with a more deterministic result would be better.
c) for a bunch of events sharing the same now, should one use some sort of hash multi map or link a list of all events at a given time?
I'm open to any other suggestions of how to do this too, or pointers to resources. Time is encoded in it's own format, representing sections:bars:beats:ticks
thanks!
iain
Nothing can save you from having to profile your code and see for yourself.
Make the data type as easy to change as possible, keep everything modular and parameterised, and then just run some tests.
Start with std::multimap and std::unordered_multimap, with time as the key. Both should have pretty good performance. Try a few different allocators, too.

Compressing strings before putting them in redis - does it make sense?

A bit more detail: we're already trying to take the most advantage of zipmaps, ziplists, etc, and I'm wondering whether these representations are already compressed, or are just serialized hashes and lists; does compression significantly reduce memory usage?
Also, does compression overhead at the app server layer get offset by lower network usage? StackOverflow's experience suggests it does, any other opinions?
In brief, does it make sense - for both short and longer strings?
Redis does not compress your values, and if you should compress them yourself depends a lot on the size of the strings you are going to store. For big strings, hundreds of K's and more it's probably worth the extra CPU cycles on the client side, just like it is when you serve web pages, but for shorter strings it's likely a waste of time. Short strings generally don't compress much, so the gain would be too small.
There's a practical way to get good compression, even for very small strings (50 bytes!) -
If your values are somewhat similar to each other - for example, they're JSON representations of a few related classes of objects - you can precompute a compressor/decompressor dictionary based on some example text.
It sounds complicated, but it's simple in practice - and simpler still with the right wrapper code to handle it.
Here's a Python implementation:
https://github.com/internetarchive/openlibrary/blob/master/openlibrary/utils/compress.py
and here's a wrapper for compressing a specific class of strings: (short JSON records)
https://github.com/internetarchive/openlibrary/blob/master/openlibrary/utils/olcompress.py
One catch: to do this efficiently, your compression library must support 'cloning' the internal state. (The Python library does) You can implement something similar by prepending the example text when compressing, but this means paying an extra computation cost.
Thanks to solrize for this awesome trick.
Redis and clients are typically IO bound and the IO costs are typically at least 2 orders of magnitude in respect to the rest of the request/reply sequence. Smaller payloads will give you higher throughput and lower latencies.
I do not believe there are any hard and fast rules beyond: cost of compression << IO gains. You should bench it and find the sweat spot in setting the lower bound, but the MTU of your network is not a bad starting point for the lower bound.