How to implement Radix sort on multi-GPU?

How to implement Radix sort on multi-GPU? - concurrency

How to implement Radix sort on multi-GPU – same way as on single GPU i.e. by splitting the data then building histograms on separate GPUs and then use merge data back (like bunch of cards)?

That method would work, but I don't think it would be the fastest approach. Specifically, merging histograms for every K bits (K=4 is currently best) would require the keys to be exchanged between GPUs 32/K = 8 times to sort 32-bit integers. Since the memory bandwidth between GPUs (~5GB/s) is much lower than the memory bandwidth on a GPU (~150GB/s) this will kill performance.
A better strategy would be to split the data into multiple parts, sort each part in parallel on a different GPU, and then merge the parts once at the end. This approach requires only one inter-GPU transfer (vs. 8 above) so it will be considerably faster.

Unfortunately this question is not adequately posed. It depends on element size, where the elements begin life in memory, and where you want the sorted elements to end up residing.
Sometimes it's possible to compress the sorted list by storing elements in groups sharing the same common prefix, or you can unique elements on the fly, storing each element once in the sorted list with an associated count. For example, you might sort a huge list of 32-bit integers into 64K distinct lists of 16-bit values, cutting your memory requirement in half.
The general principle is that you want to make the fewest number of passes over the data as possible and that your throughput will almost always correspond to bandwidth constraints associated with your storage policy.
If your data set exceeds the size of fast memory, you probably want to finish with a merge pass rather than continue to radix sort, as another person has already answered.
I'm just getting into GPU architecture and I don't understand the K=4 comment above. I've never seen an architecture yet where such a small K would prove optimal.
I suspect merging histograms is also the wrong approach. I'd probably let the elements fragment in memory rather than merge histograms. Is it that hard to manage meso-scale scatter/gather lists in the GPU fabric? I sure hope not.
Finally, it's hard to conceive of a reason why you would want to involve multiple GPUs for this task. Say your card has 2GB of memory and 60GB/s write bandwidth (that's what my mid-range card is showing). A three pass radix sort (11-bit histograms) requires 6GB of write bandwidth (likely your rate limiting factor), or about 100ms to sort a 2GB list of 32-bit integers. Great, they're sorted, now what? If you need to ship them anywhere else without some kind of preprocessing or compression, the sorting time will be small fish.
In any case, just compiled my first example programs today. There's still a lot to learn. My target application is permutation intensive, which is closely related to sorting. I'm sure I'll weigh in on this subject again in future.

Related

Any optimization about random access array modification?

Given an array A of size 105.
Then given m (m is very large, m>> the size of A) operations, each operation is for position p, increasing t.
A[p]+=t
Finally, I output the value of each position of the whole array.
Is there any constant optimization to speed up the intermediate modification operations?
For example, if I sort the positions, I can modify them sequentially to avoid random access. However, this operation will incur an additional sorting cost. Is there any other way to speed it up?
Trying to re-execute all operations after sorting can be an order of magnitude faster than executing them directly. But the cost of sorting is too high.

On architectures with many cores, the best solution is certainly to perform atomic accesses of A[p] in parallel. This assume the number of cores is sufficiently big for the parallelism to not only mitigate the overhead of the atomic operations but also be faster than the serial implementation. This can be pretty easily done with OpenMP or with native C++ thread/atomics. The number of core need not to be too huge, otherwise, the number of conflict may be significantly bigger causing contention and so decreasing performance. This should be fine since the number of item is pretty big. This solution also assume the accesses are quite uniformly random. If they are not (eg. normal distribution), then the contention can be too big for the method to be efficient.
An alternative solution is to split the accesses between N threads spacially. The array range can be statically split in N (relatively equal) parts. All the threads read the inputs but only the thread owning the target range of the output array write into it. The array parts can then be combined after that. This method works well with few threads and if the data distribution is uniform. When the distribution is not uniform at all (eg. normal distribution), then a pre-computing step may be needed so to adjust the array range owned by threads. For example, one can compute the median, or event the quartiles so to better balance the work between threads. Computing quartiles can be done using a partitioning algorithm like Floyd Rivest (std::partition should not be too bad despite I expect it to use a kind of IntroSelect algorithm that is often a bit slower). The pre-computation may be expensive but this should be significantly faster than doing a sort. Using OpenMP is certainly a good idea to implement this.
Another alternative implementation is simply to perform the reduction separately in each thread and then sum up the final array of each thread in a global array. This solution works well in your case (since "m >> the size of A") assuming the number of core is not too big. If so, on need to mix this method with the first one. This last method is probably the simplest efficient method.

Besides, #Jérôme Richard's answer targeted parallel thread computing.
I would name an idea of the partial sort like "merge-sort-just-a-few-iterations" or "bucket-sort-only-in-bucket" (note, they are different). Preferably, set the bulk size to be the Page size to have a better overall performance in terms of OS level. Especially considering m is extraordinarily big. The cost of the partial sort would be amortized by saving cache miss and page swap.
And if this is an interview question, I would ask for more details about m, p, t, data sparsity, distribution, hardware, CPU, memory, power consumption, latency, .etc details. And for each new condition, customizes more detailed designs accordingly.

3D FFT with data larger than cache

I have searched for an answer to this question but have not found anything that can directly help me.
I am working on a 3D numerical integrator for a non-linear PDE using the parallel FFT library included in MKL.
My arrays consist of 2^30 data points which is much much larger than the cache. This results in ~50% of cache references being misses, which appears to add a massive amount of overhead accessing memory.
Is there a clever way I can deal with this? Is it expected to have 50% cache misses using an array this large?
Any help would be much appreciated.
Thanks,
Dylan

2^30 data points in a single FFT counts as being quite big!
The data plus the exponentials and the output array is several thousand times bigger than the L3 cache, and millions times bigger than L1.
Given that disparity one might argue that a 50% cache miss rate is actually quite good, especially for an algorithm like an FFT which accesses memory in non-sequential ways.
I don't think that there will be much you can do about it. The MKL is quite good, and I'm sure that they've taken advantage of whatever cache hinting instructions there are.
You might try contacting Mercury Systems Inc. (www.mrcy.com) and ask them about their Scientific Algorithms Library (SAL). They have a habit of writing their own math libraries, and in my experience they are pretty good at it. Their FFT on PowerPC was 30% quicker than the next best one; quite an achievement. You can try an un-optimised version of SAL for free (http://sourceforge.net/projects/opensal/). The real optimised for Intel SAL is definitely not free though.
Also bear in mind that no matter how clever the algorithm is, with a data set that size you're always going to be fundamentally stuck with main memory bandwidths, not cache bandwidths.
GPUs might be worth a look, but you'd need one with a lot of memory to hold 2^30 data points (32 bit complex values = 2gbytes, same again for the output array, plus exponentials, etc).

I think the problem of excessive misses is due to a failure of the cache prefetch mechanism, but not knowing the details of the memory accesses I can't tell you exactly why.
It does not matter that your arrays are very large, 50% misses are excessive. The processor should avoid misses by detecting you are iterating over an array and loading ahead of time the data elements you are likely to use.
Either the pattern of array accesses is not regular and thus the prefetcher in the processor does not figure out a pattern to prefetch, or you have a cache associativy problem, that is, elements in your iteration might be matched to the same cache slots.
For example, assume a cache size of 1Mb and a set associativy of 4. In this example, the cache will map memory using the lower 20 bits to an internal slot. If you stride by 1Mb, that is, your iterations are exactly 1Mb, then the lower 20 bits are always the same and go to the same cache slot, the new element shares the same cache slot as the old one. When you get to the fifth element, all four positions are used up and from then on it is only misses, in such case your cache size is effectively one single slot; if you stride by half the cache size, then the effective number of slots is 2, which might be enough to not have any misses at all or have 100% or anything in between depending on whether your access pattern requires both slots simultaneously or not.
To convince yourself of this, make a toy program with varying stride sizes and you'll see that those that divide or are multiples of the cache sizes increase misses, you can use valgrind --tool=cachegrind

You should first make sure you know what is causing the cache misses; they may be the fault of other code you've written rather than the FFT library. In fact, I expect that is very likely the case.
The rest of this post assumes that the FFT is really at fault and we need to optimize.
The standard trick to get data locality out of an FFT is to
Arrange the data in a two-dimensional array
Do an FFT along each row
Apply twiddle factors
Do a matrix transpose
Do an FFT along each row
This is the Cooley-Tukey algorithm, in the case where we factor 2^(m+n) = 2^m * 2^n.
The point of this is that the recursive calls to the FFT are much much smaller, and may very well fit in cache. And if not, you can apply this method recursively until things do fit in cache. And if you're ambitious, you do a lot of benchmarking to figure out the optimal way to do the splitting.
Thus, assuming you also use a good matrix transpose algorithm, the end result is a relatively cache-friendly FFT.
The library you're using really should be doing this already. If it's not, then some options are:
Maybe it exposes enough lower level functionality that you can tell it to use Cooley-Tukey in an efficient way even though the high level routines aren't
You could implement Cooley-Tukey yourself, using the given library to do the smaller FFTs.

Best of breed indexing data structures for Extremely Large time-series

I'd like to ask fellow SO'ers for their opinions regarding best of breed data structures to be used for indexing time-series (aka column-wise data, aka flat linear).
Two basic types of time-series exist based on the sampling/discretisation characteristic:
Regular discretisation (Every sample is taken with a common frequency)
Irregular discretisation(Samples are taken at arbitary time-points)
Queries that will be required:
All values in the time range [t0,t1]
All values in the time range [t0,t1] that are greater/less than v0
All values in the time range [t0,t1] that are in the value range[v0,v1]
The data sets consist of summarized time-series (which sort of gets over the Irregular discretisation), and multivariate time-series. The data set(s) in question are about 15-20TB in size, hence processing is performed in a distributed manner - because some of the queries described above will result in datasets larger than the physical amount of memory available on any one system.
Distributed processing in this context also means dispatching the required data specific computation along with the time-series query, so that the computation can occur as close to the data as is possible - so as to reduce node to node communications (somewhat similar to map/reduce paradigm) - in short proximity of computation and data is very critical.
Another issue that the index should be able to cope with, is that the overwhelming majority of data is static/historic (99.999...%), however on a daily basis new data is added, think of "in the field senors" or "market data". The idea/requirement is to be able to update any running calculations (averages, garch's etc) with as low a latency as possible, some of these running calculations require historical data, some of which will be more than what can be reasonably cached.
I've already considered HDF5, it works well/efficiently for smaller datasets but starts to drag as the datasets become larger, also there isn't native parallel processing capabilities from the front-end.
Looking for suggestions, links, further reading etc. (C or C++ solutions, libraries)

You would probably want to use some type of large, balanced tree. Like Tobias mentioned, B-trees would be the standard choice for solving the first problem. If you also care about getting fast insertions and updates, there is a lot of new work being done at places like MIT and CMU into these new "cache oblivious B-trees". For some discussion of the implementation of these things, look up Tokutek DB, they've got a number of good presentations like the following:
http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf
Questions 2 and 3 are in general a lot harder, since they involve higher dimensional range searching. The standard data structure for doing this would be the range tree (which gives O(log^{d-1}(n)) query time, at the cost of O(n log^d(n)) storage). You generally would not want to use a k-d tree for something like this. While it is true that kd trees have optimal, O(n), storage costs, it is a fact that you can't evaluate range queries any faster than O(n^{(d-1)/d}) if you only use O(n) storage. For d=2, this would be O(sqrt(n)) time complexity; and frankly that isn't going to cut it if you have 10^10 data points (who wants to wait for O(10^5) disk reads to complete on a simple range query?)
Fortunately, it sounds like your situation you really don't need to worry too much about the general case. Because all of your data comes from a time series, you only ever have at most one value per each time coordinate. Hypothetically, what you could do is just use a range query to pull some interval of points, then as a post process go through and apply the v constraints pointwise. This would be the first thing I would try (after getting a good database implementation), and if it works then you are done! It really only makes sense to try optimizing the latter two queries if you keep running into situations where the number of points in [t0, t1] x [-infty,+infty] is orders of magnitude larger than the number of points in [t0,t1] x [v0, v1].

General ideas:
Problem 1 is fairly common: Create an index that fits into your RAM and has links to the data on the secondary storage (datastructure: B-Tree family).
Problem 2 / 3 are quite complicated since your data is so large. You could partition your data into time ranges and calculate the min / max for that time range. Using that information, you can filter out time ranges (e.g. max value for a range is 50 and you search for v0>60 then the interval is out). The rest needs to be searched by going through the data. The effectiveness greatly depends on how fast the data is changing.
You can also do multiple indices by combining the time ranges of lower levels to do the filtering faster.

It is going to be really time consuming and complicated to implement this by your self. I recommend you use Cassandra.
Cassandra can give you horizontal scalability, redundancy and allow you to run complicated map reduce functions in future.
To learn how to store time series in cassandra please take a look at:
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
and http://www.youtube.com/watch?v=OzBJrQZjge0.

how to improve multi-dimentional bit array comparison performance in c or c++

I have the following three-dimensional bit array(for a bloom filter):
unsigned char P_bit_table_[P_ROWS][ROWS][COLUMNS];
the P_ROWS's dimension represents independent two-dimensional bit arrays(i.e, P_ROWS[0], P_ROWS1,P_ROWS[2] are independent bit arrays) and could be as large as 100MBs and contains data which are populated independently. The data that I am looking for could be in any of these P_ROWS and right now I am searching through it independently, which is P_ROWS[0] then P_ROWS1 and so on until i get a positive or until the end of it(P_ROWS[n-1]). This implies that if n is 100 I have to do this search(bit comparison) 100 times(and this search is done very often). Some body suggested that I can improve the search performance if I could do bit grouping (use a column-major order on the row-major order array-- I DON'T KNOW HOW).
I really need to improve the performance of the search because the program does a lot of it.
I will be happy to give more details of my bit table implementation if required.
Sorry for the poor language.
Thanks for your help.
EDIT:
The bit grouping could be done in the following format:
Assume the array to be :
unsigned char P_bit_table_[P_ROWS][ROWS][COLUMNS]={{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))},
{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))},
{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))}};
As you can see all the rows --on the third dimension-- have similar data. What I want after the grouping is like; all the a1's are in one group(as just one entity so that i can compare them with another bit for checking if they are on or off ) and all the b1's are in another group and so on.

Re-use Other People's Algorithms
There are a ton of bit-calculation optimizations out there including many that are non-obvious, like Hamming Weights and specialized algorithms for finding the next true or false bit, that are rather independent of how you structure your data.
Reusing algorithms that other people have written can really speed up computation and lookups, not to mention development time. Some algorithms are so specialized and use computational magic that will have you scratching your head: in that case, you can take the author's word for it (after you confirm their correctness with unit tests).
Take Advantage of CPU Caching and Multithreading
I personally reduce my multidimensional bit arrays to one dimension, optimized for expected traversal.
This way, there is a greater chance of hitting the CPU cache.
In your case, I would also think deeply about the mutability of the data and whether you want to put locks on blocks of bits. With 100MBs of data, you have the potential of running your algorithms in parallel using many threads, if you can structure your data and algorithms to avoid contention.
You may even have a lockless model if you divide up ownership of the blocks of data by thread so no two threads can read or write to the same block. It all depends on your requirements.
Now is a good time to think about these issues. But since no one knows your data and usage better than you do, you must consider design options in the context of your data and usage patterns.

Correct data structure to use for (this specific) expiring cache?

I need to read from a dataset which is very large, highly interlinked, the data is fairly localized, and reads are fairly expensive. Specifically:
The data sets are 2gigs - 30gigs in size, so I have to map sections of the file into memory to read. This is very expensive compared to the rest of the work I do in the algorithm. From profiling I've found roughly 60% of the time is spent reading the memory, so this is the right place to start optimizing.
When operating on a piece of this dataset, I have to follow links inside of it (imagine it like being similar to a linked list), and while those reads aren't guaranteed to anywhere near sequential, they are fairly localized. This means:
Let's say, for example, we operate on 2 megs of memory at a time. If you read 2 megs of data into memory, roughly 40% of the reads I will have to subsequently do will be in that same 2 megs of memory. Roughly 20% of the reads will be purely random access in the rest of the data, and the other 40% very likely links back into the 2meg segment which pointed to this one.
From knowledge of the problem and from profiling, I believe that introducing a cache to the program will help greatly. What I want to do is create a cache which holds N chunks of X megs of memory (N and X configurable so I can tune it) which I can check first, before having to map another section of memory. Additionally, the longer something has been in the cache, the less likely it is that we will request that memory in the short term, and so the oldest data will need to be expired.
After all that, my question is very simple: What data structure would be best to implement a cache of this nature?
I need to have very fast lookups to see if a given address is in the cache. With every "miss" of the cache, I'll want to expire the oldest member of it, and add a new member. However, I plan to try to tune it (by changing the amount that's cached) such that 70% or more of reads are hits.
My current thinking is to use either an AVL tree (LOG2 n for search/insert/delete) would be the safest (no degenerate cases). My other option is a sparse hashtable such that lookups would be O(1) in the best case. In theory this could degenerate into O(n), but in practice I could keep collisions low. The concern here would be how long it takes to find and remove the oldest entry in the hashtable.
Does anyone have any thoughts or suggestions on what data structure would be best here, and why?

Put the cache into two sorted trees (AVL or any other reasonably balanced tree implementation is fine--you're better off using one from a library than creating your own).
One tree should sort by position in the file. This lets you do log(n) lookups to see if your cache is there.
The other tree should sort by time used (which can be represented by a number that increments by one on each use). When you use a cached block, you remove it, update the time, and insert it again. This will take log(n) also. When you miss, remove the smallest element of the tree, and add the new block as the largest. (Don't forget to also remove/add that block to the by-position-in-file tree.)
If your cache doesn't have very many items in it, you'll be better off still by just storing everything in a sorted array (using insertion sort to add new elements). Moving 16 items down one spot in an array is incredibly fast.

Seems like you are looking for an LRU (Least Recently Used) cache: LRU cache design

If 60% of your algorithm is I/O, I suggest that the actual cache design doesn't really matter that much- any sort of cache could be a substantial speed-up.
However, the design depends a lot on what data you're using to access your chunks. String, int, etc. If you had an int, you could do a hashmap into a linked list, erase the back on cache miss, erase and then push on top if cache hit.
hashmaps are provided under varying names (most commonly, unordered map) in many implementations. Boost has one, there's one in TR1, etc. A big advantage of a hash_map is less performance loss with growing numbers, and more flexibility about key values.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js