SIMD Implementation of std::nth_element - c++

I have an algorithm that runs on my dual-core, 3 GHz Intel processor in on average 250ms, and I am trying to optimize it. Currently, I have an std::nth_element call that is invoked around 6,000 times on std::vectors of between 150 and 300 elements, taking on average 50ms. I've spent some time optimizing the comparator I use, which currently looks up two doubles from a vector and does a simple < comparison. The comparator takes a negligible fraction of the time to run std::nth_element. The comparator's copy-constructor is also simple.
Since this call is currently taking 20% of the time for my algorithm, and since the time is mostly spent in the code for nth_element that I did not write (i.e. not the comparator), I'm wondering if anyone knows of a way of optimizing nth_element using SIMD or any other approach? I've seen some questions on parallelizing std::nth_element using OpenCL and multiple threads, but since the vectors are pretty short, I'm not sure how much benefit I would get from that approach, though I'm open to being told I'm wrong.
If there is an SSE approach, I can use any SSE instruction up to (the current, I think) SSE4.2.
Thanks!

Two thoughts:
Multithreading probably won't speed up processing for any single vector, but might help you as the number of vectors grows large.
Sorting is too powerful a tool for your problem: you're computing the entire order of the vector, but you don't care about anything but the top few. You know for each vector how many elements make up the top 5%, so instead of sorting the whole thing you should make one pass through the array and find the k largest. You can do this is O(n) time with k extra space, so it's probably a win overall.

Related

Any optimization about random access array modification?

Given an array A of size 105.
Then given m (m is very large, m>> the size of A) operations, each operation is for position p, increasing t.
A[p]+=t
Finally, I output the value of each position of the whole array.
Is there any constant optimization to speed up the intermediate modification operations?
For example, if I sort the positions, I can modify them sequentially to avoid random access. However, this operation will incur an additional sorting cost. Is there any other way to speed it up?
Trying to re-execute all operations after sorting can be an order of magnitude faster than executing them directly. But the cost of sorting is too high.
On architectures with many cores, the best solution is certainly to perform atomic accesses of A[p] in parallel. This assume the number of cores is sufficiently big for the parallelism to not only mitigate the overhead of the atomic operations but also be faster than the serial implementation. This can be pretty easily done with OpenMP or with native C++ thread/atomics. The number of core need not to be too huge, otherwise, the number of conflict may be significantly bigger causing contention and so decreasing performance. This should be fine since the number of item is pretty big. This solution also assume the accesses are quite uniformly random. If they are not (eg. normal distribution), then the contention can be too big for the method to be efficient.
An alternative solution is to split the accesses between N threads spacially. The array range can be statically split in N (relatively equal) parts. All the threads read the inputs but only the thread owning the target range of the output array write into it. The array parts can then be combined after that. This method works well with few threads and if the data distribution is uniform. When the distribution is not uniform at all (eg. normal distribution), then a pre-computing step may be needed so to adjust the array range owned by threads. For example, one can compute the median, or event the quartiles so to better balance the work between threads. Computing quartiles can be done using a partitioning algorithm like Floyd Rivest (std::partition should not be too bad despite I expect it to use a kind of IntroSelect algorithm that is often a bit slower). The pre-computation may be expensive but this should be significantly faster than doing a sort. Using OpenMP is certainly a good idea to implement this.
Another alternative implementation is simply to perform the reduction separately in each thread and then sum up the final array of each thread in a global array. This solution works well in your case (since "m >> the size of A") assuming the number of core is not too big. If so, on need to mix this method with the first one. This last method is probably the simplest efficient method.
Besides, #Jérôme Richard's answer targeted parallel thread computing.
I would name an idea of the partial sort like "merge-sort-just-a-few-iterations" or "bucket-sort-only-in-bucket" (note, they are different). Preferably, set the bulk size to be the Page size to have a better overall performance in terms of OS level. Especially considering m is extraordinarily big. The cost of the partial sort would be amortized by saving cache miss and page swap.
And if this is an interview question, I would ask for more details about m, p, t, data sparsity, distribution, hardware, CPU, memory, power consumption, latency, .etc details. And for each new condition, customizes more detailed designs accordingly.

concurrency::parallel_sort overhead and performance hit (rule of thumb)?

Recently I stumbled across a very large performance improvement -- I'm talking about a 4x improvement -- with a one line code change. I just changed a std::sort call to concurrency_parallel sort
// Get a contiguous vector copy of the pixels from the image.
std::vector<float> vals = image.copyPixels();
// New, fast way. Takes 7 seconds on a test image.
concurrency::parallel_buffered_sort(vals.begin(), vals.end());
// Old, slow way -- takes 30 seconds on a test image
// std::sort(vals.begin(), vals.end());
This was for a large image and dropped my processing time 30 seconds to 7 seconds. However some cases will involve small images. I don't know if I can or should just do this blindly.
I would like to make some judicious use of parallel_sort, parallel_for and the like but I'm wondering about what threshold needs to be crossed (in terms of number of elements to be sorted/iterated through) before it becomes a help and not a hindrance.
I will eventually go through some lengthy performance testing but at the moment I don't have a lot of time do that. I would like to get this working better "most" of the time and not hurting performance any of the time (or at least only rarely).
Can someone out there with some experience in this area can give me a reasonable rule-of-thumb that will help me in "most" cases? Does one exist?
The requirement of RandomIterator and presence of overloads with a const size_t _Chunk_size = 2048 parameter, which control the threshold of serialisation, imply that the library authors are aware of this concern. Thus probably just using concurrency::parallel_* as drop-in replacements for std::* will do fine.
Here is how I think about it, windows thread scheduling time quanta is ~20-60 ms on workstation and 120ms on the server so anything that can be done in this much time doesn't need concurrency.
So, I am guessing up to 1k-10k you are good with std::sort the latency in launching multiple threads would be an overkill, but 10k onwards there is a distinct advantage in using parallel sort or p-buffered sort (if you can afford it) and parallel radix sort probably would be great for very very large values.
Caveats apply. :o)
I don’t know about that concurrency namespace, but any sane implementation of a parallel algorithm will adapt appropriately to the size of the input. You shouldn’t have to worry about the details of the underlying thread implementation. Just do it.

Is it possible to implement median of medians introselect with no swaps or heap allocations?

So I've run into a problem in some code I'm writing in c++. I need to find the median of an array of points with an offset and step size (example).
This code will be executed millions of times in as it's part of one of my core data structures so I'm trying to make it as fast as possible.
Research has led me to believe that for the best worst-case time complexity, introselect is the fastest way to find a median in a set of unordered values. I have some additional limitations that have to due with optimization:
I can't swap any values in the array. The values in the array are all exactly where they need to be based on that context in the program. But I still need the median.
I can't make any "new" allocations or call anything that does heap allocation. Or if I have to, then they need to be at a minimum as they are costly.
I've tried implementing the following in C++:
First Second Third
Is this possible? Or are there alternatives that are just as fast at finding the median and fit those requirements?
You could consider using the same heap allocation for all operations, and avoid freeing it until you're done. This way, rather than creating millions of arrays you just create one.
Of course this approach is more complex if you're doing these find-median operations in parallel. You'd need one array per thread.

Counting the length of the Collatz sequence - custom iterator produces slowdown?

I've been solving the UVA problem #100 - "The 3n + 1 problem". This is their "sample" problem, with a very forgiving time limit (limit of 3 sec, their sample solution with no caching at all runs in 0.738 sec, my best solution so far run in 0.016 sec), so I thought however I'd experiment with the code, I should always fit in the limit. Well, I was mistaken.
The problem specification is simple: each line of input has two numbers i and j, and the output should print these numbers followed by the maximum length of the Collatz sequence whose beginning is in between i and j inclusive.
I've prepared four solutions. They are quite similar and all produce good answers.
The first solution caches up to 0x100000 sequence lengths in a vector. Length of 0 means the length for the sequence that starts at this particular number has not yet been calculated. It runs fast enough - 0.03 sec.
The second solution is quite similar, only it caches every singe length in a sparse array implemented with an unordered_map. It runs fairly slower than the previous solution, but still fits well in the limit: 0.28 sec.
As an exercise, I also wrote the third solution, which based on the second one. The goal was to use the max_element function, which only accepts iterators. I couldn't use unordered_map::iterator, for incrementing such iterator is, AFAIK, linear in map size; therefore, I wrote a custom iterator, operating on an abstract "container" that "holds" the sequence length of every possible number (but in fact calculates them and caches them only as needed). In it's core, it's actually the same unordered_map solution - only there's an extra layer of iterators added on the top. The solution did not fit in the 3 sec. limit.
And now comes the thing I can't understand. While obviously the third solution is overcomplicated on purpose, I could hardly believe that an extra layer of iterators could produce such a slowdown. To check this out, I added a same iterator layer to the vector solution. This is my fourth solution. Judging from what this iterator idea did to my unordered_map solution, I expected a considerable slowdown here as well; but oddly enough, this did not happen at all. This solution runs almost as fast as the plain vector one, in 0.035 sec.
How is this possible? What exactly is responsible for the slowdown in the third solution? How is it possible that overcomplicating two similar solutions in an exactly the same way slows one of them down very much, and hardly harms the other at all? Why adding the iterator layer to the unordered_map solution made it not fitting in time, and doing the same to the vector solution hardly slowed it down at all?
EDIT:
I found out that the problem seems to be most visible if the input contains many repetitive lines. I tested all four solutions on my machine against the input of 1 1000000, repeated 200 times. The solution with a plain vector processed all of them in 1.531 sec. The solution with the vector and the additional iterator layer took 3.087 sec. The solution with the plain unordered map took 33.821 sec. And the solution with the unordered map and the additional iterator layer managed took more than half an hour - I halted it after 31 minutes 0.482 sec! I tested it on Linux mint 17.2 64 bit, g++ version Ubuntu 4.8.4-2ubuntu1~14.04 with flags -std=c++11 -O2, processor Celeron 2955U #1.4 GHz x 2
It appears to be a problem in GCC 4.8. It doesn't occur in 4.9 up. For some reason, subsequent outer loops (with the populated unordered_map cache) run progressively slower, not faster. I'm not sure why, since the unordered_map isn't getting bigger.
If you follow that link and switch GCC 4.8 to 4.9, then you'll see the expected behavior where subsequent runs over the same numeric range add little time because they leverage the cache.
Altogether, the philosophy of being "conservative" with compiler updates has been outdated for a long time. Compilers today are rigorously tested and you should use the latest (or at least some recent) release for everyday development.
For an online judge to subject you to long-fixed bugs is just cruel.

What is the time complexity of CUDA's 'thrust::min_element' function?

Thrust library's documentation doesn't provide the time complexities for the functions. I need to know the time complexity of this particular function. How can I find it out?
The min-element algorithm just finds the minimum value in an unsorted range. If there is any way to do this in less than linear O(n) time-complexity, then my name is Mickey Mouse. And any implementation that would do worse than linear would have to be extremely badly written.
When it comes to the time complexities of algorithms in CUDA Thrust, well, they are mainly a CUDA-based parallelized implementation of the STL algorithms. So, you can generally just refer to the STL documentation.
The fact that the algorithms are parallelized does not change the time-complexity. At least, it generally cannot make the time-complexity any better. Running things in parallel simply divides the overall execution time by the number of parallel executions. In other words, it only affects the "constant factor" which is omitted from the "Big-O" analysis. You get a certain speed-up factor, but the complexity remains the same. But there is usually difficulties / overhead associated with parallelizing, and therefore, the speedup is rarely "ideal". It is only very rarely that the complexity is reduced, and it's only for some carefully-crafted fancy dynamic programming algorithms, not the kind of thing you'll find in CUDA Thrust. So, for Thrust, it's safe to assume all complexities are the same as those for the corresponding or closest-matching STL algorithm.