What is the time complexity of CUDA's 'thrust::min_element' function? - c++

Thrust library's documentation doesn't provide the time complexities for the functions. I need to know the time complexity of this particular function. How can I find it out?

The min-element algorithm just finds the minimum value in an unsorted range. If there is any way to do this in less than linear O(n) time-complexity, then my name is Mickey Mouse. And any implementation that would do worse than linear would have to be extremely badly written.
When it comes to the time complexities of algorithms in CUDA Thrust, well, they are mainly a CUDA-based parallelized implementation of the STL algorithms. So, you can generally just refer to the STL documentation.
The fact that the algorithms are parallelized does not change the time-complexity. At least, it generally cannot make the time-complexity any better. Running things in parallel simply divides the overall execution time by the number of parallel executions. In other words, it only affects the "constant factor" which is omitted from the "Big-O" analysis. You get a certain speed-up factor, but the complexity remains the same. But there is usually difficulties / overhead associated with parallelizing, and therefore, the speedup is rarely "ideal". It is only very rarely that the complexity is reduced, and it's only for some carefully-crafted fancy dynamic programming algorithms, not the kind of thing you'll find in CUDA Thrust. So, for Thrust, it's safe to assume all complexities are the same as those for the corresponding or closest-matching STL algorithm.

Related

How one would implement a 2-for particle-interaction loop using CUDA, and what is the resulting complexity?

This algorithm receives a world (list) of particles (3-dimensional vectors) and calls an interacting function between them. Or, in pseudocode:
function tick(world)
for i in range(world)
for j in range(world)
world[i] = interact(world[i], world[j])
Where interact is a function that takes 2 particles and return another one, and could be anything, for example:
function interact(a,b) = (a + b)*0.5
You can easily determine this algorithm is O(N^2) on the CPU. In my attempt to learn CUDA, I'm not sure how that could be implemented on the GPU. What would be the general structure of such algorithm, and what would be the resulting complexity? What if we knew the interact function didn't do anything if 2 particles were distant enough? Could we optimize it for locality?
What would be the general structure of such algorithm, and what would be the resulting complexity?
This is essentially the n-body problem. Solved using a direct particle-particle approach. It's been written about a lot. The order of the algorithm is O(N^2) on the GPU, just as it is on the CPU.
The core algorithm as implemented in CUDA doesn't change a lot except to take advantage of local block memory and optimize for it. Essentially the implementation would still come does to two loops.
The following paper is a good place to start, Chapter 31. Fast N-Body Simulation with CUDA.
Could we optimize it for locality?
Yes. Many n-body algorithms attempt to optimize for locality as gravitational and E-M forces decrease as a power of the distance between particles so distant particles can be ignored or their contribution can be approximated. Which of these approximation approaches to take largely depends on the type of system you are trying to simulate.
The following is a good overview of some of the more popular approaches,
Seminar presentation, N-body algorithms

SIMD Implementation of std::nth_element

I have an algorithm that runs on my dual-core, 3 GHz Intel processor in on average 250ms, and I am trying to optimize it. Currently, I have an std::nth_element call that is invoked around 6,000 times on std::vectors of between 150 and 300 elements, taking on average 50ms. I've spent some time optimizing the comparator I use, which currently looks up two doubles from a vector and does a simple < comparison. The comparator takes a negligible fraction of the time to run std::nth_element. The comparator's copy-constructor is also simple.
Since this call is currently taking 20% of the time for my algorithm, and since the time is mostly spent in the code for nth_element that I did not write (i.e. not the comparator), I'm wondering if anyone knows of a way of optimizing nth_element using SIMD or any other approach? I've seen some questions on parallelizing std::nth_element using OpenCL and multiple threads, but since the vectors are pretty short, I'm not sure how much benefit I would get from that approach, though I'm open to being told I'm wrong.
If there is an SSE approach, I can use any SSE instruction up to (the current, I think) SSE4.2.
Thanks!
Two thoughts:
Multithreading probably won't speed up processing for any single vector, but might help you as the number of vectors grows large.
Sorting is too powerful a tool for your problem: you're computing the entire order of the vector, but you don't care about anything but the top few. You know for each vector how many elements make up the top 5%, so instead of sorting the whole thing you should make one pass through the array and find the k largest. You can do this is O(n) time with k extra space, so it's probably a win overall.

Does divide-and-conquer really win against the increased memory allocation?

I've just finished coding some classical divide-and-conquer algorithms, and I came up the following question:(more for curiosity)
Admittedly, in many cases, divide-and-conquer algorithm is faster than the traditional algorithm; for examples, in Fast Fourier Transform, it improves the complexity from N^2 to Nlog2N. However, through coding, I found out that, because of "dividing", we have more subproblems, which means we have to create more containers and allocate more memories on the subproblem additionally. Just think about this, in merge sort, we have to create left and right array in each recursion, and in Fast Fourier Transform, we have to create odd and even array in each recursion. This means, we have to allocate more memories during the algorithm.
So, my question is, in reality, such as in C++, does algorithms like divide-and-conquer really win, when we also have to increase the complexity in memory allocation? (Or memory allocation won't take run time at all, and it's cost is zero?)
Thanks for helping me out!
Almost everything when it comes to optimisation is a compromise between one resource vs. another - in traditional engineering it's typically "cost vs. material".
In computing, it's often "time vs. memory usage" that is the compromise.
I don't think there is one simple answer to your actual question - it really depends on the algorithm - and in real life, this may lead to compromise solutions where a problem is divided into smaller pieces, but not ALL the way down to the minimal size, only "until it's no longer efficient to divide it".
Memory allocation isn't a zero-cost operation, if we are talking about new and delete. Stack memory is near zero cost once the actual stack memory has been populated with physical memory by the OS - it's at most one extra instruction on most architectures to make some space on the stack, and sometimes one extra instruction at exit to give the memory back.
The real answer is, as nearly always when it comes to performance, to benchmark the different solutions.
It is useful to understand that getting "one level better" in big-O terms (like going from n^2 to n, or from n to log n) usually matters a lot. Consider your Fourier example.
At O(n^2), with a n=100 you're looking at 10000, and with n=1000 you get a whole million, 1000000. On the other hand, with O(n*log(n)) you get 664 for n=100 and 9965 at n=1000. The slower growth should be obvious.
Of course memory allocation costs resources, as does some other code necessary in divide-and-conquer, such as combining the parts. But the whole idea is that the overhead from extra allocations and such is far, far smaller than the extra time that would be needed for a small algorithm.
The time for extra allocations isn't usually a concern, but the memory use itself can be. That is one of the fundamental programming tradeoffs. You have to choose between speed and memory usage. Sometimes you can afford the extra memory to get faster results, sometimes you must save all the memory. This is one of the reasons why there's no 'ultimate algorithm' for many problems. Say, mergesort is great, running in O(n * log(n)) even in the worst-case scenario, but it needs extra memory. Unless you use the in-place version, which then runs slower. Or maybe you know your data is likely already near-sorted and then something like smoothsort suits you better.

Can function overhead slow down a program by a factor of 50x?

I have a code that I'm running for a project. It is O(N^2), where N is 200 for my case. There is an algorithm that turns this O(N^2) to O(N logN). This means that, with this new algorithm, it should be ~100 times faster. However, I'm only getting a factor of 2-fold increase (aka 2x faster).
I'm trying to narrow down things to see if I messed something up, or whether it's something inherent to the way I coded this program. For starters, I have a lot of function overhead within nested classes. For example, I have a lot of this (within many loops):
energy = globals->pair_style->LJ->energy();
Since I'm getting the right results when it comes to actual data, just wrong speed increase, I'm wondering if function overhead can actually cause that much speed decrease, by as much as 50-fold.
Thanks!
Firstly, your interpretation that O(N logN) is ~100 times faster than O(N^2) for N=200 is incorrect. The big-Oh notation deals with upper bounds and behaviour in the limit, and doesn't account for any multiplicative constants in the complexity.
Secondly, yes, on modern hardware function calls tend to be relatively expensive due to pipeline disruption. To find out how big a factor this is in your case, you'd have to come up with some microbenchmarks.
The absoloute biggest hit is cache misses. An L1 cache miss is relatively cheap but when you miss on L2 (or L3 if you have it) you may be losing hundreds or even thousands of cycles to the incoming stall.
Thing is though this may only be part of the problem. Do not optimise your code until you have profiled it. Identify the slow areas and then figure out WHY they are slow. Once you have an understanding of why its running slowly you have a good chance to optimise it.
As an aside, O notation is very handy but is not the be all and end all. I've seen O(n^2) algorithms work significantly faster than O(n log n) for small amounts fo data (and small may mean less than several thousand) due to the fact that they cache far more effectively.
The important thing about Big O notation is that it only specifies the limit of the execution time, as the data set size increases - any constants are thrown away. While O(N^2) is indeed slower than O(N log N), the actual run times might be N^2 vs. 1000N log N - that is, an O(N^2) can be faster than O(N log N) on some data sets.
Without more details, it's hard to say more - yes, function calls do indeed have a fair amount of overhead, and that might be why you're not seeing a bigger increase in performance - or it might just be the case that your O(N log N) doesn't perform quite as well on a data set of your size.
I've worked on image processing algorithms, and calling a function per pixel (ie: for 640x480 would be 307200) can significanly reduce performance. Try declaring your function inline, or making the function a macro. This can quickly show you if it is because of function calls. Try looking at some profiling tools. VS 2010 comes with some nice tools, or else there is also Intel VTune, glowcode. They can help show where you are spending time.
IMHO I don't think that 1600 function calls should reduce performance much at all (200 log 200)
I suggest profiling it using
The big FAQ topic on profiling is here: How can I profile C++ code running in Linux?
gprof (requires compiletime instrumentation)
valgrind --tool=callgrind and kcachegrind; excellent tool with excellent visualization - screenshots here:

Memory Allocation in std::map

I am doing a report on the various C++ dictionary implementations (map, dictionary, vectors etc).
The results for insertions using a std::map illustrate that that the performance is O(log n). There are also consistent spikes in the performance. I am not 100% sure what's causing this; I think they are caused by memory allocation but I have been unsuccessful in finding any literature / documentation to prove this.
Can anyone clear this matter up or point me in the right direction?
Cheers.
You are right: it is O(log n) complexity. But this is due to the sorted nature of map (normally binary tree based).
Also see http://www.sgi.com/tech/stl/UniqueSortedAssociativeContainer.html there is a note on insert. It’s worst case is O(log n) and amortized O(1) if you can hint where to do the insert.
Maps are normally based on binary trees and need to be balanced to keep good performance. The load spikes you are observing probably correspond to this balancing process
The empirical approach isn't strictly necessary when it comes to STL. There's no point in experimenting when the standard clearly dictates the minimal complexity of operations such as std::map insertion.
I urge you to read the standard so you're aware of the minimal complexity guarantees before continuing with experiments. Of course, there might be bugs in whatever STL implementation you happen to be testing; but the popular STLs are pretty well-debugged creatures and very widely used, so I'd doubt it.
If I remember correctly, std::map is a balanced red-black tree. Some of the spikes could be caused when the std::map determines that the underlying tree needs balancing. Also, when a new node is allocated, the OS could contribute to some spikes during the allocation portion.