Running a single block with multiple threads, CUDA - c++

I know that you should generally have at least 32 threads running per block on CUDA since threads are executed in groups of 32. However I was wondering if it is considered an acceptable practice to have only one block with a bunch of threads (I know there is a limit on the number of threads). I am asking this because I have some problems which require the shared memory of threads and synchronization across every element of the computation. I want to launch my kernel like
computeSomething<<< 1, 256 >>>(...)
and just used the threads to do the computation.
Is this efficient to just have one block, or would I be better off just doing the computation on the cpu?

If you care about performance, it's a bad idea.
The principal reason is that a given threadblock can only occupy the resources of a single SM on a GPU. Since most GPUs have 2 or more SMs, this means you're leaving somewhere between 50% to over 90% of the GPU performance untouched.
For performance, both of these kernel configurations are bad:
kernel<<<1, N>>>(...);
and
kernel<<<N, 1>>>(...);
The first is the case you're asking about. The second is the case of a single thread per threadblock; this leaves about 97% of the GPU horsepower untouched.
In addition to the above considerations, GPUs are latency hiding machines and like to have a lot of threads, warps, and threadblocks available, to select work from, to hide latency. Having lots of available threads helps the GPU to hide latency, which generally will result in higher efficiency (work accomplished per unit time.)
It's impossible to tell if it would be faster on the CPU. You would have to benchmark and compare. If all of the data is already on the GPU, and you would have to move it back to the CPU to do the work, and then move the results back to the GPU, then it might still be faster to use the GPU in a relatively inefficient way, in order to avoid the overhead of moving data around.

Related

What's the "real world" performance improvement for multithreading I can expect?

I'm programming a recursive tree search with multiple branches and works fine. To speed up I'm implementing a simple multithreading: I distribute the search into main branches and scatter them among the threads. Each thread doesn't have to interact with the others, and when a solve is found I add it to a common std::vector using a mutex this way:
if (CubeTest.IsSolved())
{ // Solve algorithm found
std::lock_guard<std::mutex> guard(SearchMutex); // Thread safe code
Solves.push_back(Alg); // Add the solve
}
I don't allocate variables in dynamic store (heap) with new and delete, since the memory needs are small.
The maximum number of threads I use is the quantity I get from: std::thread::hardware_concurrency()
I did some tests, always the same search but changing the amount or threads used, and I found things that I don't expected.
I know that if you double the amount of threads (if the processor has enougth capacity) you can't expect to double the performance, because of context switching and things like that.
For example, I have an old Intel Xeon X5650 with 6 cores / 12 threads. If I execute my code, until the sixth thread things are as expected, but if I use an additional thread the performace is worst. Using more threads increase the performace very little, to the point that the use of all avaliable threads (12) barely compensates for the use of only 6:
Threads vs processing time chart for Xeon X5650:
(I repeat the test several times and I show the average times of all the tests).
I repeat the tests in other computer with an Intel i7-4600U (2 cores / 4 threads) and I found this:
Threads vs processing time chart for i7-4600U:
I understand that with less cores the performance gain using more threads is worst.
I think also that when you start to use the second thread in the same core the performance is penalized in some way. Am I right? How can I improve the performance in this situation?
So my question is if this performance gains for multithreading is what I can expect in the real world, or on the other hand, this numbers are telling me that I'm doing things wrong and I should learn more about mutithreading programming.
What's the “real world” performance improvement for multithreading I can expect?
It depends on many factors. In general, the most optimistic improvement that one can hope for is reduction of runtime by factor of number of cores1. In most cases this is unachievable because of the need for threads to synchronise with one another.
In worst case, not only is there no improvement due to lack of parallelism, but also the overhead of synchronisation as well as cache contention can make the runtime much worse than the single threaded program.
Peak memory use often increases linearly by number of threads because each thread needs to operate on data of their own.
Total CPU time usage, and therefore energy use also increases due to extra time spent on synchronisation. This is relevant to systems that operate on battery power as well as those that have poor heat management (both apply to phones and laptops).
Binary size would be marginally larger due to extra code that deals with threads.
1 Whether you get all of the performance out of "logical" cores i.e. "hyper threading" or "clustered multi threading" also depends on many factors. Often, one executes the same function in all threads, in which case they tend to use the same parts of the CPU, in which case sharing the core with multiple threads doesn't necessarily yield benefit.
A CPU which uses hyperthreading claims to be able to execute two threads simultaneously on one core. But actually it doesn't. It just pretends to be able to do that. Internally it performs preemptive multitasking: Execute a bit of thread A, then switch to thread B, execute a bit of B, back to A and so on.
So what's the point of hyperthreading at all?
The thread switches inside the CPU are faster than thread switches managed by the thread scheduler of the operating system. So the performance gains are mostly through avoiding overhead of thread switches. But it does not allow the CPU core to perform more operations than it did before.
Conclusion: The performance gain you can expect from concurrency depend on the number of physical cores of the CPU, not logical cores.
Also keep in mind that thread synchronization methods like mutexes can become pretty expensive. So the less locking you can get away with the better. When you have multiple threads filling the same result set, then it can sometimes be better to let each thread build their own result set and then merge those sets later when all threads are finished.

OpenCL: how lightweight are GPU threads?

I keep reading that GPU threads are lightweight and you can throw many tasks at them to complete in parallel....but how lightweight are they, exactly?
Let's say I have a million-member float3 array, and I want to calculate the length of each float3 value.
Does it make sense to send essentially 1 million tasks to the GPU (so the kernel calculates a single float3 length of the global array and returns)? Or something more like 1000 tasks, and each kernel execution loops through 1000 members of the array? If there is a benefit to grouping tasks like that, is there a way to calculate the optimal size of each grouping?
If we're talking about GPUs only, the answer is - very lightweight.
Does it make sense to send essentially 1 million tasks to the GPU
You're not "sending a million tasks" to the GPU. You're sending a single request, which is a few dozen bytes, which essentially says "please launch a million copies of this code with the grid coordinates i give you here". Those "copies" are created on the fly by hardware inside the GPU, and yes it's very efficient.
1000 tasks, and each kernel execution loops through 1000 members of the array
On a GPU, you almost certainly don't want to do this. A modern high-end GPU has easily 4000+ processing units, so you need at minimum that amount of concurrency. But usually much higher. There is a scheduler which picks one hardware thread to run on each of those processing units, and usually there are several dozen hardware threads per processing unit. So it's not unusual to see a GPU with 100K+ hardware threads. This is required to hide memory latencies.
So if you launch a kernel with 1000x1 grid size, easily 3/4 of your GPU could be unused, and the used part will spend 90% of it's time waiting for memory. Go ahead and try it out. The GPU has been designed to handle ridiculous amounts of threads - don't be afraid to use them.
Now, if you're talking about CPU, that's a slightly different matter. CPUs obviously don't have 1000s of hardware threads. Here, it depends on the OpenCL implementation - but i think most reasonable CPU OpenCL implementations today will handle this for you, by processing work in loops, in just enough hardware threads for your CPU.
TL;DR: use the "1 million tasks" solution, and perhaps try tuning the local work size.

Will multithreading improve performance significantly if I have a fixed amount of calculations that are independet from each other?

I am programming a raycasting game engine.
Each ray can be calculated without knowing anything about the other rays (I'm only calculating distances).
Since there is no waiting time between calculations, I wonder whether it's worth the effort to make the ray calculations multithreaded or not.
Is it likely that there will be a performance boost?
Mostly likely multi-threading will improve performance if done correctly. The way you've stated your problem, it is a perfect candidate for multi-threading since the computations are independent, reducing the need for coordination between threads to a minimum.
Some reasons you still might not get a speed up, or may not get the full speed up you expect could include:
1) The bottleneck may not be on-die CPU execution resources (e.g., ALU-bound operations), but rather something shared like memory or shared LLC bandwidth.
For example, on some architectures, a single thread may be able to saturate memory bandwidth, so adding more cores may not help. A more common case is that a single core can saturate some fraction, 1/N < 1 of main memory bandwidth, and this value is larger than 1/C where C is the core count. For instance, on a 4 core box, one core may be able to consume 50% of the bandwidth. Then, for a memory-bound computation, you'll get good scaling to 2 cores (using 100% of bandwidth), but little to none above that.
Other resources which are shared among cores include disk and network IO, GPU, snoop bandwidth, etc. If you have a hyper-threaded platform, this list increases to include all levels of cache and ALU resources for logical cores sharing the same physical core.
2) Contention "in practice" between operations which are "theoretically" independent.
You mention that your operations are independent. Typically this means that they are logically independent - they don't share any data (other than perhaps immutable input) and they can write to separate output areas. That doesn't exclude the possibility, however, than any given implementation actually has some hidden sharing going on.
One classic example is false-sharing - where independent variables fall in the same cache line, so logically independent writes to different variables from different threads end up thrashing the cache line between cores.
Another example, frequently encountered in practice, is contention via library - if your routines use malloc heavily, you may find that all the threads spend most of their time waiting on a lock inside the allocator as malloc is shared resource. This can be remedied by reducing reliance on malloc (perhaps via fewer, larger mallocs) or with a good concurrent malloc such as hoard or tcmalloc.
3) Implementation of the distribution and collection of the computation across threads may overwhelm the advantage you get from multiple threads. For example, if you spin up a new thread for every individual ray, the thread creation overhead would dominate your runtime and you would likely see a negative benefit. Even if you use a thread-pool of persistent threads, choosing a "work unit" that is too fine grained will impose a lot of coordination overhead which may eliminate your benefits.
Similarly, if you have to copy the input data to and from the worker threads, you may not see the scaling you expect. Where possible, use pass-by-reference for read-only data.
4) You don't have more than 1 core, or you do have more than 1 core but they are already occupied running other threads or processes. In these cases, the effort to coordinate multiple threads is pure overhead.
In general, it depends. Given that the calculations are independent, it sounds like this is a good candidate for potential performance improvements due to threading. Ray calculations typically can benefit from this.
However, there are many other factors, such as memory access requirements, as well as the underlying system on which this runs, which will have a tremendous impact on this. It's often possible to have multithreaded versions run slower than single threaded versions if not written correctly, so profiling is the only way to answer this definitively.
Probably yes, multithreading (e.g. with pthreads) could improve performance; but you surely want to benchmark (and you might be disappointed if your program is memory bound, not CPU bound). And you could also consider OpenCL (to run some regular numeric computations on the GPGPU) and OpenMP (to explicitly ask the compiler, thru pragmas, to parallelize some of your code).
Maybe Open-MPI might be considered to run on several communicating processes. And if you are brave (or crazy) you could mix several approaches.
In reality, it depends upon the algorithm and the system (both hardware and operating system), and you should benchmark (e.g. some micro-prototype related to your needs).
If on some particular system the bottleneck is the memory bandwidth (not the CPU), multi-threading or multi-processing won't help much (and probably could degrade performance).
Also, the cost of synchronization may vary widely (e.g. locking a mutex can be very fast on some systems and 50x slower on others).
Very likely. Independent calculations are a perfect candidate for parallelization. In the case of raycasting, there is so many of them that they would spread nicely across as many parallel threads as the hardware permits.
An unexpected bottleneck for calculations that would otherwise have perfect data-independence can be concurrent writes to nearby locations (false sharing of cache lines).

the meaning of multiple threading on single core cpu

Once I thought the only occasion multiple threads should be used is when IO processing is needed.
But I heard it's also useful without IO processing. Because it helps to occupy more CPU resources.
In my understanding, this would be
the process with more threads are given more CPU time.
Is this why multiple threads help improve performance even on single core?
One possible reason you can see greater performance from multiple threads on a single CPU is that CPUs tend to be really good at instruction reordering and making use of instruction-level parallelism. Threads have fewer data and control dependencies with respect to one another than any two sequential instructions within a single thread, and therefore they offer more possibilities for the CPU and OS-level schedulers and re-ordering mechanisms to be very clever.
Don't forget that things like "reads and writes in memory" are still "I/O" when viewed in a particular way. These are relatively slow operations, and much of the pipelining in modern CPUs is used to hide memory latency - having multiple threads executing at once can be useful for filling up time that would otherwise have to be filled with delay slots where there are data hazards within a single thread.
That said, threads are often not a good solution to increase performance, and can have precisely the opposite effect. It can be very easy to saturate all available memory bandwidth using a single thread on some problems.

Thread limit in Unix before affecting performance

I have some questions regarding threads:
What is the maximum number of threads allowed for a process before it decreases the performance of the application?
If there's a limit, how can this be changed?
Is there an ideal number of threads that should be running in a multi-threaded application? If it depends on what the application is doing, can you cite an example?
What are the factors to consider that affects these performance/thread limit?
This is actually a hard set of questions to which there are no absolute answers, but the following should serve as decent approximations:
It is a function of your application behavior and your runtime environment, and can only be deduced by experimentation. There is usually a threshold after which your performance actually degrades as you increase the number of threads.
Usually, after you find your limits, you have to figure out how to redesign your application such that the cost-per-thread is not as high. (Note that for some domains, you can get better performance by redesigning your algorithm and reducing the number of threads.)
There is no general "ideal" number of threads, but you can sometimes find the optimal number of threads for an application on a specific runtime environment. This is usually done by experimentation, and graphing the results of benchmarks while varying the following:
Number of threads.
Buffer sizes (if the data is not in RAM) incrementing at some reasonable value (e.g., block size, packet size, cache size, etc.)
Varying chunk sizes (if you can process the data incrementally).
Various tuning knobs for the OS or language runtime.
Pinning threads to CPUs to improve locality.
There are many factors that affect thread limits, but the most common ones are:
Per-thread memory usage (the more memory each thread uses, the fewer threads you can spawn)
Context-switching cost (the more threads you use, the more CPU-time is spent switching).
Lock contention (if you rely on a lot of coarse grained locking, the increasing the number of threads simply increases the contention.)
The threading model of the OS (How does it manage the threads? What are the per-thread costs?)
The threading model of the language runtime. (Coroutines, green-threads, OS threads, sparks, etc.)
The hardware. (How many CPUs/cores? Is it hyperthreaded? Does the OS loadbalance the threads appropriately, etc.)
Etc. (there are many more, but the above are the most important ones.)
The answer to your questions 1, 3, and 4 is "it's application dependent". Depending on what your threads do, you may need a different number to maximize your application's efficiency.
As to question 2, there's almost certainly a limit, and it's not necessarily something you can change easily. The number of concurrent threads might be limited per-user, or there might be a maximum number of a allowed threads in the kernel.
There's nothing fixed: it depends what they are doing. Sometimes adding more threads to do asynchronous I/O can increase the performance of another thread with no bad side effects.
This is likely fixed at compile time.
No, it's a process architecture decision. But having at least one listener-scheduler thread besides the one or more threads doing the heavy lifting suggests the number should normally be at least two.
Almost certainly, your ability to really grasp what is going on. Threaded code chokes easily and in the most unexpected ways: making sure the code has no races/deadlocks is hard. Study different ways of handling concurrency, such as shared-nothing (cf. Erlang).
As long as you never have more threads using CPU time than you have cores, you will have optimal performance, but then as soon as you have to wait for I/O There will be unused CPU cycles, so you may want to profile you applications, and see wait portion of the time it spends maxing out the CPU and what portion waiting for RAM, Hard Disk, Network, and other IO, in general if you are waiting for I/O you could have 1 more thread (Provided that you are primarily CPU bound).
For the hard and absolute limit Check out PTHREAD_THREADS_MAX in limits.h this may be what you are looking for. Might be POSIX_THREAD_MAX on some systems.
Any app with more busy threads than the number of processors will cause some overall slowdown. There's an upper limit, but it varies system to system. For some, it used to be 256 and you could recompile the OS to get it a bit higher.
As long as the threads are designed to do separate tasks, then there is not so much issue. However, the problem starts when these threads intersect the resources when locking mechanism should be implemented.