Measuring parallel computation time for interdependent threads - c++

I have a question concerning runtime measurements in parallel programs (I used C++ but I think the question is more general).
Some short explanations: 3 threads are running parallel (pthread), solving the same problem in different ways. Each thread may pass information to the other thread (e.g. partial solutions obtained by the one thread but not by the other, yet) for speeding up the other threads, depending on his own status / available information in his own calculation. The whole process stops as soon as the first thread is ready.
Now I would like to have a unique time measurement for evaluating the runtime from start until the problem is solved. ( In the end, I want to determine if using synergy effects through a parallel calculation is faster then calculation on a single thread).
In my eyes, the problem is, that (because of the operating system pausing / unpausing the single threads), the point when information is passed in the process is not deterministic in each process' state. That means, a certain information is acquired after xxx units of cpu time on thread 1, but it can not be controlled, whether thread 2 receives this information after yyy or zzz units of cpu time spent in its calculations. Assumed that this information would have finished thread 2's calculation in any case, the runtime of thread 2 was either yyy or zzz, depending on the operating system's action.
What can I do for obtaining a deterministic behaviour for runtime comparisons? Can I order the operation system to run each thread "undisturbed" (on a multicore machine)? Is there something I can do on implementation (c++) - basis?
Or are there other concepts for evaluating runtime (time gain) of such implementations?
Best regards
Martin

Any time someone uses the terms 'deterministic' and 'multicore' in the same sentence, it sets alarm bells ringing :-)
There are two big sources of non-determinism in your program: 1) the operating system, which adds noise to thread timings through OS jitter and scheduling decisions; and 2) the algorithm, because the program follows a different path depending on the order in which communication (of the partial solutions) occurs.
As a programmer, there's not much you can do about OS noise. A standard OS adds a lot of noise even for a program running on a dedicated (quiescent) node. Special purpose operating systems for compute nodes go some way to reducing this noise, for example Blue Gene systems exhibit significantly less OS noise and therefore less variation in timings.
Regarding the algorithm, you can introduce determinism to your program by adding synchronisation. If two threads synchronise, for example to exchange partial solutions, then the ordering of the computation before and after the synchronisation is deterministic. Your current code is asynchronous, as one thread 'sends' a partial solution but does not wait for it to be 'received'. You could convert this to a deterministic code by dividing the computation into steps and synchronising between threads after each step. For example, for each thread:
Compute one step
Record partial solution (if any)
Barrier - wait for all other threads
Read partial solutions from other threads
Repeat 1-4
Of course, we would not expect this code to perform as well, because now each thread has to wait for all the other threads to complete their computation before proceeding to the next step.
The best approach is probably to just accept the non-determinism, and use statistical methods to compare your timings. Run the program many times for a given number of threads and record the range, mean and standard deviation of the timings. It may be enough for you to know e.g. the maximum computation time across all runs for a given number of threads, or you may need a statistical test such as Student's t-test to answer more complicated questions like 'how certain is it that increasing from 4 to 8 threads reduces the runtime?'. As DanielKO says, the fluctuations in timings are what will actually be experienced by a user, so it makes sense to measure these and quantify them statistically, rather than aiming to eliminate them altogether.

What's the use of such a measurement?
Suppose you can, by some contrived method, set up the OS scheduler in a way that the threads run undisturbed (even by indirect events such as other processes using caches, MMU, etc), will that be realistic for the actual usage of the parallel program?
It's pretty rare for a modern OS to let an application take control over general interrupts handling, memory management, thread scheduling, etc. Unless you are talking directly to the metal, your deterministic measurements will not only be impractical, but the users of your program will never experience them (unless they are equally close to the metal as when you did the measurements.)
So my question is, why do you need such strict conditions for measuring your program? In the general case, just accept the fluctuations, as that is what the users will most likely see. If the speed up of a certain algorithm/implementation is so insignificant as to be indistinguishable from the background noise, that's more useful information to me than knowing the actual speedup fraction.

Related

Finding out how many threads make sense to create

I have some calculation task on a large amount of the data - so it can be quite easily to parallel. Next question is how many threads does it make sense to create. Of course I can measure time for different number of thread on my machine, but what if a program will be run on different machines, so I can't really make manual measurement. Is just get number of threads from std::thread::hardware_concurrency() good enough, or there are some other ways?
That function (std::thread::hardware_concurrency()) will give you the total core count, including hyperthreading.
If your program does intensive number crunching I would say using only physical cores and setting processor affinity is the best choice.
You can know the current processor topology with hwloc library which works in most platforms.
You may find an comprehensible explanation (though a bit old) here.
If there is lot of I/O then you may run two threads for processor to allow one to process data while other is waiting for input, or one extra thread without affinity so it can take processor time while others are waiting for I/O, but this is a very rough estimation: better measure in your machine.
If you can test in other processors, you may have different strategy for each processor.

What's the "real world" performance improvement for multithreading I can expect?

I'm programming a recursive tree search with multiple branches and works fine. To speed up I'm implementing a simple multithreading: I distribute the search into main branches and scatter them among the threads. Each thread doesn't have to interact with the others, and when a solve is found I add it to a common std::vector using a mutex this way:
if (CubeTest.IsSolved())
{ // Solve algorithm found
std::lock_guard<std::mutex> guard(SearchMutex); // Thread safe code
Solves.push_back(Alg); // Add the solve
}
I don't allocate variables in dynamic store (heap) with new and delete, since the memory needs are small.
The maximum number of threads I use is the quantity I get from: std::thread::hardware_concurrency()
I did some tests, always the same search but changing the amount or threads used, and I found things that I don't expected.
I know that if you double the amount of threads (if the processor has enougth capacity) you can't expect to double the performance, because of context switching and things like that.
For example, I have an old Intel Xeon X5650 with 6 cores / 12 threads. If I execute my code, until the sixth thread things are as expected, but if I use an additional thread the performace is worst. Using more threads increase the performace very little, to the point that the use of all avaliable threads (12) barely compensates for the use of only 6:
Threads vs processing time chart for Xeon X5650:
(I repeat the test several times and I show the average times of all the tests).
I repeat the tests in other computer with an Intel i7-4600U (2 cores / 4 threads) and I found this:
Threads vs processing time chart for i7-4600U:
I understand that with less cores the performance gain using more threads is worst.
I think also that when you start to use the second thread in the same core the performance is penalized in some way. Am I right? How can I improve the performance in this situation?
So my question is if this performance gains for multithreading is what I can expect in the real world, or on the other hand, this numbers are telling me that I'm doing things wrong and I should learn more about mutithreading programming.
What's the “real world” performance improvement for multithreading I can expect?
It depends on many factors. In general, the most optimistic improvement that one can hope for is reduction of runtime by factor of number of cores1. In most cases this is unachievable because of the need for threads to synchronise with one another.
In worst case, not only is there no improvement due to lack of parallelism, but also the overhead of synchronisation as well as cache contention can make the runtime much worse than the single threaded program.
Peak memory use often increases linearly by number of threads because each thread needs to operate on data of their own.
Total CPU time usage, and therefore energy use also increases due to extra time spent on synchronisation. This is relevant to systems that operate on battery power as well as those that have poor heat management (both apply to phones and laptops).
Binary size would be marginally larger due to extra code that deals with threads.
1 Whether you get all of the performance out of "logical" cores i.e. "hyper threading" or "clustered multi threading" also depends on many factors. Often, one executes the same function in all threads, in which case they tend to use the same parts of the CPU, in which case sharing the core with multiple threads doesn't necessarily yield benefit.
A CPU which uses hyperthreading claims to be able to execute two threads simultaneously on one core. But actually it doesn't. It just pretends to be able to do that. Internally it performs preemptive multitasking: Execute a bit of thread A, then switch to thread B, execute a bit of B, back to A and so on.
So what's the point of hyperthreading at all?
The thread switches inside the CPU are faster than thread switches managed by the thread scheduler of the operating system. So the performance gains are mostly through avoiding overhead of thread switches. But it does not allow the CPU core to perform more operations than it did before.
Conclusion: The performance gain you can expect from concurrency depend on the number of physical cores of the CPU, not logical cores.
Also keep in mind that thread synchronization methods like mutexes can become pretty expensive. So the less locking you can get away with the better. When you have multiple threads filling the same result set, then it can sometimes be better to let each thread build their own result set and then merge those sets later when all threads are finished.

Multithreading crowds out other processes

I have added multithreading to a raytracer I am writing, and while it does run much faster now, when it's running, my computer is almost unusably slow. Obviously I want to use all my PC's compute power, but I don't want it to prevent any other application from getting access to the CPUs.
I thought about having the threads sleep, but unless they all sleep at the same time, then the other threads would just eat up the extra time. Also, I don't necessarily want to give up a certain percentage of available compute power if I'm not going to use it.
Also, (This is not my official question) I've noticed that for some reason the first thread launched does more work than the second, and the second more than the third, and so on until like the last 5 threads (out of 32) won't actually get a crack at any work, despite the fact that there's plenty to go a around (there's at least 0.5M work items for them to chew through). If someone would like to venture a guess in the comments, it would be appreciated.
If you use the standard threads, you could try to use thread::hardware_concurrency to find out an estimate of the maximul number of threads that are really supported by hardware, in order not to overload your cpu.
If it returns 0 the information is not available. In other cases you could limit yourself to this number or a little bit below (thinking that other processes might use these as well).
If limiting the number of threads does not improve responsiveness, you can also consider calling from time to time this_thread::yield() to give opportunity to reschedule threads. But depending on the kind of job and synchronisation you use, this second alternative might decrease performance.
As requested, my comment as an answer:
It sounds like you've oversubscribed your poor CPU. Try reducing the number of threads?
If there's significantly more threads than hardware cores, a lot of time is going to be wasted switching between threads, scheduling them in the OS, and in contention over shared variables. It would also cause the general slowdown of the other running programs, because they have to contend with the high number of threads from your program (which by default all have the same priority as the other programs' threads in the eyes of the OS scheduler).

Multithreading efficiency in C++

I am trying to learn threading in C++, and just had a few questions about it (more specifically <thread>.
Let's say the machine this code will run on has 4 cores, should I split up an operation into 4 threads? If I were to create 8 threads instead of 4, would this run slower on a 4 core machine? What if the processor has hyperthreading, should I try and make the threads match the number of physical cores or logical cores?
Should I just not worry about the number of cores a machine has, and try to create as many threads as possible?
I apologize if these questions have been already answered; I've been looking for information about threading with <thread>, which was introduced in c11 so I haven't been able to find too much about it.
The program in question is going to run many independent simulations.
If anybody has any insight about <thread> or just multithreading in general, I would be glad to hear it.
If you are performing pure calculations with no I/O - and those calculations are freestanding and not relying on results from other calculations happening in another thread, the maximum number of such threads should be the number of cores (possibly one or two less if the system is also loaded with other tasks).
If you are doing network I/O or similar, more threads are certainly a possibility.
If you are doing disk-I/O, a single thread reading from the disk is often best, because disk reads from multiple threads leads to moving the read/write head around on the disk, which just makes things slower.
If you're using threads for to make the code simpler, then the number of threads will probably depend on what you are doing.
It also depends on how "freestanding" each thread is. If they need to share data in complex ways, the sharing/waiting for other thread/etc, may well make it slower with more threads.
And as others have said, try to make your framework for this flexible and test different options. Preferably on multiple machines (unless you only have one kind of machine that you will ever run your code on).
There is no such thing as <threads.h>, you mean <thread>, the thread support library introduced in C++11.
The only answer to your question is "test and see". You can make your code flexible enough, so that it can be run by passing an N parameter (where N is the desired number of threads).
If you are CPU-bound, the answer will be very different from the case when you are IO bound.
So, test and see! For your reference, this link can be helpful. And if you are serious, then go ahead and get this book. Multithreading, concurrency, and the like are hairy topics.
Let's say the machine this code will run on has 4 cores, should I split up an operation into 4 threads?
If some portions of your code can be run in parallel, then yes it can be made to go faster, but this is very tricky to do since loading threads and switching data between them takes a ton of time.
If I were to create 8 threads instead of 4, would this run slower on a 4 core machine?
It depends on the context switching it has to do. Sometimes the execution will switch between threads very often and sometimes it will not but this is very difficult to control. It will not in any case run faster than 4 threads doing the same work.
What if the processor has hyperthreading, should I try and make the threads match the number of physical cores or logical cores?
Hyperthreading works nearly the same as having more cores. When you will notice the differences between a real core and an execution core, you will have enough knowledge to work around the caveats.
Should I just not worry about the number of cores a machine has, and try to create as many threads as possible?
NO, threads are hard to manage, avoid them as much as you can.
The program in question is going to run many independent simulations.
You should look into openmp. It is a library in C made to parallelize computation when your program can be split up. Do not confuse parallel with concurrent. Concurrent is simply multiple threads working together while parallel is made specifically to speed up your application. Maybe openmp is overkill for your thing, but it is a good thing to know when you are approaching parallel computing
Don't think of the number of threads you need as in comparison to the machine you're running on. Threading is valuablue any time you have a process that:
A: There is some very slow operation, that the rest of the process need not wait for.
B: Certain functions can run faster than one another and don't need to be executed inline.
C: There is a lot of non-order dependant I/O going on(web servers).
These are just a few of the obvious examples when launching a thread makes sense. The number of threads you launch is therefore more dependant on the number of these scenarios that pop up in your code, than the architecture you expect to run on. In fact unless you're running a process that really really needs to be optimized, it is likely that you can only eek out a few percentage points of additional performance by benchmarking for your architecture in comparison to the number of threads that you launch, and in modern computers this number shouldn't vary much at all.
Let's take the I/O example, as it is the scenario that will see the most benefit. Let's assume that some program needs to interract with 200 users over the network. Network I/O is very very slow. Thousands of times slower than the CPU. If we were to handle each user in turn we would waste thousands of processor cycles just waiting for data to come from the first user. Could we not have been processing information from more than one user at a time? In this case since we have roughly 200 users, and the data that we're waiting for we know to be 1000s of times slower than what we can handle(assuming we have a minimal amount of processing to do on this data), we should launch as many threads as the operating system allows. A web server that takes advantage of threading can serve hundreds of more people per second than one that does not.
Now, let's consider a less I/O intensive example, where say we have several functions that execute in turn, but are independant of one another and some of them might run faster, say because there is disk I/O in one, and no disk I/O in another. In this case, our I/O is still fairly fast, but we will certainly waste processing time waiting for the disk to catch up. As such we can launch a few threads, just to take advantage of our processing power, and minimize wasted cycles. However, if we launch as many threads as the operating system allows we are likely to cuase memory management issues for branch predictors, etc... and launching too many threads in this case is actually sub optimal and can slow the program down. Note that in this, I never mentioned how many cores the machine has! NOt that optimizing for different architectures isn't valuable, but if you optimize for one architecture you are likely very close to optimal for most. Assuming, again, that you're dealing with all reasonably modern processors.
I think most people would say that large scale threading projects are better supported by languages other than c++ (go, scala,cuda). Task parallelism as opposed to data parallelism works better in c++. I would say that you should create as many threads as you have tasks to dole out but if data parallelism is more related to your problem consider maybe using cuda and linking to the rest of your project at a later time
NOTE: if you look at some sort of system monitor you will notice that there are likely far more than 8 threads running, I looked at my computer and it had hundreds of threads running at once so don't worry too much about the overhead. The main reason I choose to mention the other languages is that managing many threads in c++ or c tends to be very difficult and error prone, I did not mention it because the c++ program will run slower(which unless you use cuda it probably won't)
In regards to hyper-threading let me comment on what I have found from experience.
In large dense matrix multiplication hyper-threading actually gives worse performance. For example Eigen and MKL both use OpenMP (at least the way I have used them) and get better results on my system which has four cores and hyper-threading using only four threads instead of eight. Also, in my own GEMM code which gets better performance than Eigen I also get better results using four threads instead of eight.
However, in my Mandelbrot drawing code I get a big performance increase using hyper-threading with OpenMP (eight threads instead of four). The general trend (so far) seems to be that if the code works well using schedule(static) in OpenMP then hyper-threading does not help and may even be worse. If the code works better using schedule(dynamic) then hyper-threading may help.
In other words, my observation so far is that if the run time of each thread can vary a lot hyper-threading can help. If the run time of each thread is constant then it may even make performance worse. But YOU have to test and see for each case.

Thread limit in Unix before affecting performance

I have some questions regarding threads:
What is the maximum number of threads allowed for a process before it decreases the performance of the application?
If there's a limit, how can this be changed?
Is there an ideal number of threads that should be running in a multi-threaded application? If it depends on what the application is doing, can you cite an example?
What are the factors to consider that affects these performance/thread limit?
This is actually a hard set of questions to which there are no absolute answers, but the following should serve as decent approximations:
It is a function of your application behavior and your runtime environment, and can only be deduced by experimentation. There is usually a threshold after which your performance actually degrades as you increase the number of threads.
Usually, after you find your limits, you have to figure out how to redesign your application such that the cost-per-thread is not as high. (Note that for some domains, you can get better performance by redesigning your algorithm and reducing the number of threads.)
There is no general "ideal" number of threads, but you can sometimes find the optimal number of threads for an application on a specific runtime environment. This is usually done by experimentation, and graphing the results of benchmarks while varying the following:
Number of threads.
Buffer sizes (if the data is not in RAM) incrementing at some reasonable value (e.g., block size, packet size, cache size, etc.)
Varying chunk sizes (if you can process the data incrementally).
Various tuning knobs for the OS or language runtime.
Pinning threads to CPUs to improve locality.
There are many factors that affect thread limits, but the most common ones are:
Per-thread memory usage (the more memory each thread uses, the fewer threads you can spawn)
Context-switching cost (the more threads you use, the more CPU-time is spent switching).
Lock contention (if you rely on a lot of coarse grained locking, the increasing the number of threads simply increases the contention.)
The threading model of the OS (How does it manage the threads? What are the per-thread costs?)
The threading model of the language runtime. (Coroutines, green-threads, OS threads, sparks, etc.)
The hardware. (How many CPUs/cores? Is it hyperthreaded? Does the OS loadbalance the threads appropriately, etc.)
Etc. (there are many more, but the above are the most important ones.)
The answer to your questions 1, 3, and 4 is "it's application dependent". Depending on what your threads do, you may need a different number to maximize your application's efficiency.
As to question 2, there's almost certainly a limit, and it's not necessarily something you can change easily. The number of concurrent threads might be limited per-user, or there might be a maximum number of a allowed threads in the kernel.
There's nothing fixed: it depends what they are doing. Sometimes adding more threads to do asynchronous I/O can increase the performance of another thread with no bad side effects.
This is likely fixed at compile time.
No, it's a process architecture decision. But having at least one listener-scheduler thread besides the one or more threads doing the heavy lifting suggests the number should normally be at least two.
Almost certainly, your ability to really grasp what is going on. Threaded code chokes easily and in the most unexpected ways: making sure the code has no races/deadlocks is hard. Study different ways of handling concurrency, such as shared-nothing (cf. Erlang).
As long as you never have more threads using CPU time than you have cores, you will have optimal performance, but then as soon as you have to wait for I/O There will be unused CPU cycles, so you may want to profile you applications, and see wait portion of the time it spends maxing out the CPU and what portion waiting for RAM, Hard Disk, Network, and other IO, in general if you are waiting for I/O you could have 1 more thread (Provided that you are primarily CPU bound).
For the hard and absolute limit Check out PTHREAD_THREADS_MAX in limits.h this may be what you are looking for. Might be POSIX_THREAD_MAX on some systems.
Any app with more busy threads than the number of processors will cause some overall slowdown. There's an upper limit, but it varies system to system. For some, it used to be 256 and you could recompile the OS to get it a bit higher.
As long as the threads are designed to do separate tasks, then there is not so much issue. However, the problem starts when these threads intersect the resources when locking mechanism should be implemented.