No performance improvement when multithreading linear regression using boost c++ libraries - c++

I am performing calls on a method using multiple threads via boost libraries. I received quite a performance enhancement doing so. I've recently introduced linear regression calculations into the method and am having a severe per thread performance penalty.
For instance, if I run a single thread, the average method call takes 2 seconds. If I use two threads, I register twice as much CPU activity, but the average method call takes 5-6 seconds. This continues as I increase threads. There are no known race conditions or (I think) significant shared memory.
It almost seems if there is some cache or other CPU hardware feature that is being utilized by all the threads, becoming a bottleneck. But I don't know enough about CPU architecture to sure. I am running an Intel Xeon 25-2620 CPU.
Help is desperately needed.

Related

What's the "real world" performance improvement for multithreading I can expect?

I'm programming a recursive tree search with multiple branches and works fine. To speed up I'm implementing a simple multithreading: I distribute the search into main branches and scatter them among the threads. Each thread doesn't have to interact with the others, and when a solve is found I add it to a common std::vector using a mutex this way:
if (CubeTest.IsSolved())
{ // Solve algorithm found
std::lock_guard<std::mutex> guard(SearchMutex); // Thread safe code
Solves.push_back(Alg); // Add the solve
}
I don't allocate variables in dynamic store (heap) with new and delete, since the memory needs are small.
The maximum number of threads I use is the quantity I get from: std::thread::hardware_concurrency()
I did some tests, always the same search but changing the amount or threads used, and I found things that I don't expected.
I know that if you double the amount of threads (if the processor has enougth capacity) you can't expect to double the performance, because of context switching and things like that.
For example, I have an old Intel Xeon X5650 with 6 cores / 12 threads. If I execute my code, until the sixth thread things are as expected, but if I use an additional thread the performace is worst. Using more threads increase the performace very little, to the point that the use of all avaliable threads (12) barely compensates for the use of only 6:
Threads vs processing time chart for Xeon X5650:
(I repeat the test several times and I show the average times of all the tests).
I repeat the tests in other computer with an Intel i7-4600U (2 cores / 4 threads) and I found this:
Threads vs processing time chart for i7-4600U:
I understand that with less cores the performance gain using more threads is worst.
I think also that when you start to use the second thread in the same core the performance is penalized in some way. Am I right? How can I improve the performance in this situation?
So my question is if this performance gains for multithreading is what I can expect in the real world, or on the other hand, this numbers are telling me that I'm doing things wrong and I should learn more about mutithreading programming.
What's the “real world” performance improvement for multithreading I can expect?
It depends on many factors. In general, the most optimistic improvement that one can hope for is reduction of runtime by factor of number of cores1. In most cases this is unachievable because of the need for threads to synchronise with one another.
In worst case, not only is there no improvement due to lack of parallelism, but also the overhead of synchronisation as well as cache contention can make the runtime much worse than the single threaded program.
Peak memory use often increases linearly by number of threads because each thread needs to operate on data of their own.
Total CPU time usage, and therefore energy use also increases due to extra time spent on synchronisation. This is relevant to systems that operate on battery power as well as those that have poor heat management (both apply to phones and laptops).
Binary size would be marginally larger due to extra code that deals with threads.
1 Whether you get all of the performance out of "logical" cores i.e. "hyper threading" or "clustered multi threading" also depends on many factors. Often, one executes the same function in all threads, in which case they tend to use the same parts of the CPU, in which case sharing the core with multiple threads doesn't necessarily yield benefit.
A CPU which uses hyperthreading claims to be able to execute two threads simultaneously on one core. But actually it doesn't. It just pretends to be able to do that. Internally it performs preemptive multitasking: Execute a bit of thread A, then switch to thread B, execute a bit of B, back to A and so on.
So what's the point of hyperthreading at all?
The thread switches inside the CPU are faster than thread switches managed by the thread scheduler of the operating system. So the performance gains are mostly through avoiding overhead of thread switches. But it does not allow the CPU core to perform more operations than it did before.
Conclusion: The performance gain you can expect from concurrency depend on the number of physical cores of the CPU, not logical cores.
Also keep in mind that thread synchronization methods like mutexes can become pretty expensive. So the less locking you can get away with the better. When you have multiple threads filling the same result set, then it can sometimes be better to let each thread build their own result set and then merge those sets later when all threads are finished.

How to optimize code for Simultaneous Multithreading?

Currently, I am learning parallel processing using CPU, which is a well-covered topic with plenty of tutorials and books.
However, I could not find a single tutorial or resource that talks about programming techniques for hyper threaded CPU. Not a single code sample.
I know that to utilize hyper threading, the code must be implemented such that different parts of the CPU can be used at the same time (simplest example is calculating integer and float at the same time), so it's not plug-and-play.
Which book or resource should I look at if I want to learn more about this topic? Thank you.
EDIT: when I said hyper threading, I meant Simultaneous Multithreading in general, not Intel's hyper threading specifically.
Edit 2: for example, if I have an i7 8-core CPU, I can make a sorting algorithms that runs 8 times faster when it uses all 8-core instead of 1. But it will run the same on a 4-core CPU and a 4c-8t CPU, so in my case SMT does nothing.
Meanwhile, Cinebench will run much better on a 4c-8t CPU than on a 4c-4t CPU.
SMT is generally most effective, when one thread is loading something from memory. Depending on the memory (L1, L2, L3 cache, RAM), read/write latency can span a lot of CPU cycles that would have to be wasted doing nothing, if only one thread would be executed per core.
So, if you want to maximize the impact of SMT, try to interleave memory access of two threads so that one of them can execute instructions, while the other reads data. Theoretically, you can also use a thread just for cache warming, i.e. loading data from RAM or main storage into cache for subsequent use by other threads.
The way of successfully applying this can vary from one system to another because the access latency of cache, RAM and main storage as well as their size may differ by a lot.

Gflop vendor specification vs practical results

I have a questions that i can't i can't figure out.
I have a Nvidia GPU 750M and from specification it say it should have 722.7 GFlop/s. (GPU specification) but when i try the test from CUDA samples give me about 67.64 GFlop/Sec.
Why such a big difference?
Thanks.
The peak performance can only be achieved when every core is busy executing FMA on every cycle, which is impossible in a real task.
Apart from no other operation is counted as 2 operations like FMA,
For a single kernel launch, if you do some sampling in Visual Profiler you will notice there is something called stall. Each operation takes time to finish. And if another operation relies on the result of the previous one, it has to wait. This will eventually create "gaps" that a core is left idle waiting for a new operation is ready to be executed. Among them, device memory operations have HUGE latencies. If you don't do it right, your code will end up busy waiting for memory operations all the time.
Some tasks can be well optimized. If you test on gemm in cuBLAS, it can reach over 80% of the peak FLOPS, on some devices even 90%. While some other tasks just can not be optimized for FLOPS. For example, if you add one vector to another, the performance is always be limited by the memory bandwidth, and you can never see high FLOPS.

should I "bind" "spinning" thread to the certain core?

My application contains several latency-critical threads that "spin", i.e. never blocks.
Such thread expected to take 100% of one CPU core. However it seems modern operation systems often transfer threads from one core to another. So, for example, with this Windows code:
void Processor::ConnectionThread()
{
while (work)
{
Iterate();
}
}
I do not see "100% occupied" core in Task manager, overall system load is 36-40%.
But if I change it to this:
void Processor::ConnectionThread()
{
SetThreadAffinityMask(GetCurrentThread(), 2);
while (work)
{
Iterate();
}
}
Then I do see that one of the CPU cores is 100% occupied, also overall system load is reduced to 34-36%.
Does it mean that I should tend to SetThreadAffinityMask for "spin" threads? If I improved latency adding SetThreadAffinityMask in this case? What else should I do for "spin" threads to improve latency?
I'm in the middle of porting my application to Linux, so this question is more about Linux if this matters.
upd found this slide which shows that binding busy-waiting thread to CPU may help:
Running a thread locked to a single core gives the best latency for that thread in most circumstances if this is the most important thing in your code.
The reasons(R) are
your code is likely to be in your iCache
the branch predictors are tuned to your code
your data is likely to be ready in your dCache
the TLB points to your code and data.
Unless
Your running a SMT sytem (ex. hyperthreaded) in which case the evil twin will "help" you with by causing your code to be washed out, your branch predictors to be tuned to its code and its data will push your out of the dCache, your TLB is impacted by its use.
Cost unknown, each cache misses cost ~4ns, ~15ns and ~75ns for data, this quickly runs up to several 1000ns.
It saves for each reason R mentioned above, that is still there.
If the evil twin also just spins the costs should be much lower.
Or your allowing interrupts on your core, in which case you get the same problems and
your TLB is flushed
you take a 1000ns-20000ns hit on the context switch, most should be in the low end if the drivers are well programmed.
Or you allow the OS to switch your process out, in which case you have the same problems as the interrupt, just in the hight end of the range.
switching out could also cause the thread to pause for the entire slice as it can only be run on one (or two) hardware threads.
Or you use any system calls that cause context switches.
No disk IO at all.
only async IO else.
having more active (none-paused) threads than cores increases the likelihood of problems.
So if you need less than 100ns latency to keep your application from exploding you need to prevent or lessen the impact of SMT, interrupts and task switching on your core.
The perfect solution would be an Real time operating system with static scheduling. This is a nearly perfect match for your target, but its a new world if your have mostly done server and desktop programming.
The disadvantages of locking a thread to a single core are:
It will cost some total throughput.
as some threads that might have run if the context could have been switched.
but the latency is more important in this case.
If the thread gets context switched out it will take some time before it can be scheduled potentially one or more time slices, typically 10-16ms, which is unacceptable in this application.
Locking it to a core and its SMT will lessen this problem, but not eliminate it. Each added core will lessen the problem.
setting its priority higher will lessen the problem, but not eliminate it.
schedule with SCHED_FIFO and highest priority will prevent most context switches, interrupts can still cause temporary switches as does some system calls.
If you got a multi cpu setup you might be able to take exclusive ownership of one of the CPU's through cpuset. This prevents other applications from using it.
Using pthread_setschedparam with SCHED_FIFO and highest priority running in SU and locking it to the core and its evil twin should secure the best latency of all of these, only a real time operating system can eliminate all context switches.
Other links:
Discussion on interrupts.
Your Linux might accept that you call sched_setscheduler, using SCHED_FIFO, but this demands you got your own PID not just a TID or that your threads are cooperative multitasking.
This might not ideal as all your threads would only be switches "voluntarily" and thereby removing flexibility for the kernel to schedule it.
Interprocess communication in 100ns
Pinning a task to specific processor will generally give better performance for the task. But, there are a lot of nuances and costs to consider when doing so.
When you force affinity, you restrict the operating system's scheduling choices. You increase cpu contention for the remaining tasks. So EVERYTHING else on the system is impacted including the operating system itself. You also need to consider that if tasks need to communicate across memory, and affinities are set to cpus that don't share cache, you can drastically increase latency for communication across tasks.
One of the biggest reasons setting task cpu affinity is beneficial though, is that it gives more predictable cache and tlb (translation lookaside buffer) behavior. When a task switches cpus, the operating system can switch it to a cpu that doesn't have access to the last cpu's cache or tlb. This can increase cache misses for the task. It's particularly an issue communicating across tasks, as it takes more time to communicate across higher level caches and worst finally memory. To measure cache statistics on linux (performance in general) I recommend using perf.
The best suggestion is really to measure before you try to fix affinities. A good way to quantify latency would be by using the rdtsc instruction (at least on x86). This reads the cpu's time source, which will generally give the highest precision. Measuring across events will give roughly nanosecond accuracy.
volatile uint64_t rdtsc() {
register uint32_t eax, edx;
asm volatile (".byte 0x0f, 0x31" : "=d"(edx), "=a"(eax) : : );
return ((uint64_t) edx << 32) | (uint64_t) eax;
}
note - the rdtsc instruction needs to be combined with a load fence to ensure all previous instructions have completed (or use rdtscp)
also note - if rdtsc is used without an invariant time source (on linux grep constant_tsc /proc/cpuinfo, you may get unreliable values across frequency changes and if the task switches cpu (time source)
So, in general, yes, setting the affinity does gives lower latency, but this is not always true, and there are very serious costs when you do it.
Some additional reading...
Intel 64 Architecture Processor Topology Enumeration
What Every Programmer Needs to Know About Memory (Parts 2, 3, 4, 6, and 7)
Intel Software Developer Reference (Vol. 2A/2B)
Aquire and Release Fences
TCMalloc
I came across this question because I'm dealing with the exactly same design problem. I'm building HFT systems where each nanosecond count.
After reading all the answers, I decided to implement and benchmark 4 different approaches
busy wait with no affinity set
busy wait with affinity set
observer pattern
signals
The imbatible winner was "busy wait with affinity set". No doubt about it.
Now, as many have pointed out, make sure to leave a couple of cores free in order to allow OS run freely.
My only concern at this point is if there is some physical harm to those cores that are running at 100% for hours.
Binding a thread to a specific core is probably not the best way to get the job done. You can do that, it will not harm a multi core CPU.
The really best way to reduce latency is to raise the priority of the process and the polling thread(s). Normally the OS will interrupt your threads hundreds of times a second and let other threads run for a while. Your thread may not run for several milliseconds.
Raising the priority will reduce the effect (but not eliminate it).
Read more about SetThreadPriority and SetProcessPriorityBoost.
There some details in the docs you need to understand.
This is simply foolish. All it does is reduce the scheduler's flexibility. Whereas before it could run it on whatever core it thought was best, now it can't. Unless the scheduler was written by idiots, it would only move the thread to a different core if it had a good reason to do that.
So you're just saying to the scheduler, "even if you have a really good reason to do this, don't do it anyway". Why would you say that?

What's the meaning of thread concurrency overhead time in the profiler output?

I'd be really appreciated if someone with good experience of Intel VTune Amplifier tell me about this thing.
Recently I received performance analysis report from other guys who used Intel VTune Amplifier against my program. It tells, there is high overhead time in the thread concurrency area.
What's the meaning of the Overhead Time? They don't know (asked me), I don't have access to Intel VTune Amplifier.
I have vague ideas. This program has many thread sleep calls because pthread condition is unstable (or I did badly) in the target platform so I change many routines to do works in the loop look like below:
while (true)
{
mutex.lock();
if (event changed)
{
mutex.unlock();
// do something
break;
}
else
{
mutex.unlock();
usleep(3 * 1000);
}
}
This can be flagged as Overhead Time?
Any advice?
I found help documentation about Overhead Time from Intel site.
http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/win/ug_docs/olh/common/overhead_time.html#overhead_time
Excerpt:
Overhead time is a duration that starts with the release of a shared resource and ends with the receipt of that resource. Ideally, the duration of Overhead time is very short because it reduces the time a thread has to wait to acquire a resource. However, not all CPU time in a parallel application may be spent on doing real pay load work. In cases when parallel runtime (Intel® Threading Building Blocks, OpenMP*) is used inefficiently, a significant portion of time may be spent inside the parallel runtime wasting CPU time at high concurrency levels. For example, this may result from low granularity of work split in recursive parallel algorithms: when the workload size becomes too low, the overhead on splitting the work and performing the housekeeping work becomes significant.
Still confusing.. Could it mean "you made unnecessary/too frequent lock"?
I am also not much of an expert on that, though I have tried to use pthread a bit myself.
To demonstrate my understanding of overhead time, let us take the example of a simple single-threaded program to compute an array sum:
for(i=0;i<NUM;i++) {
sum += array[i];
}
In a simple [reasonably done] multi-threaded version of that code, the array could be broken into one piece per thread, each thread keeps its own sum, and after the threads are done, the sums are summed.
In a very poorly written multi-threaded version, the array could be broken down as before, and every thread could atomicAdd to a global sum.
In this case, the atomic addition can only be done by one thread at a time. I believe that overhead time is a measure of how long all of the other threads spend while waiting to do their own atomicAdd (you could try writing this program to check if you want to be sure).
Of course, it also takes into account the time it takes to deal with switching the semaphores and mutexes around. In your case, it probably means a significant amount of time is spent on the internals of the mutex.lock and mutex.unlock.
I parallelized a piece of software a while ago (using pthread_barrier), and had issues where it took longer to run the barriers than it did to just use one thread. It turned out that the loop that had to have 4 barriers in it was executed quickly enough to make the overhead not worth it.
Sorry, I'm not an expert on pthread or Intel VTune Amplifier, but yes, locking a mutex and unlocking it will probably count as overhead time.
Locking and unlocking mutexes can be implemented as system calls, which the profiler probably would just lump under threading overhead.
I'm not familiar with vTune but there is an in the OS overhead switching between threads. Each time a thread stops and another loads on a processor the current thread context needs to be stored so that it can be restored when the thread next runs and then the new thread's context needs to be restored so it can carry on processing.
The problem may be that you have too many threads and so the processor is spending most of its time switching between them. Multi threaded applications will run most efficiently if there are the same number of threads as processors.