General openMP - thread speeds are different - c++

I have an openMP program, where a for loop is parallelised.
Everything works as it should, except the master thread is many, many times faster than the rest of the threads... For example, when running with 4 threads, thread 0 finishes long before the other ones, but they are executing the same code, with almost the same amount of work.
Can this be because of resource handling by Windows, swapping tasks in and out of the threads used by the program, causing the slowdown?
Or is it more likely that my code is the problem? I just wanna make sure I don't waste time looking for an error in my program, if this is an unavoidable problem caused by the OS...

As for why the thread has priority, it could be an issue between the OpenMP runtime and the OS. Which compiler are you using? How are you measuring when the threads terminate?
To improve the performance of your OpenMP parallel for in this case, I would use the dynamic scheduling policy with the schedule directive. If the master thread is getting more cycles from the CPU, it will also do more work in this case. In general, you can't count on each thread to be equally fast, but if you are observing orders of magnitude differences, it sounds like a bad clash between the runtime and OS.

Look thread execution is dependent on many things so there may be many possibility like is there any locking machanism is there or not or resource availability like when the thread complete the job it must release the resources. Many more factors are there. So what I suggest is use a tool called vtune and profile your code it will give you clear idea where is your thread wasting the time and why. I hope it helps.

Related

How does thread waiting affect the execution time of the program?

In my C++ program, I am using boost libraries for parallel programming. Several threads are made to join() on other threads in a part of the program.
The program runs pretty slow for some inputs... In an attempt to improve my program, I tried finding hotspots using Intel VTune. The most time-consuming hotspot is shown to occur due to boost::this_thread::interruptible_wait:
When I checked the portion of the source code where this hotspot occurs, it shows the call to join(). I was under the impression that waiting threads do not take CPU Time. Can someone help me understand why does the thread join() operation take up so much CPU time?
Any insights on how to fix such a hotspot will be very helpful too! One way I can think of to fix such a hotspot would be to somehow detach() the threads and not join() them.
Thanks in advance!
I was under the impression that waiting threads do not take CPU Time
It really depends on how the threads wait. They may be busy waiting (i.e. spinning) to react as quickly as possible to whatever they are waiting for. The alternative of yielding execution after every check means potentially higher delays from operating system scheduling (and thread switching overhead).
VTune will mercilessly pick up on all your threading library overhead, you will need to filter appropriately to figure out where your serial hotspots are and if your parallelization has mitigated them.
If your threads spend a significant amount of time waiting on the join, your parallel section is probably not well-balanced. Without more information on your problem it's hard to tell what the reason is or how to mitigate it, but you should probably try to distribute the work more evenly.
On another note, the recent spectre/meltdown fixes appear to have increased VTune's profiling overhead. I would be careful taking the results at face value (does your program run close to the same amount of time with and without profiling?).
Edit: Related material here and here. Following the instructions in the linked page for disabling the kernel protections helped in my case, although I have not tested it on the latest VTune update.

C++ Pthreads - Multithreading slower than single-threading [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I am developing a C++ application, using pthreads library. Every thread in the program accesses a common unordered_map. The program runs slower with 4 threads than 1. I commented all the code in thread, and left only the part that tokenizes a string. Still the single-threading execution is faster, so I came to the conclusion that the map wasn't the problem.
After that I printed to the screen the threads' Ids, and they seemed to execute sequentially.
In the function that calls the threads, I have a while loop, which creates threads in an array, which size is the number of threads (let's say 'tn'). And every time tn threads are created, I execute a for loop to join them. (pthread_join). While runs many times(not only 4).
What may be wrong?
If you are running a small trivial program this tends to be the case because the work to start the threads, schedule priority, run, context switch, then sync could actually take more time then running it as a single process.
The point here is that when dealing with trivial problems it can run slower. BUT another factor might be how many cores you actually have in your CPU.
When you run a multitthreaded program, each thread will be processed sequentially according to the given CPU clock.
You will only have true multithreading if you have multiple cores. And in such scenario the only multithreading will be 1 thread /core.
Now, given the fact that you (most likely) have both threads on one core, try to keep in mind the overhead generated to the CPU for :
allocating different clock time for each thread
synchronizing thread accesses to various internal CPU operations
other thread priority operations
So in other words, for a simple application, multithreading is actually a downgrade in terms of performance.
Multithreading comes in handy when you need a asynchronous operation (meaning you don't want to wait for a rezult, such as loading an image from an url or streaming geomtery from HDD which is slower then ram) .
In such scenarios, applying multithreading will lead to better user experience, because your program won't hung up when a slow operation occurrs.
Without seeing the code it's difficult to tell for sure, but there could be a number of issues.
Your threads might not be doing enough work to justify their creation. Creating and running threads is expensive, so if your workload is too small, they won't pay for themselves.
Execution time could be spent mostly doing memory accesses on the map, in which case mutually excluding the threads means that you aren't really doing much parallel work in practice (Amdahl's Law).
If most of your code is running under a mutex that it will run serially and not in parllel

Parallel Thread Execution to achieve performance

I am little bit confused in multithreading. Actually we create multiple threads for breaking the main process to subprocess for achieving responsiveness and for removing waiting time.
But Here I got a situation where I have to execute the same task using multiple threads parallel.
And My processor can execute 4 threads parallel and so Will it improve the performance if I create more that 4 threads(10 or more). When I put this question to my colleague he is telling that nothing will happen we are already executing many threads in many other applications like browser threads, kernel threads, etc so he is telling to create multiple threads for the same task.
But if I create more than 4 threads that will execute parallel will not create more context switch and decrease the performance.
Or even though we create multiple thread for executing parallely the will execute one after the other so the performance will be the same.
So what to do in the above situations and are these correct?
edit
1 thread worked. time to process 120 seconds.
2 threads worked. time to process is about 60 seconds.
3 threads created. time to process is about 60 seconds.(not change to the time of 2 threads.)
Is it because, my hardware can only create 2 threads(for being dual)?
software thread=piece of code
Hardware thread=core(processor) for running software thread.
So my CPU support only 2 concurrent threads so if I purchase a AMD CPU which is having 8 cores or 12 cores can I achieve higher performance?
Multi-Tasking is pretty complex and performance gains usually depend a lot on the problem itself:
Only a part of the application can be worked in parallel (there is always a first part that splits up the work into multiple tasks). So the first question is: How much of the work can be done in parallel and how much of it needs to be synchronized (in some cases, you can stop here because so little can be done in parallel that the whole work isn't worth it).
Multiple tasks may depend on each other (one task may need the result of another task). These tasks cannot be executed in parallel.
Multiple tasks may work on the same data/resources (read/write situation). Here we need to synchronize access to this data/resources. If all tasks needs write access to the same object during the WHOLE process, then we cannot work in parallel.
Basically this means that without the exact definition of the problem (dependencies between tasks, dependencies on data, amount of parallel tasks, ...) it's very hard to tell how much performance you'll gain by using multiple threads (and if it's really worth it).
http://en.wikipedia.org/wiki/Amdahl%27s_law
Amdahl's states in a nutshell that the performance boost you receive from parallel execution is limited by your code that must run sequentially.
Without knowing your problem space here are some general things you should look at:
Refactor to eliminate mutex/locks. By definition they force code to run sequentially.
Reduce context switch overhead by pinning threads to physical cores. This becomes more complicated when threads must wait for work (ie blocking on IO) but in general you want to keep your core as busy as possible running your program not switching out threads.
Unless you absolutely need to use threads and sync primitives try use a task scheduler or parallel algorithms library to parallelize your work. Examples would be Intel TBB, Thrust or Apple's libDispatch.

Boost threads: is it possible to limit the run time of a thread before moving to another thread

I have a program with a main thread and a diagnostics thread. The main thread is basically a while(1) loop that performs various tasks. One of these tasks is to provide a diagnostics engine with information about the system and then check back later (i.e. in the next loop) to see if there are any problems that should be dealt with. An iteration of the main loop should take no longer than 0.1 seconds. If all is well, then the diagnostic engine takes almost no time to come back with an answer. However, if there is a problem, the diagnostic engine can take seconds to isolate the problem. For this reason each time the diagnostic engine receives new information it spins up a new diagnostics thread.
The problem we're having is that the diagnostics thread is stealing time away from the main thread. Effectively, even though we have two threads, the main thread is not able to run as often as I would like because the diagnostic thread is still spinning.
Using Boost threads, is it possible to limit the amount of time that a thread can run before moving on to another thread? Also of importance here is that the diagnostic algorithm we are using is blackbox, so we can't put any threading code inside of it. Thanks!
If you run multiple threads they will indeed consume CPU time. If you only have a single processor, and one thread is doing processor intensive work then that thread will slow down the work done on other threads. If you use OS-specific facilities to change the thread priority then you can make the diagnostic thread have a lower priority than the main thread. Also, you mention that the diagnostic thread is "spinning". Do you mean it literally has the equivalent of a spin-wait like this:
while(!check_done()) ; // loop until done
If so, I would strongly suggest that you try and avoid such a busy-wait, as it will consume CPU time without achieving anything.
However, though multiple threads can cause each other to slow-down, if you are seeing an actual delay of several seconds this would suggest there is another problem, and that the main thread is actually waiting for the diagnostic thread to complete. Check that the call to join() for the diagnostic thread is outside the main loop.
Another possibility is that the diagnostic thread is locking a mutex needed by the main thread loop. Check which mutexes are locked and where.
To really help, I'd need to see some code.
looks like your threads are interlocked, so your main thread waits until background thread finished its work. check any multithreading sychronization that can cause this.
to check that it's nothing related to OS scheduling run you program on double-core system, so both threads can be executed really in parallel
From the way you've worded your question, it appears that you're not quite sure how threads work. I assume by "the amount of time that a thread can run before moving on to another thread" you mean the number of cpu cycles spent per thread. This happens hundreds of thousands of times per second.
Boost.Thread does not have support for thread priorities, although your OS-specific thread API will. However, your problem seems to indicate the necessity for a fundamental redesign -- or at least heavy profiling to find bottlenecks.
You can't do this generally at the OS level, so I doubt boost has anything specific for limiting execution time. You can kinda fake it with small-block operations and waits, but it's not clean.
I would suggest looking into processor affinity, either at a thread or process level (this will be OS-specific). If you can isolate your diagnostic processing to a limited subset of [logical] processors on a multi-core machine, it will give you a very course mechanism to control maximum execution amount relative to the main process. That's the best solution I have found when trying to do a similar type of thing.
Hope that helps.

My multithread program works slowly or appear deadlock on dual core machine, please help

I have a program with several threads, one thread will change a global when it exits itself and the other thread will repeatedly poll the global. No any protection on the globals.
The program works fine on uni-processor. On dual core machine, it works for a while and then halt either on Sleep(0) or SuspendThread(). Would anyone be able to help me out on this?
The code would be like this:
Thread 1:
do something...
while(1)
{
.....
flag_thread1_running=false;
SuspendThread(GetCurrentThread());
continue;
}
Thread 2
flag_thread1_running=true;
ResumeThread(thread1);
.....do some other work here....
while(flag_thread1_running) Sleep(0);
....
The fact that you don't see any problem on a uniprocessor machine, but see problems on a multiproc machine is an artifact of the relatively large granularity of thread context switching on a uniprocessor machine. A thread will execute for N amount of time (milliseconds, nanoseconds, whatever) before the thread scheduler switches execution to a different thread. A lot of CPU instructions can execute in the typical thread timeslice. You can think of it as having a fairly large chunk of "free play" exclusive processor time during which you probably won't run into resource collisions because nothing else is executing on the processor.
When running on a multiproc machine, though, CPU instructions in two threads execute exactly at the same time. The size of the "free play" chunk of time is near zero.
To reproduce a resource contention issue between two threads, you need to get thread 1 to be accessing the resource and thread 2 to be accessing the resource at the same time, or very nearly the same time.
In the large-granularity thread switching that takes place on a uniprocessor machine, the chances that a thread switch will happen exactly in the right spot are slim, so the program may never exhibit a failure under normal use on a uniproc machine.
In a multiproc machine, the instructions are executing at the same time in the two threads, so the chances of thread 1 and thread 2 accessing the same resource at the same time are much, much greater - thousands of times more likely than the uniprocessor scenario.
I've seen it happen many times: an app that has been running fine for years on uniproc machines suddenly starts failing all over the place when executed on a new multiproc machine. The cause is a latent threading bug in the original code that simply never hit the right coincidence of timeslicing to repro on the uniproc machines.
When working with multithreaded code, it is absolutely imperitive to test the code on multiproc hardware. If you have thread collision issues in your code, they will quickly present themselves on a multiproc machine.
As others have noted, don't use SuspendThread() unless you are a debugger. Use mutexes or other synchronization objects to coordinate between threads.
Try using something more like WaitForSingleObjectEx instead of SuspendThread.
You are hitting a race condition. Thread 2 may execute flag_thread1_running=true;
before thread 1 executes flag_thread1_running=false.
This is not likely to happen on single CPU, because with usual the scheduling quantum 10-20 ms you are not likely to hit the problem. It will happen there as well, but very rarely.
Using proper synchronization primitives is a must here. Instead of bool, use event. Instead of checking the bool in a loop, use WaitForSingleObject (or WaitForMultipleObjects for more elaborate stuff later).
It is possible to perform synchronization between threads using plain variables, but it is rarely a good idea and it is quite hard to do it right - cf. How can I write a lock free structure?. It is definitely not a good idea to perform schedulling using Sleep, Suspend or Resume.
I guess that you already know that polling a global flag is a "Bad Idea™" so I'll skip that little speech. Try adding volatile to the flag declaration. That should force each read of it to read from memory. Without volatile, the implementation could be reading the flag into a register and not fetching it from memory.