C++ Pthreads - Multithreading slower than single-threading [closed]

C++ Pthreads - Multithreading slower than single-threading [closed] - c++

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I am developing a C++ application, using pthreads library. Every thread in the program accesses a common unordered_map. The program runs slower with 4 threads than 1. I commented all the code in thread, and left only the part that tokenizes a string. Still the single-threading execution is faster, so I came to the conclusion that the map wasn't the problem.
After that I printed to the screen the threads' Ids, and they seemed to execute sequentially.
In the function that calls the threads, I have a while loop, which creates threads in an array, which size is the number of threads (let's say 'tn'). And every time tn threads are created, I execute a for loop to join them. (pthread_join). While runs many times(not only 4).
What may be wrong?

If you are running a small trivial program this tends to be the case because the work to start the threads, schedule priority, run, context switch, then sync could actually take more time then running it as a single process.
The point here is that when dealing with trivial problems it can run slower. BUT another factor might be how many cores you actually have in your CPU.

When you run a multitthreaded program, each thread will be processed sequentially according to the given CPU clock.
You will only have true multithreading if you have multiple cores. And in such scenario the only multithreading will be 1 thread /core.
Now, given the fact that you (most likely) have both threads on one core, try to keep in mind the overhead generated to the CPU for :
allocating different clock time for each thread
synchronizing thread accesses to various internal CPU operations
other thread priority operations
So in other words, for a simple application, multithreading is actually a downgrade in terms of performance.
Multithreading comes in handy when you need a asynchronous operation (meaning you don't want to wait for a rezult, such as loading an image from an url or streaming geomtery from HDD which is slower then ram) .
In such scenarios, applying multithreading will lead to better user experience, because your program won't hung up when a slow operation occurrs.

Without seeing the code it's difficult to tell for sure, but there could be a number of issues.
Your threads might not be doing enough work to justify their creation. Creating and running threads is expensive, so if your workload is too small, they won't pay for themselves.
Execution time could be spent mostly doing memory accesses on the map, in which case mutually excluding the threads means that you aren't really doing much parallel work in practice (Amdahl's Law).

If most of your code is running under a mutex that it will run serially and not in parllel

Related

C++ multi threads vs multi processes - which is faster? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last month.
Improve this question
I have around 50 similar tasks - each one performs network calls to some server and writes the response to the db. Each task has its own set of calls it should make. Each task talks to one server, if that's relevant.
I can have one process, each running its own task with its own set of calls it should make (so no memory sharing is needed between the threads). Another option is a process for each task. Or a combination.
Should I choose multiple threads because switching between processes is more costly?
It runs on Ubuntu.
I don't care about the cost of the creation of the thread/process. I want to start them and let them run for a long time.

Strictly in terms of performance, it comes down to initialization time because as far as execution time goes, in this particular case that there is no interprocess communication required, execution time will be very similar.
Threads are faster to initialize since there is less duplication of resources to initialize.
The so much advertised fact that "threads switch faster than process" like in this StackOverflow question/answer is a preposterous non-sequitur. If a batch is running at 100% cpu and there are enough free cores to supply the demand, the OS will not ever bother to switch threads into new cores. This will only happen when threads block and wait for input (or output chance) from network or disk - in which case the price of 3-20 microseconds will completely overshadow the few nanoseconds of advantage of an additional memory cache miss.
The only reason why the operating system would preempt a thread or process other than I/O block is when the machine is running out of computing resources, ie when it is running at more than 100% on every single available CPU. But this is such a special, degenerated case that should not belong in a generic response.
However if the running time is much larger than the initialization time, say the process runs for several minutes while it takes a few milliseconds to start up, then performance should definitely be the same and your requirement should be other like facility of administration of the process batch. Another point to keep in mind is that if one process dies, the others finish, while if one thread crashes, the entire batch dies as well.
There might be some extra benefits of using threads as well due to the lesser use of system resources but I believe these would be negligible.

How does thread waiting affect the execution time of the program?

In my C++ program, I am using boost libraries for parallel programming. Several threads are made to join() on other threads in a part of the program.
The program runs pretty slow for some inputs... In an attempt to improve my program, I tried finding hotspots using Intel VTune. The most time-consuming hotspot is shown to occur due to boost::this_thread::interruptible_wait:
When I checked the portion of the source code where this hotspot occurs, it shows the call to join(). I was under the impression that waiting threads do not take CPU Time. Can someone help me understand why does the thread join() operation take up so much CPU time?
Any insights on how to fix such a hotspot will be very helpful too! One way I can think of to fix such a hotspot would be to somehow detach() the threads and not join() them.
Thanks in advance!

I was under the impression that waiting threads do not take CPU Time
It really depends on how the threads wait. They may be busy waiting (i.e. spinning) to react as quickly as possible to whatever they are waiting for. The alternative of yielding execution after every check means potentially higher delays from operating system scheduling (and thread switching overhead).
VTune will mercilessly pick up on all your threading library overhead, you will need to filter appropriately to figure out where your serial hotspots are and if your parallelization has mitigated them.
If your threads spend a significant amount of time waiting on the join, your parallel section is probably not well-balanced. Without more information on your problem it's hard to tell what the reason is or how to mitigate it, but you should probably try to distribute the work more evenly.
On another note, the recent spectre/meltdown fixes appear to have increased VTune's profiling overhead. I would be careful taking the results at face value (does your program run close to the same amount of time with and without profiling?).
Edit: Related material here and here. Following the instructions in the linked page for disabling the kernel protections helped in my case, although I have not tested it on the latest VTune update.

General openMP - thread speeds are different

I have an openMP program, where a for loop is parallelised.
Everything works as it should, except the master thread is many, many times faster than the rest of the threads... For example, when running with 4 threads, thread 0 finishes long before the other ones, but they are executing the same code, with almost the same amount of work.
Can this be because of resource handling by Windows, swapping tasks in and out of the threads used by the program, causing the slowdown?
Or is it more likely that my code is the problem? I just wanna make sure I don't waste time looking for an error in my program, if this is an unavoidable problem caused by the OS...

As for why the thread has priority, it could be an issue between the OpenMP runtime and the OS. Which compiler are you using? How are you measuring when the threads terminate?
To improve the performance of your OpenMP parallel for in this case, I would use the dynamic scheduling policy with the schedule directive. If the master thread is getting more cycles from the CPU, it will also do more work in this case. In general, you can't count on each thread to be equally fast, but if you are observing orders of magnitude differences, it sounds like a bad clash between the runtime and OS.

Look thread execution is dependent on many things so there may be many possibility like is there any locking machanism is there or not or resource availability like when the thread complete the job it must release the resources. Many more factors are there. So what I suggest is use a tool called vtune and profile your code it will give you clear idea where is your thread wasting the time and why. I hope it helps.

Parallel Thread Execution to achieve performance

I am little bit confused in multithreading. Actually we create multiple threads for breaking the main process to subprocess for achieving responsiveness and for removing waiting time.
But Here I got a situation where I have to execute the same task using multiple threads parallel.
And My processor can execute 4 threads parallel and so Will it improve the performance if I create more that 4 threads(10 or more). When I put this question to my colleague he is telling that nothing will happen we are already executing many threads in many other applications like browser threads, kernel threads, etc so he is telling to create multiple threads for the same task.
But if I create more than 4 threads that will execute parallel will not create more context switch and decrease the performance.
Or even though we create multiple thread for executing parallely the will execute one after the other so the performance will be the same.
So what to do in the above situations and are these correct?
edit
1 thread worked. time to process 120 seconds.
2 threads worked. time to process is about 60 seconds.
3 threads created. time to process is about 60 seconds.(not change to the time of 2 threads.)
Is it because, my hardware can only create 2 threads(for being dual)?
software thread=piece of code
Hardware thread=core(processor) for running software thread.
So my CPU support only 2 concurrent threads so if I purchase a AMD CPU which is having 8 cores or 12 cores can I achieve higher performance?

Multi-Tasking is pretty complex and performance gains usually depend a lot on the problem itself:
Only a part of the application can be worked in parallel (there is always a first part that splits up the work into multiple tasks). So the first question is: How much of the work can be done in parallel and how much of it needs to be synchronized (in some cases, you can stop here because so little can be done in parallel that the whole work isn't worth it).
Multiple tasks may depend on each other (one task may need the result of another task). These tasks cannot be executed in parallel.
Multiple tasks may work on the same data/resources (read/write situation). Here we need to synchronize access to this data/resources. If all tasks needs write access to the same object during the WHOLE process, then we cannot work in parallel.
Basically this means that without the exact definition of the problem (dependencies between tasks, dependencies on data, amount of parallel tasks, ...) it's very hard to tell how much performance you'll gain by using multiple threads (and if it's really worth it).

http://en.wikipedia.org/wiki/Amdahl%27s_law
Amdahl's states in a nutshell that the performance boost you receive from parallel execution is limited by your code that must run sequentially.
Without knowing your problem space here are some general things you should look at:
Refactor to eliminate mutex/locks. By definition they force code to run sequentially.
Reduce context switch overhead by pinning threads to physical cores. This becomes more complicated when threads must wait for work (ie blocking on IO) but in general you want to keep your core as busy as possible running your program not switching out threads.
Unless you absolutely need to use threads and sync primitives try use a task scheduler or parallel algorithms library to parallelize your work. Examples would be Intel TBB, Thrust or Apple's libDispatch.

Benefits of a multi thread program in a unicore system [duplicate]

This question already has answers here:
How can multithreading speed up an application (when threads can't run concurrently)?
(9 answers)
Closed 9 years ago.
My professor causally mentioned that we should program multi-thread programs even if we are using a unicore processor however because of the lack of time , he did not elaborate on it .
I would like to know what are the benefits of a multi-thread program in a unicore processor ??

It won't be as significant as a multi-core system but it can still provide some benefits.
Mainly all the benefits that you are going to get will be regarding to the context switch that will happen after a input miss to the already executing thread. Executing thread may be waiting for anything such as a hardware resource or a branch mis-prediction or even data transfer after a cache miss.
At this point the waiting thread can be executed to benefit from this "waiting time". But of course context switch will take some time. Also managing threads inside the code rather than sequential computation can create some extra complexity to your program. And as it has been said, some applications needs to be multi-threaded so there is no escape from the context switch in some cases.

Some applications need to be multi-threaded. Multi-threading isn't just about improving performance by using more cores, it's also about performing multiple tasks at once.
Take Skype for example - The GUI needs to be able to accept the text you're entering, display it on the screen, listen for new messages coming from the user you're talking to, and display them. This wouldn't be a trivial task in a single threaded application.
Even if there's only one core available, the OS thread scheduler will give you the illusion of parallelism.

Usually it is about not blocking. Running many threads on a single core still gives the illusion of concurrency. So you can have, say, a thread doing IO while another one does user interactions. The user interaction thread is not blocked while the other does IO, so the user is free to carry on interacting.

Benefits could be different.
One of the widely used examples is the application with GUI, which supposed to perform some kind of computations. If you will have a single thread - the user will have to wait the result before dealing something else with the application, but if you start it in the separate thread - user interface could be still available for user during the computation process. So, multi-thread program could emulate multi-task environment even on a unicore system. That's one of the points.

As others have already mentioned, not blocking is one application. Another one is separation of logic for unrelated tasks that are to be executed simultaneously. Using threads for that leaves handling of scheduling these tasks to the OS.
However, note that it may also be possible to implement similar behavior using asynchronous operations in a single thread. "Future" and boost::asio provide ways of doing non-blocking stuff without necessarily resorting to multiple threads.

I think it depends a bit on how exactly you design your threads and which logic is actually in the thread. Some benefits you can even get on a single core:
A thread can wrap a blocking/long-during call you can't circumvent otherwise. For some operations there are polling mechanisms, but not for all.
A thread can wrap an almost standalone part of your application that has virtually no interaction with other code. For example background polling for updates, monitoring some resource (e.g. free storage), checking internet connectivity. If you keep them in a separate thread you can keep the code relatively simple in its own 'runtime' without caring too much about the impact on the main program, the sole communication with the main logic is usually a single 'event'.
In some environments you might get more processing time. This mainly depends on how your OS scheduling system works, but if this allocates time per thread, the more threads you have the more your app will be scheduled.
Some benefits long-term:
Where it's not hard to do you benefit if your hardware evolves. You never know what's going to happen, today your app runs on a single-core embedded device, tomorrow that embedded device gets a quad core. Programming threaded from the beginning improves your future scalability.
One example is an environment where you can deterministically assign work to a thread, e.g. based on some hash all related operations end up in the same thread. The advantage for single cores is 'small' but it's not hard to do as you need little synchronization primitives so the overhead stays small.
That said, I think there are situations where it's very ill advise:
As soon as your required synchronization mechanism with other threads becomes complex (e.g. multiple locks, lots of critical sections, ...). It might still be then that multi-threading gives you a benefit when effectively moving to multiple CPUs, but the overhead is huge both for your single core and your programming time.

For instance think about operations that block because of slow peripheral devices (harddisk access etc.). While these are waiting, even the single core can do other things asyncronously.

In a lot of applications the bottleneck is not CPU processing power. So when the program flow is waiting for completion of IO requests (user input, network/disk IO), critical resources to be available, or any sort of asynchroneously triggered events, the CPU can be scheduled to do other work instead of just blocking.
In this case you don't necessarily need multiple threads that can actually run in parallel. Cooperative multi-tasking concepts like asynchroneous IO, coroutines, or fibers come into mind.
If however the application's bottleneck is CPU processing power (constantly 100% CPU usage), then it makes sense to increase the number of CPUs available to the application. At that point it is easier to scale the application up to use more CPUs if it was designed to run in parallel upfront.

As far as I can see, one answer was not yet given:
You will have to write multithreaded applications in the future!
The average number of cores will double every 18 months in the future. People have learned single-threaded programming for 50 years now, and now they are confronted with devices that have multiple cores. The programming style in a multi-threaded environment differs significantly from single-threaded programming. This refers to low-level aspects like avoiding race conditions and proper synchronization, as well as the high-level aspects like the general algorithm design.
So in addition to the points already mentioned, it's also about writing future-proof software, scalability and the development of the skills that are required to achieve these goals.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js