Reduce Context Switches Between Threads With Same Priority - c++

I am writing an application that use a third-party library to perform heavy computations.
This library implements parallelism internally and spawn given number threads. I want to run several (dynamic count) instances of this library and therefore end up with quite heavily oversubscribing the cpu.
Is there any way I can increase the "time quantum" of all the threads in a process so that e.g. all the threads with normal priority rarely context switch (yield) unless they are explicitly yielded through e.g. semaphores?
That way I could possibly avoid most of the performance overhead of oversubscribing the cpu. Note that in this case I don't care if a thread is starved for a few seconds.
EDIT:
One complicated way of doing this is to perform thread scheduling manually.
Enumerate all the threads with a specific priority (e.g. normal).
Suspend all of them.
Create a loop which resumes/suspends the threads every e.g. 40 ms and makes sure no mor threads than the current cpu count is run.
Any major drawbacks with this approach? Not sure what the overhead of resume/suspending a thread is?

There is nothing special you need to do. Any decent scheduler will not allow unforced context switches to consume a significant fraction of CPU resources. Any operating system that doesn't have a decent scheduler should not be used.
The performance overhead of oversubscribing the CPU is not the cost of unforced context switches. Why? Because the scheduler can simply avoid those. The scheduler only performs an unforced context switch when that has a benefit. The performance costs are:
It can take longer to finish a job because more work will be done on other jobs between when the job is started and when the job finishes.
Additional threads consume memory for their stacks and related other tracking information.
More threads generally means more contention (for example, when memory is allocated) which can mean more forced context switches where a thread has to be switched out because it can't make forward progress.
You only want to try to change the scheduler's behavior when you know something significant that the scheduler doesn't know. There is nothing like that going on here. So the default behavior is what you want.

Any major drawbacks with this approach? Not sure what the overhead of
resume/suspending a thread is?
Yes,resume/suspend the thread is very very dangerous activity done in user mode of program. So it should not be used(almost never). Moreover we should not use these concepts to achieve something which any modern scheduler does for us. This too is mentioned in other post of this question.
The above is applicable for any operating system, but from SO post tag it appears to me that it has been asked for Microsoft Windows based system. Now if we read about the SuspendThread() from MSDN, we get the following:
"This function is primarily designed for use by debuggers. It is not intended to be used for thread synchronization. Calling SuspendThread on a thread that owns a synchronization object, such as a mutex or critical section, can lead to a deadlock if the calling thread tries to obtain a synchronization object owned by a suspended thread".
So consider the scenario in which thread has acquired some resource(implicitly .i.e. part of not code..by library or kernel mode), and if we suspend the thread this would result into mysterious deadlock situation as other threads of that process would be waiting for that particular resource. The fact is we are not sure(at any time) in our program that what sort of resources are acquired by any running thread, suspend/resume thread is not good idea.

Related

Futex throughput on Linux

I have an async API which wraps some IO library. The library uses C style callbacks, the API is C++, so natural choice (IMHO) was to use std::future/std::promise to build this API. Something like std::future<void> Read(uint64_t addr, byte* buff, uint64_t buffSize). However, when I was testing the implementation I saw that the bottleneck is the future/promise, more precisely, the futex used to implement promise/future. Since the futex, AFAIK, is user space and the fastest mechanism I know to sync two threads, I just switched to use raw futexes, which somewhat improved the situation, but not something drastic. The performance floating somewhere around 200k futex WAKEs per second. Then I stumbled upon this article - Futex Scaling for Multi-core Systems which quite matches the effect I observe with futexes. My questions is, since the futex too slow for me, what is the fastest mechanism on Linux I can use to wake the waiting side. I dont need anything more sophisticated than binary semaphore, just to signal IO operation completion. Since IO operations are very fast (tens of microseconds) switching to kernel mode not an option. Busy wait not an option too, since CPU time is precious in my case.
Bottom line, user space, simple synchronization primitive, shared between two threads only, only one thread sets the completion, only one thread waits for completion.
EDIT001:
What if... Previously I said, no spinning in busy wait. But futex already spins in busy wait, right? But the implementation covers more general case, which requests global hash table, to hold the futexes, queues for all subscribers etc. Is it a good idea to mimic same behavior on some simple entity (like int), no locks, no atomics, no global datastructures and busy wait on it like futex already does?
In my experience, the bottleneck is due to linux's poor support for IPC. This probably isn't a multicore scaling issue, unless you have a large number of threads.
When one thread wakes another (by futex or any other mechanism), the system tries to run the 'wakee' thread immediately. But the waker thread is still running and using a core, so the system will usually put the wakee thread on a different core. If that core was previously idle, then the system will have to wake the core up from a power-down state, which takes some time. Any data shared between the threads must now be transferred between the cores.
Then, the waker thread will usually wait for a response from the wakee (it sounds like this is what you are doing). So it immediately goes to sleep, and puts its core to idle.
Then a similar thing happens again when the response comes. The continuous CPU wakes and migrations cause the slowdown. You may well discover that if you launch many instances of your process simultaneously, so that all your cores are busy, you see increased performance as the CPUs no longer have to wake up, and the threads may stop migrating between cores. You can get a similar performance increase if you pin the two threads to one core - it will do more than 1 million 'pings'/sec in this case.
So isn't there a way of saying 'put this thread to sleep and then wake that one'? Then the OS could run the wakee on the same core as the waiter? Well, Google proposed a solution to this with a FUTEX_SWAP api that does exactly this, but has yet to be accepted into the linux kernel. The focus now seems to be on user-space thread control via User Managed Concurrency Groups which will hopefully be able to do something similar. However at the time of writing this is yet to be merged into the kernel.
Without these changes to the kernel, as far as I can tell there is no way around this problem. 'You are on the fastest route'! UNIX sockets, TCP loopback, pipes all suffer from the same issue. Futexes have the lowest overhead, which is why they go faster than the others. (with TCP you get about 100k pings per sec, about half the speed of a futex impl). Fixing this issue in a general way would benefit a lot of applications/deployments - anything that uses connections to localhost could benefit.
(I did try a DIY approach where the waker thread pins the wakee thread to the same core that the waker is on, but if you don't want to to pin the waker, then every time you post the futex you need to pin the wakee to the current thread, and the system call to do this has too much overhead)

Ensure that each thread gets a chance to execute in a given time period using C++11 threads

Suppose I have a multi-threaded program in C++11, in which each thread controls the behavior of something displayed to the user.
I want to ensure that for every time period T during which one of the threads of the given program have run, each thread gets a chance to execute for at least time t, so that the display looks as if all threads are executing simultaneously. The idea is to have a mechanism for round robin scheduling with time sharing based on some information stored in the thread, forcing a thread to wait after its time slice is over, instead of relying on the operating system scheduler.
Preferably, I would also like to ensure that each thread is scheduled in real time.
In case there is no way other than relying on the operating system, is there any solution for Linux?
Is it possible to do this? How?
No that's not cross-platform possible with C++11 threads. How often and how long a thread is called isn't up to the application. It's up to the operating system you're using.
However, there are still functions with which you can flag the os that a special thread/process is really important and so you can influence this time fuzzy for your purposes.
You can acquire the platform dependent thread handle to use OS functions.
native_handle_type std::thread::native_handle //(since C++11)
Returns the implementation defined underlying thread handle.
I just want to claim again, this requires a implementation which is different for each platform!
Microsoft Windows
According to the Microsoft documentation:
SetThreadPriority function
Sets the priority value for the specified thread. This value, together
with the priority class of the thread's process determines the
thread's base priority level.
Linux/Unix
For Linux things are more difficult because there are different systems how threads can be scheduled. Under Microsoft Windows it's using a priority system but on Linux this doesn't seem to be the default scheduling.
For more information, please take a look on this stackoverflow question(Should be the same for std::thread because of this).
I want to ensure that for every time period T during which one of the threads of the given program have run, each thread gets a chance to execute for at least time t, so that the display looks as if all threads are executing simultaneously.
You are using threads to make it seem as though different tasks are executing simultaneously. That is not recommended for the reasons stated in Arthur's answer, to which I really can't add anything.
If instead of having long living threads each doing its own task you can have a single queue of tasks that can be executed without mutual exclusion - you can have a queue of tasks and a thread pool dequeuing and executing tasks.
If you cannot, you might want to look into wait free data structures and algorithms. In a wait free algorithm/data structure, every thread is guaranteed to complete its work in a finite (and even specified) number of steps. I can recommend the book The Art of Multiprocessor Programming where this topic is discussed in length. The gist of it is: every lock free algorithm/data structure can be modified to be wait free by adding communication between threads over which a thread that's about to do work makes sure that no other thread is starved/stalled. Basically, prefer fairness over total throughput of all threads. In my experience this is usually not a good compromise.

How do I automatically add threads to a pool based on the computational needs of the program?

We have a C++ program which, depending on the way the user configures it, may be CPU bound or IO bound. For the purpose of loose coupling with the program configuration, I'd like to have my thread pool automatically realize when the program would benefit from more threads (i.e. CPU bound). It would be nice if it realized when it was I/O bound and reduced the number of workers, but that would just be a bonus (i.e. I'd be happy with something that just automatically grows without automatic shrinkage).
We use Boost so if there's something there that would help we can use it. I realize that any solution would probably be platform specific, so we're mainly interested in Windows and Linux, with a tertiary interest in OS X or any other *nix.
Short answer: use distinct fixed-size thread pools for CPU intensive operations and for IOs. In addition to the pool sizes, further regulation of the number of active threads will be done by the bounded-buffer (Producer/Consumer) that synchronizes the computer and IO steps of your workflow.
For compute- and data-intensive problems where the bottlenecks are a moving target between different resources (e.g. CPU vs IO), it can be useful to make a clear distinction between a thread and a thread, particularly, as a first approximation:
A thread that is created to use more CPU cycles ("CPU thread")
A thread that is created to handle an asynchronous IO operation ("IO thread")
More generally, threads should be segregated by the type of resources that they need. The aim should be to ensure that a single thread doesn't use more than one resource (e.g. avoiding switching between reading data and processing data in the same thread). When a tread uses more than one resource, it should be split and the two resulting threads should be synchronized through a bounded-buffer.
Typically there should be exactly as many CPU threads as needed to saturate the instruction pipelines of all the cores available on the system. To ensure that, simply have a "CPU thread pool" with exactly that many threads that are dedicated to computational work only. That would be boost:: or std::thread::hardware_concurrency() if that can be trusted. When the application needs less, there will simply be unused threads in the CPU thread pool. When it needs more, the work is queued. Instead of a "CPU thread pool", you could use c++11 std::async but you would need to implement a thread throttling mechanism with your selection of synchronization tools (e.g. a counting semaphore).
In addition to the "CPU thread pool", there can be another thread pool (or several other thread pools) dedicated to asynchronous IO operations. In your case, it seems that IO resource contention is potentially a concern. If that's the case (e.g. a local hard drive) the maximum number of threads should be carefully controlled (e.g. at most 2 read and 2 write threads on a local hard drive). This is conceptually the same as with CPU threads and you should have one fixed size thread pool for reading and another one for writing. Unfortunately, there will probably not be any good primitive available to decide on the size of these thread pools (measuring might be simple though, if your IO patterns are very regular). If resource contention is not an issue (e.g. NAS or small HTTP requests) then boost::asio or c++11 std::async would probably be a better option than a thread pool; in which case, thread throttling can be entirely left to the bounded-buffers.

Multi-threaded Event Dispatching

I am developing a C++ application that will use Lua scripts for external add-ons. The add-ons are entirely event-driven; handlers are registered with the host application when the script is loaded, and the host calls the handlers as the events occur.
What I want to do is to have each Lua script running in its own thread, to prevent scripts from locking up the host application. My current intention is to spin off a new thread to execute the Lua code, and allow the thread to terminate on its own once the code has completed. What are the potential pitfalls of spinning off a new thread as a form of multi-threaded event dispatching?
Here are a few:
Unless you take some steps to that effect, you are not in control of the lifetime of the threads (they can stay running indefinitely) or the resources they consume (CPU, etc)
Messaging between threads and synchronized access to commonly accessible data will be harder to implement
If you are expecting a large number of add-ons, the overhead of creating a thread for each one might be too great
Generally speaking, giving event-driven APIs a new thread to run on strikes me as a bad decision. Why have threads running when they don't have anything to do until an event is raised? Consider spawning one thread for all add-ons, and managing all event propagation from that thread. It will be massively easier to implement and when the bugs come, you will have a fighting chance.
Creating a new thread and destroying it frequently is not really a good idea. For one, you should have a way to bound this so that it doesn't consume too much memory (think stack space, for example), or get to the point where lots of pre-emption happens because the threads are competing for time on the CPU. Second, you will waste a lot of work associated with creating new threads and tearing them down. (This depends on your operating system. Some OSs might have cheap thread creation and others might have that be expensive.)
It sounds like what you are seeking to implement is a work queue. I couldn't find a good Wikipedia article on this but this comes close: Thread pool pattern.
One could go on for hours talking about how to implement this, and different concurrent queue algorithms that can be used. But the idea is that you create N threads which will drain a queue, and do some work in response to items being enqueued. Typically you'll also want the threads to, say, wait on a semaphore while there are no items in the queue -- the worker threads decrement this semaphore and the enqueuer will increment it. To prevent enqueuers from enqueueing too much while worker threads are busy and hence taking up too much resources, you can also have them wait on a "number of queue slots available" semaphore, which the enqueuer decrements and the worker thread increments. These are just examples, the details are up to you. You'll also want a way to tell the threads to stop waiting for work.
My 2 cents: depending on the number and rate of events generated by the host application, the main problem I can see is in term of performances. Creating and destroyng thread has a cost [performance-wise] I'm assuming that each thread once spawned do not need to share any resource with the other threads, so there is no contention.
If all threads are assigned on a single core of your CPU and there is no load balancing, you can easily overload one CPU and have the others [on a multcore system] unloaded. I'll consider some thread affinity + load balancing policy.
Other problem could be in term of resource [read memory] How much memory each LUA thread will consume?
Be very careful to memory leaks in the LUA threads as well: if events are frequent and threads are created/destroyed frequently leaving leacked memory, you can consume your host memory quite soon ;)

Impact of hundreds of idle threads

I am considering the use of potentially hundreds of threads to implement tasks that manage devices over a network.
This is a C++ application running on a powerpc processor with a linux kernel.
After an initial phase when each task does synchronization to copy data from the device into the task, the task becomes idle, and only wakes up when it receives an alarm, or needs to change some data (configuration), which is rare after the start phase. Once all tasks reach the "idle" phase, I expect that only a few per second will need to wake.
So, my main concern is, if I have hundreds of threads will they have a negative impact on the system once they become idle?
Thanks.
amso
edit:
I'm updating the question based on the answers that I got. Thanks guys.
So it seems that having a ton of threads idling (IO blocked, waiting, sleeping, etc), per se , will not have an impact on the system in terms of responsiveness.
Of course, they will spend extra money for each thread's stack and TLS data but that's okay as long as we throw more memory at the thing (making it more €€€)
But then, other issues have to be accounted for. Having 100s of threads waiting will likely increase memory usage on the kernel, due to the need of wait queues or other similar resources. There's also a latency issue, which looks non-deterministic. To check the responsiveness and memory usage of each solution one should measure it and compare.
Finally, the whole idea of hundreds of threads that will be mostly idling may be modeled like a thread pool. This reduces a bit of code linearity but dramatically increases the scalability of the solution and with propper care can be easily tunable to adjust the compromise between performance and resource usage.
I think that's all. Thanks everyone for their input.
--
amso
Each thread has overhead - most importantly each one has its own stack and TLS. Performance is not that much of a problem since they will not get any time slices unless they actually do anything. You may still want to consider using thread pools.
Chiefly they will use up address space and memory for stacks; once you get, say, 1000 threads, this gets quite significant as I've seen that 10M per thread is typical for stacks (on x86_64). It is changable, but only with care.
If you have a 32-bit processor, address space will be the main limitation once you hit 1000s of threads, you can easily exhaust the AS.
They use up some kernel memory, but probably not as much as userspace.
Edit: of course threads share address space with each other only if they are in the same process; I am assuming that they are.
I'm not a Linux hacker, but assuming that Linux's thread scheduling is similar to Windows'...
Yes, of course the will be some impact. Every bit of memory you consume will potentially have some impact.
However, in a time-sliced environment, threads that are in a Wait/Sleep/Join state will not consume CPU cycles until they are awoken.
I would be worried about offering 1:1 thread-connections mappings, if nothing else because it leaves you rather exposed to denial of service attacks. (pthread_create() is a fairly expensive operation compared to just a call to accept())
EboMike has already answered the question directly - provided threads are blocked and not busy-waiting then they won't consume much in the way of resources although they will occupy memory and swap for all the per-thread state.
I'm learning the basics of the kernel now. I can't give you a specific answer yet; I'm still a noob... but here are some things for you to chew on.
Linux implements each POSIX thread as a unique process. This will create overhead as others have mentioned. In addition to this, your waiting model appears flawed any way you do it. If you create one conditional variable for each thread, then I think (based off of my interpretation of the website below) that you'll actually be expending a lot of kernel memory, as each thread would be placed into its own wait queue. If instead you break your threads up for each group of X threads to share a conditional variable, then you've got problems as well because every time the variable signals, you must wake up _EVERY_DARN_PROCESS_ in that variable's wait queue.
I also assume that you will need to do some object sharing an synchronization. In this case, your code may get slower because of the need to wake up all processes waiting on a resource, as I mentioned earlier.
I know this wasn't much help, but as I said, I'm a kernel noob. Hope it helped a little.
http://book.chinaunix.net/special/ebook/PrenticeHall/PrenticeHallPTRTheLinuxKernelPrimer/0131181637/ch03lev1sec7.html
I'm not sure what "device" you are talking about, but if it's a file descriptor, I'd suggest that you look at starting to migrate to using either poll or epoll (Id suggest the latter given the description of how active you expect each file descriptor to be). That way, you could use one process which would be responsible for all the fds.