A synchronization primitive with increased owner thread priority

A synchronization primitive with increased owner thread priority - c++

I have a program where sometimes bursts happen so that threads would load the CPU above 100% if that was possible, but in reality, they fight for the CPU. It is critical that a thread obtaining ownership of a synchronization primitive gets a higher priority than the other threads of the application, so to prevent the case where a thread obtains ownership and gets paused by the scheduler. Is there a suitable synchronization primitive in C++ (up to the latest draft) or WinAPI, or do I have to wrap the mutex locking code in SetThreadPriority() calls?

This isn't actually a problem. If a thread that owns a synchronization primitive gets paused by the scheduler, it would only be because there were enough ready-to-run threads to keep all the cores busy. In that case, there's no particular reason to care which thread runs.
Threads that waiting for the synchronization primitive aren't ready to run. So if you have four cores and the thread that holds the synchronization primitive isn't being blocked, it would only be because there are four threads, all ready-to-run, that can make forward progress without holding the synchronization primitive. In that case, running those four threads is just as good as running the thread that holds the synchronization primitive.
I strongly urge you not to mess with thread priorities unless you really have no choice. Once you start messing with thread priorities, the argument above can stop holding because you can get issues like priority inversion. But if you don't mess with thread priorities, then you can't run into those kinds of issues and the scheduler will be smart enough to do the right thing 99% of the time. And trying to mess with priorities to get it do the right thing that last 1% of the time will likely backfire.

The mechanism you are looking for is called a priority inheritance protocol. Pthreads offers support for this sort of configuration, and the idea is that if a high priority task is waiting for a resource held by a low priority task, the low priority task is boosted to that high priority until it relinquishes the resource.
Search for Liu and Layland, they wrote most of this up in the early 70s. As for C++, I am afraid it is a few versions away from 1973's state of the art.

Related

Reduce Context Switches Between Threads With Same Priority

I am writing an application that use a third-party library to perform heavy computations.
This library implements parallelism internally and spawn given number threads. I want to run several (dynamic count) instances of this library and therefore end up with quite heavily oversubscribing the cpu.
Is there any way I can increase the "time quantum" of all the threads in a process so that e.g. all the threads with normal priority rarely context switch (yield) unless they are explicitly yielded through e.g. semaphores?
That way I could possibly avoid most of the performance overhead of oversubscribing the cpu. Note that in this case I don't care if a thread is starved for a few seconds.
EDIT:
One complicated way of doing this is to perform thread scheduling manually.
Enumerate all the threads with a specific priority (e.g. normal).
Suspend all of them.
Create a loop which resumes/suspends the threads every e.g. 40 ms and makes sure no mor threads than the current cpu count is run.
Any major drawbacks with this approach? Not sure what the overhead of resume/suspending a thread is?

There is nothing special you need to do. Any decent scheduler will not allow unforced context switches to consume a significant fraction of CPU resources. Any operating system that doesn't have a decent scheduler should not be used.
The performance overhead of oversubscribing the CPU is not the cost of unforced context switches. Why? Because the scheduler can simply avoid those. The scheduler only performs an unforced context switch when that has a benefit. The performance costs are:
It can take longer to finish a job because more work will be done on other jobs between when the job is started and when the job finishes.
Additional threads consume memory for their stacks and related other tracking information.
More threads generally means more contention (for example, when memory is allocated) which can mean more forced context switches where a thread has to be switched out because it can't make forward progress.
You only want to try to change the scheduler's behavior when you know something significant that the scheduler doesn't know. There is nothing like that going on here. So the default behavior is what you want.

Any major drawbacks with this approach? Not sure what the overhead of
resume/suspending a thread is?
Yes,resume/suspend the thread is very very dangerous activity done in user mode of program. So it should not be used(almost never). Moreover we should not use these concepts to achieve something which any modern scheduler does for us. This too is mentioned in other post of this question.
The above is applicable for any operating system, but from SO post tag it appears to me that it has been asked for Microsoft Windows based system. Now if we read about the SuspendThread() from MSDN, we get the following:
"This function is primarily designed for use by debuggers. It is not intended to be used for thread synchronization. Calling SuspendThread on a thread that owns a synchronization object, such as a mutex or critical section, can lead to a deadlock if the calling thread tries to obtain a synchronization object owned by a suspended thread".
So consider the scenario in which thread has acquired some resource(implicitly .i.e. part of not code..by library or kernel mode), and if we suspend the thread this would result into mysterious deadlock situation as other threads of that process would be waiting for that particular resource. The fact is we are not sure(at any time) in our program that what sort of resources are acquired by any running thread, suspend/resume thread is not good idea.

Setting thread priorities from the running process

I've just come across the Get/SetThreadPriority methods and they got me wondering - can a thread priority meaningfully be set higher than the owning process priority (which I don't believe can be changed programatically in the same way) ?
Are there any pitfalls to using these APIs?

Yes, you can set the thread priority to any class, including a class higher than the one of the current process. In fact, these two values are complementary and provide the base priority of the thread. You can read about it in the Remarks section of the link you posted.
You can set the process priority using SetPriorityClass.
Now that we got the technicalities out of the way, I find little use for manipulating the priority of a thread directly. The OS scheduler is sophisticated enough to boost the priority of threads blocked in I/O over threads doing CPU computations (to the point that an I/O thread will preempt a CPU thread when the I/O interrupt arrives). In fact, even I/O threads are differentiated, with keyboard I/O threads getting a priority boost over file I/O threads for example.

On Windows, the thread and process priorities are combined using an algorthm that decides overall scheduling priority:
Windows priorities
Pitfalls? Well:
Raising the priority of a thread is likely to give the greatest overall gain if it is usually blocked on IO but must run ASAP afer being signaled by its driver, eg. Video IO that must process buffers quickly.
Raising the priority of threads is likely to have the greatest overall negative impact if they are CPU-bound and raised to a high priority, so preventing the running of normal-priority threads. If taken to extremes, OS threads and utilities like Task Manger will not run.

Relationship between shared memory concurrency algorithms and mutexes/semaphores

I am trying to figure out the relationship between shared memory based concurrency algorithms (Peterson's / Bakery) and the use of semaphores and mutexes.
In the first case, we have a system without OS intervention, and processes can synchronize themselves using shared memory and busy waiting.
In the second case, the OS provides processes/threads with the ability to block, and not have to busy wait.
Is there ever a situation where we'd like to use shared memory in addition to semaphores (to ensure fairness / lack of starvation), or does the OS offer a better way to do this?
(I am wondering about the general concepts, but answers specific to POSIX/Win32/JAVA threads are also interesting).
Thanks a lot!

I can't think of any circumstances where what you actually want is a busy wait. Busy waiting just consumes processor time without achieving anything. That's not to say that "busy wait" algorithms aren't useful (they are), but the "busy wait" part is not the desired property, it is just a necessary consequence of a property that is desired.
Peterson's lock algorithm and Lamport's bakery algorithm are fundamentally just implementations of the mutex concept. OS facilities provide implementations of the same concept, but with different trade-offs.
The "ideal" implementation of a mutex would have "zero overhead" --- acquiring a lock on a mutex would not take any time at all if it was not currently owned, a waiting thread would wake the instant that the prior owner released the lock, and in the mean time, the waiting thread would not consume any processor time.
A "busy wait" or "spin lock" algorithm trades processor time used by the waiting thread for a reduced wake-up time. Provided the thread is currently scheduled on a processor, a busy-waiter will wake as fast as the processor can transfer the necessary data for acquiring the lock and synchronizing the threads, but whilst it is waiting it will consume its maximum allotment of processor time. If the number of threads exceeds the number of available processors, this may well take time from the thread that currently owns the mutex, thus making the wait longer. However, in some cases the low latency between unlocking and locking is worth the trade-off.
On the other hand, a "blocking" mutex that uses OS facilities to put a waiting thread to sleep has a different trade-off. In this case, the time between unlocking a mutex and a waiting thread acquiring it can be quite large, possibly several hundred times larger than with a busy-wait algorithm. The benefit is that the waiting thread really does consume no processor time whilst waiting, so the OS can schedule other work whilst the thread is waiting. This can thus potentially reduce the overall wait time, and increase the overall throughput of the system.
Some mutex implementations use a combination of busy-waiting and blocking: they busy-wait for a short time, and then switch to blocking if the lock cannot be acquired in the short time. This has the benefits of the fast wake if the lock is released shortly after the thread began waiting, whilst consuming no processor time if the thread has to wait a long time. It also has the downsides of high processor usage for short waits, and slow wake-ups for long waits.

Scheduling of Process(s) waiting for Semaphore

It is always said when the count of a semaphore is 0, the process requesting the semaphore are blocked and added to a wait queue.
When some process releases the semaphore, and count increases from 0->1, a blocking process is activated. This can be any process, randomly picked from the blocked processes.
Now my question is:
If they are added to a queue, why is the activation of blocking processes NOT in FIFO order? I think it would be easy to pick next process from the queue rather than picking up a process at random and granting it the semaphore. If there is some idea behind this random logic, please explain. Also, how does the kernel select a process at random from queue? getting a random process that too from queue is something complex as far as a queue data structure is concerned.
tags: various OSes as each have a kernel usually written in C++ and mutex shares similar concept

A FIFO is the simplest data structure for the waiting list in a system
that doesn't support priorities, but it's not the absolute answer
otherwise. Depending on the scheduling algorithm chosen, different
threads might have different absolute priorities, or some sort of
decaying priority might be in effect, in which case, the OS might choose
the thread which has had the least CPU time in some preceding interval.
Since such strategies are widely used (particularly the latter), the
usual rule is to consider that you don't know (although with absolute
priorities, it will be one of the threads with the highest priority).

When a process is scheduled "at random", it's not that a process is randomly chosen; it's that the selection process is not predictable.
The algorithm used by Windows kernels is that there is a queue of threads (Windows schedules "threads", not "processes") waiting on a semaphore. When the semaphore is released, the kernel schedules the next thread waiting in the queue. However, scheduling the thread does not immediately make that thread start executing; it merely makes the thread able to execute by putting it in the queue of threads waiting to run. The thread will not actually run until a CPU has no threads of higher priority to execute.
While the thread is waiting in the scheduling queue, another thread that is actually executing may wait on the same semaphore. In a traditional queue system, that new thread would have to stop executing and go to the end of the queue waiting in line for that semaphore.
In recent Windows kernels, however, the new thread does not have to stop and wait for that semaphore. If the thread that has been assigned that semaphore is still sitting in the run queue, the semaphore may be reassigned to the old thread, causing the old thread to go back to waiting on the semaphore again.
The advantage of this is that the thread that was about to have to wait in the queue for the semaphore and then wait in the queue to run will not have to wait at all. The disadvantage is that you cannot predict which thread will actually get the semaphore next, and it's not fair so the thread waiting on the semaphore could potentially starve.

It is not that it CAN'T be FIFO; in fact, I'd bet many implementations ARE, for just the reasons that you state. The spec isn't that the process is chosen at random; it is that it isn't specified, so your program shouldn't rely on it being chosen in any particular way. (It COULD be chosen at random; just because it isn't the fastest approach doesn't mean it can't be done.)

All of the other answers here are great descriptions of the basic problem - especially around thread priorities and ready queues. Another thing to consider however is IO. I'm only talking about Windows here, since it is the only platform I know with any authority, but other kernels are likely to have similar issues.
On Windows, when an IO completes, something called a kernel-mode APC (Asynchronous Procedure Call) is queued against the thread which initiated the IO in order to complete it. If the thread happens to be waiting on a scheduler object (such as the semaphore in your example) then the thread is removed from the wait queue for that object which causes the (internal kernel mode) wait to complete with (something like) STATUS_ALERTED. Now, since these kernel-mode APCs are an implementation detail, and you can't see them from user mode, the kernel implementation of WaitForMultipleObjects restarts the wait at that point which causes your thread to get pushed to the back of the queue. From a kernel mode perspective, the queue is still in FIFO order, since the first caller of the underlying wait API is still at the head of the queue, however from your point of view, way up in user mode, you just got pushed to the back of the queue due to something you didn't see and quite possibly had no control over. This makes the queue order appear random from user mode. The implementation is still a simple FIFO, but because of IO it doesn't look like one from a higher level of abstraction.
I'm guessing a bit more here, but I would have thought that unix-like OSes have similar constraints around signal delivery and places where the kernel needs to hijack a process to run in its context.
Now this doesn't always happen, but the documentation has to be conservative and unless the order is explicitly guaranteed to be FIFO (which as described above - for windows at least - it can't be) then the ordering is described in the documentation as being "random" or "undocumented" or something because a random process controls it. It also gives the OS vendors lattitude to change the ordering at some later time.

Process scheduling algorithms are very specific to system functionality and operating system design. It will be hard to give a good answer to this question. If I am on a general PC, I want something with good throughput and average wait/response time. If I am on a system where I know the priority of all my jobs and know I absolutely want all my high priority jobs to run first (and don't care about preemption/starvation), then I want a Priority algorithm.
As far as a random selection goes, the motivation could be for various reasons. One being an attempt at good throughput, etc. as mentioned above above. However, it would be non-deterministic (hypothetically) and impossible to prove. This property could be an exploitation of probability (random samples, etc.), but, again, the proofs could only be based on empirical data on whether this would really work.

Multi-threaded Event Dispatching

I am developing a C++ application that will use Lua scripts for external add-ons. The add-ons are entirely event-driven; handlers are registered with the host application when the script is loaded, and the host calls the handlers as the events occur.
What I want to do is to have each Lua script running in its own thread, to prevent scripts from locking up the host application. My current intention is to spin off a new thread to execute the Lua code, and allow the thread to terminate on its own once the code has completed. What are the potential pitfalls of spinning off a new thread as a form of multi-threaded event dispatching?

Here are a few:
Unless you take some steps to that effect, you are not in control of the lifetime of the threads (they can stay running indefinitely) or the resources they consume (CPU, etc)
Messaging between threads and synchronized access to commonly accessible data will be harder to implement
If you are expecting a large number of add-ons, the overhead of creating a thread for each one might be too great
Generally speaking, giving event-driven APIs a new thread to run on strikes me as a bad decision. Why have threads running when they don't have anything to do until an event is raised? Consider spawning one thread for all add-ons, and managing all event propagation from that thread. It will be massively easier to implement and when the bugs come, you will have a fighting chance.

Creating a new thread and destroying it frequently is not really a good idea. For one, you should have a way to bound this so that it doesn't consume too much memory (think stack space, for example), or get to the point where lots of pre-emption happens because the threads are competing for time on the CPU. Second, you will waste a lot of work associated with creating new threads and tearing them down. (This depends on your operating system. Some OSs might have cheap thread creation and others might have that be expensive.)
It sounds like what you are seeking to implement is a work queue. I couldn't find a good Wikipedia article on this but this comes close: Thread pool pattern.
One could go on for hours talking about how to implement this, and different concurrent queue algorithms that can be used. But the idea is that you create N threads which will drain a queue, and do some work in response to items being enqueued. Typically you'll also want the threads to, say, wait on a semaphore while there are no items in the queue -- the worker threads decrement this semaphore and the enqueuer will increment it. To prevent enqueuers from enqueueing too much while worker threads are busy and hence taking up too much resources, you can also have them wait on a "number of queue slots available" semaphore, which the enqueuer decrements and the worker thread increments. These are just examples, the details are up to you. You'll also want a way to tell the threads to stop waiting for work.

My 2 cents: depending on the number and rate of events generated by the host application, the main problem I can see is in term of performances. Creating and destroyng thread has a cost [performance-wise] I'm assuming that each thread once spawned do not need to share any resource with the other threads, so there is no contention.
If all threads are assigned on a single core of your CPU and there is no load balancing, you can easily overload one CPU and have the others [on a multcore system] unloaded. I'll consider some thread affinity + load balancing policy.
Other problem could be in term of resource [read memory] How much memory each LUA thread will consume?
Be very careful to memory leaks in the LUA threads as well: if events are frequent and threads are created/destroyed frequently leaving leacked memory, you can consume your host memory quite soon ;)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js