Bottleneck in Threads C++

Bottleneck in Threads C++ - c++

So I am just trying to verify my understanding and hope that you guys will be able to clear up any misunderstandings. So essentially I have two threads which use the same lock and perform calculations when they hold the lock but the interesting thing is that within the lock I will cause the thread to sleep for a short time. For both threads, this sleep time will be slightly different for either thread. Because of the way locks work, wont the faster thread be bottlenecked by the slower thread as it will have to wait for it to complete?
For example:
Thread1() {
lock();
usleep(10)
lock();
}
-
Thread2() {
lock();
sleep(100)
lock();
}
Now because Thread2 holds onto the lock longer, this will cause a bottleneck. And just to be sure, this system should have a back and forth happens on who gets the lock, right?
It should be:
Thread1 gets lock
Thread1 releases lock
Thread2 gets lock
Thread2 releases lock
Thread1 gets lock
Thread1 releases lock
Thread2 gets lock
Thread2 releases lock
and so on, right? Thread1 should never be able to acquire the lock right after it releases it, can it?

Thread1 should never be able to acquire the lock right after it releases it, can it?
No, Thread1 could reacquire the lock, right after it releases it, because Thread2 could still be suspended (sleeps because of the scheduler)
Also sleep only guarantees that the thread will sleep at least the wanted amount, it can and will often be more.
In practice you would not hold a lock while calculating a value, you would get the lock, get the needed values for calculation, unlock, calculate it, and then get the lock again, check if the old values for the calculation are still valid/wanted, and then store/return your calculated results.
For this purpose, the std::future and atomic data types were invented.
...this system should have a back and forth happens on who gets the lock, right?
Mostly The most of the time it will be a back and forth but some times there could/will be two lock/unlock cycles by Thread1. It depends on your scheduler and any execution and cycle will probably vary.

Absolutely nothing prevents either thread from immediately reacquiring the lock after releasing it. I have no idea what you think prevents this from happening, but nothing does.
In fact, in many implementations, a thread that is already running has an advantage in acquiring a lock over threads that have to be made ready-to-run. This is a sensible optimization to minimize context switches.
If you're using a sleep as a way to simulate work and think this represents some real world issue with lock fairness, you are wrong. Threads that sleep are voluntarily yielding the remainder of their timeslice and are treated very differently from threads that exhaust their timeslice doing work. If these threads were actually doing work, eventually one thread would exhaust its timeslice.

Depending on what you are trying to achieve there are several possibilities.
If you want your threads to run in a particular order then have a look here.
There are basically 2 options:
- one is to use events where a thread is signaling the next one it has done his job and so the next one could start.
- the other one is to have a scheduler thread that handle the ordering with events or semaphores.
If you want your threads to run independently but have a lock mechanism where the order of attempting to get the lock is preserved you can have a look here. The last part of the answer uses a queue of one condition variable per thread seem good.
And as it was said in previous answers and comments, using sleep for scheduling is a bad idea.
Also lock is just a mutual exclusion mechanism and has no guarentee on the execution order.
A lock is usually intended for preventing concurrent access on a critical resource so it should just do that. The smaller the critical section is, the better.
Finally yes trying to order threads is making "bottlenecks". In this particular case if all calculations are made in the locked sections the threads won't do anything in parallel so you can question the utility of using threads.
Edit :
Just on more warning: be careful, with threads it's not because is worked (was scheduled as you wanted to) 10 times on your machine that it always will, especially if you change any of the context (machine, workload...). You have to be sure of it by design.

Related

How expensive is a blocked mutex?

Say I have a mutex and thread 1 locked the mutex. Now, thread 2 tries to acquire the lock but it is blocked, say for a couple of seconds. How expensive is this blocked thread? Can the executing hardware thread be rescheduled to do something computationally more expensive? If yes, then who checks if the mutex gets unlocked?
EDIT: Ok so I try to reformulate what I wanted to ask.
What I dont really understand is how the following works. thread 2 got blocked, so what exactly does thread 2 do? From the answer it seems like it is not just constantly checking if the mutex gets unlocked. If this were the case, I would consider a blocked thread expensive, as I am using one of my hardware threads just for checking if some boolean value changes.
So am I correct in thinking that when the mutex gets released by thread 1, thread 1 notifies the sheduler and the shedular assigns a hardware thread to execute thread 2 which is waiting?

I am reading your questions as:
How expensive is a locked mutex?
Mutex can be considered as an integer in memory.
A thread trying to lock on a mutex has to read the existing state of the mutex and can set it depending on the value read.
test_and_set( &mutex_value, 0, 1 ); // if mutex_value is 0, set to 1
The trick is that both the read and write (also called test-and-set) operation should be atomic. The atomicity is achieved with CPU support.
However, the test-and-set operation doesn't offer any mechanism to block/wait.
CPU has no knowledge of threads blocking on a mutex. The OS takes the responsibility to manage the blocking by providing system calls to users. The implementation varies from OS to OS. In case of Linux, you can consider futex or pthreads as an example.
The overall costs of using a mutex sums up to the test-and-set operation and the system calls used to implement the mutex.
The test-and set operation is almost constant and is insignificant compared to the cost the other operation can amount to.
If there a multiple threads trying to acquire the lock, the cost of
mutex can be accredited to the following:
1. Kernel scheduling overhead cost
2. Context switch overhead cost
Kernel scheduling overhead
What happens to other threads, if one thread has already acquired lock on a mutex?
The other threads will continue. If any other thread(s) attempting to lock a mutex that is already locked, OS will (re)schedule the other thread(s) to wait. As soon as the original thread unlocks the mutex, kernel will wake up one of the threads waiting on the mutex.
Context switch overhead
User space code should be designed in such a manner that a thread should spend a very less time trying to lock on a mutex. If you have multiple thread trying to acquire lock on a mutex at multiple places, it may result in a disaster and the performance may be as poor as a single thread serving all requests.
Can the executing hardware thread be resheduled to do something
computationally more expensive?
If I am getting your question correctly, the thread which has acquired the lock can be context switched, depending on the scheduling mechanism. But, that is an overhead of multi-threaded programming, itself.
Can you provide a use case, to define this problem clearly?
who checks if the mutex gets unlocked?
Definitely the OS scheduler. Note, it is not just a blind sleep().

Threads are just a logical OS concept. There are no "hardware threads". Hardware has cores. The OS schedules a core to run a thread for a certain amount of time. If a thread gets blocked, there are always plenty left to run.
Taking your example, with mutexes, if thread 2 is blocked, the OS takes it off schedule and puts it in a queue associated with the mutex. When thread 1 releases the lock, it notifies the scheduler, which takes thread 2 off the queue and puts it back on the schedule. A blocked thread isn't using compute resources. However, there is overhead involved in the actual lock/unlock operation, which is an OS scheduling call.
That overhead is not insignificant, so you would generally use mutexes if you have longer tasks (reasonably longer than a scheduling time slice) and not too much lock competition.
So if a lock goes out of scope, there is some code in the destructor that tells the OS that the locked mutex is now unlocked?
Blockquote
If a std::mutex goes out of scope while locked, that is undefined behavior. (https://en.cppreference.com/w/cpp/thread/mutex/~mutex) Even with non-std mutex implementations, it's reasonable to expect one to unlock before going out of scope.
Keep in mind that there are other kinds of "lock" (like spinlock...which itself has many versions) but we're only talking about mutexes here.

Multiple vs single condition variable

I'm working on an application where I currently have only one condition variable and lots of wait statements with different conditions. Whenever one of the queues or some other state is changed I just call cv.notify_all(); and know that all threads waiting for this state change will be notified (and possibly other threads which are waiting for a different condition).
I'm wondering whether it makes sense to use separate condition variables for each queue or state. All my queues are usually separate, so a thread waiting for data in queue x doesn't care at about new data in queue y. So I won't have situations where I have to notify or wait for multiple condition variables (which is what most questions I've found are about).
Is there any downside, from a performance point of view, for having dozens of condition variables around? Of course the code may be more prone to errors because one has to notify the correct condition variable.
Edit: All condition variables are still going to use the same mutex.

I would recommend having one condition_variable per actual condition. In producer/consumer, for example, there would be a cv for buffer empty and another cv for buffer full.
It doesn't make sense to notifiy all when you could just notify one. That makes me think that your one condition_variable is in reality managing several actual conditions.
You're asking about optimization. On one hand, we want to avoid premature optimization and do the profiling when it becomes a problem. On the other hand, we want to design code that's going to scale well algorithmically from the beginning.
From a speed performance point of view, it's hard to say without knowing the situation, but the potential for significant negative impact on speed from adding cv's is very low, completely dwarfed by the cost of a wait or notify. If adding more cv's can get you out of calling notify-all, the potential for positive impact on speed is high.
Notify-all makes for simple-to-understand code. However, it shouldn't be used when performance matters. When a thread waiting on cv is awoken, it is guaranteed to hold the mutex. If you notify-all, every thread waiting on that cv will be woken up and immediately try to grab the mutex. At that point, all but one will go back to sleep (this time waiting for the mutex to be free). When the lucky thread releases the mutex, the next thread will get it and the next thing it will do is check the cv predicate. Depending on the condition, it could be very likely that the predicate evaluates to false (because the first thread already took care of it, or because you have multiple actual conditions for the one cv) and the thread immediately goes back to sleep waiting on the cv. One-by-one, every thread now waiting on the mutex will wake up, check the predicate, and probably go back to sleep on the cv. It's likely that only one actually ends up doing something.
This is horrible performance because mutex lock, mutex unlock, cv wait, and cv signal are all relatively expensive.
Even when you call notify-one, you might consider doing it after the critical section to try to prevent the one waking thread from blocking on the mutex. And if you actually want to notify-all, you can simply call notify-one and have each thread notify-one once it finishes the critical section. This creates a nice linear cascade of turn-taking versus an explosion of contention.

How to prevent threads from starvation in C++11

I am just wondering if there is any locking policy in C++11 which would prevent threads from starvation.
I have a bunch of threads which are competing for one mutex. Now, my problem is that the thread which is leaving a critical section starts immediately compete for the same mutex and most of the time wins. Therefore other threads waiting on the mutex are starving.
I do not want to let the thread, leaving a critical section, sleep for some minimal amount of time to give other threads a chance to lock the mutex.
I thought that there must be some parameter which would enable fair locking for threads waiting on the mutex but I wasn't able to find any appropriate solution.
Well I found std::this_thread::yield() function, which suppose to reschedule the order of threads execution, but it is only hint to scheduler thread and depends on scheduler thread implementation if it reschedule the threads or not.
Is there any way how to provide fair locking policy for the threads waiting on the same mutex in C++11?
What are the usual strategies?
Thanks

This is a common optimization in mutexes designed to avoid wasting time switching tasks when the same thread can take the mutex again. If the current thread still has time left in its time slice then you get more throughput in terms of user-instructions-executed-per-second by letting it take the mutex rather than suspending it, and switching to another thread (which likely causes a big reload of cache lines and various other delays).
If you have so much contention on a mutex that this is a problem then your application design is wrong. You have all these threads blocked on a mutex, and therefore not doing anything: you are probably better off without so many threads.
You should design your application so that if multiple threads compete for a mutex then it doesn't matter which thread gets the lock. Direct contention should also be a rare thing, especially direct contention with lots of threads.
The only situation where I can think this is an OK scenario is where every thread is waiting on a condition variable, which is then broadcast to wake them all. Every thread will then contend for the mutex, but if you are doing this right then they should all do a quick check that this isn't a spurious wake and then release the mutex. Even then, this is called a "thundering herd" situation, and is not ideal, precisely because it serializes all these threads.

sleeping a thread in the middle of execution

What happens when a thread is put to sleep by other thread, possible by main thread, in the middle of its execution?
assuming I've a function Producer. What if Consumer sleep()s the Producer in the middle of production of one unit ?
Suppose the unit is half produced. and then its put on sleep(). The integrity of system may be in a problem

The thread that sleep is invoked on is put in the idle queue by the thread scheduler and is context switched out of the CPU it is running on, so other threads can take it's place.
All context (registers, stack pointer, base pointer, etc) are saved on the thread stack, so when it's run next time, it can continue from where it left off.
The OS is constantly doing context switches between threads in order to make your system seem like it's doing multiple things. The OS thread scheduler algorithm takes care of that.
Thread scheduling and threading is a big subject, if you want to really understand it, I suggest you start reading up on it. :)
EDIT: Using sleep for thread synchronization purposes not advised, you should use proper synchronization mechanisms to tell the thread to wait for other threads, etc.

There is no problem associated with this, unless some state is mutated while the thread sleeps, so it wakes up with a different set of values than before going to sleep.
Threads are switched in and out of execution by the CPU all the time, but that does not affect the overall outcome of their execution, assuming no data races or other bugs are present.

It would be unadvisable for one thread to forcibly and synchronously interfere with the execution of another thread. One thread could send an asynchronous message to another requesting that it reschedule itself in some way, but that would be handled by the other thread when it was in a suitable state to do so.
Assuming they communicate using channels that are thread-safe, nothing bad shoudl happen, as the sleeping thread will wake up eventually and grab data from its task queue or see that some semaphore has been set and read the prodced data.
If the threads communicate using nonvolatile variables or direct function calls that change state, that's when Bad Things occur.

I don't know of a way for a thread to forcibly cause another thread to sleep. If two threads are accessing a shared resource (like an input/output queue, which seems likely for you Produce/Consumer example), then both threads may contend for the same lock. The losing thread must wait for the other thread to release the lock if the contention is not the "trylock" variety. The thread that waits is placed into a waiting queue associated with the lock, and is removed from the schedulers run queue. When the winning thread releases the lock, the code checks the queue to see if there are threads still waiting to acquire it. If there are, one is chosen as the winner and is given the lock, and placed in the scheduler run queue.

Conditional wait overhead

When using boost::conditional_variable, ACE_Conditional or directly pthread_cond_wait, is there any overhead for the waiting itself? These are more specific issues that trouble be:
After the waiting thread is unscheduled, will it be scheduled back before the wait expires and then unscheduled again or it will stay unscheduled until signaled?
Does wait acquires periodically the mutex? In this case, I guess it wastes each iteration some CPU time on system calls to lock and release the mutex. Is it the same as constantly acquiring and releasing a mutex?
Also, then, how much time passes between the signal and the return from wait?
Afaik, when using semaphores the acquire calls responsiveness is dependent on scheduler time slice size. How does it work in pthread_cond_wait? I assume this is platform dependent. I am more interested in Linux but if someone knows how it works on other platforms, it will help too.
And one more question: are there any additional system resources allocated for each conditional? I won't create 30000 mutexes in my code, but should I worry about 30000 conditionals that use the same one mutex?

Here's what is written in the pthread_cond man page:
pthread_cond_wait atomically unlocks the mutex and waits for the condition variable cond to be signaled. The thread execution is suspended and does not consume any CPU time until the condition variable is signaled.
So from here I'd answer to the questions as following:
The waiting thread won't be scheduled back before the wait was signaled or canceled.
There are no periodic mutex acquisitions. The mutex is reacquired only once before wait returns.
The time that passes between the signal and the wait return is similar to that of thread scheduling due to mutex release.
Regarding the resources, on the same man page:
In the LinuxThreads implementation, no resources are associated with condition variables, thus pthread_cond_destroy actually does nothing except checking that the condition has no waiting threads.
Update: I dug into the sources of pthread_cond_* functions and the behavior is as follows:
All the pthread conditionals in Linux are implemented using futex.
When a thread calls wait it is suspended and unscheduled. The thread id is inserted at the tail of a list of waiting threads.
When a thread calls signal the thread at the head of the list is scheduled back.
So, the waking is as efficient as the scheduler, no OS resources are consumed and the only memory overhead is the size of the waiting list (see futex_wake function).

You should only call pthread_cond_wait if the variable is already in the "wrong" state. Since it always waits, there is always the overhead associated with putting the current thread to sleep and switching.
When the thread is unscheduled, it is unscheduled. It should not use any resources, but of course an OS can in theory be implemented badly. It is allowed to re-acquire the mutex, and even to return, before the signal (which is why you must double-check the condition), but the OS will be implemented so this doesn't impact performance much, if it happens at all. It doesn't happen spontaneously, but rather in response to another, possibly-unrelated signal.
30000 mutexes shouldn't be a problem, but some OSes might have a problem with 30000 sleeping threads.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js