synchronization between threads using tbb - c++

I am using tbb programming in C++. I am not supposed to use message queue, FIFO, PIPES etc as it is platform specific. I am supposed to use tbb specific API's.
Thread1: // Pseuodo code exits as below
// I will take mutex
m_bIsNewSubsArrived = true;
StartSubscriptionTimer();
m_bIsFristSubsArrived = true;
// Spawn a thread here.
if(m_tbbTimerThread == NULL)
{
m_bIsTimerMutexTaken = true;
m_timerMutex.lock();
m_tbbTimerThread = new tbb::tbb_thread(&WaitForTimerMutex, this);
if (m_tbbTimerThread->native_handle() == 0)
{
// report error and return.
return;
}
}
// Thread 1 exited.
In another thead I am releasing mutex which is taken above.
Thread 2.
m_timerMutex.unlock();
m_bIsTimerMutexTaken = false;
Thread 3:
// I am waiting for mutex
m_timerMutex.lock();
In above code problem I think is thread 1 which locked m_timerMutex is not relased so I think thread2 not able to unlock. And thread 3 is blocked for ever.
I think I can use sempahore, but what are API's for sempahore in TBB.
What is the best technique I can do this with out sleeping and using tbb specific API's.
Thanks for your time and help.

There is currently no support for semaphores in TBB. The reason is that TBB is intended to raise the level of abstraction above raw threads to the level of tasks, and semaphores are considered a low-level "goto" of threaded programming. Note that C++11 does not have semaphores either.
Using tasks instead of threads often allows synchronization to become implicit, and often permits automatic load balancing via the work-stealing scheduler in TBB.
If your problem is not amenable to a task-based solution, consider using condition variables, which are part of C++11. Recent versions of TBB ship with a partial implementation of C++11 condition variables.

I've never used tbb, but typically you can only release a mutex in the thread currently holding it (which is kind of the purpose of a mutex). From the code shown here, Thread1 is still holding the mutex, Thread2 can't release it because it doesn't hold it (don't know what the behavior is there in the Intel implementation, have to look this up) and Thread3 is waiting forever because Thread1 is still holding the mutex.
(But I'm just guessing, because you didn't tell us what's actually happening with the code you've got so far, or what exactly you are trying to do)

Related

Can std::atomic be used sometimes instead of std::mutex in C++?

I suppose that std::atomic sometimes can replace usages of std::mutex. But is it always safe to use atomic instead of mutex? Example code:
std::atomic_flag f, ready; // shared
// ..... Thread 1 (and others) ....
while (true) {
// ... Do some stuff in the beginning ...
while (f.test_and_set()); // spin, acquire system lock
if (ready.test()) {
UseSystem(); // .... use our system for 50-200 nanoseconds ....
}
f.clear(); // release lock
// ... Do some stuff at the end ...
}
// ...... Thread 2 .....
while (true) {
// ... Do some stuff in the beginning ...
InitSystem();
ready.test_and_set(); // signify system ready
// .... sleep for 10-30 milli-seconds ....
while (f.test_and_set()); // acquire system lock
ready.clear(); // signify system shutdown
f.clear(); // release lock
DeInitSystem(); // finalize/destroy system
// ... Do some stuff at the end ...
}
Here I use std::atomic_flag to protect use of my system (some complex library). But is it safe code? Here I suppose that if ready is false then system is not available and I can't use it and if it is true then it is available and I can use it. For simplicity suppose that code above doesn't throw exceptions.
Of cause I can use std::mutex to protect read/modify of my system. But right now I need very high performance code in Thread-1 that should use atomics very often instead of mutexes (Thread-2 can be slow and use mutexes if needed).
In Thread-1 system-usage code (inside while loop) is run very often, each iteration around 50-200 nano-seconds. So using extra mutexes will be to heavy. But Thread-2 iterations are quite large, as you can see in each iteration of while loop when system is ready it sleeps for 10-30 milli-seconds, so using mutexes only in Thread-2 is quite alright.
Thread-1 is example of one thread, there are several threads running same (or very similar) code as Thread-1 in my real project.
I'm concerned about memory operations ordering, meaning that it can probably happen somtimes that system is not yet in fully consistent state (not yet inited fully) when ready becomes true in Thread-1. Also it may happen that ready becomes false in Thread-1 too late, when system already made some destroying (deinit) operations. Also as you can see system can be inited/destroyed many times in a loop of Thread-2 and used many times in Thread-1 whenever it is ready.
Can my task be solved somehow without std::mutex and other heavy stuff in Thread-1? Only using std::atomic (or std::atomic_flag). Thread-2 can use heavy synchronization stuff if needed, mutexes etc.
Basically Thread-2 should somehow propagate whole inited state of system to all cores and other threads before ready becomes true and also Thread-2 should propagate ready equal to false before any single small operation of system destruction (deinit) is done. By propagating state I mean that all system's inited data should be 100% written consistently to global memory and caches of other core, so that other threads see fully consistent system whenever ready is true.
It is even allowed to make small (milliseconds) pause after system init and before ready is set to true if it improves situation and guarantees. And also it is allowed to make pause after ready is set to false and before starting system destruction (deinit). Also doing some expensive CPU operations in Thread-2 is also alright if there exist some operations like "propagate all Thread-2 writes to global memory and caches to all other CPU cores and threads".
Update: As a solution for my question above right now in my project I decided to use next code with std::atomic_flag to replace std::mutex:
std::atomic_flag f = ATOMIC_FLAG_INIT; // shared
// .... Later in all threads ....
while (f.test_and_set(std::memory_order_acquire)) // try acquiring
std::this_thread::yield();
shared_value += 5; // Any code, it is lock-protected.
f.clear(std::memory_order_release); // release
This solution above runs 9 nanoseconds on average (measured 2^25 operations) in single thread (release compiled) on my Windows 10 64-bit 2Ghz 2-core laptop. While using std::unique_lock<std::mutex> lock(mux); for same protection purpose takes 100-120 nanoseconds on same Windows PC. If it is needed for threads to spinlock instead of sleeping while waiting then instead of std::this_thread::yield(); in code above I just use semicolon ;. Full online example of usage and time measurements.
I'll ignore your code for the sake of the answer, the answer generally is yes.
A lock does the following things :
allows only one thread to acquire it at any given time
when the lock is acquired, a read barrier is placed
right before the lock is released, a write barrier is placed
The combination of the 3 points above makes the critical section thread safe. only one thread can touch the shared memory, all changes are observed by the locking thread because of the read barrier, and all the changes are to be visible to other locking threads, because of the write barrier.
Can you use atomics to achieve it? Yes, And real life locks (provided for example, by Win32/Posix) ARE implemented by either using atomics and lock free programming, either by using locks that use atomics and lock free programing.
Now, realistically speaking, should you use a self-written lock instead of the standard locks? Absolutely not.
Many concurrency tutorials preserve the notion that spin-locks are "more efficient" than regular locks. I can't stress enough how foolish it is. A user-mode spinlock IS NEVER more efficient than a lock that the OS provides. The reason is simple, that OS locks are wired to the OS scheduler. So if a lock tries to lock a lock and fails - the OS knows to freeze this thread and not reschedule it to run until the lock has been released.
With user-mode spinlocks, this doesn't happen. The OS can't know that the relevant thread tries to acquire to the lock in a tight loop. Yielding is just a patch and not a solution - we want to spin for a short time, then go to sleep until the lock is released. With user mode spin locks, we might waste the entire thread quantum trying to lock the spinlock and yielding.
I will say, for the sake of honesty, that recent C++ standards do give us the ability to sleep on an atomic waiting for it to change its value. So we can, in a very lame way, implement our own "real" locks that try to spin for a while and then sleep until the lock is released. However, implementing a correct and efficient lock when you're not a concurrency expert is pretty much impossible.
My own philosophical opinion that in 2021, developers should rarely deal with those very low-level concurrency topics. Leave those things to the kernel guys.
Use some high level concurrency library and focus on the product you want to develop rather than micro-optimizing your code. This is concurrency, where correctness >>> efficiency.
A related rant by Linus Torvalds

InterviewQ: How do you code a mutex?

I'm unfortunately out of a job and have been interviewing around lately. I faced this same question twice now, and was lost both times I was asked this question.
"How do you code a mutex"?
Conceptually I understand a mutex locks a certain part of code so multiple threads can not enter the critical section at the same time, eliminating data races. The first time I was asked to conceptually describe how I would code it, the second time I was asked to code it. I've been googling and haven't found any answers... can anyone help?
Thanks.
There's lots of ways to implement a mutex lock, but it typically starts with the basic premise that the cpu architecture offers some concept of atomic add and atomic subtract. That is, an addition operation can be done to an integer variable in memory (and return the result) without being corrupted by another thread attempting to access same memory location. Or at the very least, "atomic increment" and "atomic decrement".
On modern Intel chips, for example, there's an instruction called XADD. When combined with the LOCK prefix it executes atomically and invalidates cached values across other cores. gcc implements a wrapper for this instruction called __sync_add_and_fetch. Win32 implements a similar function called InterlockedIncrement. Both are just calling LOCK XADD under the hood. Other CPU architectures should offer something similar.
So the most basic mutex lock could be implemented something like this. This is often called a "spin" lock. And this cheap version offers no ability to recursively enter the lock.
// A correct, but poorly performant mutex implementation
void EnterLock(int* lock)
{
while (true)
{
int result = LOCK_XADD(lock,1); // increment the value in lock and return the result atomically
if (result == 1)
{
// if the value in lock was successfully incremented
// from 0 to 1 by this thread. It means this thread "acquired" the lock
return;
}
LOCK XADD(lock,-1); // we didn't get the lock - decrement it atmoically back to what it was
sleep(0); // give the thread quantum back before trying again
}
}
void LeaveLock(int* lock)
{
LOCK XADD(lock,-1); // release the lock. Assumes we successfully acquired it correctly with EnterLock above
}
The above suffers from poor performance of "spinning" and doesn't guarantee any fairness. A higher priority thread could continue to win the EnterLock battle over a lower priority thread. And the programmer could make a mistake and call LeaveLock with with a thread that did not previously call EnterLock. You could expand the above to operate on a data structure that not only includes the lock integer, but also has record keeping for the owner thread id and a recursion count.
The second concept for implementing a mutex is that the operating system can offer a wait and notify service such that a thread doesn't have to spin until the owner thread has released it. The thread or process waiting on lock can register itself with the OS to be put to sleep until the owner thread has released it. In OS terms, this is called a semaphore. Additionally, the OS level semaphore can also be used to implement locks across different processes and for the cases where the CPU doesn't offer an atomic add. And can be used to guaranteed fairness between multiple threads trying to acquire the lock.
Most implementations will try spinning for multiple attempts before falling back to making a system call.
I wouldn't say that this is a stupid question. On any level of abstraction for the position. On the high level you just say, you use standard library, or any threading library. If you apply for a position as the compiler developer you need to understand how it acutally works and what is needed for the implementation.
To implement a mutex, you need a locking mechanism, that is you need to have a resource that can be marked as taken across all threads. This is not trivial. You need to remember that two cores share memory, but they have caches. This piece of information must be guaranteed to be actual. So you do need support for hardware to ensure atomicity.
If you take at implementation of clang, they offload (at least in once case) implementation to pthreads, typedefs in threading support:
#if defined(_LIBCPP_HAS_THREAD_API_PTHREAD)
# include <pthread.h>
# include <sched.h>
#elif defined(_LIBCPP_HAS_THREAD_API_WIN32)
#include <Windows.h>
#include <process.h>
#include <fibersapi.h>
#endif
And if you dig through pthreads repo, you can find asm implementations of the interlocking operations. They rely on the lock asm keyword which make the operations atomic, i.e. no other thread can execute them at the same time. This eliminates racing conditions, and guarantees coherency.
Based on this, you can build a lock, which you can use for a mutex implementation.

Do mutexes guarantee ordering of acquisition? Unlocking thread takes it again while others are still waiting

A coworker had an issue recently that boiled down to what we believe was the following sequence of events in a C++ application with two threads:
Thread A holds a mutex.
While thread A is holding the mutex, thread B attempts to lock it. Since it is held, thread B is suspended.
Thread A finishes the work that it was holding the mutex for, thus releasing the mutex.
Very shortly thereafter, thread A needs to touch a resource that is protected by the mutex, so it locks it again.
It appears that thread A is given the mutex again; thread B is still waiting, even though it "asked" for the lock first.
Does this sequence of events fit with the semantics of, say, C++11's std::mutex and/or pthreads? I can honestly say I've never thought about this aspect of mutexes before.
Are there any fairness guarantees to prevent starvation of other threads for too long, or any way to get such guarantees?
Known problem. C++ mutexes are thin layer on top of OS-provided mutexes, and OS-provided mutexes are often not fair. They do not care for FIFO.
The other side of the same coin is that threads are usually not pre-empted until they run out of their time slice. As a result, thread A in this scenario was likely to continue to be executed, and got the mutex right away because of that.
The guarantee of a std::mutex is enable exclusive access to shared resources. Its sole purpose is to eliminate the race condition when multiple threads attempt to access shared resources.
The implementer of a mutex may choose to favor the current thread acquiring a mutex (over another thread) for performance reasons. Allowing the current thread to acquire the mutex and make forward progress without requiring a context switch is often a preferred implementation choice supported by profiling/measurements.
Alternatively, the mutex could be constructed to prefer another (blocked) thread for acquisition (perhaps chosen according FIFO). This likely requires a thread context switch (on the same or other processor core) increasing latency/overhead. NOTE: FIFO mutexes can behave in surprising ways. E.g. Thread priorities must be considered in FIFO support - so acquisition won't be strictly FIFO unless all competing threads are the same priority.
Adding a FIFO requirement to a mutex's definition constrains implementers to provide suboptimal performance in nominal workloads. (see above)
Protecting a queue of callable objects (std::function) with a mutex would enable sequenced execution. Multiple threads can acquire the mutex, enqueue a callable object, and release the mutex. The callable objects can be executed by a single thread (or a pool of threads if synchrony is not required).
•Thread A finishes the work that it was holding the mutex for, thus
releasing the mutex.
•Very shortly thereafter, thread A needs to touch a resource that is
protected by the mutex, so it locks it again
In real world, when the program is running. there is no guarantee provided by any threading library or the OS. Here "shortly thereafter" may mean a lot to the OS and the hardware. If you say, 2 minutes, then thread B would definitely get it. If you say 200 ms or low, there is no promise of A or B getting it.
Number of cores, load on different processors/cores/threading units, contention, thread switching, kernel/user switches, pre-emption, priorities, deadlock detection schemes et. al. will make a lot of difference. Just by looking at green signal from far you cannot guarantee that you will get it green.
If you want that thread B must get the resource, you may use IPC mechanism to instruct the thread B to gain the resource.
You are inadvertently suggesting that threads should synchronise access to the synchronisation primitive. Mutexes are, as the name suggests, about Mutual Exclusion. They are not designed for control flow. If you want to signal a thread to run from another thread you need to use a synchronisation primitive designed for control flow i.e. a signal.
You can use a fair mutex to solve your task, i.e. a mutex that will guarantee the FIFO order of your operations. Unfortunately, C++ standard library doesn't have a fair mutex.
Thankfully, there are open-source implementations, for example yamc (a header-only library).
The logic here is very simple - the thread is not preempted based on mutexes, because that would require a cost incurred for each mutex operation, which is definitely not what you want. The cost of grabbing a mutex is high enough without forcing the scheduler to look for other threads to run.
If you want to fix this you can always yield the current thread. You can use std::this_thread::yield() - http://en.cppreference.com/w/cpp/thread/yield - and that might offer the chance to thread B to take over the mutex. But before you do that, allow me to tell you that this is a very fragile way of doing things, and offers no guarantee. You could, alternatively, investigate the issue deeper:
Why is it a problem that the B thread is not started when A releases the resource? Your code should not depend on such logic.
Consider using alternative thread synchronization objects like barriers (boost::barrier or http://linux.die.net/man/3/pthread_barrier_wait ) instead, if you really need this sort of logic.
Investigate if you really need to release the mutex from A at that point - I find the practice of locking and releasing fast a mutex for more than one time a code smell, it usually impacts terribly the performace. See if you can group extraction of data in immutable structures which you can play around with.
Ambitious, but try to work without mutexes - use instead lock-free structures and a more functional approach, including using a lot of immutable structures. I often found quite a performance gain from updating my code to not use mutexes (and still work correctly from the mt point of view)
How do you know this:
While thread A is holding the mutex, thread B attempts to lock it.
Since it is held, thread B is suspended.
How do you know thread B is suspended. How do you know that it is not just finished the line of code before trying to grab the lock, but not yet grabbed the lock:
Thread B:
x = 17; // is the thread here?
// or here? ('between' lines of code)
mtx.lock(); // or suspended in here?
// how can you tell?
You can't tell. At least not in theory.
Thus the order of acquiring the lock is, to the abstract machine (ie the language), not definable.

mutex lock priority

In multithreading (2 thread) program, I have this code:
while(-1)
{
m.lock();
(...)
m.unlock();
}
m is a mutex (in my case a c++11 std::mutex, but I think it'doesn't change if I use different library).
Assuming that the first thread owns the mutex and it's done something in (...) part. The second thread tried to acquire the mutex, but it's waiting that the first thread release m.
The question is: when thread 1 ends it's (...) execution and unlocks the mutex, can we be sure that thread 2 acquires the mutex or thread 1 can re-acquire again the mutex before thread 2, leaving it stucked in lock()?
The C++ standard doesn't make any guarantee about the order locks to a mutex a granted. Thus, it is entirely possible that the active thread keeps unlock()ing and lock()ing the std::mutex m without another thread trying to acquire the lock ever getting to it. I don't think the C++ standard provides a way to control thread priorities. I don't know what you are trying to do but possibly there is another approach which avoids the problem you encounter.
If both threads are equal priority, there is no such guarantee by standard mutex implementations. Some OS's have a lis of "who's waiting", and will pick the "longest waiting" when you release something, but that is an implementation detail, not something you can reliably depend on.
And imagine that you have two threads, each running something like this:
m.lock();
(...)
m.unlock();
(...) // Clearly not the same code as above (...)
m.lock();
(...) // Some other code that needs locking against updates.
m.unlock();
Would you want the above code to switch thread on the second lock, every time?
By the way, if both threads run with lock for the entire loop, what is the point of a lock?
There are no guarantees, as the threads are not ordered in any way with respect to each other. In fact, the only synchronisation point is the mutex locking.
It's entirely possible that the first thread reacquires the lock immediately if for example it is running the function in a tight loop. Typical implementations have a notification and wakeup mechanism if any thread is sleeping on a mutex, but there may also be a bias for letting the running thread continue rather than performing a context switch... it's very much up to the implementation and the details of the platform at the time.
There are no guarantees provided by C++ or underlying OS.
However, there is some reasonable degree of fairness determined by the thread arrival time to the critical region (mutex in this case). This fairness can be expressed as statistical probability, but not a strict guarantee. Most likely this choice will be down to OS execution scheduler, which will also consider many other factors.
It's not a good idea to rely on such code, so you should probably change your design.
However, on some operating systems, sleep(0) will yield the thread. (Sleep(0) on Windows)
Again, it's best not to rely on this.

Boost (v1.33.1) Thread Interruption

How can I interrupt a sleeping/blocked boost::thread?
I am using Boost v1.33.1, upgrading is not an option.
Thank you.
A quick perusal of the documentation for boost.thread in 1.33 suggests that there is no portable way to achieve interruption. Thread interruption was introduced (for threads in one of the boost "interruption points") in 1.35.
As a result the only option I can think of is to use signals (which aren't in 1.33 either, so you'll need to fall back on, for example, pthreads) combined with time-outs on any methods that are blocking. Basically use signals to wake threads that are asleep by having them sleep waiting for the signal and timeouts on blocking threads to have them wake up and check to see if they should exit. Unfortunately this is a highly undesirable solution, and to some extent amounts to what newer versions of boost do internally anyway.
If you're using boost.thread, then you should consider upgrading to a more recent version for other projects because 1.33 doesn't support the vast majority of constructs that are essential for multi-threading.
I agree with begray, look into condition variables. If you have threads you want to wake up from time to time, they are what boost expects you to use. If you expect that threads are going to block on other calls (like calls into BSD sockets or something similar) this doesn't help you. You will need to use the timeout facilities of those calls directly, if they exist.
Here's an example, using only facilities available in boost 1.33.1. I haven't compiled it, so there may be small errors. I've included the use of a nebulous Work class, but you don't need to work with shared data at all to use this pattern. Only the mutex and the condition variable are needed.
Work work;
boost::condition workAvailable;
boost::mutex workMutex;
void Producer()
{
{
boost::mutex::scoped_lock lock(workMutex);
UpdateWork(work);
workAvailable.notify_one();
}
boost::mutex::scoped_lock lock(workMutex);
work.SetOver();
workAvailable.notify_one();
}
void Consumer()
{
//This thread uses data protected by the work mutex
boost::mutex::scoped_lock lock(workMutex);
while(true)
{
//this call releases the work mutex
//when this thread is notified, the mutex is re-acquired
workAvailable.wait(lock);
//once we have the mutex we can work with shared data
//which might require this thread to terminate
if(work.Over())
{
return;
}
DoWork(work);
}
}
The producer thread will create one unit of work, and then block. The consumer thread will do the work, and then block. Then the producer thread will set the termination condition and exit. The consumer will then exit.
There is no way to interrupt blocked thread in boost::thread. You need to implement proper thread interruption yourself, using boost::conditional for example.
AFAIK Any existing ways to interrupt running thread (TerminateThread in Windows API for example) only lead to problems (memory leaks one of them).