c++ Why shouldn't I unlock a mutex from a different thread - c++

Why shouldn't I unlock a mutex from a different thread? In the c++ standard it says it pretty clearly: If the mutex is not currently locked by the calling thread, it causes undefined behavior. But as far as I can see, everything works as expected on Linux(Fedora 31 with GCC). I seriously tried everything but I could not get it to behave strangely.
All I'm asking for is an example where something, literally anything is affected by unlocking a mutex from a different thread.
Here is a quick test I wrote which is super wrong and shoudn't work but it does:
std::mutex* testEvent;
int main()
{
testEvent = new std::mutex[1000];
for(uint32_t i = 0; i < 1000; ++i) testEvent[i].lock();
std::thread threads[2000];
auto lock = [](uint32_t index) ->void { testEvent[index].lock(); assert(!testEvent[index].try_lock()); };
auto unlock = [](uint32_t index) ->void { testEvent[index].unlock(); };
for(uint32_t j = 0; j < 1000; ++j)
{
for(uint32_t i = 0; i < 1000; ++i)
{
threads[i] = std::thread(lock,i);
threads[i+1000] = std::thread(unlock,i);
}
for(uint32_t i = 0; i < 2000; ++i)
{
threads[i].join();
}
std::cout << j << std::endl;
}
delete[] testEvent;
}

As you already said, it is UB. UB means it may work. Or not. Or randomly switch between working and making your computer sing itself a lullaby. (See also "nasal demons".)
Here are just a few ways someone can break your program on Fedora 31 with GCC on x86-64:
Compile with -fsanitize=thread. It will now crash every time, which is still a valid C++ implementation, because UB.
Run unter helgrind (valgrind --tool=helgrind ./a.out). It will crash every time -- still a valid way to host a C++ program, because UB.
The libstdc++/glibc/pthread implementation on the target system switches from using "fast" mutexes by default to "error checking" or "recursive" mutexes (https://manpages.debian.org/jessie/glibc-doc/pthread_mutex_init.3.en.html). Note that this is probably possible in a manner that is ABI-compatible with your program, meaning that it does not even have to be recompiled for it to suddenly stop working.
That being said, since you are using a platform on which the C++ mutex boils down to a futex-implemented "fast" pthread mutex, this does not work by accident. It is just not guaranteed to keep working for any time, or in any circumstance that actually checks if you are doing the right thing.

I really wonder why you would want to do this in the first place ;)
Normally you would want to have something like
lock();
do_critical_task();
unlock();
(In c++, the lock/unlock is often hidden by use of std::lock_guard or similar.)
Let's assume one thread (lets say Thread A) called this code and is inside the critical task, i.e. it is also holding the lock.
Then if you unlock the same mutex from another thread, any thread other than A can also enter the critical section simultaneously.
The main purpose of mutexes is to have mutual exclusion (hence their name), so all you would do is to erase the purpose of the mutex ;)
That said: you should always believe the standard. Only if something works out on a certain system it doesn't mean it's portable. Plus: especially in a concurrent context, a lot of things can work out a thousand times but then fail the 1001'th time as of race conditions.
In mathematics your attempt would be comparable to "proof by example".

Related

Concurrency Model C++

Suppose you are given the following code:
class FooBar {
public void foo() {
for (int i = 0; i < n; i++) {
print("foo");
}
}
public void bar() {
for (int i = 0; i < n; i++) {
print("bar");
}
}
}
The same instance of FooBar will be passed to two different threads. Thread A will call foo() while thread B will call bar(). Modify the given program to output "foobar" n times.
For the following problem on leetcode we have to write two functions
void foo(function<void()> printFoo);
void bar(function<void()> printBar);
where printFoo and correspondingly printBar is a function pointer that prints Foo. The functions foo and bar are being called in a multithreaded environment and there is no ordering guarantee on how foo and bar is being called.
My solution was
class FooBar {
private:
int n;
mutex m1;
condition_variable cv;
condition_variable cv2;
bool flag;
public:
FooBar(int n) {
this->n = n;
flag=false;
}
void foo(function<void()> printFoo) {
for (int i = 0; i < n; i++) {
unique_lock<mutex> lck(m1);
cv.wait(lck,[&]{return !flag;});
printFoo();
flag=true;
lck.unlock();
cv2.notify_one();
}
}
void bar(function<void()> printBar) {
for (int i = 0; i < n; i++) {
unique_lock<mutex> lck(m1);
cv2.wait(lck,[&]{return flag;});
printBar();
flag=false;
lck.unlock();
cv.notify_one();
// printBar() outputs "bar". Do not change or remove this line.
}
}
};
Let us assume, at time t = 0 bar was called and then at time t = 10 foo was called, foo goes through the critical section protected by the mutex m1.
My question are
Does the C++ memory model because of the fencing property guarantee that when the bar function resumes from waiting on cv2 the value of flag will be set to true?
Am I right in assuming locks shared among threads enforce a before and after relationship as illustrated in the manner of Leslie Lamports clocking system. The compiler and C++ guarantees everything before the end of a critical section (Here the end of the lock) will be observed will be observed by any thread that renters the lock, so common locks, atomics, semaphore can be visualised as enfocing before and after behavior by establishing time in multithreaded environment.
Can we solve this problem using just one condition variable?
Is there a way to do this without using locks and just atomics. What performance improvements do atomics give over locks?
What happens if i do cv.notify_one() and correspondigly cv2.notify_one() within the critical region, is there a chance of a missed interrupt.
Original Problem
https://leetcode.com/problems/print-foobar-alternately/.
Leslie Lamports Paper
https://lamport.azurewebsites.net/pubs/time-clocks.pdf
Does the C++ memory model because of the fencing property guarantee that when the bar function resumes from waiting on cv2 the value of flag will be set to true?
By itself, a conditional variable is prone to spurious wake-up. A CV.wait(lck) call without a predicate clause can return for kinds of reasons. That's why it's always important to check the predicate condition in a while loop before entering wait. You should never assume that when wait(lck) returns that the thing you were waiting for has actually happened. But with the clause you added within the wait: cv2.wait(lck,[&]{return flag;}); this check is taken care of for you. So yes, when wait(lck, predicate) returns, then flag will be true.
Can we solve this problem using just one condition variable?
Absolutely. Just get rid of cv2 and have both threads wait (and notify) on the first cv.
Is there a way to do this without using locks and just atomics. What performance improvements do atomics give over locks?
atomics are great when you can get away with polling on one thread instead of waiting. Imagine a UI thread that wants to show you the current speed of your car. And it polls the speed variable on every frame refresh. But another thread, the "engine thread" is setting that atomic<int> speed variable with every rotation of the tire. That's where it shines - when you already have a polling loop in place, and on x86, atomics are mostly implemented with the LOCK op code prefix (e.g. concurrency is done correctly by the CPU).
As for an implementation for just locks and atomics... well, it's late for me. Easy solution, both threads just sleep and poll on an atomic integer that increments with each thread's turn. Each thread just waits for value to be "last+2" and polls every few milliseconds. Not efficient, but would work.
It's a bit late in the evening for me to thing about how to do this with a single or pair of mutexes.
What happens if i do cv.notify_one() and correspondigly cv2.notify_one() within the critical region, is there a chance of a missed interrupt.
No, you're fine. As long as all your threads are holding a lock and checking their predicate condition before entering the wait call. You can do the notify call insider or outside of the critical region. I always recommend doing notify_all over notify_one, but that might even be unnecessary.

Can one function operate on few threads?

I have a question about threads.
For example I've got code like this
void xyz(int x){
...
}
int main{
for(int i=0;i<n;i++){
xyz(n);
}
}
The question is if I can modife code (and how?) in order to first thread call a function with arguments 1 to n/2 and second thread call a function with arguments from n/2 to n.
Thank you in advance
Here is a simple solution using std::async and a lambda function capturing your n:
#include <future>
int main() {
size_t n = 666;
auto f1 = std::async(std::launch::async, [n]() {
for (size_t i = 0; i < n / 2; ++i)
xyz(i);
});
auto f2 = std::async(std::launch::async, [n]() {
for (size_t i = n/2; i < n; ++i)
xyz(i);
});
f1.wait();
f2.wait();
return 0;
}
Each call to std::async creates a new thread and then calling wait() on the std::futures returned by async, makes sure the program doesn't return before those threads finishing.
Sure. You can use <thread> for this:
#include <thread>
// The function we want to execute on the new thread.
void xyz(int start, int end)
{
for (int i = start; i < end; ++i) {
// ...
}
}
// Start threads.
std::thread t1(xyz, 1, n / 2);
std::thread t2(xyz, n / 2, n);
// Wait for threads to finish.
t1.join();
t2.join();
If you're using GCC or Clang, don't forget to append -pthread to your link command if you get a link error (example: gcc -std=c++14 myfile.cpp -pthread.)
You should read some tutorial about multi-threading. I recommend this Pthread tutorial, since you could apply the concepts to C++11 threads (whose model is close to POSIX threads).
A function can be used in several threads if it is reentrant.
You might synchronize with mutexes your critical sections, and use condition variables. You should avoid data races.
In some cases, you could use atomic operations.
(your question is too broad and unclear; an entire book is needed to answer it)
You might also be interested by OpenMP, OpenACC, OpenCL.
Be aware that threads are quite expensive resources. Each has its own call stack (of one or a few megabytes), and you generally don't want to have much more runnable threads than you have available cores. As a rule of thumb, avoid (on a common desktop or laptop computer) having more than a dozen of runnable threads (and more than a few hundreds of idle ones, and probably less). But YMMV; I prefer to have less than a dozen threads, and I am trying to have less threads than what std::thread::hardware_concurrency gives.
Both answers from Nikos C. and from Maroš Beťko are using two threads. You could use a few more, but it probably would be unreasonable and inefficient to use a lot more of them (e.g. a hundred threads), at least on some ordinary computer. The optimal amount of threads is computer (and software and system) specific. You might make it a configurable parameter of your program. On supercomputers, you could mix multi-threading with some MPI approach. On datacenters or clusters, you might consider a MapReduce approach.
When benchmarking, don't forget to enable compiler optimizations. If using GCC or Clang, compile with -O2 -march=native at least.

Which Memory Order Should I use for a Host Thread waiting on Worker Threads?

I've got code that dispatches tasks to an asio io_service object to be remotely processed. As far as I can tell, the code behaves correctly, but unfortunately, I don't know much about memory ordering, and I'm not sure which memory orders I should be using when checking the atomic flags to ensure optimal performance.
//boost::asio::io_service;
//^^ Declared outside this scope
std::vector<std::atomic_bool> flags(num_of_threads, false);
//std::vector<std::thread> threads(num_of_threads);
//^^ Declared outside this scope, all of them simply call the run() method on io_service
for(int i = 0; i < num_of_threads; i++) {
io_service.post([&, i]{
/*...*/
flags[i].store(true, /*[[[1]]]*/);
});
}
for(std::atomic_bool & atm_bool : flags) while(!atm_bool.load(/*[[[2]]]*/)) std::this_thread::yield();
So basically, what I want to know is, what should I substitute in for [[[1]]] and [[[2]]]?
If it helps, the code is functionally similar to the following:
std::vector<std::thread> threads;
for(int i = 0; i < num_of_threads; i++) threads.emplace_back([]{/*...*/});
for(std::thread & thread : threads) thread.join();
Except that my code keeps the threads alive in an external thread pool and dispatches tasks to them.
You want to establish a happens-before relation between the thread setting the flag and the thread seeing that it was set. This means that once the thread sees the flag is set, it will also see the effects of everything that the other thread did before setting it (this is not guaranteed otherwise).
This can be done using release-acquire semantics:
flags[i].store(true, std::memory_order_release);
// ...
while (!atm_bool.load(std::memory_order_acquire)) ...
Note that in this case it might be cleaner to use a blocking OS-level semaphore than to spin-wait on an array of flags. Failing that, it would still be slightly more efficient to spin on a count of completed tasks instead of checking an array of flags for each.

Why is std::mutex faster than std::atomic?

I want to put objects in std::vector in multi-threaded mode. So I decided to compare two approaches: one uses std::atomic and the other std::mutex. I see that the second approach is faster than the first one. Why?
I use GCC 4.8.1 and, on my machine (8 threads), I see that the first solution requires 391502 microseconds and the second solution requires 175689 microseconds.
#include <vector>
#include <omp.h>
#include <atomic>
#include <mutex>
#include <iostream>
#include <chrono>
int main(int argc, char* argv[]) {
const size_t size = 1000000;
std::vector<int> first_result(size);
std::vector<int> second_result(size);
std::atomic<bool> sync(false);
{
auto start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel for schedule(static, 1)
for (int counter = 0; counter < size; counter++) {
while(sync.exchange(true)) {
std::this_thread::yield();
};
first_result[counter] = counter;
sync.store(false) ;
}
auto end_time = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
}
{
auto start_time = std::chrono::high_resolution_clock::now();
std::mutex mutex;
#pragma omp parallel for schedule(static, 1)
for (int counter = 0; counter < size; counter++) {
std::unique_lock<std::mutex> lock(mutex);
second_result[counter] = counter;
}
auto end_time = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
}
return 0;
}
I don't think your question can be answered referring only to the standard- mutexes are as platform-dependent as they can be. However, there is one thing, that should be mentioned.
Mutexes are not slow. You may have seen some articles, that compare their performance against custom spin-locks and other "lightweight" stuff, but that's not the right approach - these are not interchangeable.
Spin locks are considerably fast, when they are locked (acquired) for a relatively short amount of time - acquiring them is very cheap, but other threads, that are also trying to lock, are active for whole this time (running constantly in loop).
Custom spin-lock could be implemented this way:
class SpinLock
{
private:
std::atomic_flag _lockFlag;
public:
SpinLock()
: _lockFlag {ATOMIC_FLAG_INIT}
{ }
void lock()
{
while(_lockFlag.test_and_set(std::memory_order_acquire))
{ }
}
bool try_lock()
{
return !_lockFlag.test_and_set(std::memory_order_acquire);
}
void unlock()
{
_lockFlag.clear();
}
};
Mutex is a primitive, that is much more complicated. In particular, on Windows, we have two such primitives - Critical Section, that works in per-process basis and Mutex, which doesn't have such limitation.
Locking mutex (or critical section) is much more expensive, but OS has the ability to really put other waiting threads to "sleep", which improves performance and helps tasks scheduler in efficient resources management.
Why I write this? Because modern mutexes are often so-called "hybrid mutexes". When such mutex is locked, it behaves like a normal spin-lock - other waiting threads perform some number of "spins" and then heavy mutex is locked to prevent from wasting resources.
In your case, mutex is locked in each loop iteration to perform this instruction:
second_result[counter] = omp_get_thread_num();
It looks like a fast one, so "real" mutex may never be locked. That means, that in this case your "mutex" can be as fast as atomic-based solution (because it becomes an atomic-based solution itself).
Also, in the first solution you used some kind of spin-lock-like behaviour, but I am not sure if this behaviour is predictable in multi-threaded environment. I am pretty sure, that "locking" should have acquire semantics, while unlocking is a release op. Relaxed memory ordering may be too weak for this use case.
I edited the code to be more compact and correct. It uses the std::atomic_flag, which is the only type (unlike std::atomic<> specializations), that is guaranteed to be lock-free (even std::atomic<bool> does not give you that).
Also, referring to the comment below about "not yielding": it is a matter of specific case and requirements. Spin locks are very important part of multi-threaded programming and their performance can often be improved by slightly modifying its behavior. For example, Boost library implements spinlock::lock() as follows:
void lock()
{
for( unsigned k = 0; !try_lock(); ++k )
{
boost::detail::yield( k );
}
}
source: boost/smart_ptr/detail/spinlock_std_atomic.hpp
Where detail::yield() is (Win32 version):
inline void yield( unsigned k )
{
if( k < 4 )
{
}
#if defined( BOOST_SMT_PAUSE )
else if( k < 16 )
{
BOOST_SMT_PAUSE
}
#endif
#if !BOOST_PLAT_WINDOWS_RUNTIME
else if( k < 32 )
{
Sleep( 0 );
}
else
{
Sleep( 1 );
}
#else
else
{
// Sleep isn't supported on the Windows Runtime.
std::this_thread::yield();
}
#endif
}
[source: http://www.boost.org/doc/libs/1_66_0/boost/smart_ptr/detail/yield_k.hpp]
First, thread spins for some fixed number of times (4 in this case). If mutex is still locked, pause instruction is used (if available) or Sleep(0) is called, which basically causes context-switch and allows scheduler to give another blocked thread a chance to do something useful. Then, Sleep(1) is called to perform actual (short) sleep. Very nice!
Also, this statement:
The purpose of a spinlock is busy waiting
is not entirely true. The purpose of spinlock is to serve as a fast, easy-to-implement lock primitive - but it still needs to be written properly, with certain possible scenarios in mind. For example, Intel says (regarding Boost's usage of _mm_pause() as a method of yielding inside lock()):
In the spin-wait loop, the pause intrinsic improves the speed at which
the code detects the release of the lock and provides especially
significant performance gain.
So, implementations like
void lock() { while(m_flag.test_and_set(std::memory_order_acquire)); }
may not be as good as it seems.
There is an additional important issue related to your problem. An efficient spinlock never "spins" on an operation that involves (even potential) modification of a memory location (such as exchange or test_and_set). On typical modern architectures, these operations generate instructions that require the cache line with a lock memory location to be in the exclusive state, which is extremely time-consuming (especially, when multiple threads are spinning at the same time). Always spin on load/read only and try to acquire the lock only when there is a chance that this operation will succeed.
A nice relevant article is, for instance, here: Correctly implementing a spinlock in C++

C++ - Threads without coordinating mechanism like mutex_Lock

I attended one interview two days back. The interviewed guy was good in C++, but not in multithreading. When he asked me to write a code for multithreading of two threads, where one thread prints 1,3,5,.. and the other prints 2,4,6,.. . But, the output should be 1,2,3,4,5,.... So, I gave the below code(sudo code)
mutex_Lock LOCK;
int last=2;
int last_Value = 0;
void function_Thread_1()
{
while(1)
{
mutex_Lock(&LOCK);
if(last == 2)
{
cout << ++last_Value << endl;
last = 1;
}
mutex_Unlock(&LOCK);
}
}
void function_Thread_2()
{
while(1)
{
mutex_Lock(&LOCK);
if(last == 1)
{
cout << ++last_Value << endl;
last = 2;
}
mutex_Unlock(&LOCK);
}
}
After this, he said "these threads will work correctly even without those locks. Those locks will reduce the efficiency". My point was without the lock there will be a situation where one thread will check for(last == 1 or 2) at the same time the other thread will try to change the value to 2 or 1. So, My conclusion is that it will work without that lock, but that is not a correct/standard way. Now, I want to know who is correct and in which basis?
Without the lock, running the two functions concurrently would be undefined behaviour because there's a data race in the access of last and last_Value Moreover (though not causing UB) the printing would be unpredictable.
With the lock, the program becomes essentially single-threaded, and is probably slower than the naive single-threaded code. But that's just in the nature of the problem (i.e. to produce a serialized sequence of events).
I think the interviewer might have thought about using atomic variables.
Each instantiation and full specialization of the std::atomic template defines an atomic type. Objects of atomic types are the only C++ objects that are free from data races; that is, if one thread writes to an atomic object while another thread reads from it, the behavior is well-defined.
In addition, accesses to atomic objects may establish inter-thread synchronization and order non-atomic memory accesses as specified by std::memory_order.
[Source]
By this I mean the only thing you should change is remove the locks and change the lastvariable to std::atomic<int> last = 2; instead of int last = 2;
This should make it safe to access the last variable concurrently.
Out of curiosity I have edited your code a bit, and ran it on my Windows machine:
#include <iostream>
#include <atomic>
#include <thread>
#include <Windows.h>
std::atomic<int> last=2;
std::atomic<int> last_Value = 0;
std::atomic<bool> running = true;
void function_Thread_1()
{
while(running)
{
if(last == 2)
{
last_Value = last_Value + 1;
std::cout << last_Value << std::endl;
last = 1;
}
}
}
void function_Thread_2()
{
while(running)
{
if(last == 1)
{
last_Value = last_Value + 1;
std::cout << last_Value << std::endl;
last = 2;
}
}
}
int main()
{
std::thread a(function_Thread_1);
std::thread b(function_Thread_2);
while(last_Value != 6){}//we want to print 1 to 6
running = false;//inform threads we are about to stop
a.join();
b.join();//join
while(!GetAsyncKeyState('Q')){}//wait for 'Q' press
return 0;
}
and the output is always:
1
2
3
4
5
6
Ideone refuses to run this code (compilation errors)..
Edit: But here is a working linux version :) (thanks to soon)
The interviewer doesn't know what he is talking about. Without the locks you get races on both last and last_value. The compiler could for example reorder the assignment to last before the print and increment of last_value, which could lead to the other thread executing on stale data. Furthermore you could get interleaved output, meaning things like two numbers not being seperated by a linebreak.
Another thing, which could go wrong is that the compiler might decide not to reload last and (less importantly) last_value each iteration, since it can't (safely) change between those iterations anyways (since data races are illegal by the C++11 standard and aren't acknowledged in previous standards). This means that the code suggested by the interviewer actually has a good chance of creating infinite loops of doing absoulutely doing nothing.
While it is possible to make that code correct without mutices, that absolutely needs atomic operations with appropriate ordering constraints (release-semantics on the assignment to last and acquire on the load of last inside the if statement).
Of course your solution does lower efficiency due to effectivly serializing the whole execution. However since the runtime is almost completely spent inside the streamout operation, which is almost certainly internally synchronized by the use of locks, your solution doesn't lower the efficiency anymore then it already is. Waiting on the lock in your code might actually be faster then busy waiting for it, depending on the availible resources (the nonlocking version using atomics would absolutely tank when executed on a single core machine)