I attended one interview two days back. The interviewed guy was good in C++, but not in multithreading. When he asked me to write a code for multithreading of two threads, where one thread prints 1,3,5,.. and the other prints 2,4,6,.. . But, the output should be 1,2,3,4,5,.... So, I gave the below code(sudo code)
mutex_Lock LOCK;
int last=2;
int last_Value = 0;
void function_Thread_1()
{
while(1)
{
mutex_Lock(&LOCK);
if(last == 2)
{
cout << ++last_Value << endl;
last = 1;
}
mutex_Unlock(&LOCK);
}
}
void function_Thread_2()
{
while(1)
{
mutex_Lock(&LOCK);
if(last == 1)
{
cout << ++last_Value << endl;
last = 2;
}
mutex_Unlock(&LOCK);
}
}
After this, he said "these threads will work correctly even without those locks. Those locks will reduce the efficiency". My point was without the lock there will be a situation where one thread will check for(last == 1 or 2) at the same time the other thread will try to change the value to 2 or 1. So, My conclusion is that it will work without that lock, but that is not a correct/standard way. Now, I want to know who is correct and in which basis?
Without the lock, running the two functions concurrently would be undefined behaviour because there's a data race in the access of last and last_Value Moreover (though not causing UB) the printing would be unpredictable.
With the lock, the program becomes essentially single-threaded, and is probably slower than the naive single-threaded code. But that's just in the nature of the problem (i.e. to produce a serialized sequence of events).
I think the interviewer might have thought about using atomic variables.
Each instantiation and full specialization of the std::atomic template defines an atomic type. Objects of atomic types are the only C++ objects that are free from data races; that is, if one thread writes to an atomic object while another thread reads from it, the behavior is well-defined.
In addition, accesses to atomic objects may establish inter-thread synchronization and order non-atomic memory accesses as specified by std::memory_order.
[Source]
By this I mean the only thing you should change is remove the locks and change the lastvariable to std::atomic<int> last = 2; instead of int last = 2;
This should make it safe to access the last variable concurrently.
Out of curiosity I have edited your code a bit, and ran it on my Windows machine:
#include <iostream>
#include <atomic>
#include <thread>
#include <Windows.h>
std::atomic<int> last=2;
std::atomic<int> last_Value = 0;
std::atomic<bool> running = true;
void function_Thread_1()
{
while(running)
{
if(last == 2)
{
last_Value = last_Value + 1;
std::cout << last_Value << std::endl;
last = 1;
}
}
}
void function_Thread_2()
{
while(running)
{
if(last == 1)
{
last_Value = last_Value + 1;
std::cout << last_Value << std::endl;
last = 2;
}
}
}
int main()
{
std::thread a(function_Thread_1);
std::thread b(function_Thread_2);
while(last_Value != 6){}//we want to print 1 to 6
running = false;//inform threads we are about to stop
a.join();
b.join();//join
while(!GetAsyncKeyState('Q')){}//wait for 'Q' press
return 0;
}
and the output is always:
1
2
3
4
5
6
Ideone refuses to run this code (compilation errors)..
Edit: But here is a working linux version :) (thanks to soon)
The interviewer doesn't know what he is talking about. Without the locks you get races on both last and last_value. The compiler could for example reorder the assignment to last before the print and increment of last_value, which could lead to the other thread executing on stale data. Furthermore you could get interleaved output, meaning things like two numbers not being seperated by a linebreak.
Another thing, which could go wrong is that the compiler might decide not to reload last and (less importantly) last_value each iteration, since it can't (safely) change between those iterations anyways (since data races are illegal by the C++11 standard and aren't acknowledged in previous standards). This means that the code suggested by the interviewer actually has a good chance of creating infinite loops of doing absoulutely doing nothing.
While it is possible to make that code correct without mutices, that absolutely needs atomic operations with appropriate ordering constraints (release-semantics on the assignment to last and acquire on the load of last inside the if statement).
Of course your solution does lower efficiency due to effectivly serializing the whole execution. However since the runtime is almost completely spent inside the streamout operation, which is almost certainly internally synchronized by the use of locks, your solution doesn't lower the efficiency anymore then it already is. Waiting on the lock in your code might actually be faster then busy waiting for it, depending on the availible resources (the nonlocking version using atomics would absolutely tank when executed on a single core machine)
Related
I have made multiple runs of the program. I do not see that the output is incorrect, even though I do not use the mutex. My goal is to demonstrate the need of a mutex. My thinking is that different threads with different "num" values will be mixed.
Is it because the objects are different?
using VecI = std::vector<int>;
class UseMutexInClassMethod {
mutex m;
public:
VecI compute(int num, VecI veci)
{
VecI v;
num = 2 * num -1;
for (auto &x:veci) {
v.emplace_back(pow(x,num));
std::this_thread::sleep_for(std::chrono::seconds(1));
}
return v;
}
};
void TestUseMutexInClassMethodUsingAsync()
{
const int nthreads = 5;
UseMutexInClassMethod useMutexInClassMethod;
VecI vec{ 1,2,3,4,5 };
std::vector<std::future<VecI>> futures(nthreads);
std::vector<VecI> outputs(nthreads);
for (decltype(futures)::size_type i = 0; i < nthreads; ++i) {
futures[i] = std::async(&UseMutexInClassMethod::compute,
&useMutexInClassMethod,
i,vec
);
}
for (decltype(futures)::size_type i = 0; i < nthreads; ++i) {
outputs[i] = futures[i].get();
for (auto& x : outputs[i])
cout << x << " ";
cout << endl;
}
}
If you want an example that does fail with a high degree of certainty you can look at the below. It sets up a variable called accumulator to be shared by reference to all the futures. This is what is missing in your example. You are not actually sharing any memory. Make sure you understand the difference between passing by reference and passing by value.
#include <vector>
#include <memory>
#include <thread>
#include <future>
#include <iostream>
#include <cmath>
#include <mutex>
struct UseMutex{
int compute(std::mutex & m, int & num)
{
for(size_t j = 0;j<1000;j++)
{
///////////////////////
// CRITICAL SECTIION //
///////////////////////
// this code currently doesn't trigger the exception
// because of the lock on the mutex. If you comment
// out the single line below then the exception *may*
// get called.
std::scoped_lock lock{m};
num++;
std::this_thread::sleep_for(std::chrono::nanoseconds(1));
num++;
if(num%2!=0)
throw std::runtime_error("bad things happened");
}
return 0;
}
};
template <typename T> struct F;
void TestUseMutexInClassMethodUsingAsync()
{
const int nthreads = 16;
int accumulator=0;
std::mutex m;
std::vector<UseMutex> vs{nthreads};
std::vector<std::future<int>> futures(nthreads);
for (auto i = 0; i < nthreads; ++i) {
futures[i]= std::async([&,i](){return vs[i].compute(m,accumulator);});
}
for(auto i = 0; i < nthreads; ++i){
futures[i].get();
}
}
int main(){
TestUseMutexInClassMethodUsingAsync();
}
You can comment / uncomment the line
std::scoped_lock lock{m};
which protects the increment of the shared variable num. The rule for this mini program is that at the line
if(num%2!=0)
throw std::runtime_error("bad things happened");
num should be a multiple of two. But as multiple threads are accessing this variable without a lock you can't guarantee this. However if you add a lock around the double increment and test then you can be sure no other thread is accessing this memory during the duration of the increment and test.
Failing
https://godbolt.org/z/sojcs1WK9
Passing
https://godbolt.org/z/sGdx3x3q3
Of course the failing one is not guaranteed to fail but I've set it up so that it has a high probability of failing.
Notes
[&,i](){return vs[i].compute(m,accumulator);};
is a lambda or inline function. The notation [&,i] means it captures everything by reference except i which it captures by value. This is important because i changes on each loop iteration and we want each future to get a unique value of i
Is it because the objects are different?
Yes.
Your code is actually perfectly thread safe, no need for mutex here. You never share any state between threads except for copying vec from TestUseMutexInClassMethodUsingAsync to compute by std::async (and copying is thread-safe) and moving computation result from compute's return value to futures[i].get()'s return value. .get() is also thread-safe: it blocks until the compute() method terminates and then returns its computation result.
It's actually nice to see that even a deliberate attempt to get a race condition failed :)
You probably have to fully redo your example to demonstrate is how simultaneous* access to a shared object breaks things. Get rid of std::async and std::future, use simple std::thread with capture-by-reference, remove sleep_for (so both threads do a lot of operations instead of one per second), significantly increase number of operations and you will get a visible race. It may look like a crash, though.
* - yes, I'm aware that "wall-clock simulateneous access" does not exist in multithreaded systems, strictly speaking. However, it helps getting a rough idea of where to look for visible race conditions for demonstration purposes.
Comments have called out the fact that just not protecting a critical section does not guarantee that the risked behavior actually occurs.
That also applies for multiple runs, because while you are not allowed to test a few times and then rely on the repeatedly observed behavior, it is likely that optimization mechanisms cause a likely enough reoccurring observation as to be perceived has reproducible.
If you intend to demonstrate the need for synchronization you need to employ synchronization to poise things to a near guaranteed misbehavior of observable lack of protection.
Allow me to only outline a sequence for that, with a few assumptions on scheduling mechanisms (this is based on a rather simple, single core, priority based scheduling environment I have encountered in an embedded environment I was using professionally), just to give an insight with a simplified example:
start a lower priority context.
optionally set up proper protection before entering the critical section
start critical section, e.g. by outputting the first half of to-be-continuous output
asynchronously trigger a higher priority context, which is doing that which can violate your critical section, e.g. outputs something which should not be in the middle of the two-part output of the critical section
(in protected case the other context is not executed, in spite of being higher priority)
(in unprotected case the other context is now executed, because of being higher priority)
end critical section, e.g. by outputting the second half of the to-be-continuous output
optionally remove the protection after leaving the critical section
(in protected case the other context is now executed, now that it is allowed)
(in unprotected case the other context was already executed)
Note:
I am using the term "critical section" with the meaning of a piece of code which is vulnerable to being interrupted/preempted/descheduled by another piece of code or another execution of the same code. Specifically for me a critical section can exist without applied protection, though that is not a good thing. I state this explicitly because I am aware of the term being used with the meaning "piece of code inside applied protection/synchronization". I disagree but I accept that the term is used differently and requires clarification in case of potential conflicts.
I have a code at work that starts multiple threads that doing some operations and if any of them fail they set the shared variable to false.
Then main thread joins all the worker threads. Simulation of this looks roughly like this (I commented out the possible fix which I don't know if it's needed):
#include <thread>
#include <atomic>
#include <vector>
#include <iostream>
#include <cassert>
using namespace std;
//atomic_bool success = true;
bool success = true;
int main()
{
vector<thread> v;
for (int i = 0; i < 10; ++i)
{
v.emplace_back([=]
{
if (i == 5 || i == 6)
{
//success.store(false, memory_order_release);
success = false;
}
});
}
for (auto& t : v)
t.join();
//assert(success.load(memory_order_acquire) == false);
assert(success == false);
cout << "Finished" << endl;
cin.get();
return 0;
}
Is there a possibility that main thread will read the success variable as true even though one of the workers set it to false?
I found that thread::join() is a full memory barrier (source) but does that imply synchronized-with relationship with the following read of success variable from the main thread, so that we're guaranteed to get newest value?
Is the fix I posted (in the commented code) necessary in this case (or maybe another fix if this one is wrong)?
Is there a possibility that read of success variable will be optimized away (since it's not volatile) and we will get old value regardless of suppossed to exist implicit memory barrier on thread::join?
The code is suppossed to work on multiple architectures (cannot remember all of them, I don't have makefile in front of me) but there are atleast x86, amd64, itanium, arm7.
Thanks for any help with this.
Edit: I've modified the example, because in real situation more then one thread can try to write to success variable.
The code above represents a data race, and the use of join cannot change that fact. If only one thread wrote to the variable, it would be fine. But you have two threads writing to it, with no synchronization between them. That's a data race.
join simply means "all side effects of that thread's operation have completed and are now visible to you." That does not create ordering or synchronization between that thread and any thread other than your own.
If you used an atomic_bool, then it wouldn't be UB; it would be guaranteed to be false. But because there is a data race, you get pure UB. It might be true, false, or nasal demons.
The following is a snippet of a larger program and is done using Pthreads.
The UpdateFunction reads from a text file. The FunctionMap is just used to output (key,1). Here essentially UpdateFunction and FunctionMap run on different threads.
queue <list<string>::iterator> mapperpool;
void *UpdaterFunction(void* fn) {
std::string *x = static_cast<std::string*>(fn);
string filename = *x;
ifstream file (filename.c_str());
string word;
list <string> letterwords[50];
char alphabet = '0';
bool times = true;
int charno=0;
while(file >> word) {
if(times) {
alphabet = *(word.begin());
times = false;
}
if (alphabet != *(word.begin())) {
alphabet = *(word.begin());
mapperpool.push(letterwords[charno].begin());
letterwords[charno].push_back("xyzzyspoon");
charno++;
}
letterwords[charno].push_back(word);
}
file.close();
cout << "UPDATER DONE!!" << endl;
pthread_exit(NULL);
}
void *FunctionMap(void *i) {
long num = (long)i;
stringstream updaterword;
string toQ;
int charno = 0;
fprintf(stderr, "Print me %ld\n", num);
sleep(1);
while (!mapperpool.empty()) {
list<string>::iterator it = mapperpool.front();
while(*it != "xyzzyspoon") {
cout << "(" << *it << ",1)" << "\n";
cout << *it << "\n";
it++;
}
mapperpool.pop();
}
pthread_exit(NULL);
}
If I add the while(!mapperpool.empty()) in the UpdateFunction then it gives me the perfect output. But when I move it back to the FunctionMap then it gives me a weird out and Segfaults later.
Output when used in UpdateFunction:
Print me 0
course
cap
class
culture
class
cap
course
course
cap
culture
concurrency
.....
[Each word in separate line]
Output when used in FunctionMap (snippet shown above):
Print me 0
UPDATER DONE!!
(course%0+�0#+�0�+�05P+�0����cap%�+�0�+�0,�05�+�0����class5P?�0
����xyzzyspoon%�+�0�+�0(+�0%P,�0,�0�,�05+�0����class%p,�0�,�0-�05�,�0����cap%�,�0�,�0X-�05�,�0����course%-�0 -�0�-�050-�0����course%-�0p-�0�-�05�-�0����cap%�-�0�-�0H.�05�-�0����culture%.�0.�0�.�05 .�0
����concurrency%P.�0`.�0�.�05p.�0����course%�.�0�.�08/�05�.�0����cap%�.�0/�0�/�05/�0Segmentation fault (core dumped)
How do I fix this issue?
list <string> letterwords[50] is local to UpdaterFunction. When UpdaterFunction finishes, all its local variables got destroyed. When FunctionMap inspects iterator, that iterator already points to deleted memory.
When you insert while(!mapperpool.empty()) UpdaterFunction waits for FunctionMap completion and letterwords stays 'alive'.
Here essentially UpdateFunction and FunctionMap run on different threads.
And since they both manipulate the same object (mapperpool) and neither of them uses either pthread_mutex nor std::mutex (C++11), you have a data race. If you have a data race, you have Undefined Behaviour and the program might do whatever it wants. Most likely it will write garbage all over memory until eventually crashing, exactly as you see.
How do I fix this issue?
By locking the mapperpool object.
Why is list not thread-safe?
Well, in vast majority of use-cases, a single list (or any other collection) won't be used by more than one thread. In significant part of the rest the lock will have to extend over more than one operation on the collection, so the client will have to do its own locking anyway. The remaining tiny percentage of cases where locking in the operations themselves would help is not worth adding the overhead for everyone; C++ key design principle is that you only pay for what you use.
The collections are only reentrant, meaning that using different instances in parallel is safe.
Note on pthreads
C++11 introduced threading library that integrates well with the language. Most notably, it uses RAII for locking of std::mutex via std::lock_guard, std::unique_lock and std::shared_lock (for reader-writer locking). Consistently using these can eliminate large class of locking bugs that otherwise take considerable time to debug.
If you can't use C++11 yet (on desktop you can, but some embedded platforms did not get a compiler update yet), you should first consider Boost.Thread as it provides the same benefits.
If you can't use even then, still try to find, or write, a simple RAII wrapper for locking like the C++11/Boost do. The basic wrapper is just a couple of lines, but it will save you a lot of debugging.
Note that C++11 and Boost also have atomic operations library that pthreads sorely miss.
I am working on a recursive algorithm which we want to parallelize to improve the performance.
I implemented multithreading using Visual c++ 12.0 and < thread > library . However I dont see any performance improvements. The time taken either less by a few milliseconds or is more than the time with single thread.
Kindly let me know if am doing something wrong and what corrections should I make to the code.
Here is my code
void nonRecursiveFoo(<className> &data, int first, int last)
{
//process the data between first and last index and set its value to true based on some condition
//no threads are created here
}
void recursiveFoo(<className> &data, int first, int last)
{
int partitionIndex = -1;
data[first]=true;
data[last]=true;
for (int i = first + 1; i < last; i++)
{
//some logic setting the index
If ( some condition is true)
partitionIndex = i;
}
//no dependency of partitions on one another and so can be parallelized
if( partitionIndex != -1)
{
data[partitionIndex]=true;
//assume some threadlimit
if (Commons::GetCurrentThreadCount() < Commons::GetThreadLimit())
{
std::thread t1(recursiveFoo, std::ref(data), first, index);
Commons::IncrementCurrentThreadCount();
recursiveFoo(data, partitionIndex , last);
t1.join();
}
else
{
nonRecursiveFoo(data, first, partitionIndex );
nonRecursiveFoo(data, partitionIndex , last);
}
}
}
//main
int main()
{
recursiveFoo(data,0,data.size-1);
}
//commons
std::mutex threadCountMutex;
static void Commons::IncrementCurrentThreadCount()
{
threadCountMutex.lock();
CurrentThreadCount++;
threadCountMutex.unlock();
}
static int GetCurrentThreadCount()
{
return CurrentThreadCount;
}
static void SetThreadLimit(int count)
{
ThreadLimit = count;
}
static int GetThreadLimit()
{
return ThreadLimit;
}
static int GetMinPointsPerThread()
{
return MinimumPointsPerThread;
}
Without further information (see comments) this is mostly guesswork, but there are a few things you should watch out for:
First of all, make sure that your partitioning logic is very short and fast compared to the processing. Otherwise, you are just creating more work than you gain processing power.
Make sure, there is enough work to begin with or the speedup might be not enough to pay for the additional overhead of thread creation.
Check that your work gets evenly distributed among the different threads and don't spawn more threads than you have cores on your computer (print the number of total threads at the end - don't rely on your ThreadLimit).
Don't let your partitions get too small, (especially no less than 64 Bytes) or you end up with false sharing.
It would be MUCH more efficient, to implement CurrentThreadCount as a std::atomic<int> in which case you don't need a mutex.
Put the increment of the counter before the creation of the thread. Otherwise, the newly created thread might read the counter before it is incremented and spawn a new thread again, even if the max number of threads is already reached (This is still not a perfect solution, but I would only invest more time on this if you have verified, that overcommitting is your actual problem)
If you really must use a mutex (for reasons outside of the example code) you have to use it for every access to CurrentThreadCount (read and write access). Otherwise this is - strictly speaking - a race condition and thus UB.
By using t1.join you're basically waiting for the other thread to finish - i.e. not doing anything in parallel.
By looking at your algorithm I don't see how it can be parallelized(thus improved) by using threads - you have to wait for a single recursive call to end.
First of all, you are not doing anything in parallel, as every thread creation blocks, until the created thread has finished. Hence, your multithreaded code will always be slower than the non multithreaded version.
In order to parallelize you could spawn threads for that part, where the non-recursive function is called, put the thread ID into a vector and join on the highest level of the recursion, by walking through the vector. (Although there are more elegant ways to do that, but for a first should this would be OK, I think).
Thus, all non recursive calls will run in parallel. But you should use another condition than the max number of threads, but the size of the problem, e.g. last-first<threshold.
In the following code example, program execution never ends.
It creates a thread which waits for a global bool to be set to true before terminating. There is only one writer and one reader. I believe that the only situation that allows the loop to continue running is if the bool variable is false.
How is it possible that the bool variable ends up in an inconsistent state with just one writer?
#include <iostream>
#include <pthread.h>
#include <unistd.h>
bool done = false;
void * threadfunc1(void *) {
std::cout << "t1:start" << std::endl;
while(!done);
std::cout << "t1:done" << std::endl;
return NULL;
}
int main()
{
pthread_t threads;
pthread_create(&threads, NULL, threadfunc1, NULL);
sleep(1);
done = true;
std::cout << "done set to true" << std::endl;
pthread_exit(NULL);
return 0;
}
There's a problem in the sense that this statement in threadfunc1():
while(!done);
can be implemented by the compiler as something like:
a_register = done;
label:
if (a_register == 0) goto label;
So updates to done will never be seen.
There is really nothing that prevents the compiler from optimizing the while-loop away. Use atomic or a mutex to access the bool from more than one thread. That is the only supported and correct solution. As you are using posix, a mutex would be the right solution in this case.
And don't use volatile. There is a posix standard that states what has to work and volatile is not a solution that has a guaranty to work.
And there is an othere problem: There is no guaranty that your newly created thread every started to run, before you set the flag to false.
For such simple example volatile is enough. But for vast majority of real world situations it is not. Use conditional variable for this task. They look weird at the first glance but actually they are quite logical. On x86 bool IS atomic to read/write (for ARM, probably, not). Also there is an obstacle with vector: it is NOT a vector of bools, it is a bitfield. To write vector from several threads use vector (or bool arr[SIZE]).
Also you don't join with thread, it is wrong.
Race condition means: when two threads are accessing the same object, and at least one of them is a write.
It means you will have two types of racing, write-write conflict and write-read conflict.
Back to your code, you essentially have two threads, one is the main thread, and another one is the one you created with pthread_create.
One of them is a read: while(!done), and one of them is a write: done = true.
You have race condition for sure.
Is a race condition possible when only one thread writes to a bool variable in c++?
Yes. In your case, the main thread is also a thread (i.e. you have one thread writing and one thread reading).
How is it possible that the bool variable ends up in an inconsistent state with just one writer?
The compiler is (should be) an optimizing compiler. It will probably optimize the reading of the done variable, unless you take care to avoid that (use std::atomic<bool> done instead).
its not guaranteed that the assignment to a bool which is one byte is atomic