C++ Multithreading, Mutex - c++

Back in days I was working on an option that would speed up my function by multithreading. The base function finished around 15seconds, and I would like to reducing it, but I cannot logicing out how to create a good and working multithreading function.
Base function, before touches:
void FirstCall()
{
MainFunction1();
MainFunction2();
}
void MainFunction1()
{
//Call another functions, MainFunction3-10 for example
}
void MainFunction2()
{
//Cann another, different functions, in a for loop
}
In this case, the time that needed to finishing the function is around 15 seconds.
That I found to speeding up this function was the multithreading idea.
Let me show how it is right now, and what is my problem with it.
//Way 1 of multithreading
void FirstCall()
{
std::vector<std::thread> threads;
threads.push_back(std::thread(&MainFunction1, this));
threads.push_back(std::thread(&MainFunction2, this));
for (auto& th : threads)
{
if (th.joinable())
{
th.join();
}
}
}
The other functions are exactly same, so that shouldnt be related to the runtime. The runtime with the function that I showed up above is around 8-10seconds, so seems it is working fine, but sometimes the application simply closing when this function is called.
//Way 2 of multithreading
void FirstCall()
{
static std::mutex s_mutex;
static std::atomic<int> thread_number = 0;
auto MainFunctions = [&](int index)
{
SwitchMainFunctions(index);
}
auto ThreadFunction = [&]()
{
std::lock_guard<std::mutex> lGuard (s_mutex);
MainFunctions(thread_number++);
}
int thread_count = std::thread::hardware_concurrency(); //8
//thread count > function count (2 functions)
std::vector<std::thread> threads;
for (int i = 0; i < 2; i++)
{
threads.push_back(std::thread(ThreadFunction));
}
for (auto& th : threads)
{
if (th.joinable())
{
th.join();
}
}
threads.clear();
}
void SwitchMainFunctions(int index)
{
switch(index)
{
case 0:
{
MainFuunction1();
}
break;
case 1:
{
MainFuunction2();
}
break;
default:
{
return;
}
break;
}
}
The function that is presented as way 2 of multithreading is working fine, my application is do not crashing anymore, but the run time is same like the untouched function time is ~15 seconds.
I think the mutex lock is forceto wait until one thread is finishing, so it is exactly same if I'd just using the default code, but I would like really speeding up the function.
I tried to speed up my function with multithreading option, but the 2 way I tried to do have different problems.
The first idea is sometimes force my application crashing when the function is called.
The second way that I created have the same run time than the default function has without multithreading.

Your second option is far more complicated first option. Here is simple
FirstCall():
void FirstCall()
{
std::vector<std::thread> threads;
threads.push_back(std::thread(MainFunction1));//this removed since MainFunction1 is void
threads.push_back(std::thread(MainFunction2));
for (auto& th : threads)
{
if (th.joinable())
{
th.join();
}
}
}
In this simple scenario main thread will block till both threads finish and join. In your first option you used this as argument for MainFunction1 in thread constructor. That implies FirstCall() to be member function of this.
In such case you should add whole class definition to your question and at least scope of MainFunction1/2. This will help to understand why application simply closing.
Your second option is worse then single threaded since lGuard will unlock only when thread finished executing all functions.
auto ThreadFunction = [&]()
{
std::lock_guard<std::mutex> lGuard (s_mutex);
MainFunctions(thread_number++);
//MainFunctions calls SwitchMainFunctions
//SwitchMainFunctions calls MainFunction
//when done lGuard unlocks on destruction
}
Another problem with second option is why do you need mutex at all. If you insist to map std::atomic thread_number to specific function simply pass result of atomic fetch_add to SwitchMainFunctions in thread constructor.
void FirstCall()
{
std::atomic<int> thread_number = 0;
std::vector<std::thread> threads;
for (int i = 0; i < 2; i++)
{
threads.push_back(std::thread(SwitchMainFunctions, thread_number++));
}
for (auto& th : threads)
{
if (th.joinable())
{
th.join();
}
}
}

Related

Handle mutex lock in callback c++

I've got a Timer class that can run with both an initial time and an interval. There's an internal function internalQuit performs thread.join() before a thread is started again on the resetCallback. The thing is that each public function has it's own std::lock_guard on the mutex to prevent the data of being written. I'm now running into an issue that when using the callback to for example stop the timer in the callback, the mutex cannot be locked by stop(). I'm hoping to get some help on how to tackle this issue.
class Timer
{
public:
Timer(string_view identifier, Function &&timeoutHandler, Duration initTime, Duration intervalTime);
void start()
void stop() // for example
{
std::lock_guard lock{mutex};
running = false;
sleepCv.notify_all();
}
void setInitTime()
void setIntervalTime()
void resetCallback(Function &&timeoutHandler)
{
internalQuit();
{
std::lock_guard lock{mutex};
quit = false;
}
startTimerThread(std::forward<Function>(timeoutHandler));
}
private:
internalQuit() // performs thread join
{
{
std::lock_guard lock {mutex};
quit = true;
running = false;
sleepCv.notify_all();
}
thread.join();
}
mainLoop(Function &&timeoutHandler)
{
while(!quit)
{
std::unique_lock lock{mutex};
// wait for running with sleepCv.wait()
// handle initTimer with sleepCv.wait_until()
timeoutHandler(); // callback
// handle intervalTimer with sleepCv.wait_until()
timeoutHandler(); // callback
}
}
startTimerThread(Function &&timeoutHandler)
{
thread = std::thread([&, timeoutHandler = std::forward<Function>(timeoutHandler)](){
mainLoop(timeoutHandler);
});
}
std::thread thread{};
std::mutex mutex{};
std::condition_variable sleepCv{}
// initTime, intervalTime and some booleans for updating with sleepCv.notify_all();
}
For testing this, I have the following testcase in Gtest. I'm expecting the timer to stop in the callback. Unfortunately, the timer will hang on acquiring the mutex lock in the stop() function.
std::atomic<int> callbackCounter;
void timerCallback()
{
callbackCounter.fetch_add(1, std::memory_order_acq_rel);
}
TEST(timerTest, timerShouldStopWhenStoppedInNewCallback)
{
std::atomic<int> testCounter{0};
Timer<std::chrono::steady_clock > t{"timerstop", &timerCallback, std::chrono::milliseconds(0), std::chrono::milliseconds(100)};
t.resetCallback([&]{
testCounter += 1;
t.stop();
});
t.start();
sleepMilliSeconds(100);
ASSERT_EQ(testCounter.load(), 1); // trigger due to original interval timeout
sleepMilliSeconds(100);
ASSERT_EQ(testCounter.load(), 1); // no trigger, because stopped in new callback
}
Removing all the mutexes in each of the public fucntions, fixes the issue. But that could lead to possible race conditions for data being written to variables. Hence each function has a lock before writing to f.e. the booleans.
I've tried looking into the std::move functionality to move the thread during the resetCallback into a different variable and then call join on that one. I'm also investigating recursive_mutex but have no experience with using that.
void resetCallback(Function &&timeoutHandler)
{
internalQuit();
{
std::lock_guard lock{mutex};
quit = false;
}
auto prevThread = std::thread(std::move(this->thread));
// didn't know how to continue from here, requiring more selfstudy.
startTimerThread(std::forward<Function>(timeoutHandler));
}
It's a new subject for me, have worked with mutexes and timers before but with relatively simple stuff.
Thank you in advance.

A race condition in a custom implementation of recursive mutex

UPD: It seems that the problem which I explain below is non-existent. I cannot reproduce it in a week already, I started suspecting that it was caused by some bugs in a compiler or corrupted memory because it is not reproducing anymore.
I tried to implement my own recursive mutex in C++, but for some reason, it fails. I tried to debug it, but I stuck. (I know that there are recursive mutex in std, but I need a custom implementation in a project where STL is not available; this implementation was just a check of an idea). I haven't thought about efficiency yet, but I don't understand why my straightforward implementation doesn't work.
First of all, here's the implementation of the RecursiveMutex:
class RecursiveMutex
{
std::mutex critical_section;
std::condition_variable cv;
std::thread::id id;
int recursive_calls = 0;
public:
void lock() {
auto thread = std::this_thread::get_id();
std::unique_lock<std::mutex> lock(critical_section);
cv.wait( lock, [this, thread]() {
return id == thread || recursive_calls == 0;
});
++recursive_calls;
id = thread;
}
void unlock() {
std::unique_lock<std::mutex> lock( critical_section );
--recursive_calls;
if( recursive_calls == 0 ) {
lock.unlock();
cv.notify_all();
}
}
};
The failing test is straightforward, it just runs two threads, both of them are locking and unlocking the same mutex (the recursive nature of the mutex is not tested here). Here it is:
std::vector<std::thread> threads;
void initThreads( int num_of_threads, std::function<void()> func )
{
threads.resize( num_of_threads );
for( auto& thread : threads )
{
thread = std::thread( func );
}
}
void waitThreads()
{
for( auto& thread : threads )
{
thread.join();
}
}
void test () {
RecursiveMutex mutex;
while (true) {
int count = 0;
initThreads(2, [&mutex] () {
for( int i = 0; i < 100000; ++i ) {
try {
mutex.lock();
++count;
mutex.unlock();
}
catch (...) {
// Extremely rarely.
// Exception is "Operation not permited"
assert(false);
}
}
});
waitThreads();
// Happens often
assert(count == 200000);
}
}
In this code I have two kinds of errors:
Extremely rarely I get an exception in RecursiveMutex::lock() which contains message "Operation not permitted" and is thrown from cv.wait. As far as I understand, this exception is thrown when wait is called on a mutex which is not owned by the thread. At the same time, I lock it just above calling the wait so this cannot be the case.
In most situations I just get a message into console "terminate called without an active exception".
My main question is what the bug is, but I'll also be happy to know how to debug and provoke race condition in such a code in general.
P.S. I use Desktop Qt 5.4.2 MinGW 32 bit.

C++: Thread pool slower than single threading?

First of all I did look at the other topics on this website and found they don't relate to my problem as those mostly deal with people using I/O operations or thread creation overheads. My problem is that my threadpool or worker-task structure implementation is (in this case) a lot slower than single threading. I'm really confused by this and not sure if it's the ThreadPool, the task itself, how I test it, the nature of threads or something out of my control.
// Sorry for the long code
#include <vector>
#include <queue>
#include <thread>
#include <mutex>
#include <future>
#include "task.hpp"
class ThreadPool
{
public:
ThreadPool()
{
for (unsigned i = 0; i < std::thread::hardware_concurrency() - 1; i++)
m_workers.emplace_back(this, i);
m_running = true;
for (auto&& worker : m_workers)
worker.start();
}
~ThreadPool()
{
m_running = false;
m_task_signal.notify_all();
for (auto&& worker : m_workers)
worker.terminate();
}
void add_task(Task* task)
{
{
std::unique_lock<std::mutex> lock(m_in_mutex);
m_in.push(task);
}
m_task_signal.notify_one();
}
private:
class Worker
{
public:
Worker(ThreadPool* parent, unsigned id) : m_parent(parent), m_id(id)
{}
~Worker()
{
terminate();
}
void start()
{
m_thread = new std::thread(&Worker::work, this);
}
void terminate()
{
if (m_thread)
{
if (m_thread->joinable())
{
m_thread->join();
delete m_thread;
m_thread = nullptr;
m_parent = nullptr;
}
}
}
private:
void work()
{
while (m_parent->m_running)
{
std::unique_lock<std::mutex> lock(m_parent->m_in_mutex);
m_parent->m_task_signal.wait(lock, [&]()
{
return !m_parent->m_in.empty() || !m_parent->m_running;
});
if (!m_parent->m_running) break;
Task* task = m_parent->m_in.front();
m_parent->m_in.pop();
// Fixed the mutex being locked while the task is executed
lock.unlock();
task->execute();
}
}
private:
ThreadPool* m_parent = nullptr;
unsigned m_id = 0;
std::thread* m_thread = nullptr;
};
private:
std::vector<Worker> m_workers;
std::mutex m_in_mutex;
std::condition_variable m_task_signal;
std::queue<Task*> m_in;
bool m_running = false;
};
class TestTask : public Task
{
public:
TestTask() {}
TestTask(unsigned number) : m_number(number) {}
inline void Set(unsigned number) { m_number = number; }
void execute() override
{
if (m_number <= 3)
{
m_is_prime = m_number > 1;
return;
}
else if (m_number % 2 == 0 || m_number % 3 == 0)
{
m_is_prime = false;
return;
}
else
{
for (unsigned i = 5; i * i <= m_number; i += 6)
{
if (m_number % i == 0 || m_number % (i + 2) == 0)
{
m_is_prime = false;
return;
}
}
m_is_prime = true;
return;
}
}
public:
unsigned m_number = 0;
bool m_is_prime = false;
};
int main()
{
ThreadPool pool;
unsigned num_tasks = 1000000;
std::vector<TestTask> tasks(num_tasks);
for (auto&& task : tasks)
task.Set(randint(0, 1000000000));
auto s = std::chrono::high_resolution_clock::now();
#if MT
for (auto&& task : tasks)
pool.add_task(&task);
#else
for (auto&& task : tasks)
task.execute();
#endif
auto e = std::chrono::high_resolution_clock::now();
double seconds = std::chrono::duration_cast<std::chrono::nanoseconds>(e - s).count() / 1000000000.0;
}
Benchmarks with VS2013 Profiler:
10,000,000 tasks:
MT:
13 seconds of wall clock time
93.36% is spent in msvcp120.dll
3.45% is spent in Task::execute() // Not good here
ST:
0.5 seconds of wall clock time
97.31% is spent with Task::execute()
Usual disclaimer in such answers: the only way to tell for sure is to measure it with a profiler tool.
But I will try to explain your results without it. First of all, you have one mutex across all your threads. So only one thread at a time can execute some task. It kills all your gains you might have. In spite of your threads your code is perfectly serial. So at the very least make your task execution out of the mutex. You need to lock the mutex only to get a task out of the queue — you don't need to hold it when the task gets executed.
Next, your tasks are so simple that single thread will execute them in no time. You just can't measure any gains with such tasks. Create some heavy tasks which could produce some more interesting results(some tasks which are closer to the real world, not such contrived).
And the 3rd point: threads are not without their cost — context switching, mutex contention etc. To have real gains, as the previous 2 points say, you need to have tasks which take more time than the overheads threads introduce and the code should be truly parallel instead of waiting on some resource making it serial.
UPD: I looked at the wrong part of the code. The task is complex enough provided you create tasks with sufficiently large numbers.
UPD2: I've played with your code and found a good prime number to show how the MT code is better. Use the following prime number: 1019048297. It will give enough computation complexity to show the difference.
But why your code doesn't produce good results? It is hard to tell without seeing the implementation of randint() but I take it is pretty simple and in a half of the cases it returns even numbers and other cases produce not much of big prime numbers either. So the tasks are so simple that context switching and other things around your particular implementation and threads in general consume more time than the computation itself. Using the prime number I gave you give the tasks no choice but spend time computing — no easy answer since the number is big and actually prime. That's why the big number will give you the answer you seek — better time for the MT code.
You should not hold the mutex while the task is getting executed, otherwise other threads will not be able to get a task:
void work() {
while (m_parent->m_running) {
Task* currentTask = nullptr;
std::unique_lock<std::mutex> lock(m_parent->m_in_mutex);
m_parent->m_task_signal.wait(lock, [&]() {
return !m_parent->m_in.empty() || !m_parent->m_running;
});
if (!m_parent->m_running) continue;
currentTask = m_parent->m_in.front();
m_parent->m_in.pop();
lock.unlock(); //<- Release the lock so that other threads can get tasks
currentTask->execute();
currentTask = nullptr;
}
}
For MT, how much time is spent in each phase of the "overhead": std::unique_lock, m_task_signal.wait, front, pop, unlock?
Based on your results of only 3% useful work, this means the above consumes 97%. I'd get numbers for each part of the above (e.g. add timestamps between each call).
It seems to me, that the code you use to [merely] dequeue the next task pointer is quite heavy. I'd do a much simpler queue [possibly lockless] mechanism. Or, perhaps, use atomics to bump an index into the queue instead of the five step process above. For example:
void
work()
{
while (m_parent->m_running) {
// NOTE: this is just an example, not necessarily the real function
int curindex = atomic_increment(&global_index);
if (curindex >= max_index)
break;
Task *task = m_parent->m_in[curindex];
task->execute();
}
}
Also, maybe you should pop [say] ten at a time instead of just one.
You might also be memory bound and/or "task switch" bound. (e.g.) For threads that access an array, more than four threads usually saturates the memory bus. You could also have heavy contention for the lock, such that the threads get starved because one thread is monopolizing the lock [indirectly, even with the new unlock call]
Interthread locking usually involves a "serialization" operation where other cores must synchronize their out-of-order execution pipelines.
Here's a "lockless" implementation:
void
work()
{
// assume m_id is 0,1,2,...
int curindex = m_id;
while (m_parent->m_running) {
if (curindex >= max_index)
break;
Task *task = m_parent->m_in[curindex];
task->execute();
curindex += NUMBER_OF_WORKERS;
}
}

C++11 get a task finished by one of two algorithms

I have two algorithms to solve a task X ().
How can I get a thread started for algorithm 1 and a thread started for algorithm 2 and wait for the first algorithm to finish after which I kill the other one and proceed?
I have seen that join from std::thread will make me wait for it to finish but I can't do join for both threads, otherwise I will wait for both to complete. I want to issue both of them and wait until one of them completes. What's the best way to achieve this?
You can't kill threads in C++11 so you need to orchestrate their demise.
This could be done by having them loop on an std::atomic<bool> variable and getting the winner to std::call_once() in order to set the return value and flag the other threads to end.
Perhaps a bit like this:
std::once_flag once; // for std::call_once()
void algorithm1(std::atomic<bool>& done, int& result)
{
// Do some randomly timed work
for(int i = 0; !done && i < 3; ++i) // end if done is true
std::this_thread::sleep_for(std::chrono::seconds(std::rand() % 3));
// Only one thread gets to leave a result
std::call_once(once, [&]
{
done = true; // stop other threads
result = 1;
});
}
void algorithm2(std::atomic<bool>& done, int& result)
{
// Do some randomly timed work
for(int i = 0; !done && i < 3; ++i) // end if done is true
std::this_thread::sleep_for(std::chrono::seconds(std::rand() % 3));
// Only one thread gets to leave a result
std::call_once(once, [&]
{
done = true; // stop other threads
result = 2;
});
}
int main()
{
std::srand(std::time(0));
std::atomic<bool> done(false);
int result = 0;
std::thread t1(algorithm1, std::ref(done), std::ref(result));
std::thread t2(algorithm2, std::ref(done), std::ref(result));
t1.join(); // this will end if t2 finishes
t2.join();
std::cout << "result : " << result << '\n';
}
Firstly, don't kill the losing algorithm. Just let it run to completion and ignore the result.
Now, the closest thing to what you asked for is to have a mutex+condvar+result variable (or more likely two results, one for each algorithm).
Code would look something like
X result1, result2;
bool complete1 = false;
bool complete2 = false;
std::mutex result_mutex;
std::condition_variable result_cv;
// simple wrapper to signal when algoN has finished
std::thread t1([&]() { result1 = algo1();
std::unique_lock lock(result_mutex);
complete1 = true;
result_cv.notify_one();
});
std::thread t2([&]() { result2 = algo2();
std::unique_lock lock(result_mutex);
complete2 = true;
result_cv.notify_one();
});
t1.detach();
t2.detach();
// wait until one of the algos has completed
int winner;
{
std::unique_lock lock(result_mutex);
result_cv.wait(lock, [&]() { return complete1 || complete2; });
if (complete1) winner=1;
else winner=2;
}
The other mechanisms, including the future/promise one, require the main thread to busy-wait. The only non-busy-waiting alternative is to move the post-success processing to a call_once: in this case the master thread should just join both children, and the second child will simply return when it finishes processing and realises it lost.
The new C++11 standard offers some methods to solve those problems by using, e.g., futures, promises.
Please have a look at http://de.cppreference.com/w/cpp/thread/future and When is it a good idea to use std::promise over the other std::thread mechanisms?.

2 threads left hanging waiting on QWaitCondition in spite of wakeAll calls

I have threaded iterative generation of some geometries. I use VTK for rendering. After each iteration I would like to display (render) the current progress. My approach works as expected until the last 2 threads are left hanging waiting for QWaitCondition. They are blocked, even though their status in QWaitCondition's queue is wokenUp (inspected through debugger). I suspect that number of 2 threads is somehow connected with my processor's 4 cores.
Simplified code is below. What am I doing wrong and how to fix it?
class Logic
{
QMutex threadLock, renderLock;
//SOLUTION: renderLock should be per thread, not global like this!
QWaitCondition wc;
bool render;
...
}
Logic::Logic()
{
...
renderLock.lock(); //needs to be locked for QWaitCondition
}
void Logic::timerProc()
{
static int count=0;
if (render||count>10) //render wanted or not rendered in a while
{
threadLock.lock();
vtkRenderWindow->Render();
render=false;
count=0;
wc.wakeAll();
threadLock.unlock();
}
else
count++;
}
double Logic::makeMesh(int meshIndex)
{
while (notFinished)
{
...(calculate g)
threadLock.lock(); //lock scene
mesh[meshIndex]->setGeometry(g);
render=true;
threadLock.unlock();
wc.wait(&renderLock); //wait until rendered
}
return g.size;
}
void Logic::makeAllMeshes()
{
vector<QFuture<double>> r;
for (int i=0; i<meshes.size(); i++)
{
QFuture<double> future = QtConcurrent::run<double>(this, &Logic::makeMesh, i);
r.push_back(future);
}
while(any r is not finished)
QApplication::processEvents(); //give timer a chance
}
There is at least one defect in your code. count and render belong to the critical section, which means they need to be protected from concurrent access.
Assume there are more threads waiting on wc.wait(&renderLock);. Someone somewhere execute wc.wakeAll();. ALL the threads are woken up. Assume at least one thread sees notFinished as true (if any of your code make sense, this must be possible) and go back to execute :
threadLock.lock(); //lock scene
mesh[meshIndex]->setGeometry(g);
render=true;
threadLock.unlock();
wc.wait(&renderLock) <----OOPS...
The second time the thread comes back, he doesn't have the lock renderLock. So Kamil Klimek is right: you call wait on a mutex you don't hold.
You should remove the lock in constructor, and lock before the calling the condition. Wherever you lock renderlock, the thread should not hold threadlock.
The catch was that I needed one QMutex per thread, and not just one global QMutex. The corrected code is below. Thanks for help UmNyobe!
class Logic
{
QMutex threadLock;
QWaitCondition wc;
bool render;
...
}
//nothing in constructor related to threading
void Logic::timerProc()
{
//count was a debugging workaround and is not needed
if (render)
{
threadLock.lock();
vtkRenderWindow->Render();
render=false;
wc.wakeAll();
threadLock.unlock();
}
}
double Logic::makeMesh(int meshIndex)
{
QMutex renderLock; //fix
renderLock.lock(); //fix
while (notFinished)
{
...(calculate g)
threadLock.lock(); //lock scene
mesh[meshIndex]->setGeometry(g);
render=true;
threadLock.unlock();
wc.wait(&renderLock); //wait until rendered
}
return g.size;
}
void Logic::makeAllMeshes()
{
vector<QFuture<double>> r;
for (int i=0; i<meshes.size(); i++)
{
QFuture<double> future = QtConcurrent::run<double>(this, &Logic::makeMesh, i);
r.push_back(future);
}
while(any r is not finished)
QApplication::processEvents(); //give timer a chance
}