I'm trying to implement some algorithm using threads that must be synchronized at some moment. More or less the sequence for each thread should be:
1. Try to find a solution with current settings.
2. Synchronize solution with other threads.
3. If any of the threads found solution end work.
4. (empty - to be inline with example below)
5. Modify parameters for algorithm and jump to 1.
Here is a toy example with algorithm changed to just random number generation - all threads should end if at least one of them will find 0.
#include <iostream>
#include <condition_variable>
#include <thread>
#include <vector>
const int numOfThreads = 8;
std::condition_variable cv1, cv2;
std::mutex m1, m2;
int lockCnt1 = 0;
int lockCnt2 = 0;
int solutionCnt = 0;
void workerThread()
{
while(true) {
// 1. do some important work
int r = rand() % 1000;
// 2. synchronize and get results from all threads
{
std::unique_lock<std::mutex> l1(m1);
++lockCnt1;
if (r == 0) ++solutionCnt; // gather solutions
if (lockCnt1 == numOfThreads) {
// last thread ends here
lockCnt2 = 0;
cv1.notify_all();
}
else {
cv1.wait(l1, [&] { return lockCnt1 == numOfThreads; });
}
}
// 3. if solution found then quit all threads
if (solutionCnt > 0) return;
// 4. if not, then set lockCnt1 to 0 to have section 2. working again
{
std::unique_lock<std::mutex> l2(m2);
++lockCnt2;
if (lockCnt2 == numOfThreads) {
// last thread ends here
lockCnt1 = 0;
cv2.notify_all();
}
else {
cv2.wait(l2, [&] { return lockCnt2 == numOfThreads; });
}
}
// 5. Setup new algorithm parameters and repeat.
}
}
int main()
{
srand(time(NULL));
std::vector<std::thread> v;
for (int i = 0; i < numOfThreads ; ++i) v.emplace_back(std::thread(workerThread));
for (int i = 0; i < numOfThreads ; ++i) v[i].join();
return 0;
}
The questions I have are about sections 2. and 4. from code above.
A) In a section 2 there is synchronization of all threads and gathering solutions (if found). All is done using lockCnt1 variable. Comparing to single use of condition_variable I found it hard how to set lockCnt1 to zero safely, to be able to reuse this section (2.) next time. Because of that I introduced section 4. Is there better way to do that (without introducing section 4.)?
B) It seems that all examples shows using condition_variable rather in context of 'producer-consumer' scenario. Is there better way to synchronization all threads in case where all are 'producers'?
Edit: Just to be clear, I didn't want to describe algorithm details since this is not important here - anyway this is necessary to have all solution(s) or none from given loop execution and mixing them is not allowed. Described sequence of execution must be followed and the question is how to have such synchronization between threads.
A) You could just not reset the lockCnt1 to 0, just keep incrementing it further. The condition lockCnt2 == numOfThreads then changes to lockCnt2 % numOfThreads == 0. You can then drop the block #4. In future you could also use std::experimental::barrier to get the threads to meet.
B) I would suggest using std::atomic for solutionCnt and then you can drop all other counters, the mutex and the condition variable. Just atomically increase it by one in the thread that found solution and then return. In all threads after every iteration check if the value is bigger than zero. If it is, then return. The advantage is that the threads do not have to meet regularly, but can try to solve it at their own pace.
Out of curiosity, I tried to solve your problem using std::async. For every attempt to find a solution, we call async. Once all parallel attempts have finished, we process feedback, adjust parameters, and repeat. An important difference with your implementation is that feedback is processed in the calling (main) thread. If processing feedback takes too long — or if we don't want to block the main thread at all — then the code in main() can be adjusted to also call std::async.
The code is supposed to be quite efficient, provided that the implementation of async uses a thread pool (e. g. Microsoft's implementation does that).
#include <chrono>
#include <future>
#include <iostream>
#include <vector>
const int numOfThreads = 8;
struct Parameters{};
struct Feedback {
int result;
};
Feedback doTheWork(const Parameters &){
// do the work and provide result and feedback for future runs
return Feedback{rand() % 1000};
}
bool isSolution(const Feedback &f){
return f.result == 0;
}
// Runs doTheWork in parallel. Number of parallel tasks is same as size of params vector
std::vector<Feedback> findSolutions(const std::vector<Parameters> ¶ms){
// 1. Run async tasks to find solutions. Normally threads are not created each time but re-used from a pool
std::vector<std::future<Feedback>> futures;
for (auto &p: params){
futures.push_back(std::async(std::launch::async,
[&p](){ return doTheWork(p); }));
}
// 2. Syncrhonize: wait for all tasks
std::vector<Feedback> feedback(futures.size());
for (auto nofRunning = futures.size(), iFuture = size_t{0}; nofRunning > 0; ){
// Check if the task has finished (future is invalid if we already handled it during an earlier iteration)
auto &future = futures[iFuture];
if (future.valid() && future.wait_for(std::chrono::milliseconds(1)) != std::future_status::timeout){
// Collect feedback for next attempt
// Alternatively, we could already check if solution has been found and cancel other tasks [if our algorithm supports cancellation]
feedback[iFuture] = std::move(future.get());
--nofRunning;
}
if (++iFuture == futures.size())
iFuture = 0;
}
return feedback;
}
int main()
{
srand(time(NULL));
std::vector<Parameters> params(numOfThreads);
// 0. Set inital parameter values here
// If we don't want to block the main thread while the algorithm is running, we can use std::async here too
while (true){
auto feedbackVector = findSolutions(params);
auto itSolution = std::find_if(std::begin(feedbackVector), std::end(feedbackVector), isSolution);
// 3. If any of the threads has found a solution, we stop
if (itSolution != feedbackVector.end())
break;
// 5. Use feedback to re-configure parameters for next iteration
}
return 0;
}
Related
Suppose I have some tasks (Monte Carlo simulations) that I want to run in parallel. I want to complete a given number of tasks, but tasks take different amount of time so not easy to divide the work evenly over the threads. Also: I need the results of all simulations in a single vector (or array) in the end.
So I come up with below approach:
int Max{1000000};
//SimResult is some struct with well-defined default value.
std::vector<SimResult> vec(/*length*/Max);//Initialize with default values of SimResult
int LastAdded{0};
void fill(int RandSeed)
{
Simulator sim{RandSeed};
while(LastAdded < Max)
{
// Do some work to bring foo to the desired state
//The duration of this work is subject to randomness
vec[LastAdded++]
= sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1);
auto fut2 = std::async(fill,2);
//maybe some more tasks.
fut1.get();
fut2.get();
//do something with the results in vec.
}
The above code will give race conditions I guess. I am looking for a performant approach to avoid that. Requirements: avoid race conditions (fill the entire array, no skips) ; final result is immediately in array ; performant.
Reading on various approaches, it seems atomic is a good candidate, but I am not sure what settings will be most performant in my case? And not even sure whether atomic will cut it; maybe a mutex guarding LastAdded is needed?
One thing I would say is that you need to be very careful with the standard library random number functions. If your 'Simulator' class creates an instance of a generator, you should not run Monte Carlo simulations in parallel using the same object, because you'll get likely get repeated patterns of random numbers between the runs, which will give you inaccurate results.
The best practice in this area would be to create N Simulator objects with the same properties, and give each one a different random seed. Then you could pool these objects out over multiple threads using OpenMP, which is a common parallel programming model for scientific software development.
std::vector<SimResult> generateResults(size_t N_runs, double seed)
{
std::vector<SimResult> results(N_runs);
#pragma omp parallel for
for(auto i = 0; i < N_runs; i++)
{
auto sim = Simulator(seed + i);
results[i] = sim.GetResult();
}
}
Edit: With OpenMP, you can choose different scheduling models, which allow you to for e.g. dynamically split work between threads. You can do this with:
#pragma omp parallel for schedule(dynamic, 16)
which would give each thread chunks of 16 items to work on at a time.
Since you already know how many elements your are going to work with and never change the size of the vector, the easiest solution is to let each thread work on it's own part of the vector. For example
Update
to accomodate for vastly varying calculation times, you should keep your current code, but avoid race conditions via a std::lock_guard. You will need a std::mutex that is the same for all threads, for example a global variable, or pass a reference of the mutex to each thread.
void fill(int RandSeed, std::mutex &nextItemMutex)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
// enter critical area
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
// Acquire next item
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded++;
}
else
{
break;
}
// lock is released when nextItemLock goes out of scope
}
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[workingIndex] = sim.GetResult();//Produces SimResult.
}
}
Problem with this is, that snychronisation is quite expensive. But it's probably not that expensive in comparison to the simulation you run, so it shouldn't be too bad.
Version 2:
To reduce the amount of synchronisation that is required, you could acquire blocks to work on, instead of single items:
void fill(int RandSeed, std::mutex &nextItemMutex, size_t blockSize)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded += blockSize;
}
else
{
break;
}
}
for(size_t i = workingIndex; i < workingIndex + blockSize && i < MAX; i++)
vec[i] = sim.GetResult();//Produces SimResult.
}
}
Simple Version
void fill(int RandSeed, size_t partitionStart, size_t partitionEnd)
{
Simulator sim{RandSeed};
for(size_t i = partitionStart; i < partitionEnd; i++)
{
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[i] = sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1, 0, Max / 2);
auto fut2 = std::async(fill,2, Max / 2, Max);
// ...
}
I'm making a parallel password cracker for an assignment. When I launch more than one thread, the times taken to crack take longer the more threads I add. What is the problem here?
Secondly, what resource sharing techniques can I use for optimal performance too? I'm required to use either mutexes, atomic operations or barriers while also using semaphores, conditional variables or channels. Mutexes seem to slow my program down quite drastically.
Here is an example of my code for context:
std::mutex mtx;
std::condition_variable cv;
void run()
{
std::unique_lock<std::mutex> lck(mtx);
ready = true;
cv.notify_all();
}
crack()
{
std::lock_guard<std::mutex> lk(mtx);
...do cracking stuff
}
main()
{
....
std::thread *t = new std::thread[uiThreadCount];
for(int i = 0; i < uiThreadCount; i++)
{
t[i] = std::thread(crack, params);
}
run();
for(int i = 0; i < uiThreadCount; i++)
{
t[i].join();
}
}
When writing multi-threaded code, it's generally a good idea to share as few resources as possible, so you can avoid having to synchronize using a mutex or an atomic.
There are a lot of different ways to do password cracking, so I'll give a slightly simpler example. Let's say you have a hash function, and a hash, and you're trying to guess what input produces the hash (this is basically how a password would get cracked).
We can write the cracker like this. It'll take the hash function and the password hash, check a range of values, and invoke the callback function if it found a match.
auto cracker = [](auto passwdHash, auto hashFunc, auto min, auto max, auto callback) {
for(auto i = min; i < max; i++) {
auto output = hashFunc(i);
if(output == passwdHash) {
callback(i);
}
}
};
Now, we can write a parallel version. This version only has to synchronize when it finds a match, which is pretty rare.
auto parallel_cracker = [](auto passwdHash, auto hashFunc, auto min, auto max, int num_threads) {
// Get a vector of threads
std::vector<std::thread> threads;
threads.reserve(num_threads);
// Make a vector of all the matches it discovered
using input_t = decltype(min);
std::vector<input_t> matches;
std::mutex match_lock;
// Whenever a match is found, this function gets called
auto callback = [&](input_t match) {
std::unique_lock<std::mutex> _lock(match_lock);
std::cout << "Found match: " << match << '\n';
matches.push_back(match);
};
for(int i = 0; i < num_threads; i++) {
auto sub_min = min + ((max - min) * i) / num_threads;
auto sub_max = min + ((max - min) * (i + 1)) / num_threads;
matches.push_back(std::thread(cracker, passwdHash, hashFunc, sub_min, sub_max, callback));
}
// Join all the threads
for(auto& thread : threads) {
thread.join();
}
return matches;
};
yes, not surprising with the way it's written: putting a mutex at the beginning of your thread (crack function), you effectively make them run sequentially
I understand you want to achieve a "synchronous start" of the threads (by the intention of using conditional variable cv), but you don't use it properly - without use of one of its wait methods, the call cv.notify_all() is useless: it does not do what you intended, instead your threads will simply run sequentially.
using wait() from the std::condition_variable in your crack() call is imperative: it will release the mtx (which you just grabbed with the mutex guard lk) and will block the execution of the thread until the cv.notify_all(). After the call, your other threads (except the first one, whichever it will be) will remain under the mtx so if you really want the "parallel" execution, you'd then need to unlock the mtx.
Here, how your crack thread should look like:
crack()
{
std::unique_lock<std::mutex> lk(mtx);
cv.wait(lk);
lk.unlock();
...do cracking stuff
}
btw, you don't need ready flag in your run() call - it's entirely redundant/unused.
I'm required to use either mutexes, atomic operations or barriers
while also using semaphores, conditional variables or channels
- different tools/techniques are good for different things, the question is too general
So I'm trying to create a program that implements a function that generates a random number (n) and based on n, creates n threads. The main thread is responsible to print the minimum and maximum of the leafs. The depth of hierarchy with the Main thread is 3.
I have written the code below:
#include <iostream>
#include <thread>
#include <time.h>
#include <string>
#include <sstream>
using namespace std;
// a structure to keep the needed information of each thread
struct ThreadInfo
{
long randomN;
int level;
bool run;
int maxOfVals;
double minOfVals;
};
// The start address (function) of the threads
void ChildWork(void* a) {
ThreadInfo* info = (ThreadInfo*)a;
// Generate random value n
srand(time(NULL));
double n=rand()%6+1;
// initialize the thread info with n value
info->randomN=n;
info->maxOfVals=n;
info->minOfVals=n;
// the depth of recursion should not be more than 3
if(info->level > 3)
{
info->run = false;
}
// Create n threads and run them
ThreadInfo* childInfo = new ThreadInfo[(int)n];
for(int i = 0; i < n; i++)
{
childInfo[i].level = info->level + 1;
childInfo[i].run = true;
std::thread tt(ChildWork, &childInfo[i]) ;
tt.detach();
}
// checks if any child threads are working
bool anyRun = true;
while(anyRun)
{
anyRun = false;
for(int i = 0; i < n; i++)
{
anyRun = anyRun || childInfo[i].run;
}
}
// once all child threads are done, we find their max and min value
double maximum=1, minimum=6;
for( int i=0;i<n;i++)
{
// cout<<childInfo[i].maxOfVals<<endl;
if(childInfo[i].maxOfVals>=maximum)
maximum=childInfo[i].maxOfVals;
if(childInfo[i].minOfVals< minimum)
minimum=childInfo[i].minOfVals;
}
info->maxOfVals=maximum;
info->minOfVals=minimum;
// we set the info->run value to false, so that the parrent thread of this thread will know that it is done
info->run = false;
}
int main()
{
ThreadInfo info;
srand(time(NULL));
double n=rand()%6+1;
cout<<"n is: "<<n<<endl;
// initializing thread info
info.randomN=n;
info.maxOfVals=n;
info.minOfVals=n;
info.level = 1;
info.run = true;
std::thread t(ChildWork, &info) ;
t.join();
while(info.run);
info.maxOfVals= max<unsigned long>(info.randomN,info.maxOfVals);
info.minOfVals= min<unsigned long>(info.randomN,info.minOfVals);
cout << "Max is: " << info.maxOfVals <<" and Min is: "<<info.minOfVals;
}
The code compiles with no error, but when I execute it, it gives me this :
libc++abi.dylib: terminating with uncaught exception of type
std::__1::system_error: thread constructor failed: Resource
temporarily unavailable Abort trap: 6
You spawn too many threads. It looks a bit like a fork() bomb. Threads are a very heavy-weight system resource. Use them sparingly.
Within the function void Childwork I see two mistakes:
As someone already pointed out in the comments, you check the info level of a thread and then you go and create some more threads regardless of the previous check.
Within the for loop that spawns your new threads, you increment the info level right before you spawn the actual thread. However you increment a freshly created instance of ThreadInfo here ThreadInfo* childInfo = new ThreadInfo[(int)n]. All instances within childInfo hold a level of 0. Basically the level of each thread you spawn is 1.
In general avoid using threads to achieve concurrency for I/O bound operations (*). Just use threads to achieve concurrency for independent CPU bound operations. As a rule of thumb you never need more threads than you have CPU cores in your system (**). Having more does not improve concurrency and does not improve performance.
(*) You should always use direct function calls and an event based system to run pseudo concurrent I/O operations. You do not need any threading to do so. For example a TCP server does not need any threads to serve thousands of clients.
(**) This is the ideal case. In practice your software is composed of multiple parts, developed by independent developers and maintained in different modes, so it is ok to have some threads which could be theoretically avoided.
Multithreading is still rocket science in 2019. Especially in C++. Do not do it unless you know exactly what you are doing. Here is a good series of blog posts that handle threads.
With the new standards ofc++17 I wonder if there is a good way to start a process with a fixed number of threads until a batch of jobs are finished.
Can you tell me how I can achieve the desired functionality of this code:
std::vector<std::future<std::string>> futureStore;
const int batchSize = 1000;
const int maxNumParallelThreads = 10;
int threadsTerminated = 0;
while(threadsTerminated < batchSize)
{
const int& threadsRunning = futureStore.size();
while(threadsRunning < maxNumParallelThreads)
{
futureStore.emplace_back(std::async(someFunction));
}
for(std::future<std::string>& readyFuture: std::when_any(futureStore.begin(), futureStore.end()))
{
auto retVal = readyFuture.get();
// (possibly do something with the ret val)
threadsTerminated++;
}
}
I read, that there used to be an std::when_any function, but it was a feature that did make it getting into the std features.
Is there any support for this functionality (not necessarily for std::future-s) in the current standard libraries? Is there a way to easily implement it, or do I have to resolve to something like this?
This does not seem to me to be the ideal approach:
All your main thread does is waiting for your other threads finishing, polling the results of your future. Almost wasting this thread somehow...
I don't know in how far std::async re-uses the threads' infrastructures in any suitable way, so you risk creating entirely new threads each time... (apart from that you might not create any threads at all, see here, if you do not specify std::launch::async explicitly.
I personally would prefer another approach:
Create all the threads you want to use at once.
Let each thread run a loop, repeatedly calling someFunction(), until you have reached the number of desired tasks.
The implementation might look similar to this example:
const int BatchSize = 20;
int tasksStarted = 0;
std::mutex mutex;
std::vector<std::string> results;
std::string someFunction()
{
puts("worker started"); fflush(stdout);
sleep(2);
puts("worker done"); fflush(stdout);
return "";
}
void runner()
{
{
std::lock_guard<std::mutex> lk(mutex);
if(tasksStarted >= BatchSize)
return;
++tasksStarted;
}
for(;;)
{
std::string s = someFunction();
{
std::lock_guard<std::mutex> lk(mutex);
results.push_back(s);
if(tasksStarted >= BatchSize)
break;
++tasksStarted;
}
}
}
int main(int argc, char* argv[])
{
const int MaxNumParallelThreads = 4;
std::thread threads[MaxNumParallelThreads - 1]; // main thread is one, too!
for(int i = 0; i < MaxNumParallelThreads - 1; ++i)
{
threads[i] = std::thread(&runner);
}
runner();
for(int i = 0; i < MaxNumParallelThreads - 1; ++i)
{
threads[i].join();
}
// use results...
return 0;
}
This way, you do not recreate each thread newly, but just continue until all tasks are done.
If these tasks are not all all alike as in above example, you might create a base class Task with a pure virtual function (e. g. "execute" or "operator ()") and create subclasses with the implementation required (and holding any necessary data).
You could then place the instances into a std::vector or std::list (well, we won't iterate, list might be appropriate here...) as pointers (otherwise, you get type erasure!) and let each thread remove one of the tasks when it has finished its previous one (do not forget to protect against race conditions!) and execute it. As soon as no more tasks are left, return...
If you dont care about the exact number of threads, the simplest solution would be:
std::vector<std::future<std::string>> futureStore(
batchSize
);
std::generate(futureStore.begin(), futureStore.end(), [](){return std::async(someTask);});
for(auto& future : futureStore) {
std::string value = future.get();
doWork(value);
}
From my experience, std::async will reuse the threads, after a certain amount of threads is spawend. It will not spawn 1000 threads. Also, you will not gain much of a performance boost (if any), when using a threadpool. I did measurements in the past, and the overall runtime was nearly identical.
The only reason, I use threadpools now, is to avoid the delay for creating threads in the computation loop. If you have timing constraints, you may miss deadlines, when using std::async for the first time, since it will create the threads on the first calls.
There is a good thread pool library for these applications. Have a look here:
https://github.com/vit-vit/ctpl
#include <ctpl.h>
const unsigned int numberOfThreads = 10;
const unsigned int batchSize = 1000;
ctpl::thread_pool pool(batchSize /* two threads in the pool */);
std::vector<std::future<std::string>> futureStore(
batchSize
);
std::generate(futureStore.begin(), futureStore.end(), [](){ return pool.push(someTask);});
for(auto& future : futureStore) {
std::string value = future.get();
doWork(value);
}
I am trying to write a simple task class. It is a wrapper around std::future, it holds its state (not_started, running, completed), can start processing of given job on demand and it can repeatedly return result of its processing.
I can also offer some global functions for work with these tasks. But I am a little bit stuck in writing size_t wait_any(std::vector<task<T>>& tasks) function. This function is given a vector of tasks and should return index of the first completed task. If there are more tasks completed at the beginning, one of them must be returned (but this is not the problem).
A simple implementation using active waiting is following:
template <typename T>
size_t wait_any(std::vector<task<T>>& tasks) {
if (tasks.size() == 0) throw std::exception("Waiting for empty vector of tasks!");
for (auto i = tasks.begin(); i != tasks.end(); ++i) {
(*i).try_start();
}
while (true) {
for (size_t i = 0; i != tasks.size(); ++i) {
if (tasks[i].is_completed()) return i;
}
}
}
I would appreciate passive waiting for any completition. A std::this_thread::yield function is available, but I would rather not use it. As mentioned in documentation:
The exact behavior of this function depends on the implementation, in particular on the mechanics of the OS scheduler in use and the state of the system.
It seems that I should use std::condition_variable and std::mutex to get the whole thing working. There are a lot of examples showing use of these things, but I do not understand it at all and I have not found solution for this particular problem.
I would guess that I should create a std::condition_variable (just cv further) in the wait_any function. Then this cv (pointer) should be registered to all tasks from given vector. Once any of the tasks is completed (I can handle the moment when a task is done) it should call std::condition_variable::notify_one for all cv's registered in this task. These notified cv's should be also removed from all tasks which are holding them.
Now, I do not know how to use mutexes. I probably need to prevent multiple calls of notification and many other problems.
Any help appreciated!
I was thinking that since you only need one notification, you can use std::call_once to set the task_id which you require.
A naive way to go about it would be:
#include <iostream>
#include <vector>
#include <thread>
std::once_flag lala;
std::atomic_int winner( -1 );
void silly_task( int task_id )
{
//do nothing
std::call_once ( lala, [&]()
{
std::cout << "thread " << task_id << " wins" << std::endl;
winner = task_id;
} );
}
int main(){
std::vector<std::thread> vt;
for ( int i=0; i < 10 ; i ++ )
{
vt.push_back( std::thread( &silly_task, i) );
}
while ( winner == -1 )
{
std::this_thread::sleep_for(std::chrono::seconds(1));
}
for ( int i=0; i < 10 ; i ++ )
{
vt[i].join();
}
return 0;
} // end main