Boost Threading Conceptualization / Questions - c++

I've got a function that is typically run 50 times (to run 50 simulations). Usually this is done sequentially single threaded but I'd like to speed things up using multiple threads. The threads don't need to access each others memory or data so I don't think racing is an issue. Essentially the thread should just complete its task, and return to main thats it's finished, also returning a double value.
First of all, looking through all the boost documentation and examples has really convoluted me and I'm not sure what I'm looking for anymore. boost::thread ? boost future? Could someone give an example of what is applicable in my case. Additionally, I don't understand how to specify how many threads to run, is it more like I would run 50 threads and the OS handles when to execute them?

If your code is completely CPU-bound (no network/disk IO), then you would benefit from starting as many background threads as you have CPUs. Use Boost's hardware_concurrency() function to determine that number and/or allow the user to set it. Just starting a bunch of threads is not helpful, as that will increase the overhead caused by creating, switching and terminating threads.
The code starting the threads is a simple loop, followed by another loop to wait for the thread's completion. You can also use the thread_group class for that. If the number of jobs is not known and can't be distributed on thread startup, consider using a thread pool where you just start a sensible number of threads and then give them jobs while they come up.

Read the Boost.Thread Futures docs for an idea of using futures and async to achieve this. It also shows how to do it manually (the hard way) using thread objects.
Given this serial code:
double run_sim(Data*);
int main()
{
const unsigned ntasks = 50;
double results[ntasks];
Data data[ntasks];
for (unsigned i=0; i<ntasks; ++i)
results[i] = run_sim(data[i]);
}
A naive parallel version would be:
#define BOOST_THREAD_PROVIDES_FUTURE
#include <boost/thread/future.hpp>
#include <boost/bind.hpp>
double run_task(Data*);
int main()
{
const unsigned nsim = 50;
Data data[nsim];
boost::future<int> futures[nsim];
for (unsigned i=0; i<nsim; ++i)
futures[i] = boost::async(boost::bind(&run_sim, &data[i]));
double results[nsim];
for (unsigned i=0; i<nsim; ++i)
results[i] = futures[i].get();
}
Because boost::async doesn't yet support deferred functions every async call will create a new thread, so this will spawn 50 thread at once. This might perform quite badly, so you could split it up into smaller blocks:
#define BOOST_THREAD_PROVIDES_FUTURE
#include <boost/thread/future.hpp>
#include <boost/thread/thread.hpp>
#include <boost/bind.hpp>
double run_sim(Data*);
int main()
{
const unsigned nsim = 50;
unsigned nprocs = boost::thread::hardware_concurrency();
if (nprocs == 0)
nprocs = 2; // cannot determine number of cores, let's say 2
Data data[nsim];
boost::future<int> futures[nsim];
double results[nsim];
for (unsigned i=0; i<nsim; ++i)
{
if ( ((i+1) % nprocs) != 0 )
futures[i] = boost::async(boost::bind(&run_sim, &data[i]));
else
results[i] = run_sim(&data[i]);
}
for (unsigned i=0; i<nsim; ++i)
if ( ((i+1) % nprocs) != 0 )
results[i] = futures[i].get();
}
If hardware_concurrency() returns 4, this will create three new threads then call run_sim synchronously in the main() thread, then create another three new threads then call run_sim synchronously. This will prevent 50 threads all being created at once, as the main thread stops to do some of the work, which will allow some of the other threads to complete.
The code above requires quite a recent version of Boost, it's slightly easier using Standard C++ if you can use C++11:
#include <future>
double run_sim(Data*);
int main()
{
const unsigned nsim = 50;
Data data[nsim];
std::future<int> futures[nsim];
double results[nsim];
unsigned nprocs = std::thread::hardware_concurrency();
if (nprocs == 0)
nprocs = 2;
for (unsigned i=0; i<nsim; ++i)
{
if ( ((i+1) % nprocs) != 0 )
futures[i] = std::async(boost::launch::async, &run_sim, &data[i]);
else
results[i] = run_sim(&data[i]);
}
for (unsigned i=0; i<nsim; ++i)
if ( ((i+1) % nprocs) != 0 )
results[i] = futures[i].get();
}

Related

C++ Wait in main thread for future without while(true)

Question
I want to know if it is possible to wait in the main-Thread without any while(1)-loop.
I launch a few threads via std::async() and do calculation of numbers on each thread. After i start the threads i want to receive the results back. I do that with a std::future<>.get().
My problem
When i receive the result i call std::future.get(), which blocks the main thread until the calculation on the thread is done. This leads to some slower execution time, if one thread needs considerably more time then the following, where i could do some calculation with the finished results instead and then when the slowest thread is done i maybe have some some further calculation.
Is there a way to idle the main thread until ANY of the threads has finished running? I have thought of a callback function which wakes the main thread up, but i still don't know how to idle the main function without making it unresponsive for i.e. a second and not running a while(true) loop instead.
Current code
#include <iostream>
#include <future>
uint64_t calc_factorial(int start, int number);
int main()
{
uint64_t n = 1;
//The user entered number
uint64_t number = 0;
// get the user input
printf("Enter number (uint64_t): ");
scanf("%lu", &number);
std::future<uint64_t> results[4];
for (int i = 0; i < 4; i++)
{
// push to different cores
results[i] = std::async(std::launch::async, calc_factorial, i + 2, number);
}
for (int i = 0; i < 4; i++)
{
//retrieve result...I don't want to wait here if one threads needs more time than usual
n *= results[i].get();
}
// print n or the time needed
return 0;
}
uint64_t calc_factorial(int start, int number)
{
uint64_t n = 1;
for (int i = start; i <= number; i+=4) n *= i;
return n;
}
I prepared a code snippet which runs fine, I am using the GMP Lib for the big results, but the code runs with uint64_t instead if you enter small numbers.
Note
If you have compiled the GMP library for whatever reason on your PC already you could replace every uint64_t with mpz_class
I'd approach this somewhat differently.
Unless I have a fairly specific reason to do otherwise, I tend to approach most multithreaded code the same general way: use a (thread-safe) queue to transmit results. So create an instance of a thread-safe queue, and pass a reference to it to each of the threads that's doing to generate the data. The have whatever thread is going to collect the results grab them from the queue.
This makes it automatic (and trivial) that you create each result as it's produced, rather than getting stuck waiting for one after another has produced results.

How to run all threads in sequence as static with out using opemMP for?

I'm new to openMP and multi-threading.
I have been given a task to run a method as static, dynamic, and guided without using OpenMPfor loop which means I cant use scheduled clauses.!
I could create parallel threads with parallel and could assign loop iterations to threads equally
but how to make it static and dynamic(1000 block) and guided?
void static_scheduling_function(const int start_count,
const int upper_bound,
int *results)
{
int i, tid, numt;
#pragma omp parallel private(i,tid)
{
int from, to;
tid = omp_get_thread_num();
numt = omp_get_num_threads();
from = (upper_bound / numt) * tid;
to = (upper_bound / numt) * (tid + 1) - 1;
if (tid == numt - 1)
to = upper_bound - 1;
for (i = from; i < to; i++)
{
//compute one iteration (i)
int start = i;
int end = i + 1;
compute_iterations(start, end, results);
}
}
}
======================================
For dynamic i have tried something like this
void chunk_scheduling_function(const int start_count, const int upper_bound, int* results) {
int numt, shared_lower_iteration_counter=start_count;
for (int shared_lower_iteration_counter=start_count; shared_lower_iteration_counter<upper_bound;){
#pragma omp parallel shared(shared_lower_iteration_counter)
{
int tid = omp_get_thread_num();
int from,to;
int chunk = 1000;
#pragma omp critical
{
from= shared_lower_iteration_counter; // 10, 1010
to = ( shared_lower_iteration_counter + chunk ); // 1010,
shared_lower_iteration_counter = shared_lower_iteration_counter + chunk; // 1100 // critical is important while incrementing shared variable which decides next iteration
}
for(int i = from ; (i < to && i < upper_bound ); i++) { // 10 to 1009 , i< upperbound prevents other threads from executing call
int start = i;
int end = i + 1;
compute_iterations(start, end, results);
}
}
}
}
This looks like a university assignment (and a very good one IMO), I will not provide the complete solution, instead I will provide what you should be looking for.
The static scheduler looks okey; Notwithstanding, it can be improved by taking into account the chunk size as well.
For the dynamic and guided schedulers, they can be implemented by using a variable (let us name it shared_iteration_counter) that will be marking the current loop iteration that should pick up next by the threads. Therefore, when a thread needs to request a new task to work with (i.e., a new loop iteration) it queries that variable for that. In pseudo code would look like the following:
int thread_current_iteration = shared_iteration_counter++;
while(thread_current_iteration < MAX_SIZE)
{
// do work
thread_current_iteration = shared_iteration_counter++;
}
The pseudo code is assuming chunk size of 1 (i.e., shared_iteration_counter++) you will have to adapt to your use-case. Now, because that variable will be shared among threads, and every thread will be updating it, you need to ensure mutual exclusion during the updates of that variable. Fortunately, OpenMP offers means to achieve that, for instance, using #pragma omp critical, explicitly locks, and atomic operations. The latter is the better option for your use-case:
#pragma omp atomic
shared_iteration_counter = shared_iteration_counter + 1;
For the guided scheduler:
Similar to dynamic scheduling, but the chunk size starts off large and
decreases to better handle load imbalance between iterations. The
optional chunk parameter specifies them minimum size chunk to use. By
default the chunk size is approximately loop_count/number_of_threads.
In this case, not only you have to guarantee mutual exclusion of the variable that will be used to count the current loop iteration to be pick up by threads, but also guarantee mutual exclusion of the chunk size variable, since it also changes.
Without given it way too much bear in mind that you may need to considered how to deal with edge-cases such as your current thread_current_iteration= 1000 and your chunks_size=1000 with a MAX_SIZE=1500. Hence, thread_current_iteration + chunks_size > MAX_SIZE, but there is still 500 iterations to be computed.

Running fixed number of threads

With the new standards ofc++17 I wonder if there is a good way to start a process with a fixed number of threads until a batch of jobs are finished.
Can you tell me how I can achieve the desired functionality of this code:
std::vector<std::future<std::string>> futureStore;
const int batchSize = 1000;
const int maxNumParallelThreads = 10;
int threadsTerminated = 0;
while(threadsTerminated < batchSize)
{
const int& threadsRunning = futureStore.size();
while(threadsRunning < maxNumParallelThreads)
{
futureStore.emplace_back(std::async(someFunction));
}
for(std::future<std::string>& readyFuture: std::when_any(futureStore.begin(), futureStore.end()))
{
auto retVal = readyFuture.get();
// (possibly do something with the ret val)
threadsTerminated++;
}
}
I read, that there used to be an std::when_any function, but it was a feature that did make it getting into the std features.
Is there any support for this functionality (not necessarily for std::future-s) in the current standard libraries? Is there a way to easily implement it, or do I have to resolve to something like this?
This does not seem to me to be the ideal approach:
All your main thread does is waiting for your other threads finishing, polling the results of your future. Almost wasting this thread somehow...
I don't know in how far std::async re-uses the threads' infrastructures in any suitable way, so you risk creating entirely new threads each time... (apart from that you might not create any threads at all, see here, if you do not specify std::launch::async explicitly.
I personally would prefer another approach:
Create all the threads you want to use at once.
Let each thread run a loop, repeatedly calling someFunction(), until you have reached the number of desired tasks.
The implementation might look similar to this example:
const int BatchSize = 20;
int tasksStarted = 0;
std::mutex mutex;
std::vector<std::string> results;
std::string someFunction()
{
puts("worker started"); fflush(stdout);
sleep(2);
puts("worker done"); fflush(stdout);
return "";
}
void runner()
{
{
std::lock_guard<std::mutex> lk(mutex);
if(tasksStarted >= BatchSize)
return;
++tasksStarted;
}
for(;;)
{
std::string s = someFunction();
{
std::lock_guard<std::mutex> lk(mutex);
results.push_back(s);
if(tasksStarted >= BatchSize)
break;
++tasksStarted;
}
}
}
int main(int argc, char* argv[])
{
const int MaxNumParallelThreads = 4;
std::thread threads[MaxNumParallelThreads - 1]; // main thread is one, too!
for(int i = 0; i < MaxNumParallelThreads - 1; ++i)
{
threads[i] = std::thread(&runner);
}
runner();
for(int i = 0; i < MaxNumParallelThreads - 1; ++i)
{
threads[i].join();
}
// use results...
return 0;
}
This way, you do not recreate each thread newly, but just continue until all tasks are done.
If these tasks are not all all alike as in above example, you might create a base class Task with a pure virtual function (e. g. "execute" or "operator ()") and create subclasses with the implementation required (and holding any necessary data).
You could then place the instances into a std::vector or std::list (well, we won't iterate, list might be appropriate here...) as pointers (otherwise, you get type erasure!) and let each thread remove one of the tasks when it has finished its previous one (do not forget to protect against race conditions!) and execute it. As soon as no more tasks are left, return...
If you dont care about the exact number of threads, the simplest solution would be:
std::vector<std::future<std::string>> futureStore(
batchSize
);
std::generate(futureStore.begin(), futureStore.end(), [](){return std::async(someTask);});
for(auto& future : futureStore) {
std::string value = future.get();
doWork(value);
}
From my experience, std::async will reuse the threads, after a certain amount of threads is spawend. It will not spawn 1000 threads. Also, you will not gain much of a performance boost (if any), when using a threadpool. I did measurements in the past, and the overall runtime was nearly identical.
The only reason, I use threadpools now, is to avoid the delay for creating threads in the computation loop. If you have timing constraints, you may miss deadlines, when using std::async for the first time, since it will create the threads on the first calls.
There is a good thread pool library for these applications. Have a look here:
https://github.com/vit-vit/ctpl
#include <ctpl.h>
const unsigned int numberOfThreads = 10;
const unsigned int batchSize = 1000;
ctpl::thread_pool pool(batchSize /* two threads in the pool */);
std::vector<std::future<std::string>> futureStore(
batchSize
);
std::generate(futureStore.begin(), futureStore.end(), [](){ return pool.push(someTask);});
for(auto& future : futureStore) {
std::string value = future.get();
doWork(value);
}

TBB task_arena & task_group usage for scaling parallel_for work

I am trying to use the Threaded Building Blocks task_arena. There is a simple array full of '0'. Arena's threads put '1' in the array on the odd places. Main thread put '2' in the array on the even places.
/* Odd-even arenas tbb test */
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>
#include <tbb/task_arena.h>
#include <tbb/task_group.h>
#include <iostream>
using namespace std;
const int SIZE = 100;
int main()
{
tbb::task_arena limited(1); // no more than 1 thread in this arena
tbb::task_group tg;
int myArray[SIZE] = {0};
//! Main thread create another thread, then immediately returns
limited.enqueue([&]{
//! Created thread continues here
tg.run([&]{
tbb::parallel_for(tbb::blocked_range<int>(0, SIZE),
[&](const tbb::blocked_range<int> &r)
{
for(int i = 0; i != SIZE; i++)
if(i % 2 == 0)
myArray[i] = 1;
}
);
});
});
//! Main thread do this work
tbb::parallel_for(tbb::blocked_range<int>(0, SIZE),
[&](const tbb::blocked_range<int> &r)
{
for(int i = 0; i != SIZE; i++)
if(i % 2 != 0)
myArray[i] = 2;
}
);
//! Main thread waiting for 'tg' group
//** it does not create any threads here (doesn't it?) */
limited.execute([&]{
tg.wait();
});
for(int i = 0; i < SIZE; i++) {
cout << myArray[i] << " ";
}
cout << endl;
return 0;
}
The output is:
0 2 0 2 ... 0 2
So the limited.enque{tg.run{...}} block doesn't work.
What's the problem? Any ideas? Thank you.
You have created limited arena for one thread only, and by default this slot is reserved for the master thread. Though, enqueuing into such a serializing arena will temporarily boost its concurrency level to 2 (in order to satisfy 'fire-and-forget' promise of the enqueue), enqueue() does not guarantee synchronous execution of the submitted task. So, tg.wait() can start before tg.run() executes and thus the program will not wait when the worker thread is created, joins the limited arena, and fills the array with '1' (BTW, the whole array is filled in each of 100 parallel_for iterations).
So, in order to wait for the tg.run() to complete, use limited.execute instead. But it will prevent automatic enhancing of the limited concurrency level and the task will be deferred till tg.wait() executed by master thread.
If you want to see asynchronous execution, set arena's concurrency to 2 manually: tbb::task_arena limited(2);
or disable slot reservation for master thread: tbb::task_arena limited(1,0) (but note, it implies additional overheads for dynamic balancing of the number of threads in arena).
P.S. TBB has no points where threads are guaranteed to come (unlike OpenMP). Only enqueue methods guarantee creation of at least one worker thread, but it says nothing about when it will come. See local observer feature to get notification when threads are actually joining arenas.

Spawn a set of threads iteratively in C++11?

I have a function that populates entries in a large matrix. As the computations are independent, I was thinking about exploiting std::thread so that chunks of the matrix can be processed by separate threads.
Instead of dividing the matrix in to n chunks where n is the limit on the maximum number of threads allowed to run simultaneously, I would like to make finer chunks, so that I could spawn a new thread when an existing thread is finished. (As the compute time will be widely different for different entries, and equally dividing the matrix will not be very efficient here. Hence the latter idea.)
What are the concepts in std::thread I should look into for doing this? (I came across async and condition_variables although I don't clearly see how they can be exploited for such kinds of spawning). Some example pseudo code would greatly help!
Why tax the OS scheduler with thread creation & destruction? (Assume these operations are expensive.) Instead, make your threads work more instead.
EDIT: If you do no want to split the work in equal chunks, then the best solution really is a thread pool. FYI, there is a thread_pool library in the works for C++14.
What is below assumed that you could split the work in equal chunks, so is not exactly applicable to your question. END OF EDIT.
struct matrix
{
int nrows, ncols;
// assuming row-based processing; adjust for column-based processing
void fill_rows(int first, int last);
};
int num_threads = std::thread::hardware_concurrency();
std::vector< std::thread > threads(num_threads);
matrix m; // must be initialized...
// here - every thread will process as many rows as needed
int nrows_per_thread = m.nrows / num_threads;
for(int i = 0; i != num_threads; ++i)
{
// thread i will process these rows:
int first = i * nrows_per_thread;
int last = first + nrows_per_thread;
// last thread gets remaining rows
last += (i == num_threads - 1) ? m.nrows % nrows_per_thread : 0;
threads[i] = std::move(std::thread([&m,first,last]{
m.fill_rows(first,last); }))
}
for(int i = 0; i != num_threads; ++i)
{
threads[i].join();
}
If this is an operation you do very frequently, then use a worker pool as #Igor Tandetnik suggests in the comments. For one-offs, it's not worth the trouble.