I want to learn how to adapt pseudocode I have for multithreading line by line to C++. I understand the pseudocode but I am not very experienced with C++ nor the std::thread function.
This is the pseudocode I have and that I've often used:
myFunction
{
int threadNr=previous;
int numberProcs = countProcessors();
// Every thread calculates a different line
for (y = y_start+threadNr; y < y_end; y+=numberProcs) {
// Horizontal lines
for (int x = x_start; x < x_end; x++) {
psetp(x,y,RGB(255,128,0));
}
}
}
int numberProcs = countProcessors();
// Launch threads: e.g. for 1 processor launch no other thread, for 2 processors launch 1 thread, for 4 processors launch 3 threads
for (i=0; i<numberProcs-1; i++)
triggerThread(50,FME_CUSTOMEVENT,i); //The last parameter is the thread number
triggerEvent(50,FME_CUSTOMEVENT,numberProcs-1); //The last thread used for progress
// Wait for all threads to finished
waitForThread(0,0xffffffff,-1);
I know I can call my current function using one thread via std::thread like this:
std::thread t1(FilterImage,&size_param, cdepth, in_data, input_worldP, output_worldP);
t1.join();
But this is not efficient as it is calling the entire function over and over again per thread.
I would expect every processor to tackle a horizontal line on it's own.
Any example code would would be highly appreciated as I tend to learn best through example.
Invoking thread::join() forces the calling thread to wait for the child thread to finish executing. For example, if I use it to create a number of threads in a loop, and call join() on each one, it'll be the same as though everything happened in sequence.
Here's an example. I have two methods that print out the numbers 1 through n. The first one does it single threaded, and the second one joins each thread as they're created. Both have the same output, but the threaded one is slower because you're waiting for each thread to finish before starting the next one.
#include <iostream>
#include <thread>
void printN_nothreads(int n) {
for(int i = 0; i < n; i++) {
std::cout << i << "\n";
}
}
void printN_threaded(int n) {
for(int i = 0; i < n; i++) {
std::thread t([=](){ std::cout << i << "\n"; });
t.join(); //This forces synchronization
}
}
Doing threading better.
To gain benefit from using threads, you have to start all the threads before joining them. In addition, to avoid false sharing, each thread should work on a separate region of the image (ideally a section that's far away in memory).
Let's look at how this would work. I don't know what library you're using, so instead I'm going to show you how to write a multi-threaded transform on a vector.
auto transform_section = [](auto func, auto begin, auto end) {
for(; begin != end; ++begin) {
func(*begin);
}
};
This transform_section function will be called once per thread, each on a different section of the vector. Let's write transform so it's multithreaded.
template<class Func, class T>
void transform(Func func, std::vector<T>& data, int num_threads) {
size_t size = data.size();
auto section_start = [size, num_threads](int thread_index) {
return size * thread_index / num_threads;
};
auto section_end = [size, num_threads](int thread_index) {
return size * (thread_index + 1) / num_threads;
};
std::vector<std::thread> threads(num_threads);
// Each thread works on a different section
for(int i = 0; i < num_threads; i++) {
T* start = &data[section_start(i)];
T* end = &data[section_end(i)];
threads[i] = std::thread(transform_section, func, start, end);
}
// We only join AFTER all the threads are started
for(std::thread& t : threads) {
t.join();
}
}
Related
The program I am implementing involves iterating over a medium amount of independent data, performing some computation, collecting the result, and then looping back over again. This loop needs to execute very quickly. To solve this, I am trying to implement the following thread pattern.
Rather than spawn threads in setup and join them in collect, I would like to spawn all threads initially, and keep them synchronized throughout their loops. This question regarding thread barriers had initially seemed to point me in the right direction, but my implementation of them is not working. Below is my example
int main() {
int counter = 0;
int threadcount = 10;
auto on_completion = [&]() noexcept {
++counter; // Incremenent counter
};
std::barrier sync_point(threadcount, on_completion);
auto work = [&]() {
while(true)
sync_point.arrive_and_wait(); // Keep cycling the sync point
};
std::vector<std::thread> threads;
for (int i = 0; i < threadcount; ++i)
threads.emplace_back(work); // Start every thread
for (auto& thread : threads)
thread.join();
}
To keep things as simple as possible, there is no computation being done in the worker threads, and I have done away with the setup thread. I am simply cycling the threads, syncing them after each cycle, and keeping a count of how many times they have looped. However, this code is deadlocking very quickly. More threads = faster deadlock. Adding work/delay inside the compute threads slows down the deadlock, but does not stop it.
Am I abusing the thread barrier? Is this unexpected behavior? Is there a cleaner way to implement this pattern?
Edit
It looks like removing the on_completion gets rid of the deadlock. I tried a different approach to meet the synchronization requirements without using the function, but it still deadlocks fairly quickly.
int threadcount = 10;
std::barrier start_point(threadcount + 1);
std::barrier stop_point(threadcount + 1);
auto work = [&](int i) {
while(true) {
start_point.arrive_and_wait();
stop_point.arrive_and_wait();
}
};
std::vector<std::thread> threads;
for (int i = 0; i < threadcount; ++i) {
threads.emplace_back(work, i);
}
while (true) {
std::cout << "Setup" << std::endl;
start_point.arrive_and_wait(); // Sync to start
// Workers do work here
stop_point.arrive_and_wait(); // Sync to end
std::cout << "Collect" << std::endl;
}
I´m trying to create a multithread part in my program, where a loop creates multiple threads, that get a vector consisting of objects along some integers and the vector which holds the results.
The problem is I can´t seem to wrap my head around how the thread part works, I tried different things but all end in the same three errors.
This is where I don´t know how to proceed:
std::thread thread_superdiecreator;
for (int64_t i = 0; i < dicewithside.back().sides; i++) {
thread_superdiecreator(func_multicreator(dicewithside, i, amount, lastdiepossibilities, superdie));
}
term does not evalutate to a function taking 1 arguments
I tried this:
thread_superdiecreator(func_multicreator, dicewithside, i, amount, lastdiepossibilities, superdie);
call of an object of a class type without appropriate operator() or conversion functions to pointer-to-function type
And this:
std::thread thread_superdiecreator(func_multicreator, dicewithside, i, amount, lastdiepossibilities, superdie);
Invoke error in thread.
The whole code snippet:
#pragma once
#include <mutex>
#include <thread>
#include <algorithm>
#include "class_Diewithside.h"
#include "struct_Sortedinput.h"
#include "func_maximumpossibilities.h"
std::mutex superdielock;
void func_multicreator(std::vector<Diewithside> dicewithside, int64_t lastdieside, int64_t size, int64_t lastdiepossibilities, std::vector<int64_t> &superdie) {
// Set the last die side to number of the thread
dicewithside[size-1].dieside = lastdieside;
//
std::vector<int64_t> partsuperdie;
partsuperdie.reserve(lastdiepossibilities);
// Calculate all possible results of all dice thrown with the last one set
for (int64_t i = 0; i < lastdiepossibilities; i++) {
// Reset the result
int64_t result = 0;
for (int64_t j = 0; j < size; j++) {
result += dicewithside[j].alleyes[dicewithside[j].dieside];
}
partsuperdie.push_back(result);
//
for (int64_t j = 0; j < size - 1; j++) {
if (dicewithside[j].dieside == dicewithside[j].sides - 1) {
dicewithside[j].dieside = 0;
}
else {
dicewithside[j].dieside += 1;
break;
}
}
}
superdielock.lock();
for (int64_t i = 0; i < lastdiepossibilities; i++) {
superdie.push_back(partsuperdie[i]);
}
superdielock.unlock();
}
// The function superdie creates an array that holds all possible outcomes of the dice thrown
std::vector<int64_t> func_superdiecreator(sortedinput varsortedinput) {
// Get the size of the diceset vector and create a new vector out of class Diewithside
int64_t size = varsortedinput.dicesets.size();
std::vector<Diewithside> dicewithside;
// Initialize the integer amount and iterate through all the amounts of vector dicesets adding them together to set the vector dicewithside reserve
int64_t amount = 0;
for (int64_t i = 0; i < size; i++) {
amount += varsortedinput.dicesets[i].amount;
}
dicewithside.reserve(amount);
// Fill the new vector dicewithside with each single die and add the starting value of 0
for (int64_t i = 0; i < size; i++) {
for (int64_t j = 0; j < varsortedinput.dicesets[i].amount; j++) {
dicewithside.push_back(Diewithside{varsortedinput.dicesets[i].plusorminus, varsortedinput.dicesets[i].sides, varsortedinput.dicesets[i].alleyes, 0});
}
}
// Get the maximum possibilities and divide by sides of the last die to get the amount of iterations each thread has to run
int64_t maximumpossibilities = func_maximumpossibilities(varsortedinput.dicesets, size);
int64_t lastdiepossibilities = maximumpossibilities / dicewithside[amount-1].sides;
// Multithread calculate all possibilities and save them in array
std::vector<int64_t> superdie;
superdie.reserve(maximumpossibilities);
std::thread thread_superdiecreator;
for (int64_t i = 0; i < dicewithside.back().sides; i++) {
thread_superdiecreator(func_multicreator(dicewithside, i, amount, lastdiepossibilities, superdie));
}
thread_superdiecreator.join();
return superdie;
Thanks for any help!
You indeed need to create the thread using the third alternative mentioned in the question, i.e. use the constructor of std::thread to start the thread.
The issue with this approach is the fact the last parameter of func_multicreator being a lvalue reference: std::thread creates copies of parameters and moves those copies during for calling the function on the background thread, and an rvalue reference cannot be implicitly converted to an lvalue reference. You need to use std::reference_wrapper here to be able to "pass" an lvalue reference to the thread.
You should join every thread created so you need to create a collection of threads.
Simplified example:
(The interesting stuff is between the ---... comments.)
struct Diewithside
{
int64_t sides;
};
void func_multicreator(std::vector<Diewithside> dicewithside, int64_t lastdieside, int64_t size, int64_t lastdiepossibilities, std::vector<int64_t>& superdie)
{
}
std::vector<int64_t> func_superdiecreator() {
std::vector<Diewithside> dicewithside;
// Initialize the integer amount and iterate through all the amounts of vector dicesets adding them together to set the vector dicewithside reserve
int64_t amount = 0;
int64_t lastdiepossibilities = 0;
std::vector<int64_t> superdie;
// -----------------------------------------------
std::vector<std::thread> threads;
for (int64_t i = 0; i < dicewithside.back().sides; i++) {
// create thread using constructor std::thread(func_multicreator, dicewithside, i, amount, lastdiepossibilities, std::reference_wrapper(superdie));
threads.emplace_back(func_multicreator, dicewithside, i, amount, lastdiepossibilities, std::reference_wrapper(superdie));
}
for (auto& t : threads)
{
t.join();
}
// -----------------------------------------------
return superdie;
}
std::thread thread_superdiecreator;
A single std::thread object always represents a single execution threads. You seem to be trying to use this single object to represent multiple execution threads. No matter what you will try, it won't work. You need multiple std::thread objects, one for each execution thread.
thread_superdiecreator(func_multicreator, dicewithside, i, amount, lastdiepossibilities, superdie);
An actual execution thread gets created by constructing a new std::thread object, and not by invoking it as a function.
Constructing an execution thread object corresponds to the creation of a new execution thread, it's just that simple. And the simplest way to have multiple execution threads is to have a vector of them.
std::vector<std::thread> all_execution_threads.
With that in place, creating a new execution thread involves nothing more than constructing a new std::thread object and moving it into the vector. Or, better yet, emplace it directly:
all_execution_threads.emplace_back(
func_multicreator, dicewithside, i,
amount, lastdiepossibilities, superdie
);
This presumes that everything else is correct: func_multicreator agrees with the following parameters, none of them are passed by reference (you need to fix this, at least, your attempt to pass a reference into a thread function will not work), leaving dangling references behind, all access to all objects by multiple execution threads are correctly synchronized, with mutexes, and all the other usual pitfalls when working with multiple execution threads. But this covers the basics of creating some unspecified number of multiple, concurrent, execution threads. When all and said and done you end up with a std::vector of std::threads, one for each actual execution thread.
I have some class objects and want to hand them over to several threads. The number of threads is given by the command line.
When I write it the following way, it works fine:
thread t1(thread(thread(tasks[0], parts[0])));
thread t2(thread(thread(tasks[1], parts[1])));
thread t3(thread(thread(tasks[2], parts[2])));
thread t4(thread(thread(tasks[3], parts[3])));
t1.join();
t2.join();
t3.join();
t4.join();
But as I mentioned, the number of threads shall be given by the command line, so it must be more dynamic. I tried the following code, which doesn't work, and I have no idea what is wrong with it:
for(size_t i=0; i < threads.size(); i++) {
threads.push_back(thread(tasks[i], parts[i]));
}
for(auto &t : threads) {
t.join();
}
I hope someone has an idea on how to correct it.
In this statement:
thread t1(thread(thread(tasks[0], parts[0])));
You don't need to move a thread into another thread and then move that into another thread. Just pass your task parameters directly to t1's constructor:
thread t1(tasks[0], parts[0]);
Same with t2, t3, and t4.
As for your loop:
for(size_t i=0; i < threads.size(); i++) {
threads.push_back(thread(tasks[i], parts[i]));
}
Assuming you are using std::vector<std::thread> threads, then your loop is populating threads wrong. At best, the loop simply won't do anything at all if threads is initially empty, because i < threads.size() will be false when size()==0. At worst, if threads is not initially empty then the loop will run and continuously increase threads.size() with each call to threads.push_back(), causing an endless loop because i < threads.size() will never be false, thus pushing more and more threads into threads until memory blows up.
Try something more like this instead:
size_t numThreads = ...; // taken from cmd line...
std::vector<std::thread> threads(numThreads);
for(size_t i = 0; i < numThreads; i++) {
threads[i] = std::thread(tasks[i], parts[i]);
}
for(auto &t : threads) {
t.join();
}
Or this:
size_t numThreads = ...; // taken from cmd line...
std::vector<std::thread> threads;
threads.reserve(numThreads);
for(size_t i = 0; i < numThreads; i++) {
threads.emplace_back(tasks[i], parts[i]);
}
for(auto &t : threads) {
t.join();
}
Threads are not copyable; try this:
threads.emplace_back(std::thread(task));
Emplace back thread on vector
I want to run n instances of an algorithm in parallel and compute the mean of a function f of the results. If I'm not terribly wrong, the following code achieves this goal:
struct X {};
int f(X) { return /* ... */; }
int main()
{
std::size_t const n = /* ... */;
std::vector<std::future<X>> results;
results.reserve(n);
for (std::size_t i = 0; i < n; ++i)
results.push_back(std::async([]() -> X { /* ... */ }));
int mean = 0;
for (std::size_t i = 0; i < n; ++i)
mean += f(results[i].get());
mean /= n;
}
However, is there a better way to do this? The obvious problem with the code above is the following: The order of summation in the line mean += f(results[i].get()); doesn't matter. Thus, it would be better to add the results to mean as soon as they are available. If in the code above, the result of the ith task is not yet available, the program waits for that result, while it might be possible that all results of task i + 1 to n - 1 are already available.
So, how can we do this in a better way?
You're blocking on the future, which is one operation too early.
Why not update the accumulated sum in the async thread and then block on all threads being complete?
#include <condition_variable>
#include <thread>
#include <mutex>
struct X {};
int f(X);
X make_x(int);
struct algo_state
{
std::mutex m;
std::condition_variable cv;
int remaining_tasks;
int accumulator;
};
void task(X x, algo_state& state)
{
auto part = f(x);
auto lock = std::unique_lock(state.m);
state.accumulator += part;
if (--state.remaining_tasks == 0)
{
lock.unlock();
state.cv.notify_one();
}
}
int main()
{
int get_n();
auto n = get_n();
algo_state state = {
{},
{},
n,
0
};
for(int i = 0 ; i < n ; ++i)
std::thread([&] { task(make_x(i), state); }).detach();
auto lock = std::unique_lock(state.m);
state.cv.wait(lock, [&] { return state.remaining_tasks == 0; });
auto mean = state.accumulator / n;
return mean;
}
Couldn't fit this into comment:
Instead of passing N functions to M threads for N data points(X), you can have:
K queues of N/K elements of data elements for each of them
M threads in a pool (producers, ready with same function)
1 consumer (adder) thread (main?)
and pass only N data points between threads. Passing functions and executing them can have more overhead than just data.
Also those functions can add into a shared variable without needing any extra summation outside then only M producers can work with a suitable synchronization such as atomics or lock guards.
What is sizeof that struct?
Easiest way
What about making the lambda return f(x) instead of x:
for (std::size_t i = 0; i < n; ++i)
results.push_back(std::async([]() -> int { /* ... */ }));
In this case, f() could be performed as soon as possible an without waiting. The average computation would still need to wait in a sequential order. But this is a false problem since there's nothing faster than summarising integers, and anyway, you would not be able to finish the calculation of the average before having summed each part.
Easy alternative
Still another approach could be to use atomic<int> mean; and capture it in the lambda and update the sum. So in the end you'd only need to be sure that all future delivered before doing the division. But as said, considering the cost of an integer addition, this might be overkill here.
std::vector<std::future<void>> results;
...
atomic<int> mean{0};
for (std::size_t i = 0; i < n; ++i)
results.push_back(std::async([&mean]() -> void
{ X x = ...; int i=f(x); mean+=i; return; }));
for (std::size_t i = 0; i < n; ++i)
results[i].get();
mean = mean/n; // attention not an atomic operation, but all concurent things are done
First of all, I think it is important to say that I am new to multithreading and know very little about it. I was trying to write some programs in C++ using threads and ran into a problem (question) that I will try to explain to you now:
I wanted to use several threads to fill an array, here is my code:
static const int num_threads = 5;
int A[50], n;
//------------------------------------------------------------
void ThreadFunc(int tid)
{
for (int q = 0; q < 5; q++)
{
A[n] = tid;
n++;
}
}
//------------------------------------------------------------
int main()
{
thread t[num_threads];
n = 0;
for (int i = 0; i < num_threads; i++)
{
t[i] = thread(ThreadFunc, i);
}
for (int i = 0; i < num_threads; i++)
{
t[i].join();
}
for (int i = 0; i < n; i++)
cout << A[i] << endl;
return 0;
}
As a result of this program I get:
0
0
0
0
0
1
1
1
1
1
2
2
2
2
2
and so on.
As I understand, the second thread starts writing elements to an array only when the first thread finishes writing all elements to an array.
The question is why threads dont't work concurrently? I mean why don't I get something like that:
0
1
2
0
3
1
4
and so on.
Is there any way to solve this problem?
Thank you in advance.
Since n is accessed from more than one thread, those accesses need to be synchronized so that changes made in one thread don't conflict with changes made in another. There are (at least) two ways to do this.
First, you can make n an atomic variable. Just change its definition, and do the increment where the value is used:
std::atomic<int> n;
...
A[n++] = tid;
Or you can wrap all the accesses inside a critical section:
std::mutex mtx;
int next_n() {
std::unique_lock<std::mutex> lock(mtx);
return n++;
}
And in each thread, instead of directly incrementing n, call that function:
A[next_n()] = tid;
This is much slower than the atomic access, so not appropriate here. In more complex situations it will be the right solution.
The worker function is so short, i.e., finishes executing so quickly, that it's possible that each thread is completing before the next one even starts. Also, you may need to link with a thread library to get real threads, e.g., -lpthread. Even with that, the results you're getting are purely by chance and could appear in any order.
There are two corrections you need to make for your program to be properly synchronized. Change:
int n;
// ...
A[n] = tid; n++;
to
std::atomic_int n;
// ...
A[n++] = tid;
Often it's preferable to avoid synchronization issues altogether and split the workload across threads. Since the work done per iteration is the same here, it's as easy as dividing the work evenly:
void ThreadFunc(int tid, int first, int last)
{
for (int i = first; i < last; i++)
A[i] = tid;
}
Inside main, modify the thread create loop:
for (int first = 0, i = 0; i < num_threads; i++) {
// possible num_threads does not evenly divide ASIZE.
int last = (i != num_threads-1) ? std::size(A)/num_threads*(i+1) : std::size(A);
t[i] = thread(ThreadFunc, i, first, last);
first = last;
}
Of course by doing this, even though the array may be written out of order, the values will be stored to the same locations every time.