Fill an array from different threads concurrently c++ - c++

First of all, I think it is important to say that I am new to multithreading and know very little about it. I was trying to write some programs in C++ using threads and ran into a problem (question) that I will try to explain to you now:
I wanted to use several threads to fill an array, here is my code:
static const int num_threads = 5;
int A[50], n;
//------------------------------------------------------------
void ThreadFunc(int tid)
{
for (int q = 0; q < 5; q++)
{
A[n] = tid;
n++;
}
}
//------------------------------------------------------------
int main()
{
thread t[num_threads];
n = 0;
for (int i = 0; i < num_threads; i++)
{
t[i] = thread(ThreadFunc, i);
}
for (int i = 0; i < num_threads; i++)
{
t[i].join();
}
for (int i = 0; i < n; i++)
cout << A[i] << endl;
return 0;
}
As a result of this program I get:
0
0
0
0
0
1
1
1
1
1
2
2
2
2
2
and so on.
As I understand, the second thread starts writing elements to an array only when the first thread finishes writing all elements to an array.
The question is why threads dont't work concurrently? I mean why don't I get something like that:
0
1
2
0
3
1
4
and so on.
Is there any way to solve this problem?
Thank you in advance.

Since n is accessed from more than one thread, those accesses need to be synchronized so that changes made in one thread don't conflict with changes made in another. There are (at least) two ways to do this.
First, you can make n an atomic variable. Just change its definition, and do the increment where the value is used:
std::atomic<int> n;
...
A[n++] = tid;
Or you can wrap all the accesses inside a critical section:
std::mutex mtx;
int next_n() {
std::unique_lock<std::mutex> lock(mtx);
return n++;
}
And in each thread, instead of directly incrementing n, call that function:
A[next_n()] = tid;
This is much slower than the atomic access, so not appropriate here. In more complex situations it will be the right solution.

The worker function is so short, i.e., finishes executing so quickly, that it's possible that each thread is completing before the next one even starts. Also, you may need to link with a thread library to get real threads, e.g., -lpthread. Even with that, the results you're getting are purely by chance and could appear in any order.
There are two corrections you need to make for your program to be properly synchronized. Change:
int n;
// ...
A[n] = tid; n++;
to
std::atomic_int n;
// ...
A[n++] = tid;
Often it's preferable to avoid synchronization issues altogether and split the workload across threads. Since the work done per iteration is the same here, it's as easy as dividing the work evenly:
void ThreadFunc(int tid, int first, int last)
{
for (int i = first; i < last; i++)
A[i] = tid;
}
Inside main, modify the thread create loop:
for (int first = 0, i = 0; i < num_threads; i++) {
// possible num_threads does not evenly divide ASIZE.
int last = (i != num_threads-1) ? std::size(A)/num_threads*(i+1) : std::size(A);
t[i] = thread(ThreadFunc, i, first, last);
first = last;
}
Of course by doing this, even though the array may be written out of order, the values will be stored to the same locations every time.

Related

How do I implement std::thread the right way with loops?

I´m trying to create a multithread part in my program, where a loop creates multiple threads, that get a vector consisting of objects along some integers and the vector which holds the results.
The problem is I can´t seem to wrap my head around how the thread part works, I tried different things but all end in the same three errors.
This is where I don´t know how to proceed:
std::thread thread_superdiecreator;
for (int64_t i = 0; i < dicewithside.back().sides; i++) {
thread_superdiecreator(func_multicreator(dicewithside, i, amount, lastdiepossibilities, superdie));
}
term does not evalutate to a function taking 1 arguments
I tried this:
thread_superdiecreator(func_multicreator, dicewithside, i, amount, lastdiepossibilities, superdie);
call of an object of a class type without appropriate operator() or conversion functions to pointer-to-function type
And this:
std::thread thread_superdiecreator(func_multicreator, dicewithside, i, amount, lastdiepossibilities, superdie);
Invoke error in thread.
The whole code snippet:
#pragma once
#include <mutex>
#include <thread>
#include <algorithm>
#include "class_Diewithside.h"
#include "struct_Sortedinput.h"
#include "func_maximumpossibilities.h"
std::mutex superdielock;
void func_multicreator(std::vector<Diewithside> dicewithside, int64_t lastdieside, int64_t size, int64_t lastdiepossibilities, std::vector<int64_t> &superdie) {
// Set the last die side to number of the thread
dicewithside[size-1].dieside = lastdieside;
//
std::vector<int64_t> partsuperdie;
partsuperdie.reserve(lastdiepossibilities);
// Calculate all possible results of all dice thrown with the last one set
for (int64_t i = 0; i < lastdiepossibilities; i++) {
// Reset the result
int64_t result = 0;
for (int64_t j = 0; j < size; j++) {
result += dicewithside[j].alleyes[dicewithside[j].dieside];
}
partsuperdie.push_back(result);
//
for (int64_t j = 0; j < size - 1; j++) {
if (dicewithside[j].dieside == dicewithside[j].sides - 1) {
dicewithside[j].dieside = 0;
}
else {
dicewithside[j].dieside += 1;
break;
}
}
}
superdielock.lock();
for (int64_t i = 0; i < lastdiepossibilities; i++) {
superdie.push_back(partsuperdie[i]);
}
superdielock.unlock();
}
// The function superdie creates an array that holds all possible outcomes of the dice thrown
std::vector<int64_t> func_superdiecreator(sortedinput varsortedinput) {
// Get the size of the diceset vector and create a new vector out of class Diewithside
int64_t size = varsortedinput.dicesets.size();
std::vector<Diewithside> dicewithside;
// Initialize the integer amount and iterate through all the amounts of vector dicesets adding them together to set the vector dicewithside reserve
int64_t amount = 0;
for (int64_t i = 0; i < size; i++) {
amount += varsortedinput.dicesets[i].amount;
}
dicewithside.reserve(amount);
// Fill the new vector dicewithside with each single die and add the starting value of 0
for (int64_t i = 0; i < size; i++) {
for (int64_t j = 0; j < varsortedinput.dicesets[i].amount; j++) {
dicewithside.push_back(Diewithside{varsortedinput.dicesets[i].plusorminus, varsortedinput.dicesets[i].sides, varsortedinput.dicesets[i].alleyes, 0});
}
}
// Get the maximum possibilities and divide by sides of the last die to get the amount of iterations each thread has to run
int64_t maximumpossibilities = func_maximumpossibilities(varsortedinput.dicesets, size);
int64_t lastdiepossibilities = maximumpossibilities / dicewithside[amount-1].sides;
// Multithread calculate all possibilities and save them in array
std::vector<int64_t> superdie;
superdie.reserve(maximumpossibilities);
std::thread thread_superdiecreator;
for (int64_t i = 0; i < dicewithside.back().sides; i++) {
thread_superdiecreator(func_multicreator(dicewithside, i, amount, lastdiepossibilities, superdie));
}
thread_superdiecreator.join();
return superdie;
Thanks for any help!
You indeed need to create the thread using the third alternative mentioned in the question, i.e. use the constructor of std::thread to start the thread.
The issue with this approach is the fact the last parameter of func_multicreator being a lvalue reference: std::thread creates copies of parameters and moves those copies during for calling the function on the background thread, and an rvalue reference cannot be implicitly converted to an lvalue reference. You need to use std::reference_wrapper here to be able to "pass" an lvalue reference to the thread.
You should join every thread created so you need to create a collection of threads.
Simplified example:
(The interesting stuff is between the ---... comments.)
struct Diewithside
{
int64_t sides;
};
void func_multicreator(std::vector<Diewithside> dicewithside, int64_t lastdieside, int64_t size, int64_t lastdiepossibilities, std::vector<int64_t>& superdie)
{
}
std::vector<int64_t> func_superdiecreator() {
std::vector<Diewithside> dicewithside;
// Initialize the integer amount and iterate through all the amounts of vector dicesets adding them together to set the vector dicewithside reserve
int64_t amount = 0;
int64_t lastdiepossibilities = 0;
std::vector<int64_t> superdie;
// -----------------------------------------------
std::vector<std::thread> threads;
for (int64_t i = 0; i < dicewithside.back().sides; i++) {
// create thread using constructor std::thread(func_multicreator, dicewithside, i, amount, lastdiepossibilities, std::reference_wrapper(superdie));
threads.emplace_back(func_multicreator, dicewithside, i, amount, lastdiepossibilities, std::reference_wrapper(superdie));
}
for (auto& t : threads)
{
t.join();
}
// -----------------------------------------------
return superdie;
}
std::thread thread_superdiecreator;
A single std::thread object always represents a single execution threads. You seem to be trying to use this single object to represent multiple execution threads. No matter what you will try, it won't work. You need multiple std::thread objects, one for each execution thread.
thread_superdiecreator(func_multicreator, dicewithside, i, amount, lastdiepossibilities, superdie);
An actual execution thread gets created by constructing a new std::thread object, and not by invoking it as a function.
Constructing an execution thread object corresponds to the creation of a new execution thread, it's just that simple. And the simplest way to have multiple execution threads is to have a vector of them.
std::vector<std::thread> all_execution_threads.
With that in place, creating a new execution thread involves nothing more than constructing a new std::thread object and moving it into the vector. Or, better yet, emplace it directly:
all_execution_threads.emplace_back(
func_multicreator, dicewithside, i,
amount, lastdiepossibilities, superdie
);
This presumes that everything else is correct: func_multicreator agrees with the following parameters, none of them are passed by reference (you need to fix this, at least, your attempt to pass a reference into a thread function will not work), leaving dangling references behind, all access to all objects by multiple execution threads are correctly synchronized, with mutexes, and all the other usual pitfalls when working with multiple execution threads. But this covers the basics of creating some unspecified number of multiple, concurrent, execution threads. When all and said and done you end up with a std::vector of std::threads, one for each actual execution thread.

How to multithread line by line pixels using std::thread?

I want to learn how to adapt pseudocode I have for multithreading line by line to C++. I understand the pseudocode but I am not very experienced with C++ nor the std::thread function.
This is the pseudocode I have and that I've often used:
myFunction
{
int threadNr=previous;
int numberProcs = countProcessors();
// Every thread calculates a different line
for (y = y_start+threadNr; y < y_end; y+=numberProcs) {
// Horizontal lines
for (int x = x_start; x < x_end; x++) {
psetp(x,y,RGB(255,128,0));
}
}
}
int numberProcs = countProcessors();
// Launch threads: e.g. for 1 processor launch no other thread, for 2 processors launch 1 thread, for 4 processors launch 3 threads
for (i=0; i<numberProcs-1; i++)
triggerThread(50,FME_CUSTOMEVENT,i); //The last parameter is the thread number
triggerEvent(50,FME_CUSTOMEVENT,numberProcs-1); //The last thread used for progress
// Wait for all threads to finished
waitForThread(0,0xffffffff,-1);
I know I can call my current function using one thread via std::thread like this:
std::thread t1(FilterImage,&size_param, cdepth, in_data, input_worldP, output_worldP);
t1.join();
But this is not efficient as it is calling the entire function over and over again per thread.
I would expect every processor to tackle a horizontal line on it's own.
Any example code would would be highly appreciated as I tend to learn best through example.
Invoking thread::join() forces the calling thread to wait for the child thread to finish executing. For example, if I use it to create a number of threads in a loop, and call join() on each one, it'll be the same as though everything happened in sequence.
Here's an example. I have two methods that print out the numbers 1 through n. The first one does it single threaded, and the second one joins each thread as they're created. Both have the same output, but the threaded one is slower because you're waiting for each thread to finish before starting the next one.
#include <iostream>
#include <thread>
void printN_nothreads(int n) {
for(int i = 0; i < n; i++) {
std::cout << i << "\n";
}
}
void printN_threaded(int n) {
for(int i = 0; i < n; i++) {
std::thread t([=](){ std::cout << i << "\n"; });
t.join(); //This forces synchronization
}
}
Doing threading better.
To gain benefit from using threads, you have to start all the threads before joining them. In addition, to avoid false sharing, each thread should work on a separate region of the image (ideally a section that's far away in memory).
Let's look at how this would work. I don't know what library you're using, so instead I'm going to show you how to write a multi-threaded transform on a vector.
auto transform_section = [](auto func, auto begin, auto end) {
for(; begin != end; ++begin) {
func(*begin);
}
};
This transform_section function will be called once per thread, each on a different section of the vector. Let's write transform so it's multithreaded.
template<class Func, class T>
void transform(Func func, std::vector<T>& data, int num_threads) {
size_t size = data.size();
auto section_start = [size, num_threads](int thread_index) {
return size * thread_index / num_threads;
};
auto section_end = [size, num_threads](int thread_index) {
return size * (thread_index + 1) / num_threads;
};
std::vector<std::thread> threads(num_threads);
// Each thread works on a different section
for(int i = 0; i < num_threads; i++) {
T* start = &data[section_start(i)];
T* end = &data[section_end(i)];
threads[i] = std::thread(transform_section, func, start, end);
}
// We only join AFTER all the threads are started
for(std::thread& t : threads) {
t.join();
}
}

How can we run n instances of an algorithm in parallel and compute the mean of a function of the results in an efficient way?

I want to run n instances of an algorithm in parallel and compute the mean of a function f of the results. If I'm not terribly wrong, the following code achieves this goal:
struct X {};
int f(X) { return /* ... */; }
int main()
{
std::size_t const n = /* ... */;
std::vector<std::future<X>> results;
results.reserve(n);
for (std::size_t i = 0; i < n; ++i)
results.push_back(std::async([]() -> X { /* ... */ }));
int mean = 0;
for (std::size_t i = 0; i < n; ++i)
mean += f(results[i].get());
mean /= n;
}
However, is there a better way to do this? The obvious problem with the code above is the following: The order of summation in the line mean += f(results[i].get()); doesn't matter. Thus, it would be better to add the results to mean as soon as they are available. If in the code above, the result of the ith task is not yet available, the program waits for that result, while it might be possible that all results of task i + 1 to n - 1 are already available.
So, how can we do this in a better way?
You're blocking on the future, which is one operation too early.
Why not update the accumulated sum in the async thread and then block on all threads being complete?
#include <condition_variable>
#include <thread>
#include <mutex>
struct X {};
int f(X);
X make_x(int);
struct algo_state
{
std::mutex m;
std::condition_variable cv;
int remaining_tasks;
int accumulator;
};
void task(X x, algo_state& state)
{
auto part = f(x);
auto lock = std::unique_lock(state.m);
state.accumulator += part;
if (--state.remaining_tasks == 0)
{
lock.unlock();
state.cv.notify_one();
}
}
int main()
{
int get_n();
auto n = get_n();
algo_state state = {
{},
{},
n,
0
};
for(int i = 0 ; i < n ; ++i)
std::thread([&] { task(make_x(i), state); }).detach();
auto lock = std::unique_lock(state.m);
state.cv.wait(lock, [&] { return state.remaining_tasks == 0; });
auto mean = state.accumulator / n;
return mean;
}
Couldn't fit this into comment:
Instead of passing N functions to M threads for N data points(X), you can have:
K queues of N/K elements of data elements for each of them
M threads in a pool (producers, ready with same function)
1 consumer (adder) thread (main?)
and pass only N data points between threads. Passing functions and executing them can have more overhead than just data.
Also those functions can add into a shared variable without needing any extra summation outside then only M producers can work with a suitable synchronization such as atomics or lock guards.
What is sizeof that struct?
Easiest way
What about making the lambda return f(x) instead of x:
for (std::size_t i = 0; i < n; ++i)
results.push_back(std::async([]() -> int { /* ... */ }));
In this case, f() could be performed as soon as possible an without waiting. The average computation would still need to wait in a sequential order. But this is a false problem since there's nothing faster than summarising integers, and anyway, you would not be able to finish the calculation of the average before having summed each part.
Easy alternative
Still another approach could be to use atomic<int> mean; and capture it in the lambda and update the sum. So in the end you'd only need to be sure that all future delivered before doing the division. But as said, considering the cost of an integer addition, this might be overkill here.
std::vector<std::future<void>> results;
...
atomic<int> mean{0};
for (std::size_t i = 0; i < n; ++i)
results.push_back(std::async([&mean]() -> void
{ X x = ...; int i=f(x); mean+=i; return; }));
for (std::size_t i = 0; i < n; ++i)
results[i].get();
mean = mean/n; // attention not an atomic operation, but all concurent things are done

openMP C++ simple parallel region - inconsistent output

As stated above, I have been trying to craft a simple parallel loop, but it has inconsistent behaviour for different number of threads. Here is my code (testable!)
#include <iostream>
#include <stdio.h>
#include <vector>
#include <utility>
#include <string>
using namespace std;
int row = 5, col = 5;
int token = 1;
int ar[20][20] = {0};
int main (void)
{
unsigned short j_end = 1, k = 1;
unsigned short mask;
for (unsigned short i=1; i<=(row + col -1); i++)
{
#pragma omp parallel default(none) shared(ar) firstprivate(k, row, col, i, j_end, token) private(mask)
{
if(i > row) {
mask = row;
}
else {
mask = i;
}
#pragma omp for schedule(static, 2)
for(unsigned short j=k; j<=j_end; j++)
{
ar[mask][j] = token;
if(mask > 1) {
#pragma omp critical
{
mask--;
}
}
} //inner loop - barrier
}//end parallel
token++;
if(j_end == col) {
k++;
j_end = col;
}
else {
j_end++;
}
} // outer loop
// print the array
for (int i = 0; i < row + 2; i++)
{
for (int j = 0; j < col + 2; j++)
{
cout << ar[i][j] << " ";
}
cout << endl;
}
return 0;
} // main
I believe most of the code is self explanatory, but to sum it up, I have 2 loops, the inner one iterates through the inverse-diagonals of the square matrix ar[row][col], (row & col variables can be used to change the total size of ar).
Visual aid: desired output for 5x5 ar (serial version)
(Note: This does happen when OMP_NUM_THREADS=1 too.)
But when OMP_NUM_THREADS=2 or OMP_NUM_THREADS=4 the output looks like this:
The serial (and for 1 thread) code is consistent so I don't think the implementation is problematic. Also, given the output of the serial code, there shouldn't be any dependencies in the inner loop.
I have also tried:
Vectorizing
threadpivate counters for the inner loop
But nothing seems to work so far...
Is there a fault in my approach, or did I miss something API-wise that led to this behavior?
Thanks for your time in advance.
Analyzing the algorithm
As you noted, the algorithm itself has no dependencies in the inner or outer loop. An easy way to show this is to move the parallelism "up" to the outer loop so that you can iterate across all the different inverse diagonals simultaneously.
Right now, the main problem with the algorithm you've written is that it's presented as a serial algorithm in both the inner and outer loop. If you're going to parallelize across the inner loop, then mask needs to be handled specially. If you're going to parallelize across the outer loop, then j_end, token, and k need to be handled specially. By "handled specially," I mean they need to be computed independently of the other threads. If you try adding critical regions into your code, you will kill all performance benefits of adding OpenMP in the first place.
Fixing the problem
In the following code, I parallelize over the outer loop. i corresponds to what you call token. That is, it is both the value to be added to the inverse diagonal and the assumed starting length of this diagonal. Note that for this to parallelize correctly, length, startRow, and startCol must be calculated as a function of i independently from other iterations.
Finally note that once the algorithm is re-written this way, the actual OpenMP pragma is incredibly simple. Every variable is assumed to be shared by default because they're all read-only. The only exception is ar in which we are careful never to overwrite another thread's value of the array. All variables that must be private are only created inside the parallel loop and thus are thread-private by definition. Lastly, I've changed the schedule to dynamic to showcase that this algorithm exhibits load-imbalance. In your example if you had 9 threads (the worst case scenario), you can see how the thread assigned to i=5 has to do much more work than the thread assigned to i=1 or i=9.
Example code
#include <iostream>
#include <omp.h>
int row = 5;
int col = 5;
#define MAXSIZE 20
int ar[MAXSIZE][MAXSIZE] = {0};
int main(void)
{
// What an easy pragma!
#pragma omp parallel for default(shared) schedule(dynamic)
for (unsigned short i = 1; i < (row + col); i++)
{
// Calculates the length of the current diagonal to consider
// INDEPENDENTLY from other i iterations!
unsigned short length = i;
if (i > row) {
length -= (i-row);
}
if (i > col) {
length -= (i-col);
}
// Calculates the starting coordinate to start at
// INDEPENDENTLY from other i iterations!
unsigned short startRow = i;
unsigned short startCol = 1;
if (startRow > row) {
startCol += (startRow-row);
startRow = row;
}
for(unsigned short offset = 0; offset < length; offset++) {
ar[startRow-offset][startCol+offset] = i;
}
} // outer loop
// print the array
for (int i = 0; i <= row; i++)
{
for (int j = 0; j <= col; j++)
{
std::cout << ar[i][j] << " ";
}
std::cout << std::endl;
}
return 0;
} // main
Final points
I want to leave with a few last points.
If you are only adding parallelism on a small array (row,col < 1e6), you will most likely not get any benefits from OpenMP. On a small array, the algorithm itself will take microseconds, while setting up the threads could take milliseconds... slowing down execution time considerably from your original serial code!
While I did rewrite this algorithm and change around variable names, I tried to keep the spirit of your implementation as best as I could. Thus, the inverse-diagonal scanning and nested loop pattern remains.
There is a better way to parallelize this algorithm to avoid load balance, though. If instead you give each thread a row and have it instead iterate its token value (i.e. row/thread 2 places the numbers 2->6), then each thread will work on exactly the same amount of numbers and you can change the pragma to schedule(static).
As I mentioned in the comments above, don't use firstprivate when you mean shared. A good rule of thumb is that all read-only variables should be shared.
It is erroneous to assume that getting correct output when running parallel code on 1 thread implies the implementation is correct. In fact, barring disastrous use of OpenMP, you are incredibly unlikely to get the wrong output with only 1 thread. Testing with multiple threads reveals that your previous implementation was not correct.
Hope this helps.
EDIT: The output I get is the same as yours for a 5x5 matrix.

threading program in C++ not faster

I have a program which reads the file line by line and then stores each possible substring of length 50 in a hash table along with its frequency. I tried to use threads in my program so that it will read 5 lines and then use five different threads to do the processing. The processing involves reading each substring of that line and putting them into hash map with frequency. But it seems there is something wrong which I could not figure out for which the program is not faster then the serial approach. Also, for large input file it is aborted. Here is the piece of code I am using
unordered_map<string, int> m;
mutex mtx;
void parseLine(char *line, int subLen){
int no_substr = strlen(line) - subLen;
for(int i = 0; i <= no_substr; i++) {
char *subStr = (char*) malloc(sizeof(char)* subLen + 1);
strncpy(subStr, line+i, subLen);
subStr[subLen]='\0';
mtx.lock();
string s(subStr);
if(m.find(s) != m.end()) m[s]++;
else {
pair<string, int> ret(s, 1);
m.insert(ret);
}
mtx.unlock();
}
}
int main(){
char **Array = (char **) malloc(sizeof(char *) * num_thread +1);
int num = 0;
while (NOT END OF FILE) {
if(num < num_th) {
if(num == 0)
for(int x = 0; x < num_th; x++)
Array[x] = (char*) malloc(sizeof(char)*strlen(line)+1);
strcpy(Array[num], line);
num++;
}
else {
vector<thread> threads;
for(int i = 0; i < num_th; i++) {
threads.push_back(thread(parseLine, Array[i]);
}
for(int i = 0; i < num_th; i++){
if(threads[i].joinable()) {
threads[i].join();
}
}
for(int x = 0; x < num_th; x++) free(seqArray[x]);
num = 0;
}
}
}
It's a myth that just by the virtue of using threads, the end result must be faster. In general, in order to take advantage of multithreading, two conditions must be met(*):
1) You actually have to have sufficient physical CPU cores, that can run the threads at the same time.
2) The threads have independent tasks to do, that they can do on their own.
From a cursory examination of the shown code, it seems to fail on the second part. It seems to me that, most of the time all of these threads will be fighting each other in order to acquire the same mutex. There's little to be gained from multithreading, in this situation.
(*) Of course, you don't always use threads for purely performance reasons. Multithreading also comes in useful in many other situations too, for example, in a program with a GUI, having a separate thread updating the GUI helps the UI working even while the main execution thread is chewing on something, for a while...