I'm having trouble coming up with an algorithm to ensure mutual exclusion while threads are accessing a shared global variable. I'm trying to write a threaded function that can use a global variable, instead of switching it to a local variable.
I have this code so far:
int sumGlobal = 0;
void sumArray(int** array, int size){
for (int i=0; i<size; i++){
for (int j=0; j<size; j++){
sumGlobal += array[i][j];
}
}
}
int main(){
int size = 4000;
int numThreads = 4;
int** a2DArray = new int*[size];
for (int i=0; i<size; i++){
a2DArray[i] = new int[size];
for (int j=0; j<dim; j++){
a2DArray[i][j] = genRandNum(0,100);
}
}
std::vector<std::future<void>> thread_Pool;
for (int i = 0; i < numThreads; ++i) {
thread_Pool.push_back( std::async(launch::async,
sumArray, a2DArray, size));
}
}
I'm unsure of how to guarantee that sumGlobal is not rewritten with every thread. I want to update it correctly, so that each thread adds its value to the global variable when it's finished. I'm just trying to learn threading, and not be restricted to non-void functions.
Make the variable atomic:
#include <atomic>
...
std::atomic<int> sumGlobal {0};
An atomic variable is exempt from data races: it behaves well even when several threads are trying to read and write it. Wheteher the atomicity is implemented through mutual exclusion or in a lock free maner is left to the implementation. As you use += to atomically update the variable, there is no risk of having inconsistencies in your example.
This nice video explains you in much more detail what atomics are, why they are needed, and how they work.
You could also keep your variable as they are and use a mutex/lock_gard to protect it, as explained my #Miles Budnek. The problem is, that only one thread at a time can execute code protected by the mutex. In your example, this would mean that the processing in the different threads would not really work concurently. The atomic approach should have superior performance : one thread may still compute indexes and read the array while the other is updating the global variable.
If you don't want to use a synchronized object like std::atomic<int> as #Christophe suggests, you can use std::mutex and std::lock_guard to manually synchronize access to your sum.
int sumGlobal = 0;
std::mutex sumMutex;
void sumArray(...) {
for(...) {
for(...) {
std::lock_guard<std::mutex> lock(sumMutex);
sumGlobal += ...;
}
}
}
Keep in mind that all that locking and unlocking will incur quite a bit of overhead.
Related
I´m trying to create a multithread part in my program, where a loop creates multiple threads, that get a vector consisting of objects along some integers and the vector which holds the results.
The problem is I can´t seem to wrap my head around how the thread part works, I tried different things but all end in the same three errors.
This is where I don´t know how to proceed:
std::thread thread_superdiecreator;
for (int64_t i = 0; i < dicewithside.back().sides; i++) {
thread_superdiecreator(func_multicreator(dicewithside, i, amount, lastdiepossibilities, superdie));
}
term does not evalutate to a function taking 1 arguments
I tried this:
thread_superdiecreator(func_multicreator, dicewithside, i, amount, lastdiepossibilities, superdie);
call of an object of a class type without appropriate operator() or conversion functions to pointer-to-function type
And this:
std::thread thread_superdiecreator(func_multicreator, dicewithside, i, amount, lastdiepossibilities, superdie);
Invoke error in thread.
The whole code snippet:
#pragma once
#include <mutex>
#include <thread>
#include <algorithm>
#include "class_Diewithside.h"
#include "struct_Sortedinput.h"
#include "func_maximumpossibilities.h"
std::mutex superdielock;
void func_multicreator(std::vector<Diewithside> dicewithside, int64_t lastdieside, int64_t size, int64_t lastdiepossibilities, std::vector<int64_t> &superdie) {
// Set the last die side to number of the thread
dicewithside[size-1].dieside = lastdieside;
//
std::vector<int64_t> partsuperdie;
partsuperdie.reserve(lastdiepossibilities);
// Calculate all possible results of all dice thrown with the last one set
for (int64_t i = 0; i < lastdiepossibilities; i++) {
// Reset the result
int64_t result = 0;
for (int64_t j = 0; j < size; j++) {
result += dicewithside[j].alleyes[dicewithside[j].dieside];
}
partsuperdie.push_back(result);
//
for (int64_t j = 0; j < size - 1; j++) {
if (dicewithside[j].dieside == dicewithside[j].sides - 1) {
dicewithside[j].dieside = 0;
}
else {
dicewithside[j].dieside += 1;
break;
}
}
}
superdielock.lock();
for (int64_t i = 0; i < lastdiepossibilities; i++) {
superdie.push_back(partsuperdie[i]);
}
superdielock.unlock();
}
// The function superdie creates an array that holds all possible outcomes of the dice thrown
std::vector<int64_t> func_superdiecreator(sortedinput varsortedinput) {
// Get the size of the diceset vector and create a new vector out of class Diewithside
int64_t size = varsortedinput.dicesets.size();
std::vector<Diewithside> dicewithside;
// Initialize the integer amount and iterate through all the amounts of vector dicesets adding them together to set the vector dicewithside reserve
int64_t amount = 0;
for (int64_t i = 0; i < size; i++) {
amount += varsortedinput.dicesets[i].amount;
}
dicewithside.reserve(amount);
// Fill the new vector dicewithside with each single die and add the starting value of 0
for (int64_t i = 0; i < size; i++) {
for (int64_t j = 0; j < varsortedinput.dicesets[i].amount; j++) {
dicewithside.push_back(Diewithside{varsortedinput.dicesets[i].plusorminus, varsortedinput.dicesets[i].sides, varsortedinput.dicesets[i].alleyes, 0});
}
}
// Get the maximum possibilities and divide by sides of the last die to get the amount of iterations each thread has to run
int64_t maximumpossibilities = func_maximumpossibilities(varsortedinput.dicesets, size);
int64_t lastdiepossibilities = maximumpossibilities / dicewithside[amount-1].sides;
// Multithread calculate all possibilities and save them in array
std::vector<int64_t> superdie;
superdie.reserve(maximumpossibilities);
std::thread thread_superdiecreator;
for (int64_t i = 0; i < dicewithside.back().sides; i++) {
thread_superdiecreator(func_multicreator(dicewithside, i, amount, lastdiepossibilities, superdie));
}
thread_superdiecreator.join();
return superdie;
Thanks for any help!
You indeed need to create the thread using the third alternative mentioned in the question, i.e. use the constructor of std::thread to start the thread.
The issue with this approach is the fact the last parameter of func_multicreator being a lvalue reference: std::thread creates copies of parameters and moves those copies during for calling the function on the background thread, and an rvalue reference cannot be implicitly converted to an lvalue reference. You need to use std::reference_wrapper here to be able to "pass" an lvalue reference to the thread.
You should join every thread created so you need to create a collection of threads.
Simplified example:
(The interesting stuff is between the ---... comments.)
struct Diewithside
{
int64_t sides;
};
void func_multicreator(std::vector<Diewithside> dicewithside, int64_t lastdieside, int64_t size, int64_t lastdiepossibilities, std::vector<int64_t>& superdie)
{
}
std::vector<int64_t> func_superdiecreator() {
std::vector<Diewithside> dicewithside;
// Initialize the integer amount and iterate through all the amounts of vector dicesets adding them together to set the vector dicewithside reserve
int64_t amount = 0;
int64_t lastdiepossibilities = 0;
std::vector<int64_t> superdie;
// -----------------------------------------------
std::vector<std::thread> threads;
for (int64_t i = 0; i < dicewithside.back().sides; i++) {
// create thread using constructor std::thread(func_multicreator, dicewithside, i, amount, lastdiepossibilities, std::reference_wrapper(superdie));
threads.emplace_back(func_multicreator, dicewithside, i, amount, lastdiepossibilities, std::reference_wrapper(superdie));
}
for (auto& t : threads)
{
t.join();
}
// -----------------------------------------------
return superdie;
}
std::thread thread_superdiecreator;
A single std::thread object always represents a single execution threads. You seem to be trying to use this single object to represent multiple execution threads. No matter what you will try, it won't work. You need multiple std::thread objects, one for each execution thread.
thread_superdiecreator(func_multicreator, dicewithside, i, amount, lastdiepossibilities, superdie);
An actual execution thread gets created by constructing a new std::thread object, and not by invoking it as a function.
Constructing an execution thread object corresponds to the creation of a new execution thread, it's just that simple. And the simplest way to have multiple execution threads is to have a vector of them.
std::vector<std::thread> all_execution_threads.
With that in place, creating a new execution thread involves nothing more than constructing a new std::thread object and moving it into the vector. Or, better yet, emplace it directly:
all_execution_threads.emplace_back(
func_multicreator, dicewithside, i,
amount, lastdiepossibilities, superdie
);
This presumes that everything else is correct: func_multicreator agrees with the following parameters, none of them are passed by reference (you need to fix this, at least, your attempt to pass a reference into a thread function will not work), leaving dangling references behind, all access to all objects by multiple execution threads are correctly synchronized, with mutexes, and all the other usual pitfalls when working with multiple execution threads. But this covers the basics of creating some unspecified number of multiple, concurrent, execution threads. When all and said and done you end up with a std::vector of std::threads, one for each actual execution thread.
I have read various articles on C++ threading, among others GeeksForGeeks article. I have also read this quection but none of these has an answer for my need. In my project, (which is too complex to mention here), I would need something along the lines:
#include <iostream>
#include <thread>
using namespace std;
class Simulate{
public:
int Numbers[100][100];
thread Threads[100][100];
// Method to be passed to thread - in the same way as function pointer?
void DoOperation(int i, int j) {
Numbers[i][j] = i + j;
}
// Method to start the thread from
void Update(){
// Start executing threads
for (int i = 0; i < 100; i++) {
for (int j = 0; j < 100; j++) {
Threads[i][j] = thread(DoOperation, i, j);
}
}
// Wait till all of the threads finish
for (int i = 0; i < 100; i++) {
for (int j = 0; j < 100; j++) {
if (Threads[i][j].joinable()) {
Threads[i][j].join();
}
}
}
}
};
int main()
{
Simulate sim;
sim.Update();
}
How can I do this please? Any help is appreciated, and alternative solutions wellcomed. I am a mathematician by training, learning C++ for less than a week, so simplicity is pereferred. I desperately need something along these lines to make my research simulations faster.
The easiest way to call member functions and pass arguments is to use a lambda expression:
Threads[i][j] = std::thread([this, i, j](){ this->DoOperation(i, j); });
The variables listed in [] are captured and their values can be used by the code inside {}. The lambda itself has a unique anonymous type, but can be implicitly cast to std::function which is accepted by std::thread constructor.
However, starting 100x100 = 10000 threads will quickly exhaust memory on most systems. Adding more threads than there are CPU cores does not improve performance for computational tasks. Instead it is a better idea to start e.g. 10 threads that each process 1000 items in a loop.
I need to iterate over an array and assign each element according to a calculation that requires some iteration itself. Removing all unnecessary details the program boils down to something like this.
float output[n];
const float input[n] = ...;
for (int i = 0; i < n; ++i) {
output[i] = 0.0f;
for (int j = 0; j < n; ++j) {
output[i] += some_calculation(i, input[j]);
}
}
some_calculation does not alter its arguments, nor does it have an internal state so its thread safe. Looking at the loops, I understand that the outer loop is thread-safe because different iterations output to different memory locations (different output[i]) and the shared elements of input are never altered while the loop runs, but the inner loop is not thread safe because it has a race condition on output[i] because it is altered in all iterations.
Consequently, I'd like to spawn threads and get them working for different values of i but the whole iteration over input should be local to each thread so as not to introduce a race condition on output[i]. I think the following achieves this.
std::array<float, n> output;
const std::array<float, n> input[n];
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
output[i] = 0.0f;
for (int j = 0; j < n; ++j) {
output[i] += some_calculation(i, input[j]);
}
}
I'm not sure how this handles the inner loop. Threads working on different is should be able to run the loop in parallel but I don't understand if I'm allowing them to without another #pragma omp directive. On the other hand I don't want to accidentally allow threads to run for different values of j over the same i because that introduces a race condition. I'm also not sure if I need some extra specification on how the two arrays should be handled.
Lastly, if this code is in a function that is going to get called repeatedly, does it need the parallel directive or can that be called once before my main loop begins like so.
void iterative_step(const std::array<float, n> &input, const std::array<float, n> &output) {
// Threads have already been spawned
#pragma omp for
for (int i = 0; i < n; ++i) {
output[i] = 0.0f;
for (int j = 0; j < n; ++j) {
output[i] += some_calculation(i, input[j]);
}
}
int main() {
...
// spawn threads once, but not for this loop
#pragma omp parallel
while (...) {
iterative_step(input, output);
}
...
}
I looked through various other questions but they were about different problems with different race conditions and I'm confused as to how to generalize the answers.
You don't want the omp parallel in main. The omp for you use will only create/reuse threads for the following for (int i loop. For any particular value of i, the j loop will run entirely on one thread.
One other thing that would help a little is to compute your output[i] result into a local variable, then store that into output[i] once you're done with the j loop.
I am trying to modify a some strings from threads (each thread would have its own string) but all strings are stored in a vector, because i need to be able to access them after the threads have done their thing.
I haven't used threads in c++, so if this is a terrible thing to do, all suggestions welcome :)
Basically the only thing the program does now is:
create some threads
send a string and an id to each thread
thread function modifies the string to add the id to it
end
This gives a segfault :(
Is this just a bad approach? How else could I do this?
static const int cores = 8;
void bmh_t(std::string & resr, int tid){
resr.append(std::to_string(tid));
resr.append(",");
return;
}
std::vector<std::string> parbmh(std::string text, std::string pat){
std::vector<std::string> distlists;
std::thread t[cores];
//Launch a group of threads
for (int i = 0; i < cores; ++i) {
distlists.push_back(" ");
t[i] = std::thread(bmh_t,std::ref(distlists[i]), i);
}
for (int i = 0; i < cores; ++i) {
t[i].join();
}
return distlists;
}
Your basic approach is fine. The main thing you need to consider when writing parallel code is that any data shared between threads is done so in a safe way. Because your algorithm uses a different string for each thread, it's a good approach.
The reason you're seeing a crash is because you're calling push_back on your vector of strings after you've already given each thread a reference to data stored within the vector. This is a problem because push_back needs to grow your vector, when its size reaches its capacity. That growth can invalidate the references that you've dispatched to each thread, causing them to write to freed memory.
The fix is very simple: just make sure ahead of time that your vector doesn't need to grow. This can be accomplished with a constructor argument specifying an initial number of elements; a call to reserve(); or a call to resize().
Here's an implementation that doesn't crash:
static const int cores = 8;
void bmh_t(std::string & resr, int tid){
resr.append(std::to_string(tid));
resr.append(",");
return;
}
std::vector<std::string> parbmh(){
std::vector<std::string> distlists;
std::thread t[cores];
distlists.reserve(cores);
//Launch a group of threads
for (int i = 0; i < cores; ++i) {
distlists.push_back(" ");
t[i] = std::thread(bmh_t, std::ref(distlists[i]), i);
}
for (int i = 0; i < cores; ++i) {
t[i].join();
}
return distlists;
}
The vector of strings is being destructed before the threads can act on the contained strings. You'll want to join the threads before returning so that the vector of strings isn't destroyed.
I am writing a program where a bunch of different classes, all stored in a vector, do parallel operations on private members using public data structures. I'd like to parallelize it for multiple processors using OpenMP, but I have two questions about two of the operations in the code, both of which are indicated in comments in the example below that shows a reduced form of the program's logic.
#include <omp.h>
#include <iostream>
#include <sys/timeb.h>
#include <vector>
class A {
private :
long _i;
public :
void add_i(long &i) { _i += i; }
long get_i() const { return _i; }
};
int main()
{
timeb then;
ftime(&then);
unsigned int BIG = 1000000;
int N = 4;
std::vector<long> foo(BIG, 1);
std::vector<A *> bar;
for (unsigned int i = 0; i < N; i++)
{
bar.push_back(new A());
}
#pragma omp parallel num_threads(4)
{
for(long i = 0; i < BIG; i++)
{
int thread_n = omp_get_thread_num();
// read a global variable
long *to_add = &foo[i];
// write to a private variable
bar[thread_n]->add_i(*to_add);
}
}
timeb now;
ftime(&now);
for (int i = 0; i < N; i++)
{
std::cout << bar[i]->get_i() << std::endl;
}
std::cout << now.millitm - then.millitm << std::endl;
}
The first comment addresses the read from the global foo. Is this "false sharing" (or data sloshing)? Most of the resources I read talk about false sharing in terms of write operations, but I don't know if the same applies to read operations.
The second comment addresses writing operations to classes in bar. Same question: is this false sharing? They are writing to elements in the same global data structure (which is, from what I've read, sloshing), but only ever acting on private data within the elements.
When I replace the OpenMP macro with a for loop, the program is faster by about 25%, so I'm guessing I'm doing something wrong...
Modern memory allocators are thread-aware. To prevent false sharing when it comes to modifying each instance of class A pointed to by the elements of bar, you should move the memory allocation inside the parallel region, e.g.:
const int N = 4;
std::vector<long> foo(BIG, 1);
std::vector<A *> bar(N);
#pragma omp parallel num_threads(N)
{
int thread_n = omp_get_thread_num();
bar[thread_n] = new A();
for(long i = 0; i < BIG; i++)
{
// read a global variable
long *to_add = &foo[i];
// write to a private variable
bar[thread_n]->add_i(*to_add);
}
}
Note also that in this case omp_get_thread_num() is called only once as opposed to BIG times as in your code. The overhead of calling a function is relatively low, but it adds up when you do that many times.
Your biggest sharing problem is the bar[thread_n]. The reading of foo[i] is less of an issue.
Edit: Since bar[thread_n] is holding pointers, and the pointee is what gets updated, there is little or no sharing. You may still benefit from loading a "lump at a time" into each CPU-core, rather than reading one or two items per CPU-core from each cache-line. So below code may still benefit. As always when it's a matter of performance, benchmark a lot (with optimisation enabled), as different systems will behave different (depending on compiler, CPU architecture, memory subsystems, etc, etc)
It would be better to "lump" a few items at a time in each thread. Something like this, perhaps:
const int STEP=16;
for(long i = 0; i < BIG; i+=STEP)
{
int thread_n = omp_get_thread_num();
int stop = std::min(BIG-i, STEP); // Don't go over edge.
for(j = 0; j < stop; j++)
{
// read a global variable
long *to_add = &foo[i+j];
// write to a private variable
bar[thread_n*STEP + j]->add_i(*to_add);
}
}
You may need to adjust "STEP" a bit to make it the right level.