ThreadSanitizer data race when multiple threads write the same value - c++

I have a program where many threads do some computations and write a boolean true value in a shared array to tag the corresponding item as "dirty". This is a data race, reported by ThreadSanitizer. Nevertheless, the flag is never read from these threads, and since the same value is written by all threads, I wonder if it is an actually problematic data race.
Here is a minimal working example:
#include <array>
#include <cstdio>
#include <thread>
#include <vector>
int
main()
{
constexpr int N = 64;
std::array<bool, N> dirty{};
std::vector<std::thread> threads;
threads.reserve(3 * N);
for (int j = 0; j != 3; ++j)
for (int i = 0; i != N; ++i)
threads.emplace_back([&dirty, i]() -> void {
if (i % 2 == 0)
dirty[i] = true; // data race here.
});
for (std::thread& t : threads)
if (t.joinable())
t.join();
for (int i = 0; i != N; ++i)
if (dirty[i])
printf("%d\n", i);
return 0;
}
Compiled with g++ -fsanitize=thread, a data race is reported on the marked line. Under which conditions can this be an actual problem, i.e. the dirty flag for an item i would not be the expected value?

Related

Can't tell if Mutex Lock is kicking in or not?

I'm working on a college assignment and have been tasked with showing a basic mutex lock example. I've never worked with threads in any form, so I'm a total beginner working with POSIX threads in C++.
What I'm trying to get the program to do is create 1000 threads that increment a global integer by 1000.
#include <iostream>
#include <stdlib.h>
#include <pthread.h>
#include <sys/types.h>
#include <unistd.h>
#include <thread>
pthread_t threadArr[1000];
pthread_mutex_t lock;
// Global int to increment
int numberToInc = 0;
void* incByTwo(void*)
{
pthread_mutex_lock(&lock);
for(int j = 0; j < 1000; j++){
numberToInc += 1;
}
pthread_mutex_unlock(&lock);
return NULL;
}
int main()
{
//Creates 1000 threads with incByTwo func
for(int i = 0; i < 1000; i++){
pthread_create(&threadArr[i], NULL, incByTwo, NULL);
}
std::cout << "\n" << numberToInc << "\n";
return 0;
}
The following produces a series of different results, obviously because the threads are executing concurrently, right?
Now, I've gotten it to work correctly by inserting
for(int i = 0; i < 1000; i++){
pthread_join(threadArr[i], NULL);
}
After the thread creation loop, but then removing the mutex locks, it still works. I've been trying to piece out how pthread_join works but I'm a little lost. Any advice?
Sorted a way to show the mutex lock in action. So when I output the global var in the function, without mutex locks it has the potential to show the results out of order.
Running the number range with mutex locks, out looks like:
1000
2000
3000
... (etc)
10000
With mutex locks removed, the output can vary in the order.
E.g.
1000
2000
4000
6000
3000
5000
7000
8000
9000
10000
While the final result of the three threads is correct, the sequence is out of order. In the context of this program it doesn't really matter but I'd imagine if it's passing inconsistently sequenced values it messes things up?
pthread_t threadArr[10];
pthread_mutex_t lock;
int numberToInc = 0;
void* incByTwo(void*)
{
pthread_mutex_lock(&lock);
for(int j = 0; j < 1000; j++){
numberToInc += 1;
}
std::cout << numberToInc << "\n";
pthread_mutex_unlock(&lock);
return NULL;
}
int main()
{
if (pthread_mutex_init(&lock, NULL) != 0)
{
printf("\n mutex init failed\n");
return 1;
}
for(int i = 0; i < 10; i++){
pthread_create(&threadArr[i], NULL, incByTwo, NULL);
}
pthread_join(threadArr[0], NULL);
return 0;
}

synchronizing 10 threads with atomic bool

I'm trying to use 10 threads and each one needs to print his number and the printing needs to be synchronized. I'm doing it as homework and I have to use atomic variables to do it (no locks).
Here what I tried so far:
#include <atomic>
#include <thread>
#include <iostream>
#include <vector>
using namespace std;
atomic<bool> turn = true;
void print(int i);
int main()
{
vector<thread> threads;
for (int i = 1; i <= 10; i++)
{
threads.push_back(thread(print, i));
}
for (int i = 0; i < 10; i++)
{
threads[i].join();
}
return 0;
}
void print(int i)
{
bool f = true;
for (int j = 0; j < 100; j++)
{
while((turn.compare_exchange_weak(f, false)) == false)
{ }
cout << i << endl;
turn = turn.exchange(true);
}
}
output example:
24
9143
541
2
8
expected output:
2
4
9
1
4
3
1
5
4
10
8
You have 2 bugs in your use of atomic.
When compare_exchange_weak fails it stores the current value in the first parameter. If you want to keep trying the same value you need to set it back to the original value:
while ((turn.compare_exchange_weak(f, false)) == false)
{
f = true;
}
The second issue is that exchange returns the currently stored value so:
turn = turn.exchange(true);
Sets the value of turn back to false, you need just:
turn.exchange(true);
Or even just:
turn = true;
Synchronisation isn't actually necessary in this case as std::cout will do the synchronisation for you, single output operations wont overlap so you can just change your print function to the following and it will just work:
void print(int i)
{
for (int j = 0; j < 100; j++)
{
cout << std::to_string(i) + "\n";
}
}
Atomics aren't the right approach to this problem, your code is incredibly slow. Mutexes would probably be quicker.

How to create a certain number of threads based on a value a variable contains?

I have a integer variable, that contains the number of threads to execute. Lets call it myThreadVar. I want to execute myThreadVar threads, and cannot think of any way to do it, without a ton of if statements. Is there any way I can create myThreadVar threads, no matter what myThreadVar is?
I was thinking:
for (int i = 0; i < myThreadVar; ++i) { std::thread t_i(myFunc); }, but that obviously won't work.
Thanks in advance!
Make an array or vector of threads, put the threads in, and then if you want to wait for them to finish have a second loop go over your collection and join them all:
std::vector<std::thread> myThreads;
myThreads.reserve(myThreadVar);
for (int i = 0; i < myThreadVar; ++i)
{
myThreads.push_back(std::thread(myFunc));
}
While other answers use vector::push_back(), I prefer vector::emplace_back(). Possibly more efficient. Also use vector::reserve(). See it live here.
#include <thread>
#include <vector>
void func() {}
int main() {
int num = 3;
std::vector<std::thread> vec;
vec.reserve(num);
for (auto i = 0; i < num; ++i) {
vec.emplace_back(func);
}
for (auto& t : vec) t.join();
}
So, obvious the best solution is not to wait previous thread to done. You need to run all of them in parallel.
In this case you can use vector class to store all of instances and after that make join to all of them.
Take a look at my example.
#include <thread>
#include <vector>
void myFunc() {
/* Some code */
}
int main()
{
int myThreadVar = 50;
std::vector <thread> threadsToJoin;
threadsToJoin.resize(myThreadVar);
for (int i = 0; i < myThreadVar; ++i) {
threadsToJoin[i] = std::thread(myFunc);
}
for (int i = 0; i < threadsToJoin.size(); i++) {
threadsToJoin[i].join();
}
}
#include <iostream>
#include <thread>
void myFunc(int n) {
std::cout << "myFunc " << n << std::endl;
}
int main(int argc, char *argv[]) {
int myThreadVar = 5;
for (int i = 0; i < myThreadVar; ++i) {
std::cout << "Launching " << i << std::endl;
std::thread t_i(myFunc,i);
t_i.detach();
}
}
g++ -std=c++11 -o 35106568 35106568.cpp
./35106568
Launching 0
myFunc 0
Launching 1
myFunc 1
Launching 2
myFunc 2
Launching 3
myFunc 3
Launching 4
myFunc 4
You need to store the thread so you can send it to join.
std::thread t[myThreadVar];
for (int i = 0; i < myThreadVar; ++i) { t[i] = std::thread(myFunc); }//Start all threads
for (int i = 0; i < myThreadVar; ++i) {t[i].join;}//Wait for all threads to finish
I think this is valid syntax, but I'm more used to c so I am unsure if I initialized the array correctly.

Threads failing to affect performance

Below is a small program meant to parallelize the approximation of the 1/(n^2) series. Note the global parameter NUM_THREADS.
My issue is that increasing the number of threads from 1 to 4 (the number of processors my computer has is 4) does not significantly affect the outcomes of timing experiments. Do you see a logical flaw in the ThreadFunction? Is there false sharing or misplaced blocking that ends up serializing the execution?
#include <iostream>
#include <thread>
#include <vector>
#include <mutex>
#include <string>
#include <future>
#include <chrono>
std::mutex sum_mutex; // This mutex is for the sum vector
std::vector<double> sum_vec; // This is the sum vector
int NUM_THREADS = 1;
int UPPER_BD = 1000000;
/* Thread function */
void ThreadFunction(std::vector<double> &l, int beg, int end, int thread_num)
{
double sum = 0;
for(int i = beg; i < end; i++) sum += (1 / ( l[i] * l[i]) );
std::unique_lock<std::mutex> lock1 (sum_mutex, std::defer_lock);
lock1.lock();
sum_vec.push_back(sum);
lock1.unlock();
}
void ListFill(std::vector<double> &l, int z)
{
for(int i = 0; i < z; ++i) l.push_back(i);
}
int main()
{
std::vector<double> l;
std::vector<std::thread> thread_vec;
ListFill(l, UPPER_BD);
int len = l.size();
int lower_bd = 1;
int increment = (UPPER_BD - lower_bd) / NUM_THREADS;
for (int j = 0; j < NUM_THREADS; ++j)
{
thread_vec.push_back(std::thread(ThreadFunction, std::ref(l), lower_bd, lower_bd + increment, j));
lower_bd += increment;
}
for (auto &t : thread_vec) t.join();
double big_sum;
for (double z : sum_vec) big_sum += z;
std::cout << big_sum << std::endl;
return 0;
}
From looking at your code, I suspect that ListFill is taking longer than ThreadFunction. Why pass a list of values to the thread instead of the bounds each thread should loop over? Something like:
void ThreadFunction( int beg, int end ) {
double sum = 0.0;
for(double i = beg; i < end; i++)
sum += (1.0 / ( i * i) );
std::unique_lock<std::mutex> lock1 (sum_mutex);
sum_vec.push_back(sum);
}
To maximize parallelism, you need to push as much work as possible onto the threads. See Amdahl's Law
In addition to dohashi's nice improvement, you can remove the need for the mutex by populating the sum_vec in advance in the main thread:
sum_vec.resize(4);
then writing directly to it in ThreadFunction:
sum_vec[thread_num] = sum;
since each thread writes to a distinct element and doesn't modify the vector itself there is no need to lock anything.

OpenMP vs C++11 threads

In the following example the C++11 threads take about 50 seconds to execute, but the OMP threads only 5 seconds. Any ideas why? (I can assure you it still holds true if you are doing real work instead of doNothing, or if you do it in a different order, etc.) I'm on a 16 core machine, too.
#include <iostream>
#include <omp.h>
#include <chrono>
#include <vector>
#include <thread>
using namespace std;
void doNothing() {}
int run(int algorithmToRun)
{
auto startTime = std::chrono::system_clock::now();
for(int j=1; j<100000; ++j)
{
if(algorithmToRun == 1)
{
vector<thread> threads;
for(int i=0; i<16; i++)
{
threads.push_back(thread(doNothing));
}
for(auto& thread : threads) thread.join();
}
else if(algorithmToRun == 2)
{
#pragma omp parallel for num_threads(16)
for(unsigned i=0; i<16; i++)
{
doNothing();
}
}
}
auto endTime = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = endTime - startTime;
return elapsed_seconds.count();
}
int main()
{
int cppt = run(1);
int ompt = run(2);
cout<<cppt<<endl;
cout<<ompt<<endl;
return 0;
}
OpenMP thread-pools for its Pragmas (also here and here). Spinning up and tearing down threads is expensive. OpenMP avoids this overhead, so all it's doing is the actual work and the minimal shared-memory shuttling of the execution state. In your Threads code you are spinning up and tearing down a new set of 16 threads every iteration.
I tried a code of an 100 looping at
Choosing the right threading framework and it took
OpenMP 0.0727, Intel TBB 0.6759 and C++ thread library 0.5962 mili-seconds.
I also applied what AruisDante suggested;
void nested_loop(int max_i, int band)
{
for (int i = 0; i < max_i; i++)
{
doNothing(band);
}
}
...
else if (algorithmToRun == 5)
{
thread bristle(nested_loop, max_i, band);
bristle.join();
}
This code looks like taking less time than your original C++ 11 thread section.