C++ threads race condition simulation - c++

Here is a C++ program that runs 10 times with 5 different threads and each thread increments the value of counter so the final output should be 500, which is exactly what the program is giving output. But i cant understand why is it giving 500 every time the output should be different as the increment operation is not atomic and there are no locks used so the program should give out different outputs in each case.
edit to increase probability of race condition i increased the loop count but still couldn't see any varying output
#include <iostream>
#include <thread>
#include <vector>
struct Counter {
int value;
Counter() : value(0){}
void increment(){
value = value + 1000;
}
};
int main(){
int n = 50000;
while(n--){
Counter counter;
std::vector<std::thread> threads;
for(int i = 0; i < 5; ++i){
threads.push_back(std::thread([&counter](){
for(int i = 0; i < 1000; ++i){
counter.increment();
}
}));
}
for(auto& thread : threads){
thread.join();
}
std::cout << counter.value << std::endl;
}
return 0;
}

You're just lucky :)
Compiling with clang++ my output is not always 500:
500
425
470
500
500
500
500
500
432
440

Note
Using g++ with -fsanitize=thread -static-libtsan:
WARNING: ThreadSanitizer: data race (pid=13871)
Read of size 4 at 0x7ffd1037a9c0 by thread T2:
#0 Counter::increment() <null> (Test+0x000000509c02)
#1 main::{lambda()#1}::operator()() const <null> (Test+0x000000507ed1)
#2 _M_invoke<> /usr/include/c++/5/functional:1531 (Test+0x0000005097d7)
#3 operator() /usr/include/c++/5/functional:1520 (Test+0x0000005096b2)
#4 _M_run /usr/include/c++/5/thread:115 (Test+0x0000005095ea)
#5 <null> <null> (libstdc++.so.6+0x0000000b8c7f)
Previous write of size 4 at 0x7ffd1037a9c0 by thread T1:
#0 Counter::increment() <null> (Test+0x000000509c17)
#1 main::{lambda()#1}::operator()() const <null> (Test+0x000000507ed1)
#2 _M_invoke<> /usr/include/c++/5/functional:1531 (Test+0x0000005097d7)
#3 operator() /usr/include/c++/5/functional:1520 (Test+0x0000005096b2)
#4 _M_run /usr/include/c++/5/thread:115 (Test+0x0000005095ea)
#5 <null> <null> (libstdc++.so.6+0x0000000b8c7f)
shows the race condition. (Also, on my system the output shows results different than 500).
The options for g++ are explained in the documentage for g++ (e.g.: man g++). See also: https://github.com/google/sanitizers/wiki#threadsanitizer.

Just because your code has race conditions does not mean they occur. That is the hard part about them. A lot of times they only occur when something else changes and timing is different.
here are several issues: incrementing to 100 can be done really fast. So your threads may be already halfway done before the second one is started. Same for the next thread etc. So you never know you have really 5 in parallel.
You should create a barrier at the beginning of each thread to make sure they start all at the same time.
Also maybe try a bit more than "100" and only 5 threads. But it all depends on the system / load / timing. etc.

to increase probability of race condition i increased the loop count
but still couldn't see any varying output
Strictly speaking you have data race in this code which is Undefined Behavior and therefore you cannot reliably reproduce it.
But you can rewrite Counter to some "equivalent" code with artificial delays in increment:
struct Counter {
int value;
Counter() : value(0){}
void increment(){
int val=value;
std::this_thread::sleep_for(std::chrono::milliseconds(1));
++val;
value=val;
}
};
I've got the following output with this counter which is far less than 500:
100
100
100
100
100
101
100
100
101
100

Related

Why does similar code running in multiple threads have different running times?

I met a very strange problem about a C++ multi-thread program which as belows.
#include<iostream>
#include<thread>
using namespace std;
int* counter = new int[1024];
void updateCounter(int position)
{
for (int j = 0; j < 100000000; j++)
{
counter[position] = counter[position] + 8;
}
}
int main() {
time_t begin, end;
begin = clock();
thread t1(updateCounter, 1);
thread t2(updateCounter, 2);
thread t3(updateCounter, 3);
thread t4(updateCounter, 4);
t1.join();
t2.join();
t3.join();
t4.join();
end = clock();
cout<<end-begin<<endl; //1833
begin = clock();
thread t5(updateCounter, 16);
thread t6(updateCounter, 32);
thread t7(updateCounter, 48);
thread t8(updateCounter, 64);
t5.join();
t6.join();
t7.join();
t8.join();
end = clock();
cout<<end-begin<<endl; //358
}
the first code block run about 1833 seconds,but the second which is almost same with the first one
run about 358 seconds.Beg for an answer!Thank you!
Writing to nearby variables from multiple threads is slow due to "false sharing" which is described here: What is "false sharing"? How to reproduce / avoid it?
Your offsets of 16/32/48/64 are 64 bytes apart because the int values are (on most common platforms) 4 bytes each. And 64 bytes is a common cache line size, so this puts each target value on its own cache line.
The performance difference is not nearly as large if you compile with optimization. Which of course you should always do when measuring performance. But there's still a difference, and it may get worse the more threads you have.
Finally, your benchmark is unfair because you always run the "slow" code first. That means the code and data are "cold" for the first experiment and "hot" for the second experiment. This is a common mistake in benchmarking, and may even be the dominant factor in the performance difference you're seeing, depending on your system.

Race condition in OpenMP outside parallel block (ThreadSanitizer); false positive?

The following minimal example computes the sum of all the numbers from 1 to 1000 and is parallelized with OpenMP:
#include <iostream>
double sum;
void do_it() {
const size_t n = 1000;
#pragma omp parallel
{
#pragma omp for
for (size_t i = 1; i <= n; ++i) {
#pragma omp atomic
sum += static_cast<double>(i);
}
}
}
int main() {
sum = 0.;
do_it();
std::cout << sum << std::endl;
return 0;
}
I tried compiling this with either clang++-6.0.0 and g++-5.4.0 and ThreadSanitizer. Both compilers produce a few warnings about race conditions in libomp.so/libgomp.so, which I am assuming are false positives, and the following warning about my code:
==================
WARNING: ThreadSanitizer: data race (pid=22081)
Read of size 8 at 0x000001555f48 by main thread:
#0 main /home/arekfu/src/foo/openmp.cc:20 (openmp+0x4be0ce)
Previous atomic write of size 8 at 0x000001555f48 by thread T11:
#0 __tsan_atomic64_compare_exchange_val ??:? (openmp+0x476470)
#1 .omp_outlined._debug__ /home/arekfu/src/foo/openmp.cc:12 (openmp+0x4be011)
#2 .omp_outlined. /home/arekfu/src/foo/openmp.cc:8 (openmp+0x4be011)
#3 __kmp_invoke_microtask ??:? (libomp.so.5+0x994b2)
Location is global '<null>' at 0x000000000000 (openmp+0x000001555f48)
Thread T11 (tid=22093, running) created by main thread at:
#0 pthread_create ??:? (openmp+0x4284db)
#1 __kmpc_threadprivate_register_vec ??:? (libomp.so.5+0x5bc1f)
#2 __libc_start_main /build/glibc-LK5gWL/glibc-2.23/csu/../csu/libc-start.c:291 (libc.so.6+0x2082f)
SUMMARY: ThreadSanitizer: data race /home/arekfu/src/foo/openmp.cc:20 in main
==================
I cannot see any data race in my code though!
I have also tried replacing the atomic updates with a critical section, like this:
#pragma omp critical
{
sum += static_cast<double>(i);
}
This changes the warning, but the new one does not make much more sense:
==================
WARNING: ThreadSanitizer: data race (pid=27477)
Write of size 8 at 0x000001555f48 by thread T4:
#0 .omp_outlined._debug__ /home/arekfu/src/foo/openmp.cc:13 (openmp+0x4be0a2)
#1 .omp_outlined. /home/arekfu/src/foo/openmp.cc:8 (openmp+0x4be0a2)
#2 __kmp_invoke_microtask ??:? (libomp.so.5+0x994b2)
Previous write of size 8 at 0x000001555f48 by thread T3:
#0 .omp_outlined._debug__ /home/arekfu/src/foo/openmp.cc:13 (openmp+0x4be0a2)
#1 .omp_outlined. /home/arekfu/src/foo/openmp.cc:8 (openmp+0x4be0a2)
#2 __kmp_invoke_microtask ??:? (libomp.so.5+0x994b2)
Location is global '<null>' at 0x000000000000 (openmp+0x000001555f48)
Thread T4 (tid=27482, running) created by main thread at:
#0 pthread_create ??:? (openmp+0x42857b)
#1 __kmpc_threadprivate_register_vec ??:? (libomp.so.5+0x5bc1f)
#2 __libc_start_main /build/glibc-LK5gWL/glibc-2.23/csu/../csu/libc-start.c:291 (libc.so.6+0x2082f)
Thread T3 (tid=27481, running) created by main thread at:
#0 pthread_create ??:? (openmp+0x42857b)
#1 __kmpc_threadprivate_register_vec ??:? (libomp.so.5+0x5bc1f)
#2 __libc_start_main /build/glibc-LK5gWL/glibc-2.23/csu/../csu/libc-start.c:291 (libc.so.6+0x2082f)
SUMMARY: ThreadSanitizer: data race /home/arekfu/src/foo/openmp.cc:13 in .omp_outlined._debug__
==================
Are these warnings an indication of real data races, or are they false positives?
The "problem" is the read operation on sum in line 20:
std::cout << sum << std::endl; // here you are reading the value of sum
TSAN cannot infer an inter-thread happens-before relation between this read and the (atomic) updates in the loop. But of course such a relation exists, since all threads are sychronized at the end of the omp block. So yes, this is a false positive.
This post provides more information how to avoid such false positives with OpenMP: Can I use Thread Sanitizer for OpenMP programs?

GCC 8.1.0/MinGW64-compiled OpenMP program crashes looking for cygwin.s?

I'm learning OpenMP in C++ using gcc 8.1.0 and MinGW64 (latest version as of this month), and I'm running into a weird debug error when my program encounters a segmentation fault.
I know the cause of the crash, attempting to create too many OpenMP threads (50,000), but it's the error itself that has me puzzled. I didn't compile gcc or MinGW64 from source, I just used the installers, and I'm on Windows.
Why is it looking for cygwin.s, and why use that file structure on Windows? My code and the error message from gdb are below the closing.
I'm learning OpenMP in the process of programming a path tracer, and I think I have a workaround for the thread limit (using while (threads < runs) and letting OpenMP set the thread count automatically), but I am stumped as to the error. Is there a workaround or solution for this?
It works fine with ~10,000 threads. I know it's not actually creating 10,000 threads simultaneously, but it's what I was doing before I thought of the workaround.
Thank you for the heads up about rand() and thread safety. I ended up replacing my RNG code with some that appears to be working fine in OpenMP, and it's literally a night and day difference visually. I will try the other changes and report back. Thanks!
WOW! It runs so much faster and the image is artifact-free! Thank you!
Jadan Bliss
Final code:
#pragma omp parellel
for (j = options.height - 1; j >= 0; j--){
for (i=0; i < options.width; i++) {
#pragma omp parallel for reduction(Vector3Add:col)
for (int s=0; s < options.samples; s++)
{
float u = (float(i) + scene_drand()) / float(options.width);
float v = (float(j) + scene_drand()) / float(options.height);
Ray r = cam.get_ray(u, v); // was: origin, lower_left_corner + u*horizontal + v*vertical);
col += color(r, world, 0);
}
col /= real(options.samples);
render.set(i,j, col);
col = Vector3(0.0);
}
}
Error:
Starting program:
C:\Users\Jadan\Documents\CBProjects\learnOMP\bin\Debug\learnOMP.exe
[New Thread 22136.0x6620] [New Thread 22136.0x80a8] [New Thread
22136.0x8008] [New Thread 22136.0x5428]
Thread 1 received signal SIGSEGV, Segmentation fault.
___chkstk_ms () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:126 126
../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S: No such file
or directory.
Here are some remarks on your code.
Using a huge number of thread will not bring you any gain and is the probable reason of your problems. Thread creation has a time and resource cost. Time cost makes that it will probably be the main time in your program and your parallel program will be by far longer than its sequential version. Concerning resource cost, each thread has its own stack segment. Its size is system dependent, but typical values are measured in MB. I do not know the characteristics of your system, but with 100000 threads, this is probably the reason why your code is crashing. I have no explaination for the message about about cygwin.s, but after a stack overflow, the behavior can be weird.
Threads are a mean to parralelize code, and, for data parallelism, it is most of the time useless to have more threads than the number of logical processors on your system. Let openmp set it, but you can experiment later to tune this number.
Besides that, there are other problems.
rand() is not thread safe as it uses a global state that will be modified concurrently by threads. rand_r() is, as the state of the random generator is not global and can be stored in every thread.
You should not modify a shared var like result without an atomic access as concurrent thread accesses can lead to unexpected results. While safe, using an atomic modification for every value is not a very efficient solution, though. Atomic accesses are very expensive and it is better to use a reduction that does local accumulation in every thread and a unique atomic access at the end.
#include <omp.h>
#include <iostream>
#include <random>
#include <time.h>
int main()
{
int runs = 100000;
double result = 0.0;
#pragma omp parallel
{
// per thread initialisation of rand_r seed.
unsigned int rand_state=omp_get_thread_num()*time(NULL);
// or whatever thread dependent seed
#pragma omp for reduction(+:result)
for(int i=0; i<runs; i++)
{
double d = double(rand_r(&rand_state))/double(RAND_MAX);
result += d;
}
}
result /= double(runs);
std::cout << "The computed average over " << runs << " runs was "
<< result << std::endl;
return 0;
}

posting future and new threads in a loop multiple times

I am writing a program that does calculations in multiple threads and return the result using c++ future, here's a simplified version of my code
int main()
{
int length = 64;
vector<std::future<float>> threads(length);
vector<float> results(length);
int blockLength = 8;
int blockCount = length/blockLength;
for(int j=0;j<blockCount;j++)
{
for(int i=0;i<blockLength;i++)
{
threads[i + j * blockLength] = std::async(func1,i*j);
}
for(int i=0;i<blockLength;i++)
{
results[i + j * blockLength] = threads[i].get();
}
}
the definition of func1 is simplified as follows:
float func1(int input)
{
//calculations...
return result;
}
I would like that the program above does 64 times of calculations, in 8 threads at a time, so that the processor and memory usage would be better at the same time.
The program is conceived that it will post blockLength number of threads at a time, and wait till the calculation results are obtained, and proceed to the next loop.
the program will post blockLength number of threads for blockCount times, for example, 8 threads for 8 times.
but the program is not working, there is always a EXC_BAD_ACCESS exception when the first loop of blockLength threads finishes, besides, the calculation time of each thread is not guaranteed, any thread can run for a long time or finish quickly.
Here is a screenshot:
as is shown above, the CPU usage drops as some of the threads finish, but an exception is thrown as soon as the second loop starts.
Would you please point out what is wrong with my usage of future?
How can we correct it?
Thank you very much!

Code runs 6 times slower with 2 threads than with 1

Original Problem:
So I have written some code to experiment with threads and do some testing.
The code should create some numbers and then find the mean of those numbers.
I think it is just easier to show you what I have so far. I was expecting with two threads that the code would run about 2 times as fast. Measuring it with a stopwatch I think it runs about 6 times slower! EDIT: Now using the computer and clock() function to tell the time.
void findmean(std::vector<double>*, std::size_t, std::size_t, double*);
int main(int argn, char** argv)
{
// Program entry point
std::cout << "Generating data..." << std::endl;
// Create a vector containing many variables
std::vector<double> data;
for(uint32_t i = 1; i <= 1024 * 1024 * 128; i ++) data.push_back(i);
// Calculate mean using 1 core
double mean = 0;
std::cout << "Calculating mean, 1 Thread..." << std::endl;
findmean(&data, 0, data.size(), &mean);
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Repeat, using two threads
std::vector<std::thread> thread;
std::vector<double> result;
result.push_back(0.0);
result.push_back(0.0);
std::cout << "Calculating mean, 2 Threads..." << std::endl;
// Run threads
uint32_t halfsize = data.size() / 2;
uint32_t A = 0;
uint32_t B, C, D;
// Split the data into two blocks
if(data.size() % 2 == 0)
{
B = C = D = halfsize;
}
else if(data.size() % 2 == 1)
{
B = C = halfsize;
D = hsz + 1;
}
// Run with two threads
thread.push_back(std::thread(findmean, &data, A, B, &(result[0])));
thread.push_back(std::thread(findmean, &data, C, D , &(result[1])));
// Join threads
thread[0].join();
thread[1].join();
// Calculate result
mean = result[0] + result[1];
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Return
return EXIT_SUCCESS;
}
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
}
I don't think this code is exactly wonderful, if you could suggest ways of improving it then I would be grateful for that also.
Register Variable:
Several people have suggested making a local variable for the function 'findmean'. This is what I have done:
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
register double holding = *result;
for(uint32_t i = 0; i < length; i ++) {
holding += (*datavec).at(start + i);
}
*result = holding;
}
I can now report: The code runs with almost the same execution time as with a single thread. That is a big improvement of 6x, but surely there must be a way to make it nearly twice as fast?
Register Variable and O2 Optimization:
I have set the optimization to 'O2' - I will create a table with the results.
Results so far:
Original Code with no optimization or register variable:
1 thread: 4.98 seconds, 2 threads: 29.59 seconds
Code with added register variable:
1 Thread: 4.76 seconds, 2 Threads: 4.76 seconds
With reg variable and -O2 optimization:
1 Thread: 0.43 seconds, 2 Threads: 0.6 seconds 2 Threads is now slower?
With Dameon's suggestion, which was to put a large block of memory in between the two result variables:
1 Thread: 0.42 seconds, 2 Threads: 0.64 seconds
With TAS 's suggestion of using iterators to access contents of the vector:
1 Thread: 0.38 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (single channel memory 4GB):
1 Thread: 0.31 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (dual channel memory 2x2GB):
1 Thread: 0.31 seconds, 2 Threads: 0.35 seconds
Why are 2 threads 6x slower than 1 thread?
You are getting hit by a bad case of false sharing.
After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?
You are bottlenecked by your memory bandwidth.
False Sharing:
The problem here is that each thread is accessing the result variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.
Each thread is running this loop:
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
And you can see that the result variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result.
Normally, the compiler should put *result into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.
Memory Bandwidth:
Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.
Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.
In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.
As stated in other answers, you are seeing false sharing on the result variable, but there is also one other location where this is happening. The std::vector<T>::at() function (as well as the std::vector<T>::operator[]()) access the length of the vector on each element access. To avoid this you should switch to using iterators. Also, using std::accumulate() will allow you to take advantage of optimizations in the standard library implementation you are using.
Here are the relevant parts of the code:
thread.push_back(std::thread(findmean, std::begin(data)+A, std::begin(data)+B, &(result[0])));
thread.push_back(std::thread(findmean, std::begin(data)+B, std::end(data), &(result[1])));
and
void findmean(std::vector<double>::const_iterator start, std::vector<double>::const_iterator end, double* result)
{
*result = std::accumulate(start, end, 0.0);
}
This consistently gives me better performance for two threads on my 32-bit netbook.
More threads doesn't mean faster! There is an overhead in creating and context-switching threads, even the hardware in which this code run is influencing the results. For such a trivial work like this it's better probably a single thread.
This is probably because the cost of launching and waiting for two threads is a lot more than computing the result in a single loop. Your data size is 128MB, which is not alot for modern processors to process in a single loop.