Is seq_cst needed for synchronization of OpenMP atomic updates? - c++

I read here that sequential memory consistency (seq_cst) "might be needed" to make sure an atomic update is viewed by all threads consistently in an OpenMP parallel region.
Consider the following MWE, which is admittedly trivial and could be realized with a reduction rather than atomics, but which illustrates my question that arose in a more complex piece of code:
#include <iostream>
int main()
{
double a = 0;
#pragma omp parallel for
for (int i = 0; i < 10000000; ++i)
{
#pragma omp atomic
a += 5.5;
}
std::cout.precision(17);
std::cout << a << std::endl;
return 0;
}
I compiled this with g++ -fopenmp -O3 using GCC versions 6 to 12 on an Intel Core i9-9880H CPU, and then ran it using 4 or 8 threads, which always correctly prints:
55000000
When adding seq_cst to the atomic directive, the result is exactly the same. I would have expected the code without seq_cst to (occasionally) produce smaller results due to race conditions / outdated memory view. Is this hardware dependent? Is the code guaranteed to be free of race conditions even without seq_cst, and if so, why? Would the answer be different when using a compiler that was still based on OpenMP 3.1, as that apparently worked somewhat differently?

Related

Why std::for_each is faster than __gnu_parallel::for_each

I'm trying to understand why std::for_each which runs on single thread is ~3 times faster than __gnu_parallel::for_each in the example below:
Time =0.478101 milliseconds
vs
Time =0.166421 milliseconds
Here the code i'm using to benchmark:
#include <iostream>
#include <chrono>
#include <parallel/algorithm>
//The struct I'm using for timming
struct TimerAvrg
{
std::vector<double> times;
size_t curr=0,n;
std::chrono::high_resolution_clock::time_point begin,end;
TimerAvrg(int _n=30)
{
n=_n;
times.reserve(n);
}
inline void start()
{
begin= std::chrono::high_resolution_clock::now();
}
inline void stop()
{
end= std::chrono::high_resolution_clock::now();
double duration=double(std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count())*1e-6;
if ( times.size()<n)
times.push_back(duration);
else{
times[curr]=duration;
curr++;
if (curr>=times.size()) curr=0;}
}
double getAvrg()
{
double sum=0;
for(auto t:times)
sum+=t;
return sum/double(times.size());
}
};
int main( int argc, char** argv )
{
float sum=0;
for(int alpha = 0; alpha <5000; alpha++)
{
TimerAvrg Fps;
Fps.start();
std::vector<float> v(1000000);
std::for_each(v.begin(), v.end(),[](auto v){ v=0;});
Fps.stop();
sum = sum + Fps.getAvrg()*1000;
}
std::cout << "\rTime =" << sum/5000<< " milliseconds" << std::endl;
return 0;
}
This is my configuration:
gcc version 7.3.0 (Ubuntu 7.3.0-21ubuntu1~16.04)
Intel® Core™ i7-7600U CPU # 2.80GHz × 4
htop to check if the program is running in single or multiple threads
g++ -std=c++17 -fomit-frame-pointer -Ofast -march=native -ffast-math -mmmx -msse -msse2 -msse3 -DNDEBUG -Wall -fopenmp benchmark.cpp -o benchmark
The same code doesn't get compiled with gcc 8.1.0. I got that error message:
/usr/include/c++/8/tr1/cmath:1163:20: error: ‘__gnu_cxx::conf_hypergf’ has not been declared
using __gnu_cxx::conf_hypergf;
I already checked couple of posts but either they're very old or not the same issue..
My questions are:
Why is it slower in parallel?
I'm using the wrong functions?
In cppreference it is saying that gcc with Standardization of Parallelism TS is not supported (mentioned with red color in the table) and my code is running in parallel!?
Your function [](auto v){ v=0;} is extremely simple.
The function may be replaced it with a single call to memset or use SIMD instructions for single threaded parallellism. With the knowledge that it overwrites the same state as the vector initially had, the entire loop could be optimised away. It may be easier for the optimiser to replace std::for_each than a parallel implementation.
Furthermore, assuming the parallel loop uses threads, one must remember that creation and eventual synchronisation (in this case there is no need for synchronisation during processing) have overhead, which may be significant in relation to your trivial operation.
Threaded parallellism is often only worth it for computationally expensive tasks. v=0 is one of the least computationally expensive operations there are.
Your benchmark is faulty, I'm even surprised it takes time to run it.
You wrote:
std::for_each(v.begin(), v.end(),[](auto v){ v=0;});
As v is a local argument of the operator() with no reads, I would expect it to become removed by your compiler.
As you now have a loop with a body, that loop can be removed as well as there isn't an observable effect.
And similar to that, the vector can be removed as well as you don't have any readers.
So, without any side effects, this could all be removed. If you would use a parallel algorithm, chances are you have some kind of synchronization, which make optimizing this much harder as there might be side effects in another thread? Proving it doesn't is more complex, not to mention the side effects of the thread management which could exist?
To solve this, a lot of benchmarks have trucks in macros to force the compiler to assume side effects. Use them in the lambda so the compiler doesn't remove it.

This code prints the value of x around 5000 but not 10000, why is that?

This code that I have written creates 2 threads and a for loop that iterates 10000 times but the value of x at the end comes out near 5000 instead of 10000, why is that happening?
#include<unistd.h>
#include<stdio.h>
#include<sys/time.h>
#include "omp.h"
using namespace std;
int x=0;
int main(){
omp_set_num_threads(2);
#pragma omp parallel for
for(int i= 0;i<10000;i++){
x+=1;
}
printf("x is: %d\n",x);
}
x is not an atomic type and is read and written in different threads. (Thinking that int is an atomic type is a common misconception.)
The behaviour of your program is therefore undefined.
Using std::atomic<int> x; is the fix.
The reason is, that when multiple threads access the same variable, race conditions can occur.
The operation x+=1 can be understand as: x = x + 1. So you first read the value of x and then write x + 1 to x. When you have two threads running and operating on the same value of x, following happens: Thread A reads the value of x which is 0. Thread B reads the value of x which is still 0. Then thread A writes 0+1 to x. And then Thread B writes 0+1 to x. And now you have missed one increment and x is just 1 instead of 2. A fix for this problem might be to use an atomic_int.
Modifying one (shared) value by multiple threads is a race condition and leads to wrong results. If multiple threads work with one value, all of them must only read the value.
The idiomatic solution is to use a OpenMP reduction as follows
#pragma omp parallel for reduction(+:x)
for(int i= 0;i<10000;i++){
x+=1;
}
Internally, each thread has it's own x and they are added together after the loop.
Using atomics is an alternative, but will perform significantly worse. Atomic operations are more costly in itself and also very bad for caches.
If you use atomics, you should use OpenMP atomics which are applied to the operation, not the variable. I.e.
#pragma omp parallel for
for (int i= 0;i<10000;i++){
#pragma omp atomic
x+=1;
}
You should not, as other answers suggest, use C++11 atomics. Using them is explicitly unspecified behavior in OpenMP. See this question for details.

Is OpenMP vectorization guaranteed?

Does the OpenMP standard guarantee #pragma omp simd to work, i.e. should the compilation fail if the compiler can't vectorize the code?
#include <cstdint>
void foo(uint32_t r[8], uint16_t* ptr)
{
const uint32_t C = 1000;
#pragma omp simd
for (int j = 0; j < 8; ++j)
if (r[j] < C)
r[j] = *(ptr++);
}
gcc and clang fail to vectorize this but do not complain at all (unless you use -fopt-info-vec-optimized-missed and the like).
No, it is not guaranteed. Relevant portions of the OpenMP 4.5 standard that I could find (emphasis mine):
(1.3) When any thread encounters a simd construct, the iterations of the loop associated with the construct may be executed concurrently using the SIMD lanes that are available to the thread.
(2.8.1) The simd construct can be applied to a loop to indicate that the loop can be transformed into a SIMD loop (that is, multiple iterations of the loop can be executed concurrently using SIMD instructions).
(Appendix C) The number of iterations that are executed concurrently at any given time is implementation defined.
(1.2.7) implementation defined: Behavior that must be documented by the implementation, and is allowed to vary among different compliant implementations. An implementation is allowed to define this behavior as unspecified.

Why does code mutating a shared variable across threads apparently NOT suffer from a race condition?

I'm using Cygwin GCC and run this code:
#include <iostream>
#include <thread>
#include <vector>
using namespace std;
unsigned u = 0;
void foo()
{
u++;
}
int main()
{
vector<thread> threads;
for(int i = 0; i < 1000; i++) {
threads.push_back (thread (foo));
}
for (auto& t : threads) t.join();
cout << u << endl;
return 0;
}
Compiled with the line: g++ -Wall -fexceptions -g -std=c++14 -c main.cpp -o main.o.
It prints 1000, which is correct. However, I expected a lesser number due to threads overwriting a previously incremented value. Why does this code not suffer from mutual access?
My test machine has 4 cores, and I put no restrictions on the program that I know of.
The problem persists when replacing the content of the shared foo with something more complex, e.g.
if (u % 3 == 0) {
u += 4;
} else {
u -= 1;
}
foo() is so short that each thread probably finishes before the next one even gets spawned. If you add a sleep for a random time in foo() before the u++, you may start seeing what you expect.
It is important to understand a race condition does not guarantee the code will run incorrectly, merely that it could do anything, as it is an undefined behavior. Including running as expected.
Particularly on X86 and AMD64 machines race conditions in some cases rarely cause issues as many of the instructions are atomic and the coherency guarantees are very high. These guarantees are somewhat reduced on multi processor systems where the lock prefix is needed for many instructions to be atomic.
If on your machine increment is an atomic op, this will likely run correctly even though according to the language standard it is Undefined Behavior.
Specifically I expect in this case the code may be being compiled to an atomic Fetch and Add instruction (ADD or XADD in X86 assembly) which is indeed atomic in single processor systems, however on multiprocessor systems this is not guaranteed to be atomic and a lock would be required to make it so. If you are running on a multiprocessor system there will be a window where threads could interfere and produce incorrect results.
Specifically I compiled your code to assembly using https://godbolt.org/ and foo() compiles to:
foo():
add DWORD PTR u[rip], 1
ret
This means it is solely performing an add instruction which for a single processor will be atomic (though as mentioned above not so for a multi processor system).
I think it is not so much the thing if you put a sleep before or after the u++. It is rather that operation u++ translates to code that is - compared to the overhead of spawning threads that call foo - very quickly performed such that it is unlikely to get intercepted. However, if you "prolong" the operation u++, then the race condition will become much more likely:
void foo()
{
unsigned i = u;
for (int s=0;s<10000;s++);
u = i+1;
}
result: 694
BTW: I also tried
if (u % 2) {
u += 2;
} else {
u -= 1;
}
and it gave me most times 1997, but sometimes 1995.
It does suffer from a race condition. Put usleep(1000); before u++; in foo and I see different output (< 1000) each time.
The likely answer to why the race condition didn't manifest for you, though it does exist, is that foo() is so fast, compared to the time it takes to start a thread, that each thread finishes before the next can even start. But...
Even with your original version, the result varies by system: I tried it your way on a (quad-core) Macbook, and in ten runs, I got 1000 three times, 999 six times, and 998 once. So the race is somewhat rare, but clearly present.
You compiled with '-g', which has a way of making bugs disappear. I recompiled your code, still unchanged but without the '-g', and the race became much more pronounced: I got 1000 once, 999 three times, 998 twice, 997 twice, 996 once, and 992 once.
Re. the suggestion of adding a sleep -- that helps, but (a) a fixed sleep time leaves the threads still skewed by start time (subject to timer resolution), and (b) a random sleep spreads them out when what we want is to pull them closer together. Instead, I'd code them to wait for a start signal, so I can create them all before letting them get to work. With this version (with or without '-g'), I get results all over place, as low as 974, and no higher than 998:
#include <iostream>
#include <thread>
#include <vector>
using namespace std;
unsigned u = 0;
bool start = false;
void foo()
{
while (!start) {
std::this_thread::yield();
}
u++;
}
int main()
{
vector<thread> threads;
for(int i = 0; i < 1000; i++) {
threads.push_back (thread (foo));
}
start = true;
for (auto& t : threads) t.join();
cout << u << endl;
return 0;
}

Fetch-and-add using OpenMP atomic operations

I’m using OpenMP and need to use the fetch-and-add operation. However, OpenMP doesn’t provide an appropriate directive/call. I’d like to preserve maximum portability, hence I don’t want to rely on compiler intrinsics.
Rather, I’m searching for a way to harness OpenMP’s atomic operations to implement this but I’ve hit a dead end. Can this even be done? N.B., the following code almost does what I want:
#pragma omp atomic
x += a
Almost – but not quite, since I really need the old value of x. fetch_and_add should be defined to produce the same result as the following (only non-locking):
template <typename T>
T fetch_and_add(volatile T& value, T increment) {
T old;
#pragma omp critical
{
old = value;
value += increment;
}
return old;
}
(An equivalent question could be asked for compare-and-swap but one can be implemented in terms of the other, if I’m not mistaken.)
As of openmp 3.1 there is support for capturing atomic updates, you can capture either the old value or the new value. Since we have to bring the value in from memory to increment it anyways, it only makes sense that we should be able to access it from say, a CPU register and put it into a thread-private variable.
There's a nice work-around if you're using gcc (or g++), look up atomic builtins:
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
It think Intel's C/C++ compiler also has support for this but I haven't tried it.
For now (until openmp 3.1 is implemented), I've used inline wrapper functions in C++ where you can choose which version to use at compile time:
template <class T>
inline T my_fetch_add(T *ptr, T val) {
#ifdef GCC_EXTENSION
return __sync_fetch_and_add(ptr, val);
#endif
#ifdef OPENMP_3_1
T t;
#pragma omp atomic capture
{ t = *ptr; *ptr += val; }
return t;
#endif
}
Update: I just tried Intel's C++ compiler, it currently has support for openmp 3.1 (atomic capture is implemented). Intel offers free use of its compilers in linux for non-commercial purposes:
http://software.intel.com/en-us/articles/non-commercial-software-download/
GCC 4.7 will support openmp 3.1, when it eventually is released... hopefully soon :)
If you want to get old value of x and a is not changed, use (x-a) as old value:
fetch_and_add(int *x, int a) {
#pragma omp atomic
*x += a;
return (*x-a);
}
UPDATE: it was not really an answer, because x can be modified after atomic by another thread.
So it's seems to be impossible to make universal "Fetch-and-add" using OMP Pragmas. As universal I mean operation, which can be easily used from any place of OMP code.
You can use omp_*_lock functions to simulate an atomics:
typedef struct { omp_lock_t lock; int value;} atomic_simulated_t;
fetch_and_add(atomic_simulated_t *x, int a)
{
int ret;
omp_set_lock(x->lock);
x->value +=a;
ret = x->value;
omp_unset_lock(x->lock);
}
This is ugly and slow (doing a 2 atomic ops instead of 1). But If you want your code to be very portable, it will be not the fastest in all cases.
You say "as the following (only non-locking)". But what is the difference between "non-locking" operations (using CPU's "LOCK" prefix, or LL/SC or etc) and locking operations (which are implemented itself with several atomic instructions, busy loop for short wait of unlock and OS sleeping for long waits)?