OpenMP atomic on return values - c++

I have some trouble understanding the definition of scalar expression in OpenMP.
In particular, I thought that calling a function and using its return value in an atomic expression is not allowed.
Looking at Compiler Explorer asm code, it seems to me as it is atomic, though.
Maybe someone can clarify this.
#include <cmath>
int f() { return sin(2); }
int main()
{
int i=5;
#pragma omp atomic
i += f();
return i;
}

#paleonix comment is a good answer, so I'm expanding it a little here.
The important point is that the atomicity is an issue only for the read/modify/write operation expressed by the += operator.
How you generate the value to be added is irrelevant, and what happens during generating that value is unaffected by the atomic pragma, so can still contain races (should you want them :-)).
Of course, this is unlike the use of a critical section, where the whole scope of the critical region is protected.

Related

Achieving the effect of 'volatile' without the added MOV* instructions?

I have a piece of code that must run under all circumstances, as it modifies things outside of its own scope. Let's define that piece of code as:
// Extremely simplified C++ as an example.
#include <iostream>
#include <map>
#include <cstdint>
#if defined(__GNUC__) || defined(__GNUG__) || defined(__clang__)
#include <x86intrin.h>
#elif defined(_MSC_VER)
#include <intrin.h>
#endif
uint64_t some_time_function() {
unsigned int waste;
return __rdtscp(&waste);
};
void insert_into_map(std::map<uint64_t, float>& data, uint64_t t1, uint64_t t0, float v) {
data.emplace((t1 - t0), v);
};
void fn(std::map<uint64_t, float>& map_outside_of_this_scope) {
const float a = 1;
const float b = 2;
float v = 0;
for (uint32_t i = 0; i < 1000000; i++) {
uint64_t t0 = some_time_function();
v = (v + b) - a;
uint64_t t1 = some_time_function();
insert_into_map(map_outside_of_this_scope, t1, t0, v);
}
}
int main(int argc, const char** argv) {
std::map<uint64_t, float> my_map;
fn(my_map);
std::cout << my_map.begin()->first << std::endl;
return 0;
}
This looks like an optimal target for the optimizer in compilers, and that is what I have observed with my code as well: map_outside_of_this_scope ends up empty. Unfortunately the map_outside_of_this_scope is critical to operation and must contain data, otherwise the application crashes. The only way to fix this is by marking v as volatile, however that makes the application significantly slower than an equivalent Assembly based function.
Is there a way to achieve the effect of volatile, without the MOV instructions of volatile?
Inasmuch as you assert in comments that you are most interested in a narrow answer to the question ...
Is there a way to achieve the effect of volatile, without the MOV instructions of volatile?
... the only thing we can say is that C and C++ do not specify the involvement of MOV or any other specific assembly instructions anywhere, for any purpose. If you observe such instructions in compiled binaries then those reflect implementation decisions by your compiler's developers. What's more, where you see them, the MOVs are most likely important to implementing those semantics.
Additionally, neither C nor C++ specifies any alternative feature that duplicates the rather specific semantics of volatile access (why would they?). You might be able to use inline assembly to achieve custom, different effects that serve your purpose, however.
With respect to the more general problem that inspired the above question, all we can really say is that the multiple compilers that perform the unwanted optimization are likely justified in doing so, for reasons that are not clear from the code presented. With that in mind, I recommend that you broaden your problem-solving focus to search for why the compilers think they can perform the optimization when volatile is not involved. To that end, construct a MRE -- not for us, but because the exercise of MRE construction is a powerful debugging technique in its own right.
volatile is directly relevant to optimizers because reads and writes of volatile variables are observable behavior. That means the reads and writes cannot be removed.
Similarly, optimizers cannot remove writes of variables that are observable by other means - whether you write the variable to std::cout, file or socket. And the burden of proof is on the compiler - the write can only be eliminated if the write is provably dead.
In the example above, for instance, mymap.begin()->first is written to std::cout. That is observable behavior, so even in absence of volatile the behavior must be kept. But the exact details do not matter. An optimizer may spot that only the ->first member is observed in this particular example. Hence, v (the ->second value) is not observed, and can legally be optimized out.
But if you copy mymap.begin()->second to a volatile float sink, then that write to sink is observable behavior, and the compiler must make sure the right value is written. That pretty much means that your v calculation inside the loop needs to be preserved, even though v itself is not volatile.
The compiler could do loop unrolls that affect how v is read and written, because the individual v updates are no longer observable. Only the value that's eventually written to volatile float sink counts.

Constant folding/propagation optimization with memory barriers

I have been reading for a while in order to understand better whats going on when multithread programming with a modern (multicore) CPU. However, while I was reading this, I noticed the code below in the "Explicit Compiler Barriers" section, which does not use volatile for IsPublished global.
#define COMPILER_BARRIER() asm volatile("" ::: "memory")
int Value;
int IsPublished = 0;
void sendValue(int x)
{
Value = x;
COMPILER_BARRIER(); // prevent reordering of stores
IsPublished = 1;
}
int tryRecvValue()
{
if (IsPublished)
{
COMPILER_BARRIER(); // prevent reordering of loads
return Value;
}
return -1; // or some other value to mean not yet received
}
The question is, is it safe to omit volatile for IsPublished here? Many people mention that "volatile" keyword has nothing much to do with multithread programming and I agree with them. However, during the compiler optimizations "Constant Folding/Propagation" can be applied and as the wiki page shows it is possible to change if (IsPublished) into if (false) if compiler do not knows much about who can change the value of IsPublished. Do I miss or misunderstood something here?
Memory barriers can prevent compiler ordering and out-of-order execution for CPU, but as I said in the previos paragraph do I still need volatile in order to avoid "Constant Folding/Propagation" which is a dangereous optimization especially using globals as flags in a lock-free code?
If tryRecvValue() is called once, it is safe to omit volatile for IsPublished. The same is true in case, when between calls to tryRecvValue() there is a function call, for which compiler cannot prove, that it does not change false value of IsPublished.
// Example 1(Safe)
int v = tryRecvValue();
if(v == -1) exit(1);
// Example 2(Unsafe): tryRecvValue may be inlined and 'IsPublished' may be not re-read between iterations.
int v;
while(true)
{
v = tryRecvValue();
if(v != -1) break;
}
// Example 3(Safe)
int v;
while(true)
{
v = tryRecvValue();
if(v != -1) break;
some_extern_call(); // Possibly can change 'IsPublished'
}
Constant propagation can be applied only when compiler can prove value of the variable. Because IsPublished is declared as non-constant, its value can be proven only if:
Variable is assigned to the given value or read from variable is followed by the branch, executed only in case when variable has given value.
Variable is read (again) in the same program's thread.
Between 2 and 3 variable is not changed within given program's thread.
Unless you call tryRecvValue() in some sort of .init function, compiler will never see IsPublished initialization in the same thread with its reading. So, proving false value of this variable according to its initialization is not possible.
Proving false value of IsPublished according to false (empty) branch in tryRecvValue function is possible, see Example 2 in the code above.

Prevent C++11 removal of endless loops

As discussed in this question, C++11 optimizes endless loops away.
However, in embedded devices which have a single purpose, endless loops make sense and are actually quite often used. Even a completely empty while(1); is useful for a watchdog-assisted reset. Terminating but empty loops can also be useful in embedded development.
Is there an elegant way to specifically tell the compiler to not remove empty or endless loops, without disabling optimization altogether?
One of the requirements for a loop to be removed (as mentioned in that question) is that it
does not access or modify volatile objects
So,
void wait_forever(void)
{
volatile int i = 1;
while (i) ;
}
should do the trick, although I would certainly verify this by looking at the disassembly of a program produced with your particular toolchain.
A function like this would be a good candidate for GCC's noreturn attribute as well.
void wait_forever(void) __attribute__ ((noreturn));
void wait_forever(void)
{
volatile int i = 1;
while (i) ;
}
int main(void)
{
if (something_bad_happened)
wait_forever();
}

Optimization and multithreading in B.Stroustrup's new book

Please refer to section 41.2.2 Instruction Reordering of "TCPL" 4th edition by B.Stroustrup, which I transcribe below:
To gain performance, compilers, optimizers, and hardware reorder
instructions. Consider:
// thread 1:
int x;
bool x_init;
void init()
{
x = initialize(); // no use of x_init in initialize()
x_init = true;
// ...
}
For this piece of code there is no stated reason to assign to x before
assigning to x_init. The optimizer (or the hardware instruction
scheduler) may decide to speed up the program by executing x_init =
true first. We probably meant for x_init to indicate whether x had
been initialized by initializer() or not. However, we did not say
that, so the hardware, the compiler, and the optimizer do not know
that.
Add another thread to the program:
// thread 2:
extern int x;
extern bool x_init;
void f2()
{
int y;
while (!x_init) // if necessary, wait for initialization to complete
this_thread::sleep_for(milliseconds{10});
y = x;
// ...
}
Now we have a problem: thread 2 may never wait and thus will assign an
uninitialized x to y. Even if thread 1 did not set x_init and x in
‘‘the wrong order,’’ we still may have a problem. In thread 2, there
are no assignments to x_init, so an optimizer may decide to lift the
evaluation of !x_init out of the loop, so that thread 2 either never
sleeps or sleeps forever.
Does the Standard allow the reordering in thread 1? (some quote from the Standard would be forthcoming) Why would that speed up the program?
Both answers in this discussion on SO seem to indicate that no such optimization occurs when there are global variables in the code, as x_init above.
What does the author mean by "to lift the evaluation of !x_init out of the loop"? Is this something like this?
if( !x_init ) while(true) this_thread::sleep_for(milliseconds{10});
y = x;
This is not so much a issue of the C++ compiler/standard, but that of modern CPUs. Have a look here. The compiler isn't going to emit memory barrier instructions between the assignments of x and x_init unless you tell it to.
For what it is worth, prior to C++11, the standard had no notion of multi threading in it's abstract machine model. Things are a bit nicer these days.
The C++11 standard does not "allow" or "prevent" reordering. It specifies some way to force a specific "barrier" that, it turns, prevent the compiler to move instructions before/after them. A compiler, in this example might reorder the assignment because it might be more efficient on a CPU with multiple computing unit (ALU/Hyperthreading/etc...) even with a single core. Typically, if your CPU has 2 ALU that can work in parallel, there is no reason the compiler would not try to feed them with as much work as it can.
I'm not speaking of out-of-order reordering of the CPU instructions that's done internally in Intel CPU (for example), but compile time ordering to ensure all the computing resources are busy doing some work.
I think it depends on the compiler compilation flags. Typically unless you tell it to, the compiler must assume that another compilation unit (say B.cpp, which is not visible at compile time) can have a "extern bool x_init", and can change it at any time. Then, the re-ordering optimization would break with the expected behavior (B can define the initialize() function). This example is trivial and unlikely to break. The linked SO answer are not related to this "optimization", but simply that, in their case, the compiler can not make the assumption that the global array is not modified externally, and as such can not make the optimization. This is not like your example.
Yes. It's a very common optimization trick, instead of:
// test is a bool
for (int i = 0; i < 345; i++) {
if (test) do_something();
}
The compiler might do:
if (test) for(int i = 0; i < 345; i++) { do_something(); }
And save 344 useless tests.

Fetch-and-add using OpenMP atomic operations

I’m using OpenMP and need to use the fetch-and-add operation. However, OpenMP doesn’t provide an appropriate directive/call. I’d like to preserve maximum portability, hence I don’t want to rely on compiler intrinsics.
Rather, I’m searching for a way to harness OpenMP’s atomic operations to implement this but I’ve hit a dead end. Can this even be done? N.B., the following code almost does what I want:
#pragma omp atomic
x += a
Almost – but not quite, since I really need the old value of x. fetch_and_add should be defined to produce the same result as the following (only non-locking):
template <typename T>
T fetch_and_add(volatile T& value, T increment) {
T old;
#pragma omp critical
{
old = value;
value += increment;
}
return old;
}
(An equivalent question could be asked for compare-and-swap but one can be implemented in terms of the other, if I’m not mistaken.)
As of openmp 3.1 there is support for capturing atomic updates, you can capture either the old value or the new value. Since we have to bring the value in from memory to increment it anyways, it only makes sense that we should be able to access it from say, a CPU register and put it into a thread-private variable.
There's a nice work-around if you're using gcc (or g++), look up atomic builtins:
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
It think Intel's C/C++ compiler also has support for this but I haven't tried it.
For now (until openmp 3.1 is implemented), I've used inline wrapper functions in C++ where you can choose which version to use at compile time:
template <class T>
inline T my_fetch_add(T *ptr, T val) {
#ifdef GCC_EXTENSION
return __sync_fetch_and_add(ptr, val);
#endif
#ifdef OPENMP_3_1
T t;
#pragma omp atomic capture
{ t = *ptr; *ptr += val; }
return t;
#endif
}
Update: I just tried Intel's C++ compiler, it currently has support for openmp 3.1 (atomic capture is implemented). Intel offers free use of its compilers in linux for non-commercial purposes:
http://software.intel.com/en-us/articles/non-commercial-software-download/
GCC 4.7 will support openmp 3.1, when it eventually is released... hopefully soon :)
If you want to get old value of x and a is not changed, use (x-a) as old value:
fetch_and_add(int *x, int a) {
#pragma omp atomic
*x += a;
return (*x-a);
}
UPDATE: it was not really an answer, because x can be modified after atomic by another thread.
So it's seems to be impossible to make universal "Fetch-and-add" using OMP Pragmas. As universal I mean operation, which can be easily used from any place of OMP code.
You can use omp_*_lock functions to simulate an atomics:
typedef struct { omp_lock_t lock; int value;} atomic_simulated_t;
fetch_and_add(atomic_simulated_t *x, int a)
{
int ret;
omp_set_lock(x->lock);
x->value +=a;
ret = x->value;
omp_unset_lock(x->lock);
}
This is ugly and slow (doing a 2 atomic ops instead of 1). But If you want your code to be very portable, it will be not the fastest in all cases.
You say "as the following (only non-locking)". But what is the difference between "non-locking" operations (using CPU's "LOCK" prefix, or LL/SC or etc) and locking operations (which are implemented itself with several atomic instructions, busy loop for short wait of unlock and OS sleeping for long waits)?