Difference between atomic<int> and int - c++

I want to know if there is any different between std::atomic<int> and int if we are just doing load and store. I am not concerned about the memory ordering. For example consider the below code
int x{1};
void f(int myid) {
while(1){
while(x!= myid){}
//cout<<"thread : "<< myid<<"\n";
//this_thread::sleep_for(std::chrono::duration(3s));
x = (x % 3) + 1;
}
}
int main(){
thread x[3];
for(int i=0;i<3;i++){
x[i] = thread(f,i+1);
}
for(int i=0;i<3;i++){
x[i].join();
}
}
Now the output (if you uncomment the cout) will be
Thread :1
Thread :2
Thread :3
...
I want to know if there is any benefit in changing the int x to atomic<int> x?

Consider your code:
void f(int myid) {
while(1){
while(x!= myid){}
//cout<<"thread : "<< myid<<"\n";
//this_thread::sleep_for(std::chrono::duration(3s));
x = (x % 3) + 1;
}
}
If the program didn't have undefined behaviour, then you could expect that when f was called, x would be read from the stack at least once, but having done that, the compiler has no reason to think that any changes to x will happen outside the function, or that any changes to x made within the function need to be visible outside the function until after the function returns, so it's entitled to read x into a CPU register, keep looking at the same register value and comparing it to myid - which means it'll either pass through instantly or be stuck forever.
Then, compilers are allowed to assume they'll make progress (see Forward Progress in the C++ Standard), so they could conclude that because they'd never progress if x != myid, x can't possibly be equal to myid, and remove the inner while loop. Similarly, an outer loop simplified to while (1) x = (x % 3) + 1; where x might be a register - doesn't make progress and could also be eliminated. Or, the compiler could leave the loop but remove the seemingly pointless operations on x.
Putting your code into the online Godbolt compiler explorer and compiling with GCC trunk at -O3 optimisation, f(int) code is:
f(int):
.L2:
jmp .L2
If you make x atomic, then the compiler can't simply use a register while accessing/modifying it, and assume that there will be a good time to update it before the function returns. It will actually have to modify the variable in memory and propagate that change so other threads can read the updated value.

I want to know if there is any benefit in changing the int x to atomic x?
You could say that. Turning int into atomic<int> in your example will turn your program from incorrect to correct (*).
Accessing the same int from multiple threads at the same time (without any form of access synchronization) is Undefined Behavior.
*) Well, the program might still be incorrect, but at least it avoids this particular problem.

Related

Question regarding race conditions while multithreading

So I'm reading through a book an in this chapter that goes over multithreading and concurrency they gave me a question that does not really make sense to me.
I'm suppose to create 3 functions with param x that simply calculates x * x; one using mutex, one using atomic types, and one using neither. And create 3 global variables holding the values.
The first two functions will prevent race conditions but the third might not.
After that I create N threads and then loop through and tell each thread to calculate it's x function (3 separate loops, one for each function. So I'm creating N threads 3 times)
Now the book tells me that using function 1 & 2 I should always get the correct answer but using function 3 I won't always get the right answer. However, I am always getting the right answer for all of them. I assume this is because I am just calculating x * x which is all it does.
As an example, when N=3, the correct value is 0 * 0 + 1 * 1 + 2 * 2 = 5.
this is the atomic function:
void squareAtomic(atomic<int> x)
{
accumAtomic += x * x;
}
And this is how I call the function
thread threadsAtomic[N]
for (int i = 0; i < N; i++) //i will be the current thread that represents x
{
threadsAtomic[i] = thread(squareAtomic, i);
}
for (int i = 0; i < N; i++)
{
threadsAtomic[i].join();
}
This is the function that should sometimes create race conditions:
void squareNormal(int x)
{
accumNormal += x * x;
}
Heres how I call that:
thread threadsNormal[N];
for (int i = 0; i < N; i++) //i will be the current thread that represents x
{
threadsNormal[i] = thread(squareNormal, i);
}
for (int i = 0; i < N; i++)
{
threadsNormal[i].join();
}
This is all my own code so I might not be doing this question correctly, and in that case I apologize.
One problem with race conditions (and with undefined behavior in general) is that their presence doesn't guarantee that your program will behave incorrectly. Rather, undefined behavior only voids the guarantee that your program will behave according to rules of the C++ language spec. That can make undefined behavior very difficult to detect via empirical testing. (Every multithreading-programmer's worst nightmare is the bug that was never seen once during the program's intensive three-month testing period, and only appears in the form of a mysterious crash during the big on-stage demo in front of a live audience)
In this case your racy program's race condition comes in the form of multiple threads reading and writing accumNormal simultaneously; in particular, you might get an incorrect result if thread A reads the value of accumNormal, and then thread B writes a new value to accumNormal, and then thread A writes a new value to accumNormal, overwriting thread B's value.
If you want to be able to demonstrate to yourself that race conditions really can cause incorrect results, you'd want to write a program where multiple threads hammer on the same shared variable for a long time. For example, you might have half the threads increment the variable 1 million times, while the other half decrement the variable 1 million times, and then check afterwards (i.e. after joining all the threads) to see if the final value is zero (which is what you would expect it to be), and if not, run the test again, and let that test run all night if necessary. (and even that might not be enough to detect incorrect behavior, e.g. if you are running on hardware where increments and decrements are implemented in such a way that they "just happen to work" for this use case)

Why doesn't my C++ compiler optimize these memory writes away?

I created this program. It does nothing of interest but use processing power.
Looking at the output with objdump -d, I can see the three rand calls and corresponding mov instructions near the end even when compiling with O3 .
Why doesn't the compiler realize that memory isn't going to be used and just replace the bottom half with while(1){}? I'm using gcc, but I'm mostly interested in what is required by the standard.
/*
* Create a program that does nothing except slow down the computer.
*/
#include <cstdlib>
#include <unistd.h>
int getRand(int max) {
return rand() % max;
}
int main() {
for (int thread = 0; thread < 5; thread++) {
fork();
}
int len = 1000;
int *garbage = (int*)malloc(sizeof(int)*len);
for (int x = 0; x < len; x++) {
garbage[x] = x;
}
while (true) {
garbage[getRand(len)] = garbage[getRand(len)] - garbage[getRand(len)];
}
}
Because GCC isn't smart enough to perform this optimization on dynamically allocated memory. However, if you change garbageto be a local array instead, GCC compiles the loop to this:
.L4:
call rand
call rand
call rand
jmp .L4
This just calls rand repeatedly (which is needed because the call has side effects), but optimizes out the reads and writes.
If GCC was even smarter, it could also optimize out the randcalls, because its side effects only affect any later randcalls, and in this case there aren't any. However, this sort of optimization would probably be a waste of compiler writers' time.
It can't, in general, tell that rand() doesn't have observable side-effects here, and it isn't required to remove those calls.
It could remove the writes, but it may be the use of arrays is enough to suppress that.
The standard neither requires nor prohibits what it is doing. As long as the program has the correct observable behaviour any optimisation is purely a quality of implementation matter.
This code causes undefined behaviour because it has an infinite loop with no observable behaviour. Therefore any result is permissible.
In C++14 the text is 1.10/27:
The implementation may assume that any thread will eventually do one of the following:
terminate,
make a call to a library I/O function,
access or modify a volatile object, or
perform a synchronization operation or an atomic operation.
[Note: This is intended to allow compiler transformations such as removal of empty loops, even when termination cannot be proven. —end note ]
I wouldn't say that rand() counts as an I/O function.
Related question
Leave it a chance to crash by array overflow ! The compiler won't speculate on the range of outputs of getRand.

Constant folding/propagation optimization with memory barriers

I have been reading for a while in order to understand better whats going on when multithread programming with a modern (multicore) CPU. However, while I was reading this, I noticed the code below in the "Explicit Compiler Barriers" section, which does not use volatile for IsPublished global.
#define COMPILER_BARRIER() asm volatile("" ::: "memory")
int Value;
int IsPublished = 0;
void sendValue(int x)
{
Value = x;
COMPILER_BARRIER(); // prevent reordering of stores
IsPublished = 1;
}
int tryRecvValue()
{
if (IsPublished)
{
COMPILER_BARRIER(); // prevent reordering of loads
return Value;
}
return -1; // or some other value to mean not yet received
}
The question is, is it safe to omit volatile for IsPublished here? Many people mention that "volatile" keyword has nothing much to do with multithread programming and I agree with them. However, during the compiler optimizations "Constant Folding/Propagation" can be applied and as the wiki page shows it is possible to change if (IsPublished) into if (false) if compiler do not knows much about who can change the value of IsPublished. Do I miss or misunderstood something here?
Memory barriers can prevent compiler ordering and out-of-order execution for CPU, but as I said in the previos paragraph do I still need volatile in order to avoid "Constant Folding/Propagation" which is a dangereous optimization especially using globals as flags in a lock-free code?
If tryRecvValue() is called once, it is safe to omit volatile for IsPublished. The same is true in case, when between calls to tryRecvValue() there is a function call, for which compiler cannot prove, that it does not change false value of IsPublished.
// Example 1(Safe)
int v = tryRecvValue();
if(v == -1) exit(1);
// Example 2(Unsafe): tryRecvValue may be inlined and 'IsPublished' may be not re-read between iterations.
int v;
while(true)
{
v = tryRecvValue();
if(v != -1) break;
}
// Example 3(Safe)
int v;
while(true)
{
v = tryRecvValue();
if(v != -1) break;
some_extern_call(); // Possibly can change 'IsPublished'
}
Constant propagation can be applied only when compiler can prove value of the variable. Because IsPublished is declared as non-constant, its value can be proven only if:
Variable is assigned to the given value or read from variable is followed by the branch, executed only in case when variable has given value.
Variable is read (again) in the same program's thread.
Between 2 and 3 variable is not changed within given program's thread.
Unless you call tryRecvValue() in some sort of .init function, compiler will never see IsPublished initialization in the same thread with its reading. So, proving false value of this variable according to its initialization is not possible.
Proving false value of IsPublished according to false (empty) branch in tryRecvValue function is possible, see Example 2 in the code above.

Question about optimization in C++

I've read that the C++ standard allows optimization to a point where it can actually hinder with expected functionality. When I say this, I'm talking about return value optimization, where you might actually have some logic in the copy constructor, yet the compiler optimizes the call out.
I find this to be somewhat bad, as in someone who doesn't know this might spend quite some time fixing a bug resulting from this.
What I want to know is whether there are any other situations where over-optimization from the compiler can change functionality.
For example, something like:
int x = 1;
x = 1;
x = 1;
x = 1;
might be optimized to a single x=1;
Suppose I have:
class A;
A a = b;
a = b;
a = b;
Could this possibly also be optimized? Probably not the best example, but I hope you know what I mean...
Eliding copy operations is the only case where a compiler is allowed to optimize to the point where side effects visibly change. Do not rely on copy constructors being called, the compiler might optimize away those calls.
For everything else, the "as-if" rule applies: The compiler might optimize as it pleases, as long as the visible side effects are the same as if the compiler had not optimized at all.
("Visible side effects" include, for example, stuff written to the console or the file system, but not runtime and CPU fan speed.)
It might be optimized, yes. But you still have some control over the process, for example, suppose code:
int x = 1;
x = 1;
x = 1;
x = 1;
volatile int y = 1;
y = 1;
y = 1;
y = 1;
Provided that neither x, nor y are used below this fragment, VS 2010 generates code:
int x = 1;
x = 1;
x = 1;
x = 1;
volatile int y = 1;
010B1004 xor eax,eax
010B1006 inc eax
010B1007 mov dword ptr [y],eax
y = 1;
010B100A mov dword ptr [y],eax
y = 1;
010B100D mov dword ptr [y],eax
y = 1;
010B1010 mov dword ptr [y],eax
That is, optimization strips all lines with "x", and leaves all four lines with "y". This is how volatile works, but the point is that you still have control over what compiler does for you.
Whether it is a class, or primitive type - all depends on compiler, how sophisticated it's optimization caps are.
Another code fragment for study:
class A
{
private:
int c;
public:
A(int b)
{
*this = b;
}
A& operator = (int b)
{
c = b;
return *this;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
int b = 0;
A a = b;
a = b;
a = b;
return 0;
}
Visual Studio 2010 optimization strips all the code to nothing, in release build with "full optimization" _tmain does just nothing and immediately returns zero.
This will depend on how class A is implemented, whether the compiler can see the implementation and whether it is smart enough. For example, if operator=() in class A has some side effects such optimizing out would change the program behavior and is not possible.
Optimization does not (in proper term) "remove calls to copy or assignments".
It convert a finite state machine in another finite state, machine with a same external behaviour.
Now, if you repeadly call
a=b; a=b; a=b;
what the compiler do depends on what operator= actually is.
If the compiler founds that a call have no chances to alter the state of the program (and the "state of the program" is "everything lives longer than a scope that a scope can access") it will strip it off.
If this cannot be "demonstrated" the call will stay in place.
Whatever the compiler will do, don't worry too much about: the compiler cannot (by contract) change the external logic of a program or of part of it.
i dont know c++ that much but am currently reading Compilers-Principles, techniques and tools
here is a snippet from their section on code optimization:
the machine-independent code-optimization phase attempts to improve
intermediate code so that better target code will result. Usually
better means faster, but other objectives may be desired, such as
shorter code, or target code that consumes less power. for example a
straightforward algorithm generates the intermediate code (1.3) using
an instruction for each operator in the tree representation that comes
from semantic analyzer. a simple intermediate code generation
algorithm followed by code optimization is a reasonable way to
generate good target code. the optimizar can duduce that the
conversion of 60 from integer to floating point can be done once and
for all at compile time, so the inttofloat operation can be eliminated
by replacing the integer 6- by the floating point number 60.0.
moreover t3 is used only once to trasmit its value to id1 so the
optimizer can transform 1.3 into the shorter sequence (1.4)
1.3
t1 - intoffloat(60
t2 -- id3 * id1
ts -- id2 + t2
id1 t3
1.4
t1=id3 * 60.0
id1 = id2 + t1
all and all i mean to say that code optimization should come at a much deeper level and because the code is at such a simple state is doesnt effect what your code does
I had some trouble with const variables and const_cast. The compiler produced incorrect results when it was used to calculate something else. The const variable was optimized away, its old value was made into a compile-time constant. Truly "unexpected behavior". Okay, perhaps not ;)
Example:
const int x = 2;
const_cast<int&>(x) = 3;
int y = x * 2;
cout << y << endl;

Memory ordering issues

I'm experimenting with C++0x support and there is a problem, that I guess shouldn't be there. Either I don't understand the subject or gcc has a bug.
I have the following code, initially x and y are equal. Thread 1 always increments x first and then increments y. Both are atomic integer values, so there is no problem with the increment at all. Thread 2 is checking whether the x is less than y and displays an error message if so.
This code fails sometimes, but why? The issue here is probably memory reordering, but all atomic operations are sequentially consistent by default and I didn't explicitly relax of those any operations. I'm compiling this code on x86, which as far as I know shouldn't have any ordering issues. Can you please explain what the problem is?
#include <iostream>
#include <atomic>
#include <thread>
std::atomic_int x;
std::atomic_int y;
void f1()
{
while (true)
{
++x;
++y;
}
}
void f2()
{
while (true)
{
if (x < y)
{
std::cout << "error" << std::endl;
}
}
}
int main()
{
x = 0;
y = 0;
std::thread t1(f1);
std::thread t2(f2);
t1.join();
t2.join();
}
The result can be viewed here.
There is a problem with the comparison:
x < y
The order of evaluation of subexpressions (in this case, of x and y) is unspecified, so y may be evaluated before x or x may be evaluated before y.
If x is read first, you have a problem:
x = 0; y = 0;
t2 reads x (value = 0);
t1 increments x; x = 1;
t1 increments y; y = 1;
t2 reads y (value = 1);
t2 compares x < y as 0 < 1; test succeeds!
If you explicitly ensure that y is read first, you can avoid the problem:
int yval = y;
int xval = x;
if (xval < yval) { /* ... */ }
The problem could be in your test:
if (x < y)
the thread could evaluate x and not get around to evaluating y until much later.
Every now and then, x will wrap around to 0 just before y wraps around to zero. At this point y will legitimately be greater than x.
First, I agree with "Michael Burr" and "James McNellis". Your test is not fair, and there's a legitime possibility to fail. However even if you rewrite the test the way "James McNellis" suggests the test may fail.
First reason for this is that you don't use volatile semantics, hence the compiler may do optimizations to your code (that are supposed to be ok in a single-threaded case).
But even with volatile your code is not guaranteed to work.
I think you don't fully understand the concept of memory reordering. Actually memory read/write reorder can occur at two levels:
Compiler may exchange the order of the generated read/write instructions.
CPU may execute memory read/write instructions in arbitrary order.
Using volatile prevents the (1). However you've done nothing to prevent (2) - memory access reordering by the hardware.
To prevent this you should put special memory fence instructions in the code (that are designated for CPU, unlike volatile which is for compiler only).
In x86/x64 there're many different memory fence instructions. Also every instruction with lock semantics by default issues full memory fence.
More information here:
http://en.wikipedia.org/wiki/Memory_barrier