Memory ordering issues - c++

I'm experimenting with C++0x support and there is a problem, that I guess shouldn't be there. Either I don't understand the subject or gcc has a bug.
I have the following code, initially x and y are equal. Thread 1 always increments x first and then increments y. Both are atomic integer values, so there is no problem with the increment at all. Thread 2 is checking whether the x is less than y and displays an error message if so.
This code fails sometimes, but why? The issue here is probably memory reordering, but all atomic operations are sequentially consistent by default and I didn't explicitly relax of those any operations. I'm compiling this code on x86, which as far as I know shouldn't have any ordering issues. Can you please explain what the problem is?
#include <iostream>
#include <atomic>
#include <thread>
std::atomic_int x;
std::atomic_int y;
void f1()
{
while (true)
{
++x;
++y;
}
}
void f2()
{
while (true)
{
if (x < y)
{
std::cout << "error" << std::endl;
}
}
}
int main()
{
x = 0;
y = 0;
std::thread t1(f1);
std::thread t2(f2);
t1.join();
t2.join();
}
The result can be viewed here.

There is a problem with the comparison:
x < y
The order of evaluation of subexpressions (in this case, of x and y) is unspecified, so y may be evaluated before x or x may be evaluated before y.
If x is read first, you have a problem:
x = 0; y = 0;
t2 reads x (value = 0);
t1 increments x; x = 1;
t1 increments y; y = 1;
t2 reads y (value = 1);
t2 compares x < y as 0 < 1; test succeeds!
If you explicitly ensure that y is read first, you can avoid the problem:
int yval = y;
int xval = x;
if (xval < yval) { /* ... */ }

The problem could be in your test:
if (x < y)
the thread could evaluate x and not get around to evaluating y until much later.

Every now and then, x will wrap around to 0 just before y wraps around to zero. At this point y will legitimately be greater than x.

First, I agree with "Michael Burr" and "James McNellis". Your test is not fair, and there's a legitime possibility to fail. However even if you rewrite the test the way "James McNellis" suggests the test may fail.
First reason for this is that you don't use volatile semantics, hence the compiler may do optimizations to your code (that are supposed to be ok in a single-threaded case).
But even with volatile your code is not guaranteed to work.
I think you don't fully understand the concept of memory reordering. Actually memory read/write reorder can occur at two levels:
Compiler may exchange the order of the generated read/write instructions.
CPU may execute memory read/write instructions in arbitrary order.
Using volatile prevents the (1). However you've done nothing to prevent (2) - memory access reordering by the hardware.
To prevent this you should put special memory fence instructions in the code (that are designated for CPU, unlike volatile which is for compiler only).
In x86/x64 there're many different memory fence instructions. Also every instruction with lock semantics by default issues full memory fence.
More information here:
http://en.wikipedia.org/wiki/Memory_barrier

Related

Difference between atomic<int> and int

I want to know if there is any different between std::atomic<int> and int if we are just doing load and store. I am not concerned about the memory ordering. For example consider the below code
int x{1};
void f(int myid) {
while(1){
while(x!= myid){}
//cout<<"thread : "<< myid<<"\n";
//this_thread::sleep_for(std::chrono::duration(3s));
x = (x % 3) + 1;
}
}
int main(){
thread x[3];
for(int i=0;i<3;i++){
x[i] = thread(f,i+1);
}
for(int i=0;i<3;i++){
x[i].join();
}
}
Now the output (if you uncomment the cout) will be
Thread :1
Thread :2
Thread :3
...
I want to know if there is any benefit in changing the int x to atomic<int> x?
Consider your code:
void f(int myid) {
while(1){
while(x!= myid){}
//cout<<"thread : "<< myid<<"\n";
//this_thread::sleep_for(std::chrono::duration(3s));
x = (x % 3) + 1;
}
}
If the program didn't have undefined behaviour, then you could expect that when f was called, x would be read from the stack at least once, but having done that, the compiler has no reason to think that any changes to x will happen outside the function, or that any changes to x made within the function need to be visible outside the function until after the function returns, so it's entitled to read x into a CPU register, keep looking at the same register value and comparing it to myid - which means it'll either pass through instantly or be stuck forever.
Then, compilers are allowed to assume they'll make progress (see Forward Progress in the C++ Standard), so they could conclude that because they'd never progress if x != myid, x can't possibly be equal to myid, and remove the inner while loop. Similarly, an outer loop simplified to while (1) x = (x % 3) + 1; where x might be a register - doesn't make progress and could also be eliminated. Or, the compiler could leave the loop but remove the seemingly pointless operations on x.
Putting your code into the online Godbolt compiler explorer and compiling with GCC trunk at -O3 optimisation, f(int) code is:
f(int):
.L2:
jmp .L2
If you make x atomic, then the compiler can't simply use a register while accessing/modifying it, and assume that there will be a good time to update it before the function returns. It will actually have to modify the variable in memory and propagate that change so other threads can read the updated value.
I want to know if there is any benefit in changing the int x to atomic x?
You could say that. Turning int into atomic<int> in your example will turn your program from incorrect to correct (*).
Accessing the same int from multiple threads at the same time (without any form of access synchronization) is Undefined Behavior.
*) Well, the program might still be incorrect, but at least it avoids this particular problem.

c++ multithread atomic load/store

When I read the 5th chapter of the book CplusplusConcurrencyInAction, the example code as follows, multithread load/store some atomic values concurrently with the momery_order_relaxed.Three array save the value of x、y and z respectively at each round.
#include <thread>
#include <atomic>
#include <iostream>
​
std::atomic<int> x(0),y(0),z(0); // 1
std::atomic<bool> go(false); // 2
​
unsigned const loop_count=10;
​
struct read_values
{
int x,y,z;
};
​
read_values values1[loop_count];
read_values values2[loop_count];
read_values values3[loop_count];
read_values values4[loop_count];
read_values values5[loop_count];
​
void increment(std::atomic<int>* var_to_inc,read_values* values)
{
while(!go)
std::this_thread::yield();
for(unsigned i=0;i<loop_count;++i)
{
values[i].x=x.load(std::memory_order_relaxed);
values[i].y=y.load(std::memory_order_relaxed);
values[i].z=z.load(std::memory_order_relaxed);
var_to_inc->store(i+1,std::memory_order_relaxed); // 4
std::this_thread::yield();
}
}
​
void read_vals(read_values* values)
{
while(!go)
std::this_thread::yield();
for(unsigned i=0;i<loop_count;++i)
{
values[i].x=x.load(std::memory_order_relaxed);
values[i].y=y.load(std::memory_order_relaxed);
values[i].z=z.load(std::memory_order_relaxed);
std::this_thread::yield();
}
}
​
void print(read_values* v)
{
for(unsigned i=0;i<loop_count;++i)
{
if(i)
std::cout<<",";
std::cout<<"("<<v[i].x<<","<<v[i].y<<","<<v[i].z<<")";
}
std::cout<<std::endl;
}
​
int main()
{
std::thread t1(increment,&x,values1);
std::thread t2(increment,&y,values2);
std::thread t3(increment,&z,values3);
std::thread t4(read_vals,values4);
std::thread t5(read_vals,values5);
​
go=true;
​
t5.join();
t4.join();
t3.join();
t2.join();
t1.join();
​
print(values1);
print(values2);
print(values3);
print(values4);
print(values5);
}
one of the valid output mentioned in this chapter:
(0,0,0),(1,0,0),(2,0,0),(3,0,0),(4,0,0),(5,7,0),(6,7,8),(7,9,8),(8,9,8),(9,9,10)
(0,0,0),(0,1,0),(0,2,0),(1,3,5),(8,4,5),(8,5,5),(8,6,6),(8,7,9),(10,8,9),(10,9,10)
(0,0,0),(0,0,1),(0,0,2),(0,0,3),(0,0,4),(0,0,5),(0,0,6),(0,0,7),(0,0,8),(0,0,9)
(1,3,0),(2,3,0),(2,4,1),(3,6,4),(3,9,5),(5,10,6),(5,10,8),(5,10,10),(9,10,10),(10,10,10)
(0,0,0),(0,0,0),(0,0,0),(6,3,7),(6,5,7),(7,7,7),(7,8,7),(8,8,7),(8,8,9),(8,8,9)
The 3rd output of values1 is (2,0,0),at this point it reads x=2,and y=z=0.It means when y=0,the x is already equals to 2, Why the 3rd output of the values2 it reads x=0 and y=2,which means x is the old value because x、y、z is increasing, so when y=2 that x is at least 2.
And I test the code in my PC,I can't reproduce the result like that.
The reason is that reading via x.load(std::memory_order_relaxed) guarantees only that you never see x decrease within the same thread (in this example code). (It also guarantees that a thread writing to x will read that same value again in the next iteration.)
In general, different threads can read different values from the same variable at the same time. That is, there need not be a consistent "global state" that all threads agree on. The example output is supposed to demonstrate that: The first thread might still see y = 0 when it already wrote x = 4, while the second thread might still see x = 0 when it already writes y = 2. The standard allows this because real hardware may work that way: Consider the case when the threads are on different CPU cores, each with its own private L1 cache.
However, it is not possible that the second thread sees x = 5 and then later sees x = 2 - the atomic object always guarantees that there is a consistent global modification order (that is, all writes to the variable are observed to happen in the same order by all the threads).
But when using std::memory_order_relaxed there are no guarantees about when a thread finally does "see" those writes*, or how the observations of different threads relate to each other. You need stronger memory ordering to get those guarantees.
*In fact, a valid output would be all threads reading only 0 all the time, except the writer threads reading what they wrote the previous iteration to their "own" variable (and 0 for the others). On hardware that never flushed caches unless prompted, this might actually happen, and it would be fully compliant with the C++ standard!
And I test the code in my PC,I can't reproduce the result like that.
The "example output" shown is highly artificial. The C++ standard allows for this output to happen. This means you can write efficient and correct multithreaded code even on hardware with no inbuilt guarantees on cache coherency (see above). But common hardware today (x86 in particular) brings a lot of guarantees that actually make certain behavior impossible to observe (including the output in the question).
Also, note that x, y and z are extremely likely to be adjacent (depends on the compiler), meaning they will likely all land on the same cache line. This will lead to massive performance degradation (look up "false sharing"). But since memory can only be transferred between cores at cache line granularity, this (together with the x86 coherency guarantees) makes it essentially impossible that an x86 CPU (which you most likely performed your tests with) reads outdated values of any of the variables. Allocating these values more than 1-2 cache lines apart will likely lead to more interesting/chaotic results.

C++11 Atomic memory order with non-atomic variables

I am unsure about how the memory ordering guarantees of atomic variables in c++11 affect operations to other memory.
Let's say I have one thread which periodically calls the write function to update a value, and another thread which calls read to get the current value. Is it guaranteed that the effects of d = value; will not be seen before effects of a = version;, and will be seen before the effects of b = version;?
atomic<int> a {0};
atomic<int> b {0};
double d;
void write(int version, double value) {
a = version;
d = value;
b = version;
}
double read() {
int x,y;
double ret;
do {
x = b;
ret = d;
y = a;
} while (x != y);
return ret;
}
The rule is that, given a write thread that executes once, and nothing else that modifies a, b or d,
You can read a and b from a different thread at any time, and
if you read b and see version stored in it, then
You can read d; and
What you read will be value.
Note that whether the second part is true depends on the memory ordering; it is true with the default one (memory_order_seq_cst).
Your object d is written and read by two threads and it's not atomic. This is unsafe, as suggested in the C++ standard on multithreading:
1.10/4 Two expression evaluations conflict if one of them modifies a memory location and the other one accesses or modifies the same memory location.
1.10/21 The execution of a program contains a data race if it contains two conflicting actions in different threads,at least one of
which is not atomic, and neither happens before the other. Any such
data race results in undefined behavior.
Important edit:
In your non-atomic case, you have no guarantees about the ordering between the reading and the writing. You don't even have guarantee that the reader will read a value that was written by the writer (this short article explains the risk for non-atomic variables).
Nevertheless, your reader's loop finishes based on a test of the surrounding atomic variables, for which there are strong guarantees. Assuming that version never repeats between writer different calls, and given the reverse order in which you aquire their value:
the order of the d read compared to the d write can't be unfortunate if the two atomics are equal.
similarly, the read value can't be inconsistent if the two atomics are equal.
This means that in case of an adverse race condition on your non-atomic, thanks to the loop, you'll end-up reading the last value.
Is it guaranteed that the effects of d = value; will not be seen before effects of a = version;, and will be seen before the effects of b = version;?
Yes, it is. This is because sequensial consistency barrier is implied when read or write atomic<> variable.
Instead of storing version tag into two atomic variables before value's modification and after it, you can increment single atomic variable before and after modification:
atomic<int> a = {0};
double d;
void write(double value)
{
a = a + 1; // 'a' become odd
d = value; //or other modification of protected value(s)
a = a + 1; // 'a' become even, but not equal to the one before modification
}
double read(void)
{
int x;
double ret;
do
{
x = a;
ret = value; // or other action with protected value(s)
} while((x & 2) || (x != a));
return ret;
}
This is known as seqlock in the Linux kernel: http://en.wikipedia.org/wiki/Seqlock

Question about optimization in C++

I've read that the C++ standard allows optimization to a point where it can actually hinder with expected functionality. When I say this, I'm talking about return value optimization, where you might actually have some logic in the copy constructor, yet the compiler optimizes the call out.
I find this to be somewhat bad, as in someone who doesn't know this might spend quite some time fixing a bug resulting from this.
What I want to know is whether there are any other situations where over-optimization from the compiler can change functionality.
For example, something like:
int x = 1;
x = 1;
x = 1;
x = 1;
might be optimized to a single x=1;
Suppose I have:
class A;
A a = b;
a = b;
a = b;
Could this possibly also be optimized? Probably not the best example, but I hope you know what I mean...
Eliding copy operations is the only case where a compiler is allowed to optimize to the point where side effects visibly change. Do not rely on copy constructors being called, the compiler might optimize away those calls.
For everything else, the "as-if" rule applies: The compiler might optimize as it pleases, as long as the visible side effects are the same as if the compiler had not optimized at all.
("Visible side effects" include, for example, stuff written to the console or the file system, but not runtime and CPU fan speed.)
It might be optimized, yes. But you still have some control over the process, for example, suppose code:
int x = 1;
x = 1;
x = 1;
x = 1;
volatile int y = 1;
y = 1;
y = 1;
y = 1;
Provided that neither x, nor y are used below this fragment, VS 2010 generates code:
int x = 1;
x = 1;
x = 1;
x = 1;
volatile int y = 1;
010B1004 xor eax,eax
010B1006 inc eax
010B1007 mov dword ptr [y],eax
y = 1;
010B100A mov dword ptr [y],eax
y = 1;
010B100D mov dword ptr [y],eax
y = 1;
010B1010 mov dword ptr [y],eax
That is, optimization strips all lines with "x", and leaves all four lines with "y". This is how volatile works, but the point is that you still have control over what compiler does for you.
Whether it is a class, or primitive type - all depends on compiler, how sophisticated it's optimization caps are.
Another code fragment for study:
class A
{
private:
int c;
public:
A(int b)
{
*this = b;
}
A& operator = (int b)
{
c = b;
return *this;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
int b = 0;
A a = b;
a = b;
a = b;
return 0;
}
Visual Studio 2010 optimization strips all the code to nothing, in release build with "full optimization" _tmain does just nothing and immediately returns zero.
This will depend on how class A is implemented, whether the compiler can see the implementation and whether it is smart enough. For example, if operator=() in class A has some side effects such optimizing out would change the program behavior and is not possible.
Optimization does not (in proper term) "remove calls to copy or assignments".
It convert a finite state machine in another finite state, machine with a same external behaviour.
Now, if you repeadly call
a=b; a=b; a=b;
what the compiler do depends on what operator= actually is.
If the compiler founds that a call have no chances to alter the state of the program (and the "state of the program" is "everything lives longer than a scope that a scope can access") it will strip it off.
If this cannot be "demonstrated" the call will stay in place.
Whatever the compiler will do, don't worry too much about: the compiler cannot (by contract) change the external logic of a program or of part of it.
i dont know c++ that much but am currently reading Compilers-Principles, techniques and tools
here is a snippet from their section on code optimization:
the machine-independent code-optimization phase attempts to improve
intermediate code so that better target code will result. Usually
better means faster, but other objectives may be desired, such as
shorter code, or target code that consumes less power. for example a
straightforward algorithm generates the intermediate code (1.3) using
an instruction for each operator in the tree representation that comes
from semantic analyzer. a simple intermediate code generation
algorithm followed by code optimization is a reasonable way to
generate good target code. the optimizar can duduce that the
conversion of 60 from integer to floating point can be done once and
for all at compile time, so the inttofloat operation can be eliminated
by replacing the integer 6- by the floating point number 60.0.
moreover t3 is used only once to trasmit its value to id1 so the
optimizer can transform 1.3 into the shorter sequence (1.4)
1.3
t1 - intoffloat(60
t2 -- id3 * id1
ts -- id2 + t2
id1 t3
1.4
t1=id3 * 60.0
id1 = id2 + t1
all and all i mean to say that code optimization should come at a much deeper level and because the code is at such a simple state is doesnt effect what your code does
I had some trouble with const variables and const_cast. The compiler produced incorrect results when it was used to calculate something else. The const variable was optimized away, its old value was made into a compile-time constant. Truly "unexpected behavior". Okay, perhaps not ;)
Example:
const int x = 2;
const_cast<int&>(x) = 3;
int y = x * 2;
cout << y << endl;

How to cause an intentional division by zero?

For testing reasons I would like to cause a division by zero in my C++ code. I wrote this code:
int x = 9;
cout << "int x=" << x;
int y = 10/(x-9);
y += 10;
I see "int =9" printed on the screen, but the application doesn't crash. Is it because of some compiler optimizations (I compile with gcc)? What could be the reason?
Make the variables volatile. Reads and writes to volatile variables are considered observable:
volatile x = 1;
volatile y = 0;
volatile z = x / y;
Because y is not being used, it's getting optimized away.
Try adding a cout << y at the end.
Alternatively, you can turn off optimization:
gcc -O0 file.cpp
Division by zero is an undefined behavior. Not crashing is also pretty much a proper subset of the potentially infinite number of possible behaviors in the domain of undefined behavior.
Typically, a divide by zero will throw an exception. If it is unhandled, it will break the program, but it will not crash.