Instruction reordering with lock

Instruction reordering with lock - c++

Will the compiler reorder instructions which are guarded with a mutex? I am using a boolean variable to decide whether one thread has updated some struct. If the compiler reorders the instructions, it might happen that the boolean variable is set before all the fields of the struct are updated.
struct A{
int x;
int y;
// Many other variables, it is a big struct
}
std::mutex m_;
bool updated;
A* first_thread_copy_;
// a has been allocated memory
m_.lock();
first_thread_copy_->x = 1;
first_thread_copy_->y = 2;
// Other variables of the struct are updated here
updated = true;
m_.unlock();
And in the other thread I just check if the struct has been updated and swap the pointers.
while (!stop_running){
if (updated){
m_.lock();
updated = false;
A* tmp = second_thread_copy_;
second_thread_copy_ = first_thread_copy_;
first_thread_copy_ = tmp;
m.unlock();
}
....
}
My goal is to keep the second thread as fast as possible. If it sees that update has not happened, it continues and uses the old value to do the remaining stuff.
One solution to avoid reordering would be to use memory barriers, but I'm trying to avoid it within mutex block.

You can safely assume that instructions are not reordered between the lock/unlock and the instructions inside the lock. updated = true will not happen after the unlock or before the lock. Both are barriers, and prevent reordering.
You cannot assume that the updates inside the lock happen without reordering. It is possible that the update to updated takes place before the updates to x or y. If all your accesses are also under lock, that should not be a problem.
With that in mind, please note that it is not only the compiler that might reorder instructions. The CPU also might execute instructions out of order.

Thats what locks gaurantee so your dont have to use updated. On a side note you should lock your struct rather than code and everytime you access a you should lock it first.
struct A{
int x;
int y;
}
A a;
std::mutex m_; //Share this lock amongst threads
m_.lock();
a.x = 1;
a.y = 2;
m_.unlock();
On second thread you can do
while(!stop)
{
if(m_.try_lock()) {
A* tmp = second_thread_copy_;
second_thread_copy_ = first_thread_copy_;
first_thread_copy_ = tmp;
m_.unlock();
}
}
EDIT: since you are overwriting struct having mutex inside doesnt make sense.

Related

Updating two atomic variables under a condition in C++

I want to update two atomic variables under an if condition and the if condition uses one of the atomic variable. I am not sure if both these atomic variables will be updated together or not.
I have a multithreaded code below. In "if(local > a1)" a1 is an atomic variable so will reading it in if condition be atomic across threads, In other words if thread t1 is at the if condition, will thread t2 wait for a1 to be updated by thread t1? Is it possible that a2 is updated by one thread and a1 is updated by another thread?
// constructing atomics
#include <iostream> // std::cout
#include <atomic> // std::atomic
#include <thread> // std::thread
#include <vector> // std::vector
std::atomic<int> a1{0};
std::atomic<int> a2{0};
void count1m (int id) {
double local = id;
double local2 = id*3;
*if(local > a1) {* // a1 is an atomic variable so will reading it in if condition be atomic across threads or not?
a1 = local;
a2 = local2;
}
};
int main ()
{
std::vector<std::thread> threads;
std::cout << "spawning 20 threads that count to 1 million...\n";
for (int i=20; i>=0; --i) {
threads.push_back(std::thread(count1m,i));
}
for (auto& th : threads) th.join();
cout << "a1 = " << a1 << endl;
}

I am not sure if both these atomic variables will be updated together or not.
Not.
Atomic means indivisible, in that writes to an atomic can't be read half-done, in an intermediate or incomplete state.
However, updates to one atomic aren't somehow batched with updates to another atomic. How could the compiler tell which updates were supposed to be batched like this?
If you have two atomic variables, you have two independent objects neither of which can individually be observed to have a part-written state. You can still read them both and see a state where another thread has updated one but not the other, even if the stores are adjacent in the code.
Possibilities are:
Just use a mutex.
You ruled this out in a comment, but I'm going to mention it for completeness and because it's by far the easiest way.
Pack both objects into a single atomic.
Note that a 128-bit object (large enough for two binary64 doubles) may have to use a mutex or similar synchronization primitive internally, if your platform doesn't have native 128-bit atomics. You can check with std::atomic<DoublePair>::is_lock_free() to find out (for a suitable struct DoublePair containing a pair of doubles).
Whether a non-lock-free atomic is acceptable under your mutex prohibition I cannot guess.
Concoct an elaborate lock-free synchronization protocol, such as:
storing the index into a circular array of DoublePair objects and atomically updating that (there are various schemes for this with multiple producers, but single producer is definitely simpler - and don't forget A-B-A protection)
using a raw futex, or a semaphore, or some other technically-not-a-mutex synchronization primitive that already exists
using atomics to write a spinlock (again not technically a mutex, but again I can't guess whether it's actually suitable for you)
The main issue is that you've said you're not allowed to use a mutex, but haven't said why. Does the code have to be lock-free? Wait-free? Does someone just really hate std::mutex but will accept any other synchronization primitive?

There are basically two ways to do this and they are different.
The first way is to create an atomic struct that will be updated at once. Note that with this approach there is a race condition where the comparison between local and aip.a1 might change before aip is updated.
struct IntPair {
int a1;
int a2;
};
std::atomic<IntPair> aip = IntPair{0,0};
void count1m (int id) {
double local = id;
double local2 = id*3;
if(local > aip.load().a1) {
aip = IntPair{int(local),int(local2)};
}
};
The second approach is to use a mutex to synchronize the entire section, like below. This will guarantee that no race condition occurs and everything is done atomically. We used a std::lock_guard for better safety rather than calling m.lock() and m.unlock() manually.
IntPair ip{0,0};
std::mutex m;
void count1m (int id) {
double local = id;
double local2 = id*3;
std::lock_guard<std::mutex> g(m);
if(local > ip.a1) {
ip = IntPair{int(local),int(local2)};
}
};

Using a mutex to block execution from outside the critical section

I'm not sure I got the terminology right but here goes - I have this function that is used by multiple threads to write data (using pseudo code in comments to illustrate what I want)
//these are initiated in the constructor
int* data;
std::atomic<size_t> size;
void write(int value) {
//wait here while "read_lock"
//set "write_lock" to "write_lock" + 1
auto slot = size.fetch_add(1, std::memory_order_acquire);
data[slot] = value;
//set "write_lock" to "write_lock" - 1
}
the order of the writes is not important, all I need here is for each write to go to a unique slot
Every once in a while though, I need one thread to read the data using this function
int* read() {
//set "read_lock" to true
//wait here while "write_lock"
int* ret = data;
data = new int[capacity];
size = 0;
//set "read_lock" to false
return ret;
}
so it basically swaps out the buffer and returns the old one (I've removed capacity logic to make the snippets shorter)
In theory this should lead to 2 operating scenarios:
1 - just a bunch of threads writing into the container
2 - when some thread executes the read function, all new writers will have to wait, the reader will wait until all existing writes are finished, it will then do the read logic and scenario 1 can continue.
The question part is that I don't know what kind of a barrier to use for the locks -
A spinlock would be wasteful since there are many containers like this and they all need cpu cycles
I don't know how to apply std::mutex since I only want the write function to be in a critical section if the read function is triggered. Wrapping the whole write function in a mutex would cause unnecessary slowdown for operating scenario 1.
So what would be the optimal solution here?

If you have C++14 capability then you can use a std::shared_timed_mutex to separate out readers and writers. In this scenario it seems you need to give your writer threads shared access (allowing other writer threads at the same time) and your reader threads unique access (kicking all other threads out).
So something like this may be what you need:
class MyClass
{
public:
using mutex_type = std::shared_timed_mutex;
using shared_lock = std::shared_lock<mutex_type>;
using unique_lock = std::unique_lock<mutex_type>;
private:
mutable mutex_type mtx;
public:
// All updater threads can operate at the same time
auto lock_for_updates() const
{
return shared_lock(mtx);
}
// Reader threads need to kick all the updater threads out
auto lock_for_reading() const
{
return unique_lock(mtx);
}
};
// many threads can call this
void do_writing_work(std::shared_ptr<MyClass> sptr)
{
auto lock = sptr->lock_for_updates();
// update the data here
}
// access the data from one thread only
void do_reading_work(std::shared_ptr<MyClass> sptr)
{
auto lock = sptr->lock_for_reading();
// read the data here
}
The shared_locks allow other threads to gain a shared_lock at the same time but prevent a unique_lock gaining simultaneous access. When a reader thread tries to gain a unique_lock all shared_locks will be vacated before the unique_lock gets exclusive control.

You can also do this with regular mutexes and condition variables rather than shared. Supposedly shared_mutex has higher overhead, so I'm not sure which will be faster. With Gallik's solution you'd presumably be paying to lock the shared mutex on every write call; I got the impression from your post that write gets called way more than read so maybe this is undesirable.
int* data; // initialized somewhere
std::atomic<size_t> size = 0;
std::atomic<bool> reading = false;
std::atomic<int> num_writers = 0;
std::mutex entering;
std::mutex leaving;
std::condition_variable cv;
void write(int x) {
++num_writers;
if (reading) {
--num_writers;
if (num_writers == 0)
{
std::lock_guard l(leaving);
cv.notify_one();
}
{ std::lock_guard l(entering); }
++num_writers;
}
auto slot = size.fetch_add(1, std::memory_order_acquire);
data[slot] = x;
--num_writers;
if (reading && num_writers == 0)
{
std::lock_guard l(leaving);
cv.notify_one();
}
}
int* read() {
int* other_data = new int[capacity];
{
std::unique_lock enter_lock(entering);
reading = true;
std::unique_lock leave_lock(leaving);
cv.wait(leave_lock, [] () { return num_writers == 0; });
swap(data, other_data);
size = 0;
reading = false;
}
return other_data;
}
It's a bit complicated and took me some time to work out, but I think this should serve the purpose pretty well.
In the common case where only writing is happening, reading is always false. So you do the usual, and pay for two additional atomic increments and two untaken branches. So the common path does not need to lock any mutexes, unlike the solution involving a shared mutex, this is supposedly expensive: http://permalink.gmane.org/gmane.comp.lib.boost.devel/211180.
Now, suppose read is called. The expensive, slow heap allocation happens first, meanwhile writing continues uninterrupted. Next, the entering lock is acquired, which has no immediate effect. Now, reading is set to true. Immediately, any new calls to write enter the first branch, and eventually hit the entering lock which they are unable to acquire (as its already taken), and those threads then get put to sleep.
Meanwhile, the read thread is now waiting on the condition that the number of writers is 0. If we're lucky, this could actually go through right away. If however there are threads in write in either of the two locations between incrementing and decrementing num_writers, then it will not. Each time a write thread decrements num_writers, it checks if it has reduced that number to zero, and when it does it will signal the condition variable. Because num_writers is atomic which prevents various reordering shenanigans, it is guaranteed that the last thread will see num_writers == 0; it could also be notified more than once but this is ok and cannot result in bad behavior.
Once that condition variable has been signalled, that shows that all writers are either trapped in the first branch or are done modifying the array. So the read thread can now safely swap the data, and then unlock everything, and then return what it needs to.
As mentioned before, in typical operation there are no locks, just increments and untaken branches. Even when a read does occur, the read thread will have one lock and one condition variable wait, whereas a typical write thread will have about one lock/unlock of a mutex and that's all (one, or a small number of write threads, will also perform a condition variable notification).

How do fences actually work in c++

I've been struggling with understanding how fences actually force code to synchronize.
for instance, say i have this code
bool x = false;
std::atomic<bool> y;
std::atomic<int> z;
void write_x_then_y()
{
x = true;
std::atomic_thread_fence(std::memory_order_release);
y.store(true, std::memory_order_relaxed);
}
void read_y_then_x()
{
while (!y.load(std::memory_order_relaxed));
std::atomic_thread_fence(std::memory_order_acquire);
if (x)
++z;
}
int main()
{
x = false;
y = false;
z = 0;
std::thread a(write_x_then_y);
std::thread b(read_y_then_x);
a.join();
b.join();
assert(z.load() != 0);
}
because the release fence is followed by an atomic store operation, and the acquire fence is preceded by an atomic load, everything synchronizes as it's supposed to and the assert won't fire
but if y was not an atomic variable like this
bool x;
bool y;
std::atomic<int> z;
void write_x_then_y()
{
x = true;
std::atomic_thread_fence(std::memory_order_release);
y = true;
}
void read_y_then_x()
{
while (!y);
std::atomic_thread_fence(std::memory_order_acquire);
if (x)
++z;
}
then, I hear, there might be a data race. But why is that?
Why must release fences be followed by an atomic store, and acquire fences be preceded by an atomic load in order for the code to synchronize properly?
I would also appreciate it if anyone could provide an execution scenario in which a data race causes the assert to fire

No real data race is a problem for your second snippet. This snippet would be OK ... if the compiler would literally generate machine code from the one which is written.
But the compiler is free to generate any machine code, which is equivalent to the original one in case of a single-threaded program.
E.g., compiler can note, that the y variable doesn't changes within while(!y) loop, so it can load this variable once to register and use only that register in the next iterations. So, if initially y=false, you will get an infinite loop.
Another optimization, which is possible, is just removing the while(!y) loop, as it doesn't contain accesses to volatile or atomic variables and doesn't use synchronization actions. (C++ Standard says that any correct program should eventually do one of the actions specified above, so the compiler may rely on that fact when optimizing the program).
And so on.
More generally, the C++ Standard specifies that concurrent access to any non-atomic variable lead to Undefined Behavior, which is like "Warranty is cleared". That is why you should use an atomic y variable.
From the other side, variable x doesn't need to be atomic, as accesses to it are not concurrent because of the memory fences.

Spinning thread barrier using Atomic Builtins

I'm trying to implement a spinning thread barrier using atomics, specifically __sync_fetch_and_add. https://gcc.gnu.org/onlinedocs/gcc-4.4.5/gcc/Atomic-Builtins.html
I basically want an alternative to the pthread barrier. I'm using Ubuntu on a system that can run about a hundred threads in parallel.
int bar = 0; //global variable
int P = MAX_THREADS; //number of threads
__sync_fetch_and_add(&bar,1); //each thread comes and adds atomically
while(bar<P){} //threads spin until bar increments to P
bar=0; //a thread sets bar=0 to be used in the next spinning barrier
This does not work for obvious reasons (a thread may set bar=0, and another thread gets stuck in an infinite while loop etc). I saw an implementation here: Writing a (spinning) thread barrier using c++11 atomics, however it seems too complex and I think its performance might be worse than a pthread barrier.
This implementation is also expected to produce more traffic within the memory hierarchy due to bar's cache line being ping-ponged among threads.
Any ideas on how to use these atomic instructions to make a simple barrier? A communication-optimal scheme would also be helpful additionally.

Instead of spinning on the counter of the threads, it is better to spin on the number of the barries passed, which will be incremented only by the last thread, faced the barrier. Such way you also reduce memory cache pressure, as spinning variable is now updated only by single thread.
int P = MAX_THREADS;
int bar = 0; // Counter of threads, faced barrier.
volatile int passed = 0; // Number of barriers, passed by all threads.
void barrier_wait()
{
int passed_old = passed; // Should be evaluated before incrementing *bar*!
if(__sync_fetch_and_add(&bar,1) == (P - 1))
{
// The last thread, faced barrier.
bar = 0;
// *bar* should be reseted strictly before updating of barriers counter.
__sync_synchronize();
passed++; // Mark barrier as passed.
}
else
{
// Not the last thread. Wait others.
while(passed == passed_old) {};
// Need to synchronize cache with other threads, passed barrier.
__sync_synchronize();
}
}
Note, that you need to use volatile modificator for spinning variable.
C++ code could be somewhat faster than C one, as it can use acquire/release memory barriers instead of the full one, which is the only barrier available from __sync functions:
int P = MAX_THREADS;
std::atomic<int> bar = 0; // Counter of threads, faced barrier.
std::atomic<int> passed = 0; // Number of barriers, passed by all threads.
void barrier_wait()
{
int passed_old = passed.load(std::memory_order_relaxed);
if(bar.fetch_add(1) == (P - 1))
{
// The last thread, faced barrier.
bar = 0;
// Synchronize and store in one operation.
passed.store(passed_old + 1, std::memory_order_release);
}
else
{
// Not the last thread. Wait others.
while(passed.load(std::memory_order_relaxed) == passed_old) {};
// Need to synchronize cache with other threads, passed barrier.
std::atomic_thread_fence(std::memory_order_acquire);
}
}

Do I need to use volatile keyword if I declare a variable between mutexes and return it?

Let's say I have the following function.
std::mutex mutex;
int getNumber()
{
mutex.lock();
int size = someVector.size();
mutex.unlock();
return size;
}
Is this a place to use volatile keyword while declaring size? Will return value optimization or something else break this code if I don't use volatile? The size of someVector can be changed from any of the numerous threads the program have and it is assumed that only one thread (other than modifiers) calls getNumber().

No. But beware that the size may not reflect the actual size AFTER the mutex is released.
Edit:If you need to do some work that relies on size being correct, you will need to wrap that whole task with a mutex.

You haven't mentioned what the type of the mutex variable is, but assuming it is an std::mutex (or something similar meant to guarantee mutual exclusion), the compiler is prevented from performing a lot of optimizations. So you don't need to worry about return value optimization or some other optimization allowing the size() query from being performed outside of the mutex block.
However, as soon as the mutex lock is released, another waiting thread is free to access the vector and possibly mutate it, thus changing the size. Now, the number returned by your function is outdated. As Mats Petersson mentions in his answer, if this is an issue, then the mutex lock needs to be acquired by the caller of getNumber(), and held until the caller is done using the result. This will ensure that the vector's size does not change during the operation.
Explicitly calling mutex::lock followed by mutex::unlock quickly becomes unfeasible for more complicated functions involving exceptions, multiple return statements etc. A much easier alternative is to use std::lock_guard to acquire the mutex lock.
int getNumber()
{
std::lock_guard<std::mutex> l(mutex); // lock is acquired
int size = someVector.size();
return size;
} // lock is released automatically when l goes out of scope

Volatile is a keyword that you use to tell the compiler to literally actually write or read the variable and not to apply any optimizations. Here is an example
int example_function() {
int a;
volatile int b;
a = 1; // this is ignored because nothing reads it before it is assigned again
a = 2; // same here
a = 3; // this is the last one, so a write takes place
b = 1; // b gets written here, because b is volatile
b = 2; // and again
b = 3; // and again
return a + b;
}
What is the real use of this? I've seen it in delay functions (keep the CPU busy for a bit by making it count up to a number) and in systems where several threads might look at the same variable. It can sometimes help a bit with multi-threaded things, but it isn't really a threading thing and is certainly not a silver bullet

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js