Correct usage of C++ atomics around flag & ISR usage - c++

I have a case where I need to ensure that an ISR cannot interact with a library when it is unloaded in an embedded system (ARM Cortex-M4 based).
The library can be loaded or deloaded at any time, as well as the interrupt / ISR firing off at any time. These library load / deload / process ISR calls are wrapped inside project specific routines, so I have full control over what is executed at the boundary.
For this, it seems like the natural way to protect against this is to add in a "is loaded" flag (and operate on using acquire / release orderings), and only call the process ISR library routine when I can be sure that the library is currently loaded.
However, the deload wrapper routine presents a problem - the release memory ordering (my initial thought) does not cause the "is loaded" flag to be seen before the library is actually deloaded, as according to the C++ release semantics, this only provides guarantees for no loads/stores before the atomic operation to be reordered, not load/stores after. I really want a "store with acquire ordering" - the flag must be cleared and visible to the ISR before the library is deloaded (no load/stores can be reordered to before the operation) - but the C++ standard seems to imply this isn't possible at least with normal acquire / release.
I have explored the idea of using SeqCst operations / fences due to additional total ordering property, but I am not confident that I am using it correctly. Inspecting the assembly of the fence version, it appears correct with DMB appearing in the right places, however reading the standard about atomic_thread_fence and Fence-fence synchronization, it seems like this is technically incorrect usage as "FA is not sequenced before X in thread A".
Is someone able to walk through the process of figuring out whether this usage is correct or not (using C++ standard terms, ie: https://en.cppreference.com/w/cpp/atomic/memory_order, https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence)?
Equivilant code link: https://godbolt.org/z/v9nzqvjxa - this uses SeqCst fences.
The internals of the library code are irrelevant and can be treated as an opaque box.
Edit:
Copy of equivilant code link above:
#include <atomic>
#include <thread>
#include <iostream>
#include <stdexcept>
#include <chrono>
#include <random>
/// Library code
bool non_atomic_state;
void load_library()
{
if (!non_atomic_state)
std::cout << "loaded" << std::endl;
non_atomic_state = true;
}
void deload_library()
{
if (non_atomic_state)
std::cout << "deloaded library" << std::endl;
non_atomic_state = false;
}
void library_isr_routine()
{
if (!non_atomic_state)
throw std::runtime_error("crash");
std::cout << "library routine" << std::endl;
}
/// MCU project code
std::atomic<bool> loaded;
void mcu_library_isr()
{
if (loaded.load(std::memory_order_relaxed))
{
std::atomic_thread_fence(std::memory_order_seq_cst);
library_isr_routine();
}
}
void mcu_load_library()
{
load_library();
std::atomic_thread_fence(std::memory_order_seq_cst);
loaded.store(true, std::memory_order_relaxed);
}
void mcu_deload_library()
{
loaded.store(false, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
deload_library();
}
/// Test harness code
void t1()
{
std::random_device rd;
for (int i = 0; i < 10000; i++)
{
auto sleep_duration = std::chrono::milliseconds(rd() % 10);
std::this_thread::sleep_for(sleep_duration);
mcu_library_isr();
}
}
void t2()
{
std::random_device rd;
for (int i = 0; i < 10000; i++)
{
auto random_value = rd();
auto sleep_duration = std::chrono::milliseconds(random_value % 10);
std::this_thread::sleep_for(sleep_duration);
if (random_value % 2 == 0)
mcu_load_library();
else
mcu_deload_library();
}
}
int main()
{
std::thread t1_handle(t1);
std::thread t2_handle(t2);
t1_handle.join();
t2_handle.join();
return 0;
}

(More a point of discussion rather than a canoncial answer)
Wouldn't an acquire-release operation with subsequent control-flow dependency resolve all ambiguity?
Like this:
if(loaded.exchange(false, std::memory_order_acq_rel))
deload_library();
Since it is an acquire-release, the deloading cannot be reordered before the exchange. As a bonus, it protects against two threads attempting to deload but not a second thread loading while the deloading is still in progress.

Related

Why the following program does not mix the output when mutex is not used?

I have made multiple runs of the program. I do not see that the output is incorrect, even though I do not use the mutex. My goal is to demonstrate the need of a mutex. My thinking is that different threads with different "num" values will be mixed.
Is it because the objects are different?
using VecI = std::vector<int>;
class UseMutexInClassMethod {
mutex m;
public:
VecI compute(int num, VecI veci)
{
VecI v;
num = 2 * num -1;
for (auto &x:veci) {
v.emplace_back(pow(x,num));
std::this_thread::sleep_for(std::chrono::seconds(1));
}
return v;
}
};
void TestUseMutexInClassMethodUsingAsync()
{
const int nthreads = 5;
UseMutexInClassMethod useMutexInClassMethod;
VecI vec{ 1,2,3,4,5 };
std::vector<std::future<VecI>> futures(nthreads);
std::vector<VecI> outputs(nthreads);
for (decltype(futures)::size_type i = 0; i < nthreads; ++i) {
futures[i] = std::async(&UseMutexInClassMethod::compute,
&useMutexInClassMethod,
i,vec
);
}
for (decltype(futures)::size_type i = 0; i < nthreads; ++i) {
outputs[i] = futures[i].get();
for (auto& x : outputs[i])
cout << x << " ";
cout << endl;
}
}
If you want an example that does fail with a high degree of certainty you can look at the below. It sets up a variable called accumulator to be shared by reference to all the futures. This is what is missing in your example. You are not actually sharing any memory. Make sure you understand the difference between passing by reference and passing by value.
#include <vector>
#include <memory>
#include <thread>
#include <future>
#include <iostream>
#include <cmath>
#include <mutex>
struct UseMutex{
int compute(std::mutex & m, int & num)
{
for(size_t j = 0;j<1000;j++)
{
///////////////////////
// CRITICAL SECTIION //
///////////////////////
// this code currently doesn't trigger the exception
// because of the lock on the mutex. If you comment
// out the single line below then the exception *may*
// get called.
std::scoped_lock lock{m};
num++;
std::this_thread::sleep_for(std::chrono::nanoseconds(1));
num++;
if(num%2!=0)
throw std::runtime_error("bad things happened");
}
return 0;
}
};
template <typename T> struct F;
void TestUseMutexInClassMethodUsingAsync()
{
const int nthreads = 16;
int accumulator=0;
std::mutex m;
std::vector<UseMutex> vs{nthreads};
std::vector<std::future<int>> futures(nthreads);
for (auto i = 0; i < nthreads; ++i) {
futures[i]= std::async([&,i](){return vs[i].compute(m,accumulator);});
}
for(auto i = 0; i < nthreads; ++i){
futures[i].get();
}
}
int main(){
TestUseMutexInClassMethodUsingAsync();
}
You can comment / uncomment the line
std::scoped_lock lock{m};
which protects the increment of the shared variable num. The rule for this mini program is that at the line
if(num%2!=0)
throw std::runtime_error("bad things happened");
num should be a multiple of two. But as multiple threads are accessing this variable without a lock you can't guarantee this. However if you add a lock around the double increment and test then you can be sure no other thread is accessing this memory during the duration of the increment and test.
Failing
https://godbolt.org/z/sojcs1WK9
Passing
https://godbolt.org/z/sGdx3x3q3
Of course the failing one is not guaranteed to fail but I've set it up so that it has a high probability of failing.
Notes
[&,i](){return vs[i].compute(m,accumulator);};
is a lambda or inline function. The notation [&,i] means it captures everything by reference except i which it captures by value. This is important because i changes on each loop iteration and we want each future to get a unique value of i
Is it because the objects are different?
Yes.
Your code is actually perfectly thread safe, no need for mutex here. You never share any state between threads except for copying vec from TestUseMutexInClassMethodUsingAsync to compute by std::async (and copying is thread-safe) and moving computation result from compute's return value to futures[i].get()'s return value. .get() is also thread-safe: it blocks until the compute() method terminates and then returns its computation result.
It's actually nice to see that even a deliberate attempt to get a race condition failed :)
You probably have to fully redo your example to demonstrate is how simultaneous* access to a shared object breaks things. Get rid of std::async and std::future, use simple std::thread with capture-by-reference, remove sleep_for (so both threads do a lot of operations instead of one per second), significantly increase number of operations and you will get a visible race. It may look like a crash, though.
* - yes, I'm aware that "wall-clock simulateneous access" does not exist in multithreaded systems, strictly speaking. However, it helps getting a rough idea of where to look for visible race conditions for demonstration purposes.
Comments have called out the fact that just not protecting a critical section does not guarantee that the risked behavior actually occurs.
That also applies for multiple runs, because while you are not allowed to test a few times and then rely on the repeatedly observed behavior, it is likely that optimization mechanisms cause a likely enough reoccurring observation as to be perceived has reproducible.
If you intend to demonstrate the need for synchronization you need to employ synchronization to poise things to a near guaranteed misbehavior of observable lack of protection.
Allow me to only outline a sequence for that, with a few assumptions on scheduling mechanisms (this is based on a rather simple, single core, priority based scheduling environment I have encountered in an embedded environment I was using professionally), just to give an insight with a simplified example:
start a lower priority context.
optionally set up proper protection before entering the critical section
start critical section, e.g. by outputting the first half of to-be-continuous output
asynchronously trigger a higher priority context, which is doing that which can violate your critical section, e.g. outputs something which should not be in the middle of the two-part output of the critical section
(in protected case the other context is not executed, in spite of being higher priority)
(in unprotected case the other context is now executed, because of being higher priority)
end critical section, e.g. by outputting the second half of the to-be-continuous output
optionally remove the protection after leaving the critical section
(in protected case the other context is now executed, now that it is allowed)
(in unprotected case the other context was already executed)
Note:
I am using the term "critical section" with the meaning of a piece of code which is vulnerable to being interrupted/preempted/descheduled by another piece of code or another execution of the same code. Specifically for me a critical section can exist without applied protection, though that is not a good thing. I state this explicitly because I am aware of the term being used with the meaning "piece of code inside applied protection/synchronization". I disagree but I accept that the term is used differently and requires clarification in case of potential conflicts.

Is there an implicit memory barrier with synchronized-with relationship on thread::join?

I have a code at work that starts multiple threads that doing some operations and if any of them fail they set the shared variable to false.
Then main thread joins all the worker threads. Simulation of this looks roughly like this (I commented out the possible fix which I don't know if it's needed):
#include <thread>
#include <atomic>
#include <vector>
#include <iostream>
#include <cassert>
using namespace std;
//atomic_bool success = true;
bool success = true;
int main()
{
vector<thread> v;
for (int i = 0; i < 10; ++i)
{
v.emplace_back([=]
{
if (i == 5 || i == 6)
{
//success.store(false, memory_order_release);
success = false;
}
});
}
for (auto& t : v)
t.join();
//assert(success.load(memory_order_acquire) == false);
assert(success == false);
cout << "Finished" << endl;
cin.get();
return 0;
}
Is there a possibility that main thread will read the success variable as true even though one of the workers set it to false?
I found that thread::join() is a full memory barrier (source) but does that imply synchronized-with relationship with the following read of success variable from the main thread, so that we're guaranteed to get newest value?
Is the fix I posted (in the commented code) necessary in this case (or maybe another fix if this one is wrong)?
Is there a possibility that read of success variable will be optimized away (since it's not volatile) and we will get old value regardless of suppossed to exist implicit memory barrier on thread::join?
The code is suppossed to work on multiple architectures (cannot remember all of them, I don't have makefile in front of me) but there are atleast x86, amd64, itanium, arm7.
Thanks for any help with this.
Edit: I've modified the example, because in real situation more then one thread can try to write to success variable.
The code above represents a data race, and the use of join cannot change that fact. If only one thread wrote to the variable, it would be fine. But you have two threads writing to it, with no synchronization between them. That's a data race.
join simply means "all side effects of that thread's operation have completed and are now visible to you." That does not create ordering or synchronization between that thread and any thread other than your own.
If you used an atomic_bool, then it wouldn't be UB; it would be guaranteed to be false. But because there is a data race, you get pure UB. It might be true, false, or nasal demons.

Many reads/one write in atomic variable

Could you help me please.
Suppose I have p - 1 read threads and one write thread. They all read and write in one atomic int variable. Could it be that if all reads and write occur simultaneously the write operation will wait p - 1 time? I have doubts because when atomic operation happens there is some strange lock(in assembler) and I afraid that it locks memory(where variable is). So it could happen that write operation will wait for p-1 reads. Could it happen?
Here is some simple code:
#include <atomic>
#include <iostream>
#include <thread>
#include <vector>
std::atomic<int> val;
void writer()
{
val.store(7);
}
void read()
{
int tmp = val.load();
while (tmp == 0)
{
std::cout << std::this_thread::get_id() << ": wait" << std::endl;
tmp = val.load();
}
std::cout << std::this_thread::get_id() << " Operation: " << tmp * tmp << std::endl;
}
int main()
{
val.store(0);
std::vector<std::thread> v;
for (int i = 0; i < 1; ++i)
v.push_back(std::thread(read));
std::this_thread::sleep_for(std::chrono::milliseconds(77));
writer();
std::for_each(v.begin(), v.end(), std::mem_fn(&std::thread::join));
return 0;
}
Thank you!
Processor's instruction, which locks memory bus(has LOCK prefix), does not use locking in usual, high-level sence. It makes threads(caller one and, probably, some concurrent threads which access same or near memory blocks) a bit slower.
The upper limit of this bit is only depends from machine and its architecture.
Normal locks also make threads slower, but amount of this slowerness highly depends from lock contention, locking implementation properties(e.g., fairness), and code under lock protection. You shouldn't bother about locked memory access except because of perfomance reason.
Actually, LOCK prefix doesn't need for atomic loads/stores. I guess, it is a compiler optimization, which provides sequential consistent memory order. This order is enforced by .store() and .load() atomic's methods by default, but it is unnecessary in your example. The mostly used pattern is:
use relaxed memory order for initialization:
val.store(0, std::memory_order_relaxed);
use acquire memory order for read value:
tmp = val.load(std::memory_order_acquire);
use release memory order for write(change) value:
val.store(7, std::memory_order_release);
This will prevent compiler from using instructions with LOCK prefix.

Why is std::mutex faster than std::atomic?

I want to put objects in std::vector in multi-threaded mode. So I decided to compare two approaches: one uses std::atomic and the other std::mutex. I see that the second approach is faster than the first one. Why?
I use GCC 4.8.1 and, on my machine (8 threads), I see that the first solution requires 391502 microseconds and the second solution requires 175689 microseconds.
#include <vector>
#include <omp.h>
#include <atomic>
#include <mutex>
#include <iostream>
#include <chrono>
int main(int argc, char* argv[]) {
const size_t size = 1000000;
std::vector<int> first_result(size);
std::vector<int> second_result(size);
std::atomic<bool> sync(false);
{
auto start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel for schedule(static, 1)
for (int counter = 0; counter < size; counter++) {
while(sync.exchange(true)) {
std::this_thread::yield();
};
first_result[counter] = counter;
sync.store(false) ;
}
auto end_time = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
}
{
auto start_time = std::chrono::high_resolution_clock::now();
std::mutex mutex;
#pragma omp parallel for schedule(static, 1)
for (int counter = 0; counter < size; counter++) {
std::unique_lock<std::mutex> lock(mutex);
second_result[counter] = counter;
}
auto end_time = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
}
return 0;
}
I don't think your question can be answered referring only to the standard- mutexes are as platform-dependent as they can be. However, there is one thing, that should be mentioned.
Mutexes are not slow. You may have seen some articles, that compare their performance against custom spin-locks and other "lightweight" stuff, but that's not the right approach - these are not interchangeable.
Spin locks are considerably fast, when they are locked (acquired) for a relatively short amount of time - acquiring them is very cheap, but other threads, that are also trying to lock, are active for whole this time (running constantly in loop).
Custom spin-lock could be implemented this way:
class SpinLock
{
private:
std::atomic_flag _lockFlag;
public:
SpinLock()
: _lockFlag {ATOMIC_FLAG_INIT}
{ }
void lock()
{
while(_lockFlag.test_and_set(std::memory_order_acquire))
{ }
}
bool try_lock()
{
return !_lockFlag.test_and_set(std::memory_order_acquire);
}
void unlock()
{
_lockFlag.clear();
}
};
Mutex is a primitive, that is much more complicated. In particular, on Windows, we have two such primitives - Critical Section, that works in per-process basis and Mutex, which doesn't have such limitation.
Locking mutex (or critical section) is much more expensive, but OS has the ability to really put other waiting threads to "sleep", which improves performance and helps tasks scheduler in efficient resources management.
Why I write this? Because modern mutexes are often so-called "hybrid mutexes". When such mutex is locked, it behaves like a normal spin-lock - other waiting threads perform some number of "spins" and then heavy mutex is locked to prevent from wasting resources.
In your case, mutex is locked in each loop iteration to perform this instruction:
second_result[counter] = omp_get_thread_num();
It looks like a fast one, so "real" mutex may never be locked. That means, that in this case your "mutex" can be as fast as atomic-based solution (because it becomes an atomic-based solution itself).
Also, in the first solution you used some kind of spin-lock-like behaviour, but I am not sure if this behaviour is predictable in multi-threaded environment. I am pretty sure, that "locking" should have acquire semantics, while unlocking is a release op. Relaxed memory ordering may be too weak for this use case.
I edited the code to be more compact and correct. It uses the std::atomic_flag, which is the only type (unlike std::atomic<> specializations), that is guaranteed to be lock-free (even std::atomic<bool> does not give you that).
Also, referring to the comment below about "not yielding": it is a matter of specific case and requirements. Spin locks are very important part of multi-threaded programming and their performance can often be improved by slightly modifying its behavior. For example, Boost library implements spinlock::lock() as follows:
void lock()
{
for( unsigned k = 0; !try_lock(); ++k )
{
boost::detail::yield( k );
}
}
source: boost/smart_ptr/detail/spinlock_std_atomic.hpp
Where detail::yield() is (Win32 version):
inline void yield( unsigned k )
{
if( k < 4 )
{
}
#if defined( BOOST_SMT_PAUSE )
else if( k < 16 )
{
BOOST_SMT_PAUSE
}
#endif
#if !BOOST_PLAT_WINDOWS_RUNTIME
else if( k < 32 )
{
Sleep( 0 );
}
else
{
Sleep( 1 );
}
#else
else
{
// Sleep isn't supported on the Windows Runtime.
std::this_thread::yield();
}
#endif
}
[source: http://www.boost.org/doc/libs/1_66_0/boost/smart_ptr/detail/yield_k.hpp]
First, thread spins for some fixed number of times (4 in this case). If mutex is still locked, pause instruction is used (if available) or Sleep(0) is called, which basically causes context-switch and allows scheduler to give another blocked thread a chance to do something useful. Then, Sleep(1) is called to perform actual (short) sleep. Very nice!
Also, this statement:
The purpose of a spinlock is busy waiting
is not entirely true. The purpose of spinlock is to serve as a fast, easy-to-implement lock primitive - but it still needs to be written properly, with certain possible scenarios in mind. For example, Intel says (regarding Boost's usage of _mm_pause() as a method of yielding inside lock()):
In the spin-wait loop, the pause intrinsic improves the speed at which
the code detects the release of the lock and provides especially
significant performance gain.
So, implementations like
void lock() { while(m_flag.test_and_set(std::memory_order_acquire)); }
may not be as good as it seems.
There is an additional important issue related to your problem. An efficient spinlock never "spins" on an operation that involves (even potential) modification of a memory location (such as exchange or test_and_set). On typical modern architectures, these operations generate instructions that require the cache line with a lock memory location to be in the exclusive state, which is extremely time-consuming (especially, when multiple threads are spinning at the same time). Always spin on load/read only and try to acquire the lock only when there is a chance that this operation will succeed.
A nice relevant article is, for instance, here: Correctly implementing a spinlock in C++

Understanding c++11 memory fences

I'm trying to understand memory fences in c++11, I know there are better ways to do this, atomic variables and so on, but wondered if this usage was correct. I realize that this program doesn't do anything useful, I just wanted to make sure that the usage of the fence functions did what I thought they did.
Basically that the release ensures that any changes made in this thread before the fence are visible to other threads after the fence, and that in the second thread that any changes to the variables are visible in the thread immediately after the fence?
Is my understanding correct? Or have I missed the point entirely?
#include <iostream>
#include <atomic>
#include <thread>
int a;
void func1()
{
for(int i = 0; i < 1000000; ++i)
{
a = i;
// Ensure that changes to a to this point are visible to other threads
atomic_thread_fence(std::memory_order_release);
}
}
void func2()
{
for(int i = 0; i < 1000000; ++i)
{
// Ensure that this thread's view of a is up to date
atomic_thread_fence(std::memory_order_acquire);
std::cout << a;
}
}
int main()
{
std::thread t1 (func1);
std::thread t2 (func2);
t1.join(); t2.join();
}
Your usage does not actually ensure the things you mention in your comments. That is, your usage of fences does not ensure that your assignments to a are visible to other threads or that the value you read from a is 'up to date.' This is because, although you seem to have the basic idea of where fences should be used, your code does not actually meet the exact requirements for those fences to "synchronize".
Here's a different example that I think demonstrates correct usage better.
#include <iostream>
#include <atomic>
#include <thread>
std::atomic<bool> flag(false);
int a;
void func1()
{
a = 100;
atomic_thread_fence(std::memory_order_release);
flag.store(true, std::memory_order_relaxed);
}
void func2()
{
while(!flag.load(std::memory_order_relaxed))
;
atomic_thread_fence(std::memory_order_acquire);
std::cout << a << '\n'; // guaranteed to print 100
}
int main()
{
std::thread t1 (func1);
std::thread t2 (func2);
t1.join(); t2.join();
}
The load and store on the atomic flag do not synchronize, because they both use the relaxed memory ordering. Without the fences this code would be a data race, because we're performing conflicting operations a non-atomic object in different threads, and without the fences and the synchronization they provide there would be no happens-before relationship between the conflicting operations on a.
However with the fences we do get synchronization because we've guaranteed that thread 2 will read the flag written by thread 1 (because we loop until we see that value), and since the atomic write happened after the release fence and the atomic read happens-before the acquire fence, the fences synchronize. (see ยง 29.8/2 for the specific requirements.)
This synchronization means anything that happens-before the release fence happens-before anything that happens-after the acquire fence. Therefore the non-atomic write to a happens-before the non-atomic read of a.
Things get trickier when you're writing a variable in a loop, because you might establish a happens-before relation for some particular iteration, but not other iterations, causing a data race.
std::atomic<int> f(0);
int a;
void func1()
{
for (int i = 0; i<1000000; ++i) {
a = i;
atomic_thread_fence(std::memory_order_release);
f.store(i, std::memory_order_relaxed);
}
}
void func2()
{
int prev_value = 0;
while (prev_value < 1000000) {
while (true) {
int new_val = f.load(std::memory_order_relaxed);
if (prev_val < new_val) {
prev_val = new_val;
break;
}
}
atomic_thread_fence(std::memory_order_acquire);
std::cout << a << '\n';
}
}
This code still causes the fences to synchronize but does not eliminate data races. For example if f.load() happens to return 10 then we know that a=1,a=2, ... a=10 have all happened-before that particular cout<<a, but we don't know that cout<<a happens-before a=11. Those are conflicting operations on different threads with no happens-before relation; a data race.
Your usage is correct, but insufficient to guarantee anything useful.
For example, the compiler is free to internally implement a = i; like this if it wants to:
while(a != i)
{
++a;
atomic_thread_fence(std::memory_order_release);
}
So the other thread may see any values at all.
Of course, the compiler would never implement a simple assignment like that. However, there are cases where similarly perplexing behavior is actually an optimization, so it's a very bad idea to rely on ordinary code being implemented internally in any particular way. This is why we have things like atomic operations and fences only produce guaranteed results when used with such operations.