Can I use no thread synchronization here? - c++

I am researching mutexes.
I come up with this example that seems to work without any synchronization.
#include <cstdint>
#include <thread>
#include <iostream>
constexpr size_t COUNT = 10000000;
int g_x = 0;
void p1(){
for(size_t i = 0; i < COUNT; ++i){
++g_x;
}
}
void p2(){
int a = 0;
for(size_t i = 0; i < COUNT; ++i){
if (a > g_x){
std::cout << "Problem detected" << '\n';
}
a = g_x;
}
}
int main(){
std::thread t1{ p1 };
std::thread t2{ p2 };
t1.join();
t2.join();
std::cout << g_x << '\n';
}
My assumptions are following:
Thread 1 change the value of g_x, but it is the only thread that change the value, so theoretically this suppose to be OK.
Thread 2 reads the value of g_x. Reads suppose to be atomic on x86 and ARM. So there must be no problem there too. I have example with several read threads and it works OK too.
With other words, write is not shared and reads are atomic.
Are the assumptions correct?

There's certainly a data race here: g_x is not an std::atomic; it is written to by one thread, and read from by another. So the results are undefined.
Note that the CPU memory model is only part of the deal. The compiler might do all sorts of optimizations (using registers, reordering etc.) if you don't declare your shared variables properly.
As for mutexes, you do not need one here. Declaring g_x as atomic should remove the UB and guarantee proper communication between the threads. Btw, the for in p2 can probably be optimized out even if you're using atomics, but I assume this is just a reduced code and not the real thing.

Related

How to implement simple custom lock using acquire-release memory order model?

I'm trying to understand acquire-release memory order through implementing my custom lock.
#include <atomic>
#include <vector>
#include <thread>
#include <iostream>
class my_lock {
static std::atomic<bool> flag;
public:
void lock() {
bool expected = false;
while (!flag.compare_exchange_strong(expected, true, std::memory_order_acq_rel))
expected = false;
}
void unlock() {
flag.store(false, std::memory_order_release);
}
};
std::atomic<bool> my_lock::flag(false);
static int num0 = 0;
static int num1 = 0;
my_lock lk;
void increase() {
for(int i = 0; i < 100000; ++i) {
lk.lock();
num0++;
num1++;
lk.unlock();
}
}
void read() {
for(int i = 0; i < 100000; ++i) {
lk.lock();
if(num0 > num1) {
std::cout << "num0:" << num0 << " > " << "num1:" << num1 << std::endl;
}
lk.unlock();
}
}
int main() {
std::thread t1(increase);
std::thread t2(read);
t1.join();
t2.join();
std::cout << "finished! num0:" << num0 << ", num1:"<< num1 << std::endl;
}
Question 1: Am I correct to use acquire-release memory order ?
Below are paragraph from C++ Concurrency In Action
Despite the potentially non-intuitive outcomes, anyone who’s used locks has had to deal with the same ordering issues: locking a mutex is an acquire operation, and unlocking the mutex is a release operation. With mutexes, you learn that you must ensure that the same mutex is locked when you read a value as was locked when you wrote it, and the same applies here; your acquire and release operations have to be on the same variable to ensure an ordering. If data is protected with a mutex, the exclusive nature of the lock means that the result is indistinguishable from what it would have been had the lock and unlock been sequentially consistent operations. Similarly, if you use acquire and release orderings on atomic variables to build a simple lock, then from the point of view of code that uses the lock, the behavior will appear sequentially consistent, even though the internal operations are not.
This paragraph says that "the result is indistinguishable from ..... sequentially consistent operation".
Question 2: Why the result is indistinguishable? If we use lock, the result should be distinguishable from my understanding.
Edit :
I add one more question.
Below is std::atomic_flag example in C++ concurrency in action.
class spinlock_mutex
{
std::atomic_flag flag;
public:
spinlock_mutex():
flag(ATOMIC_FLAG_INIT)
{}
void lock()
{
while(flag.test_and_set(std::memory_order_acquire));
}
void unlock()
{
flag.clear(std::memory_order_release);
}
}
Question 3: Why does this code doesn't use std::memory_order_acq_rel? flag.test_and_set is RWM operation, so I think it should be used with std::memory_order_acq_rel like my first example.
The acquire/release is strong enough to confine the instructions within the critical sections. And it provides sufficient synchronization such that a happens before relation is established between a release of a lock and subsequent acquire of the same lock.
The lock will give you some sequential order on the lock acquire/release operations; just as with sequential consistency.
Why are you using compare_exchange_strong? You are already in a loop. You can use compare_exchange_weak.

Peterson's Algorithm on C++ multithreading

I've written a simple implementation for Peterson's Algorithm in C++ with multi threading. This program changes the string through two threads. But I'm not getting the final result. Where am I wrong?
using namespace std;
int flag[2]={0,1};
int turn;
void* first(void* data){
flag[0]=1;
turn=1;
while(flag[1] && turn==1){}
string &str=*(static_cast<string*>(data));
if(str!=""){
if(str=="abcd"){
str="Hello";
}
}
flag[0]=0;
pthread_exit(NULL);
}
void* second(void* data){
flag[1]=1;
turn=0;
while(flag[0] && turn==0){}
string &str=*(static_cast<string*>(data));
if(str!=""){
if(str=="wxyz"){
str="abcd";
}
}
flag[1]=0;
pthread_exit(NULL);
}
int main(){
int rc=0;
string s = "wxyz";
pthread_t t;
rc=pthread_create(&t,NULL,first,static_cast<void*>(&s));
if(rc!=0){
cout<<"error!";
exit(rc);
}
rc=pthread_create(&t,NULL,second,static_cast<void*>(&s));
if(rc!=0){
cout<<"error!";
exit(rc);
}
while(flag[0] && flag[1]!=0){}
cout<<s;
pthread_exit(NULL);
return 0;
}
Prior to C++11 there was no threading model in C++. After C++11, your code does unordered access to the same variable causing race conditions.
Race conditions result in undefined behavior.
Changing a std::string is not atomic. You cannot do it safely while other threads are reading or writing from it.
In C++11 the threading primitives of std are a better idea than the above raw pthread code, excluding the very rare features you cannot emulate.
Refactored to use atomics. Note the explicit fences to guarantee correct ordering or reads/writes to the (non-atomic) string across threads.
Maybe someone would like to sanity-check my logic?
#include <iostream>
#include <thread>
#include <atomic>
#include <memory>
using namespace std;
// atomic types require the compiler to issue appropriate
// store-release/load-acquire ordering
std::atomic<int> flag[2]={{0},{1}};
std::atomic<int> turn;
void first(std::string& str){
flag[0]=1;
turn=1;
while(flag[1] && turn==1){}
std::atomic_signal_fence(std::memory_order_acquire);
if(str!=""){
if(str=="abcd"){
str="Hello";
std::atomic_signal_fence(std::memory_order_release);
}
}
flag[0]=0;
}
void second(std::string& str){
flag[1]=1;
turn=0;
while(flag[0] && turn==0){}
std::atomic_signal_fence(std::memory_order_acquire);
if(str!=""){
if(str=="wxyz"){
str="abcd";
std::atomic_signal_fence(std::memory_order_release);
}
}
flag[1]=0;
}
int main(){
string s = "wxyz";
auto t1 = std::thread(first, std::ref(s));
auto t2 = std::thread(second, std::ref(s));
for( ; flag[0] && flag[1]; )
;
std::atomic_signal_fence(std::memory_order_acquire);
cout << s << endl;
t1.join();
t2.join();
return 0;
}
expected output:
wxyz
Endnote:
Modern memory architectures are not what they were when this algorithm was invented. Reads and writes to memory don't happen when you expect on a modern chip, and sometimes don't happen at all.
Cancel your appointments for the next 3 hours and watch this fantastic talk on the subject:
https://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2
I had to make the threads wait till the thread first has finished, so I created separate threads, t for first and u for second and make main wait by pthread_join till the threads have finished. This removed the need to main spin.
Changes:
pthread_t t,u;
pthread_create(&t,NULL,first,static_cast<void*>(&s));
pthread_create(&u,NULL,second,static_cast<void*>(&s));
pthread_join(u,NULL);
pthread_join(t,NULL);
//while(flag[0] && flag[1]!=0){}
cout<<s;
The atomic fence in functions remained as they ensured ordered execution of instructions.
OUTPUT
abcd
and
Hello
And although changing the order of pthread_createof first and second always outputs to Hello, It kills the very idea of first thread waiting for second thread to complete. So I think the above changes will be the answer to it.

Passing an object to two std::threads

If I create an object which is going to be accessed by two different std::threads, do I need to make any special provisions when I create the object or pass it to the threads?
For example:
class Alpha
{
public:
int x;
};
void Foo(Alpha* alpha)
{
while (true)
{
alpha->x++;
std::cout << "Foo: alpha.x = " << alpha->x << std::endl;
}
}
void Bar(Alpha* alpha)
{
while (true)
{
alpha->x++;
std::cout << "Bar: alpha.x = " << alpha->x << std::endl;
}
}
int main(int argc, char * argv[])
{
Alpha alpha;
alpha.x = 0;
std::thread t1(Foo, &alpha);
std::thread t2(Bar, &alpha);
t1.join();
t2.join();
return 0;
}
This compiles fine, and seems to run fine too. But I haven't explicitly told my program that alpha needs to be accessed by two different threads. Should I be doing this differently?
You have race condition on alpha.x, as both threads may write when the other read/write its value. You may fix that by changing type of x into std::atomic<int> or by using protecting read/write access by mutex.
If an object is going to be accessed by multiple threads, then you must make provisions for synchronization. In your case, it will suffice to declare the variable x as an atomic:
#include <atomic>
class Alpha
{
public:
std::atomic<int> x;
};
This will guarantee that any function which increments "x" will actually use the atomic fetch_and_add() method. This guarantees that each thread that increments the variable "x" will get a unique incremented value of x.
In the code you have posted, it is extremely possible that both threads will get a value of 1 for x if the executions interleave just right.

Many reads/one write in atomic variable

Could you help me please.
Suppose I have p - 1 read threads and one write thread. They all read and write in one atomic int variable. Could it be that if all reads and write occur simultaneously the write operation will wait p - 1 time? I have doubts because when atomic operation happens there is some strange lock(in assembler) and I afraid that it locks memory(where variable is). So it could happen that write operation will wait for p-1 reads. Could it happen?
Here is some simple code:
#include <atomic>
#include <iostream>
#include <thread>
#include <vector>
std::atomic<int> val;
void writer()
{
val.store(7);
}
void read()
{
int tmp = val.load();
while (tmp == 0)
{
std::cout << std::this_thread::get_id() << ": wait" << std::endl;
tmp = val.load();
}
std::cout << std::this_thread::get_id() << " Operation: " << tmp * tmp << std::endl;
}
int main()
{
val.store(0);
std::vector<std::thread> v;
for (int i = 0; i < 1; ++i)
v.push_back(std::thread(read));
std::this_thread::sleep_for(std::chrono::milliseconds(77));
writer();
std::for_each(v.begin(), v.end(), std::mem_fn(&std::thread::join));
return 0;
}
Thank you!
Processor's instruction, which locks memory bus(has LOCK prefix), does not use locking in usual, high-level sence. It makes threads(caller one and, probably, some concurrent threads which access same or near memory blocks) a bit slower.
The upper limit of this bit is only depends from machine and its architecture.
Normal locks also make threads slower, but amount of this slowerness highly depends from lock contention, locking implementation properties(e.g., fairness), and code under lock protection. You shouldn't bother about locked memory access except because of perfomance reason.
Actually, LOCK prefix doesn't need for atomic loads/stores. I guess, it is a compiler optimization, which provides sequential consistent memory order. This order is enforced by .store() and .load() atomic's methods by default, but it is unnecessary in your example. The mostly used pattern is:
use relaxed memory order for initialization:
val.store(0, std::memory_order_relaxed);
use acquire memory order for read value:
tmp = val.load(std::memory_order_acquire);
use release memory order for write(change) value:
val.store(7, std::memory_order_release);
This will prevent compiler from using instructions with LOCK prefix.

Understanding c++11 memory fences

I'm trying to understand memory fences in c++11, I know there are better ways to do this, atomic variables and so on, but wondered if this usage was correct. I realize that this program doesn't do anything useful, I just wanted to make sure that the usage of the fence functions did what I thought they did.
Basically that the release ensures that any changes made in this thread before the fence are visible to other threads after the fence, and that in the second thread that any changes to the variables are visible in the thread immediately after the fence?
Is my understanding correct? Or have I missed the point entirely?
#include <iostream>
#include <atomic>
#include <thread>
int a;
void func1()
{
for(int i = 0; i < 1000000; ++i)
{
a = i;
// Ensure that changes to a to this point are visible to other threads
atomic_thread_fence(std::memory_order_release);
}
}
void func2()
{
for(int i = 0; i < 1000000; ++i)
{
// Ensure that this thread's view of a is up to date
atomic_thread_fence(std::memory_order_acquire);
std::cout << a;
}
}
int main()
{
std::thread t1 (func1);
std::thread t2 (func2);
t1.join(); t2.join();
}
Your usage does not actually ensure the things you mention in your comments. That is, your usage of fences does not ensure that your assignments to a are visible to other threads or that the value you read from a is 'up to date.' This is because, although you seem to have the basic idea of where fences should be used, your code does not actually meet the exact requirements for those fences to "synchronize".
Here's a different example that I think demonstrates correct usage better.
#include <iostream>
#include <atomic>
#include <thread>
std::atomic<bool> flag(false);
int a;
void func1()
{
a = 100;
atomic_thread_fence(std::memory_order_release);
flag.store(true, std::memory_order_relaxed);
}
void func2()
{
while(!flag.load(std::memory_order_relaxed))
;
atomic_thread_fence(std::memory_order_acquire);
std::cout << a << '\n'; // guaranteed to print 100
}
int main()
{
std::thread t1 (func1);
std::thread t2 (func2);
t1.join(); t2.join();
}
The load and store on the atomic flag do not synchronize, because they both use the relaxed memory ordering. Without the fences this code would be a data race, because we're performing conflicting operations a non-atomic object in different threads, and without the fences and the synchronization they provide there would be no happens-before relationship between the conflicting operations on a.
However with the fences we do get synchronization because we've guaranteed that thread 2 will read the flag written by thread 1 (because we loop until we see that value), and since the atomic write happened after the release fence and the atomic read happens-before the acquire fence, the fences synchronize. (see § 29.8/2 for the specific requirements.)
This synchronization means anything that happens-before the release fence happens-before anything that happens-after the acquire fence. Therefore the non-atomic write to a happens-before the non-atomic read of a.
Things get trickier when you're writing a variable in a loop, because you might establish a happens-before relation for some particular iteration, but not other iterations, causing a data race.
std::atomic<int> f(0);
int a;
void func1()
{
for (int i = 0; i<1000000; ++i) {
a = i;
atomic_thread_fence(std::memory_order_release);
f.store(i, std::memory_order_relaxed);
}
}
void func2()
{
int prev_value = 0;
while (prev_value < 1000000) {
while (true) {
int new_val = f.load(std::memory_order_relaxed);
if (prev_val < new_val) {
prev_val = new_val;
break;
}
}
atomic_thread_fence(std::memory_order_acquire);
std::cout << a << '\n';
}
}
This code still causes the fences to synchronize but does not eliminate data races. For example if f.load() happens to return 10 then we know that a=1,a=2, ... a=10 have all happened-before that particular cout<<a, but we don't know that cout<<a happens-before a=11. Those are conflicting operations on different threads with no happens-before relation; a data race.
Your usage is correct, but insufficient to guarantee anything useful.
For example, the compiler is free to internally implement a = i; like this if it wants to:
while(a != i)
{
++a;
atomic_thread_fence(std::memory_order_release);
}
So the other thread may see any values at all.
Of course, the compiler would never implement a simple assignment like that. However, there are cases where similarly perplexing behavior is actually an optimization, so it's a very bad idea to rely on ordinary code being implemented internally in any particular way. This is why we have things like atomic operations and fences only produce guaranteed results when used with such operations.