Many reads/one write in atomic variable - c++

Could you help me please.
Suppose I have p - 1 read threads and one write thread. They all read and write in one atomic int variable. Could it be that if all reads and write occur simultaneously the write operation will wait p - 1 time? I have doubts because when atomic operation happens there is some strange lock(in assembler) and I afraid that it locks memory(where variable is). So it could happen that write operation will wait for p-1 reads. Could it happen?
Here is some simple code:
#include <atomic>
#include <iostream>
#include <thread>
#include <vector>
std::atomic<int> val;
void writer()
{
val.store(7);
}
void read()
{
int tmp = val.load();
while (tmp == 0)
{
std::cout << std::this_thread::get_id() << ": wait" << std::endl;
tmp = val.load();
}
std::cout << std::this_thread::get_id() << " Operation: " << tmp * tmp << std::endl;
}
int main()
{
val.store(0);
std::vector<std::thread> v;
for (int i = 0; i < 1; ++i)
v.push_back(std::thread(read));
std::this_thread::sleep_for(std::chrono::milliseconds(77));
writer();
std::for_each(v.begin(), v.end(), std::mem_fn(&std::thread::join));
return 0;
}
Thank you!

Processor's instruction, which locks memory bus(has LOCK prefix), does not use locking in usual, high-level sence. It makes threads(caller one and, probably, some concurrent threads which access same or near memory blocks) a bit slower.
The upper limit of this bit is only depends from machine and its architecture.
Normal locks also make threads slower, but amount of this slowerness highly depends from lock contention, locking implementation properties(e.g., fairness), and code under lock protection. You shouldn't bother about locked memory access except because of perfomance reason.
Actually, LOCK prefix doesn't need for atomic loads/stores. I guess, it is a compiler optimization, which provides sequential consistent memory order. This order is enforced by .store() and .load() atomic's methods by default, but it is unnecessary in your example. The mostly used pattern is:
use relaxed memory order for initialization:
val.store(0, std::memory_order_relaxed);
use acquire memory order for read value:
tmp = val.load(std::memory_order_acquire);
use release memory order for write(change) value:
val.store(7, std::memory_order_release);
This will prevent compiler from using instructions with LOCK prefix.

Related

Can I use no thread synchronization here?

I am researching mutexes.
I come up with this example that seems to work without any synchronization.
#include <cstdint>
#include <thread>
#include <iostream>
constexpr size_t COUNT = 10000000;
int g_x = 0;
void p1(){
for(size_t i = 0; i < COUNT; ++i){
++g_x;
}
}
void p2(){
int a = 0;
for(size_t i = 0; i < COUNT; ++i){
if (a > g_x){
std::cout << "Problem detected" << '\n';
}
a = g_x;
}
}
int main(){
std::thread t1{ p1 };
std::thread t2{ p2 };
t1.join();
t2.join();
std::cout << g_x << '\n';
}
My assumptions are following:
Thread 1 change the value of g_x, but it is the only thread that change the value, so theoretically this suppose to be OK.
Thread 2 reads the value of g_x. Reads suppose to be atomic on x86 and ARM. So there must be no problem there too. I have example with several read threads and it works OK too.
With other words, write is not shared and reads are atomic.
Are the assumptions correct?
There's certainly a data race here: g_x is not an std::atomic; it is written to by one thread, and read from by another. So the results are undefined.
Note that the CPU memory model is only part of the deal. The compiler might do all sorts of optimizations (using registers, reordering etc.) if you don't declare your shared variables properly.
As for mutexes, you do not need one here. Declaring g_x as atomic should remove the UB and guarantee proper communication between the threads. Btw, the for in p2 can probably be optimized out even if you're using atomics, but I assume this is just a reduced code and not the real thing.

Why is std::mutex faster than std::atomic?

I want to put objects in std::vector in multi-threaded mode. So I decided to compare two approaches: one uses std::atomic and the other std::mutex. I see that the second approach is faster than the first one. Why?
I use GCC 4.8.1 and, on my machine (8 threads), I see that the first solution requires 391502 microseconds and the second solution requires 175689 microseconds.
#include <vector>
#include <omp.h>
#include <atomic>
#include <mutex>
#include <iostream>
#include <chrono>
int main(int argc, char* argv[]) {
const size_t size = 1000000;
std::vector<int> first_result(size);
std::vector<int> second_result(size);
std::atomic<bool> sync(false);
{
auto start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel for schedule(static, 1)
for (int counter = 0; counter < size; counter++) {
while(sync.exchange(true)) {
std::this_thread::yield();
};
first_result[counter] = counter;
sync.store(false) ;
}
auto end_time = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
}
{
auto start_time = std::chrono::high_resolution_clock::now();
std::mutex mutex;
#pragma omp parallel for schedule(static, 1)
for (int counter = 0; counter < size; counter++) {
std::unique_lock<std::mutex> lock(mutex);
second_result[counter] = counter;
}
auto end_time = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
}
return 0;
}
I don't think your question can be answered referring only to the standard- mutexes are as platform-dependent as they can be. However, there is one thing, that should be mentioned.
Mutexes are not slow. You may have seen some articles, that compare their performance against custom spin-locks and other "lightweight" stuff, but that's not the right approach - these are not interchangeable.
Spin locks are considerably fast, when they are locked (acquired) for a relatively short amount of time - acquiring them is very cheap, but other threads, that are also trying to lock, are active for whole this time (running constantly in loop).
Custom spin-lock could be implemented this way:
class SpinLock
{
private:
std::atomic_flag _lockFlag;
public:
SpinLock()
: _lockFlag {ATOMIC_FLAG_INIT}
{ }
void lock()
{
while(_lockFlag.test_and_set(std::memory_order_acquire))
{ }
}
bool try_lock()
{
return !_lockFlag.test_and_set(std::memory_order_acquire);
}
void unlock()
{
_lockFlag.clear();
}
};
Mutex is a primitive, that is much more complicated. In particular, on Windows, we have two such primitives - Critical Section, that works in per-process basis and Mutex, which doesn't have such limitation.
Locking mutex (or critical section) is much more expensive, but OS has the ability to really put other waiting threads to "sleep", which improves performance and helps tasks scheduler in efficient resources management.
Why I write this? Because modern mutexes are often so-called "hybrid mutexes". When such mutex is locked, it behaves like a normal spin-lock - other waiting threads perform some number of "spins" and then heavy mutex is locked to prevent from wasting resources.
In your case, mutex is locked in each loop iteration to perform this instruction:
second_result[counter] = omp_get_thread_num();
It looks like a fast one, so "real" mutex may never be locked. That means, that in this case your "mutex" can be as fast as atomic-based solution (because it becomes an atomic-based solution itself).
Also, in the first solution you used some kind of spin-lock-like behaviour, but I am not sure if this behaviour is predictable in multi-threaded environment. I am pretty sure, that "locking" should have acquire semantics, while unlocking is a release op. Relaxed memory ordering may be too weak for this use case.
I edited the code to be more compact and correct. It uses the std::atomic_flag, which is the only type (unlike std::atomic<> specializations), that is guaranteed to be lock-free (even std::atomic<bool> does not give you that).
Also, referring to the comment below about "not yielding": it is a matter of specific case and requirements. Spin locks are very important part of multi-threaded programming and their performance can often be improved by slightly modifying its behavior. For example, Boost library implements spinlock::lock() as follows:
void lock()
{
for( unsigned k = 0; !try_lock(); ++k )
{
boost::detail::yield( k );
}
}
source: boost/smart_ptr/detail/spinlock_std_atomic.hpp
Where detail::yield() is (Win32 version):
inline void yield( unsigned k )
{
if( k < 4 )
{
}
#if defined( BOOST_SMT_PAUSE )
else if( k < 16 )
{
BOOST_SMT_PAUSE
}
#endif
#if !BOOST_PLAT_WINDOWS_RUNTIME
else if( k < 32 )
{
Sleep( 0 );
}
else
{
Sleep( 1 );
}
#else
else
{
// Sleep isn't supported on the Windows Runtime.
std::this_thread::yield();
}
#endif
}
[source: http://www.boost.org/doc/libs/1_66_0/boost/smart_ptr/detail/yield_k.hpp]
First, thread spins for some fixed number of times (4 in this case). If mutex is still locked, pause instruction is used (if available) or Sleep(0) is called, which basically causes context-switch and allows scheduler to give another blocked thread a chance to do something useful. Then, Sleep(1) is called to perform actual (short) sleep. Very nice!
Also, this statement:
The purpose of a spinlock is busy waiting
is not entirely true. The purpose of spinlock is to serve as a fast, easy-to-implement lock primitive - but it still needs to be written properly, with certain possible scenarios in mind. For example, Intel says (regarding Boost's usage of _mm_pause() as a method of yielding inside lock()):
In the spin-wait loop, the pause intrinsic improves the speed at which
the code detects the release of the lock and provides especially
significant performance gain.
So, implementations like
void lock() { while(m_flag.test_and_set(std::memory_order_acquire)); }
may not be as good as it seems.
There is an additional important issue related to your problem. An efficient spinlock never "spins" on an operation that involves (even potential) modification of a memory location (such as exchange or test_and_set). On typical modern architectures, these operations generate instructions that require the cache line with a lock memory location to be in the exclusive state, which is extremely time-consuming (especially, when multiple threads are spinning at the same time). Always spin on load/read only and try to acquire the lock only when there is a chance that this operation will succeed.
A nice relevant article is, for instance, here: Correctly implementing a spinlock in C++

Boost Mutex Scoped Lock

I was reading through a Boost Mutex tutorial on drdobbs.com, and found this piece of code:
#include <boost/thread/thread.hpp>
#include <boost/thread/mutex.hpp>
#include <boost/bind.hpp>
#include <iostream>
boost::mutex io_mutex;
void count(int id)
{
for (int i = 0; i < 10; ++i)
{
boost::mutex::scoped_lock
lock(io_mutex);
std::cout << id << ": " <<
i << std::endl;
}
}
int main(int argc, char* argv[])
{
boost::thread thrd1(
boost::bind(&count, 1));
boost::thread thrd2(
boost::bind(&count, 2));
thrd1.join();
thrd2.join();
return 0;
}
Now I understand the point of a Mutex is to prevent two threads from accessing the same resource at the same time, but I don't see the correlation between io_mutex and std::cout. Does this code just lock everything within the scope until the scope is finished?
Now I understand the point of a Mutex is to prevent two threads from accessing the same resource at the same time, but I don't see the correlation between io_mutex and std::cout.
std::cout is a global object, so you can see that as a shared resource. If you access it concurrently from several threads, those accesses must be synchronized somehow, to avoid data races and undefined behavior.
Perhaps it will be easier for you to notice that concurrent access occurs by considering that:
std::cout << x
Is actually equivalent to:
::operator << (std::cout, x)
Which means you are calling a function that operates on the std::cout object, and you are doing so from different threads at the same time. std::cout must be protected somehow. But that's not the only reason why the scoped_lock is there (keep reading).
Does this code just lock everything within the scope until the scope is finished?
Yes, it locks io_mutex until the lock object itself goes out of scope (being a typical RAII wrapper), which happens at the end of each iteration of your for loop.
Why is it needed? Well, although in C++11 individual insertions into cout are guaranteed to be thread-safe, subsequent, separate insertions may be interleaved when several threads are outputting something.
Keep in mind that each insertion through operator << is a separate function call, as if you were doing:
std::cout << id;
std::cout << ": ";
std::cout << i;
std::cout << endl;
The fact that operator << returns the stream object allows you to chain the above function calls in a single expression (as you have done in your program), but the fact that you are having several separate function calls still holds.
Now looking at the above snippet, it is more evident that the purpose of this scoped lock is to make sure that each message of the form:
<id> ": " <index> <endl>
Gets printed without its parts being interleaved with parts from other messages.
Also, in C++03 (where insertions into cout are not guaranteed to be thread-safe) , the lock will protect the cout object itself from being accessed concurrently.
A mutex has nothing to do with anything else in the program
(except a conditional variable), at least at a higher level.
A mutex has two effeccts: it controls program flow, and prevents
multiple threads from executing the same block of code
simultaneously. It also ensures memory synchronization. The
important issue here, is that mutexes aren't associated with
resources, and don't prevent two threads from accessing the same
resource at the same time. A mutex defines a critical section
of code, which can only be entered by one thread at a time. If
all of the use of a particular resource is done in critical
sections controled by the same mutex, then the resource is
effectively protected by the mutex. But the relationship is
established by the coder, by ensuring that all use does take
place in the critical sections.

C++ - Threads without coordinating mechanism like mutex_Lock

I attended one interview two days back. The interviewed guy was good in C++, but not in multithreading. When he asked me to write a code for multithreading of two threads, where one thread prints 1,3,5,.. and the other prints 2,4,6,.. . But, the output should be 1,2,3,4,5,.... So, I gave the below code(sudo code)
mutex_Lock LOCK;
int last=2;
int last_Value = 0;
void function_Thread_1()
{
while(1)
{
mutex_Lock(&LOCK);
if(last == 2)
{
cout << ++last_Value << endl;
last = 1;
}
mutex_Unlock(&LOCK);
}
}
void function_Thread_2()
{
while(1)
{
mutex_Lock(&LOCK);
if(last == 1)
{
cout << ++last_Value << endl;
last = 2;
}
mutex_Unlock(&LOCK);
}
}
After this, he said "these threads will work correctly even without those locks. Those locks will reduce the efficiency". My point was without the lock there will be a situation where one thread will check for(last == 1 or 2) at the same time the other thread will try to change the value to 2 or 1. So, My conclusion is that it will work without that lock, but that is not a correct/standard way. Now, I want to know who is correct and in which basis?
Without the lock, running the two functions concurrently would be undefined behaviour because there's a data race in the access of last and last_Value Moreover (though not causing UB) the printing would be unpredictable.
With the lock, the program becomes essentially single-threaded, and is probably slower than the naive single-threaded code. But that's just in the nature of the problem (i.e. to produce a serialized sequence of events).
I think the interviewer might have thought about using atomic variables.
Each instantiation and full specialization of the std::atomic template defines an atomic type. Objects of atomic types are the only C++ objects that are free from data races; that is, if one thread writes to an atomic object while another thread reads from it, the behavior is well-defined.
In addition, accesses to atomic objects may establish inter-thread synchronization and order non-atomic memory accesses as specified by std::memory_order.
[Source]
By this I mean the only thing you should change is remove the locks and change the lastvariable to std::atomic<int> last = 2; instead of int last = 2;
This should make it safe to access the last variable concurrently.
Out of curiosity I have edited your code a bit, and ran it on my Windows machine:
#include <iostream>
#include <atomic>
#include <thread>
#include <Windows.h>
std::atomic<int> last=2;
std::atomic<int> last_Value = 0;
std::atomic<bool> running = true;
void function_Thread_1()
{
while(running)
{
if(last == 2)
{
last_Value = last_Value + 1;
std::cout << last_Value << std::endl;
last = 1;
}
}
}
void function_Thread_2()
{
while(running)
{
if(last == 1)
{
last_Value = last_Value + 1;
std::cout << last_Value << std::endl;
last = 2;
}
}
}
int main()
{
std::thread a(function_Thread_1);
std::thread b(function_Thread_2);
while(last_Value != 6){}//we want to print 1 to 6
running = false;//inform threads we are about to stop
a.join();
b.join();//join
while(!GetAsyncKeyState('Q')){}//wait for 'Q' press
return 0;
}
and the output is always:
1
2
3
4
5
6
Ideone refuses to run this code (compilation errors)..
Edit: But here is a working linux version :) (thanks to soon)
The interviewer doesn't know what he is talking about. Without the locks you get races on both last and last_value. The compiler could for example reorder the assignment to last before the print and increment of last_value, which could lead to the other thread executing on stale data. Furthermore you could get interleaved output, meaning things like two numbers not being seperated by a linebreak.
Another thing, which could go wrong is that the compiler might decide not to reload last and (less importantly) last_value each iteration, since it can't (safely) change between those iterations anyways (since data races are illegal by the C++11 standard and aren't acknowledged in previous standards). This means that the code suggested by the interviewer actually has a good chance of creating infinite loops of doing absoulutely doing nothing.
While it is possible to make that code correct without mutices, that absolutely needs atomic operations with appropriate ordering constraints (release-semantics on the assignment to last and acquire on the load of last inside the if statement).
Of course your solution does lower efficiency due to effectivly serializing the whole execution. However since the runtime is almost completely spent inside the streamout operation, which is almost certainly internally synchronized by the use of locks, your solution doesn't lower the efficiency anymore then it already is. Waiting on the lock in your code might actually be faster then busy waiting for it, depending on the availible resources (the nonlocking version using atomics would absolutely tank when executed on a single core machine)

Understanding c++11 memory fences

I'm trying to understand memory fences in c++11, I know there are better ways to do this, atomic variables and so on, but wondered if this usage was correct. I realize that this program doesn't do anything useful, I just wanted to make sure that the usage of the fence functions did what I thought they did.
Basically that the release ensures that any changes made in this thread before the fence are visible to other threads after the fence, and that in the second thread that any changes to the variables are visible in the thread immediately after the fence?
Is my understanding correct? Or have I missed the point entirely?
#include <iostream>
#include <atomic>
#include <thread>
int a;
void func1()
{
for(int i = 0; i < 1000000; ++i)
{
a = i;
// Ensure that changes to a to this point are visible to other threads
atomic_thread_fence(std::memory_order_release);
}
}
void func2()
{
for(int i = 0; i < 1000000; ++i)
{
// Ensure that this thread's view of a is up to date
atomic_thread_fence(std::memory_order_acquire);
std::cout << a;
}
}
int main()
{
std::thread t1 (func1);
std::thread t2 (func2);
t1.join(); t2.join();
}
Your usage does not actually ensure the things you mention in your comments. That is, your usage of fences does not ensure that your assignments to a are visible to other threads or that the value you read from a is 'up to date.' This is because, although you seem to have the basic idea of where fences should be used, your code does not actually meet the exact requirements for those fences to "synchronize".
Here's a different example that I think demonstrates correct usage better.
#include <iostream>
#include <atomic>
#include <thread>
std::atomic<bool> flag(false);
int a;
void func1()
{
a = 100;
atomic_thread_fence(std::memory_order_release);
flag.store(true, std::memory_order_relaxed);
}
void func2()
{
while(!flag.load(std::memory_order_relaxed))
;
atomic_thread_fence(std::memory_order_acquire);
std::cout << a << '\n'; // guaranteed to print 100
}
int main()
{
std::thread t1 (func1);
std::thread t2 (func2);
t1.join(); t2.join();
}
The load and store on the atomic flag do not synchronize, because they both use the relaxed memory ordering. Without the fences this code would be a data race, because we're performing conflicting operations a non-atomic object in different threads, and without the fences and the synchronization they provide there would be no happens-before relationship between the conflicting operations on a.
However with the fences we do get synchronization because we've guaranteed that thread 2 will read the flag written by thread 1 (because we loop until we see that value), and since the atomic write happened after the release fence and the atomic read happens-before the acquire fence, the fences synchronize. (see ยง 29.8/2 for the specific requirements.)
This synchronization means anything that happens-before the release fence happens-before anything that happens-after the acquire fence. Therefore the non-atomic write to a happens-before the non-atomic read of a.
Things get trickier when you're writing a variable in a loop, because you might establish a happens-before relation for some particular iteration, but not other iterations, causing a data race.
std::atomic<int> f(0);
int a;
void func1()
{
for (int i = 0; i<1000000; ++i) {
a = i;
atomic_thread_fence(std::memory_order_release);
f.store(i, std::memory_order_relaxed);
}
}
void func2()
{
int prev_value = 0;
while (prev_value < 1000000) {
while (true) {
int new_val = f.load(std::memory_order_relaxed);
if (prev_val < new_val) {
prev_val = new_val;
break;
}
}
atomic_thread_fence(std::memory_order_acquire);
std::cout << a << '\n';
}
}
This code still causes the fences to synchronize but does not eliminate data races. For example if f.load() happens to return 10 then we know that a=1,a=2, ... a=10 have all happened-before that particular cout<<a, but we don't know that cout<<a happens-before a=11. Those are conflicting operations on different threads with no happens-before relation; a data race.
Your usage is correct, but insufficient to guarantee anything useful.
For example, the compiler is free to internally implement a = i; like this if it wants to:
while(a != i)
{
++a;
atomic_thread_fence(std::memory_order_release);
}
So the other thread may see any values at all.
Of course, the compiler would never implement a simple assignment like that. However, there are cases where similarly perplexing behavior is actually an optimization, so it's a very bad idea to rely on ordinary code being implemented internally in any particular way. This is why we have things like atomic operations and fences only produce guaranteed results when used with such operations.