I have a few threads writing in a vector. It's possible that different threads try to write the same byte. There is no reads. Can I use only an atomic_fecth_or(), like in the example, so the vector will become thread safe? It compiled with GCC without errors or warnings.
std::vector<std::atomic<uint8_t>> MapVis(1024*1024);
void threador()
{
...
std::atomic_fetch_or(&MapVis[i], testor1);
}
It compiled with GCC without errors or warnings
That doesn't mean anything because compilers don't perform that sort of concurrency analysis. There are dedicated static analysis tools that may do this with varying levels of success.
Can I use only an atomic_fetch_or ...
you certainly can, and it will be safe at the level of each individual std::atomic<uint8_t>.
... the vector will become thread safe?
it's not sufficient that each element is accessed safely. You specifically need to avoid any operation that invalidates iterators (swap, resize, insert, push_back etc.).
I'd hesitate to say vector is thread-safe in this context - but you're limiting yourself to a thread-safe subset of its interface, so it will work correctly.
Note that as VTT suggests, keeping a separate partial vector per thread is better if possible. Partly because it's easier to prove correct, and partly because it avoids false sharing between cores.
Yes this is guaranteed to be thread safe due to atomic opperations being guaranteed of:
Isolation from interrupts, signals, concurrent processes and threads
Thus when you access an element of MapVis atomically you're guaranteed that any other process writing to it has already completed. And that your process will not be interrupted till it finishes writing.
The concern if you were using non-atomic variables would be that:
Thread A fetches the value of MapVis[i]
Thread B fetches the value of MapVis[i]
Thread A writes the ored value to MapVis[i]
Thread B writes the ored value to MapVis[i]
As you can see Thread B needed to wait until Thread A had finished writing otherwise it's just going to stomp Thread A's changes to MapVis[i]. With atomic variables the fetch and write cannot be interrupted by concurrent threads. Meaning that Thread B couldn't interrupt Thread A's read-write operations.
Related
im just exploring the use of acquire and release memory fences and dont understand why i get the value output to zero sometimes and not the value of 2 all the time
I ran the program a number of times , and assumed the atomic store before the release barrier and the atomic load after the acquire barrier would ensure the values always would synchronise
#include <iostream>
#include <thread>
#include <atomic>
std::atomic<int>x;
void write()
{
x.store(2,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
}
void read()
{
std::atomic_thread_fence(std::memory_order_acquire);
// THIS DOES NOT GIVE THE EXPECTED VALUE OF 2 SOMETIMES
std::cout<<x.load(std::memory_order_relaxed)<<std::endl;
}
int main()
{
std::thread t1(write);
std::thread t2(read);
t1.join();
t2.join();
return 0;
}
the atomic varible x gives a value of 0 sometimes
I think you are misunderstanding the purpose of fences. Fences only enforce a certain ordering of memory operations for the compiler and processor in a single thread of execution. Your acquire fence will not magically make the thread wait until the óther thread performs the release.
Some literature will describe that a release operation in one thread "synchronizes with" a subsequent acquire operation in another thread. The key to this is that the acquire action is a subsequent action (i.e. the acquire is ordered "after" the release). If the release action is ordered after your acquire action, then there is no synchronizes-with relation between the write and read operations.
The reason why your code doesn't consistently return what you're expecting is because the thread interleavings sometimes order the write before the read, sometimes the read before the write.
If you want to guarantee that thread t2 reads the value 2 that thread t1 publishes, you're going to have to force t2 to wait for the publish to happen. The textbook example almost invariably uses a guard variable that notifies t2 that the data is ready to be consumed.
I recommend you read a very well-written blog post about release and acquire semantics and the synchronizes-with relation at Preshing on Programming's The Synchronizes-With Relation.
Looks like you misuse the fence. You are trying to use it as a mutex, right? If you expect the code to output 2 always, you just think that the load operation would never be executed before the save one. But that is not what memory fence does, that is what the synchronization primitives do.
The fences are much trickier and they just don't allow compiler/processor to reorder certain types of commands within one thread. At the end of the day the order of execution of two separate threads is undefined.
The reasons is simple: your fences accomplish exactly nothing and cannot have any use here anyway because there is no write that the fence would make visible (on the release side) to the acquiring side.
The simple answer is that the reading thread can run first and obviously will not see any write if it does.
The longer answer is that when your code has a race, as any code that uses either mutexes or atomics in a non trivial way, it must be prepared for any race outcome! So you has to make sure that not reading the value written by a write will not break your code.
ADDITIONAL EXPLANATION
A way to explain rel/ack semantics is:
release means "I have accomplished something" and I set that atomic object to some value to publish that claim;
acquire means "have you accomplished something?" and I read that atomic object to see if it contains the claim.
So release before you have accomplished anything is meaningless and an acquire that throws away the information containing the claim, as in (void)x.load(memory_order_acquire) is generally meaningless as there is no knowledge (in general) of what was acquired, which is to say of what was accomplished. (The exception to that rule is when a thread had relaxed loads or RMW operations.)
Assume that I have code like:
void InitializeComplexClass(ComplexClass* c);
class Foo {
public:
Foo() {
i = 0;
InitializeComplexClass(&c);
}
private:
ComplexClass c;
int i;
};
If I now do something like Foo f; and hand a pointer to f over to another thread, what guarantees do I have that any stores done by InitializeComplexClass() will be visible to the CPU executing the other thread that accesses f? What about the store writing zero into i? Would I have to add a mutex to the class, take a writer lock on it in the constructor and take corresponding reader locks in any methods that accesses the member?
Update: Assume I hand a pointer over to a bunch of other threads once the constructor has returned. I'm not assuming that the code is running on x86, but could be instead running on something like PowerPC, which has a lot of freedom to do memory reordering. I'm essentially interested in what sorts of memory barriers the compiler has to inject into the code when the constructor returns.
In order for the other thread to be able to know about your new object, you have to hand over the object / signal other thread somehow. For signaling a thread you write to memory. Both x86 and x64 perform all memory writes in order, CPU does not reorder these operations with regards to each other. This is called "Total Store Ordering", so CPU write queue works like "first in first out".
Given that you create an object first and then pass it on to another thread, these changes to memory data will also occur in order and the other thread will always see them in the same order. By the time the other thread learns about the new object, the contents of this object was guaranteed to be available for that thread even earlier (if the thread only somehow knew where to look).
In conclusion, you do not have to synchronise anything this time. Handing over the object after it has been initialised is all the synchronisation you need.
Update: On non-TSO architectures you do not have this TSO guarantee. So you need to synchronise. Use MemoryBarrier() macro (or any interlocked operation), or some synchronisation API. Signalling the other thread by corresponding API causes also synchronisation, otherwise it would not be synchronisation API.
x86 and x64 CPU may reorder writes past reads, but that is not relevant here. Just for better understanding - writes can be ordered after reads since writes to memory go through a write queue and flushing that queue may take some time. On the other hand, read cache is always consistent with latest updates from other processors (that have went through their own write queue).
This topic has been made so unbelievably confusing for so many, but in the end there is only a couple of things a x86-x64 programmer has to be worried about:
- First, is the existence of write queue (and one should not at all be worried about read cache!).
- Secondly, concurrent writing and reading in different threads to same variable in case of non-atomic variable length, which may cause data tearing, and for which case you would need synchronisation mechanisms.
- And finally, concurrent updates to same variable from multiple threads, for which we have interlocked operations, or again synchronisation mechanisms.)
If you do :
Foo f;
// HERE: InitializeComplexClass() and "i" member init are guaranteed to be completed
passToOtherThread(&f);
/* From this point, you cannot guarantee the state/members
of 'f' since another thread can modify it */
If you're passing an instance pointer to another thread, you need to implement guards in order for both threads to interact with the same instance. If you ONLY plan to use the instance on the other thread, you do not need to implement guards. However, do not pass a stack pointer like in your example, pass a new instance like this:
passToOtherThread(new Foo());
And make sure to delete it when you are done with it.
Consider the following implementation of a trivial thread pool written in C++14.
threadpool.h
threadpool.cpp
Observe that each thread is sleeping until it's been notified to awaken -- or some spurious wake up call -- and the following predicate evaluates to true:
std::unique_lock<mutex> lock(this->instance_mutex_);
this->cond_handle_task_.wait(lock, [this] {
return (this->destroy_ || !this->tasks_.empty());
});
Furthermore, observe that a ThreadPool object uses the data member destroy_ to determine if its being destroyed -- the destructor has been called. Toggling this data member to true will notify each worker thread that it's time to finish its current task and any of the other queued tasks then synchronize with the thread that's destroying this object; in addition to prohibiting the enqueue member function.
For your convenience, the implementation of the destructor is below:
ThreadPool::~ThreadPool() {
{
std::lock_guard<mutex> lock(this->instance_mutex_); // this line.
this->destroy_ = true;
}
this->cond_handle_task_.notify_all();
for (auto &worker : this->workers_) {
worker.join();
}
}
Q: I do not understand why it's necessary to lock the object's mutex while toggling destroy_ to true in the destructor. Furthermore, is it only necessary for setting its value or is it also necessary for accessing its value?
BQ: Can this thread pool implementation be improved or optimized while maintaining it's original purpose; a thread pool that can pool N amount of threads and distribute tasks to them to be executed concurrently?
This thread pool implementation is forked from Jakob Progsch's C++11 thread pool repository with a thorough code step through to understand the purpose behind its implementation and some subjective style changes.
I am introducing myself to concurrent programming and there is still much to learn -- I am a novice concurrent programmer as it stands right now. If my questions are not worded correctly then please make the appropriate correction(s) in your provided answer. Moreover, if the answer can be geared towards a client who is being introduced to concurrent programming for the first time then that would be best -- for myself and any other novices as well.
If the owning thread of the ThreadPool object is the only thread that atomically writes to the destroy_ variable, and the worker threads only atomically read from the destroy_ variable, then no, a mutex is not needed to protect the destroy_ variable in the ThreadPool destructor. Typically a mutex is necessary when an atomic set of operations must take place that can't be accomplished through a single atomic instruction on a platform, (i.e., operations beyond an atomic swap, etc.). That being said, the author of the thread pool may be trying to force some type of acquire semantics on the destroy_ variable without restoring to atomic operations (i.e. a memory fence operation), and/or the setting of the flag itself is not considered an atomic operation (platform dependent)... Some other options include declaring the variable as volatile to prevent it from being cached, etc. You can see this thread for more info.
Without some sort of synchronization operation in place, the worst case scenario could end up with a worker that won't complete due to the destroy_ variable being cached on a thread. On platforms with weaker memory ordering models, that's always a possibility if you allowed a benign memory race condition to exist ...
C++ defines a data race as multiple threads potentially accessing an object simultaneously with at least one of those accesses being a write. Programs with data races have undefined behavior. If you were to write to destroy in your destructor without holding the mutex, your program would have undefined behavior and we cannot predict what would happen.
If you were to read destroy elsewhere without holding the mutex, that read could potentially happen while the destructor is writing to it which is also a data race.
In my program I've some threads running. Each thread gets a pointer to some object (in my program - vector). And each thread modifies the vector.
And sometimes my program fails with a segm-fault. I thought it occurred because thread A begins doing something with the vector while thread B hasn't finished operating with it? Can it be true?
How am I supposed to fix it? Thread synchronization? Or maybe make a flag VectorIsInUse and set this flag to true while operating with it?
vector, like all STL containers, is not thread-safe. You have to explicitly manage the synchronization yourself. A std::mutex or boost::mutex could be use to synchronize access to the vector.
Do not use a flag as this is not thread-safe:
Thread A checks value of isInUse flag and it is false
Thread A is suspended
Thread B checks value of isInUse flag and it is false
Thread B sets isInUse to true
Thread B is suspended
Thread A is resumed
Thread A still thinks isInUse is false and sets it true
Thread A and Thread B now both have access to the vector
Note that each thread will have to lock the vector for the entire time it needs to use it. This includes modifying the vector and using the vector's iterators as iterators can become invalidated if the element they refer to is erase() or the vector undergoes an internal reallocation. For example do not:
mtx.lock();
std::vector<std::string>::iterator i = the_vector.begin();
mtx.unlock();
// 'i' can become invalid if the `vector` is modified.
If you want a container that is safe to use from many threads, you need to use a container that is explicitly designed for the purpose. The interface of the Standard containers is not designed for concurrent mutation or any kind of concurrency, and you cannot just throw a lock at the problem.
You need something like TBB or PPL which has concurrent_vector in it.
That's why pretty much every class library that offers threads also has synchronization primitives such as mutexes/locks. You need to setup one of these, and aquire/release the lock around every operation on the shared item (read AND write operations, since you need to prevent reads from occuring during a write too, not just preventing multiple writes happening concurrently).
If I lock a std::mutex will I always get a memory fence? I am unsure if it implies or enforces you to get the fence.
Update:
Found this reference following up on RMF's comments.
Multithreaded programming and memory visibility
As I understand this is covered in:
1.10 Multi-threaded executions and data races
Para 5:
The library defines a number of atomic operations (Clause 29) and operations on mutexes (Clause 30)
that are specially identified as synchronization operations. These operations play a special role in making assignments in one thread visible to another. A synchronization operation on one or more memory locations is either a consume operation, an acquire operation, a release operation, or both an acquire and release operation. A synchronization operation without an associated memory location is a fence and can be either an acquire fence, a release fence, or both an acquire and release fence. In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations, which have special characteristics. [Note: For example, a call that acquires a mutex will perform an acquire operation on the locations comprising the mutex. Correspondingly, a call that releases the same mutex will perform a release operation on those same locations. Informally, performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform a consume or an acquire operation on A. “Relaxed” atomic operations are not synchronization operations even though, like synchronization operations, they cannot contribute to data races. —end note]
Unlocking a mutex synchronizes with locking the mutex. I don't know what options the compiler has for the implementation, but you get the same effect of a fence.
A mutex operation (lock or unlock) on a particular mutex M is only useful for any purpose related to synchronization or memory visibility if M is shared by different threads and they perform these operations. A mutex defined locally and only used by one thread does not provide any meaningful synchronization.
[Note: The optimizations I describe here are probably not done by many compilers, which might view these mutex and atomic synchronization operations as "black boxes" that cannot be optimized (or even that should not optimized in order to preserve the predictability of code generation, and some particular patterns, which is a bogus argument). I would not be surprised if zero compiler did the optimization in even the simpler case but there is no doubt that they are legal.]
A compiler can easily determine that some variables are never used by multiple threads (or any asynchronous execution), notably for an automatic variable whose address is not taken (nor a reference to it). Such objects are called here "thread private". (All automatic variables candidate for register allocation are thread private.)
For a thread private mutex, no meaningful code needs to be generated for lock/unlock operations: no atomic compare and exchange, no fencing, and often no state needs to be kept at all, except for the case of "safe mutex" where the behavior of recursive locking is well defined and should fail (to make the sequence m.lock(); bool locked = m.try_lock(); work you need to keep at least a boolean state).
This is also true for any thread private atomic objects: only the naked non atomic type is needed and normal operations can be performed (so a fetch-add 1 becomes as regular post increment).
The reason why these transformations are legal:
the obvious observation that if an object is accessed by only of thread or parallel execution (they aren't even accessed by an asynchronous signal handler) so the use of non atomic operation in assembly is not detectable by any mean
the less obvious remark that no ordering/memory visibility is implied by any use of thread private synchronization object.
All synchronization objects are specified as tools for inter thread communication: they can guarantee that the side effects in one thread are visible in another thread; they cause a well defined order of operations to exist not just in one thread (the sequential order of execution of operations of one thread) but in multiple threads.
A common example is the publication of an information with an atomic pointer type:
The shared data is:
atomic<T*> shared; // null pointer by default
The publishing thread does:
T *p = new T;
*p = load_info();
shared.store(p, memory_order_release);
The consuming thread can check whether the data is available by loading the atomic object value, as a consumer:
T *p = shared.load(memory_order_acquire);
if (p) use *p;
(There is no defined way to wait for availability here, it's a simple example to illustrate publication and consumption of the published value.)
The publishing thread only needs to set the atomic variable after it has finished the initialization of all fields; the memory order is a release to communicate the fact the memory manipulations are finished.
The other threads only need an acquire memory order to "connect" with the release operation if there was one. If the value is still zero, the thread has learn nothing about the world and the acquire is meaningless; it can't act on it. (By the time the thread check the pointer and see a null, the shared variable might have been changed already. It doesn't matter as the designer considered that not having a value in that thread is manageable, or it would have performed the operation in a sequence.)
All atomic operations are intended to be possibly lock less, that is to finish in a short finite time whatever other threads are doing, and even if they are stuck. It means that you can't depend on another thread having finished a job.
At the other end of the spectrum of thread communication primitives, mutexes don't hold a value that can be used to carry information between threads (*) but they ensure that one thread can enter a lock-ops-unlock sequence only after another thread has finished his own lock-ops-unlock sequence.
(*) not even a boolean value, as using a mutex to as a general boolean signal (= binary semaphore) between threads is specifically prohibited
A mutex is always used in connection with a set of shared variables: the protected variables or objects V; these V are used to carry information between threads and the mutex make access to that information well ordered (or serialized) between threads. In technical terms, all but the first mutex lock (on M) operation pair with previous unlock operation on M:
a lock of M is an acquire operation on M
an unlock of M is a release operation on M
The semantic of locking/unlocking is defined on a single M so let's stop repeating "on M"; we have threads A and B. The lock by B is an acquire that pairs with the unlock by A. Both operations together form an inter thread synchronization.
[What about a thread that happens to lock M often and will often re-lock M without any other thread acting on M in the meantime? Nothing interesting, the acquire is still paired with a release but A = B so nothing is accomplished. The unlock was sequenced in the same thread of execution so it's pointless in that particular case, but in general a thread can't tell that it's pointless. It isn't even special cased by the language semantic.]
The synchronization that occurs is between the set of threads T acting on a mutex: no other thread is guaranteed to be able to view any memory operation performed by these T. Note that in practice on most real computers, once a memory modification hits the cache, all CPU will view it if they check that same address, by the power of the cache consistency. But C/C++ threads(#) are not specified in term of a globally consistent cache, and not in term of ordering visible on a CPU, as the compiler itself can assume that non atomic objects are not mutated in arbitrary ways by the program without synchronization (the CPU cannot assume any such thing as it doesn't a concept of atomic vs. non atomic memory locations). That means that a guarantee available at the CPU/memory system you are targeting is not in general available at the C/C++ high level model. You absolutely cannot use normal C/C++ code as a high level assembly; only by dousing your code with volatile (almost everywhere) can you even vaguely approach high level assembly (but not quite).
(#) "C/C++ thread/semantics" not "C/C++ programming language thread semantics": C and C++ are based on the same specification for synchronization primitives, that doesn't mean that there is a C/C++ language)
Since the effect of mutex operations on M is only to serialize access to some data by threads that use M, it's clear that other threads don't see any effect. In technical terms, the synchronize with relation is between threads using the same synchronization objects (mutexes in that context, same atomic objects in the context of atomic use).
Even when the compiler emits memory fences in assembly language, it doesn't have to assume that an unlock operation makes changes performed before the unlock to threads outside the set T.
That allows the decomposition of sets of threads for program analysis: if a programs runs in parallel two sets of threads U and V, and U and V are created such that U and V can't access any common synchronization object (but they can access common non atomic objects), then you can analyse the interactions of U and of V separately from the point of view of thread semantics, as U and V cannot exchange information in well defined inter threads ways (they can still exchange information via the system, for example via disk files, sockets, for system specific shared memory).
(That observation might allow the compiler to optimize some threads without doing a full program analysis, even if some common mutable objects are "pulled it" via a third party class that has static members.)
Another way to explain that is to say that the semantics of these primitives is not leaking: only those threads that participate get a defined result.
Note that this is only true at the specification level for acquire and release operations, not sequentially consistent operations (which is the default order for operations on atomic object with you don't specify a memory order): all sequentially consistent actions (operations on an atomic object or fences) occur in a well defined global order. That however doesn't mean anything for independent threads having no atomic objects in common.
An order of operations is unlike an order of elements in a container where you can really navigate the container, or saying that files are presented as ordered by names. Only objects are observable, the orders of operations are not. Saying that there is a well defined order only means that values don't appear to have provably changed backward (vs. some abstract order).
If you have two unrelated sets that are ordered, say the integers with usual order and the words with lexicographical order), you can define the sum of these sets as having an order compatible with both orders. You might put the numbers before, after, or alternated with the words. You are free to do what you want because the elements in the sum of two sets doesn't have any relation with each other when they don't come from the same set.
You could say that there is a global order of all mutex operations, it's just not useful, like defining the order of the sum of unrelated sets.