How to lock element of array using TBB/OpenMP - c++

I have a very big array read/write by several threads. Each thread will only rw one element of them at a time, so it would be a bad idea to lock the whole array. What I am expecting is something like
// before threads
lock_t Lock[NUM_THREADS];
...
// during threads
get_lock(Lock[thread_id], element_id);
array[element_id]+=10; // some operations
release_lock(Lock[thread_id]);
So my question is, what is the best strategy of designing get_lock and release_lock?

When using OpenMP you can obtain an equivalent behavior using the atomic construct:
// during threads
#pragma omp atomic
array[element_id]+=10; // some operations
Just to give you an idea of its semantic, here is an excerpt from the OpenMP 3.1 standard:
The atomic construct ensures that a specific storage location is
accessed atomically, rather than exposing it to the possibility of
multiple, simultaneous reading and writing threads that may result in
indeterminate values
On the other hand, if you decide to use Intel TBB, you can take a look at the atomic template class.

Related

Is it possible to create an atomic vector or array in C++?

I have some code which uses an array of int (int[]) in a thread which is activated every second.
I use lock() from std::mutex to lock this array in this thread.
However I wonder if there is a way to create an atomic array (or vector) to avoid using a mutex? I tried a couple of ways, but the compiler always complains somehow?
I know there is a way to create an array of atomics but this is not the same.
In practice, at the CPU level, there are instructions which can atomically update an int, and a good compiler will use these for std::atomic<int>. In contrast, there are are no instructions which can atomically update a vector of ints (for any architecture I am aware of), so there has got to be a mutex of some sort somewhere. You might as well let it be your mutex.
For future readers who haven't yet written code with the mutex:
You can't create a std::atomic of int[10], because that leads to a function which returns an array - and you can't have those. What you can do, is have a std::atomic<std::array<int,10>>
int main()
{
std::atomic<std::array<int,10>> myArray;
}
Note that the compiler/library will create a mutex under the hood to make this atomic. Note further that this doesn't do what you want. It allows you to set the value of the whole array atomically.
It doesn't allow you to read the whole array, update one element, and write the whole array back atomically.
The reads and the writes will be individually atomic, but another thread can get in between the read and the write.
You need the mutex!
You can put arrays in atomics, but not directly. Like the other answer explain you can use std::array. I answered this question and explained how to do something similar for a struct.
Having said that and explained the technical viability, I have to tell you something else:
PLEASE DON'T DO THAT
The power of atomic variables come from the fact that some processors can do their operations with one instruction. The C++ compiler will try to make your atomic operations happen in one instruction. If it fails, it'll initiate a bus lock, which is like a global lock of everything, until that array is updated. It's equivalent to a mutex that locks all your variables in your program. If you're concerned about performance, don't do that!
So for your case, a mutex is not a bad idea. At least you can control what is critical and improve performance.

How do I make memory stores in one thread "promptly" visible in other threads?

Suppose I wanted to copy the contents of a device register into a variable that would be read by multiple threads. Is there a good general way of doing this? Here are examples of two possible methods of doing this:
#include <atomic>
volatile int * const Device_reg_ptr = reinterpret_cast<int *>(0x666);
// This variable is read by multiple threads.
std::atomic<int> device_reg_copy;
// ...
// Method 1
const_cast<volatile std::atomic<int> &>(device_reg_copy)
.store(*Device_reg_ptr, std::memory_order_relaxed);
// Method 2
device_reg_copy.store(*Device_reg_ptr, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
More generally, in the face of possible whole program optimization, how does one correctly control the latency of memory writes in one thread being visible in other threads?
EDIT: In your answer, please consider the following scenario:
The code is running on a CPU in an embedded system.
A single application is running on the CPU.
The application has far fewer threads than the CPU has processor cores.
Each core has a massive number of registers.
The application is small enough that whole program optimization is successfully used when building its executable.
How do we make sure that a store in one thread does not remain invisible to other threads indefinitely?
If you would like to update the value of device_reg_copy in atomic fashion, then device_reg_copy.store(*Device_reg_ptr, std::memory_order_relaxed); suffices.
There is no need to apply volatile to atomic variables, it is unnecessary.
std::memory_order_relaxed store is supposed to incur the least amount of synchronization overhead. On x86 it is just a plain mov instruction.
However, if you would like to update it in such a way, that the effects of any preceding stores become visible to other threads along with the new value of device_reg_copy, then use std::memory_order_release store, i.e. device_reg_copy.store(*Device_reg_ptr, std::memory_order_release);. The readers need to load device_reg_copy as std::memory_order_acquire in this case. Again, on x86 std::memory_order_release store is a plain mov.
Whereas if you use the most expensive std::memory_order_seq_cst store, it does insert the memory barrier for you on x86.
This is why they say that x86 memory model is a bit too strong for C++11: plain mov instruction is std::memory_order_release on stores and std::memory_order_acquire on loads. There is no relaxed store or load on x86.
I cannot recommend enough CPU Cache Flushing Fallacy article.
The C++ standard is rather vague about making atomic stores visible to other threads..
29.3.12
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
That is as detailed as it gets, there is no definition of 'reasonable', and it does not have to be immediately.
Using a stand-alone fence to force a certain memory ordering is not necessary since you can specify those on atomic operations, but the question is,
what is your expectation with regards to using a memory fence..
Fences are designed to enforce ordering on memory operations (between threads), but they do not guarantee visibility in a timely manner.
You can store a value to an atomic variable with the strongest memory ordering (ie. seq_cst), but even when another thread executes load() at a later time than the store(),
you might still get an old value from the cache and yet (surprisingly) it does not violate the happens-before relationship.
Using a stronger fence might make a difference wrt. timing and visibility, but there are no guarantees.
If prompt visibility is important, I would consider using a Read-Modify-Write (RMW) operation to load the value.
These are atomic operations that read and modify atomically (ie. in a single call), and have the additional property that they are guaranteed to operate on the latest value.
But since they have to reach a little further than the local cache, these calls also tend to be more expensive to execute.
As pointed out by Maxim Egorushkin, whether or not you can use weaker memory orderings than the default (seq_cst) depends on whether other memory operations need to be synchronized (made visible) between threads.
That is not clear from your question, but it is generally considered safe to use the default (sequential consistency).
If you are on an unusually weak platform, if performance is problematic, and if you need data synchronization between threads, you could consider using acquire/release semantics:
// thread 1
device_reg_copy.store(*Device_reg_ptr, std::memory_order_release);
// thread 2
device_reg_copy.fetch_add(0, std::memory_order_acquire);
If thread 2 sees the value written by thread 1, it is guaranteed that memory operations prior to the store in thread 1 are visible after the load in thread 2.
Acquire/Release operations form a pair and they synchronize based on a run-time relationship between the store and load. In other words, if thread 2 does not see the value stored by thread 1,
there are no ordering guarantees.
If the atomic variable has no dependencies on any other data, you can use std::memory_order_relaxed; store ordering is always guaranteed for a single atomic variable.
As mentioned by others, there is no need for volatile when it comes to inter-thread communication with std::atomic.

Synchronization in OpenMP

I am trying to implement a parallel algorithm with OpenMP.
In principle I should have many threads writing and reading different components of a shared vector in an ashyncronous way.
There is a FOR loop in which the threads cycle and when a thread is in, let's say, row A of the loop it writes on a random component of the shared vector, while when it is in row B it reads a random component of the same shared vector.
It may happen that a thread would try to read a component of the shared vector while this component is written by another thread.
How to avoid inconsistency?
I read about locks and critical sections, but I think this is not the solution. For example, I can set a lock around row A in which the threads write in the shared vector, but does this prevent inconsistency if at the same time a thread in row B is trying to read that component?
If the vector modifications are very simple single-value assignment operations and are not actually function calls, what you need are probably atomic reads and writes. With atomic operations, a read from an array element that is simultaneously being written to will either return the new value or the previous value; it will never return some kind of a bit mash of the old and the new value. OpenMP provides the atomic construct for that purpose. On certain architectures, including x86, atomics are far more lightweight than critical sections.
With more complex modifications you have to use critical sections. Those could be either named or anonymous. The latter are created using
#pragma omp critical
code block
Anonymous critical sections all map to the same synchronisation object, no matter what the position of the construct in the source code, therefore it is possible for unrelated code sections to get synchronised with all possible ill effects, like performance degradation or even unexpected deadlocks. That's why it is advisable to always use named critical sections. For example, the following two code segments will not get synchronised:
// -- thread i -- // -- thread j --
... ...
#pragma omp critical(foo) < #pragma omp critical(foo)
do_something(); < do_something;
... ...
#pragma omp critical(bar) #pragma omp critical(bar) <
do_something_else(); do_something_else(); <
... ...
(the code currently being executing by each thread is marked with <)
Note that critical sections bind to all threads of the program, without regard to the team to which the threads belong. It means that even code that executes in different parallel regions (a situation that mainly arises when nested parallelism is used) gets synchronised.

OpenMP shared data

I'm somewhat new to OpenMP but have experience with parallel processing in general. I worked with boost::threads before and now I'm testing around with openmp.
The problem is that I don't know how to handle shared data access because I don't really know what openmp does internally with shared data objects inside parallel loops.
What I'm doing right now (this is working so far): I read files from disk into memory with mmap. I receive a pointer on char after the memory map part.
OpenMP can now use this pointer inside an OpenMP parallel for loop and share the data between threads. I'm now able to search for regular expression matches inside the mapped and shared file with multiple threads checking every string against a (pretty long) list of regular expressions.
I made this list (a vector containing regex) private inside the openmp loop, so every thread has it's own copy of this list.
Here comes the problem:
To dramatically increase performance of my application I need to be able to remove (regex-)items from this vector once they match a string.
Now all the other active threads need to have this item removed from their list too asap.
So I made this list a shared data object inside the openmp loop but now I get segmentation faults at runtime when I try to write (vector.erase(item#)) to the list.
With boost::threads I would just have used a mutex for locking this object while writing/reading it.
But openmp seems to handle most of the synchronization itself so now I wonder what would be the correct approach to handle this problem when using openmp which is new to me.
For synchronization, you may use #pragma omp critical or you may use OpenMP lock routines (omp_{init,set,unset,destroy}_lock).
The benefits of #pragma omp critical are simplicity and ability to ignore the pragma when the parallel region is known to execute by a single thread. Drawbacks are applicability only to a single parallel region, and global effect within this region: no other thread can execute any other critical section in the region.
OpenMP lock routines are similar to most other available locks, e.g. those of pthreads or Boost (RAII aside). You initialize a lock object, then use it to protect certain critical sections, and destroy when it's unnecessary. These locks can be used to protect access to data from different parallel regions, to build a distributed locking scheme, etc.; but certain amount of overhead is always incurred, and surely usage is more "hairy" comparing to #pragma omp critical.
However, I would challenge the design of the parallel solution. Erasing an element from middle of the vector makes all iterators invalidated, and moves elements. Erase is supposedly a rare operation (otherwise, the choice of vector would be questionable even in serial code I think), but due to the above effects you have to protect all reads of the vector too, and that will likely be costly. Read/write locks could give some relief, but those are not available in OpenMP, so you would need to use either platform-specific interfaces or a 3rd party library.
I think the following will potentially work better:
You keep regex vectors private, and add a same-size shared vector of flags that indicate whether a certain regex is still valid or not.
Before applying a certain regex from a private vector, the code checks in the shared vector if this regex was not "erased" by some other thread. If it was, the regex is skipped.
After finding a match, the code marks the element of the shared vector that corresponds to the current regex as "erased" so that it will be ignored from now on.
In this scheme, there exist races for reading/writing flags: a flag might be set to "erased" the very next moment it was read as "valid" by another thread. As the result, two different threads may concurrently find a match for the same regex. However this problem I believe exists in your current solution where all regex containers are private, as well as in a solution with a shared container and locks or RW locks, unless a non-RW lock protects also the operation with a given regex. In case multiple matches are a problem, it all should be re-thought.
You can achieve this by creating a critical section.
#pragma omp critical
{
...some synchronized code...
}
EDIT:
Removed the part about '#pragma omp atomic' as it cannot atomically perform the operations needed.

Proper use of "atomic directive" to lock STL container

I have a large number of sets of integers, which I have, in turn, put into a vector of pointers. I need to be able to update these sets of integers in parallel without causing a race condition. More specifically. I am using OpenMP's "parallel for" construct.
For dealing with shared resources, OpenMP offers a handy "atomic directive," which allows one to avoid a race condition on a specific piece of memory without using locks. It would be convenient if I could use the "atomic directive" to prevent simultaneous updating to my integer sets, however, I'm not sure whether this is possible.
Basically, I want to know whether the following code could lead to a race condition
vector< set<int>* > membershipDirectory(numSets, new set<int>);
#pragma omp for schedule(guided,expandChunksize)
for(int i=0; i<100; i++)
{
set<int>* sp = membershipDirectory[rand()];
#pragma omp atomic
sp->insert(45);
}
Note that I use a random integer for the index, because in my application, any thread might access any index (there is a random element in my larger application, but I need not go into details).
I have seen a similar example of this for incrementing an integer, but I'm not sure whether it works when working with a pointer to a container as in my case.
After searching around, I found the OpenMP C and C++ API manual on openmp.org, and in section 2.6.4, the limitations of the atomic construct are described.
Basically, the atomic directive can only be used with the following operators:
Unary:
++, -- (prefix and postfix)
Binary:
+,-,*,/,^,&,|,<<,>>
So I will just use locks!
(In some situations critical sections might be preferable, but in my case locks will provide fine grained access to the shared resource, yielding better performance than a critical section.)
you should not use atomic where expression is a function call, it only applies to simple expressions (with possibly built-ins: power, square root).
Instead use critical section (either named or default)
Your code is not clear. Assuming that membershipDirectory[5] is actually membershipDirectory[i], atomic directive is not needed. For two processors, for example, OpenMP produces two threads, one handles i = 0-49, another 50-99 intervals. In this case, there is no need to protect membershipDirectory[i]. atomic directive is required to protect some common resource which does not depend on the loop index, for example, total sum.