Synchronization in OpenMP - c++

I am trying to implement a parallel algorithm with OpenMP.
In principle I should have many threads writing and reading different components of a shared vector in an ashyncronous way.
There is a FOR loop in which the threads cycle and when a thread is in, let's say, row A of the loop it writes on a random component of the shared vector, while when it is in row B it reads a random component of the same shared vector.
It may happen that a thread would try to read a component of the shared vector while this component is written by another thread.
How to avoid inconsistency?
I read about locks and critical sections, but I think this is not the solution. For example, I can set a lock around row A in which the threads write in the shared vector, but does this prevent inconsistency if at the same time a thread in row B is trying to read that component?

If the vector modifications are very simple single-value assignment operations and are not actually function calls, what you need are probably atomic reads and writes. With atomic operations, a read from an array element that is simultaneously being written to will either return the new value or the previous value; it will never return some kind of a bit mash of the old and the new value. OpenMP provides the atomic construct for that purpose. On certain architectures, including x86, atomics are far more lightweight than critical sections.
With more complex modifications you have to use critical sections. Those could be either named or anonymous. The latter are created using
#pragma omp critical
code block
Anonymous critical sections all map to the same synchronisation object, no matter what the position of the construct in the source code, therefore it is possible for unrelated code sections to get synchronised with all possible ill effects, like performance degradation or even unexpected deadlocks. That's why it is advisable to always use named critical sections. For example, the following two code segments will not get synchronised:
// -- thread i -- // -- thread j --
... ...
#pragma omp critical(foo) < #pragma omp critical(foo)
do_something(); < do_something;
... ...
#pragma omp critical(bar) #pragma omp critical(bar) <
do_something_else(); do_something_else(); <
... ...
(the code currently being executing by each thread is marked with <)
Note that critical sections bind to all threads of the program, without regard to the team to which the threads belong. It means that even code that executes in different parallel regions (a situation that mainly arises when nested parallelism is used) gets synchronised.

Related

How to lock element of array using TBB/OpenMP

I have a very big array read/write by several threads. Each thread will only rw one element of them at a time, so it would be a bad idea to lock the whole array. What I am expecting is something like
// before threads
lock_t Lock[NUM_THREADS];
...
// during threads
get_lock(Lock[thread_id], element_id);
array[element_id]+=10; // some operations
release_lock(Lock[thread_id]);
So my question is, what is the best strategy of designing get_lock and release_lock?
When using OpenMP you can obtain an equivalent behavior using the atomic construct:
// during threads
#pragma omp atomic
array[element_id]+=10; // some operations
Just to give you an idea of its semantic, here is an excerpt from the OpenMP 3.1 standard:
The atomic construct ensures that a specific storage location is
accessed atomically, rather than exposing it to the possibility of
multiple, simultaneous reading and writing threads that may result in
indeterminate values
On the other hand, if you decide to use Intel TBB, you can take a look at the atomic template class.

OpenMP(C/C++): Efficient way of sharing an unordered_map<string, vector<int>> and a vector<int> between threads

I have a for loop that I would like to make parallel, however the threads must share an unordered_map and a vector.
Because the for loop is somewhat big I will post here a concise overview of it so that I can make my main problem clear. Please read the comments.
unordered_map<string, vector<int>> sharedUM;
/*
here I call a function that updates the unordered_map with some
initial data, however the unordered_map will need to be updated by
the threads inside the for loop
*/
vector<int> sharedVector;
/*
the shared vector initially is empty, the threads will
fill it with integers, the order of these integers should be in ascending
order, however I can simply sort the array after all the
threads finish executing so I guess we can assume that the order
does not matter
*/
#pragma omp parallel for
for(int i=0; i<N; i++){
key = generate_a_key_value_according_to_an_algorithm();
std::unordered_map<string, vector<int>::iterator it = sharedUM.find(key);
/*
according to the data inside it->second(the value),
the thread makes some conclusions which then
uses in order to figure out whether
it should run a high complexity algorithm
or not.
*/
bool conclusion = make_conclusion();
if(conclusion == true){
results = run_expensive_algorithm();
/*
According to the results,
the thread updates some values of
the key that it previously searched for inside the unordered_map
this update may help other threads avoid running
the expensive algorithm
*/
}
sharedVector.push_back(i);
}
Initially I left the code as it is, so I just used that #pragma over the for loop, however I got a few problems regarding the update of the sharedVector. So I decided to use simple locks in order to force a thread acquire the lock before writing to the vector. So in my implementation I had something like this:
omp_lock_t sharedVectorLock;
omp_init_lock(&sharedVectorLock);
...
for(...)
...
omp_set_lock(&sharedVectorLock);
sharedVector.push_back(i);
omp_unset_lock(&sharedVectorLock);
...
omp_destroy_lock(&sharedVectorLock);
I had run my application many times and everything seemed to be working great, and that's until I decided to rerun it automatically too many times until I got wrong results. Because I'm very new to the world of OpenMP and the threads in general, I wasn't aware of the fact that we should lock all the readers when a writer is updating some shared data. As you can see here in my application the threads always read some data from the unordered_map in order make some conclusions and learn things about the key that was assigned to them. What happens though if two threads have to work with the same key, and while some other thread is trying to read the values of this key, another one has reached the point of updating those values? I believe that's where my problem occurs.
However my main problem right now is that I'm not sure what would be the best way to avoid such things from happening. It's like my system works for 99% of the time, but that 1% ruins everything because two threads are rarely assigned with the same key which in turn is because my unordered_map is usually big.
Would locking the unordered_map do my job? Most likely, but that wouldn't be efficient because a thread A that wants to work with the key x would have to wait for a thread B that is already working with the key y where y can be different than x to finish.
So my main question is, how should I approach this problem? How can I lock the unordered_map if and only if two threads are working with the same key?
Thank you in advance
1 on using locks and mutexes. You must declare and initialise the lock variables outside of the parallel block (before #pragma omp parallel) and then use them inside the parallel block: (1) acquire a lock (this may block if another thread has locked it), (2) change the variable with the race condition, (3) release the lock. Finally, destroy it after exiting the parallel block. A lock declared inside the parallel block is local to the thread and hence cannot provide synchronisation.
This may explain your problems.
2 on writing into complicated C++ containers. OpenMP was designed originally for simple FORTRAN do loops (similar to C/C++ for loops with integer control variables). Everything more complicated will give you headache. To be on the safe side, any non-constant operation on a C++ container must be performed within a lock (use the same lock for any such operation on the same container) or omp critical region (use the same name for any such operation on the same container). This includes pop() and push() etc, anything but simple reads. This can only remain efficient if such non-constant container operations take only a tiny fraction of the time.
3 If I were you, I wouldn't bother with openMP (I have used it but am regretting this now). With C++ you could use TBB, which also comes with some threadsafe but lock-free containers. It also allows you to think in terms of tasks, not threads, which are executed recursively (a parent task spawns child tasks, etc), but TBB has some simple implementations for parallel for loops, for instance.
An alternative approach would be to use TBB's concurrent_unordered_map.
You don't have to use the rest of TBB's parallelism support (though if you're starting from scratch in C++ it's certainly more "c++-ish" than OpenMP).
May be this could help:
vector<bool> sv(N);
replace
sharedVector.push_back(i);
by
sv[i]=true;
this allows to avoid locks (very time consuming) and sharedVector
can easily be sorted, e.g
for(int i=0; i<N;i++){
if(sv[i])sharedVector.push_back(i);
}

C++ OpenMP directives

I have a loop that I'm trying to parallelize and in it I am filling a container, say an STL map. Consider then the simple pseudo code below where T1 and T2 are some arbitrary types, while f and g are some functions of integer argument, returning T1, T2 types respectively:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
c.insert(std::make_pair<T1,T2>(f(i),g(i))
}
This looks rather straighforward and seems like it should be trivially parallelized but it doesn't speed up as I expected. On the contrary it leads to run-time errors in my code, due to unexpected values being filled in the container, likely due to race conditions. I've even tried putting barriers and what-not, but all to no-avail. The only thing that allows it to work is to use a critical directive as below:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
#pragma omp critical
{
c.insert(std::make_pair<T1,T2>(f(i),g(i))
}
}
But this sort of renders useless the whole point of using omp in the above example, since only one thread at a time is executing the bulk of the loop (the container insert statement). What am I missing here? Short of changing the way the code is written, can somebody kindly explain?
This particular example you have is not a good candidate for parallelism unless f() and g() are extremely expensive function calls.
STL containers are not thread-safe. That's why you're getting the race conditions. So accessing them needs to be synchronized - which makes your insertion process inherently sequential.
As the other answer mentions, there's a LOT of overhead for parallelism. So unless f() and g() extremely expensive, your loop doesn't do enough work to offset the overhead of parallelism.
Now assuming f() and g() are extremely expensive calls, then your loop can be parallelized like this:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
std::pair<T1,T2> p = std::make_pair<T1,T2>(f(i),g(i));
#pragma omp critical
{
c.insert(p);
}
}
Running multithreaded code make you think about thread safety and shared access to your variables. As long as you start inserting into c from multiple threads, the collection should be prepared to take such "simultaneous" calls and keep its data consistent, are you sure it is made this way?
Another thing is that parallelization has its own overhead and you are not going to gain anything when you try to run a very small task on multiple threads - with the cost of splitting and synchronization you might end up with even higher total execution time for the task.
c will have obviously data races, as you guessed. STL map is not thread-safe. Calling insert method concurrently in multiple threads will have very unpredictable behavior, mostly just crash.
Yes, to avoid the data races, you must have either (1) a mutex like #pragma omp critical, or (2) concurrent data structure (aka look-free data structures). However, not all data structures can be lock-free in current hardware. For example, TBB provides tbb::concurrent_hash_map. If you don't need ordering of the keys, you may use it and could get some speedup as it does not have a conventional mutex.
In case where you can use just a hash table and the table is very huge, you could take a reduction-like approach (See this link for the concept of reduction). Hash tables do not care about the ordering of the insertion. In this case, you allocate multiple hash tables for each thread, and let each thread inserts N/#thread items in parallel, which will give a speedup. Looking up is also can be easily done by accessing these tables in parallel.

OpenMP shared data

I'm somewhat new to OpenMP but have experience with parallel processing in general. I worked with boost::threads before and now I'm testing around with openmp.
The problem is that I don't know how to handle shared data access because I don't really know what openmp does internally with shared data objects inside parallel loops.
What I'm doing right now (this is working so far): I read files from disk into memory with mmap. I receive a pointer on char after the memory map part.
OpenMP can now use this pointer inside an OpenMP parallel for loop and share the data between threads. I'm now able to search for regular expression matches inside the mapped and shared file with multiple threads checking every string against a (pretty long) list of regular expressions.
I made this list (a vector containing regex) private inside the openmp loop, so every thread has it's own copy of this list.
Here comes the problem:
To dramatically increase performance of my application I need to be able to remove (regex-)items from this vector once they match a string.
Now all the other active threads need to have this item removed from their list too asap.
So I made this list a shared data object inside the openmp loop but now I get segmentation faults at runtime when I try to write (vector.erase(item#)) to the list.
With boost::threads I would just have used a mutex for locking this object while writing/reading it.
But openmp seems to handle most of the synchronization itself so now I wonder what would be the correct approach to handle this problem when using openmp which is new to me.
For synchronization, you may use #pragma omp critical or you may use OpenMP lock routines (omp_{init,set,unset,destroy}_lock).
The benefits of #pragma omp critical are simplicity and ability to ignore the pragma when the parallel region is known to execute by a single thread. Drawbacks are applicability only to a single parallel region, and global effect within this region: no other thread can execute any other critical section in the region.
OpenMP lock routines are similar to most other available locks, e.g. those of pthreads or Boost (RAII aside). You initialize a lock object, then use it to protect certain critical sections, and destroy when it's unnecessary. These locks can be used to protect access to data from different parallel regions, to build a distributed locking scheme, etc.; but certain amount of overhead is always incurred, and surely usage is more "hairy" comparing to #pragma omp critical.
However, I would challenge the design of the parallel solution. Erasing an element from middle of the vector makes all iterators invalidated, and moves elements. Erase is supposedly a rare operation (otherwise, the choice of vector would be questionable even in serial code I think), but due to the above effects you have to protect all reads of the vector too, and that will likely be costly. Read/write locks could give some relief, but those are not available in OpenMP, so you would need to use either platform-specific interfaces or a 3rd party library.
I think the following will potentially work better:
You keep regex vectors private, and add a same-size shared vector of flags that indicate whether a certain regex is still valid or not.
Before applying a certain regex from a private vector, the code checks in the shared vector if this regex was not "erased" by some other thread. If it was, the regex is skipped.
After finding a match, the code marks the element of the shared vector that corresponds to the current regex as "erased" so that it will be ignored from now on.
In this scheme, there exist races for reading/writing flags: a flag might be set to "erased" the very next moment it was read as "valid" by another thread. As the result, two different threads may concurrently find a match for the same regex. However this problem I believe exists in your current solution where all regex containers are private, as well as in a solution with a shared container and locks or RW locks, unless a non-RW lock protects also the operation with a given regex. In case multiple matches are a problem, it all should be re-thought.
You can achieve this by creating a critical section.
#pragma omp critical
{
...some synchronized code...
}
EDIT:
Removed the part about '#pragma omp atomic' as it cannot atomically perform the operations needed.

Proper use of "atomic directive" to lock STL container

I have a large number of sets of integers, which I have, in turn, put into a vector of pointers. I need to be able to update these sets of integers in parallel without causing a race condition. More specifically. I am using OpenMP's "parallel for" construct.
For dealing with shared resources, OpenMP offers a handy "atomic directive," which allows one to avoid a race condition on a specific piece of memory without using locks. It would be convenient if I could use the "atomic directive" to prevent simultaneous updating to my integer sets, however, I'm not sure whether this is possible.
Basically, I want to know whether the following code could lead to a race condition
vector< set<int>* > membershipDirectory(numSets, new set<int>);
#pragma omp for schedule(guided,expandChunksize)
for(int i=0; i<100; i++)
{
set<int>* sp = membershipDirectory[rand()];
#pragma omp atomic
sp->insert(45);
}
Note that I use a random integer for the index, because in my application, any thread might access any index (there is a random element in my larger application, but I need not go into details).
I have seen a similar example of this for incrementing an integer, but I'm not sure whether it works when working with a pointer to a container as in my case.
After searching around, I found the OpenMP C and C++ API manual on openmp.org, and in section 2.6.4, the limitations of the atomic construct are described.
Basically, the atomic directive can only be used with the following operators:
Unary:
++, -- (prefix and postfix)
Binary:
+,-,*,/,^,&,|,<<,>>
So I will just use locks!
(In some situations critical sections might be preferable, but in my case locks will provide fine grained access to the shared resource, yielding better performance than a critical section.)
you should not use atomic where expression is a function call, it only applies to simple expressions (with possibly built-ins: power, square root).
Instead use critical section (either named or default)
Your code is not clear. Assuming that membershipDirectory[5] is actually membershipDirectory[i], atomic directive is not needed. For two processors, for example, OpenMP produces two threads, one handles i = 0-49, another 50-99 intervals. In this case, there is no need to protect membershipDirectory[i]. atomic directive is required to protect some common resource which does not depend on the loop index, for example, total sum.