OpenMP shared data - c++

I'm somewhat new to OpenMP but have experience with parallel processing in general. I worked with boost::threads before and now I'm testing around with openmp.
The problem is that I don't know how to handle shared data access because I don't really know what openmp does internally with shared data objects inside parallel loops.
What I'm doing right now (this is working so far): I read files from disk into memory with mmap. I receive a pointer on char after the memory map part.
OpenMP can now use this pointer inside an OpenMP parallel for loop and share the data between threads. I'm now able to search for regular expression matches inside the mapped and shared file with multiple threads checking every string against a (pretty long) list of regular expressions.
I made this list (a vector containing regex) private inside the openmp loop, so every thread has it's own copy of this list.
Here comes the problem:
To dramatically increase performance of my application I need to be able to remove (regex-)items from this vector once they match a string.
Now all the other active threads need to have this item removed from their list too asap.
So I made this list a shared data object inside the openmp loop but now I get segmentation faults at runtime when I try to write (vector.erase(item#)) to the list.
With boost::threads I would just have used a mutex for locking this object while writing/reading it.
But openmp seems to handle most of the synchronization itself so now I wonder what would be the correct approach to handle this problem when using openmp which is new to me.

For synchronization, you may use #pragma omp critical or you may use OpenMP lock routines (omp_{init,set,unset,destroy}_lock).
The benefits of #pragma omp critical are simplicity and ability to ignore the pragma when the parallel region is known to execute by a single thread. Drawbacks are applicability only to a single parallel region, and global effect within this region: no other thread can execute any other critical section in the region.
OpenMP lock routines are similar to most other available locks, e.g. those of pthreads or Boost (RAII aside). You initialize a lock object, then use it to protect certain critical sections, and destroy when it's unnecessary. These locks can be used to protect access to data from different parallel regions, to build a distributed locking scheme, etc.; but certain amount of overhead is always incurred, and surely usage is more "hairy" comparing to #pragma omp critical.
However, I would challenge the design of the parallel solution. Erasing an element from middle of the vector makes all iterators invalidated, and moves elements. Erase is supposedly a rare operation (otherwise, the choice of vector would be questionable even in serial code I think), but due to the above effects you have to protect all reads of the vector too, and that will likely be costly. Read/write locks could give some relief, but those are not available in OpenMP, so you would need to use either platform-specific interfaces or a 3rd party library.
I think the following will potentially work better:
You keep regex vectors private, and add a same-size shared vector of flags that indicate whether a certain regex is still valid or not.
Before applying a certain regex from a private vector, the code checks in the shared vector if this regex was not "erased" by some other thread. If it was, the regex is skipped.
After finding a match, the code marks the element of the shared vector that corresponds to the current regex as "erased" so that it will be ignored from now on.
In this scheme, there exist races for reading/writing flags: a flag might be set to "erased" the very next moment it was read as "valid" by another thread. As the result, two different threads may concurrently find a match for the same regex. However this problem I believe exists in your current solution where all regex containers are private, as well as in a solution with a shared container and locks or RW locks, unless a non-RW lock protects also the operation with a given regex. In case multiple matches are a problem, it all should be re-thought.

You can achieve this by creating a critical section.
#pragma omp critical
{
...some synchronized code...
}
EDIT:
Removed the part about '#pragma omp atomic' as it cannot atomically perform the operations needed.

Related

Efficient parallel union of sets with OpenMP

I need to calculate a global std::set (or equivalently a global std::unordered_set) in an OpenMP parallelised programm. At the moment every thread has a local std::set which then later the union is calculated from using
#pragma omp critical //critical as std container inserting is not thread safe
global_set.insert(local_set.begin(), local_set.end());
However this creates an effectively serial section of code, where each thread inserts its local set into the global set one after the other.
How can I improve on this by parallelising the union of the sets? The union is preceded by a large block of work, is there a convenient way to give all threads different amounts of work to let the others work while one is inserting the elements in the set? Or can the union itself be efficiently parallelised (for example by unioning sets in a 'binary tree' fashion)?
You should read up on OpenMP reductions, and, in particular user-defined reductions. That lets you pass the problem to the OpenMP implementation, which will very likely perform the reduction up a tree.
Of course, whether that is beneficial is not clear; it may be that it simply introduces a lot of copying and memory allocation which is still slower than the style of code you show.

Synchronization in OpenMP

I am trying to implement a parallel algorithm with OpenMP.
In principle I should have many threads writing and reading different components of a shared vector in an ashyncronous way.
There is a FOR loop in which the threads cycle and when a thread is in, let's say, row A of the loop it writes on a random component of the shared vector, while when it is in row B it reads a random component of the same shared vector.
It may happen that a thread would try to read a component of the shared vector while this component is written by another thread.
How to avoid inconsistency?
I read about locks and critical sections, but I think this is not the solution. For example, I can set a lock around row A in which the threads write in the shared vector, but does this prevent inconsistency if at the same time a thread in row B is trying to read that component?
If the vector modifications are very simple single-value assignment operations and are not actually function calls, what you need are probably atomic reads and writes. With atomic operations, a read from an array element that is simultaneously being written to will either return the new value or the previous value; it will never return some kind of a bit mash of the old and the new value. OpenMP provides the atomic construct for that purpose. On certain architectures, including x86, atomics are far more lightweight than critical sections.
With more complex modifications you have to use critical sections. Those could be either named or anonymous. The latter are created using
#pragma omp critical
code block
Anonymous critical sections all map to the same synchronisation object, no matter what the position of the construct in the source code, therefore it is possible for unrelated code sections to get synchronised with all possible ill effects, like performance degradation or even unexpected deadlocks. That's why it is advisable to always use named critical sections. For example, the following two code segments will not get synchronised:
// -- thread i -- // -- thread j --
... ...
#pragma omp critical(foo) < #pragma omp critical(foo)
do_something(); < do_something;
... ...
#pragma omp critical(bar) #pragma omp critical(bar) <
do_something_else(); do_something_else(); <
... ...
(the code currently being executing by each thread is marked with <)
Note that critical sections bind to all threads of the program, without regard to the team to which the threads belong. It means that even code that executes in different parallel regions (a situation that mainly arises when nested parallelism is used) gets synchronised.

OpenMP(C/C++): Efficient way of sharing an unordered_map<string, vector<int>> and a vector<int> between threads

I have a for loop that I would like to make parallel, however the threads must share an unordered_map and a vector.
Because the for loop is somewhat big I will post here a concise overview of it so that I can make my main problem clear. Please read the comments.
unordered_map<string, vector<int>> sharedUM;
/*
here I call a function that updates the unordered_map with some
initial data, however the unordered_map will need to be updated by
the threads inside the for loop
*/
vector<int> sharedVector;
/*
the shared vector initially is empty, the threads will
fill it with integers, the order of these integers should be in ascending
order, however I can simply sort the array after all the
threads finish executing so I guess we can assume that the order
does not matter
*/
#pragma omp parallel for
for(int i=0; i<N; i++){
key = generate_a_key_value_according_to_an_algorithm();
std::unordered_map<string, vector<int>::iterator it = sharedUM.find(key);
/*
according to the data inside it->second(the value),
the thread makes some conclusions which then
uses in order to figure out whether
it should run a high complexity algorithm
or not.
*/
bool conclusion = make_conclusion();
if(conclusion == true){
results = run_expensive_algorithm();
/*
According to the results,
the thread updates some values of
the key that it previously searched for inside the unordered_map
this update may help other threads avoid running
the expensive algorithm
*/
}
sharedVector.push_back(i);
}
Initially I left the code as it is, so I just used that #pragma over the for loop, however I got a few problems regarding the update of the sharedVector. So I decided to use simple locks in order to force a thread acquire the lock before writing to the vector. So in my implementation I had something like this:
omp_lock_t sharedVectorLock;
omp_init_lock(&sharedVectorLock);
...
for(...)
...
omp_set_lock(&sharedVectorLock);
sharedVector.push_back(i);
omp_unset_lock(&sharedVectorLock);
...
omp_destroy_lock(&sharedVectorLock);
I had run my application many times and everything seemed to be working great, and that's until I decided to rerun it automatically too many times until I got wrong results. Because I'm very new to the world of OpenMP and the threads in general, I wasn't aware of the fact that we should lock all the readers when a writer is updating some shared data. As you can see here in my application the threads always read some data from the unordered_map in order make some conclusions and learn things about the key that was assigned to them. What happens though if two threads have to work with the same key, and while some other thread is trying to read the values of this key, another one has reached the point of updating those values? I believe that's where my problem occurs.
However my main problem right now is that I'm not sure what would be the best way to avoid such things from happening. It's like my system works for 99% of the time, but that 1% ruins everything because two threads are rarely assigned with the same key which in turn is because my unordered_map is usually big.
Would locking the unordered_map do my job? Most likely, but that wouldn't be efficient because a thread A that wants to work with the key x would have to wait for a thread B that is already working with the key y where y can be different than x to finish.
So my main question is, how should I approach this problem? How can I lock the unordered_map if and only if two threads are working with the same key?
Thank you in advance
1 on using locks and mutexes. You must declare and initialise the lock variables outside of the parallel block (before #pragma omp parallel) and then use them inside the parallel block: (1) acquire a lock (this may block if another thread has locked it), (2) change the variable with the race condition, (3) release the lock. Finally, destroy it after exiting the parallel block. A lock declared inside the parallel block is local to the thread and hence cannot provide synchronisation.
This may explain your problems.
2 on writing into complicated C++ containers. OpenMP was designed originally for simple FORTRAN do loops (similar to C/C++ for loops with integer control variables). Everything more complicated will give you headache. To be on the safe side, any non-constant operation on a C++ container must be performed within a lock (use the same lock for any such operation on the same container) or omp critical region (use the same name for any such operation on the same container). This includes pop() and push() etc, anything but simple reads. This can only remain efficient if such non-constant container operations take only a tiny fraction of the time.
3 If I were you, I wouldn't bother with openMP (I have used it but am regretting this now). With C++ you could use TBB, which also comes with some threadsafe but lock-free containers. It also allows you to think in terms of tasks, not threads, which are executed recursively (a parent task spawns child tasks, etc), but TBB has some simple implementations for parallel for loops, for instance.
An alternative approach would be to use TBB's concurrent_unordered_map.
You don't have to use the rest of TBB's parallelism support (though if you're starting from scratch in C++ it's certainly more "c++-ish" than OpenMP).
May be this could help:
vector<bool> sv(N);
replace
sharedVector.push_back(i);
by
sv[i]=true;
this allows to avoid locks (very time consuming) and sharedVector
can easily be sorted, e.g
for(int i=0; i<N;i++){
if(sv[i])sharedVector.push_back(i);
}

Proper use of "atomic directive" to lock STL container

I have a large number of sets of integers, which I have, in turn, put into a vector of pointers. I need to be able to update these sets of integers in parallel without causing a race condition. More specifically. I am using OpenMP's "parallel for" construct.
For dealing with shared resources, OpenMP offers a handy "atomic directive," which allows one to avoid a race condition on a specific piece of memory without using locks. It would be convenient if I could use the "atomic directive" to prevent simultaneous updating to my integer sets, however, I'm not sure whether this is possible.
Basically, I want to know whether the following code could lead to a race condition
vector< set<int>* > membershipDirectory(numSets, new set<int>);
#pragma omp for schedule(guided,expandChunksize)
for(int i=0; i<100; i++)
{
set<int>* sp = membershipDirectory[rand()];
#pragma omp atomic
sp->insert(45);
}
Note that I use a random integer for the index, because in my application, any thread might access any index (there is a random element in my larger application, but I need not go into details).
I have seen a similar example of this for incrementing an integer, but I'm not sure whether it works when working with a pointer to a container as in my case.
After searching around, I found the OpenMP C and C++ API manual on openmp.org, and in section 2.6.4, the limitations of the atomic construct are described.
Basically, the atomic directive can only be used with the following operators:
Unary:
++, -- (prefix and postfix)
Binary:
+,-,*,/,^,&,|,<<,>>
So I will just use locks!
(In some situations critical sections might be preferable, but in my case locks will provide fine grained access to the shared resource, yielding better performance than a critical section.)
you should not use atomic where expression is a function call, it only applies to simple expressions (with possibly built-ins: power, square root).
Instead use critical section (either named or default)
Your code is not clear. Assuming that membershipDirectory[5] is actually membershipDirectory[i], atomic directive is not needed. For two processors, for example, OpenMP produces two threads, one handles i = 0-49, another 50-99 intervals. In this case, there is no need to protect membershipDirectory[i]. atomic directive is required to protect some common resource which does not depend on the loop index, for example, total sum.

Any issues with large numbers of critical sections?

I have a large array of structures, like this:
typedef struct
{
int a;
int b;
int c;
etc...
}
data_type;
data_type data[100000];
I have a bunch of separate threads, each of which will want to make alterations to elements within data[]. I need to make sure that no to threads attempt to access the same data element at the same time. To be precise: one thread performing data[475].a = 3; and another thread performing data[475].b = 7; at the same time is not allowed, but one thread performing data[475].a = 3; while another thread performs data[476].a = 7; is allowed. The program is highly speed critical. My plan is to make a separate critical section for each data element like so:
typedef struct
{
CRITICAL_SECTION critsec;
int a;
int b;
int c;
etc...
}
data_type;
In one way I guess it should all work and I should have no real questions, but not having had much experience in multithreaded programming I am just feeling a little uneasy about having so many critical sections. I'm wondering if the sheer number of them could be creating some sort of inefficiency. I'm also wondering if perhaps some other multithreading technique could be faster? Should I just relax and go ahead with plan A?
With this many objects, most of their critical sections will be unlocked, and there will be almost no contention. As you already know (other comment), critical sections don't require a kernel-mode transition if they're unowned. That makes critical sections efficient for this situation.
The only other consideration would be whether you would want the critical sections inside your objects or in another array. Locality of reference is a good reason to put the critical sections inside the object. When you've entered the critical section, an entire cacheline (e.g. 16 or 32 bytes) will be in memory. With a bit of padding, you can make sure each object starts on a cacheline. As a result, the object will be (partially) in cache once its critical section is entered.
Your plan is worth trying, but I think you will find that Windows is unhappy creating that many Critical Sections. Each CS contains some kernel handle(s) and you are using up precious kernel space. I think, depending on your version of Windows, you will run out of handle memory and InitializeCriticalSection() or some other function will start to fail.
What you might want to do is have a pool of CSs available for use, and store a pointer to the 'in use' CS inside your struct. But then this gets tricky quite quickly and you will need to use Atomic operations to set/clear the CS pointer (to atomically flag the array entry as 'in use'). Might also need some reference counting, etc...
Gets complicated.
So try your way first, and see what happens. We had a similar situation once, and we had to go with a pool, but maybe things have changed since then.
Depending on the data member types in your data_type structure (and also depending on the operations you want to perform on those members), you might be able to forgo using a separate synchronization object, using the Interlocked functions instead.
In your sample code, all the data members are integers, and all the operations are assignments (and presumably reads), so you could use InterlockedExchange() to set the values atomically and InterlockedCompareExchange() to read the values atomically.
If you need to use non-integer data member types, or if you need to perform more complex operations, or if you need to coordinate atomic access to more than one operation at a time (e.g., read data[1].a and then write data[1].b), then you will have to use a synchronization object, such as a CRITICAL_SECTION.
If you must use a synchronization object, I recommend that you consider partitioning your data set into subsets and use a single synchronization object per subset. For example, you might consider using one CRITICAL_SECTION for each span of 1000 elements in the data array.
You could also consider MUTEX.
This is nice method.
Each client could reserve the resource by itself with mutex (mutual-exclusion).
This is more common, some libraries also support this with threads.
Read about boost::thread and it's mutexes
With Your approach:
data_type data[100000];
I'd be afraid of stack overflow, unless You're allocating it at the heap.
EDIT:
Boost::MUTEX
uses win32 Critical Sections
As others have pointed out, yes there is an issue and it is called too fine-grained locking.. it's resource wasteful and even though the chances are small you will start creating a lot of backing primitives and data when the things do get an occasional, call it longer than usual or whatever, contention. Plus you are wasting resources as it is not really a trivial data structure as for example in VM impls..
If I recall correctly you will have a higher chance of a SEH exception from that point onwards on Win32 or just higher memory usage. Partitioning and pooling them is probably the way to go but it is a more complex implementation. Paritioning on something else (re:action) and expecting some short-lived contention is another way to deal with it.
In any case, it is a problem of resource management with what you have right now.