Efficient parallel union of sets with OpenMP

Efficient parallel union of sets with OpenMP - c++

I need to calculate a global std::set (or equivalently a global std::unordered_set) in an OpenMP parallelised programm. At the moment every thread has a local std::set which then later the union is calculated from using
#pragma omp critical //critical as std container inserting is not thread safe
global_set.insert(local_set.begin(), local_set.end());
However this creates an effectively serial section of code, where each thread inserts its local set into the global set one after the other.
How can I improve on this by parallelising the union of the sets? The union is preceded by a large block of work, is there a convenient way to give all threads different amounts of work to let the others work while one is inserting the elements in the set? Or can the union itself be efficiently parallelised (for example by unioning sets in a 'binary tree' fashion)?

You should read up on OpenMP reductions, and, in particular user-defined reductions. That lets you pass the problem to the OpenMP implementation, which will very likely perform the reduction up a tree.
Of course, whether that is beneficial is not clear; it may be that it simply introduces a lot of copying and memory allocation which is still slower than the style of code you show.

Related

How to collect data for each thread OpenMP

I'm new to OpenMP and try to sort out the issue of collecting data from threads. I study the example of applying OpenMP on Monte-Carlo method (square of a circle inscribed into a square).
I understood how the following code works:
unsigned pointsInside = 0;
#pragma omp parallel for num_threads(threadNum) shared(threadNum) reduction(+: pointsInside)
for (unsigned i = 0; i < threadNum; i++) { ... }
Am I right that originally pointsInsideis a variable but OpenMP represents it as an array and than the mantra reduction(+: pointsInside) sums over the elements of the "array"?
But the main question is how to collect information directly into an array or vector? I tried to declare array or vector and provide pointer or address into OpenMP via shared and collect information for each thread at corresponding index. But it works slower than the way with the variable and reduction. Such the approach with vector or array is needed for me for my current project. Thanks a lot!
UPD:
When I said above that "it works slower" I meant comparison of two realizations of the Monte-Carlo method: 1) via shared and a vector/array, and 2) via a scalar variable and reduction. The first case is faster. My guess and question about it below.
I would like to rephrase my question more clear. I create a vector/array and provide it into OpenMP via shared. I want to collect data for each thread at corresponding index in vector/array. Under this approach I don't need any synchronization of access to the vector/array. Is it true that OpenMP enable synchronization by default when I use shared. If it is so, then how to disable it. Or may be another approaches exist. If it is not so, then how to share vector/array into the parallel part correctly and without synchronization of access.
I'd like to apply this technique for my project where I want to sort through different permutations in parallel part, collect each permutation and scalar result for it outside of the parallel part. Then sort the results and choose the best one.

A partial answer:
Am I right that originally pointsInside is a variable but OpenMP represents it as an array and than the mantra reduction(+: pointsInside) sums over the elements of the "array"?
I think it is better to think of pointsInside as a scalar. When the parallel region starts the run-time takes care of creating individual scalars, perhaps you might think of them as myPointsInside, one such scalar for each thread. When the parallel region finishes the run-time reduces the values of all the thread scalars onto the original scalar pointsInside. This is just about what OpenMP actually does behind the scenes.
As to the rest of your question:
Yes, you can perform reductions onto arrays - but this was only added to OpenMP, for C and C++ programs, in OpenMP 4.5 (I think). What goes on is much the same as for the scalar case. This Q&A provides some assistance - Is it possible to do a reduction on an array with openmp?
As to the speed, it's difficult to answer that without a much clearer understanding of what comparisons you are making. But it's very easy to write parallel reductions on arrays which incur a significant penalty in performance from the phenomenon of false sharing, about which you may wish to inform yourself.

OpenMP(C/C++): Efficient way of sharing an unordered_map<string, vector<int>> and a vector<int> between threads

I have a for loop that I would like to make parallel, however the threads must share an unordered_map and a vector.
Because the for loop is somewhat big I will post here a concise overview of it so that I can make my main problem clear. Please read the comments.
unordered_map<string, vector<int>> sharedUM;
/*
here I call a function that updates the unordered_map with some
initial data, however the unordered_map will need to be updated by
the threads inside the for loop
*/
vector<int> sharedVector;
/*
the shared vector initially is empty, the threads will
fill it with integers, the order of these integers should be in ascending
order, however I can simply sort the array after all the
threads finish executing so I guess we can assume that the order
does not matter
*/
#pragma omp parallel for
for(int i=0; i<N; i++){
key = generate_a_key_value_according_to_an_algorithm();
std::unordered_map<string, vector<int>::iterator it = sharedUM.find(key);
/*
according to the data inside it->second(the value),
the thread makes some conclusions which then
uses in order to figure out whether
it should run a high complexity algorithm
or not.
*/
bool conclusion = make_conclusion();
if(conclusion == true){
results = run_expensive_algorithm();
/*
According to the results,
the thread updates some values of
the key that it previously searched for inside the unordered_map
this update may help other threads avoid running
the expensive algorithm
*/
}
sharedVector.push_back(i);
}
Initially I left the code as it is, so I just used that #pragma over the for loop, however I got a few problems regarding the update of the sharedVector. So I decided to use simple locks in order to force a thread acquire the lock before writing to the vector. So in my implementation I had something like this:
omp_lock_t sharedVectorLock;
omp_init_lock(&sharedVectorLock);
...
for(...)
...
omp_set_lock(&sharedVectorLock);
sharedVector.push_back(i);
omp_unset_lock(&sharedVectorLock);
...
omp_destroy_lock(&sharedVectorLock);
I had run my application many times and everything seemed to be working great, and that's until I decided to rerun it automatically too many times until I got wrong results. Because I'm very new to the world of OpenMP and the threads in general, I wasn't aware of the fact that we should lock all the readers when a writer is updating some shared data. As you can see here in my application the threads always read some data from the unordered_map in order make some conclusions and learn things about the key that was assigned to them. What happens though if two threads have to work with the same key, and while some other thread is trying to read the values of this key, another one has reached the point of updating those values? I believe that's where my problem occurs.
However my main problem right now is that I'm not sure what would be the best way to avoid such things from happening. It's like my system works for 99% of the time, but that 1% ruins everything because two threads are rarely assigned with the same key which in turn is because my unordered_map is usually big.
Would locking the unordered_map do my job? Most likely, but that wouldn't be efficient because a thread A that wants to work with the key x would have to wait for a thread B that is already working with the key y where y can be different than x to finish.
So my main question is, how should I approach this problem? How can I lock the unordered_map if and only if two threads are working with the same key?
Thank you in advance

1 on using locks and mutexes. You must declare and initialise the lock variables outside of the parallel block (before #pragma omp parallel) and then use them inside the parallel block: (1) acquire a lock (this may block if another thread has locked it), (2) change the variable with the race condition, (3) release the lock. Finally, destroy it after exiting the parallel block. A lock declared inside the parallel block is local to the thread and hence cannot provide synchronisation.
This may explain your problems.
2 on writing into complicated C++ containers. OpenMP was designed originally for simple FORTRAN do loops (similar to C/C++ for loops with integer control variables). Everything more complicated will give you headache. To be on the safe side, any non-constant operation on a C++ container must be performed within a lock (use the same lock for any such operation on the same container) or omp critical region (use the same name for any such operation on the same container). This includes pop() and push() etc, anything but simple reads. This can only remain efficient if such non-constant container operations take only a tiny fraction of the time.
3 If I were you, I wouldn't bother with openMP (I have used it but am regretting this now). With C++ you could use TBB, which also comes with some threadsafe but lock-free containers. It also allows you to think in terms of tasks, not threads, which are executed recursively (a parent task spawns child tasks, etc), but TBB has some simple implementations for parallel for loops, for instance.

An alternative approach would be to use TBB's concurrent_unordered_map.
You don't have to use the rest of TBB's parallelism support (though if you're starting from scratch in C++ it's certainly more "c++-ish" than OpenMP).

May be this could help:
vector<bool> sv(N);
replace
sharedVector.push_back(i);
by
sv[i]=true;
this allows to avoid locks (very time consuming) and sharedVector
can easily be sorted, e.g
for(int i=0; i<N;i++){
if(sv[i])sharedVector.push_back(i);
}

OpenMP shared data

I'm somewhat new to OpenMP but have experience with parallel processing in general. I worked with boost::threads before and now I'm testing around with openmp.
The problem is that I don't know how to handle shared data access because I don't really know what openmp does internally with shared data objects inside parallel loops.
What I'm doing right now (this is working so far): I read files from disk into memory with mmap. I receive a pointer on char after the memory map part.
OpenMP can now use this pointer inside an OpenMP parallel for loop and share the data between threads. I'm now able to search for regular expression matches inside the mapped and shared file with multiple threads checking every string against a (pretty long) list of regular expressions.
I made this list (a vector containing regex) private inside the openmp loop, so every thread has it's own copy of this list.
Here comes the problem:
To dramatically increase performance of my application I need to be able to remove (regex-)items from this vector once they match a string.
Now all the other active threads need to have this item removed from their list too asap.
So I made this list a shared data object inside the openmp loop but now I get segmentation faults at runtime when I try to write (vector.erase(item#)) to the list.
With boost::threads I would just have used a mutex for locking this object while writing/reading it.
But openmp seems to handle most of the synchronization itself so now I wonder what would be the correct approach to handle this problem when using openmp which is new to me.

For synchronization, you may use #pragma omp critical or you may use OpenMP lock routines (omp_{init,set,unset,destroy}_lock).
The benefits of #pragma omp critical are simplicity and ability to ignore the pragma when the parallel region is known to execute by a single thread. Drawbacks are applicability only to a single parallel region, and global effect within this region: no other thread can execute any other critical section in the region.
OpenMP lock routines are similar to most other available locks, e.g. those of pthreads or Boost (RAII aside). You initialize a lock object, then use it to protect certain critical sections, and destroy when it's unnecessary. These locks can be used to protect access to data from different parallel regions, to build a distributed locking scheme, etc.; but certain amount of overhead is always incurred, and surely usage is more "hairy" comparing to #pragma omp critical.
However, I would challenge the design of the parallel solution. Erasing an element from middle of the vector makes all iterators invalidated, and moves elements. Erase is supposedly a rare operation (otherwise, the choice of vector would be questionable even in serial code I think), but due to the above effects you have to protect all reads of the vector too, and that will likely be costly. Read/write locks could give some relief, but those are not available in OpenMP, so you would need to use either platform-specific interfaces or a 3rd party library.
I think the following will potentially work better:
You keep regex vectors private, and add a same-size shared vector of flags that indicate whether a certain regex is still valid or not.
Before applying a certain regex from a private vector, the code checks in the shared vector if this regex was not "erased" by some other thread. If it was, the regex is skipped.
After finding a match, the code marks the element of the shared vector that corresponds to the current regex as "erased" so that it will be ignored from now on.
In this scheme, there exist races for reading/writing flags: a flag might be set to "erased" the very next moment it was read as "valid" by another thread. As the result, two different threads may concurrently find a match for the same regex. However this problem I believe exists in your current solution where all regex containers are private, as well as in a solution with a shared container and locks or RW locks, unless a non-RW lock protects also the operation with a given regex. In case multiple matches are a problem, it all should be re-thought.

You can achieve this by creating a critical section.
#pragma omp critical
{
...some synchronized code...
}
EDIT:
Removed the part about '#pragma omp atomic' as it cannot atomically perform the operations needed.

Proper use of "atomic directive" to lock STL container

I have a large number of sets of integers, which I have, in turn, put into a vector of pointers. I need to be able to update these sets of integers in parallel without causing a race condition. More specifically. I am using OpenMP's "parallel for" construct.
For dealing with shared resources, OpenMP offers a handy "atomic directive," which allows one to avoid a race condition on a specific piece of memory without using locks. It would be convenient if I could use the "atomic directive" to prevent simultaneous updating to my integer sets, however, I'm not sure whether this is possible.
Basically, I want to know whether the following code could lead to a race condition
vector< set<int>* > membershipDirectory(numSets, new set<int>);
#pragma omp for schedule(guided,expandChunksize)
for(int i=0; i<100; i++)
{
set<int>* sp = membershipDirectory[rand()];
#pragma omp atomic
sp->insert(45);
}
Note that I use a random integer for the index, because in my application, any thread might access any index (there is a random element in my larger application, but I need not go into details).
I have seen a similar example of this for incrementing an integer, but I'm not sure whether it works when working with a pointer to a container as in my case.

After searching around, I found the OpenMP C and C++ API manual on openmp.org, and in section 2.6.4, the limitations of the atomic construct are described.
Basically, the atomic directive can only be used with the following operators:
Unary:
++, -- (prefix and postfix)
Binary:
+,-,*,/,^,&,|,<<,>>
So I will just use locks!
(In some situations critical sections might be preferable, but in my case locks will provide fine grained access to the shared resource, yielding better performance than a critical section.)

you should not use atomic where expression is a function call, it only applies to simple expressions (with possibly built-ins: power, square root).
Instead use critical section (either named or default)

Your code is not clear. Assuming that membershipDirectory[5] is actually membershipDirectory[i], atomic directive is not needed. For two processors, for example, OpenMP produces two threads, one handles i = 0-49, another 50-99 intervals. In this case, there is no need to protect membershipDirectory[i]. atomic directive is required to protect some common resource which does not depend on the loop index, for example, total sum.

Any issues with large numbers of critical sections?

I have a large array of structures, like this:
typedef struct
{
int a;
int b;
int c;
etc...
}
data_type;
data_type data[100000];
I have a bunch of separate threads, each of which will want to make alterations to elements within data[]. I need to make sure that no to threads attempt to access the same data element at the same time. To be precise: one thread performing data[475].a = 3; and another thread performing data[475].b = 7; at the same time is not allowed, but one thread performing data[475].a = 3; while another thread performs data[476].a = 7; is allowed. The program is highly speed critical. My plan is to make a separate critical section for each data element like so:
typedef struct
{
CRITICAL_SECTION critsec;
int a;
int b;
int c;
etc...
}
data_type;
In one way I guess it should all work and I should have no real questions, but not having had much experience in multithreaded programming I am just feeling a little uneasy about having so many critical sections. I'm wondering if the sheer number of them could be creating some sort of inefficiency. I'm also wondering if perhaps some other multithreading technique could be faster? Should I just relax and go ahead with plan A?

With this many objects, most of their critical sections will be unlocked, and there will be almost no contention. As you already know (other comment), critical sections don't require a kernel-mode transition if they're unowned. That makes critical sections efficient for this situation.
The only other consideration would be whether you would want the critical sections inside your objects or in another array. Locality of reference is a good reason to put the critical sections inside the object. When you've entered the critical section, an entire cacheline (e.g. 16 or 32 bytes) will be in memory. With a bit of padding, you can make sure each object starts on a cacheline. As a result, the object will be (partially) in cache once its critical section is entered.

Your plan is worth trying, but I think you will find that Windows is unhappy creating that many Critical Sections. Each CS contains some kernel handle(s) and you are using up precious kernel space. I think, depending on your version of Windows, you will run out of handle memory and InitializeCriticalSection() or some other function will start to fail.
What you might want to do is have a pool of CSs available for use, and store a pointer to the 'in use' CS inside your struct. But then this gets tricky quite quickly and you will need to use Atomic operations to set/clear the CS pointer (to atomically flag the array entry as 'in use'). Might also need some reference counting, etc...
Gets complicated.
So try your way first, and see what happens. We had a similar situation once, and we had to go with a pool, but maybe things have changed since then.

Depending on the data member types in your data_type structure (and also depending on the operations you want to perform on those members), you might be able to forgo using a separate synchronization object, using the Interlocked functions instead.
In your sample code, all the data members are integers, and all the operations are assignments (and presumably reads), so you could use InterlockedExchange() to set the values atomically and InterlockedCompareExchange() to read the values atomically.
If you need to use non-integer data member types, or if you need to perform more complex operations, or if you need to coordinate atomic access to more than one operation at a time (e.g., read data[1].a and then write data[1].b), then you will have to use a synchronization object, such as a CRITICAL_SECTION.
If you must use a synchronization object, I recommend that you consider partitioning your data set into subsets and use a single synchronization object per subset. For example, you might consider using one CRITICAL_SECTION for each span of 1000 elements in the data array.

You could also consider MUTEX.
This is nice method.
Each client could reserve the resource by itself with mutex (mutual-exclusion).
This is more common, some libraries also support this with threads.
Read about boost::thread and it's mutexes
With Your approach:
data_type data[100000];
I'd be afraid of stack overflow, unless You're allocating it at the heap.
EDIT:
Boost::MUTEX
uses win32 Critical Sections

As others have pointed out, yes there is an issue and it is called too fine-grained locking.. it's resource wasteful and even though the chances are small you will start creating a lot of backing primitives and data when the things do get an occasional, call it longer than usual or whatever, contention. Plus you are wasting resources as it is not really a trivial data structure as for example in VM impls..
If I recall correctly you will have a higher chance of a SEH exception from that point onwards on Win32 or just higher memory usage. Partitioning and pooling them is probably the way to go but it is a more complex implementation. Paritioning on something else (re:action) and expecting some short-lived contention is another way to deal with it.
In any case, it is a problem of resource management with what you have right now.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js