Shared vectors in OpenMP - c++

I am trying to parallize a program I am using and got the following question.
Will I get a loss of performance if multiple threads need to read/write on the same vector but different elements of the vector ? I have the feeling thats the reason my program hardly gets any faster upon parallizing it. Take the following code:
#include <vector>
int main(){
vector<double> numbers;
vector<double> results(10);
double x;
//write 10 values in vector numbers
for (int i =0; i<10; i++){
numbers.push_back(cos(i));
}
#pragma omp parallel for \
private(x) \
shared(numbers, results)
for(int j = 0; j < 10; j++){
x = 2 * numbers[j] + 5;
#pragma omp critical // do I need this ?
{
results[j] = x;
}
}
return 0;
}
Obviously the actual program does far more expensive operations, but this example shall
only explain my question. So can the for loop be done fast and completely parallel or do the different threads have to wait for each other because only one thread at a time can access the vector number for instance although they are all reading different elements of the vector ?
Same question with the write operation: Do I need the critical pragma or is it no problem since every thread writes into a different element of the vector results ?
I am happy with every help I can get and also it would be good to know if there is a better way to do this (maybe not use vectors at all, but simple arrays and pointers etc. ?)
I also read vectors aren't thread safe in certain cases and it is recommended to use a pointer: OpenMP and STL vector
Thanks a lot for your help!

I imagine that most of the issues with vectors in multiple threads would be if it has to resize, then it copies the entire contents of the vector into a new place in memory (a larger allocated chunk) which if you're accessing this in parallel then you just tried to read an object that has been deleted.
If you are not resizing your array, then I have had never had any trouble with concurrent read writes into the vector (obviously as long as I'm not writing twice the same element)
As for the lack of performance boost, the openmp critical section will slow your program down to probably the same as just using 1 thread (depending on how much is actually done outside that critical section)
You can remove the critical section statement (with the conditions above in mind).

You get no speedup precisely because of the critical sectino, which is superfluous, since the same elements will never be modified at the same time. Remove the critical section piece and it will work just fine.
You can play with the schedule strategy as well, because if memory access is not linear (it is in the example you gave), threads might fight for cache (writing elements in the same cache line). OTOH if the number of elements is given as in your case and there is no branching in the loop (therefore they will execute at about the same speed), static, which is IIRC the default, should work the best anyway.
(BTW you can declare x inside the loop to avoid private(x) and the shared directive is implied IIRC (I never used it).)

Related

how can I get good speedup for a parallel write to memory?

I'm new to OpenMP and trying to get some very basic loops in my code parallelized with OpenMP, with good speedup on multiple cores. Here's a function in my program:
bool Individual::_SetFitnessScaling_1(double source_value, EidosObject **p_values, size_t p_values_size)
{
if ((source_value < 0.0) || (std::isnan(source_value)))
return true;
#pragma omp parallel for schedule(static) default(none) shared(p_values_size) firstprivate(p_values, source_value) if(p_values_size >= EIDOS_OMPMIN_SET_FITNESS_S1)
for (size_t value_index = 0; value_index < p_values_size; ++value_index)
((Individual *)(p_values[value_index]))->fitness_scaling_ = source_value;
return false;
}
So the goal is to set the fitnessScaling ivar of every object pointed to by pointers in the buffer that p_values points to, to the same double value source_value. Those various objects might be more or less anywhere in memory, so each write probably hits a different cache line; that's an aspect of the code that would be difficult to change, but I'm hoping that by spreading it across multiple cores I can at least divide that pain by a good speedup factor. The cast to (Individual *) is safe, by the way; checks were already done external to this function that guarantee its safety.
You can see my first attempt at parallelizing this, using the default static schedule (so each thread gets its own contiguous block in p_values), making the loop limit shared, and making p_values and source_value be firstprivate so each thread gets its own private copy of those variables, initialized to the original value. The threshold for parallelization, EIDOS_OMPMIN_SET_FITNESS_S1, is set to 900. I test this with a script that passes in a million values, with between 1 and 8 cores (and a max thread count to match), so the loop should certainly run in parallel. I have followed these same practices in some other places in the code and have seen a good speedup. [EDIT: I should say that the speedup I observe for this, for 2/4/6/8 cores/threads, is always about 1.1x-1.2x the single-threaded performance, so there's a very small win but it is realized already with 2 cores and does not get any better with 8 cores.] The notable difference with this code is that this loop spends its time writing to memory; the other loops I have successfully parallelized spend their time doing things like reading values from a buffer and summing across them, so they might be limited by memory read speeds, but not by memory write speeds.
It occurred to me that with all of this writing through a pointer, my loop might be thrashing due to things like aliasing (making the compiler force a flush of the cache after each write), or some such. I attempted to solve that kind of issue as follows, using const and __restrict:
bool Individual::_SetFitnessScaling_1(double source_value, EidosObject **p_values, size_t p_values_size)
{
if ((source_value < 0.0) || (std::isnan(source_value)))
return true;
#pragma omp parallel default(none) shared(p_values_size) firstprivate(p_values, source_value) if(p_values_size >= EIDOS_OMPMIN_SET_FITNESS_S1)
{
EidosObject * const * __restrict local_values = p_values;
#pragma omp for schedule(static)
for (size_t value_index = 0; value_index < p_values_size; ++value_index)
((Individual *)(local_values[value_index]))->fitness_scaling_ = source_value;
}
return false;
}
This made no difference to the performance, however. I still suspect that some kind of memory contention, cache thrash, or aliasing issue is preventing the code from parallelizing effectively, but I don't know how to solve it. Or maybe I'm barking up the wrong tree?
These tests are done with Xcode 13 (i.e., using Apple clang 13.0.0) on macOS, on an M1 Mac mini (2020).
[EDIT: In reply to comments below, a few points. (1) There is nothing fancy going on inside the class here, no operator= or similar; the assignment of source_value into fitness_scaling_ is, in effect, simply the assignment of a double into a field in a struct. (2) The use of firstprivate(p_values, source_value) is to ensure that repeated reading from those values across threads doesn't introduce some kind of between-thread contention that slows things down. It is recommended in Mattson, He, & Koniges' book "The OpenMP Common Core"; see section 6.3.2, figure 6.10 with the corrected Mandelbrot code using firstprivate, and the quote on p. 111: "An easy solution is to change the storage attribute for eps to firstprivate. This gives each thread its own copy of the variable but with a specified value. Notice that eps is read-only. It is not updated inside the parallel region. Therefore, another solution is to let it be shared (shared(eps)) or not specify eps in a data environment clause and let its default, shared behavior be used. While this would result in correct code, it would potentially increase overhead. If eps is shared, every thread will be reading the same address in memory... Some compilers will optimize for such read-only variables by putting them into registers, but we should not rely on that behavior." I have observed this change speeding up parallelized loops in other contexts, so I have adopted it as my standard practice in such cases; if I have misunderstood, please do let me know. (3) No, keeping the fitness_scaling_ values in their own buffer is not a workable solution for several reasons. Most importantly, this method may be called with any arbitrary buffer of pointers to Individual; it is not necessarily setting the fitness_scaling_ of all Individual objects, just an effectively random subset of them, so this operation will never be reducible to a simple memset(). Also, I am going to need to similarly optimize the setting of many other properties on Individual and on other classes in my code, so a general solution is needed; I can't very well put all of the ivars of all of my classes into separately allocated buffers external to the objects themselves. And third, Individual objects are being dynamically allocated and deallocated independently of each other, so an external buffer of fitness_scaling_ values for the objects would have big implementation problems.]

What is the best way to parallelise tasks sharing an object but otherwise independent?

I'm coding a physics simulation consisting mainly of a central loop of hundreds of billions of repetitions of operations on an array. These operations are independent from the other (well actually the array changes along the way) and so I'm thinking about parallelising my code as I can make it run on 4 or 8 cores computers in my lab.
It's my first time doing something alike and I've been advised to look at openmp. I've started to code some toy programs with it, but I'm really unsure about how it works and the documentation is quite cryptic to me. For example the following code:
int a = 0;
#pragma omp parallel
{
a++;
}
cout << a << endl;
launched on my computer (4 cores CPU) gives me sometimes 4, other times 3 or 2. Is it because it doesn't wait for all the cores to execute the instructions? Because I sure need to know how many iterations were done in my case. Should I look for something else than openmp considering what I want in the end?
When writing concurrently to a shared variable (a in your code), you have a data race. To avoid different threads writing "simultaneously", you must either use an atomic assignment or protect the assignment with a mutex (= mutual exclusion). In OpenMP, the latter is done via a critical region
int a = 0;
#pragma omp parallel
{
#pragma omp critical
{
a++;
}
}
cout << a << endl;
(of course, this particular program does nothing in parallel, hence will be slower than a serial one doing the same).
For more info, read the openMP documentation! However, I would advise you to not use OpenMP, but TBB if you're using C++. It's much more flexible.
What you are seeing is the typical example of a race condition. Four threads are trying to increment variable a and they are fighting for it. Some 'lose' and they are not able to increment so you see a result lower than 4.
What happens is that the a++ command is actually a set of three instructions: read a from memory and put it in a register, increment the value in the register, then put the value back in memory. If thread 1 reads the value of a after thread 2 has read it but before thread 2 has written the new value back to a, the increment operation of thread2 will be overwritten. Using #omp critical is a way to ensure that all the read/increment/write operations are not interrupted by another thread.
If you need to parallelize iterations, you can use omp parallel for, for instance to increment all the elements in an array.
Typical use:
#pragma omp parallel for
for (i = 0; i < N; i++)
a[i]++;

OpenMP(C/C++): Efficient way of sharing an unordered_map<string, vector<int>> and a vector<int> between threads

I have a for loop that I would like to make parallel, however the threads must share an unordered_map and a vector.
Because the for loop is somewhat big I will post here a concise overview of it so that I can make my main problem clear. Please read the comments.
unordered_map<string, vector<int>> sharedUM;
/*
here I call a function that updates the unordered_map with some
initial data, however the unordered_map will need to be updated by
the threads inside the for loop
*/
vector<int> sharedVector;
/*
the shared vector initially is empty, the threads will
fill it with integers, the order of these integers should be in ascending
order, however I can simply sort the array after all the
threads finish executing so I guess we can assume that the order
does not matter
*/
#pragma omp parallel for
for(int i=0; i<N; i++){
key = generate_a_key_value_according_to_an_algorithm();
std::unordered_map<string, vector<int>::iterator it = sharedUM.find(key);
/*
according to the data inside it->second(the value),
the thread makes some conclusions which then
uses in order to figure out whether
it should run a high complexity algorithm
or not.
*/
bool conclusion = make_conclusion();
if(conclusion == true){
results = run_expensive_algorithm();
/*
According to the results,
the thread updates some values of
the key that it previously searched for inside the unordered_map
this update may help other threads avoid running
the expensive algorithm
*/
}
sharedVector.push_back(i);
}
Initially I left the code as it is, so I just used that #pragma over the for loop, however I got a few problems regarding the update of the sharedVector. So I decided to use simple locks in order to force a thread acquire the lock before writing to the vector. So in my implementation I had something like this:
omp_lock_t sharedVectorLock;
omp_init_lock(&sharedVectorLock);
...
for(...)
...
omp_set_lock(&sharedVectorLock);
sharedVector.push_back(i);
omp_unset_lock(&sharedVectorLock);
...
omp_destroy_lock(&sharedVectorLock);
I had run my application many times and everything seemed to be working great, and that's until I decided to rerun it automatically too many times until I got wrong results. Because I'm very new to the world of OpenMP and the threads in general, I wasn't aware of the fact that we should lock all the readers when a writer is updating some shared data. As you can see here in my application the threads always read some data from the unordered_map in order make some conclusions and learn things about the key that was assigned to them. What happens though if two threads have to work with the same key, and while some other thread is trying to read the values of this key, another one has reached the point of updating those values? I believe that's where my problem occurs.
However my main problem right now is that I'm not sure what would be the best way to avoid such things from happening. It's like my system works for 99% of the time, but that 1% ruins everything because two threads are rarely assigned with the same key which in turn is because my unordered_map is usually big.
Would locking the unordered_map do my job? Most likely, but that wouldn't be efficient because a thread A that wants to work with the key x would have to wait for a thread B that is already working with the key y where y can be different than x to finish.
So my main question is, how should I approach this problem? How can I lock the unordered_map if and only if two threads are working with the same key?
Thank you in advance
1 on using locks and mutexes. You must declare and initialise the lock variables outside of the parallel block (before #pragma omp parallel) and then use them inside the parallel block: (1) acquire a lock (this may block if another thread has locked it), (2) change the variable with the race condition, (3) release the lock. Finally, destroy it after exiting the parallel block. A lock declared inside the parallel block is local to the thread and hence cannot provide synchronisation.
This may explain your problems.
2 on writing into complicated C++ containers. OpenMP was designed originally for simple FORTRAN do loops (similar to C/C++ for loops with integer control variables). Everything more complicated will give you headache. To be on the safe side, any non-constant operation on a C++ container must be performed within a lock (use the same lock for any such operation on the same container) or omp critical region (use the same name for any such operation on the same container). This includes pop() and push() etc, anything but simple reads. This can only remain efficient if such non-constant container operations take only a tiny fraction of the time.
3 If I were you, I wouldn't bother with openMP (I have used it but am regretting this now). With C++ you could use TBB, which also comes with some threadsafe but lock-free containers. It also allows you to think in terms of tasks, not threads, which are executed recursively (a parent task spawns child tasks, etc), but TBB has some simple implementations for parallel for loops, for instance.
An alternative approach would be to use TBB's concurrent_unordered_map.
You don't have to use the rest of TBB's parallelism support (though if you're starting from scratch in C++ it's certainly more "c++-ish" than OpenMP).
May be this could help:
vector<bool> sv(N);
replace
sharedVector.push_back(i);
by
sv[i]=true;
this allows to avoid locks (very time consuming) and sharedVector
can easily be sorted, e.g
for(int i=0; i<N;i++){
if(sv[i])sharedVector.push_back(i);
}

C++ OpenMP directives

I have a loop that I'm trying to parallelize and in it I am filling a container, say an STL map. Consider then the simple pseudo code below where T1 and T2 are some arbitrary types, while f and g are some functions of integer argument, returning T1, T2 types respectively:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
c.insert(std::make_pair<T1,T2>(f(i),g(i))
}
This looks rather straighforward and seems like it should be trivially parallelized but it doesn't speed up as I expected. On the contrary it leads to run-time errors in my code, due to unexpected values being filled in the container, likely due to race conditions. I've even tried putting barriers and what-not, but all to no-avail. The only thing that allows it to work is to use a critical directive as below:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
#pragma omp critical
{
c.insert(std::make_pair<T1,T2>(f(i),g(i))
}
}
But this sort of renders useless the whole point of using omp in the above example, since only one thread at a time is executing the bulk of the loop (the container insert statement). What am I missing here? Short of changing the way the code is written, can somebody kindly explain?
This particular example you have is not a good candidate for parallelism unless f() and g() are extremely expensive function calls.
STL containers are not thread-safe. That's why you're getting the race conditions. So accessing them needs to be synchronized - which makes your insertion process inherently sequential.
As the other answer mentions, there's a LOT of overhead for parallelism. So unless f() and g() extremely expensive, your loop doesn't do enough work to offset the overhead of parallelism.
Now assuming f() and g() are extremely expensive calls, then your loop can be parallelized like this:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
std::pair<T1,T2> p = std::make_pair<T1,T2>(f(i),g(i));
#pragma omp critical
{
c.insert(p);
}
}
Running multithreaded code make you think about thread safety and shared access to your variables. As long as you start inserting into c from multiple threads, the collection should be prepared to take such "simultaneous" calls and keep its data consistent, are you sure it is made this way?
Another thing is that parallelization has its own overhead and you are not going to gain anything when you try to run a very small task on multiple threads - with the cost of splitting and synchronization you might end up with even higher total execution time for the task.
c will have obviously data races, as you guessed. STL map is not thread-safe. Calling insert method concurrently in multiple threads will have very unpredictable behavior, mostly just crash.
Yes, to avoid the data races, you must have either (1) a mutex like #pragma omp critical, or (2) concurrent data structure (aka look-free data structures). However, not all data structures can be lock-free in current hardware. For example, TBB provides tbb::concurrent_hash_map. If you don't need ordering of the keys, you may use it and could get some speedup as it does not have a conventional mutex.
In case where you can use just a hash table and the table is very huge, you could take a reduction-like approach (See this link for the concept of reduction). Hash tables do not care about the ordering of the insertion. In this case, you allocate multiple hash tables for each thread, and let each thread inserts N/#thread items in parallel, which will give a speedup. Looking up is also can be easily done by accessing these tables in parallel.

Concurrency and optimization using OpenMP

I'm learning OpenMP. To do so, I'm trying to make an existing code parallel. But I seems to get an worse time when using OpenMP than when I don't.
My inner loop:
#pragma omp parallel for
for(unsigned long j = 0; j < c_numberOfElements; ++j)
{
//int th_id = omp_get_thread_num();
//printf("thread %d, j = %d\n", th_id, (int)j);
Point3D current;
#pragma omp critical
{
current = _points[j];
}
Point3D next = getNext(current);
if (!hasConstraint(next))
{
continue;
}
#pragma omp critical
{
_points[j] = next;
}
}
_points is a pointMap_t, defined as:
typedef boost::unordered_map<unsigned long, Point3D> pointMap_t;
Without OpenMP my running time is 44.904s. With OpenMP enabled, on a computer with two cores, it is 64.224s. What I am doing wrong?
Why have you wrapped your reads and writes to _points[j] in critical sections ? I'm not much of a C++ programmer, but it doesn't look to me as if you need those sections at all. As you've written it (uunamed critical sections) each thread is going to wait while the other goes through each of the sections. This could easily make the program slower.
It seems possible that the lookup and write to _points in critical sections is dragging down the performance when you use OpenMP. Single-threaded, this will not result in any contention.
Sharing seed data like this seems counterproductive in a parallel programming context. Can you restructure to avoid these contention points?
You need to show the rest of the code. From a comment to another answer, it seems you are using a map. That is really a bad idea, especially if you are mapping 0..n numbers to values: why don't you use an array?
If you really need to use containers, consider using the ones from the Intel's Thread Building Blocks library.
I agree that it would be best to see some working code.
The ultimate issue here is that there are criticals within a parallel region, and criticals are (a) enormously expensive in and of themselves, and (b) by definition, kill parallelism. The assignment to current certainl doesn't need to be inside a critical, as it is private; I wouldn't have thought the _points[j] assignment would be, either, but I don't know what the map stuff does, so there you go.
But you have a loop in which you have a huge amount of overhead, which grows linearly in the number of threads (the two critical regions) in order to do a tiny amount of actual work (walk along a linked list, it looks like). That's never going to be a good trade...