Simultaneous writing in the same memory in parallel omp loop - c++

I want to implement the following function that marks some elements of array by 1.
void mark(std::vector<signed char>& marker)
{
#pragma omp parallel for schedule(dynamic, M)
for (int i = 0; i < marker.size; i++)
marker[i] = 0;
#pragma omp parallel for schedule(dynamic, M)
for (int i = 0; i < marker.size; i++)
marker[getIndex(i)] = 1; // is it ok ?
}
What will happen if we try to set value of the same element to 1 in different threads at the same time? Will it be normally set to 1 or this loop may lead to unexpected behavior?

This answer is wrong in one fundamental part (emphasis mine):
If you write with different threads to the very same location, you get a race condition. This is not necessarily undefined behaviour, but nevertheless it need to be avoided.
Having a look at the OpenMP standard, section 1.4.1 says (also emphasis mine):
If multiple threads write without synchronization to the same memory unit, including cases due to atomicity considerations as described above, then a data race occurs. Similarly, if at least one thread reads from a memory unit and at least one thread writes without synchronization to that same memory unit, including cases due to atomicity considerations as described above, then a data race occurs. If a data race occurs then the result of the program is unspecified.
Technically the OP snippet is in the undefined behavior realm. This implies that there is no guarantee on the behavior of the program until the UB is removed from it.
The simplest way to do it is to protect memory access with an atomic operation:
#pragma omp parallel for schedule(dynamic, M)
for (int i = 0; i < marker.size; i++)
#pragma omp atomic write seq_cst
marker[getIndex(i)] = 1;
but that will probably hinder performance in a sensible way (as was correctly noted by #schorsch312).

If you write with different threads to the very same location, you get a race condition. This is not necessarily undefined behaviour, but nevertheless it need to be avoided.
Since, you write a "1" with all threads it might be ok, but if you write real data it is probably not.
Side note: In order to have a good numerical performance, you need to work on memory which is not too close to each other. If two threads are writing to two different elements in one cache line, this chunk of memory will be invalidated for all other threads. This will lead to a cache miss and will spoil your performance gain (parallel execution might be even slower than single threaded execution).

Related

how can I get good speedup for a parallel write to memory?

I'm new to OpenMP and trying to get some very basic loops in my code parallelized with OpenMP, with good speedup on multiple cores. Here's a function in my program:
bool Individual::_SetFitnessScaling_1(double source_value, EidosObject **p_values, size_t p_values_size)
{
if ((source_value < 0.0) || (std::isnan(source_value)))
return true;
#pragma omp parallel for schedule(static) default(none) shared(p_values_size) firstprivate(p_values, source_value) if(p_values_size >= EIDOS_OMPMIN_SET_FITNESS_S1)
for (size_t value_index = 0; value_index < p_values_size; ++value_index)
((Individual *)(p_values[value_index]))->fitness_scaling_ = source_value;
return false;
}
So the goal is to set the fitnessScaling ivar of every object pointed to by pointers in the buffer that p_values points to, to the same double value source_value. Those various objects might be more or less anywhere in memory, so each write probably hits a different cache line; that's an aspect of the code that would be difficult to change, but I'm hoping that by spreading it across multiple cores I can at least divide that pain by a good speedup factor. The cast to (Individual *) is safe, by the way; checks were already done external to this function that guarantee its safety.
You can see my first attempt at parallelizing this, using the default static schedule (so each thread gets its own contiguous block in p_values), making the loop limit shared, and making p_values and source_value be firstprivate so each thread gets its own private copy of those variables, initialized to the original value. The threshold for parallelization, EIDOS_OMPMIN_SET_FITNESS_S1, is set to 900. I test this with a script that passes in a million values, with between 1 and 8 cores (and a max thread count to match), so the loop should certainly run in parallel. I have followed these same practices in some other places in the code and have seen a good speedup. [EDIT: I should say that the speedup I observe for this, for 2/4/6/8 cores/threads, is always about 1.1x-1.2x the single-threaded performance, so there's a very small win but it is realized already with 2 cores and does not get any better with 8 cores.] The notable difference with this code is that this loop spends its time writing to memory; the other loops I have successfully parallelized spend their time doing things like reading values from a buffer and summing across them, so they might be limited by memory read speeds, but not by memory write speeds.
It occurred to me that with all of this writing through a pointer, my loop might be thrashing due to things like aliasing (making the compiler force a flush of the cache after each write), or some such. I attempted to solve that kind of issue as follows, using const and __restrict:
bool Individual::_SetFitnessScaling_1(double source_value, EidosObject **p_values, size_t p_values_size)
{
if ((source_value < 0.0) || (std::isnan(source_value)))
return true;
#pragma omp parallel default(none) shared(p_values_size) firstprivate(p_values, source_value) if(p_values_size >= EIDOS_OMPMIN_SET_FITNESS_S1)
{
EidosObject * const * __restrict local_values = p_values;
#pragma omp for schedule(static)
for (size_t value_index = 0; value_index < p_values_size; ++value_index)
((Individual *)(local_values[value_index]))->fitness_scaling_ = source_value;
}
return false;
}
This made no difference to the performance, however. I still suspect that some kind of memory contention, cache thrash, or aliasing issue is preventing the code from parallelizing effectively, but I don't know how to solve it. Or maybe I'm barking up the wrong tree?
These tests are done with Xcode 13 (i.e., using Apple clang 13.0.0) on macOS, on an M1 Mac mini (2020).
[EDIT: In reply to comments below, a few points. (1) There is nothing fancy going on inside the class here, no operator= or similar; the assignment of source_value into fitness_scaling_ is, in effect, simply the assignment of a double into a field in a struct. (2) The use of firstprivate(p_values, source_value) is to ensure that repeated reading from those values across threads doesn't introduce some kind of between-thread contention that slows things down. It is recommended in Mattson, He, & Koniges' book "The OpenMP Common Core"; see section 6.3.2, figure 6.10 with the corrected Mandelbrot code using firstprivate, and the quote on p. 111: "An easy solution is to change the storage attribute for eps to firstprivate. This gives each thread its own copy of the variable but with a specified value. Notice that eps is read-only. It is not updated inside the parallel region. Therefore, another solution is to let it be shared (shared(eps)) or not specify eps in a data environment clause and let its default, shared behavior be used. While this would result in correct code, it would potentially increase overhead. If eps is shared, every thread will be reading the same address in memory... Some compilers will optimize for such read-only variables by putting them into registers, but we should not rely on that behavior." I have observed this change speeding up parallelized loops in other contexts, so I have adopted it as my standard practice in such cases; if I have misunderstood, please do let me know. (3) No, keeping the fitness_scaling_ values in their own buffer is not a workable solution for several reasons. Most importantly, this method may be called with any arbitrary buffer of pointers to Individual; it is not necessarily setting the fitness_scaling_ of all Individual objects, just an effectively random subset of them, so this operation will never be reducible to a simple memset(). Also, I am going to need to similarly optimize the setting of many other properties on Individual and on other classes in my code, so a general solution is needed; I can't very well put all of the ivars of all of my classes into separately allocated buffers external to the objects themselves. And third, Individual objects are being dynamically allocated and deallocated independently of each other, so an external buffer of fitness_scaling_ values for the objects would have big implementation problems.]

OpenMP: for loop avoid data race without using critical

Suppose I have the following C code and want to parallelize it using OpenMP.
for (int i = 0; i < n; ++i)
{
int key = get_key(i);
toArray[count[key]++] = fromArray[i];
}
I know if I directly use parallel for syntax it could cause data race and get a wrong answer, but if I use critical, the performance would be very poor.
#pragma omp parallel for schedule(static)
for (int i = 0; i < n; ++i)
{
int key = get_key(i);
#pragma omp criical
toArray[count[key]++] = fromArray[i];
}
I wonder if there is a way to parallelize it with good performance?
I'm afraid your assumption is wrong. The version with a critical section does produce a correct answer - well at least not a deterministic answer.
For simplicity take the case where get_key always returns 0. The serial version would copy the array, the parallel one would perform a arbitrary reshuffle. There is an ordering dependency between all iterations in which get_key returns the same value.
Generally speaking. Simple critical sections can often be replaced by a reduction which allows a independent execution while incurring some merge overhead after the parallel part. Atomics can also be an option for simple operations, but they also suffer from a general performance penalty and often additional negative cache issues. Technically your incorrect critical section code would be equivalent to this slightly more efficient atomic code:
int index;
#pragma omp atomic capture
index = count[key]++;
#pragma omp atomic write
toArray[index] = fromArray[i];
I wonder if there is a way to parallelize it with good performance?
Any question about performance needs more specific information. What are the involved types, data sizes, parallelism level, ...? There is no general answer to "this is the best way for performance".

omp parallel for: Thread fails to assign value

I have a for loop parallelized by OpenMP. I want the parallel threads to fill an std::vector<bool> of falses with trues. Each thread should write to its own entry of the vector. But sometimes one assignment fails. How can this happen?
int size = 10;
std::vector<bool> vec(10); // all entries contain false
#pragma omp parallel for
for (int i = 0; i < size; ++i)
vec[i] = true; // sometimes this assignment fails for a thread
The vector may end up looking like this:
true true true true false true true true true true
The problem is due to how std::vector<bool> is stored. It is designed for space efficiency, and stores elements in individual bits, not bytes. This means that assignments to vec[0], vec[1], ... vec[7] all write to the same byte in memory. Since you have multiple threads writing to the same address (it will actually be a read-modify-write sequence), there is a race condition. This can cause a write by one thread to be "undone" by a write of a later thread.
In addition, the potential performance benefit in a memory-intensive loop like this is not large due to the bandwidth limits of main memory. Combined with the cache invalidation required by each thread, the threaded performance will likely be inferior to single threaded operation.
With the fundamental basic operation being done here (setting memory to true), just code it up as a standard non-OMP single thread loop. The compiler will be able to optimize it to something reasonably performant.
As per the standard, all standard containers are required to avoid data races when the contents of the contained object in different elements in the same container are modified concurrently.
Except std::vector<bool>.
1201ProgramAlarm explains why.
std::vector<bool> is infamous for many issues - the most straight-forward solution is to use std::vector<char> instead. If you do that, your program is just fine, although you should not expect any performance boost from using OpenMP in that particular case.
my openmp is rusty, but what comes to mind is #pragma omp flush [(list)]
the flush directive in openmp may be used to identify a synchronization point, which is defined as a point in the execution of the program where the executing thread needs to have a consistent view of memory. A consistent view of memory has 2 requirements: all memory read/write operations before and after the flush directive must be performed either before or after. this is often called a memory fence. Page 163, parallel programming in openmp by Chandra,Dagum,Kohr,Maydan,McDonald,Menon; 2001
And there is much more detailed description on this in the book, as well as mention of variables not being retained in a buffer/register across any synchronization point, which per your posted code you are not doing.
with very little code posted, and I don't know how or when you are checking vec[i] so this would be my first guess... and then checking on the versions of the linux operating system, C compiler, and openmp versions involved...
but I don't expect an OS upgrade, a compiler upgrade, nor an openmp version upgrade, to correct what you are experiencing it's likely your code
much like what happens in standard C when doing printf("hello") without a following fflush(stdout) when you do not do printf("hello\n"). The \n causes an inherent flush to happen on most all systems that is taken for granted.

Shared vectors in OpenMP

I am trying to parallize a program I am using and got the following question.
Will I get a loss of performance if multiple threads need to read/write on the same vector but different elements of the vector ? I have the feeling thats the reason my program hardly gets any faster upon parallizing it. Take the following code:
#include <vector>
int main(){
vector<double> numbers;
vector<double> results(10);
double x;
//write 10 values in vector numbers
for (int i =0; i<10; i++){
numbers.push_back(cos(i));
}
#pragma omp parallel for \
private(x) \
shared(numbers, results)
for(int j = 0; j < 10; j++){
x = 2 * numbers[j] + 5;
#pragma omp critical // do I need this ?
{
results[j] = x;
}
}
return 0;
}
Obviously the actual program does far more expensive operations, but this example shall
only explain my question. So can the for loop be done fast and completely parallel or do the different threads have to wait for each other because only one thread at a time can access the vector number for instance although they are all reading different elements of the vector ?
Same question with the write operation: Do I need the critical pragma or is it no problem since every thread writes into a different element of the vector results ?
I am happy with every help I can get and also it would be good to know if there is a better way to do this (maybe not use vectors at all, but simple arrays and pointers etc. ?)
I also read vectors aren't thread safe in certain cases and it is recommended to use a pointer: OpenMP and STL vector
Thanks a lot for your help!
I imagine that most of the issues with vectors in multiple threads would be if it has to resize, then it copies the entire contents of the vector into a new place in memory (a larger allocated chunk) which if you're accessing this in parallel then you just tried to read an object that has been deleted.
If you are not resizing your array, then I have had never had any trouble with concurrent read writes into the vector (obviously as long as I'm not writing twice the same element)
As for the lack of performance boost, the openmp critical section will slow your program down to probably the same as just using 1 thread (depending on how much is actually done outside that critical section)
You can remove the critical section statement (with the conditions above in mind).
You get no speedup precisely because of the critical sectino, which is superfluous, since the same elements will never be modified at the same time. Remove the critical section piece and it will work just fine.
You can play with the schedule strategy as well, because if memory access is not linear (it is in the example you gave), threads might fight for cache (writing elements in the same cache line). OTOH if the number of elements is given as in your case and there is no branching in the loop (therefore they will execute at about the same speed), static, which is IIRC the default, should work the best anyway.
(BTW you can declare x inside the loop to avoid private(x) and the shared directive is implied IIRC (I never used it).)

C++ OpenMP directives

I have a loop that I'm trying to parallelize and in it I am filling a container, say an STL map. Consider then the simple pseudo code below where T1 and T2 are some arbitrary types, while f and g are some functions of integer argument, returning T1, T2 types respectively:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
c.insert(std::make_pair<T1,T2>(f(i),g(i))
}
This looks rather straighforward and seems like it should be trivially parallelized but it doesn't speed up as I expected. On the contrary it leads to run-time errors in my code, due to unexpected values being filled in the container, likely due to race conditions. I've even tried putting barriers and what-not, but all to no-avail. The only thing that allows it to work is to use a critical directive as below:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
#pragma omp critical
{
c.insert(std::make_pair<T1,T2>(f(i),g(i))
}
}
But this sort of renders useless the whole point of using omp in the above example, since only one thread at a time is executing the bulk of the loop (the container insert statement). What am I missing here? Short of changing the way the code is written, can somebody kindly explain?
This particular example you have is not a good candidate for parallelism unless f() and g() are extremely expensive function calls.
STL containers are not thread-safe. That's why you're getting the race conditions. So accessing them needs to be synchronized - which makes your insertion process inherently sequential.
As the other answer mentions, there's a LOT of overhead for parallelism. So unless f() and g() extremely expensive, your loop doesn't do enough work to offset the overhead of parallelism.
Now assuming f() and g() are extremely expensive calls, then your loop can be parallelized like this:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
std::pair<T1,T2> p = std::make_pair<T1,T2>(f(i),g(i));
#pragma omp critical
{
c.insert(p);
}
}
Running multithreaded code make you think about thread safety and shared access to your variables. As long as you start inserting into c from multiple threads, the collection should be prepared to take such "simultaneous" calls and keep its data consistent, are you sure it is made this way?
Another thing is that parallelization has its own overhead and you are not going to gain anything when you try to run a very small task on multiple threads - with the cost of splitting and synchronization you might end up with even higher total execution time for the task.
c will have obviously data races, as you guessed. STL map is not thread-safe. Calling insert method concurrently in multiple threads will have very unpredictable behavior, mostly just crash.
Yes, to avoid the data races, you must have either (1) a mutex like #pragma omp critical, or (2) concurrent data structure (aka look-free data structures). However, not all data structures can be lock-free in current hardware. For example, TBB provides tbb::concurrent_hash_map. If you don't need ordering of the keys, you may use it and could get some speedup as it does not have a conventional mutex.
In case where you can use just a hash table and the table is very huge, you could take a reduction-like approach (See this link for the concept of reduction). Hash tables do not care about the ordering of the insertion. In this case, you allocate multiple hash tables for each thread, and let each thread inserts N/#thread items in parallel, which will give a speedup. Looking up is also can be easily done by accessing these tables in parallel.