OpenMP: for loop avoid data race without using critical - c++

Suppose I have the following C code and want to parallelize it using OpenMP.
for (int i = 0; i < n; ++i)
{
int key = get_key(i);
toArray[count[key]++] = fromArray[i];
}
I know if I directly use parallel for syntax it could cause data race and get a wrong answer, but if I use critical, the performance would be very poor.
#pragma omp parallel for schedule(static)
for (int i = 0; i < n; ++i)
{
int key = get_key(i);
#pragma omp criical
toArray[count[key]++] = fromArray[i];
}
I wonder if there is a way to parallelize it with good performance?

I'm afraid your assumption is wrong. The version with a critical section does produce a correct answer - well at least not a deterministic answer.
For simplicity take the case where get_key always returns 0. The serial version would copy the array, the parallel one would perform a arbitrary reshuffle. There is an ordering dependency between all iterations in which get_key returns the same value.
Generally speaking. Simple critical sections can often be replaced by a reduction which allows a independent execution while incurring some merge overhead after the parallel part. Atomics can also be an option for simple operations, but they also suffer from a general performance penalty and often additional negative cache issues. Technically your incorrect critical section code would be equivalent to this slightly more efficient atomic code:
int index;
#pragma omp atomic capture
index = count[key]++;
#pragma omp atomic write
toArray[index] = fromArray[i];
I wonder if there is a way to parallelize it with good performance?
Any question about performance needs more specific information. What are the involved types, data sizes, parallelism level, ...? There is no general answer to "this is the best way for performance".

Related

Why do they prefer dynamic scheduling in this code snippet

I solved an exercise, where I had to implement SGD (stochastic gradient descent) with momentum. The exercise was to parallelize it afterwards.
My suggestion was as follow:
#pragma omp for schedule(static) nowait
for (int i = 0; i < size; i++)
{
const double nablaE_w = (1.0/double(B)) * (grad[i] + lambda * param[i]);
mom_w[i] = beta * mom_w[i] - eta * nablaE_w;
param[i] = param[i] + mom_w[i];
}
I use nowait because after the for-loop the calculation is finished.
But on the solution they uses:
#pragma omp for schedule (dynamic, 64/sizeof(double)) nowait
Also after reading several stackoverflow-answers, I still cannot see the advantage of scheduling(dynamic) and why they uses the chunksize=8
I am careful when using schedule(dynamic) - actually I never use it, because I have no idea how to do it correctly.
The solution likely did it for one of two reasons: there is another nowait loop nearby that is imbalanced and might create a case in which threads reach this loop out of order, or they were worried about running on a system with a great deal of interference that is variable by core. Using schedule(dynamic) is materially more expensive, for now until runtimes actually implement nonmonotonic properly, so static is the clear choice as long as the workload is balanced, which this snippet is. The only reason to use dynamic here would be if imbalance between threads were expected to come from somewhere else, either in the code or from interfering workloads.

Simultaneous writing in the same memory in parallel omp loop

I want to implement the following function that marks some elements of array by 1.
void mark(std::vector<signed char>& marker)
{
#pragma omp parallel for schedule(dynamic, M)
for (int i = 0; i < marker.size; i++)
marker[i] = 0;
#pragma omp parallel for schedule(dynamic, M)
for (int i = 0; i < marker.size; i++)
marker[getIndex(i)] = 1; // is it ok ?
}
What will happen if we try to set value of the same element to 1 in different threads at the same time? Will it be normally set to 1 or this loop may lead to unexpected behavior?
This answer is wrong in one fundamental part (emphasis mine):
If you write with different threads to the very same location, you get a race condition. This is not necessarily undefined behaviour, but nevertheless it need to be avoided.
Having a look at the OpenMP standard, section 1.4.1 says (also emphasis mine):
If multiple threads write without synchronization to the same memory unit, including cases due to atomicity considerations as described above, then a data race occurs. Similarly, if at least one thread reads from a memory unit and at least one thread writes without synchronization to that same memory unit, including cases due to atomicity considerations as described above, then a data race occurs. If a data race occurs then the result of the program is unspecified.
Technically the OP snippet is in the undefined behavior realm. This implies that there is no guarantee on the behavior of the program until the UB is removed from it.
The simplest way to do it is to protect memory access with an atomic operation:
#pragma omp parallel for schedule(dynamic, M)
for (int i = 0; i < marker.size; i++)
#pragma omp atomic write seq_cst
marker[getIndex(i)] = 1;
but that will probably hinder performance in a sensible way (as was correctly noted by #schorsch312).
If you write with different threads to the very same location, you get a race condition. This is not necessarily undefined behaviour, but nevertheless it need to be avoided.
Since, you write a "1" with all threads it might be ok, but if you write real data it is probably not.
Side note: In order to have a good numerical performance, you need to work on memory which is not too close to each other. If two threads are writing to two different elements in one cache line, this chunk of memory will be invalidated for all other threads. This will lead to a cache miss and will spoil your performance gain (parallel execution might be even slower than single threaded execution).

how can i convert a sequential program to parallel using openMP?

I'm beginning in openMP and i want parallelize this portion of code :
for (i=0 ;i<n ;i++)
for (j=1 ;j<n ;j++)
A[i][j]+=A[i][j-1];
How can i make this for parallel?
I recommend you start by checking out the following link: http://bisqwit.iki.fi/story/howto/openmp/.
It gives a brief overview of what you can achieve using OpenMP.
For your code fragment parallelising can be as easy as writing one pragma:
#pragma omp parallel for private(i, j) shared(A, n)
for (i = 0; i < n; ++i)
for (j = 1; j < n; ++j)
A[i][j] += A[i][j-1];
That is the idea behind OMP: you annotate your program with messages, that allow the code to be compiled and linked with OMP and then ran in parallel, or compiled ignoring the pragmas in which case the program should remain a valid sequential program.
In this case the pragma leaves the decision as to how many threads to run to the runtime. The runtime generally makes a decision based on the amount of cores in the machine. The outer loop will be parallelized and each i iteration will conceptually be performed by a different thread. This is important because you have data dependencies between various j iterations, and communication/synchronization in parallel is tricky. Keeping the inner loop within a thread deals with this problem. The shared section could be left out because things are shared by default. But for this reason exactly you shouldn't leave it out: be explicit about what you want shared and what you want private. This is a good way to avoid many bugs that happen when writing parallel code.

C++ OpenMP directives

I have a loop that I'm trying to parallelize and in it I am filling a container, say an STL map. Consider then the simple pseudo code below where T1 and T2 are some arbitrary types, while f and g are some functions of integer argument, returning T1, T2 types respectively:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
c.insert(std::make_pair<T1,T2>(f(i),g(i))
}
This looks rather straighforward and seems like it should be trivially parallelized but it doesn't speed up as I expected. On the contrary it leads to run-time errors in my code, due to unexpected values being filled in the container, likely due to race conditions. I've even tried putting barriers and what-not, but all to no-avail. The only thing that allows it to work is to use a critical directive as below:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
#pragma omp critical
{
c.insert(std::make_pair<T1,T2>(f(i),g(i))
}
}
But this sort of renders useless the whole point of using omp in the above example, since only one thread at a time is executing the bulk of the loop (the container insert statement). What am I missing here? Short of changing the way the code is written, can somebody kindly explain?
This particular example you have is not a good candidate for parallelism unless f() and g() are extremely expensive function calls.
STL containers are not thread-safe. That's why you're getting the race conditions. So accessing them needs to be synchronized - which makes your insertion process inherently sequential.
As the other answer mentions, there's a LOT of overhead for parallelism. So unless f() and g() extremely expensive, your loop doesn't do enough work to offset the overhead of parallelism.
Now assuming f() and g() are extremely expensive calls, then your loop can be parallelized like this:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
std::pair<T1,T2> p = std::make_pair<T1,T2>(f(i),g(i));
#pragma omp critical
{
c.insert(p);
}
}
Running multithreaded code make you think about thread safety and shared access to your variables. As long as you start inserting into c from multiple threads, the collection should be prepared to take such "simultaneous" calls and keep its data consistent, are you sure it is made this way?
Another thing is that parallelization has its own overhead and you are not going to gain anything when you try to run a very small task on multiple threads - with the cost of splitting and synchronization you might end up with even higher total execution time for the task.
c will have obviously data races, as you guessed. STL map is not thread-safe. Calling insert method concurrently in multiple threads will have very unpredictable behavior, mostly just crash.
Yes, to avoid the data races, you must have either (1) a mutex like #pragma omp critical, or (2) concurrent data structure (aka look-free data structures). However, not all data structures can be lock-free in current hardware. For example, TBB provides tbb::concurrent_hash_map. If you don't need ordering of the keys, you may use it and could get some speedup as it does not have a conventional mutex.
In case where you can use just a hash table and the table is very huge, you could take a reduction-like approach (See this link for the concept of reduction). Hash tables do not care about the ordering of the insertion. In this case, you allocate multiple hash tables for each thread, and let each thread inserts N/#thread items in parallel, which will give a speedup. Looking up is also can be easily done by accessing these tables in parallel.

Concurrency and optimization using OpenMP

I'm learning OpenMP. To do so, I'm trying to make an existing code parallel. But I seems to get an worse time when using OpenMP than when I don't.
My inner loop:
#pragma omp parallel for
for(unsigned long j = 0; j < c_numberOfElements; ++j)
{
//int th_id = omp_get_thread_num();
//printf("thread %d, j = %d\n", th_id, (int)j);
Point3D current;
#pragma omp critical
{
current = _points[j];
}
Point3D next = getNext(current);
if (!hasConstraint(next))
{
continue;
}
#pragma omp critical
{
_points[j] = next;
}
}
_points is a pointMap_t, defined as:
typedef boost::unordered_map<unsigned long, Point3D> pointMap_t;
Without OpenMP my running time is 44.904s. With OpenMP enabled, on a computer with two cores, it is 64.224s. What I am doing wrong?
Why have you wrapped your reads and writes to _points[j] in critical sections ? I'm not much of a C++ programmer, but it doesn't look to me as if you need those sections at all. As you've written it (uunamed critical sections) each thread is going to wait while the other goes through each of the sections. This could easily make the program slower.
It seems possible that the lookup and write to _points in critical sections is dragging down the performance when you use OpenMP. Single-threaded, this will not result in any contention.
Sharing seed data like this seems counterproductive in a parallel programming context. Can you restructure to avoid these contention points?
You need to show the rest of the code. From a comment to another answer, it seems you are using a map. That is really a bad idea, especially if you are mapping 0..n numbers to values: why don't you use an array?
If you really need to use containers, consider using the ones from the Intel's Thread Building Blocks library.
I agree that it would be best to see some working code.
The ultimate issue here is that there are criticals within a parallel region, and criticals are (a) enormously expensive in and of themselves, and (b) by definition, kill parallelism. The assignment to current certainl doesn't need to be inside a critical, as it is private; I wouldn't have thought the _points[j] assignment would be, either, but I don't know what the map stuff does, so there you go.
But you have a loop in which you have a huge amount of overhead, which grows linearly in the number of threads (the two critical regions) in order to do a tiny amount of actual work (walk along a linked list, it looks like). That's never going to be a good trade...