how can i convert a sequential program to parallel using openMP? - c++

I'm beginning in openMP and i want parallelize this portion of code :
for (i=0 ;i<n ;i++)
for (j=1 ;j<n ;j++)
A[i][j]+=A[i][j-1];
How can i make this for parallel?

I recommend you start by checking out the following link: http://bisqwit.iki.fi/story/howto/openmp/.
It gives a brief overview of what you can achieve using OpenMP.
For your code fragment parallelising can be as easy as writing one pragma:
#pragma omp parallel for private(i, j) shared(A, n)
for (i = 0; i < n; ++i)
for (j = 1; j < n; ++j)
A[i][j] += A[i][j-1];
That is the idea behind OMP: you annotate your program with messages, that allow the code to be compiled and linked with OMP and then ran in parallel, or compiled ignoring the pragmas in which case the program should remain a valid sequential program.
In this case the pragma leaves the decision as to how many threads to run to the runtime. The runtime generally makes a decision based on the amount of cores in the machine. The outer loop will be parallelized and each i iteration will conceptually be performed by a different thread. This is important because you have data dependencies between various j iterations, and communication/synchronization in parallel is tricky. Keeping the inner loop within a thread deals with this problem. The shared section could be left out because things are shared by default. But for this reason exactly you shouldn't leave it out: be explicit about what you want shared and what you want private. This is a good way to avoid many bugs that happen when writing parallel code.

Related

Why do they prefer dynamic scheduling in this code snippet

I solved an exercise, where I had to implement SGD (stochastic gradient descent) with momentum. The exercise was to parallelize it afterwards.
My suggestion was as follow:
#pragma omp for schedule(static) nowait
for (int i = 0; i < size; i++)
{
const double nablaE_w = (1.0/double(B)) * (grad[i] + lambda * param[i]);
mom_w[i] = beta * mom_w[i] - eta * nablaE_w;
param[i] = param[i] + mom_w[i];
}
I use nowait because after the for-loop the calculation is finished.
But on the solution they uses:
#pragma omp for schedule (dynamic, 64/sizeof(double)) nowait
Also after reading several stackoverflow-answers, I still cannot see the advantage of scheduling(dynamic) and why they uses the chunksize=8
I am careful when using schedule(dynamic) - actually I never use it, because I have no idea how to do it correctly.
The solution likely did it for one of two reasons: there is another nowait loop nearby that is imbalanced and might create a case in which threads reach this loop out of order, or they were worried about running on a system with a great deal of interference that is variable by core. Using schedule(dynamic) is materially more expensive, for now until runtimes actually implement nonmonotonic properly, so static is the clear choice as long as the workload is balanced, which this snippet is. The only reason to use dynamic here would be if imbalance between threads were expected to come from somewhere else, either in the code or from interfering workloads.

OpenMP: for loop avoid data race without using critical

Suppose I have the following C code and want to parallelize it using OpenMP.
for (int i = 0; i < n; ++i)
{
int key = get_key(i);
toArray[count[key]++] = fromArray[i];
}
I know if I directly use parallel for syntax it could cause data race and get a wrong answer, but if I use critical, the performance would be very poor.
#pragma omp parallel for schedule(static)
for (int i = 0; i < n; ++i)
{
int key = get_key(i);
#pragma omp criical
toArray[count[key]++] = fromArray[i];
}
I wonder if there is a way to parallelize it with good performance?
I'm afraid your assumption is wrong. The version with a critical section does produce a correct answer - well at least not a deterministic answer.
For simplicity take the case where get_key always returns 0. The serial version would copy the array, the parallel one would perform a arbitrary reshuffle. There is an ordering dependency between all iterations in which get_key returns the same value.
Generally speaking. Simple critical sections can often be replaced by a reduction which allows a independent execution while incurring some merge overhead after the parallel part. Atomics can also be an option for simple operations, but they also suffer from a general performance penalty and often additional negative cache issues. Technically your incorrect critical section code would be equivalent to this slightly more efficient atomic code:
int index;
#pragma omp atomic capture
index = count[key]++;
#pragma omp atomic write
toArray[index] = fromArray[i];
I wonder if there is a way to parallelize it with good performance?
Any question about performance needs more specific information. What are the involved types, data sizes, parallelism level, ...? There is no general answer to "this is the best way for performance".

Performance problems using OpenMP in nested loops

I'm using the following code, which contains an OpenMP parallel for loop nested in another for-loop. Somehow the performance of this code is 4 Times slower than the sequential version (omitting #pragma omp parallel for).
Is it possible that OpenMp has to create Threads every time the method is called? In my test it is called 10000 times directly after each other.
I heard that sometimes OpenMP will keep the threads spinning. I also tried setting OMP_WAIT_POLICY=active and GOMP_SPINCOUNT=INFINITE. When I remove the openMP pragmas, the code is about 10 times faster. Note that the method containing this code will be called 10000 times.
for (round k = 1; k < processor.max; ++k) {
initialise_round(k);
for (std::vector<int> bucket : color_buckets) {
#pragma omp parallel for schedule (dynamic)
for (int i = 0; i < bucket.size(); ++i) {
if (processor.mark.is_marked_item(bucket[i])) {
processor.process(k, bucket[i]);
}
}
processor.finish_round(k);
}
}
You say that your sequential code is much faster so this makes me think that your processor.process function has too few instructions and duration. This leads to the case where passing the data to each thread does not pay off (the data exchange overhead is simply larger than the actual computation on that thread).
Other than that, I think that parallelizing the middle loop won't affect the algorithm but increase the amount of work per thread/
I think you are creating a team of threads on each iteration of the loop... (although I'm not sure what for alone does - I thought it should be parallel for). In this case, it would probably be better to separate the parallel from the for so the work of forking and creating the threads is done just once rather than being repeated in the other loops. So you could try to put a parallel pragma before your outermost loop so the overhead of forking and thread creation is just done once.
The actual problem was not related to OpenMP directly.
As the system has two CPUs, half of the threads where spawned on one and the other half on the other CPU. Therefore there was not shared L3 Cache. This lead in combination that the algorithm doesn't scale well to a performance decrease especially when using 2-4 Threads.
The solution was to use thread pinning for example via the linux tool: taskset

C++ OpenMP directives

I have a loop that I'm trying to parallelize and in it I am filling a container, say an STL map. Consider then the simple pseudo code below where T1 and T2 are some arbitrary types, while f and g are some functions of integer argument, returning T1, T2 types respectively:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
c.insert(std::make_pair<T1,T2>(f(i),g(i))
}
This looks rather straighforward and seems like it should be trivially parallelized but it doesn't speed up as I expected. On the contrary it leads to run-time errors in my code, due to unexpected values being filled in the container, likely due to race conditions. I've even tried putting barriers and what-not, but all to no-avail. The only thing that allows it to work is to use a critical directive as below:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
#pragma omp critical
{
c.insert(std::make_pair<T1,T2>(f(i),g(i))
}
}
But this sort of renders useless the whole point of using omp in the above example, since only one thread at a time is executing the bulk of the loop (the container insert statement). What am I missing here? Short of changing the way the code is written, can somebody kindly explain?
This particular example you have is not a good candidate for parallelism unless f() and g() are extremely expensive function calls.
STL containers are not thread-safe. That's why you're getting the race conditions. So accessing them needs to be synchronized - which makes your insertion process inherently sequential.
As the other answer mentions, there's a LOT of overhead for parallelism. So unless f() and g() extremely expensive, your loop doesn't do enough work to offset the overhead of parallelism.
Now assuming f() and g() are extremely expensive calls, then your loop can be parallelized like this:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
std::pair<T1,T2> p = std::make_pair<T1,T2>(f(i),g(i));
#pragma omp critical
{
c.insert(p);
}
}
Running multithreaded code make you think about thread safety and shared access to your variables. As long as you start inserting into c from multiple threads, the collection should be prepared to take such "simultaneous" calls and keep its data consistent, are you sure it is made this way?
Another thing is that parallelization has its own overhead and you are not going to gain anything when you try to run a very small task on multiple threads - with the cost of splitting and synchronization you might end up with even higher total execution time for the task.
c will have obviously data races, as you guessed. STL map is not thread-safe. Calling insert method concurrently in multiple threads will have very unpredictable behavior, mostly just crash.
Yes, to avoid the data races, you must have either (1) a mutex like #pragma omp critical, or (2) concurrent data structure (aka look-free data structures). However, not all data structures can be lock-free in current hardware. For example, TBB provides tbb::concurrent_hash_map. If you don't need ordering of the keys, you may use it and could get some speedup as it does not have a conventional mutex.
In case where you can use just a hash table and the table is very huge, you could take a reduction-like approach (See this link for the concept of reduction). Hash tables do not care about the ordering of the insertion. In this case, you allocate multiple hash tables for each thread, and let each thread inserts N/#thread items in parallel, which will give a speedup. Looking up is also can be easily done by accessing these tables in parallel.

Concurrency and optimization using OpenMP

I'm learning OpenMP. To do so, I'm trying to make an existing code parallel. But I seems to get an worse time when using OpenMP than when I don't.
My inner loop:
#pragma omp parallel for
for(unsigned long j = 0; j < c_numberOfElements; ++j)
{
//int th_id = omp_get_thread_num();
//printf("thread %d, j = %d\n", th_id, (int)j);
Point3D current;
#pragma omp critical
{
current = _points[j];
}
Point3D next = getNext(current);
if (!hasConstraint(next))
{
continue;
}
#pragma omp critical
{
_points[j] = next;
}
}
_points is a pointMap_t, defined as:
typedef boost::unordered_map<unsigned long, Point3D> pointMap_t;
Without OpenMP my running time is 44.904s. With OpenMP enabled, on a computer with two cores, it is 64.224s. What I am doing wrong?
Why have you wrapped your reads and writes to _points[j] in critical sections ? I'm not much of a C++ programmer, but it doesn't look to me as if you need those sections at all. As you've written it (uunamed critical sections) each thread is going to wait while the other goes through each of the sections. This could easily make the program slower.
It seems possible that the lookup and write to _points in critical sections is dragging down the performance when you use OpenMP. Single-threaded, this will not result in any contention.
Sharing seed data like this seems counterproductive in a parallel programming context. Can you restructure to avoid these contention points?
You need to show the rest of the code. From a comment to another answer, it seems you are using a map. That is really a bad idea, especially if you are mapping 0..n numbers to values: why don't you use an array?
If you really need to use containers, consider using the ones from the Intel's Thread Building Blocks library.
I agree that it would be best to see some working code.
The ultimate issue here is that there are criticals within a parallel region, and criticals are (a) enormously expensive in and of themselves, and (b) by definition, kill parallelism. The assignment to current certainl doesn't need to be inside a critical, as it is private; I wouldn't have thought the _points[j] assignment would be, either, but I don't know what the map stuff does, so there you go.
But you have a loop in which you have a huge amount of overhead, which grows linearly in the number of threads (the two critical regions) in order to do a tiny amount of actual work (walk along a linked list, it looks like). That's never going to be a good trade...