Why do they prefer dynamic scheduling in this code snippet - c++

I solved an exercise, where I had to implement SGD (stochastic gradient descent) with momentum. The exercise was to parallelize it afterwards.
My suggestion was as follow:
#pragma omp for schedule(static) nowait
for (int i = 0; i < size; i++)
{
const double nablaE_w = (1.0/double(B)) * (grad[i] + lambda * param[i]);
mom_w[i] = beta * mom_w[i] - eta * nablaE_w;
param[i] = param[i] + mom_w[i];
}
I use nowait because after the for-loop the calculation is finished.
But on the solution they uses:
#pragma omp for schedule (dynamic, 64/sizeof(double)) nowait
Also after reading several stackoverflow-answers, I still cannot see the advantage of scheduling(dynamic) and why they uses the chunksize=8
I am careful when using schedule(dynamic) - actually I never use it, because I have no idea how to do it correctly.

The solution likely did it for one of two reasons: there is another nowait loop nearby that is imbalanced and might create a case in which threads reach this loop out of order, or they were worried about running on a system with a great deal of interference that is variable by core. Using schedule(dynamic) is materially more expensive, for now until runtimes actually implement nonmonotonic properly, so static is the clear choice as long as the workload is balanced, which this snippet is. The only reason to use dynamic here would be if imbalance between threads were expected to come from somewhere else, either in the code or from interfering workloads.

Related

OpenMP: for loop avoid data race without using critical

Suppose I have the following C code and want to parallelize it using OpenMP.
for (int i = 0; i < n; ++i)
{
int key = get_key(i);
toArray[count[key]++] = fromArray[i];
}
I know if I directly use parallel for syntax it could cause data race and get a wrong answer, but if I use critical, the performance would be very poor.
#pragma omp parallel for schedule(static)
for (int i = 0; i < n; ++i)
{
int key = get_key(i);
#pragma omp criical
toArray[count[key]++] = fromArray[i];
}
I wonder if there is a way to parallelize it with good performance?
I'm afraid your assumption is wrong. The version with a critical section does produce a correct answer - well at least not a deterministic answer.
For simplicity take the case where get_key always returns 0. The serial version would copy the array, the parallel one would perform a arbitrary reshuffle. There is an ordering dependency between all iterations in which get_key returns the same value.
Generally speaking. Simple critical sections can often be replaced by a reduction which allows a independent execution while incurring some merge overhead after the parallel part. Atomics can also be an option for simple operations, but they also suffer from a general performance penalty and often additional negative cache issues. Technically your incorrect critical section code would be equivalent to this slightly more efficient atomic code:
int index;
#pragma omp atomic capture
index = count[key]++;
#pragma omp atomic write
toArray[index] = fromArray[i];
I wonder if there is a way to parallelize it with good performance?
Any question about performance needs more specific information. What are the involved types, data sizes, parallelism level, ...? There is no general answer to "this is the best way for performance".

how can i convert a sequential program to parallel using openMP?

I'm beginning in openMP and i want parallelize this portion of code :
for (i=0 ;i<n ;i++)
for (j=1 ;j<n ;j++)
A[i][j]+=A[i][j-1];
How can i make this for parallel?
I recommend you start by checking out the following link: http://bisqwit.iki.fi/story/howto/openmp/.
It gives a brief overview of what you can achieve using OpenMP.
For your code fragment parallelising can be as easy as writing one pragma:
#pragma omp parallel for private(i, j) shared(A, n)
for (i = 0; i < n; ++i)
for (j = 1; j < n; ++j)
A[i][j] += A[i][j-1];
That is the idea behind OMP: you annotate your program with messages, that allow the code to be compiled and linked with OMP and then ran in parallel, or compiled ignoring the pragmas in which case the program should remain a valid sequential program.
In this case the pragma leaves the decision as to how many threads to run to the runtime. The runtime generally makes a decision based on the amount of cores in the machine. The outer loop will be parallelized and each i iteration will conceptually be performed by a different thread. This is important because you have data dependencies between various j iterations, and communication/synchronization in parallel is tricky. Keeping the inner loop within a thread deals with this problem. The shared section could be left out because things are shared by default. But for this reason exactly you shouldn't leave it out: be explicit about what you want shared and what you want private. This is a good way to avoid many bugs that happen when writing parallel code.

Scheduling system for fitting

I would like to parallelise a linear operation (fitting a complicated mathematical function to some dataset) with multiple processors.
Assume I have 8 cores in my machine, and I want to fit 1000 datasets. What I expect is some system that takes the 1000 datasets as a queue, and sends them to the 8 cores for processing, so it starts by taking the first 8 from the 1000 as FIFO. The fitting times of each dataset is in general different than the other, so some of the 8 datasets being fitted could take longer than the others. What I want from the system is to save the results of the fitted data sets, and then resume taking new datasets from the big queue (1000 datasets) for each thread that is done. This has to resume till the whole 1000 datasets is processed. And then I could move on with my program.
What is such a system called? and are there models for that on C++?
I parallelise with OpenMP, and use advanced C++ techniques like templates and polymorphism.
Thank you for any efforts.
You can either use OpenMP parallel for with dynamic schedule or OpenMP tasks. Both could be used to parallelise cases where each iteration takes different amount of time to complete. With dynamically scheduled for:
#pragma omp parallel
{
Fitter fitter;
fitter.init();
#pragma omp for schedule(dynamic,1)
for (int i = 0; i < numFits; i++)
fitter.fit(..., &results[i]);
}
schedule(dynamic,1) makes each thread execute one iteration at a time and threads are never left idle unless there are no more iterations to process.
With tasks:
#pragma omp parallel
{
Fitter fitter;
fitter.init();
#pragma omp single
for (int i = 0; i < numFits; i++)
{
#pragma omp task
fitter.fit(..., &results[i]);
}
#pragma omp taskwait
// ^^^ only necessary if more code before the end of the parallel region
}
Here one of the threads runs a for-loop which produces 1000 OpenMP tasks. OMP tasks are kept in a queue and processed by idle threads. It works somewhat similar to dynamic for-loops but allows for greater freedom in the code constructs (e.g. with tasks you can parallelise recursive algorithms). The taskwait construct waits for all pending tasks to be done. It is implied at the end of the parallel region so it is really necessary only if more code follows before the end of the parallel region.
In both cases each invocation to fit() will be done in a different thread. You have to make sure that fitting one set of parameters does not affect fitting other sets, e.g. that fit() is a thread-safe method/function. Both cases also require that the time to execute fit() is much higher than the overhead of the OpenMP constructs.
OpenMP tasking requires OpenMP 3.0 compliant compiler. This rules out all versions of MS VC++ (even the one in VS2012), should you happen to develop on Windows.
If you'd like to have only one instance of fitter ever initialised per thread, then you should take somewhat different approach, e.g. make the fitter object global and threadprivate:
#include <omp.h>
Fitter fitter;
#pragma omp threadprivate(fitter)
...
int main()
{
// Disable dynamic teams
omp_set_dynamic(0);
// Initialise all fitters once per thread
#pragma omp parallel
{
fitter.init();
}
...
#pragma omp parallel
{
#pragma omp for schedule(dynamic,1)
for (int i = 0; i < numFits; i++)
fitter.fit(..., &results[i]);
}
...
return 0;
}
Here fitter is a global instance of the Fitter class. The omp threadprivate directive instructs the compiler to put it in the Thread-Local Storage, e.g. to make it per-thread global variable. These persists between the different parallel regions. You can also use omp threadprivate on static local variables. These too persist between the different parallel regions (but only in the same function):
#include <omp.h>
int main()
{
// Disable dynamic teams
omp_set_dynamic(0);
static Fitter fitter; // must be static
#pragma omp threadprivate(fitter)
// Initialise all fitters once per thread
#pragma omp parallel
{
fitter.init();
}
...
#pragma omp parallel
{
#pragma omp for schedule(dynamic,1)
for (int i = 0; i < numFits; i++)
fitter.fit(..., &results[i]);
}
...
return 0;
}
The omp_set_dynamic(0) call disables dynamic teams, i.e. each parallel region will always execute with as many threads as specified by the OMP_NUM_THREADS environment variable.
What you basically want is a pool of workers (or a thread pool) which take a job from a queue, process it, and proceed with another job afterwards. OpenMP provides different approaches to handle such tasks, e.g. barriers (all workers run until a certain point and only proceed when a certain requirement is fulfilled) or reductions to accumulate values into a global variable after the workers managed to compute their respective parts.
Your question is very broad, but one more hint I can give you is to take a look into the MapReduce paradigm. In this paradigm, a function is mapped over a dataset and the result is ordered into buckets which are reduced using another function (which can possibly be the same function again). In your case this would mean that each of your processors/cores/nodes maps a given function over its assigned set of data and sends the result buckets to another node responsible to combine it. I guess that you have to look into MPI if you want to use MapReduce with C++ and without using a specific MapReduce framework. As you are running the program on one node, maybe you can do something similar with OpenMP, so searching the web for that might helps.
TL;DR search for pool of workers (thread pool), barriers and MapReduce.

C++ OpenMP directives

I have a loop that I'm trying to parallelize and in it I am filling a container, say an STL map. Consider then the simple pseudo code below where T1 and T2 are some arbitrary types, while f and g are some functions of integer argument, returning T1, T2 types respectively:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
c.insert(std::make_pair<T1,T2>(f(i),g(i))
}
This looks rather straighforward and seems like it should be trivially parallelized but it doesn't speed up as I expected. On the contrary it leads to run-time errors in my code, due to unexpected values being filled in the container, likely due to race conditions. I've even tried putting barriers and what-not, but all to no-avail. The only thing that allows it to work is to use a critical directive as below:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
#pragma omp critical
{
c.insert(std::make_pair<T1,T2>(f(i),g(i))
}
}
But this sort of renders useless the whole point of using omp in the above example, since only one thread at a time is executing the bulk of the loop (the container insert statement). What am I missing here? Short of changing the way the code is written, can somebody kindly explain?
This particular example you have is not a good candidate for parallelism unless f() and g() are extremely expensive function calls.
STL containers are not thread-safe. That's why you're getting the race conditions. So accessing them needs to be synchronized - which makes your insertion process inherently sequential.
As the other answer mentions, there's a LOT of overhead for parallelism. So unless f() and g() extremely expensive, your loop doesn't do enough work to offset the overhead of parallelism.
Now assuming f() and g() are extremely expensive calls, then your loop can be parallelized like this:
#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
std::pair<T1,T2> p = std::make_pair<T1,T2>(f(i),g(i));
#pragma omp critical
{
c.insert(p);
}
}
Running multithreaded code make you think about thread safety and shared access to your variables. As long as you start inserting into c from multiple threads, the collection should be prepared to take such "simultaneous" calls and keep its data consistent, are you sure it is made this way?
Another thing is that parallelization has its own overhead and you are not going to gain anything when you try to run a very small task on multiple threads - with the cost of splitting and synchronization you might end up with even higher total execution time for the task.
c will have obviously data races, as you guessed. STL map is not thread-safe. Calling insert method concurrently in multiple threads will have very unpredictable behavior, mostly just crash.
Yes, to avoid the data races, you must have either (1) a mutex like #pragma omp critical, or (2) concurrent data structure (aka look-free data structures). However, not all data structures can be lock-free in current hardware. For example, TBB provides tbb::concurrent_hash_map. If you don't need ordering of the keys, you may use it and could get some speedup as it does not have a conventional mutex.
In case where you can use just a hash table and the table is very huge, you could take a reduction-like approach (See this link for the concept of reduction). Hash tables do not care about the ordering of the insertion. In this case, you allocate multiple hash tables for each thread, and let each thread inserts N/#thread items in parallel, which will give a speedup. Looking up is also can be easily done by accessing these tables in parallel.

Concurrency and optimization using OpenMP

I'm learning OpenMP. To do so, I'm trying to make an existing code parallel. But I seems to get an worse time when using OpenMP than when I don't.
My inner loop:
#pragma omp parallel for
for(unsigned long j = 0; j < c_numberOfElements; ++j)
{
//int th_id = omp_get_thread_num();
//printf("thread %d, j = %d\n", th_id, (int)j);
Point3D current;
#pragma omp critical
{
current = _points[j];
}
Point3D next = getNext(current);
if (!hasConstraint(next))
{
continue;
}
#pragma omp critical
{
_points[j] = next;
}
}
_points is a pointMap_t, defined as:
typedef boost::unordered_map<unsigned long, Point3D> pointMap_t;
Without OpenMP my running time is 44.904s. With OpenMP enabled, on a computer with two cores, it is 64.224s. What I am doing wrong?
Why have you wrapped your reads and writes to _points[j] in critical sections ? I'm not much of a C++ programmer, but it doesn't look to me as if you need those sections at all. As you've written it (uunamed critical sections) each thread is going to wait while the other goes through each of the sections. This could easily make the program slower.
It seems possible that the lookup and write to _points in critical sections is dragging down the performance when you use OpenMP. Single-threaded, this will not result in any contention.
Sharing seed data like this seems counterproductive in a parallel programming context. Can you restructure to avoid these contention points?
You need to show the rest of the code. From a comment to another answer, it seems you are using a map. That is really a bad idea, especially if you are mapping 0..n numbers to values: why don't you use an array?
If you really need to use containers, consider using the ones from the Intel's Thread Building Blocks library.
I agree that it would be best to see some working code.
The ultimate issue here is that there are criticals within a parallel region, and criticals are (a) enormously expensive in and of themselves, and (b) by definition, kill parallelism. The assignment to current certainl doesn't need to be inside a critical, as it is private; I wouldn't have thought the _points[j] assignment would be, either, but I don't know what the map stuff does, so there you go.
But you have a loop in which you have a huge amount of overhead, which grows linearly in the number of threads (the two critical regions) in order to do a tiny amount of actual work (walk along a linked list, it looks like). That's never going to be a good trade...