I am trying to achieve something contrived using OpenMP.
I have a multi-core system with N available processors. I want to have a vector of objects of length k*P to be populated in batches of P by a single thread (by reading a file), i.e. a single thread reads this file and writes in vecObj[0 to P-1] then vecObj[p to 2P-1] etc. To make things simple, this vector is pre-resized (i.e. inserting using = operator, no pushbacks, constant length as far as we are concerned).
After a batch is written into the vector, I want the remaining N-1 threads to work on the available data. Since every object can take different time to be worked upon, it would be good to have dynamic scheduling for the remaining threads. The below snippet works really well when all the threads are working on the data.
#pragma omp parallel for schedule(dynamic, per_thread)
for(size_t i = 0; i < dataLength(); ++i) {
threadWorkOnElement(vecObj, i);
}
Now, according to me, the the main issue I am facing in thinking up of a solution is the question as to how can I have N-1 threads dynamically scheduled over the range of available data, while another thread just keeps on reading and populating the vector with data?
I am guessing that the issue of writing new data and messaging the remaining threads can be achieved using std atomic.
I think that what I am trying to achieve is along the lines of the following pseudo code
std::atomic<size_t> freshDataEnd;
size_t dataWorkStart = 0;
size_t dataWorkEnd;
#pragma omp parallel
{
#pragma omp task
{
//increment freshDataEnd atomically upon reading every P objects
//return when end of file is reached
readData(vecObj, freshDataEnd);
}
#pragma omp task
{
omp_set_num_threads(N-1);
while(freshDataEnd <= MAX_VEC_LEN) {
if (dataWorkStart < freshDataEnd) {
dataWorkEnd = freshDataEnd;
#pragma omp parallel for schedule(dynamic, per_thread)
for(size_t i = dataWorkStart; i < dataWorkEnd; ++i) {
threadWorkOnElement(vecObj, i);
}
dataWorkStart = dataWorkEnd;
}
}
}
}
Is this the correct approach to achieve what I am trying to do? How can I handle this sort of nested parallelism? Not so important : I would have preferred to stick with openmp directives and not use std atomics, is that possible? How?
Related
I have the following piece of code.
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
where offset is an array of size n containing values in the range [0, m) and cnt is an array of size m initialized to 0. I use OpenMP to parallelize it as follows.
#pragma omp parallel for shared(cnt, offset) private(i)
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
According to the discussion in this post, if offset[i1] == offset[i2] for i1 != i2, the above piece of code may result in incorrect cnt. What can I do to avoid this?
This code:
#pragma omp parallel for shared(cnt, offset) private(i)
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
contains a race-condition during the updates of the array cnt, to solve it you need to guarantee mutual exclusion of those updates. That can be achieved with (for instance) #pragma omp atomic update but as already pointed out in the comments:
However, this resolves just correctness and may be terribly
inefficient due to heavy cache contention and synchronization needs
(including false sharing). The only solution then is to have each
thread its private copy of cnt and reduce these copies at the end.
The alternative solution is to have a private array per thread, and at end of the parallel region you perform the manual reduction of all those arrays into one. An example of such approach can be found here.
Fortunately, with OpenMP 4.5 you can reduce arrays using a dedicate pragma, namely:
#pragma omp parallel for reduction(+:cnt)
You can have look at this example on how to apply that feature.
Worth mentioning that regarding the reduction of arrays versus the atomic approach as kindly point out by #Jérôme Richard:
Note that this is fast only if the array is not huge (the atomic based
solution could be faster in this specific case regarding the platform
and if the values are not conflicting). So that is m << n. –
As always profiling is the key!; Hence, you should test your code with aforementioned approaches to find out which one is the most efficient.
Suppose I have a the following function which makes use of #pragma omp parallel internally.
void do_heavy_work(double * input_array);
I now want to do_heavy_work on many input_arrays thus:
void do_many_heavy_work(double ** input_arrays, int num_arrays)
{
for (int i = 0; i < num_arrays; ++i)
{
do_heavy_work(input_arrays[i]);
}
}
Let's say I have N hardware threads. The implementation above would cause num_arrays invocations of do_heavy_work to occur in a serial fashion, each using all N threads internally to do whatever parallel thing it wants.
Now assume that when num_arrays > 1 it is actually more efficient to parallelise over this outer loop than it is to parallelise internally in do_heavy_work. I now have the following options.
Put #pragma omp parallel for on the outer loop and set OMP_NESTED=1. However, by setting OMP_NUM_THREADS=N I will end up with a large total number of threads (N*num_arrays) to be spawned.
As above but turn off nested parallelism. This wastes available cores when num_arrays < N.
Ideally I want OpenMP to split its team of OMP_NUM_THREADS threads into num_arrays subteams, and then each do_heavy_work can thread over its allocated subteam if given some.
What's the easiest way to achieve this?
(For the purpose of this discussion let's assume that num_arrays is not necessarily known beforehand, and also that I cannot change the code in do_heavy_work itself. The code should work on a number of machines so N should be freely specifiable.)
OMP_NUM_THREADS can be set to a list, thus specifying the number of threads at each level of nesting. E.g. OMP_NUM_THREADS=10,4 will tell the OpenMP runtime to execute the outer parallel region with 10 threads and each nested region will execute with 4 threads for a total of up to 40 simultaneously running threads.
Alternatively, you can make your program adaptive with code similar to this one:
void do_many_heavy_work(double ** input_arrays, int num_arrays)
{
#pragma omp parallel num_threads(num_arrays)
{
int nested_team_size = omp_get_max_threads() / num_arrays;
omp_set_num_threads(nested_team_size);
#pragma omp for
for (int i = 0; i < num_arrays; ++i)
{
do_heavy_work(input_arrays[i]);
}
}
}
This code will not use all available threads if the value of OMP_NUM_THREADS is not divisible by num_arrays. If having different number of threads per nested region is fine (it could result in some arrays being processed faster than others), come up with an idea of how to distribute the threads and set nested_team_size in each thread accordingly. Calling omp_set_num_threads() from within a parallel region only affects nested regions started by the calling thread, so you can have different nested team sizes.
I have a long-running simulation program and I plan to use OpenMP for paralleling some codes for speedup. I'm new to OpenMP and have the following question.
Given that the simulation is a stochastic one, I have following data structure and I need to capture age-specific count of seeded agents [Edited: some code edited]:
class CAgent {
int ageGroup;
bool isSeed;
/* some other stuff */
};
class Simulator {
std::vector<int> seed_by_age;
std::vector<CAgent> agents;
void initEnv();
/* some other stuff */
};
void Simulator::initEnv() {
std::fill(seed_by_age.begin(), seed_by_age.end(), 0);
#pragma omp parallel
{
#pragma omp for
for (size_t i = 0; i < agents.size(); i++)
{
agents[i].setup(); // (a)
if (someRandomCondition())
{
agents[i].isSeed = true;
/* (b) */
seed_by_age[0]++; // index = 0 -> overall
seed_by_age[ agents[i].ageGroup - 1 ]++;
}
}
} // end #parallel
} // end Simulator::initEnv()
As the variable seed_by_age is shared across threads, I know I have to protect it properly. So in (b), I used #pragma omp flush(seed_by_age[agents[i].ageGroup]) But the compiler complains "error: expected ')' before '[' token"
I'm not doing reduction, and I try to avoid 'critical' directive if possible. So, am I missing something here? How can I properly protect a particular element of the vector?
Many thanks and I appreciate any suggestions.
Development box: 2 core CPU, target platform 4-6 cores
Platform: Windows 7, 64bits
MinGW 4.7.2 64 bits (rubenvb build)
You can only use flush with variables, not elements of arrays and definitely not with elements of C++ container classes. The indexing operator for std::vector results in a call to operator[], an inline function, but still a function.
Because in your case std::vector::operator[] returns a reference to a simple scalar type, you can use the atomic update construct to protect the updates:
#pragma omp atomic update
seed_by_age[0]++; // index = 0 -> overall
#pragma omp atomic update
seed_by_age[ agents[i].ageGroup - 1 ]++;
As for not using reduction, each thread touches seed_by_age[0] when the condition inside the loop is met thereby invalidating the same cache line in all other cores. Access to the other vector elements also leads to mutual cache invalidation but assuming that agents are more or less equally distributed among the age groups, it would not be that severe as in the case with the first element in the vector. Therefore I would propose that you do something like:
int total_seed_by_age = 0;
#pragma omp parallel for schedule(static) reduction(+:total_seed_by_age)
for (size_t i = 0; i < agents.size(); i++)
{
agents[i].setup(); // (a)
if (someRandomCondition())
{
agents[i].isSeed = true;
/* (b) */
total_seed_by_age++;
#pragma omp atomic update
seed_by_age[ agents[i].ageGroup - 1 ]++;
}
}
seed_by_age[0] = total_seed_by_age;
#pragma omp flush(seed_by_age[agents[i]].ageGroup)
try to close all your bracket, it will fix the compiler error.
I am afraid, that your #pragma omp flush statement is not sufficient to protect your data and prevent a race condition here.
If someRandomCondition() is true in only a very limited number of cases you could use a critical section for the update of your vector without loosing too much speed. Alternatively, if the size of your vector seed_by_age is not too large (which I assume) than it could be efficient to have a private version of the vector for each thread which you merge right before leaving the parallel block.
Performance wise, which of the following is more efficient?
Assigning in the master thread and copying the value to all threads:
int i = 0;
#pragma omp parallel for firstprivate(i)
for( ; i < n; i++){
...
}
Declaring and assigning the variable in each thread
#pragma omp parallel for
for(int i = 0; i < n; i++){
...
}
Declaring the variable in the master thread but assigning it in each thread.
int i;
#pragma omp parallel for private(i)
for(i = 0; i < n; i++){
...
}
It may seem a silly question and/or the performance impact may be negligible. But I'm parallelizing a loop that does a small amount of computation and is called a large number of times, so any optimization I can squeeze out of this loop is helpful.
I'm looking for a more low level explanation and how OpenMP handles this.
For example, if parallelizing for a large number of threads I assume the second implementation would be more efficient, since initializing a variable using xor is far more efficient than copying the variable to all the threads
There is not much of a difference in terms of performance among the 3 versions you presented, since each one of them is using #pragma omp parallel for. Hence, OpenMP will automatically assign each for iteration to different threads. Thus, variable i will became private to each thread, and each thread will have a different range of for iterations to work with. The variable 'i' was automatically set to private in order to avoid race conditions when updating this variable. Since, the variable 'i' will be private on the parallel for anyway, there is no need to put private(i) on the #pragma omp parallel for.
Nevertheless, your first version will produce an error since OpenMP is expecting that the loop right underneath of #pragma omp parallel for have the following format:
for(init-expr; test-expr;incr-expr)
inorder to precompute the range of work.
The for directive places restrictions on the structure of all
associated for-loops. Specifically, all associated for-loops must
have the following canonical form:
for (init-expr; test-expr;incr-expr) structured-block (OpenMP Application Program Interface pag. 39/40.)
Edit: I tested your two last versions, and inspected the generated assembly. Both version produce the same assembly, as you can see -> version 2 and version 3.
I need to parallelize this loop, I though that to use was a good idea, but I never studied them before.
#pragma omp parallel for
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it){
worst_q = std::min(worst_q, mesh->element_quality(*it));
}
In this case the loop is not parallelized because it uses iterator and the compiler cannot
understand how to slit it.
Can You help me?
OpenMP requires that the controlling predicate in parallel for loops has one of the following relational operators: <, <=, > or >=. Only random access iterators provide these operators and hence OpenMP parallel loops work only with containers that provide random access iterators. std::set provides only bidirectional iterators. You may overcome that limitation using explicit tasks. Reduction can be performed by first partially reducing over private to each thread variables followed by a global reduction over the partial values.
double *t_worst_q;
// Cache size on x86/x64 in number of t_worst_q[] elements
const int cb = 64 / sizeof(*t_worst_q);
#pragma omp parallel
{
#pragma omp single
{
t_worst_q = new double[omp_get_num_threads() * cb];
for (int i = 0; i < omp_get_num_threads(); i++)
t_worst_q[i * cb] = worst_q;
}
// Perform partial min reduction using tasks
#pragma omp single
{
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it) {
size_t elem = *it;
#pragma omp task
{
int tid = omp_get_thread_num();
t_worst_q[tid * cb] = std::min(t_worst_q[tid * cb],
mesh->element_quality(elem));
}
}
}
// Perform global reduction
#pragma omp critical
{
int tid = omp_get_thread_num();
worst_q = std::min(worst_q, t_worst_q[tid * cb]);
}
}
delete [] t_worst_q;
(I assume that mesh->element_quality() returns double)
Some key points:
The loop is executed serially by one thread only, but each iteration creates a new task. These are most likely queued for execution by the idle threads.
Idle threads waiting at the implicit barrier of the single construct begin consuming tasks as soon as they are created.
The value pointed by it is dereferenced before the task body. If dereferenced inside the task body, it would be firstprivate and a copy of the iterator would be created for each task (i.e. on each iteration). This is not what you want.
Each thread performs partial reduction in its private part of the t_worst_q[].
In order to prevent performance degradation due to false sharing, the elements of t_worst_q[] that each thread accesses are spaced out so to end up in separate cache lines. On x86/x64 the cache line is 64 bytes, therefore the thread number is multiplied by cb = 64 / sizeof(double).
The global min reduction is performed inside a critical construct to protect worst_q from being accessed by several threads at once. This is for illustrative purposes only since the reduction could also be performed by a loop in the main thread after the parallel region.
Note that explicit tasks require compiler which supports OpenMP 3.0 or 3.1. This rules out all versions of Microsoft C/C++ Compiler (it only supports OpenMP 2.0).
Random-Access Container
The simplest solution is to just throw everything into a random-access container (like std::vector) and use the index-based loops that are favoured by OpenMP:
// Copy elements
std::vector<size_t> neListVector(mesh->NEList[vid].begin(), mesh->NEList[vid].end());
// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
worst_q = std::min(worst_q, complexCalc(neListVector[i]));
}
Apart from being incredibly simple, in your situation (tiny elements of type size_t that can easily be copied) this is also the solution with the best performance and scalability.
Avoiding copies
However, in a different situation than yours you may have elements that aren't copied as easily (larger elements) or cannot be copied at all. In this case you can just throw the corresponding pointers in a random-access container:
// Collect pointers
std::vector<const nonCopiableObjectType *> neListVector;
for (const auto &entry : mesh->NEList[vid]) {
neListVector.push_back(&entry);
}
// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
worst_q = std::min(worst_q, mesh->element_quality(*neListVector[i]));
}
This is slightly more complex than the first solution, still has the same good performance on small elements and increased performance on larger elements.
Tasks and Dynamic Scheduling
Since someone else brought up OpenMP Tasks in his answer, I want to comment on that to. Tasks are a very powerful construct, but they have a huge overhead (that even increases with the number of threads) and in this case just make things more complex.
For the min reduction the use of Tasks is never justified because the creation of a Task in the main thread costs much more than just doing the std::min itself!
For the more complex operation mesh->element_quality you might think that the dynamic nature of Tasks can help you with load-balancing problems, in case that the execution time of mesh->element_quality varies greatly between iterations and you don't have enough iterations to even it out. But even in that case, there is a simpler solution: Simply use dynamic scheduling by adding the schedule(dynamic) directive to your parallel for line in one of my previous solutions. It achieves the same behaviour which far less overhead.