I wanted to parallelize this code with the help of OpenMP with something like
#pragma omp parallel for to divide the work amongst the different threads.
What would be an efficient way? Here level is shared between the varioud threads.
Here make is a set.
for(iter=make.at(level).begin();iter!=make.at(level).end();iter++)
{
Function(*iter);
}
If the type returned by make.at(level) has random-access iterators with constant access time and if your compiler supports recent enough OpenMP version (read: it is not MSVC++) then you can directly use the parallel for worksharing directive:
obj = make.at(level);
#pragma omp parallel for
for (iter = obj.begin(); iter != obj.end(); iter++)
{
Function(*iter);
}
If the type does not provide radom-access iterators but still your compiler supports OpenMP 3.0 or newer, then you can use OpenMP tasks:
#pragma omp parallel
{
#pragma omp single
{
obj = make.at(level);
for (iter = obj.begin(); iter != obj.end(); iter++)
{
#pragma omp task
Function(*iter);
}
}
}
Here a single thread executes the for loop and creates a number of OpenMP tasks. Each task will make a single call to Function() using the corresponding value of *iter. Then each idle thread will start picking from the list of unfinished tasks. At the end of the parallel region there is an implicit barrier so the master thread will dutifully wait for all tasks to finish before continuing execution past the parallel region.
If you are unfortunate enough to use MS Visual C++, then you don't have much of a choice than to create an array of object pointers and iterate over it using a simple integer loop:
obj = make.at(level);
obj_type* elements = new obj_type*[obj.size()];
for (i = 0, iter = obj.begin(); i < obj.size(); i++)
{
elements[i] = &(*iter++);
}
#pragma omp parallel for
for (i = 0; i < obj.size(); i++)
{
Function(*elements[i]);
}
delete [] elements;
It's not the most elegant solution but it should work.
Edit: If I understand correctly from the title of your question, you are working with sets. That rules out the first algorithm since sets do not support random-access iterators. Use either the second or the third algorithm depending on your compiler's supports for OpenMP tasks.
It seems that the variable in a parallel for must be signed int. But I'm not sure. Here is a topic about this.Why must loop variables be signed in a parallel for?
To use this iterator pattern with OpenMP probably requires some rethinking of how to perform the loop - you can't use #pragma omp for since your loop isn't a simple integer loop. I wonder if the following would work:
iter = make.at(level).begin();
end = make.at(level).end();
#pragma omp parallel private(iter) shared(make,level,end)
{
#pragma omp single
func(iter); /* only one thread does the first item */
while (1)
{
#pragma omp critical
iter = make.at(level).next(); /* each thread gets a different item */
if (iter == end)
break;
func(iter);
}
} /* end parallel block */
Note that I had to change your iter++ into a next() call in a critical section to make it work. The reason for this is that the shared make.at(level) object needs to remember which items have already been processed.
Related
I am trying to achieve something contrived using OpenMP.
I have a multi-core system with N available processors. I want to have a vector of objects of length k*P to be populated in batches of P by a single thread (by reading a file), i.e. a single thread reads this file and writes in vecObj[0 to P-1] then vecObj[p to 2P-1] etc. To make things simple, this vector is pre-resized (i.e. inserting using = operator, no pushbacks, constant length as far as we are concerned).
After a batch is written into the vector, I want the remaining N-1 threads to work on the available data. Since every object can take different time to be worked upon, it would be good to have dynamic scheduling for the remaining threads. The below snippet works really well when all the threads are working on the data.
#pragma omp parallel for schedule(dynamic, per_thread)
for(size_t i = 0; i < dataLength(); ++i) {
threadWorkOnElement(vecObj, i);
}
Now, according to me, the the main issue I am facing in thinking up of a solution is the question as to how can I have N-1 threads dynamically scheduled over the range of available data, while another thread just keeps on reading and populating the vector with data?
I am guessing that the issue of writing new data and messaging the remaining threads can be achieved using std atomic.
I think that what I am trying to achieve is along the lines of the following pseudo code
std::atomic<size_t> freshDataEnd;
size_t dataWorkStart = 0;
size_t dataWorkEnd;
#pragma omp parallel
{
#pragma omp task
{
//increment freshDataEnd atomically upon reading every P objects
//return when end of file is reached
readData(vecObj, freshDataEnd);
}
#pragma omp task
{
omp_set_num_threads(N-1);
while(freshDataEnd <= MAX_VEC_LEN) {
if (dataWorkStart < freshDataEnd) {
dataWorkEnd = freshDataEnd;
#pragma omp parallel for schedule(dynamic, per_thread)
for(size_t i = dataWorkStart; i < dataWorkEnd; ++i) {
threadWorkOnElement(vecObj, i);
}
dataWorkStart = dataWorkEnd;
}
}
}
}
Is this the correct approach to achieve what I am trying to do? How can I handle this sort of nested parallelism? Not so important : I would have preferred to stick with openmp directives and not use std atomics, is that possible? How?
So I have a loop where I iterate over elements of a vector, call a function on each element, and if it meets a certain criteria, I push it onto a list.
my_list li;
for (auto itr = Obj.begin(); itr != Obj.end(); ++itr) {
if ((*itr).function_call())
li.push_back((*itr);
}
I've been thinking of ways to optimize my program, and I came across OpenMP, but a lot of the sample code is hard to follow.
Could someone walk me through how to convert the above loop to utilize multiple cores in parallel?
Thanks.
There are a few points you need to take care to parallelize that code snippet
If you're using OpenMP 3.0 (or above) you can parallelize your for-loop #pragma omp for, if you're using an older version of OpenMP, you need to be using a for loop accessing vector with indexes.
You need to guard li.push_back((*itr); statement with a lock or set it as critical section
If function_call is not a really slow function or your vector does not contain so many items, it may not be necessary to parallelize as thread creation will introduce overhead.
So a pseudo-code implementation would be
my_list li;
#pragma omp for
for (auto itr = Obj.begin(); itr != Obj.end(); ++itr) {
if ((*itr).function_call())
{
#pragma omp critical CRIT_1
{
li.push_back((*itr);
}
}
}
The time has come to discuss efficient ways to use container classes such as std::list or std::vector with OpenMP (since the OP wants to optimize his code using lists with OpenMP). Let me list four ways in increasing level of efficiency.
Fill the container in a parallel section in a critical block
Make private versions of the container for each thread, fill them in parallel, and then merge them in a critical section
Don't use STL containers. STL was not designed with efficiency in mind. Instead either write your own or use something like Agner Fog's containters which are designed for efficiency. For example instead of using a heap for memory allocation they use a memory pool.
In some special cases it's possible to merge the private versions of the containers in parallel as well.
Example code for the first case is given in the accepted answer. This defeats most of the purpose of using threaded code since each iteration fills the container in critical section.
Example code for the second case can be found at C++ OpenMP Parallel For Loop - Alternatives to std::vector. Rather than re-post the code here let me give an example for the third case using Agner Fog's container classes
DynamicArray<int> vec;
#pragma omp parallel
{
DynamicArray<int> vec_private;
#pragma omp for nowait //fill vec_private in parallel
for(int i=0; i<100; i++) {
vec_private.Push(i);
}
//merging here is probably not optimal
//Dynamic array needs an append function
//vec should reserve a size equal to the sum of size each vec_private
//then use memcpy to append vec_private into vec in a critcal section
#pragma omp critical
{
for(int i=0; i<vec_private.GetNum(); i++) {
vec.Push(vec_private[i]);
}
}
}
Finally, in special cases for example with histograms (probably the most common data structure in experimental particle physics), it's possible to merge the private arrays in parallel as well. For the histograms this is equivalent to an array reduction. This is a bit tricky. An example showing how to do this can be found at Fill histograms (array reduction) in parallel with OpenMP without using a critical section
I have a long-running simulation program and I plan to use OpenMP for paralleling some codes for speedup. I'm new to OpenMP and have the following question.
Given that the simulation is a stochastic one, I have following data structure and I need to capture age-specific count of seeded agents [Edited: some code edited]:
class CAgent {
int ageGroup;
bool isSeed;
/* some other stuff */
};
class Simulator {
std::vector<int> seed_by_age;
std::vector<CAgent> agents;
void initEnv();
/* some other stuff */
};
void Simulator::initEnv() {
std::fill(seed_by_age.begin(), seed_by_age.end(), 0);
#pragma omp parallel
{
#pragma omp for
for (size_t i = 0; i < agents.size(); i++)
{
agents[i].setup(); // (a)
if (someRandomCondition())
{
agents[i].isSeed = true;
/* (b) */
seed_by_age[0]++; // index = 0 -> overall
seed_by_age[ agents[i].ageGroup - 1 ]++;
}
}
} // end #parallel
} // end Simulator::initEnv()
As the variable seed_by_age is shared across threads, I know I have to protect it properly. So in (b), I used #pragma omp flush(seed_by_age[agents[i].ageGroup]) But the compiler complains "error: expected ')' before '[' token"
I'm not doing reduction, and I try to avoid 'critical' directive if possible. So, am I missing something here? How can I properly protect a particular element of the vector?
Many thanks and I appreciate any suggestions.
Development box: 2 core CPU, target platform 4-6 cores
Platform: Windows 7, 64bits
MinGW 4.7.2 64 bits (rubenvb build)
You can only use flush with variables, not elements of arrays and definitely not with elements of C++ container classes. The indexing operator for std::vector results in a call to operator[], an inline function, but still a function.
Because in your case std::vector::operator[] returns a reference to a simple scalar type, you can use the atomic update construct to protect the updates:
#pragma omp atomic update
seed_by_age[0]++; // index = 0 -> overall
#pragma omp atomic update
seed_by_age[ agents[i].ageGroup - 1 ]++;
As for not using reduction, each thread touches seed_by_age[0] when the condition inside the loop is met thereby invalidating the same cache line in all other cores. Access to the other vector elements also leads to mutual cache invalidation but assuming that agents are more or less equally distributed among the age groups, it would not be that severe as in the case with the first element in the vector. Therefore I would propose that you do something like:
int total_seed_by_age = 0;
#pragma omp parallel for schedule(static) reduction(+:total_seed_by_age)
for (size_t i = 0; i < agents.size(); i++)
{
agents[i].setup(); // (a)
if (someRandomCondition())
{
agents[i].isSeed = true;
/* (b) */
total_seed_by_age++;
#pragma omp atomic update
seed_by_age[ agents[i].ageGroup - 1 ]++;
}
}
seed_by_age[0] = total_seed_by_age;
#pragma omp flush(seed_by_age[agents[i]].ageGroup)
try to close all your bracket, it will fix the compiler error.
I am afraid, that your #pragma omp flush statement is not sufficient to protect your data and prevent a race condition here.
If someRandomCondition() is true in only a very limited number of cases you could use a critical section for the update of your vector without loosing too much speed. Alternatively, if the size of your vector seed_by_age is not too large (which I assume) than it could be efficient to have a private version of the vector for each thread which you merge right before leaving the parallel block.
I need to parallelize this loop, I though that to use was a good idea, but I never studied them before.
#pragma omp parallel for
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it){
worst_q = std::min(worst_q, mesh->element_quality(*it));
}
In this case the loop is not parallelized because it uses iterator and the compiler cannot
understand how to slit it.
Can You help me?
OpenMP requires that the controlling predicate in parallel for loops has one of the following relational operators: <, <=, > or >=. Only random access iterators provide these operators and hence OpenMP parallel loops work only with containers that provide random access iterators. std::set provides only bidirectional iterators. You may overcome that limitation using explicit tasks. Reduction can be performed by first partially reducing over private to each thread variables followed by a global reduction over the partial values.
double *t_worst_q;
// Cache size on x86/x64 in number of t_worst_q[] elements
const int cb = 64 / sizeof(*t_worst_q);
#pragma omp parallel
{
#pragma omp single
{
t_worst_q = new double[omp_get_num_threads() * cb];
for (int i = 0; i < omp_get_num_threads(); i++)
t_worst_q[i * cb] = worst_q;
}
// Perform partial min reduction using tasks
#pragma omp single
{
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it) {
size_t elem = *it;
#pragma omp task
{
int tid = omp_get_thread_num();
t_worst_q[tid * cb] = std::min(t_worst_q[tid * cb],
mesh->element_quality(elem));
}
}
}
// Perform global reduction
#pragma omp critical
{
int tid = omp_get_thread_num();
worst_q = std::min(worst_q, t_worst_q[tid * cb]);
}
}
delete [] t_worst_q;
(I assume that mesh->element_quality() returns double)
Some key points:
The loop is executed serially by one thread only, but each iteration creates a new task. These are most likely queued for execution by the idle threads.
Idle threads waiting at the implicit barrier of the single construct begin consuming tasks as soon as they are created.
The value pointed by it is dereferenced before the task body. If dereferenced inside the task body, it would be firstprivate and a copy of the iterator would be created for each task (i.e. on each iteration). This is not what you want.
Each thread performs partial reduction in its private part of the t_worst_q[].
In order to prevent performance degradation due to false sharing, the elements of t_worst_q[] that each thread accesses are spaced out so to end up in separate cache lines. On x86/x64 the cache line is 64 bytes, therefore the thread number is multiplied by cb = 64 / sizeof(double).
The global min reduction is performed inside a critical construct to protect worst_q from being accessed by several threads at once. This is for illustrative purposes only since the reduction could also be performed by a loop in the main thread after the parallel region.
Note that explicit tasks require compiler which supports OpenMP 3.0 or 3.1. This rules out all versions of Microsoft C/C++ Compiler (it only supports OpenMP 2.0).
Random-Access Container
The simplest solution is to just throw everything into a random-access container (like std::vector) and use the index-based loops that are favoured by OpenMP:
// Copy elements
std::vector<size_t> neListVector(mesh->NEList[vid].begin(), mesh->NEList[vid].end());
// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
worst_q = std::min(worst_q, complexCalc(neListVector[i]));
}
Apart from being incredibly simple, in your situation (tiny elements of type size_t that can easily be copied) this is also the solution with the best performance and scalability.
Avoiding copies
However, in a different situation than yours you may have elements that aren't copied as easily (larger elements) or cannot be copied at all. In this case you can just throw the corresponding pointers in a random-access container:
// Collect pointers
std::vector<const nonCopiableObjectType *> neListVector;
for (const auto &entry : mesh->NEList[vid]) {
neListVector.push_back(&entry);
}
// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
worst_q = std::min(worst_q, mesh->element_quality(*neListVector[i]));
}
This is slightly more complex than the first solution, still has the same good performance on small elements and increased performance on larger elements.
Tasks and Dynamic Scheduling
Since someone else brought up OpenMP Tasks in his answer, I want to comment on that to. Tasks are a very powerful construct, but they have a huge overhead (that even increases with the number of threads) and in this case just make things more complex.
For the min reduction the use of Tasks is never justified because the creation of a Task in the main thread costs much more than just doing the std::min itself!
For the more complex operation mesh->element_quality you might think that the dynamic nature of Tasks can help you with load-balancing problems, in case that the execution time of mesh->element_quality varies greatly between iterations and you don't have enough iterations to even it out. But even in that case, there is a simpler solution: Simply use dynamic scheduling by adding the schedule(dynamic) directive to your parallel for line in one of my previous solutions. It achieves the same behaviour which far less overhead.
I'm trying to iterate over a map in c++ using openMP, but I got three error messages saying
that the initialization, termination and increment of my loop has improper form and I'm quite new in using openmp, so is there any way to get around this problem while getting the same results as the serial ones? the following is the code I used
map< int,string >::iterator datIt;
#pragma omp parallel for
for(datIt=dat.begin();datIt!=dat.end();datIt++) //construct the distance matrix
{
...............
}
It could be done also by using a simple index based for loop clubbed with std::advance to reach to a particular map element. OpenMP 2.0 supports index based for loops very well.
#pragma omp parallel for
for(int i = 0; i < dat.size(); i++) {
auto datIt = dat.begin();
advance(datIt, i);
//construct the distance matrix using iterator datIt
}
In each thread the iterator datIt will point to a map item and can be used to perform operations on it.
It's likely your implementation of OpenMP is incompatible with STL iterators. While there have been some changes to the standard to make OMP more compatible with the STL, I think you'll find your implementation doesn't support such behaviour. Most OpenMP implementations I've encountered are at most version 2.5, Microsoft C++ is 2.0. The only compiler I'm aware of that supports 3.0 is the Intel C++ compiler.
A few other points, you should use std::begin, and std::end. Also, you either need to declare your loop invariant as private, or have OpenMP figure that out by itself, like so:
#pragma omp parallel for
for(map< int,string >::iterator datIt = std::begin(dat);
datIt != std::end(dat);
datIt++)
{
//construct the distance matrix...
}
But without 3.0 support, this is beside the point.
OpenMP 3.0 available now on gcc and Intel compiler has task directive that allow a thread to delegate task to a pool of thread
Inspired from : this response and this course, I wrote these kind of code that work fine for me :
map< int,string >::iterator datIt;
...
#pragma omp parallel for
#pragma omp single nowait
{
for(datIt=dat.begin();datIt!=dat.end();datIt++) //construct the distance matrix
{
#pragma omp task firstprivate(datIt)
{
...............
}
}
}
One task (single directive) loop over the whole map and put every task to do for every element in map into a pool of tasks. Other OMP threads process tasks remains in this pool. There is not necessary for others OMP task to wait the end of for loop to start task processing (nowait). Every task has a pointer of element in map to process (firstprivate(datIt)).
Constraint : Every task must be independent and map mustn't change before the end.
Try this way if its helpful.
#pragma omp parallel for shared(dat) private(datIt)
for(map< int,string >::iterator datIt=dat.begin();datIt!=dat.end();datIt++)
{
...............
}