I wanted to create a map ascribing pairs of integers vectors of integers. My purpose is to do it in parallel way. To ensure that I am not trying to push_back at the same time to the same memory entity (by multiple threads), the second coordinate of map's key pairs is responsible for the current thread number. I encountered, however, problems. It seems that some of the values are not inserting properly. Instead of getting 10 values all together, I get always less (sometimes 9, sometimes 8, 6, etc.)
map<pair<int, int>, vector<int> > test;
#pragma omp parallel num_threads(8)
{
#pragma omp for
for (int i = 0;i < 10;i++)
{
test[make_pair(i % 3, omp_get_thread_num())].push_back(i);
}
}
I have also tried test.at(make_pair(i % 3, omp_get_thread_num())).push_back(i) and it didn't work either. In this case, however, the execution interrupted with an exception.
I thought that #pragma omp for distributes the for loop into disjoint subsequences of (0,...,9) so that there shouldn't be problem with my code... I am a bit confused. Could someone explain this issue to me?
As stated, the standard library containers are not thread safe.
The appropriate solution for this case is to initialize n-maps (one for each thread) and than join them at the end.
As stated prior, using a mutex (and making access to the map safe) would be a valid solution, however it would also result in worse performance. As every time the map would be accessed each thread would have to wait on the mutex unlocking the data.
It should be noted that a size of 10 is not sufficient to make multithreading worth it, using multiple-threads here would most likely degrade performance.
map<pair<int, int>, vector<int> > test[8];
#pragma omp parallel num_threads(8)
{
#pragma omp for
for (int i = 0;i < large_number; i++)
{
int thread_id = omp_get_thread_id();
test[thread_id][make_pair(i % 3, omp_get_thread_num())].push_back(i);
}
}
#pragma omp barrier
map<pair<int,int>, vector<int>> combined;
for (int i = 0; i < 8; ++i)
combined.insert(test[i].begin(), test[i].end());
That's because a map is not thread safe (nor is a vector, but this part is not threaded). You have to add mutex, use lock-free containers or prepare your map first.
In this case, start on a single thread by creating all your entries.
#pragma omp single
for (int i = 0;i < 3;++i)
{
for (int j = 0;i < 8;++j)
{
test.insert(make_pair(make_pair(i, j), vector<int>()));
}
}
Then do your parallel for (add a barrier).
Related
So I have a function, let's call it dostuff() in which it's beneficial for my application to sometimes parallelize within, or to do it multiple times and parallelize the whole thing. The function does not change though between both use cases.
Note: object is large enough that it cannot viably be stored in a list, and so it must be discarded with each iteration.
So, let's say our code looks like this:
bool parallelize_within = argv[1];
if (parallelize_within) {
// here we assume parallelization is handled within the function dostuff()
for (int i = 0; i < 100; ++i) {
object randomized = function_that_accesses_rand();
dostuff(i, randomized, parallelize_within);
}
} else {
#pragma omp parallel for
for (int i = 0; i < 100; ++i) {
object randomized = function_that_accesses_rand();
dostuff(i, randomized, parallelize_within);
}
}
Obviously, we run into the issue that dostuff() will have threads access the random object at different times in different iterations of the same program. This is not the case when parallelize_within == true, but when we run dostuff() in parallel individually per thread, is there a way to guarantee that the random object is accessed in order based on the iteration? I know that I could do:
#pragma omp parallel for schedule(dynamic)
which will guarantee that eventually, as iterations are assigned to threads at runtime dynamically, the objects will access rand in order with the iteration number, but for the first set of iterations it will be totally random. Any suggestions on how to avoid this?
First of all you have to make sure that both function_that_accesses_rand and do_stuff are threadsafe.
You do not have to duplicate your code if you use the if clause:
#pragma omp parallel for if(!parallelize_within)
To make sure that in function dostuff(i, randomized,...); i reflects the order of creation of randomized object you have to do something like this:
int j = 0;
#pragma omp parallel for if(!parallelize_within)
for (int i = 0; i < 100; ++i) {
int k;
object randomized;
#pragma omp critical
{
k = j++;
randomized = function_that_accesses_rand();
}
dostuff(k, randomized, parallelize_within);
}
You may eliminate the use of the critical section if your function_that_accesses_rand makes it possible, but I cannot be more specific without knowing your function. One solution is that this function returns the value representing the order. Do not forget that this function has to be threadsafe!
#pragma omp parallel for if(!parallelize_within)
for (int i = 0; i < 100; ++i) {
int k;
object randomized = function_that_accesses_rand(k);
dostuff(k, randomized, parallelize_within);
}
... function_that_accesses_rand(int& k){
...
#pragma omp atomic capture
k = some_internal_counter++;
...
}
You could pre generate the random object and store it in a list. Then have a variable in the omp loop, that is incremented per thread.
// generate random objects
i=0
#pragma omp parallel for
for( ... ){
do_stuff(...,rand_obj[i],...)
I would like to parallel a big loop using OpenMP to improve its efficiency. Here is the main part of the toy code:
vector<int> config;
config.resize(indices.size());
omp_set_num_threads(2);
#pragma omp parallel for schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
#pragma omp simd
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
#pragma omp atomic
result[index]++;
}
Then I found I cannot get improvements in efficiency if I use 2, 4, or 8 threads. The execution time of the parallel versions is generally greater than that of the sequential version. The outer loop has 10000 iterations and they are independent so I want multiple threads to execute those iterations in parallel.
I guess the reasons for performance decrease maybe include: private copies of config? or, random access of ref_table? or, expensive atomic operation? So what are the exact reasons for the performance decrease? More importantly, how can I get a shorter execution time?
Private copies of config or, random access of ref_tables are not problematic, I think the workload is very small, there are 2 potential issues which prevent efficient parallelization:
atomic operation is too expensive.
overheads are bigger than workload (it simply means that it is not worth parallelizing with OpenMP)
I do not know which one is more significant in your case, so it is worth trying to get rid of atomic operation. There are 2 cases:
a) If the results array is zero initialized you have to use:
#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config) where N is the size of result array and delete #pragma omp atomic. Note that this works on OpenMP 4.5 or later. It is also worth removing #parama omp simd for a loop of 2-10 iterations. So, your code should look like this:
#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
result[index]++;
}
b) If the result array is not zero initialized the solution is very similar, but use a temporary zero initialized array in the loop and after that add it to result array.
If the speed will not increase then your code is not worth parallelizing with OpenMP on your hardware.
I look for a better way to cancel my threads.
In my approach, I use a shared variable and if this variable is set, I just throw a continue. This finishes my threads fast, but threads keep theoretically spawning and ending, which seems not elegant.
So, is there a better way to solve the issue (break is not supported by my OpenMP)?
I have to work with Visual, so my OpenMP Lib is outdated and there is no way around that. Consequently, I think #omp cancel will not work
int progress_state = RunExport;
#pragma omp parallel
{
#pragma omp for
for (int k = 0; k < foo.z; k++)
for (int j = 0; j < foo.y; j++)
for (int i = 0; i < foo.x; i++) {
if (progress_state == StopExport) {
continue;
}
// do some fancy shit
// yeah here is a condition for speed due to the critical
#pragma omp critical
if (condition) {
progress_state = StopExport;
}
}
}
You should do it the simple way of "just continue in all remaining iterations if cancellation is requested". That can just be the first check in the outermost loop (and given that you have several nested loops, that will probably not have any measurable overhead).
std::atomic<int> progress_state = RunExport;
// You could just write #pragma omp parallel for instead of these two nested blocks.
#pragma omp parallel
{
#pragma omp for
for (int k = 0; k < foo.z; k++)
{
if (progress_state == StopExport)
continue;
for (int j = 0; j < foo.y; j++)
{
// You can add break statements in these inner loops.
// OMP only parallelizes the outermost loop (at least given the way you wrote this)
// so it won't care here.
for (int i = 0; i < foo.x; i++)
{
// ...
if (condition) {
progress_state = StopExport;
}
}
}
}
}
Generally speaking, OMP will not suddenly spawn new threads or end existing ones, especially not within one parallel region. This means there is little overhead associated with running a few more tiny iterations. This is even more true given that the default scheduling in your case is most likely static, meaning that each thread knows its start and end index right away. Other scheduling modes would have to call into the OMP runtime every iteration (or every few iterations) to request more work, but that won't happen here. The compiler will basically see this code for the threaded work:
// Not real omp functions.
int myStart = __omp_static_for_my_start();
int myEnd = __omp_static_for_my_end();
for (int k = myStart; k < myEnd; ++k)
{
if (progress_state == StopExport)
continue;
// etc.
}
You might try a non-atomic thread-local "should I cancel?" flag that starts as false and can only be changed to true (which the compiler may understand and fold into the loop condition). But I doubt you will see significant overhead either way, at least on x86 where int is atomic anyway.
which seems not elegant
OMP 2.0 does not exactly shine with respect to elegance. I mean, iterating over a std::vector requires at least one static_cast to silence signed -> unsigned conversion warnings. So unless you have specific evidence of this pattern causing a performance problem, there is little reason not to use it.
I'm implementing Boruvka's algorithm in C++ to find minimum spanning tree for a graph. This algorithm finds a minimum-weight edge for each supervertex (a supervertex is a connected component, it is simply a vertex in the first iteration) and adds them into the MST. Once an edge is added, we update the connected components and repeat the find-min-edge, and merge-supervertices process, until all the vertices in the graph are in one connected component.
Since find-min-edge for each supervertex can be done in parallel, I want to use OpenMP to do this. Here is the omp for loop I would like to use for parallel find-min.
int index[NUM_VERTICES];
#pragma omp parallel private(nthreads, tid, index, min) shared(minedgeindex, setcount, forest, EV, umark)
{
#pragma omp for
for(int k = 0; k < setcount; k++){ //iterate over supervertices, omp for here
min = 9999;
std::fill_n(index, NUM_VERTICES, -1);
/* Gets minimum edge for each supervertex */
for(int i = 0; i < NUM_VERTICES; i++) {
if(forest[i]->mark == umark[k]){ //find vertices with mark k
for(int j = 0; j < NUM_EDGES; j++) {
//check min edge for each vertex in the supervertex k
if(EV[j].v1==i){
if(Find(forest[EV[j].v1])!= Find(forest[EV[j].v2])){
if(EV[j].w <= min ){
min = EV[j].w;
index[k] = j;
break; //break looping over edges for current vertex i, go to next vertex i+1
}
}
}
}
}
} //end finding min disjoint-connecting edge for the supervertex with mark k
if(index[k] != -1){
minedgeindex.insert(minedgeindex.begin(), index[k]);
}
} //omp for end
}
Since I'm new to OpenMP, I currently cannot make it work as I expected.
Let me briefly explain what I'm doing in this block of code:
setcount is the number of supervertices. EV is a vector containing all edges (Edge is a struct I defined previously, has attributes v1, v2, w which correspond to the two nodes it connects and its weight). minedgeindex is a vector, I want each thread to find min edge for each connected component, and add the index (index j in EV) of the min edge to vector minedgeindex at the same time. So I think minedgeindex should be shared. forest is a struct for each vertex, it has a set mark umark indicating which supervertex it's in. I use Union-Find to mark all supervertices, but it is not relevant in this block of omp code.
The ultimate goal I need for this block of code is to give me the correct vector minedgeindex containing all min edges for each supervertex.
To be more clear and ignore the graph background, I just have a large vector of numbers, I separate them into a bunch of sets, then I need some parallel threads to find the min for each set of numbers and give me back the indices for those mins, store them in a vector minedgeindex.
If you need more clarification just ask me. Please help me make this work, I think the main issue is the declaration of private and shared variables which I don't know if I'm doing right.
Thank you in advance!
Allocating an array outside of a parallel block and then declaring it private is not going to work.
Edit: After reading through your code again it does not appear that index should even be private. In that case you should just declare it outside the parallel block (as you did) but not declare it private. But I am not sure you even need index to be an array. I think you can just declare it as an private int.
Additionally, you can't fill minedgeindex like you did. That causes a race condition. You need to put it in a critical section. Personally I would try and use push_back and not insert from the beginning of the array since that's inefficient.
Some people prefer to explicitly declare everything shared and private. In standard C you sorta have to do this, at least for private. But for C99/C++ this is not necessary. I prefer to only declare shared/private if it's necessary. Everything outside of the parallel region is shared (unless it's an index used in a parallel loop) and everything inside is private. If you keep that in mind you rarely have to explicitly declare data shared or private.
//int index[NUM_VERTICES]; //index is shared
//std::fill_n(index, NUM_VERTICES, -1);
#pragma omp parallel
{
#pragma omp for
for(int k = 0; k < setcount; k++){ //iterate over supervertices, omp for here
int min = 9999; // min is private
int index = -1;
//iterate over supervertices
if(index != -1){
#pragma omp critical
minedgeindex.insert(minedgeindex.begin(), index);
//minedgeindex.insert(minedgeindex.begin(), index[k]);
}
}
}
Now that the code is working here are some suggestions to perhaps speed it up.
Using the critical declaration inside the loop could be very inefficient. I suggest filling a private array (std::vector) and then merging them after the parallel loop (but still in the parallel block). The loop has an implicit barrier which is not necessary. This can be removed with nowait.
Independent of the critical section the time to find each minimum can vary per iteration so you may want to consider schedule(dynamic). The following code does all this. Some variation of these suggestions, if not all, may improve your performance.
#pragma omp parallel
{
vector<int> minedgeindex_private;
#pragma omp for schedule(dynamic) nowait
for(int k = 0; k < setcount; k++){ //iterate over supervertices, omp for here
int min = 9999;
int index = -1;
//iterate over supervertices
if(index != -1){
minedgeindex_private.push_back(index);
}
}
#pragma omp critical
minedgeindex.insert(
minedgeindex.end(),
minedgeindex_private.begin(), minedgeindex_private.end());
}
This is not going to work efficiently with openMP, because omp for simply splits the work statically between all threads, i.e. each threads gets a fair share of the supervertices. However, the work per supervertex may be uneven, when the work-sharing between treads not be even.
You can try to use dynamic or guided schedule with openMP, but better is to avoid openMP altogether and use TBB, when tbb::parallel_for() avoids this issue.
OpenMP has several disadvantages:
1) it is pre-processor based
2) it has rather limited functionality (this is what I highlighted above)
3) it isn't standardised for C++ (in particular C++11)
TBB is a pure C++ library (no preprocessor hack) with full C++11 support. For more details, see my answer to this question
I'm trying to make a for loop multi-threaded in C++ so that the calculation gets divided to the multiple threads. Yet it contains data that needs to be joined together in the order as they are.
So the idea is to first join the small bits on many cores (25.000+ loops) and then join the combined data once more at the end.
std::vector<int> ids; // mappings
std::map<int, myData> combineData; // data per id
myData outputData; // combined data based on the mappings
myData threadData; // data per thread
#pragma parallel for default(none) private(data, threadData) shared(combineData)
for (int i=0; i<30000; i++)
{
threadData += combineData[ids[i]];
}
// Then here I would like to get all the seperate thread data and combine them in a similar manner
// I.e.: for each threadData: outputData += threadData
What would be the efficient and good way to approach this?
How can I schedule the openmp loop so that the scheduling is split evenly into chunks
For example for 2 threads:
[0, 1, 2, 3, 4, .., 14999] & [15000, 15001, 15002, 15003, 15004, .., 29999]
If there's a better way to join the data (which involves joining a lot of std::vectors together and some matrix math), yet preserve the order of additions pointers to that would help as well.
Added information
The addition is associative, though not commutative.
myData is not an intrinsic type. It's a class containing data as multiple std::vectors (and some data related to the Autodesk Maya API.)
Each cycle is doing a similar matrix multiplication to many points and adds these points to a vector (in theory the calculation time should stay roughly similar per cycle)
Basically it's adding mesh data (consisting of vectors of data) to eachother (combining meshes) though the order of the whole thing accounts for the index value of the vertices. The vertex index should be consistent and rebuildable.
This depends on a few properties of the the addition operator of myData. If the operator is both associative (A + B) + C = A + (B + C) as well as commutative A + B = B + A then you can use a critical section or if the data is plain old data (e.g. a float, int,...) a reduction.
However, if it's not commutative as you say (order of operation matters) but still associative you can fill an array with a number of elements equal to the number of threads of the combined data in parallel and then merge them in order in serial (see the code below. Using schedule(static) will split the chunks more or less evenly and with increasing thread number as you want.
If the operator is neither associative nor commutative then I don't think you can parallelize it (efficiently - e.g. try parallelizing a Fibonacci series efficiently).
std::vector<int> ids; // mappings
std::map<int, myData> combineData; // data per id
myData outputData; // combined data based on the mappings
myData *threadData;
int nthreads;
#pragma omp parallel
{
#pragma omp single
{
nthreads = omp_get_num_threads();
threadData = new myData[nthreads];
}
myData tmp;
#pragma omp for schedule(static)
for (int i=0; i<30000; i++) {
tmp += combineData[ids[i]];
}
threadData[omp_get_thread_num()] = tmp;
}
for(int i=0; i<nthreads; i++) {
outputData += threadData[i];
}
delete[] threadData;
Edit: I'm not 100% sure at this point if the chunks will assigned in order of increasing thread number with #pragma omp for schedule(static) (though I would be surprised if they are not). There is an ongoing discussion on this issue. Meanwhile, if you want to be 100% sure then instead of
#pragma omp for schedule(static)
for (int i=0; i<30000; i++) {
tmp += combineData[ids[i]];
}
you can do
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int start = ithread*30000/nthreads;
const int finish = (ithread+1)*30000/nthreads;
for(int i = start; i<finish; i++) {
tmp += combineData[ids[i]];
}
Edit:
I found a more elegant way to fill in parallel but merge in order
#pragma omp parallel
{
myData tmp;
#pragma omp for schedule(static) nowait
for (int i=0; i<30000; i++) {
tmp += combineData[ids[i]];
}
#pragma omp for schedule(static) ordered
for(int i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
outputData += tmp;
}
}
This avoids allocating data for each thread (threadData) and merging outside the parallel region.
If you really want to preserve the same order as in the serial case, then there is no other way than doing it serially. In that case you can maybe try to parallelize the operations done in operator+=.
If the operations can be done randomly, but the reduction of the blocks has a specific order , then it may be worth having a look at TBB parallel_reduce. It will require you to write more code, but if I remember well you can define complex custom reduction operations.
If the order of the operations doesn't matter, then your snippet is almost complete. What it lacks is possibly a critical construct to aggregate private data:
std::vector<int> ids; // mappings
std::map<int, myData> combineData; // data per id
myData outputData; // combined data based on the mappings
#pragma omp parallel
{
myData threadData; // data per thread
#pragma omp for nowait
for (int ii =0; ii < total_iterations; ii++)
{
threadData += combineData[ids[ii]];
}
#pragma omp critical
{
outputData += threadData;
}
#pragma omp barrier
// From here on you are ensured that every thread sees
// the correct value of outputData
}
The schedule of the for loop in this case is not important for the semantic. If the overload of operator+= is a relatively stable operation (in terms of the time needed to compute it), then you can use schedule(static) which divides the iterations evenly among threads. Otherwise you can resort to other scheduling to balance the computational burden (e.g. schedule(guided)).
Finally if myData is a typedef of an intrinsic type, then you can avoid the critical section and use a reduction clause:
#pragma omp for reduction(+:outputData)
for (int ii =0; ii < total_iterations; ii++)
{
outputData += combineData[ids[ii]];
}
In this case you don't need to declare anything explicitly as private.