Using consistent RNG with OpenMP - c++

So I have a function, let's call it dostuff() in which it's beneficial for my application to sometimes parallelize within, or to do it multiple times and parallelize the whole thing. The function does not change though between both use cases.
Note: object is large enough that it cannot viably be stored in a list, and so it must be discarded with each iteration.
So, let's say our code looks like this:
bool parallelize_within = argv[1];
if (parallelize_within) {
// here we assume parallelization is handled within the function dostuff()
for (int i = 0; i < 100; ++i) {
object randomized = function_that_accesses_rand();
dostuff(i, randomized, parallelize_within);
}
} else {
#pragma omp parallel for
for (int i = 0; i < 100; ++i) {
object randomized = function_that_accesses_rand();
dostuff(i, randomized, parallelize_within);
}
}
Obviously, we run into the issue that dostuff() will have threads access the random object at different times in different iterations of the same program. This is not the case when parallelize_within == true, but when we run dostuff() in parallel individually per thread, is there a way to guarantee that the random object is accessed in order based on the iteration? I know that I could do:
#pragma omp parallel for schedule(dynamic)
which will guarantee that eventually, as iterations are assigned to threads at runtime dynamically, the objects will access rand in order with the iteration number, but for the first set of iterations it will be totally random. Any suggestions on how to avoid this?

First of all you have to make sure that both function_that_accesses_rand and do_stuff are threadsafe.
You do not have to duplicate your code if you use the if clause:
#pragma omp parallel for if(!parallelize_within)
To make sure that in function dostuff(i, randomized,...); i reflects the order of creation of randomized object you have to do something like this:
int j = 0;
#pragma omp parallel for if(!parallelize_within)
for (int i = 0; i < 100; ++i) {
int k;
object randomized;
#pragma omp critical
{
k = j++;
randomized = function_that_accesses_rand();
}
dostuff(k, randomized, parallelize_within);
}
You may eliminate the use of the critical section if your function_that_accesses_rand makes it possible, but I cannot be more specific without knowing your function. One solution is that this function returns the value representing the order. Do not forget that this function has to be threadsafe!
#pragma omp parallel for if(!parallelize_within)
for (int i = 0; i < 100; ++i) {
int k;
object randomized = function_that_accesses_rand(k);
dostuff(k, randomized, parallelize_within);
}
... function_that_accesses_rand(int& k){
...
#pragma omp atomic capture
k = some_internal_counter++;
...
}

You could pre generate the random object and store it in a list. Then have a variable in the omp loop, that is incremented per thread.
// generate random objects
i=0
#pragma omp parallel for
for( ... ){
do_stuff(...,rand_obj[i],...)

Related

How to optimize omp parallelization when batching

I am generating class Objects and putting them into std::vector. Before adding, I need to check if they intersect with the already generated objects. As I plan to have millions of them, I need to parallelize this function as it takes a lot of time (The function must check each new object against all previously generated).
Unfortunately, the speed increase is not significant. The profiler also shows very low efficiency (all overhead). Any advise would be appreciated.
bool
Generator::_check_cube (std::vector<Cube> &cubes, const cube &cube)
{
auto ptr_cube = &cube;
auto npol = cubes.size();
auto ptr_cubes = cubes.data();
const auto nthreads = omp_get_max_threads();
bool check = false;
#pragma omp parallel shared (ptr_cube, ptr_cubes, npol, check)
{
#pragma omp single nowait
{
const auto batch_size = npol / nthreads;
for (int32_t i = 0; i < nthreads; i++)
{
const auto bstart = batch_size * i;
const auto bend = ((bstart + batch_size) > npol) ? npol : bstart + batch_size;
#pragma omp task firstprivate(i, bstart, bend) shared (check)
{
struct bd bd1{}, bd2{};
bd1 = allocate_bd();
bd2 = allocate_bd();
for (auto j = bstart; j < bend; j++)
{
bool loc_check;
#pragma omp atomic read
loc_check = check;
if (loc_check) break;
if (ptr_cube->cube_intersecting(ptr_cubes[j], &bd1, &bd2))
{
#pragma omp atomic write
check = true;
break;
}
}
free_bd(&bd1);
free_bd(&bd2);
}
}
}
}
return check;
}
UPDATE: The Cube is actually made of smaller objects Cuboids, each of them have size (L, W, H), position coordinates and rotation. The intersect function:
bool
Cube::cube_intersecting(Cube &other, struct bd *bd1, struct bd *bd2) const
{
const auto nom = number_of_cuboids();
const auto onom = other.number_of_cuboids();
for (int32_t i = 0; i < nom; i++)
{
get_mcoord(i, bd1);
for (int32_t j = 0; j < onom; j++)
{
other.get_mcoord(j, bd2);
if (check_gjk_intersection(bd1, bd2))
{
return true;
}
}
}
return false;
}
//get_mcoord calculates vertices of the cuboids
void
Cube::get_mcoord(int32_t index, struct bd *bd) const
{
for (int32_t i = 0; i < 8; i++)
{
for (int32_t j = 0; j < 3; j++)
{
bd->coord[i][j] = _cuboids[index].get_coord(i)[j];
}
}
}
inline struct bd
allocate_bd()
{
struct bd bd{};
bd.numpoints = 8;
bd.coord = (double **) malloc(8 * sizeof(double *));
for (int32_t i = 0; i < 8; i++)
{
bd.coord[i] = (double *) malloc(3 * sizeof(double));
}
return bd;
}
Typical values: npol > 1 million, threads 32, and each npol Cube consists of 1 - 3 smaller cuboids which are directly checked against other if intersect.
The problem with your search is that OpenMP really likes static loops, where the number of iterations is predetermined. Thus, maybe one task will break early, but all the other will go through their full search.
With recent versions of OpenMP (5, I think) there is a solution for that.
(Not sure about this one: Make your tasks much more fine-grained, for instance one for each intersection test);
Spawn your tasks in a taskloop;
Once you find your intersection (or any condition that causes you to break), do cancel taskloop.
Small problem: cancelling is disabled by default. Set the environment variable OMP_CANCELLATION to true.
Do you have more intersections being true or more being false ? If most are true, you're flooding your hardware with requests to write to a shared resource, and what you are doing is essentially sequential. One way to address this is to avoid using a shared resource so there is no mutex and you let all threads run and at the end you take a decision given the results; this will likely run faster but the benefit depends also on arbitrary choices such as few metrics (eg., nthreads, ncuboids).
It is possible that on another architecture (eg., gpu), your algorithm works well as it is. I may be worth it to benchmark it on a gpu, and see if you will benefit from that migration, given the production sizes (millions of cuboids, 24 dimensions).
You also have a complexity problem, which is, for every new cuboid you compare up to the whole set of existing cuboids. One way to address this is to gather all the cuboids size (range) by dimension and order them, and add the new cuboids ranges ordered. If there is intersection in one dimension, you test the next one etc. You also can runs them in parallel. Before running through the ranges, you test if you are hitting inside the global range, if not it's useless to test locally the intersection.
Here and in general you want to parallelize with minimum of dependency (shared resources, mutex). So you want to try to find a point of view where this will happen. Parallelising over dimensions over ordered ranges (segments) might be better that parallelizing over cuboids.
Algorithms and benefits of parallelism also depend on the values of your objects. This does not mean that complexity predictions are not relevant, but that one may find a smarter approach given those values.
I think your code is memory bound, so its bottleneck is memory read/write not calculations. This can be the main reason of poor speed increase. As already mentioned by #Soleil a different hardware (GPU) can be beneficial here.
You mentioned in the comments that Generator::_check_cub called many times. To reduce OpenMP overheads my suggestion is moving the parallel region out of this function, you can even use it in your main function:
main(){
#pragma omp parallel
#pragma omp single nowait
{
//your code
}
}
In this case you have to use #pragma omp taskwait to wait for the tasks to complete.
for (int32_t i = 0; i < nthreads; i++)
{
#pragma omp task default(none) firstprivate(...) shared (..)
{
//your code comes here
}
}
#pragma omp taskwait
I also suggest using default(none) clause in #pragma omp task directive so you have to explicitly tell the sharing attribute of all your variables.
Do you really need function get_mcoord? It seems a redunant memory copy to me. I think it may be better to write a check_gjk_intersection function which takes _cuboids or its indices as parameters. In this case you get rid of many memory allocations/deallocations of bd1 and bd2, which also can be time consuming as #Victor pointed out.

Aspects that affects the efficiency of OpenMP parallelism

I would like to parallel a big loop using OpenMP to improve its efficiency. Here is the main part of the toy code:
vector<int> config;
config.resize(indices.size());
omp_set_num_threads(2);
#pragma omp parallel for schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
#pragma omp simd
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
#pragma omp atomic
result[index]++;
}
Then I found I cannot get improvements in efficiency if I use 2, 4, or 8 threads. The execution time of the parallel versions is generally greater than that of the sequential version. The outer loop has 10000 iterations and they are independent so I want multiple threads to execute those iterations in parallel.
I guess the reasons for performance decrease maybe include: private copies of config? or, random access of ref_table? or, expensive atomic operation? So what are the exact reasons for the performance decrease? More importantly, how can I get a shorter execution time?
Private copies of config or, random access of ref_tables are not problematic, I think the workload is very small, there are 2 potential issues which prevent efficient parallelization:
atomic operation is too expensive.
overheads are bigger than workload (it simply means that it is not worth parallelizing with OpenMP)
I do not know which one is more significant in your case, so it is worth trying to get rid of atomic operation. There are 2 cases:
a) If the results array is zero initialized you have to use:
#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config) where N is the size of result array and delete #pragma omp atomic. Note that this works on OpenMP 4.5 or later. It is also worth removing #parama omp simd for a loop of 2-10 iterations. So, your code should look like this:
#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
result[index]++;
}
b) If the result array is not zero initialized the solution is very similar, but use a temporary zero initialized array in the loop and after that add it to result array.
If the speed will not increase then your code is not worth parallelizing with OpenMP on your hardware.

How to stop parallel region of OpenMP 2.0

I look for a better way to cancel my threads.
In my approach, I use a shared variable and if this variable is set, I just throw a continue. This finishes my threads fast, but threads keep theoretically spawning and ending, which seems not elegant.
So, is there a better way to solve the issue (break is not supported by my OpenMP)?
I have to work with Visual, so my OpenMP Lib is outdated and there is no way around that. Consequently, I think #omp cancel will not work
int progress_state = RunExport;
#pragma omp parallel
{
#pragma omp for
for (int k = 0; k < foo.z; k++)
for (int j = 0; j < foo.y; j++)
for (int i = 0; i < foo.x; i++) {
if (progress_state == StopExport) {
continue;
}
// do some fancy shit
// yeah here is a condition for speed due to the critical
#pragma omp critical
if (condition) {
progress_state = StopExport;
}
}
}
You should do it the simple way of "just continue in all remaining iterations if cancellation is requested". That can just be the first check in the outermost loop (and given that you have several nested loops, that will probably not have any measurable overhead).
std::atomic<int> progress_state = RunExport;
// You could just write #pragma omp parallel for instead of these two nested blocks.
#pragma omp parallel
{
#pragma omp for
for (int k = 0; k < foo.z; k++)
{
if (progress_state == StopExport)
continue;
for (int j = 0; j < foo.y; j++)
{
// You can add break statements in these inner loops.
// OMP only parallelizes the outermost loop (at least given the way you wrote this)
// so it won't care here.
for (int i = 0; i < foo.x; i++)
{
// ...
if (condition) {
progress_state = StopExport;
}
}
}
}
}
Generally speaking, OMP will not suddenly spawn new threads or end existing ones, especially not within one parallel region. This means there is little overhead associated with running a few more tiny iterations. This is even more true given that the default scheduling in your case is most likely static, meaning that each thread knows its start and end index right away. Other scheduling modes would have to call into the OMP runtime every iteration (or every few iterations) to request more work, but that won't happen here. The compiler will basically see this code for the threaded work:
// Not real omp functions.
int myStart = __omp_static_for_my_start();
int myEnd = __omp_static_for_my_end();
for (int k = myStart; k < myEnd; ++k)
{
if (progress_state == StopExport)
continue;
// etc.
}
You might try a non-atomic thread-local "should I cancel?" flag that starts as false and can only be changed to true (which the compiler may understand and fold into the loop condition). But I doubt you will see significant overhead either way, at least on x86 where int is atomic anyway.
which seems not elegant
OMP 2.0 does not exactly shine with respect to elegance. I mean, iterating over a std::vector requires at least one static_cast to silence signed -> unsigned conversion warnings. So unless you have specific evidence of this pattern causing a performance problem, there is little reason not to use it.

Parallel insertion to map

I wanted to create a map ascribing pairs of integers vectors of integers. My purpose is to do it in parallel way. To ensure that I am not trying to push_back at the same time to the same memory entity (by multiple threads), the second coordinate of map's key pairs is responsible for the current thread number. I encountered, however, problems. It seems that some of the values are not inserting properly. Instead of getting 10 values all together, I get always less (sometimes 9, sometimes 8, 6, etc.)
map<pair<int, int>, vector<int> > test;
#pragma omp parallel num_threads(8)
{
#pragma omp for
for (int i = 0;i < 10;i++)
{
test[make_pair(i % 3, omp_get_thread_num())].push_back(i);
}
}
I have also tried test.at(make_pair(i % 3, omp_get_thread_num())).push_back(i) and it didn't work either. In this case, however, the execution interrupted with an exception.
I thought that #pragma omp for distributes the for loop into disjoint subsequences of (0,...,9) so that there shouldn't be problem with my code... I am a bit confused. Could someone explain this issue to me?
As stated, the standard library containers are not thread safe.
The appropriate solution for this case is to initialize n-maps (one for each thread) and than join them at the end.
As stated prior, using a mutex (and making access to the map safe) would be a valid solution, however it would also result in worse performance. As every time the map would be accessed each thread would have to wait on the mutex unlocking the data.
It should be noted that a size of 10 is not sufficient to make multithreading worth it, using multiple-threads here would most likely degrade performance.
map<pair<int, int>, vector<int> > test[8];
#pragma omp parallel num_threads(8)
{
#pragma omp for
for (int i = 0;i < large_number; i++)
{
int thread_id = omp_get_thread_id();
test[thread_id][make_pair(i % 3, omp_get_thread_num())].push_back(i);
}
}
#pragma omp barrier
map<pair<int,int>, vector<int>> combined;
for (int i = 0; i < 8; ++i)
combined.insert(test[i].begin(), test[i].end());
That's because a map is not thread safe (nor is a vector, but this part is not threaded). You have to add mutex, use lock-free containers or prepare your map first.
In this case, start on a single thread by creating all your entries.
#pragma omp single
for (int i = 0;i < 3;++i)
{
for (int j = 0;i < 8;++j)
{
test.insert(make_pair(make_pair(i, j), vector<int>()));
}
}
Then do your parallel for (add a barrier).

openmp with nested loops and function call

I have a few nested loops and I put the first one in parallel mode. apar and mpar are structs whose values are modified in the loop and then function breakLogic is called which generates a struct which i store in a pre created vector of those structs.
one, two ... have been declared earlier in the function.
I have tried to include ordered and critical to ensure accuracy but i am still getting incorrect results.
#pragma omp parallel for ordered private(appFlip, atur, apar, mpar, i, j, k, l, m, n) shared(rawFlip)
for(i=0; i<oneL; i++)
{
initialize mpar
#pragma omp critical
apar.one = one[i];
for(j=0; j<twoL; j++)
{
apar.two = two[j];
for(k=0; k<threeL; k++)
{
apar.three = floor(three[k]*apar.two);
appFlip = applyParamSin(rawFlip, apar);
for(l=0; l< fourL; l++)
{
mpar.four = four[l];
for(m=0; m<fiveL; m++)
{
mpar.five = five[m];
for(n=0; n<sixL; n++)
{
mpar.six = add[n];
atur = breakLogic(appFlip, mpar, dt);
#pragma omp ordered
{
sinResVec[itr] = atur;
itr++;
}
}
}
}
r0(appFlip);
}
}
}
Or is this code not conducive for parallelism? Are there any tools for g++ which can profile code for parallel processing and indicate potential issues?
This modified code works but gives no performance improvement.
You original code can be paralleled by a few modifications.
set apar and mpar as firstprivate. apar and mpar should be thread local variables and be initialized when entering the parallel for region;
remove all critical and ordered clauses, including the one in the parallel for directive. they are not working as your expected;
calculate iter with i,j,k,l,m,n to remove the dependency.
.
iter=(((i*twoL+j)*threeL+k)*fourL+m)*fiveL+n;
sinResVec[itr] = atur;
update
See here for more details of OpenMP, especially the differences between private and firstprivate.
http://msdn.microsoft.com/en-us/library/tt15eb9t.aspx