Parallel Processing Collision Pairs - c++

I am writing some code for parallel processing of collisions, the expected result would be to have an acceleration for each thread, but I'm not getting any acceleration on the data processing because I have a critical section inside parallel_reduce() and I believe its serializing too much the access to the objects. This is how the code looks:
do {
totalVel = 0.;
#pragma omp parallel for
for (unsigned long i = 0; i < bodyContact.size(); i++) {
totalVel += bodyContact.at(i).bodyA()->parallel_reduce();
totalVel += bodyContact.at(i).bodyB()->parallel_reduce();
}
} while (totalVel >= 0.00001);
Is there any way to gain more speed by making it parallel or the serialization of access is too much?
Observations:
bodyA() and bodyB() are objects that repeat themselves a lot inside the bodyContact container.
For now parallel_reduce() only does one multiplication (the critical section), but will get more complex.
double parallel_reduce(){
#pragma omp critical
this->vel_ *= 0.99;
return vel_.length();
}
Actual timings:
serial, 25.635
parallel, 123.559

There is always cost of using OpenMP constructs, so avoid using a parallel inside a loop, following the implementation it could launch at each time new threads, instead of rewaking the previous launched threads.
In fact, if bodyContact.size() is small and the do {} while in number of step is big and parallel_reduce is very quick is very hard to have scalability with just a few OpenMP pragma.
#pragma omp parallel shared(totalVel) shared(bodyContact)
{
do {
totalVel = 0.;
#pragma omp for reduce(+:totalVel)
for (unsigned long i = 0; i < bodyContact.size(); i++) {
totalVel += bodyContact.at(i).bodyA()->parallel_reduce();
totalVel += bodyContact.at(i).bodyB()->parallel_reduce();
}
} while (totalVel >= 0.00001);
}

The above is likely not only slower, but very likely wrong; all the threads are trying to update the same totalVel. Tonnes of race conditions, but also contention, cache invalidation, etc.
Assuming the parallel_reduce() stuff is ok, you'd like something more like
do {
totalVel = 0.;
#pragma omp parallel for default(none) shared(bodyContact) reduction(+:totalVel)
for (unsigned long i = 0; i < bodyContact.size(); i++) {
totalVel += bodyContact.at(i).bodyA()->parallel_reduce();
totalVel += bodyContact.at(i).bodyB()->parallel_reduce();
}
} while (totalVel >= 0.00001);
which will do the reduction on totalVel correctly.

Related

How to optimize omp parallelization when batching

I am generating class Objects and putting them into std::vector. Before adding, I need to check if they intersect with the already generated objects. As I plan to have millions of them, I need to parallelize this function as it takes a lot of time (The function must check each new object against all previously generated).
Unfortunately, the speed increase is not significant. The profiler also shows very low efficiency (all overhead). Any advise would be appreciated.
bool
Generator::_check_cube (std::vector<Cube> &cubes, const cube &cube)
{
auto ptr_cube = &cube;
auto npol = cubes.size();
auto ptr_cubes = cubes.data();
const auto nthreads = omp_get_max_threads();
bool check = false;
#pragma omp parallel shared (ptr_cube, ptr_cubes, npol, check)
{
#pragma omp single nowait
{
const auto batch_size = npol / nthreads;
for (int32_t i = 0; i < nthreads; i++)
{
const auto bstart = batch_size * i;
const auto bend = ((bstart + batch_size) > npol) ? npol : bstart + batch_size;
#pragma omp task firstprivate(i, bstart, bend) shared (check)
{
struct bd bd1{}, bd2{};
bd1 = allocate_bd();
bd2 = allocate_bd();
for (auto j = bstart; j < bend; j++)
{
bool loc_check;
#pragma omp atomic read
loc_check = check;
if (loc_check) break;
if (ptr_cube->cube_intersecting(ptr_cubes[j], &bd1, &bd2))
{
#pragma omp atomic write
check = true;
break;
}
}
free_bd(&bd1);
free_bd(&bd2);
}
}
}
}
return check;
}
UPDATE: The Cube is actually made of smaller objects Cuboids, each of them have size (L, W, H), position coordinates and rotation. The intersect function:
bool
Cube::cube_intersecting(Cube &other, struct bd *bd1, struct bd *bd2) const
{
const auto nom = number_of_cuboids();
const auto onom = other.number_of_cuboids();
for (int32_t i = 0; i < nom; i++)
{
get_mcoord(i, bd1);
for (int32_t j = 0; j < onom; j++)
{
other.get_mcoord(j, bd2);
if (check_gjk_intersection(bd1, bd2))
{
return true;
}
}
}
return false;
}
//get_mcoord calculates vertices of the cuboids
void
Cube::get_mcoord(int32_t index, struct bd *bd) const
{
for (int32_t i = 0; i < 8; i++)
{
for (int32_t j = 0; j < 3; j++)
{
bd->coord[i][j] = _cuboids[index].get_coord(i)[j];
}
}
}
inline struct bd
allocate_bd()
{
struct bd bd{};
bd.numpoints = 8;
bd.coord = (double **) malloc(8 * sizeof(double *));
for (int32_t i = 0; i < 8; i++)
{
bd.coord[i] = (double *) malloc(3 * sizeof(double));
}
return bd;
}
Typical values: npol > 1 million, threads 32, and each npol Cube consists of 1 - 3 smaller cuboids which are directly checked against other if intersect.
The problem with your search is that OpenMP really likes static loops, where the number of iterations is predetermined. Thus, maybe one task will break early, but all the other will go through their full search.
With recent versions of OpenMP (5, I think) there is a solution for that.
(Not sure about this one: Make your tasks much more fine-grained, for instance one for each intersection test);
Spawn your tasks in a taskloop;
Once you find your intersection (or any condition that causes you to break), do cancel taskloop.
Small problem: cancelling is disabled by default. Set the environment variable OMP_CANCELLATION to true.
Do you have more intersections being true or more being false ? If most are true, you're flooding your hardware with requests to write to a shared resource, and what you are doing is essentially sequential. One way to address this is to avoid using a shared resource so there is no mutex and you let all threads run and at the end you take a decision given the results; this will likely run faster but the benefit depends also on arbitrary choices such as few metrics (eg., nthreads, ncuboids).
It is possible that on another architecture (eg., gpu), your algorithm works well as it is. I may be worth it to benchmark it on a gpu, and see if you will benefit from that migration, given the production sizes (millions of cuboids, 24 dimensions).
You also have a complexity problem, which is, for every new cuboid you compare up to the whole set of existing cuboids. One way to address this is to gather all the cuboids size (range) by dimension and order them, and add the new cuboids ranges ordered. If there is intersection in one dimension, you test the next one etc. You also can runs them in parallel. Before running through the ranges, you test if you are hitting inside the global range, if not it's useless to test locally the intersection.
Here and in general you want to parallelize with minimum of dependency (shared resources, mutex). So you want to try to find a point of view where this will happen. Parallelising over dimensions over ordered ranges (segments) might be better that parallelizing over cuboids.
Algorithms and benefits of parallelism also depend on the values of your objects. This does not mean that complexity predictions are not relevant, but that one may find a smarter approach given those values.
I think your code is memory bound, so its bottleneck is memory read/write not calculations. This can be the main reason of poor speed increase. As already mentioned by #Soleil a different hardware (GPU) can be beneficial here.
You mentioned in the comments that Generator::_check_cub called many times. To reduce OpenMP overheads my suggestion is moving the parallel region out of this function, you can even use it in your main function:
main(){
#pragma omp parallel
#pragma omp single nowait
{
//your code
}
}
In this case you have to use #pragma omp taskwait to wait for the tasks to complete.
for (int32_t i = 0; i < nthreads; i++)
{
#pragma omp task default(none) firstprivate(...) shared (..)
{
//your code comes here
}
}
#pragma omp taskwait
I also suggest using default(none) clause in #pragma omp task directive so you have to explicitly tell the sharing attribute of all your variables.
Do you really need function get_mcoord? It seems a redunant memory copy to me. I think it may be better to write a check_gjk_intersection function which takes _cuboids or its indices as parameters. In this case you get rid of many memory allocations/deallocations of bd1 and bd2, which also can be time consuming as #Victor pointed out.

How to optimize OpenMp code,For example Histogram

I am dealing with huge point cloud data. I try to use OpenMp.
But I found it's very hard for beginners to optimize code.
For example, when I want to get the Histogram of the point cloud (the point has another info beyond x,y,z). I write code below
#pragma omp parallel num_threads(N_THREAD) shared(hist,partHist)
{
int tId = omp_get_thread_num();
int index = tId * partCount;
#pragma omp for nowait
for(int i =0;i<partCount;++i)
{
if (index + i < size)
#pragma omp atomic
partHist[tId][(int)floor((array[index + i] - minValue) / stride)]++;
}
#pragma omp critical
{
for (int i = 0; i < binCount; ++i)
hist[i] += partHist[tId][i];
}
}
The code is being run on Linux with an i7-9700k, compiled with g++ and using omp 4.0
I have two questions
The data set is about 10^8 at least, I use 128 threads. but It's slower than serial.How can I optimize the code
Are there rules that I can follow to optimize the code when some other questions occur?

How to stop parallel region of OpenMP 2.0

I look for a better way to cancel my threads.
In my approach, I use a shared variable and if this variable is set, I just throw a continue. This finishes my threads fast, but threads keep theoretically spawning and ending, which seems not elegant.
So, is there a better way to solve the issue (break is not supported by my OpenMP)?
I have to work with Visual, so my OpenMP Lib is outdated and there is no way around that. Consequently, I think #omp cancel will not work
int progress_state = RunExport;
#pragma omp parallel
{
#pragma omp for
for (int k = 0; k < foo.z; k++)
for (int j = 0; j < foo.y; j++)
for (int i = 0; i < foo.x; i++) {
if (progress_state == StopExport) {
continue;
}
// do some fancy shit
// yeah here is a condition for speed due to the critical
#pragma omp critical
if (condition) {
progress_state = StopExport;
}
}
}
You should do it the simple way of "just continue in all remaining iterations if cancellation is requested". That can just be the first check in the outermost loop (and given that you have several nested loops, that will probably not have any measurable overhead).
std::atomic<int> progress_state = RunExport;
// You could just write #pragma omp parallel for instead of these two nested blocks.
#pragma omp parallel
{
#pragma omp for
for (int k = 0; k < foo.z; k++)
{
if (progress_state == StopExport)
continue;
for (int j = 0; j < foo.y; j++)
{
// You can add break statements in these inner loops.
// OMP only parallelizes the outermost loop (at least given the way you wrote this)
// so it won't care here.
for (int i = 0; i < foo.x; i++)
{
// ...
if (condition) {
progress_state = StopExport;
}
}
}
}
}
Generally speaking, OMP will not suddenly spawn new threads or end existing ones, especially not within one parallel region. This means there is little overhead associated with running a few more tiny iterations. This is even more true given that the default scheduling in your case is most likely static, meaning that each thread knows its start and end index right away. Other scheduling modes would have to call into the OMP runtime every iteration (or every few iterations) to request more work, but that won't happen here. The compiler will basically see this code for the threaded work:
// Not real omp functions.
int myStart = __omp_static_for_my_start();
int myEnd = __omp_static_for_my_end();
for (int k = myStart; k < myEnd; ++k)
{
if (progress_state == StopExport)
continue;
// etc.
}
You might try a non-atomic thread-local "should I cancel?" flag that starts as false and can only be changed to true (which the compiler may understand and fold into the loop condition). But I doubt you will see significant overhead either way, at least on x86 where int is atomic anyway.
which seems not elegant
OMP 2.0 does not exactly shine with respect to elegance. I mean, iterating over a std::vector requires at least one static_cast to silence signed -> unsigned conversion warnings. So unless you have specific evidence of this pattern causing a performance problem, there is little reason not to use it.

OpenMP loop runs code slower than serial loop

I'm running this neat little gravity simulation and in serial execution it takes a little more than 4 minutes, when i parallelize one loop inside a it increases to about 7 minutes and if i try parallelizing more loops it increases to more than 20 minutes. I'm posting a slightly shortened version without some initializations but I think they don't matter. I'm posting the 7 minute version however with some comments where i wanted to add parallelization to loops. Thank you for helping me with my messy code.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>
#define numb 1000
int main(){
double pos[numb][3],a[numb][3],a_local[3],v[numb][3];
memset(v, 0.0, numb*3*sizeof(double));
double richtung[3];
double t,deltat=0.0,r12 = 0.0,endt=10.;
unsigned seed;
int tcount=0;
#pragma omp parallel private(seed) shared(pos)
{
seed = 25235 + 16*omp_get_thread_num();
#pragma omp for
for(int i=0;i<numb;i++){
for(int j=0;j<3;j++){
pos[i][j] = (double) (rand_r(&seed) % 100000 - 50000);
}
}
}
for(t=0.;t<endt;t+=deltat){
printf("\r%le", t);
tcount++;
#pragma omp parallel for shared(pos,v)
for(int id=0; id<numb; id++){
for(int l=0;l<3;l++){
pos[id][l] = pos[id][l]+(0.5*deltat*v[id][l]);
v[id][l] = v[id][l]+a[id][l]*(deltat);
}
}
memset(a, 0.0, numb*3*sizeof(double));
memset(a_local, 0.0, 3*sizeof(double));
#pragma omp parallel for private(r12,richtung) shared(a,pos)
for(int id=0; id <numb; ++id){
for(int id2=0; id2<id; id2++){
for(int k=0;k<3;k++){
r12 += sqrt((pos[id][k]-pos[id2][k])*(pos[id][k]-pos[id2][k]));
}
for(int k=0; k<3;k++){
richtung[k] = (-1.e10)*(pos[id][k]-pos[id2][k])/r12;
a[id][k] += richtung[k]/(((r12)*(r12)));
a_local[k] += (-1.0)*richtung[k]/(((r12)*(r12)));
#pragma omp critical
{
a[id2][k] += a_local[k];
}
}
r12=0.0;
}
}
#pragma omp parallel for shared(pos)
for(int id =0; id<numb; id++){
for(int k=0;k<3;k++){
pos[id][k] = pos[id][k]+(0.5*deltat*v[id][k]);
}
}
deltat= 0.01;
}
return 0;
}
I'm using
g++ -fopenmp -o test_grav test_grav.c
to compile the code and I'm measuring time in the shell just by
time ./test_grav.
When I used
get_numb_threads()
to get the number of threads it displayed 4. top also shows more than 300% (sometimes ~380%) cpu usage. Interesting little fact if I start the parallel region before the time-loop (meaning the most outer for-loop) and without any actual #pragma omp for it is equivalent to making one parallel region for every major (the three second to most outer loops) loop. So I think it is an optimization thing, but I don't know how to solve it. Can anyone help me?
Edit: I made the example verifiable and lowered numbers like numb to make it better testable but the problem still occurs. Even when I remove the critical region as suggested by TheQuantumPhysicist, just not as severely.
I believe that critical section is the cause of the problem. Consider taking all critical sections outside the parallelized loop and running them after the parallelization is over.
Try this:
#pragma omp parallel shared(a,pos)
{
#pragma omp for private(id2,k,r12,richtung,a_local)
for(id=0; id <numb; ++id){
for(id2=0; id2<id; id2++){
for(k=0;k<3;k++){
r12 += sqrt((pos[id][k]-pos[id2][k])*(pos[id][k]-pos[id2][k]));
}
for(k =0; k<3;k++){
richtung[k] = (-1.e10)*(pos[id][k]-pos[id2][k])/r12;
a[id][k] += richtung[k]/(((r12)*(r12))+epsilon);
a_local[k]+= richtung[k]/(((r12)*(r12))+epsilon)*(-1.0);
}
}
}
}
for(id=0; id <numb; ++id){
for(id2=0; id2<id; id2++){
for(k=0;k<3;k++){
a[id2][k] += a_local[k];
}
}
}
Critical sections will lead to locking and blocking. If you can keep these sections linear, you'll win a lot in performance.
Notice that I'm talking about a syntactic solution, which I don't know whether it works for your case. But to be clear: If every point in your series depends on the next one, then parallelizing is not a solution for you; at least simple parallelization using OpenMP.

Why my C code is slower using OpenMP

I m trying to do multi-thread programming on CPU using OpenMP. I have lots of for loops which are good candidate to be parallel. I attached here a part of my code. when I use first #pragma omp parallel for reduction, my code is faster, but when I try to use the same command to parallelize other loops it gets slower. does anyone have any idea why it is like this?
.
.
.
omp_set_dynamic(0);
omp_set_num_threads(4);
float *h1=new float[nvi];
float *h2=new float[npi];
while(tol>0.001)
{
std::fill_n(h2, npi, 0);
int k,i;
float h222=0;
#pragma omp parallel for private(i,k) reduction (+: h222)
for (i=0;i<npi;++i)
{
int p1=ppi[i];
int m = frombus[p1];
for (k=0;k<N;++k)
{
h222 += v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k])
+ B[m-1][k]*sin(del[m-1]-del[k]));
}
h2[i]=h222;
}
//*********** h3*****************
std::fill_n(h3, nqi, 0);
float h333=0;
#pragma omp parallel for private(i,k) reduction (+: h333)
for (int i=0;i<nqi;++i)
{
int q1=qi[i];
int m = frombus[q1];
for (int k=0;k<N;++k)
{
h333 += v[m-1]*v[k]*(G[m-1][k]*sin(del[m-1]-del[k])
- B[m-1][k]*cos(del[m-1]-del[k]));
}
h3[i]=h333;
}
.
.
.
}
I don't think your OpenMP code gives the same result as without OpenMP. Let's just concentrate on the h2[i] part of the code (since the h3[i] has the same logic). There is a dependency of h2[i] on the index i (i.e. h2[1] = h2[1] + h2[0]). The OpenMP reduction you're doing won't give the correct result. If you want to do the reduction with OpenMP you need do it on the inner loop like this:
float h222 = 0;
for (int i=0; i<npi; ++i) {
int p1=ppi[i];
int m = frombus[p1];
#pragma omp parallel for reduction(+:h222)
for (int k=0;k<N; ++k) {
h222 += v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k])
+ B[m-1][k]*sin(del[m-1]-del[k]));
}
h2[i] = h222;
}
However, I don't know if that will be very efficient. An alternative method is fill h2[i] in parallel on the outer loop without a reduction and then take care of the dependency in serial. Even though the serial loop is not parallelized it still should have a small effect on the computation time since it does not have the inner loop over k. This should give the same result with and without OpenMP and still be fast.
#pragma omp parallel for
for (int i=0; i<npi; ++i) {
int p1=ppi[i];
int m = frombus[p1];
float h222 = 0;
for (int k=0;k<N; ++k) {
h222 += v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k])
+ B[m-1][k]*sin(del[m-1]-del[k]));
}
h2[i] = h222;
}
//take care of the dependency serially
for(int i=1; i<npi; i++) {
h2[i] += h2[i-1];
}
Keep in mind that creating and destroying threads is a time consuming process; clock the execution time for the process and see for yourself. You only use parallel reduction twice which may be faster than a serial reduction, however the initial cost of creating the threads may still be higher. Try parallelizing the outer most loop (if possible) to see if you can obtain a speedup.