Optimal way to handle array indexing in parallel? - c++

I have the following situation: I have a list of particles in a box of size L, where L is the length of one of the sides.
Next, I split the box into cells, where L/cell_dim = 7. So there are 7*7*7 cells.
Finally, I read through all the particles, note their position, and calculate which cell they are in.
I accomplish the above in an openMP parallel for loop. However, I need to capture the information in a thread safe fashion such that I don't have to loop through all the particles for each cell. So I need some way to record an arbitrary subset of the particles into each cell, in parallel.
The method I have right now makes use of the OpenMP critical code block. I have an array size [7][7][7][max_particles], where max_particles is the highest number of particles per cell, but which is much less than the total number of particles. I record the index of the last particle added in a counter array size [7][7][7], and update the cell array according to the latest count in my parallel loop:
int cube[7][7][7][10];
int cube_counts[7][7][7]={0};
#pragma omp parallel for num_threads(a lot)
for (int i = 0; i < num_particles; i++){
cell_x = //cell calculation;
cell_y = //ditto;
cell_z = //...;
#pragma omp critical
{
cube_counts[cell_x][cell_y][cell_z] += 1;
// for readability
int index = cube_counts[cell_x][cell_y][cell_z];
cube[cell_x][cell_y][cell_z][index] = i;
}
}
// rest in pseudo code:
foreach cell:
adjacent_cell = cell2
particle_countA = cube_counts[cellx][celly][cellz]
particle_countB = cube_counts[cell2x][cell2y][cell2z]
// these two for loops will cover ~2-4 particles,
// so super small...as a result of the cell analysis above.
for particle in cell:
for particle in cell2:
...do stuff
Although this works, it increases in speed by a factor of more than 2 when I am able to eliminate the critical block (I am on an intel coprocessor with 60 physical, 240 logical).
How would I accomplish this without need for the critical block? I thought of doing a big array...but then I lose everything I gained when I iterate through the 7*7*7*257 (where 257 is the particle count) array. Linked lists still have the race conditions.
Maybe some kind of unordered, thread safe list...?

Using a lock instead of the critical section can be driven further:
You may use atomic increment and atomic assignment pseudo calls ("intrinsics") that the compiler will translate to the correct x86 specific assembler instructions. This is however platform or even compiler dependent.
If your use a modern c++ compiler (C++11) then std::atomic_* might be the best way to do it.

Related

Clean and efficient way of iterating over ranges of indices in a single array

I have a contiguous array of particles in 3D space for a fluid simulation and I need to do a lot of neighbor searches on it. I've found that partitioning the search space into cubic cells and sorting the particles in-place by the cell they are in works well for my problem. That way for any given cell, its particles are in a contiguous span so you can iterate over them all easily if you know the begin and end indices. eg cell N might occupy [N_begin, N_end) of the array used to store the particles.
However, no matter how you divide the space, a particle might have neighbors not only in its own cell but also in every neighboring cell (imagine a particle that is almost touching the boundary between cells to understand why). Because of this, a neighbor search needs to return all particles in the same cell as well as all of its neighbors in 3D space, a total of up to 27 cells when not on the edge of the simulation space. There is no ordering of the cells into the array (which is by its nature 1D) that can get all 27 of those spans to be adjacent for any requested cell. To work around this, I keep track of where each cell begins and ends in the particle array and I have a function that determines which cells hold potential neighbors. To represent multiple ranges of indices, it has to return up to 27 pairs of indices signifying the begin and end of those ranges.
std::vector<std::pair<int, int>> get_neighbor_indices(const Vec3f &position);
The index is actually required at some point so it works better this way than a pair of iterators or some other abstraction. The problem is that it forces me to use a loop structure that is tied to the implementation. I would like something like the following, using some pseudocode and omissions to simplify.
for(int i = 0; i < num_of_particles; ++i) {
auto neighbor_indices = get_neighbor_indices(particle[i].position);
for (int j : neighbor_indices) {
// do stuff with particle[i] and particle[j]
}
}
That would only work if neighbor_indices was a complete list of all indices, but that is a significant number of indices that are trivial to calculate on the fly and therefore a huge waste of memory. So the best I can get without compromising the performance of the code is the following.
for(int i = 0; i < num_of_particles; ++i) {
auto neighbor_indices = get_neighbor_indices(particle[i].position);
for (const auto& indices_pair : neighbor_indices) {
for (int j = indices_pair.first; j < indices_pair.second; ++j) {
// do stuff with particle[i] and particle[j]
}
}
}
The loss of genericity is a setback for my project because I have to test and measure performance a lot and make adjustments when I come across a performance problem. Implementation-specific code significantly slows down this process.
I'm looking for something like an iterator, except it will return an index instead of a reference. It will allow me to use it as follows.
for(int i = 0; i < num_of_particles; ++i) {
auto neighbor_indices = get_neighbor_indices(particle[i].position);
for (int j : neighbor_indices) {
// do stuff with particle[i] and particle[j]
}
}
The only issue with this iterator-like approach is that incrementing it is cumbersome. I have to manually keep track of the range I'm in, as well as continuously check when I'm at its end to switch to the next. It's a lot of code to get rid of just that one line that breaks the genericity of the iteration loop. So I'm looking for either a cleaner way to implement the "iterator" or just a better way to iterate over a number of ranges as if they were one.
Keep in mind that this is in a bottleneck computation loop so abstractions have to be zero or negligible cost.

How to use multi-threading within a loop that iterates through a point cloud in C++?

I have made a function that estimates the normal vectors of a 3D Point Cloud and it takes a lot of time to run on a cloud of size 2 million. I want to multi-thread by calling the same function on two different points at the same time but it didn't work (it was creating hundreds of threads). Here is what I tried:
// kd-tree used for finding neighbours
pcl::KdTreeFLANN<pcl::PointXYZRGB> kdt;
// cloud iterators
pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it = pt_cl->points.begin();
pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it1;
pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it2;
pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it3;
pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it4;
// initializing tree
kdt.setInputCloud(pt_cl);
// loop exit condition
bool it_completed = false;
while (!it_completed)
{
// initializing cloud iterators
cloud_it1 = cloud_it;
cloud_it2 = cloud_it++;
cloud_it3 = cloud_it++;
if (cloud_it3 != pt_cl->points.end())
{
// attaching threads
boost::thread thread_1 = boost::thread(geom::vectors::find_normal, pt_cl, cloud_it1, kdt, radius, max_neighbs);
boost::thread thread_2 = boost::thread(geom::vectors::find_normal, pt_cl, cloud_it2, kdt, radius, max_neighbs);
boost::thread thread_3 = boost::thread(geom::vectors::find_normal, pt_cl, cloud_it3, kdt, radius, max_neighbs);
// joining threads
thread_1.join();
thread_2.join();
thread_3.join();
cloud_it++;
}
else
{
it_completed = true;
}
}
As you can see I am trying to call the same function on 3 different points at the same time. Any suggestions for how to make this work? Sorry for the poor code, I'm tired and thank you in advance.
EDIT: here is the find_normal function
Here are the parameters:
#param pt_cl is a pointer to the point cloud to be treated (pcl::PointCloud<PointXYZRGB>::Ptr)
#param cloud_it is an iterator of this cloud (pcl::PointCloud<PointXYZRGB>::iterator)
#param kdt is the kd_tree used to find the closest neighbours of a point
#param radius defines the range in which to search for the neighbours of a point
#param max_neighbs is the maximum number of neighbours to be returned by the radius search
// auxilliary vectors for the k-tree nearest search
std::vector<int> pointIdxRadiusSearch; // neighbours ids
std::vector<float> pointRadiusSquaredDistance; // distances from the source to the neighbours
// the vectors of which the cross product calculates the normal
geom::vectors::vector3 *vect1;
geom::vectors::vector3 *vect2;
geom::vectors::vector3 *cross_prod;
geom::vectors::vector3 *abs_cross_prod;
geom::vectors::vector3 *normal;
geom::vectors::vector3 *normalized_normal;
// vectors to average
std::vector<geom::vectors::vector3> vct_toavg;
// if there are neighbours left
if (kdt.radiusSearch(*cloud_it, radius, pointIdxRadiusSearch, pointRadiusSquaredDistance, max_neighbs) > 0)
{
for (int pt_index = 0; pt_index < (pointIdxRadiusSearch.size() - 1); pt_index++)
{
// defining the first vector
vect1 = geom::vectors::create_vect2p((*cloud_it), pt_cl->points[pointIdxRadiusSearch[pt_index + 1]]);
// defining the second vector; making sure there is no 'out of bounds' error
if (pt_index == pointIdxRadiusSearch.size() - 2)
vect2 = geom::vectors::create_vect2p((*cloud_it), pt_cl->points[pointIdxRadiusSearch[1]]);
else
vect2 = geom::vectors::create_vect2p((*cloud_it), pt_cl->points[pointIdxRadiusSearch[pt_index + 2]]);
// adding the cross product of the two previous vectors to our list
cross_prod = geom::vectors::cross_product(*vect1, *vect2);
abs_cross_prod = geom::aux::abs_vector(*cross_prod);
vct_toavg.push_back(*abs_cross_prod);
// freeing memory
delete vect1;
delete vect2;
delete cross_prod;
delete abs_cross_prod;
}
// calculating the normal
normal = geom::vectors::vect_avg(vct_toavg);
// calculating the normalized normal
normalized_normal = geom::vectors::normalize_normal(*normal);
// coloring the point
geom::aux::norm_toPtRGB(&(*cloud_it), *normalized_normal);
// freeing memory
delete normal;
delete normalized_normal;
// clearing vectors
vct_toavg.clear();
pointIdxRadiusSearch.clear();
pointRadiusSquaredDistance.clear();
// shrinking vectors
vct_toavg.shrink_to_fit();
pointIdxRadiusSearch.shrink_to_fit();
pointRadiusSquaredDistance.shrink_to_fit();
}
Since I don't quite get it how the result data is being stored, I'm going to suggest a solution based on OpenMP that matches the code you've posted.
// kd-tree used for finding neighbours
pcl::KdTreeFLANN<pcl::PointXYZRGB> kdt;
#pragma openmp parallel for schedule(static)
for (pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it = pt_cl->points.begin();
cloud_it < pt_cl.end();
++cloud_it) {
geom::vectors::find_normal, pt_cl, cloud_it, kdt, radius, max_neighbs);
}
Note that you should be using the < comparison, and not the != one, -that's how OpenMP works (it wants random access iterators). I'm using the static schedule since every element should take more or less identical time to process. If that's not the case, try using schedule(dynamic) instead.
This solution uses OpenMP, and you may investigate e.g. TBB as well, though it has a higher entrance barrier than OpenMP and uses an OOP-style API.
Also, repeating what I've said in the comments already: OpenMP as well as TBB are going to handle thread management and load distribution for you. You only pass them hints (such as schedule(static)) on how to do it to so as to better suit your needs.
Other than that, please, do get in the habit of repeating as little code as you can; ideally, no code should be duplicated. E.g. when you declare many variables of the same type, or call a certain function a few times in a row with a similar pattern, etc. I also see excessive commenting in the code, with an unclear reason behind it.

Function calls in OpenMP loop?

I have been trying to parallelize a particle simulation code I wrote. But in my parallelization, I came away with no increase in performance when moving from 1 processor to 12, and even worse the code is no longer returning accurate results. I have been banging my head against the wall and can't figure this out. Below is the loop being parallelized:
#pragma omp parallel
{
omp_set_dynamic(1);
omp_set_num_threads(12);
#pragma omp for
// Loop over azimuth ejection angle, from 0-360.
for(int i=0; i<360; i++)
{
// Declare temporary variables
double *y = new double[12];
vector<double> ejecVel(3);
vector<double> colLoc(7);
double azimuth, inc;
bool collision;
// Loop over inclincation ejection angle from 1-max_angle, increasing by 1 degree.
for(int j=1; j<=15; j++)
{
// Update azimuth and inclination angle and get velocity direction vector.
azimuth = (double) i;
inc = (double) j;
ejecVel = Jet::GetEjecVelocity(azimuth,inc);
collision = false;
// Update initial conditions.
y[0] = m_parPos[0];
y[1] = m_parPos[1];
... (define pointer values)
// Simulate particle
systemSolver.ParticleSim(simSteps,dt,y,collision,colLoc);
if(collision == true)
{
cout << "Collision! " << endl;
}
}
delete [] y;
}
The goal is to loop through, simulating particles for different initial conditions over the loops, and store where they have gone and their state vector upon collision in master variables densCount and collisionStates. The simulation takes place in a function from another class (systemSolver.ParticleSim() ), and it seems like each solve from a different thread is not independent. Everything I've read suggests that it should be, but I can't figure out why else the result would not be right only if I have Open MP implemented. Any thoughts are greatly appreciated.
-ben
SOLUTION: The simulation was modifying a member variable of a separate (systemSolver) class. Since I provided a single class object to all threads, they were all simultaneously modifying an important member variable. Thought I would post this in case any other n00bs encounter a similar problem.
I believe one mistake is the call to omp_set_* functions inside the parallel region. In the best case, they take effect on subsequent regions only. Try to reorder as following:
omp_set_dynamic(1);
omp_set_num_threads(12);
#pragma omp parallel

Shift vector in thrust

I'm looking at a project involving online (streaming) data. I want to work with a sliding window of that data. For example, say that I want to hold 10 values in my vector. When value 11 comes in, I want to drop value 1, shift everything over, and then place value 11 where value 10 was.
The long way would be something like the following:
int n = 9;
thrust::device_vector<float> val;
val.resize(n+1,0);
// Shift left
for(int i=0; i != n-1; i++){
val[i] = val[i+1];
}
// add the new value to the last position
val[n] = newValue;
Is there a "fast" way to do this with thrust? The project I'm looking at will have around 500 vectors that will need this operation done simultaneously.
Thanks!
As I have said, Ring buffer is what you need. No need to shift there, only one counter and a fixed size array.
Let's think how we may deal with 500 of ring buffers.
If you want to have 500 (let it be 512) sliding windows and process them all on the GPU, then you might pack them into one big 2D texture, where each column is an array of samples for the same moment.
If you're getting new samples for each of the vector at once (I mean one new sample for each 512 buffers at one processing step), then this "ring texture" (like a cylinder) only needs to be updated once (upload the array of new samples at each step) and you need just one counter.
I highly recommend using a different, yet still free, library for this problem. In 4 lines of ArrayFire code, you can do all 500 vectors, as follows:
array val = array(window_width, num_vectors);
val = shift(val, 0, 1);
array newValue = array(1,num_vectors);
val(span,end) = newValue;
I benchmarked against Thrust code for the same and ArrayFire is getting about a 10X speedup over Thrust.
Downside is that ArrayFire is not open source, but it is still free for this sort of problem.
Want you want is simply thrust::copy. You can't do a shift in place in parallel, because you can't guarantee a value is read before it is written.
int n = 9;
thrust::device_vector<float> val_in(n);
thrust::device_vector<float> val_out(n+1);
thrust::copy(val_in.begin() + 1, val_in.end(), val_out.begin());
// add the new value to the last position
val_out[n] = newValue;

improving performance for graph connectedness computation

I am writing a program to generate a graph and check whether it is connected or not. Below is the code. Here is some explanation: I generate a number of points on the plane at random locations. I then connect the nodes, NOT based on proximity only. By that I mean to say that a node is more likely to be connected to nodes that are closer, and this is determined by a random variable that I use in the code (h_sq) and the distance. Hence, I generate all links (symmetric, i.e., if i can talk to j the viceversa is also true) and then check with a BFS to see if the graph is connected.
My problem is that the code seems to be working properly. However, when the number of nodes becomes greater than ~2000 it is terribly slow, and I need to run this function many times for simulation purposes. I even tried to use other libraries for graphs but the performance is the same.
Does anybody know how could I possibly speed everything up?
Thanks,
int Graph::gen_links() {
if( save == true ) { // in case I want to store the structure of the graph
links.clear();
links.resize(xy.size());
}
double h_sq, d;
vector< vector<luint> > neighbors(xy.size());
// generate links
double tmp = snr_lin / gamma_0_lin;
// xy is a std vector of pairs containing the nodes' locations
for(luint i = 0; i < xy.size(); i++) {
for(luint j = i+1; j < xy.size(); j++) {
// generate |h|^2
d = distance(i, j);
if( d < d_crit ) // for sim purposes
d = 1.0;
h_sq = pow(mrand.randNorm(0, 1), 2.0) + pow(mrand.randNorm(0, 1), 2.0);
if( h_sq * tmp >= pow(d, alpha) ) {
// there exists a link between i and j
neighbors[i].push_back(j);
neighbors[j].push_back(i);
// options
if( save == true )
links.push_back( make_pair(i, j) );
}
}
if( neighbors[i].empty() && save == false ) {
// graph not connected. since save=false i dont need to store the structure,
// hence I exit
connected = 0;
return 1;
}
}
// here I do BFS to check whether the graph is connected or not, using neighbors
// BFS code...
return 1;
}
UPDATE:
the main problem seems to be the push_back calls within the inner for loops. It's the part that takes most of the time in this case. Shall I use reserve() to increase efficiency?
Are you sure the slowness is caused by the generation but not by your search algorithm?
The graph generation is O(n^2) and you can't do too much to it. However, you can apparently use memory in exchange of some of the time if the point locations are fixed for at least some of the experiments.
First, distances of all node pairs, and pow(d, alpha) can be precomputed and saved into memory so that you don't need to compute them again and again. The extra memory cost for 10000 nodes will be about 800mb for double and 400mb for float..
In addition, sum of square of normal variable is chi-square distribution if I remember correctly.. Probably you can have some precomputed table lookup if the accuracy allowed?
At last, if the probability that two nodes will be connected are so small if the distance exceeds some value, then you don't need O(n^2) and probably you can only calculate those node pairs that have distance smaller than some limits?
As a first step you should try to use reserve for both inner and outer vectors.
If this does not bring performance up to your expectations I believe this is because memory allocations that are still happening.
There is a handy class I've used in similar situations, llvm::SmallVector (find it in Google). It provides a vector with few pre-allocated items, so you can have decrease number of allocations by one per vector.
It still can grow when it is running out of items in pre-allocated space.
So:
1) Examine the number of items you have in your vectors on average during runs (I'm talking about both inner and outer vectors)
2) Put in llvm::SmallVector with a pre-allocation of such size (as vector is allocated on the stack you might need to increase stack size, or reduce pre-allocation if you are restricted on available stack memory).
Another good thing about SmallVector is that it has almost the same interface as std::vector (could be easily put instead of it)