Why Is My Spatial Hash So Slow? - c++

Why is my spatial hash so slow? I am working on a code that uses smooth particle hydrodynamics to model the movement of landslides. In smooth particle hydrodynamics each particle influences the particles that are within a distance of 3 "smoothing lengths". I am trying to implement a spatial hash function in order to have a fast look up of the neighboring particles.
For my implementation I made use of the "set" datatype from the stl. At each time step the particles are hashed into their bucket using the function below. "bucket" is a vector of sets, with one set for each grid cell (the spatial domain is limited). Each particle is identified by an integer.
To look for collisions the function below entitled "getSurroundingParticles" is used which takes an integer (corresponding to a particle) and returns a set that contains all the grid cells that are within 3 support lengths of the particle.
The problem is that this implementation is really slow, slower even than just checking each particle against every other particles, when the number of particles is 2000. I was hoping that someone could spot a glaring problem in my implementation that I'm not seeing.
//put each particle into its bucket(s)
void hashParticles()
{
int grid_cell0;
cellsWithParticles.clear();
for(int i = 0; i<N; i++)
{
//determine the four grid cells that surround the particle, as well as the grid cell that the particle occupies
//using the hash function int grid_cell = ( floor(x/cell size) ) + ( floor(y/cell size) )*width
grid_cell0 = ( floor( (Xnew[i])/cell_spacing) ) + ( floor(Ynew[i]/cell_spacing) )*cell_width;
//keep track of cells with particles, cellsWithParticles is an unordered set so duplicates will automatically be deleted
cellsWithParticles.insert(grid_cell0);
//since each of the hash buckets is an unordered set any duplicates will be deleted
buckets[grid_cell0].insert(i);
}
}
set<int> getSurroundingParticles(int particleOfInterest)
{
set<int> surroundingParticles;
int numSurrounding;
float divisor = (support_length/cell_spacing);
numSurrounding = ceil( divisor );
int grid_cell;
for(int i = -numSurrounding; i <= numSurrounding; i++)
{
for(int j = -numSurrounding; j <= numSurrounding; j++)
{
grid_cell = (int)( floor( ((Xnew[particleOfInterest])+j*cell_spacing)/cell_spacing) ) + ( floor((Ynew[particleOfInterest]+i*cell_spacing)/cell_spacing) )*cell_width;
surroundingParticles.insert(buckets[grid_cell].begin(),buckets[grid_cell].end());
}
}
return surroundingParticles;
}
The code that looks calls getSurroundingParticles:
set<int> nearbyParticles;
//for each bucket with particles in it
for ( int i = 0; i < N; i++ )
{
nearbyParticles = getSurroundingParticles(i);
//for each particle in the bucket
for ( std::set<int>::iterator ipoint = nearbyParticles.begin(); ipoint != nearbyParticles.end(); ++ipoint )
{
//do stuff to check if the smaller subset of particles collide
}
}
Thanks a lot!

Your performance is getting eaten alive by the sheer amount of stl heap allocations caused by repeatedly creating and populating all those Sets. If you profiled the code (say with a quick and easy non-instrumenting tool like Sleepy), I'm certain you'd find that to be the case. You're using Sets to avoid having a given particle added to a bucket more than once - I get that. If Duck's suggestion doesn't give you what you need, I think you could dramatically improve performance by using preallocated arrays or vectors, and getting uniqueness in those containers by adding an "added" flag to the particle that gets set when the item is added. Then just check that flag before adding, and be sure to clear the flags before the next cycle. (If the number of particles is constant, you can do this extremely efficiently with a preallocated array dedicated to storing the flags, then memsetting to 0 at the end of the frame.)
That's the basic idea. If you decide to go this route and get stuck somewhere, I'll help you work out the details.

Related

Is using for loops faster than saving things in a vector in C++?

Sorry for the bad title, but I can actually not think of a better one (open to suggestions).
I got a big grid (1000*1000*1000).
for (int k = 0; k<dims.nz; k++);
{
for (int i = 0; i < dims.nx; i++)
{
for (int j = 0; j < dims.ny; j++)
{
if (inputLabel->evalReg(i, j, 0) == 0)
{
sum = sum + anotherField->evalReg(i, j, 0);
}
}
}
}
I go through all grid points to find which grid points have the value 0 in my labelfield and sum up the corresponding values of another field.
After this I want to set all the points that I detected above to a certain value.
Would it be faster to do basically do the same for loop again (while this time setting values instead of reading them), or should I write all the positions that I got into a separate vector (which would have to change size in ever step of the loop in which we detect something) and simply build a for loop like
for(int p=0; p<size_vec_1,p++)
{
anotherField->set(vec_1[p],vec_2[p],vec_3[p], random value);
}
The point is that I do not know how much of the grid will be affected by my routien due to different data. Might be half of the data or something completly different. Can I do a genereal estimation of the speed of the methods or is it soley depending on the distribution of my values ?
The point is that I do not know how much of the grid will be affected by my routien due to different data.
Here's a trick, which may work: sample inputLabel randomly, to make an approximation how many entries are 0. If a few, then go the "putting indices into a vector" way. If a lot, then go the "scan the array again" way.
It needs fine tuning for a specific computer, what should be the threshold between the two cases, how many samples to take (it should not be too large, as the approximation will take too much time, but should not be too small to have a good approximation), etc.
Bonus trick: take cache-line-aligned and cache-line-sized samples. This way the approximation will take the similar amount of time (because it is memory bound), but the approximation will be better.

Clean and efficient way of iterating over ranges of indices in a single array

I have a contiguous array of particles in 3D space for a fluid simulation and I need to do a lot of neighbor searches on it. I've found that partitioning the search space into cubic cells and sorting the particles in-place by the cell they are in works well for my problem. That way for any given cell, its particles are in a contiguous span so you can iterate over them all easily if you know the begin and end indices. eg cell N might occupy [N_begin, N_end) of the array used to store the particles.
However, no matter how you divide the space, a particle might have neighbors not only in its own cell but also in every neighboring cell (imagine a particle that is almost touching the boundary between cells to understand why). Because of this, a neighbor search needs to return all particles in the same cell as well as all of its neighbors in 3D space, a total of up to 27 cells when not on the edge of the simulation space. There is no ordering of the cells into the array (which is by its nature 1D) that can get all 27 of those spans to be adjacent for any requested cell. To work around this, I keep track of where each cell begins and ends in the particle array and I have a function that determines which cells hold potential neighbors. To represent multiple ranges of indices, it has to return up to 27 pairs of indices signifying the begin and end of those ranges.
std::vector<std::pair<int, int>> get_neighbor_indices(const Vec3f &position);
The index is actually required at some point so it works better this way than a pair of iterators or some other abstraction. The problem is that it forces me to use a loop structure that is tied to the implementation. I would like something like the following, using some pseudocode and omissions to simplify.
for(int i = 0; i < num_of_particles; ++i) {
auto neighbor_indices = get_neighbor_indices(particle[i].position);
for (int j : neighbor_indices) {
// do stuff with particle[i] and particle[j]
}
}
That would only work if neighbor_indices was a complete list of all indices, but that is a significant number of indices that are trivial to calculate on the fly and therefore a huge waste of memory. So the best I can get without compromising the performance of the code is the following.
for(int i = 0; i < num_of_particles; ++i) {
auto neighbor_indices = get_neighbor_indices(particle[i].position);
for (const auto& indices_pair : neighbor_indices) {
for (int j = indices_pair.first; j < indices_pair.second; ++j) {
// do stuff with particle[i] and particle[j]
}
}
}
The loss of genericity is a setback for my project because I have to test and measure performance a lot and make adjustments when I come across a performance problem. Implementation-specific code significantly slows down this process.
I'm looking for something like an iterator, except it will return an index instead of a reference. It will allow me to use it as follows.
for(int i = 0; i < num_of_particles; ++i) {
auto neighbor_indices = get_neighbor_indices(particle[i].position);
for (int j : neighbor_indices) {
// do stuff with particle[i] and particle[j]
}
}
The only issue with this iterator-like approach is that incrementing it is cumbersome. I have to manually keep track of the range I'm in, as well as continuously check when I'm at its end to switch to the next. It's a lot of code to get rid of just that one line that breaks the genericity of the iteration loop. So I'm looking for either a cleaner way to implement the "iterator" or just a better way to iterate over a number of ranges as if they were one.
Keep in mind that this is in a bottleneck computation loop so abstractions have to be zero or negligible cost.

Trying to scan a vector of vehicles an extract information for lane switching C++

I am trying to build a simulation which contains certain objects. I have vehicles and lanes. I have an engine which allows vehicles to advance, based on their velocity and acceleration.
bool Lane::allowedOvertake(double pos, double mindist)
{
for (unsigned int iV = 0; iV < getNVehiclesinLane() - 1; iV++)
{
if ((fVehicles[iV]->getPosition() > pos - mindist) // If inside rear safety distance.
|| fVehicles[iV]->getPosition() < pos + mindist) // If inside front safety distance.
{}//continue
//else {return false;}
}
return true;
}
I would like this for loop to scan over all the vehicles in a lane, so that a vehicle from a neighbouring lane can check whether it can move into this scanned lane. As a note, the pos and mindist parameters are the positions and the minimum distance the lane seeking vehicle needs to safely switch lanes. Also, fVehicles is a vector of vehicles. If the result is true, I then use an if statement in my 'master' object, the road, which allows for an actual switch to take place (using vector.insert()).
I currently get vehicles switching lanes without regard. At first glance, I would suspect it is the above function's logic which is incorrect. Any help in providing a fix, or even a better solution, would be appreciated.
Note: I have a vector of vehicles, and a vector of lanes. However, the vehicles are not ordered in the vector by their positions. I have been advised to re-design this so that the order of the vehicles in the vector are more significant and one can benefit from this when developing the code. However, for now, I would like to fix the design I currently have. Then I will look into redesigning the simulation to make the order more significant. Besides this, my problem above would still exist, just in a slightly different form.
Given an unsorted vector, you have to check if all of them are distant enough from the passed position or, in other words, if none of them is too close:
#include <algorithm>
bool Lane::allowedOvertake(double pos, double mindist)
{
return std::none_of(
fVehicles.begin(), fVehicles.end(), [pos, mindist] (auto & v) {
return v->getPosition() <= pos + mindist
and v->getPosition() >= pos - mindist;
}
);
}

Optimizing the Dijkstra's algorithm

I need a graph-search algorithm that is enough in our application of robot navigation and I chose Dijkstra's algorithm.
We are given the gridmap which contains free, occupied and unknown cells where the robot is only permitted to pass through the free cells. The user will input the starting position and the goal position. In return, I will retrieve the sequence of free cells leading the robot from starting position to the goal position which corresponds to the path.
Since executing the dijkstra's algorithm from start to goal would give us a reverse path coming from goal to start, I decided to execute the dijkstra's algorithm backwards such that I would retrieve the path from start to goal.
Starting from the goal cell, I would have 8 neighbors whose cost horizontally and vertically is 1 while diagonally would be sqrt(2) only if the cells are reachable (i.e. not out-of-bounds and free cell).
Here are the rules that should be observe in updating the neighboring cells, the current cell can only assume 8 neighboring cells to be reachable (e.g. distance of 1 or sqrt(2)) with the following conditions:
The neighboring cell is not out of bounds
The neighboring cell is unvisited.
The neighboring cell is a free cell which can be checked via the 2-D grid map.
Here is my implementation:
#include <opencv2/opencv.hpp>
#include <algorithm>
#include "Timer.h"
/// CONSTANTS
static const int UNKNOWN_CELL = 197;
static const int FREE_CELL = 255;
static const int OCCUPIED_CELL = 0;
/// STRUCTURES for easier management.
struct vertex {
cv::Point2i id_;
cv::Point2i from_;
vertex(cv::Point2i id, cv::Point2i from)
{
id_ = id;
from_ = from;
}
};
/// To be used for finding an element in std::multimap STL.
struct CompareID
{
CompareID(cv::Point2i val) : val_(val) {}
bool operator()(const std::pair<double, vertex> & elem) const {
return val_ == elem.second.id_;
}
private:
cv::Point2i val_;
};
/// Some helper functions for dijkstra's algorithm.
uint8_t get_cell_at(const cv::Mat & image, int x, int y)
{
assert(x < image.rows);
assert(y < image.cols);
return image.data[x * image.cols + y];
}
/// Some helper functions for dijkstra's algorithm.
bool checkIfNotOutOfBounds(cv::Point2i current, int rows, int cols)
{
return (current.x >= 0 && current.y >= 0 &&
current.x < cols && current.y < rows);
}
/// Brief: Finds the shortest possible path from starting position to the goal position
/// Param gridMap: The stage where the tracing of the shortest possible path will be performed.
/// Param start: The starting position in the gridMap. It is assumed that start cell is a free cell.
/// Param goal: The goal position in the gridMap. It is assumed that the goal cell is a free cell.
/// Param path: Returns the sequence of free cells leading to the goal starting from the starting cell.
bool findPathViaDijkstra(const cv::Mat& gridMap, cv::Point2i start, cv::Point2i goal, std::vector<cv::Point2i>& path)
{
// Clear the path just in case
path.clear();
// Create working and visited set.
std::multimap<double,vertex> working, visited;
// Initialize working set. We are going to perform the djikstra's
// backwards in order to get the actual path without reversing the path.
working.insert(std::make_pair(0, vertex(goal, goal)));
// Conditions in continuing
// 1.) Working is empty implies all nodes are visited.
// 2.) If the start is still not found in the working visited set.
// The Dijkstra's algorithm
while(!working.empty() && std::find_if(visited.begin(), visited.end(), CompareID(start)) == visited.end())
{
// Get the top of the STL.
// It is already given that the top of the multimap has the lowest cost.
std::pair<double, vertex> currentPair = *working.begin();
cv::Point2i current = currentPair.second.id_;
visited.insert(currentPair);
working.erase(working.begin());
// Check all arcs
// Only insert the cells into working under these 3 conditions:
// 1. The cell is not in visited cell
// 2. The cell is not out of bounds
// 3. The cell is free
for (int x = current.x-1; x <= current.x+1; x++)
for (int y = current.y-1; y <= current.y+1; y++)
{
if (checkIfNotOutOfBounds(cv::Point2i(x, y), gridMap.rows, gridMap.cols) &&
get_cell_at(gridMap, x, y) == FREE_CELL &&
std::find_if(visited.begin(), visited.end(), CompareID(cv::Point2i(x, y))) == visited.end())
{
vertex newVertex = vertex(cv::Point2i(x,y), current);
double cost = currentPair.first + sqrt(2);
// Cost is 1
if (x == current.x || y == current.y)
cost = currentPair.first + 1;
std::multimap<double, vertex>::iterator it =
std::find_if(working.begin(), working.end(), CompareID(cv::Point2i(x, y)));
if (it == working.end())
working.insert(std::make_pair(cost, newVertex));
else if(cost < (*it).first)
{
working.erase(it);
working.insert(std::make_pair(cost, newVertex));
}
}
}
}
// Now, recover the path.
// Path is valid!
if (std::find_if(visited.begin(), visited.end(), CompareID(start)) != visited.end())
{
std::pair <double, vertex> currentPair = *std::find_if(visited.begin(), visited.end(), CompareID(start));
path.push_back(currentPair.second.id_);
do
{
currentPair = *std::find_if(visited.begin(), visited.end(), CompareID(currentPair.second.from_));
path.push_back(currentPair.second.id_);
} while(currentPair.second.id_.x != goal.x || currentPair.second.id_.y != goal.y);
return true;
}
// Path is invalid!
else
return false;
}
int main()
{
// cv::Mat image = cv::imread("filteredmap1.jpg", CV_LOAD_IMAGE_GRAYSCALE);
cv::Mat image = cv::Mat(100,100,CV_8UC1);
std::vector<cv::Point2i> path;
for (int i = 0; i < image.rows; i++)
for(int j = 0; j < image.cols; j++)
{
image.data[i*image.cols+j] = FREE_CELL;
if (j == image.cols/2 && (i > 3 && i < image.rows - 3))
image.data[i*image.cols+j] = OCCUPIED_CELL;
// if (image.data[i*image.cols+j] > 215)
// image.data[i*image.cols+j] = FREE_CELL;
// else if(image.data[i*image.cols+j] < 100)
// image.data[i*image.cols+j] = OCCUPIED_CELL;
// else
// image.data[i*image.cols+j] = UNKNOWN_CELL;
}
// Start top right
cv::Point2i goal(image.cols-1, 0);
// Goal bottom left
cv::Point2i start(0, image.rows-1);
// Time the algorithm.
Timer timer;
timer.start();
findPathViaDijkstra(image, start, goal, path);
std::cerr << "Time elapsed: " << timer.getElapsedTimeInMilliSec() << " ms";
// Add the path in the image for visualization purpose.
cv::cvtColor(image, image, CV_GRAY2BGRA);
int cn = image.channels();
for (int i = 0; i < path.size(); i++)
{
image.data[path[i].x*cn*image.cols+path[i].y*cn+0] = 0;
image.data[path[i].x*cn*image.cols+path[i].y*cn+1] = 255;
image.data[path[i].x*cn*image.cols+path[i].y*cn+2] = 0;
}
cv::imshow("Map with path", image);
cv::waitKey();
return 0;
}
For the algorithm implementation, I decided to have two sets namely the visited and working set whose each elements contain:
The location of itself in the 2D grid map.
The accumulated cost
Through what cell did it get its accumulated cost (for path recovery)
And here is the result:
The black pixels represent obstacles, the white pixels represent free space and the green line represents the path computed.
On this implementation, I would only search within the current working set for the minimum value and DO NOT need to scan throughout the cost matrix (where initially, the initially cost of all cells are set to infinity and the starting point 0). Maintaining a separate vector of the working set I think promises a better code performance because all the cells that have cost of infinity is surely to be not included in the working set but only those cells that have been touched.
I also took advantage of the STL which C++ provides. I decided to use the std::multimap since it can store duplicating keys (which is the cost) and it sorts the lists automatically. However, I was forced to use std::find_if() to find the id (which is the row,col of the current cell in the set) in the visited set to check if the current cell is on it which promises linear complexity. I really think this is the bottleneck of the Dijkstra's algorithm.
I am well aware that A* algorithm is much faster than Dijkstra's algorithm but what I wanted to ask is my implementation of Dijkstra's algorithm optimal? Even if I implemented A* algorithm using my current implementation in Dijkstra's which is I believe suboptimal, then consequently A* algorithm will also be suboptimal.
What improvement can I perform? What STL is the most appropriate for this algorithm? Particularly, how do I improve the bottleneck?
You're using a std::multimap for 'working' and 'visited'. That's not great.
The first thing you should do is change visited into a per-vertex flag so you can do your find_if in constant time instead of linear times and also so that operations on the list of visited vertices take constant instead of logarithmic time. You know what all the vertices are and you can map them to small integers trivially, so you can use either a std::vector or a std::bitset.
The second thing you should do is turn working into a priority queue, rather than a balanced binary tree structure, so that operations are a (largish) constant factor faster. std::priority_queue is a barebones binary heap. A higher-radix heap---say quaternary for concreteness---will probably be faster on modern computers due to its reduced depth. Andrew Goldberg suggests some bucket-based data structures; I can dig up references for you if you get to that stage. (They're not too complicated.)
Once you've taken care of these two things, you might look at A* or meet-in-the-middle tricks to speed things up even more.
Your performance is several orders of magnitude worse than it could be because you're using graph search algorithms for what looks like geometry. This geometry is much simpler and less general than the problems that graph search algorithms can solve. Also, with a vertex for every pixel your graph is huge even though it contains basically no information.
I heard you asking "how can I make this better without changing what I'm thinking" but nevertheless I'll tell you a completely different and better approach.
It looks like your robot can only go horizontally, vertically or diagonally. Is that for real or just a side effect of you choosing graph search algorithms? I'll assume the latter and let it go in any direction.
The algorithm goes like this:
(0) Represent your obstacles as polygons by listing the corners. Work in real numbers so you can make them as thin as you like.
(1) Try for a straight line between the end points.
(2) Check if that line goes through an obstacle or not. To do that for any line, show that all corners of any particular obstacle lie on the same side of the line. To do that, translate all points by (-X,-Y) of one end of the line so that that point is at the origin, then rotate until the other point is on the X axis. Now all corners should have the same sign of Y if there's no obstruction. There might be a quicker way just using gradients.
(3) If there's an obstruction, propose N two-segment paths going via the N corners of the obstacle.
(4) Recurse for all segments, culling any paths with segments that go out of bounds. That won't be a problem unless you have obstacles that go out of bounds.
(5) When it stops recursing, you should have a list of locally optimised paths from which you can choose the shortest.
(6) If you really want to restrict bearings to multiples of 45 degrees, then you can do this algorithm first and then replace each segment by any 45-only wiggly version that avoids obstacles. We know that such a version exists because you can stay extremely close to the original line by wiggling very often. We also know that all such wiggly paths have the same length.

improving performance for graph connectedness computation

I am writing a program to generate a graph and check whether it is connected or not. Below is the code. Here is some explanation: I generate a number of points on the plane at random locations. I then connect the nodes, NOT based on proximity only. By that I mean to say that a node is more likely to be connected to nodes that are closer, and this is determined by a random variable that I use in the code (h_sq) and the distance. Hence, I generate all links (symmetric, i.e., if i can talk to j the viceversa is also true) and then check with a BFS to see if the graph is connected.
My problem is that the code seems to be working properly. However, when the number of nodes becomes greater than ~2000 it is terribly slow, and I need to run this function many times for simulation purposes. I even tried to use other libraries for graphs but the performance is the same.
Does anybody know how could I possibly speed everything up?
Thanks,
int Graph::gen_links() {
if( save == true ) { // in case I want to store the structure of the graph
links.clear();
links.resize(xy.size());
}
double h_sq, d;
vector< vector<luint> > neighbors(xy.size());
// generate links
double tmp = snr_lin / gamma_0_lin;
// xy is a std vector of pairs containing the nodes' locations
for(luint i = 0; i < xy.size(); i++) {
for(luint j = i+1; j < xy.size(); j++) {
// generate |h|^2
d = distance(i, j);
if( d < d_crit ) // for sim purposes
d = 1.0;
h_sq = pow(mrand.randNorm(0, 1), 2.0) + pow(mrand.randNorm(0, 1), 2.0);
if( h_sq * tmp >= pow(d, alpha) ) {
// there exists a link between i and j
neighbors[i].push_back(j);
neighbors[j].push_back(i);
// options
if( save == true )
links.push_back( make_pair(i, j) );
}
}
if( neighbors[i].empty() && save == false ) {
// graph not connected. since save=false i dont need to store the structure,
// hence I exit
connected = 0;
return 1;
}
}
// here I do BFS to check whether the graph is connected or not, using neighbors
// BFS code...
return 1;
}
UPDATE:
the main problem seems to be the push_back calls within the inner for loops. It's the part that takes most of the time in this case. Shall I use reserve() to increase efficiency?
Are you sure the slowness is caused by the generation but not by your search algorithm?
The graph generation is O(n^2) and you can't do too much to it. However, you can apparently use memory in exchange of some of the time if the point locations are fixed for at least some of the experiments.
First, distances of all node pairs, and pow(d, alpha) can be precomputed and saved into memory so that you don't need to compute them again and again. The extra memory cost for 10000 nodes will be about 800mb for double and 400mb for float..
In addition, sum of square of normal variable is chi-square distribution if I remember correctly.. Probably you can have some precomputed table lookup if the accuracy allowed?
At last, if the probability that two nodes will be connected are so small if the distance exceeds some value, then you don't need O(n^2) and probably you can only calculate those node pairs that have distance smaller than some limits?
As a first step you should try to use reserve for both inner and outer vectors.
If this does not bring performance up to your expectations I believe this is because memory allocations that are still happening.
There is a handy class I've used in similar situations, llvm::SmallVector (find it in Google). It provides a vector with few pre-allocated items, so you can have decrease number of allocations by one per vector.
It still can grow when it is running out of items in pre-allocated space.
So:
1) Examine the number of items you have in your vectors on average during runs (I'm talking about both inner and outer vectors)
2) Put in llvm::SmallVector with a pre-allocation of such size (as vector is allocated on the stack you might need to increase stack size, or reduce pre-allocation if you are restricted on available stack memory).
Another good thing about SmallVector is that it has almost the same interface as std::vector (could be easily put instead of it)