I am writing a program to generate a graph and check whether it is connected or not. Below is the code. Here is some explanation: I generate a number of points on the plane at random locations. I then connect the nodes, NOT based on proximity only. By that I mean to say that a node is more likely to be connected to nodes that are closer, and this is determined by a random variable that I use in the code (h_sq) and the distance. Hence, I generate all links (symmetric, i.e., if i can talk to j the viceversa is also true) and then check with a BFS to see if the graph is connected.
My problem is that the code seems to be working properly. However, when the number of nodes becomes greater than ~2000 it is terribly slow, and I need to run this function many times for simulation purposes. I even tried to use other libraries for graphs but the performance is the same.
Does anybody know how could I possibly speed everything up?
Thanks,
int Graph::gen_links() {
if( save == true ) { // in case I want to store the structure of the graph
links.clear();
links.resize(xy.size());
}
double h_sq, d;
vector< vector<luint> > neighbors(xy.size());
// generate links
double tmp = snr_lin / gamma_0_lin;
// xy is a std vector of pairs containing the nodes' locations
for(luint i = 0; i < xy.size(); i++) {
for(luint j = i+1; j < xy.size(); j++) {
// generate |h|^2
d = distance(i, j);
if( d < d_crit ) // for sim purposes
d = 1.0;
h_sq = pow(mrand.randNorm(0, 1), 2.0) + pow(mrand.randNorm(0, 1), 2.0);
if( h_sq * tmp >= pow(d, alpha) ) {
// there exists a link between i and j
neighbors[i].push_back(j);
neighbors[j].push_back(i);
// options
if( save == true )
links.push_back( make_pair(i, j) );
}
}
if( neighbors[i].empty() && save == false ) {
// graph not connected. since save=false i dont need to store the structure,
// hence I exit
connected = 0;
return 1;
}
}
// here I do BFS to check whether the graph is connected or not, using neighbors
// BFS code...
return 1;
}
UPDATE:
the main problem seems to be the push_back calls within the inner for loops. It's the part that takes most of the time in this case. Shall I use reserve() to increase efficiency?
Are you sure the slowness is caused by the generation but not by your search algorithm?
The graph generation is O(n^2) and you can't do too much to it. However, you can apparently use memory in exchange of some of the time if the point locations are fixed for at least some of the experiments.
First, distances of all node pairs, and pow(d, alpha) can be precomputed and saved into memory so that you don't need to compute them again and again. The extra memory cost for 10000 nodes will be about 800mb for double and 400mb for float..
In addition, sum of square of normal variable is chi-square distribution if I remember correctly.. Probably you can have some precomputed table lookup if the accuracy allowed?
At last, if the probability that two nodes will be connected are so small if the distance exceeds some value, then you don't need O(n^2) and probably you can only calculate those node pairs that have distance smaller than some limits?
As a first step you should try to use reserve for both inner and outer vectors.
If this does not bring performance up to your expectations I believe this is because memory allocations that are still happening.
There is a handy class I've used in similar situations, llvm::SmallVector (find it in Google). It provides a vector with few pre-allocated items, so you can have decrease number of allocations by one per vector.
It still can grow when it is running out of items in pre-allocated space.
So:
1) Examine the number of items you have in your vectors on average during runs (I'm talking about both inner and outer vectors)
2) Put in llvm::SmallVector with a pre-allocation of such size (as vector is allocated on the stack you might need to increase stack size, or reduce pre-allocation if you are restricted on available stack memory).
Another good thing about SmallVector is that it has almost the same interface as std::vector (could be easily put instead of it)
Related
I wrote a small code that clusters floor areas on a map. It works well however I would like to optimize the computational time.
Do you have any advice in order to reduce the computation time?
The idea to optimize the algorithm was to use only one vector for the clusters and to go through all the pixels only once.
The algorithm works as follows, it scans all the cells starting in the top left corner and analyzing each row after another. If a cell is detected as a floor, it analyses the neighbors on the left and define if one of them belongs to a cluster. If it's the case, the same cluster number is assigned, if not, a new cluster number is assigned. If two neighbors have different cluster numbers, they are merged together with the "replace" operation.
std::vector<int> PathPlanning::clustering(){
const int unknown = -1;
int clusterCount = 0;
std::vector<int> clusters(m_map.size(), unknown);
for (size_t i = 0; i < clusters.size(); ++i) {
if(m_map.at(i) != e_mapState::FLOOR)
continue;
for(auto& nb : neighborLeft(i)) {
if(clusters.at(nb)!= unknown){
if(clusters.at(i) == unknown){
clusters.at(i) = clusters.at(nb);
}else{
std::replace (clusters.begin(), clusters.end(), clusters.at(nb), clusters.at(i));
}
}
}
if(clusters.at(i) == unknown){
clusterCount+=1;
clusters.at(i) = clusterCount;
}
}
return clusters;
}
This looks like a disjoint set problem, where the running time is O(α(n)), O(1) in practice, per element, so O(n α(n)), O(n) in practice.
Where your code is O(n neighbours n), n and neighbours for the loops, and an additional n for the replace.
There is also some Boost code available so you might save some time implementing it.
I have a contiguous array of particles in 3D space for a fluid simulation and I need to do a lot of neighbor searches on it. I've found that partitioning the search space into cubic cells and sorting the particles in-place by the cell they are in works well for my problem. That way for any given cell, its particles are in a contiguous span so you can iterate over them all easily if you know the begin and end indices. eg cell N might occupy [N_begin, N_end) of the array used to store the particles.
However, no matter how you divide the space, a particle might have neighbors not only in its own cell but also in every neighboring cell (imagine a particle that is almost touching the boundary between cells to understand why). Because of this, a neighbor search needs to return all particles in the same cell as well as all of its neighbors in 3D space, a total of up to 27 cells when not on the edge of the simulation space. There is no ordering of the cells into the array (which is by its nature 1D) that can get all 27 of those spans to be adjacent for any requested cell. To work around this, I keep track of where each cell begins and ends in the particle array and I have a function that determines which cells hold potential neighbors. To represent multiple ranges of indices, it has to return up to 27 pairs of indices signifying the begin and end of those ranges.
std::vector<std::pair<int, int>> get_neighbor_indices(const Vec3f &position);
The index is actually required at some point so it works better this way than a pair of iterators or some other abstraction. The problem is that it forces me to use a loop structure that is tied to the implementation. I would like something like the following, using some pseudocode and omissions to simplify.
for(int i = 0; i < num_of_particles; ++i) {
auto neighbor_indices = get_neighbor_indices(particle[i].position);
for (int j : neighbor_indices) {
// do stuff with particle[i] and particle[j]
}
}
That would only work if neighbor_indices was a complete list of all indices, but that is a significant number of indices that are trivial to calculate on the fly and therefore a huge waste of memory. So the best I can get without compromising the performance of the code is the following.
for(int i = 0; i < num_of_particles; ++i) {
auto neighbor_indices = get_neighbor_indices(particle[i].position);
for (const auto& indices_pair : neighbor_indices) {
for (int j = indices_pair.first; j < indices_pair.second; ++j) {
// do stuff with particle[i] and particle[j]
}
}
}
The loss of genericity is a setback for my project because I have to test and measure performance a lot and make adjustments when I come across a performance problem. Implementation-specific code significantly slows down this process.
I'm looking for something like an iterator, except it will return an index instead of a reference. It will allow me to use it as follows.
for(int i = 0; i < num_of_particles; ++i) {
auto neighbor_indices = get_neighbor_indices(particle[i].position);
for (int j : neighbor_indices) {
// do stuff with particle[i] and particle[j]
}
}
The only issue with this iterator-like approach is that incrementing it is cumbersome. I have to manually keep track of the range I'm in, as well as continuously check when I'm at its end to switch to the next. It's a lot of code to get rid of just that one line that breaks the genericity of the iteration loop. So I'm looking for either a cleaner way to implement the "iterator" or just a better way to iterate over a number of ranges as if they were one.
Keep in mind that this is in a bottleneck computation loop so abstractions have to be zero or negligible cost.
I have made a function that estimates the normal vectors of a 3D Point Cloud and it takes a lot of time to run on a cloud of size 2 million. I want to multi-thread by calling the same function on two different points at the same time but it didn't work (it was creating hundreds of threads). Here is what I tried:
// kd-tree used for finding neighbours
pcl::KdTreeFLANN<pcl::PointXYZRGB> kdt;
// cloud iterators
pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it = pt_cl->points.begin();
pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it1;
pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it2;
pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it3;
pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it4;
// initializing tree
kdt.setInputCloud(pt_cl);
// loop exit condition
bool it_completed = false;
while (!it_completed)
{
// initializing cloud iterators
cloud_it1 = cloud_it;
cloud_it2 = cloud_it++;
cloud_it3 = cloud_it++;
if (cloud_it3 != pt_cl->points.end())
{
// attaching threads
boost::thread thread_1 = boost::thread(geom::vectors::find_normal, pt_cl, cloud_it1, kdt, radius, max_neighbs);
boost::thread thread_2 = boost::thread(geom::vectors::find_normal, pt_cl, cloud_it2, kdt, radius, max_neighbs);
boost::thread thread_3 = boost::thread(geom::vectors::find_normal, pt_cl, cloud_it3, kdt, radius, max_neighbs);
// joining threads
thread_1.join();
thread_2.join();
thread_3.join();
cloud_it++;
}
else
{
it_completed = true;
}
}
As you can see I am trying to call the same function on 3 different points at the same time. Any suggestions for how to make this work? Sorry for the poor code, I'm tired and thank you in advance.
EDIT: here is the find_normal function
Here are the parameters:
#param pt_cl is a pointer to the point cloud to be treated (pcl::PointCloud<PointXYZRGB>::Ptr)
#param cloud_it is an iterator of this cloud (pcl::PointCloud<PointXYZRGB>::iterator)
#param kdt is the kd_tree used to find the closest neighbours of a point
#param radius defines the range in which to search for the neighbours of a point
#param max_neighbs is the maximum number of neighbours to be returned by the radius search
// auxilliary vectors for the k-tree nearest search
std::vector<int> pointIdxRadiusSearch; // neighbours ids
std::vector<float> pointRadiusSquaredDistance; // distances from the source to the neighbours
// the vectors of which the cross product calculates the normal
geom::vectors::vector3 *vect1;
geom::vectors::vector3 *vect2;
geom::vectors::vector3 *cross_prod;
geom::vectors::vector3 *abs_cross_prod;
geom::vectors::vector3 *normal;
geom::vectors::vector3 *normalized_normal;
// vectors to average
std::vector<geom::vectors::vector3> vct_toavg;
// if there are neighbours left
if (kdt.radiusSearch(*cloud_it, radius, pointIdxRadiusSearch, pointRadiusSquaredDistance, max_neighbs) > 0)
{
for (int pt_index = 0; pt_index < (pointIdxRadiusSearch.size() - 1); pt_index++)
{
// defining the first vector
vect1 = geom::vectors::create_vect2p((*cloud_it), pt_cl->points[pointIdxRadiusSearch[pt_index + 1]]);
// defining the second vector; making sure there is no 'out of bounds' error
if (pt_index == pointIdxRadiusSearch.size() - 2)
vect2 = geom::vectors::create_vect2p((*cloud_it), pt_cl->points[pointIdxRadiusSearch[1]]);
else
vect2 = geom::vectors::create_vect2p((*cloud_it), pt_cl->points[pointIdxRadiusSearch[pt_index + 2]]);
// adding the cross product of the two previous vectors to our list
cross_prod = geom::vectors::cross_product(*vect1, *vect2);
abs_cross_prod = geom::aux::abs_vector(*cross_prod);
vct_toavg.push_back(*abs_cross_prod);
// freeing memory
delete vect1;
delete vect2;
delete cross_prod;
delete abs_cross_prod;
}
// calculating the normal
normal = geom::vectors::vect_avg(vct_toavg);
// calculating the normalized normal
normalized_normal = geom::vectors::normalize_normal(*normal);
// coloring the point
geom::aux::norm_toPtRGB(&(*cloud_it), *normalized_normal);
// freeing memory
delete normal;
delete normalized_normal;
// clearing vectors
vct_toavg.clear();
pointIdxRadiusSearch.clear();
pointRadiusSquaredDistance.clear();
// shrinking vectors
vct_toavg.shrink_to_fit();
pointIdxRadiusSearch.shrink_to_fit();
pointRadiusSquaredDistance.shrink_to_fit();
}
Since I don't quite get it how the result data is being stored, I'm going to suggest a solution based on OpenMP that matches the code you've posted.
// kd-tree used for finding neighbours
pcl::KdTreeFLANN<pcl::PointXYZRGB> kdt;
#pragma openmp parallel for schedule(static)
for (pcl::PointCloud<pcl::PointXYZRGB>::iterator cloud_it = pt_cl->points.begin();
cloud_it < pt_cl.end();
++cloud_it) {
geom::vectors::find_normal, pt_cl, cloud_it, kdt, radius, max_neighbs);
}
Note that you should be using the < comparison, and not the != one, -that's how OpenMP works (it wants random access iterators). I'm using the static schedule since every element should take more or less identical time to process. If that's not the case, try using schedule(dynamic) instead.
This solution uses OpenMP, and you may investigate e.g. TBB as well, though it has a higher entrance barrier than OpenMP and uses an OOP-style API.
Also, repeating what I've said in the comments already: OpenMP as well as TBB are going to handle thread management and load distribution for you. You only pass them hints (such as schedule(static)) on how to do it to so as to better suit your needs.
Other than that, please, do get in the habit of repeating as little code as you can; ideally, no code should be duplicated. E.g. when you declare many variables of the same type, or call a certain function a few times in a row with a similar pattern, etc. I also see excessive commenting in the code, with an unclear reason behind it.
I'm not sure if there is an actual performance increase to achieve, or if my computer is just old and slow, but I'll ask anyway.
So I've tried making a program to plot the Mandelbrot set using the cairo library.
The loop that draws the pixels looks as follows:
vector<point_t*>::iterator it;
for(unsigned int i = 0; i < iterations; i++){
it = points->begin();
//cout << points->size() << endl;
double r,g,b;
r = (double)i+1 / (double)iterations;
g = 0;
b = 0;
while(it != points->end()){
point_t *p = *it;
p->Z = (p->Z * p->Z) + p->C;
if(abs(p->Z) > 2.0){
cairo_set_source_rgba(cr, r, g, b, 1);
cairo_rectangle (cr, p->x, p->y, 1, 1);
cairo_fill (cr);
it = points->erase(it);
} else {
it++;
}
}
}
The idea is to color all points that just escaped the set, and then remove them from list to avoid evaluating them again.
It does render the set correctly, but it seems that the rendering takes a lot longer than needed.
Can someone spot any performance issues with the loop? or is it as good as it gets?
Thanks in advance :)
SOLUTION
Very nice answers, thanks :) - I ended up with a kind of hybrid of the answers. Thinking of what was suggested, i realized that calculating each point, putting them in a vector and then extract them was a huge waste CPU time and memory. So instead, the program now just calculate the Z value of each point witout even using the point_t or vector. It now runs A LOT faster!
Edit: I think the suggestion in the answer of kuroi neko is also a very good idea if you do not care about "incremental" computation, but have a fixed number of iterations.
You should use vector<point_t> instead of vector<point_t*>.
A vector<point_t*> is a list of pointers to point_t. Each point is stored at some random location in the memory. If you iterate over the points, the pattern in which memory is accessed looks completely random. You will get a lot of cache misses.
On the opposite vector<point_t> uses continuous memory to store the points. Thus the next point is stored directly after the current point. This allows efficient caching.
You should not call erase(it); in your inner loop.
Each call to erase has to move all elements after the one you remove. This has O(n) runtime. For example, you could add a flag to point_t to indicate that it should not be processed any longer. It may be even faster to remove all the "inactive" points after each iteration.
It is probably not a good idea to draw individual pixels using cairo_rectangle. I would suggest you create an image and store the color for each pixel. Then draw the whole image with one draw call.
Your code could look like this:
for(unsigned int i = 0; i < iterations; i++){
double r,g,b;
r = (double)i+1 / (double)iterations;
g = 0;
b = 0;
for(vector<point_t>::iterator it=points->begin(); it!=points->end(); ++it) {
point_t& p = *it;
if(!p.active) {
continue;
}
p.Z = (p.Z * p.Z) + p.C;
if(abs(p.Z) > 2.0) {
cairo_set_source_rgba(cr, r, g, b, 1);
cairo_rectangle (cr, p.x, p.y, 1, 1);
cairo_fill (cr);
p.active = false;
}
}
// perhaps remove all points where p.active = false
}
If you can not change point_t, you can use an additional vector<char> to store if a point has become "inactive".
The Zn divergence computation is what makes the algorithm slow (depending on the area you're working on, of course). In comparison, pixel drawing is mere background noise.
Your loop is flawed because it makes the Zn computation slow.
The way to go is to compute divergence for each point in a tight, optimized loop, and then take care of the display.
Besides, it's useless and wasteful to store Z permanently.
You just need C as an input and the number of iterations as an output.
Assuming your points array only holds C values (basically you don't need all this vector crap, but it won't hurt performances either), you could do something like that :
for(vector<point_t>::iterator it=points->begin(); it!=points->end(); ++it)
{
point_t Z = 0;
point_t C = *it;
for(unsigned int i = 0; i < iterations; i++) // <-- this is the CPU burner
{
Z = Z * Z + C;
if(abs(Z) > 2.0) break;
}
cairo_set_source_rgba(cr, (double)i+1 / (double)iterations, g, b, 1);
cairo_rectangle (cr, p->x, p->y, 1, 1);
cairo_fill (cr);
}
Try to run this with and without the cairo thing and you should see no noticeable difference in execution time (unless you're looking at an empty spot of the set).
Now if you want to go faster, try to break down the Z = Z * Z + C computation in real and imaginary parts and optimize it. You could even use mmx or whatever to do parallel computations.
And of course the way to go to gain another significant speed factor is to parallelize your algorithm over the available CPU cores (i.e. split your display area is subsets and have different worker threads compute these parts in parallel).
This is not as obvious at it might seem, though, since each sub-picture will have a different computation time (black areas are very slow to compute while white areas are computed almost instantly).
One way to do it is to split the area is a large number of rectangles, and have all worker threads pick a random rectangle from a common pool until all rectangles have been processed.
This simple load balancing scheme that makes sure no CPU core will be left idle while its buddies are busy on other parts of the display.
The first step to optimizing performance is to find out what is slow. Your code mixes three tasks- iterating to calculate whether a point escapes, manipulating a vector of points to test, and plotting the point.
Separate these three operations and measure their contribution. You can optimise the escape calculation by parallelising it using simd operations. You can optimise the vector operations by not erasing from the vector if you want to remove it but adding it to another vector if you want to keep it ( since erase is O(N) and addition O(1) ) and improve locality by having a vector of points rather than pointers to points, and if the plotting is slow then use an off-screen bitmap and set points by manipulating the backing memory rather than using cairo functions.
(I was going to post this but #Werner Henze already made the same point in a comment, hence community wiki)
Why is my spatial hash so slow? I am working on a code that uses smooth particle hydrodynamics to model the movement of landslides. In smooth particle hydrodynamics each particle influences the particles that are within a distance of 3 "smoothing lengths". I am trying to implement a spatial hash function in order to have a fast look up of the neighboring particles.
For my implementation I made use of the "set" datatype from the stl. At each time step the particles are hashed into their bucket using the function below. "bucket" is a vector of sets, with one set for each grid cell (the spatial domain is limited). Each particle is identified by an integer.
To look for collisions the function below entitled "getSurroundingParticles" is used which takes an integer (corresponding to a particle) and returns a set that contains all the grid cells that are within 3 support lengths of the particle.
The problem is that this implementation is really slow, slower even than just checking each particle against every other particles, when the number of particles is 2000. I was hoping that someone could spot a glaring problem in my implementation that I'm not seeing.
//put each particle into its bucket(s)
void hashParticles()
{
int grid_cell0;
cellsWithParticles.clear();
for(int i = 0; i<N; i++)
{
//determine the four grid cells that surround the particle, as well as the grid cell that the particle occupies
//using the hash function int grid_cell = ( floor(x/cell size) ) + ( floor(y/cell size) )*width
grid_cell0 = ( floor( (Xnew[i])/cell_spacing) ) + ( floor(Ynew[i]/cell_spacing) )*cell_width;
//keep track of cells with particles, cellsWithParticles is an unordered set so duplicates will automatically be deleted
cellsWithParticles.insert(grid_cell0);
//since each of the hash buckets is an unordered set any duplicates will be deleted
buckets[grid_cell0].insert(i);
}
}
set<int> getSurroundingParticles(int particleOfInterest)
{
set<int> surroundingParticles;
int numSurrounding;
float divisor = (support_length/cell_spacing);
numSurrounding = ceil( divisor );
int grid_cell;
for(int i = -numSurrounding; i <= numSurrounding; i++)
{
for(int j = -numSurrounding; j <= numSurrounding; j++)
{
grid_cell = (int)( floor( ((Xnew[particleOfInterest])+j*cell_spacing)/cell_spacing) ) + ( floor((Ynew[particleOfInterest]+i*cell_spacing)/cell_spacing) )*cell_width;
surroundingParticles.insert(buckets[grid_cell].begin(),buckets[grid_cell].end());
}
}
return surroundingParticles;
}
The code that looks calls getSurroundingParticles:
set<int> nearbyParticles;
//for each bucket with particles in it
for ( int i = 0; i < N; i++ )
{
nearbyParticles = getSurroundingParticles(i);
//for each particle in the bucket
for ( std::set<int>::iterator ipoint = nearbyParticles.begin(); ipoint != nearbyParticles.end(); ++ipoint )
{
//do stuff to check if the smaller subset of particles collide
}
}
Thanks a lot!
Your performance is getting eaten alive by the sheer amount of stl heap allocations caused by repeatedly creating and populating all those Sets. If you profiled the code (say with a quick and easy non-instrumenting tool like Sleepy), I'm certain you'd find that to be the case. You're using Sets to avoid having a given particle added to a bucket more than once - I get that. If Duck's suggestion doesn't give you what you need, I think you could dramatically improve performance by using preallocated arrays or vectors, and getting uniqueness in those containers by adding an "added" flag to the particle that gets set when the item is added. Then just check that flag before adding, and be sure to clear the flags before the next cycle. (If the number of particles is constant, you can do this extremely efficiently with a preallocated array dedicated to storing the flags, then memsetting to 0 at the end of the frame.)
That's the basic idea. If you decide to go this route and get stuck somewhere, I'll help you work out the details.