C++ plotting Mandlebrot set, bad performance - c++

I'm not sure if there is an actual performance increase to achieve, or if my computer is just old and slow, but I'll ask anyway.
So I've tried making a program to plot the Mandelbrot set using the cairo library.
The loop that draws the pixels looks as follows:
vector<point_t*>::iterator it;
for(unsigned int i = 0; i < iterations; i++){
it = points->begin();
//cout << points->size() << endl;
double r,g,b;
r = (double)i+1 / (double)iterations;
g = 0;
b = 0;
while(it != points->end()){
point_t *p = *it;
p->Z = (p->Z * p->Z) + p->C;
if(abs(p->Z) > 2.0){
cairo_set_source_rgba(cr, r, g, b, 1);
cairo_rectangle (cr, p->x, p->y, 1, 1);
cairo_fill (cr);
it = points->erase(it);
} else {
it++;
}
}
}
The idea is to color all points that just escaped the set, and then remove them from list to avoid evaluating them again.
It does render the set correctly, but it seems that the rendering takes a lot longer than needed.
Can someone spot any performance issues with the loop? or is it as good as it gets?
Thanks in advance :)
SOLUTION
Very nice answers, thanks :) - I ended up with a kind of hybrid of the answers. Thinking of what was suggested, i realized that calculating each point, putting them in a vector and then extract them was a huge waste CPU time and memory. So instead, the program now just calculate the Z value of each point witout even using the point_t or vector. It now runs A LOT faster!

Edit: I think the suggestion in the answer of kuroi neko is also a very good idea if you do not care about "incremental" computation, but have a fixed number of iterations.
You should use vector<point_t> instead of vector<point_t*>.
A vector<point_t*> is a list of pointers to point_t. Each point is stored at some random location in the memory. If you iterate over the points, the pattern in which memory is accessed looks completely random. You will get a lot of cache misses.
On the opposite vector<point_t> uses continuous memory to store the points. Thus the next point is stored directly after the current point. This allows efficient caching.
You should not call erase(it); in your inner loop.
Each call to erase has to move all elements after the one you remove. This has O(n) runtime. For example, you could add a flag to point_t to indicate that it should not be processed any longer. It may be even faster to remove all the "inactive" points after each iteration.
It is probably not a good idea to draw individual pixels using cairo_rectangle. I would suggest you create an image and store the color for each pixel. Then draw the whole image with one draw call.
Your code could look like this:
for(unsigned int i = 0; i < iterations; i++){
double r,g,b;
r = (double)i+1 / (double)iterations;
g = 0;
b = 0;
for(vector<point_t>::iterator it=points->begin(); it!=points->end(); ++it) {
point_t& p = *it;
if(!p.active) {
continue;
}
p.Z = (p.Z * p.Z) + p.C;
if(abs(p.Z) > 2.0) {
cairo_set_source_rgba(cr, r, g, b, 1);
cairo_rectangle (cr, p.x, p.y, 1, 1);
cairo_fill (cr);
p.active = false;
}
}
// perhaps remove all points where p.active = false
}
If you can not change point_t, you can use an additional vector<char> to store if a point has become "inactive".

The Zn divergence computation is what makes the algorithm slow (depending on the area you're working on, of course). In comparison, pixel drawing is mere background noise.
Your loop is flawed because it makes the Zn computation slow.
The way to go is to compute divergence for each point in a tight, optimized loop, and then take care of the display.
Besides, it's useless and wasteful to store Z permanently.
You just need C as an input and the number of iterations as an output.
Assuming your points array only holds C values (basically you don't need all this vector crap, but it won't hurt performances either), you could do something like that :
for(vector<point_t>::iterator it=points->begin(); it!=points->end(); ++it)
{
point_t Z = 0;
point_t C = *it;
for(unsigned int i = 0; i < iterations; i++) // <-- this is the CPU burner
{
Z = Z * Z + C;
if(abs(Z) > 2.0) break;
}
cairo_set_source_rgba(cr, (double)i+1 / (double)iterations, g, b, 1);
cairo_rectangle (cr, p->x, p->y, 1, 1);
cairo_fill (cr);
}
Try to run this with and without the cairo thing and you should see no noticeable difference in execution time (unless you're looking at an empty spot of the set).
Now if you want to go faster, try to break down the Z = Z * Z + C computation in real and imaginary parts and optimize it. You could even use mmx or whatever to do parallel computations.
And of course the way to go to gain another significant speed factor is to parallelize your algorithm over the available CPU cores (i.e. split your display area is subsets and have different worker threads compute these parts in parallel).
This is not as obvious at it might seem, though, since each sub-picture will have a different computation time (black areas are very slow to compute while white areas are computed almost instantly).
One way to do it is to split the area is a large number of rectangles, and have all worker threads pick a random rectangle from a common pool until all rectangles have been processed.
This simple load balancing scheme that makes sure no CPU core will be left idle while its buddies are busy on other parts of the display.

The first step to optimizing performance is to find out what is slow. Your code mixes three tasks- iterating to calculate whether a point escapes, manipulating a vector of points to test, and plotting the point.
Separate these three operations and measure their contribution. You can optimise the escape calculation by parallelising it using simd operations. You can optimise the vector operations by not erasing from the vector if you want to remove it but adding it to another vector if you want to keep it ( since erase is O(N) and addition O(1) ) and improve locality by having a vector of points rather than pointers to points, and if the plotting is slow then use an off-screen bitmap and set points by manipulating the backing memory rather than using cairo functions.
(I was going to post this but #Werner Henze already made the same point in a comment, hence community wiki)

Related

C++: Unsure if code is multithreadable

I'm working on a small piece of code which takes a very large amount of time to complete, so I was thinking of multithreading it either with pthread (which I hardly understand but think I can master a lot quicker) or with some GPGPU implementation (probably OpenCL as I have an AMD card at home and the PCs I use at my office have various NVIDIA cards)
while(sDead < (unsigned long) nrPoints*nrPoints) {
pPoint1 = distrib(*rng);
pPoint2 = distrib(*rng);
outAxel = -1;
if(pPoint1 != pPoint2) {
point1 = space->getPointRef(pPoint1);
point2 = space->getPointRef(pPoint2);
outAxel = point1->influencedBy(point2, distThres);
if(outAxel == 0 || outAxel == 1)
sDead++;
else
sDead = 0;
}
i++;
}
Where distrib is a uniform_int_distribution with a = 0 and b = nrPoints-1.
For clarity, here is the structure I'm working with:
class Space{
vector<Point> points
(more stuff)
}
class Point {
vector<Coords> coordinates
(more stuff)
}
struct Coords{
char Range
bool TypeOfCoord
char Coord
}
The length of coordinates is the same for all Points and Point[x].Coord[y].Range == Point[z].Coord[y].Range for all x, y and z. The same goes for TypeOfCoord.
Some background: during each run of the while loop, two randomly drawn Points from space are tested for interaction. influencedBy() checks whether or not point1 and point2 are close enough to eachother (distance is dependent on some metric but it boils down to similarity in Coord. If the distance is smaller than distThres, interaction is possible) to interact. Interaction means that one of the Coord variables which doesn't equal the corresponding Coord in the other object is flipped to equal it. This decreases the distance between the Points but also changes the distance of the changed point to every other point in Space, hence my question of whether or not this is multithreadable. As I said, I'm a complete newbie to multithreading and I'm not sure if I can safely implement a function that chops this up, so I was looking for your input. Suggestions are also very welcome.
E: The influencedby() function (and the functions it in turn calls) can be found here. Functions that I did not include, such as getFeature() and getNrFeatures() are tiny and cannot possibly contribute much. Take note that I used generalised names for objects in this question but I might mess up or make it more confusing if I replace them in the other code, so I've left the original names there. For the record:
Space = CultSpace
Point = CultVec
Points = Points
Coordinates = Feats
Coords = Feature
TypeOfCoord = Nomin
Coord = Trait
(Choosing "Answer" because the format permits better presentation. Not quite what your're asking for, but let's clarify this first.)
Later
How often is the loop executed until this condition becomes true?
while(sDead < (unsigned long) nrPoints*nrPoints) {
Probably not a big gain, but:
pPoint1 = distrib(*rng);
do {
pPoint2 = distrib(*rng);
while( pPoint1 == pPoint2 );
outAxel = -1;
How costly is getPointRef? Linear search in Space?
point1 = space->getPointRef(pPoint1);
point2 = space->getPointRef(pPoint2);
outAxel = point1->influencedBy(point2, distThres);
Is it really necessary to recompute the "distance of the changed point to every other point in Space" immediately after a "flip"?

How do I most efficiently perform collision detection on a group of spheres

Suppose I have a CPU with several cores, on which I want to find which spheres are touching. Any set of spheres where each sphere is connected (ie. they're all touching at least one of the spheres in the set) is called a "group" and is to be organized into a vector called, in the example below, "group_members". To achieve this I am currently using a rather expensive operation that looks conceptually like this:
vector<Sphere*> unallocated_spheres = all_spheres; // start with a copy of all spheres
vector<vector<Sphere*>> group_sequence; // groups will be collected here
while (unallocated_spheres.size() > 0U) // each iteration of this will represent the creation of a new group
{
std::vector<Sphere*> group_members; // this will store all members of the current group
group_members.push_back(unallocated_spheres.back()); // start with the last sphere (pop_back requires less resources than erase)
unallocated_spheres.pop_back(); // it has been allocated to a group so remove it from the unallocated list
// compare each sphere in the new group to every other sphere, and continue to do so until no more spheres are added to the current group
for (size_t i = 0U; i != group_members.size(); ++i) // iterators would be unsuitable in this case
{
Sphere const * const sphere = group_members[i]; // the sphere to which all others will be compared to to check if they should be added to the group
auto it = unallocated_spheres.begin();
while (it != unallocated_spheres.end())
{
// check if the iterator sphere belongs to the same group
if ((*it)->IsTouching(sphere))
{
// it does belong to the same group; add it and remove it from the unallocated_spheres vector and repair iterators
group_members.push_back(*it);
it = unallocated_spheres.erase(it); // repair the iterator
}
else ++it; // if no others were found, increment iterator manually
}
}
group_sequence.push_back(group_members);
}
Does anyone have any suggestions for improving the efficiency of this code in terms of wall time? My program spends a significant fraction of the time running through these loops, and any advice on how to structurally change it to make it more efficient would be appreciated.
Note that as these are spheres, "IsTouching()" is a very quick floating point operation (comparing position and radii of the two spheres). It looks like this (note that x,y and z are the position of the sphere in that euclidean dimension):
// input whether this cell is touching the input cell (or if they are the same cell; both return true)
bool const Sphere::IsTouching(Sphere const * const that) const
{
// Apply pythagoras' theorem in 3 dimensions
double const dx = this->x - that->x;
double const dy = this->y - that->y;
double const dz = this->z - that->z;
// get the sum of the radii of the two cells
double const rad_sum = this->radius + that->radius;
// to avoid taking the square root to get actual distances, we instead compare
// the square of the pythagorean distance with the square of the radii sum
return dx*dx + dy*dy + dz*dz < rad_sum*rad_sum;
}
Does anyone have any suggestions for improving the efficiency of this code in terms of wall time?
Change the algorithm. Low-level optimization won't help you. (although you'll achieve very small speedup if you move group_members outside of the while loop)
You need to use space partitioning (bsp-tree, oct-tree) or sweep and prune algorithm.
Sweep and prune (wikipedia has links to original article, plus you can google it) can easily handle 100000 moving and potentially colliding spheres on single-core machine (well, as long as you don't put them all at the same coordinates) and is a bit easier to implement than space partitioning. If you know maximum possible size of colliding object, sweep and prune will be more suitable/simpler to implement.
If you're going to use sweep and prune algorithm, you should learn insertion sort algorithm. This sorting algorithm is faster than pretty much any other algorithm when you work on "almost" sorted data, which is the case with sweep-and-prune. Of course, you'll also need some implementation of quicksort or heapsort, but standard library provides that.

random access to buffer optimisation

I have colorBuffer Color[width*height] (most likely 800*600)
and during rasterization I call:
void setPixel(int x, int y, Color & color)
{
colorBuffer[y * width + x] = color;
}
It turns out that this random access to color buffer is really ineffective and slows my application down.
I think that it is caused the way I use it. I calculate some pixel (with rasterization algorithms) and call setPixel.
So I think my buffer is not in cache and this is the main problem. When trying to write into the whole buffer at once, it is much much faster.
Is there any way, how to optimize this?
edit
I do not use it to fill buffer with two for cycles.
I use it to paint "random" pixels.
eg when rasterize line I use it like
setPixel(10,10);
calculate next point
setPixel(10,11);
calculate next point
setPixel(next point)
...
They way I see it, the access-pattern to the buffer depends in the order in which your algorithm processes the pixels. Can you not simply change that order so that it creates a sequential access-scheme to your buffer?
Yes, you should try to be cache-friendly,
but the first thing I would do is find out what's taking time.
It's simple enough. Just pause it several times and see what it's doing.
If it's mostly in calculate next point, you should see what it's doing in there, because that's where the time is going.
(I assume you understand that by "in" I mean "on the stack".)
If it's mostly in SetPixel, when you pause it, look at the disassembly window.
If it's spending much time in the prologue/epilogue of the routine, it should be inlined.
If it's spending much time in the actual move instruction into colorBuffer, then you're hitting the cache issue.
If it's spending much time in the code for the index calculation y * width + x, then you might want to see if you could somehow use an initialized pointer that you step along.
If you fix anything, you should do it all again, because you may have uncovered another opportunity to speed it up further.
The first thing to notice is that the way you process your pixels makes a huge difference to speed. If you do
for (int x = 0; x < width;++x)
{
for (int y = 0; y < height; ++y)
{
setPixel(x,y,Color());
}
}
this will be really bad for performance because you're literally jumping around in memory width-wise (note that you do y*width + x).
If you simply change the order of processing to
for (int y = 0; y < height;++y)
{
for (int x = 0; x < width; ++x)
{
setPixel(x,y,Color());
}
}
you already should notice a performance gain as the processor now gets a chance to cache memory accesses (which it didn't before).
Furthermore you should check if you can determine that entire blocks of pixels will have the same color value before actually setting the memory. Then you can copy those constant color values block-wise to your image array which can save you also a good deal of performance.

improving performance for graph connectedness computation

I am writing a program to generate a graph and check whether it is connected or not. Below is the code. Here is some explanation: I generate a number of points on the plane at random locations. I then connect the nodes, NOT based on proximity only. By that I mean to say that a node is more likely to be connected to nodes that are closer, and this is determined by a random variable that I use in the code (h_sq) and the distance. Hence, I generate all links (symmetric, i.e., if i can talk to j the viceversa is also true) and then check with a BFS to see if the graph is connected.
My problem is that the code seems to be working properly. However, when the number of nodes becomes greater than ~2000 it is terribly slow, and I need to run this function many times for simulation purposes. I even tried to use other libraries for graphs but the performance is the same.
Does anybody know how could I possibly speed everything up?
Thanks,
int Graph::gen_links() {
if( save == true ) { // in case I want to store the structure of the graph
links.clear();
links.resize(xy.size());
}
double h_sq, d;
vector< vector<luint> > neighbors(xy.size());
// generate links
double tmp = snr_lin / gamma_0_lin;
// xy is a std vector of pairs containing the nodes' locations
for(luint i = 0; i < xy.size(); i++) {
for(luint j = i+1; j < xy.size(); j++) {
// generate |h|^2
d = distance(i, j);
if( d < d_crit ) // for sim purposes
d = 1.0;
h_sq = pow(mrand.randNorm(0, 1), 2.0) + pow(mrand.randNorm(0, 1), 2.0);
if( h_sq * tmp >= pow(d, alpha) ) {
// there exists a link between i and j
neighbors[i].push_back(j);
neighbors[j].push_back(i);
// options
if( save == true )
links.push_back( make_pair(i, j) );
}
}
if( neighbors[i].empty() && save == false ) {
// graph not connected. since save=false i dont need to store the structure,
// hence I exit
connected = 0;
return 1;
}
}
// here I do BFS to check whether the graph is connected or not, using neighbors
// BFS code...
return 1;
}
UPDATE:
the main problem seems to be the push_back calls within the inner for loops. It's the part that takes most of the time in this case. Shall I use reserve() to increase efficiency?
Are you sure the slowness is caused by the generation but not by your search algorithm?
The graph generation is O(n^2) and you can't do too much to it. However, you can apparently use memory in exchange of some of the time if the point locations are fixed for at least some of the experiments.
First, distances of all node pairs, and pow(d, alpha) can be precomputed and saved into memory so that you don't need to compute them again and again. The extra memory cost for 10000 nodes will be about 800mb for double and 400mb for float..
In addition, sum of square of normal variable is chi-square distribution if I remember correctly.. Probably you can have some precomputed table lookup if the accuracy allowed?
At last, if the probability that two nodes will be connected are so small if the distance exceeds some value, then you don't need O(n^2) and probably you can only calculate those node pairs that have distance smaller than some limits?
As a first step you should try to use reserve for both inner and outer vectors.
If this does not bring performance up to your expectations I believe this is because memory allocations that are still happening.
There is a handy class I've used in similar situations, llvm::SmallVector (find it in Google). It provides a vector with few pre-allocated items, so you can have decrease number of allocations by one per vector.
It still can grow when it is running out of items in pre-allocated space.
So:
1) Examine the number of items you have in your vectors on average during runs (I'm talking about both inner and outer vectors)
2) Put in llvm::SmallVector with a pre-allocation of such size (as vector is allocated on the stack you might need to increase stack size, or reduce pre-allocation if you are restricted on available stack memory).
Another good thing about SmallVector is that it has almost the same interface as std::vector (could be easily put instead of it)

Help with code optimization

I've written a little particle system for my 2d-application. Here is raining code:
// HPP -----------------------------------
struct Data
{
float x, y, x_speed, y_speed;
int timeout;
Data();
};
std::vector<Data> mData;
bool mFirstTime;
void processDrops(float windPower, int i);
// CPP -----------------------------------
Data::Data()
: x(rand()%ScreenResolutionX), y(0)
, x_speed(0), y_speed(0), timeout(rand()%130)
{ }
void Rain::processDrops(float windPower, int i)
{
int posX = rand() % mWindowWidth;
mData[i].x = posX;
mData[i].x_speed = WindPower*0.1; // WindPower is float
mData[i].y_speed = Gravity*0.1; // Gravity is 9.8 * 19.2
// If that is first time, process drops randomly with window height
if (mFirstTime)
{
mData[i].timeout = 0;
mData[i].y = rand() % mWindowHeight;
}
else
{
mData[i].timeout = rand() % 130;
mData[i].y = 0;
}
}
void update(float windPower, float elapsed)
{
// If this is first time - create array with new Data structure objects
if (mFirstTime)
{
for (int i=0; i < mMaxObjects; ++i)
{
mData.push_back(Data());
processDrops(windPower, i);
}
mFirstTime = false;
}
for (int i=0; i < mMaxObjects; i++)
{
// Sleep until uptime > 0 (To make drops fall with randomly timeout)
if (mData[i].timeout > 0)
{
mData[i].timeout--;
}
else
{
// Find new x/y positions
mData[i].x += mData[i].x_speed * elapsed;
mData[i].y += mData[i].y_speed * elapsed;
// Find new speeds
mData[i].x_speed += windPower * elapsed;
mData[i].y_speed += Gravity * elapsed;
// Drawing here ...
// If drop has been falled out of the screen
if (mData[i].y > mWindowHeight) processDrops(windPower, i);
}
}
}
So the main idea is: I have some structure which consist of drop position, speed. I have a function for processing drops at some index in the vector-array. Now if that's first time of running I'm making array with max size and process it in cycle.
But this code works slower that all another I have. Please, help me to optimize it.
I tried to replace all int with uint16_t but I think it doesn't matter.
Replacing int with uint16_t shouldn't do any difference (it'll take less memory, but shouldn't affect running time on most machines).
The shown code already seems pretty fast (it's doing only what it's needed to do, and there are no particular mistakes), I don't see how you could optimize it further (at most you could remove the check on mFirstTime, but that should make no difference).
If it's slow it's because of something else. Maybe you've got too many drops, or the rest of your code is so slow that update gets called little times per second.
I'd suggest you to profile your program and see where most time is spent.
EDIT:
one thing that could speed up such algorithm, especially if your system hasn't got an FPU (! That's not the case of a personal computer...), would be to replace your floating point values with integers.
Just multiply the elapsed variable (and your constants, like those 0.1) by 1000 so that they will represent milliseconds, and use only integers everywhere.
Few points:
Physics is incorrect: wind power should be changed as speed makes closed to wind speed, also for simplicity I would assume that initial value of x_speed is the speed of the wind.
You don't take care the fraction with the wind at all, so drops getting faster and faster. but that depends on your want to model.
I would simply assume that drop fails in constant speed in constant direction because this is really what happens very fast.
Also you can optimize all this very simply as you don't need to solve motion equation using integration as it can be solved quite simply directly as:
x(t):= x_0 + wind_speed * t
y(t):= y_0 - fall_speed * t
This is the case of stable fall when the gravity force is equal to friction.
x(t):= x_0 + wind_speed * t;
y(t):= y_0 - 0.5 * g * t^2;
If you want to model drops that fall faster and faster.
Few things to consider:
In your processDrops function, you pass in windPower but use some sort of class member or global called WindPower, is that a typo? If the value of Gravity does not change, then save the calculation (i.e. mult by 0.1) and use that directly.
In your update function, rather than calculating windPower * elapsed and Gravity * elapsed for every iteration, calculate and save that before the loop, then add. Also, re-organise the loop, there is no need to do the speed calculation and render if the drop is out of the screen, do the check first, and if the drop is still in the screen, then update the speed and render!
Interestingly, you never check to see if the drop is out of the screen interms of it's x co-ordinate, you check the height, but not the width, you could save yourself some calculations and rendering time if you did this check as well!
In loop introduce reference Data& current = mData[i] and use it instead of mData[i]. And use this reference instead of index also in procesDrops.
BTW I think that consulting mFirstTime in processDrops serves no purpose because it will never be true. Hmm, I missed processDrops in initialization loop. Never mind this.
This looks pretty fast to me already.
You could get some tiny speedup by removing the "firsttime" code and putting it in it's own functions to call once rather that testing every calls.
You are doing the same calculation on lots of similar data so maybe you could look into using SSE intrinsics to process several items at once. You'l likely have to rearrange your data structure for that though to be a structure of vectors rather than a vector od structures like now. I doubt it would help too much though. How many items are in your vector anyway?
It looks like maybe all your time goes into ... Drawing Here.
It's easy enough to find out for sure where the time is going.