This non-member function drawPoly() draws an n-sided polygon in 3D space from a list of vertices.
This function typically gets called thousands of times during normal execution and speed is critical.
Ignoring the effects of the functions called within drawPoly(), does the allocation of the 25-element vertex array have any negative effects on speed?
void drawPoly(const meshx::Face& face, gen::Vector position,
ALLEGRO_COLOR color, bool filled)
{
ALLEGRO_VERTEX vertList[25];
std::size_t k = 0;
// ...For every vertex in the polygon...
for(; k < face.getNumVerts(); ++k) {
vertList[k].x = position.x + face.alVerts[k].x;
vertList[k].y = position.y + face.alVerts[k].y;
vertList[k].z = position.z + face.alVerts[k].z;
vertList[k].u = 0;
vertList[k].v = 0;
vertList[k].color = color;
}
// Draw with ALLEGRO_VERTEXs and no textures.
if(filled) {
al_draw_prim(vertList, nullptr, nullptr,
0, k, ALLEGRO_PRIM_TRIANGLE_LIST);
} else {
al_draw_prim(vertList, nullptr, nullptr,
0, k, ALLEGRO_PRIM_LINE_LOOP);
}
}
The only way to tell it for sure, is to measure. But what else could you use instead, to compare with? Allocating on the heap would be obviously slower. Using a global variable to hold the vertices could be an option - only for perf benchmarking.
Given that the stack allocation of trivially constructible objects is usually translates to a simple change of the stack pointer, the allocation itself probably wouldn't be a big deal. What could have an observable effect tough, is touching extra cache lines. The less cache lines the code writes, the better, from the performance perspective. Therefore, you can experiment with splitting vertList[25] into cache line sized arrays, and calling al_draw_prim multiple times. A benchmark would show if there's a difference.
Related
I have a for loop that I use to draw a grid of tiles with sdl on a game. Since the grid is quite huge with more than 50k elements I want to optimize it.
So there is this function that use to check if I should draw a tile, so if it's outside of the screen I ignore it.
bool Camera::isInViewport(int &x, int &y, int &w, int &h) {
int translatedX = x + offsetX;
int translatedY = y + offsetY;
if (translatedX + w >= 0 && translatedX <= 0 + sdl.windowWidth) {
if (translatedY + h >= 0 && translatedY <= 0 + sdl.windowHeight) {
return true;
}
}
return false;
}
I checked this function it's eating 15% of the CPU alone when the grid is big. Will be possible to make this faster? I can't think of way that will make it eat less resources.
There is not a lot that you can do with this funciton. Do not pass ints as references, it internally passes them as pointers, and it increases costs by dereferencing them. Merge conditions into one if statement and start from those that most probably will be evaluated into false to make early short-circuiting possible.
What I would do instead to solve this performance issue is to organize your tiles in 2D array where index and coordinates could be calculated from each other. In this case you just need to understand index boundaries of tiles covered by your viewport. Instead of checking result of this function on every cell you will be able to just tell left and right X index and top and down Y index. Then just draw them in two nested loops like that:
for (int y = topY; y <= bottomY; ++y)
for (int x = leftX; x <= rightX; ++x)
// do drawing with tile[y][x];
Another approach would be to cache the previous results. If camera is not moving and tiles are not moving - then result of this function is not going to change. Just storing flag that indicates you whether each tile is visible could work here (but not a good practice in big game), update them every time camera moves or recalculate tile if it moves (if it is possible in your app). Still recalculation of all visibility flags on camera movement will be expensive, so try to use first optimization and reduce the task by finding what tile range is affected by camera at all
I'm not sure if there is an actual performance increase to achieve, or if my computer is just old and slow, but I'll ask anyway.
So I've tried making a program to plot the Mandelbrot set using the cairo library.
The loop that draws the pixels looks as follows:
vector<point_t*>::iterator it;
for(unsigned int i = 0; i < iterations; i++){
it = points->begin();
//cout << points->size() << endl;
double r,g,b;
r = (double)i+1 / (double)iterations;
g = 0;
b = 0;
while(it != points->end()){
point_t *p = *it;
p->Z = (p->Z * p->Z) + p->C;
if(abs(p->Z) > 2.0){
cairo_set_source_rgba(cr, r, g, b, 1);
cairo_rectangle (cr, p->x, p->y, 1, 1);
cairo_fill (cr);
it = points->erase(it);
} else {
it++;
}
}
}
The idea is to color all points that just escaped the set, and then remove them from list to avoid evaluating them again.
It does render the set correctly, but it seems that the rendering takes a lot longer than needed.
Can someone spot any performance issues with the loop? or is it as good as it gets?
Thanks in advance :)
SOLUTION
Very nice answers, thanks :) - I ended up with a kind of hybrid of the answers. Thinking of what was suggested, i realized that calculating each point, putting them in a vector and then extract them was a huge waste CPU time and memory. So instead, the program now just calculate the Z value of each point witout even using the point_t or vector. It now runs A LOT faster!
Edit: I think the suggestion in the answer of kuroi neko is also a very good idea if you do not care about "incremental" computation, but have a fixed number of iterations.
You should use vector<point_t> instead of vector<point_t*>.
A vector<point_t*> is a list of pointers to point_t. Each point is stored at some random location in the memory. If you iterate over the points, the pattern in which memory is accessed looks completely random. You will get a lot of cache misses.
On the opposite vector<point_t> uses continuous memory to store the points. Thus the next point is stored directly after the current point. This allows efficient caching.
You should not call erase(it); in your inner loop.
Each call to erase has to move all elements after the one you remove. This has O(n) runtime. For example, you could add a flag to point_t to indicate that it should not be processed any longer. It may be even faster to remove all the "inactive" points after each iteration.
It is probably not a good idea to draw individual pixels using cairo_rectangle. I would suggest you create an image and store the color for each pixel. Then draw the whole image with one draw call.
Your code could look like this:
for(unsigned int i = 0; i < iterations; i++){
double r,g,b;
r = (double)i+1 / (double)iterations;
g = 0;
b = 0;
for(vector<point_t>::iterator it=points->begin(); it!=points->end(); ++it) {
point_t& p = *it;
if(!p.active) {
continue;
}
p.Z = (p.Z * p.Z) + p.C;
if(abs(p.Z) > 2.0) {
cairo_set_source_rgba(cr, r, g, b, 1);
cairo_rectangle (cr, p.x, p.y, 1, 1);
cairo_fill (cr);
p.active = false;
}
}
// perhaps remove all points where p.active = false
}
If you can not change point_t, you can use an additional vector<char> to store if a point has become "inactive".
The Zn divergence computation is what makes the algorithm slow (depending on the area you're working on, of course). In comparison, pixel drawing is mere background noise.
Your loop is flawed because it makes the Zn computation slow.
The way to go is to compute divergence for each point in a tight, optimized loop, and then take care of the display.
Besides, it's useless and wasteful to store Z permanently.
You just need C as an input and the number of iterations as an output.
Assuming your points array only holds C values (basically you don't need all this vector crap, but it won't hurt performances either), you could do something like that :
for(vector<point_t>::iterator it=points->begin(); it!=points->end(); ++it)
{
point_t Z = 0;
point_t C = *it;
for(unsigned int i = 0; i < iterations; i++) // <-- this is the CPU burner
{
Z = Z * Z + C;
if(abs(Z) > 2.0) break;
}
cairo_set_source_rgba(cr, (double)i+1 / (double)iterations, g, b, 1);
cairo_rectangle (cr, p->x, p->y, 1, 1);
cairo_fill (cr);
}
Try to run this with and without the cairo thing and you should see no noticeable difference in execution time (unless you're looking at an empty spot of the set).
Now if you want to go faster, try to break down the Z = Z * Z + C computation in real and imaginary parts and optimize it. You could even use mmx or whatever to do parallel computations.
And of course the way to go to gain another significant speed factor is to parallelize your algorithm over the available CPU cores (i.e. split your display area is subsets and have different worker threads compute these parts in parallel).
This is not as obvious at it might seem, though, since each sub-picture will have a different computation time (black areas are very slow to compute while white areas are computed almost instantly).
One way to do it is to split the area is a large number of rectangles, and have all worker threads pick a random rectangle from a common pool until all rectangles have been processed.
This simple load balancing scheme that makes sure no CPU core will be left idle while its buddies are busy on other parts of the display.
The first step to optimizing performance is to find out what is slow. Your code mixes three tasks- iterating to calculate whether a point escapes, manipulating a vector of points to test, and plotting the point.
Separate these three operations and measure their contribution. You can optimise the escape calculation by parallelising it using simd operations. You can optimise the vector operations by not erasing from the vector if you want to remove it but adding it to another vector if you want to keep it ( since erase is O(N) and addition O(1) ) and improve locality by having a vector of points rather than pointers to points, and if the plotting is slow then use an off-screen bitmap and set points by manipulating the backing memory rather than using cairo functions.
(I was going to post this but #Werner Henze already made the same point in a comment, hence community wiki)
I'm working on a small piece of code which takes a very large amount of time to complete, so I was thinking of multithreading it either with pthread (which I hardly understand but think I can master a lot quicker) or with some GPGPU implementation (probably OpenCL as I have an AMD card at home and the PCs I use at my office have various NVIDIA cards)
while(sDead < (unsigned long) nrPoints*nrPoints) {
pPoint1 = distrib(*rng);
pPoint2 = distrib(*rng);
outAxel = -1;
if(pPoint1 != pPoint2) {
point1 = space->getPointRef(pPoint1);
point2 = space->getPointRef(pPoint2);
outAxel = point1->influencedBy(point2, distThres);
if(outAxel == 0 || outAxel == 1)
sDead++;
else
sDead = 0;
}
i++;
}
Where distrib is a uniform_int_distribution with a = 0 and b = nrPoints-1.
For clarity, here is the structure I'm working with:
class Space{
vector<Point> points
(more stuff)
}
class Point {
vector<Coords> coordinates
(more stuff)
}
struct Coords{
char Range
bool TypeOfCoord
char Coord
}
The length of coordinates is the same for all Points and Point[x].Coord[y].Range == Point[z].Coord[y].Range for all x, y and z. The same goes for TypeOfCoord.
Some background: during each run of the while loop, two randomly drawn Points from space are tested for interaction. influencedBy() checks whether or not point1 and point2 are close enough to eachother (distance is dependent on some metric but it boils down to similarity in Coord. If the distance is smaller than distThres, interaction is possible) to interact. Interaction means that one of the Coord variables which doesn't equal the corresponding Coord in the other object is flipped to equal it. This decreases the distance between the Points but also changes the distance of the changed point to every other point in Space, hence my question of whether or not this is multithreadable. As I said, I'm a complete newbie to multithreading and I'm not sure if I can safely implement a function that chops this up, so I was looking for your input. Suggestions are also very welcome.
E: The influencedby() function (and the functions it in turn calls) can be found here. Functions that I did not include, such as getFeature() and getNrFeatures() are tiny and cannot possibly contribute much. Take note that I used generalised names for objects in this question but I might mess up or make it more confusing if I replace them in the other code, so I've left the original names there. For the record:
Space = CultSpace
Point = CultVec
Points = Points
Coordinates = Feats
Coords = Feature
TypeOfCoord = Nomin
Coord = Trait
(Choosing "Answer" because the format permits better presentation. Not quite what your're asking for, but let's clarify this first.)
Later
How often is the loop executed until this condition becomes true?
while(sDead < (unsigned long) nrPoints*nrPoints) {
Probably not a big gain, but:
pPoint1 = distrib(*rng);
do {
pPoint2 = distrib(*rng);
while( pPoint1 == pPoint2 );
outAxel = -1;
How costly is getPointRef? Linear search in Space?
point1 = space->getPointRef(pPoint1);
point2 = space->getPointRef(pPoint2);
outAxel = point1->influencedBy(point2, distThres);
Is it really necessary to recompute the "distance of the changed point to every other point in Space" immediately after a "flip"?
Over the past few days I made my first "engine" thingy. A central object with a window object, graphics object, and an input object - all nice and encapsulated and happy.
In this setup I also included some objects in the graphics object that handle some 'utility' functions, like a camera and a 'vindex' manager.
The Vertex/Index Manager stores all vertices and indices in std::vectors, that are called upon and sent to graphics when it's time to create the buffers.
The only problem is that I get ~8 frames a second with only 8-10 rectangles.
I think the problem is in the 'Vindex' object (my shader is nothing spectacular, and the pipeline is pretty vanilla).
Is storing Vertices in this way a plum bad idea, or is there just some painfully obvious thing I'm missing?
I did a little evolution sim project a few years ago that was pretty messy code-wise, but it rendered 20,000 vertices at 100s of frames a second on this machine, so it's not my machine that's slow.
I've been kind of staring at this for several hours, any and all input is VERY much appreciated :)
Example from my object that stores my vertices:
for (int i = 0; i < 24; ++i)
{
mVertList.push_back(Vertex(v[i], n[i], col));
}
For Clarity's sake
std::vector<Vertex> mVertList;
std::vector<int> mIndList;
and
std::vector<Vertex> VindexPile::getVerts()
{
return mVertList;
}
std::vector<int> VindexPile::getInds()
{
return mIndList;
}
In my graphics.cpp file:
md3dDevice->CreateVertexBuffer(mVinds.getVerts().size() * sizeof(Vertex), D3DUSAGE_WRITEONLY, 0, D3DPOOL_MANAGED, &mVB, 0);
Vertex * v = 0;
mVB->Lock(0, 0, (void**)&v, 0);
std::vector<Vertex> vList = mVinds.getVerts();
for (int i = 0; i < mVinds.getVerts().size(); ++i)
{
v[i] = vList[i];
}
mVB->Unlock();
md3dDevice->CreateIndexBuffer(mVinds.getInds().size() * sizeof(WORD), D3DUSAGE_WRITEONLY, D3DFMT_INDEX16, D3DPOOL_MANAGED, &mIB, 0);
WORD* ind = 0;
mIB->Lock(0, 0, (void**)&ind, 0);
std::vector<int> iList = mVinds.getInds();
for (int i = 0; i<mVinds.getInds().size(); ++i)
{
ind[i] = iList[i];
}
mIB->Unlock();
There is quite a bit of copying going on in here: I can not tell without running a profiler and some more code, but that seems like the first culprit:
std::vector<Vertex> vList = mVinds.getVerts();
std::vector<int> iList = mVinds.getInds();
Those two calls create copies of your vertex/index buffers, which is most probably not what you want - you most probably want to declare those as const references. You are also ruining cache coherency by doing those copies, which slows down your program more.
mVertList.push_back(Vertex(v[i], n[i], col));
This is moving and resizing the vectors quite a lot as well - you should most probably use reserve or resize before putting stuff in your vectors, to avoid reallocation and moving throughout memory of your data.
If I have to give you one big advice however, that would be: Profile. I don't know what tools you have access to, however there are plenty of profilers available, pick one and learn it, and it will provide much more valuable insight into why your program is slow.
I have colorBuffer Color[width*height] (most likely 800*600)
and during rasterization I call:
void setPixel(int x, int y, Color & color)
{
colorBuffer[y * width + x] = color;
}
It turns out that this random access to color buffer is really ineffective and slows my application down.
I think that it is caused the way I use it. I calculate some pixel (with rasterization algorithms) and call setPixel.
So I think my buffer is not in cache and this is the main problem. When trying to write into the whole buffer at once, it is much much faster.
Is there any way, how to optimize this?
edit
I do not use it to fill buffer with two for cycles.
I use it to paint "random" pixels.
eg when rasterize line I use it like
setPixel(10,10);
calculate next point
setPixel(10,11);
calculate next point
setPixel(next point)
...
They way I see it, the access-pattern to the buffer depends in the order in which your algorithm processes the pixels. Can you not simply change that order so that it creates a sequential access-scheme to your buffer?
Yes, you should try to be cache-friendly,
but the first thing I would do is find out what's taking time.
It's simple enough. Just pause it several times and see what it's doing.
If it's mostly in calculate next point, you should see what it's doing in there, because that's where the time is going.
(I assume you understand that by "in" I mean "on the stack".)
If it's mostly in SetPixel, when you pause it, look at the disassembly window.
If it's spending much time in the prologue/epilogue of the routine, it should be inlined.
If it's spending much time in the actual move instruction into colorBuffer, then you're hitting the cache issue.
If it's spending much time in the code for the index calculation y * width + x, then you might want to see if you could somehow use an initialized pointer that you step along.
If you fix anything, you should do it all again, because you may have uncovered another opportunity to speed it up further.
The first thing to notice is that the way you process your pixels makes a huge difference to speed. If you do
for (int x = 0; x < width;++x)
{
for (int y = 0; y < height; ++y)
{
setPixel(x,y,Color());
}
}
this will be really bad for performance because you're literally jumping around in memory width-wise (note that you do y*width + x).
If you simply change the order of processing to
for (int y = 0; y < height;++y)
{
for (int x = 0; x < width; ++x)
{
setPixel(x,y,Color());
}
}
you already should notice a performance gain as the processor now gets a chance to cache memory accesses (which it didn't before).
Furthermore you should check if you can determine that entire blocks of pixels will have the same color value before actually setting the memory. Then you can copy those constant color values block-wise to your image array which can save you also a good deal of performance.