understanding when to use CCSpriteBatchNodes? - cocos2d-iphone

I have seen in several places, including the source code of CCSpriteBatchNode that it is "expensive" to add/remove childs from it. My understanding is that the whole point of using batch nodes is to prevent expensive OpenGL calls from happening over and over when many sprites from the same sprite sheet are being added to the same container.
What I am wondering is 1) how "expensive" is adding / removing childs to a sprite batch node, and 2) when is it considered a appropriate to make use of one?
For example, I have a laser object which creates ten sprites... as it moves across the screen, it shows/hides the current sprite for that given screen position. When it reaches the far right edge of the screen, the laser object is discarded, and so are the ten sprites. So, I was wondering, is this a case where a sprite batch node would be not appropriate to use because it's only 10 sprites, and it happens so fast-- The move animation is 0.2 seconds, so that if the player were to rapidly fire, that would mean adding/removing 10 sprites to a batch node over and over...
In other cases, I have a SpriteBatchNode already setup for various objects, and occasionally I come across a one-off sprite that needs to be added, and it just happens to be part of the same sprite sheet, so I am tempted to add it to that batch node since it's there, and it's designated to that particular sprite sheet already...... Anyway, I'd love to get some clarification on this topic.

The main difference between a CCSpriteBatchNode and a normal CCSprite is the fact that a CCSpriteBatchNode sends all the data of all the sprites at once to the GPU instead that doing it for each sprite.
A CCSprite draw call works in the following way:
glVertexAttribPointer(kCCVertexAttrib_Position, 3, GL_FLOAT, GL_FALSE, kQuadSize, (void*) (offset + diff));
glVertexAttribPointer(kCCVertexAttrib_TexCoords, 2, GL_FLOAT, GL_FALSE, kQuadSize, (void*)(offset + diff));
glVertexAttribPointer(kCCVertexAttrib_Color, 4, GL_UNSIGNED_BYTE, GL_TRUE, kQuadSize, (void*)(offset + diff));
glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);
Basically 3 calls are made to set the data of the sprite and then a call to glDrawArrays is done. If you have 100 sprite this code is executed 100 times.
Now let's look at CCSpriteBatchNode (I chose the implementation without VAO, which is another possible optimization):
glVertexAttribPointer(kCCVertexAttrib_Position, 3, GL_FLOAT, GL_FALSE, kQuadSize, (GLvoid*) offsetof( ccV3F_C4B_T2F, vertices));
glVertexAttribPointer(kCCVertexAttrib_Color, 4, GL_UNSIGNED_BYTE, GL_TRUE, kQuadSize, (GLvoid*) offsetof( ccV3F_C4B_T2F, colors));
glVertexAttribPointer(kCCVertexAttrib_TexCoords, 2, GL_FLOAT, GL_FALSE, kQuadSize, (GLvoid*) offsetof( ccV3F_C4B_T2F, texCoords));
glBindBuffer(GL_ARRAY_BUFFER, 0);
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, buffersVBO_[1]);
glDrawElements(GL_TRIANGLE_STRIP, (GLsizei) n*6, GL_UNSIGNED_SHORT, (GLvoid*) (start*6*sizeof(indices_[0])) );
Now this code sets all the data of all the sprites at once, since it's stored in contiguous memory. This call is the same for 1, 10, 100, whatever amount of sprites.
That's why it is more efficient, but at the same time, since the data is stored contiguosly in memory, when a child is removed or added or modifiyed, the array must be changed accordingly and updated in the GPU. That's where the cost of adding and removing comes from (or even the fact that a hidden CCSprite just skips the rendering phase while a hidden CCSprite in a batch node doesn't)
From personal experience I can tell you that the cost is usually negligible and you should always use a CCSpriteBatchNode when you can (since they have their limits, like blending over the whole node and not on a per sprite basis and similar things) and when you are drawing more than a bunch of sprites of the same kind/reason.
Benchmarking it for your self should be easy though.

1) how "expensive" is adding / removing childs to a sprite batch node
The only scenario I am aware that it can be "expensive" is when you have to increase the atlas capacity. You see, batch nodes have a capacity, and if you add a child that surpasses it, the node will have to increase its capacity and recalculate texture coordinates for all sprites.
To fix this, you simply give your batch node a reasonable capacity to begin with - not too little and not too much. It's up to you to identify such number, depending on your needs.
2) when is it considered a appropriate to make use of one?
Whenever you have several sprites that can use the same texture source. For a Mario game, it is clear that you will need several coins on the screen. This would be a good use case for a batch node: have a batch node for the coin image, and then all your coin sprites will use this batch node.
Sometimes you can pack several elements into the same texture. Say, you could fit a coin image, a monster image, and a mushroom image all in the same texture. This way, all your coins, monsters and mushrooms could use the same batch node.
You shouldn't need batch nodes for things like background textures, because you probably only need one background sprite anyway.
So, I was wondering, is this a case where a sprite batch node would be
not appropriate to use because it's only 10 sprites, and it happens so
fast-- The move animation is 0.2 seconds, so that if the player were
to rapidly fire, that would mean adding/removing 10 sprites to a batch
node over and over...
This is a valid use case for a batch node. 10 sprites are drawn simulatenously, after all. And, if you know that you won't be using a laser object anymore, you can always unload the corresponding batch node. I imagine that you may have several laser objects in your game, so a batch node is a good idea.
Frankly, don't worry much about performance. I use dozens in my game all the time for all sorts of things (characters, weather particles, map objects, collectibles, interface, etc), and thanks to them I rarely ever see it fall below 55fps.
In fact, I find it hard to argue against using batch nodes. They rarely cause any harm.

As previously said, a sprite batch node batches the calls to GPU for all its children (since they use the same texture). However for that to make an impact on the performance, a good amount of sprite must be involved. For 10 sprites, I do not think it would make a difference...
That said, please note that if you are using a new version of Cocos2d (like 3.0), the 3.1 which is now in beta offers automatic batching so you do not need to waste your time playing around with CCSpriteBatchNode. Cocos2d will batch the data sent to the GPU automatically.

Related

OpenGL: Repeated use of transform feedback buffers overwrites already established textures

I have a working implementation of this technique for view frustum culling of instanced geometry. The gist of the technique is that we use a vertex shader to check if the bounds of an object lie within the view frustum, and if they do we output the position of that object, using a transform feedback buffer and a geometry shader, to a texture. We can then, during an actual rendering pass, use that texture, along with a query of how many positions we emitted, to acquire the relevant position data for the object we're rendering, and number of draws to specify in our call to glDrawElementsInstanced. One difference between what I do, and what the article does, is that I emit a full transformation matrix, rather than a simple position vector, to the texture, but I doubt that has any bearing on my problem.
The actual problem: Currently I have this setup so that, for each object type being rendered (i.e. tree, box, rock, whatever), the actual rendering pass follows immediately upon the frustum cull rendering pass. This works, and gives the intended results. What I want to do instead, however, is to go over all my drawcommands and do all the frustum culling for the various objects first, and only thereafter do all the actual rendering, to avoid a bunch of unnecessary state changes (i.e. switching back and forth between shader programs). When I do this, however, I encounter the problem that previously established textures -- the ones I use for reading positions from during the actual rendering passes -- all seem to be overwritten by the latest call to the frustum culling function, meaning that all textures established seemingly contain only the position information from the last frustum cull call.
For example: I render, in order, 4 trees, 10 boxes and 3 rocks, and what I will see instead is a tree, a box, and a rock, at all the (three) positions where I would expect only the 3 rocks to be. I cannot for the life of me figure out why this is, because I quite clearly bind new buffers and textures to the TRANSFORM_FEEDBACK_BUFFER every time I call the function. Why are the previously used textures still receiving the new data from the latest call?
Code, in C, for the frustum culling function:
void fcullidraw(drawcommand *tar) {
/* printf("Fculling %s\n", tar->res->name); */
mesh *rmesh = &tar->res->amod->meshes[0];
/* glDeleteTextures(1, &rmesh->ctex); */
if(rmesh->ctbuf == 0)
glGenBuffers(1, &rmesh->ctbuf);
glBindBuffer(GL_TEXTURE_BUFFER, rmesh->ctbuf);
glBufferData(GL_TEXTURE_BUFFER, sizeof(instancedata) * tar->nodraws, NULL, GL_DYNAMIC_COPY);
if(rmesh->ctex == 0)
glGenTextures(1, &rmesh->ctex);
glBindTexture(GL_TEXTURE_BUFFER, rmesh->ctex);
glTexBuffer(GL_TEXTURE_BUFFER, GL_RGBA32F, rmesh->ctbuf);
if(rmesh->cquery == 0)
glGenQueries(1, &rmesh->cquery);
checkactiveshader(tar->tar, findshader("icull"));
glEnable(GL_RASTERIZER_DISCARD);
glUniform1f(activeshader->radius, tar->res->amesh->bbox.radius);
glUniform3fv(activeshader->extent, 1, (const GLfloat*)&tar->res->amesh->bbox.ext);
glUniform3fv(activeshader->cp, 1, (const GLfloat*)&tar->res->amesh->bbox.cp);
glBindVertexArray(tar->res->amod->meshes[0].vao);
glBindBuffer(GL_ARRAY_BUFFER, tar->res->amod->meshes[0].posarray);
glBufferData(GL_ARRAY_BUFFER, sizeof(mat4_t) * tar->nodraws, tar->posarray, GL_DYNAMIC_DRAW);
glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, rmesh->ctbuf);
glBeginTransformFeedback(GL_POINTS);
glBeginQuery(GL_PRIMITIVES_GENERATED, rmesh->cquery);
glDrawArrays(GL_POINTS, 0, tar->nodraws);
glEndQuery(GL_PRIMITIVES_GENERATED);
glEndTransformFeedback();
glDisable(GL_RASTERIZER_DISCARD);
glGetQueryObjectuiv(rmesh->cquery, GL_QUERY_RESULT, &rmesh->visibleinstances);
}
tar and rmesh obviously vary between each call to this function. Do note that I have left in a few lines of comments here containing code to delete the buffers and textures between each rendering cycle, rather than simply overwriting them, but using that code instead has no effect on the error mode.
I'm stumped. I feel that the textures and buffers are well defined and clearly kept separate, so I do not understand how the textures from previous calls to fcullidraw are somehow still bound to and being overwritten by the TransformFeedback, if that is indeed what is happening, and it certainly seems to be, because the earlier objects will read in the entire transformation matrix of the rock quite neatly, with the "right" rotation, translation, and everything.
The article linked does do the operations in the order I want to do them -- i.e. first repeated frustum culls, and then repeated rendering -- and I'm not sure I see what I do differently. Might be some small and obvious thing, and I might be an idiot, but in that case I'd love to know why and how I am that.
EDIT: I pushed on and updated my implementation with a refinement of the original technique, suggested here, which gets rid of the writing-to-texture method altogether, in favor of instead simply writing to a buffer bound to the VAO, and set to update once per rendered instance with a VertexAttribDivisor. This method looks at lot cleaner on the whole, and incidentally had the additional side effect of not having my original problem at all, as I'm no longer writing to and uploading textures. This is, thus, no longer a practical problem for me, but the answer to the theoretical question does still elude me, so if anyone has ideas I'm still all ears.

How can I draw multiple vertex array on a single draw call in OpenGL?

I want to simulate a bunch of balls' movement. These balls are rendered as 12 vertices with GL_TRIANGLE_FAN. In my design, every ball object maintains an array of its own vertices attributes. However, every time I invoke glDrawElements, it only draws one Vertex Buffer Object. 1000 balls need 1000 draw call. It's not efficient. If I draw points instead of triangle fan, I could draw it as follow:
class Ball
{
public:
GLfloat x, y;
// ...
}
Ball balls[] = {Ball(100, 100), Ball(80, 120), ...};
void display()
{
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(2, GL_FLOAT, sizeof(Ball), 0); // stride
glDrawArrays(GL_POINTS, 0, vertex_num);
//...
}
With stride parameter, I could 'choose' x, y from balls array and skip other members. I wonder how could I do the same thing when every object maintain an array of vertex attributes? Or I should extract all vertex attributes and combine them into a single VBO? But it needs extra code to maintain it and increases the coupling.
If all the objects are (exactly or mostly) the same, you should look into "instanced" rendering or drawing (i.e. glDrawArraysInstanced, etc.)
If your objects are mostly different and mostly dynamic but have the same render state (textures, shaders, etc.,) and you can use OpenGL 4+, you should use "multi-draw indirect" facilities (i.e. glMultiDrawArraysIndirect, etc.) Note that the version requirement is a little tricky, when considering all the prior extensions, etc.
If your objects are not the same but are static, you should combine them into one (or a few) objects and draw those. This will need more code and will result in less flexibility, and you have to figure out whether it's worth the performance benefits or not.
If none of the above fits (i.e. your objects are all different, very dynamic, and you can't use modern OpenGL,) then you don't have any easy ways to improve your drawing efficiency. You'll probably need to employ ubershaders (actually, very few shaders that do everything), texture arrays (or bindless texturing) and other tricks.

Zoom window in OpenGL

I've implemented Game of Life using OpenGL buffers (as specified here: http://www.glprogramming.com/red/chapter14.html#name20). In this implementation each pixel is a cell in the game.
My program receives the initial state of the game (2d array). The size array ,in my implementation, is the size of the window. This of course makes it "unplayable" if the array is 5x5 or some other small values.
At each iteration I'm reading the content of the framebuffer into a 2D array (its size is the window size):
glReadPixels(0, 0, win_x, win_y, GL_RGB, GL_UNSIGNED_BYTE, image);
Then, I'm doing the necessary steps to calculate the living and dead cells, and then draw a rectangle which covers the whole window, using:
glRectf(0, 0, win_x, win_y);
I want to zoom (or enlarge) the window without affecting the correctness of my code. If I resize the window, then the framebuffer content won't fit inside image(the array). Is there a way of zooming the window(so that each pixel be drawn as several pixels) without affecting the framebuffer?
First, you seem to be learning opengl 2, I would suggest instead learning a newer version, as it is more powerful and efficient. A good tutorial can be found here http://www.opengl-tutorial.org/
If i understand this correctly, you read in an initial state and draw it, then continuously read in the pixels on the screen, update the array based on the game of life logic then draw it back? this seems overly complicated.
The reading of the pixels on the screen is unnecessary, and will cause complications if you try to enlarge the rects to more than a pixel.
I would say a good solution would be to keep a bit array (1 is a organism, 0 is not), possibly as a 2d array in memory, updating the logic every say 30 iterations (for 30 fps), then drawing all the rects to the screen, black for 1, white for 0 using glColor(r,g,b,a) tied to an in statement in a nested for loop.
Then, if you give your rects a negative z coord, you can "zoom in" using glTranslate(x,y,z) triggered by a keyboard button.
Of course in a newer version of opengl, vertex buffers would make the code much cleaner and efficient.
You can't store your game state directly the window framebuffer and then resize it for rendering, since what is stored in the framebuffer is by definition what is about to be rendered. (You could overwrite it, but then you lose your game state...) The simplest solution would just to store the game state in an array (on the client side) and then update a texture based on that. Thus for each block that was set, you could set a pixel in a texture to be the appropriate color. Each frame, you then render a full screen quad with that texture (with GL_NEAREST filtering).
However, if you want to take advantage of your GPU there are some tricks that could massively speed up the simulation by using a fragment shader to generate the texture. In this case you would actually have two textures that you ping-pong between: one containing the current game state, and the other containing the next game state. Each frame you would use your fragment shader (along with a FBO) to generate the next state texture from the current state texture. Afterwards, the two textures are swapped, making the next state become the current state. The current state texture would then be rendered to the screen the same way as above.
I tried to give an overview of how you might be able to offload the computation onto the GPU, but if I was unclear anywhere just ask! For a more detailed explanation feel free to ask another question.

C++ OpenGL Threaded Terrain Crashing

What the aim is:
I'm relatively new to threading. I've been trying to make Quad-Tree rendered terrain which will render fast and efficiently. The amount of terrain which is currently rendered would lag the user majorly if it was all at the maximum detail. This is why I've used a QuadTree to render it. The engine also supports Input and Physics therefore I decided to use a rendering thread. This has caused lots of problems.
The problem('s):
When i wasn't threading there was a bit of lag due to the other systems in the Engine. The main one that caused the lag is the loading and deletion of terrain in the QuadTree (I'm not even sure if this is the optimal way to do it.) Now, Rendering happens very fast and it doesn't seem to lag. When the camera is standing still the game runs fine. I left the game running for an hour and no crashes were found.
When terrain is loaded it uses several of the variables the rendering code uses. Namely, binding the buffers -
glBindBuffer(GL_ARRAY_BUFFER, vertexbuffer);
glBufferData(GL_ARRAY_BUFFER, vertices.size() * sizeof(glm::vec3), &vertices[0], GL_STATIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, normalbuffer);
glBufferData(GL_ARRAY_BUFFER, normals.size() * sizeof(glm::vec3), &normals[0], GL_STATIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, uvbuffer);
glBufferData(GL_ARRAY_BUFFER, uvs.size() * sizeof(glm::vec2), &uvs[0], GL_STATIC_DRAW);
This variable I believe is being accessed at the same time as the other thread. Which causes a crash. How does one fix this? I have tried using Mutexes but this doesn't seem to work. Where would I lock and unlock the mutex to fix this?
Another variable that seem to cause the same error are "IsLeaf".
Another crash (std::badAlloc) is caused after loading a lot of terrain. Even though it's beeing cleaned up. I assume this is due to my deletion code but I don't know whats wrong.
The way I currently add and delete the tiles is I check the range from the camera and delete/create the tile. I want to render the tile i'm on and the ones around it. However, this doesn't work when transitioning from one of the 4 main tiles. Creating by using the range doesn't work as it's the range to the center of the big tile rather than the smaller ones. I've also tried deleting the whole map every few seconds and this seems to work too but with more lag. Is there a better way to do the creation and destruction?
Between different resolutions there are gaps. Is there anyway to reduce these? Currently i render the tiles a little larger than they need to be but this doesn't help on major resolution changes.
If you have any idea how to fix one of these errors it'd be much appreciated.
The code (Too much to upload here)
http://pastebin.com/MwXaymG0
http://pastebin.com/2tRbqtEB
An OpenGL context can only be bound to one thread at a time (through wglMakeCurrent() on Windows).
Therefore you should not being using gl* functions across threads, even if you use Mutexes to secure access to certain variables in memory the calls will fail.
What I would suggest is to move your gl* calls into your rendering thread, however, have things such as terrain loading, frustum calculations, clipping etc in your other thread. The rendering thread just needs to check whether an object has new data and then perform the appropriate GL calls as part of it's update/render.

Opengl 2D performance tips

I'm currently developing a Touhou-esque bullet hell shooter game. The screen will be absolutely filled with bullets (so instancing is what I want here), but I want this to work on older hardware, so I'm doing something along the lines of this at the moment, there are not colors, textures, etc. yet until I figure this out.
glVertexPointer(3, GL_FLOAT, 0, SQUARE_VERTICES);
for (int i = 0; i < info.centers.size(); i += 3) {
glPushMatrix();
glTranslatef(info.centers.get(i), info.centers.get(i + 1), info.centers.get(i + 2));
glScalef(info.sizes.get(i), info.sizes.get(i + 1), info.sizes.get(i + 2));
glDrawElements(GL_QUADS, 4, GL_UNSIGNED_SHORT, SQUARE_INDICES);
glPopMatrix();
}
Because I want this to work on old hardware I'm trying to avoid shaders and whatnot. The setup up there fails me on about 80 polygons. I'm looking to get at least a few hundred out of this. info is a struct which has all the goodies for rendering, nothing much to it besides a few vectors.
I'm pretty new to OpenGL, but I at least heard and tried out everything that can be done, not saying I'm good with it at all though. This game is a 2D game, I switched from SDL to Opengl because it would make for some fancier effects easier. Obviously SDL works differently, but I never had this problem using it.
It boils down to this, I'm clearly doing something wrong here, so how can I implement instancing for old hardware (OpenGL 1.x) correctly? Also, give me any tips for increasing performance.
Also, give me any tips for increasing performance.
If you're going to use sprites....
Load all sprites into single huge texture. If they don't fit, use several textures, but keep number of textures low - to avoid texture switching.
Switch textures and change OpenGL state as infrequently as possible. Ideally, you should set texture once, and draw everything you can with it.
Use texture fonts for text. FTGL font might look nice, but it can hit performance very hard with complex fonts.
Avoid alpha-blending when possible and use alpha-testing.
When alpha-blending, always use alpha-testing to reduce number of pixels you draw. When your texture has many pixels with alpha==0, cut them out with alpha-test.
Reduce number of very big sprites. Huge screen-aligned/pixel-aligne sprite (1024*1024) will drop FPS even on very good hardware.
Don't use non-power-of-2 sized textures. They (used to) produce huge performance drop on certain ATI cards.
glTranslatef
For 2D sprite-based(that's important) game you could avoid matrices completely (with exception of camera/projection matrices, perhaps). I don't think that matrices will benefit you very much with 2D game.
With 2d game your main bottleneck will be GPU memory transfer speed - transferring data from texture to screen. So "use as little draw calls" and "put everything in VA" won't help you - you can kill performance with one sprite.
However, if you're going to use vector graphics (see area2048(youtube) or rez) that does not use textures, then most of the advice above will not apply, and such game won't be very different from 3d game. In this case it'll be reasonable to use vertex arrays, vertex buffer objects or display lists (depends on what is available) and utilize matrix function - because your bottleneck will be vertex processing. You'll still have to minimize number of state switches.