Opengl 2D performance tips - c++

I'm currently developing a Touhou-esque bullet hell shooter game. The screen will be absolutely filled with bullets (so instancing is what I want here), but I want this to work on older hardware, so I'm doing something along the lines of this at the moment, there are not colors, textures, etc. yet until I figure this out.
glVertexPointer(3, GL_FLOAT, 0, SQUARE_VERTICES);
for (int i = 0; i < info.centers.size(); i += 3) {
glTranslatef(info.centers.get(i), info.centers.get(i + 1), info.centers.get(i + 2));
glScalef(info.sizes.get(i), info.sizes.get(i + 1), info.sizes.get(i + 2));
Because I want this to work on old hardware I'm trying to avoid shaders and whatnot. The setup up there fails me on about 80 polygons. I'm looking to get at least a few hundred out of this. info is a struct which has all the goodies for rendering, nothing much to it besides a few vectors.
I'm pretty new to OpenGL, but I at least heard and tried out everything that can be done, not saying I'm good with it at all though. This game is a 2D game, I switched from SDL to Opengl because it would make for some fancier effects easier. Obviously SDL works differently, but I never had this problem using it.
It boils down to this, I'm clearly doing something wrong here, so how can I implement instancing for old hardware (OpenGL 1.x) correctly? Also, give me any tips for increasing performance.

If you're going to use sprites....
Load all sprites into single huge texture. If they don't fit, use several textures, but keep number of textures low - to avoid texture switching.
Switch textures and change OpenGL state as infrequently as possible. Ideally, you should set texture once, and draw everything you can with it.
Use texture fonts for text. FTGL font might look nice, but it can hit performance very hard with complex fonts.
Avoid alpha-blending when possible and use alpha-testing.
When alpha-blending, always use alpha-testing to reduce number of pixels you draw. When your texture has many pixels with alpha==0, cut them out with alpha-test.
Reduce number of very big sprites. Huge screen-aligned/pixel-aligne sprite (1024*1024) will drop FPS even on very good hardware.
Don't use non-power-of-2 sized textures. They (used to) produce huge performance drop on certain ATI cards.
For 2D sprite-based(that's important) game you could avoid matrices completely (with exception of camera/projection matrices, perhaps). I don't think that matrices will benefit you very much with 2D game.
With 2d game your main bottleneck will be GPU memory transfer speed - transferring data from texture to screen. So "use as little draw calls" and "put everything in VA" won't help you - you can kill performance with one sprite.
However, if you're going to use vector graphics (see area2048(youtube) or rez) that does not use textures, then most of the advice above will not apply, and such game won't be very different from 3d game. In this case it'll be reasonable to use vertex arrays, vertex buffer objects or display lists (depends on what is available) and utilize matrix function - because your bottleneck will be vertex processing. You'll still have to minimize number of state switches.


C++, OpenGL - Rendering large amount of... teapots

I am quite new in OpenGL programming. My goal was to set object-oriented graphics programming and I proudly can say that I done some progress. Now I have different problem.
Lets say we have working program what can make one, two or many rotating teapots. I made this by using list inside my class. Raw code for Drawing function is here:
void Draw(void)
for(list<teapot>::iterator it=teapots.begin();it!=teapots.end();it++){
Everything is great, but when I draw large amount of teapots - say, 128 in two rows - my fps number drops. I don't know, if it is just hardware limit or I make something wrong? Maybe glPushMatrix() and glPopMatrix() should happen more often? Or less often?
You're using an old, deprecated part of OpenGL (called "immediate mode") in which all the graphics data is sent from the CPU to the GPU every frame: inside glutSolidTeapot() is code that does something like glBegin(GL_TRIANGLES) followed by lots of glVertex3f(...) and finally glEnd(). The reason that's deprecated is because it's a bottleneck. GPUs are highly parallel and are capable of processing many triangles at the same time, but they can't do that if your program is sending the vertices one-at-a-time with glVertex3f.
You should learn about the modern OpenGL API, in which you start by creating a "buffer object" and loading your vertex data into it — basically uploading your shape into the GPU's memory once, up-front — and then you can issue lots of calls telling the GPU to draw triangles using the vertices in that buffer object, instead of having to send all the vertices again every time.
(Unfortunately, this means you won't be able to use glutSolidTeapot(), since that draws in immediate mode and doesn't know how to produce vertex data for a buffer object. But I'm sure you can find a teapot model somewhere on the web.) is a decent tutorial that I know of for modern-style OpenGL, but I'm sure there are others as well.
Wyzard is right,partially.Besides the fact you are using old deprecated API where on each draw call you submit all your data again and again from CPU to GPU you also expect to maintain descent frame rate while rendering the same geometry multiple times.So in fact,keeping such an approach to geometry rendering while using programmable pipeline will not gain you much either.You will start noticing FPS drop already after +- 40-60 objects(depends on your GPU).What you really need is called batched drawing.The batch drawing may have different techniques all of witch imply you using modern OpenGL as we are talking here of data buffers(Arrays of vertices in your case which you upload to GPU).You can either push all the geometry into a single vertex buffer or use instanced rendering commands.In your case ,if all you are after is drawing the same mesh multiple times,second technique is perfect solution.There are more complex techniques like indirect multiple draw commands ,which allow you drawing indeed very large quantities of different geometry by a single draw call.But those are pretty advanced for the beginners.Anyway,the bottom line is you must move to modern OpenGL and start using geometry batching if you want to keep your app FPS high while drawing large amounts of meshes.

Techniques for drawing tiles with OpenGL

I've been using XNA for essentialy all of my programming so far and would like to move on to OpenGL (along with SFML for IO, creating the window etc.) with C++ . For starters I'd like to create a tile-based game and I've mostly looked at LazyFoo's tutorials.
I just have a two questions:
How should I draw the tiles? Should I use immediate drawing, arrays, VBOs or what? VBOs feel like overkill for this but I'm not sure. It's very tempting to use immediate drawing but apparently it's deprecated. Maybe it's fine for this purpose since it's 2D and only for a bunch of quads.
I'd like a lot of different tiles and thus all of my tiles will not fit into a single texture without making it massive. I've read that using bindTexture isn't very cheap and thus I should avoid as many calls as I can. I thought that maybe I can create a manager for my textures and stitch them all together into one big texture and bind that but then the dimensions of that is an issue.
Don't use immediate mode! It's cumbersome to work with and has been removed from recent OpenGL versions. Use Vertex Arrays, ideally through VBOs. In the end they're much easier to use, believe me.
Regarding that switching of textures. We're talking about optimizing the texture switch patterns in very complex scenes. In your case it will hardly matter at all.
Right now you worry abount things without having even used them. That's worse than premature optimization. I suggest you first get a good grip on OpenGL, then start worrying about state switch management.
With regards to the texture atlas; this is usually done by stitching textures into groups of power-of-two sized textures. For example in a tile-based game you might have a particular tile set (say, tiles for an ice world) grouped together on 2 or 3 textures. When you want to render them you would determine what tiles are visible, then you bind each texture once and render the tiles from that texture for any tiles that are visible on screen.
This requires quite a lot of set-up time to get right; you need keep information on each sub-texture of the atlas so you can find the right texture and render the appropriate region of that texture whenever a tile is referenced. You also need a good way of grouping rendering operations so that they occur when the appropriate texture is bound.
Like datenwolf said, I wouldn't focus too much on complicated texture systems early on; eager binding of textures will be plenty fast enough until you get further down the road.

Dynamic tile display optimalization in OpenGL

I am working on a tile based, top-down 2D game with dinamically generated terrain, and started (re)writing the graphics engine in OpenGL. The game is written in Java using LWJGL, and I'd prefer it to stay relatively platform-independent, and playable on older computers too.
Currently I'm using immediate mode for drawing, but obviously this is too slow for anything but the simplest scenes.
There are two basic types of objects that are drawn: Tiles, which is the world, and Sprites, which is pretty much everything else (Entities, items, effects, ect).
The tiles are 20*20 px, and are stored in chunks (40*40 tiles). Terrain generation is done in full chunks, like in Minecraft.
The method I use now is iterating over the 9 chunks near the player, and then iterating over each tile inside, drawing one quad for the tile texture, and optional extra quads for features depending on the material.
This ends up quite slow, but a simple out-of-view check gives a 5-10x FPS boost.
For optimizing this, I looked into using VBOs and quad strips, but I have a problem when terrain changes. This doesn't happen every frame, but not a very rare event either.
A simple method would be dropping and rebuilding a chunk's VBO every time it changes. This doesn't seem the best way though. I read that VBOs can be "dynamic" allowing their content to be changed. How can this be done, and what data can be changed inside them efficiently? Are there any other ways for efficiently drawing the world?
The other type, sprites, are currently drawn with a quad with a texture mapped from a sprite sheet. So by changing texture coordinates, I can even animate them later. Is this the correct way to do the aniamtion though?
Currently even a very high number of sprites won't slow the game down much, and by understanding VBOs, I'll be able to speed them up even more, but I haven't seen any solid and reliable tutorials for an efficient way of doing this. Does anyone know one perhaps?
Thanks for the help!
I disagree. Unless you are drawing a lot of tiles (tens of thousands per frame), immediate mode should be just fine for you.
The key is something you will have to be doing to get good performance anyway: texture atlases. All of your tiles should be stored in a single texture. You use texture coordinate to pull different tiles out of that texture when rendering. So if this is what your render loop looks like now:
for(tile in tileList) //Pseudocode. Not actual C++
glBindTexture(GL_TEXTURE_2D, tile.texture);
glTexCoord2f(0.0f, 0.0f);
glTexCoord2f(0.0f, 1.0f);
glTexCoord2f(1.0f, 1.0f);
glTexCoord2f(1.0f, 0.0f);
You can convert it into this:
glBindTexture(GL_TEXTURE_2D, allTilesTexture);
for(tile in tileList) //Still pseudocode.
If you are already using a texture atlas and still aren't getting acceptable performance, then you can move on to buffer objects and the like. But you won't get any better performance from buffer objects if you don't do this first.
If all of your tiles cannot fit into a single texture, then you will need to do one of two things: use multiple textures (rendering as many tiles with each texture in one glBegin/glEnd pair as possible), or use a texture array. Texture arrays are available in OpenGL 3.0-level hardware only. That means any Radeon HDxxxx or GeForce 8xxxx or better.
You mentioned that you sometimes render "features" on top of tiles. These features likely use blending and different glTexEnv modes from regular tiles. In this case, you need to find ways to group similar features into a single glBegin/glEnd pair.
As you may be gathering from this, the key to performance is minimizing the number of times you call glBindTexture and glBegin/glEnd. Do as much work as possible in each glBegin/glEnd.
If you wish to proceed with a buffer-based approach (and you should only bother if the texture atlas approach didn't get your performance up to par), it's fairly simple. Put all of your tile "chunks" into a single buffer object. Don't make a buffer for each one; there's no real reason to do so, and 40x40 tiles worth of vertex data is only 12,800 bytes. You can put 81 such chunks in a single 1MB buffer. This way, you only have to call glBindBuffer for your terrain. Which again, saves you performance.
I would need to know more about these "features" you sometimes use to suggest a way to optimize them. But as for dynamic buffers, I wouldn't worry. Just use glBufferSubData to update the part of the buffer in question. If this turns out to be slow, there are several options for making it faster that you can employ. But you shouldn't bother unless you know that it is necessary, since they're complex.
Sprites are probably something that benefits the absolute least from a buffer object approach. There's really nothing to be gained by it over immediate mode. Even if you're rendering hundreds of them, each one will have its own transformation matrix. Which means that each one will have to be a separate draw call. So it may as well be glBegin/glEnd.

2D engine with OpenGL: Use Z buffer or own implementation for sprite sorting?

If I was making a 3D engine, the answer to this question would be clear: I'd go for using the depth buffer instead of thinking of sorting all my polygons on my own.
However, this is a different situation with 2D, because here layers can be implemented easily without the help of OpenGL - and you then could even sort and move sprites within layers. (Which isn't possible in OpenGL afaik)
(Why) should I use the OpenGL depth buffer instead of a C++ layer system running on the CPU?
How much slower would the depth buffer version be?
It is clear to me that making a layer system in C++ would impose as good as no performance impact at all, as I have to iterate over the sprites for rendering in any case.
I would suggest you to do it in software since you probably want to use transparency on your sprites and that implies you render them from back to front. Also sorting a couple of sprites shouldn't be that CPU demanding.
Use both, if you can.
Depth information is nice for post-processing and stuff like 3D-glasses, so you shouldn't throw it away. These kinds of effects can be very nice for 2D games.
Also, if you draw your (opaque) layers front to back, you can save fill-rate because the Z-Buffer can do the clipping for you (Depth tests are faster than actual drawing).
Depth testing is usually almost free, especially when you got hierarchical Z info. Because of this and the fill-rate savings, using depth testing will probably be even faster.
On the other hand, the software sorting is nice so you can actually do front to back rendering for opaque sprites and it's mandatory to do alpha-blending right (of course, you draw these sprites back to front).
Direct answers:
allowing the GPU to use the depth buffer would allow you to dynamically adjust the draw order of things without any on-CPU shuffling and would free you from having to assign things to different layers in situations where doing so is a bit of a fiction — for example, you could have effects like projectiles that come from the background towards and then in front of the player, without having to figure out which layer to assign them to all the time
on the GPU, the use of a depth would have no measurable effect, even if you're on an embedded chip, a plug-in card from more than a decade ago or an integrated part; they're so fundamental to modern GPUs that they've been optimised down to costing nothing in practical terms
However, I'd imagine you actually want to do it on the CPU for the simple reason of treating transparency correctly. A depth buffer stores one depth per pixel, so if you draw a near transparent object then attempt to draw something behind it, the thing behind won't be drawn even though it should be visible. In a 2d game it's likely that anti-aliasing will give your sprites partially transparent edges; if you submit drawing to the GPU in draw order then your partial transparencies will always be composited correctly. If you leave the z-buffer to do it then you risk weird looking fringing.

OpenGL voxel engine slow

I'm making a voxel engine in C++ and OpenGL (à la Minecraft) and can't get decent fps on my 3GHz with ATI X1600... I'm all out of ideas.
When I have about 12000 cubes on the screen it falls to under 20fps - pathetic.
So far the optimizations I have are: frustum culling, back face culling (via OpenGL's glEnable(GL_CULL_FACE)), the engine draws only the visible faces (except the culled ones of course) and they're in an octree.
I've tried VBO's, I don't like them and they do not significantly increase the fps.
How can Minecraft's engine be so fast... I struggle with a 10000 cubes, whereas Minecraft can easily draw much more at higher fps.
Any ideas?
#genpfault: I analyze the connectivity and just generate faces for the outer, visible surface. The VBO had a single cube that I glTranslate()d
I'm not an expert at OpenGL, but as far as I understand this is going to save very little time because you still have to send every cube to the card.
Instead what you should do is generate faces for all of the outer visible surface, put that in a VBO, and send it to the card and continue to render that VBO until the geometry changes. This saves you a lot of the time your card is actually waiting on your processor to send it the geometry information.
You should profile your code to find out if the bottleneck in your application is on the CPU or GPU. For instance it might be that your culling/octtree algorithms are slow and in that case it is not an OpenGL-problem at all.
I would also keep count of the number of cubes you draw on each frame and display that on screen. Just so you know your culling routines work as expected.
Finally you don't mention if your cubes are textured. Try using smaller textures or disable textures and see how much the framerate increases.
gDEBugger is a great tool that will help you find bottlenecks with OpenGL.
I don't know if it's ok here to "bump" an old question but a few things came up my mind:
If your voxels are static you can speed up the whole rendering process by using an octree for frustum culling, etc. Furthermore you can also compile a static scene into a potential-visibility-set in the octree. The main principle of PVS is to precompute for evere node in the tree which other nodes are potential visible from it and store pointers to them in a vector. When it comes to rendering you first check in which node the camera is placed and then run frustum culling against all nodes in the PVS-vector of the node.(Carmack used something like that in the Quake engines, but with Binary Space Partitioning trees)
If the shading of your voxels is kindalike complex it is also fast to do a pre-Depth-Only-Pass, without writing into the colorbuffer,just to fill the Depthbuffer. After that you render a 2nd pass: disable writing to the Depthbuffer and render only to the Colorbuffer while checking the Depthbuffer. So you avoid expensive shader-computations which are later overwritten by a new fragment which is closer to the viewer.(Carmack used that in Quake3)
Another thing which will definitely speed up things is the use of Instancing. You store only the position of each voxel and, if nescessary, its scale and other parameters into a texturebufferobject. In the vertexshader you can then read the positions of the voxels to be spawned and create an instance of the voxel(i.e. a cube which is given to the shader in a vertexbufferobject). So you send the 8 Vertices + 8 Normals (3 *sizeof(float) *8 +3 *sizeof(float) *8 + floats for color/texture etc...) only once to the card in the VBO and then only the positions of the instances of the Cube(3*sizeof(float)*number of voxels) in the TBO.
Maybe it is possibile to parallelize things between GPU and CPU by combining all 3 steps in 2 threads, in the CPU-thread you check the octrees pvs and update a TBO for instancing in the next frame, the GPU-thread does meanwhile render the 2 passes while using an TBO for instancing which was created by the CPU thread in the previous step. After that you switch TBOs. If the Camera has not moved you don't even have to do the CPU-calculations again.
Another kind of tree you me be interested in is the so called k-d-tree, which is more general than octrees.
PS: sorry for my english, it's not the clearest....
There are 3rd-party libraries you could use to make the rendering more efficient. For example the C++ PolyVox library can take a volume and generate the mesh for you in an efficient way. It has built-in methods for reducing triangle count and helping to generate things like ambient occlusion. It's got a good community around it so getting support on the forum should be easy.
Have you used a common display list for all your cubes ?
Do you skip calling drawing code of cubes which are not visible to the user ?