Opengl slows down when a lot of stuff is on the screen - c++

So I just started switching over from SDL to OpenGL today and I'm having this problem which I didn't have when I was using SDL.
When there's a lot of stuff on the screen the whole thing goes into slow motion. And when I say a lot I mean 200+ objects, but starts to be noticeable maybe from 50.
This is how things are rendered, I have a class Renderable with a virtual void render() which is called by the RenderManager in a loop void manage() which calls render() for every Renderable on screen.
The main loop looks like this
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
glLoadIdentity();
_renderManager.manage();
glFlush();
SDL_GL_SwapBuffers();
and render() for the objects I'm using are only squares so
glBegin(GL_QUADS);
// Draw square with colors
glEnd();
My CPU usage or memory usage don't seem to be high at all, it's just like... the game is slowing down.

I guess the problem is that you are using immediate mode, you should use Vertex Arrays if you have performance problems.
We don't know all of your code so it's difficult to give a complete answer but using vertex arrays is surely the first step you should ensure if things go slowly.
Take a look here: http://www.songho.ca/opengl/gl_vertexarray.html
Basically the fact is that with glBegin...glEnd you end up doing many calls to the GPU while with vertex arrays you precompute your shapes, you save them in buffers and directly draw them reducing the number of calls by a significant amount.

Sorry, I've never used OpenGL and I really had no idea what I was doing, but I've found a solution (and am working on it now!). Also thank you Jack for you answer! I'm definitely using vertex arrays now, I've found that glBegin() and glEnd() are deprecated. I might even try vertex buffer objects.
The problem was each individual Renderable was calling glBegin() and glEnd() when the RenderManager calls render() for each Renderable each loop. That causes a lot of stress for the GPU.
At the time of writing this answer I have a sendVertices(GLfloat vertices[]); in my RenderManager which adds all vertices to an std::vector<GLfloat>. During a loop for RenderManager I make a vertex pointer
glVertexPointer(2, GL_FLOAT, 0, &vertices[0]);
then call
glDrawArrays(GL_QUADS, 0, vertices.size() / 2);
So instead of rendering for each object it renders everything at once based on the vertices. Now I start to see slowing down at 800+ objects. Albeit not great, but there's still a lot of work I have to do. Because right now the vertices vector is recreated each loop rather than modified. Also this doesn't account for colors, but I'm on the right track!
Originally switching to vertex arrays it didn't make much of a difference because it was still being rendered for each individual object, so glDrawArrays() was being called for each of the 200+ Renderable.
EDIT:
Also sorry for not giving enough information, I guess I assumed the problem would be obvious. What confidence I have in myself, huh? Haha.

Related

OpenGL: Repeated use of transform feedback buffers overwrites already established textures

I have a working implementation of this technique for view frustum culling of instanced geometry. The gist of the technique is that we use a vertex shader to check if the bounds of an object lie within the view frustum, and if they do we output the position of that object, using a transform feedback buffer and a geometry shader, to a texture. We can then, during an actual rendering pass, use that texture, along with a query of how many positions we emitted, to acquire the relevant position data for the object we're rendering, and number of draws to specify in our call to glDrawElementsInstanced. One difference between what I do, and what the article does, is that I emit a full transformation matrix, rather than a simple position vector, to the texture, but I doubt that has any bearing on my problem.
The actual problem: Currently I have this setup so that, for each object type being rendered (i.e. tree, box, rock, whatever), the actual rendering pass follows immediately upon the frustum cull rendering pass. This works, and gives the intended results. What I want to do instead, however, is to go over all my drawcommands and do all the frustum culling for the various objects first, and only thereafter do all the actual rendering, to avoid a bunch of unnecessary state changes (i.e. switching back and forth between shader programs). When I do this, however, I encounter the problem that previously established textures -- the ones I use for reading positions from during the actual rendering passes -- all seem to be overwritten by the latest call to the frustum culling function, meaning that all textures established seemingly contain only the position information from the last frustum cull call.
For example: I render, in order, 4 trees, 10 boxes and 3 rocks, and what I will see instead is a tree, a box, and a rock, at all the (three) positions where I would expect only the 3 rocks to be. I cannot for the life of me figure out why this is, because I quite clearly bind new buffers and textures to the TRANSFORM_FEEDBACK_BUFFER every time I call the function. Why are the previously used textures still receiving the new data from the latest call?
Code, in C, for the frustum culling function:
void fcullidraw(drawcommand *tar) {
/* printf("Fculling %s\n", tar->res->name); */
mesh *rmesh = &tar->res->amod->meshes[0];
/* glDeleteTextures(1, &rmesh->ctex); */
if(rmesh->ctbuf == 0)
glGenBuffers(1, &rmesh->ctbuf);
glBindBuffer(GL_TEXTURE_BUFFER, rmesh->ctbuf);
glBufferData(GL_TEXTURE_BUFFER, sizeof(instancedata) * tar->nodraws, NULL, GL_DYNAMIC_COPY);
if(rmesh->ctex == 0)
glGenTextures(1, &rmesh->ctex);
glBindTexture(GL_TEXTURE_BUFFER, rmesh->ctex);
glTexBuffer(GL_TEXTURE_BUFFER, GL_RGBA32F, rmesh->ctbuf);
if(rmesh->cquery == 0)
glGenQueries(1, &rmesh->cquery);
checkactiveshader(tar->tar, findshader("icull"));
glEnable(GL_RASTERIZER_DISCARD);
glUniform1f(activeshader->radius, tar->res->amesh->bbox.radius);
glUniform3fv(activeshader->extent, 1, (const GLfloat*)&tar->res->amesh->bbox.ext);
glUniform3fv(activeshader->cp, 1, (const GLfloat*)&tar->res->amesh->bbox.cp);
glBindVertexArray(tar->res->amod->meshes[0].vao);
glBindBuffer(GL_ARRAY_BUFFER, tar->res->amod->meshes[0].posarray);
glBufferData(GL_ARRAY_BUFFER, sizeof(mat4_t) * tar->nodraws, tar->posarray, GL_DYNAMIC_DRAW);
glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, rmesh->ctbuf);
glBeginTransformFeedback(GL_POINTS);
glBeginQuery(GL_PRIMITIVES_GENERATED, rmesh->cquery);
glDrawArrays(GL_POINTS, 0, tar->nodraws);
glEndQuery(GL_PRIMITIVES_GENERATED);
glEndTransformFeedback();
glDisable(GL_RASTERIZER_DISCARD);
glGetQueryObjectuiv(rmesh->cquery, GL_QUERY_RESULT, &rmesh->visibleinstances);
}
tar and rmesh obviously vary between each call to this function. Do note that I have left in a few lines of comments here containing code to delete the buffers and textures between each rendering cycle, rather than simply overwriting them, but using that code instead has no effect on the error mode.
I'm stumped. I feel that the textures and buffers are well defined and clearly kept separate, so I do not understand how the textures from previous calls to fcullidraw are somehow still bound to and being overwritten by the TransformFeedback, if that is indeed what is happening, and it certainly seems to be, because the earlier objects will read in the entire transformation matrix of the rock quite neatly, with the "right" rotation, translation, and everything.
The article linked does do the operations in the order I want to do them -- i.e. first repeated frustum culls, and then repeated rendering -- and I'm not sure I see what I do differently. Might be some small and obvious thing, and I might be an idiot, but in that case I'd love to know why and how I am that.
EDIT: I pushed on and updated my implementation with a refinement of the original technique, suggested here, which gets rid of the writing-to-texture method altogether, in favor of instead simply writing to a buffer bound to the VAO, and set to update once per rendered instance with a VertexAttribDivisor. This method looks at lot cleaner on the whole, and incidentally had the additional side effect of not having my original problem at all, as I'm no longer writing to and uploading textures. This is, thus, no longer a practical problem for me, but the answer to the theoretical question does still elude me, so if anyone has ideas I'm still all ears.

C++, OpenGL - Rendering large amount of... teapots

I am quite new in OpenGL programming. My goal was to set object-oriented graphics programming and I proudly can say that I done some progress. Now I have different problem.
Lets say we have working program what can make one, two or many rotating teapots. I made this by using list inside my class. Raw code for Drawing function is here:
void Draw(void)
{
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
glMatrixMode(GL_MODELVIEW);
glPushMatrix();
for(list<teapot>::iterator it=teapots.begin();it!=teapots.end();it++){
glTranslatef(it->pos.x,it->pos.y,it->pos.z);
glRotatef(angle,it->ang.x,it->ang.y,it->ang.z);
glutSolidTeapot(it->size);
glRotatef(angle,-it->ang.x,-it->ang.y,-it->ang.z);
glTranslatef(-it->pos.x,-it->pos.y,-it->pos.z);
}
glPopMatrix();
glutSwapBuffers();
}
Everything is great, but when I draw large amount of teapots - say, 128 in two rows - my fps number drops. I don't know, if it is just hardware limit or I make something wrong? Maybe glPushMatrix() and glPopMatrix() should happen more often? Or less often?
You're using an old, deprecated part of OpenGL (called "immediate mode") in which all the graphics data is sent from the CPU to the GPU every frame: inside glutSolidTeapot() is code that does something like glBegin(GL_TRIANGLES) followed by lots of glVertex3f(...) and finally glEnd(). The reason that's deprecated is because it's a bottleneck. GPUs are highly parallel and are capable of processing many triangles at the same time, but they can't do that if your program is sending the vertices one-at-a-time with glVertex3f.
You should learn about the modern OpenGL API, in which you start by creating a "buffer object" and loading your vertex data into it — basically uploading your shape into the GPU's memory once, up-front — and then you can issue lots of calls telling the GPU to draw triangles using the vertices in that buffer object, instead of having to send all the vertices again every time.
(Unfortunately, this means you won't be able to use glutSolidTeapot(), since that draws in immediate mode and doesn't know how to produce vertex data for a buffer object. But I'm sure you can find a teapot model somewhere on the web.)
Open.gl is a decent tutorial that I know of for modern-style OpenGL, but I'm sure there are others as well.
Wyzard is right,partially.Besides the fact you are using old deprecated API where on each draw call you submit all your data again and again from CPU to GPU you also expect to maintain descent frame rate while rendering the same geometry multiple times.So in fact,keeping such an approach to geometry rendering while using programmable pipeline will not gain you much either.You will start noticing FPS drop already after +- 40-60 objects(depends on your GPU).What you really need is called batched drawing.The batch drawing may have different techniques all of witch imply you using modern OpenGL as we are talking here of data buffers(Arrays of vertices in your case which you upload to GPU).You can either push all the geometry into a single vertex buffer or use instanced rendering commands.In your case ,if all you are after is drawing the same mesh multiple times,second technique is perfect solution.There are more complex techniques like indirect multiple draw commands ,which allow you drawing indeed very large quantities of different geometry by a single draw call.But those are pretty advanced for the beginners.Anyway,the bottom line is you must move to modern OpenGL and start using geometry batching if you want to keep your app FPS high while drawing large amounts of meshes.

OpenGL what do I have to do before drawing a triangle?

Most of the tutorials, guides and books that I've found out there are related to OpenGL, explains how to draw a triangle and initialize OpenGL. That's fine. But when they try to explain it they just list a bunch of functions and parameters like:
glClear()
glClearColor()
glBegin()
glEnd()
...
Since I'm not very good at learning things by memory, I always need an answer to "why are we doing this?" so that I'll write that bunch of functions because I remember that I have to set a certain things before doing somethings else and so on not because the tutorial told me so.
Could please someone explain to me what do I have to define to OpenGL (only pure OpenGL, I'm using SFML as background library but that really doesn't matter) before starting to draw something with glBegin() and glEnd()?
Sample answer:
You have to first tell OpenGL what color does it need to clear the
screen with. Because each frame needs to be cleared by the previous
before we start to draw the current one...
First you should know, that OpenGL is a state machine. That means, that apart from creating the OpenGL context (which is done by SFML) there's no such thing as initialization!
Since I'm not very good at learning things by memory,
This is good…
I always need an answer to "why are we doing this?"
This is excellent!
Could please someone explain to me what do I have to define to OpenGL (only pure OpenGL, I'm using SFML as background library but that really doesn't matter) before starting to draw something with glBegin() and glEnd()?
As I already told: OpenGL is a state machine. That basically means, that there are two kinds of calls you can do: Setting state and executing operations.
For example glClearColor sets a state variable, that of the clear color, which value is used for clearing the active framebuffer color to, when a call to glClear with the GL_COLOR_BUFFER_BIT flag set. There exists a similar function glClearDepth for the depth value (GL_DEPTH_BUFFER_BIT flag to glClear).
glBegin and glEnd belong to the immediate mode of OpenGL, which have been deprecated. So there's little reason in learning them. You should use Vertex Arrays instead, preferrably through Vertex Buffer Objects.
But here it goes: glBegin sets OpenGL in a state that it should now draw geometry, of the kind of primitive selected as parameter to glBegin. GL_TRIANGLES for example means, that OpenGL will now interpret every 3 calls to glVertex as forming a triangle. glEnd tells OpenGL that you've finished that batch of triangles. Within a glBegin…glEnd block certain state changes are disallowed. Among those everything that has to do with transforming the geometry and generating the picture, which matrices, shaders, textures, and some others.
One common misconception is, that OpenGL is initialized. This is due to badly written tutorials which have a initGL function or similar. It's a good practice to set all state from scratch when beginning to render a scene. But since a single frame may contain several scenes (think of a HUD or split screen gaming) this happens several times a scene.
Update:
So how do you draw a triangle? Well, it's simple enough. First you need the geometry data. For example this:
GLfloat triangle[] = {
-1, 0, 0,
+1, 0, 0,
0, 1, 0
};
In the render function we tell OpenGL that the next calls to glDrawArrays or glDrawElements shall fetch the data from there (for the sake of simplicity I'll use OpenGL-2 functions here):
glVertexPointer(3, /* there are three scalars per vertex element */
GL_FLOAT, /* element scalars are float */
0, /* elements are tightly packed (could as well be sizeof(GLfloat)*3 */
trignale /* and there you find the data */ );
/* Note that glVertexPointer does not make a copy of the data!
If using a VBO the data is copied when calling glBufferData. */
/* this switches OpenGL into a state that it will
actually access data at the place we pointed it
to with glVertexPointer */
glEnableClientState(GL_VERTEX_ARRAY);
/* glDrawArrays takes data from the supplied arrays and draws them
as if they were submitted sequentially in a for loop to immediate
mode functions. Has some valid applications. Better use index
based drawing for models with a lot of shared vertices. */
glDrawArrays(Gl_TRIANGLE, /* draw triangles */
0, /* start at index 0 */
3, /* process 3 elements (of 3 scalars each) */ );
What I didn't include yet is setting up the transformation and viewport mapping.
The viewport defines how the readily projected and normalized geometry is placed in the window. This state is set using glViewport(pos_left, pos_bottom, width, height).
Transformation today happens in a vertex shader, Essentially a vertex shader is a small program written in a special language (GLSL), that takes the vertex attributes and calculates the clip space position of the resulting vertex. The usual approach for this is emulating the fixed function pipeline, which is a two stage process: First transform the geometry into view space (some calculations, like illumination are easier in this space), then project it into clip space, which is kind of the lens of the renderer. In the fixed function pipeline there are two transformation matrices for this: Modelview and Projection. You set them to whatever is required for the desired outcome. In the case of just a triangle, we leave the modelview identity and use a ortho projection from -1 to 1 in either dimension.
glMatrixMode(GL_PROJECTION);
/* the following function multiplies onto what's already on the stack,
so reset it to identity */
glLoadIdentity();
/* our clip volume is defined by 6 orthogonal planes with normals X,Y,Z
and ditance 1 from origin into each direction */
glOrtho(-1, 1, -1, 1, -1, 1);
glMatrixMode(GL_MODELVIEW);
/* now a identity matrix is loaded onto the modelview */
glLoadIdentity();
Having set up the transformation we can now draw the triangle as outlined above:
draw_triangle();
Finally we need to tell OpenGL we're done with sending commands and it should finish it's renderings.
if(singlebuffered)
glFinish();
However most of the time your window is double buffered, so you need to swap it to make things visime. Since swapping makes no sense without finishing the swap implies a finish
else
SwapBuffers();
You're using the API to set and change the OpenGL state machine.
You're not actually programming directly to the GPU, you're using a medium between your application and your GPU to do whatever you're trying to do.
The reason it is like this and doesn't work the same way as a CPU and memory, is because openGL was intended to run on os/system-independent hardware, so that your code can run on any OS and run on any hardware and not just the one your programming to.
Hence, because of this, you need to learn to use their preset code that makes sure that whatever you're trying to do it will be able to be run on all systems/OS/hardware within a reasonable range.
For example if you were to create your application on windows 8.1 with a certain graphics card(say amd's) you still want your application to be able to run on Andoird/iOS/Linux/other Windows systems/other hardware(gpus) such as Nvidia.
Hence why Khronos, when they created the API, they made it as system/hardware independent as possible so that it can run on everything and be a standard for everyone.
This is the price we have to pay for it, we have to learn their API instead of learning how to directly write to gpu memory and directly utilize the GPU to process information/data.
Although with the introduction of Vulkan things might be different when it is released(also from khronos)and we will find out how it will be working.

2D Particle System - Performance

I have implemented a 2D Particle System based on the ideas and concepts outlined in "Bulding an Advanced Particle System" (John van der Burg, Game Developer Magazine, March 2000).
Now I am wondering what performance I should expect from this system. I am currently testing it within the context of my simple (unfinished) SDL/OpenGL platformer, where all particles are updated every frame. Drawing is done as follows
// Bind Texture
glBindTexture(GL_TEXTURE_2D, *texture);
// for all particles
glBegin(GL_QUADS);
glTexCoord2d(0,0); glVertex2f(x,y);
glTexCoord2d(1,0); glVertex2f(x+w,y);
glTexCoord2d(1,1); glVertex2f(x+w,y+h);
glTexCoord2d(0,1); glVertex2f(x,y+h);
glEnd();
where one texture is used for all particles.
It runs smoothly up to about 3000 particles. To be honest I was expecting a lot more, particularly since this is meant to be used with more than one system on screen. What number of particles should I expect to be displayed smoothly?
PS: I am relatively new to C++ and OpenGL likewise, so it might well be that I messed up somewhere!?
EDIT Using POINT_SPRITE
glEnable(GL_POINT_SPRITE);
glBindTexture(GL_TEXTURE_2D, *texture);
glTexEnvi(GL_POINT_SPRITE, GL_COORD_REPLACE, GL_TRUE);
// for all particles
glBegin(GL_POINTS);
glPointSize(size);
glVertex2f(x,y);
glEnd();
glDisable( GL_POINT_SPRITE );
Can't see any performance difference to using GL_QUADS at all!?
EDIT Using VERTEX_ARRAY
// Setup
glEnable (GL_POINT_SPRITE);
glTexEnvi(GL_POINT_SPRITE, GL_COORD_REPLACE, GL_TRUE);
glPointSize(20);
// A big array to hold all the points
const int NumPoints = 2000;
Vector2 ArrayOfPoints[NumPoints];
for (int i = 0; i < NumPoints; i++) {
ArrayOfPoints[i].x = 350 + rand()%201;
ArrayOfPoints[i].y = 350 + rand()%201;
}
// Rendering
glEnableClientState(GL_VERTEX_ARRAY); // Enable vertex arrays
glVertexPointer(2, GL_FLOAT, 0, ArrayOfPoints); // Specify data
glDrawArrays(GL_POINTS, 0, NumPoints); // ddraw with points, starting from the 0'th point in my array and draw exactly NumPoints
Using VAs made a performance difference to the above. I've then tried VBOs, but don't really see a performance difference there?
I can't say how much you can expect from that solution, but there are some ways to improve it.
Firstly, by using glBegin() and glEnd() you are using immediate mode, which is, as far as I know, the slowest way of doing things. Furthermore, it isn't even present in the current OpenGL standard anymore.
For OpenGL 2.1
Point Sprites:
You might want to use point sprites. I implemented a particle system using them and came up with a nice performance (for my knowledge back then, at least). Using point sprites you are doing less OpenGL calls per frame and you send less data to the graphic card (or even have the data stored at the graphic card, not sure about that). A short google search should even give you some implementations of that to look at.
Vertex Arrays:
If using point sprites doesn't help, you should consider using vertex arrays in combination with point sprites (to save a bit of memory). Basically, you have to store the vertex data of the particles in an array. You then enable vertex array support by calling glEnableClientState() with GL_VERTEX_ARRAY as parameter. After that, you call glVertexPointer() (the parameters are explained in the OpenGL documentation) and call glDrawArrays() to draw the particles. This will reduce your OpenGL calls to only a handfull instead of 3000 calls per frame.
For OpenGL 3.3 and above
Instancing:
If you are programming against OpenGL 3.3 or above, you can even consider using instancing to draw your particles, which should speed that up even further. Again, a short google search will let you look at some code about that.
In General:
Using SSE:
In addition, some time might be lost while updating your vertex positions. So, if you want to speed that up, you can take a look at using SSE for updating them. If done correctly, you will gain a lot of performance (at a large amount of particles at least)
Data Layout:
Finally, I recently found a link (divergentcoder.com/programming/aos-soa-explorations-part-1, thanks Ben) about structures of arrays (SoA) and arrays of structures (AoS). They were compared on how they affect the performance with an example of a particle system.
Consider using vertex arrays instead of immediate mode (glBegin/End): http://www.songho.ca/opengl/gl_vertexarray.html
If you are willing to get into shaders, you could also search for "vertex shader" and consider using that approach for your project.

Dynamic tile display optimalization in OpenGL

I am working on a tile based, top-down 2D game with dinamically generated terrain, and started (re)writing the graphics engine in OpenGL. The game is written in Java using LWJGL, and I'd prefer it to stay relatively platform-independent, and playable on older computers too.
Currently I'm using immediate mode for drawing, but obviously this is too slow for anything but the simplest scenes.
There are two basic types of objects that are drawn: Tiles, which is the world, and Sprites, which is pretty much everything else (Entities, items, effects, ect).
The tiles are 20*20 px, and are stored in chunks (40*40 tiles). Terrain generation is done in full chunks, like in Minecraft.
The method I use now is iterating over the 9 chunks near the player, and then iterating over each tile inside, drawing one quad for the tile texture, and optional extra quads for features depending on the material.
This ends up quite slow, but a simple out-of-view check gives a 5-10x FPS boost.
For optimizing this, I looked into using VBOs and quad strips, but I have a problem when terrain changes. This doesn't happen every frame, but not a very rare event either.
A simple method would be dropping and rebuilding a chunk's VBO every time it changes. This doesn't seem the best way though. I read that VBOs can be "dynamic" allowing their content to be changed. How can this be done, and what data can be changed inside them efficiently? Are there any other ways for efficiently drawing the world?
The other type, sprites, are currently drawn with a quad with a texture mapped from a sprite sheet. So by changing texture coordinates, I can even animate them later. Is this the correct way to do the aniamtion though?
Currently even a very high number of sprites won't slow the game down much, and by understanding VBOs, I'll be able to speed them up even more, but I haven't seen any solid and reliable tutorials for an efficient way of doing this. Does anyone know one perhaps?
Thanks for the help!
Currently I'm using immediate mode for drawing, but obviously this is too slow for anything but the simplest scenes.
I disagree. Unless you are drawing a lot of tiles (tens of thousands per frame), immediate mode should be just fine for you.
The key is something you will have to be doing to get good performance anyway: texture atlases. All of your tiles should be stored in a single texture. You use texture coordinate to pull different tiles out of that texture when rendering. So if this is what your render loop looks like now:
for(tile in tileList) //Pseudocode. Not actual C++
{
glBindTexture(GL_TEXTURE_2D, tile.texture);
glBegin(GL_QUADS);
glTexCoord2f(0.0f, 0.0f);
glVertex2fv(tile.lowerLeft);
glTexCoord2f(0.0f, 1.0f);
glVertex2fv(tile.upperLeft);
glTexCoord2f(1.0f, 1.0f);
glVertex2fv(tile.upperRight);
glTexCoord2f(1.0f, 0.0f);
glVertex2fv(tile.lowerRight);
glEnd();
}
You can convert it into this:
glBindTexture(GL_TEXTURE_2D, allTilesTexture);
glBegin(GL_QUADS);
for(tile in tileList) //Still pseudocode.
{
glTexCoord2f(tile.texCoord.lowerLeft);
glVertex2fv(tile.lowerLeft);
glTexCoord2f(tile.texCoord.upperLeft);
glVertex2fv(tile.upperLeft);
glTexCoord2f(tile.texCoord.upperRight);
glVertex2fv(tile.upperRight);
glTexCoord2f(tile.texCoord.lowerRight);
glVertex2fv(tile.lowerRight);
}
glEnd();
If you are already using a texture atlas and still aren't getting acceptable performance, then you can move on to buffer objects and the like. But you won't get any better performance from buffer objects if you don't do this first.
If all of your tiles cannot fit into a single texture, then you will need to do one of two things: use multiple textures (rendering as many tiles with each texture in one glBegin/glEnd pair as possible), or use a texture array. Texture arrays are available in OpenGL 3.0-level hardware only. That means any Radeon HDxxxx or GeForce 8xxxx or better.
You mentioned that you sometimes render "features" on top of tiles. These features likely use blending and different glTexEnv modes from regular tiles. In this case, you need to find ways to group similar features into a single glBegin/glEnd pair.
As you may be gathering from this, the key to performance is minimizing the number of times you call glBindTexture and glBegin/glEnd. Do as much work as possible in each glBegin/glEnd.
If you wish to proceed with a buffer-based approach (and you should only bother if the texture atlas approach didn't get your performance up to par), it's fairly simple. Put all of your tile "chunks" into a single buffer object. Don't make a buffer for each one; there's no real reason to do so, and 40x40 tiles worth of vertex data is only 12,800 bytes. You can put 81 such chunks in a single 1MB buffer. This way, you only have to call glBindBuffer for your terrain. Which again, saves you performance.
I would need to know more about these "features" you sometimes use to suggest a way to optimize them. But as for dynamic buffers, I wouldn't worry. Just use glBufferSubData to update the part of the buffer in question. If this turns out to be slow, there are several options for making it faster that you can employ. But you shouldn't bother unless you know that it is necessary, since they're complex.
Sprites are probably something that benefits the absolute least from a buffer object approach. There's really nothing to be gained by it over immediate mode. Even if you're rendering hundreds of them, each one will have its own transformation matrix. Which means that each one will have to be a separate draw call. So it may as well be glBegin/glEnd.