OpenGL Display List Optimizing - c++

I am currently running some speed tests for my applications and I am trying to find more ways to optimize my program, specifically with my display lists. Currently I am getting:
12 FPS with 882,000 vertices
40 FPS with 234,000 vertices
95 FPS with 72,000 vertices
I know that I need to minimize the number of calls made, so instead of:
for(int i = 0; i < Number; i++) {
glBegin(GL_QUADS);
...normal and vertex declarations here
glEnd();
}
A better way would be to do this:
glBegin(GL_QUADS);
for(int i = 0; i < Number; i++) {
...normal and vertex declarations here
}
glEnd();
This did help increase my FPS to the results listed above, however, are there other ways I can optimize my display lists? Perhaps by using something other than nested vertex arrays to store my model data?

You'll get a significant speed boost by switching to VBOs or at least Vertex arrays.
Immediate mode (glBegin()...glEnd()) has a lot of method call overhead. I've managed to render ~1 million vertices at several hundred fps on a laptop (would be faster without the physics engine/entity system overhead too) by using more modern OpenGL.
If you're wondering about compatibility, about 98% of people support the VBO extension (GL_ARB_vertex_buffer_object) http://feedback.wildfiregames.com/report/opengl/

Related

In OpenGL, what is a good target for how many vertices in a VBO while maintaining a good frame rate

I am working on making a 2D game engine from scratch, mostly for fun. Recently I've been really concerned about the performance of the whole engine. I keep reading articles on a good target number of polygons to try and reach, and I've seen talk in the millions, meanwhile I've only managed to get 40,000 without horrible frame rate drops.
I've tried to use a mapped buffer from the graphics card instead of my own, but that actually gives me worse performance. I've read about techniques like triple buffer rendering, and I can see how it may theoretically speed it up, I cant imagine it speeding my code up into the millions I've read about.
The format I use is 28 Byte vertices, (Three floats for position, 2 floats for texture coordinates, 1 for color, and 1 for which texture buffer to read from). I've thought about trimming this down, but once again it doesn't seem worth it.
Looking through my code almost 98% of the time is spent allocating, filling up, and giving the VAO to the graphics card. So that's currently my only bottleneck.
All the sprites are just 4 sided polygons, and I'm just using GL_QUADS to render the whole object. 40,000 sprites just feels really low. I only have one draw call for them, so I was expecting at least 10 times that from what I've read. I've head some models have nearly 40k polygons in them alone for 3D!
Here is some relevant code to how I render it all:
//This is the main render loop, currently it's only called once per frame
for (int i = 0; i < l_Layers.size(); i++) {
glUseProgram(l_Layers[i]->getShader().getShaderProgram());
GLint loc = glGetUniformLocation(l_Layers[i]->getShader().getShaderProgram(), "MVT");
glUniformMatrix4fv(loc,1, GL_FALSE, mat.data);
l_Layers[i]->getVertexBuffer().Bind();
glDrawArrays(GL_QUADS, 0, l_Layers[i]->getVertexBuffer().getSize());
l_Layers[i]->getVertexBuffer().Unbind();
}
//These lines of code take up by far the most compute time
void OP::VertexBuffer::startBuffer(int size)
{
flush();
Vertices = new Vertex[size * 4];
}
void OP::VertexBuffer::submit(Vertex vertex)
{
Vertices[Index] = vertex;
Index++;
}
void Layer::Render() {
l_VertexBuffer.startBuffer(l_Sprites.size());
for (size_t i = 0; i < l_Sprites.size(); i++) {
Vertex* vert = l_Sprites[i]->getVertexArray();
l_VertexBuffer.submit(vert[0]);
l_VertexBuffer.submit(vert[1]);
l_VertexBuffer.submit(vert[2]);
l_VertexBuffer.submit(vert[3]);
}
}
I don't know of anything I've been doing wrong, but I just dont understand how people are getting orders of magnitude more polygons on the screen. Especially when they have far more complex models then I have with GL_QUADS.
98% of the time is spent allocating, filling up, and giving the VAO to the graphics card. So that's currently my only bottleneck.
Creating the VAO and filling it up should actually only happen once and therefore should not affect the frame rate, you should only need to bind the VAO before calling render.
Obviously I can't see all of your code so I may have the wrong idea but it looks like you're creating a new vertex array every time Render is called.
It doesn't surprise me that you're spending all of your time in here:
//These lines of code take up by far the most compute time
void OP::VertexBuffer::startBuffer(int size)
{
flush();
Vertices = new Vertex[size * 4];
}
Calling new on every render call for a large array is going to considerably impact your performance, you're also spending time assigning to that array every frame.
On top of that you appear to be leaking memory.
Every time you call:
Vertices = new Vertex[size * 4];
You're failing to free the array that you allocated on the previous call to Render. What you're doing is similar to the example below:
foo = new Foo();
foo = new Foo();
Memory is allocated to foo in the first call, the first foo created was never deconstructed nor deallocated and there is now no way to do so as foo has been reassigned so the first foo has leaked.
So I think you have a combination of issues going on here.

which one is proper method of writing this gl code

i have been doing some experiments with opengl and handling textures.
in my experiment i have a 2d array of (int) which are randomly generated
int mapskeleton[300][300];
then after that i have my own obj file loader for loading obj with textures
m2d wall,floor;//i initialize and load those files at start
for recording statistics of render times i used
bool Once = 1;
int secs = 0;
now to the render code here i did my experiment
// Code A: Benchmarked on radeon 8670D
// Takes 232(average) millisecs for drawing 300*300 tiles
if(Once)
secs = glutGet(GLUT_ELAPSED_TIME);
for(int i=0;i<mapHeight;i++){
for(int j=0;j<mapWidth;j++){
if(mapskeleton[j][i] == skel_Wall){
glBindTexture(GL_TEXTURE_2D,wall.texture);
glPushMatrix();
glTranslatef(j*10,i*10,0);
wall.Draw();//Draws 10 textured triangles
glPopMatrix();
}
if(mapskeleton[j][i] == skel_floor){
glBindTexture(GL_TEXTURE_2D,floor.texture);
glPushMatrix();
glTranslatef(j*10,i*10,0);
floor.Draw();//Draws 2 textured triangles
glPopMatrix();
}
}
}
if(Once){
secs = glutGet(GLUT_ELAPSED_TIME)-secs;
printf("time taken for rendering %i msecs",secs)
Once = 0;
}
and other code is
// Code B: Benchmarked on radeon 8670D
// Takes 206(average) millisecs for drawing 300*300 tiles
if(Once)
secs = glutGet(GLUT_ELAPSED_TIME);
glBindTexture(GL_TEXTURE_2D,floor.texture);
for(int i=0;i<mapHeight;i++){
for(int j=0;j<mapWidth;j++){
if(mapskeleton[j][i] == skel_floor){
glPushMatrix();
glTranslatef(j*10,i*10,0);
floor.Draw();
glPopMatrix();
}
}
}
glBindTexture(GL_TEXTURE_2D,wall.texture);
for(int i=0;i<mapHeight;i++){
for(int j=0;j<mapWidth;j++){
if(mapskeleton[j][i] == skel_Wall){
glPushMatrix();
glTranslatef(j*10,i*10,0);
wall.Draw();
glPopMatrix();
}
}
}
if(Once){
secs = glutGet(GLUT_ELAPSED_TIME)-secs;
printf("time taken for rendering %i msecs",secs)
Once = 0;
}
for me code A looks good with a point of a person(Beginner) viewing code. but benchmarks say different.
my gpu seems to like code B. I don't understand why does code B takes less time to render?
Changes to OpenGL state can generally be expensive - the driver's and/or GPUs data structures and caches can become invalidated. In your case, the change in question is binding a different texture. In code B, you're doing it twice. In code A, you're easily doing it thousands of times.
When programming OpenGL rendering, you'll generally want to set up the pipeline for settings A, render everything which needs settings A, re-set the pipeline for settings B, render everything which needs settings B, and so on.
#Angew covered why one options is more efficient than the other. But there is an important point that needs to be stated very clearly. Based on the text of your question, particularly here:
for recording statistics of render times
my gpu seems to like code B
you seem to attempt to measure rendering/GPU performance.
You are NOT AT ALL measuring GPU performance!
You measure the time for setting up the state and making the draw calls. OpenGL lets the GPU operate asynchronously from the code executed on the CPU. The picture you should keep in mind when you make (most) OpenGL calls is that you're submitting work to the GPU for later execution. There's no telling when the GPU completes that work. It most definitely (except for very few calls that you want to avoid in speed critical code) does not happen by the time the call returns.
What you're measuring in your code is purely the CPU overhead for making these calls. This includes what's happening in your own code, and what happens in the driver code for handling the calls and preparing the work for later submission to the GPU.
I'm not saying that the measurement is not useful. Minimizing CPU overhead is very important. You just need to be very aware of what you are in fact measuring, and make sure that you draw the right conclusions.

glDrawArrays first few calls very slow using my shader and then very fast

I am using my own shader that does quite advanced calculations and outputs results into frame buffer.
I do call glfinish to make sure previous opengl commands are executed on the graphics card. Then i call gldrawarrays and this single call takes 5 seconds!
After calling gldrawarrays a few more times they finally start running under 1 ms per each call. So only a few first gldrawarrays calls are super slow.
There is no correlation with the size of the textures used, that doesn't affect performance. If i simplify the shader source code it does make the first gldrawarrays calls faster but not dramatically. Sometimes very much benign changes in the shader source code lead to serious changes in performance(e.g. commenting out a few additions or subtractions). But all these code changes can speedup first gldrawarrays calls from 5 seconds to e.g. 1 second, not more. Those changes do not affect much performance of gldrawarrays calls after first few calls are made.Those still run 1ms each, thousand times faster than first 2-3 calls.
I am buffled by this problem. What could possibly be happening here?? Is there a way extract at least some info of what really happening inside that gpu.
Ok, the shader code that affects performance is like this:
if (aType<18){
if (aType < 9){
if (aType < 6){
if (aType==2)
{
res.x = EndX1;
res.y = EndY1;
}
else
if (aType==3)
{
res.x = EndX2;
res.y = EndY2;
}
.......... //continues with all these if 36 times
Replacing code above with for loop solved the performance problem:
for (int i=1; i <= 36; i++){
if ((y < EndY[i]) || ((y== EndY[i])&&(x<=EndX[i])))
{
res.xy = SubXY(x,y,EndX[i-1],EndY[i-1]);
res.z= 2;
return res;
}
}
Ironically i wanted to avoid for loop for performance reasons :)
Your driver is delaying the serious optimization steps until after the shader has been used a few times. And the non optimized shader may be software emulated.
There are various reasons for this but chiefly is that optimization takes time.
To fix this you can force the shader to run a few time with less data (smaller output buffer by glViewport). This will tell the driver to optimize the shaders before you actually need it and it can handle larger loads.

Rewriting a simple Pygame 2D drawing function in C++

I have a 2D list of vectors (say 20x20 / 400 points) and I am drawing these points on a screen like so:
for row in grid:
for point in row:
pygame.draw.circle(window, white, (particle.x, particle.y), 2, 0)
pygame.display.flip() #redraw the screen
This works perfectly, however it's much slower then I expected.
I want to rewrite this in C++ and hopefully learn some stuff (I am doing a unit on C++ atm, so it'll help) on the way. What's the easiest way to approach this? I have looked at Direct X, and have so far followed a bunch of tutorials and have drawn some rudimentary triangles. However I can't find a simple (draw point).
DirectX doesn't have functions for drawing just one point. It operates on vertex and index buffers only. If you want simpler way to make just one point, you'll need to write a wrapper.
For drawing lists of points you'll need to use DrawPrimitive(D3DPT_POINTLIST, ...). however, there will be no easy way to just plot a point. You'll have to prepare buffer, lock it, fill with data, then draw the buffer. Or you could use dynamic vertex buffers - to optimize performance. There is a DrawPrimitiveUP call that is supposed to be able to render primitives stored in system memory (instead of using buffers), but as far as I know, it doesn't work (may silently discard primitives) with pure devices, so you'll have to use software vertex processing.
In OpenGL you have glVertex2f and glVertex3f. Your call would look like this (there might be a typo or syntax error - I didn't compiler/run it) :
glBegin(GL_POINTS);
glColor3f(1.0, 1.0, 1.0);//white
for (int y = 0; y < height; y++)
for (int x = 0; x < width; x++)
glVertex2f(points[y][x].x, points[y][x].y);//plot point
glEnd();
OpenGL is MUCH easier for playing around and experimenting than DirectX. I'd recommend to take a look at SDL, and use it in conjuction with OpenGL. Or you could use GLUT instead of SDL.
Or you could try using Qt 4. It has a very good 2D rendering routines.
When I first dabbled with game/graphics programming I became fond of Allegro. It's got a huge range of features and a pretty easy learning curve.

Slow C++ DirectX 2D Game

I'm new to C++ and DirectX, I come from XNA.
I have developed a game like Fly The Copter.
What i've done is created a class named Wall.
While the game is running I draw all the walls.
In XNA I stored the walls in a ArrayList and in C++ I've used vector.
In XNA the game just runs fast and in C++ really slow.
Here's the C++ code:
void GameScreen::Update()
{
//Update Walls
int len = walls.size();
for(int i = wallsPassed; i < len; i++)
{
walls.at(i).Update();
if (walls.at(i).pos.x <= -40)
wallsPassed += 2;
}
}
void GameScreen::Draw()
{
//Draw Walls
int len = walls.size();
for(int i = wallsPassed; i < len; i++)
{
if (walls.at(i).pos.x < 1280)
walls.at(i).Draw();
else
break;
}
}
In the Update method I decrease the X value by 4.
In the Draw method I call sprite->Draw (Direct3DXSprite).
That the only codes that runs in the game loop.
I know this is a bad code, if you have an idea to improve it please help.
Thanks and sorry about my english.
Try replacing all occurrences of at() with the [] operator. For example:
walls[i].Draw();
and then turn on all optimisations. Both [] and at() are function calls - to get the maximum performance you need to make sure that they are inlined, which is what upping the optimisation level will do.
You can also do some minimal caching of a wall object - for example:
for(int i = wallsPassed; i < len; i++)
{
Wall & w = walls[i];
w.Update();
if (w.pos.x <= -40)
wallsPassed += 2;
}
Try to narrow the cause of the performance problem (also termed profiling). I would try drawing only one object while continue updating all the objects. If its suddenly faster, then its a DirectX drawing problem.
Otherwise try drawing all the objects, but updating only one wall. If its faster then your update() function may be too expensive.
How fast is 'fast'?
How slow is'really slow'?
How many sprites are you drawing?
How big is each one as an image file, and in pixels drawn on-screen?
How does performance scale (in XNA/C++) as you change the number of sprites drawn?
What difference do you get if you draw without updating, or vice versa
Maybe you just have forgotten to turn on release mode :) I had some problems with it in the past - I thought my code was very slow because of debug mode. If it's not it, you can have a problem with rendering part, or with huge count of objects. The code you provided looks good...
Have you tried multiple buffers (a.k.a. Double Buffering) for the bitmaps?
The typical scenario is to draw in one buffer, then while the first buffer is copied to the screen, draw in a second buffer.
Another technique is to have a huge "logical" screen in memory. The portion draw in the physical display is a viewport or view into a small area in the logical screen. Moving the background (or screen) just requires a copy on the part of the graphics processor.
You can aid batching of sprite draw calls. Presumably Your draw call calls your only instance of ID3DXSprite::Draw with the relevant parameters.
You can get much improved performance by doing a call to ID3DXSprite::Begin (with the D3DXSPRITE_SORT_TEXTURE flag set) and then calling ID3DXSprite::End when you've done all your rendering. ID3DXSprite will then sort all your sprite calls by texture to decrease the number of texture switches and batch the relevant calls together. This will improve performance massively.
Its difficult to say more, however, without seeing the internals of your Update and Draw calls. The above is only a guess ...
To draw every single wall with a different draw call is a bad idea. Try to batch the data into a single vertex buffer/index buffer and send them into a single draw. That's a more sane idea.
Anyway for getting an idea of WHY it goes slowly try with some CPU and GPU (PerfHud, Intel GPA, etc...) to know first of all WHAT's the bottleneck (if the CPU or the GPU). And then you can fight to alleviate the problem.
The lookups into your list of walls are unlikely to be the source of your slowdown. The cost of drawing objects in 3D will typically be the limiting factor.
The important parts are your draw code, the flags you used to create the DirectX device, and the flags you use to create your textures. My stab in the dark... check that you initialize the device as HAL (hardware 3d) rather than REF (software 3d).
Also, how many sprites are you drawing? Each draw call has a fair amount of overhead. If you make more than couple-hundred per frame, that will be your limiting factor.