which one is proper method of writing this gl code - c++

i have been doing some experiments with opengl and handling textures.
in my experiment i have a 2d array of (int) which are randomly generated
int mapskeleton[300][300];
then after that i have my own obj file loader for loading obj with textures
m2d wall,floor;//i initialize and load those files at start
for recording statistics of render times i used
bool Once = 1;
int secs = 0;
now to the render code here i did my experiment
// Code A: Benchmarked on radeon 8670D
// Takes 232(average) millisecs for drawing 300*300 tiles
if(Once)
secs = glutGet(GLUT_ELAPSED_TIME);
for(int i=0;i<mapHeight;i++){
for(int j=0;j<mapWidth;j++){
if(mapskeleton[j][i] == skel_Wall){
glBindTexture(GL_TEXTURE_2D,wall.texture);
glPushMatrix();
glTranslatef(j*10,i*10,0);
wall.Draw();//Draws 10 textured triangles
glPopMatrix();
}
if(mapskeleton[j][i] == skel_floor){
glBindTexture(GL_TEXTURE_2D,floor.texture);
glPushMatrix();
glTranslatef(j*10,i*10,0);
floor.Draw();//Draws 2 textured triangles
glPopMatrix();
}
}
}
if(Once){
secs = glutGet(GLUT_ELAPSED_TIME)-secs;
printf("time taken for rendering %i msecs",secs)
Once = 0;
}
and other code is
// Code B: Benchmarked on radeon 8670D
// Takes 206(average) millisecs for drawing 300*300 tiles
if(Once)
secs = glutGet(GLUT_ELAPSED_TIME);
glBindTexture(GL_TEXTURE_2D,floor.texture);
for(int i=0;i<mapHeight;i++){
for(int j=0;j<mapWidth;j++){
if(mapskeleton[j][i] == skel_floor){
glPushMatrix();
glTranslatef(j*10,i*10,0);
floor.Draw();
glPopMatrix();
}
}
}
glBindTexture(GL_TEXTURE_2D,wall.texture);
for(int i=0;i<mapHeight;i++){
for(int j=0;j<mapWidth;j++){
if(mapskeleton[j][i] == skel_Wall){
glPushMatrix();
glTranslatef(j*10,i*10,0);
wall.Draw();
glPopMatrix();
}
}
}
if(Once){
secs = glutGet(GLUT_ELAPSED_TIME)-secs;
printf("time taken for rendering %i msecs",secs)
Once = 0;
}
for me code A looks good with a point of a person(Beginner) viewing code. but benchmarks say different.
my gpu seems to like code B. I don't understand why does code B takes less time to render?

Changes to OpenGL state can generally be expensive - the driver's and/or GPUs data structures and caches can become invalidated. In your case, the change in question is binding a different texture. In code B, you're doing it twice. In code A, you're easily doing it thousands of times.
When programming OpenGL rendering, you'll generally want to set up the pipeline for settings A, render everything which needs settings A, re-set the pipeline for settings B, render everything which needs settings B, and so on.

#Angew covered why one options is more efficient than the other. But there is an important point that needs to be stated very clearly. Based on the text of your question, particularly here:
for recording statistics of render times
my gpu seems to like code B
you seem to attempt to measure rendering/GPU performance.
You are NOT AT ALL measuring GPU performance!
You measure the time for setting up the state and making the draw calls. OpenGL lets the GPU operate asynchronously from the code executed on the CPU. The picture you should keep in mind when you make (most) OpenGL calls is that you're submitting work to the GPU for later execution. There's no telling when the GPU completes that work. It most definitely (except for very few calls that you want to avoid in speed critical code) does not happen by the time the call returns.
What you're measuring in your code is purely the CPU overhead for making these calls. This includes what's happening in your own code, and what happens in the driver code for handling the calls and preparing the work for later submission to the GPU.
I'm not saying that the measurement is not useful. Minimizing CPU overhead is very important. You just need to be very aware of what you are in fact measuring, and make sure that you draw the right conclusions.

Related

How to reduce OpenGL CPU usage and/or how to use OpenGL properly

I'm working a on a Micromouse simulation application built with OpenGL, and I have a hunch that I'm not doing things properly. In particular, I'm suspicious about the way I am getting my (mostly static) graphics to refresh at a close-to-constant framerate (60 FPS). My approach is as follows:
1) Start a timer
2) Draw my shapes and text (about a thousand of them):
glBegin(GL_POLYGON);
for (Cartesian vertex : polygon.getVertices()) {
std::pair<float, float> coordinates = getOpenGlCoordinates(vertex);
glVertex2f(coordinates.first, coordinates.second);
}
glEnd();
and
glPushMatrix();
glScalef(scaleX, scaleY, 0);
glTranslatef(coordinates.first * 1.0/scaleX, coordinates.second * 1.0/scaleY, 0);
for (int i = 0; i < text.size(); i += 1) {
glutStrokeCharacter(GLUT_STROKE_MONO_ROMAN, text.at(i));
}
glPopMatrix();
3) Call
glFlush();
4) Stop the timer
5) Sleep for (1/FPS - duration) seconds
6) Call
glutPostRedisplay();
The "problem" is that the above approach really hogs my CPU - the process is using something like 96-100%. I know that there isn't anything inherently wrong with using lots of CPU, but I feel like I shouldn't be using that much all of the time.
The kicker is that most of the graphics don't change from frame to frame. It's really just a single polygon moving over (and covering up) some static shapes. Is there any way to tell OpenGL to only redraw what has changed since the previous frame (with the hope it would reduce the number of glxxx calls, which I've deemed to be the source of the "problem")? Or, better yet, is my approach to getting my graphics to refresh even correct?
First and foremost the biggest CPU hog with OpenGL is immediate modeā€¦ and you're using it (glBegin, glEnd). The problem with IM is, that every single vertex requires a whole couple of OpenGL calls being made; and because OpenGL uses a thread local state this means that each and every OpenGL call must go through some indirection. So the first step would be getting rid of that.
The next issue is with how you're timing your display. If low latency between user input and display is not your ultimate goal the standard approach would setting up the window for double buffering, enabling V-Sync, set a swap interval of 1 and do a buffer swap (glutSwapBuffers) once the frame is rendered. The exact timings what and where things will block are implementation dependent (unfortunately), but you're more or less guaranteed to exactly hit your screen refresh frequency, as long as your renderer is able to keep up (i.e. rendering a frame takes less time that a screen refresh interval).
glutPostRedisplay merely sets a flag for the main loop to call the display function if no further events are pending, so timing a frame redraw through that is not very accurate.
Last but not least you may be simply mocked by the way Windows does account CPU time (time spent in driver context, which includes blocking, waiting for V-Sync) will be accouted to the consumed CPU time, while it's in fact interruptible sleep. However you wrote, that you already do a sleep in your code, which would rule that out, because the go-to approach to get a more reasonable accounting would be adding a Sleep(1) before or after the buffer swap.
I found that by putting render thread to sleep helps reducing cpu usage from (my case) 26% to around 8%
#include <chrono>
#include <thread>
void render_loop(){
...
auto const start_time = std::chrono::steady_clock::now();
auto const wait_time = std::chrono::milliseconds{ 17 };
auto next_time = start_time + wait_time;
while(true){
...
// execute once after thread wakes up every 17ms which is theoretically 60 frames per
// second
auto then = std::chrono::high_resolution_clock::now();
std::this_thread::sleep_until(next_time);
...rendering jobs
auto elasped_time =
std::chrono::duration_cast<std::chrono::milliseconds> (std::chrono::high_resolution_clock::now() - then);
std::cout << "ms: " << elasped_time.count() << '\n';
next_time += wait_time;
}
}
I thought about attempting to measure the frame rate while the thread is asleep but there isn't any reason for my use case to attempt that. The result was averaging around 16ms so I thought it was good enough
Inspired by this post

glDrawArrays first few calls very slow using my shader and then very fast

I am using my own shader that does quite advanced calculations and outputs results into frame buffer.
I do call glfinish to make sure previous opengl commands are executed on the graphics card. Then i call gldrawarrays and this single call takes 5 seconds!
After calling gldrawarrays a few more times they finally start running under 1 ms per each call. So only a few first gldrawarrays calls are super slow.
There is no correlation with the size of the textures used, that doesn't affect performance. If i simplify the shader source code it does make the first gldrawarrays calls faster but not dramatically. Sometimes very much benign changes in the shader source code lead to serious changes in performance(e.g. commenting out a few additions or subtractions). But all these code changes can speedup first gldrawarrays calls from 5 seconds to e.g. 1 second, not more. Those changes do not affect much performance of gldrawarrays calls after first few calls are made.Those still run 1ms each, thousand times faster than first 2-3 calls.
I am buffled by this problem. What could possibly be happening here?? Is there a way extract at least some info of what really happening inside that gpu.
Ok, the shader code that affects performance is like this:
if (aType<18){
if (aType < 9){
if (aType < 6){
if (aType==2)
{
res.x = EndX1;
res.y = EndY1;
}
else
if (aType==3)
{
res.x = EndX2;
res.y = EndY2;
}
.......... //continues with all these if 36 times
Replacing code above with for loop solved the performance problem:
for (int i=1; i <= 36; i++){
if ((y < EndY[i]) || ((y== EndY[i])&&(x<=EndX[i])))
{
res.xy = SubXY(x,y,EndX[i-1],EndY[i-1]);
res.z= 2;
return res;
}
}
Ironically i wanted to avoid for loop for performance reasons :)
Your driver is delaying the serious optimization steps until after the shader has been used a few times. And the non optimized shader may be software emulated.
There are various reasons for this but chiefly is that optimization takes time.
To fix this you can force the shader to run a few time with less data (smaller output buffer by glViewport). This will tell the driver to optimize the shaders before you actually need it and it can handle larger loads.

OpenGL Display List Optimizing

I am currently running some speed tests for my applications and I am trying to find more ways to optimize my program, specifically with my display lists. Currently I am getting:
12 FPS with 882,000 vertices
40 FPS with 234,000 vertices
95 FPS with 72,000 vertices
I know that I need to minimize the number of calls made, so instead of:
for(int i = 0; i < Number; i++) {
glBegin(GL_QUADS);
...normal and vertex declarations here
glEnd();
}
A better way would be to do this:
glBegin(GL_QUADS);
for(int i = 0; i < Number; i++) {
...normal and vertex declarations here
}
glEnd();
This did help increase my FPS to the results listed above, however, are there other ways I can optimize my display lists? Perhaps by using something other than nested vertex arrays to store my model data?
You'll get a significant speed boost by switching to VBOs or at least Vertex arrays.
Immediate mode (glBegin()...glEnd()) has a lot of method call overhead. I've managed to render ~1 million vertices at several hundred fps on a laptop (would be faster without the physics engine/entity system overhead too) by using more modern OpenGL.
If you're wondering about compatibility, about 98% of people support the VBO extension (GL_ARB_vertex_buffer_object) http://feedback.wildfiregames.com/report/opengl/

Measure Render-To-Texture Performance in OpenGL ES 2.0

Basically, I'm doing some sort of image processing using a screen-sized rectangle made of two triangles and a fragment shader, which is doing the whole processing stuff. The actual effect is something like an animation as it depends on a uniform variable, called current_frame.
I'm very much interested in measuring the performance in terms of "MPix/s". What I do is something like that:
/* Setup all necessary stuff, including: */
/* - getting the location of the `current_frame` uniform */
/* - creating an FBO, adding a color attachment */
/* and setting it as the current one */
double current_frame = 0;
double step = 1.0f / NUMBER_OF_ITERATIONS;
tic(); /* Start counting the time */
for (i = 0; i < NUMBER_OF_ITERATIONS; i++)
{
glUniform1f(current_frame_handle, current_frame);
current_frame += step;
glDrawArrays(GL_TRIANGLES, 0, NUMBER_OF_INDICES);
glFinish();
}
double elapsed_time = tac(); /* Get elapsed time in seconds */
/* Calculate achieved pixels per second */
double pps = (OUT_WIDTH * OUT_HEIGHT * NUMBER_OF_ITERATIONS) / elapsed_time;
/* Sanity check by using reading the output into a buffer */
/* using glReadPixels and saving this buffer into a file */
As far as theory goes, is there anything wrong with my concept?
Also, I've got the impression that glFinish() on mobile hardware doesn't necessarily wait for previous render calls and may do some optimizations.
Of course, I can always force it by doing glReadPixels() after each draw, but that would be quite slow so that this wouldn't really help.
Could you advise me as to whether my testing scenario is sensible and whether there is something more that can be done.
Concerning speed, using glDrawArrays() still duplicates the shared vertices.
glDrawElements() is the solution to reduce the number of vertices in
the array, so it allows transferring less data to OpenGL.
http://www.songho.ca/opengl/gl_vertexarray.html
Just throwing that in there to help speed up your results. As far as your timing concept, it looks fine to me. Are you getting results similar to what you had hoped?
I would precalculate all possible frames, and then use glEnableClientState() and glTexCoordPointer() to change which part of the existing texture is drawn in each frame.

Slow C++ DirectX 2D Game

I'm new to C++ and DirectX, I come from XNA.
I have developed a game like Fly The Copter.
What i've done is created a class named Wall.
While the game is running I draw all the walls.
In XNA I stored the walls in a ArrayList and in C++ I've used vector.
In XNA the game just runs fast and in C++ really slow.
Here's the C++ code:
void GameScreen::Update()
{
//Update Walls
int len = walls.size();
for(int i = wallsPassed; i < len; i++)
{
walls.at(i).Update();
if (walls.at(i).pos.x <= -40)
wallsPassed += 2;
}
}
void GameScreen::Draw()
{
//Draw Walls
int len = walls.size();
for(int i = wallsPassed; i < len; i++)
{
if (walls.at(i).pos.x < 1280)
walls.at(i).Draw();
else
break;
}
}
In the Update method I decrease the X value by 4.
In the Draw method I call sprite->Draw (Direct3DXSprite).
That the only codes that runs in the game loop.
I know this is a bad code, if you have an idea to improve it please help.
Thanks and sorry about my english.
Try replacing all occurrences of at() with the [] operator. For example:
walls[i].Draw();
and then turn on all optimisations. Both [] and at() are function calls - to get the maximum performance you need to make sure that they are inlined, which is what upping the optimisation level will do.
You can also do some minimal caching of a wall object - for example:
for(int i = wallsPassed; i < len; i++)
{
Wall & w = walls[i];
w.Update();
if (w.pos.x <= -40)
wallsPassed += 2;
}
Try to narrow the cause of the performance problem (also termed profiling). I would try drawing only one object while continue updating all the objects. If its suddenly faster, then its a DirectX drawing problem.
Otherwise try drawing all the objects, but updating only one wall. If its faster then your update() function may be too expensive.
How fast is 'fast'?
How slow is'really slow'?
How many sprites are you drawing?
How big is each one as an image file, and in pixels drawn on-screen?
How does performance scale (in XNA/C++) as you change the number of sprites drawn?
What difference do you get if you draw without updating, or vice versa
Maybe you just have forgotten to turn on release mode :) I had some problems with it in the past - I thought my code was very slow because of debug mode. If it's not it, you can have a problem with rendering part, or with huge count of objects. The code you provided looks good...
Have you tried multiple buffers (a.k.a. Double Buffering) for the bitmaps?
The typical scenario is to draw in one buffer, then while the first buffer is copied to the screen, draw in a second buffer.
Another technique is to have a huge "logical" screen in memory. The portion draw in the physical display is a viewport or view into a small area in the logical screen. Moving the background (or screen) just requires a copy on the part of the graphics processor.
You can aid batching of sprite draw calls. Presumably Your draw call calls your only instance of ID3DXSprite::Draw with the relevant parameters.
You can get much improved performance by doing a call to ID3DXSprite::Begin (with the D3DXSPRITE_SORT_TEXTURE flag set) and then calling ID3DXSprite::End when you've done all your rendering. ID3DXSprite will then sort all your sprite calls by texture to decrease the number of texture switches and batch the relevant calls together. This will improve performance massively.
Its difficult to say more, however, without seeing the internals of your Update and Draw calls. The above is only a guess ...
To draw every single wall with a different draw call is a bad idea. Try to batch the data into a single vertex buffer/index buffer and send them into a single draw. That's a more sane idea.
Anyway for getting an idea of WHY it goes slowly try with some CPU and GPU (PerfHud, Intel GPA, etc...) to know first of all WHAT's the bottleneck (if the CPU or the GPU). And then you can fight to alleviate the problem.
The lookups into your list of walls are unlikely to be the source of your slowdown. The cost of drawing objects in 3D will typically be the limiting factor.
The important parts are your draw code, the flags you used to create the DirectX device, and the flags you use to create your textures. My stab in the dark... check that you initialize the device as HAL (hardware 3d) rather than REF (software 3d).
Also, how many sprites are you drawing? Each draw call has a fair amount of overhead. If you make more than couple-hundred per frame, that will be your limiting factor.