How taxing are OpenGL glDrawElements() calls compared to basic logic code? - c++

I'm planning to do some optimization on my OpenGL program (it doesn't need optimizing, but I'm doing it for the sake of it). Out of curiosity, how expensive are OpenGL drawing functions compared to basic logic code? At the moment, I'm making the start of a game where the screen is filled with squares, to represent a 2D blocky landscape. This means that the draw call for a square(two triangles) is called many times. At the moment, I'm planning to add in some code that looks at the positioning of blocks in the current frame, and groups them together. For example, if there is a column that is 7 blocks high, instead of doing 7 separate drawBlock() functions (which contain the glDrawElements() calls) I could call one function, that draws a rectangle that is 1 x 7, and so on, throughout the screen.
I won't bother doing this if the code that calculates what to draw, actually uses up more of the CPU than just drawing the blocks individually would.

The cost of glDrawElements (or any other OpenGL rendering command) cannot really be estimated. This is because its cost depends a great deal on what OpenGL state you changed between draw calls. The cost of calling an OpenGL state changing function (basically, any OpenGL function that isn't a glGet of some form or a glDraw of some form) will be relatively quick. But it will make the next draw call slower.
This video on OpenGL performance shows which state changes are more costly at draw time than others. The really good part starts around 31 minutes in.
Draw calls are relatively fast if you haven't changed any OpenGL state between draw calls. Different pieces of state have different effects on draw calls. From fastest to slowest (according to NVIDIA's presentation above, so take it with a grain of salt):
Non-UBO uniform updates
Vertex buffer bindings (without changing formats)
UBO binding
Vertex format changes
Texture bindings
Fragment post-processing state changes
Shader program changes
Render target switches
Now, a draw call will be more expensive than "basic logic". They're not cheap, even without state changes between them. If efficiency is important to your code, then grouping your squares is a good idea.

The actual numbers are highly platform and vendor dependent. Driver architectures on different operating systems vary substantially, and some of them are more efficient than others. On top of that, driver implementations and hardware can cause large performance differences. For example, I've seen 10-20 times higher draw call throughput for one vendor compared to another vendor, on the same platform and with comparable hardware.
Based on this, any numbers below are just a very rough order of magnitude. You really need to measure this yourself on the configurations you care about.
With all these disclaimers, I would expect that a draw call could be processed in the range of 100 instructions (CPU cycles). This is for the case where you just make back to back draw calls, and there are no other bottlenecks in the pipeline.
As #NicolBolas already pointed out, the most expensive part of handling draw calls is normally processing deferred state changes. And most of the time, you will have state changes between draw calls. In this case, for relatively cheap state changes (like binding a texture or buffer, or changing some attributes), a few 100 instructions are typical.
Switching frame buffers is generally quite expensive, and very expensive on some platforms. Other than that, the numbers I measured in the past while optimizing and benchmarking state changes showed an order that is quite different from the list in #NicolBolas' answer. But again, this is highly platform and vendor/hardware dependent.
There are a couple more aspects that makes this somewhat tricky to measure:
Most of the CPU time might not be consumed in your thread. Many drivers are multi-threaded, meaning that most of the work needed to process OpenGL calls is offloaded to a secondary thread. If your application does not use all CPU cores, and you're not throttled by power/thermal limits, this means that a lot of the driver work can happen in parallel, without slowing down your application much. But particularly on mobile devices and laptops, performance is often limited by power consumption, so the driver overhead will still slow you down.
CPU time consumed by the driver is only part of what can slow your application code down. Another consideration is cache pollution. If cache content used by your application is evicted while the OpenGL implementation processes your draw calls, your own code will get more cache misses, and will run slower. So measuring the time spent inside the OpenGL calls only shows part of the picture.

Related

How do multiple rendering contexts across multiple processes affect GPU performance? [duplicate]

My co-worker and I are working on a video rendering engine.
The whole idea is to parse a configuration file and render each frame to offscreen FBO, and then fetch the frame render results using glReadPixel for video encoding.
We tried to optimize the rendering speed by creating two threads each with an independent OpenGL context. One thread render odd frames and the other render even frames. The two threads do not share any gl resources.
The results are quite confusing. On my computer, the rendering speed increased compared to our single thread implementation, while on my partner's computer, the entire speed dropped.
I wonder here that, how do the amount of OpenGL contexts affect the overall performance. Is it really a good idea to create multiple OpenGL threads if they do not share anything.
Context switching is certainly not free. As pretty much always for performance related questions, it's impossible to quantify in general terms. If you want to know, you need to measure it on the system(s) you care about. It can be quite expensive.
Therefore, you add a certain amount of overhead by using multiple contexts. If that pays off depends on where your bottleneck is. If you were already GPU limited with a single CPU thread, you won't really gain anything because you can't get the GPU to do the work quicker if it was already fully loaded. Therefore you add overhead for the context switches without any gain, and make the whole thing slower.
If you were CPU limited, using multiple CPU threads can reduce your total elapsed time. If the parallelization of the CPU work combined with the added overhead for synchronization and context switches results in a net total gain again depends on your use case and the specific system. Trying both and measuring is the only good thing to do.
Based on your problem description, you might also be able to use multithreading while still sticking with a single OpenGL context, and keeping all OpenGL calls in a single thread. Instead of using glReadPixels() synchronously, you could have it read into PBOs (Pixel Buffer Objects), which allows you to use asynchronous reads. This decouples GPU and CPU work much better. You could also do the video encoding on a separate thread if you're not doing that yet. This approach will need some inter-thread synchronization, but it avoids using multiple contexts, while still using parallel processing to get the job done.

Which is a larger performance drain: quantity of vertices in one draw call, or quantity of calls?

I am quickly finding that one of the organisational considerations you must make when preparing rendering in OpenGL is the type of topography and the arrangement of vertices.
Now there are some interesting methods out there for organising vertices into very long arrays, with nice uses of interleaved arrays, indexes, etc, so that you can pour a lot of geometry into one OpenGL call.
But it's much easier in some cases to simply iterate and perform multiple calls with smaller vertex arrays.
While I agree with the notion that premature optimization is somewhat wasteful, just how important of a consideration should it be to minimize OpenGL calls, especially if multiple calls would actually involve far fewer vertexes per call?
I can see that this is one of those decisions that is important early in the development process, since it forms a lot of the structure of how vertexes get created and organized.
There is an overhead for each command you send down to the GPU. By batching the vertices you minimize that overhead and also allows the driver to make small optimizations in you data before sending it to the hardware. It can make quite a difference and is the reason the glBegin and glEnd was completely removed from newer iterations of OpenGL.
You should try to avoid making many driver states changes and many drawing calls.
EDIT: Consider using degenerated vertices in you triangle strips (also helps in minimizing the number of vertices processed) so that you can just use one drawing call and render all your topology (unless you need to change some driver state between parts of the topology).
You can find a balance for your specific needs. But the thing is that there're many variables in the equation. And there's no simple solution (like "always make scene as one big single batch!"). TraxNet gave you a good advice though - always try to minimize api calls(whether drawing or state changes). But it hasn't to be just a few calls. On modern PC it could be thousands per frame, not so modern mobile phone, maybe, just a few hundred.
Also TraxNet mentioned degenerate triangles(helping form strips) - though they're still triangles(kinda add to 'total' triangle count rendered) - they cost almost nothing still helping to minimize amount of draw calls.

Reducing bandwidth between GPU and CPU( sending raw data or pre calculate first)

OK so I am just trying to work out the best way reduce band width between the GPU and CPU.
Particle Systems.
Should I be pre calculating most things on the CPU and sending it to the GPU this is includes stuff like positions, rotations, velocity, calculations for alpha and random numbers ect.
Or should I be doing as much as i can in the shaders and using the geometry shader as much as possible.
My problem is that the sort of app that I have written has to have a good few variables sent to the shaders for example, A user at run time will select emitter positions and velocity plus a lot more. The sorts of things that I am not sure how to tackle are things like "if a user wants a random velocity and gives a min and max value to have the random value select from, should this random value be worked out on the CPU and sent as a single value to the GPU or should both the min and max values be sent to the GPU and have a random function generator in the GPU do it? Any comments on reducing bandwidth and optimization are much appreciated.
Should I be pre calculating most things on the CPU and sending it to the GPU this is includes stuff like positions, rotations, velocity, calculations for alpha and random numbers ect.
Or should I be doing as much as i can in the shaders and using the geometry shader as much as possible.
Impossible to answer. Spend too much CPU time and performance will drop. Spend too much GPU time, performance will drop too. Transfer too much data, performance will drop. So, instead of trying to guess (I don't know what app you're writing, what's your target hardware, etc. Hell, you didn't even specify your target api and platform) measure/profile and select optimal method. PROFILE instead of trying to guess the performance. There are AQTime 7 Standard, gprof, and NVPerfKit for that (plus many other tools).
Do you actually have performance problem in your application? If you don't have any performance problems, then don't do anything. Do you have, say ten million particles per frame in real time? If not, there's little reason to worry, since a 600mhz cpu was capable of handling thousand of them easily 7 years ago. On other hand, if you have, say, dynamic 3d environmnet and particles must interact with it (bounce), then doing it all on GPU will be MUCH harder.
Anyway, to me it sounds like you don't have to optimize anything and there's no actual NEED to optimize. So the best idea would be to concentrate on some other things.
However, in any case, ensure that you're using correct way to transfer "dynamic" data that is frequently updated. In directX that meant using dynamic write-only vertex buffers with D3DLOCK_DISCARD|D3DLOCK_NOOVERWRITE. With OpenGL that'll probably mean using STREAM or DYNAMIC bufferdata with DRAW access. That should be sufficient to avoid major performance hits.
There's no single right answer to this. Here are some things that might help you make up your mind:
Are you sure the volume of data going over the bus is high enough to be a problem? You might want to do the math and see how much data there is per second vs. what's available on the target hardware.
Is the application likely to be CPU bound or GPU bound? If it's already GPU bound there's no point loading it up further.
Particle systems are pretty easy to implement on the CPU and will run on any hardware. A GPU implementation that supports nontrivial particle systems will be more complex and limited to hardware that supports the required functionality (e.g. stream out and an API that gives access to it.)
Consider a mixed approach. Can you split the particle systems into low complexity, high bandwidth particle systems implemented on the GPU and high complexity, low bandwidth systems implemented on the CPU?
All that said, I think I would start with a CPU implementation and move some of the work to the GPU if it proves necessary and feasible.

Which OpenGL functions are not GPU-accelerated?

I was shocked when I read this (from the OpenGL wiki):
glTranslate, glRotate, glScale
Are these hardware accelerated?
No, there are no known GPUs that
execute this. The driver computes the
matrix on the CPU and uploads it to
the GPU.
All the other matrix operations are
done on the CPU as well :
glPushMatrix, glPopMatrix,
glLoadIdentity, glFrustum, glOrtho.
This is the reason why these functions
are considered deprecated in GL 3.0.
You should have your own math library,
build your own matrix, upload your
matrix to the shader.
For a very, very long time I thought most of the OpenGL functions use the GPU to do computation. I'm not sure if this is a common misconception, but after a while of thinking, this makes sense. Old OpenGL functions (2.x and older) are really not suitable for real-world applications, due to too many state switches.
This makes me realise that, possibly, many OpenGL functions do not use the GPU at all.
So, the question is:
Which OpenGL function calls don't use the GPU?
I believe knowing the answer to the above question would help me become a better programmer with OpenGL. Please do share some of your insights.
Edit:
I know this question easily leads to optimisation level. It's good, but it's not the intention of this question.
If anyone knows a set of GL functions on a certain popular implementation (as AshleysBrain suggested, nVidia/ATI, and possibly OS-dependent) that don't use the GPU, that's what I'm after!
Plausible optimisation guides come later. Let's focus on the functions, for this topic.
Edit2:
This topic isn't about how matrix transformations work. There are other topics for that.
Boy, is this a big subject.
First, I'll start with the obvious: Since you're calling the function (any function) from the CPU, it has to run at least partly on the CPU. So the question really is, how much of the work is done on the CPU and how much on the GPU.
Second, in order for the GPU to get to execute some command, the CPU has to prepare a command description to pass down. The minimal set here is a command token describing what to do, as well as the data for the operation to be executed. How the CPU triggers the GPU to do the command is also somewhat important. Since most of the time, this is expensive, the CPU does not do it often, but rather batches commands in command buffers, and simply sends a whole buffer for the GPU to handle.
All this to say that passing work down to the GPU is not a free exercise. That cost has to be pitted against just running the function on the CPU (no matter what we're talking about).
Taking a step back, you have to ask yourself why you need a GPU at all. The fact is, a pure CPU implementation does the job (as AshleysBrain mentions). The power of the GPU comes from its design to handle:
specialized tasks (rasterization, blending, texture filtering, blitting, ...)
heavily parallel workloads (DeadMG is pointing to that in his answer), when a CPU is more designed to handle single-threaded work.
And those are the guiding principles to follow in order to decide what goes in the chip. Anything that can benefit from those ought to run on the GPU. Anything else ought to be on the CPU.
It's interesting, by the way. Some functionality of the GL (prior to deprecation, mostly) are really not clearly delineated. Display lists are probably the best example of such a feature. Each driver is free to push as much as it wants from the display list stream to the GPU (typically in some command buffer form) for later execution, as long as the semantics of the GL display lists are kept (and that is somewhat hard in general). So some implementations only choose to push a limited subset of the calls in a display list to a computed format, and choose to simply replay the rest of the command stream on the CPU.
Selection is another one where it's unclear whether there is value to executing on the GPU.
Lastly, I have to say that in general, there is little correlation between the API calls and the amount of work on either the CPU or the GPU. A state setting API tends to only modify a structure somewhere in the driver data. It's effect is only visible when a Draw, or some such, is called.
A lot of the GL API works like that. At that point, asking whether glEnable(GL_BLEND) is executed on the CPU or GPU is rather meaningless. What matters is whether the blending will happen on the GPU when Draw is called. So, in that sense, Most GL entry points are not accelerated at all.
I could also expand a bit on data transfer but Danvil touched on it.
I'll finish with the little "s/w path". Historically, GL had to work to spec no matter what the hardware special cases were. Which meant that if the h/w was not handling a specific GL feature, then it had to emulate it, or implement it fully in software. There are numerous cases of this, but one that struck a lot of people is when GLSL started to show up.
Since there was no practical way to estimate the code size of a GLSL shader, it was decided that the GL was supposed to take any shader length as valid. The implication was fairly clear: either implement h/w that could take arbitrary length shaders -not realistic at the time-, or implement a s/w shader emulation (or, as some vendors chose to, simply fail to be compliant). So, if you triggered this condition on a fragment shader, chances were the whole of your GL ended up being executed on the CPU, even when you had a GPU siting idle, at least for that draw.
The question should perhaps be "What functions eat an unexpectedly high amount of CPU time?"
Keeping a matrix stack for projection and view is not a thing the GPU can handle better than a CPU would (on the contrary ...). Another example would be shader compilation. Why should this run on the GPU? There is a parser, a compiler, ..., which are just normal CPU programs like the C++ compiler.
Potentially "dangerous" function calls are for example glReadPixels, because data can be copied from host (=CPU) memory to device (=GPU) memory over the limited bus. In this category are also functions like glTexImage_D or glBufferData.
So generally speaking, if you want to know how much CPU time an OpenGL call eats, try to understand its functionality. And beware of all functions, which copy data from host to device and back!
Typically, if an operation is per-something, it will occur on the GPU. An example is the actual transformation - this is done once per vertex. On the other hand, if it occurs only once per large operation, it'll be on the CPU - such as creating the transformation matrix, which is only done once for each time the object's state changes, or once per frame.
That's just a general answer and some functionality will occur the other way around - as well as being implementation dependent. However, typically, it shouldn't matter to you, the programmer. As long as you allow the GPU plenty of time to do it's work while you're off doing the game sim or whatever, or have a solid threading model, you shouldn't need to worry about it that much.
#sending data to GPU: As far as I know (only used Direct3D) it's all done in-shader, that's what shaders are for.
glTranslate, glRotate and glScale change the current active transformation matrix. This is of course a CPU operation. The model view and projection matrices just describes how the GPU should transforms vertices when issue a rendering command.
So e.g. by calling glTranslate nothing is translated at all yet. Before rendering the current projection and model view matrices are multiplied (MVP = projection * modelview) then this single matrix is copied to the GPU and then the GPU does the matrix * vertex multiplications ("T&L") for each vertex. So the translation/scaling/projection of the vertices is done by the GPU.
Also you really should not be worried about the performance if you don't use these functions in an inner loop somewhere. glTranslate results in three additions. glScale and glRotate are a bit more complex.
My advice is that you should learn a bit more about linear algebra. This is essential for working with 3D APIs.
There are software rendered implementations of OpenGL, so it's possible that no OpenGL functions run on the GPU. There's also hardware that doesn't support certain render states in hardware, so if you set a certain state, switch to software rendering, and again, nothing will run on the GPU (even though there's one there). So I don't think there's any clear distinction between 'GPU-accelerated functions' and 'non-GPU accelerated functions'.
To be on the safe side, keep things as simple as possible. The straightforward rendering-with-vertices and basic features like Z buffering are most likely to be hardware accelerated, so if you can stick to that with the minimum state changing, you'll be most likely to keep things hardware accelerated. This is also the way to maximize performance of hardware-accelerated rendering - graphics cards like to stay in one state and just crunch a bunch of vertices.

DirectX9, DirectDraw, Optimization?

First off, I'm programming a game. Currently in the render function there are two calls to two different functions. One renders some text, one renders sprites.
On my computer (AMD Phenom(tm) II X4 955 Processor (4 CPUs), ~3.2GHz, 4096MB RAM DDR2, NVIDIA GeForce GTX 285) I have a render speed of ~2200 FPS when rendering around 200 sprites and about 100 FPS when rendering about 14,500.
I'm using a vector to store the information of each object I'm rendering and using one sprite with many draw calls.
VS2008 release mode with full optimization for C++. I know I've heard left and right don't optimize prematurely, but at this point, it's running great for me, but not so well on certain computers.
I can't imagine changing vectors out for arrays since I'm pushing and pulling things from the vector every frame, in an indeterminable method. Nearly randomly.
I've tried floats and doubles and the speed is no different.
Would it be different using DirectDraw rather than DirectX and the Sprite Render method? Since I've no idea the differences between DirectDraw and DirectX, I'm not 100% what I should be thinking about that.
The game runs fine on average computers, but what I'm comparing my game to is Touhou. Touhou runs at 60 FPS on the weakest computer I've tried, but my game won't run faster than 36~42 FPS. I can't imagine what I'm doing wrong, being so new to DirectX and C++.
Any assistance in this matter would be great, unfortunately I won't be around for awhile to add information or answers questions.
You need a profiler.
There's some good performance advice in the responses, but it doesn't matter. Trying to optimize a program without a profiler is like trying to write a program without a compiler. Do not guess, measure.
Now with that said, profiling graphics code is an infamous pain in the neck, and there aren't (to my knowledge) any good, free tools to help with it. So never mind that for now: start with an ordinary CPU profiler, and find out which of your calls is really taking up all your time.
I'm using a vector to store the information of each object I'm rendering and using one
sprite with many draw calls.
I'm not sure I understand what you're saying, but this sounds like you're drawing essentially the same object in a lot of different places. If that's the case, you probably want to look up DirectX Instancing. The basic idea is you specify 1) the geometry to draw, and 2) a number of places to draw it. This saves re-specifying the geometry every time you draw the object, so it can improve speed considerably.
I can't imagine changing vectors out for arrays since I'm pushing and pulling things from the vector every frame, in an indeterminable method. Nearly randomly.
Are you inserting and/or removing things from positions other than the back of the vector? In a vector, insertions and removals from the middle take O(n) time, that is, the amount of time it takes is proportional to the size of your vector.
If that is the case, then consider using an std::list instead. Note that with 10k+ objects this could easily be causing your performance issues, depending on how often you do it.
Profile your application, and determine if your bottleneck is the CPU or the GPU (or the transfer BUS between the two)
When determined you have a few choices :
1) If its the CPU, you can try instancing to reduce the number of draws call. Or if your target machine does not support Hardware instancing, try a kind of batching. To instance or Batch a sprite you have to use a QUAD (2 triangle orientated) as the default interface does.
2) If its the GPU, try to understand if its a shader causing the slowdown. If that's the case try to optimize it. If its not the shader, try to reduce overdraw. IF part of your objects are not transparent using front-to-back drawing.
3) If its the BUS, try to do as with the CPU, as with batching you reduce the number of Locks/Unlocks you need to transfer the data. (with instancing you would not need to update the buffer at all)
That's all. :P
P.S. A warning...DO NOT TRY TO PROFILE DirectX calls with a CPU profiler. (but use PerfHud from nVidia or GPUPerfStudio from ATI, or GPA from Intel)
Its just time losed, DirectX has a command buffer and you are not assured that a call made now it is executed that time. Most of the time it returns immediately and do nothing.