DirectX9, DirectDraw, Optimization? - c++

First off, I'm programming a game. Currently in the render function there are two calls to two different functions. One renders some text, one renders sprites.
On my computer (AMD Phenom(tm) II X4 955 Processor (4 CPUs), ~3.2GHz, 4096MB RAM DDR2, NVIDIA GeForce GTX 285) I have a render speed of ~2200 FPS when rendering around 200 sprites and about 100 FPS when rendering about 14,500.
I'm using a vector to store the information of each object I'm rendering and using one sprite with many draw calls.
VS2008 release mode with full optimization for C++. I know I've heard left and right don't optimize prematurely, but at this point, it's running great for me, but not so well on certain computers.
I can't imagine changing vectors out for arrays since I'm pushing and pulling things from the vector every frame, in an indeterminable method. Nearly randomly.
I've tried floats and doubles and the speed is no different.
Would it be different using DirectDraw rather than DirectX and the Sprite Render method? Since I've no idea the differences between DirectDraw and DirectX, I'm not 100% what I should be thinking about that.
The game runs fine on average computers, but what I'm comparing my game to is Touhou. Touhou runs at 60 FPS on the weakest computer I've tried, but my game won't run faster than 36~42 FPS. I can't imagine what I'm doing wrong, being so new to DirectX and C++.
Any assistance in this matter would be great, unfortunately I won't be around for awhile to add information or answers questions.

You need a profiler.
There's some good performance advice in the responses, but it doesn't matter. Trying to optimize a program without a profiler is like trying to write a program without a compiler. Do not guess, measure.
Now with that said, profiling graphics code is an infamous pain in the neck, and there aren't (to my knowledge) any good, free tools to help with it. So never mind that for now: start with an ordinary CPU profiler, and find out which of your calls is really taking up all your time.

I'm using a vector to store the information of each object I'm rendering and using one
sprite with many draw calls.
I'm not sure I understand what you're saying, but this sounds like you're drawing essentially the same object in a lot of different places. If that's the case, you probably want to look up DirectX Instancing. The basic idea is you specify 1) the geometry to draw, and 2) a number of places to draw it. This saves re-specifying the geometry every time you draw the object, so it can improve speed considerably.

I can't imagine changing vectors out for arrays since I'm pushing and pulling things from the vector every frame, in an indeterminable method. Nearly randomly.
Are you inserting and/or removing things from positions other than the back of the vector? In a vector, insertions and removals from the middle take O(n) time, that is, the amount of time it takes is proportional to the size of your vector.
If that is the case, then consider using an std::list instead. Note that with 10k+ objects this could easily be causing your performance issues, depending on how often you do it.

Profile your application, and determine if your bottleneck is the CPU or the GPU (or the transfer BUS between the two)
When determined you have a few choices :
1) If its the CPU, you can try instancing to reduce the number of draws call. Or if your target machine does not support Hardware instancing, try a kind of batching. To instance or Batch a sprite you have to use a QUAD (2 triangle orientated) as the default interface does.
2) If its the GPU, try to understand if its a shader causing the slowdown. If that's the case try to optimize it. If its not the shader, try to reduce overdraw. IF part of your objects are not transparent using front-to-back drawing.
3) If its the BUS, try to do as with the CPU, as with batching you reduce the number of Locks/Unlocks you need to transfer the data. (with instancing you would not need to update the buffer at all)
That's all. :P
P.S. A warning...DO NOT TRY TO PROFILE DirectX calls with a CPU profiler. (but use PerfHud from nVidia or GPUPerfStudio from ATI, or GPA from Intel)
Its just time losed, DirectX has a command buffer and you are not assured that a call made now it is executed that time. Most of the time it returns immediately and do nothing.

Related

How taxing are OpenGL glDrawElements() calls compared to basic logic code?

I'm planning to do some optimization on my OpenGL program (it doesn't need optimizing, but I'm doing it for the sake of it). Out of curiosity, how expensive are OpenGL drawing functions compared to basic logic code? At the moment, I'm making the start of a game where the screen is filled with squares, to represent a 2D blocky landscape. This means that the draw call for a square(two triangles) is called many times. At the moment, I'm planning to add in some code that looks at the positioning of blocks in the current frame, and groups them together. For example, if there is a column that is 7 blocks high, instead of doing 7 separate drawBlock() functions (which contain the glDrawElements() calls) I could call one function, that draws a rectangle that is 1 x 7, and so on, throughout the screen.
I won't bother doing this if the code that calculates what to draw, actually uses up more of the CPU than just drawing the blocks individually would.
The cost of glDrawElements (or any other OpenGL rendering command) cannot really be estimated. This is because its cost depends a great deal on what OpenGL state you changed between draw calls. The cost of calling an OpenGL state changing function (basically, any OpenGL function that isn't a glGet of some form or a glDraw of some form) will be relatively quick. But it will make the next draw call slower.
This video on OpenGL performance shows which state changes are more costly at draw time than others. The really good part starts around 31 minutes in.
Draw calls are relatively fast if you haven't changed any OpenGL state between draw calls. Different pieces of state have different effects on draw calls. From fastest to slowest (according to NVIDIA's presentation above, so take it with a grain of salt):
Non-UBO uniform updates
Vertex buffer bindings (without changing formats)
UBO binding
Vertex format changes
Texture bindings
Fragment post-processing state changes
Shader program changes
Render target switches
Now, a draw call will be more expensive than "basic logic". They're not cheap, even without state changes between them. If efficiency is important to your code, then grouping your squares is a good idea.
The actual numbers are highly platform and vendor dependent. Driver architectures on different operating systems vary substantially, and some of them are more efficient than others. On top of that, driver implementations and hardware can cause large performance differences. For example, I've seen 10-20 times higher draw call throughput for one vendor compared to another vendor, on the same platform and with comparable hardware.
Based on this, any numbers below are just a very rough order of magnitude. You really need to measure this yourself on the configurations you care about.
With all these disclaimers, I would expect that a draw call could be processed in the range of 100 instructions (CPU cycles). This is for the case where you just make back to back draw calls, and there are no other bottlenecks in the pipeline.
As #NicolBolas already pointed out, the most expensive part of handling draw calls is normally processing deferred state changes. And most of the time, you will have state changes between draw calls. In this case, for relatively cheap state changes (like binding a texture or buffer, or changing some attributes), a few 100 instructions are typical.
Switching frame buffers is generally quite expensive, and very expensive on some platforms. Other than that, the numbers I measured in the past while optimizing and benchmarking state changes showed an order that is quite different from the list in #NicolBolas' answer. But again, this is highly platform and vendor/hardware dependent.
There are a couple more aspects that makes this somewhat tricky to measure:
Most of the CPU time might not be consumed in your thread. Many drivers are multi-threaded, meaning that most of the work needed to process OpenGL calls is offloaded to a secondary thread. If your application does not use all CPU cores, and you're not throttled by power/thermal limits, this means that a lot of the driver work can happen in parallel, without slowing down your application much. But particularly on mobile devices and laptops, performance is often limited by power consumption, so the driver overhead will still slow you down.
CPU time consumed by the driver is only part of what can slow your application code down. Another consideration is cache pollution. If cache content used by your application is evicted while the OpenGL implementation processes your draw calls, your own code will get more cache misses, and will run slower. So measuring the time spent inside the OpenGL calls only shows part of the picture.

which is the most optimal and correct way to drawing many different dynamic 3D models (they are animated and change every frame)

I need to know how I can render many different 3D models, which change their geometry to each frame (are animated models), don't repeat models and textures.
I carry all models and for each created an "object" model class.
What is the most optimal way to render them?
To use 1 VBO for each 3D model
To use a single VBO for all models (to be all different, I do not see this option possible)
I work with OpenGL 3.x or higher, C++ on Windows.
TL; DR - there's no silver bullet when it comes to rendering performance
Why is that? That depends on the complicated process that gets your data, converts it, pushes it to GPU and then makes pixels on the screen flicker. So, instead of "one best way", a few of guideliness appeared that might usually improve the performance.
Keep all the necessary data on the GPU (because the closer to the screen, the shorter way electrons have to go :))
Send as little data to GPU between frames as possible
Don't sync needlessly between CPU and GPU (that's like trying to run two high speed trains on parallel tracks, but insisting on slowing them down to the point where you can pass something through the window every once in a while),
Now, it's obvious that if you want to have a model that will change, you can't have the cake and eat it. You have to made tradeoffs. Simply put, dynamic objects will never render as fast as static ones. So, what should you do?
Hint GPU about the data usage (GL_STREAM_DRAW or GL_DYNAMIC_DRAW) - that should guarantee optimal memory arrangement.
Don't use interleaved buffers to mix static vertex attributes with dynamic ones - if you divide the memory, you can batch-update the geometry leaving texture coordinates intact, for example.
Try to do as much as you can purely on the GPU - with compute shaders and transform feedback, it might well be possible to store whole animation data as a buffer itself and calculate it on GPU, avoiding expensive syncs.
And last but not least, always carefully measure the impact of your change on performance. Going blindly won't help. Measure accurately and thoroughly (even stuff like shader compilation time might matter sometimes!). Then, even if you go by trial-and-error, there's a hope you'll get somewhere.
And to address one of your points in particular; whether it's one large VBO and a few smaller ones doesn't really matter, but a huge one might have problems in fitting in memory. You can still update parts of it, and what matters most is the memory arrangement inside of it.

Performance of WebGL and OpenGL

For the past month I've been messing with WebGL, and found that if I create and draw a large vertex buffer it causes low FPS. Does anyone know if it be the same if I used OpenGL with C++?
Is that a bottleneck with the language used (JavaScript in the case of WebGL) or the GPU?
WebGL examples like this show that you can draw 150,000 cubes using one buffer with good performance but anything more than this, I get FPS drops. Would that be the same with OpenGL, or would it be able to handle a larger buffer?
Basically, I've got to make a decision to continue using WebGL and try to optimise by code or - if you tell me OpenGL would perform better and it's a language speed bottleneck, switch to C++ and use OpenGL.
If you only have a single drawArrays call, there should not be much of a difference between OpenGL and WebGL for the call itself. However, setting up the data in Javascript might be a lot slower, so it really depends on your problem. If the bulk of your data is static (landscape, rooms), WebGL might work well for you. Otherwise, setting up the data in JS might be too slow for your purpose. It really depends on your problem.
p.s. If you include more details of what you are trying to do, you'll probably get more detailed / specific answers.
Anecdotally, I wrote a tile-based game in the early 2000's using the old glVertex() style API that ran perfectly smoothly. I recently started port it to WebGL and glDrawArrays() and now on my modern PC that is at least 10 times faster it gets terrible performance.
The reason seems to be that I was faking a call go glBegin(GL_QUADS); glVertex()*4; glEnd(); by using glDrawArrays(). Using glDrawArrays() to draw one polygon is much much slower in WebGL than doing the same with glVertex() was in C++.
I don't know why this is. Maybe it is because javascript is dog slow. Maybe it is because of some context switching issues in javascript. Anyway I can only do around 500 one-polygon glDrawArray() calls while still getting 60 FPS.
Everybody seems to work around this by doing as much on the GPU as possible, and doing as few glDrawArray() calls per frame as possible. Whether you can do this depends on what you are trying to draw. In the example of cubes you linked they can do everything on the GPU, including moving the cubes, which is why it is fast. Essentially they cheated - typically WebGL apps won't be like that.
Google had a talk where they explained this technique (they also unrealistically calculate the object motion on the GPU): https://www.youtube.com/watch?v=rfQ8rKGTVlg
OpenGL is more flexible and more optimized because of the newer versions of the api used.
It is true if you say that OpenGL is faster and more capable, but it also depends on your needs.
If you need one cube mesh with texture, webGL would be sufficient. However, if you intend building large-scale projects with lots of vertices, post-processing effects and different rendering techniques (and kind of displacement, parallax mapping, per-vertex or maybe tessellation) then OpenGL might be a better and wiser choice actually.
Optimizing buffers to a single call, optimizing update of those can be done, but it has its limits, of course, and yes, OpenGL would most likely perform better anyway.
To answer, it is not a language bottleneck, but an api-version-used one.
WebGL is based upon OpenGL ES, which has some pros but also runs a bit slower and it has more abstraction levels for code handling than pure OpenGL has, and that is reason for lowering performance - more code needs to be evaluated.
If your project doesn't require web-based solution, and doesn't care which devices are supported, then OpenGL would be a better and smarter choice.
Hope this helps.
WebGL is much slower on the same hardware compared to equivalent OpenGL, because of the high overheard for each WebGL call.
On desktop OpenGL, this overhead is at least limited, if still relatively expensive.
But in browsers like Chrome, WebGL requires that not only do we cross the FFI barrier to access those native OpenGL API calls (which still incur the same overhead), but we also have the cost of security checks to prevent the GPU being hijacked for computation.
If you are looking at something like glDraw* calls, which are called per frame, this means we are talking about perhaps (an) order(s) of magnitude fewer calls. All the more reason to opt for something like instancing, where the number of calls is drastically reduced.

Which is a larger performance drain: quantity of vertices in one draw call, or quantity of calls?

I am quickly finding that one of the organisational considerations you must make when preparing rendering in OpenGL is the type of topography and the arrangement of vertices.
Now there are some interesting methods out there for organising vertices into very long arrays, with nice uses of interleaved arrays, indexes, etc, so that you can pour a lot of geometry into one OpenGL call.
But it's much easier in some cases to simply iterate and perform multiple calls with smaller vertex arrays.
While I agree with the notion that premature optimization is somewhat wasteful, just how important of a consideration should it be to minimize OpenGL calls, especially if multiple calls would actually involve far fewer vertexes per call?
I can see that this is one of those decisions that is important early in the development process, since it forms a lot of the structure of how vertexes get created and organized.
There is an overhead for each command you send down to the GPU. By batching the vertices you minimize that overhead and also allows the driver to make small optimizations in you data before sending it to the hardware. It can make quite a difference and is the reason the glBegin and glEnd was completely removed from newer iterations of OpenGL.
You should try to avoid making many driver states changes and many drawing calls.
EDIT: Consider using degenerated vertices in you triangle strips (also helps in minimizing the number of vertices processed) so that you can just use one drawing call and render all your topology (unless you need to change some driver state between parts of the topology).
You can find a balance for your specific needs. But the thing is that there're many variables in the equation. And there's no simple solution (like "always make scene as one big single batch!"). TraxNet gave you a good advice though - always try to minimize api calls(whether drawing or state changes). But it hasn't to be just a few calls. On modern PC it could be thousands per frame, not so modern mobile phone, maybe, just a few hundred.
Also TraxNet mentioned degenerate triangles(helping form strips) - though they're still triangles(kinda add to 'total' triangle count rendered) - they cost almost nothing still helping to minimize amount of draw calls.

Which OpenGL functions are not GPU-accelerated?

I was shocked when I read this (from the OpenGL wiki):
glTranslate, glRotate, glScale
Are these hardware accelerated?
No, there are no known GPUs that
execute this. The driver computes the
matrix on the CPU and uploads it to
the GPU.
All the other matrix operations are
done on the CPU as well :
glPushMatrix, glPopMatrix,
glLoadIdentity, glFrustum, glOrtho.
This is the reason why these functions
are considered deprecated in GL 3.0.
You should have your own math library,
build your own matrix, upload your
matrix to the shader.
For a very, very long time I thought most of the OpenGL functions use the GPU to do computation. I'm not sure if this is a common misconception, but after a while of thinking, this makes sense. Old OpenGL functions (2.x and older) are really not suitable for real-world applications, due to too many state switches.
This makes me realise that, possibly, many OpenGL functions do not use the GPU at all.
So, the question is:
Which OpenGL function calls don't use the GPU?
I believe knowing the answer to the above question would help me become a better programmer with OpenGL. Please do share some of your insights.
Edit:
I know this question easily leads to optimisation level. It's good, but it's not the intention of this question.
If anyone knows a set of GL functions on a certain popular implementation (as AshleysBrain suggested, nVidia/ATI, and possibly OS-dependent) that don't use the GPU, that's what I'm after!
Plausible optimisation guides come later. Let's focus on the functions, for this topic.
Edit2:
This topic isn't about how matrix transformations work. There are other topics for that.
Boy, is this a big subject.
First, I'll start with the obvious: Since you're calling the function (any function) from the CPU, it has to run at least partly on the CPU. So the question really is, how much of the work is done on the CPU and how much on the GPU.
Second, in order for the GPU to get to execute some command, the CPU has to prepare a command description to pass down. The minimal set here is a command token describing what to do, as well as the data for the operation to be executed. How the CPU triggers the GPU to do the command is also somewhat important. Since most of the time, this is expensive, the CPU does not do it often, but rather batches commands in command buffers, and simply sends a whole buffer for the GPU to handle.
All this to say that passing work down to the GPU is not a free exercise. That cost has to be pitted against just running the function on the CPU (no matter what we're talking about).
Taking a step back, you have to ask yourself why you need a GPU at all. The fact is, a pure CPU implementation does the job (as AshleysBrain mentions). The power of the GPU comes from its design to handle:
specialized tasks (rasterization, blending, texture filtering, blitting, ...)
heavily parallel workloads (DeadMG is pointing to that in his answer), when a CPU is more designed to handle single-threaded work.
And those are the guiding principles to follow in order to decide what goes in the chip. Anything that can benefit from those ought to run on the GPU. Anything else ought to be on the CPU.
It's interesting, by the way. Some functionality of the GL (prior to deprecation, mostly) are really not clearly delineated. Display lists are probably the best example of such a feature. Each driver is free to push as much as it wants from the display list stream to the GPU (typically in some command buffer form) for later execution, as long as the semantics of the GL display lists are kept (and that is somewhat hard in general). So some implementations only choose to push a limited subset of the calls in a display list to a computed format, and choose to simply replay the rest of the command stream on the CPU.
Selection is another one where it's unclear whether there is value to executing on the GPU.
Lastly, I have to say that in general, there is little correlation between the API calls and the amount of work on either the CPU or the GPU. A state setting API tends to only modify a structure somewhere in the driver data. It's effect is only visible when a Draw, or some such, is called.
A lot of the GL API works like that. At that point, asking whether glEnable(GL_BLEND) is executed on the CPU or GPU is rather meaningless. What matters is whether the blending will happen on the GPU when Draw is called. So, in that sense, Most GL entry points are not accelerated at all.
I could also expand a bit on data transfer but Danvil touched on it.
I'll finish with the little "s/w path". Historically, GL had to work to spec no matter what the hardware special cases were. Which meant that if the h/w was not handling a specific GL feature, then it had to emulate it, or implement it fully in software. There are numerous cases of this, but one that struck a lot of people is when GLSL started to show up.
Since there was no practical way to estimate the code size of a GLSL shader, it was decided that the GL was supposed to take any shader length as valid. The implication was fairly clear: either implement h/w that could take arbitrary length shaders -not realistic at the time-, or implement a s/w shader emulation (or, as some vendors chose to, simply fail to be compliant). So, if you triggered this condition on a fragment shader, chances were the whole of your GL ended up being executed on the CPU, even when you had a GPU siting idle, at least for that draw.
The question should perhaps be "What functions eat an unexpectedly high amount of CPU time?"
Keeping a matrix stack for projection and view is not a thing the GPU can handle better than a CPU would (on the contrary ...). Another example would be shader compilation. Why should this run on the GPU? There is a parser, a compiler, ..., which are just normal CPU programs like the C++ compiler.
Potentially "dangerous" function calls are for example glReadPixels, because data can be copied from host (=CPU) memory to device (=GPU) memory over the limited bus. In this category are also functions like glTexImage_D or glBufferData.
So generally speaking, if you want to know how much CPU time an OpenGL call eats, try to understand its functionality. And beware of all functions, which copy data from host to device and back!
Typically, if an operation is per-something, it will occur on the GPU. An example is the actual transformation - this is done once per vertex. On the other hand, if it occurs only once per large operation, it'll be on the CPU - such as creating the transformation matrix, which is only done once for each time the object's state changes, or once per frame.
That's just a general answer and some functionality will occur the other way around - as well as being implementation dependent. However, typically, it shouldn't matter to you, the programmer. As long as you allow the GPU plenty of time to do it's work while you're off doing the game sim or whatever, or have a solid threading model, you shouldn't need to worry about it that much.
#sending data to GPU: As far as I know (only used Direct3D) it's all done in-shader, that's what shaders are for.
glTranslate, glRotate and glScale change the current active transformation matrix. This is of course a CPU operation. The model view and projection matrices just describes how the GPU should transforms vertices when issue a rendering command.
So e.g. by calling glTranslate nothing is translated at all yet. Before rendering the current projection and model view matrices are multiplied (MVP = projection * modelview) then this single matrix is copied to the GPU and then the GPU does the matrix * vertex multiplications ("T&L") for each vertex. So the translation/scaling/projection of the vertices is done by the GPU.
Also you really should not be worried about the performance if you don't use these functions in an inner loop somewhere. glTranslate results in three additions. glScale and glRotate are a bit more complex.
My advice is that you should learn a bit more about linear algebra. This is essential for working with 3D APIs.
There are software rendered implementations of OpenGL, so it's possible that no OpenGL functions run on the GPU. There's also hardware that doesn't support certain render states in hardware, so if you set a certain state, switch to software rendering, and again, nothing will run on the GPU (even though there's one there). So I don't think there's any clear distinction between 'GPU-accelerated functions' and 'non-GPU accelerated functions'.
To be on the safe side, keep things as simple as possible. The straightforward rendering-with-vertices and basic features like Z buffering are most likely to be hardware accelerated, so if you can stick to that with the minimum state changing, you'll be most likely to keep things hardware accelerated. This is also the way to maximize performance of hardware-accelerated rendering - graphics cards like to stay in one state and just crunch a bunch of vertices.