If we have several OpenGL contexts, each in its own process, the driver somehow virtualises the device, so that each program thinks it exclusively runs the GPU. That is, if one program calls glEnable, the other one will never notice that.
This could be otherwise done with a ton of glGet calls to save state and its counterparts to restore it afterwards. Obviously, the driver does it more efficiently. However, in userspace we need to track which changes we made to the state and handle them selectively. Maybe it's just me missing something, but I thought it would be nice, for one, to adjust Viewport for a Framebuffer, and then just undo those changes to whatever state they were before.
Maybe there is a way of achieving the effect of a context switch yet within a single program?
Maybe there is a way of achieving the effect of a context switch yet within a single program?
You may create as many OpenGL contexts in a single process as you like and switch between them. Also with modern GPUs the state of the OpenGL context has little resemblance of what's actually happening on the GPU.
For pre-Core OpenGL there's glPushAttrib()/glPopAttrib() that will let you store off some GL state.
You're probably better off writing your own client-side state shadowing though.
The state machine (and command queue, discussed below) are unique to each context. It is much, much higher-level than you are thinking; the state is generally wrapped up nicely in usermode.
As for context-switching in a single process, be aware that each render context in GL is unsynchronized. An implicit flush is generated during a context switch in order to help alleviate this problem. As long as a context is only used by a single thread, this is generally adequate but probably going to negatively impact performance.
Related
Recently I'm reading https://github.com/ARM-software/vulkan_best_practice_for_mobile_developers/blob/master/samples/vulkan_basics.md, and it said:
OpenGL ES uses a synchronous rendering model, which means that an API call must behave as if all earlier API calls have already been processed. In reality no modern GPU works this way, rendering workloads are processed asynchronously and the synchronous model is an elaborate illusion maintained by the device driver. To maintain this illusion the driver must track which resources are read or written by each rendering operation in the queue, ensure that workloads run in a legal order to avoid rendering corruption, and ensure that API calls which need a data resource block and wait until that resource is safely available.
Vulkan uses an asynchronous rendering model, reflecting how the modern GPUs work. Applications queue rendering commands into a queue, use explicit scheduling dependencies to control workload execution order, and use explicit synchronization primitives to align dependent CPU and GPU processing.
The impact of these changes is to significantly reduce the CPU overhead of the graphics drivers, at the expense of requiring the application to handle dependency management and synchronization.
Could someone help explain why asynchronous rendering model could reduce CPU overhead? Since in Vulkan you still have to track state yourself.
Could someone help explain why asynchronous rendering model could
reduce CPU overhead?
First of all, let's get back to the original statement you are referring to, emphasis mine:
The impact of these changes is to significantly reduce the CPU
overhead of the graphics drivers, [...]
So the claim here is that the driver itself will need to consume less CPU, and it is easy to see as it can more directly forward your requests "as-is".
However, one overall goal of a low-level rendering API like Vulkan is also a potentially reduced CPU overhead in general, not only in the driver.
Consider the following example: You have a draw call which renders to a texture. And then you have another draw call which samples from this texture.
To get the implicit synchronization right, the driver has to track the usage of this texture, both as render target and as source for texture sampling operations.
It doesn't know in advance if the next draw call will need any resources which are still to be written to in previous draw calls. It has to always track every possible such conflicts, no matter if they can occur in your application or not. And it also must be extremely conservative in its decisions. It might be possible that you have a texture bound for a framebuffer for a draw call, but you may know that with the actual uniform values you set for this shaders the texture is not modified. But the GPU driver can't know that. If it can't rule - out with absolute certainty - that a resource is modified, it has to assume it is.
However, your application will more like know such details. If you have several render passes, and the second pass will depend on the texture rendered to in the first, you can (and must) add proper synchronization primitives - but the GPU driver doesn't need to care why there is any synchronization necessary at all, and it doesn't need track any resource usage to find out - it can just do as it is told. And your application also doesn't need to track it's own resource usage in many cases. It is just inherent from the usage as you coded it that a synchronization might be required at some point. There might be still cases where you need to track your own resource usage to find out though, especially if you write some intermediate layer like some more high-level graphics library where you know less and less of the structure of the rendering - then you are getting into a position similar to what a GL driver has to do (unless you want to forward all the burden of synchronization on the users of your library, like Vulkan does).
Suppose I'm trying to make some kind of a small opengl graphics engine on C++. I've read that accessing opengl state via glGet* functions can be quite expensive (while accessing opengl state seems to be an often operation), and it's strongly recommended to store a copy of opengl state somewhere with fast read/write access.
I'm currently thinking of storing the opengl state as a global thread_local variable of some appropriate type. How bad is that design? Are there any pitfalls?
If you want to stick with OpenGL's design (where your context pointer could be considered "thread_local") I guess it's a valid option... Obviously, you will need to have full control over all OpenGL calls in order to keep your state copy in sync with the current context's state.
I personally prefer to wrap the OpenGL state of interest using an "OpenGLState" class with a bunch of settable/gettable properties each mapping to some part of the state. You can then also avoid setting the same state twice. You could make it thread_local, but I couldn't (Visual C++ only supports thread_local for POD types).
You will need to be very careful, as some OpenGL calls indirectly change seemingly unrelated parts of the context's state. For example, glDeleteTextures will reset any binding of the deleted texture(s) to 0. Also, some toolkits are very "helpful" in changing OpenGL state behind your back (for example, QtOpenGLContext on OSX changes your viewport for you when made current).
Since you can only (reasonably) use a GL context with one thread, why do you need thread local? Yes, you can make a context current in different threads at different times, but this is not a wise design.
You will usually have one context and one thread accessing it. In rare cases, you will have two contexts (often shared) with two threads. In that case, you can simply put any additional state you wish to save into your context class, of which each instance is owned by exactly one thread.
But most of the time, you need not explicitly "remember" states anyway. All states have well-documented initial states, and they only change when you change them (exception being changes made by a "super smart" toolkit, but storing a wrong state doesn't help in that case either).
You will usually try to batch together states and do many "similar" draw calls with one set of states, the reason being that state changes are stalling the pipeline and need expensive validations being done before the next draw calls.
So, start off with the defaults, and set everything that needs to be non-default before drawing a batch. Then change what needs to be different for the next batch.
If you can't be bothered to dig through the specs for default values and keep track, you can redundantly set everything all the time. Then run your application in GDebugger, which will tell you what state changes are redundant, so you can elimiate them.
I currently try to implement a "Loading thread" for a very basic gaming engine, which takes care of loading e.g. textures or audio while the main thread keeps rendering a proper message/screen until the operation is finished or even render regular game scenes while loading of smaller objects occurs in background.
Now, I am by far no OpenGL expert, but as I implemented such a "Loading" mechanism I quickly found out that OGL doesn't like access to the rendering context from a thread other than the one it was created on very much. I googled around and the solution seems to be:
"Create a second rendering context on the thread and share it with the context of the main thread"
The problem with this is that I use SDL to take care of my window management and context creation, and as far as I can tell from inspecting the API there is no way to tell SDL to share contexts between each other :(
I came to the conclusion that the best solutions for my case are:
Approach A) Alter the SDL library to support context sharing with the platform specific functions (wglShareLists() and glXCreateContext() I assume)
Approach B) Let the "Loading Thread" only load the data into memory and process it to be in a OpenGL-friendly format and pass it to the main thread which e.g. takes care of uploading the texture to the graphics adapter. This, of course, only applies to data that needs a valid OpenGL context to be done
The first solution is the least efficient one I guess. I don't really want to mangle with SDL and beside that I read that context sharing is not a high-performance operation. So my next take would be on the second approach so far.
EDIT: Regarding the "high-performance operation": I read the article wrong, it actually isn't that performance intensive. The article suggested shifting the CPU intensive operations to the second thread with a second context. Sorry for that
After all this introduction I would really appreciate if anyone could give me some hints and comments to the following questions:
1) Is there any way to share contexts with SDL and would it be any good anyway to do so?
2) Is there any other more "elegant" way to load my data in the background that I may have missed or didn't think about?
3) Can my intention of going with approach B considered to be a good choice? There would still be slight overhead from the OpenGL operations on my main thread which blocks rendering, or is it that small that it can be ignored?
Is there any way to share contexts with SDL
No.
Yes!
You have to get the current context, using platform-specific calls. From there, you can create a new context and make it shared, also with platform-specific calls.
Is there any other more "elegant" way to load my data in the background that I may have missed or didn't think about?
Not really. You enumerated the options quite well: hack SDL to get the data you need, or load data inefficiently.
However, you can load the data into mapped buffer objects and transfer the data to OpenGL. You can only do the mapping/unmapping on the OpenGL thread, but the pointer you get when you map can be used on any thread. So map a buffer, pass it to the worker thread. It loads data into the mapped memory, and flips a switch. The GL thread unmaps the pointer (the worker thread should forget about the pointer now) and uploads the texture data.
Can my intention of going with approach B considered to be a good choice?
Define "good"? There's no way to answer this without knowing more about your problem domain.
I was shocked when I read this (from the OpenGL wiki):
glTranslate, glRotate, glScale
Are these hardware accelerated?
No, there are no known GPUs that
execute this. The driver computes the
matrix on the CPU and uploads it to
the GPU.
All the other matrix operations are
done on the CPU as well :
glPushMatrix, glPopMatrix,
glLoadIdentity, glFrustum, glOrtho.
This is the reason why these functions
are considered deprecated in GL 3.0.
You should have your own math library,
build your own matrix, upload your
matrix to the shader.
For a very, very long time I thought most of the OpenGL functions use the GPU to do computation. I'm not sure if this is a common misconception, but after a while of thinking, this makes sense. Old OpenGL functions (2.x and older) are really not suitable for real-world applications, due to too many state switches.
This makes me realise that, possibly, many OpenGL functions do not use the GPU at all.
So, the question is:
Which OpenGL function calls don't use the GPU?
I believe knowing the answer to the above question would help me become a better programmer with OpenGL. Please do share some of your insights.
Edit:
I know this question easily leads to optimisation level. It's good, but it's not the intention of this question.
If anyone knows a set of GL functions on a certain popular implementation (as AshleysBrain suggested, nVidia/ATI, and possibly OS-dependent) that don't use the GPU, that's what I'm after!
Plausible optimisation guides come later. Let's focus on the functions, for this topic.
Edit2:
This topic isn't about how matrix transformations work. There are other topics for that.
Boy, is this a big subject.
First, I'll start with the obvious: Since you're calling the function (any function) from the CPU, it has to run at least partly on the CPU. So the question really is, how much of the work is done on the CPU and how much on the GPU.
Second, in order for the GPU to get to execute some command, the CPU has to prepare a command description to pass down. The minimal set here is a command token describing what to do, as well as the data for the operation to be executed. How the CPU triggers the GPU to do the command is also somewhat important. Since most of the time, this is expensive, the CPU does not do it often, but rather batches commands in command buffers, and simply sends a whole buffer for the GPU to handle.
All this to say that passing work down to the GPU is not a free exercise. That cost has to be pitted against just running the function on the CPU (no matter what we're talking about).
Taking a step back, you have to ask yourself why you need a GPU at all. The fact is, a pure CPU implementation does the job (as AshleysBrain mentions). The power of the GPU comes from its design to handle:
specialized tasks (rasterization, blending, texture filtering, blitting, ...)
heavily parallel workloads (DeadMG is pointing to that in his answer), when a CPU is more designed to handle single-threaded work.
And those are the guiding principles to follow in order to decide what goes in the chip. Anything that can benefit from those ought to run on the GPU. Anything else ought to be on the CPU.
It's interesting, by the way. Some functionality of the GL (prior to deprecation, mostly) are really not clearly delineated. Display lists are probably the best example of such a feature. Each driver is free to push as much as it wants from the display list stream to the GPU (typically in some command buffer form) for later execution, as long as the semantics of the GL display lists are kept (and that is somewhat hard in general). So some implementations only choose to push a limited subset of the calls in a display list to a computed format, and choose to simply replay the rest of the command stream on the CPU.
Selection is another one where it's unclear whether there is value to executing on the GPU.
Lastly, I have to say that in general, there is little correlation between the API calls and the amount of work on either the CPU or the GPU. A state setting API tends to only modify a structure somewhere in the driver data. It's effect is only visible when a Draw, or some such, is called.
A lot of the GL API works like that. At that point, asking whether glEnable(GL_BLEND) is executed on the CPU or GPU is rather meaningless. What matters is whether the blending will happen on the GPU when Draw is called. So, in that sense, Most GL entry points are not accelerated at all.
I could also expand a bit on data transfer but Danvil touched on it.
I'll finish with the little "s/w path". Historically, GL had to work to spec no matter what the hardware special cases were. Which meant that if the h/w was not handling a specific GL feature, then it had to emulate it, or implement it fully in software. There are numerous cases of this, but one that struck a lot of people is when GLSL started to show up.
Since there was no practical way to estimate the code size of a GLSL shader, it was decided that the GL was supposed to take any shader length as valid. The implication was fairly clear: either implement h/w that could take arbitrary length shaders -not realistic at the time-, or implement a s/w shader emulation (or, as some vendors chose to, simply fail to be compliant). So, if you triggered this condition on a fragment shader, chances were the whole of your GL ended up being executed on the CPU, even when you had a GPU siting idle, at least for that draw.
The question should perhaps be "What functions eat an unexpectedly high amount of CPU time?"
Keeping a matrix stack for projection and view is not a thing the GPU can handle better than a CPU would (on the contrary ...). Another example would be shader compilation. Why should this run on the GPU? There is a parser, a compiler, ..., which are just normal CPU programs like the C++ compiler.
Potentially "dangerous" function calls are for example glReadPixels, because data can be copied from host (=CPU) memory to device (=GPU) memory over the limited bus. In this category are also functions like glTexImage_D or glBufferData.
So generally speaking, if you want to know how much CPU time an OpenGL call eats, try to understand its functionality. And beware of all functions, which copy data from host to device and back!
Typically, if an operation is per-something, it will occur on the GPU. An example is the actual transformation - this is done once per vertex. On the other hand, if it occurs only once per large operation, it'll be on the CPU - such as creating the transformation matrix, which is only done once for each time the object's state changes, or once per frame.
That's just a general answer and some functionality will occur the other way around - as well as being implementation dependent. However, typically, it shouldn't matter to you, the programmer. As long as you allow the GPU plenty of time to do it's work while you're off doing the game sim or whatever, or have a solid threading model, you shouldn't need to worry about it that much.
#sending data to GPU: As far as I know (only used Direct3D) it's all done in-shader, that's what shaders are for.
glTranslate, glRotate and glScale change the current active transformation matrix. This is of course a CPU operation. The model view and projection matrices just describes how the GPU should transforms vertices when issue a rendering command.
So e.g. by calling glTranslate nothing is translated at all yet. Before rendering the current projection and model view matrices are multiplied (MVP = projection * modelview) then this single matrix is copied to the GPU and then the GPU does the matrix * vertex multiplications ("T&L") for each vertex. So the translation/scaling/projection of the vertices is done by the GPU.
Also you really should not be worried about the performance if you don't use these functions in an inner loop somewhere. glTranslate results in three additions. glScale and glRotate are a bit more complex.
My advice is that you should learn a bit more about linear algebra. This is essential for working with 3D APIs.
There are software rendered implementations of OpenGL, so it's possible that no OpenGL functions run on the GPU. There's also hardware that doesn't support certain render states in hardware, so if you set a certain state, switch to software rendering, and again, nothing will run on the GPU (even though there's one there). So I don't think there's any clear distinction between 'GPU-accelerated functions' and 'non-GPU accelerated functions'.
To be on the safe side, keep things as simple as possible. The straightforward rendering-with-vertices and basic features like Z buffering are most likely to be hardware accelerated, so if you can stick to that with the minimum state changing, you'll be most likely to keep things hardware accelerated. This is also the way to maximize performance of hardware-accelerated rendering - graphics cards like to stay in one state and just crunch a bunch of vertices.
How often should I call OpenGL functions like glEnable() or glEnableClientState() and their corresponding glDisable counterparts? Are they meant to be called once at the beginning of the application, or should I keep them disabled and only enable those features I immediately need for drawing something? Is there a performance difference?
Warning: glPushAttrib / glPopAttrib are DEPRECATED in the modern OpenGL programmable pipeline.
If you find that you are checking the value of state variables often and subsequently calling glEnable/glDisable you may be able to clean things up a bit by using the attribute stack (glPushAttrib / glPopAttrib).
The attribute stack allows you to isolate areas of your code and such that changes to attribute in one sections does not affect the attribute state in other sections.
void drawObject1(){
glPushAttrib(GL_ENABLE_BIT);
glEnable(GL_DEPTH_TEST);
glEnable(GL_LIGHTING);
/* Isolated Region 1 */
glPopAttrib();
}
void drawObject2(){
glPushAttrib(GL_ENABLE_BIT);
glEnable(GL_FOG);
glEnable(GL_GL_POINT_SMOOTH);
/* Isolated Region 2 */
glPopAttrib();
}
void drawScene(){
drawObject1();
drawObject2();
}
Although GL_LIGHTING and GL_DEPTH_TEST are set in drawObject1 their state is not preserved to drawObject2. In the absence of glPushAttrib this would not be the case. Also - note that there is no need to call glDisable at the end of the function calls, glPopAttrib does the job.
As far as performance, the overhead due of individual function calls to glEnable/glDisable is minimal. If you need to be handling lots of state you will probably need to create your own state manager or make numerous calls to glGetInteger... and then act accordingly. The added machinery and control flow could made the code less transparent, harder to debug, and more difficult to maintain. These issues may make other, more fruitful, optimizations more difficult.
The attribution stack can aid in maintaining layers of abstraction and create regions of isolation.
glPushAttrib manpage
"That depends".
If you entire app only uses one combination of enable/disable states, then by all means just set it up at the beginning and go.
Most real-world apps need to mix, and then you're forced to call glEnable() to enable some particular state(s), do the draw calls, then glDisable() them again when you're done to "clear the stage".
State-sorting, state-tracking, and many optimization schemes stem from this, as state switching is sometimes expensive.
First of all, which OpenGL version do you use? And which generation of graphics hardware does your target group have? Knowing this would make it easier to give a more correct answer. My answer assumes OpenGL 2.1.
OpenGL is a state machine, meaning that whenever a state is changed, that state is made "current" until changed again explicitly by the programmer with a new OpenGL API call. Exceptions to this rule exist, like client state array calls making the current vertex color undefined. But those are the exceptions which define the rule.
"once at the beginning of the application" doesn't make much sense, because there are times you need to destroy your OpenGL context while the application is still running. I assume you mean just after every window creation. That works for state you don't need to change later. Example: If all your draw calls use the same vertex array data, you don't need to disable them with glDisableClientState afterwards.
There is a lot of enable/disable state associated with the old fixed-function pipeline. The easy redemption for this is: Use shaders! If you target a generation of cards at most five years old, it probably mimics the fixed-function pipeline with shaders anyway. By using shaders you are in more or less total control of what's happening during the transform and rasterization stages, and you can make your own "states" with uniforms, which are very cheap to change/update.
Knowing that OpenGL is a state machine like I said above should make it clear that one should strive to keep the state changes to a minimum, as long as it's possible. However, there are most probably other things which impacts performance much more than enable/disable state calls. If you want to know about them, read down.
The expense of state not associated with the old fixed-function state calls and which is not simple enable/disable state can differ widely in cost. Notably linking shaders, and binding names, ("names" of textures, programs, buffer objects) are usually fairly expensive. This is why a lot of games and applications used to sort the draw order of their meshes according to the texture. In that way, they didn't have to bind the same texture twice. Nowadays however, the same applies to shader programs. You don't want to bind the same shader program twice if you don't have to. Also, not all the features in a particular OpenGL version are hardware accelerated on all cards, even if the vendors of those cards claim they are OpenGL compliant. Being compliant means that they follow the specification, not that they necessarily run all the featurs efficiently. Some of the functions like glHistogram and glMinMax from GL__ ARB __imaging should be remembered in this regard.
Conclusion: Unless there is an obvious reason not to, use shaders! It saves you from a lot of uncecessary state calls since you can use uniforms instead. OpenGL shaders have just been around for about six years you know. Also, the overhead of enable/disable state changes can be an issue, but usually there is a lot more to gain from optimizing other more expensive state changes, like glUseProgram, glCompileShader, glLinkprogram, glBindBuffer and glBindTexture.
P.S: OpenGL 3.0 removed the client state enable/disable calls. They are implicitly enabled as draw arrays is the only way to draw in this version. Immediate mode was removed. gl..Pointer calls were removed as well, since one really just needs glVertexAttribPointer.
A rule of thumb that I was taught said that it's almost always cheaper to just enable/disable at will rather than checking the current state and changing only if needed.
That said, Marc's answer is something that should definitely work.