Storing OpenGL state - c++

Suppose I'm trying to make some kind of a small opengl graphics engine on C++. I've read that accessing opengl state via glGet* functions can be quite expensive (while accessing opengl state seems to be an often operation), and it's strongly recommended to store a copy of opengl state somewhere with fast read/write access.
I'm currently thinking of storing the opengl state as a global thread_local variable of some appropriate type. How bad is that design? Are there any pitfalls?

If you want to stick with OpenGL's design (where your context pointer could be considered "thread_local") I guess it's a valid option... Obviously, you will need to have full control over all OpenGL calls in order to keep your state copy in sync with the current context's state.
I personally prefer to wrap the OpenGL state of interest using an "OpenGLState" class with a bunch of settable/gettable properties each mapping to some part of the state. You can then also avoid setting the same state twice. You could make it thread_local, but I couldn't (Visual C++ only supports thread_local for POD types).
You will need to be very careful, as some OpenGL calls indirectly change seemingly unrelated parts of the context's state. For example, glDeleteTextures will reset any binding of the deleted texture(s) to 0. Also, some toolkits are very "helpful" in changing OpenGL state behind your back (for example, QtOpenGLContext on OSX changes your viewport for you when made current).

Since you can only (reasonably) use a GL context with one thread, why do you need thread local? Yes, you can make a context current in different threads at different times, but this is not a wise design.
You will usually have one context and one thread accessing it. In rare cases, you will have two contexts (often shared) with two threads. In that case, you can simply put any additional state you wish to save into your context class, of which each instance is owned by exactly one thread.
But most of the time, you need not explicitly "remember" states anyway. All states have well-documented initial states, and they only change when you change them (exception being changes made by a "super smart" toolkit, but storing a wrong state doesn't help in that case either).
You will usually try to batch together states and do many "similar" draw calls with one set of states, the reason being that state changes are stalling the pipeline and need expensive validations being done before the next draw calls.
So, start off with the defaults, and set everything that needs to be non-default before drawing a batch. Then change what needs to be different for the next batch.
If you can't be bothered to dig through the specs for default values and keep track, you can redundantly set everything all the time. Then run your application in GDebugger, which will tell you what state changes are redundant, so you can elimiate them.

Related

Should I cache OpenGL state such as currently bound buffers, or does OpenGL do that anyway?

A typical OpenGL call might look like the following:
GLuint buffer;
glGenBuffers(1, &buffer);
glBindBuffer(GL_SOME_BUFFER, buffer);
...
I've read that binding of buffers and other similar functions can be quite expensive. Is it worth saving the currently bound buffer, and checking it before I bind? Such as this:
void StateManager::bindBuffer(GLenum bufType, GLuint bufID) {
if (this->m_currentBuffer[bufType] != bufID) {
glBindBuffer(bufType, bufID);
this->m_currentBuffer[bufType] = bufID;
}
}
The idea behind this being that if bufID is already bound then the expensive call to glBindBuffer is missed. Is this a worthwhile approach? I assumed that OpenGL would likely implement such an optimization already, but I have seen this pattern used in a few projects now, so I am having my doubts. I am simply interested because it would be a pretty simple thing to implement, but if it doesn't make much/any difference then I will skip it (avoiding premature optimization).
This is highly platform and vendor dependent.
You're asking if "OpenGL would implement...". As you certainly understand already, OpenGL is an API specification. There are many different implementations, and whether they check for redundant state changes is entirely an implementation decision, which can (and will) be different from implementation to implementation.
You shouldn't even expect that a given implementation handles this the same for all pieces of state.
Since this topic is somewhat close to my heart based on past experience, I was tempted to write a small essay, including a few rants. But I decided that it wouldn't belong here, so here is just a list of considerations that could affect if a given OpenGL implementation tests for redundant state changes in specific cases:
How expensive is it to actually change the state? If it's very cheap, checking for redundant changes might simply not be worth it.
How expensive is it to check for redundant changes? Normally not much, but we're looking at pieces of software where every little bit counts.
Are important apps/benchmarks redundantly changing this state on a frequent basis?
What's the philosophy on responsibilities of apps vs. responsibilities of OpenGL implementations?
And yes, this is unfortunate for everybody. For you as an app writer who wants to get ideal performance across vendors/platforms, there's really no easy solution. If you add checks to your code, they will be useless, and add extra overhead, on platforms that have the same checks in the OpenGL implementation. If you do not have checks in your code, and cannot easily avoid having these redundant state changes in the first place, you may leave performance on the table on platforms where the OpenGL implementation does not check.
The reason why state caching is a bad idea is simple: you're doing it wrong. You'll always be in danger of doing it wrong.
Oh sure, you corrected the mistake I pointed out, that different buffer bindings have different state. And maybe you're using a hash-table that makes lookup pretty quick, even if a new extension comes out that adds a new buffer binding point that didn't exist when you wrote your cache.
But that's merely the tip of the iceberg as far as object binding idiosyncrasies.
For example, did you realize that GL_ELEMENT_ARRAY_BUFFER is not actually context state? It's really VAO state, and every time you bind a new VAO, that buffer binding changes. So your VAO cache now has to change the shadowed element buffer binding too.
Also, were you aware that deleting an object automatically unbinds it from any context binding points it is currently bound to? And this is true even for objects that are attached to another object that is bound to the context; the deleted object is automatically detached.
Except that this is only true for certain object types. And even then, it's only true for the context that was current when the object was deleted. Other contexts will be unaffected.
My point is this: proper caching of state is really hard. And if you get it wrong, you will create a multitude of very subtle bugs in your application. Whereas if you just let OpenGL do its thing and structure your code so that multiple binding simply doesn't happen, then you don't have a problem.

Vulkan: Creating and benefit of pipeline derivatives

In Vulkan, you can use vkCreateGraphicsPipeline or vkCreateComputePipeline to create pipeline derivates, with the basePipelineHandle or basePipelineIndex members of VkGraphicsPipelineCreateInfo/VkComputePipelineCreateInfo. The documentation states that this feature is available for performance reasons:
The goal of derivative pipelines is that they be cheaper to create using the parent as a starting point, and that it be more efficient (on either host or device) to switch/bind between children of the same parent.
This raises quite a few questions for me:
Is there a way to indicate which state is shared between parent and child pipelines, or does the implementation decide?
Is there any way to know whether the implementation is actually getting any benefit from using derived pipelines (other than profiling)?
The parent pipeline needs to be created with VK_PIPELINE_CREATE_ALLOW_DERIVATIVES_BIT. Is there a downside to always using this flag (eg. in case you may create a derived pipeline from this one in the future)?
I came to this question investigating whether pipeline derivatives provide a benefit. Here's some resources I found from vendors:
Tips and Tricks: Vulkan Dos and Don’ts, Nvidia, June 6, 2019
Don’t expect speedup from Pipeline Derivatives.
Vulkan Usage Recommendations, Samsung
Pipeline derivatives let applications express "child" pipelines as incremental state changes from a similar "parent"; on some architectures, this can reduce the cost of switching between similar states. Many mobile GPUs gain performance primarily through pipeline caches, so pipeline derivatives often provide no benefit to portable mobile applications.
Recommendations
Create pipelines early in application execution. Avoid pipeline creation at draw time.
Use a single pipeline cache for all pipeline creation.
Write the pipeline cache to a file between application runs.
Avoid pipeline derivatives.
Vulkan Best Practice for Mobile Developers - Pipeline Management, Arm Software, Jul 11, 2019
Don't
Create pipelines at draw time without a pipeline cache (introduces performance stutters).
Use pipeline derivatives as they are not supported.
Vulkan Samples, LunarG, API-Samples/pipeline_derivative/pipeline_derivative.cpp
/*
VULKAN_SAMPLE_SHORT_DESCRIPTION
This sample creates pipeline derivative and draws with it.
Pipeline derivatives should allow for faster creation of pipelines.
In this sample, we'll create the default pipeline, but then modify
it slightly and create a derivative. The derivatve will be used to
render a simple cube.
We may later find that the pipeline is too simple to show any speedup,
or that replacing the fragment shader is too expensive, so this sample
can be updated then.
*/
It doesn't look like any vendor is actually recommending the use of pipeline derivatives, except maybe to speed up pipeline creation.
To me, that seems like a good idea in theory on a theoretical implementation that doesn't amount to much in practice.
Also, if the driver is supposed to benefit from a common parent of multiple pipelines, it should be completely able to automate that ancestor detection. "Common ancestors" could be synthesized based on whichever specific common pipeline states provide the best speed-up. Why specify it explicitly through the API?
Is there a way to indicate which state is shared between parent and child pipelines
No; the pipeline creation API provides no way to tell it what state will change. The idea being that, since the implementation can see the parent's state, and it can see what you ask of the child's state, it can tell what's different.
Also, if there were such a way, it would only represent a way for you to accidentally misinform the implementation as to what changed. Better to just let the implementation figure out the changes.
Is there any way to know whether the implementation is actually getting any benefit from using derived pipelines (other than profiling)?
No.
The parent pipeline needs to be created with VK_PIPELINE_CREATE_ALLOW_DERIVATIVES_BIT. Is there a downside to always using this flag (eg. in case you may create a derived pipeline from this one in the future)?
Probably. Due to #1, the implementation needs to store at least some form of the parent pipeline's state, so that it can compare it to the child pipeline's state. And it must store this state in an easily readable form, which will probably not be the same form as the GPU memory and tokens to be copied into the command stream. As such, there's a good chance that parent pipelines will allocate additional memory for such data. Though the likelihood of them being slower at binding/command execution time is low.
You can test this easily enough by passing an allocator to the pipeline creation functions. If it allocates the same amount of memory as without the flag, then it probably isn't storing anything.
I'm no expert in computer graphics, but my understanding (partly includes intuition) is the following:
Is there a way to indicate which state is shared between parent and child pipelines, or does the implementation decide?
There are certain aspects of the pipeline that are not specified at render time (and so are fixed), for example which shaders to use. My speculation is that the derived from and the derived pipelines likely share these "read-only" information (or in C terms, they point to the same object). That's why creation of derived pipelines is faster.
Switching between these pipelines would also be faster because there is less need to change resources on changing pipelines, because some of the resources are shared and the same.
The parent pipeline needs to be created with VK_PIPELINE_CREATE_ALLOW_DERIVATIVES_BIT. Is there a downside to always using this flag (eg. in case you may create a derived pipeline from this one in the future)?
This is very likely implementation-dependent. My speculation is that, when you allow derivatives, you enable resource (e.g. shader) sharing, which means the implementation is likely going to do reference counting for these resources. That would be an unnecessary cost if the resources are not going to be shared. Also, when changing pipelines, the driver wouldn't need to check whether each resource is shared and can stay on the GPU, or is not and needs changing. If there is no sharing, all resources would be changed, and there is no overhead of checking. None of these are that much of an overhead, so either Vulkan is staying on the safe side, or there is another reason I don't know about.

Is it possible to preserve all the state at once in OpenGL?

If we have several OpenGL contexts, each in its own process, the driver somehow virtualises the device, so that each program thinks it exclusively runs the GPU. That is, if one program calls glEnable, the other one will never notice that.
This could be otherwise done with a ton of glGet calls to save state and its counterparts to restore it afterwards. Obviously, the driver does it more efficiently. However, in userspace we need to track which changes we made to the state and handle them selectively. Maybe it's just me missing something, but I thought it would be nice, for one, to adjust Viewport for a Framebuffer, and then just undo those changes to whatever state they were before.
Maybe there is a way of achieving the effect of a context switch yet within a single program?
Maybe there is a way of achieving the effect of a context switch yet within a single program?
You may create as many OpenGL contexts in a single process as you like and switch between them. Also with modern GPUs the state of the OpenGL context has little resemblance of what's actually happening on the GPU.
For pre-Core OpenGL there's glPushAttrib()/glPopAttrib() that will let you store off some GL state.
You're probably better off writing your own client-side state shadowing though.
The state machine (and command queue, discussed below) are unique to each context. It is much, much higher-level than you are thinking; the state is generally wrapped up nicely in usermode.
As for context-switching in a single process, be aware that each render context in GL is unsynchronized. An implicit flush is generated during a context switch in order to help alleviate this problem. As long as a context is only used by a single thread, this is generally adequate but probably going to negatively impact performance.

Opengl Unsynchronized/Non-blocking Map

I just found the following OpenGL specification for ARB_map_buffer_range.
I'm wondering if it is possible to do non-blocking map calls using this extension?
Currently in my application im rendering to an FBO which I then map to a host PBO buffer.
glMapBuffer(target_, GL_READ_ONLY);
However, the problem with this is that it blocks the rendering thread while transferring the data.
I could reduce this issue by pipelining the rendering, but latency is a big issue in my application.
My question is whether i can use map_buffer_range with MAP_UNSYNCHRONIZED_BIT and wait for the map operation to finish on another thread, or defer the map operation on the same thread, while the rendering thread renders the next frame.
e.g.
thread 1:
map();
render_next_frame();
thread 2:
wait_for_map
or
thread 1:
map();
while(!is_map_ready())
do_some_rendering_for_next_frame();
What I'm unsure of is how I know when the map operation is ready, the specification only mentions "other synchronization techniques to ensure correct operation".
Any ideas?
If you map a buffer with GL_MAP_UNSYNCHRONIZED_BIT, the driver will not wait until OpenGL is done with that memory before mapping it for you. So you will get more or less immediate access to it.
The problem is that this does not mean that you can just read/write that memory willy-nilly. If OpenGL is reading from or writing to that buffer and you change it... welcome to undefined behavior. Which can include crashing.
Therefore, in order to actually use unsynchronized mapping, you must synchronize your behavior to OpenGL's access of that buffer. This will involve the use of ARB_sync objects (or NV_fence if you're only on NVIDIA and haven't updated your drivers recently).
That being said, if you're using a fence object to synchronize access to the buffer, then you really don't need GL_MAP_UNSYNCHRONIZED_BIT at all. Once you finish the fence, or detect that it has completed, you can map the buffer normally and it should complete immediately (unless some other operation is reading/writing too).
In general, unsynchronized access is best used for when you need fine-grained write access to the buffer. In this case, good use of sync objects will get you what you really need (the ability to tell when the map operation is finished).
Addendum: The above is now outdated (depending on your hardware). Thanks to OpenGL 4.4/ARB_buffer_storage, you can now not only map unsynchronized, you can keep a buffer mapped indefinitely. Yes, you can have a buffer mapped while it is in use.
This is done by creating immutable storage and providing that storage with (among other things) the GL_MAP_PERSISTENT_BIT. Then you glMapBufferRange, also providing the same bit.
Now technically, that changes pretty much nothing. You still need to synchronize your actions with OpenGL. If you write stuff to a region of the buffer, you'll need to either issue a barrier or flush that region of the buffer explicitly. And if you're reading, you still need to use a fence sync object to make sure that the data is actually there before reading it (and unless you use GL_MAP_COHERENT_BIT too, you'll need to issue a barrier before reading).
In general, it is not possible to do a "nonblocking map", but you can map without blocking.
The reason why there can be no "nonblocking map" is that the moment the function call returns, you could access the data, so the driver must make sure it is there, positively. If the data has not been transferred, what else can the driver do but block.
Threads don't make this any better, and possibly make it worse (adding synchronisation and context sharing issues). Threads cannot magically remove the need to transfer data.
And this leads to how to not block on mapping: Only map when you are sure that the transfer is finished. One safe way to do this is to map the buffer after flipping buffers or after glFinish or after waiting on a query/fence object. Using a fence is the preferrable way if you can't wait until buffers have been swapped. A fence won't stall the pipeline, but will tell you whether or not your transfer is done (glFinish may or may not, but will probably stall).
Reading after swapping buffers is also 100% safe, but may not be acceptable if you need the data within the same frame (works perfectly for screenshots or for calculating a histogram for tonemapping, though).
A less safe way is to insert "some other stuff" and hope that in the mean time the transfer has completed.
In respect of below comment:
This answer is not incorrect. It isn't possible to do any better than access data after it's available (this should be obvious). Which means that you must sync/block, one way or the other, there is no choice.
Although, from a very pedantic point of view, you can of course use GL_MAP_UNSYNCHRONIZED_BIT to get a non-blocking map operation, this is entirely irrelevant, as it does not work unless you explicitly reproduce the implicit sync as described above. A mapping that you can't safely access is good for nothing.
Mapping and accessing a buffer that OpenGL is transferring data to without synchronizing/blocking (implicitly or explicitly) means "undefined behavior", which is only a nicer wording for "probably garbage results, maybe crash".
If, on the other hand, you explicitly synchronize (say, with a fence as described above), then it's irrelevant whether or not you use the unsynchronized flag, since no more implicit sync needs to happen anyway.

Should I call glEnable and glDisable every time I draw something?

How often should I call OpenGL functions like glEnable() or glEnableClientState() and their corresponding glDisable counterparts? Are they meant to be called once at the beginning of the application, or should I keep them disabled and only enable those features I immediately need for drawing something? Is there a performance difference?
Warning: glPushAttrib / glPopAttrib are DEPRECATED in the modern OpenGL programmable pipeline.
If you find that you are checking the value of state variables often and subsequently calling glEnable/glDisable you may be able to clean things up a bit by using the attribute stack (glPushAttrib / glPopAttrib).
The attribute stack allows you to isolate areas of your code and such that changes to attribute in one sections does not affect the attribute state in other sections.
void drawObject1(){
glPushAttrib(GL_ENABLE_BIT);
glEnable(GL_DEPTH_TEST);
glEnable(GL_LIGHTING);
/* Isolated Region 1 */
glPopAttrib();
}
void drawObject2(){
glPushAttrib(GL_ENABLE_BIT);
glEnable(GL_FOG);
glEnable(GL_GL_POINT_SMOOTH);
/* Isolated Region 2 */
glPopAttrib();
}
void drawScene(){
drawObject1();
drawObject2();
}
Although GL_LIGHTING and GL_DEPTH_TEST are set in drawObject1 their state is not preserved to drawObject2. In the absence of glPushAttrib this would not be the case. Also - note that there is no need to call glDisable at the end of the function calls, glPopAttrib does the job.
As far as performance, the overhead due of individual function calls to glEnable/glDisable is minimal. If you need to be handling lots of state you will probably need to create your own state manager or make numerous calls to glGetInteger... and then act accordingly. The added machinery and control flow could made the code less transparent, harder to debug, and more difficult to maintain. These issues may make other, more fruitful, optimizations more difficult.
The attribution stack can aid in maintaining layers of abstraction and create regions of isolation.
glPushAttrib manpage
"That depends".
If you entire app only uses one combination of enable/disable states, then by all means just set it up at the beginning and go.
Most real-world apps need to mix, and then you're forced to call glEnable() to enable some particular state(s), do the draw calls, then glDisable() them again when you're done to "clear the stage".
State-sorting, state-tracking, and many optimization schemes stem from this, as state switching is sometimes expensive.
First of all, which OpenGL version do you use? And which generation of graphics hardware does your target group have? Knowing this would make it easier to give a more correct answer. My answer assumes OpenGL 2.1.
OpenGL is a state machine, meaning that whenever a state is changed, that state is made "current" until changed again explicitly by the programmer with a new OpenGL API call. Exceptions to this rule exist, like client state array calls making the current vertex color undefined. But those are the exceptions which define the rule.
"once at the beginning of the application" doesn't make much sense, because there are times you need to destroy your OpenGL context while the application is still running. I assume you mean just after every window creation. That works for state you don't need to change later. Example: If all your draw calls use the same vertex array data, you don't need to disable them with glDisableClientState afterwards.
There is a lot of enable/disable state associated with the old fixed-function pipeline. The easy redemption for this is: Use shaders! If you target a generation of cards at most five years old, it probably mimics the fixed-function pipeline with shaders anyway. By using shaders you are in more or less total control of what's happening during the transform and rasterization stages, and you can make your own "states" with uniforms, which are very cheap to change/update.
Knowing that OpenGL is a state machine like I said above should make it clear that one should strive to keep the state changes to a minimum, as long as it's possible. However, there are most probably other things which impacts performance much more than enable/disable state calls. If you want to know about them, read down.
The expense of state not associated with the old fixed-function state calls and which is not simple enable/disable state can differ widely in cost. Notably linking shaders, and binding names, ("names" of textures, programs, buffer objects) are usually fairly expensive. This is why a lot of games and applications used to sort the draw order of their meshes according to the texture. In that way, they didn't have to bind the same texture twice. Nowadays however, the same applies to shader programs. You don't want to bind the same shader program twice if you don't have to. Also, not all the features in a particular OpenGL version are hardware accelerated on all cards, even if the vendors of those cards claim they are OpenGL compliant. Being compliant means that they follow the specification, not that they necessarily run all the featurs efficiently. Some of the functions like glHistogram and glMinMax from GL__ ARB __imaging should be remembered in this regard.
Conclusion: Unless there is an obvious reason not to, use shaders! It saves you from a lot of uncecessary state calls since you can use uniforms instead. OpenGL shaders have just been around for about six years you know. Also, the overhead of enable/disable state changes can be an issue, but usually there is a lot more to gain from optimizing other more expensive state changes, like glUseProgram, glCompileShader, glLinkprogram, glBindBuffer and glBindTexture.
P.S: OpenGL 3.0 removed the client state enable/disable calls. They are implicitly enabled as draw arrays is the only way to draw in this version. Immediate mode was removed. gl..Pointer calls were removed as well, since one really just needs glVertexAttribPointer.
A rule of thumb that I was taught said that it's almost always cheaper to just enable/disable at will rather than checking the current state and changing only if needed.
That said, Marc's answer is something that should definitely work.