If I plan to use multithreading in OpenGL, should I have separate buffers (from glGenBuffers) for each context?
I do not know much about OpenGL multithreading yet (for now I work in "single" thread). I need to know if I can share buffers already pushed to Video Memory (with glBufferData/glBufferSubData), or I have to keep copy of buffer for another thread.
You do not want to use several contexts with several threads. You really don't.
While this sounds like a good idea, in practice muli-context-multi-thread is complicated, troublesome, and badly supported on the driver side, and it only marginally improves (possibly even reduces!) performance.
What you really want is to have only one thread talk to OpenGL (with one context, obviously), map a buffer, and pass the memory pointer to another thread, preferrably using 3 buffers (3 subbuffers of a 3x sized buffer) with immutable storage and persistent mapping, if this is available.
That, and doing indirect render calls, where a second thread feeds the buffers the indirect call reads from.
Further info on the persistent mapping topic: See in particular slides 22-25 of this GDC2014 presentation, which is basically a remake of Cass Everitt's 2013 SIGGRAPH talk.
See also Everitt's original talk: Beyond porting.
Vaos aren't shared so you'll need to generate a new vao for each object per context or else the behavior will become unpredictable and incorrect upon deletion / creation of a new one. This can be a major source of error. Vbos can be shared, so you just need one vbo per object.
Related
I'm experimenting a bit with the new features in DirectX12. So far I really like some of the changes, for example, the pipeline states. At the same time, some other changes are a bit confusing, for example, the descriptor heaps.
Let's start with a quick background so you better understand what I'm asking for.
In DirectX11, we created objects of different shaders and then we had to bind each of them separately during actual runtime when setting up our draw call. Here's a pseudo-example:
deviceContext->VSSetShader(...);
deviceContext->HSSetShader(...);
deviceContext->DSSetShader(...);
deviceContext->PSSetShader(...);
In DirectX12, they've implemented this so much smarter, because now we can configure the pipeline state during initialization instead, and then set all of the above with a single API call:
commandList->SetPipelineState(...);
Very simple, elegant and quicker. And on top of that, very logical. Now let's take a look at the descriptor heaps instead. I kind of expected this to follow the same elegant pattern, and this is basically what my question is about.
In DirectX11, we created objects of different desriptors (views) and then we had to bind each of them separately for each shader during actual runtime when setting up our draw call. Once again a pseudo-example:
deviceContext->PSSetConstantBuffers(0, n, ...);
deviceContext->PSSetShaderResources(0, n, ...);
deviceContext->PSSetSamplers(0, n, ...);
In DirectX12, they've implemented something called descriptor heaps. Basically they're chunks of memory that contain all of the descriptors that we want to bind, and we can also set it up during initialization. So far, it looks equally elegant as the pipeline state, since we can set everything with a single API call:
commandList->SetDescriptorHeaps(n, ...);
Or can we? This is where the confusion arises, because after a search I found this question that states:
Swapping descriptor heaps is a costly operation you want to avoid at all cost.
Meanwhile, the MSDN documentation for SetDesciptorHeaps doesn't state anything about this method behing particularly expensive.
Considering how elegantly they've designed the pipeline state, I was kind of expecting to be able to do this:
commandList->SetPipelineState(...);
commandList->SetDescriptorHeaps(n, ...);
commandList->DrawInstanced(...);
commandList->SetPipelineState(...);
commandList->SetDescriptorHeaps(n, ...);
commandList->DrawInstanced(...);
commandList->SetPipelineState(...);
commandList->SetDescriptorHeaps(n, ...);
commandList->DrawInstanced(...);
But if SetDescriptorHeaps is actually that expensive, this will propably provide a very bad performance. Or will it? As said, I can't find any statement about this actually being a bad idea on MSDN.
So my questions are:
If the above is considered bad practice, how should SetDescriptorHeaps be used?
If this is a Nvidia-only performance problem, how come that they don't fix their drivers?
Basically, what I want to do is to have two descriptor heaps (CBV/SRV/UAV + sampler) for each pipeline state. And judging from how cheap it's to change the pipeline state, it would be logical that changing the descriptor heap would be equally cheap. The pipeline state and the descriptor heap are quite closely related, i.e. changing the pipeline state will most likely require a different set of descriptors.
I'm aware of the strategy of using one massive descriptor heap for each type of descriptor. But that approach feels so overly complicated considering all the work required to keep track of each individual descriptors index. And on top of that, the descriptors in a descriptor table need to be continious in the heap.
Descriptor heaps are independent of pipelines; they don't have to be bound per draw/dispatch. You can also just have a big descriptor heap and bind that instead. This should then be corrected by the root signature though; which should point to the correct offset in this descriptor heap. This means you could have unique textures in one heap and point your root signature to the correct descriptor. You could also suballocate the current heap into one giant heap.
MSDN documentation has now addressed the performance hit on switching heaps:
On some hardware, this can be an expensive operation, requiring a GPU stall to flush all work that depends on the currently bound descriptor heap.
Source: Descriptor Heaps Overview - Switching Heaps
The reason this may happen is that for some hardware, switching between hardware descriptor heaps during execution requires a GPU wait for idle (to ensure that GPU references to the previously descriptor heap are finished).
To avoid being impacted by this possible wait for idle on the descriptor heap switch, applications can take advantage of breaks in rendering that would cause the GPU to idle for other reasons as the time to do descriptor heap switches, since a wait for idle is happening anyway.
Source: Shader Visible Descriptor Heaps - Overview
Recently I looked into improving texture submissions for streaming and whatnot and despite my long searches I have not found any material presenting or even mentioning any way of using PBOs with DSA only functions.
Am I not looking in the right places or is there really no way as of yet?
All of the pixel transfer functions can take either a buffer object+offset or a client CPU pointer (unlike VAO functions, for example, which can only work with buffers now). As such, allowing you to pass a buffer object+offset directly would require having a separate entrypoint for each of the two ways of doing pixel transfer. So they would need glNamedReadPixelsToBuffer and glNamedReadPixelsToClient.
So instead of further proliferating the number of functions (and instead of forbidding using client memory), they make the buffer part work the way it always did: through a binding point. So yes, you're still going to have to bind that buffer to the PACK/UNPACK binding.
Since pixel transfers are not exactly a common operation (relative to the number of other kinds of state changing and rendering commands), and since these particular binds are not directly tied to the GPU, it shouldn't affect your code that much. Plus, there's already a lot of context state tied to pixel transfer operations; what does one more matter?
The OpenGL tradition is to let the user manipulate OpenGL objects using an unsigned int handle. Why not just give a pointer instead? What are the advantages of unique IDs over pointers?
TL;DR: OpenGL IDs don't map bijectively to memory locations. A single OpenGL ID may refer to multiple memory locations at the same time. Also OpenGL has been designed to work for distributed rendering architectures (like X11) as well, and given an indirect context programs running on different machines may use the same OpenGL context.
OpenGL has been designed as an architecture and display system agnostic API. When OpenGL was first developed this happened in light of client-server display architectures (like X11). If you look into the OpenGL specification, even of modern OpenGL-4 it refers to clients and servers.
However in a client/server architectures pointers make no sense. For one the address space of the server is not accessible to the clients without jumping some hoops. And even if you set up a shared memory mapping, the addresses of objects are not the same for client and server. Add to this that on architectures like X11 a single indirect OpenGL context can be used by multiple clients, that may even run on different machines. Pointers simply don't work for that.
Last but not least the OpenGL object model is highly abstract and the OpenGL drawing model is asynchonous Say I do the following:
id = glGenTextures(1)
glBindTexture(id)
glTexStorage(…)
glTexSubImage(image_a)
draw_something()
glTexSubImage(image_b)
draw_someting_b()
When the end of this little snippet has reached, actually nothing at all may have been drawn yet, because no synchronization point has been reached (glFinish, glReadPixels, a buffer swap). Note the two calls to glTexSubImage, which happen on the same id. When the pixels are finally put to the framebuffer, there two different images to be sourced from a single texture ID, because OpenGL guarantees you, that things will appear as if things were drawn synchronously. So at the end of a drawing batch a single object ID may refer to a whole collection of different data sets with different locations in memory.
My first consideration - having pointers would make programmers wonder if they can operate with them in a pointer-arithmetic way, e.g. by pointing to a middle of a texture to update it or something like that. Maybe even more crazy things, such as patching shaders code on-the-fly. That all sounds like a whole new cool degree of freedom, unless you think of additional complications caused by tampering with highly efficient and optimized GPU "black-box" way of operation.
For example - consider inner workings of GPU memory allocation. Just like with OS - pointers you get from OS are not the real "physical" ones, OS memory manager can move things around behind the scenes while keeping the pointers the same (f.e. swapping to HDD). In that case IDs are just the same - GPU can optimize and pack entities with even more freedom, while keeping the nice facade of them being available at 1-2-3.
Another example - OpenGL is not actually the same across manufacturers. In fact OpenGL is just a description of API, where each vendor can make his own implementation the way it works best for him. For example there's no rule on hot to store texture mipmaps, aligned, or interleaved or whatever. Having pointers to a texture would lure developers into tampering with mipmaps, which would cause a lot of trouble to support various implementations or force all the implementations to become strictly unified, which again is a bad idea for performance.
The OpenGL device (GPU) may have its own memory with its own address space, independent of the host (CPU) memory system. (Think of a discrete video card with its own onboard RAM.) The host can't (directly) access that memory, so it's not possible to have a pointer to it.
It's best to think of the GPU as a whole separate computer; it's actually possible to do OpenGL over a network, with a program running on one computer rendering graphics on the video card in another. When you set up your textures and buffers, you're basically uploading data to the GL device for its own internal use.
I just found the following OpenGL specification for ARB_map_buffer_range.
I'm wondering if it is possible to do non-blocking map calls using this extension?
Currently in my application im rendering to an FBO which I then map to a host PBO buffer.
glMapBuffer(target_, GL_READ_ONLY);
However, the problem with this is that it blocks the rendering thread while transferring the data.
I could reduce this issue by pipelining the rendering, but latency is a big issue in my application.
My question is whether i can use map_buffer_range with MAP_UNSYNCHRONIZED_BIT and wait for the map operation to finish on another thread, or defer the map operation on the same thread, while the rendering thread renders the next frame.
e.g.
thread 1:
map();
render_next_frame();
thread 2:
wait_for_map
or
thread 1:
map();
while(!is_map_ready())
do_some_rendering_for_next_frame();
What I'm unsure of is how I know when the map operation is ready, the specification only mentions "other synchronization techniques to ensure correct operation".
Any ideas?
If you map a buffer with GL_MAP_UNSYNCHRONIZED_BIT, the driver will not wait until OpenGL is done with that memory before mapping it for you. So you will get more or less immediate access to it.
The problem is that this does not mean that you can just read/write that memory willy-nilly. If OpenGL is reading from or writing to that buffer and you change it... welcome to undefined behavior. Which can include crashing.
Therefore, in order to actually use unsynchronized mapping, you must synchronize your behavior to OpenGL's access of that buffer. This will involve the use of ARB_sync objects (or NV_fence if you're only on NVIDIA and haven't updated your drivers recently).
That being said, if you're using a fence object to synchronize access to the buffer, then you really don't need GL_MAP_UNSYNCHRONIZED_BIT at all. Once you finish the fence, or detect that it has completed, you can map the buffer normally and it should complete immediately (unless some other operation is reading/writing too).
In general, unsynchronized access is best used for when you need fine-grained write access to the buffer. In this case, good use of sync objects will get you what you really need (the ability to tell when the map operation is finished).
Addendum: The above is now outdated (depending on your hardware). Thanks to OpenGL 4.4/ARB_buffer_storage, you can now not only map unsynchronized, you can keep a buffer mapped indefinitely. Yes, you can have a buffer mapped while it is in use.
This is done by creating immutable storage and providing that storage with (among other things) the GL_MAP_PERSISTENT_BIT. Then you glMapBufferRange, also providing the same bit.
Now technically, that changes pretty much nothing. You still need to synchronize your actions with OpenGL. If you write stuff to a region of the buffer, you'll need to either issue a barrier or flush that region of the buffer explicitly. And if you're reading, you still need to use a fence sync object to make sure that the data is actually there before reading it (and unless you use GL_MAP_COHERENT_BIT too, you'll need to issue a barrier before reading).
In general, it is not possible to do a "nonblocking map", but you can map without blocking.
The reason why there can be no "nonblocking map" is that the moment the function call returns, you could access the data, so the driver must make sure it is there, positively. If the data has not been transferred, what else can the driver do but block.
Threads don't make this any better, and possibly make it worse (adding synchronisation and context sharing issues). Threads cannot magically remove the need to transfer data.
And this leads to how to not block on mapping: Only map when you are sure that the transfer is finished. One safe way to do this is to map the buffer after flipping buffers or after glFinish or after waiting on a query/fence object. Using a fence is the preferrable way if you can't wait until buffers have been swapped. A fence won't stall the pipeline, but will tell you whether or not your transfer is done (glFinish may or may not, but will probably stall).
Reading after swapping buffers is also 100% safe, but may not be acceptable if you need the data within the same frame (works perfectly for screenshots or for calculating a histogram for tonemapping, though).
A less safe way is to insert "some other stuff" and hope that in the mean time the transfer has completed.
In respect of below comment:
This answer is not incorrect. It isn't possible to do any better than access data after it's available (this should be obvious). Which means that you must sync/block, one way or the other, there is no choice.
Although, from a very pedantic point of view, you can of course use GL_MAP_UNSYNCHRONIZED_BIT to get a non-blocking map operation, this is entirely irrelevant, as it does not work unless you explicitly reproduce the implicit sync as described above. A mapping that you can't safely access is good for nothing.
Mapping and accessing a buffer that OpenGL is transferring data to without synchronizing/blocking (implicitly or explicitly) means "undefined behavior", which is only a nicer wording for "probably garbage results, maybe crash".
If, on the other hand, you explicitly synchronize (say, with a fence as described above), then it's irrelevant whether or not you use the unsynchronized flag, since no more implicit sync needs to happen anyway.
I currently try to implement a "Loading thread" for a very basic gaming engine, which takes care of loading e.g. textures or audio while the main thread keeps rendering a proper message/screen until the operation is finished or even render regular game scenes while loading of smaller objects occurs in background.
Now, I am by far no OpenGL expert, but as I implemented such a "Loading" mechanism I quickly found out that OGL doesn't like access to the rendering context from a thread other than the one it was created on very much. I googled around and the solution seems to be:
"Create a second rendering context on the thread and share it with the context of the main thread"
The problem with this is that I use SDL to take care of my window management and context creation, and as far as I can tell from inspecting the API there is no way to tell SDL to share contexts between each other :(
I came to the conclusion that the best solutions for my case are:
Approach A) Alter the SDL library to support context sharing with the platform specific functions (wglShareLists() and glXCreateContext() I assume)
Approach B) Let the "Loading Thread" only load the data into memory and process it to be in a OpenGL-friendly format and pass it to the main thread which e.g. takes care of uploading the texture to the graphics adapter. This, of course, only applies to data that needs a valid OpenGL context to be done
The first solution is the least efficient one I guess. I don't really want to mangle with SDL and beside that I read that context sharing is not a high-performance operation. So my next take would be on the second approach so far.
EDIT: Regarding the "high-performance operation": I read the article wrong, it actually isn't that performance intensive. The article suggested shifting the CPU intensive operations to the second thread with a second context. Sorry for that
After all this introduction I would really appreciate if anyone could give me some hints and comments to the following questions:
1) Is there any way to share contexts with SDL and would it be any good anyway to do so?
2) Is there any other more "elegant" way to load my data in the background that I may have missed or didn't think about?
3) Can my intention of going with approach B considered to be a good choice? There would still be slight overhead from the OpenGL operations on my main thread which blocks rendering, or is it that small that it can be ignored?
Is there any way to share contexts with SDL
No.
Yes!
You have to get the current context, using platform-specific calls. From there, you can create a new context and make it shared, also with platform-specific calls.
Is there any other more "elegant" way to load my data in the background that I may have missed or didn't think about?
Not really. You enumerated the options quite well: hack SDL to get the data you need, or load data inefficiently.
However, you can load the data into mapped buffer objects and transfer the data to OpenGL. You can only do the mapping/unmapping on the OpenGL thread, but the pointer you get when you map can be used on any thread. So map a buffer, pass it to the worker thread. It loads data into the mapped memory, and flips a switch. The GL thread unmaps the pointer (the worker thread should forget about the pointer now) and uploads the texture data.
Can my intention of going with approach B considered to be a good choice?
Define "good"? There's no way to answer this without knowing more about your problem domain.