Consumer-Producer threads with DirectX desktop frames

Consumer-Producer threads with DirectX desktop frames - c++

I'm writing a DirectX application with two threads:
Producer thread grabs desktop frames with DirectX (as in the Desktop Duplication DirectX sample)
IDXGIResource* DesktopResource = nullptr;
ID3D11Texture2D *m_AcquiredDesktopImage = nullptr;
HRESULT hr = m_DeskDupl->AcquireNextFrame(500, &FrameInfo, &DesktopResource);
hr = DesktopResource->QueryInterface(__uuidof(ID3D11Texture2D), reinterpret_cast<void **>(&m_AcquiredDesktopImage));
DesktopResource->Release();
// The texture pointer I'm interested is m_AcquiredDesktopImage
Consumer thread performs image processing operations on the GPU.
To avoid copies I'd like to keep everything on the GPU as much as possible. From ReleaseFrame's documentation I kinda get that I should call ReleaseFrame on the desktop duplication interface as soon as I'm done processing the frame.
My question: should I copy the m_AcquiredDesktopImage texture into another one and call ReleaseFrame as soon as the copy is finished and return that new texture to the producer thread for processing or can I just get away with returning the m_AcquiredDesktopImage texture pointer to the consumer thread? Is this a copy of the framebuffer texture or is it the framebuffer texture and I might generate a data race by returning it?
Which one is the correct way to handle a producer of grabbed frames and a consumer of GPU textures?

...should I copy the m_AcquiredDesktopImage texture into another one and call ReleaseFrame as soon as the copy is finished and return that new texture to the producer thread for processing or...
Yes, this is the way. You got your texture, you are finished with it and you release it because the data is no longer valid after the release.
...can I just get away with returning the m_AcquiredDesktopImage texture pointer to the consumer thread? Is this a copy of the framebuffer texture or is it the framebuffer texture and I might generate a data race by returning it?
The API keeps updating this texture. You are promised that between successful return from AcquireNextFrame and your ReleaseFrame call the API does not touch the texture and you are free to use it. If you cannot complete your use between the mentioned calls (which is your case, after all you created a consumer thread to run asynchronously to capture) you copy data and ReleaseFrame. Once you released it, the API resumes the updating.
An attempt to use the texture after ReleaseFrame will result in concurrent access to the texture, your and the API's further updates.

The MSDN documentation on ReleaseFrame is a little convoluted. It specifically states you need to release the current frame before processing the next one, and that the surface state is "invalid" after release, which would indicate it is either not a copy, or not a copy that your process owns (which would yield the same effective result). It also states you should delay the call to ReleaseFrame until right before you call AcquireNextFrame for performance reasons, which can make for some interesting timing issues, especially with the threading model you're using.
I think you'd be better off making a copy (so ReleaseFrame from the previous capture, AcquireNextFrame, CopyResource). Unless you're using fences you don't have any guarantees the GPU will be consuming the resource before your producer thread has called ReleaseFrame, which could give you undefined results. And if you are using fences, and the AcquireNextFrame call is delayed until the GPU has finished consuming the previous frame's data, you'll introduce stalls and lose a lot of the benefits of the CPU being able to run ahead of the GPU.
I'm curious why you're going with this threading model, when the work is done on the GPU. I suspect it makes life a little more complicated. Although making a copy of the texture would remove a lot of those complications.

Related

Open GL: multithreaded glFlushMappedBufferRange?

I know that multi threaded OpenGL is a delicate topic and I am not trying here to render from multiple threads. I also do not try to create multiple contexts and share objects with share lists. I have a single context and I issue draw commands and gl state changes only from the main thread.
However, I am dynamically updating parts of a VBO in every frame. I only write to the VBO, I do not need to read it on the CPU side. I use glMapBufferRange so I can compute the changed data on the fly and don't need an additional copy (which would be created by the blocking glBufferSubData).
It works and now I would like to multi thread the the data update (since it needs to update a lot of vertices at steady 90 fps) and use a persistently mapped buffer (using GL_MAP_PERSISTENT_BIT). This will require to issue glFlushMappedBufferRange whenever a worker thread finished updating parts of the mapped buffer.
Is it fine to call glFlushMappedBufferRange on a separate thread? The Ranges the different threads operate on do not overlap. Is there an overhead or implicit synchronisation involved in doing so?

No you need to call glFlushMappedBufferRange in the thread that does the openGL stuff.
To overcome this you have 2 options:
get the openGL context and make it current in the worker thread. Which means the openGL thread has to relinquish the context for it to work.
push the relevant range into a thread-safe queue and let the openGL thread pop each range from it and call glFlushMappedBufferRange.

ARB_sync and proper testing

I came across the concept of Sync Objects, and decided to test them out. They seem to work as expected, but my current test cases are limited.
What would be a proper test to ensure that these sync objects are performing as intended as a means to synchronize the CPU rendering thread with the GPU?
An example use-case for this would be for video capture programs which "hook" into the OpenGL context of a video game, or some other application using OpenGL.

Your example use-case seems fishy to me.
FRAPS is an example of a program that "hooks" into an OpenGL application to capture video, and it does it very differently. Rather than force a CPU-GPU synchronization, FRAPS inserts an asynchronous pixelbuffer read immediately before SwapBuffers (...) is called. It will then try and read the results back the next time SwapBuffers (...) is called instead of stalling while the result becomes available the first time around. Latency does not matter for FRAPS.
However, even without the async PBO read, there would be no reason for FRAPS to use a sync object. glReadPixels (...) and commands like it will implicitly wait for all pending commands to finish before reading the results and returning control to the CPU. It would really hurt performance, but GL would automatically do the synchronization.
The simplest use-case for sync objects is two or more render contexts running simultaneously.
In OpenGL you can share certain resources (including sync objects) across contexts, but the command stream for each context is completely separate and no synchronization of any sort is enforced. Thus, if you were to upload data to a vertex buffer in one context and use it in another, you would insert a fence sync in the producer (upload context) and wait for it to be signaled in the consumer (draw context). This will ensure that the draw command does not occur until the upload is finished - if the commands were all issued from the same context, GL would actually guarantee this without the use of a sync object.
The example I just gave does not require CPU-GPU synchronization (only GPU-GPU), but you can use glClientWaitSync (...) to block your calling thread until the upload is finished if you had a situation where CPU-GPU made sense.
Here is some pseudo-code to evaluate the effectiveness of a sync object:
Thread 1:
glBindBuffer (GL_ARRAY_BUFFER, vbo);
glBufferSubData (GL_ARRAY_BUFFER, 0, 4096*4096, foo); // Upload a 16 MiB buffer
GLsync ready =
glFenceSync (GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
Thread 0:
glBindBuffer (GL_ARRAY_BUFFER, vbo);
// Try with and without synchronization
if (sync) {
// Wait up to 1 second for the upload to finish
glClientWaitSync (ready, GL_SYNC_FLUSH_COMMANDS_BIT, 1000000000UL);
}
// Ordinarily mapping a buffer would wait for everything else to finish,
// we need to eliminate that behavior (GL_MAP_UNSYNCHRONIZED_BIT) for this test.
void* bar =
glMapBufferRange (GL_ARRAY_BUFFER, 0, 4096*4096, GL_MAP_UNSYNCHRONIZED_BIT);
// When `sync` is true and the sync object is working, bar should be identical to foo

A way of generating chunks

I'm making a game and I'm actually on the generation of the map.
The map is generated procedurally with some algorithms. There's no problems with this.
The problem is that my map can be huge. So I've thought about cutting the map in chunks.
My chunks are ok, they're 512*512 pixels each, but the only problem is : I have to generate a texture (actually a RenderTexture from the SFML). It takes around 0.5ms to generate so it makes the game to freeze each time I generate a chunk.
I've thought about a way to fix this : I've made a kind of a threadpool with a factory. I just have to send a task to it and it creates the chunk.
Now that it's all implemented, it raises opengl warnings like :
"An internal OpenGL call failed in RenderTarget.cpp (219) : GL_INVALID_OPERATION, the specified operation is not allowed in the current state".
I don't know if this is the good way of dealing with chunks. I've also thought about saving the chunks into images / files, but I fear that it take too much time to save / load them.
Do you know a better way to deal with this kind of "infinite" maps ?

It is an invalid operation because you must have a context bound to each thread. More importantly, all of the GL window system APIs enforce a strict 1:1 mapping between threads and contexts... no thread may have more than one context bound and no context may be bound to more than one thread. What you would need to do is use shared contexts (one context for drawing and one for each worker thread), things like buffer objects and textures will be shared between all shared contexts but the state machine and container objects like FBOs and VAOs will not.
Are you using tiled rendering for this map, or is this just one giant texture?
If you do not need to update individual sub-regions of your "chunk" images you can simply create new textures in your worker threads. The worker threads can create new textures and give them data while the drawing thread goes about its business. Only after a worker thread finishes would you actually try to draw using one of the chunks. This may increase the overall latency between the time a chunk starts loading and eventually appears in the finished scene but you should get a more consistent framerate.
If you need to use a single texture for this, I would suggest you double buffer your texture. Have one that you use in the drawing thread and another one that your worker threads issue glTexSubImage2D (...) on. When the worker thread(s) finish updating their regions of the texture you can swap the texture you use for drawing and updating. This will reduce the amount of synchronization required, but again increases the latency before an update eventually appears on screen.

things to try:
make your chunks smaller
generate the chunks in a separate thread, but pass to the gpu from the main thread
pass to the gpu a small piece at a time, taking a second or two

How to call glReadPixels on different thread?

When I call glReadPixels on another thread, it doesn't return me any data. I read somewhere suggesting that I need to create a new context in the calling thread and copy the memory over. How exactly do I do this?
This is the glReadPixels code I use:
pixels = new BYTE[ 3 * width * height];
glReadPixels(0, 0, width, height, GL_RGB, GL_UNSIGNED_BYTE, pixels);
image = FreeImage_ConvertFromRawBits(pixels, width, height, 3 * width, 24, 0xFF0000, 0x00FF00, 0x0000FF, false);
FreeImage_Save(FIF_PNG, image, pngpath.c_str() , 0);
Alternatively, I read from this thread they suggest to use another piece of code (see the end) but I dont understand what are origX, origY, srcOrigX, srcOrigY?

You can create shared contexts, and this will work as you intended. See wglShareLists (the name is chosen badly, it shares more than just lists). Or, use WGL_ARB_create_context, which directly supports sharing contexts too (you have tagged the question "windows", but similar functionality exists for non-WGL too).
However, it is much, much easier to use a pixel buffer object instead, that will have the same net effect as multithreading (the transfer will run asynchronously without blocking the render thread), and it is many times less complex.

You have different options.
You call ReadPixel pipelined with the rendering thread. In this case the returned data shall be stored on a buffer that can be enqueued to a thread that is dedcated for saving pictures. This can be done easily with a buffer queue, a mutex and a semaphore: rendering thread get data using ReadPixel, lock the mutex, enqueue (system memory) pixel buffer, unlock the mutex, increase the semaphore; the worker thread (locked on the semaphore) will be signaled by the rendering thread, lock the mutex, dequeue the pixel buffer, unlock the mutex and save the image.
Otherwise, you can copy the current framebuffer on a texture or a pixel buffer object. In this case you must have two different threads, having an OpenGL context current (via MakeCurrent) each, sharing their object space with each other (as suggested by user771921). When the first rendering thread calls ReadPixels (or CopyPixels), notifies the second thread about the operation (using a semaphore for example); the second rendering thread will map the pixel buffer object (or get the texture data).
This method has the advantage to allow the driver to pipeline the first thread read operation, but it actually doubles the memory copy operations by introducing an additional support buffer. Moreover the ReadPixel operation is flushed when the second thread maps the buffer, which is executed (most probably) just after the second thread is signaled.
I would suggest the first option, since it is the much cleaner and simple. The second one is overcomplicated, and I doubt you can get advantages from using it: the image saving operation is a lot slower than ReadPixel.
Even if the ReadPixel is not pipelined, does your FPS really slow down? Don't optimize before you can profile.
The example you have linked uses GDI functions, which are not OpenGL related. I think the code would causea repaint form event and then capture the window client area contents. It seems to much slower compared with ReadPixel, even if I haven't actually executed any profiling on this issue.

Well, using opengl in a multi threaded program is a bad idea - specially if you use opengl functions in a thread that has no context created.
Apart from that, there is nothing wrong in your code example.

Asynchronous readback from opengl front buffer using multiple PBO's

I am developing an application that needs to read back the whole frame from the front buffer of an openGL application. I can hijack the application's opengl library and insert my code on swapbuffers. At the moment I am successfully using a simple but excruciating slow glReadPixels command without PBO's.
Now I read about using multiple PBO's to speed things up. While I think I've found enough resources to actually program that (isn't that hard), I have some operational questions left. I would do something like this:
create a series (e.g. 3) of PBO's
use glReadPixels in my swapBuffers override to read data from front buffer to a PBO (should be fast and non-blocking, right?)
Create a seperate thread to call glMapBufferARB, once per PBO after a glReadPixels, because this will block until the pixels are in client memory.
Process the data from step 3.
Now my main concern is of course in steps 2 and 3. I read about glReadPixels used on PBO's being non-blocking, will this be an issue if I issue new opengl commands after that very fast? Will those opengl commands block? Or will they continue (my guess), and if so, I guess only swapbuffers can be a problem, will this one stall or will glReadPixels from front buffer be many times faster than swapping (about each 15->30ms) or, worst case scenario, will swapbuffers be executed while glReadPixels is still reading data to the PBO? My current guess is this logic will do something like this: copy FRONT_BUFFER -> generic place in VRAM, copy VRAM->RAM. But I have no idea which of those 2 is the real bottleneck and more, what the influence on the normal opengl command stream is.
Then in step 3. Is it wise to do this asynchronously in a thread separated from normal opengl logic? At the moment I think not, It seems you have to restore buffer operations to normal after doing this and I can't install synchronization objects in the original code to temporarily block those. So I think my best option is to define a certain swapbuffer delay before reading them out, so e.g. calling glReadPixels on PBO i%3 and glMapBufferARB on PBO (i+2)%3 in the same thread, resulting in a delay of 2 frames. Also, when I call glMapBufferARB to use data in client memory, will this be the bottleneck or will glReadPixels (asynchronously) be the bottleneck?
And finally, if you have some better ideas to speed up frame readback from GPU in opengl, please tell me, because this is a painful bottleneck in my current system.
I hope my question is clear enough, I know the answer will probably also be somewhere on the internet but I mostly came up with results that used PBO's to keep buffers in video memory and do processing there. I really need to read back the front buffer to RAM and I do not find any clear explanations about performance in that case (which I need, I cannot rely on "it's faster", I need to explain why it's faster).
Thank you

Are you sure you want to read from the front buffer? You do not own this buffer, and depending on your OS it might be destroyed, e.g., by another window on top of it.
For your use case, people typically do
draw N
start PBO read N from back buffer
draw N+1
start PBO read N+1
sync PBO read N
process N
...
from a single thread.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js