I have a multithreaded application, in which I'm trying to render with different threads. First I tried to use the same Rendering Context between all threads, but I was getting NULL current contexts for other threads. I've read on the internet that one context can only be current at one thread at a time.
So I decided to make something different. I create a window, I get the HDC from it and create the first RC. AFter that, I share this HDC between threads, and in every new thread I create I obtain a new RC from the same HDC and I make it current for that thread. Everytime I do it, the RC returned is always different (usually the previous value + 1). I make an assertion to check if wglGetCurrentContext() returns a RC, and it looks like it returns the one that was just created. But after making the rendering, i get no rendering and if I call GetLastError() I obtain error 6 (invalid handle??)
So, does this mean that, despite every new call of wglCreateContext() gives me a new value, somehow it means that all these different values are the same "Connection channel" to the OpenGL calls?
Does this mean that I will always have to invalid the previous Rendering Context on a thread, and activate it on the new one? I really have to make this sync all the time or is there any other way to work arround this problem?
I have a multithreaded application, in which I'm trying to render with different threads.
DON'T!!!
You will gain nothing from trying to multithread your renderer. Basically you're running into one large race condition and the driver will just be busy synchronizing the threads to somehow make sense of it.
To gain best rendering performance keep all OpenGL operations to only one thread. All parallelization happens for free on the GPU.
I suggest to read the following wiki article from the OpenGL Consortium.
In simple words, it depends a lot on what you mean for multi threading in regards to OpenGl, if you have one thread doing the rendering part and one (or more) doing other jobs (i.e. AI, Physics, game logic etc) it is a perfectly right.
If you wish to have multiple threads messing up with OpenGL, you cannot, or better, you could but it will really give you more troubles than advantages.
Try to read the following FAQ on parallel OpenGL usage to have a better idea on this concept:
http://www.equalizergraphics.com/documentation/parallelOpenGLFAQ.html
In some cases it may make sense to use multiple rendering contexts in different threads. I have used such a design to load image data from filesystem and push this data into a texture.
OpenGL on Mac OS X is single-thread safe; to enable it:
#include <OpenGL/OpenGL.h>
CGLError err = 0;
CGLContextObj ctx = CGLGetCurrentContext();
// Enable the multi-threading
err = CGLEnable( ctx, kCGLCEMPEngine);
if (err != kCGLNoError ) {
// Multi-threaded execution is possibly not available
// Insert your code to take appropriate action
}
See:
Concurrency and OpenGL - Apple Developer
And:
Technical Q&A QA1612: OpenGL ES multithreading and ...
https://www.imaginationtech.com/blog/understanding-opengl-es-multi-thread-multi-window-rendering/
When shouldn’t I use multi-threaded rendering?
When you’re not CPU limited or load times are not a concern.
So, if you are CPU limited, separate other threads to do CPU limited job, such as codec, texture upload, calculate...
Related
I heard of PBOs in OpenGL, they are pretty neat in texture loading / uploading. Using PBO involves using synchronisation fences, like they are used in this PBO tutorial. I've tried this technique and it's a great replacement of glTexImage and glGetTexImage in case of big images. Now I want to apply the same approach to other loading / uploading routines (and possibly some other).
If I understand OpenGL client—server model correctly, it works as follows:
(Italics are the things I am not sure of)
Client (my program) 'asks' OGL context to place new commands into OGL command queue. It is done by simply calling the gl* functions in the order that client wants them to execute;
If command flushing is enabled (it is by default), commands are immidiately flushed to the GPU (e. g. via PCI). Otherwise they are placed in some buffer and flushed afterwards when needed (call to glFlush does this);
GPU (server) receives commands from OGL context and executes them in desired order and changes object / context state;
GPU sends back a reply that context (client) asked for;
Context replies to client with the data received from server.
Fences may be used to indicate whether GPU is done executing previous commands.
(2) implies that command execution is not necessarily done instantly. One can, for example, block command flushing by calling glWaitSync, place new commands into queue and then call glFlush. Commands will be flushed to GPU and executed asynchronously (independently from client). When GPU is busy executing given commands, CPU can focus on doing other stuff (e. g. sending info to a remote TCP server, or receiving input from user, or pretty much anything else). When CPU needs to perform something with OGL context again, it can wait until GPU is done with previous job by calling glClientWaitSync and place new commands in the queue, and the cycle repeats.
Based on all of the above, in case of that PBO tutorial, OGL context receives data from the program, buffers it and then sends it to GPU. Sending large amounts of data takes time, hence fence is used to know when sending is complete.
However, Khronos wiki says that only rendering functions are asynchronous. I understand it: rendering also takes time. But then why does the PBO example above work? And it's not like the image upload to GPU is instant, fence is not signaled instantly. Surely, the time it takes to finish uploading depends on how big is the image (I tested it with different image sizes).
Another example: I send a source code for a shader with glShaderSource and then do glCompileShader. Then I instantly check the GL_COMPILE_STATUS with glGetShaderiv. If the shader is not yet compiled when I do glGetShaderiv (it simply did not have enough time to compile), is it possible that GL_COMPILE_STATUS will state that shader is not compiled? Or is it guaranteed that the GL_COMPILE_STATUS is only returned after the compilation? Or is the compilation performed on CPU and does not need to communicate with GPU? (i. e. compilation does not place any commands in GPU queue). It has never really happened to me that shader compilation failed due to time limits, it failed only because of bad shader code.
The questions are:
Is my understanding of OGL client—server model correct or does it need some adjustments?
If not all functions can be executed asynchronously, what are those functions exactly?
Why does wiki say that only render actions may be performed asynchronously?
If command execution may indeed be 'deferred' (not really the right word for it...) with glWaitSync for example, how can I upload and compile shaders the same way images are uploaded in PBO example? Or how can I perform VBO / EBO upload the same way? Or UBO? Or TBO? Non-buffer objects? Is it just uploading and then waiting for fence to be signaled?
In case it matters, I use OGL with latest GLFW github release, latest GLAD (configuration) and C++ (MinGWx64 11.2.0).
UPD: I found this answer that touches the subject of my question. However, I must specify that my question is not about where and how OGL functions are executed, it's about how to control the flow of them, i. e. control when they are executed, to perform asynchronous work of GPU and CPU if it is even possible (it seems to be, if I understand wiki page right).
Hi I am a newbie learning Direct 3D 12.
So far, I understood that Direct 3D 12 is designed for multithreading and I'm trying to make my own simple multithread demo by following the tutorial by braynzarsoft.
https://www.braynzarsoft.net/viewtutorial/q16390-03-initializing-directx-12
Environment is windows, using C++, Visual Studio.
As far as I understand, multithreading in Direct 3D 12 seems, in a nutshell, populating command lists in multiple threads.
If it is right, it seems
1 Swap Chain
1 Command Queue
N Command Lists (N corresponds to number of threads)
N Command Allocators (N corresponds to number of threads)
1 Fence
is enough for a single window program.
I wonder
Q1. When do we need multiple command queues?
Q2. Why do we need multiple fences?
Q3. When do we submit commands multiple times?
Q4. Does GetCPUDescriptorHandleForHeapStart() return value changes?
Q3 comes from here.
https://developer.nvidia.com/sites/default/files/akamai/gameworks/blog/GDC16/GDC16_gthomas_adunn_Practical_DX12.pdf
Purpose of Q4 is I thought of calling the function once and store the value for reuse, it didn't change when I debugged.
Rendering loop in my mind is (based on Game Loop pattern), for example,
Thread waits for fence value (eg. Main thread).
Begin multiple threads to populate command lists.
Wait all threads done with population.
ExecuteCommandLists.
Swap chain present.
Return to 1 in the next loop.
If I am totally misunderstanding, please help.
Q1. When do we need multiple command queues?
Read this https://learn.microsoft.com/en-us/windows/win32/direct3d12/user-mode-heap-synchronization:
Asynchronous and low priority GPU work. This enables concurrent execution of low priority GPU work and atomic operations that enable one GPU thread to consume the results of another unsynchronized thread without blocking.
High priority compute work. With background compute it is possible to interrupt 3D rendering to do a small amount of high priority compute work. The results of this work can be obtained early for additional processing on the CPU.
Background compute work. A separate low priority queue for compute workloads allows an application to utilize spare GPU cycles to perform background computation without negative impact on the primary rendering (or other) tasks.
Streaming and uploading data. A separate copy queue replaces the D3D11 concepts of initial data and updating resources. Although the application is responsible for more details in the Direct3D 12 model, this responsibility comes with power. The application can control how much system memory is devoted to buffering upload data. The app can choose when and how (CPU vs GPU, blocking vs non-blocking) to synchronize, and can track progress and control the amount of queued work.
Increased parallelism. Applications can use deeper queues for background workloads (e.g. video decode) when they have separate queues for foreground work.
Q2. Why do we need multiple fences?
All gpu work is asynchronous. So you can think of fences as low level tools to achieve the same result as futures/coroutines. You can check if the work has been completed, wait for work to complete or set an event on completion. You need a fence whenever you need to guarantee a resource holds the output of work (when resource barriers are insufficient).
Q4. Does GetCPUDescriptorHandleForHeapStart() return value changes?
No it doesn't.
store the value for reuse, it didn't change when I debugged.
The direct3d12 samples do this, you should know them intimately if you want to become proficient.
Rendering loop in my mind is (based on Game Loop pattern), for example,
That sounds okay, but I urge you to look at the direct3d12 samples and steal the patterns (and the code) they use there.
I can't seem to find a good answer to this:
I'm making a game, and I want the logic loop to be separate from the graphics loop. In other words I want the game to go through a loop every X milliseconds regardless of how many frames/second it is displaying.
Obviously they will both be sharing a lot of variables, so I can't have a thread/timer passing one variable back and forth... I'm basically just looking for a way to have a timer in the background that every X milliseconds sends out a flag to execute the logic loop, regardless of where the graphics loop is.
I'm open to any suggestions. It seems like the best option is to have 2 threads, but I'm not sure what the best way to communicate between them is, without constantly synchronizing large amounts of data.
You can very well do multithreading by having your "world view" exchanged every tick. So here is how it works:
Your current world view is pointed to by a single smart pointer and is read only, so no locking is necessary.
Your logic creates your (first) world view, publishes it and schedules the renderer.
Your renderer grabs a copy of the pointer to your world view and renders it (remember, read-only)
In the meantime, your logic creates a new, slightly different world view.
When it's done it exchanges the pointer to the current world view, publishing it as the current one.
Even if the renderer is still busy with the old world view there is no locking necessary.
Eventually the renderer finishes rendering the (old) world. It grabs the new world view and starts another run.
In the meantime, ... (goto step 4)
The only locking you need is for the time when you publish or grab the pointer to the world. As an alternative you can do atomic exchange but then you have to make sure you use smart pointers that can do that.
Most toolkits have an event loop (built above some multiplexing syscall like poll(2) -or the obsolete select-...), e.g. GTK has g_application_run (which is above:) gtk_main which is built above Glib main event loop (which in fact does a poll or something similar). Likewise, Qt has QApplication and its exec methods.
Very often, you can register timers within the event loop. For GTK, use GTimers, g_timeout_add etc. For Qt learn about its timers.
Very often, you can also register some idle or background processing, which is one of your function which is started by the event loop after other events and timeouts have been processed. Your idle function is expected to run quickly (usually it does a small step of some computation in a few milliseconds, to keep the GUI responsive). For GTK, use g_idle_add etc. IIRC, in Qt you can use a timer with a 0 delay.
So you could code even a (conceptually) single threaded application, using timeouts and idle processing.
Of course, you could use multi-threading: generally the main thread is running the event loop, and other threads can do other things. You have synchronization issues. On POSIX systems, a nice synchronization trick could be to use a pipe(7) to self: you set up a pipe before running the event loop, and your computation threads may write a few bytes on it, while the main event loop is "listening" on it (with GTK, using g_source_add_poll or async IO or GUnixInputStream etc.., with Qt, using QSocketNotifier etc....). Then, in the input handler running in the main loop for that pipe, you could access traditional global data with mutexes etc...
Conceptually, read about continuations. It is a relevant notion.
You could have a Draw and Update Method attached to all your game components. That way you can set it that while your game is running the update is called and the draw is ignored or any combination of the two. It also has the benefit of keeping logic and graphics completely separate.
Couldn't you just have a draw method for each object that needs to be drawn and make them globals. Then just run your rendering thread with a sleep delay in it. As long as your rendering thread doesn't write any information to the globals you should be fine. Look up sfml to see an example of it in action.
If you are running on a unix system you could use usleep() however that is not available on windows so you might want to look here for alternatives.
I hope the title did not mislead you.
My problem is the following: Currently I try to speed up a raytracer and this is done with the help of the graphics card. It works fine despite the fact that it got slower by this. :)
This is caused by the fact, that I trace one ray on the whole geometry at once on the graphics card(my "tracing server") and then fetch the results, which is awfully slow, so I have to gather some rays and calc them and fetch the results together to speed this up.
The next problem is, that I am not allowed to rewrite the surrounding framework that should know nothing or least possible about this parallelization.
So here is my approach:
I thought about using several threads, where each one gets a ray and requests my "tracing server" to calc the intersections. Then the thread is stopped until enough rays were gathered to calc the intersections on the graphics card and get the results back efficiently. This means that each thread will wait until the results were fetched.
You see I already have some plan but following I do not know:
Which threading framework should I take to be platformindependent?
Should I use a threadpool of fixed size or create them as needed?
Can any given thread library handle at least 1000 waiting threads(because that would be the number that I need to gather for my fetch to be efficient)?
But I also could imagine doing this with one thread that
dumps its load (a new ray) to the "tracing server" and fetches the next load until
there is enough to fetch the results.
Then the thread would take the results one by one, do the further calculations until all results are processed and then goes back to step one until all rays are done.
Also if you have some better idea how to parallelize this, tell me about it.
Regards,
Nobody
PS
If you need this information: The two platforms I want to use are Linux and Windows.
use either Thread Building Blocks or boost::thread.
http://www.boost.org/doc/libs/1_46_0/doc/html/thread.html
http://threadingbuildingblocks.org/
As far as threadpool/on-demand-threads - threadpool is generally better idea as it avoids creation overhead.
Number of waiting threads is gonna depend on the underlying system more than anything else:
Maximum number of threads per process in Linux?
I have a Snake game in-development (up at https://github.com/RobotGymnast/Gingerbread/tree/eventThreaded). Initially, everything (graphics, events, game logic update, physics) were called from a "main" thread. Then I started multithreading (using boost threads). It's been pretty straightforward, but I recently split the graphics display logic into a new thread, which allocated the screen object in its local stack space. Then I split my event-detection and event-handling logic into a new thread. Then my screen stopped appearing. Judging by my command-line output, everything still worked fine, just the screen stopped appearing. It turned out it was hanging on my SDL_SetVideoMode() call.
I fixed this by allocating my screen object in the "main" thread, and passing in a reference to the graphics thread. For some reason, allocating the screen object in a new thread from the event logic was creating problems.
Since this fix, the event-detection and event-handling no longer works. The event checks are still being made, e.g. SDL_PollEvent(), but they're not picking up any events at all (keyboard, mouse, etc.).
My suspicion is that SDL might do some behind-the-scenes thread syncing, but I've been using boost threads. Could this be a problem? SDL threads are rather restrictive, and I'd rather not switch.
Anybody had this issue before? Any recommendations?
I'm not sure about SDL, but on several windowing subsystems (I believe on both X and Win32), you cannot modify ANYTHING related to a graphics object or widget, except from the thread which initially created that graphics object/widget.
It doesn't look like (to my limited 10 second google search) SDL abstracts that bit from you -- you'd need to only modify graphics related objects from the thread that created them. To do otherwise is to invite strange behavior.
Graphics display logic should almost always be in the main thread, due to some technical considerations on various platforms.
Similarly, the event handling (at least at the low level) should be in the main thread, as events can be posted to specific threads rather than processes.
For the most part I would recommend not calling any SDL functions from anything other than the main thread apart from ones that don't operate on shared state.