I heard of PBOs in OpenGL, they are pretty neat in texture loading / uploading. Using PBO involves using synchronisation fences, like they are used in this PBO tutorial. I've tried this technique and it's a great replacement of glTexImage and glGetTexImage in case of big images. Now I want to apply the same approach to other loading / uploading routines (and possibly some other).
If I understand OpenGL client—server model correctly, it works as follows:
(Italics are the things I am not sure of)
Client (my program) 'asks' OGL context to place new commands into OGL command queue. It is done by simply calling the gl* functions in the order that client wants them to execute;
If command flushing is enabled (it is by default), commands are immidiately flushed to the GPU (e. g. via PCI). Otherwise they are placed in some buffer and flushed afterwards when needed (call to glFlush does this);
GPU (server) receives commands from OGL context and executes them in desired order and changes object / context state;
GPU sends back a reply that context (client) asked for;
Context replies to client with the data received from server.
Fences may be used to indicate whether GPU is done executing previous commands.
(2) implies that command execution is not necessarily done instantly. One can, for example, block command flushing by calling glWaitSync, place new commands into queue and then call glFlush. Commands will be flushed to GPU and executed asynchronously (independently from client). When GPU is busy executing given commands, CPU can focus on doing other stuff (e. g. sending info to a remote TCP server, or receiving input from user, or pretty much anything else). When CPU needs to perform something with OGL context again, it can wait until GPU is done with previous job by calling glClientWaitSync and place new commands in the queue, and the cycle repeats.
Based on all of the above, in case of that PBO tutorial, OGL context receives data from the program, buffers it and then sends it to GPU. Sending large amounts of data takes time, hence fence is used to know when sending is complete.
However, Khronos wiki says that only rendering functions are asynchronous. I understand it: rendering also takes time. But then why does the PBO example above work? And it's not like the image upload to GPU is instant, fence is not signaled instantly. Surely, the time it takes to finish uploading depends on how big is the image (I tested it with different image sizes).
Another example: I send a source code for a shader with glShaderSource and then do glCompileShader. Then I instantly check the GL_COMPILE_STATUS with glGetShaderiv. If the shader is not yet compiled when I do glGetShaderiv (it simply did not have enough time to compile), is it possible that GL_COMPILE_STATUS will state that shader is not compiled? Or is it guaranteed that the GL_COMPILE_STATUS is only returned after the compilation? Or is the compilation performed on CPU and does not need to communicate with GPU? (i. e. compilation does not place any commands in GPU queue). It has never really happened to me that shader compilation failed due to time limits, it failed only because of bad shader code.
The questions are:
Is my understanding of OGL client—server model correct or does it need some adjustments?
If not all functions can be executed asynchronously, what are those functions exactly?
Why does wiki say that only render actions may be performed asynchronously?
If command execution may indeed be 'deferred' (not really the right word for it...) with glWaitSync for example, how can I upload and compile shaders the same way images are uploaded in PBO example? Or how can I perform VBO / EBO upload the same way? Or UBO? Or TBO? Non-buffer objects? Is it just uploading and then waiting for fence to be signaled?
In case it matters, I use OGL with latest GLFW github release, latest GLAD (configuration) and C++ (MinGWx64 11.2.0).
UPD: I found this answer that touches the subject of my question. However, I must specify that my question is not about where and how OGL functions are executed, it's about how to control the flow of them, i. e. control when they are executed, to perform asynchronous work of GPU and CPU if it is even possible (it seems to be, if I understand wiki page right).
Related
I have an application that issues about 100 drawcalls per frame, each with an individual VBO. The VBOs are uploaded via glBufferData in a separate thread has has gl context resource sharing. The render thread tests buffer upload state via glClientWaitSync.
Now my question:
According to the documentation glClientWaitSync and GL_SYNC_FLUSH_COMMANDS_BIT cause a flush at every call, right? This would mean that for every not yet finished glBufferData in the upload thread I would have dozens of flushes in the render thread right? What impact on performance would it have if thus, in the worst case, I practically issue a flush before every drawcall?
The behavior of GL_SYNC_FLUSH_COMMANDS_BIT has changed from its original specification.
In the original, the use of that bit was the equivalent of issuing a glFlush before doing the wait.
However, GL 4.5 changed the wording. Now, it is the equivalent of having performed a flush immediately after you submitted that sync object. That is, instead of doing a flush relative to the current stream, it works as if you had flushed after submitting the sync. Thus, repeated use does not mean repeatedly flushing.
You can get the equivalent behavior of course by manually issuing a flush after you submit the sync object, then not using GL_SYNC_FLUSH_COMMANDS_BIT when you wait for it.
I have some big performance issues here
So I would like to take some measurements on the gpu side.
By reading this thread I wrote this code around my draw functions, including the gl error check and the swapBuffers() (auto swapping is indeed disabled)
gl4.glBeginQuery(GL4.GL_TIME_ELAPSED, queryId[0]);
{
draw(gl4);
checkGlError(gl4);
glad.swapBuffers();
}
gl4.glEndQuery(GL4.GL_TIME_ELAPSED);
gl4.glGetQueryObjectiv(queryId[0], GL4.GL_QUERY_RESULT, frameGpuTime, 0);
And since OpenGL rendering commands are supposed to be asynchronous ( the driver can buffer up to X commands before sending them all together in one batch), my question regards essentially if:
the code above is correct
I am right assuming that at the begin of a new frame all the previous GL commands (from the previous frame) have been sent, executed and terminated on the gpu
I am right assuming that when I get query result with glGetQueryObjectiv and GL_QUERY_RESULT all the GL commands so far have been terminated? That is OpenGL will wait until the result become available (from the thread)?
Yes, when you query the timer it will block until the data is available, ie until the GPU is finished with everything that happened between beginning and ending the query. To avoid synchronising with the GPU, you can use GL_QUERY_RESULT_AVAILABLE to check if the results are already available and only then read them then. That might require less straightforward code to keep tabs on open queries and periodically checking them, but it will have the least performance impact. Waiting for the value every time is a sure way to kill your performance.
Edit: To address your second question, swapping the buffer doesn't necessarily mean it will block until the operation succeeds. You may see that behaviour, but it's just as likely that it is just an implicit glFlush and the command buffer is not empty yet. Which is also the more wanted behaviour because ideally you want to start with your next frame right away and keep the CPUs command buffer filled. Check the implementations documentation for more info though, as that is implementation defined.
Edit 2: Checking for errors might end up being an implicit synchronization by the way, so you will probably see the command buffer emptying when you wait for error checking in the command stream.
I am trying to look for scenarios where Sync Objects can be used in OpenGL. My understanding is that a sync object once put in GL command stream ( using glFenceSync() ) will be signaled after all the GL commands are executed and realized.
If the sync objects are synchronization primitives why can't we MANUALLY signal them ? Where exactly this functionality can help GL programmer ?
Is the following scenario a correct one ?
Thread 1 :
Load model
Draw()
glFenceSync()
Thread 2 :
glWaitSync();
ReadPixels
Use data for subsequent operation.
Does this mean that I can't launch thread 2 unless glFenceSync() is called in Thread 1 ?
Fences are not so much meant to synchronize threads, but to know, when asynchronus operations are finished. For example if you do a glReadPixels into a pixel buffer object (PBO), you might want to know, that the read has been completed, before you even attempt to read from or map the PBO to client address space.
If you do a glReadPixels with a PBO being the target, the call will return immediately, but the data transfer may indeed take some time. That's where fences come in handy.
I have a multithreaded application, in which I'm trying to render with different threads. First I tried to use the same Rendering Context between all threads, but I was getting NULL current contexts for other threads. I've read on the internet that one context can only be current at one thread at a time.
So I decided to make something different. I create a window, I get the HDC from it and create the first RC. AFter that, I share this HDC between threads, and in every new thread I create I obtain a new RC from the same HDC and I make it current for that thread. Everytime I do it, the RC returned is always different (usually the previous value + 1). I make an assertion to check if wglGetCurrentContext() returns a RC, and it looks like it returns the one that was just created. But after making the rendering, i get no rendering and if I call GetLastError() I obtain error 6 (invalid handle??)
So, does this mean that, despite every new call of wglCreateContext() gives me a new value, somehow it means that all these different values are the same "Connection channel" to the OpenGL calls?
Does this mean that I will always have to invalid the previous Rendering Context on a thread, and activate it on the new one? I really have to make this sync all the time or is there any other way to work arround this problem?
I have a multithreaded application, in which I'm trying to render with different threads.
DON'T!!!
You will gain nothing from trying to multithread your renderer. Basically you're running into one large race condition and the driver will just be busy synchronizing the threads to somehow make sense of it.
To gain best rendering performance keep all OpenGL operations to only one thread. All parallelization happens for free on the GPU.
I suggest to read the following wiki article from the OpenGL Consortium.
In simple words, it depends a lot on what you mean for multi threading in regards to OpenGl, if you have one thread doing the rendering part and one (or more) doing other jobs (i.e. AI, Physics, game logic etc) it is a perfectly right.
If you wish to have multiple threads messing up with OpenGL, you cannot, or better, you could but it will really give you more troubles than advantages.
Try to read the following FAQ on parallel OpenGL usage to have a better idea on this concept:
http://www.equalizergraphics.com/documentation/parallelOpenGLFAQ.html
In some cases it may make sense to use multiple rendering contexts in different threads. I have used such a design to load image data from filesystem and push this data into a texture.
OpenGL on Mac OS X is single-thread safe; to enable it:
#include <OpenGL/OpenGL.h>
CGLError err = 0;
CGLContextObj ctx = CGLGetCurrentContext();
// Enable the multi-threading
err = CGLEnable( ctx, kCGLCEMPEngine);
if (err != kCGLNoError ) {
// Multi-threaded execution is possibly not available
// Insert your code to take appropriate action
}
See:
Concurrency and OpenGL - Apple Developer
And:
Technical Q&A QA1612: OpenGL ES multithreading and ...
https://www.imaginationtech.com/blog/understanding-opengl-es-multi-thread-multi-window-rendering/
When shouldn’t I use multi-threaded rendering?
When you’re not CPU limited or load times are not a concern.
So, if you are CPU limited, separate other threads to do CPU limited job, such as codec, texture upload, calculate...
I hope the title did not mislead you.
My problem is the following: Currently I try to speed up a raytracer and this is done with the help of the graphics card. It works fine despite the fact that it got slower by this. :)
This is caused by the fact, that I trace one ray on the whole geometry at once on the graphics card(my "tracing server") and then fetch the results, which is awfully slow, so I have to gather some rays and calc them and fetch the results together to speed this up.
The next problem is, that I am not allowed to rewrite the surrounding framework that should know nothing or least possible about this parallelization.
So here is my approach:
I thought about using several threads, where each one gets a ray and requests my "tracing server" to calc the intersections. Then the thread is stopped until enough rays were gathered to calc the intersections on the graphics card and get the results back efficiently. This means that each thread will wait until the results were fetched.
You see I already have some plan but following I do not know:
Which threading framework should I take to be platformindependent?
Should I use a threadpool of fixed size or create them as needed?
Can any given thread library handle at least 1000 waiting threads(because that would be the number that I need to gather for my fetch to be efficient)?
But I also could imagine doing this with one thread that
dumps its load (a new ray) to the "tracing server" and fetches the next load until
there is enough to fetch the results.
Then the thread would take the results one by one, do the further calculations until all results are processed and then goes back to step one until all rays are done.
Also if you have some better idea how to parallelize this, tell me about it.
Regards,
Nobody
PS
If you need this information: The two platforms I want to use are Linux and Windows.
use either Thread Building Blocks or boost::thread.
http://www.boost.org/doc/libs/1_46_0/doc/html/thread.html
http://threadingbuildingblocks.org/
As far as threadpool/on-demand-threads - threadpool is generally better idea as it avoids creation overhead.
Number of waiting threads is gonna depend on the underlying system more than anything else:
Maximum number of threads per process in Linux?