OpenGL, measuring rendering time on gpu - opengl

I have some big performance issues here
So I would like to take some measurements on the gpu side.
By reading this thread I wrote this code around my draw functions, including the gl error check and the swapBuffers() (auto swapping is indeed disabled)
gl4.glBeginQuery(GL4.GL_TIME_ELAPSED, queryId[0]);
{
draw(gl4);
checkGlError(gl4);
glad.swapBuffers();
}
gl4.glEndQuery(GL4.GL_TIME_ELAPSED);
gl4.glGetQueryObjectiv(queryId[0], GL4.GL_QUERY_RESULT, frameGpuTime, 0);
And since OpenGL rendering commands are supposed to be asynchronous ( the driver can buffer up to X commands before sending them all together in one batch), my question regards essentially if:
the code above is correct
I am right assuming that at the begin of a new frame all the previous GL commands (from the previous frame) have been sent, executed and terminated on the gpu
I am right assuming that when I get query result with glGetQueryObjectiv and GL_QUERY_RESULT all the GL commands so far have been terminated? That is OpenGL will wait until the result become available (from the thread)?

Yes, when you query the timer it will block until the data is available, ie until the GPU is finished with everything that happened between beginning and ending the query. To avoid synchronising with the GPU, you can use GL_QUERY_RESULT_AVAILABLE to check if the results are already available and only then read them then. That might require less straightforward code to keep tabs on open queries and periodically checking them, but it will have the least performance impact. Waiting for the value every time is a sure way to kill your performance.
Edit: To address your second question, swapping the buffer doesn't necessarily mean it will block until the operation succeeds. You may see that behaviour, but it's just as likely that it is just an implicit glFlush and the command buffer is not empty yet. Which is also the more wanted behaviour because ideally you want to start with your next frame right away and keep the CPUs command buffer filled. Check the implementations documentation for more info though, as that is implementation defined.
Edit 2: Checking for errors might end up being an implicit synchronization by the way, so you will probably see the command buffer emptying when you wait for error checking in the command stream.

Related

OpenGL client—server model, synchronization and 'deferred' loading in OpenGL 3.3

I heard of PBOs in OpenGL, they are pretty neat in texture loading / uploading. Using PBO involves using synchronisation fences, like they are used in this PBO tutorial. I've tried this technique and it's a great replacement of glTexImage and glGetTexImage in case of big images. Now I want to apply the same approach to other loading / uploading routines (and possibly some other).
If I understand OpenGL client—server model correctly, it works as follows:
(Italics are the things I am not sure of)
Client (my program) 'asks' OGL context to place new commands into OGL command queue. It is done by simply calling the gl* functions in the order that client wants them to execute;
If command flushing is enabled (it is by default), commands are immidiately flushed to the GPU (e. g. via PCI). Otherwise they are placed in some buffer and flushed afterwards when needed (call to glFlush does this);
GPU (server) receives commands from OGL context and executes them in desired order and changes object / context state;
GPU sends back a reply that context (client) asked for;
Context replies to client with the data received from server.
Fences may be used to indicate whether GPU is done executing previous commands.
(2) implies that command execution is not necessarily done instantly. One can, for example, block command flushing by calling glWaitSync, place new commands into queue and then call glFlush. Commands will be flushed to GPU and executed asynchronously (independently from client). When GPU is busy executing given commands, CPU can focus on doing other stuff (e. g. sending info to a remote TCP server, or receiving input from user, or pretty much anything else). When CPU needs to perform something with OGL context again, it can wait until GPU is done with previous job by calling glClientWaitSync and place new commands in the queue, and the cycle repeats.
Based on all of the above, in case of that PBO tutorial, OGL context receives data from the program, buffers it and then sends it to GPU. Sending large amounts of data takes time, hence fence is used to know when sending is complete.
However, Khronos wiki says that only rendering functions are asynchronous. I understand it: rendering also takes time. But then why does the PBO example above work? And it's not like the image upload to GPU is instant, fence is not signaled instantly. Surely, the time it takes to finish uploading depends on how big is the image (I tested it with different image sizes).
Another example: I send a source code for a shader with glShaderSource and then do glCompileShader. Then I instantly check the GL_COMPILE_STATUS with glGetShaderiv. If the shader is not yet compiled when I do glGetShaderiv (it simply did not have enough time to compile), is it possible that GL_COMPILE_STATUS will state that shader is not compiled? Or is it guaranteed that the GL_COMPILE_STATUS is only returned after the compilation? Or is the compilation performed on CPU and does not need to communicate with GPU? (i. e. compilation does not place any commands in GPU queue). It has never really happened to me that shader compilation failed due to time limits, it failed only because of bad shader code.
The questions are:
Is my understanding of OGL client—server model correct or does it need some adjustments?
If not all functions can be executed asynchronously, what are those functions exactly?
Why does wiki say that only render actions may be performed asynchronously?
If command execution may indeed be 'deferred' (not really the right word for it...) with glWaitSync for example, how can I upload and compile shaders the same way images are uploaded in PBO example? Or how can I perform VBO / EBO upload the same way? Or UBO? Or TBO? Non-buffer objects? Is it just uploading and then waiting for fence to be signaled?
In case it matters, I use OGL with latest GLFW github release, latest GLAD (configuration) and C++ (MinGWx64 11.2.0).
UPD: I found this answer that touches the subject of my question. However, I must specify that my question is not about where and how OGL functions are executed, it's about how to control the flow of them, i. e. control when they are executed, to perform asynchronous work of GPU and CPU if it is even possible (it seems to be, if I understand wiki page right).

When is glFlush called too often?

I have an application that issues about 100 drawcalls per frame, each with an individual VBO. The VBOs are uploaded via glBufferData in a separate thread has has gl context resource sharing. The render thread tests buffer upload state via glClientWaitSync.
Now my question:
According to the documentation glClientWaitSync and GL_SYNC_FLUSH_COMMANDS_BIT cause a flush at every call, right? This would mean that for every not yet finished glBufferData in the upload thread I would have dozens of flushes in the render thread right? What impact on performance would it have if thus, in the worst case, I practically issue a flush before every drawcall?
The behavior of GL_SYNC_FLUSH_COMMANDS_BIT has changed from its original specification.
In the original, the use of that bit was the equivalent of issuing a glFlush before doing the wait.
However, GL 4.5 changed the wording. Now, it is the equivalent of having performed a flush immediately after you submitted that sync object. That is, instead of doing a flush relative to the current stream, it works as if you had flushed after submitting the sync. Thus, repeated use does not mean repeatedly flushing.
You can get the equivalent behavior of course by manually issuing a flush after you submit the sync object, then not using GL_SYNC_FLUSH_COMMANDS_BIT when you wait for it.

IDirectXVideoDecoder performance

I am trying to understand some of the nuances of IDirectXVideoDecoder. CAVEAT: The conclusions stated below are not based on the DirectX docs or any other official source, but are my own observations and understandings. That said...
In normal use, IDirectXVideoDecoder is easily fast enough to process frames at any sensible frame rate. However, if you aren't rendering the frames based on timecodes and instead are going "as fast as possible," then you eventually run into a bottleneck in the decoder and IDirectXVideoDecoder::BeginFrame starts returning E_PENDING.
Apparently at any given time, a system can only have X frames active in the decoder. Attempting to submit X + 1 gives you this error until one of the previous frames completes. On my (somewhat older) box, X == 4. On my newer box, X == 8.
Which brings us to my first question:
Q1: How do I find out how many simultaneous decoding operations a system supports? What property/attribute describes this?
Then there's the question of what to do when you hit this error. I can think of 3 different approaches, but they all have drawbacks:
1) Just do a loop waiting for a decoder to free up:
do {
hr = m_pVideoDecoder->BeginFrame(pSurface9Video[y], NULL);
} while(hr == E_PENDING);
On the plus side, this approach gives the fastest throughput. On the minus side, this causes a massive amount of CPU time to get burned waiting for a decoder to free up (>93% of my execution time gets spent here).
2) Do a loop, and add a Sleep:
do {
hr = m_pVideoDecoder->BeginFrame(pSurface9Video[y], NULL);
if (hr == E_PENDING)
Sleep(1);
} while(hr == E_PENDING);
On the plus side, this significantly drops the CPU utilization. But on the minus side, it ends up slowing down the total throughput.
In trying to figure out why it's slowing things down, I made a few observations:
Normal time to process a frame on my system is ~4 milliseconds.
Sleep(1) can Sleep for as much as 8 milliseconds, even when there are CPUs available to run on.
Frames sent to the decoders aren't being added to a queue and decoded one at a time. It actually performs X decodings at the same time.
The result of all this is that if you try to Sleep, one of the decoders frequently ends up sitting idle.
3) Before submitting the next frame for decoding, wait for one of the previous frames to complete:
// LockRect doesn't return until the surface is ready.
D3DLOCKED_RECT lr;
// I don't think this matters. It may always return the whole frame.
RECT r = {0, 0, 2, 2};
hr = pSurface9Video[old]->LockRect(&lr, &r, D3DLOCK_READONLY);
if (SUCCEEDED(hr))
pSurface9Video[old]->UnlockRect();
This also drops the CPU usage, but it also has a throughput penalty. Maybe due to the 'surface' being in use longer than the 'decoder,' but more likely because the amount of time it takes to (pointlessly) transfer the frame back to memory.
Which brings us to the second question:
Q2: Is there some way here to maximize throughput without pointlessly pounding on the CPU?
Final thoughts:
It appears that LockRect must be doing a WaitForSingleObject. If I had access to that handle, waiting on it (without also copying the frame back) seems like it would be the best solution. But I can't figure out where to get it. I've tried GetDC, GetPrivateData, even looking at the debug data members for IDirect3DSurface9. I'm not finding it.
IDirectXVideoDecoder::EndFrame outputs a handle in a parameter named pHandleComplete. This sounds like exactly what I need. Unfortunately it is marked as "reserved" and doesn't seem to work. Unless there is a trick?
I'm pretty new to DirectX, so maybe I've got this all wrong?
Update 1:
Re Q1: Turns out both my machines only support 4 decoders (oops). This will make it harder to determine which property I'm looking for. While very few properties (none actually) return 8 on one machine and 4 on the other, there are several that return 4.
Re Q2: Since the (4) decoders are (presumably) shared between apps, the idea of finding out if the decoding is complete by (somehow) querying to see if the decoder is idle is a non-starter.
The call to create surfaces doesn't create handles (handle count stays the same across the call). So the idea of waiting on the "surface's handle" doesn't seem like it's going to pan out either.
The only idea I have left is to see if the surface is available by making some other call (besides LockRect) using it. So far I've tried calling StretchRect and ColorFill on a surface that the decoder is "still using," but they complete without error instead of blocking like LockRect.
There may not be a better answer here. So far it appears that for best performance, I should use #1. If CPU utilization is an issue, #2 is better than #1. If I'm going to be reading the surfaces back to memory anyway, then #3 makes sense, otherwise, stick with 1 or 2.

How does the opencl command queue work, and what can I ask of it

I'm working on an algorithm that does prettymuch the same operation a bunch of times. Since the operation consists of some linear algebra(BLAS), I thourght I would try using the GPU for this.
I've writen my kernel and started pushing kernels on the command queue. Since I don't wanna wait after each call I figures I would try daisy-chaining my calls with events and just start pushing these on the queue.
call kernel1(return event1)
call kernel2(wait for event 1, return event 2)
...
call kernel1000000(vait for event 999999)
Now my question is, does all of this get pushed to the graphic chip of does the driver store the queue? It there a bound on the number of event I can use, or to the length of the command queue, I've looked around but I've not been able to find this.
I'm using atMonitor to check the utilization of my gpu' and its pretty hard to push it above 20%, could this simply be becaurse I'm not able to push the calls out there fast enough? My data is already stored on the GPU and all I'm passing out there is the actual calls.
First, you shouldn't wait for an event from a previous kernel unless the next kernel has data dependencies on that previous kernel. Device utilization (normally) depends on there always being something ready-to-go in the queue. Only wait for an event when you need to wait for an event.
"does all of this get pushed to the graphic chip of does the driver store the queue?"
That's implementation-defined. Remember, OpenCL works on more than just GPUs! In terms of the CUDA-style device/host dichotomy, you should probably consider command queue operations (for most implementations) on the "host."
Try queuing up multiple kernels calls without waits in-between them. Also, make sure you are a using an optimal work group size. If you do both of those, you should be able to max out your device.
Unfortunately i don't know the answers to all of your questions and you've got me wondering about the same things now too but i can say that i doubt the OpenCL queue will ever become full since you GPU should finish executing the last queued command before at least 20 commands are submitted. This is only true though if your GPU has a "watchdog" because that would stop ridiculously long kernels (i think 5 seconds or more) from executing.

Platform independent parallelization without changing the framework?

I hope the title did not mislead you.
My problem is the following: Currently I try to speed up a raytracer and this is done with the help of the graphics card. It works fine despite the fact that it got slower by this. :)
This is caused by the fact, that I trace one ray on the whole geometry at once on the graphics card(my "tracing server") and then fetch the results, which is awfully slow, so I have to gather some rays and calc them and fetch the results together to speed this up.
The next problem is, that I am not allowed to rewrite the surrounding framework that should know nothing or least possible about this parallelization.
So here is my approach:
I thought about using several threads, where each one gets a ray and requests my "tracing server" to calc the intersections. Then the thread is stopped until enough rays were gathered to calc the intersections on the graphics card and get the results back efficiently. This means that each thread will wait until the results were fetched.
You see I already have some plan but following I do not know:
Which threading framework should I take to be platformindependent?
Should I use a threadpool of fixed size or create them as needed?
Can any given thread library handle at least 1000 waiting threads(because that would be the number that I need to gather for my fetch to be efficient)?
But I also could imagine doing this with one thread that
dumps its load (a new ray) to the "tracing server" and fetches the next load until
there is enough to fetch the results.
Then the thread would take the results one by one, do the further calculations until all results are processed and then goes back to step one until all rays are done.
Also if you have some better idea how to parallelize this, tell me about it.
Regards,
Nobody
PS
If you need this information: The two platforms I want to use are Linux and Windows.
use either Thread Building Blocks or boost::thread.
http://www.boost.org/doc/libs/1_46_0/doc/html/thread.html
http://threadingbuildingblocks.org/
As far as threadpool/on-demand-threads - threadpool is generally better idea as it avoids creation overhead.
Number of waiting threads is gonna depend on the underlying system more than anything else:
Maximum number of threads per process in Linux?