Asynchronous rendering model in vulkan - opengl

Recently I'm reading https://github.com/ARM-software/vulkan_best_practice_for_mobile_developers/blob/master/samples/vulkan_basics.md, and it said:
OpenGL ES uses a synchronous rendering model, which means that an API call must behave as if all earlier API calls have already been processed. In reality no modern GPU works this way, rendering workloads are processed asynchronously and the synchronous model is an elaborate illusion maintained by the device driver. To maintain this illusion the driver must track which resources are read or written by each rendering operation in the queue, ensure that workloads run in a legal order to avoid rendering corruption, and ensure that API calls which need a data resource block and wait until that resource is safely available.
Vulkan uses an asynchronous rendering model, reflecting how the modern GPUs work. Applications queue rendering commands into a queue, use explicit scheduling dependencies to control workload execution order, and use explicit synchronization primitives to align dependent CPU and GPU processing.
The impact of these changes is to significantly reduce the CPU overhead of the graphics drivers, at the expense of requiring the application to handle dependency management and synchronization.
Could someone help explain why asynchronous rendering model could reduce CPU overhead? Since in Vulkan you still have to track state yourself.

Could someone help explain why asynchronous rendering model could
reduce CPU overhead?
First of all, let's get back to the original statement you are referring to, emphasis mine:
The impact of these changes is to significantly reduce the CPU
overhead of the graphics drivers, [...]
So the claim here is that the driver itself will need to consume less CPU, and it is easy to see as it can more directly forward your requests "as-is".
However, one overall goal of a low-level rendering API like Vulkan is also a potentially reduced CPU overhead in general, not only in the driver.
Consider the following example: You have a draw call which renders to a texture. And then you have another draw call which samples from this texture.
To get the implicit synchronization right, the driver has to track the usage of this texture, both as render target and as source for texture sampling operations.
It doesn't know in advance if the next draw call will need any resources which are still to be written to in previous draw calls. It has to always track every possible such conflicts, no matter if they can occur in your application or not. And it also must be extremely conservative in its decisions. It might be possible that you have a texture bound for a framebuffer for a draw call, but you may know that with the actual uniform values you set for this shaders the texture is not modified. But the GPU driver can't know that. If it can't rule - out with absolute certainty - that a resource is modified, it has to assume it is.
However, your application will more like know such details. If you have several render passes, and the second pass will depend on the texture rendered to in the first, you can (and must) add proper synchronization primitives - but the GPU driver doesn't need to care why there is any synchronization necessary at all, and it doesn't need track any resource usage to find out - it can just do as it is told. And your application also doesn't need to track it's own resource usage in many cases. It is just inherent from the usage as you coded it that a synchronization might be required at some point. There might be still cases where you need to track your own resource usage to find out though, especially if you write some intermediate layer like some more high-level graphics library where you know less and less of the structure of the rendering - then you are getting into a position similar to what a GL driver has to do (unless you want to forward all the burden of synchronization on the users of your library, like Vulkan does).

Related

What can Vulkan do specifically that OpenGL 4.6+ cannot?

I'm looking into whether it's better for me to stay with OpenGL or consider a Vulkan migration for intensive bottlenecked rendering.
However I don't want to make the jump without being informed about it. I was looking up what benefits Vulkan offers me, but with a lot of googling I wasn't able to come across exactly what gives performance boosts. People will throw around terms like "OpenGL is slow, Vulkan is way faster!" or "Low power consumption!" and say nothing more on the subject.
Because of this, it makes it difficult for me to evaluate whether or not the problems I face are something Vulkan can help me with, or if my problems are due to volume and computation (and Vulkan would in such a case not help me much).
I'm assuming Vulkan does not magically make things in the pipeline faster (as in shading in triangles is going to be approximately the same between OpenGL and Vulkan for the same buffers and uniforms and shader). I'm assuming all the things with OpenGL that cause grief (ex: framebuffer and shader program changes) are going to be equally as painful in either API.
There are a few things off the top of my head that I think Vulkan offers based on reading through countless things online (and I'm guessing this certainly is not all the advantages, or whether these are even true):
Texture rendering without [much? any?] binding (or rather a better version of 'bindless textures'), which I've noticed when I switched to bindless textures I gained a significant performance boost, but this might not even be worth mentioning as a point if bindless textures effectively does this and therefore am not sure if Vulkan adds anything here
Reduced CPU/GPU communication by composing some kind of command list that you can execute on the GPU without needing to send much data
Being able to interface in a multithreaded way that OpenGL can't somehow
However I don't know exactly what cases people run into in the real world that demand these, and how OpenGL limits these. All the examples so far online say "you can run faster!" but I haven't seen how people have been using it to run faster.
Where can I find information that answers this question? Or do you know some tangible examples that would answer this for me? Maybe a better question would be where are the typical pain points that people have with OpenGL (or D3D) that caused Vulkan to become a thing in the first place?
An example of answer that would not be satisfying would be a response like
You can multithread and submit things to Vulkan quicker.
but a response that would be more satisfying would be something like
In Vulkan you can multithread your submissions to the GPU. In OpenGL you can't do this because you rely on the implementation to do the appropriate locking and placing fences on your behalf which may end up creating a bottleneck. A quick example of this would be [short example here of a case where OpenGL doesn't cut it for situation X] and in Vulkan it is solved by [action Y].
The last paragraph above may not be accurate whatsoever, but I was trying to give an example of what I'd be looking for without trying to write something egregiously wrong.
Vulkan really has four main advantages in terms of run-time behavior:
Lower CPU load
Predictable CPU load
Better memory interfaces
Predictable memory load
Specifically lower GPU load isn't one of the advantages; the same content using the same GPU features will have very similar GPU performance with both of the APIs.
In my opinion it also has many advantages in terms of developer usability - the programmer's model is a lot cleaner than OpenGL, but there is a steeper learning curve to get to the "something working correctly" stage.
Let's look at each of the advantages in more detail:
Lower CPU load
The lower CPU load in Vulkan comes from multiple areas, but the main ones are:
The API encourages up-front construction of descriptors, so you're not rebuilding state on a draw-by-draw basis.
The API is asynchronous and can therefore move some responsibilities, such as tracking resource dependencies, to the application. A naive application implementation here will be just as slow as OpenGL, but the application has more scope to apply high level algorithmic optimizations because it can know how resources are used and how they relate to the scene structure.
The API moves error checking out to layer drivers, so the release drivers are as lean as possible.
The API encourages multithreading, which is always a great win (especially on mobile where e.g. four threads running slowly will consume a lot less energy than one thread running fast).
Predictable CPU load
OpenGL drivers do various kinds of "magic", either for performance (specializing shaders based on state only known late at draw time), or to maintain the synchronous rendering illusion (creating resource ghosts on the fly to avoid stalling the pipeline when the application modifies a resource which is still referenced by a pending command).
The Vulkan design philosophy is "no magic". You get what you ask for, when you ask for it. Hopefully this means no random slowdowns because the driver is doing something you didn't expect in the background. The downside is that the application takes on the responsibility for doing the right thing ;)
Better memory interfaces
Many parts of the OpenGL design are based on distinct CPU and GPU memory pools which require a programming model which gives the driver enough information to keep them in sync. Most modern hardware can do better with hardware-backed coherency protocols, so Vulkan enables a model where you can just map a buffer once, and then modify it adhoc and guarantee that the "other process" will see the changes. No more "map" / "unmap" / "invalidate" overhead (provided the platform supports coherent buffers, of course, it's still not universal).
Secondly Vulkan separates the concept of the memory allocation and how that memory is used (the memory view). This allows the same memory to be recycled for different things in the frame pipeline, reducing the amount of intermediate storage you need allocated.
Predictable memory load
Related to the "no magic" comment for CPU performance, Vulkan won't generate random resources (e.g. ghosted textures) on the fly to hide application problems. No more random fluctuations in resource memory footprint, but again the application has to take on the responsibility to do the right thing.
This is at risk of being opinion based. I suppose I will just reiterate the Vulkan advantages that are written on the box, and hopefully uncontested.
You can disable validation in Vulkan. It obviously uses less CPU (or battery\power\noise) that way. In some cases this can be significant.
OpenGL does have poorly defined multi-threading. Vulkan has well defined multi-threading in the specification. Meaning you do not immediately lose your mind trying to code with multiple threads, as well as better performance if otherwise the single thread would be a bottleneck on CPU.
Vulkan is more explicit; it does not (or tries to not) expose big magic black boxes. That means e.g. you can do something about micro-stutter and hitching, and other micro-optimizations.
Vulkan has cleaner interface to windowing systems. No more odd contexts and default framebuffers. Vulkan does not even require window to draw (or it can achieve it without weird hacks).
Vulkan is cleaner and more conventional API. For me that means it is easier to learn (despite the other things) and more satisfying to use.
Vulkan takes binary intermediate code shaders. While OpenGL used not to. That should mean faster compilation of such code.
Vulkan has mobile GPUs as first class citizen. No more ES.
Vulkan have open source, and conventional (GitHub) public tracker(s). Meaning you can improve the ecosystem without going through hoops. E.g. you can improve\implement a validation check for error that often trips you. Or you can improve the specification so it does make sense for people that are not insiders.

How do multiple rendering contexts across multiple processes affect GPU performance? [duplicate]

My co-worker and I are working on a video rendering engine.
The whole idea is to parse a configuration file and render each frame to offscreen FBO, and then fetch the frame render results using glReadPixel for video encoding.
We tried to optimize the rendering speed by creating two threads each with an independent OpenGL context. One thread render odd frames and the other render even frames. The two threads do not share any gl resources.
The results are quite confusing. On my computer, the rendering speed increased compared to our single thread implementation, while on my partner's computer, the entire speed dropped.
I wonder here that, how do the amount of OpenGL contexts affect the overall performance. Is it really a good idea to create multiple OpenGL threads if they do not share anything.
Context switching is certainly not free. As pretty much always for performance related questions, it's impossible to quantify in general terms. If you want to know, you need to measure it on the system(s) you care about. It can be quite expensive.
Therefore, you add a certain amount of overhead by using multiple contexts. If that pays off depends on where your bottleneck is. If you were already GPU limited with a single CPU thread, you won't really gain anything because you can't get the GPU to do the work quicker if it was already fully loaded. Therefore you add overhead for the context switches without any gain, and make the whole thing slower.
If you were CPU limited, using multiple CPU threads can reduce your total elapsed time. If the parallelization of the CPU work combined with the added overhead for synchronization and context switches results in a net total gain again depends on your use case and the specific system. Trying both and measuring is the only good thing to do.
Based on your problem description, you might also be able to use multithreading while still sticking with a single OpenGL context, and keeping all OpenGL calls in a single thread. Instead of using glReadPixels() synchronously, you could have it read into PBOs (Pixel Buffer Objects), which allows you to use asynchronous reads. This decouples GPU and CPU work much better. You could also do the video encoding on a separate thread if you're not doing that yet. This approach will need some inter-thread synchronization, but it avoids using multiple contexts, while still using parallel processing to get the job done.

OpenGL: which OpenGL implementations are not pipelined?

In OpenGL wiki on Performance, it says:
"OpenGL implementations are almost always pipelined - that is to say,
things are not necessarily drawn when you tell OpenGL to draw them -
and the fact that an OpenGL call returned doesn't mean it finished
rendering."
Since it says "almost", that means there are some mplementations are not pipelined.
Here I find one:
OpenGL Pixel Buffer Object (PBO)
"Conventional glReadPixels() blocks the pipeline and waits until all
pixel data are transferred. Then, it returns control to the
application. On the contrary, glReadPixels() with PBO can schedule
asynchronous DMA transfer and returns immediately without stall.
Therefore, the application (CPU) can execute other process right away,
while transferring data with DMA by OpenGL (GPU)."
So this means conventional glReadPixels() (not with PBO) blocks the pipeline.
But actually in OpenGL reference of glReadPixels I cannot tell the fact.
Then I am wondering:
which OpenGL implementations are not pipelined?
How about glDrawArrays?
The OpenGL specification itself does not specify the term "pipeline" but rather "command stream". The runtime behavior of command stream execution is deliberately left open, to give implementors maximal flexibility.
The important term is "OpenGL sychronization point": https://www.opengl.org/wiki/Synchronization
Here I find one: (Link to songho article)
Note that this is not an official OpenGL specification resource. The wording "blocks the OpenGL pipeline" is a bit unfortunate, because it gets the actual blocking and bottleneck turned "upside down". Essentially it means, that glReadPixels can only return once all the commands leading up to the image it will fetch have been executed.
So this means conventional glReadPixels() (not with PBO) blocks the pipeline. But actually in OpenGL reference of glReadPixels I cannot tell the fact.
Actually it's not the OpenGL pipeline that gets blocked, but the execution of the program on the CPU. It means, that the GPU sees no further commands coming from the CPU. So the pipeline doesn't get "blocked" but in fact drained. When a pipeline drains, or needs to be restarted one says the pipeline has been stalled (i.e. the flow in the pipeline came to a halt).
From the GPUs point of view everything happens with maximum throughput: Render the stuff until the point glReadPixels got called, do a DMA transfer, unfortunately no further commands are available after initiating the transfer.
How about glDrawArrays?
glDrawArrays returns immediately after the data has been queued and necessary been made.
Actually it means that this specific operation can't be pipelined because all data needs to be transfered before the function returns, it doesn't mean other things can't be.
Operations like that are said to stall the pipeline. One function that will always stall the pipeline is glFinish.
Usually when the function returns a value like getting the contents of a buffer, it will induce a stall.
Depending on the driver implementation creating programs and buffers and such can be done without stalling.
Then I am wondering: which OpenGL implementations are not pipelined?
I could imagine that a pure software implementation might not be pipelined. Not much reason to queue up work if you end up executing it on the same CPU. Unless you wanted to take advantage of multi-threading.
But it's probably safe to say that any OpenGL implementation that uses dedicated hardware (commonly called GPU) will be pipelined. This allows the CPU and GPU to work in parallel, which is critical to get good system performance. Also, submitting work to the GPU incurs a certain amount of overhead, so it's beneficial to queue up work, and then submit it in larger batches.
But actually in OpenGL reference of glReadPixels I cannot tell the fact.
True. The man pages don't directly specify which calls cause a synchronization. In general, anything that returns values/data produced by the GPU causes synchronization. Examples that come to mind:
glFinish(). Explicitly requires a full synchronization, which is actually its only purpose.
glReadPixels(), in the non PBO case. The GPU has to finish rendering before you can read back the result.
glGetQueryObjectiv(id, GL_QUERY_RESULT, ...). Blocks until the GPU reaches the point where the query was submitted.
glClientWaitSync(). Waits until the GPU reaches the point where the corresponding glFenceSync() was submitted.
Note that there can be different types of synchronizations that are not directly tied to specific OpenGL calls. For example, in the case where the whole workload is GPU limited, the CPU would queue up an infinite about of work unless there is some throttling. So the driver will block the CPU at more or less arbitrary points to let the GPU catch up to a certain point. This could happen at frame boundaries, but it does not have to be. Similar synchronization can be necessary if memory runs low, or if internal driver resources are exhausted.

Directx 11 Registers

I'm somewhat confused about registers in DirectX 11. Let me give an example of the situation: Assume you have 3 models. They each have a texture that is mapped to register t0. Models 1 and 3 use the same texture, and model 2 uses a different texture. When drawing model 1, I set the texture resource view to register 0 and draw the model. Then I do the same things for models 2 and 3, but use the same resource view for model 3. When I set the the texture for model 2, does the GPU replace the texture in the GPU memory with a different one, or does it maintain that texture memory until space is needed and just moves some pointers around? I would like to minimize data transfer to GPU and I'm wondering if I should handle situations like these myself or does DX handle it for me. Btw, I am NOT using the Effects 11 framework.
Thanks in advance.
In general, you should assume that once you have created a resource on the GPU (e.g. using CreateTexture2D), that memory is reserved and resident for that resource for use by the 3D pipeline. Note that this is independent of data transfer to the GPU, which is also explicit via Map/Unmap or UpdateSubresource.
There are some cases where the OS will swap memory in and out, but usually this should be avoided if possible. For example, if you create a bunch of large textures but never access them, eventually the video memory manager will page them out to system memory for other tasks (e.g. watching Netflix / browsing the internet on another display). You can also run into real problems if you overcommit video memory (using more than what is available on the system). This used to be impossible (you would just get E_OUTOFMEMORY) but now the memory manager tries to make it work by paging things to system memory or even disk. This is something you should really strive to avoid since if you ever bind and use a paged-out resource, you'll get a glitch waiting for the memory manager to page it back in for use.
Note that the above really just applies to discrete GPU configs. On integrated systems e.g. from Intel or AMD, you get unified memory which has completely different characteristics. But in general you should target discrete configs first, since there are more performance cliffs you have to worry about if you screwed something up, and they would be unlikely to show up on integrated.
Going back to your original question, changing SRVs between draw calls is not that expensive - it's more than a pointer swap, but nowhere near the cost of transferring the entire texture across the bus. You should feel free to swap SRVs at the same frequency as your draw calls and expect no adverse performance impact.
I think what you're asking depends a lot on hardware and driver implementation. That's also why DX11 documentation doesn't do any claims on how memory is managed for graphics resources. I can't really give you any valid sources but I believe it's safe to assume that textures/buffers for which you've made a view will reside in GPU's memory (especially the ones you're accessing more frequently). The graphics driver will do a lot of access optimization. However, it's good practice to change the state of graphics pipeline as little as possible.
You can read an in depth discussion of graphics pipeline here

Which OpenGL functions are not GPU-accelerated?

I was shocked when I read this (from the OpenGL wiki):
glTranslate, glRotate, glScale
Are these hardware accelerated?
No, there are no known GPUs that
execute this. The driver computes the
matrix on the CPU and uploads it to
the GPU.
All the other matrix operations are
done on the CPU as well :
glPushMatrix, glPopMatrix,
glLoadIdentity, glFrustum, glOrtho.
This is the reason why these functions
are considered deprecated in GL 3.0.
You should have your own math library,
build your own matrix, upload your
matrix to the shader.
For a very, very long time I thought most of the OpenGL functions use the GPU to do computation. I'm not sure if this is a common misconception, but after a while of thinking, this makes sense. Old OpenGL functions (2.x and older) are really not suitable for real-world applications, due to too many state switches.
This makes me realise that, possibly, many OpenGL functions do not use the GPU at all.
So, the question is:
Which OpenGL function calls don't use the GPU?
I believe knowing the answer to the above question would help me become a better programmer with OpenGL. Please do share some of your insights.
Edit:
I know this question easily leads to optimisation level. It's good, but it's not the intention of this question.
If anyone knows a set of GL functions on a certain popular implementation (as AshleysBrain suggested, nVidia/ATI, and possibly OS-dependent) that don't use the GPU, that's what I'm after!
Plausible optimisation guides come later. Let's focus on the functions, for this topic.
Edit2:
This topic isn't about how matrix transformations work. There are other topics for that.
Boy, is this a big subject.
First, I'll start with the obvious: Since you're calling the function (any function) from the CPU, it has to run at least partly on the CPU. So the question really is, how much of the work is done on the CPU and how much on the GPU.
Second, in order for the GPU to get to execute some command, the CPU has to prepare a command description to pass down. The minimal set here is a command token describing what to do, as well as the data for the operation to be executed. How the CPU triggers the GPU to do the command is also somewhat important. Since most of the time, this is expensive, the CPU does not do it often, but rather batches commands in command buffers, and simply sends a whole buffer for the GPU to handle.
All this to say that passing work down to the GPU is not a free exercise. That cost has to be pitted against just running the function on the CPU (no matter what we're talking about).
Taking a step back, you have to ask yourself why you need a GPU at all. The fact is, a pure CPU implementation does the job (as AshleysBrain mentions). The power of the GPU comes from its design to handle:
specialized tasks (rasterization, blending, texture filtering, blitting, ...)
heavily parallel workloads (DeadMG is pointing to that in his answer), when a CPU is more designed to handle single-threaded work.
And those are the guiding principles to follow in order to decide what goes in the chip. Anything that can benefit from those ought to run on the GPU. Anything else ought to be on the CPU.
It's interesting, by the way. Some functionality of the GL (prior to deprecation, mostly) are really not clearly delineated. Display lists are probably the best example of such a feature. Each driver is free to push as much as it wants from the display list stream to the GPU (typically in some command buffer form) for later execution, as long as the semantics of the GL display lists are kept (and that is somewhat hard in general). So some implementations only choose to push a limited subset of the calls in a display list to a computed format, and choose to simply replay the rest of the command stream on the CPU.
Selection is another one where it's unclear whether there is value to executing on the GPU.
Lastly, I have to say that in general, there is little correlation between the API calls and the amount of work on either the CPU or the GPU. A state setting API tends to only modify a structure somewhere in the driver data. It's effect is only visible when a Draw, or some such, is called.
A lot of the GL API works like that. At that point, asking whether glEnable(GL_BLEND) is executed on the CPU or GPU is rather meaningless. What matters is whether the blending will happen on the GPU when Draw is called. So, in that sense, Most GL entry points are not accelerated at all.
I could also expand a bit on data transfer but Danvil touched on it.
I'll finish with the little "s/w path". Historically, GL had to work to spec no matter what the hardware special cases were. Which meant that if the h/w was not handling a specific GL feature, then it had to emulate it, or implement it fully in software. There are numerous cases of this, but one that struck a lot of people is when GLSL started to show up.
Since there was no practical way to estimate the code size of a GLSL shader, it was decided that the GL was supposed to take any shader length as valid. The implication was fairly clear: either implement h/w that could take arbitrary length shaders -not realistic at the time-, or implement a s/w shader emulation (or, as some vendors chose to, simply fail to be compliant). So, if you triggered this condition on a fragment shader, chances were the whole of your GL ended up being executed on the CPU, even when you had a GPU siting idle, at least for that draw.
The question should perhaps be "What functions eat an unexpectedly high amount of CPU time?"
Keeping a matrix stack for projection and view is not a thing the GPU can handle better than a CPU would (on the contrary ...). Another example would be shader compilation. Why should this run on the GPU? There is a parser, a compiler, ..., which are just normal CPU programs like the C++ compiler.
Potentially "dangerous" function calls are for example glReadPixels, because data can be copied from host (=CPU) memory to device (=GPU) memory over the limited bus. In this category are also functions like glTexImage_D or glBufferData.
So generally speaking, if you want to know how much CPU time an OpenGL call eats, try to understand its functionality. And beware of all functions, which copy data from host to device and back!
Typically, if an operation is per-something, it will occur on the GPU. An example is the actual transformation - this is done once per vertex. On the other hand, if it occurs only once per large operation, it'll be on the CPU - such as creating the transformation matrix, which is only done once for each time the object's state changes, or once per frame.
That's just a general answer and some functionality will occur the other way around - as well as being implementation dependent. However, typically, it shouldn't matter to you, the programmer. As long as you allow the GPU plenty of time to do it's work while you're off doing the game sim or whatever, or have a solid threading model, you shouldn't need to worry about it that much.
#sending data to GPU: As far as I know (only used Direct3D) it's all done in-shader, that's what shaders are for.
glTranslate, glRotate and glScale change the current active transformation matrix. This is of course a CPU operation. The model view and projection matrices just describes how the GPU should transforms vertices when issue a rendering command.
So e.g. by calling glTranslate nothing is translated at all yet. Before rendering the current projection and model view matrices are multiplied (MVP = projection * modelview) then this single matrix is copied to the GPU and then the GPU does the matrix * vertex multiplications ("T&L") for each vertex. So the translation/scaling/projection of the vertices is done by the GPU.
Also you really should not be worried about the performance if you don't use these functions in an inner loop somewhere. glTranslate results in three additions. glScale and glRotate are a bit more complex.
My advice is that you should learn a bit more about linear algebra. This is essential for working with 3D APIs.
There are software rendered implementations of OpenGL, so it's possible that no OpenGL functions run on the GPU. There's also hardware that doesn't support certain render states in hardware, so if you set a certain state, switch to software rendering, and again, nothing will run on the GPU (even though there's one there). So I don't think there's any clear distinction between 'GPU-accelerated functions' and 'non-GPU accelerated functions'.
To be on the safe side, keep things as simple as possible. The straightforward rendering-with-vertices and basic features like Z buffering are most likely to be hardware accelerated, so if you can stick to that with the minimum state changing, you'll be most likely to keep things hardware accelerated. This is also the way to maximize performance of hardware-accelerated rendering - graphics cards like to stay in one state and just crunch a bunch of vertices.