I have read the specification from khronos, I know that glProgramUniform*() uploads data to the specified program object and that glUniform*() uploads data to the currently bound program object. But I want to know if there are any other different between those two like performance.
The only difference guaranteed by the spec is the one you already mentioned: There is no need to bind the shader before using the glProgramUniform* command family.
If there are any performance differences, then they are vendor/driver/version specific.
Note, that glUniform* is available since 2.0 while glProgramUniform was introduced in 4.1.
Generally speaking, “less GL instructions” is a good thing. That's because, in modern hardware, the bottle neck is in the communication between CPU and GPU.
So avoiding the bind here is the key difference.
The new glProgramUniform() is part of the Direct State Access (DSA).
A model closer to how hardware works today and intended to approach to the ideal zero driver overhead.
I'm looking into whether it's better for me to stay with OpenGL or consider a Vulkan migration for intensive bottlenecked rendering.
However I don't want to make the jump without being informed about it. I was looking up what benefits Vulkan offers me, but with a lot of googling I wasn't able to come across exactly what gives performance boosts. People will throw around terms like "OpenGL is slow, Vulkan is way faster!" or "Low power consumption!" and say nothing more on the subject.
Because of this, it makes it difficult for me to evaluate whether or not the problems I face are something Vulkan can help me with, or if my problems are due to volume and computation (and Vulkan would in such a case not help me much).
I'm assuming Vulkan does not magically make things in the pipeline faster (as in shading in triangles is going to be approximately the same between OpenGL and Vulkan for the same buffers and uniforms and shader). I'm assuming all the things with OpenGL that cause grief (ex: framebuffer and shader program changes) are going to be equally as painful in either API.
There are a few things off the top of my head that I think Vulkan offers based on reading through countless things online (and I'm guessing this certainly is not all the advantages, or whether these are even true):
Texture rendering without [much? any?] binding (or rather a better version of 'bindless textures'), which I've noticed when I switched to bindless textures I gained a significant performance boost, but this might not even be worth mentioning as a point if bindless textures effectively does this and therefore am not sure if Vulkan adds anything here
Reduced CPU/GPU communication by composing some kind of command list that you can execute on the GPU without needing to send much data
Being able to interface in a multithreaded way that OpenGL can't somehow
However I don't know exactly what cases people run into in the real world that demand these, and how OpenGL limits these. All the examples so far online say "you can run faster!" but I haven't seen how people have been using it to run faster.
Where can I find information that answers this question? Or do you know some tangible examples that would answer this for me? Maybe a better question would be where are the typical pain points that people have with OpenGL (or D3D) that caused Vulkan to become a thing in the first place?
An example of answer that would not be satisfying would be a response like
You can multithread and submit things to Vulkan quicker.
but a response that would be more satisfying would be something like
In Vulkan you can multithread your submissions to the GPU. In OpenGL you can't do this because you rely on the implementation to do the appropriate locking and placing fences on your behalf which may end up creating a bottleneck. A quick example of this would be [short example here of a case where OpenGL doesn't cut it for situation X] and in Vulkan it is solved by [action Y].
The last paragraph above may not be accurate whatsoever, but I was trying to give an example of what I'd be looking for without trying to write something egregiously wrong.
Vulkan really has four main advantages in terms of run-time behavior:
Lower CPU load
Predictable CPU load
Better memory interfaces
Predictable memory load
Specifically lower GPU load isn't one of the advantages; the same content using the same GPU features will have very similar GPU performance with both of the APIs.
In my opinion it also has many advantages in terms of developer usability - the programmer's model is a lot cleaner than OpenGL, but there is a steeper learning curve to get to the "something working correctly" stage.
Let's look at each of the advantages in more detail:
Lower CPU load
The lower CPU load in Vulkan comes from multiple areas, but the main ones are:
The API encourages up-front construction of descriptors, so you're not rebuilding state on a draw-by-draw basis.
The API is asynchronous and can therefore move some responsibilities, such as tracking resource dependencies, to the application. A naive application implementation here will be just as slow as OpenGL, but the application has more scope to apply high level algorithmic optimizations because it can know how resources are used and how they relate to the scene structure.
The API moves error checking out to layer drivers, so the release drivers are as lean as possible.
The API encourages multithreading, which is always a great win (especially on mobile where e.g. four threads running slowly will consume a lot less energy than one thread running fast).
Predictable CPU load
OpenGL drivers do various kinds of "magic", either for performance (specializing shaders based on state only known late at draw time), or to maintain the synchronous rendering illusion (creating resource ghosts on the fly to avoid stalling the pipeline when the application modifies a resource which is still referenced by a pending command).
The Vulkan design philosophy is "no magic". You get what you ask for, when you ask for it. Hopefully this means no random slowdowns because the driver is doing something you didn't expect in the background. The downside is that the application takes on the responsibility for doing the right thing ;)
Better memory interfaces
Many parts of the OpenGL design are based on distinct CPU and GPU memory pools which require a programming model which gives the driver enough information to keep them in sync. Most modern hardware can do better with hardware-backed coherency protocols, so Vulkan enables a model where you can just map a buffer once, and then modify it adhoc and guarantee that the "other process" will see the changes. No more "map" / "unmap" / "invalidate" overhead (provided the platform supports coherent buffers, of course, it's still not universal).
Secondly Vulkan separates the concept of the memory allocation and how that memory is used (the memory view). This allows the same memory to be recycled for different things in the frame pipeline, reducing the amount of intermediate storage you need allocated.
Predictable memory load
Related to the "no magic" comment for CPU performance, Vulkan won't generate random resources (e.g. ghosted textures) on the fly to hide application problems. No more random fluctuations in resource memory footprint, but again the application has to take on the responsibility to do the right thing.
This is at risk of being opinion based. I suppose I will just reiterate the Vulkan advantages that are written on the box, and hopefully uncontested.
You can disable validation in Vulkan. It obviously uses less CPU (or battery\power\noise) that way. In some cases this can be significant.
OpenGL does have poorly defined multi-threading. Vulkan has well defined multi-threading in the specification. Meaning you do not immediately lose your mind trying to code with multiple threads, as well as better performance if otherwise the single thread would be a bottleneck on CPU.
Vulkan is more explicit; it does not (or tries to not) expose big magic black boxes. That means e.g. you can do something about micro-stutter and hitching, and other micro-optimizations.
Vulkan has cleaner interface to windowing systems. No more odd contexts and default framebuffers. Vulkan does not even require window to draw (or it can achieve it without weird hacks).
Vulkan is cleaner and more conventional API. For me that means it is easier to learn (despite the other things) and more satisfying to use.
Vulkan takes binary intermediate code shaders. While OpenGL used not to. That should mean faster compilation of such code.
Vulkan has mobile GPUs as first class citizen. No more ES.
Vulkan have open source, and conventional (GitHub) public tracker(s). Meaning you can improve the ecosystem without going through hoops. E.g. you can improve\implement a validation check for error that often trips you. Or you can improve the specification so it does make sense for people that are not insiders.
What I want to do is to get the render result from one context, and do some further rendering in another context which do not shared with the previous one.
The only method I can come up with is that copy the render result from GPU memory to system memory using glReadPixels like APIs and use the copied data in another context.
Is there a better way to do this? I mean without copying the data from GPU memory to system memory and system to GPU again.
I am working with GLX under Linux.
I'm not aware of a way to share them correctly. The closest I could find for GLX is the GLX_NV_copy_image extension. In the introduction, it says:
The WGL and GLX versions allow copying between images in different contexts, even if those contexts are in different sharelists or even on different physical devices.
With this extension, you would use the glXCopyImageSubDataNV() function to copy from one context to the other. While this does not allow sharing, it might still be much faster than copying the data yourself.
As you can already tell from the name, this is a vendor specific extension. I don't know how widely supported it is, but you certainly shouldn't count on it being present on all systems.
Other window system bindings have mechanisms for sharing images even between processes. E.g. with EGL, which is used with OpenGL ES, EGLImage can be used for this purpose. But from browsing through the GLX spec and list of extensions, I couldn't spot anything similar there.
If hacks are in order you could use CUDA or OpenCL for this. Works only on GPUs that support either CUDA or OpenCL though.
There are also AMD extensions which might be of some relevance to this qeustion: WGL_AMD_gpu_association and GLX_AMD_gpu_association:
While this extension's focus is assigning GL context to specific GPUs and efficiently copying data between GPUs, it might also be useful for two contexts on the same GPU:
To provide an accelerated path for blitting data from one context to another, the new blit function blitContextFramebufferAMD has been added.
These extensions and also some synchronization techniques for using them are covered in somewhat more detail in this AMD whitepaper.
The OpenGL tradition is to let the user manipulate OpenGL objects using an unsigned int handle. Why not just give a pointer instead? What are the advantages of unique IDs over pointers?
TL;DR: OpenGL IDs don't map bijectively to memory locations. A single OpenGL ID may refer to multiple memory locations at the same time. Also OpenGL has been designed to work for distributed rendering architectures (like X11) as well, and given an indirect context programs running on different machines may use the same OpenGL context.
OpenGL has been designed as an architecture and display system agnostic API. When OpenGL was first developed this happened in light of client-server display architectures (like X11). If you look into the OpenGL specification, even of modern OpenGL-4 it refers to clients and servers.
However in a client/server architectures pointers make no sense. For one the address space of the server is not accessible to the clients without jumping some hoops. And even if you set up a shared memory mapping, the addresses of objects are not the same for client and server. Add to this that on architectures like X11 a single indirect OpenGL context can be used by multiple clients, that may even run on different machines. Pointers simply don't work for that.
Last but not least the OpenGL object model is highly abstract and the OpenGL drawing model is asynchonous Say I do the following:
id = glGenTextures(1)
When the end of this little snippet has reached, actually nothing at all may have been drawn yet, because no synchronization point has been reached (glFinish, glReadPixels, a buffer swap). Note the two calls to glTexSubImage, which happen on the same id. When the pixels are finally put to the framebuffer, there two different images to be sourced from a single texture ID, because OpenGL guarantees you, that things will appear as if things were drawn synchronously. So at the end of a drawing batch a single object ID may refer to a whole collection of different data sets with different locations in memory.
My first consideration - having pointers would make programmers wonder if they can operate with them in a pointer-arithmetic way, e.g. by pointing to a middle of a texture to update it or something like that. Maybe even more crazy things, such as patching shaders code on-the-fly. That all sounds like a whole new cool degree of freedom, unless you think of additional complications caused by tampering with highly efficient and optimized GPU "black-box" way of operation.
For example - consider inner workings of GPU memory allocation. Just like with OS - pointers you get from OS are not the real "physical" ones, OS memory manager can move things around behind the scenes while keeping the pointers the same (f.e. swapping to HDD). In that case IDs are just the same - GPU can optimize and pack entities with even more freedom, while keeping the nice facade of them being available at 1-2-3.
Another example - OpenGL is not actually the same across manufacturers. In fact OpenGL is just a description of API, where each vendor can make his own implementation the way it works best for him. For example there's no rule on hot to store texture mipmaps, aligned, or interleaved or whatever. Having pointers to a texture would lure developers into tampering with mipmaps, which would cause a lot of trouble to support various implementations or force all the implementations to become strictly unified, which again is a bad idea for performance.
The OpenGL device (GPU) may have its own memory with its own address space, independent of the host (CPU) memory system. (Think of a discrete video card with its own onboard RAM.) The host can't (directly) access that memory, so it's not possible to have a pointer to it.
It's best to think of the GPU as a whole separate computer; it's actually possible to do OpenGL over a network, with a program running on one computer rendering graphics on the video card in another. When you set up your textures and buffers, you're basically uploading data to the GL device for its own internal use.
I am trying to understand better how GPUs work, and I am confused about how they handled high level APIs like Direct3D or OpenGL. It is very common to see graphic cards advertising they support Direct3D and OpenGL hardware acceleration. Does this mean that they handle Direct3D and OpenGL instructions directly in hardware?
I haven't been able to find clear evidence to this, or to them being compiled to an assembly representation that the GPU can handle. If there is such a conversion who does that? The software library (Direct3D/OpenGL), the driver or the GPU itself?
On that same line, where is the graphics pipeline defined? in the gpu hardware, the driver, or the software library? This confuses me specially with the idea of programmable pipelines.
Is there a good resource where I can find information about these details?
You have asked a very broad and complicated question. Actually, you have asked several broad, complicated questions.
The software that has final governance over the operation of any hardware is called the hardware's "driver". Naturally, for graphics hardware, this is called the "graphics driver." Like all drivers, the graphics driver is effectively an installable part of the OS; the OS is what allows the graphics driver to do its job and talk to the hardware. The two work hand in hand.
There are effectively two kinds of D3D or OpenGL (heretofore known as "the API") calls: those that talk to the driver and those that do not. Every call that actually draws something needs to (eventually) talk to the driver, but calls that set up later drawing calls may just store data locally.
When you make a drawing call, the API does some checks to make sure that you as the user have made a valid rendering call. If so, the API has some options as to what to do. It turns out that talking directly to the driver takes a long time, regardless of how many commands you give it when you start talking. Therefore, what often happens is that the API stores your rendering call and returns immediately. Then, possibly in another thread, it may look to see how many rendering calls have been stored. If there are "enough", then it will forward them to the driver. This is called "marshalling".
The driver's job is to take these calls that have been forwarded and convert them into stuff that the GPU will do.
On that same line, where is the graphics pipeline defined? in the gpu hardware, the driver, or the software library?
That's actually a pretty tricky question these days, and becoming trickier every hardware generation.
In the old days, the construction of the graphics pipeline was rigidly controlled by the GPU hardware. These days, this is less true, though there is some hardware control. On modern hardware (capable of OpenGL 3.0 or Direct3D10 or better), it would be theoretically possible, if you had direct access to the graphics driver, to design an API that used a somewhat altered version of the graphics pipeline. So the APIs dictate much of what the graphics pipeline looks like.
Each stage in the rendering pipeline takes certain values from the precious stage(s) as input and generates some number of values as output. A stage is "programmable" if the mechanism for generating the outputs from the inputs involves executing a user-supplied program, called a "shader". So there is no such thing as a programmable pipeline (yet); just programmable stages of a fixed pipeline.
There's no such thing as D3D or OGL instructions. Direct3D or OpenGL will call into the graphics driver and they will perform whatever they need to do to make it happen. This is not completely true of shaders, which do have a uniform bytecode at the API (D3D/OGL) level, and in this case, the API provides a compiler, but those are, as far as I know, still transformed in hardware-dependent ways before being executed. Of course, Direct3D and OpenGL also include user-mode components to improve performance or provide a better interface- for example, they will batch calls to the kernel to reduce context switches.
The reality of GPU making is that Microsoft and nVidia/ATi get together and think about what they want and what's feasible to implement, and come up with a group specification, as the reality is that none of this would work if the major hardware and software vendors didn't co-operate. Nobody will buy a GPU that doesn't support DirectX- and nobody will buy Windows where no GPU implements DirectX. Of course, "nobody" is relative- but it would be a huge loss for all concerned, and of course, if you have a game that is built to only the D3D10 API, then the driver supporting D3D10 is a must to run the game- effectively increasing the value of the product by increasing the range of software it can run, which is a selling point. This means that the semantic difference between being defined by the hardware vendor or software vendor is minimal, realistically- especially as the only two real 3D rendering API's on the PC, OpenGL and Direct3D, follow very similar models for the graphical pipeline, as far as I know.
However, with the new programmable GPUs, you could argue that the graphical pipeline doesn't really exist- a DX11 device can be used for any graphics pipeline you can conceive of, if you have the patience to program it.
Ultimately, the GPU is protected by a strong driver-level abstraction. It implements a C-style interface, and whatever's permitted or necessary in that implementation goes. Everything after that is completely implementation-defined.
You could check out the MSDN documentation for writing a graphics driver. I've seen it, but don't have a link handy, and it describes the interfaces that you must adhere to and other things.
You already got two very good answers. But maybe the best thing is, reading the actual programming documentation for AMD/ATI's GPUs: http://developer.amd.com/documentation/guides/pages/default.aspx#open_gpu
Unfortunately NVidia won't publish theirs.
I was shocked when I read this (from the OpenGL wiki):
glTranslate, glRotate, glScale
Are these hardware accelerated?
No, there are no known GPUs that
execute this. The driver computes the
matrix on the CPU and uploads it to
the GPU.
All the other matrix operations are
done on the CPU as well :
glPushMatrix, glPopMatrix,
glLoadIdentity, glFrustum, glOrtho.
This is the reason why these functions
are considered deprecated in GL 3.0.
You should have your own math library,
build your own matrix, upload your
matrix to the shader.
For a very, very long time I thought most of the OpenGL functions use the GPU to do computation. I'm not sure if this is a common misconception, but after a while of thinking, this makes sense. Old OpenGL functions (2.x and older) are really not suitable for real-world applications, due to too many state switches.
This makes me realise that, possibly, many OpenGL functions do not use the GPU at all.
So, the question is:
Which OpenGL function calls don't use the GPU?
I believe knowing the answer to the above question would help me become a better programmer with OpenGL. Please do share some of your insights.
I know this question easily leads to optimisation level. It's good, but it's not the intention of this question.
If anyone knows a set of GL functions on a certain popular implementation (as AshleysBrain suggested, nVidia/ATI, and possibly OS-dependent) that don't use the GPU, that's what I'm after!
Plausible optimisation guides come later. Let's focus on the functions, for this topic.
This topic isn't about how matrix transformations work. There are other topics for that.
Boy, is this a big subject.
First, I'll start with the obvious: Since you're calling the function (any function) from the CPU, it has to run at least partly on the CPU. So the question really is, how much of the work is done on the CPU and how much on the GPU.
Second, in order for the GPU to get to execute some command, the CPU has to prepare a command description to pass down. The minimal set here is a command token describing what to do, as well as the data for the operation to be executed. How the CPU triggers the GPU to do the command is also somewhat important. Since most of the time, this is expensive, the CPU does not do it often, but rather batches commands in command buffers, and simply sends a whole buffer for the GPU to handle.
All this to say that passing work down to the GPU is not a free exercise. That cost has to be pitted against just running the function on the CPU (no matter what we're talking about).
Taking a step back, you have to ask yourself why you need a GPU at all. The fact is, a pure CPU implementation does the job (as AshleysBrain mentions). The power of the GPU comes from its design to handle:
specialized tasks (rasterization, blending, texture filtering, blitting, ...)
heavily parallel workloads (DeadMG is pointing to that in his answer), when a CPU is more designed to handle single-threaded work.
And those are the guiding principles to follow in order to decide what goes in the chip. Anything that can benefit from those ought to run on the GPU. Anything else ought to be on the CPU.
It's interesting, by the way. Some functionality of the GL (prior to deprecation, mostly) are really not clearly delineated. Display lists are probably the best example of such a feature. Each driver is free to push as much as it wants from the display list stream to the GPU (typically in some command buffer form) for later execution, as long as the semantics of the GL display lists are kept (and that is somewhat hard in general). So some implementations only choose to push a limited subset of the calls in a display list to a computed format, and choose to simply replay the rest of the command stream on the CPU.
Selection is another one where it's unclear whether there is value to executing on the GPU.
Lastly, I have to say that in general, there is little correlation between the API calls and the amount of work on either the CPU or the GPU. A state setting API tends to only modify a structure somewhere in the driver data. It's effect is only visible when a Draw, or some such, is called.
A lot of the GL API works like that. At that point, asking whether glEnable(GL_BLEND) is executed on the CPU or GPU is rather meaningless. What matters is whether the blending will happen on the GPU when Draw is called. So, in that sense, Most GL entry points are not accelerated at all.
I could also expand a bit on data transfer but Danvil touched on it.
I'll finish with the little "s/w path". Historically, GL had to work to spec no matter what the hardware special cases were. Which meant that if the h/w was not handling a specific GL feature, then it had to emulate it, or implement it fully in software. There are numerous cases of this, but one that struck a lot of people is when GLSL started to show up.
Since there was no practical way to estimate the code size of a GLSL shader, it was decided that the GL was supposed to take any shader length as valid. The implication was fairly clear: either implement h/w that could take arbitrary length shaders -not realistic at the time-, or implement a s/w shader emulation (or, as some vendors chose to, simply fail to be compliant). So, if you triggered this condition on a fragment shader, chances were the whole of your GL ended up being executed on the CPU, even when you had a GPU siting idle, at least for that draw.
The question should perhaps be "What functions eat an unexpectedly high amount of CPU time?"
Keeping a matrix stack for projection and view is not a thing the GPU can handle better than a CPU would (on the contrary ...). Another example would be shader compilation. Why should this run on the GPU? There is a parser, a compiler, ..., which are just normal CPU programs like the C++ compiler.
Potentially "dangerous" function calls are for example glReadPixels, because data can be copied from host (=CPU) memory to device (=GPU) memory over the limited bus. In this category are also functions like glTexImage_D or glBufferData.
So generally speaking, if you want to know how much CPU time an OpenGL call eats, try to understand its functionality. And beware of all functions, which copy data from host to device and back!
Typically, if an operation is per-something, it will occur on the GPU. An example is the actual transformation - this is done once per vertex. On the other hand, if it occurs only once per large operation, it'll be on the CPU - such as creating the transformation matrix, which is only done once for each time the object's state changes, or once per frame.
That's just a general answer and some functionality will occur the other way around - as well as being implementation dependent. However, typically, it shouldn't matter to you, the programmer. As long as you allow the GPU plenty of time to do it's work while you're off doing the game sim or whatever, or have a solid threading model, you shouldn't need to worry about it that much.
#sending data to GPU: As far as I know (only used Direct3D) it's all done in-shader, that's what shaders are for.
glTranslate, glRotate and glScale change the current active transformation matrix. This is of course a CPU operation. The model view and projection matrices just describes how the GPU should transforms vertices when issue a rendering command.
So e.g. by calling glTranslate nothing is translated at all yet. Before rendering the current projection and model view matrices are multiplied (MVP = projection * modelview) then this single matrix is copied to the GPU and then the GPU does the matrix * vertex multiplications ("T&L") for each vertex. So the translation/scaling/projection of the vertices is done by the GPU.
Also you really should not be worried about the performance if you don't use these functions in an inner loop somewhere. glTranslate results in three additions. glScale and glRotate are a bit more complex.
My advice is that you should learn a bit more about linear algebra. This is essential for working with 3D APIs.
There are software rendered implementations of OpenGL, so it's possible that no OpenGL functions run on the GPU. There's also hardware that doesn't support certain render states in hardware, so if you set a certain state, switch to software rendering, and again, nothing will run on the GPU (even though there's one there). So I don't think there's any clear distinction between 'GPU-accelerated functions' and 'non-GPU accelerated functions'.
To be on the safe side, keep things as simple as possible. The straightforward rendering-with-vertices and basic features like Z buffering are most likely to be hardware accelerated, so if you can stick to that with the minimum state changing, you'll be most likely to keep things hardware accelerated. This is also the way to maximize performance of hardware-accelerated rendering - graphics cards like to stay in one state and just crunch a bunch of vertices.