How much of a modern graphics pipeline uses dedicated hardware? - opengl

To put the question another way, if one were to try and reimplement OpenGL or DirectX (or an analogue) using GPGPU (CUDA, OpenCL), where and why would it be slower that the stock implementations on NVIDIA and AMD cards?
I can see how vertex/fragment/geometry/tesselation shaders could be made nice and fast using GPGPU, but what about things like generating the list of fragments to be rendered, clipping, texture sampling and so on?
I'm asking purely for academic interest.

Modern GPUs have still lots of fixed-function hardware which is hidden from the compute APIS. This includes: The blending stages, the triangle rasterization and a lot of on-chip queues. The shaders of course all map well to CUDA/OpenCL -- after all, shaders and the compute languages all use the same part of the GPU -- the general purpose shader cores. Think of those units as a bunch of very-wide SIMD CPUs (for instance, a GTX 580 has 16 cores with a 32 wide SIMD unit.)
You get access to the texture units via shaders though, so there's no need to implement that in "compute". If you would, your performance would suck most likely as you don't get access to the texture caches which are optimized for spatial layout.
You shouldn't underestimate the amount of work required for rasterization. This is a major problem, and if you throw all of the GPU at it you get roughly 25% of the raster hardware performance (see: High-Performance Software Rasterization on GPUs.) That includes the blending costs, which are also done by fixed-function units usually.
Tesselation has also a fixed-function part which is difficult to emulate efficiently, as it amplifies the input up to 1:4096, and you surely don't want to reserve so much memory up-front.
Next, you get lots of performance penalties because you don't have access to framebuffer compression, as there is again dedicated hardware for this which is "hidden" from you when you're in compute only mode. Finally, as you don't have any on-chip queues, it will be difficult to reach the same utility ratio as the "graphics pipeline" gets (for instance, it can easily buffer output from vertex shaders depending on shader load, you can't switch shaders that flexibly.)

an interesting source code link :
http://code.google.com/p/cudaraster/
and corresponding research paper:
http://research.nvidia.com/sites/default/files/publications/laine2011hpg_paper.pdf
Some researchers at Nvidia have tried to implement and benchmark exactly what was asked in this post : "Open-source implementation of "High-Performance Software Rasterization on GPUs"" ...
And it is open source for "purely academic interest" : it is a limited sub-set of Opengl, mainly for benchmarking rasterization of triangles.

To put the question another way, if one were to try and reimplement OpenGL or DirectX (or an analogue) using GPGPU (CUDA, OpenCL)
Do you realize, that before CUDA and OpenCL existed, GPGPU was done by shaders accessed through DirectX or OpenGL?
Reimplementing OpenGL on top of OpenCL or CUDA would introduce unneccessary complexity. On a system that supports OpenCL or CUDA, the OpenGL and DirectX drivers will share a lot of code with the OpenCL and/or CUDA driver, since they access the same piece of hardware.
Update
On a modern GPU all of the pipeline runs on the HW. That's what the whole GPU is for. Whats done on the CPU is bookkeeping and data management. Bookkeeping would be the whole transformation matrix setup (i.e. determine the transformation matrices, and assign them to the proper registers of the GPU), geometry data upload (transfer geometry and image data to GPU memory), shader compilation and last but not least, "pulling the trigger", i.e. send commands to the GPU that make it execute the prepared program to draw nice things. Then the GPU will by itself fetch the geometry and image data from the memory, process it as per the shaders and parameters in the registers (=uniforms).

Related

GPU intrinsics for OpenGL

Are there intrinsics/instructions on GPUs specific for the common operations of OpenGL/DirectX, such as triangle filling, texture mapping, clipping, etc?
And if so, can they be accessed using OpenCL or CUDA code running on the GPU?
Edit: I was wondering if operations like triangle filling, etc. in OpenGL make use of specific GPU instructions that cannot be accessed from OpenCL or CUDA, so that it is impossible to implement them as efficiently in OpenCL/CUDA, as they would be using OpenGL (with a render-to-texture context).
OpenGL context allows you to access graphics pipeline, which is not accessible when creating compute context. There are no direct intrinsics as both API has it's own language which gets mapped to PTX or some hardware specific instructions.
Triangle filling is definitely not a single instruction. You can implement texture mapping and clipping in compute shaders if you want.
Please clarify what is your intent as you may have some fundamental mis-understanding.

OpenGL Separable shader programs and pipline performance on modern hardware

I am porting a small OpenGL framework from 3.3 to 4.3. I have shader mix/match implemented in software (ie: shaders are bound individually and programs are linked lazily when a draw call is issued.).
OpenGL 4.1 added this feature with separable programs & pipelines however the point of having programs encapsulating all the shader stages was to be able to optimize them as a whole (and only once).
So I would like to know if using SPOs is slower than standard shader programs on Direct3D 11 hardware. Especially : do current implementations allow you to have one program per shader (so a pipeline with 2-5 separate programs) without significant performance loss ?
It is funny you should mention D3D11 hardware by name.
If you talk about D3D, you should know it has always worked this way. Shader programs are not immutable objects with every stage linked together in D3D like they are in OpenGL. D3D uses semantics and other goodies to let you swap out the shader attached to each stage whenever you want. The hardware has always worked the way D3D does and OpenGL just exposes this better now.
Whether you will see a change in performance or not from separable shaders is not a problem with the hardware. Any performance gain or loss will be down to the driver implementation. It cannot be substantial, however, or D3D would have adopted OpenGL's linked program model a long time ago -- that API constantly reinvents itself to lower overhead.

what exactly cuda does for opengl applications?

I guess some portion of an opengl application used to run on CPU, and now through cuda people can run it on GPU, and accelerate those portions (portion of an opengl application or pipeline) .
Can someone explain me what exactly cuda does for opengl ? I mean exactly what operations are offloaded to GPU for processing by cuda?
CUDA is a totally separate API than OpenGL; you can use them at the same time but CUDA isn't necessary to get GPU acceleration of rendering. In OpenGL you use shaders, which are conceptually somewhat similar to CUDA kernels, to achieve hardware acceleration of many tasks in OpenGL.
CUDA does allow interoperability with OpenGL (and Direct3D) but it is by no means necessary; you'd usually only want to do that if you both want to do scientific computing AND rendering in the same application. That's becoming even less necessary though, now with compute shaders available in both GL and D3D

OpenGL vs. OpenCL, which to choose and why?

What features make OpenCL unique to choose over OpenGL with GLSL for calculations? Despite the graphic related terminology and inpractical datatypes, is there any real caveat to OpenGL?
For example, parallel function evaluation can be done by rendering a to a texture using other textures. Reducing operations can be done by iteratively render to smaller and smaller textures. On the other hand, random write access is not possible in any efficient manner (the only way to do is rendering triangles by texture driven vertex data). Is this possible with OpenCL? What else is possible not possible with OpenGL?
OpenCL is created specifically for computing. When you do scientific computing using OpenGL you always have to think about how to map your computing problem to the graphics context (i.e. talk in terms of textures and geometric primitives like triangles etc.) in order to get your computation going.
In OpenCL you just formulate you computation with a calculation kernel on a memory buffer and you are good to go. This is actually a BIG win (saying that from a perspective of having thought through and implemented both variants).
The memory access patterns are though the same (your calculation still is happening on a GPU - but GPUs are getting more and more flexible these days).
But what else would you expect than using more than a dozen parallel "CPUs" without breaking your head about how to translate - e.g. (silly example) Fourier to Triangles and Quads...?
Something that hasn't been mentioned in any answers so far has been speed of execution. If your algorithm can be expressed in OpenGL graphics (e.g. no scattered writes, no local memory, no workgroups, etc.) it will very often run faster than an OpenCL counterpart. My specific experience of this has been doing image filter (gather) kernels across AMD, nVidia, IMG and Qualcomm GPUs. The OpenGL implementations invariably run faster even after hardcore OpenCL kernel optimization. (aside: I suspect this is due to years of hardware and drivers being specifically tuned to graphics orientated workloads.)
My advice would be that if your compute program feels like it maps nicely to the graphics domain then use OpenGL. If not, OpenCL is more general and simpler to express compute problems.
Another point to mention (or to ask) is whether you are writing as a hobbyist (i.e. for yourself) or commercially (i.e. for distribution to others). While OpenGL is supported pretty much everywhere, OpenCL is totally lacking support on mobile devices and, imho, is highly unlikely to appear on Android or iOS in the next few years. If wide cross platform compatibility from a single code base is a goal then OpenGL may be forced upon you.
What features make OpenCL unique to choose over OpenGL with GLSL for calculations? Despite the graphic related terminology and inpractical datatypes, is there any real caveat to OpenGL?
Yes: it's a graphics API. Therefore, everything you do in it has to be formulated along those terms. You have to package your data as some form of "rendering". You have to figure out how to deal with your data in terms of attributes, uniform buffers, and textures.
With OpenGL 4.3 and OpenGL ES 3.1 compute shaders, things become a bit more muddled. A compute shader is able to access memory via SSBOs/Image Load/Store in similar ways to OpenCL compute operations (though OpenCL offers actual pointers, while GLSL does not). Their interop with OpenGL is also much faster than OpenCL/GL interop.
Even so, compute shaders do not change one fact: OpenCL compute operations operate at a very different precision than OpenGL's compute shaders. GLSL's floating-point precision requirements are not very strict, and OpenGL ES's are even less strict. So if floating-point accuracy is important to your calculations, OpenGL will not be the most effective way of computing what you need to compute.
Also, OpenGL compute shaders require 4.x-capable hardware, while OpenCL can run on much more inferior hardware.
Furthermore, if you're doing compute by co-opting the rendering pipeline, OpenGL drivers will still assume that you're doing rendering. So it's going to make optimization decisions based on that assumption. It will optimize the assignment of shader resources assuming you're drawing a picture.
For example, if you're rendering to a floating-point framebuffer, the driver might just decide to give you an R11_G11_B10 framebuffer, because it detects that you aren't doing anything with the alpha and your algorithm could tolerate the lower precision. If you use image load/store instead of a framebuffer however, you're much less likely to get this effect.
OpenCL is not a graphics API; it's a computation API.
Also, OpenCL just gives you access to more stuff. It gives you access to memory levels that are implicit with regard to GL. Certain memory can be shared between threads, but separate shader instances in GL are unable to directly affect one-another (outside of Image Load/Store, but OpenCL runs on hardware that doesn't have access to that).
OpenGL hides what the hardware is doing behind an abstraction. OpenCL exposes you to almost exactly what's going on.
You can use OpenGL to do arbitrary computations. But you don't want to; not while there's a perfectly viable alternative. Compute in OpenGL lives to service the graphics pipeline.
The only reason to pick OpenGL for any kind of non-rendering compute operation is to support hardware that can't run OpenCL. At the present time, this includes a lot of mobile hardware.
One notable feature would be scattered writes, another would be the absence of "Windows 7 smartness". Windows 7 will, as you probably know, kill the display driver if OpenGL does not flush for 2 seconds or so (don't nail me down on the exact time, but I think it's 2 secs). This may be annoying if you have a lengthy operation.
Also, OpenCL obviously works with a much greater variety of hardware than just the graphics card, and it does not have a rigid graphics-oriented pipeline with "artificial constraints". It is easier (trivial) to run several concurrent command streams too.
Although currently OpenGL would be the better choice for graphics, this is not permanent.
It could be practical for OpenGL to eventually merge as an extension of OpenCL. The two platforms are about 80% the same, but have different syntax quirks, different nomenclature for roughly the same components of the hardware. That means two languages to learn, two APIs to figure out. Graphics driver developers would prefer a merge because they no longer would have to develop for two separate platforms. That leaves more time and resources for driver debugging. ;)
Another thing to consider is that the origins of OpenGL and OpenCL are different: OpenGL began and gained momentum during the early fixed-pipeline-over-a-network days and was slowly appended and deprecated as the technology evolved. OpenCL, in some ways, is an evolution of OpenGL in the sense that OpenGL started being used for numerical processing as the (unplanned) flexibility of GPUs allowed so. "Graphics vs. Computing" is really more of a semantic argument. In both cases you're always trying to map your math operations to hardware with the highest performance possible. There are parts of GPU hardware which vanilla CL won't use but that won't keep a separate extension from doing so.
So how could OpenGL work under CL? Speculatively, triangle rasterizers could be enqueued as a special CL task. Special GLSL functions could be implemented in vanilla OpenCL, then overridden to hardware accelerated instructions by the driver during kernel compilation. Writing a shader in OpenCL, pending the library extensions were supplied, doesn't sound like a painful experience at all.
To call one to have more features than the other doesn't make much sense as they're both gaining 80% the same features, just under different nomenclature. To claim that OpenCL is not good for graphics because it is designed for computing doesn't make sense because graphics processing is computing.
Another major reason is that OpenGL\GLSL are supported only on graphics cards. Although multi-core usage started with using graphics hardware there are many hardware vendors working on multi-core hardware platform targeted for computation. For example see Intels Knights Corner.
Developing code for computation using OpenGL\GLSL will prevent you from using any hardware that is not a graphics card.
Well as of OpenGL 4.5 these are the features OpenCL 2.0 has that OpenGL 4.5 Doesn't (as far as I could tell) (this does not cover the features that OpenGL has that OpenCL doesn't):
Events
Better Atomics
Blocks
Workgroup Functions:
work_group_all and work_group_any
work_group_broadcast:
work_group_reduce
work_group_inclusive/exclusive_scan
Enqueue Kernel from Kernel
Pointers (though if you are executing on the GPU this probably doesn't matter)
A few math functions that OpenGL doesn't have (though you could construct them yourself in OpenGL)
Shared Virtual Memory
(More) Compiler Options for Kernels
Easy to select a particular GPU (or otherwise)
Can run on the CPU when no GPU
More support for those niche hardware platforms (e.g. FGPAs)
On some (all?) platforms you do not need a window (and its context binding) to do calculations.
OpenCL allows just a bit more control over precision of calculations (including some through those compiler options).
A lot of the above are mostly for better CPU - GPU interaction: Events, Shared Virtual Memory, Pointers (although these could potentially benefit other stuff too).
OpenGL has gained the ability to sort things into different areas of Client and Server memory since a lot of the other posts here have been made.
OpenGL has better memory barrier and atomics support now and allows you to allocate things to different registers within the GPU (to about the same degree OpenCL can). For example you can share registers in the local compute group now in OpenGL (using something like the AMD GPUs LDS (local data share) (though this particular feature only works with OpenGL compute shaders at this time).
OpenGL has stronger more performing implementations on some platforms (such as Open Source Linux drivers).
OpenGL has access to more fixed function hardware (like other answers have said). While it is true that sometimes fixed function hardware can be avoided (e.g. Crytek uses a "software" implementation of a depth buffer) fixed function hardware can manage memory just fine (and usually a lot better than someone who isn't working for a GPU hardware company could) and is just vastly superior in most cases. I must admit OpenCL has pretty good fixed function texture support which is one of the major OpenGL fixed function areas.
I would argue that Intels Knights Corner is a x86 GPU that controls itself.
I would also argue that OpenCL 2.0 with its texture functions (which are actually in lesser versions of OpenCL) can be used to much the same performance degree user2746401 suggested.
In addition to the already existing answers, OpenCL/CUDA not only fits more to the computational domain, but also doesn't abstract away the underlying hardware too much. This way you can profit from things like shared memory or coalesced memory access more directly, which would otherwise be burried in the actual implementation of the shader (which itself is nothing more than a special OpenCL/CUDA kernel, if you want).
Though to profit from such things you also need to be a bit more aware of the specific hardware your kernel will run on, but don't try to explicitly take those things into account using a shader (if even completely possible).
Once you do something more complex than simple level 1 BLAS routines, you will surely appreciate the flexibility and genericity of OpenCL/CUDA.
The "feature" that OpenCL is designed for general-purpose computation, while OpenGL is for graphics. You can do anything in GL (it is Turing-complete) but then you are driving in a nail using the handle of the screwdriver as a hammer.
Also, OpenCL can run not just on GPUs, but also on CPUs and various dedicated accelerators.
OpenCL (in 2.0 version) describes heterogeneous computational environment, where every component of system can both produce & consume tasks, generated by other system components. No more CPU, GPU (etc) notions are longer needed - you have just Host & Device(s).
OpenGL, in opposite, has strict division to CPU, which is task producer & GPU, which is task consumer. That's not bad, as less flexibility ensures greater performance. OpenGL is just more narrow-scope instrument.
One thought is to write your program in both and test them with respect to your priorities.
For example: If you're processing a pipeline of images, maybe your implementation in openGL or openCL is faster than the other.
Good luck.

Which parts of Graphics Pipelines are done using CPU & GPU?

Which parts of pipelines are done using CPU and which are done using GPU?
Reading Wikipedia on Graphics Pipeline, maybe my question does not precisely represent what I am asking.
Referring to this question, which "steps" are done in CPU and which are done in GPU?
Edit:
My question is more into which parts of logical high level steps needed to display terrain+3D models [from files] are using CPU/GPU instead of which functions.
Which parts of pipelines are done using CPU and which are done using GPU?
As far as I know...
It depends entirely on the hardware, driver and opengl implementation.
Everything you can do in OpenGL, works, but you can't be 100% sure where processing happens - it is conveniently hidden from you (it is actually one of OpenGL benefits).
There are full-software OpenGL emulators, for example - mesa3d (non-certified).
Put such opengl32.dll (if you're on windows) into your program folder, and everything will be running on CPU.
Typically, rasterization (last stage of pipeline - alpha-blend, putting pixel, ztest) is done in hardware. Get old videocard, and it is possible that the rest of functionality will run on CPU. Vertex transformations, vertex manipulations, and even standard lighting can be done on CPU (and they were done on CPU for a long time) - depending on videocard. However, I must admit that that videocards without hardware TL (transform and lighting) support are quite old. AFAIK, transformation and lighting is always performed on CPU if card only supports full hardware acceleration of DirectX 7 (windows-platform specific).
There may be platform dependent way to get videocard capabilities in openGL, though..
Side note:
DirectX gives you more information about card capabilities, and a ways to control what happens where (i.e. you can choose software T&L even if videocard supports hardware acceleration - in some cases this may be necessarry). This comes with penalty - DirectX is less flexible than OpenGL, harder to use, harder to experiment with, it is available on less platforms than OpenGL, and worrying about which feature is supported and which isn't takes development time.
The answer depends on what GPU (since they differ in HW capabilities, specially in integrated ones), but on modern non-integrated GPUs, the entire pipeline is implemented in HW. This includes culling, z-sort and occlusion.
Integrated chipsets, such as Intel's, have less HW features, for example, they usually do tranformation & lighting using software.
It depends a lot on the system configuration, at least the CPU has to issue the commands to the GPU (send vertices lists, upload textures, etc.), and the GPU usually does all the geometric transformations, texture mappings, lighting, etc.
But in many cases, part of the work that the GPU should be doing might be done by the CPU if the hardware doesn't support certain features.
If you're using a 3D engine, it might also do some pre-optimizations on the CPU to alleviate the number of triangles that the GPU has to process, like discarding beforehand parts of the geometry that are known to be hidden (by using BSP's like IdTech engines or Portals like Unreal Engine, for example).
Things like moving and rotating the meshes of the models (e.g. the walking animations) are usually done in the CPU, but nowadays I guess the trend would be to HW accelerate them too.
Regarding the wikipedia link, all the steps mentioned there are usually done in the GPU, but is up to the software running in the CPU to decide what polygons are going to be sent for processing (even if they are discarded by the GPU later on because they're hidden or out of the viewing frustrum).