OpenGL Buffer Texture cache implementation - opengl

I've been playing a bit with opengl TBOs today, because it seems to be the only way to have an object shared with OpenCL which OpenCL can read/write inside one kernel (it is not an image) and a fragment shader can read from (and has less limitation in size). Pretty nice!
However, after comparing the read performance on the GL side to actual 1d/2d/3d textures I have the suspicion the texelFetch on the gsamplerBuffer is simply and uncached global memory read, and for my application about 2x slower. At least on OSX driver OpenGL 4.1 ATI-1.22.25, GLSL 4.10.
Can anybody confirm this suspicion or provide contrary findings (on other platforms?)?

Related

OpenGL Separable shader programs and pipline performance on modern hardware

I am porting a small OpenGL framework from 3.3 to 4.3. I have shader mix/match implemented in software (ie: shaders are bound individually and programs are linked lazily when a draw call is issued.).
OpenGL 4.1 added this feature with separable programs & pipelines however the point of having programs encapsulating all the shader stages was to be able to optimize them as a whole (and only once).
So I would like to know if using SPOs is slower than standard shader programs on Direct3D 11 hardware. Especially : do current implementations allow you to have one program per shader (so a pipeline with 2-5 separate programs) without significant performance loss ?
It is funny you should mention D3D11 hardware by name.
If you talk about D3D, you should know it has always worked this way. Shader programs are not immutable objects with every stage linked together in D3D like they are in OpenGL. D3D uses semantics and other goodies to let you swap out the shader attached to each stage whenever you want. The hardware has always worked the way D3D does and OpenGL just exposes this better now.
Whether you will see a change in performance or not from separable shaders is not a problem with the hardware. Any performance gain or loss will be down to the driver implementation. It cannot be substantial, however, or D3D would have adopted OpenGL's linked program model a long time ago -- that API constantly reinvents itself to lower overhead.

OpenGL - gpu memory exceeded, possible scenarios

I can use glTexImage2D or glBufferData to send some data to the gpu memory. Let's assume that I request driver to send more data to the gpu but the gpu memory is already full. I probably get GL_OUT_OF_MEMORY. What might happen with a rendering thread ? What are possible scenarios ? Is it possible that a rendering thread will be terminated ?
It depends on the actual OpenGL implementation. But the most likely scenario is, that you'll just encounter a serious performance drop, but things will just keep working.
OpenGL uses an abstract memory model, for which actual implementation threat the GPU's own memory as a cache. In fact for most OpenGL implementation when you load texture data it doesn't even go directly to the GPU at first. Only when it's actually required for rendering it gets loaded into the GPU RAM. If there are more textures in use than fit into GPU RAM, textures are swapped in and out from GPU RAM as needed to complete the rendering.
Older GPU generations required for a texture to completely fit into their RAM. The GPUs that came out after 2012 actually can access texture subsets from host memory as required thereby lifting that limit. In fact you're sooner running into maximum texture dimension limits, rather than memory limits (BT;DT).
Of course other, less well developed OpenGL implementations may bail out with an out of memory error. But at least for AMD and NVidia that's not an issue.

Should I make my raytracer with GLSL or OpenCL, and how I do I get a large 1gb buffer?

Right now, I have implemented a GLSL raytracer that uses a buffer texture to access the acceleration structure used for ray tracing.
I'm traversing the texture with a while loop, and it's very costly, but I think there's hope for making it faster. But there seems to be a wall that I'm going to hit that I can't seem to fix. Buffer textures have a limited size, on my GPU it was around 200mb, I forget exactly what it was.
I need my data structure to be around 1gb.
Someone recommended OpenCL to me to solve the problem, so I studied OpenCL and now I'm familiar with the API. However, I discovered that OpenCL also has a similar problem with maximum buffer sizes. Most GPUs will only give you access to 1/4 of total vram in a single buffer. Most GPU's have 1 or 2 gbs of vram so creating 1 buffer for my structure will not work.
It seems like the only way get my data structure on the GPU is to split it up into multiple buffers now. My question is, what's the most effective and fast way to do this, and would it wise to continue in OpenCL or GLSL. I know branching buffer/texture reads can be costly, and it seems like that's what I would have to do if I split it up. You could avoid the branch if you somehow put the buffer to read in an array and index the buffer somehow, however, I have experienced indexing with GLSL to be EXTREMELY slow, even if it's just indexing a local array (why is this?). I wonder if the same slowness would occur if you grouped buffers into an array in a kernel, if that's even possible.
Current devices with updated drivers can access more than that. AMD has an envvar that lets you set it even higher.
OpenCL could be a good solution for you.
Also, OpenGL 4.3 added Compute Shaders which are extremely OpenCL-like and perfect for folks with OpenGL experience and an existing OpenGL application.
Regarding performance, looping in your kernel can be a problem due to work group divergence, and if you don't have many work items active, it can reduce device occupancy.

Hardware support for non-power-of-two textures

I have been hearing controversial opinions on whether it is safe to use non-power-of two textures in OpenGL applications. Some say all modern hardware supports NPOT textures perfectly, others say it doesn't or there is a big performance hit.
The reason I'm asking is because I want to render something to a frame buffer the size of the screen (which may not be a power of two) and use it as a texture. I want to understand what is going to happen to performance and portability in this case.
Arbitrary texture sizes have been specified as core part of OpenGL ever since OpenGL-2, which was a long time ago (2004). All GPUs designed every since do support NP2 textures just fine. The only question is how good the performance is.
However ever since GPUs got programmable any optimization based on the predictable patterns of fixed function texture gather access became sort of obsolete and GPUs now have caches optimized for general data locality and performance is not much of an issue now either. In fact, with P2 textures you may need to upscale the data to match the format, which increases the required memory bandwidth. However memory bandwidth is the #1 bottleneck of modern GPUs. So using a slightly smaller NP2 texture may actually improve performance.
In short: You can use NP2 textures safely and performance is not much of a big issue either.
All modern APIs (except some versions of OpenGL ES, I believe) on modern graphics hardware (the last 10 or so generations from ATi/AMD/nVidia and the last couple from Intel) support NP2 texture just fine. They've been in use, particularly for post-processing, for quite some time.
However, that's not to say they're as convenient as power-of-2 textures. One major case is memory packing; drivers can often pack textures into memory far better when they are powers of two. If you look at a texture with mipmaps, the base and all mips can be packed into an area 150% the original width and 100% the original height. It's also possible that certain texture sizes will line up memory pages with stride (texture row size, in bytes), which would provide an optimal memory access situation. NP2 makes this sort of optimization harder to perform, and so memory usage and addressing may be a hair less efficient. Whether you'll notice any effect is very much driver and application-dependent.
Offscreen effects are perhaps the most common usecase for NP2 textures, especially screen-sized textures. Almost every game on the market now that performs any kind of post-processing or deferred rendering has 1-15 offscreen buffers, many of which are the same size as the screen (for some effects, half or quarter-size are useful). These are generally well-supported, even with mipmaps.
Because NP2 textures are widely supported and almost a sure bet on desktops and consoles, using them should work just fine. If you're worried about platforms or hardware where they may not be supported, easy fallbacks include using the nearest power-of-2 size (may cause slightly lower quality, but will work) or dropping the effect entirely (with obvious consquences).
I have a lot of experience in making games (+4 years) and using texture atlases for iOS & Android though cross platform development using OpenGL 2.0
Stick with PoT textures with a maximum size of 2048x2048 because some devices (especially the cheap ones with cheap hardware) still don't support dynamic texture sizes, i know this from real life testers and seeing it first hand. There are so many devices out there now, you never know what sort of GPU you'll be facing.
You're iOS devices will also show black squares and artefacts if you are not using PoT textures.
Just a tip.
Even if arbitrary texture size is required by OpenGL X certain videocards are still not fully compliant with OpenGL. I had a friend with a IntelCard having problems with NPOT2 textures (I assume now Intel Cards are fully compliant).
Do you have any reason for using NPOT2 Textures? than do it, but remember that maybe some old hardware don't support them and you'll probably need some software fallback that can make your textures POT2.
Don't you have any reason for using NPOT2 Textures? then just use POT2 Textures. (certain compressed formats still requires POT2 textures)

How much of a modern graphics pipeline uses dedicated hardware?

To put the question another way, if one were to try and reimplement OpenGL or DirectX (or an analogue) using GPGPU (CUDA, OpenCL), where and why would it be slower that the stock implementations on NVIDIA and AMD cards?
I can see how vertex/fragment/geometry/tesselation shaders could be made nice and fast using GPGPU, but what about things like generating the list of fragments to be rendered, clipping, texture sampling and so on?
I'm asking purely for academic interest.
Modern GPUs have still lots of fixed-function hardware which is hidden from the compute APIS. This includes: The blending stages, the triangle rasterization and a lot of on-chip queues. The shaders of course all map well to CUDA/OpenCL -- after all, shaders and the compute languages all use the same part of the GPU -- the general purpose shader cores. Think of those units as a bunch of very-wide SIMD CPUs (for instance, a GTX 580 has 16 cores with a 32 wide SIMD unit.)
You get access to the texture units via shaders though, so there's no need to implement that in "compute". If you would, your performance would suck most likely as you don't get access to the texture caches which are optimized for spatial layout.
You shouldn't underestimate the amount of work required for rasterization. This is a major problem, and if you throw all of the GPU at it you get roughly 25% of the raster hardware performance (see: High-Performance Software Rasterization on GPUs.) That includes the blending costs, which are also done by fixed-function units usually.
Tesselation has also a fixed-function part which is difficult to emulate efficiently, as it amplifies the input up to 1:4096, and you surely don't want to reserve so much memory up-front.
Next, you get lots of performance penalties because you don't have access to framebuffer compression, as there is again dedicated hardware for this which is "hidden" from you when you're in compute only mode. Finally, as you don't have any on-chip queues, it will be difficult to reach the same utility ratio as the "graphics pipeline" gets (for instance, it can easily buffer output from vertex shaders depending on shader load, you can't switch shaders that flexibly.)
an interesting source code link :
http://code.google.com/p/cudaraster/
and corresponding research paper:
http://research.nvidia.com/sites/default/files/publications/laine2011hpg_paper.pdf
Some researchers at Nvidia have tried to implement and benchmark exactly what was asked in this post : "Open-source implementation of "High-Performance Software Rasterization on GPUs"" ...
And it is open source for "purely academic interest" : it is a limited sub-set of Opengl, mainly for benchmarking rasterization of triangles.
To put the question another way, if one were to try and reimplement OpenGL or DirectX (or an analogue) using GPGPU (CUDA, OpenCL)
Do you realize, that before CUDA and OpenCL existed, GPGPU was done by shaders accessed through DirectX or OpenGL?
Reimplementing OpenGL on top of OpenCL or CUDA would introduce unneccessary complexity. On a system that supports OpenCL or CUDA, the OpenGL and DirectX drivers will share a lot of code with the OpenCL and/or CUDA driver, since they access the same piece of hardware.
Update
On a modern GPU all of the pipeline runs on the HW. That's what the whole GPU is for. Whats done on the CPU is bookkeeping and data management. Bookkeeping would be the whole transformation matrix setup (i.e. determine the transformation matrices, and assign them to the proper registers of the GPU), geometry data upload (transfer geometry and image data to GPU memory), shader compilation and last but not least, "pulling the trigger", i.e. send commands to the GPU that make it execute the prepared program to draw nice things. Then the GPU will by itself fetch the geometry and image data from the memory, process it as per the shaders and parameters in the registers (=uniforms).