How many memory accesses are required to render textures with OpenGL? - opengl

Using OpenGL, I want to calculate the necessary throughput to render a frame.
Assuming the worse case scenario, I have (1) a framebuffer and (2) 8 textures that are all fullscreen and the textures end up covering the whole screen (however, textures are likely flat rectangles, but not 1:1 scale).
So each pixel is tweaked from 9 RGB pixels (assuming all textures have some degree of transparency) and in an ideal world you would read exactly 9 pixels and then save the result in the framebuffer. Of course with filtering it could be reading more pixels (especially if the textures are not straight, etc.)
How do I compute an approximation of the number of memory access. So if my video board says it has a limit of 40Gb/s, I can make sure I have enough bandwidth to support the full load?
Some details as requested in comments:
Frames / texture sizes: 3840 x 2160 (4K)
Expected frame rate: 30 FPS
GPU: NVidia (one of the newest like Maxwell or Pascal at least)

Related

"Interleaved rendering" in fragment shader

P.S. Yes, I posted this question on Computer Graphics Stack Exchange. But posting there also in hope more people will see
Intro
I'm trying to render multi-channel images (more than 4 channels, for the purposes of feeding it to a Neural Network). Since OpenGL doesn't support it natively, I have multiple 4-channel render buffers, into which I render a corresponding portion of channels.
For example, I need multi-channel image of size 512 x 512 x 16, in OpenGL I have 4 render buffers of size 512 x 512 x 4. Now the problem is that the Neural Network expects the data with strides 512 x 512 x 16, i.e. 16 values of channels of one pixel are followed by 16 values of channels from the next pixel. However currently I can efficiently read my 4 render buffers via 4 calls to glReadPixels, basically making the data having strides 4 x 512 x 512 x 4. Manual reordering of data on the client side will not suffice me as it's too slow.
Main question
I've got an idea to render to a single 4-channel render buffer of size 512*4 x 512 x 4, because stride-wise it's equivalent to 512 x 512 x 16, we just treat a combination of 4 pixels in a row as a single pixel of 16-channel output image. Let's call it an "interleaved rendering"
But this requires me to magically adjust my fragment shader, so that every group of consequent 4 fragments would have exactly the same interpolation of vertex attributes. Is there any way to do that?
This bad illustration with 1 render buffer of 1024 x 512 4-channel image, is an example of how it should be rendered. With that I can in 1 call glReadPixels extract the data with stride 512 x 512 x 8
EDIT: better pictures
What I have now (4 render buffers)
What I want to do natively in OpenGL (this image is done in Python offline)
But this requires me to magically adjust my fragment shader, so that every group of consequent 4 fragments would have exactly the same interpolation of vertex attributes.
No, it would require a bit more than that. You have to fundamentally change how rasterization works.
Rendering at 4x the width is rendering at 4x the width. That means stretching the resulting primitives, relative to a square area. But that's not the effect you want. You need the rasterizer to rasterize at the original resolution, then replicate the rasterization products.
That's not possible.
From the comments:
It just got to me, that I can try to get a 512 x 512 x 2 image of texture coordinates from vertex+fragment shaders, then stitch it with itself to make 4 times wider (thus we'll get the same interpolation) and from that form the final image
This is a good idea. You'll need to render whatever interpolated values you need to the original size texture, similar to how deferred rendering works. So it may be more than just 2 values. You could just store the gl_FragCoord.xy values, and then use them to compute whatever you need, but it's probably easier to store the interpolated values directly.
I would suggest doing a texelFetch when reading the texture, as you can specify exact integer texel coordinates. The integer coordinates you need can be computed from gl_FragCoord as follows:
ivec2 texCoords = ivec2(int(gl_FragCoord.x * 0.25f), int(gl_FragCoord.y));

OpenGL: How can I render a triangle mesh at 10bit (or 12bit or 16bit) channel depth color?

For a vision research experiment, I have a monitor that supports 10bit/channel (=30bit color). I want to render a triangle mesh in a simple scene that uses the full bit depth, and I want to save this rendering as a .png file. The rendering is just for single, static images, and doesn't need to happen lightning fast.
For the triangle mesh, I have:
List of vertices in xyz coordinates
List of triangles containing the indices of the vertices
List of the vertex normals
List of the triangle/face normals
My hardware includes (possibly irrelevant)
Dell UP3218K monitor - 8k and 10bits/channel
GeForce RTX 2080 Super (but can get a better one if needed)
I tried using the pyrender library, but it outputs the rendered scene as uint8 (limiting it to 8bit).
I can't find any code examples of OpenGL or PyOpenGL rendering meshes at 10bits or higher. With the increasing popularity of >8bit monitors, surely this is possible?
What can I use to render a mesh at 10 bit/channel depth?
Edit with more specific question:
I have a triangle mesh (points, vertices, normals). How can I render it (display unnecessary at this step) in a scene and save this rendering as a 10-bit depth .png file? Later, I would like to display this .png on a 10-bit monitor.
When you create a framebuffer object (FBO) you get to decide what kind of buffer you're rendering to. Most applications would use an GL_RGBA8 texture as the colour buffer, but you don't have to...
Here are a list of all the formats which your graphics driver is required to support. It may also support other ones which aren't on this list, but on this list are some formats that may be interesting to you:
GL_RGB10_A2 - 10 bits each for R/G/B, 2 bits for A - 32 bits per pixel
GL_RGBA16 - 16 bits each for R/G/B/A - 64 bits per pixel
GL_RGBA16F - 16 bits each for R/G/B/A - 64 bits per pixel - but they're floating-point numbers with with 1-bit sign, 5-bit exponent and 10-bit mantissa.

OpenGL: Why do square textures take less memory

Question:
Why does the same amount of pixels take dramatically less video memory if stored in a square texture than in a long rectangular texture?
Example:
I'm creating 360 4x16384 size textures with the glTexImage2D command. Internal format is GL_RGBA. Video memory: 1328 MB.
If I'm creating 360 256x256 textures with the same data, the memory usage is less than 100MB.
Using an integrated Intel HD4000 GPU.
It's not about the texture being rectangular. It's about one of the dimensions being extremely small.
In order to select texels from textures in an optimal fashion, hardware will employ what's known as swizzling. The general idea is that it will restructure the bytes in the texture so that pixels that neighbor each other in 2 dimensions will be neighbors in memory too. But doing this requires that the texture be of a certain minimum size in both dimensions.
Now, the texture filtering hardware can ignore this minimum size and only fetch from pixels within the texture's actual size is. But that extra storage is still there, taking up space to no useful purpose.
Given what you're seeing, there's a good chance that Intel's swizzling hardware has a base minimum size of 32 or 64 pixels.
In OpenGL, there's not much you can do to detect this incongruity other than what you've done here.

1k vs 4k texture resolution performance impact

This is for realtime graphics.
Let's say that there is a single mesh that we are rendering. We place a 1k (1024x1024) texture on it and it renders fine. Now let's say that we place a 4k texture on it but render only a 1k section of the texture by using different UVs on the same mesh.
Now both times, the visible surface has 1k texture on it. But one comes from 1k texture map the other from 4k texture map. Would there be a difference in performance, not counting increased VRAM usage from 4k map.
For all intents and purposes, no, there will be no difference.
By restricting the UVs to the top left 1024x1024 you'll be pulling in the same amount of texture data as if the texture were 1024x1024 and you read the entire thing. The number of texture samples remains the same as well.
It's impossible to rule it out completely of course without having low-level knowledge of every GPU past, present and future, but you should assume the performance will be the same.

OpenGL Gaussian Kernel on 3D texture

I would like to perform a blur on a 3D texture in openGL. Since it is separable I should be able to do it in 3 passes. My question is, what would be the best way to cope with it?
I currently have the 3D texture and fill it using imageStore. Should I create other 2 copies of the texture for the blur or is there a way to do it while using a single texture?
I am already using glCompute to compute the mip map of the 3D texture, but in this case I read from the texture at level 0 and write to the one at the next level so there is no conflict, while in this case I would need some copy.
In short it can't be done in 3 passes, because is not a 2D image. Even if kernel is separable.
You have to blur each image slice separately, wich is 2 passes for image (if you are using a 256x256x256 texture then you have 512 passes just for blurring along U and V coordinates). The you still have to blur along T and U (or T and V: indifferent) coordinates wich is another 512 passes. You can gain performance by using bilinear filter and read values between texels to save some constant processing cost. The 3D blur will be very costly.
Performance tip: maybe you don't need to blur the whole texture but only a part of it? (the visible part?)
The problem wich a such high number of passes, is the number of interactions between GPU and CPU: drawcalls and FBO setup wich are both slow operations that hang the CPU (probably a different API with low CPU overhead would be faster)
Try to not separate the kernel:
If you have a small kernel (I guess up to 5^3, only profiling will show the max kernel size) probably the fastest way is to NOT separate the kernel (that's it, you save a lot of drawcalls and FBO binding and leverage everything to GPU fillrate and bandwith).
Spread work over time:
Does not matter if your kernel is separated or not. Instead of computing a Gaussian Blur every frame, you could just compute it every second (maybe with a bigger kernel). Then you use as source of "continuos blurring data" the interpolation of the previouse blur and the next blur (wich is a 2x 3D Texture samples each frame, wich is much cheaper than continuosly blurring).