"Interleaved rendering" in fragment shader

"Interleaved rendering" in fragment shader - opengl

P.S. Yes, I posted this question on Computer Graphics Stack Exchange. But posting there also in hope more people will see
Intro
I'm trying to render multi-channel images (more than 4 channels, for the purposes of feeding it to a Neural Network). Since OpenGL doesn't support it natively, I have multiple 4-channel render buffers, into which I render a corresponding portion of channels.
For example, I need multi-channel image of size 512 x 512 x 16, in OpenGL I have 4 render buffers of size 512 x 512 x 4. Now the problem is that the Neural Network expects the data with strides 512 x 512 x 16, i.e. 16 values of channels of one pixel are followed by 16 values of channels from the next pixel. However currently I can efficiently read my 4 render buffers via 4 calls to glReadPixels, basically making the data having strides 4 x 512 x 512 x 4. Manual reordering of data on the client side will not suffice me as it's too slow.
Main question
I've got an idea to render to a single 4-channel render buffer of size 512*4 x 512 x 4, because stride-wise it's equivalent to 512 x 512 x 16, we just treat a combination of 4 pixels in a row as a single pixel of 16-channel output image. Let's call it an "interleaved rendering"
But this requires me to magically adjust my fragment shader, so that every group of consequent 4 fragments would have exactly the same interpolation of vertex attributes. Is there any way to do that?
This bad illustration with 1 render buffer of 1024 x 512 4-channel image, is an example of how it should be rendered. With that I can in 1 call glReadPixels extract the data with stride 512 x 512 x 8
EDIT: better pictures
What I have now (4 render buffers)
What I want to do natively in OpenGL (this image is done in Python offline)

But this requires me to magically adjust my fragment shader, so that every group of consequent 4 fragments would have exactly the same interpolation of vertex attributes.
No, it would require a bit more than that. You have to fundamentally change how rasterization works.
Rendering at 4x the width is rendering at 4x the width. That means stretching the resulting primitives, relative to a square area. But that's not the effect you want. You need the rasterizer to rasterize at the original resolution, then replicate the rasterization products.
That's not possible.
From the comments:
It just got to me, that I can try to get a 512 x 512 x 2 image of texture coordinates from vertex+fragment shaders, then stitch it with itself to make 4 times wider (thus we'll get the same interpolation) and from that form the final image
This is a good idea. You'll need to render whatever interpolated values you need to the original size texture, similar to how deferred rendering works. So it may be more than just 2 values. You could just store the gl_FragCoord.xy values, and then use them to compute whatever you need, but it's probably easier to store the interpolated values directly.
I would suggest doing a texelFetch when reading the texture, as you can specify exact integer texel coordinates. The integer coordinates you need can be computed from gl_FragCoord as follows:
ivec2 texCoords = ivec2(int(gl_FragCoord.x * 0.25f), int(gl_FragCoord.y));

Related

opengl tall texture vs wide texture for memory locality

A 2D Texture has two coordinates, x and y. To store a 2D array in 1D memory, the two possible formats are [x + y * width] and [x * height + y]. OpenGL has various confusing row-major/column-major conventions so I am unsure which of the two formats it uses. This is relevant because if a texture is used to store multiple images, such as in a sprite sheet or atlas, it is better to have the parts of an image located close together in memory. For example, if the format is [x + y * width] and we are using a very wide texture, then the GPU will have to skip through long parts of memory to find the texels it needs.
Thus: is a tall texture atlas superior to a wide texture atlas, or is it the other way around? Or do GPUs have no memory locality benefits?

The most important aspect of a texture atlas is how many images can fit inside it. Even when it comes to texture atlases, you are far more likely to access adjacent texels than distant ones.
Think about it. Say you render 2 32x32 sprites. So that's 2 quads, in a single rendering call. Each quad will take up 32x32 pixels on the screen; that's 1024 pixels.
Locality matters; you're rendering from 1024 locally adjacent texels, then rendering from a different set of 1024 locally adjacent texels.
In any case, OpenGL does not expose you to the details of the GPU's image formats. You can ask for a particular size of texel and a number of channels. But you don't get any more details than that. The data you provide will be appropriately converted by the driver into the actual internal GPU data.
Typically, GPUs will swizzle textures in memory. This means rearranging data so that locality is preserved. That is, instead of storing texels as either x + y * width or x * height + y, they get stored in a more complex arrangement.
For example, the first 4 values would be texels 0,0; 0,1; 1,0; and 1,1. So a 2x2 block of texels is store in a single contiguous array of memory. That's an example of how swizzled texture storage works.
But this is all an implementation detail; there's nothing you can do to influence or affect this, and not even a low-level API like Vulkan allows you to directly load pre-swizzled texel data.

Using IBO for animation - good or bad?

I just watching my animated sprite code, and get some idea.
Animation was made by altering tex coords. It have buffer object, which holds current frame texture coords, as new frame requested, new texture coords feed up in buffer by glBufferData().
And what if we pre-calculate all animation frames texture coords, put them in BO and create Index Buffer Object with just a number of frame, which we need to draw
GLbyte cur_frames = 0; //1,2,3 etc
Now then as we need to update animation, all we need is update 1 byte (instead of 4 /quad vertex count/ * 2 /s, t/ * sizeof(GLfloat) bytes for quad drawing with TRIANGLE_STRIP) frame of our IBO with glBufferData, we don't need hold any texture coords after init of our BO.
I am missing something? What are contras?
Edit: of course your vertex data may be not gl_float just for example.

As Tim correctly states, this depends on your application, let us talk some numbers, you mention both IBO's and inserting texture coordinates for all frames into one VBO, so let us take a look at the impact of each.
Suppose a typical vertex looks like this:
struct vertex
{
float x,y,z; //position
float tx,ty; //Texture coordinates
}
I added a z-component but the calculations are similar if you don't use it, or if you have more attributes. So it is clear this attribute takes 20 bytes.
Let's assume a simple sprite: a quad, consisting of 2 triangles. In a very naive mode you just send 2x3 vertices and send 6*20=120 bytes to the GPU.
In comes indexing, you know you have actually only four vertices: 1,2,3,4 and two triangles 1,2,3 and 2,3,4. So we send two buffers to the GPU: one containing 4 vertices (4*20=80 byte) and one containing the list of indices for the triangles ([1,2,3,2,3,4]), let's say we can do this in 2 byte (65535 indices should be enough), so this comes down to 6*2=12 byte. In total 92 byte, we saved 28 byte or about 23%. Also, when rendering the GPU is likely to only process each vertex once in the vertex shader, it saves us some processing power also.
So, now you want to add all texture coordinates for all animations at once. First thing you have to note is that a vertex in indexed rendering is defined by all it's attributes, you can't split it in an index for positions and an index for texture coordinates. So if you want to add extra texture coordinates, you will have to repeat the positions. So each 'frame' that you add will add 80 byte to the VBO and 12 byte to the IBO. Suppose you have 64 frames, you end up with 64*(80+12)=5888byte. Let's say you have 1000 sprites, then this would become about 6MB. That does not seem too bad, but note that it scales quite rapidly, each frame adds to the size, but also each attribute (because they have to be repeated).
So, what does it gain you?
You don't have to send data to the GPU dynamically. Note that updating the whole VBO would require sending 80 bytes or 640 bits. Suppose you need to do this for 1000 sprites per frame at 30 frames per second, you get to 19200000bps or 19.2Mbps (no overhead included). This is quite low (e.g. 16xPCI-e can handle 32Gbps), but it could be worth wile if you have other bandwidth issues (e.g. due to texturing). Also, if you construct your VBO's carefully (e.g. separate VBO's or non-interleaved), you could reduce it to only updating the texture-part, which is only 16 byte per sprite in the above example, this could reduce bandwidth even more.
You don't have to waste time computing the next frame position. However, this is usually just a few additions and few if's to handle the edges of your textures. I doubt you will gain much CPU power here.
Finally, you also have the possibility to simply split the animation image over a lot of textures. I have absolutely no idea how this scales, but in this case you don't even have to work with more complex vertex attributes, you just activate another texture for each frame of animation.
edit: another method could be to pass the frame number in a uniform and do the calculations in your fragment shader, before sampling. Setting a single integer uniform should be that much of an overhead.

For a modern GPU, accessing/unpacking single bytes is not necessarily faster than accessing integer types or even vectors (register sizes & load instructions, etc.). You can just save memory and therefore memory bandwidth, but I wouldn't expect this to give much of a difference in relation to all other vertex attribute array accesses.
I think, the fastest way to supply a frame index for animated sprites is either an uniform, or if multiple sprites have to be rendered with one draw call, the usage of instanced vertex attrib arrays. With the latter, you could provide a single index for fixed-size subsequences of vertices.
For example, when drawing 'sprite-quads', you'd have one frame index fetch per 4 vertices.
A third approach would be a buffer-texture, when using instanced rendering.
I recommend a global (shared) uniform for time/frame index calculation, so you can calculate the animation index on the fly within you shader, which doesn't require you to update the index buffer (which then just represents the relative animation state among sprites)

Which would be faster for pixel format convertion? Pixel shader or compute shader, or maybe OpenCL?

I want to convert frames from YUV420p format(or something like that) to ABGR format on the fly, and put the result frames in video memory as textures.
There are two ways I can think about now:
Let each channel be a source texture, and render to another texture.
Do it "normally" in a compute shader.
I don't quite understand the rules in the GPU. As in my card there are 720 shader cores, 36 texture units, and 16 output units. Does it mean within each cycle, I can at most sampling 40 texture and output 16 pixels, while I can execute 720 shader operations? So if I use method 1, I will be constrained to that 16 pixels output even if I only use 2 or 3 operations for each pixel? If I use method 2, does it mean as long as I can convert one pixel within 45 cycles, it will be faster than using method 1?

OpenGL color index in frag shader?

I have a large sprite library and I'd like to cut GPU memory requirements. Can I store textures on the gpu with only 1 byte per pixel and use that for an RGB color look up in a fragment shader? I see conflicting reports on the use of GL_R8.

I'd say this really depends on whether your hardware supports that texture format or not. How about skipping the whole issue by using a A8R8G8B8 texture instead? It would just be compressed, i.e. using a bit mask (or r/g/b/a members in glsl) to read "sub pixel" values. Like the first pixel is stored in alpha channel, second pixel in red channel, third pixel in green channel, etc.
You could even use this to store up to 4 layers in a single image (cutting max texture width/height); picking just one shouldn't be an issue.

Fast texel settting in OpenGL

I'm in need of rendering an influence map in OpenGL. At present I have 100 x 100 quads rendering with a set color to represent the influence at each point on the map. I've been recommended to change my rendering method to one quad with a texture, then allowing the rendering pipeline to take over in speed.
Basic testing has shown that glTexSubImage2D is too slow for setting 10,000 texels per frame. Do you have any suggestions? Would it better to create an entirely new texture each frame? My influence map is in normalized floats (0.0 to 1.0) and that is converted to grayscale colors (1.0f = white).
Thanks :D

Are you currently updating each of the 10000 texels separately, with 10000 calls of glTexSubImage2D?
Just use one 100x100 grayscale float texture (array of 10000 floats) in RAM, update values directly to that and then send the whole data to GPU with one glTexImage2D call. You could also use buffer objects to allow the transfer happen on background, but it should be unnecessary since you are not moving very large amounts of data.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js