I have one big buffer (object) containing the MNIST dataset: many (tens of thousands) small (28x28) grayscale images, stored one-by-one in row-wise order as floats indicating pixel intensity. I would like to efficiently (i.e. somewhat interactively) blend these many images into one "average" image, where each pixel in the blended image is the average of all the pixels at that same position. Is this possible?
The options I considered are:
Using a compute shader directly on the buffer object. I would spawn imgWidth * imgHeight compute shader invocations/threads, with each invocation looping over all images. This doesn't seem very efficient, as each invocation has to loop over all images, but doing it the other way (i.e. spawning numImages invocations and walking over the pixels) still has invocations waiting on each other.
Using the graphics pipeline to draw the textures one-by-one to a framebuffer, blending them all over each other. This would still result in linear time, as each image has to be rendered to the framebuffer in turn. I'm not very familiar with framebuffers, though.
Doing it all linearly in the CPU, which seems easier and not much slower than doing it on the GPU. I would only be missing out on the parallel processing of the pixels.
Are their other possibilities I'm missing. Is there an optimal way? And if not, what do you think would be the easiest?
Most times we want to parallelize at the pixel level because there are many.
However, in your case there are not that many pixels (28x28).
The biggest number you have seems to be the number of images (thousands of images). So we would like to leverage that.
Using a compute shader, instead of iterating though all the images, you could blend the images in pairs. After each pass you would halve the number of images. Once the number of images gets very small, you might want to change the strategy but that's something that you need to experiment with to see what works best.
You know compute shaders can have 3 dimensions. You could have X and Y index the pixel of the image. And Z can be used to inxed the pair of images in a texture array. So for index Z, you would blend textures 2*Z and 2*Z+1.
Some implementation details you need to take into account:
Most likely, the number of images won't be a power of two. So at some point the number of images will be odd.
Since you are working with lots of images, you could run into float precission issues. You might need to use float textures, or addapt the strategy so this is not a problem.
Usually compute shaders work best when the threads process tiles of 2x2 pixels instead of individual pixels.
This is how i do it.
Render all the textures to the framebuffer , which can also be the default frame buffer.
Once rendering in completed.
Read the data from the Framebuffer.
glReadBuffer(GL_COLOR_ATTACHMENT0);
glBindBuffer(GL_PIXEL_PACK_BUFFER, w_pbo[w_writeIndex]);
// copy from framebuffer to PBO asynchronously. it will be ready in the NEXT frame
glReadPixels(0, 0, SCR_WIDTH, SCR_HEIGHT, GL_RGBA, GL_UNSIGNED_BYTE, nullptr);
// now read other PBO which should be already in CPU memory
glBindBuffer(GL_PIXEL_PACK_BUFFER, w_pbo[w_readIndex]);
unsigned char* Data = (unsigned char*)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
Related
What I want to do
I want to have a set triangles bleed through, or rather ignore the depth buffer, for another set triangles, but only if they have the same number.
Problem (optional reading)
I do not know how to do this without introducing a ton of bubbles into the pipeline. Right now I have very high throughput because I can throw my geometry onto the GPU, tell it to render, and forget about it. However, if I have to keep toggling the state when drawing, I'm worried I'm going to tank my performance. Other people who have done what I've just said (doing a ton of draw calls and state changes) have much worse performance than me. This performance hit is also significantly worse on older hardware, where we are talking on order of 50 - 100+ times performance loss by doing it the state-change way.
Unfortunately this triangle bleeding scenario happens many thousands of times, so the state machine will be getting flooded with "draw triangles, depth off, draw triangles that bleed through, depth on, ...", except N times, where N can get large (N >= 1000).
A good way of imagining this is having a set of triangles T_i, and a set of triangles that bleed through B_i where B_i only bleeds through T_i, and i ranges from 0...1000+. Note that if we are drawing B_100, then it should only bleed through T_100, not T_99 or T_101.
My next thought is to draw all the triangles with their integer into one framebuffer (along with the integer), then draw the bleed through triangles into another framebuffer (also with the integer), and then merge these framebuffers together. I figure they will have the color, depth, and the integer, so I can hopefully merge them in the fragment shader.
Problem is, I have no idea how to write an integer alongside the out vec4 fragColor in the fragment shader.
Questions (and in short)
This leaves me with two questions:
How do I write an integer into a framebuffer? Do I need to write to 4 separate texture framebuffers? (like one color/depth framebuffer texture, another integer framebuffer texture, and then double this so I can merge the pairs of framebuffers together at some point?)
To make this more clear, the algorithm would look like
Render all the 'could be bled from triangles', described above as set T_i,
write colors and depth info into FB1, and write integers into FB2
Render all the 'bleeding' triangles, described above as set B_i,
write colors and depth into FB3, and write integers to FB4
Bind the textures for FB1, FB2, FB3, FB4
Render each pixel by sampling the RGBA, depth, and integers
from the appropriate texture and write those out into the
final framebuffer
I would need to access the color and depth from the textures in the shader. I would also need to access the integer from the other texture. Then I can do the comparison and choose which pixel to write to the default framebuffer.
Is this idea possible? I assume if (1) is, then the answer is yes. Maybe another question could be whether there's a better way. I tried thinking of doing this with the stencil buffer but had no luck
What you want is theoretically possible, but I can't speak as to its performance. You'll be reading and writing a whole lot of texels in a lot of textures for every program iteration.
Anyway to answer your questions:
A framebuffer can have multiple color attachments by using glFramebufferTexture2D with GL_COLOR_ATTACHMENT0, GL_COLOR_ATTACHMENT1, etc. Each texture can then have its own internal format, in your example you probably want a regular RGB texture for your color output, and a second 1-integer only texture.
Your depth buffer is complicated, because you don't want to let OpenGL handle it as normal. If you want to take over the depth buffer, you probably want to attach it as yet another, float texture that you can check against or not your screen-space fragments.
If you have doubts about your shader, remember that you can bind the any number of textures as input samplers you program in code, and each color bind gets its own output value (your shader runs per-texel, so you output one value at a time). Make sure the format of your output is correct, ie vec3/vec4 for the color buffer, int for your integer buffer and float for the float buffer.
And stencil buffers won't help you turn depth checking on or off in a single (possibly indirect) draw call. I can't visualize what your bleeding thing means, but it can probably help with that? Maybe? But definitely not conditional depth checking.
I am now using FFMPEG to read a high resolution video (6480*1920) and use opengl to show it
after decoding, I get 3 pointer that point to the Y,U,V.
At first, I use swsscale to convert it rgb and show it, but I find it's too slow. So I directly deal with YUV. My second try is generate 3 one channel texture and convert it to rgb in fragment shader. It is faster, but still cannot achieve 60fps
I find the bottleneck is this function : texture(texy, tex_coord.xy). When the texture is large, it cost a lot of time. So instead of call it 3 times, my idea is to put the YUV in one single texture since a texture can have 4 channel. But I wonder that how can I update a certain channel of a texture.
I try the following code, but it seems do not work. Instead of update a channel, glTexSubImage2D changes the whole texture:
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGB, frame->width, frame->height,0, GL_RED, GL_UNSIGNED_BYTE, Y);
glTexSubImage2D(GL_TEXTURE_2D,0,0,0,frame->width, frame->height, GL_GREEN,U);
glTexSubImage2D(GL_TEXTURE_2D,0,0,0,frame->width, frame->height, GL_BLUE,V);
So how can I use one texture to pass the YUV data ? I also try that gather the YUV data into one array then generate the texture. But it does not help since it need a lot of time to generate that array.
Any good idea?
You're approaching this from the wrong angle, since you don't actually understand what is causing the poor performance in the first place. Yes, texture access is a rather expensive operation. But it is not that expensive; I mean, just think about of the amount of texture data that gets pushed around in modern games at very high frame rates.
The problem is not the channel format of the texture, and it is also not the call of GLSL texture.
Your problem is this:
(…) high resolution video (6480*1920)
Plain and simple the dimensions of the frame are outside the range of what the GPU is comfortable working with. Try breaking down the picture into a set of smaller textures. Using glPixelStorei paramters GL_UNPACK_ROW_LENGTH, GL_UNPACK_SKIP_PIXELS and GL_UNPACK_SKIP_ROWS you can select the rectangle inside your source picture to copy.
You don't have to make several draw calls BTW, just select the texture inside the shader based on the target fragment position or texture coordinate.
Unfortunately OpenGL doesn't offer a convenient function to determine the sweet spot, for most GPUs these days the maximum size in either direction for dense textures is 2048. Go above it and in my experience the performance tanks for dense textures.
Sparse textures are an entirely different chapter, and irrelevant for this problem.
And just for the sake of completeness: I take it, that you don't reinitialize the texture for each and every frame with a call to glTexImage2D. Do that only once at the start of the video, then just update the texture(s).
Currently I'm creating a particle system and I would like to transfer most of the work to the GPU using OpenGL, for gaining experience and performance reasons. At the moment, there are multiple particles scattered through the space (these are currently still created on the CPU). I would more or less like to create a histogram of them. If I understand correctly, for this I would first translate all the particles from world coordinates to screen coordinates in a vertex shader. However, now I want to do the following:
So, for each pixel a hit count of how many particles are inside. Each particle will also have several properties (e.g. a colour) and I would like to sum them for every pixel (as shown in the lower-right corner). Would this be possible using OpenGL? If so, how?
The best tool I recomend for having the whole data (if it fits on GPU memory) is the use of SSBO.
Nevertheless, you need data after transforming them (e.g. by a projection). Still SSBO is your best option:
In the fragment shader you read the properties of already handled particles (let's say, the rendered pixel) and write modified properties (number of particles at this pixel, color, etc) to the same index in the buffer.
Due to parallel nature of GPU, several instances coming from different particles may be doing concurrently the work for the same index. Thus you need to handle this on your own. Read Memory model and Atomic operations
Another approach, but limited, is using Blending
The idea is that each fragment increments the actual color value of the frame buffer. This can be done using GL_FUNC_ADD for glBlendEquationSeparate and using as fragment-output-color a value of 1/255 (normalized integer) for each RGB/a component.
Limitations come from the [0-255] range: Only up to 255 particles in the same pixel, the rest amount is clamped to this range and so "lost".
You have four components RGBA, thus four properties can be handled. But can have several renderbuffers in a FBO.
You can read the FBO by glReadPixels. Use glReadBuffer first with a GL_COLOR_ATTACHMENTi if you use a FBO instead of the default frame buffer.
I'm in need of rendering an influence map in OpenGL. At present I have 100 x 100 quads rendering with a set color to represent the influence at each point on the map. I've been recommended to change my rendering method to one quad with a texture, then allowing the rendering pipeline to take over in speed.
Basic testing has shown that glTexSubImage2D is too slow for setting 10,000 texels per frame. Do you have any suggestions? Would it better to create an entirely new texture each frame? My influence map is in normalized floats (0.0 to 1.0) and that is converted to grayscale colors (1.0f = white).
Thanks :D
Are you currently updating each of the 10000 texels separately, with 10000 calls of glTexSubImage2D?
Just use one 100x100 grayscale float texture (array of 10000 floats) in RAM, update values directly to that and then send the whole data to GPU with one glTexImage2D call. You could also use buffer objects to allow the transfer happen on background, but it should be unnecessary since you are not moving very large amounts of data.
I'm rendering a certain scene into a texture and then I need to process that image in some simple way. How I'm doing this now is to read the texture using glReadPixels() and then process it on the CPU. This is however too slow so I was thinking about moving the processing to the GPU.
The simplest setup to do this I could think of is to display a simple white quad that takes up the entire viewport in an orthogonal projection and then write the image processing bit as a fragment shader. This will allow many instances of the processing to run in parallel as well as to access any pixel of the texture it requires for the processing.
Is this a viable course of action? is it common to do things this way?
Is there maybe a better way to do it?
Yes, this is the usual way of doing things.
Render something into a texture.
Draw a fullscreen quad with a shader that reads that texture and does some operations.
Simple effects (e.g. grayscale, color correction, etc.) can be done by reading one pixel and outputting one pixel in the fragment shader. More complex operations (e.g. swirling patterns) can be done by reading one pixel from offset location and outputting one pixel. Even more complex operations can be done by reading multiple pixels.
In some cases multiple temporary textures would be needed. E.g. blur with high radius is often done this way:
Render into a texture.
Render into another (smaller) texture, with a shader that computes each output pixel as average of multiple source pixels.
Use this smaller texture to render into another small texture, with a shader that does proper Gaussian blur or something.
... repeat
In all of the above cases though, each output pixel should be independent of other output pixels. It can use one more more input pixels just fine.
An example of processing operation that does not map well is Summed Area Table, where each output pixel is dependent on input pixel and the value of adjacent output pixel. Still, it is possible to do those kinds on the GPU (example pdf).
Yes, it's the normal way to do image processing. The color of the quad doesn't really matter if you'll be setting the color for every pixel. Depending on your application, you might need to careful about pixel sampling issues (i.e. ensuring that you sample from exactly the correct pixel on the source texture, rather than halfway between two pixels).