I have an application where I need take the average intensity of an image for around 1 million images. It "feels" like a job for a GPU fragment shader, but fragment shaders are for per-pixel local computations, while image averaging is a global operation.
One approach I considered is loading the image into a texture, applying a 2x2 box-blur, load the result back into a N/2 x N/2 texture and repeating until the output is 1x1. However, this would take log n applications of the shader.
Is there a way to do it in one pass? Or should I just break down and use CUDA/OpenCL?

The summation operation is a specific case of the "reduction," a standard operation in CUDA and OpenCL libraries. A nice writeup on it is available on the cuda demos page. In CUDA, Thrust and CUDPP are just two examples of libraries that provide reduction. I'm less familiar with OpenCL, but CLPP seems to be a good library that provides reduction. Just copy your color buffer to an OpenGL pixel buffer object and use the appropriate OpenGL interoperability call to make that pixel buffer's memory accessible in CUDA/OpenCL.
If it must be done using the opengl API (as the original question required), the solution is to render to a texture, create a mipmap of the texture, and read in the 1x1 texture. You have to set the filtering right (bilinear is appropriate, I think), but it should get close to the right answer, modulo precision error.

My gut tells me to attempt your implementation in OpenCL. You can optimize for your image size and graphics hardware by breaking up the images into bespoke chunks of data that are then summed in parallel. Could be very fast indeed.
Fragment shaders are great for convolutions but that result is usually written to the gl_FragColor so it makes sense. Ultimately you will have to loop over every pixel in the texture and sum the result which is then read back in the main program. Generating image statistics perhaps not what the fragment shader was designed for and its not clear that a major performance gain is to be had since its not guaranteed a particular buffer is located in GPU memory.
It sounds like you may be applying this algorithm to a real-time motion detection scenario, or some other automated feature detection application. It may be faster to compute some statistics from a sample of pixels rather than the entire image and then build a machine learning classifier.
It doesn't need CUDA if you like to stick to GLSL. Like in the CUDA solution mentioned here, it can be done in a fragment shader staight forward. However, you need about log(resolution) draw calls.
Just set up a shader that takes 2x2 pixel samples from the original image, and output the average sum of those. The result is an image with half resolution in both axes. Repeat that until the image is 1x1 px.
Some considerations: Use GL_FLOAT luminance textures if avaliable, to get an more precise sum. Use glViewport to quarter the rendering area in each stage. The result then ends up in the top left pixel of your framebuffer.


I'm currently implementing automatically adapting exposure for use with HDR in OpenGL. For this I need to retrieve the average brightness of all pixels in the previous frame.
I've not managed to find any solid explanations of how to do this. As far as I can see there are two ways to go about it.
Use glReadPixels to copy the framebuffer to memory and average them on the CPU. This is likely to be painfully slow and doesn't make good use of the GPU.
Take the frame and render it to successively smaller FBOs using linear filtering. This lets the GPU do most of the work but it's going to require a lot of FBOs (roughly 10 for a 1080p screen).
There has got to be a better way of getting average scene brightness. Does anyone have any suggestions?
There are two options that come into my mind:
Using glGenerateMipmap, which calculates the average of a 2x2 window, leaving you with the average scene brightness at the smallest level. This can be retrieved using textureLod function in a shader. Since each mipmap level has half the size of the previous one, the correct level will be log2(max), where max is the returned value of GL_MAX_TEXTURE_SIZE.
Using compute shaders to do basically the same thing glGenerateMipmap does, but with a bigger window size, which could potentially be faster (although I never tested this).
Your Option 2 is not much different from using glGenerateMipmap on the texture, just that you don't need to hassle with any client side objects like FBOs. So basically, rendering to mipmap level 0 of the texture, letting the GL generate the mipmap pyramid, and reading back just the highest level 1x1 image is probably the easiest way to get some approximation of the average color value.

I'm trying to find a clever way to render a large spectrogram (say, fullscreen). A spectrogram is a coordinate-system, where the x-axis is time, the y-axis is frequency and the colour intensity is the magnitude of the frequency component, and it looks like this (youtube).
What's interesting to note is that each frame, a new column (1 pixel wide) is new, but the whole rest of the spectrum is the same, only shifted left one pixel. Currently I'm just writing to a circular software buffer acting like an image, and drawing that - but it is obviously slow at high framerates and screensizes.
Is there any obvious solution to this problem, using OpenGL (or some software trick - has to be cross-platform, though)? Perhaps through some use of buffer on the GPU memory, with a shader that fills it (admittedly, i have a very vague understanding of OpenGL beyond drawing simple stuff)? It revolves around keeping the old data on the GPU memory as i see it.
Use a single channel texture for the waterfall (this is what you're drawing, a waterfall plot) in which you update one column or row at a time using glTexSubImage. By using GL_WRAP mode you can simply advance the texture coordinates beyond the bounds of the texture and it will, well, wrap. By moving the texture opposing to the update you can get the waterfall effect (i.e. moving spectrogram, with the updates coming in at the right edge).
To give the whole thing color, use the texture's values as an index into a transfer function LUT texture using a fragment shader.
You can use the GPU library for spectrogram calculations: nnAudio

I would like to perform a blur on a 3D texture in openGL. Since it is separable I should be able to do it in 3 passes. My question is, what would be the best way to cope with it?
I currently have the 3D texture and fill it using imageStore. Should I create other 2 copies of the texture for the blur or is there a way to do it while using a single texture?
I am already using glCompute to compute the mip map of the 3D texture, but in this case I read from the texture at level 0 and write to the one at the next level so there is no conflict, while in this case I would need some copy.
In short it can't be done in 3 passes, because is not a 2D image. Even if kernel is separable.
You have to blur each image slice separately, wich is 2 passes for image (if you are using a 256x256x256 texture then you have 512 passes just for blurring along U and V coordinates). The you still have to blur along T and U (or T and V: indifferent) coordinates wich is another 512 passes. You can gain performance by using bilinear filter and read values between texels to save some constant processing cost. The 3D blur will be very costly.
Performance tip: maybe you don't need to blur the whole texture but only a part of it? (the visible part?)
The problem wich a such high number of passes, is the number of interactions between GPU and CPU: drawcalls and FBO setup wich are both slow operations that hang the CPU (probably a different API with low CPU overhead would be faster)
Try to not separate the kernel:
If you have a small kernel (I guess up to 5^3, only profiling will show the max kernel size) probably the fastest way is to NOT separate the kernel (that's it, you save a lot of drawcalls and FBO binding and leverage everything to GPU fillrate and bandwith).
Spread work over time:
Does not matter if your kernel is separated or not. Instead of computing a Gaussian Blur every frame, you could just compute it every second (maybe with a bigger kernel). Then you use as source of "continuos blurring data" the interpolation of the previouse blur and the next blur (wich is a 2x 3D Texture samples each frame, wich is much cheaper than continuosly blurring).

I understand the idea behind the bloom/glow effect: we downsample the texture to keep our convolution kernels small. Now that I am trying to implement it, I am not quite sure which road I should take.
My first idea was to use glGenerateMipMap to do the downsampling. However, I cannot tell it to stop after, say, 4 steps. It's a bit of a black box for me, and for all I know, it may generate 10 images to sample my screen from 1024*768 down to 1*1. Maybe these last steps are cheap because everything is so small already, but maybe they are not.
I googled around and found that many people were relying on FBOs rather than glGenerateMipMap. I am familiar with FBOs since I use deferred lighting. My second idea is to simply render a 'quad' with a linear sampler into a smaller texture. I would do that four times in a row, halving width and height each time. However, I found that some people preferred using their own fragment shader for downsampling rather than relying on GL_LINEAR and I wonder why; maybe it is faster?
What would be a way to quickly downsample my full-screen texture 4 times in a row, keeping each version? I have no need for fancy edge-preserving sampling algorithms as I am going to blur everything anyway.
we downsample the texture to keep our convolution kernels small.
Or you simply render the bloom/glow layer at a smaller resolution in the first place. This saves both fillrate and you don't have to minify afterwards.
My second idea is to simply render a 'quad' with a linear sampler into a smaller texture.
This is no downsampling it all. It's linear interpolation between sampling points and may create artifacts.

I'm rendering a certain scene into a texture and then I need to process that image in some simple way. How I'm doing this now is to read the texture using glReadPixels() and then process it on the CPU. This is however too slow so I was thinking about moving the processing to the GPU.
The simplest setup to do this I could think of is to display a simple white quad that takes up the entire viewport in an orthogonal projection and then write the image processing bit as a fragment shader. This will allow many instances of the processing to run in parallel as well as to access any pixel of the texture it requires for the processing.
Is this a viable course of action? is it common to do things this way?
Is there maybe a better way to do it?
Yes, this is the usual way of doing things.
Render something into a texture.
Draw a fullscreen quad with a shader that reads that texture and does some operations.
Simple effects (e.g. grayscale, color correction, etc.) can be done by reading one pixel and outputting one pixel in the fragment shader. More complex operations (e.g. swirling patterns) can be done by reading one pixel from offset location and outputting one pixel. Even more complex operations can be done by reading multiple pixels.
In some cases multiple temporary textures would be needed. E.g. blur with high radius is often done this way:
Render into a texture.
Render into another (smaller) texture, with a shader that computes each output pixel as average of multiple source pixels.
Use this smaller texture to render into another small texture, with a shader that does proper Gaussian blur or something.
... repeat
In all of the above cases though, each output pixel should be independent of other output pixels. It can use one more more input pixels just fine.
An example of processing operation that does not map well is Summed Area Table, where each output pixel is dependent on input pixel and the value of adjacent output pixel. Still, it is possible to do those kinds on the GPU (example pdf).
Yes, it's the normal way to do image processing. The color of the quad doesn't really matter if you'll be setting the color for every pixel. Depending on your application, you might need to careful about pixel sampling issues (i.e. ensuring that you sample from exactly the correct pixel on the source texture, rather than halfway between two pixels).