I'm in need of rendering an influence map in OpenGL. At present I have 100 x 100 quads rendering with a set color to represent the influence at each point on the map. I've been recommended to change my rendering method to one quad with a texture, then allowing the rendering pipeline to take over in speed.
Basic testing has shown that glTexSubImage2D is too slow for setting 10,000 texels per frame. Do you have any suggestions? Would it better to create an entirely new texture each frame? My influence map is in normalized floats (0.0 to 1.0) and that is converted to grayscale colors (1.0f = white).
Thanks :D
Are you currently updating each of the 10000 texels separately, with 10000 calls of glTexSubImage2D?
Just use one 100x100 grayscale float texture (array of 10000 floats) in RAM, update values directly to that and then send the whole data to GPU with one glTexImage2D call. You could also use buffer objects to allow the transfer happen on background, but it should be unnecessary since you are not moving very large amounts of data.
Related
I have one big buffer (object) containing the MNIST dataset: many (tens of thousands) small (28x28) grayscale images, stored one-by-one in row-wise order as floats indicating pixel intensity. I would like to efficiently (i.e. somewhat interactively) blend these many images into one "average" image, where each pixel in the blended image is the average of all the pixels at that same position. Is this possible?
The options I considered are:
Using a compute shader directly on the buffer object. I would spawn imgWidth * imgHeight compute shader invocations/threads, with each invocation looping over all images. This doesn't seem very efficient, as each invocation has to loop over all images, but doing it the other way (i.e. spawning numImages invocations and walking over the pixels) still has invocations waiting on each other.
Using the graphics pipeline to draw the textures one-by-one to a framebuffer, blending them all over each other. This would still result in linear time, as each image has to be rendered to the framebuffer in turn. I'm not very familiar with framebuffers, though.
Doing it all linearly in the CPU, which seems easier and not much slower than doing it on the GPU. I would only be missing out on the parallel processing of the pixels.
Are their other possibilities I'm missing. Is there an optimal way? And if not, what do you think would be the easiest?
Most times we want to parallelize at the pixel level because there are many.
However, in your case there are not that many pixels (28x28).
The biggest number you have seems to be the number of images (thousands of images). So we would like to leverage that.
Using a compute shader, instead of iterating though all the images, you could blend the images in pairs. After each pass you would halve the number of images. Once the number of images gets very small, you might want to change the strategy but that's something that you need to experiment with to see what works best.
You know compute shaders can have 3 dimensions. You could have X and Y index the pixel of the image. And Z can be used to inxed the pair of images in a texture array. So for index Z, you would blend textures 2*Z and 2*Z+1.
Some implementation details you need to take into account:
Most likely, the number of images won't be a power of two. So at some point the number of images will be odd.
Since you are working with lots of images, you could run into float precission issues. You might need to use float textures, or addapt the strategy so this is not a problem.
Usually compute shaders work best when the threads process tiles of 2x2 pixels instead of individual pixels.
This is how i do it.
Render all the textures to the framebuffer , which can also be the default frame buffer.
Once rendering in completed.
Read the data from the Framebuffer.
glReadBuffer(GL_COLOR_ATTACHMENT0);
glBindBuffer(GL_PIXEL_PACK_BUFFER, w_pbo[w_writeIndex]);
// copy from framebuffer to PBO asynchronously. it will be ready in the NEXT frame
glReadPixels(0, 0, SCR_WIDTH, SCR_HEIGHT, GL_RGBA, GL_UNSIGNED_BYTE, nullptr);
// now read other PBO which should be already in CPU memory
glBindBuffer(GL_PIXEL_PACK_BUFFER, w_pbo[w_readIndex]);
unsigned char* Data = (unsigned char*)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
Currently I'm creating a particle system and I would like to transfer most of the work to the GPU using OpenGL, for gaining experience and performance reasons. At the moment, there are multiple particles scattered through the space (these are currently still created on the CPU). I would more or less like to create a histogram of them. If I understand correctly, for this I would first translate all the particles from world coordinates to screen coordinates in a vertex shader. However, now I want to do the following:
So, for each pixel a hit count of how many particles are inside. Each particle will also have several properties (e.g. a colour) and I would like to sum them for every pixel (as shown in the lower-right corner). Would this be possible using OpenGL? If so, how?
The best tool I recomend for having the whole data (if it fits on GPU memory) is the use of SSBO.
Nevertheless, you need data after transforming them (e.g. by a projection). Still SSBO is your best option:
In the fragment shader you read the properties of already handled particles (let's say, the rendered pixel) and write modified properties (number of particles at this pixel, color, etc) to the same index in the buffer.
Due to parallel nature of GPU, several instances coming from different particles may be doing concurrently the work for the same index. Thus you need to handle this on your own. Read Memory model and Atomic operations
Another approach, but limited, is using Blending
The idea is that each fragment increments the actual color value of the frame buffer. This can be done using GL_FUNC_ADD for glBlendEquationSeparate and using as fragment-output-color a value of 1/255 (normalized integer) for each RGB/a component.
Limitations come from the [0-255] range: Only up to 255 particles in the same pixel, the rest amount is clamped to this range and so "lost".
You have four components RGBA, thus four properties can be handled. But can have several renderbuffers in a FBO.
You can read the FBO by glReadPixels. Use glReadBuffer first with a GL_COLOR_ATTACHMENTi if you use a FBO instead of the default frame buffer.
I would like to perform a blur on a 3D texture in openGL. Since it is separable I should be able to do it in 3 passes. My question is, what would be the best way to cope with it?
I currently have the 3D texture and fill it using imageStore. Should I create other 2 copies of the texture for the blur or is there a way to do it while using a single texture?
I am already using glCompute to compute the mip map of the 3D texture, but in this case I read from the texture at level 0 and write to the one at the next level so there is no conflict, while in this case I would need some copy.
In short it can't be done in 3 passes, because is not a 2D image. Even if kernel is separable.
You have to blur each image slice separately, wich is 2 passes for image (if you are using a 256x256x256 texture then you have 512 passes just for blurring along U and V coordinates). The you still have to blur along T and U (or T and V: indifferent) coordinates wich is another 512 passes. You can gain performance by using bilinear filter and read values between texels to save some constant processing cost. The 3D blur will be very costly.
Performance tip: maybe you don't need to blur the whole texture but only a part of it? (the visible part?)
The problem wich a such high number of passes, is the number of interactions between GPU and CPU: drawcalls and FBO setup wich are both slow operations that hang the CPU (probably a different API with low CPU overhead would be faster)
Try to not separate the kernel:
If you have a small kernel (I guess up to 5^3, only profiling will show the max kernel size) probably the fastest way is to NOT separate the kernel (that's it, you save a lot of drawcalls and FBO binding and leverage everything to GPU fillrate and bandwith).
Spread work over time:
Does not matter if your kernel is separated or not. Instead of computing a Gaussian Blur every frame, you could just compute it every second (maybe with a bigger kernel). Then you use as source of "continuos blurring data" the interpolation of the previouse blur and the next blur (wich is a 2x 3D Texture samples each frame, wich is much cheaper than continuosly blurring).
My game renders lots of cubes which randomly have 1 of 12 textures. I already Z order the geometry so therefore I cant just render all the cubes with texture1 then 2 then 3 etc... because that would defeat z ordering. I already keep track of the previous texture and in they are == then I do not call glbindtexture, but its still way too many calls to this. What else can I do?
Thanks
Ultimate and fastest way would be to have an array of textures (normal ones or cubemaps). Then dynamically fetch the texture slice according to an id stored in each cube instance data/ or cube face data (if you want a different texture on a per cube face basis) using GLSL built-in gl_InstanceID or gl_PrimitiveID.
With this implementation you would bind your texture array just once.
This would of course required used of gpu_shader4 and texture_array extensions:
http://developer.download.nvidia.com/opengl/specs/GL_EXT_gpu_shader4.txt
http://developer.download.nvidia.com/opengl/specs/GL_EXT_texture_array.txt
I have used this mechanism (using D3D10, but principle applies too) and it worked very well.
I had to map on sprites (3D points of a constant screen size of 9x9 or 15x15 pixels IIRC) differents textures indicating each a different meaning for the user.
Edit:
If you don't feel comfy with all shader stuff, I would simply sort cubes by textures, and don't Z order the geometry. Then measure performances gains.
Also I would try to add a pre-Z pass where you render all your cubes in Z buffer only, then render normal scene, and see if it speed up things (if fragments bound, it could help).
You can pack your textures into one texture and offset the texture coordinates accordingly
glMatrixMode(GL_TEXTURE) will also allow you to perform transformations on the texture space (to avoid changing all the texture coords)
Also from NVIDIA:
Bindless Graphics
I'm rendering a certain scene into a texture and then I need to process that image in some simple way. How I'm doing this now is to read the texture using glReadPixels() and then process it on the CPU. This is however too slow so I was thinking about moving the processing to the GPU.
The simplest setup to do this I could think of is to display a simple white quad that takes up the entire viewport in an orthogonal projection and then write the image processing bit as a fragment shader. This will allow many instances of the processing to run in parallel as well as to access any pixel of the texture it requires for the processing.
Is this a viable course of action? is it common to do things this way?
Is there maybe a better way to do it?
Yes, this is the usual way of doing things.
Render something into a texture.
Draw a fullscreen quad with a shader that reads that texture and does some operations.
Simple effects (e.g. grayscale, color correction, etc.) can be done by reading one pixel and outputting one pixel in the fragment shader. More complex operations (e.g. swirling patterns) can be done by reading one pixel from offset location and outputting one pixel. Even more complex operations can be done by reading multiple pixels.
In some cases multiple temporary textures would be needed. E.g. blur with high radius is often done this way:
Render into a texture.
Render into another (smaller) texture, with a shader that computes each output pixel as average of multiple source pixels.
Use this smaller texture to render into another small texture, with a shader that does proper Gaussian blur or something.
... repeat
In all of the above cases though, each output pixel should be independent of other output pixels. It can use one more more input pixels just fine.
An example of processing operation that does not map well is Summed Area Table, where each output pixel is dependent on input pixel and the value of adjacent output pixel. Still, it is possible to do those kinds on the GPU (example pdf).
Yes, it's the normal way to do image processing. The color of the quad doesn't really matter if you'll be setting the color for every pixel. Depending on your application, you might need to careful about pixel sampling issues (i.e. ensuring that you sample from exactly the correct pixel on the source texture, rather than halfway between two pixels).