How do I speed up my Offscreen OpenGL pointcloud warp rendering code? - c++

I'm working on a visual odometry algorithm that tracks movement of the camera between images. An integral part of this algorithm is being able to generate incremental dense warped images of a reference image, where each pixel has a corresponding depth (so it can be considered a point cloud of width x height dimensions)
I haven't had much experience working with OpenGL in the past, but having gone through a few tutorials, I managed to setup an offscreen rendering pipeline to take in a transformation matrix and render the pointcloud from the new perspective. I'm using VBOs to load the data in the GPU and renderbuffers to render, and glReadPixels() to read into CPU memory.
On my Nvidia card, I can render at ~1 ms per warp. Is that the fastest I can render the data (640x480 3D points)? This step is proving to be a major bottleneck for my algorithm, so I'd really appreciate any performance tips!
(I thought that one optimization could be rendering only in grayscale, since I don't really care about colour, but it seems like internally OpenGL uses colour anyway)
My current implementation is at
https://gist.github.com/icoderaven/1212c7623881d8cd5e1f1e0acb7644fb,
and the shaders at
https://gist.github.com/icoderaven/053c9a6d674c86bde8f7246a48e5c033
Thanks!

Related

Open gl es - How to improve performance, render to texture, blending

I am here because I'm working on an OpenGL program and I have some issues with performance. I work with OpenGL ES 3.0 on iMX6 soc.
Here is my algorithm :
I get an image from camera which is directly map to a texture.
Using an FBO, I render to texture to map the image on a specific form.
I do the same thing (with a second FBO) for another image which is sent via shared memory by another application. This step is performed only if the image is updated. Only once per second.
I blend these two textures in the default frame buffer to render the result to the screen.
If I perform these three steps separately, It works well and the screen is updated at 30FPS. But when I include the three step in one program the render is very slow and I got only 0.5FPS.
I am wondering if the GPU on the iMX6 is enough powerful, but I think it is not a complex algorithm. I think I am doing something in the wrong way, but what?
I use 3 different frame buffers, so is that a good way or should I use only one?
Can someone give me answer, clues, anything that can help me? :-)
My images dimensions are 1280x1024 x RGBA. Then I am doing some conversion from floating-point texture to integer and back to float, this is done to perform bitwise operation on pixels.
Thanks to #Columbo the problem came from all the conversion, I work with floating-point texture and only for the bitwise operations I do the conversion which improve a lot the performance of the algorithm.
Another point which decrease the performance was the texture format. For the first step, the image was 1280x1024 but only on one composent (grayscale image). To keep only the grayscale composant and not to use too much memory I worked with a GL_RED texture but this wasn't a good idea because when I changed it to GL_RGB, I double the framerate of the render too.

Rendering visualization of spectrogram efficiently

I'm trying to find a clever way to render a large spectrogram (say, fullscreen). A spectrogram is a coordinate-system, where the x-axis is time, the y-axis is frequency and the colour intensity is the magnitude of the frequency component, and it looks like this (youtube).
What's interesting to note is that each frame, a new column (1 pixel wide) is new, but the whole rest of the spectrum is the same, only shifted left one pixel. Currently I'm just writing to a circular software buffer acting like an image, and drawing that - but it is obviously slow at high framerates and screensizes.
Is there any obvious solution to this problem, using OpenGL (or some software trick - has to be cross-platform, though)? Perhaps through some use of buffer on the GPU memory, with a shader that fills it (admittedly, i have a very vague understanding of OpenGL beyond drawing simple stuff)? It revolves around keeping the old data on the GPU memory as i see it.
Use a single channel texture for the waterfall (this is what you're drawing, a waterfall plot) in which you update one column or row at a time using glTexSubImage. By using GL_WRAP mode you can simply advance the texture coordinates beyond the bounds of the texture and it will, well, wrap. By moving the texture opposing to the update you can get the waterfall effect (i.e. moving spectrogram, with the updates coming in at the right edge).
To give the whole thing color, use the texture's values as an index into a transfer function LUT texture using a fragment shader.
You can use the GPU library for spectrogram calculations: nnAudio
https://github.com/KinWaiCheuk/nnAudio

Is there any conventional way to do a per voxel shader programming?

I'm finding a way to do 3d filters in directx or opengl shaders, same as the gaussian filter for images.In detail, it is to do proccessing for every voxel of a 3d texture.
Maybe store the volume data in slices can do it, but it is not a friendly way to access the volume data and not easy to write in shaders.
sorry for my poor english, any reply will be appreciate.
p.s.:Cuda's texture memory can do this work, but my poor gpu can only run in a very low frame rate with debug model,and i don't know why.
There is a 3D texture target in both Direct3D and OpenGL. Of course target framebuffers are still 2D. So using a compute shader, OpenCL or DirectCompute may be better suited for pure filter purposes, that don't include rendering to screen.

Render Mona Lisa to PBO

After reading this article I wanted to try to do the same, but to speed things up the rendering part I've wanted to be performed on the GPU, needless to say why the triangles or any other geometric objects should be rendered on GPU rather than CPU.
Here's one nice image of the process:
The task:
Render 'set of vertices'
Estimate the difference pixel by
pixel between the rendered 'set of vertices' and the Mona Lisa image (Mona Lisa is located on GPU in texture or PBO no big difference)
The problem:
When using OpenCL or Cuda with OpenGL FBO (Frame Buffer Object) extension.
In this case according to our task
Render 'set of vertices' (handled by OpenGL)
Estimate the difference pixel by
pixel between the rendered 'set of vertices' and the Mona Lisa image (handled by OpenCL or Cuda)
So in this case I'm forced to do copies from FBO to PBO (Pixel Buffer Object) to get rendered 'set of vertices' available for OpenCL/Cuda. I know how fast are Device to Device memory copies but according to the fact that I need to do thousands of these copies it makes sense not to do so...
This problem leaves three choices:
Render with OpenGL to PBO (somehow, I don't know how, It also might be impossible to do so)
Render the image and estimate the difference between images totally with OpenGL (somehow, I don't know how, maybe by using shaders, the only problem is that I've never written a shader in my life and this might take months of work for me...)
Render the image and estimate the difference between images totally with OpenCL/Cuda (I know how to do this, but also it will take months to get stable and more or less optimized version of renderer implemented in OpenCL or Cuda)
The question
Can anybody help me with writing a shader for the above process or maybe point-out the way of rendering the Mona Lisa to PBO without copies from FBO...
My gut feeling is that the Shader approach is also going to have the same IO problem, you certainly can compare textures in a shader as long as the GPU supports PS 4.0 or higher; but you've still got to get the source texture (Mona Lisa) on to the device in the first place.
Edit: Been digging around a bit and this forum post might provide some insight:
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=221384&page=1.
The poster, Komat, provides an example of the shader on the 2nd page.

Sum image intensities in GPU

I have an application where I need take the average intensity of an image for around 1 million images. It "feels" like a job for a GPU fragment shader, but fragment shaders are for per-pixel local computations, while image averaging is a global operation.
One approach I considered is loading the image into a texture, applying a 2x2 box-blur, load the result back into a N/2 x N/2 texture and repeating until the output is 1x1. However, this would take log n applications of the shader.
Is there a way to do it in one pass? Or should I just break down and use CUDA/OpenCL?
The summation operation is a specific case of the "reduction," a standard operation in CUDA and OpenCL libraries. A nice writeup on it is available on the cuda demos page. In CUDA, Thrust and CUDPP are just two examples of libraries that provide reduction. I'm less familiar with OpenCL, but CLPP seems to be a good library that provides reduction. Just copy your color buffer to an OpenGL pixel buffer object and use the appropriate OpenGL interoperability call to make that pixel buffer's memory accessible in CUDA/OpenCL.
If it must be done using the opengl API (as the original question required), the solution is to render to a texture, create a mipmap of the texture, and read in the 1x1 texture. You have to set the filtering right (bilinear is appropriate, I think), but it should get close to the right answer, modulo precision error.
My gut tells me to attempt your implementation in OpenCL. You can optimize for your image size and graphics hardware by breaking up the images into bespoke chunks of data that are then summed in parallel. Could be very fast indeed.
Fragment shaders are great for convolutions but that result is usually written to the gl_FragColor so it makes sense. Ultimately you will have to loop over every pixel in the texture and sum the result which is then read back in the main program. Generating image statistics perhaps not what the fragment shader was designed for and its not clear that a major performance gain is to be had since its not guaranteed a particular buffer is located in GPU memory.
It sounds like you may be applying this algorithm to a real-time motion detection scenario, or some other automated feature detection application. It may be faster to compute some statistics from a sample of pixels rather than the entire image and then build a machine learning classifier.
Best of luck to you in any case!
It doesn't need CUDA if you like to stick to GLSL. Like in the CUDA solution mentioned here, it can be done in a fragment shader staight forward. However, you need about log(resolution) draw calls.
Just set up a shader that takes 2x2 pixel samples from the original image, and output the average sum of those. The result is an image with half resolution in both axes. Repeat that until the image is 1x1 px.
Some considerations: Use GL_FLOAT luminance textures if avaliable, to get an more precise sum. Use glViewport to quarter the rendering area in each stage. The result then ends up in the top left pixel of your framebuffer.