Cost of using multiple render targets

Cost of using multiple render targets - opengl

I am using glsl as a framework for GPGPU for real-time image-processing. I am currently trying to "shave off" a few more milliseconds to make my application real-time. Here's the basic setup:
I take an input image, calculate several transformations of it, and then output a result image. For instance, Let the input image be I. Then the one fragment shader calculates f(I); the second calculates g(I); and the last one calculates h(f(I),g(I)).
My question is regarding efficiently calculating f(I),g(I): does it matter if I use 2 separate fragment shaders (and therefore 2 rendering passes), or if I use a single fragment shader with 2 outputs? Will the latter run faster? I have mostly found discussions about the "how-to"; not about the performance.
Edit
Thanks for the replies so far. Following several remarks, here's an example for my use-case with some more details:
I want a to filter the rows of image I with a 1-d filter; and also filter the rows of the squared image (each pixel is squared). f(I) = filter rows and g(I) = square and filter rows:
shader1: (input image) I --> filter rows --> I_rows (output image)
shader2: (input image) I --> square pixels and filter rows--> I^2_rows (output image)
The question is: will writing a single shader that does both operations be faster than running these two shaders one after the other? #derhass suggests that the answer is positive, because of accessing the same texture locations and enjoying locality. But if it wasn't for the texture-locality: would I still be enjoying a performance boost? or is a single shader rendering to two outputs basically equivalent to two render passes?

Using multiple render passes is usually slower than using one pass with MRT output, but this will also depend on your situation.
As I understand it, both f(I) and g(I) sample the input image I, and if each samples the same (or closely neighboring) loactions, you can greatly profit from the texture cache between the different operations - you have to sample the input texture just once, instead of two times with the multipass approach.
Taking this approach one step further: Do you even need the intermediate results f(I) and g(I) separately? Maybe you could just put h(f(I),g(I)) directly onto one shader, so you do neither need multiple passes and MRTs. If you want to be able to dyanmically combine your operations, you can still use that apporach and programatically combine different shader code parts dynamically to implement the operations (where possible), and use multiple passes only where absolutely necessary.
EDIT
As the question has been updated in the meantime, I think I can give some more specific answers:
What I said so far, especially about putting h(f(I),g(f(I)) into one shader is only a good idea if h (or f and g) will not need any neighboring pixels. If h is a nxn filter kernel, you would have to access nxn different input texels, and since those inputs are not directly known, you would have to calculate f and g for each of them. If both f and h are filter kernels, the effective filter size of the compound operation will be greater, and it is much better to calculate the intermediate results first and use multiple passes.
Looking at the specific issue you describe, it comes down to this.
If you use two separate shaders in the most naive way, you rendering will look like this.
use the shader1
select some output color buffer
draw a quad
use shader2
select some different color buffer
draw a quad
Every draw call has its overhead. The GL will have to do some extra validation. Switching the shaders might be the most expensive extra step here compared to the combined shader approach, as it might force a GPU pipeline flush. Als, for each draw call, you have the vertex processing, rasterization, and per fragment attribute interolation operations. With just one shader, lot's of this overhead is going away, and the per-fragment calculations described so far can be "shared" for both filters.
But if it wasn't for the texture-locality: would I still be enjoying a
performance boost?
Because of the stuff I said so far, and specific to the shaders you presented, I tend to say: yes. But the effect will be very small to neglegible if we ignore the texture acesses here, especially if we assume reasonable high resolution images so that the relative overhead compared to the total amount of work appearts small. I would at least say that using a single pass MRT setup will not be slower. However, only benchmarking/profiling the very specific implementation on a specific GPU will give a definitive answer.
Why did I say "the shaders you presented". Because in both cases, you do the image squaring in one shader. You could split that into two different shaders and renderpasses also. In that case, you would get additional overhead (to the already mentioned) for writing the intermediate results, and having to read that back. However, since you run a filter over the intermediate resulte, you do not have to square any input texel more than once, but in the combined approach, you do. If the squaring operation is expensive enough, and your filter size is big enough, you could in theory save more time than is introduced by the overhead of multiple passes. Again, only benchmarking/profiling cann tell you where the break even would lie.
I have done some benchmarking with MRT vs. multiple render passes myself in the past, although the image processing operations I was interested in are a bit different from yours. What I found is that in such scenarios, the texture access is the key factor, and you can hide a lot of other calculations (like squaring a color value) in the texture access latency. I think that your "But if it wasn't for the texture-locality" is a bit unrealistic, since it is the major contribution to the overall running time. And it isn't just the locality, it is also the number of texture accesses in total: With your multipe-shader approach, an imge of size w*h, and a 1D filter of size n, you will end up with 2*w*h*n texture accesses overall, while with the combined approach, you will just have reduced to *w*h*n, and that will make a huge difference in the past.
For a AMD FirePro V9800,image size of 1920x1080, and just copying the pixels to two output buffers by rendering textured quds, I got with two passes: ~0,320ms (even without switching shaders) vs 1 pass MRT: ~0,230ms. So execution time was reduced by "only" 30%, but this was with just one texutre fetch per shader invocation. With filter kernels, I'd expect to see this figure getting closer to 50% reduction with increasing kernel size (but I haven't measured that, though).

Let us ignore any potential benefits from hardware-specific things like data cache, register re-use, etc. that might occur if you do your entire algorithm in a single shader invocation and focus entirely on algorithm complexity for a minute.
A Gaussian Blur on a 2D image is a separable filter (X and Y can be blurred as a much simpler series of 1D blurs), and you can actually get better performance if you split the horizontal and vertical application into two passes.
Consider the complexity of two 1D blurs vs. one 2D blur in Big O:
Two-Pass Gaussian Blur (Two 1D blurs):
     
Single-Pass Gaussian Blur (Single 2D blur):
     
Deferred shading is another example. Instead of one massive loop over all lights in a single-pass, many implementations will do one pass per-light shading only the area of the screen that each individual light actually covers.
Multi-pass is not always a bad thing, when it simplifies your algorithm as in the case of a separable filter or light coverage, it is often good.
Your results may vary, but if you can show an appreciable difference in algorithm complexity in Big O notation using one approach over the other, it is worth exploring the run-time performance of both implementations.

Related

how to calculate the number of specified colored pixels using GLSL?

I have a grayscale texture (8000*8000) , the value of each pixel is an ID (actually, this ID is the ID of triangle to which the fragment belongs, I want to using this method to calculate how many triangles and which triangles are visible in my scene).
now I need to count how many unique IDs there are and what are them. I want to implement this with GLSL and minimize the data transfer between GPU RAM and RAM.
The initial idea I come up with is to use a shader storage buffer, bind it to an array in GLSL, its size is totalTriangleNum, then iterate through the ID texture in shader, increase the array element by 1 that have index equal to ID in texture.
After that, read the buffer to OpenGL application and get what I want. Is this a efficient way to do so? Or are there some better solutions like compute-shader (well I'm not familiar with it) or something else.

I want to using this method to calculate how many triangles and which triangles are visible in my scene)
Given your description of your data let me rephrase that a bit:
You want to determine how many distinct values there are in your dataset, and how often each value appears.
This is commonly known as a Histogram. Unfortunately (for you) generating histograms are among the problems not that trivially solved on GPUs. Essentially you have to divide down your image into smaller and smaller subimages (BSP, quadtree, etc.) until divided down to single pixels on which you perform the evaluation. Then you backtrack propagating up the sub-histograms, essentially performing an insertion or merge sort on the histogram.
Generating histograms with GPUs is still actively researched, so I suggest you read up on the published academic works (usually accompanied with source code). Keywords: Histogram, GPU
This one is a nice paper done by the AMD GPU researchers: https://developer.amd.com/wordpress/media/2012/10/GPUHistogramGeneration_preprint.pdf

I need advice how to improve graphics

I have file with table containing 23 millions records the following form {atomName, x, y, z, transparence}. For solutions I decided to use OpenGL.
My task to render it. In first iteration, I used block "glBegin/glEnd" and have drawed every atom as point some color. This solution worked. But I got 0.002 fps.
Then i tried using VBO. I formed three buffers: vertex, color and indexes. This solution worked. I got 60 fps, but i have not comfortable binding buffers and i am drawing points, not spheres.
Then i read about VAO, which can simplify binding buffers. Ok, it is worked. I got comfortable binding.
Now i want to draw spheres, not points. I thought, to form relative to each point of the set of vertices on which it will be possible to build a sphere (with some accuracy). But if i have 23 million vertices, i must calculate yet ~12 or more vertices relaty every point. 23 000 000 * 4 (float) = 1 Gb data, perhaps it not good solution.
What is the best next move i should do? I can not fully understand, applicable shaders in this task or exist other ways.

About your drawing process
My task to render it. In first iteration, I used block "glBegin/glEnd" and have drawed every atom as point some color. This solution worked. But I got 0.002 fps.
Think about it: For every of your 23 million records you make at least one function call directly (glVertex) and probably several function calls implicitly by that. Even worse, glVertex likely causes a context switch. What this means is, that your CPU hits several speed bumps for every vertex it has to processes. A top notch CPU these days has a clock rate of about 3 GHz and a pipeline length in the order of 10 instructions. When you make a context switch that pipeline gets stalled, in the worst case it then takes one pipeline length to actually process one single instruction. Lets consider that you have to perform at least 1000 instructions for processing a single glVertex call (which is actually a rather optimistic estimation). That alone means, that you're limited to process at most 3 million vertices per second. So at 23 million vertices that's already less than one FPS then.
But you also got context switches in there, which add a further penality. And probably a lot of branching which create further pipeline flushes.
And that's just the glVertex call. You also have colors in there.
And you wonder that immediate mode is slow?
Of course it's slow. Using the Immediate Mode has been discouraged for well over 15 years. Vertex Arrays are available since OpenGL-1.1.
This solution worked. I got 60 fps,
Yes, because all the data resides on the GPU's own memory now. GPUs are massively parallel and optimized to crunch this kind of data and doing the operations they do.
but i have not comfortable binding buffers
Well, OpenGL is not a high level scene graph library. It's a mid to low level drawing API. You use it like a sophisticated pencil to draw on a digital canvas.
Then i read about VAO
Well, VAOs are meant to coalesce buffer objects that belong together so it makes sense using them.
Now i want to draw spheres, not points.
You have two options:
Using point sprite textures. This means that your points will get area when drawn, and that area gets a texture applied. I think this is the best method for you. Given the right shader you can even give your point sprite the right kind of depth values, so that your "spheres" will actually intersect like spheres in the depth buffer.
The other option is using instancing a single sphere geometry, using your atom records as control data for the instancing process. This would then process real sphere geometry. However I fear that implementing an instanced drawing process might be a bit too advanced for your skill level at the moment.
About drawing 23 million points
Seriously what kind of display do you have available, that you can draw 23 million, distinguishable points? Your typical computer screen will have some about 2000×1500 points. The highest resolution displays you can buy these days have about 4k×2.5k pixels, i.e. 10 million individual pixels. Let's assume your atoms are evenly distributed in a plane: At 23 million atoms to draw each pixel will get several times overdrawn. You simply can't display 23 million individual atoms that way. Another way to look at this is, that the display's pixel grid implies a spatial sampling and you can't reproduce anything smaller than twice the average sampling distance (sampling theorem).
So it absolutely makes sense to draw only a subset of the data, namely the subset that's actually in view. Also if you're zoomed very far out (i.e. you have the full dataset in view) it makes sense to coalesce atoms closeby.
It definitely makes sense to sort your data into a spatial subdivision structure. In your case I think an octree would be a good choice.

Fastest way to calculate OpenGL texture similarity/distance?

Here's what I have:
I load a texture from disk, say 256x256, say a picture of penguin
I create another texture with same dimensions and I draw some stuff on it
I want to find distance between the two openGL textures AS FAST AS POSSIBLE (accuracy of the distance function is of LEAST concern).
Actually, they might not be textures in the first place.
I might as well compare a region of the framebuffer to the loaded texture or do some other "magic".
How to do this super-fast?

This has little to do with OpenGL in fact. OpenGL is just a 3D rasterization "driver".
The main idea is to get a distance/similarity algorithm, which is a tricky task. For beginning you could do a Root Mean Square Deviation algorithm. It will give you a number of how far away pixel values are.
When you implement it on CPU you could benchmark it for your needs and maybe convert to OpenCL. I don't think loading the "penguin" to GPU just to compare it to other shapes you prepare there is a super-fast process by itself.
Try to be more specific with your next question and avoid attitude of "I have a very well microscope, how do I hit nails with it super-fast?"

You could use the Pearson-Product for matching different images. You can find it on an answer of mine. Instead of matching a template more little than the original image, you could correlated directly two images.
Shortly, you get the deviation of each textel from the average, giving you a correlation factor.
But improving that algorithm by using shaders could be a little hard. First, you have to compute the textel average: maybe some OpenGL extension (like histogram) may help you in this task.
Then, you could use fragment shaders to perform single-component computation (the difference between each textel with the averaged one computed previously. The average textel has to be passed as uniform, and the result should be stored in floating-point texture (you can render it on a framebuffer object).
Then, you should sum up all textel of the resulting texture in order to get the correlation between the two souce textures.
This may worth in the case images are very big. Otherwise I think it's just better to execute the algorithm on CPU, using SIMD instruction set (like MMX, SSE, AVX).

OpenGL - A way to display lot of points dynamically

I am providing a question regarding a subject that I am now working on.
I have an OpenGL view in which I would like to display points.
So far, this is something I can handle ;)
For every point, I have its coordinates (X ; Y ; Z) and a value (unsigned char).
I have a color array giving the link between one value and a color.
For example, 255 is red, 0 is blue, and so on...
I want to display those points in an OpenGL view.
I want to use a threshold value so that depending on it, I can modify the transparency value of a color depending on the value of one point.
I want also that the performance doesn't go bad even if I have a lot of points (5 billions in the worst case but 1~2 millions in a standard case).
I am now looking for the effective way to handle this.
I am interested in the VBO. I have read that it will allow some good performance and also that I can modify the buffer as I want without recalculating it from scratch (as with display list).
So that I can solve the threshold issue.
However, doing this on a million points dynamically will provide some heavy calculations (at least a pretty bad for loop), no ?
I am opened to any suggestions and I would like to discuss about any of your ideas !

Trying to display a billion points or more is generally (forgive the pun) pointless.
Even an extremely high resolution screen has only a few million pixels. Nothing you can do will get it to display more points than that.
As such, your first step is almost undoubtedly to figure out a way to restrict your display to a number of points that's at least halfway reasonable. OpenGL can (and will) oblige if you ask it to display more, but your monitor won't and neither will mine or much or anybody else's.

Not directly related to the OpenGL part of your question, but if you are looking at rendering massive point clouds you might want to read up on space partitioning hierarchies such as octrees to keep performance in check.

Put everything into one VBO. Draw it as an array of points: glDrawArrays(GL_POINTS,0,num). Calculate alpha in a pixel shader (using threshold passed as uniform).
If you want to change a small subset of points - you can map a sub-range of the VBO. If you need to update large parts frequently - you can use Transform Feedback to utilize GPU.

If you need to simulate something for the updates, you should consider using CUDA or OpenCL to run the update completely on the GPU. This will give you the best performance. Otherwise, you can use a single VBO and update it once per frame from the CPU. If this gets too slow, you could try multiple buffers and distribute the updates across several frames.
For the threshold, you should use a shader uniform variable instead of modifying the vertex buffer. This allows you to set a value per-frame which can be then combined with the data from the vertex buffer (for instance, you set a float minVal; and every vertex with some attribute less than minVal gets discarded in the geometry shader.)

How to go about benchmarking a software rasterizer

Ok, ive been developing a software rasterizer for some time now, but have no idea how to go about benchmarking it to see if its actually any good.... i mean say you can render X amount of verts ant Y frames per second, what would be a good way to analyse this data to see if its any good? rather than someone just saying
"30 fps with 1 light is good" etc?

What do you want to measure? I suggest fillrate and triangle rate. Basically fillrate is how many pixels your rasterizer can spit out each second, Triangle rate is how many triangles your rasterizer + affine transformation functions can push out each second, independent of the fillrate. Here's my suggestion for measuring both:
To measure the fillrate without getting noise from the time used for the triangle setup, use only two triangles, which forms a quad. Start with a small size, and then increase it with a small interval. You should eventually find an optimal size with respect to the render time of one second. If you don't, you can perform blending, with full-screen triangle pairs, which is a pretty slow operation, and which only burns fillrate. The fillrate becomes width x height of your rendered triangle. For example, 4 megapixels / second.
To measure the triangle rate, do the same thing; only for triangles this time. Start with two tiny triangles, and increase the number of triangles until the rendering time reaches one second. The time used by the triangle/transformation setup is much more apparent in small triangles than the time used to fill it. The units is triangle count/second.
Also, the overall time used to render a frame might be comparable too. The render time for a frame is the derivative of the global time, i.e delta time. The reciprocal of the delta time is the number of frames per second, if that delta time was constant for all frames.
Of course, for these numbers to be half-way comparable across rasterizers, you have to use the same techniques and features. Comparing numbers from a rasterizer which uses per-pixel lighting against another which uses flat-shading doesn't make much sense. Resolution and color depth should also be equal.
As for optimization, getting a proper profiler should do the trick. GCC has the GNU profiler gprof. If you want an opinion on clever things to optimize in a rasterizer, ask that as a seperate question. I'll answer to the best of my ability.

If you want to determine if it's "any good" you will need to compare your rasterizer with other rasterizers. "30 fps with 1 light" might be extremely good, if no-one else has ever managed to go beyond, say, 10 fps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js