Proposition of an optimized rendering technical with batches using OpenGL and GLSL

Proposition of an optimized rendering technical with batches using OpenGL and GLSL - opengl

I am looking for a sort method to optimize the rendering of a scene (regardless the number of meshes and their sizes) minimizing the states changes and maximizing the geometry gathering to have the fewest call of glDraw* functions.
In my proposal, the most judicious choice to render a specific batch is the function glMultiDrawElements because it takes an array of indices (one for each model contained in the batch) to render a batch. Plus, it should be perfect for the space partitioning (if a mesh is not visible in the frustum of the camera, a boolean flag m_IsVisible will be turned to false. In this case, the indices of this specific geometry will not be included and thus will not be rendered).
So, here is my proposition I want to share with someone interested by the subject. For a sake of lisibility I took a few time to show you my demonstration properly on a WORD project):
Here's the criteria I need to sort my geometry during the initialization (there are ordered by pertinence). They could be gathered in an enum:
enum E__SortingCriteria
{
GEOMETRY_CRITERION (geometry storage -> VBOs)
SHADER_PROGRAM_CRITERION
MATERIAL_CRITERION
TEXTURE_CRITERION
};
As you can see, the state changes and the geometry gathering seems to be optimized. However, I am concerned that something may not be correct or that there is room for improvement.

Related

how to calculate the number of specified colored pixels using GLSL?

I have a grayscale texture (8000*8000) , the value of each pixel is an ID (actually, this ID is the ID of triangle to which the fragment belongs, I want to using this method to calculate how many triangles and which triangles are visible in my scene).
now I need to count how many unique IDs there are and what are them. I want to implement this with GLSL and minimize the data transfer between GPU RAM and RAM.
The initial idea I come up with is to use a shader storage buffer, bind it to an array in GLSL, its size is totalTriangleNum, then iterate through the ID texture in shader, increase the array element by 1 that have index equal to ID in texture.
After that, read the buffer to OpenGL application and get what I want. Is this a efficient way to do so? Or are there some better solutions like compute-shader (well I'm not familiar with it) or something else.

I want to using this method to calculate how many triangles and which triangles are visible in my scene)
Given your description of your data let me rephrase that a bit:
You want to determine how many distinct values there are in your dataset, and how often each value appears.
This is commonly known as a Histogram. Unfortunately (for you) generating histograms are among the problems not that trivially solved on GPUs. Essentially you have to divide down your image into smaller and smaller subimages (BSP, quadtree, etc.) until divided down to single pixels on which you perform the evaluation. Then you backtrack propagating up the sub-histograms, essentially performing an insertion or merge sort on the histogram.
Generating histograms with GPUs is still actively researched, so I suggest you read up on the published academic works (usually accompanied with source code). Keywords: Histogram, GPU
This one is a nice paper done by the AMD GPU researchers: https://developer.amd.com/wordpress/media/2012/10/GPUHistogramGeneration_preprint.pdf

Detect and Remove Hidden Surfaces of a Mesh

For the past few weeks, I have been working on an algorithm that finds hidden surfaces of complex meshes and removes them. These hidden surfaces are completely occluded, and will never be seen. Due to the nature of the meshes I'm working with, there are a ton of these hidden triangles. In some cases, there are more hidden surfaces than visible surfaces. As removing them manually is prohibitive for larger meshes, I am looking to automate this with software.
My current algorithm consists of:
Generating several points on the surface of a triangle.
For each point, generate a hemisphere sampler aligned to the normal of the triangle.
Cast rays up into the hemispheres.
If there are less than a certain number of rays unoccluded, I flag the triangle for deletion.
However, this algorithm is causing a lot of grief. It's very inconsistent. While some of the "occluded" faces are not found as occluded by the algorithm, I'm more worried about very visible faces that get removed due to issues with the current implementation. Therefore, I'm wondering about two things, mainly:
Is there a better way to find and remove these hidden surfaces than raytracing?
Should I investigate non-random ray generation? I'm currently generating random directions in a cosine-weighted hemisphere, which could be causing issues. The only reason I haven't investigated this is because I have yet to find an algorithm to generate evenly-spaced rays in a hemisphere.
Note: This is intended to be an object space algorithm. That is, visibility from any angle--not a fixed camera.

I've actually never implemented ray tracing, but I have a few suggestions anyhow. As your goal is to detect every hidden triangle, you could turn the problem around and instead find every visible triangle.
I'm thinking of something along the lines of either:
Ray trace from the outside and towards the centre/perpendicular to the surface, mark any triangle hit as visible.
Cull all others.
or
Choose a view of your model.
Rasterize the model, (for example using a different colour for each triangle).
Mark any triangle visible as visible.
Change the orientation and repeat.
Cull all non-visible triangles.
The advantage of the last one is that it should be relatively cheap to implement using a graphics API, if you can read/write the pixels reliably.
A disadvantage of both would be the resolution needed. Triangles inside small openings that should not be culled may still be, thus the number of rays may be prohibitive (in the first algorithm) or you will require very large off screen frame buffers (in the second).

A couple of ideas that may help.
Use a connectivity test to determine what is connected to your main model (if there is one).
Use a variant of Depth Peeling (I've used it to convert shells into voxels; once you know what is inside the models that you want to keep (the voxels), you can intersect the junk that you want to remove.)
Create a connectivity graph and prune the graph based on the complexity of connected groups.

Cost of using multiple render targets

I am using glsl as a framework for GPGPU for real-time image-processing. I am currently trying to "shave off" a few more milliseconds to make my application real-time. Here's the basic setup:
I take an input image, calculate several transformations of it, and then output a result image. For instance, Let the input image be I. Then the one fragment shader calculates f(I); the second calculates g(I); and the last one calculates h(f(I),g(I)).
My question is regarding efficiently calculating f(I),g(I): does it matter if I use 2 separate fragment shaders (and therefore 2 rendering passes), or if I use a single fragment shader with 2 outputs? Will the latter run faster? I have mostly found discussions about the "how-to"; not about the performance.
Edit
Thanks for the replies so far. Following several remarks, here's an example for my use-case with some more details:
I want a to filter the rows of image I with a 1-d filter; and also filter the rows of the squared image (each pixel is squared). f(I) = filter rows and g(I) = square and filter rows:
shader1: (input image) I --> filter rows --> I_rows (output image)
shader2: (input image) I --> square pixels and filter rows--> I^2_rows (output image)
The question is: will writing a single shader that does both operations be faster than running these two shaders one after the other? #derhass suggests that the answer is positive, because of accessing the same texture locations and enjoying locality. But if it wasn't for the texture-locality: would I still be enjoying a performance boost? or is a single shader rendering to two outputs basically equivalent to two render passes?

Using multiple render passes is usually slower than using one pass with MRT output, but this will also depend on your situation.
As I understand it, both f(I) and g(I) sample the input image I, and if each samples the same (or closely neighboring) loactions, you can greatly profit from the texture cache between the different operations - you have to sample the input texture just once, instead of two times with the multipass approach.
Taking this approach one step further: Do you even need the intermediate results f(I) and g(I) separately? Maybe you could just put h(f(I),g(I)) directly onto one shader, so you do neither need multiple passes and MRTs. If you want to be able to dyanmically combine your operations, you can still use that apporach and programatically combine different shader code parts dynamically to implement the operations (where possible), and use multiple passes only where absolutely necessary.
EDIT
As the question has been updated in the meantime, I think I can give some more specific answers:
What I said so far, especially about putting h(f(I),g(f(I)) into one shader is only a good idea if h (or f and g) will not need any neighboring pixels. If h is a nxn filter kernel, you would have to access nxn different input texels, and since those inputs are not directly known, you would have to calculate f and g for each of them. If both f and h are filter kernels, the effective filter size of the compound operation will be greater, and it is much better to calculate the intermediate results first and use multiple passes.
Looking at the specific issue you describe, it comes down to this.
If you use two separate shaders in the most naive way, you rendering will look like this.
use the shader1
select some output color buffer
draw a quad
use shader2
select some different color buffer
draw a quad
Every draw call has its overhead. The GL will have to do some extra validation. Switching the shaders might be the most expensive extra step here compared to the combined shader approach, as it might force a GPU pipeline flush. Als, for each draw call, you have the vertex processing, rasterization, and per fragment attribute interolation operations. With just one shader, lot's of this overhead is going away, and the per-fragment calculations described so far can be "shared" for both filters.
But if it wasn't for the texture-locality: would I still be enjoying a
performance boost?
Because of the stuff I said so far, and specific to the shaders you presented, I tend to say: yes. But the effect will be very small to neglegible if we ignore the texture acesses here, especially if we assume reasonable high resolution images so that the relative overhead compared to the total amount of work appearts small. I would at least say that using a single pass MRT setup will not be slower. However, only benchmarking/profiling the very specific implementation on a specific GPU will give a definitive answer.
Why did I say "the shaders you presented". Because in both cases, you do the image squaring in one shader. You could split that into two different shaders and renderpasses also. In that case, you would get additional overhead (to the already mentioned) for writing the intermediate results, and having to read that back. However, since you run a filter over the intermediate resulte, you do not have to square any input texel more than once, but in the combined approach, you do. If the squaring operation is expensive enough, and your filter size is big enough, you could in theory save more time than is introduced by the overhead of multiple passes. Again, only benchmarking/profiling cann tell you where the break even would lie.
I have done some benchmarking with MRT vs. multiple render passes myself in the past, although the image processing operations I was interested in are a bit different from yours. What I found is that in such scenarios, the texture access is the key factor, and you can hide a lot of other calculations (like squaring a color value) in the texture access latency. I think that your "But if it wasn't for the texture-locality" is a bit unrealistic, since it is the major contribution to the overall running time. And it isn't just the locality, it is also the number of texture accesses in total: With your multipe-shader approach, an imge of size w*h, and a 1D filter of size n, you will end up with 2*w*h*n texture accesses overall, while with the combined approach, you will just have reduced to *w*h*n, and that will make a huge difference in the past.
For a AMD FirePro V9800,image size of 1920x1080, and just copying the pixels to two output buffers by rendering textured quds, I got with two passes: ~0,320ms (even without switching shaders) vs 1 pass MRT: ~0,230ms. So execution time was reduced by "only" 30%, but this was with just one texutre fetch per shader invocation. With filter kernels, I'd expect to see this figure getting closer to 50% reduction with increasing kernel size (but I haven't measured that, though).

Let us ignore any potential benefits from hardware-specific things like data cache, register re-use, etc. that might occur if you do your entire algorithm in a single shader invocation and focus entirely on algorithm complexity for a minute.
A Gaussian Blur on a 2D image is a separable filter (X and Y can be blurred as a much simpler series of 1D blurs), and you can actually get better performance if you split the horizontal and vertical application into two passes.
Consider the complexity of two 1D blurs vs. one 2D blur in Big O:
Two-Pass Gaussian Blur (Two 1D blurs):
     
Single-Pass Gaussian Blur (Single 2D blur):
     
Deferred shading is another example. Instead of one massive loop over all lights in a single-pass, many implementations will do one pass per-light shading only the area of the screen that each individual light actually covers.
Multi-pass is not always a bad thing, when it simplifies your algorithm as in the case of a separable filter or light coverage, it is often good.
Your results may vary, but if you can show an appreciable difference in algorithm complexity in Big O notation using one approach over the other, it is worth exploring the run-time performance of both implementations.

Clarification about octrees and how they work in a Voxel world

I read about octrees and I didn't fully understand how they world work/be implemented in a voxel world where the octree's purpose is to lower the amount of voxels you would render by connecting repeating voxels to one big "voxel".
Here are the questions I want clarification about:
What type of data structure would you use? How could turn a 3-D array of voxels into and array that has different sized voxels that take multiple locations in the array?
What are the nodes and what are they used for?
Does the octree connect the voxels so there are ONLY square shapes or could it be a rectangle or a L shape or an entire Y column of voxels or what?
Do the octrees really improve performance of a voxel game? If so usually by how much?

Quick answers:
A tree:Each node has 8 children, top-back-left, top-back-right, etc. down to a certain levelThe code for this can get quite complex, especially if the voxels can change at runtime.
The type of voxel (colour, material, a list of items)
yep. Cubes onlyMore specifically 1x1, 2x2, 4x4, 8x8 etc. It must be an entire node.If you really want to you could define some sort of patterns, but its no longer a octdtree.
yeah, but it depends on your data. Imagine describing 256 identical blocks individually, or describing it once (like air in Minecraft)
I'd start with trying to understand quadtrees first. You can do that on paper, or make a test program with it. You'll answer these questions yourself if you experiment

An octree done correctly can also help you with neighbour searches which enable you to determine if a face is considered to be "visible" (ie so you end up with a hull of voxels visible). Once you've established your octree you then use this to store your XYZ coords which you then extract into a single array. You then feed this array into your VERTEX Buffer (GL solutions require this) which you can then render in chunk forms as needed (as the camera moves forward etc).
Octree's also by there very nature collapse Cubes into bigger ones if there are ones of the same type... much like Tetris does when you have colors/shapes that "fit" one another.. this in turn can reduce your vertex count and at render you're really drawing a combination of squares and rectangles
If done correctly you will end up with a lot of chunks that only have the outfacing "faces" visible in the vertex buffers. Now you then have to also build your own Occlusion Culling algorithm which then reduces the visibility ontop of this resulting in less rendering required.
I did an example here:
https://vimeo.com/71330826
notice how the outside is only being rendered but the chunks themselves go all the way down to the bottom even though the chunks depth faces should cancel each other out? (needs more optimisation). Also note how the camera turns around and the faces are removed from the rendering buffers?

OpenGL Dynamic Object Motion Blur

I've been following the GPU Gems 3 tutorial on how to blur based on camera movement. However I'm wanting to implement a blur based on object movement too. The solution is presented in the article (see quote below), however I'm curious as to how exactly to implement this.
At the moment I'm multiplying the object's matrix by the view-projection, then separately again for the previous-view-projection and then passing them into the pixel shader to calculate the velocity instead of just the view-projections.
If that is in fact the correct method, then why am I not simply able to pass in the model-view-projection? I would have assumed they would be the same value?
GPU Gems 3 Motion Blur
To generate a velocity texture for rigid dynamic objects, transform the object by using the current frame's view-projection matrix and the last frame's view-projection matrix, and then compute the difference in viewport positions the same way as for the post-processing pass. This velocity should be computed per-pixel by passing both transformed positions into the pixel shader and computing the velocity there.

Check out my research I did a few months ago on this topic: https://slu-files.s3.us-east-1.amazonaws.com/Fragment_shader_dynamic_blur.pdf
(source: stevenlu.net)
(source: stevenlu.net)
Sadly I did not implement textured objects when producing this material, but do use your imagination. I am working on a game engine so when that finally sees the light of day in the form of a game, you can be sure that I'll come and post breadcrumbs here.
It primarily addresses how to implement this effect in 2D, and in cases where objects do not overlap. There is not really a good way to handle using a fragment shader to "sweep" samples in order to generate "accurate" blur. While the effect approaches pixel-perfection as the sample count is cranked up, the geometry that must be generated to cover the sweep area has to be manually assembled using some "ugly" techniques.
In full 3D it's a rather difficult problem to know which pixels a dynamic object will sweep over during the course of a frame. Even with static geometry and a moving camera the solution proposed by the GPU Gems article is incorrect when moving past things quickly because it is unable to address that issue of requiring blending of the area swept out by something moving...
That said, if this approximation which neglects the sweep is sufficient (and it may be) then what you can do to extend to dynamic objects is to take their motion into account. You'll need to work out the details of course but look at lines 2 and 5 in the second code block on the article you linked: They are the current and previous screen space "positions" of the pixel. You simply have to somehow pass in the matrices that will allow you to compute the previous position of each pixel, taking into account the dynamic motion of your object.
It shouldn't be too bad. In the pass where you render your dynamic object you send in an extra matrix that represents its motion over the last frame.
Update:
I found that this paper describes an elegant and performant approach that provides somewhat high quality physically correct blurring for a 3D pipeline. It'll be hard to do much better than this within the constraint of rendering the full scene no more than one time for performance reasons.
I noticed with some of the examples the quality of the velocity buffer could be better. for example a rotating wheel should have some curves in the velocity space. I believe if they can be set properly (may require custom fragment shaders to render the velocity out...) they will look intuitively correct like the spinning cube seen above from my 2D exploration into dynamic motion blurring.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js