Strategies when vertex attributes are with different indexing - opengl

I've an OBJ file that I've parsed, but not surprisingly indexing for vertex position and vertex texture is separate.
Here are a couple of OBJ lines to explicit what I mean with different indexing. These are quads, where first index references XYZ position and second index references UV coords:
f 3899/8605 3896/8606 720/8607 3897/8608
f 3898/8609 3899/8610 3897/8611 721/8612
I know that a solution is do some duplication, but what's the most clever way to proceed?
As per now I had these two options in mind:
1) Use the indexing to create two big sets of vertices and vertex texture coordinates. This means that I duplicate everything so that I will end up with a vertex for each couple v/vt in the faces blindly. If I have for example 1/3 in first face and the same 1/3 in a different face, I will end up with two separate vertices. Proceed then with glDrawArrays without using indices anymore, but the newly created sets (full of duplicates)
2) Examine each face vertex to come up to unique "GL vertices" (position+texture coord are the same in my specific case) and figure out a way of indexing with these. Differently from 1) here I will not consider as separate vertices the same couple found multiple times. I'll then create a new indexing for these new vertices and finally using glDrawElements when it comes to the draw call using the new indices.
Now I believe that the first option is way easier, but I guess each drawArrays call will be bit slower than a drawElement right? How much is this advantage I'd have?
The second option as a first thought looks pretty slow in a preprocessing step and more complicated to implement. But will it grants to me much better performance overall?
Are there any other way to account for this issue?

If you have few low-poly models - go for option #1, it's way easier to implement and performance difference will be unnoticeable.
Option #2 would be the proper way if you have some high-poly models (looking at the sample, you have at least 9k vertices in there).
Generally you should not worry about model loading time, cos that is done only once and after that you can convert/save it in a most optimal format you need (serialize it just the way it is stored in your code)
Where's the dividing line between these two approaches? It's impossible to say without real-life profiling on the target hardware and your vertex rendering pipe (skeletal animation, shadows, everything adds its toll).

Related

Slow transform feedback-based picking?

I'm trying to implement picking routine using transform feedback. Currently it works ok, but the problem is very low speed (slower than GL_SELECT).
How it works now:
Bind TBO using glBindBufferRange() with offset (0 in the beginning).
Reset memory(size of TF varyings structure) using glBufferSubData() (to be sure picking will be correct). The main problem is here.
Draw objects with geometry shader that checks intersection with picking ray. If intersection has been found, shader writes this to TF varying (initially it has no intersection, see step 2).
Increase offset and go to step 1 with the next object.
So, at the end I have an array of picking data for each object.
The question is how to avoid calling glBufferSubData() on each iteration? Possible solutions (but I don't know how to implement them) are:
Write only one TF varying. So it is not necessary to reset others
Reset data with any other way
Any ideas?
If all you want to do is clear a region of a buffer, use glClearBufferSubData. That being said, it's not clear why you need to clear it, instead of just overwriting what's there.
FYI: Picking is best implemented by rendering the scene, assigning objects different "colors", and reading the pixel of interest back. Your method is always going to be slower.

Cost of using multiple render targets

I am using glsl as a framework for GPGPU for real-time image-processing. I am currently trying to "shave off" a few more milliseconds to make my application real-time. Here's the basic setup:
I take an input image, calculate several transformations of it, and then output a result image. For instance, Let the input image be I. Then the one fragment shader calculates f(I); the second calculates g(I); and the last one calculates h(f(I),g(I)).
My question is regarding efficiently calculating f(I),g(I): does it matter if I use 2 separate fragment shaders (and therefore 2 rendering passes), or if I use a single fragment shader with 2 outputs? Will the latter run faster? I have mostly found discussions about the "how-to"; not about the performance.
Edit
Thanks for the replies so far. Following several remarks, here's an example for my use-case with some more details:
I want a to filter the rows of image I with a 1-d filter; and also filter the rows of the squared image (each pixel is squared). f(I) = filter rows and g(I) = square and filter rows:
shader1: (input image) I --> filter rows --> I_rows (output image)
shader2: (input image) I --> square pixels and filter rows--> I^2_rows (output image)
The question is: will writing a single shader that does both operations be faster than running these two shaders one after the other? #derhass suggests that the answer is positive, because of accessing the same texture locations and enjoying locality. But if it wasn't for the texture-locality: would I still be enjoying a performance boost? or is a single shader rendering to two outputs basically equivalent to two render passes?
Using multiple render passes is usually slower than using one pass with MRT output, but this will also depend on your situation.
As I understand it, both f(I) and g(I) sample the input image I, and if each samples the same (or closely neighboring) loactions, you can greatly profit from the texture cache between the different operations - you have to sample the input texture just once, instead of two times with the multipass approach.
Taking this approach one step further: Do you even need the intermediate results f(I) and g(I) separately? Maybe you could just put h(f(I),g(I)) directly onto one shader, so you do neither need multiple passes and MRTs. If you want to be able to dyanmically combine your operations, you can still use that apporach and programatically combine different shader code parts dynamically to implement the operations (where possible), and use multiple passes only where absolutely necessary.
EDIT
As the question has been updated in the meantime, I think I can give some more specific answers:
What I said so far, especially about putting h(f(I),g(f(I)) into one shader is only a good idea if h (or f and g) will not need any neighboring pixels. If h is a nxn filter kernel, you would have to access nxn different input texels, and since those inputs are not directly known, you would have to calculate f and g for each of them. If both f and h are filter kernels, the effective filter size of the compound operation will be greater, and it is much better to calculate the intermediate results first and use multiple passes.
Looking at the specific issue you describe, it comes down to this.
If you use two separate shaders in the most naive way, you rendering will look like this.
use the shader1
select some output color buffer
draw a quad
use shader2
select some different color buffer
draw a quad
Every draw call has its overhead. The GL will have to do some extra validation. Switching the shaders might be the most expensive extra step here compared to the combined shader approach, as it might force a GPU pipeline flush. Als, for each draw call, you have the vertex processing, rasterization, and per fragment attribute interolation operations. With just one shader, lot's of this overhead is going away, and the per-fragment calculations described so far can be "shared" for both filters.
But if it wasn't for the texture-locality: would I still be enjoying a
performance boost?
Because of the stuff I said so far, and specific to the shaders you presented, I tend to say: yes. But the effect will be very small to neglegible if we ignore the texture acesses here, especially if we assume reasonable high resolution images so that the relative overhead compared to the total amount of work appearts small. I would at least say that using a single pass MRT setup will not be slower. However, only benchmarking/profiling the very specific implementation on a specific GPU will give a definitive answer.
Why did I say "the shaders you presented". Because in both cases, you do the image squaring in one shader. You could split that into two different shaders and renderpasses also. In that case, you would get additional overhead (to the already mentioned) for writing the intermediate results, and having to read that back. However, since you run a filter over the intermediate resulte, you do not have to square any input texel more than once, but in the combined approach, you do. If the squaring operation is expensive enough, and your filter size is big enough, you could in theory save more time than is introduced by the overhead of multiple passes. Again, only benchmarking/profiling cann tell you where the break even would lie.
I have done some benchmarking with MRT vs. multiple render passes myself in the past, although the image processing operations I was interested in are a bit different from yours. What I found is that in such scenarios, the texture access is the key factor, and you can hide a lot of other calculations (like squaring a color value) in the texture access latency. I think that your "But if it wasn't for the texture-locality" is a bit unrealistic, since it is the major contribution to the overall running time. And it isn't just the locality, it is also the number of texture accesses in total: With your multipe-shader approach, an imge of size w*h, and a 1D filter of size n, you will end up with 2*w*h*n texture accesses overall, while with the combined approach, you will just have reduced to *w*h*n, and that will make a huge difference in the past.
For a AMD FirePro V9800,image size of 1920x1080, and just copying the pixels to two output buffers by rendering textured quds, I got with two passes: ~0,320ms (even without switching shaders) vs 1 pass MRT: ~0,230ms. So execution time was reduced by "only" 30%, but this was with just one texutre fetch per shader invocation. With filter kernels, I'd expect to see this figure getting closer to 50% reduction with increasing kernel size (but I haven't measured that, though).
Let us ignore any potential benefits from hardware-specific things like data cache, register re-use, etc. that might occur if you do your entire algorithm in a single shader invocation and focus entirely on algorithm complexity for a minute.
A Gaussian Blur on a 2D image is a separable filter (X and Y can be blurred as a much simpler series of 1D blurs), and you can actually get better performance if you split the horizontal and vertical application into two passes.
Consider the complexity of two 1D blurs vs. one 2D blur in Big O:
Two-Pass Gaussian Blur (Two 1D blurs):
     
Single-Pass Gaussian Blur (Single 2D blur):
     
Deferred shading is another example. Instead of one massive loop over all lights in a single-pass, many implementations will do one pass per-light shading only the area of the screen that each individual light actually covers.
Multi-pass is not always a bad thing, when it simplifies your algorithm as in the case of a separable filter or light coverage, it is often good.
Your results may vary, but if you can show an appreciable difference in algorithm complexity in Big O notation using one approach over the other, it is worth exploring the run-time performance of both implementations.

Mesh simplification of a grid-like structure

I'm working on a 3D building app. The building is done on a 3D grid (like a Rubik's Cube), and each cell of the grid is either a solid cube or a 45 degree slope. To illustrate, here's a picture of a chamfered cube I pulled off of google images:
Ignore the image to the right, the focus is the one on the left. Currently, in the building phase, I have each face of each cell drawn separately. When it comes to exporting it, though, I'd like to simplify it. So in the above cube, I'd like the up-down-left-right-back-front faces to be composed of a single quad each (two triangles), and the edges would be reduced from two quads to single quads.
What I've been trying to do most recently is the following:
Iterate through the shape layer by layer, from all directions, and for each layer figure out a good simplification (remove overlapping edges to create single polygon, then split polygon to avoid holes, use ear clipping to triangulate).
I'm clearly over complicating things (at least I hope I am). If I've got a list of vertices, normals, and indices (currently with lots of duplicate vertices), is there some tidy way to simplify? The limitations are that indices can't be shared between faces (because I need the normals pointing in different directions), but otherwise I don't mind if it's not the fastest or most optimal solution, I'd rather it be easy to implement and maintain.
EDIT: Just to further clarify, I've already performed hidden face removal, that's not an issue. And secondly, it's of utmost importance that there is no degradation in quality, only simplification of the faces themselves (I need to retain the sharp edges).
Thanks goes to Roger Rowland for the great tips! If anyone else stumbles upon this question, here's a short summary of what I did:
First thing to tackle: ensure that the mesh you are attempting to simplify is a manifold mesh! This is a requirement for traversing halfedge data structures. One instance where I has issues with this was overlapping quads and triangles; I initially resolved to just leave the quads whole, rather than splitting them into triangles, because it was easier, but that resulted in edges that broke the halfedge mesh.
Once the mesh is manifold, create a halfedge mesh out of the vertices and faces.
With that done, decimate the mesh. I did it via edge collapsing, determining which edges to collapse through normal deviation (in my case, if the resulting faces from the collapse had normals not equal to their original values, then the collapse was not performed).
I did this via my own implementation at first, but I started running into frustrating bugs, and thus opted to use OpenMesh instead (it's very easy to get started with).
There's still one issue I have yet to resolve: if there are two cubes diagonally to one another, touching, the result is an edge with four faces connected to it: a complex edge! I suspect it'd be trivial to iterate through the edges checking for the number of faces connected, and then resolving by duplicating the appropriate vertices. But with that said, it's not something I'm going to invest the time in fixing, unless it becomes a critical issue later on.
I am giving a theoretical answer.
For the figure left, find all 'edge sharing triangles' with same normal (same x,y,z coordinates)(make it unit normal because of uneffect of direction of positive scaling of vectors). Merge them. Then triangulate it with maximum aspect ratio will give a solution you want.
Another easy and possible way for mesh simplification is I am proposing now.
Take the NORMALS and divide with magnitude(root of sum of squares of coordinates), gives unit normal vector. And take the adjucent triangles and take DOT PRODUCT between them(multiply x,y,z coordinates each and add). It gives the COSINE value of angle between these normals or triangles. Take a range(like 0.99-1) and consider the all adjacent triangles in this range with respect to referring triangle and merge them and retriangulate. We definitely can ignore some triangles in weird directions with smaller areas.
There is also another proposal for a more simple mesh reduction like in your left figure or building figures. Define a pre-defined number of faces (here 6+8 = 14) means value of normals, and classify all faces according to the direction close to these(by dot product) and merge and retriangulate.
Google "mesh simplification". You'll find that this problem is a huge one and is heavily researched. Take a look at these introductory resources: link (p.11 starts the good stuff) and link. CGAL has a good discussion, as well: link.
Once familiar with the issues, you'll have some decisions for applying simplification to your problem. How fast should the simplification be? How important is accuracy? (Iterative vertex clustering is a quick and dirty approach, but its results can be arbitrarily ugly.) Can you rely on a 3rd party library? (i.e. CGAL? GTS doesn't appear active any longer, but there are others) .

3D Math - Only keeping positions within a certain amount of yards

I'm trying to determine from a large set of positions how to narrow my list down significantly.
Right now I have around 3000 positions (x, y, z) and I want to basically keep the positions that are furthest apart from each other (I don't need to keep 100 positions that are all within a 2 yard radius from each other).
Besides doing a brute force method and literally doing 3000^2 comparisons, does anyone have any ideas how I can narrow this list down further?
I'm a bit confused on how I should approach this from a math perspective.
Well, I can't remember the name for this algorithm, but I'll tell you a fun technique for handling this. I'll assume that there is a semi-random scattering of points in a 3D environment.
Simple Version: Divide and Conquer
Divide your space into a 3D grid of cubes. Each cube will be X yards on each side.
Declare a multi-dimensional array [x,y,z] such that you have an element for each cube in your grid.
Every element of the array should either be a vertex or reference to a vertex (x,y,z) structure, and each should default to NULL
Iterate through each vertex in your dataset, determine which cube the vertex falls in.
How? Well, you might assume that the (5.5, 8.2, 9.1) vertex belongs in MyCubes[5,8,9], assuming X (cube-side-length) is of size 1. Note: I just truncated the decimals/floats to determine which cube.
Check to see if that relevant cube is already taken by a vertex. Check: If MyCubes[5,8,9] == NULL then (inject my vertex) else (do nothing, toss it out! spot taken, buddy)
Let's save some memory
This will give you a nicely simplified dataset in one pass, but at the cost of a potentially large amount of memory.
So, how do you do it without using too much memory?
I'd use a hashtable such that my key is the Grid-Cube coordinate (5,8,9) in my sample above.
If MyHashTable.contains({5,8,9}) then DoNothing else InsertCurrentVertex(...)
Now, you will have a one-pass solution with minimal memory usage (no gigantic array with a potentially large number of empty cubes. What is the cost? Well, the programming time to setup your structure/class so that you can perform the .contains action in a HashTable (or your language-equivalent)
Hey, my results are chunky!
That's right, because we took the first result that fit in any cube. On average, we will have achieved X-separation between vertices, but as you can figure out by now, some vertices will still be close to one another (at the edges of the cubes).
So, how do we handle it? Well, let's go back to the array method at the top (memory-intensive!).
Instead of ONLY checking to see if a vertex is already in the cube-in-question, also perform this other check:
If Not ThisCubeIsTaken()
For each SurroundingCube
If not Is_Your_Vertex_Sufficiently_Far_Away_From_Me()
exit_loop_and_outer_if_statement()
end if
Next
//Ok, we got here, we can add the vertex to the current cube because the cube is not only available, but the neighbors are far enough away from me
End If
I think you can probably see the beauty of this, as it is really easy to get neighboring cubes if you have a 3D array.
If you do some smoothing like this, you can probably enforce a 'don't add if it's with 0.25X' policy or something. You won't have to be too strict to achieve a noticeable smoothing effect.
Still too chunky, I want it smooth
In this variation, we will change the qualifying action for whether a vertex is permitted to take residence in a cube.
If TheCube is empty OR if ThisVertex is closer to the center of TheCube than the Cube's current vertex
InsertVertex (overwrite any existing vertex in the cube
End If
Note, we don't have to perform neighbor detection for this one. We just optimize towards the center of each cube.
If you like, you can merge this variation with the previous variation.
Cheat Mode
For some people in this situation, you can simply take a 10% random selection of your dataset and that will be a good-enough simplification. However, it will be very chunky with some points very close together. On the bright side, it takes a few minutes max. I don't recommend it unless you are prototyping.

OpenGL - A way to display lot of points dynamically

I am providing a question regarding a subject that I am now working on.
I have an OpenGL view in which I would like to display points.
So far, this is something I can handle ;)
For every point, I have its coordinates (X ; Y ; Z) and a value (unsigned char).
I have a color array giving the link between one value and a color.
For example, 255 is red, 0 is blue, and so on...
I want to display those points in an OpenGL view.
I want to use a threshold value so that depending on it, I can modify the transparency value of a color depending on the value of one point.
I want also that the performance doesn't go bad even if I have a lot of points (5 billions in the worst case but 1~2 millions in a standard case).
I am now looking for the effective way to handle this.
I am interested in the VBO. I have read that it will allow some good performance and also that I can modify the buffer as I want without recalculating it from scratch (as with display list).
So that I can solve the threshold issue.
However, doing this on a million points dynamically will provide some heavy calculations (at least a pretty bad for loop), no ?
I am opened to any suggestions and I would like to discuss about any of your ideas !
Trying to display a billion points or more is generally (forgive the pun) pointless.
Even an extremely high resolution screen has only a few million pixels. Nothing you can do will get it to display more points than that.
As such, your first step is almost undoubtedly to figure out a way to restrict your display to a number of points that's at least halfway reasonable. OpenGL can (and will) oblige if you ask it to display more, but your monitor won't and neither will mine or much or anybody else's.
Not directly related to the OpenGL part of your question, but if you are looking at rendering massive point clouds you might want to read up on space partitioning hierarchies such as octrees to keep performance in check.
Put everything into one VBO. Draw it as an array of points: glDrawArrays(GL_POINTS,0,num). Calculate alpha in a pixel shader (using threshold passed as uniform).
If you want to change a small subset of points - you can map a sub-range of the VBO. If you need to update large parts frequently - you can use Transform Feedback to utilize GPU.
If you need to simulate something for the updates, you should consider using CUDA or OpenCL to run the update completely on the GPU. This will give you the best performance. Otherwise, you can use a single VBO and update it once per frame from the CPU. If this gets too slow, you could try multiple buffers and distribute the updates across several frames.
For the threshold, you should use a shader uniform variable instead of modifying the vertex buffer. This allows you to set a value per-frame which can be then combined with the data from the vertex buffer (for instance, you set a float minVal; and every vertex with some attribute less than minVal gets discarded in the geometry shader.)