OpenGL: Is it more efficient to use GL_QUADS or GL_TRIANGLES? - opengl

I know that OpenGL deprecated and got rid of GL_QUADS in the newer releases. I have heard this is due to the fact that modern GPUs only render with triangles so calling a quad would just make the GPU work harder to break it into two triangles (what I have heard anyway, I am not much of an expert on any of this topic).
I was wondering whether or not it is better (assuming the average person's CPU is faster, relatively, than their GPU) to just manually break the rendering of quads into two triangles yourself or to just let the GPU do it itself. Again, I have absolutely no real experience with OpenGL as I am just starting. I would rather know which is better for most machines these days so I could focus my attention on either rendering method*. Thanks.
*Yet I will probably utilize the 'triangle method' for the sake of it.

Even if you feed OpenGL quads, the triangularization is done by the driver on the CPU side before it even hits the GPU. Modern GPUs these days eat nothing except triangles. (Well, and lines and points.) So something will be triangulating, whether it's you or the driver -- it doesn't matter too much where it happens.
This would be less efficient if, say, you don't reuse your vertex buffers, and instead refill them anew every time with quads (in which case the driver will have to retriangulate every vertex buffer), instead of refilling them with pretriangulated triangles every time, but that's pretty contrived (and the problem you should be fixing in that case is just the fact you're refilling your vertex buffers).
I would say, if you have the choice, stick with triangles, since that's what most content pipelines put out anyways, and you're less likely to run into problems with non-planar quads and the like. If you get to choose what format your content comes in, then use triangles for sure, and the triangulation step gets skipped altogether.

Any geometry can be represented with triangles, and that is why it was decided to use triangles instead of quads. Another reason is two triangles do not have to be co-planar, which is not true for quad.
Yes, you select to render quads, but the driver will converting the quad into two triangles.
Therefore, by choosing to render a quad will not make GPU work less, but will make your CPU work more, because it has to do the conversion.

Related

Rendering translucent triangles in models animated by a vertex shader?

Until now, I've been handling translucency by sorting my translucent triangles back to front. This works very well for quads, but I'd like to incorporate translucency in my models now.
I've thought of separating the translucent tris out and sorting them in the same way as my quads. Sorting by their centroids, then streaming the results into an IBO for just them each frame. But the number of triangles in a model, and the need to transform them on the CPU according to a table of bones, and blend shapes, and some other things in my vertex shader... This doesn't seem like a good solution in performance or sanity.
My models are about 4K tris each, with maybe 20 in a scene at worst, and I'd really like to lean into a simple cute style that relies on translucency, which doesn't have to be physically accurate, or draw objects behind 4 or more layers of translucency.
What technique might work well for my situation, in 2021? I'm using OpenGL 3.3 but I'll use another version if new features exist for this.
Afaik, there's no easy solution to what you want to do. However, there are some things that you can do:
Don't sort triangles at all, and only sort individual draw calls. This is the easiest solution and is utilised by most games, if they use translucent/transparent objects in the first place. Of course, this might not have the best looks (though it works well with convex shapes), but it will have very good performance.
Order-independet transparency: There are some pixel shader based techniques for rendering transparent/translucent objects without having to sort anything at all. These techniques usually are approximations (there are also some non-approximate algorithms) and tend work the best for things like smoke (where small errors aren't as noticable), or when not many transparent/translucent triangles overlap each other anyway.
Compute Shaders: You can use a compute shader to transform your individual vertices according to your animation, and then use another compute shader to sort the triangles on the GPU. That is probably the most straight-forward improvement of what you'd otherwise do on the CPU, and there are many examples for sorting stuff on the GPU out there. But, if you've never worked with compute shaders before, it might be a bit hard to wrap your head around their strengths and limitations at first.
Ray Tracing: This would probably be the most complex way to solve that problem, since you'll need specific hardware, generate corresponding data structures and a few new shaders, and on top of that, even with modern hardware, ray tracing is quite costly (though probably still faster than sorting triangles on the CPU each frame). But it also doesn't need your triangles to be sorted and will actually work perfectly well even if different translucent objects are intersecting/interleaving each other.

Performance of GL_POINTS on modern hardware

Is there any difference in performance between drawing a scene with full triangles (GL_TRIANGLES) instead of just drawing their vertices (GL_POINTS), on modern hardware?
Where GL_POINTS is initialized like this:
glPointSize(1.0);
glDisable(GL_POINT_SMOOTH);
I have a somewhat low-end graphics card (9600gt) and drawing vertices-only can bring a 2x fps increase on certain sceneries. Not sure if it applies too on more recent gpus.
2x fps increase on
You lose 98% of picture and get only 2x fps increase. That's not impressive. If you take into account that you should be able to easily render 300..500 fps on any decent hardware (with vsync disabled and minor optimizations), that's probably not worth it.
Is there any difference in performance between drawing a scene with full triangles (GL_TRIANGLES) instead of just drawing their vertices (GL_POINTS), on modern hardware?
Well, if your scene has a LOT of alpha-blending and very "heavy" pixel shaders, then, obviously, displaying scene as point cloud will speed things up, because there's less pixels to fill.
On other hand, this kind of "optimization" will be completely useless for any practical task. I mean, if you're using blending and shaders, you probably wouldn't want to display your scene as pointlist in the first place, unless you're doing some kind of debug render (using glPolygonMode), and in case of debug render, you'll probably turn shaders off (because shaded/lit point will be hard to see) and disable lighting.
Even if you're using point sprites as particles or something, I'd stick with triangles - they give more control and do not have maximum size limit (compared to point sprites).
I can display more objects?
If you want more objects, you should probably try to optimzie things elsewhere first. If you stop trying to draw invisible objects (outside of field of view, etc), that'll be a start that can improve performance.
you have a mesh which is very far away from the camera. 1 million triangles and you know it is always in view. At this density ratio, triangles can't be bigger than a pixel,
When triangles are smaller than a pixel, and there are many of them, your mesh start looking like garbage and turns into pixelated mess of points. It will be ugly. Roughly same effect as when you disable mippimapping and texture filters and then render checkboard pattern. Using points instead of triangles might even aggravate effect.
: If you have 1mil triangle mesh that is always visible, you already need different kind of optimization. Reduce number of triangles (level of detail, dynamic tesselation or some solution that can simplify geometry on the fly), use bump mapping(maybe parallax mapping) to simulate extra geometry details that aren't even here, or even turn it into static background or a sprite. That'll work much better. Trying to render it using points will simply make it look ugly.
No, if the number of triangles is similar to the number of their shared vertices (considering the glDrawElements rendering command being used) in both modes the geometry-wise part of the rendering pipeline will be evaluated at roughly the same speed. The only benefit you can get from drawing GL_POINTS relies solely on the percentage of empty screen space you get from not drawing faces, thus only at fragment shader level.

Deferred Rendering with Tile-Based culling Concept Problems

EDIT: I'm still looking for some help about the use of OpenCL or compute shaders. I would prefer to keep using OGL 3.3 and not have to deal with the bad driver support for OGL 4.3 and OpenCL 1.2, but I can't think of anyway to do this type of shading without using one of the two (to match lights and tiles). Is it possible to implement tile-based culling without using GPGPU?
I wrote a deferred render in OpenGL 3.3. Right now I don't do any culling for the light pass (I just render a full screen quad for every light). This (obviously) has a ton of overdraw. (Sometimes it is ~100%). Because of this I've been looking into ways to improve performance during the light pass. It seems like the best way in (almost) everyone's opinion is to cull the scene using screen space tiles. This was the method used in Frostbite 2. I read the the presentation from Andrew Lauritzen during SIGGRAPH 2010 (http://download-software.intel.com/sites/default/files/m/d/4/1/d/8/lauritzen_deferred_shading_siggraph_2010.pdf) , and I'm not sure I fully understand the concept. (and for that matter why it's better than anything else, and if it is better for me)
In the presentation Laurtizen goes over deferred shading with light volumes, quads, and tiles for culling the scene. According to his data, the tile based deferred renderer was the fastest (by far). I don't understand why it is though. I'm guessing it has something to do with the fact that for each tile, all the lights are batched together. In the presentation it says to read the G-Buffer once and then compute the lighting, but this doesn't make sense to me. In my mind, I would implement this like this:
for each tile {
for each light effecting the tile {
render quad (the tile) and compute lighting
blend with previous tiles (GL_ONE, GL_ONE)
}
}
This would still involve sampling the G-Buffer a lot. I would think that doing that would have the same (if not worse) performance than rendering a screen aligned quad for every light. From how it's worded though, it seems like this is what's happening:
for each tile {
render quad (the tile) and compute all lights
}
But I don't see how one would do this without exceeding the instruction limit for the fragment shader on some GPUs . Can anyone help me with this? It also seems like almost every tile based deferred renderer uses compute shaders or OpenCL (to batch the lights), why is this, and if I didn't use these what would happen?
But I don't see how one would do this without exceeding the instruction limit for the fragment shader on some GPUs .
It rather depends on how many lights you have. The "instruction limits" are pretty high; it's generally not something you need to worry about outside of degenerate cases. Even if 100+ lights affects a tile, odds are fairly good that your lighting computations aren't going to exceed instruction limits.
Modern GL 3.3 hardware can run at least 65536 dynamic instructions in a fragment shader, and likely more. For 100 lights, that's still 655 instructions per light. Even if you take 2000 instructions to compute the camera-space position, that still leaves 635 instructions per light. Even if you were doing Cook-Torrance directly in the GPU, that's probably still sufficient.

Are triangles a gpu restriction or are there other rendering pathways?

To preface this question, I have a competent understanding of OpenGL and the maths behind it, and while I have never touched anything related to DirectX I imagine the concepts are similar.
There is plenty of information around about why triangles are used for 3D graphics (they are necessarily planar, are indivisible except into smaller triangles, etc). However, I would like to know if triangles are merely a convenient way of storing and manipulating 3D data (simpler maths regarding interpolation, etc), or if there is a hardware limitation in the graphics card that only realistically allows the rendering of triangles (e.g. instructions that can essentially ONLY be applied to triangles).
Following on from this, is there any way to achieve pixel-by-pixel control of graphics rendering (as outlined briefly by the answer to this question). While I appreciate direct control over individual pixels is done through a driver, is there any way I can get this kind of control over a rendering environment? Is there away to 'avoid triangles' completely?
Yes and no. Kind of.
Current GPUs are designed to render triangles because triangles are nice to work with. And because current GPUs are designed to work with triangles, people use triangles and so GPUs only need to process triangles, and so they're designed to process only triangles.
As you say, triangles just have advantages that make them convenient to use. GPUs can be made (and have been made) to render other primitives natively, but it's just not really worth it. If you tell a modern GPU to render a quad, it splits it up into two triangles and renders those.
Not because there's a technical reason why a GPU can't render quads natively, but because it's not worth spending transistors on. It's much more useful to focus the GPU on doing triangles as fast as possible, and then just emulate other primitives if they're needed.
So yes, modern GPUs have hardware limitations so they don't work with quads, for example, but not because it's impossible to design a GPU which works with quads. It'd just be less efficient to do so. :)
As for "avoiding triangles", sure, that's basically what the fragment shader does: it fills in one single pixel. The GPU just runs it a few million times in parallel to fill in the entire screen. You could draw two big triangles, which form a quad filling the entire screen, and then just specify a fragment shader which fills that with the content you like.
If you want more control over the process, do it in software instead: paint one pixel at a time to a memory surface, and then load that as a texture on the GPU. But it's slow. :)
As far as i know every modern CAN render quads and some even N-gons but it comparing the render time of a quad to 2 triangles shows the triangle advantage.
This is mainly because GPU's have been optimized to render triangles and that the accual hardware has way more "steam processors" (for triangles) then others such as textures ones. Some other processor types on the GPU can render quads directly but normally you would find a thousand steam to a few texture processors
Note that getting a texure unit to render a quad is EXTREMELY difficult. It is possible in theory but no one used the pricip for a serius case.
Unless you work with very hardware close operation the software will take care of the triangles, (eg, Auto-Convert them from quads)

OpenGL voxel engine slow

I'm making a voxel engine in C++ and OpenGL (à la Minecraft) and can't get decent fps on my 3GHz with ATI X1600... I'm all out of ideas.
When I have about 12000 cubes on the screen it falls to under 20fps - pathetic.
So far the optimizations I have are: frustum culling, back face culling (via OpenGL's glEnable(GL_CULL_FACE)), the engine draws only the visible faces (except the culled ones of course) and they're in an octree.
I've tried VBO's, I don't like them and they do not significantly increase the fps.
How can Minecraft's engine be so fast... I struggle with a 10000 cubes, whereas Minecraft can easily draw much more at higher fps.
Any ideas?
#genpfault: I analyze the connectivity and just generate faces for the outer, visible surface. The VBO had a single cube that I glTranslate()d
I'm not an expert at OpenGL, but as far as I understand this is going to save very little time because you still have to send every cube to the card.
Instead what you should do is generate faces for all of the outer visible surface, put that in a VBO, and send it to the card and continue to render that VBO until the geometry changes. This saves you a lot of the time your card is actually waiting on your processor to send it the geometry information.
You should profile your code to find out if the bottleneck in your application is on the CPU or GPU. For instance it might be that your culling/octtree algorithms are slow and in that case it is not an OpenGL-problem at all.
I would also keep count of the number of cubes you draw on each frame and display that on screen. Just so you know your culling routines work as expected.
Finally you don't mention if your cubes are textured. Try using smaller textures or disable textures and see how much the framerate increases.
gDEBugger is a great tool that will help you find bottlenecks with OpenGL.
I don't know if it's ok here to "bump" an old question but a few things came up my mind:
If your voxels are static you can speed up the whole rendering process by using an octree for frustum culling, etc. Furthermore you can also compile a static scene into a potential-visibility-set in the octree. The main principle of PVS is to precompute for evere node in the tree which other nodes are potential visible from it and store pointers to them in a vector. When it comes to rendering you first check in which node the camera is placed and then run frustum culling against all nodes in the PVS-vector of the node.(Carmack used something like that in the Quake engines, but with Binary Space Partitioning trees)
If the shading of your voxels is kindalike complex it is also fast to do a pre-Depth-Only-Pass, without writing into the colorbuffer,just to fill the Depthbuffer. After that you render a 2nd pass: disable writing to the Depthbuffer and render only to the Colorbuffer while checking the Depthbuffer. So you avoid expensive shader-computations which are later overwritten by a new fragment which is closer to the viewer.(Carmack used that in Quake3)
Another thing which will definitely speed up things is the use of Instancing. You store only the position of each voxel and, if nescessary, its scale and other parameters into a texturebufferobject. In the vertexshader you can then read the positions of the voxels to be spawned and create an instance of the voxel(i.e. a cube which is given to the shader in a vertexbufferobject). So you send the 8 Vertices + 8 Normals (3 *sizeof(float) *8 +3 *sizeof(float) *8 + floats for color/texture etc...) only once to the card in the VBO and then only the positions of the instances of the Cube(3*sizeof(float)*number of voxels) in the TBO.
Maybe it is possibile to parallelize things between GPU and CPU by combining all 3 steps in 2 threads, in the CPU-thread you check the octrees pvs and update a TBO for instancing in the next frame, the GPU-thread does meanwhile render the 2 passes while using an TBO for instancing which was created by the CPU thread in the previous step. After that you switch TBOs. If the Camera has not moved you don't even have to do the CPU-calculations again.
Another kind of tree you me be interested in is the so called k-d-tree, which is more general than octrees.
PS: sorry for my english, it's not the clearest....
There are 3rd-party libraries you could use to make the rendering more efficient. For example the C++ PolyVox library can take a volume and generate the mesh for you in an efficient way. It has built-in methods for reducing triangle count and helping to generate things like ambient occlusion. It's got a good community around it so getting support on the forum should be easy.
Have you used a common display list for all your cubes ?
Do you skip calling drawing code of cubes which are not visible to the user ?