Draw a bunch of elements generated by CUDA/OpenCL? - opengl

I'm new to graphics programming, and need to add on a rendering backend for a demo we're creating. I'm hoping you guys can point me in the right direction.
Short version: Is there any way to send OpenGL an array of data for distinct elements, without having to issue a draw command for each element distinctly?
Long version: We have a CUDA program (will eventually be OpenCL) which calculates a bunch of data for a bunch of objects for us. We then need to render these objects using, e.g., OpenGL.
The CUDA kernel can generate our vertices, and using OpenGL interop, it can shove these in an OpenGL VBO and not have to transfer the data back to host device memory. But the problem is we have a bunch (upwards of a million is our goal) distinct objects. It seems like our best bet here is allocating one VBO and putting every object's vertices into it. Then we can call glDrawArrays with offsets and lengths of each element inside that VBO.
However, each object may have a variable number of vertices (though the total vertices in the scene can be bounded.) I'd like to avoid having to transfer a list of start indices and lengths from CUDA -> CPU every frame, especially given that these draw commands are going right back to the GPU.
Is there any way to pack a buffer with data such that we can issue only one call to OpenGL to render the buffer, and it can render a number of distinct elements from that buffer?
(Hopefully I've also given enough info to avoid a XY problem here.)

One way would be to get away from understanding these as individual objects and making them a single large object drawn with a single draw call. The question is, what data is it that distinguishes the objects from each other, meaning what is it you change between the individual calls to glDrawArrays/glDrawElements?
If it is something simple, like a color, it would probably be easier to supply this an additional per-vertex attribute. This way you can render all objects as one single large object using a single draw call with the indiviudal sub-objects (which really only exist conceptually now) colored correctly. The memory cost of the additional attribute may be well worth it.
If it is something a little more complex (like a texture), you may still be able to index it using an additional per-vertex attribute, being either an index into a texture array (as texture arrays should be supported on CUDA/OpenCL-able hardware) or a texture coordinate into a particular subregion of a single large texture (a so-called texture atlas).
But if the difference between those objects is something more complex, as a different shader or something, you may really need to render individual objects and make individual draw calls. But you still don't need to neccessarily make a round-trip to the CPU. With the use of the ARB_draw_indirect extension (which is core since GL 4.0, I think, but may be supported on GL 3 hardware (and thus CUDA/CL-hardware), don't know) you can source the arguments to a glDrawArrays/glDrawElements call from an additional buffer (into which you can write with CUDA/CL like any other GL buffer). So you can assemble the offset-length-information of each individual object on the GPU and store them in a single buffer. Then you do your glDrawArraysIndirect loop offsetting into this single draw-indirect-buffer (with the offset between the individual objects now being constant).
But if the only reason for issuing multiple draw calls is that you want to render the objects as single GL_TRIANGLE_STRIPs or GL_TRIANGLE_FANs (or, god beware, GL_POLYGONs), you may want to reconsider just using a bunch of GL_TRIANGLES so that you can render all objects in a single draw call. The (maybe) time and memory savings from using triangle strips are likely to be outweight by the overhead of multiple draw calls, especially when rendering many small triangle strips. If you really want to use strips or fans, you may want to introduce degenerate triangles (by repeating vertices) to seprate them from each other, even when drawn with a single draw call. Or you may look into the glPrimitiveRestartIndex function introduced with GL 3.1.

Probably not optimal, but you could make a single glDrawArray on your whole buffer...
If you use GL_TRIANGLES, you can fill your buffer with zeroes, and write only the needed vertices in your kernel. This way "empty" regions of your buffer will be drawn as 0-area polygons ( = degenerate polygons -> not drawn at all )
If you use GL_TRIANGLE_STRIP, you can do the same, but you'll have to duplicate your first vertex in order to make a fake triangle between (0,0,0) and your mesh.
This can seem overkill, but :
- You'll have to be able to handle as many vertices anyway
- degenerate triangles use no fillrate, so they are almost free (the vertex shader is still computed, though)
A probably better solution would be to use glDrawElements instead : In you kernel, you also generate an index list for your whole buffer, which will be able to completely skip regions of your buffer.

Related

Cost of large buffer switch vs. small buffer switch

I'm creating a tile-based renderer where each tile has a vertex model. However, from each vertex model only a small portion is rendered in one frame. These subsets change every frame.
What would be the fastest way to render this? I can think of the following options:
Make one draw call for every model. Every model is stored in full on the gpu. For every draw call, the full vbo is switched every time. Indices are then used to pick the appropriate small portion for the actual rendering.
Make one draw call with one vbo which gets assembled every frame by copying the necessary (small) subset of all the other vbos (the data is copied within vram).
Make one draw call with one vbo, but the vbo is recreated every frame with the (small) subset from CPU data using glBufferData.
Which do you think is fastest, or can you think of something faster?
One deciding factor is obviously if switching between larger VBOs is more expensive than switching between smaller VBOs.
It is a bad idea to make a lot of drawcalls. In OpenGL,you will be CPU bound by this method, so it is better to batch a lot of models.
Actually, I would go for this method. All static geometry is inside one and only one VBO and one VAO. It does not mean that you only have "one draw call". However, you should use glMultiDraw*Indirect.
The idea burried that is you have to use compute shaders to perform culling on GPU, and use something like GL_INDIRECT_PARAMETERS extensions with your multi indirect draw call.
Indirect Drawing
For all dynamic geometry, you can use a persistent buffer.
To answer your question about changing vao/vbo. Change VAO, or use glBindVertexBuffer should not make a big overhead.
But you should profile it, it can depends on your driver / hardware :)

Render multiple models in OpenGL with a single draw call

I built a 2D graphical engine, and I created a batching system for it, so, if I have 1000 sprites with the same texture, I can draw them with one single call to openGl.
This is achieved by putting in a single vbo vertex array all the vertices of all the sprites with the same texture.
Instead of "print these vertices, print these vertices, print these vertices", I do "put all the vertices toghether, print", just to be very clear.
Easy enough, but now I'm trying to achieve the same thing in 3D, and I'm having a big problem.
The problem is that I'm using a Model View Projection matrix to place and render my models, which is the common approach to render a model in 3D space.
For each model on screen, I need to pass the MVP matrix to the shader, so that I can use it to transform each vertex to the correct position.
If I would do the transformation outside the shader, it would be executed by the cpu, which I not a good idea, for obvious reasons.
But the problem lies there. I need to pass the matrix to the shader, but for each model the matrix is different.
So I cannot do the same I did with 2d sprites, because changing a shader uniform requires a draw every time.
I hope I've been clear, maybe you have a good idea I didn't have or you already had the same problem. I know for a fact that there is a solution somewhere, because in engine like Unity, you can use the same shader for multiple models, and get away with one draw call
There exists a feature exactly like what you're looking for, and it's called instancing. With instancing, you store n matrices (or whatever else you need) in a Uniform Buffer and call glDrawElementsInstanced to draw n copies. In the shader, you get an extra input gl_InstanceID, with which you index into the Uniform Buffer to fetch the matrix you need for that particular instance.
You can read more about instancing here: https://www.opengl.org/wiki/Vertex_Rendering#Instancing
The answer depends on whether the vertex data for each item is identical or not. If it is, you can use instancing as in #orost's answer, using glDrawElementsInstanced, and gl_InstanceID within the vertex shader, and that method should be preferred.
However, if each 3D model requires different vertex data (which is frequently the case), you can still render them using a single draw call. To do this, you would add another stream into your vertex data with glVertexAttribPointer (and glEnableVertexAttribArray). This extra stream would contain the index of the matrix within the uniform buffer that vertex should use when rendering - so each mesh within the VBO would have an identical index in the extra stream. The uniform buffer contains the same data as in the instancing setup.
Note this method may require some extra CPU processing, if you need to redo the batching - for example, an object within a batch should not be rendered anymore. If this process is required frequently, it should be determined whether batching items is actually beneficial or not.
Besides instancing and adding another vertex attribute as some object ID, I'd like to also mention another strategy (which requires modern OpenGL, though):
The extension ARB_multi_draw_indirect (in core since GL 4.3) adds indirect drawing commands. These commands do source their parameters (number of vertices, starting index and so on) directly from another buffer object. With these functions, many different objects can be drawn with a single draw call.
However, as you still want some per-object state like transformation matrices, that feature is not enough. But in combination with ARB_shader_draw_parameters (not in core GL yet), you get the gl_DrawID parameter, which will be incremented by one for each single object in one mult draw indirect call. That way, you can index into some UBO, or TBO, or SSBO (or whatever) where you store per-object data.

Best way to convert OpenGL immediate mode rendering utility methods to using VBOs?

I've written for myself a small utility class containing useful methods for rendering lines, quads, cubes, etc. quickly and easily in OpenGL. Up until now, I've been using almost entirely immediate mode, so I could focus on learning other aspects of OpenGL. It seems prudent to switch over to using VBOs. However, I want to keep much of the same functionality I've been using, for instance my utility class. Is there a good method of converting these simple immediate mode calls to a versatile VBO system?
I am using LWJGL.
Having converted my own code from begin..end blocks and also taught others, this is what I recommend.
I'm assuming that your utility class is mostly static methods, draw a line from this point to that point.
First step is to have each individual drawing operation create a VBO for each attribute. Replace your glBegin..glEnd block with code that creates an array (actually a ByteBuffer) for each vertex attribute: coordinates, colors, tex coords, etc. After what used to be glEnd, copy the ByteBuffers to the VBOs with glBufferData. Then set up the attributes with chunks of glEnableClientState, glBindBuffer, glVertex|Color|whateverPointer calls. Call glDrawArrays to actually draw something, and finally restore client state and delete the VBOs.
Now, this is not good OpenGL code and is horribly inefficient and wasteful. But it should work, it's fairly straightforward to write, and you can change one method at a time.
And if you don't need to draw very much, well modern GPUs are so fast that maybe you won't care that it's inefficient.
Second step is to start re-using VBOs. Have your class create one VBO for each possible attribute at init time or first use. The drawing code still creates ByteBuffer data arrays and copies them over, but doesn't delete the VBOs.
Third step, if you want to move into OpenGL 4 and are using shaders, would be to replace glVertexPointer with glVertexAttribPointer(0, glColorPointer with glVertexAttribPointer(1, etc. You should also create a Vertex Array Object along with the VBOs at init time. (You'll still have to enable/disable attrib pointers individually depending on whether each draw operation needs colors, tex coords, etc.)
And the last step, which would require changes elsewhere to your program(s), would be to go for 3D "objects" rather than methods. Your utility class would no longer contain drawing methods. Instead you create a line, quad, or cube object and draw that. Each of these objects would (probably) have its own VBOs. This is more work, but really pays off in the common case when a lot of your 3D geometry doesn't change from frame to frame. But again, you can start with the more "wasteful" approach of replacing each method call to draw a line from P1 to P2 with something like l = new Line3D(P1, P2) ; l.draw().
Hope this helps.

OpenGL- drawarrays or drawelements?

I'm making a small 2D game demo and from what I've read, it's better to use drawElements() to draw an indexed triangle list than using drawArrays() to draw an unindexed triangle list.
But it doesn't seem possible as far as I know to draw multiple elements that are not connected in a single draw call with drawElements().
So for my 2D game demo where I'm only ever going to draw squares made of two triangles, what would be the best approach so I don't end having one draw call per object?
Yes, it's better to use indices in many cases since you don't have to store or transfer duplicate vertices and you don't have to process duplicate vertices (vertex shader only needs to be run once per vertex). In the case of quads, you reduce 6 vertices to 4, plus a small amount of index data. Two thirds is quite a good improvement really, especially if your vertex data is more than just position.
In summary, glDrawElements results in
Less data (mostly), which means more GPU memory for other things
Faster updating if the data changes
Faster transfer to the GPU
Faster vertex processing (no duplicates)
Indexing can affect cache performance, if the reference vertices that aren't near each other in memory. Modellers commonly produce meshes which are optimized with this in mind.
For multiple elements, if you're referring to GL_TRIANGLE_STRIP you could use glPrimitiveRestartIndex to draw multiple strips of triangles with the one glDrawElements call. In your case it's easy enough to use GL_TRIANGLES and reference 4 vertices with 6 indices for each quad. Your vertex array then needs to store all the vertices for all your quads. If they're moving you still need to send that data to the GPU every frame. You could position all the moving quads at the front of the array and only update the active ones. You could also store static vertex data in a separate array.
The typical approach to drawing a 3D model is to provide a list of fixed vertices for the geometry and move the whole thing with the model matrix (as part of the model-view). The confusing part here is that the mesh data is so small that, as you say, the overhead of the draw calls may become quite prominent. I think you'll have to draw a LOT of quads before you get to the stage where it'll be a problem. However, if you do, instancing or some similar idea such as particle systems is where you should look.
Perhaps only go down the following track if the draw calls or data transfer becomes a problem as there's a lot involved. A good way of implementing particle systems entirely on the GPU is to store instance attributes such as position/colour in a texture. Each frame you use an FBO/render-to-texture to "ping-pong" this data between another texture and update the attributes in a fragment shader. To draw the particles, you can set up a static VBO which stores quads with the attribute-data texture coordinates for use in the vertex shader where the particle position can be read and applied. I'm sure there's a bunch of good tutorials/implementations to follow out there (please comment if you know of a good one).

Rendering a mesh in OpenGL as a series of subgroups?

I'm completing a wavefront object parser and I want to use it to construct generic mesh objects. My engine uses OpenGL 4 and shaders to draw everything in my engine.
My question is about how to ensure best rendering efficiency for rendering a mesh.
A wavefront .obj file normally has many object sub-groups specified.
A sub-group might be assigned a specific material (e.g. a shiny red colour).
So a mesh might be a fairly complex collection of sub-groups, each with their own material assigned.
My questions are -
Q. Do I need to draw each sub-group separately e.g. with a call to glDrawElements for each sub-group ? (So if I had 4 separate sub-groups, I'd have to make four glDrawElements calls, thereby invoking the shader 4 times with 4 uniform changes (for the materials/textures) )
glDrawElements( GL_TRIANGLES, nNumIndicesInGroup, GL_UNSIGNED_INT, ((char*)NULL)+ first-vertex-offset );
If this is correct, then I'll have to calculate:
The indices in each sub-group (implying a separate index array and VAO for each sub-group)
The vertex offset of the start of the sub-group
This seems terribly inefficient, am I barking up the wrong tree?
Also, from the Wavefront obj wiki page:
Smooth shading across polygons is enabled by smoothing groups.
s 1
...
# Smooth shading can be disabled as well.
s off
...
Can anyone suggest what smooth shading values indicate? E.g. s1, s2, s4 etc.
Yes, you should draw each sub-group separately from the others. This is required till the state is different between sub-groups.
But you are making a too long step.
To avoid multiple draw calls, you can introduce a vertex attribute indicating an index used for accessing uniform array values (array of materials, array of textures). In this way, you need only one draw call, but you will have the cost of one additional attribute and its relative management.
I would avoid the above approach. What if a sub-group is textured and another one not? How do you discriminate whether to texture or not? Introducing other attributes? Seems confusing.
The first point is that the buffer object management is very flexible. Indeed you could have a single element buffer object and a single vertex buffer object: using offsets and interleaving you can satisfy every level of complexity. And then, on modern harware, using vertex array objects you can minimize the cost of the different buffer bindings.
Second point is that your software can group different sub-group having the same uniform state, joining multiple draw calls into a single one. Remember that you can use Multi draw entry points variants, and there's also the primitive restart that can aid you (in the case stripped primitives).
Other considerations are not usefull, because you have to draw anyway, regardless if it's complex or not. Successively, when you have a correct rendering, you could profile the application and the rendering, cutting-off the hot-spots.
Smoothing groups are a collection of vertices that are sharing the same option attribute (normals, texture coordinates). This is the case of element-indexed vertices.
To go deeper on subject, read one of the specification found by googling.