Fastest way to upload streaming points, and removing occasionally - opengl

So i have a system (using OpenGL 4.x) where i am receiving a stream of points (potentially with color and/or normal), from an external source. And I need to draw these points as GL_POINTS, running custom switchable shaders for coloring (color could be procedurally generated, or come from vertex color or normal direction).
The stream consists of receiving a group of points (with or without normal or color) of an arbitrary count (typical from 1k to 70k points) at a fairly regular interval (4 to 10 hz), I need to add these points to my current points and draw all the points so far received points.
I am guaranteed that my vertex type will not change, I am told at the beginning of the streaming which to expect, so i am either using an interleaved vertex with: pos+normal+color, pos+normal, pos+color, or just pos.
My current solution is to allocate interleaved vertex VBOs (with surrounding VAOs) of the appropriate vertex type at a config file specified max vertex count (allocated with the DYNAMIC hint).
As new points come in i fill up my current non filled VBO via glBufferSubData. I keep a count (activePoints) of how many vertices the current frontier VBO has in it so far, and use glBufferSubData to fill in a range starting with activePoints, if my current update group has more vertices than can fit in my frontier buffer (since i limit the vertex count per VBO), then i allocate a new VBO and fill the range starting at 0 and ending with the number of points left in my update group (not added to the last buffer), if I still have points left I do this again and again. It is rare that an update group straddles more than 2 buffers.
When rendering i render all my VBOs (-1) with a glDrawArrays(m_DrawMode,0,numVertices), where numVertices is equal to max buffer allowed size, and my frontier buffer with a glDrawArrays(m_DrawMode,startElem,numElems) to account for it not being completely filled with valid vertices.
Of course at some point I will have more points than I can draw interactively, so i have an LRU mechanism that deallocates the oldest (according to the LRU alg) sets of VBOs as needed.
Is there a more optimal method for doing this? Buffer orphaning? Streaming hint? Map vs SubData? Something else?
The second issue is that i am now asked to removed points (at irregular intervals), ranging from 10 to 2000 at a time. But these points are irregularly spaced within the order I received them initially. I can find out what offsets in which buffers they currently exit in, but its more of a scattering than a range. I have been "removing them" by finding their offsets into the right buffers and one by one calling glBufferSubData with a range of 1 (its rare that they are beside each other in a buffer), and changing there pos to be somewhere far off where they will never be seen. Eventually i guess buffers should be deleted from these remove request adding up, but I don't currently do that.
What would be a better way to handle that?

Mapping may be more efficient than glBufferSubData, especially when having to "delete" points. Explicit flush may be of particular help. Also, mapping allows you to offload the filling of a buffer to another thread.
Be positively sure to get the access bits correct (or performance is abysmal), in particular do not map a region "read" if all you do is write.
Deleting points from a vertex buffer is not easily possible, as you probably know. For "few" points (e.g. 10 or 20) I would just set w = 0, which moves them to infinity and keep drawing the whole thing as before. If your clip plane is not at infinity, this will just discard them. With explicit flushing, you would not even need to keep a separate copy in memory.
For "many" points (e.g. 1,000), you may consider using glCopyBufferSubData to remove the "holes". Moving memory on the GPU is fast, and for thousands of points it's probably worth the trouble. You then need to maintain a count for every vertex buffer, so you draw fewer points after removing some.
To "delete" entire vertex buffers, you should just orphan them (and reuse). OpenGL will do the right thing on its own behalf then, and it's the most efficient way to keep drawing and reusing memory.
Using glDrawElements instead of glDrawArrays as suggested in Andon M. Coleman's comment is usually a good advice, but will not help you in this case. The reason why one would want to do that is that the post-transform cache works by tagging vertices by their index, so drawing elements takes advantage of the post-transform cache whereas drawing arrays does not. However, the post-transform cache is only useful on complex geometry such as triangle lists or triangle strips. You're drawing points, so you will not use the post-transform cache in any case -- but using indices increases memory bandwidth both on the GPU and on the PCIe bus.

Related

Using Index buffers in DirectX 11; how does it know?

Let's say I create two vertex buffers, for two different meshes.
(I'm assuming creating separate buffers for separate meshes is how it's usually done)
Now, let's say I want to draw one of the meshes using an index buffer.
Looking at the book Practical Rendering and Computation with Direct3D 11 it doesnt seem like the creation of an index buffer in any way references a vertex buffer, so how does the index buffer know (during input assembly) what vertex buffer to act on?
I've done some googling without answers, which leads me to assume there's something obvious about it that I'm missing.
You are right, index buffers do not reference specific vertex buffers. During DrawIndexed active index buffer is used to supply indices into active vertex buffers (the ones you set using SetIndexBuffer/SetVertexBuffers).
Indeed, Index Buffers and Vertex Buffers are completely independent.
Index buffer will know about VertexBuffer at draw time (eg: when both as bound to the pipeline)
You can think of Index Buffer as a "Lookup Table", where you keep a list or elements indices to draw.
That also means you can attach two completely "logically unrelated" buffers to the pipeline and draw it, nothing will prevent you from doing that, but you will of course have some strange visual results.
Decoupling both has many advantages, here are a few examples:
You can reuse an index buffer (for example, two displaced grids with identical resolution can share the same index buffer). That can be a decent memory gain.
You can draw your Vertex buffer on it's own and do some processing per vertex (draw a point list for sprites for example, or apply skinning/displacement into a Stream Output buffer , then draw the resulting vertex buffer using DrawIndexed)
Both Vertex/Index buffers can also be bound as ByteAddressBuffer, so you can also process your geometry in Compute Shader, and build another optimized index buffer, with culled triangles for example, then process the Indexed Draw with the optimized buffer. Applying those culls in with indices instead of vertices is often faster than vertex, as you will move much less memory.
This is a niche case, but sometimes I have to draw a mesh as a set of triangles, but then draw as a set of lines (some form of wireframe). If as example, you take a single Box, you will not want to draw the diagonals as lines, so I have a shared Vertexbuffer with box vertices, then one IndexBuffer dedicated to draw triangle list, and another to draw line list. In case of large models, this can also be an effective memory gain.

efficient update of GL state given a change to the scene

Suppose we have a scene which consists of a list of n meshes in draw order. The triangle count of each mesh is bounded by a constant (though that constant may be large). Also suppose we have GL state such that all meshes can be rendered with a single draw call (glDrawArrays/Elements).
Given a change to the scene, which may consist of:
Inserting a mesh in the list
Removing a mesh from the list
Changing the geometry of the mesh (which may include changing its triangle count)
Is there a O(1) way to update GL state for a single change to the scene?
Because of the draw-order and single-draw-call constraints, the meshes must be laid out linearly in a VBO in that order. Thus, if a mesh is inserted in the list, the VBO must be resized by moving data, which is not O(1).
This isn't really an OpenGL problem. The work complexity has to do with how you model geometry, which happens well before you start trying to shove it into a GPU.
But there are some GL-specific things you can think about:
Separate vertex array order from draw order by using an index buffer and a DrawElements call. If you're drawing multiple entities out of one vertex buffer and you expect them to change, you can leave some padding in the vertex buffer and address vertices by index.
Think about how you're getting that vertex data to the GPU if it's changing every frame. For example, with double- buffering or MapBufferRange, the CPU can work on filling a buffer with new vertex data for the next frame while the GPU draws the current frame from a different buffer (or range).
The work you do to arrange/modify vertices every frame still can't be O(1). GPU work (and CPU/GPU transfers) tends to be thought of more in ms than order analysis terms, but there are things you can to minimize time.

Draw a bunch of elements generated by CUDA/OpenCL?

I'm new to graphics programming, and need to add on a rendering backend for a demo we're creating. I'm hoping you guys can point me in the right direction.
Short version: Is there any way to send OpenGL an array of data for distinct elements, without having to issue a draw command for each element distinctly?
Long version: We have a CUDA program (will eventually be OpenCL) which calculates a bunch of data for a bunch of objects for us. We then need to render these objects using, e.g., OpenGL.
The CUDA kernel can generate our vertices, and using OpenGL interop, it can shove these in an OpenGL VBO and not have to transfer the data back to host device memory. But the problem is we have a bunch (upwards of a million is our goal) distinct objects. It seems like our best bet here is allocating one VBO and putting every object's vertices into it. Then we can call glDrawArrays with offsets and lengths of each element inside that VBO.
However, each object may have a variable number of vertices (though the total vertices in the scene can be bounded.) I'd like to avoid having to transfer a list of start indices and lengths from CUDA -> CPU every frame, especially given that these draw commands are going right back to the GPU.
Is there any way to pack a buffer with data such that we can issue only one call to OpenGL to render the buffer, and it can render a number of distinct elements from that buffer?
(Hopefully I've also given enough info to avoid a XY problem here.)
One way would be to get away from understanding these as individual objects and making them a single large object drawn with a single draw call. The question is, what data is it that distinguishes the objects from each other, meaning what is it you change between the individual calls to glDrawArrays/glDrawElements?
If it is something simple, like a color, it would probably be easier to supply this an additional per-vertex attribute. This way you can render all objects as one single large object using a single draw call with the indiviudal sub-objects (which really only exist conceptually now) colored correctly. The memory cost of the additional attribute may be well worth it.
If it is something a little more complex (like a texture), you may still be able to index it using an additional per-vertex attribute, being either an index into a texture array (as texture arrays should be supported on CUDA/OpenCL-able hardware) or a texture coordinate into a particular subregion of a single large texture (a so-called texture atlas).
But if the difference between those objects is something more complex, as a different shader or something, you may really need to render individual objects and make individual draw calls. But you still don't need to neccessarily make a round-trip to the CPU. With the use of the ARB_draw_indirect extension (which is core since GL 4.0, I think, but may be supported on GL 3 hardware (and thus CUDA/CL-hardware), don't know) you can source the arguments to a glDrawArrays/glDrawElements call from an additional buffer (into which you can write with CUDA/CL like any other GL buffer). So you can assemble the offset-length-information of each individual object on the GPU and store them in a single buffer. Then you do your glDrawArraysIndirect loop offsetting into this single draw-indirect-buffer (with the offset between the individual objects now being constant).
But if the only reason for issuing multiple draw calls is that you want to render the objects as single GL_TRIANGLE_STRIPs or GL_TRIANGLE_FANs (or, god beware, GL_POLYGONs), you may want to reconsider just using a bunch of GL_TRIANGLES so that you can render all objects in a single draw call. The (maybe) time and memory savings from using triangle strips are likely to be outweight by the overhead of multiple draw calls, especially when rendering many small triangle strips. If you really want to use strips or fans, you may want to introduce degenerate triangles (by repeating vertices) to seprate them from each other, even when drawn with a single draw call. Or you may look into the glPrimitiveRestartIndex function introduced with GL 3.1.
Probably not optimal, but you could make a single glDrawArray on your whole buffer...
If you use GL_TRIANGLES, you can fill your buffer with zeroes, and write only the needed vertices in your kernel. This way "empty" regions of your buffer will be drawn as 0-area polygons ( = degenerate polygons -> not drawn at all )
If you use GL_TRIANGLE_STRIP, you can do the same, but you'll have to duplicate your first vertex in order to make a fake triangle between (0,0,0) and your mesh.
This can seem overkill, but :
- You'll have to be able to handle as many vertices anyway
- degenerate triangles use no fillrate, so they are almost free (the vertex shader is still computed, though)
A probably better solution would be to use glDrawElements instead : In you kernel, you also generate an index list for your whole buffer, which will be able to completely skip regions of your buffer.

Using more than one index list in a single VAO

I'm probably going about this all wrong, but hey.
I am rendering a large number of wall segments (for argument's sake, let's say 200). Every segment is one unit high, even and straight with no diagonals. All changes in direction are a change of 90 degrees.
I am representing each one as a four pointed triangle fan, AKA a quad. Each vertex has a three dimensional texture coordinate associated with it, such as 0,0,0, 0,1,7 or 10,1,129.
This all works fine, but I can't help but think it could be so much better. For instance, every point is duplicated at least twice (Every wall is a contiguous line of segments and there are some three & four way intersections) and the starting corner texture coordinates (0,0,X and 0,1,X) are going to be duplicated for every wall with texture number X on it. This could be compressed even further by moving the O coordinate into a third attribute and indexing the S and T coordinates separately.
The problem is, I can't seem to work out how to do this. VAOs only seem to allow one index, and taken as a lump, each position and texture coordinate form a unique snowflake never to be repeated. (Admittedly, this may not be true on certain corners, but that's a very edge case)
Is what I want to do possible, or am I going to have to stick with the (admittedly fine) method I currently use?
It depends on how much work you want to do.
OpenGL does not directly allow you to use multiple indices. But you can get the same effect.
The most direct way is to use a Buffer Texture to access an index list (using gl_VertexID), which you then use to access a second Buffer Texture containing your positions or texture coordinates. Basically, you'd be manually doing what OpenGL automatically does. This will naturally be slower per-vertex, as attributes are designed to be fast to access. You also lose some of the compression features, as Buffer Textures don't support as many formats.
each vertex and texture coordinate form a unique snowflake never to be repeated
A vertex is not just position, but the whole vector formed by position, texture coordinates and all the other attributes. It is, what you referred to as "snowflake".
And for only 200 walls, I'd not bother about memory consumption. It comes down to cache efficiency. And GPUs use vertices – and that means the whole position and other attributes vector – as caching key. Any "optimization" like you want to do them will probably have a negative effect on performance.
But having some duplicated vertices doesn't hurt so much, if they're not too far apart in the primitive index list. Today GPUs can hold about between 30 to 1000 vertices (that is after transformation, i.e. the shader stage) in their cache, depending on the number of vertex attributes are fed and the number of varying variables delivered to the fragment shader stage. So if a vertex (input key) has been cached, the shader won't be executed, but the cached result fed to fragment processing directly.
So the optimization you should really aim for is cache locality, i.e. batching things up in a way, that shared/duplicated vertices are sent to the GPU in quick succession.

OpenGL: Buffer object performance issue

I have a question related to Buffer object performance. I have rendered a mesh using standard Vertex Arrays (not interleaved) and I wanted to change it to Buffer Object to get some performance boost. When I introduce buffers object I was in shock when I find out that using Buffers object lowers performance four times. I think that buffers should increase performance. Does it true? So, I think that I am doing something wrong...
I have render 3d tiled map and to reduce amount of needed memory I use only a single tile (vertices set) to render whole map. I change only texture coordinates and y value in vertex position for each tile of map. Buffers for position and texture coords are created with GL_DYNAMIC_DRAW parameter. The buffer for indices is created with GL_STATIC_DRAW because it doesn't change during map rendering. So, for each tile of map buffers are mapped and unmapped at least one time. Should I use only one buffer for texture coords and positions?
Thanks,
Try moving your vertex/texture coordinates with GL_MODELVIEW/GL_TEXTURE matrices, and leave buffer data alone (GL_STATIC_DRAW alone). e.g. if tile is of size 1x1, create rect (0, 0)-(1, 1) and set it's position in the world with glTranslate. Same with texture coordinates.
VBOs are not there to increase performance of drawing few quads. Their true power is seen when drawing meshes with thousands of polygons using shaders. If you don't need any forward compatibility with newer opengl versions, I see little use in using them to draw dynamically changing data.
If you need to update the buffer(s) each frame you should use GL_STREAM_DRAW (which hints that the buffer contents will likely be used only once) rather than GL_DYNAMIC_DRAW (which hints that they will be but used a couple of times before being updated).
As far as my experience goes, buffers created with GL_STREAM_DRAW will be treated similarly to plain ol' arrays, so you should expect about the same performance as for arrays when using it.
Also make sure that you call glMapBuffer with the access parameter set to GL_WRITE_ONLY, assuming you don't need to read the contents of the buffer. Otherwise, if the buffer is in video memory, it will have to be transferred from video memory to main memory and then back again (well, that's up to the driver really...) for each map call. Transferring to much data over the bus is a very real bottleneck that's quite easy to bump into.