Grouping data from Vertex Shader to Geometry Shader - opengl

Let's suppose I have some points p1,p2, p3 and p4. I need to apply some transformations to each them in the Geometry Shader phase based on its successor, so my GS would require having access to the pairs (p1, p2), (p2, p3), (p3, p4). How can I achieve this? If I use the POINTS primitive I can only gain access to a single point at a time.
Please also note that this is a simplification, since in pratice I would need to have four points at a time, placed like the vertices of a cube. I have thought of using something like a line strip, but it doesn't provide enough points...
EDIT:
To clarify, what I am actually trying to achieve is to have the CPU send a "cubic lattice" (?) to the GPU expressed as a set of points. My GS will have to take four of this points at a time, each representing one cube's vertex, and output triangles based on the attributes of these points.

Let's say you have your 3D lattice in a buffer. You know the order (e.g. by rows). So you know in advance how to extract the four points needed in each iteration. For a regular grid, you know the stride between points. Thus you can use glVertexAttribPointer() with the right stride parameter.
You can also use indexed buffer and glDrawElements.
Another aproach, likely slower, is to use four buffers bound in the same VAO and read with different attributes.
The command can be glDrawArray(GL_POINTS,...). Or even you can try instanced drawing and use the instace ID as an indication to the lattice location.
The thing is that glDrawXXXwill read from the bound buffers the number of times you specify. Each time you can read your four points.
Whatever you use, you get four points in the VS that you can pass to the GS.

Related

Assigning a normal to stl vertex with opengl

In an stl file, there are facet normals, then a list of verticies. In some stl files I work with, there are multiples of the same vertex, for example, a file with 5 million verticies, is usually containing 30 duplicates of each vertex. Such as, a cylinder cut out of a cube, has one vertex that belongs to 20 other triangles.
For this reason, I like to store the verticies in a hash table, that allows me to upload the index set of verticies for the triangle, reducing a mesh from 5 million verticies to 900k.
This however, creates a normal issue for the facet, which uses the first facet normal to assign to the first instance of the vertex.
What is the fastest way to store a vertex normal that will work for all of the facets it belongs to in the file, or, is this just not possible?
A vertex is not just the position, a vertex is the whole tuple of its associated attributes. The normal is a vertex attribute. If vertices differ in any of their attributes, they're different vertices.
While it's perfectly possible to decompose the vertex attributes into multiple sets and use an intermediate indexing structure, this kind of data format is hard or even impossible to process for GPUs and also very cumbersome to work with. OpenGL for example can not directly use it.
Deduplication of certain vertex attributes (like the normal or other properties shared across vertices) makes sense only for storing the data. When you want to work with it, you normally expand it.
The data structure you have right now is what you want. Don't try to "optimize" it. Also even at 5 million vertices, given two attributes (position and normal) that's at most 100MiB of data. Modern computers have Gigabytes of RAM, so that's not really a problem.
The only straightforward approach in OpenGL is to create a vertex for each unique combination of position and normal. Depending on your data, this can still give you a very substantial reduction in the number of vertices. But if your data does not contain repeated vertices that share both position and normal, it will not help.
To validate if this will work for your data, you can extend the approach you already tried. Instead of using the 3 vertex coordinates as the key into your hash table, you use 6 values: the 3 vertex coordinates, and the 3 normal components.
If the number of entries in your hash table is significantly smaller than the original number of vertices, indexed rendering will be beneficial. You can then assign an index to each unique position/normal combination stored in the hash table, and used these indices to build the index buffer as well as the vertex buffer.
Beyond that, AMD defined an extension to support separate indices for different attributes, but this will not be useful if you want to keep your code portable: GL_AMD_interleaved_elements.

Fastest way to upload streaming points, and removing occasionally

So i have a system (using OpenGL 4.x) where i am receiving a stream of points (potentially with color and/or normal), from an external source. And I need to draw these points as GL_POINTS, running custom switchable shaders for coloring (color could be procedurally generated, or come from vertex color or normal direction).
The stream consists of receiving a group of points (with or without normal or color) of an arbitrary count (typical from 1k to 70k points) at a fairly regular interval (4 to 10 hz), I need to add these points to my current points and draw all the points so far received points.
I am guaranteed that my vertex type will not change, I am told at the beginning of the streaming which to expect, so i am either using an interleaved vertex with: pos+normal+color, pos+normal, pos+color, or just pos.
My current solution is to allocate interleaved vertex VBOs (with surrounding VAOs) of the appropriate vertex type at a config file specified max vertex count (allocated with the DYNAMIC hint).
As new points come in i fill up my current non filled VBO via glBufferSubData. I keep a count (activePoints) of how many vertices the current frontier VBO has in it so far, and use glBufferSubData to fill in a range starting with activePoints, if my current update group has more vertices than can fit in my frontier buffer (since i limit the vertex count per VBO), then i allocate a new VBO and fill the range starting at 0 and ending with the number of points left in my update group (not added to the last buffer), if I still have points left I do this again and again. It is rare that an update group straddles more than 2 buffers.
When rendering i render all my VBOs (-1) with a glDrawArrays(m_DrawMode,0,numVertices), where numVertices is equal to max buffer allowed size, and my frontier buffer with a glDrawArrays(m_DrawMode,startElem,numElems) to account for it not being completely filled with valid vertices.
Of course at some point I will have more points than I can draw interactively, so i have an LRU mechanism that deallocates the oldest (according to the LRU alg) sets of VBOs as needed.
Is there a more optimal method for doing this? Buffer orphaning? Streaming hint? Map vs SubData? Something else?
The second issue is that i am now asked to removed points (at irregular intervals), ranging from 10 to 2000 at a time. But these points are irregularly spaced within the order I received them initially. I can find out what offsets in which buffers they currently exit in, but its more of a scattering than a range. I have been "removing them" by finding their offsets into the right buffers and one by one calling glBufferSubData with a range of 1 (its rare that they are beside each other in a buffer), and changing there pos to be somewhere far off where they will never be seen. Eventually i guess buffers should be deleted from these remove request adding up, but I don't currently do that.
What would be a better way to handle that?
Mapping may be more efficient than glBufferSubData, especially when having to "delete" points. Explicit flush may be of particular help. Also, mapping allows you to offload the filling of a buffer to another thread.
Be positively sure to get the access bits correct (or performance is abysmal), in particular do not map a region "read" if all you do is write.
Deleting points from a vertex buffer is not easily possible, as you probably know. For "few" points (e.g. 10 or 20) I would just set w = 0, which moves them to infinity and keep drawing the whole thing as before. If your clip plane is not at infinity, this will just discard them. With explicit flushing, you would not even need to keep a separate copy in memory.
For "many" points (e.g. 1,000), you may consider using glCopyBufferSubData to remove the "holes". Moving memory on the GPU is fast, and for thousands of points it's probably worth the trouble. You then need to maintain a count for every vertex buffer, so you draw fewer points after removing some.
To "delete" entire vertex buffers, you should just orphan them (and reuse). OpenGL will do the right thing on its own behalf then, and it's the most efficient way to keep drawing and reusing memory.
Using glDrawElements instead of glDrawArrays as suggested in Andon M. Coleman's comment is usually a good advice, but will not help you in this case. The reason why one would want to do that is that the post-transform cache works by tagging vertices by their index, so drawing elements takes advantage of the post-transform cache whereas drawing arrays does not. However, the post-transform cache is only useful on complex geometry such as triangle lists or triangle strips. You're drawing points, so you will not use the post-transform cache in any case -- but using indices increases memory bandwidth both on the GPU and on the PCIe bus.

Draw a bunch of elements generated by CUDA/OpenCL?

I'm new to graphics programming, and need to add on a rendering backend for a demo we're creating. I'm hoping you guys can point me in the right direction.
Short version: Is there any way to send OpenGL an array of data for distinct elements, without having to issue a draw command for each element distinctly?
Long version: We have a CUDA program (will eventually be OpenCL) which calculates a bunch of data for a bunch of objects for us. We then need to render these objects using, e.g., OpenGL.
The CUDA kernel can generate our vertices, and using OpenGL interop, it can shove these in an OpenGL VBO and not have to transfer the data back to host device memory. But the problem is we have a bunch (upwards of a million is our goal) distinct objects. It seems like our best bet here is allocating one VBO and putting every object's vertices into it. Then we can call glDrawArrays with offsets and lengths of each element inside that VBO.
However, each object may have a variable number of vertices (though the total vertices in the scene can be bounded.) I'd like to avoid having to transfer a list of start indices and lengths from CUDA -> CPU every frame, especially given that these draw commands are going right back to the GPU.
Is there any way to pack a buffer with data such that we can issue only one call to OpenGL to render the buffer, and it can render a number of distinct elements from that buffer?
(Hopefully I've also given enough info to avoid a XY problem here.)
One way would be to get away from understanding these as individual objects and making them a single large object drawn with a single draw call. The question is, what data is it that distinguishes the objects from each other, meaning what is it you change between the individual calls to glDrawArrays/glDrawElements?
If it is something simple, like a color, it would probably be easier to supply this an additional per-vertex attribute. This way you can render all objects as one single large object using a single draw call with the indiviudal sub-objects (which really only exist conceptually now) colored correctly. The memory cost of the additional attribute may be well worth it.
If it is something a little more complex (like a texture), you may still be able to index it using an additional per-vertex attribute, being either an index into a texture array (as texture arrays should be supported on CUDA/OpenCL-able hardware) or a texture coordinate into a particular subregion of a single large texture (a so-called texture atlas).
But if the difference between those objects is something more complex, as a different shader or something, you may really need to render individual objects and make individual draw calls. But you still don't need to neccessarily make a round-trip to the CPU. With the use of the ARB_draw_indirect extension (which is core since GL 4.0, I think, but may be supported on GL 3 hardware (and thus CUDA/CL-hardware), don't know) you can source the arguments to a glDrawArrays/glDrawElements call from an additional buffer (into which you can write with CUDA/CL like any other GL buffer). So you can assemble the offset-length-information of each individual object on the GPU and store them in a single buffer. Then you do your glDrawArraysIndirect loop offsetting into this single draw-indirect-buffer (with the offset between the individual objects now being constant).
But if the only reason for issuing multiple draw calls is that you want to render the objects as single GL_TRIANGLE_STRIPs or GL_TRIANGLE_FANs (or, god beware, GL_POLYGONs), you may want to reconsider just using a bunch of GL_TRIANGLES so that you can render all objects in a single draw call. The (maybe) time and memory savings from using triangle strips are likely to be outweight by the overhead of multiple draw calls, especially when rendering many small triangle strips. If you really want to use strips or fans, you may want to introduce degenerate triangles (by repeating vertices) to seprate them from each other, even when drawn with a single draw call. Or you may look into the glPrimitiveRestartIndex function introduced with GL 3.1.
Probably not optimal, but you could make a single glDrawArray on your whole buffer...
If you use GL_TRIANGLES, you can fill your buffer with zeroes, and write only the needed vertices in your kernel. This way "empty" regions of your buffer will be drawn as 0-area polygons ( = degenerate polygons -> not drawn at all )
If you use GL_TRIANGLE_STRIP, you can do the same, but you'll have to duplicate your first vertex in order to make a fake triangle between (0,0,0) and your mesh.
This can seem overkill, but :
- You'll have to be able to handle as many vertices anyway
- degenerate triangles use no fillrate, so they are almost free (the vertex shader is still computed, though)
A probably better solution would be to use glDrawElements instead : In you kernel, you also generate an index list for your whole buffer, which will be able to completely skip regions of your buffer.

Using more than one index list in a single VAO

I'm probably going about this all wrong, but hey.
I am rendering a large number of wall segments (for argument's sake, let's say 200). Every segment is one unit high, even and straight with no diagonals. All changes in direction are a change of 90 degrees.
I am representing each one as a four pointed triangle fan, AKA a quad. Each vertex has a three dimensional texture coordinate associated with it, such as 0,0,0, 0,1,7 or 10,1,129.
This all works fine, but I can't help but think it could be so much better. For instance, every point is duplicated at least twice (Every wall is a contiguous line of segments and there are some three & four way intersections) and the starting corner texture coordinates (0,0,X and 0,1,X) are going to be duplicated for every wall with texture number X on it. This could be compressed even further by moving the O coordinate into a third attribute and indexing the S and T coordinates separately.
The problem is, I can't seem to work out how to do this. VAOs only seem to allow one index, and taken as a lump, each position and texture coordinate form a unique snowflake never to be repeated. (Admittedly, this may not be true on certain corners, but that's a very edge case)
Is what I want to do possible, or am I going to have to stick with the (admittedly fine) method I currently use?
It depends on how much work you want to do.
OpenGL does not directly allow you to use multiple indices. But you can get the same effect.
The most direct way is to use a Buffer Texture to access an index list (using gl_VertexID), which you then use to access a second Buffer Texture containing your positions or texture coordinates. Basically, you'd be manually doing what OpenGL automatically does. This will naturally be slower per-vertex, as attributes are designed to be fast to access. You also lose some of the compression features, as Buffer Textures don't support as many formats.
each vertex and texture coordinate form a unique snowflake never to be repeated
A vertex is not just position, but the whole vector formed by position, texture coordinates and all the other attributes. It is, what you referred to as "snowflake".
And for only 200 walls, I'd not bother about memory consumption. It comes down to cache efficiency. And GPUs use vertices – and that means the whole position and other attributes vector – as caching key. Any "optimization" like you want to do them will probably have a negative effect on performance.
But having some duplicated vertices doesn't hurt so much, if they're not too far apart in the primitive index list. Today GPUs can hold about between 30 to 1000 vertices (that is after transformation, i.e. the shader stage) in their cache, depending on the number of vertex attributes are fed and the number of varying variables delivered to the fragment shader stage. So if a vertex (input key) has been cached, the shader won't be executed, but the cached result fed to fragment processing directly.
So the optimization you should really aim for is cache locality, i.e. batching things up in a way, that shared/duplicated vertices are sent to the GPU in quick succession.

How to batch same square in a single glVertexPointer

I've read that to optimize drawing, one can draw a set of figures using the same texture in one pass. But how do i connect my singles square together to form one figure to send to glVertexPointer.
(read in PowerVR MBX.3D Application Development Recommendations.1.0.67a - page 5)
I take it you are pondering the "software transform" step. I assume you have some kind of vertex array for ONE square and would like to concatenate several different of instances of that array into one big array containing all the vertexes for your squares, and then finally draw the big array.
The big array in this case would then contain pre-transformed vertex data that is then sent to the GPU in one draw call. Since you know how many squares you are going to draw, say N of them. That array must be able to contain N*4*3 elements (assuming you are working in 3d and therefore have 3 coordinates per vertex).
The software transform step would then iterate over all the squares and append the transformed data into the big array, which in turn can be drawn by a call to glVertexPointer().
However, I'm a bit sceptical to this software transform step. This means you are going to take care of all these transformations on your own, which means you are not using the power of the GPU. That is you will use CPU power and not GPU power to get your transformed coordinates. Personally I'd start off by creating the texture atlas and then generate all the texture coordinates for each single square. This is only needed once, since the texture coordinates will not change. You can then use the texture coordinates by a call to glTextureCoordPointer() before you start drawing your squares (push matrix i, draw quad, pop matrix etc)
Edit:
yes this is what i want. Are you sure
it's slower on the cpu ? from what
they say the overhead of calling
multiple times glVertexPointer and
glDrawArrays would be slower.
I'm taking back my scepticism ! :) Why don't you do some measures, just for the heck of it? The trade off must at least be that you must be shuffling a whole lot more data. So if the GPU can't hold all that data, normal transformations might be a neccesity.
Oh, there's one more thing. As soon as a butterfly moves, its data has to be manually retransformed, this is not needed when you let the GPU transform the data for you. So you must flag some state for each instance when it's dirty and you need to retransform before doing the draw call.
Looks like you want to use indices? See glDrawElements.