How to batch same square in a single glVertexPointer - opengl

I've read that to optimize drawing, one can draw a set of figures using the same texture in one pass. But how do i connect my singles square together to form one figure to send to glVertexPointer.
(read in PowerVR MBX.3D Application Development Recommendations.1.0.67a - page 5)

I take it you are pondering the "software transform" step. I assume you have some kind of vertex array for ONE square and would like to concatenate several different of instances of that array into one big array containing all the vertexes for your squares, and then finally draw the big array.
The big array in this case would then contain pre-transformed vertex data that is then sent to the GPU in one draw call. Since you know how many squares you are going to draw, say N of them. That array must be able to contain N*4*3 elements (assuming you are working in 3d and therefore have 3 coordinates per vertex).
The software transform step would then iterate over all the squares and append the transformed data into the big array, which in turn can be drawn by a call to glVertexPointer().
However, I'm a bit sceptical to this software transform step. This means you are going to take care of all these transformations on your own, which means you are not using the power of the GPU. That is you will use CPU power and not GPU power to get your transformed coordinates. Personally I'd start off by creating the texture atlas and then generate all the texture coordinates for each single square. This is only needed once, since the texture coordinates will not change. You can then use the texture coordinates by a call to glTextureCoordPointer() before you start drawing your squares (push matrix i, draw quad, pop matrix etc)
Edit:
yes this is what i want. Are you sure
it's slower on the cpu ? from what
they say the overhead of calling
multiple times glVertexPointer and
glDrawArrays would be slower.
I'm taking back my scepticism ! :) Why don't you do some measures, just for the heck of it? The trade off must at least be that you must be shuffling a whole lot more data. So if the GPU can't hold all that data, normal transformations might be a neccesity.
Oh, there's one more thing. As soon as a butterfly moves, its data has to be manually retransformed, this is not needed when you let the GPU transform the data for you. So you must flag some state for each instance when it's dirty and you need to retransform before doing the draw call.

Looks like you want to use indices? See glDrawElements.

Related

How would I store vertex, color, and index data separately? (OpenGL)

I'm starting to learn openGL (working with version 3.3) with intent to get a small 3d falling sand simulation up, akin to this:
https://www.youtube.com/watch?v=R3Ji8J2Kprw&t=41s
I have a little experience with setting up a voxel environment like Minecraft from some Udemy tutorials for Unity, but I want to build something simple from the ground up and not deal with all the systems already laid on top of things with Unity.
The first issue I've run into comes early. I want to build a system for rendering quads, because instancing a ton of cubes is ridiculously inefficient. I also want to be efficient with storage of vertices, colors, etc. Thus far in the opengl tutorials I've worked with the way to do this is to store each vertex in a float array with both position and color data, and set up the buffer object to read every set of six entries as three floats for position and three for color, using glVertexAttribPointer. The problem is that for each neighboring quad, the same vertices will be repeated because if they are made of different "blocks" they will be different colors, and I want to avoid this.
What I want to do instead to make things more efficient is store the vertices of a cube in one int array (positions will all be ints), then add each quad out of the terrain to an indices array (which will probably turn into each chunk's mesh later on). The indices array will store each quad's position, and a separate array will store each quad's color. I'm a little confused on how to set this up since I am rather new to opengl, but I know this should be doable based on what other people have done with minecraft clones, if not even easier since I don't need textures.
I just really want to get the framework for the chunks, blocks, world, etc, up and running so that I can get to the fun stuff like adding new elements. Anyone able to verify that this is a sensible way to do this (lol) and offer guidance on how to set this up in the rendering, I would very much appreciate.
Thus far in the opengl tutorials I've worked with the way to do this is to store each vertex in a float array with both position and color data, and set up the buffer object to read every set of six entries as three floats for position and three for color, using glVertexAttribPointer. The problem is that for each neighboring quad, the same vertices will be repeated because if they are made of different "blocks" they will be different colors, and I want to avoid this.
Yes, and perhaps there's a reason for that. You seem to be trying to save.. what, a few bytes of RAM? Your graphics card has 8GB of RAM on it, what it doesn't have is a general processing unit or an unlimited bus to do random lookups in other buffers for every single rendered pixel.
The indices array will store each quad's position, and a separate array will store each quad's color.
If you insist on doing it this way, nothing's stopping you. You don't even need the quad vertices, you can synthesize them in a geometry shader.
Just fill in a buffer with X|Y|Width|Height|Color(RGB) with glVertexAttribPointer like you already know, then run a geometry shader to synthesize two triangles for each entry in your input buffer (a quad), then your vertex shader projects it to world units (you mentioned integers, so you're not in world units initially), and then your fragment shader can color each rastered pixel according to its color entry.
ridiculously inefficient
Indeed, if that sounds ridiculously inefficient to you, it's because it is. You're essentially packing your data on the CPU, transferring it to the GPU, unpacking it and then processing it as normal. You can skip at least two of the steps, and even more if you consider that vertex shader outputs get cached within rasterized primitives.
There are many more variations of this insanity, like:
store vertex positions unpacked as normal, and store an index for the colors. Then store the colors in a linear buffer of some kind (texture, SSBO, generic buffer, etc) and look up each color index. That's even more inefficient, but it's closer to the algorithm you were suggesting.
store vertex positions for one quad and set up instanced rendering with a multi-draw command and a buffer to feed individual instance data (positions and colors). If you also have textures, you can use bindless textures for each quad instance. It's still rendering multiple objects, but it's slightly more optimized by your graphics driver.
or just store per-vertex data in a buffer and render it. Done. No pre-computations, no unlimited expansions, no crazy code, you have your vertex data and you render it.

Using OpenGL instancing for rendering 2D scene with object depths and alpha blending

Here's what I'm trying to do: I want to render a 2D scene, consisting of a number of objects (quads), using instancing. Objects with a lower y value (towards the bottom of the screen) need to be rendered in front of the ones with higher y values. And alpha blending also needs to work.
So my first idea was to use the Z value for depth, but I soon realized alpha blending will not work unless the objects are drawn in the right order. But I'm not issuing one call for each quad, but use a single instanced call to render the whole scene. Putting the instance data in the correct sorted order seems to work for me, but I doubt this is something I can rely on, since the GPU is supposed to run those computations in parallel as much as possible.
So the question is, is there a way to make this work? The best thing I can think of right now is to issue an instanced call for each separate y value (and issue those in order, back to front). Is there a better way to do this?
Instancing is best used for cases where each instance is medium-sized: hundreds or maybe thousands of triangles. Quads are not a good candidate for instancing.
Just build and render a sequence of triangles. There are even ways to efficiently get around the lack of a GL_QUADS primitive type in modern OpenGL.
Putting the instance data in the correct sorted order seems to work for me, but I doubt this is something I can rely on, since the GPU is supposed to run those computations in parallel as much as possible.
That's not how GPUs work.
When you issue a rendering command, what you (eventually) get is a sequence of primitives. Because the vertices that were given to that command are ordered (first to last), and the instances in that command are ordered, and even the draws within a single draw command are ordered, an order can be assigned to every primitive in the draw call with respect to every other primitive based on the order of vertices, instances, and draws.
This defines the primitive order for a drawing command. GPUs guarantee that blending (and logical operations and other visible post-fragment shader operations) will respect the primitive order of a rendering command and between rendering commands. That is, if you draw 2 triangles in a single call, and the first is behind the second (with depth testing turned off), then blending for the second triangle will respect the data written by the first.
Basically, if you give primitives to the GPU in an order, the GPU will respect that order with regard to blending and such.
So again, just build a ordered stream of triangles to represent your quads and render them.

OpenGL- drawarrays or drawelements?

I'm making a small 2D game demo and from what I've read, it's better to use drawElements() to draw an indexed triangle list than using drawArrays() to draw an unindexed triangle list.
But it doesn't seem possible as far as I know to draw multiple elements that are not connected in a single draw call with drawElements().
So for my 2D game demo where I'm only ever going to draw squares made of two triangles, what would be the best approach so I don't end having one draw call per object?
Yes, it's better to use indices in many cases since you don't have to store or transfer duplicate vertices and you don't have to process duplicate vertices (vertex shader only needs to be run once per vertex). In the case of quads, you reduce 6 vertices to 4, plus a small amount of index data. Two thirds is quite a good improvement really, especially if your vertex data is more than just position.
In summary, glDrawElements results in
Less data (mostly), which means more GPU memory for other things
Faster updating if the data changes
Faster transfer to the GPU
Faster vertex processing (no duplicates)
Indexing can affect cache performance, if the reference vertices that aren't near each other in memory. Modellers commonly produce meshes which are optimized with this in mind.
For multiple elements, if you're referring to GL_TRIANGLE_STRIP you could use glPrimitiveRestartIndex to draw multiple strips of triangles with the one glDrawElements call. In your case it's easy enough to use GL_TRIANGLES and reference 4 vertices with 6 indices for each quad. Your vertex array then needs to store all the vertices for all your quads. If they're moving you still need to send that data to the GPU every frame. You could position all the moving quads at the front of the array and only update the active ones. You could also store static vertex data in a separate array.
The typical approach to drawing a 3D model is to provide a list of fixed vertices for the geometry and move the whole thing with the model matrix (as part of the model-view). The confusing part here is that the mesh data is so small that, as you say, the overhead of the draw calls may become quite prominent. I think you'll have to draw a LOT of quads before you get to the stage where it'll be a problem. However, if you do, instancing or some similar idea such as particle systems is where you should look.
Perhaps only go down the following track if the draw calls or data transfer becomes a problem as there's a lot involved. A good way of implementing particle systems entirely on the GPU is to store instance attributes such as position/colour in a texture. Each frame you use an FBO/render-to-texture to "ping-pong" this data between another texture and update the attributes in a fragment shader. To draw the particles, you can set up a static VBO which stores quads with the attribute-data texture coordinates for use in the vertex shader where the particle position can be read and applied. I'm sure there's a bunch of good tutorials/implementations to follow out there (please comment if you know of a good one).

Draw a bunch of elements generated by CUDA/OpenCL?

I'm new to graphics programming, and need to add on a rendering backend for a demo we're creating. I'm hoping you guys can point me in the right direction.
Short version: Is there any way to send OpenGL an array of data for distinct elements, without having to issue a draw command for each element distinctly?
Long version: We have a CUDA program (will eventually be OpenCL) which calculates a bunch of data for a bunch of objects for us. We then need to render these objects using, e.g., OpenGL.
The CUDA kernel can generate our vertices, and using OpenGL interop, it can shove these in an OpenGL VBO and not have to transfer the data back to host device memory. But the problem is we have a bunch (upwards of a million is our goal) distinct objects. It seems like our best bet here is allocating one VBO and putting every object's vertices into it. Then we can call glDrawArrays with offsets and lengths of each element inside that VBO.
However, each object may have a variable number of vertices (though the total vertices in the scene can be bounded.) I'd like to avoid having to transfer a list of start indices and lengths from CUDA -> CPU every frame, especially given that these draw commands are going right back to the GPU.
Is there any way to pack a buffer with data such that we can issue only one call to OpenGL to render the buffer, and it can render a number of distinct elements from that buffer?
(Hopefully I've also given enough info to avoid a XY problem here.)
One way would be to get away from understanding these as individual objects and making them a single large object drawn with a single draw call. The question is, what data is it that distinguishes the objects from each other, meaning what is it you change between the individual calls to glDrawArrays/glDrawElements?
If it is something simple, like a color, it would probably be easier to supply this an additional per-vertex attribute. This way you can render all objects as one single large object using a single draw call with the indiviudal sub-objects (which really only exist conceptually now) colored correctly. The memory cost of the additional attribute may be well worth it.
If it is something a little more complex (like a texture), you may still be able to index it using an additional per-vertex attribute, being either an index into a texture array (as texture arrays should be supported on CUDA/OpenCL-able hardware) or a texture coordinate into a particular subregion of a single large texture (a so-called texture atlas).
But if the difference between those objects is something more complex, as a different shader or something, you may really need to render individual objects and make individual draw calls. But you still don't need to neccessarily make a round-trip to the CPU. With the use of the ARB_draw_indirect extension (which is core since GL 4.0, I think, but may be supported on GL 3 hardware (and thus CUDA/CL-hardware), don't know) you can source the arguments to a glDrawArrays/glDrawElements call from an additional buffer (into which you can write with CUDA/CL like any other GL buffer). So you can assemble the offset-length-information of each individual object on the GPU and store them in a single buffer. Then you do your glDrawArraysIndirect loop offsetting into this single draw-indirect-buffer (with the offset between the individual objects now being constant).
But if the only reason for issuing multiple draw calls is that you want to render the objects as single GL_TRIANGLE_STRIPs or GL_TRIANGLE_FANs (or, god beware, GL_POLYGONs), you may want to reconsider just using a bunch of GL_TRIANGLES so that you can render all objects in a single draw call. The (maybe) time and memory savings from using triangle strips are likely to be outweight by the overhead of multiple draw calls, especially when rendering many small triangle strips. If you really want to use strips or fans, you may want to introduce degenerate triangles (by repeating vertices) to seprate them from each other, even when drawn with a single draw call. Or you may look into the glPrimitiveRestartIndex function introduced with GL 3.1.
Probably not optimal, but you could make a single glDrawArray on your whole buffer...
If you use GL_TRIANGLES, you can fill your buffer with zeroes, and write only the needed vertices in your kernel. This way "empty" regions of your buffer will be drawn as 0-area polygons ( = degenerate polygons -> not drawn at all )
If you use GL_TRIANGLE_STRIP, you can do the same, but you'll have to duplicate your first vertex in order to make a fake triangle between (0,0,0) and your mesh.
This can seem overkill, but :
- You'll have to be able to handle as many vertices anyway
- degenerate triangles use no fillrate, so they are almost free (the vertex shader is still computed, though)
A probably better solution would be to use glDrawElements instead : In you kernel, you also generate an index list for your whole buffer, which will be able to completely skip regions of your buffer.

Using more than one index list in a single VAO

I'm probably going about this all wrong, but hey.
I am rendering a large number of wall segments (for argument's sake, let's say 200). Every segment is one unit high, even and straight with no diagonals. All changes in direction are a change of 90 degrees.
I am representing each one as a four pointed triangle fan, AKA a quad. Each vertex has a three dimensional texture coordinate associated with it, such as 0,0,0, 0,1,7 or 10,1,129.
This all works fine, but I can't help but think it could be so much better. For instance, every point is duplicated at least twice (Every wall is a contiguous line of segments and there are some three & four way intersections) and the starting corner texture coordinates (0,0,X and 0,1,X) are going to be duplicated for every wall with texture number X on it. This could be compressed even further by moving the O coordinate into a third attribute and indexing the S and T coordinates separately.
The problem is, I can't seem to work out how to do this. VAOs only seem to allow one index, and taken as a lump, each position and texture coordinate form a unique snowflake never to be repeated. (Admittedly, this may not be true on certain corners, but that's a very edge case)
Is what I want to do possible, or am I going to have to stick with the (admittedly fine) method I currently use?
It depends on how much work you want to do.
OpenGL does not directly allow you to use multiple indices. But you can get the same effect.
The most direct way is to use a Buffer Texture to access an index list (using gl_VertexID), which you then use to access a second Buffer Texture containing your positions or texture coordinates. Basically, you'd be manually doing what OpenGL automatically does. This will naturally be slower per-vertex, as attributes are designed to be fast to access. You also lose some of the compression features, as Buffer Textures don't support as many formats.
each vertex and texture coordinate form a unique snowflake never to be repeated
A vertex is not just position, but the whole vector formed by position, texture coordinates and all the other attributes. It is, what you referred to as "snowflake".
And for only 200 walls, I'd not bother about memory consumption. It comes down to cache efficiency. And GPUs use vertices – and that means the whole position and other attributes vector – as caching key. Any "optimization" like you want to do them will probably have a negative effect on performance.
But having some duplicated vertices doesn't hurt so much, if they're not too far apart in the primitive index list. Today GPUs can hold about between 30 to 1000 vertices (that is after transformation, i.e. the shader stage) in their cache, depending on the number of vertex attributes are fed and the number of varying variables delivered to the fragment shader stage. So if a vertex (input key) has been cached, the shader won't be executed, but the cached result fed to fragment processing directly.
So the optimization you should really aim for is cache locality, i.e. batching things up in a way, that shared/duplicated vertices are sent to the GPU in quick succession.