OpenGL: Buffer object performance issue - opengl

I have a question related to Buffer object performance. I have rendered a mesh using standard Vertex Arrays (not interleaved) and I wanted to change it to Buffer Object to get some performance boost. When I introduce buffers object I was in shock when I find out that using Buffers object lowers performance four times. I think that buffers should increase performance. Does it true? So, I think that I am doing something wrong...
I have render 3d tiled map and to reduce amount of needed memory I use only a single tile (vertices set) to render whole map. I change only texture coordinates and y value in vertex position for each tile of map. Buffers for position and texture coords are created with GL_DYNAMIC_DRAW parameter. The buffer for indices is created with GL_STATIC_DRAW because it doesn't change during map rendering. So, for each tile of map buffers are mapped and unmapped at least one time. Should I use only one buffer for texture coords and positions?
Thanks,

Try moving your vertex/texture coordinates with GL_MODELVIEW/GL_TEXTURE matrices, and leave buffer data alone (GL_STATIC_DRAW alone). e.g. if tile is of size 1x1, create rect (0, 0)-(1, 1) and set it's position in the world with glTranslate. Same with texture coordinates.
VBOs are not there to increase performance of drawing few quads. Their true power is seen when drawing meshes with thousands of polygons using shaders. If you don't need any forward compatibility with newer opengl versions, I see little use in using them to draw dynamically changing data.

If you need to update the buffer(s) each frame you should use GL_STREAM_DRAW (which hints that the buffer contents will likely be used only once) rather than GL_DYNAMIC_DRAW (which hints that they will be but used a couple of times before being updated).
As far as my experience goes, buffers created with GL_STREAM_DRAW will be treated similarly to plain ol' arrays, so you should expect about the same performance as for arrays when using it.
Also make sure that you call glMapBuffer with the access parameter set to GL_WRITE_ONLY, assuming you don't need to read the contents of the buffer. Otherwise, if the buffer is in video memory, it will have to be transferred from video memory to main memory and then back again (well, that's up to the driver really...) for each map call. Transferring to much data over the bus is a very real bottleneck that's quite easy to bump into.

Related

Using Index buffers in DirectX 11; how does it know?

Let's say I create two vertex buffers, for two different meshes.
(I'm assuming creating separate buffers for separate meshes is how it's usually done)
Now, let's say I want to draw one of the meshes using an index buffer.
Looking at the book Practical Rendering and Computation with Direct3D 11 it doesnt seem like the creation of an index buffer in any way references a vertex buffer, so how does the index buffer know (during input assembly) what vertex buffer to act on?
I've done some googling without answers, which leads me to assume there's something obvious about it that I'm missing.
You are right, index buffers do not reference specific vertex buffers. During DrawIndexed active index buffer is used to supply indices into active vertex buffers (the ones you set using SetIndexBuffer/SetVertexBuffers).
Indeed, Index Buffers and Vertex Buffers are completely independent.
Index buffer will know about VertexBuffer at draw time (eg: when both as bound to the pipeline)
You can think of Index Buffer as a "Lookup Table", where you keep a list or elements indices to draw.
That also means you can attach two completely "logically unrelated" buffers to the pipeline and draw it, nothing will prevent you from doing that, but you will of course have some strange visual results.
Decoupling both has many advantages, here are a few examples:
You can reuse an index buffer (for example, two displaced grids with identical resolution can share the same index buffer). That can be a decent memory gain.
You can draw your Vertex buffer on it's own and do some processing per vertex (draw a point list for sprites for example, or apply skinning/displacement into a Stream Output buffer , then draw the resulting vertex buffer using DrawIndexed)
Both Vertex/Index buffers can also be bound as ByteAddressBuffer, so you can also process your geometry in Compute Shader, and build another optimized index buffer, with culled triangles for example, then process the Indexed Draw with the optimized buffer. Applying those culls in with indices instead of vertices is often faster than vertex, as you will move much less memory.
This is a niche case, but sometimes I have to draw a mesh as a set of triangles, but then draw as a set of lines (some form of wireframe). If as example, you take a single Box, you will not want to draw the diagonals as lines, so I have a shared Vertexbuffer with box vertices, then one IndexBuffer dedicated to draw triangle list, and another to draw line list. In case of large models, this can also be an effective memory gain.

OpenGL: efficient way to read sparce pixel data from many framebuffer textures?

I'm writing a program that uses the GPU to calculate stuff, and I want to read data from the framebuffers to be used in my client code. The framebuffers I'm using are about 40 textures, all 1024x1024 in size, all of which contain data that needs read, but only very sparcely, like 50 or so pixels in arbitrary x/y coordinates from each texture. Using glReadPixels for each texture, for each frame, is proving too costly for me to do though...
I only need to read a few select pixels from each texture, is there a way to quickly gather their data without needing to download every entire texture from the GPU?
This sounds fairly expensive no matter how you slice it. A couple of approaches come to mind:
What I would try first is glReadPixels(), but with using a PBO. Bind a buffer large enough to hold all the pixels to the GL_PIXEL_PACK_BUFFER target, and then submit the glReadPixels() calls, with offsets to place the results in distinct sections of the buffer. Then call glMapBufferRange() to read back the values.
An alternate approach is that you copy all the pixels you want to read into a single texture. You could use glBlitFramebuffer() or glCopyTexSubImage2D(). Then use a single glReadPixels() or glGetTexImage() call to get all the data from this texture.
Both of these approaches should result in about the same amount of work and synchronization overhead. But one or the other could be more efficient, depending on which paths in the driver are better optimized.
As the earlier answer already suggested, I would make very sure that you really need this, and there isn't any way to keep and process the data on the GPU. Any time you read back data, you introduce synchronization between GPU and CPU, which is mostly harmful to performance.
Do you have any restrictions on what OpenGL version you can use? If not, it sounds like you should look into compute shaders. You say that you are calculating data, so I assume that you are "abusing" the rendering pipeline for your application, especially the fragment shader, and store fragment data in the framebuffer that is interpreted as something else than color.
If this is the case, then all you need is a shader storage buffer and an atomic counter. At some point right now you are deciding that fragment (x, y, z [z being the texture index]) should have value v. So in your compute shader, you do your calculation as you would in the fragment shader, but as output, you store a tuple (x, y, z, v). You store this tuple in the shader storage buffer at the index of the atomic counter which you increment after each written element. In the end, you have your data stored compactly in the buffer and only need to read back these elements. The exact number is the value the atomic counter holds after termination. Download the buffer with glGetBufferSubData into an array of location-value pairs, iterate over it and do your CPU magic.
If you need to copy the data from the GPU to the CPU memory, there is no way (AFAIK) around using glReadPixels.
Depending on what platform you're using, and the specific of your programs, you can try several optimizations, using FBOs:
Copy only part of the texture, assuming you know the locations of the pixels. Note that in most cases it still faster to copy the entire texture instead of issuing several small reads
If you don't need 32 bit textures, you can render to a lower color resolution. The specific depends on your platform extensions.
Maybe you don't really need to copy the pixels since you plan to use them as a texture input to the next stage? In that case you copy the pixels directly on the GPU using glCopyTexImage2D

OpenGL rendering several mesh instances

I started learning OpenGL, now I'm going to try to develop something on my own and got stuck on a doubt.
I'm going to render models that have about 50k primitives (cylinders, cubes, cones, etc). Less than 1/4 of them are 'unique', I mean, have the same dimensions, but different positioning and rotation. So I thought that somehow I could fill a data buffer with only basic vertices and then draw them with individual transformations matrices.
From what I've read, I should use a buffer for vertices and a buffer for indices, so I don't waste memory storing repeated vertices.
All of them are stored in a single big buffer (that's because I read that this is more efficient if the one single buffer if it do not exceed a 1~3mb limit).
To draw them I'm trying to use glDrawElements, but since they are all in a single buffer, I cannot update the individual matrices to update the shaders so they can draw each mesh in the correct position.
One solution would be to use thousands of small buffers and then update the matrices between the glDrawElements calls.
Another would be discard the indices buffer and store only the vertices so I can draw them using glDrawArrays, which allows me to draw only a small part of the buffer.
Anything I said above is wrong? Which option would result in better performance? Is there a better way to do this?

Draw a bunch of elements generated by CUDA/OpenCL?

I'm new to graphics programming, and need to add on a rendering backend for a demo we're creating. I'm hoping you guys can point me in the right direction.
Short version: Is there any way to send OpenGL an array of data for distinct elements, without having to issue a draw command for each element distinctly?
Long version: We have a CUDA program (will eventually be OpenCL) which calculates a bunch of data for a bunch of objects for us. We then need to render these objects using, e.g., OpenGL.
The CUDA kernel can generate our vertices, and using OpenGL interop, it can shove these in an OpenGL VBO and not have to transfer the data back to host device memory. But the problem is we have a bunch (upwards of a million is our goal) distinct objects. It seems like our best bet here is allocating one VBO and putting every object's vertices into it. Then we can call glDrawArrays with offsets and lengths of each element inside that VBO.
However, each object may have a variable number of vertices (though the total vertices in the scene can be bounded.) I'd like to avoid having to transfer a list of start indices and lengths from CUDA -> CPU every frame, especially given that these draw commands are going right back to the GPU.
Is there any way to pack a buffer with data such that we can issue only one call to OpenGL to render the buffer, and it can render a number of distinct elements from that buffer?
(Hopefully I've also given enough info to avoid a XY problem here.)
One way would be to get away from understanding these as individual objects and making them a single large object drawn with a single draw call. The question is, what data is it that distinguishes the objects from each other, meaning what is it you change between the individual calls to glDrawArrays/glDrawElements?
If it is something simple, like a color, it would probably be easier to supply this an additional per-vertex attribute. This way you can render all objects as one single large object using a single draw call with the indiviudal sub-objects (which really only exist conceptually now) colored correctly. The memory cost of the additional attribute may be well worth it.
If it is something a little more complex (like a texture), you may still be able to index it using an additional per-vertex attribute, being either an index into a texture array (as texture arrays should be supported on CUDA/OpenCL-able hardware) or a texture coordinate into a particular subregion of a single large texture (a so-called texture atlas).
But if the difference between those objects is something more complex, as a different shader or something, you may really need to render individual objects and make individual draw calls. But you still don't need to neccessarily make a round-trip to the CPU. With the use of the ARB_draw_indirect extension (which is core since GL 4.0, I think, but may be supported on GL 3 hardware (and thus CUDA/CL-hardware), don't know) you can source the arguments to a glDrawArrays/glDrawElements call from an additional buffer (into which you can write with CUDA/CL like any other GL buffer). So you can assemble the offset-length-information of each individual object on the GPU and store them in a single buffer. Then you do your glDrawArraysIndirect loop offsetting into this single draw-indirect-buffer (with the offset between the individual objects now being constant).
But if the only reason for issuing multiple draw calls is that you want to render the objects as single GL_TRIANGLE_STRIPs or GL_TRIANGLE_FANs (or, god beware, GL_POLYGONs), you may want to reconsider just using a bunch of GL_TRIANGLES so that you can render all objects in a single draw call. The (maybe) time and memory savings from using triangle strips are likely to be outweight by the overhead of multiple draw calls, especially when rendering many small triangle strips. If you really want to use strips or fans, you may want to introduce degenerate triangles (by repeating vertices) to seprate them from each other, even when drawn with a single draw call. Or you may look into the glPrimitiveRestartIndex function introduced with GL 3.1.
Probably not optimal, but you could make a single glDrawArray on your whole buffer...
If you use GL_TRIANGLES, you can fill your buffer with zeroes, and write only the needed vertices in your kernel. This way "empty" regions of your buffer will be drawn as 0-area polygons ( = degenerate polygons -> not drawn at all )
If you use GL_TRIANGLE_STRIP, you can do the same, but you'll have to duplicate your first vertex in order to make a fake triangle between (0,0,0) and your mesh.
This can seem overkill, but :
- You'll have to be able to handle as many vertices anyway
- degenerate triangles use no fillrate, so they are almost free (the vertex shader is still computed, though)
A probably better solution would be to use glDrawElements instead : In you kernel, you also generate an index list for your whole buffer, which will be able to completely skip regions of your buffer.

Do VBOs boost performance even when all data changes frequently

I'm doing a 2D turn based RTS game with 32x32 tiles (400-500 tiles per frame). I could use a VBO for this, but I may have to change almost all the VBO data each frame, as the background is a scrolling one and the visible tiles will change every time the map scrolls. Will using VBOs rather than client side vertex arrays still yield a performance benefit here? Also if using VBOs which data format is most efficient (float, or int16, or ...)?
If you are simply scrolling, you can use the vertex shader to manipulate the position rather than update the vertices themselves. Pass in a 'scroll' value as a uniform to your background and simply add that value to the x (or y, or whatever applies to your case) value of each vertex.
Update:
If you intend to modify the VBO often, you can tell the driver this using the usage param of glBufferData. This page has a good description of how that works: http://www.opengl.org/wiki/Vertex_Buffer_Object, under Accessing VBOs. In your case, it looks like you should specify GL_DYNAMIC_DRAW to glBufferData so that the driver puts your VBO in the best place in memory for your application.
The regular approach is to move the camera and perform culling instead of updating the content of the VBOs. For a 2d game culling will use simple rectangle intersection algorithm, which you will need anyway for unit selection in the game. As a bonus, manipulating the camera will allow to rate the camera and zoom in and zoom out. Also you could combine several tiles (4, 9 or 16) into one VBO.
I would strongly advise against writing logic to move the tiles instead of the camera. It will take you longer, have more bugs, and be less flexible.
The format will depend on what data you are storing in the VBOs. When in doubt, just use uint8 for color and float32 for everything else. Though for a 2d game your VBOs or vertex array are going to be very small compared to 3d applications, so it's highly unlikely VBO will make any difference.