I have a VBO of 1 050 625 vertices representing a height map. I draw the mesh with GL_TRIANGLE_STRIPS by frustum-culled chunks of 32*32 cells with indexed rendering.
Should I care about how my vertices are aligned in the VBO in terms of performance? I mean is there any information about how distance between different elements affects performance, like: [100,101,102] or [10,1017,2078]?
Distance between indices affects the memory positions to be read from. The affection is related to cached memory. If the position is not in the current cache it must be read from main memory.
At least theorically. In practice, it depends on hardware and driver implementantion. Cache size and bus speed have influence.
As a point to start from, anything with size below a few MB should be the quickest solution.
Anyhow, when performance is a matter, the true way of knowing about it is benchmarking different options, in different hardware if possible.
Related
I'd like to process an image with CUDA. Each pixel's new value is calculated based on the two neighboring pixels in one row. Would it make sense to use __shared__ memory for the pixel values, since each value will be used only twice? Aren't tiles also the wrong way to do it, since it doesn't suit the problem structure? My approach would be to run a thread on each pixel and load the neighboring pixel values each time for each thread.
All currently supported CUDA architectures have caches.
From compute capability 3.5 onwards these are particularly efficient for read-only data (as read-write data is only cached in L2, the L1 cache is limited to read-only data). If you mark the pointer to the input data as const __restrict__, the compiler will most likely load it via the L1 texture cache. You can also force this by explicitly using the __ldg() builtin.
While it is possible to explicitly manage the reuse of data from neighoring pixels via shared memory, you will probably find this to provide no benefit over just relying on the cache.
Of course, whether or not you use shared memory, you want to maximize the blocksize in x-direction and use a blockSize.y of 1 for optimal access locality.
Combine using shared memory with taking advantage of coalesced memory accesses. All you need to do is to ensure that image is stored row-wise. Each block would process a chunk of linear array. Because of data reuse (every pixel except the first and last ones would take part in processing three times) it would be beneficial if at the beginning of your kernel you would copy the values of all pixels that will be processed to shared memory.
Currently, in my rendering engine, I have a VBO for each mesh data (1 VBO for vertices, 1 VBO for normals, 1 VBO for texture coordinates, 1 VBO for tangents and 1 VBO for bitangents) and all of them are bound together with a VAO.
I'm now thinking about changing the system to hold a single VBO containing all the mesh data (vertices, normals, etc.) but how mush will I gain from this? Speaking about speed and utility (because I may not have all the data and provide only vertices and normals if my mesh isn't textured).
You'll be seeking to reduce overall memory bandwidth. If your buffer object contains all of your attributes interleaved together then that means that your entire array object references only one single contiguous section of memory. Which is much easier for a memory subsystem to cache. It's exactly the same principle as for CPUs — the more local consecutive memory accesses are, the faster they're likely to be.
There is also a potential detriment: the general rule is that you should align your elements to whichever is the greater of the size of the element and four bytes. Which leads to some wasted space. But the benefit almost always outweighs the detriment.
Obviously the only thing that will be affected is the time the GPU takes to fetch the vertices. If you're tessellation or fill bound you won't immediately see any improvement.
I'm writing a particle simulation where the logic is updated using Intel AVX. I'm using a SoA approach to maximize my "SIMD-friendliness" but I shuffle the particle position components into XYZ-format
when updating the vertex buffer.
Is it possible to exclude the shuffle part and simply pass the vertex data in
XXYYZZ-format and construct each vertex in a shader stage?
My first thought was using three vertex buffers with x, y and z components separated and construct each vertex using the same subscript index to access the x, y and z component of a vertex.
I'm aware that this is not the conventional way but I would like to emphasize that this is just an experiment. Perhaps anyone got some knowledge about this approach (if it is even possible) and/or could point me in the right direction? Perhaps there is a name to it aswell?
Thank you!
There is no restriction on how you feed the GPU with your vertices. You can customize the input layout to read values from any number of vertex buffers, in your example, you will have at least three elements. In the vertex shader, you receive your three elements as three scalars and swizzle them back. The only real limitation is that each value are at the same index in each buffer.
In regards to performance, unless you want to get the top 1% performance of the GPU, you will see no difference compared to a well interleaved vertex. This influence mostly the bandwidth and L2 cache miss, so unless you have crazy millions of particles, it is unlikely to happen. And if you have, you can use a compute shader to interleave the data in a pre-process.
I am implementing a voxel raycaster in OpenGL 4.3.0. I have got a basic version going where I store a 256x256x256 voxel data set of float values in a 3D texture of the same dimensions.
However, I want to make a LOD scheme using an octree. I have the data stored in a 1D array on the host side. The root has index 0, the root's children have indices 1-8, the next level have indices 9-72 and so on. The octree has 9 levels in total (where the last level has the full 256x256x256 resolution). Since the octree will always be full the structure is implicit and there is no need to store pointers, just the one float value per voxel. I have the 1D indexing and traversal algorithms all set.
My problem is that I don't know how to store this in a texture. GL_TEXTURE_MAX_SIZE is way too small (16384) for using the 1D array approach for which I have figured out the indexing. I need to store this in a 3D texture, and I don't know what will happen when I try to cram my 1D array in there, nor do I know how to choose a size and a 1D->3D conversion scheme to not waste space or time.
My question is if someone has a good strategy for storing this whole octree structure in one 3D texture, and in that case how to choose dimensions and indexing for it.
First some words on porting your 1D-array solution directly:
First of all, as Mortennobel says in his comment, the max texture size is very likely not 3397, that's just the enum value of GL_MAX_TEXTURE_SIZE (how should the opengl.h Header, that defines this value, know your hardware and driver limits?). To get the actual value from your implementation use int size; glGetIntegerv(GL_MAX_TEXTURE_SIZE, &size);. But even then this might be too small for you (maybe 8192 or something similar).
But to get much larger 1D arrays into your shaders, you can use buffer textures (which are core since OpenGL 3, and therefore present on DX10 class hardware). Those are textures sourcing their data from standard OpenGL buffer objects. But those textures are always 1D, accessed by integer texCoords (array indices, so to say) and not filtered. So they are effectively not really textures, but a way to access a buffer object as a linear 1D array inside a shader, which is a perfect fit for your needs (and in fact a much better fit than a normal filtered and normalized 1D texture).
EDIT: You might also think about using a straight-forward 3D texture like you did before, but with homemade mipmap levels (yes, a 3D texture can have mipmaps, too) for the higher parts of the hierarchy. So mipmap level 0 is the fine 256 grid, level 1 contains the coarser 128 grid, ... But to work with this data structure effectively, you will probably need explicit LOD texture access in the shader (using textureLod or, even better without filtering, texelFetch), which requires OpenGL 3, too.
EDIT: If you don't have support for OpenGL 3, I would still not suggest to use 3D textures to put your 1D array into, but rather 2D textures, like Rahul suggests in his answer (the 1D-2D index magic isn't really that hard). But if you have OpenGL 3, then I would either use buffer textures for using your linear 1D array layout directly, or a 3D texture with mipmaps for a straight-forward octree mapping (or maybe come up with a completely different and more sophisticated data structure for the voxel grid in the first place).
EDIT: Of course a fully subdivided octree is not really using the memory saving features of octrees to its advantage. For a more dynamic and memory efficient method of packing octrees into 3D textures, you might also take some inspiration from this classic GPU Gems article on octree textures. They basically store all octree cells as 2x2x2 grids arbitrarily into a 3D texture using the interal nodes' values as pointers to the children in this texture. Of course nowadays you can employ all sorts of refinements on this (since it seems you want the internal nodes to store data, too), like storing integers alongside floats and using nice bit encodings and the like, but the basic idea is pretty simple.
Here's a solution sketch/outline:
Use a 2D texture to store your 256x256x256 (it'll be 4096x4096 -- I hope you're using an OpenGL platform that supports 4k x 4k textures).
Now store your 1D data in row-major order. Inside your raycaster, simply do a row/col conversion (from 1D address to 4k x 4k) and look up the value you need.
I trust that you will figure out the rest yourself :)
Between a TriangleStrip and a TriangleList, which one performs faster?
Something interesting I've just read says: "My method using triangle list got about 780fps, and the one with triangle strip only got 70fps". I don't have details as to what exactly he is making, but according to this he's getting about 10 times the frame rate using a TriangleList. I find this counter-intuitive because the list contains more vertex data.
Does anyone know a technical reason why the TriangleList might be so much faster than a Strip?
Triangle strips are a memory optimization, not a speed optimization. At some point in the past, when bus bandwidth between the system memory and video memory was the main bottle neck in a data intensive application, then yes it would also saved time but this is very rarely the case anymore. Also, transform cache was very small in old hardware, so an ordinary strip would cache better than a badly optimized indexed list.
The reason why a triangle list can be equaly or more efficient than a triangle strip is indices. Indices let the hardware transform and cache vertices in a very previsible fashion, given that you are optimizing your geometry and triangle order correctly. Also, in a very complex mesh requiring a lot of degenerate triangles, strips will be both slower and take more memory than an indexed list.
I must say I'm a little surprised that your example shows an order of magnitude difference though.
A triangle list can be much faster than a strip because it saves draw calls by batching the vertex data together easily. Draw calls are expensive so the memory you save by using a strip is sometimes not worth the decreased performance.
Indexed triangle lists will generally win..
Here's a simple rule. Count the number of vertices you will be uploading to the graphics card. If the triangle list (indexed triangle list to be precise) has less vertices than the same data as a triangle strip, then likely it will run faster.
If the number of vertices are very close in both cases, then possibly the strip will run faster because it doesn't have the overhead of the indice list, but I expect that is also driver specific.
Non-Indexed triangle lists are almost always worst case (3 verts per triangle, no sharing) unless you are just dealing with disjoint quads which will also cost 6 verts per quad using degenerate stripping. In that case, you get each quad for 4 verts with indexed triangle lists so it probably wins again but you'd want to test on your target hardware I think.