I'm using an octree of axis aligned bounding boxes to segment the space in my scene where I do a physics simulation.The problem is, the scene is very large(space) and I need to detect collision of large objects at large distances as well as small objects at close distances.The thing is, there are only a few of them on the scene, but kilometers apart, so this means a lot of empty space.So basically I'm wasting 2 gigs of RAM to store bounding boxes for empty sectors.I'd like to only allocate memory for the sectors that actually contain something(to have them be pointers to AABBs), but that would mean thousands of allocations each frame to re-create the octree.If I use a pool to counter the slowdown from allocations it would still mean I'm allocating 2 gigs of RAM for my application.Is there any other way to achieve this?
Look into Loose Octrees (for dealing with many objects) or a more adaptive system such as AABB-trees built around each object rather than one for the entire space. You can perform general distance/collision using the overall AABB (the root) and get finer collisions using the tree under each object (and eventually a ray-triangle intersection test if you need that fine a resolution). The only disadvantage with AABB-trees is that if the object rotates you need to rebuild the tree (you can adaptively scale and translate the AABB-tree).
Related
I have a VBO of 1 050 625 vertices representing a height map. I draw the mesh with GL_TRIANGLE_STRIPS by frustum-culled chunks of 32*32 cells with indexed rendering.
Should I care about how my vertices are aligned in the VBO in terms of performance? I mean is there any information about how distance between different elements affects performance, like: [100,101,102] or [10,1017,2078]?
Distance between indices affects the memory positions to be read from. The affection is related to cached memory. If the position is not in the current cache it must be read from main memory.
At least theorically. In practice, it depends on hardware and driver implementantion. Cache size and bus speed have influence.
As a point to start from, anything with size below a few MB should be the quickest solution.
Anyhow, when performance is a matter, the true way of knowing about it is benchmarking different options, in different hardware if possible.
lately I have been trying to create a 2D platformer engine in C++ with Direct2D. The problem I am currently having is getting objects that are resting against each other to interact correctly after accelerations like gravity have been applied to them.
Right now I can detect collisions and respond to them correctly (I think) and when objects collide they remember what other objects they're resting against so objects can be pushed by other objects (note that there is no bounce in any collisions so when objects collide they are guaranteed to become resting until something else happens). Every time the simulation advances, the acceleration for objects is applied to their velocities (for example vx += ax * t, where t is time elapsed since last advancement).
After these accelerations are applied, I want to check if any objects that are resting against each other are moving at different speeds than their counterparts (as different objects can have different accelerations) and depending on that difference either unlink the two objects so they are no longer resting, or even out their velocities so they are moving at the same speed once again. I am having trouble creating an algorithm that can do this across many resting objects.
Here's a diagram to help explain my problem
http://i.imgur.com/cYYsWdE.png
Imagine a typical game where objects in the simulated world are created and destroyed. When these objects are created, their vertex data is stored in a VBO. This VBO is rendered once per frame.
Is there a best practice for dealing with dead objects? I.e. when the object is destroyed and thus no longer needs to be rendered, what should happen to its corresponding VBO data?
It seems like you'd want to "free" that memory up for future use by other objects. Otherwise, your VBO would eventually be filled almost entirely with dead data.
I have one possible idea for implementing this: a map of VBO memory wherein individual bytes are marked as free or in use. (This map would live on the CPU as a normal array, not on the GPU.) When an object is created, we buffer its data to a free region as determined by the map. We mark that region as used on the map. Then when the object is destroyed, we mark that same region as free. I'm thinking you would store the map either as an array of booleans if you're lazy, or pack it in as one map bit per VBO byte if you want to do it right.
So far, does this sound like the best approach? Is there a more common approach that I'm not seeing?
I know a lot of these questions hinge on the characteristics of the scene you're rendering, so here's the context. My scene consists of several hundred objects. Each object has about eight vertices. Each vertex has a position and texture coordinate stored as floats. So, we're looking at approximately:
4 bytes per float * 6 floats per vert * 8 verts per object * 500 objects
= 96,000 bytes of vertex data
Sounds like you're thinking of using a pool allocator. There's a lot of existing work done on those, which should apply quite well to allocations inside a VBO also.
It will be pretty straightforward if all elements are the same size. Otherwise, you need to be concerned about fragmentation, but heap managers are quite well known.
The simplest improvement I would offer is to start your scan for a free slot from the last slot filled, instead of always from the beginning.
You can trade space for speed by using a deque-style data structure to store a list of free locations, which eliminates the need to scan for a free spot.
The size of the data stored in the VBO really has no impact on the manager. Only the number of slots which can be invididually repurposed.
Between a TriangleStrip and a TriangleList, which one performs faster?
Something interesting I've just read says: "My method using triangle list got about 780fps, and the one with triangle strip only got 70fps". I don't have details as to what exactly he is making, but according to this he's getting about 10 times the frame rate using a TriangleList. I find this counter-intuitive because the list contains more vertex data.
Does anyone know a technical reason why the TriangleList might be so much faster than a Strip?
Triangle strips are a memory optimization, not a speed optimization. At some point in the past, when bus bandwidth between the system memory and video memory was the main bottle neck in a data intensive application, then yes it would also saved time but this is very rarely the case anymore. Also, transform cache was very small in old hardware, so an ordinary strip would cache better than a badly optimized indexed list.
The reason why a triangle list can be equaly or more efficient than a triangle strip is indices. Indices let the hardware transform and cache vertices in a very previsible fashion, given that you are optimizing your geometry and triangle order correctly. Also, in a very complex mesh requiring a lot of degenerate triangles, strips will be both slower and take more memory than an indexed list.
I must say I'm a little surprised that your example shows an order of magnitude difference though.
A triangle list can be much faster than a strip because it saves draw calls by batching the vertex data together easily. Draw calls are expensive so the memory you save by using a strip is sometimes not worth the decreased performance.
Indexed triangle lists will generally win..
Here's a simple rule. Count the number of vertices you will be uploading to the graphics card. If the triangle list (indexed triangle list to be precise) has less vertices than the same data as a triangle strip, then likely it will run faster.
If the number of vertices are very close in both cases, then possibly the strip will run faster because it doesn't have the overhead of the indice list, but I expect that is also driver specific.
Non-Indexed triangle lists are almost always worst case (3 verts per triangle, no sharing) unless you are just dealing with disjoint quads which will also cost 6 verts per quad using degenerate stripping. In that case, you get each quad for 4 verts with indexed triangle lists so it probably wins again but you'd want to test on your target hardware I think.
I would like to know what the best practice for efficiently storing (and subsequently accessing) sets of multi-dimensional data arrays with variable length. The focus is on performance, but I also need to be able to handle changing the length of an individual data set during runtime without too much overhead.
Note: I know this is a somewhat lengthy question, but I have looked around quite a lot and could not find a solution or example which describes the problem at hand with sufficient accuracy.
Background
The context is a computational fluid dynamics (CFD) code that is based on the discontinuous Galerkin spectral element method (DGSEM) (cf. Kopriva (2009), Implementing Spectral Methods for Partial Differential Equations). For the sake of simplicity, let us assume a 2D data layout (it is in fact in three dimensions, but the extension from 2D to 3D should be straightforward).
I have a grid that consists of K square elements k (k = 0,...,K-1) that can be of different (physical) sizes. Within each grid element (or "cell") k, I have N_k^2 data points. N_k is the number of data points in each dimension, and can vary between different grid cells.
At each data point n_k,i (where i = 0,...,N_k^2-1) I have to store an array of solution values, which has the same length nVars in the whole domain (i.e. everywhere), and which does not change during runtime.
Dimensions and changes
The number of grid cells K is of O(10^5) to O(10^6) and can change during runtime.
The number of data points N_k in each grid cell is between 2 and 8 and can change during runtime (and may be different for different cells).
The number of variables nVars stored at each grid point is around 5 to 10 and cannot change during runtime (it is also the same for every grid cell).
Requirements
Performance is the key issue here. I need to be able to regularly iterate in an ordered fashion over all grid points of all cells in an efficient manner (i.e. without too many cache misses). Generally, K and N_k do not change very often during the simulation, so for example a large contiguous block of memory for all cells and data points could be an option.
However, I do need to be able to refine or coarsen the grid (i.e. delete cells and create new ones, the new ones may be appended to the end) during runtime. I also need to be able to change the approximation order N_k, so the number of data points I store for each cell can change during runtime as well.
Conclusion
Any input is appreciated. If you have experience yourself, or just know a few good resources that I could look at, please let me know. However, while the solution will be crucial to the performance of the final program, it is just one of many problems, so the solution needs to be of an applied nature and not purely academic.
Should this be the wrong venue to ask this question, please let me know what a more suitable place would be.
Often, these sorts of dynamic mesh structures can be very tricky to deal with efficiently, but in block-structured adaptive mesh refinement codes (common in astrophysics, where complex geometries aren't important) or your spectral element code where you have large block sizes, it is often much less of an issue. You have so much work to do per block/element (with at least 10^5 cells x 2 points/cell in your case) that the cost of switching between blocks is comparitively minor.
Keep in mind, too, that you can't generally do too much of the hard work on each element or block until a substantial amount of that block's data is already in cache. You're already going to have to had flushed most of block N's data out of cache before getting much work done on block N+1's anyway. (There might be some operations in your code which are exceptions to this, but those are probably not the ones where you're spending much time anyway, cache or no cache, because there's not a lot of data reuse - eg, elementwise operations on cell values). So keeping each the blocks/elements beside each other isn't necessarily a huge deal; on the other hand, you definitely want the blocks/elements to be themselves contiguous.
Also notice that you can move blocks around to keep them contiguous as things get resized, but not only are all those memory copies also going to wipe your cache, but the memory copies themselves get very expensive. If your problem is filling a significant fraction of memory (and aren't we always?), say 1GB, and you have to move 20% of that around after a refinement to make things contiguous again, that's .2 GB (read + write) / ~20 GB/s ~ 20 ms compared to reloading (say) 16k cache lines at ~100ns each ~ 1.5 ms. And your cache is trashed after the shuffle anyway. This might still be worth doing if you knew that you were going to do the refinement/derefinement very seldom.
But as a practical matter, most adaptive mesh codes in astrophysical fluid dynamics (where I know the codes well enough to say) simply maintain a list of blocks and their metadata and don't worry about their contiguity. YMMV of course. My suggestion would be - before spending too much time crafting the perfect data structure - to first just test the operation on two elements, twice; the first, with the elements in order and computing on them 1-2, and the second, doing the operation in the "wrong" order, 2-1, and timing the two computations several times.
For each cell, store the offset in which to find the cell data in a contiguous array. This offset mapping is very efficient and widely used. You can reorder the cells for cache reuse in traversals. When the order or number of cells changes, create a new array and interpolate, then throw away the old arrays. This storage is much better for external analysis because operations like inner products in Krylov methods and stages in Runge-Kutta methods can be managed without reference to the mesh. It also requires minimal memory per vector (e.g. in Krylov bases and with time integration).