I'm creating a game engine using C++ and SFML. I have a class called character that will be the base for entities within the game. The physics class is also going to handle character movement.
My question is, is it faster to create a vector of pointers to the characters that move in a frame. Then, whenever a function moves a character it places it inside that vector. After the physics class is done handling the vector it gets cleared?
Or is it faster to have a bool variable that gets set to true whenever a function moves a character and then have an if statement inside my physics class that tests every character for movement?
EDIT:
Ok i've gone with a different approach where a function inside the Physics class is responsible for dealing with character movement. Immediately upon movement, it tests for collision detection. If collision happens it stops the movement in that direction.
Thanks for your help guys
Compared to all the other stuff that is going on in your program (physics, graphics), this will not make a difference. Use the method that makes programming easier because you will not notice a runtime difference at all.
If the total number of characters is relatively small, then you won't notice the difference between the approaches.
Else (the number of characters is large), if most of characters move during a frame, then the approach with flag inside a character seems more appropriate, because even with vector of moved characters, you'll traverse all of them and besides that you get additional overhead of maintaining the vector.
Else (the number of characters is large, but only few of them move during a frame), it may be better to use vector because it can save you time by not traversing characters which didn't move.
What is a small or large number, depends on your application. You should test under which conditions you get better performance using either of approaches.
This would be the right time to quote Hoare, but I'll abstain. Generally, however, you should profile before you optimize (if, and only if, the time budget is not enough on the minimum spec hardware -- if your game runs at 60fps on the target hardware you will do nothing whatsoever).
It is much more likely that the actual physics calculations will be the limiting factor, not doing the "is this unit moving?" check. Also, it is much more likely that submitting draw calls will bite you rather than checking a few hundred or so units.
As an "obvious" thing, it appears to be faster to hold a vector of object pointers and only process the units that are actually moving. However, the "obvious" is not always correct. Iterating linearly over a greater number of elements can very well be faster than jumping around (due to cache). Again, if this part of your game is identified as the bottleneck (very unlikely) then you will have to measure which is better.
Related
How to avoid cache miss related to 1:N (pirateship-cannon) relationship in entity-component-system (ECS)?
For example, a PirateShip can have 1-100 cannons. (1:N)
Each cannon can detach/attach freely to any pirateship at any time.
For some reasons, PirateShip and Cannon should be entity.
Memory diagram
In around first time-steps, when ship/cannon are gradually created, the ECS memory looks very nice :-
Image note:
Left = low address, right = high address
Although there seems to be gaps, ShipCom and CannonCom are actually compact arrays.
It is really fast to access cannon information from ship and vice versa (pseudo code):-
Ptr<ShipCom> shipCom=....;
EntityPtr ship = shipCom; //implicitly conversion
Array<EntityPtr> cannons = getAllCannonFromShip(ship);
for(auto cannon: cannons){
Ptr<CannonCom> cannonCom=cannon;
//cannonCom-> ....
}
Problem
In later time-step, some ship / cannon are randomly created/destroyed.
As a result, Entity,ShipCom and CannonCom array has gap scattering around.
When I want to allocate them, I will get a random memory block from the pool.
EntityPtr ship = ..... (an old existing ship)
EntityPtr cannon = createNewEntity();
Ptr<CannonCom> cannonCom= createNew<CannonCom>(cannon);
attach_Pirate_Cannon(ship,cannon);
//^ ship entity and cannon tend to have very different (far-away) address
Thus, the "really fast" code above become bottom-neck. (I profiled.)
(Edit) I believe that the underlying cache miss also occur from different address between cannon inside the same ship.
For example (# is address of turret component),
ShipA has turret#01 to turret#49
ShipB has turret#50 to turret#99
In later timesteps, if turret#99 is moved to ShipA, it will be :-
ShipA has turret#01 to turret#49 + turret#99 (mem jump)
ShipB has turret#50 to turret#98
Question
What is a design pattern / C++ magic to reduce cache miss from frequently-used relation?
More information:
In real case, there are a lot of 1:1 and 1:N relationship. A certain relationship binds specifically to a certain type of component to a certain type of component.
For example, relation_Pirate_Cannon = (ShipCom:CannonCom), relation_physic_graphic = (PhysicCom:GraphicCom)
Only some of the relation are "indirect" often.
Current architecture has no limit on amount of Entity/ShipCom/CannonCom.
I don't want to restrict it in the beginning of program.
I prefer an improvement that not make game-logic coding harder.
The first solution that come to my mind is to enable relocation, but I believe it is the last resort approach.
A possible solution is to add another layer of indirection. It slows things down a bit, but it helps keeping compact your arrays and could help speeding up the whole thing. Profiling is the only way to know if it really helps.
That being said, how to do that?
Here is a brief introduction to sparse set and it's worth reading it before to proceed to better understand what I'm saying.
Instead of create relationships between items within the same array, use a second array to which to point.
Let's call the two arrays reverse and direct:
reverse is accessed through the entity identifier (a number, thus just an index in the array). Each and every slot contains an index within the direct array.
direct is accessed, well... directly and each slot contains the entity identifier (that is an index to access the reverse array) and the actual component.
Whenever you add a cannon, get its entity identifier and the first free slot in the direct array. Set slot.entity with your entity identifier and put in reverse[entity] the index of the slot. Whenever you drop something, copy the last element in the direct array to keep it compact and adjust the indexes so that relationships hold up.
The ship will store the indexes to the outer array (reverse), so that you are free to switch back and forth things within the inner array (direct).
What are advantages and disadvantages?
Well, whenever you access the cannons through the outer array, you have an extra jump because of the extra layer of indirection. Anyway, as long as you succeed in keeping low the number of accesses made this way and you visit the direct array in your systems, you have a compact array to iterate and the lowest number of cache misses.
How about sorting the entities to minimize cache misses ?
Everytime a cannon and/or ship is added/destroyed/moved sort the entities.
I am not sure if this is feasible in your ECS system;
it wouldn't be practical if you heavily depend on the entity indices, they would change everytime you sort.
I'm writing a GPGPU program using GLSL shaders and am trying to come up with a few optimizations for an N-body collision detection algorithm. One is performing a 'quick' check to determine whether two objects are even in the same ballpark. The idea is to quickly disqualify lots of possibilities so that I only have to perform a more accurate collision test on a handful of objects. If the quick check decides there's a chance they might collide, the accurate check is performed.
The objects are circles (or spheres). I know the position of their center and their radius. The quick check will see if their square (or cube) bounding boxes overlap:
//make sure A is to the right of and above B
//code for that
if(A_minX > B_maxX) return false; //they definitely don't collide
if(A_minY > B_maxY) return false; //they definitely don't collide
if(length(A_position - B_position) <= A_radius + B_radius){
//they definitely do collide
return true;
}
My question is whether the overhead of performing this quick check (making sure that A and B are in the right order, then checking whether their bounding boxes overlap) is going to be faster than calling length() and comparing that against their combined radii.
It'd be useful to know the relative computational cost of various math operations in GLSL, but I'm not quite sure how to discover them empirically or whether this information is already posted somewhere.
You can avoid using square roots (which are implicitly needed for the length() function) by comparing the squares of the values.
The test could then look like this:
vec3 vDiff = A_position - B_position;
float radSum = A_radius + B_radius;
if (dot(vDiff, vDiff) < radSum * radSum) {
return true;
}
This reduces it back to a single test, but still uses only simple and efficient operations.
While we're on the topic of costs, you don't need two branches here. You can test the results of a component-wise test instead. So, this could be collapsed into a single test using any (greaterThan (A_min, B_max)). A good compiler will probably figure this out, but it helps if you consider data parallelism yourself.
Costs are all relative. 15 years ago the arithmetic work necessary to do what length (...) does was such that you could do a cubemap texture lookup in less time - on newer hardware you'd be insane to do that because compute is quicker than memory.
To put this all into perspective, thread divergence can be more of a bottleneck than instruction or memory throughput. That is, if two of your shader invocations running in parallel take separate paths through the shader, you may introduce unnecessary performance penalties. The underlying hardware architecture means that things that were once a safe bet for optimization may not be in the future and might even cause your optimization attempt to hurt performance.
I'm coding a physics simulation and I'm now feeling the need for optimizing it. I'm thinking about improving one point: one of the methods of one of my class (which I call a billion times in several cases) defines everytime a probability distribution. Here is the code:
void myClass::myMethod(){ //called billions of times in several cases
uniform_real_distribution<> probd(0,1);
uniform_int_distribution<> probh(1,h-2);
uniform_int_distribution<> probv(1,v-2);
//rest of the code
}
Could I pass the distribution as member of the class so that I won't have to define them everytime? And just initialize them in the constructor and redefine them when h and v change? Can it be a good optimizing progress? And last question, could it be something that is already corrected by the compiler (g++ in my case) when compiled with flag -O3 or -O2?
Thank you in advance!
Update: I coded it and timed both: the program turned actually a bit slower (a few percents) so I'm back at what I started with: creating the probability distributions at each loop
Answer A: I shouldn't think so, for a uniform distribution it's just going to copy the parameter values into place, maybe with a small amount of arithmetic, and that will be well optimized.
However, I believe distribution objects can have state. They can use part of the random data from a call to the generator, and are permitted save the rest of the randomness to use next time the distribution is used, in order to reduce the total number of calls to the generator. So when you destroy a distribution object you might be discarding some possibly-costly random data.
Answer B: stop guessing and test it.
Time your code, then add static to the definition of probd and time it again.
Yes
Yes
Well, there may be some advantage, but AFAIK those objects aren't really heavyweight/expensive to construct. Also, with locals you may gain something in data locality and in assumptions the optimizer can make.
I don't think they are automatically moved as class variables (especially if your class is POD - in that case I doubt the compiler will dare to modify its layout); most probably, instead, they are completely optimized away - only the code of the called methods - in particular operator() - may remain, referring directly to h and v. But this must be checked by looking at the generated assembly.
Incidentally, if you have a performance problem, besides optimizing obvious points (non-optimal algorithms used in inner loops, continuous memory allocations, removing useless copies of big objects, ...) you should really try to use a profiler to find the real "hot spots" in your code, and concentrate to optimize them instead of going randomly through all the code.
uniform_real_distribution maintains a state of type param_type which is two double values (using default template parameters). The constructor assigns to these and is otherwise trivial, the destructor is trivial.
Therefore, constructing a temporary within your function has an overhead of storing 2 double values as compared to initializing 1 pointer (or reference) or going through an indirection via this. In theory, it might therefore be faster (though, what appears to be faster, or what would make sense to run faster isn't necessary any faster). Since it's not much work, it's certainly worth trying and timing whether there's a difference, even if it is a micro-optimization.
Some 3-4 extra cycles are normally neglegible, but since you're saying "billions of times" it may of course very well make a measurable difference. 3 cycles times one billion is 1 second on a 3GHz machine.
Of course, optimization without profiling is always somewhat... awkward. You might very well find that a different part in your code that's called billions of times saves a lot more cycles.
EDIT:
Since you're not going to modify it, and since the first distribution is initialized with literal values, you might actually make it a constant (such as a constexpr or namespace level static const). That should, regardless of the other two, allow the compiler to generate the most efficient code in any case for that one.
I would like to know what the best practice for efficiently storing (and subsequently accessing) sets of multi-dimensional data arrays with variable length. The focus is on performance, but I also need to be able to handle changing the length of an individual data set during runtime without too much overhead.
Note: I know this is a somewhat lengthy question, but I have looked around quite a lot and could not find a solution or example which describes the problem at hand with sufficient accuracy.
Background
The context is a computational fluid dynamics (CFD) code that is based on the discontinuous Galerkin spectral element method (DGSEM) (cf. Kopriva (2009), Implementing Spectral Methods for Partial Differential Equations). For the sake of simplicity, let us assume a 2D data layout (it is in fact in three dimensions, but the extension from 2D to 3D should be straightforward).
I have a grid that consists of K square elements k (k = 0,...,K-1) that can be of different (physical) sizes. Within each grid element (or "cell") k, I have N_k^2 data points. N_k is the number of data points in each dimension, and can vary between different grid cells.
At each data point n_k,i (where i = 0,...,N_k^2-1) I have to store an array of solution values, which has the same length nVars in the whole domain (i.e. everywhere), and which does not change during runtime.
Dimensions and changes
The number of grid cells K is of O(10^5) to O(10^6) and can change during runtime.
The number of data points N_k in each grid cell is between 2 and 8 and can change during runtime (and may be different for different cells).
The number of variables nVars stored at each grid point is around 5 to 10 and cannot change during runtime (it is also the same for every grid cell).
Requirements
Performance is the key issue here. I need to be able to regularly iterate in an ordered fashion over all grid points of all cells in an efficient manner (i.e. without too many cache misses). Generally, K and N_k do not change very often during the simulation, so for example a large contiguous block of memory for all cells and data points could be an option.
However, I do need to be able to refine or coarsen the grid (i.e. delete cells and create new ones, the new ones may be appended to the end) during runtime. I also need to be able to change the approximation order N_k, so the number of data points I store for each cell can change during runtime as well.
Conclusion
Any input is appreciated. If you have experience yourself, or just know a few good resources that I could look at, please let me know. However, while the solution will be crucial to the performance of the final program, it is just one of many problems, so the solution needs to be of an applied nature and not purely academic.
Should this be the wrong venue to ask this question, please let me know what a more suitable place would be.
Often, these sorts of dynamic mesh structures can be very tricky to deal with efficiently, but in block-structured adaptive mesh refinement codes (common in astrophysics, where complex geometries aren't important) or your spectral element code where you have large block sizes, it is often much less of an issue. You have so much work to do per block/element (with at least 10^5 cells x 2 points/cell in your case) that the cost of switching between blocks is comparitively minor.
Keep in mind, too, that you can't generally do too much of the hard work on each element or block until a substantial amount of that block's data is already in cache. You're already going to have to had flushed most of block N's data out of cache before getting much work done on block N+1's anyway. (There might be some operations in your code which are exceptions to this, but those are probably not the ones where you're spending much time anyway, cache or no cache, because there's not a lot of data reuse - eg, elementwise operations on cell values). So keeping each the blocks/elements beside each other isn't necessarily a huge deal; on the other hand, you definitely want the blocks/elements to be themselves contiguous.
Also notice that you can move blocks around to keep them contiguous as things get resized, but not only are all those memory copies also going to wipe your cache, but the memory copies themselves get very expensive. If your problem is filling a significant fraction of memory (and aren't we always?), say 1GB, and you have to move 20% of that around after a refinement to make things contiguous again, that's .2 GB (read + write) / ~20 GB/s ~ 20 ms compared to reloading (say) 16k cache lines at ~100ns each ~ 1.5 ms. And your cache is trashed after the shuffle anyway. This might still be worth doing if you knew that you were going to do the refinement/derefinement very seldom.
But as a practical matter, most adaptive mesh codes in astrophysical fluid dynamics (where I know the codes well enough to say) simply maintain a list of blocks and their metadata and don't worry about their contiguity. YMMV of course. My suggestion would be - before spending too much time crafting the perfect data structure - to first just test the operation on two elements, twice; the first, with the elements in order and computing on them 1-2, and the second, doing the operation in the "wrong" order, 2-1, and timing the two computations several times.
For each cell, store the offset in which to find the cell data in a contiguous array. This offset mapping is very efficient and widely used. You can reorder the cells for cache reuse in traversals. When the order or number of cells changes, create a new array and interpolate, then throw away the old arrays. This storage is much better for external analysis because operations like inner products in Krylov methods and stages in Runge-Kutta methods can be managed without reference to the mesh. It also requires minimal memory per vector (e.g. in Krylov bases and with time integration).
On this game I have 3 defense towers (the number is configurable) which fire a "bullet" every 3 seconds at 30km/h. These defense towers have a radar and they only start firing when the player is under the tower radar. That's not the issue.
My question is how to store the data for the gun fire. I'm not sure exactly what data do I need for each bullet, but one that comes to mind is the position of the bullet of course. Let's assume that I only need to store that (I already have a struct defined for a 3D point) for now.
Should I try to figure it out the maximum bullets the game can have at a particular point and declare an array with that size? Should I use a linked-list? Or maybe something else?
I really have no idea how to do this. I don't need anything fancy or complex. Something basic that just works and it's easy to use and implement is more than enough.
P.S: I didn't post this question on the game development website (despite the tag) because I think it fits better here.
Generally, fixed length arrays aren't a good idea.
Given your game model, I wouldn't go for any data structure that doesn't allow O(1) removal. That rules out plain arrays anyway, and might suggest a linked list. However the underlying details should be abstracted out by using a generic container class with the right attributes.
As for what you should store:
Position (as you mentioned)
Velocity
Damage factor (your guns are upgradeable, aren't they?)
Maximum range (ditto)
EDIT To slightly complicated matters the STL classes always take copies of the elements put in them, so in practise if any of the attributes might change over the object's lifetime you'll need to allocate your structures on the heap and store (smart?) pointers to them in the collection.
I'd probably use a std::vector or std::list. Whatever's easiest.
Caveat: If you are coding for a very constrained platform (slow CPU, little memory), then it might make sense to use a plain-old fixed-size C array. But that's very unlikely these days. Start with whatever is easiest to code, and change it later if and only if it turns out you can't afford the abstractions.
I guess you can start off with std::vector<BulletInfo> and see how it works from there. It provides the array like interface but is dynamically re-sizable.
In instances like this I prefer a slightly more complex method to managing bullets. Since the number of bullets possible on screen is directly related to the number of towers I would keep a small fixed length array of bullets inside each tower class. Whenever a tower goes to fire a bullet it would search through its array, find an un-used bullet, setup the bullet with a new position/velocity and mark it active.
The slightly more complex part is I like to keep a second list of bullets in an outside manager, say a BulletManager. When each tower is created the tower would add all its bullets to the central manager. Then the central manager can be in charge of updating the bullets.
I like this method because it easily allows me to manage memory constrains related to bullets, just tweak the 'number of active towers' number and all of the bullets are created for you. You don't need to allocate bullets on the fly because they are all pooled, and you don't have just one central pool that you constantly need to change the size of as you add/remove towers.
It does involve slightly move overhead because there is a central manager with a list of pointers. And you need to be careful to always remove any bullets from a destroyed tower from the central manager. But for me the benefits are worth it.