avoid cache miss related to 1:N relationship in entity-component-system - c++

How to avoid cache miss related to 1:N (pirateship-cannon) relationship in entity-component-system (ECS)?
For example, a PirateShip can have 1-100 cannons. (1:N)
Each cannon can detach/attach freely to any pirateship at any time.
For some reasons, PirateShip and Cannon should be entity.
Memory diagram
In around first time-steps, when ship/cannon are gradually created, the ECS memory looks very nice :-
Image note:
Left = low address, right = high address
Although there seems to be gaps, ShipCom and CannonCom are actually compact arrays.
It is really fast to access cannon information from ship and vice versa (pseudo code):-
Ptr<ShipCom> shipCom=....;
EntityPtr ship = shipCom; //implicitly conversion
Array<EntityPtr> cannons = getAllCannonFromShip(ship);
for(auto cannon: cannons){
Ptr<CannonCom> cannonCom=cannon;
//cannonCom-> ....
}
Problem
In later time-step, some ship / cannon are randomly created/destroyed.
As a result, Entity,ShipCom and CannonCom array has gap scattering around.
When I want to allocate them, I will get a random memory block from the pool.
EntityPtr ship = ..... (an old existing ship)
EntityPtr cannon = createNewEntity();
Ptr<CannonCom> cannonCom= createNew<CannonCom>(cannon);
attach_Pirate_Cannon(ship,cannon);
//^ ship entity and cannon tend to have very different (far-away) address
Thus, the "really fast" code above become bottom-neck. (I profiled.)
(Edit) I believe that the underlying cache miss also occur from different address between cannon inside the same ship.
For example (# is address of turret component),
ShipA has turret#01 to turret#49
ShipB has turret#50 to turret#99
In later timesteps, if turret#99 is moved to ShipA, it will be :-
ShipA has turret#01 to turret#49 + turret#99 (mem jump)
ShipB has turret#50 to turret#98
Question
What is a design pattern / C++ magic to reduce cache miss from frequently-used relation?
More information:
In real case, there are a lot of 1:1 and 1:N relationship. A certain relationship binds specifically to a certain type of component to a certain type of component.
For example, relation_Pirate_Cannon = (ShipCom:CannonCom), relation_physic_graphic = (PhysicCom:GraphicCom)
Only some of the relation are "indirect" often.
Current architecture has no limit on amount of Entity/ShipCom/CannonCom.
I don't want to restrict it in the beginning of program.
I prefer an improvement that not make game-logic coding harder.
The first solution that come to my mind is to enable relocation, but I believe it is the last resort approach.

A possible solution is to add another layer of indirection. It slows things down a bit, but it helps keeping compact your arrays and could help speeding up the whole thing. Profiling is the only way to know if it really helps.
That being said, how to do that?
Here is a brief introduction to sparse set and it's worth reading it before to proceed to better understand what I'm saying.
Instead of create relationships between items within the same array, use a second array to which to point.
Let's call the two arrays reverse and direct:
reverse is accessed through the entity identifier (a number, thus just an index in the array). Each and every slot contains an index within the direct array.
direct is accessed, well... directly and each slot contains the entity identifier (that is an index to access the reverse array) and the actual component.
Whenever you add a cannon, get its entity identifier and the first free slot in the direct array. Set slot.entity with your entity identifier and put in reverse[entity] the index of the slot. Whenever you drop something, copy the last element in the direct array to keep it compact and adjust the indexes so that relationships hold up.
The ship will store the indexes to the outer array (reverse), so that you are free to switch back and forth things within the inner array (direct).
What are advantages and disadvantages?
Well, whenever you access the cannons through the outer array, you have an extra jump because of the extra layer of indirection. Anyway, as long as you succeed in keeping low the number of accesses made this way and you visit the direct array in your systems, you have a compact array to iterate and the lowest number of cache misses.

How about sorting the entities to minimize cache misses ?
Everytime a cannon and/or ship is added/destroyed/moved sort the entities.
I am not sure if this is feasible in your ECS system;
it wouldn't be practical if you heavily depend on the entity indices, they would change everytime you sort.

Related

Helps a vector for cache locality? (C++)

Last week I have read about great concepts as cache locality and pipelining in a cpu. Although these concepts are easy to understand I have two questions. Suppose one can choose between a vector of objects or a vector of pointers to objects (as in this question).
Then an argument for using pointers is that shufling larger objects may be expensive. However, I'm not able to find when I should call an object large. Is an object of several bytes already large?
An argument against the pointers is the loss of cache locality. Will it help if one uses two vectors where the first one contains the objects and will not be reordered and the second one contains pointers to these objects? Say that we have a vector of 200 objects and create a vector with pointers to these objects and then randomly shuffle the last vector. Is the cache locality then lost if we loop over the vector with pointers?
This last scenario happens a lot in my programs where I have City objects and then have around 200 vectors of pointers to these Cities. To avoid having 200 instances of each City I use a vector of pointers instead of a vector of Cities.
There is no simple answer to this question. You need to understand how your system interacts with regards to memory, what operations you do on the container, and which of those operations are "important". But by understanding the concepts and what affects what, you can get a better understanding of how things work. So here's some "discussion" on the subject.
"Cache locality" is largely about "keeping things in the cache". In other words, if you look at A, then B, and A is located close to B, they are probably getting loaded into the cache together.
If objects are large enough that they fill one or more cache-lines (modern CPU's have cache-lines of 64-128 bytes, mobile ones are sometimes smaller), the "next object in line" will not be in the cache anyways [1], so the cache-locality of the "next element in the vector" is less important. The smaller the object is, the more effect of this you get - assuming you are accessing objects in the order they are stored. If you pick a random number, then other factors start to become important [2], and the cache locality is much less important.
On the other other hand, as objects get larger, moving them within the vector (including growing, removing, inserting, as well as "random shuffle") will be more time consuming, as copying more data gets more extensive.
Of course, one further step is always needed to read from a pointer vs. reading an element directly in a vector, since the pointer itself needs to be "read" before we can get to the actual data in the pointee object. Again, this becomes more important when random-accessing things.
I always start with "whatever is simplest" (which depends on the overall construct of the code, e.g. sometimes it's easier to create a vector of pointers because you have to dynamically create the objects in the first place). Most of the code in a system is not performance critical anyway, so why worry about it's performance - just get it working and leave it be if it doesn't turn up in your performance measurements.
Of course, also, if you are doing a lot of movement of objects in a container, maybe vector isn't the best container. That's why there are multiple container variants - vector, list, map, tree, deque - as they have different characteristics with regards to their access and insert/remove as well as characteristics for linearly walking the data.
Oh, and in your example, you talk of 200 city objects - well, they are probably going to all fit in the cache of any modern CPU anyways. So stick them in a vector. Unless a city contains a list of every individual living in the city... But that probably should be a vector (or other container object) in itself.
As an experiment, make a program that does the same operations on a std::vector<int> and std::vector<int*> [such as filling with random numbers, then sorting the elements], then make an object that is large [stick some array of integers in there, or some such], with one integer so that you can do the very same operations on that. Vary the size of the object stored, and see how it behaves. On YOUR system, where is the benefit of having pointers, over having plain objects. Of course, also vary the number of elements, to see what effect that has.
[1] Well, modern processors use cache-prefetching, which MAY load "next data" into the cache speculatively, but we certainly can't rely on this.
[2] An extreme case of this is a telephone exchange with a large number of subscribers (millions). When placing a call, the caller and callee are looked up in a table. But the chance of either caller or callee being in the cache is nearly zero, because (assuming we're dealing with a large city, say London) the number of calls placed and received every second is quite large. So caches become useless, and it gets worse, because the processor also caches the page-table entries, and they are also, most likely, out of date. For these sort of applications, the CPU designers have "huge pages", which means that the memory is split into 1GB pages instead of the usual 4K or 2MB pages that have been around for a while. This reduces the amount of memory reading needed before "we get to the right place". Of course, the same applies to various other "large database, unpredictable pattern" - airlines, facebook, stackoverflow all have these sort of problems.

Keep vector members of class contiguous with class instance

I have a class that implements two simple, pre-sized stacks; those are stored as members of the class of type vector pre-sized by the constructor. They are small and cache line size friendly objects.
Those two stacks are constant in size, persisted and updated lazily, and are often accessed together by some computationally cheap methods that, however, can be called a large number of times (tens to hundred of thousands of times per second).
All objects are already in good state (code is clean and does what it's supposed to do), all sizes kept under control (64k to 128K most cases for the whole chain of ops including results, rarely they get close to 256k, so at worse an L2 look-up and often L1).
some auto-vectorization comes into play, but other than that it's single threaded code throughout.
The class, minus some minor things and padding, looks like this:
class Curve{
private:
std::vector<ControlPoint> m_controls;
std::vector<Segment> m_segments;
unsigned int m_cvCount;
unsigned int m_sgCount;
std::vector<unsigned int> m_sgSampleCount;
unsigned int m_maxIter;
unsigned int m_iterSamples;
float m_lengthTolerance;
float m_length;
}
Curve::Curve(){
m_controls = std::vector<ControlPoint>(CONTROL_CAP);
m_segments = std::vector<Segment>( (CONTROL_CAP-3) );
m_cvCount = 0;
m_sgCount = 0;
std::vector<unsigned int> m_sgSampleCount(CONTROL_CAP-3);
m_maxIter = 3;
m_iterSamples = 20;
m_lengthTolerance = 0.001;
m_length = 0.0;
}
Curve::~Curve(){}
Bear with the verbosity, please, I'm trying to educate myself and make sure I'm not operating by some half-arsed knowledge:
Given the operations that are run on those and their actual use, performance is largely memory I/O bound.
I have a few questions related to optimal positioning of the data, keep in mind this is on Intel CPUs (Ivy and a few Haswell) and with GCC 4.4, I have no other use cases for this:
I'm assuming that if the actual storage of controls and segments are contiguous to the instance of Curve that's an ideal scenario for the cache (size wise the lot can easily fit on my target CPUs).
A related assumption is that if the vectors are distant from the instance of the Curve , and between themselves, as methods alternatively access the contents of those two members, there will be more frequent eviction and re-populating the L1 cache.
1) Is that correct (data is pulled for the entire stretch of cache size from the address first looked up on a new operation, and not in convenient multiple segments of appropriate size), or am I mis-understanding the caching mechanism and the cache can pull and preserve multiple smaller stretches of ram?
2) Following the above, insofar by pure circumstance all my test always end up with the class' instance and the vectors contiguous, but I assume that's just dumb luck, however statistically probable. Normally instancing the class reserves only the space for that object, and then the vectors are allocated in the next free contiguous chunk available, which is not guaranteed to be anywhere near my Curve instance if that previously found a small emptier niche in memory.
Is this correct?
3) Assuming 1 and 2 are correct, or close enough functionally speaking, I understand to guarantee performance I'd have to write an allocator of sorts to make sure the class object itself is large enough on instancing, and then copy the vectors in there myself and from there on refer to those.
I can probably hack my way to something like that if it's the only way to work through the problem, but I'd rather not hack it horribly if there are nice/smart ways to go about something like that. Any pointers on best practices and suggested methods would be hugely helpful (beyond "don't use malloc it's not guaranteed contiguous", that one I already have down :) ).
If the Curve instances fit into a cache line and the data of the two vectors also fit a cachline each, the situation is not that bad, because you have four constant cachelines then. If every element was accessed indirectly and randomly positioned in memory, every access to an element might cost you a fetch operation, which is avoided in that case. In the case that both Curve and its elements fit into less than four cachelines, you would reap benefits from putting them into contiguous storage.
True.
If you used std::array, you would have the guarantee that the elements are embedded in the owning class and not have the dynamic allocation (which in and of itself costs you memory space and bandwidth). You would then even avoid the indirect access that you would still have if you used a special allocator that puts the vector content in contiguous storage with the Curve instance.
BTW: Short style remark:
Curve::Curve()
{
m_controls = std::vector<ControlPoint>(CONTROL_CAP, ControlPoint());
m_segments = std::vector<Segment>(CONTROL_CAP - 3, Segment());
...
}
...should be written like this:
Curve::Curve():
m_controls(CONTROL_CAP),
m_segments(CONTROL_CAP - 3)
{
...
}
This is called "initializer list", search for that term for further explanations. Also, a default-initialized element, which you provide as second parameter, is already the default, so no need to specify that explicitly.

Rapidly instantiate objects - good or bad?

Note: Although this question doesn't directly correlate to games, I've molded the context around game development in order to better visualize a scenario where this question is relevant.
tl;dr: Is rapidly creating objects of the same class memory intensive, inefficient, or is it common practice?
Say we have a "bullet" class - and an instance of said class is created every time the player 'shoots' - anywhere between 1 and 10 times every second. These instances may be destroyed (obviously) upon collision.
Would this be a terrible idea? Is general OOP okay here (i.e.: class Bullet { short x; short y; }, etc.) or is there a better way to do this? Is new and delete preferred?
Any input much appreciated. Thank you!
This sounds like a good use-case for techniques like memory-pools or Free-Lists. The idea in both is that you have memory for a certain number of elements pre-allocated. You can override the new operator of your class to use the pool/list or use placement new to instantiate your class in a retrieved address.
The advantages:
no memory fragmentation
pretty quick
The disadvantages:
you must know the maximum number of elements beforehand
Don't just constantly create and delete objects. Instead, an alternative is to have a constant, resizable array or list of object instances that you can reuse. For example, create an array of 100 bullets, they don't all have to be drawn, have a boolean that states whether they are "active" or not.
Then whenever you need a new bullet, "activate" an inactive bullet and set its position where you need it. Then whenever it is off screen, you can mark it inactive again and not have to delete it.
If you ever need more than 100 bullets, just expand the array.
Consider reading this article to learn more: Object Pool. It also has several other game pattern related topics.
The very very least that happens when you allocate an object is a function call (its constructor). If that allocation is dynamic, there is also the cost of memory management which at some point could get drastic due to fragmentation.
Would calling some function 10 times a second be really bad? No. Would creating and destroying many small objects dynamically 10 times a second be bad? Possibly. Should you be doing this? Absolutely not.
Even if the performance penalty is not "felt", it's not ok to have a suboptimal solution while an optimal one is immediately available.
So, instead of for example a std::list of objects that are dynamically added and removed, you can simply have a std::vector of bullets where addition of bullets means appending to the vector (which after it has reached a large enough size, shouldn't require any memory allocation anymore) and deleting means swapping the element being deleted with the last element and popping it from the vector (effectively just reducing the vector size variable).
Think that with every instantiation there is a place in the heap where it need to allocate memory and create a new instance. This may affect the performance. Try using collection and create an instance of that collection with how many bullets you want.

What is the most efficient (yet sufficiently flexible) way to store multi-dimensional variable-length data?

I would like to know what the best practice for efficiently storing (and subsequently accessing) sets of multi-dimensional data arrays with variable length. The focus is on performance, but I also need to be able to handle changing the length of an individual data set during runtime without too much overhead.
Note: I know this is a somewhat lengthy question, but I have looked around quite a lot and could not find a solution or example which describes the problem at hand with sufficient accuracy.
Background
The context is a computational fluid dynamics (CFD) code that is based on the discontinuous Galerkin spectral element method (DGSEM) (cf. Kopriva (2009), Implementing Spectral Methods for Partial Differential Equations). For the sake of simplicity, let us assume a 2D data layout (it is in fact in three dimensions, but the extension from 2D to 3D should be straightforward).
I have a grid that consists of K square elements k (k = 0,...,K-1) that can be of different (physical) sizes. Within each grid element (or "cell") k, I have N_k^2 data points. N_k is the number of data points in each dimension, and can vary between different grid cells.
At each data point n_k,i (where i = 0,...,N_k^2-1) I have to store an array of solution values, which has the same length nVars in the whole domain (i.e. everywhere), and which does not change during runtime.
Dimensions and changes
The number of grid cells K is of O(10^5) to O(10^6) and can change during runtime.
The number of data points N_k in each grid cell is between 2 and 8 and can change during runtime (and may be different for different cells).
The number of variables nVars stored at each grid point is around 5 to 10 and cannot change during runtime (it is also the same for every grid cell).
Requirements
Performance is the key issue here. I need to be able to regularly iterate in an ordered fashion over all grid points of all cells in an efficient manner (i.e. without too many cache misses). Generally, K and N_k do not change very often during the simulation, so for example a large contiguous block of memory for all cells and data points could be an option.
However, I do need to be able to refine or coarsen the grid (i.e. delete cells and create new ones, the new ones may be appended to the end) during runtime. I also need to be able to change the approximation order N_k, so the number of data points I store for each cell can change during runtime as well.
Conclusion
Any input is appreciated. If you have experience yourself, or just know a few good resources that I could look at, please let me know. However, while the solution will be crucial to the performance of the final program, it is just one of many problems, so the solution needs to be of an applied nature and not purely academic.
Should this be the wrong venue to ask this question, please let me know what a more suitable place would be.
Often, these sorts of dynamic mesh structures can be very tricky to deal with efficiently, but in block-structured adaptive mesh refinement codes (common in astrophysics, where complex geometries aren't important) or your spectral element code where you have large block sizes, it is often much less of an issue. You have so much work to do per block/element (with at least 10^5 cells x 2 points/cell in your case) that the cost of switching between blocks is comparitively minor.
Keep in mind, too, that you can't generally do too much of the hard work on each element or block until a substantial amount of that block's data is already in cache. You're already going to have to had flushed most of block N's data out of cache before getting much work done on block N+1's anyway. (There might be some operations in your code which are exceptions to this, but those are probably not the ones where you're spending much time anyway, cache or no cache, because there's not a lot of data reuse - eg, elementwise operations on cell values). So keeping each the blocks/elements beside each other isn't necessarily a huge deal; on the other hand, you definitely want the blocks/elements to be themselves contiguous.
Also notice that you can move blocks around to keep them contiguous as things get resized, but not only are all those memory copies also going to wipe your cache, but the memory copies themselves get very expensive. If your problem is filling a significant fraction of memory (and aren't we always?), say 1GB, and you have to move 20% of that around after a refinement to make things contiguous again, that's .2 GB (read + write) / ~20 GB/s ~ 20 ms compared to reloading (say) 16k cache lines at ~100ns each ~ 1.5 ms. And your cache is trashed after the shuffle anyway. This might still be worth doing if you knew that you were going to do the refinement/derefinement very seldom.
But as a practical matter, most adaptive mesh codes in astrophysical fluid dynamics (where I know the codes well enough to say) simply maintain a list of blocks and their metadata and don't worry about their contiguity. YMMV of course. My suggestion would be - before spending too much time crafting the perfect data structure - to first just test the operation on two elements, twice; the first, with the elements in order and computing on them 1-2, and the second, doing the operation in the "wrong" order, 2-1, and timing the two computations several times.
For each cell, store the offset in which to find the cell data in a contiguous array. This offset mapping is very efficient and widely used. You can reorder the cells for cache reuse in traversals. When the order or number of cells changes, create a new array and interpolate, then throw away the old arrays. This storage is much better for external analysis because operations like inner products in Krylov methods and stages in Runge-Kutta methods can be managed without reference to the mesh. It also requires minimal memory per vector (e.g. in Krylov bases and with time integration).

How should I store and and use gun fire data in a game?

On this game I have 3 defense towers (the number is configurable) which fire a "bullet" every 3 seconds at 30km/h. These defense towers have a radar and they only start firing when the player is under the tower radar. That's not the issue.
My question is how to store the data for the gun fire. I'm not sure exactly what data do I need for each bullet, but one that comes to mind is the position of the bullet of course. Let's assume that I only need to store that (I already have a struct defined for a 3D point) for now.
Should I try to figure it out the maximum bullets the game can have at a particular point and declare an array with that size? Should I use a linked-list? Or maybe something else?
I really have no idea how to do this. I don't need anything fancy or complex. Something basic that just works and it's easy to use and implement is more than enough.
P.S: I didn't post this question on the game development website (despite the tag) because I think it fits better here.
Generally, fixed length arrays aren't a good idea.
Given your game model, I wouldn't go for any data structure that doesn't allow O(1) removal. That rules out plain arrays anyway, and might suggest a linked list. However the underlying details should be abstracted out by using a generic container class with the right attributes.
As for what you should store:
Position (as you mentioned)
Velocity
Damage factor (your guns are upgradeable, aren't they?)
Maximum range (ditto)
EDIT To slightly complicated matters the STL classes always take copies of the elements put in them, so in practise if any of the attributes might change over the object's lifetime you'll need to allocate your structures on the heap and store (smart?) pointers to them in the collection.
I'd probably use a std::vector or std::list. Whatever's easiest.
Caveat: If you are coding for a very constrained platform (slow CPU, little memory), then it might make sense to use a plain-old fixed-size C array. But that's very unlikely these days. Start with whatever is easiest to code, and change it later if and only if it turns out you can't afford the abstractions.
I guess you can start off with std::vector<BulletInfo> and see how it works from there. It provides the array like interface but is dynamically re-sizable.
In instances like this I prefer a slightly more complex method to managing bullets. Since the number of bullets possible on screen is directly related to the number of towers I would keep a small fixed length array of bullets inside each tower class. Whenever a tower goes to fire a bullet it would search through its array, find an un-used bullet, setup the bullet with a new position/velocity and mark it active.
The slightly more complex part is I like to keep a second list of bullets in an outside manager, say a BulletManager. When each tower is created the tower would add all its bullets to the central manager. Then the central manager can be in charge of updating the bullets.
I like this method because it easily allows me to manage memory constrains related to bullets, just tweak the 'number of active towers' number and all of the bullets are created for you. You don't need to allocate bullets on the fly because they are all pooled, and you don't have just one central pool that you constantly need to change the size of as you add/remove towers.
It does involve slightly move overhead because there is a central manager with a list of pointers. And you need to be careful to always remove any bullets from a destroyed tower from the central manager. But for me the benefits are worth it.