Data oriented access to several indexed data arrays - c++

I am working on an entity component system for a game engine. One of my goals is to use a data oriented approach for optimal data processing. In other words, I want to follow the guideline of rather wanting structs of arrays than arrays of structs. However, my problem is that I haven't managed to figure out a neat way to solve this for me.
My idea so far is that every component in the system is responsible for a specific part of the game logic, say that the Gravity Component takes care of calculating forces every frame depending on mass, velocity etc. and other components take care of other stuff. Hence every component is interested in different data sets. The Gravity Component might be interested in mass and velocity while the Collision Component might be interested in bounding boxes and position etc.
So far I figured I could have a data manager which saves one array per attribute. So say that entities may have one or more of weight, position, velocity, etc and they would have a unique ID. The data in the data manager would be represented as follows where every number represents an entity ID:
weightarray -> [0,1,2,3]
positionarray -> [0,1,2,3]
velocityarray -> [0,1,2,3]
This approach works good if all entities have each one of the attributes. However if only entity 0 and 2 have all tree attributes and the other ones are entities of the type that does not move, they will not have velocity and the data would look:
weightarray -> [0,1,2,3]
positionarray -> [0,1,2,3]
velocityarray -> [0,2] //either squash it like this
velocityarray -> [0 ,2 ] //or leave "empty gaps" to keep alignment
Suddenly it isn't as easy to iterate throught it. A component only interested in iterating over, and manipulating the velocity would have to either somehow skip the empty gaps if I went by the second approach. The first approach of keeping the array short wouldn't work well either in more complicated situations. Say if I have one entity 0 with all three attributes, another entity 1 having only weight and position, and an entity 2 which only has position and velocity. Finally there is one last entity 3 which only has weight. The arrays squashed would look like:
weightarray -> [0,1,3]
positionarray -> [0,1,2]
velocityarray -> [0,2]
The other approach would leave gaps like so:
weightarray -> [0,1, ,3]
positionarray -> [0,1,2, ]
velocityarray -> [0, ,2, ]
Both of these situations are nontrivial to iterate if you are only interested in iterating over the set of entities that only has a few of the attributes. A given component X would be interested in processing entities with position and velocity for instance. How can I extract iterable array pointers to give to this component to do its calculations? I would want to give it an array where the elements are just next to each other, but that seems impossible.
I've been thinking about solutions like having a bit field for every array, describing which spots are valid and which are gaps, or a system that copies over data to temporary arrays that have no holes and are then given to the components, and other ideas but none that I thought of was elegant and didn't have additional overhead for the processing (such as extra checks if data is valid, or extra copying of data).
I am asking here because I hope that someone of you might have experience with something similar or might have ideas or thoughts helpful in pursuing this issue. :) Also if this whole idea is crap and impossible to get right and you have a much better idea instead, please tell me. Hopefully the question isn't too long or cluttery.
Thanks.

Good question. However, as far as I can tell, there is no straightforward solution to this problem. There are multiple solutions (some of which you've mentioned) but I don't see an immediate silver bullet solution.
Let's look at the goal first. The goal isn't to put all data in linear arrays, that just the means to reach the goal. The goal is optimizing performance by minimizing cache misses. That's all. If you use OOP-Objects, your Entities data will be surrounded by data you don't necessarily need. If your architecture has a cache line size of 64 bytes and you only need weight (float), position (vec3) and velocity (vec3) you use 28 bytes, but the remaining 36 bytes will be loaded anyway. Even worse is when those 3 values are not side by side in memory or your data structure overlaps a cache line boundary, you will load multiple cache lines for just 28 bytes of actually used data.
Now this isn't that bad when you do this a few times. Even if you do it a hundred times, you will hardly notice it. However if you do this thousands of times each second, it may become an issue. So enter DOD, where you optimize for cache utilization, usually by creating linear arrays for each variable, in situations where there are linear access patterns. In your case arrays for weight, position, velocity. When you load the position of one entity, you again load 64 bytes of data. But because your position data is side by side in an array, you don't load 1 position value, you load the data for 5 adjacent entities. The next iteration of your update loop will probably need the next position value, which was already loaded in cache, and so on, until only at the 6th entity it will need to load new data from main memory.
So the goal of DOD isn't using linear arrays, it's maximizing cache utilization by placing data that is accessed at (about) the same time adjacent in memory. If you nearly always access 3 variables at the same time, you don't need to create 3 arrays for each variable, you could just as easily create a struct which contains only those 3 values and create an array of these structs. The best solution always depends on the way you use the data. If your access patterns are linear, but you don't always use all variables, go for separate linear arrays. If your access patterns are more irregular but you always use all variables at the same time, put them in a struct together and create an array of those structs.
So there is your answer in short form: it all depends on your data usage. This is the reason I can't answer your question directly. I can give you some ideas on how to deal with your data, and you can decide for yourself which would be the most useful (if any of them are) in your situation, or maybe you can adapt/mix them up.
You could keep most accessed data in a continuous array. For instance, position is used often by many different components, so it is a prime candidate for a continuous array. Weight on the other hand is only used by the gravity component, so there can be gaps here. You've optimized for the most used case and will get less performance for data that is used less often. Still, I'm not a big fan of this solution for a number of reasons: it's still inefficient, you will load way too much empty data, the lower the ratio of # specific components/ # total entities is the worse it gets. If only one in 8 entities have gravity components, and these entities are spread evenly throughout the arrays, you still get one cache miss for each update. It also assumes all entities will have a position (or whatever is the common variable), it's hard to add and remove entities, it's inflexible and plain ugly (imho anyway). It may be the easiest solution though.
Another way to solve this is using indexes. Every array for a component will be packed, but there are two extra arrays, one to get entity id from a component array index and a second one to get the component array index from an entity id. Let's say position is shared by all entities while weight and velocity are only used by Gravity. You can now iterate over the packed weight and velocity arrays, and to get/set the corresponding position, you can get the gravityIndex -> entityID value, go to the Position component, use it's entityID -> positionIndex to get the correct index in the Position array. The advantage is your weight and velocity accesses will no longer give you cache misses, but you still get cache misses for the positions if the ratio between # gravity components / # position components is low. You also get an extra 2 array lookups, but a 16-bit unsigned int index should be enough in most cases so these arrays will fit nicely into the cache, meaning this might not be a very expensive operation in most cases. Still, profile profile profile to be sure of this!
A third option is data duplication. Now, I'm pretty sure this isn't going to be worth the effort in the case of your Gravity component, I think it's more interesting in computationally heavy situations, but let's take it as an example anyway. In this case, the Gravity component has 3 packed arrays for weight, velocity and position. It also has a similar index table to what you saw in the second option. When you start the Gravity component update, you first update the position array from the original position array in the Position component, using the index table as in example 2. Now you have 3 packed arrays that you can do your calculations with linearly with maximum cache utilization. When you're done, copy the position back to the original Position component using the index table. Now, this won't be faster (in fact probably slower) than the second option if you use it for something like Gravity, because you only read and write position once. However, suppose you have a component where entities interact with each other, with each update pass requiring multiple reads and writes, this may be faster. Still, all depends on access patterns.
The last option I'll mention is a change-based system. You can easily adapt this into something like a messaging system. In this case, you only update data that's changed. In your Gravity component, most objects will be lying on the floor without change, but a few are falling. The Gravity component has packed arrays for position, velocity, weight. If the position is updated during your update loop, you add the entity ID and the new position to a list of changes. When you're done, you send those changes to any other component that's keeping a position value. The same principle if any other component (for instance, the player control component) changes the position, it will send the new positions of changed entities, the Gravity component can listen to that and update only those positions in its positions array. You'll duplicate a lot of data just like in the previous example, but instead of rereading all data every update cycle, you only update data when it changes. Very useful in situations where small amounts of data actually change each frame, but might get ineffective if large amounts of data change.
So there is no silver bullet. There are a lot of options. The best solution is entirely dependent on your situation, on your data and the way you process that data. Maybe none of the examples I gave are right for you, maybe all of them are. Not every component has to work in the same way, some might use the change/message system while others use the indexes option. Remember that while many DOD performance guidelines are great if you need the performance, it is only useful in certain situations. DOD is not about always using arrays, it is not about always maximizing cache utilization, you should only do this where it actually matters. Profile profile profile. Know your data. Know your data access patterns. Know your (cache) architecture. If you do all of that, solutions will become apparent when you reason about it :)
Hope this helps!

The solution is actually accepting that there are limits on how far you can optimize.
Solving the gap problem will only cause the following to be introduced:
If statements (branches) to handle the data exceptions (entities which are missing component).
Introducing holes meaning you may as well iterate lists randomly. The power of DoD is that all data is tightly packed and ordered in the way it will be processed.
What you may want to do:
Create different lists optimized for different systems / cases. Every frame: copy the properties from one system to another system only for the entities that require it (which have that specific component).
Having the following simplified lists and their attributes:
rigidbody (force, velocity, transform)
collision (boundingbox, transform)
drawable (texture_id, shader_id, transform)
rigidbody_to_collision (rigidbody_index, collision_index)
collision_to_rigidbody (collision_index, rigidbody_index)
rigidbody_to_drawable (rigidbody_index, drawable_index)
etc...
For the processes / jobs you may want the following:
RigidbodyApplyForces(...), apply forces (ex. gravity) to velocities
RigidbodyIntegrate(...), apply velocities to transforms.
RigidbodyToCollision(...), copy rigidbody transforms to collision transforms only for entities that have the collision component. The "rigidbody_to_collision" list contains the indices of which rigidbody ID should be copied to which collision ID. This keeps the collision list tightly packed.
RigidbodyToDrawable(...), copy rigidbody transforms to drawable transforms for entities that have the draw component. The "rigidbody_to_drawable" list contains the indices of which rigidbody ID should be copied to which drawable ID. This keeps the drawabkl list tightly packed.
CollisionUpdateBoundingBoxes(...), update bounding boxes using new transforms.
CollisionRecalculateHashgrid(...), update hashgrid using bounding boxes. You may want to execute this divided over several frames to distribute load.
CollisionBroadphaseResolve(...), calculate possible collisions using hashgrid etc....
CollisionMidphaseResolve(...), calculate collision using bounding boxes for broadphase etc....
CollisionNarrowphaseResolve(...), calculate collision using polygons from midphase etc....
CollisionToRigidbody(...), add reactive forces of colliding objects to rigidbody forces. The "collision_to_rigidbody" list contains the indices from which collision ID the force should be added to which rigidbody ID. You may also create another list called "reactive_forces_to_be_added". You can use that to delay the addition of the forces.
RenderDrawable(...), render the drawables to screen (renderer is just simplified).
Of course you'll need a lot more processes / jobs. You probably want to occlude and sort the drawables, add a transform graph system between the physics and drawables (see Sony presentation about how you may do this) etc. The execution of the jobs can be executed distributed over multiple cores. This is very easy when everything is just a list as they can be divided into multiple lists.
When an entity is being created the component data will also be created together and stored in the same order. Meaning the lists will stay mostly in the same order.
In the case of the "copy object to object" processes. If the skipping of holes really is becoming a problem you can always create a "reorder objects" process which will at the end of every frame, distributed over multiple frames, reorder objects into the most optimal order. The order which requires the least skipping of holes. The skipping of holes is the price to pay to keep all lists as tightly packed as possible and also allows it to be ordered in the way it is going to be processed.

I rely on two structures for this problem. Hopefully the diagrams are clear enough (I can add further explanation otherwise):
The sparse array allows us to associate data in parallel to another without hogging up too much memory from unused indices and without degrading spatial locality much at all (since each block stores a bunch of elements contiguously).
You might use a smaller block size than 512 since that can be pretty huge for a particular component type. Something like 32 might be reasonable or you might adjust the block size on the fly based on the sizeof(ComponentType). With this you can just associate your components in parallel to your entities without blowing up memory use too much from unoccupied spaces, though I don't use it that way (I use the vertical type of representation, but my system has many component types -- if you only have a few, you might just store everything in parallel).
However, we need another structure when iterating to figure out which indices are occupied. There I use a hierarchical bitset (I love and use this data structure a lot, but I don't know if there's a formal name for it since it's just something I made without knowing what it's called):
This allows the elements which are occupied to always be accessed in sequential order (similar to that of using sorted indices). This structure is extremely fast for sequential iteration since testing a single bit might indicate that a million contiguous elements can be processed without checking a million bits or having to store and access a million indices into the container.
As a bonus, it also allows you to do set intersections in a best-case scenario of Log(N)/Log(64) (ex: being able to find the set intersection between two dense index sets containing a million elements each in 3-4 iterations) if you ever need fast set intersections which can often be pretty handy for an ECS.
These two structures are the backbones of my ECS engine. They're pretty fast as I can process 2 million particle entities (accessing two different components) without caching the query for the entities with both components at just a little under 30 FPS. Of course that's a crappy frame rate for just 2 million particles, but that's when representing them as entire entities with two components attached each (motion and sprite) with the particle system performing the query every single frame, uncached -- something people would normally never do (better to use like a ParticleEmitter component which represents many particles for a given entity rather than making a particle a whole separate entity itself).
Hopefully the diagrams are clear enough to implement your own version if you're interested.

Rather than addressing the structuring of your data, I'd just like to offer perspective on how I've done stuff like this in the past.
The game engine has a list of managers responsible for various systems in the game (InputManager, PhysicsManager, RenderManager, etc...).
Most things in the 3D world are represented by an Object class, and each Object can have any number of Components. Each component is responsible for different aspects of the object's behavior (RenderComponent, PhysicsComponent, etc...).
The physics component was responsible for loading the physics mesh, and giving it all of the necessary properties like mass, density, center of mass, inertia response data, and more. This component also stored information about the physics model once is was in the world, like a position, rotation, linear velocity, angular velocity, and more.
The PhysicsManager had knowledge of every physics mesh that had been loaded by any physics components, this allowed that manager to handle all physics-related tasks, such as collision detection, dispatching collision messages, doing physics ray casts.
If we wanted specialized behavior that only a few objects would need we would create a component for it, and have that component manipulate data like velocity, or friction, and those changes would be seen by the PhysicsManager and accounted for in the physics simulation.
As far as the data structure goes, you can have the system I mentioned above and structure it in several ways. Generally the Objects are kept in either a Vector or Map, and Components are in a Vector or List on the Object. As far as physics information goes, the PhysicsManager has a list of all physics objects, which can be stored in an Array/Vector, and the PhysicsComponent has a copy of its position, velocity, and other data so that it can do anything that it needs to have that data manipulated by the physics manager. For example if you wanted to alter the velocity of an Object you'd just tell the PhysicsComponent, it would alter its velocity value and then notify the PhysicsManager.
I talk more about the subject of object/component engine structure here: https://gamedev.stackexchange.com/a/23578/12611

Related

parallel quadtree construction from morton ordered points

I have a collection of points [(x1,y1),(x2,y2), ..., (xn,yn)] which are Morton sorted. I wish to construct a quadtree from these points in parallel. My intuition is to construct a subtree on each core and merge all subtrees to form a complete quadtree. Can anyone provide some high level insights or pseudocode how may I do this efficiently?
First some thought on your plan:
Are you sure that parallelizing construction will help? I think there is a risk that you won't a much speedup. Quadtree construction is rather cheap on the CPU, so it will be partly bound by your memory bandwidth. Parallelization may not help much, unless you have separate memory buses, for example separate machines.
If you want to parallelize construction on parallel machines, it may be cheapest to simply create separate quadtrees by splitting your point collection in evenly sized chunks. This has one big advantage over other solution: When you want insert more points, or want to look up points, the morton order allows you to pretty efficiently determine which tree contains the point (or should contain it, for insertion). For window queries you can do a similar optimization, if the morton-codes of the 'min/min' and the 'max/max' corners of the query-window lie in the same 'chunk' (sub-tree), then you only need to query this one tree. More optimizations are possible.
If you really want to create a single quadtree on a single machine, there are several ways to split your dataset efficiently:
Walk through all points and identify global min/max. Then walk through all points and assign them (assuming 4 cores) to each core, where each core represents a quadrant. These steps are well parallelizable by splitting the dataset into 4 evenly sized chunks, and it results in a quadtree that exactly fits your dataset. You will have to synchronize insertion, into the trees, but since the dataset is morton ordered, there should be relatively few lock collisions.
You can completely avoid lock collisions during insertion by aligning the quadtrants with Morton coordinates, such that the morton-curve (a z-curve) crosses the quadrant borders only once. Disadvantage: the tree will be imbalanced, i.e. it is unlikely that all quadrants contain the same amount of data. This means your CPUs may have considerably different workloads, unless you split the sub-tree into sub-sub-trees, and so on, to distribute the load better. The split-planes for avoiding the z-curve to cross quadrant borders can be identified on the morton-code/z-code of your coordinates. Split the z-code in chunks of two bits, each to bits tell you which (sub-)quadrant to choose, i.e. 00 is lower/left, 01 is lower/right, 10 is upper/left and 11 is upper/right. Since your points a morton ordered, you can simply use binary search to find the chunks for each quadrant. I realize this maybe sound rather cryptic without more explanation. So maybe you can have a look at the PH-Tree, it is essentially are Z-Ordered (morton-ordered) quadtree (more a 'trie' than a 'tree'). There are also some in-depth explanations here and here (shameless self advertisement). The PH-Tree has some nice properties, such as inherently limiting depth to 64 levels (for 64bit numbers) while guaranteeing small nodes (4 entries max for 2 dimensions); it also guarantees, like the quadtree, that any insert/removal will never affect more than one node, plus possibly adding or removing a second node. There is also a C++ implementation here.

How would one look for entities with specific components in an entity component system?

How would one look for entities with specific components in an entity component system?
In my current implementation I'm storing components in a
std::unordered_map< entity_id, std::unordered_map<type_index, Component *> >.
So if a system needs access to entities with specific components, what is the most efficient way to access them.
I currently have 2 ideas:
Iterate through the map and skip the entities that don't have those components.
Create "mappers" or "views" that hold a pointer to the entity and update them every time a component is assigned to or removed from an entity.
I saw some approaches with bitmasks and such, but that doesn't seem scalable.
Your situation calls for std::unordered_multimap.
"find" method would return an iterator for the first element, which matches the key in multimap. "equal_range" method would return you a pair, containing the iterators for the first and last object, matching your key.
Actually what unordered_multimap allows you to create is an in-memory key-value database that stores a bunch of objects for the same key.
If your "queries" would get more complicated than "give me all objects with component T" and turn into something like "give me all components that have component T and B at the same time", you would be better suited to create a class that has unordered_multimap as a member and has a bunch of utility methods for querying the stuff.
More on the subject:
http://www.cplusplus.com/reference/unordered_map/unordered_multimap/equal_range/
unordered_multimap - iterating the result of find() yields elements with different value (somewhat related question - the accepted answer could be helpful)
The way I do it involves storing a back index to the entity from the component (32-bits). It adds a bit of memory overhead but the total memory overhead of a component in mine is around 8 bytes which is usually not too bad for my use cases, and around 4 bytes per entity.
Now when you have a back index to an entity, what you can do when satisfying a query for all entities that have 2 or more component types is to use parallel bit sets.
For example, if you are looking for entities with two component types, Motion, and Sprite, then we start out by iterating through the motion components and set the associated bits for the entities that own them.
Next we iterate through the sprite components and look for the entity bits already set by the pass through motion components. If the entity index appears in both the motion components and the sprite components, then we add the entity to the list of entities that contain both. A diagram of the idea as well as how to multithread it and pool the entity-parallel bit arrays:
That gives you a set intersection in linear time and with a very small m (very, very cheap work per iteration as we're just marking and inspecting a bit -- much, much, much cheaper than a hash table, e.g.). I can actually perform a set intersection between two sets with 100 million elements each in under a second using this technique. As a bonus, with some minor effort, you can make it give you the entities back in sorted order for cache-friendly access patterns if you use the bitset to grab the indices of the entities that belong in 2 or more components.
There are ways to do this in better than linear time (Log(N)/Log(64)) though it gets considerably more involved where you can actually perform a set intersection between two sets containing a hundred million elements each in under a millisecond. Here's a hint:

What is the data structure typically used to map input to the display?

My problem is how to, given an input, map this input to an area of the display. You can also see this as mapping an input to a widget of a GUI, but my intentions are simply about the generic case.
I'm assuming that when an input is triggered by the hardware sensor/OS I get a pair of [x,y] coordinates.
I was convinced that an array used as a lookup table would be enough for this, you create a 2x2 matrix where each element points to the widget that is that given pixel.
But with this approach there is a problem, an array is a data structure that is "rusty" and doesn't scale at all, I'm not really buying the fact that someone will do this kind of mapping using a simple array, for example with a simple rescale of the window you will have to re-create that array and this is expensive, in terms of computation and memory allocation, without considering the fact that you have to keep both the hierarchy and the layout for the widgets internally, so there is a need for a much more flexible data structure, probably with random access capabilities and really low complexity, around O(1) or O(log(N)).
I can't think about a good data structure that I know that will cope well with this scenario, so what is usually used for a GUI system to map input to the single pixel ?
I believe what you want is a quadtree.

Data structure for handling a list of 3 integers

I'm currently coding a physical simulation on a lattice, I'm interested in describing loops in this lattice, they are closed curved composed by the edges of the lattice cells. I'm storing the information on this lattice cells (by information I mean a Boolean variable saying if the edge is valuable or no for composing a loop) in a 3 dimensional Boolean array.
I'm now thinking about a good structure to handle this loops. they are basically a list of edges, so I would need something like an array of 3d integer vectors, each edge being defined by 3 coordinates in my current parameterization. I'm already thinking about building a class around this "list" object as I'll need methods computing the loop diameter and probably more in the future.
But, I'm definitely not so aware of the choice of structure I have to do that, my physics background hasn't taught me enough in C++. And for so, I'd like to hear your suggestion for shaping this piece of code. I would really enjoy discovering some new ways of coding this kid of things.
You want two separate things. One is keeping track of all edges and allowing fast lookup of edge objects by an (int,int,int) index (you probably don't want int there but something like size_t or so). This is entirely independent from your second goal crating ordered subsets of these.
General Collection (1)
Since your edge database is going to be sparse (i.e. only a few of the possible indices will actually identify as a particular edge), my prior suggestion of using a 3d matrix is unsuitable. Instead, you probably want to lookup edges with a hash map.
How easy this is, depends on the expected size of the individual integers. That is, can you manage to have no more than 21 bit per integer (for instance if your integers are short int values, which have only 16 bit), then you can concatenate them to one 64 bit value, which already has an std::hash implementation. Otherwise, you will have to implement your own hash specialisation for, e.g., std::hash<std::array<uint32_t,3>> (which is also quite easy, and highly stackable).
Once you can hash your key, you can throw it into an std::unordered_map and be done with it. That thing is fast.
Loop detection (2)
Then you want to have short-lived data structures for identifying loops, so you want a data structure that extends on one end but never on the other. That means you're probably fine with an std::vector or possibly with an std::deque if you have very large instances (but try the vector first!).
I'd suggest simply keeping the index to an edge in the local vector. You can always lookup the edge object in your unordered_map. Then the question is how to represent the index. If Int represents your integer type (e.g. int, size_t, short, ...) it's probably the most consistent to use an std::array<Int,3> --- if the types of the integers differ, you'll want an std::tuple<...>.

Bidirectional data structure for this situation

I'm studying a little part of a my game engine and wondering how to optimize some parts.
The situation is quite simple and it is the following:
I have a map of Tiles (stored in a bi-dimensional array) (~260k tiles, but assume many more)
I have a list of Items which always are in at least and at most a tile
A Tile can logically contain infinite amount of Items
During game execution many Items are continuously created and they start from their own Tile
Every Item continuously changes its Tile to one of the neighbors (up, right, down, left)
Up to now every Item has a reference to its actual Tile, and I just keep a list of items.
Every time an Item moves to an adjacent tile I just update item->tile = .. and I'm fine. This works fine but it's unidirectional.
While extending the engine I realized that I have to find all items contained in a tile many times and this is effectively degrading the performance (especially for some situations, in which I have to find all items for a range of tiles, one by one).
This means I would like to find a data structure suitable to find all the items of a specific Tile better than in O(n), but I would like to avoid much overhead in the "moving from one tile to another" phase (now it's just assigning a pointer, I would like to avoid doing many operations there, since it's quite frequent).
I'm thinking about a custom data structure to exploit the fact that items always move to neighbor cell but I'm currently groping in the dark! Any advice would be appreciated, even tricky or cryptic approaches. Unfortunately I can't just waste memory so a good trade-off is needed to.
I'm developing it in C++ with STL but without Boost. (Yes, I do know about multimap, it doesn't satisfy me, but I'll try if I don't find anything better)
struct Coordinate { int x, y; };
map<Coordinate, set<Item*>> tile_items;
This maps coordinates on the tile map to sets of Item pointers indicating which items are on that tile. You wouldn't need an entry for every coordinate, only the ones that actually have items on them. Now, I know you said this:
but I would like to avoid much overhead in the "moving from one tile
to another" phase
And this method would involve adding more overhead in that phase. But have you actually tried something like this yet and determined that it is a problem?
To me I would wrap a std::vector into a matrix type (IE impose 2d access on a 1d array) this give you fast random access to any of your tiles (implementing the matrix is trivial).
use
vector_index=y_pos*y_size+x_pos;
to index a vector of size
vector_size=y_size*x_size;
Then each item can have a std::vector of items (if the amount of items a tile has is very dynamic maybe a deque) again these are random access contains with very minimal overhead.
I would stay away from indirect containers for your use case.
PS: if you want you can have my matrix template.
If you really think having each tile store it's items will cost you too much space, consider using a quadtree to store items then. This allows you to efficiently get all the items on a tile, but leaves your Tile grid in place for item movement.