This is question about good practice
Consider situation which is typical e.g. in 3D engines, physics engines, Finite element method or classical molecular dynamics solvers: You have objects of various types ( e.g. vertexes, edges, faces, bounded solid volumes ) which are cross-linked to each other (e.g. vertex know which edge are connected to it and vice versa). For performance and convenience of usage of such engine is crucial to be able quickly browse the network of such connections.
The question is: Is it better to point to the linked object by index in array, or by pointer ? ... especially performance-wise
typedef index_t uint16_t;
class Vertex{
Vec3 pos;
#ifdef BY_POINTER
Edge* edges[nMaxEdgesPerVertex];
Face* faces[nMaxFacesPerVertex];
#else
index_t edges[nMaxEdgesPerVertex];
index_t faces[nMaxFacesPerVertex];
#endif
}
class Edge{
Vec3 direction;
double length;
#ifdef BY_POINTER
Vertex* verts[2];
Faces* faces[nMaxFacesPerEdge];
#else
index_t verts[2];
index_t faces[nMaxFacesPerEdge];
#endif
}
class Face{
Vec3 normal;
double isoVal; // Plane equation: normal.dot(test_point)==isoVal
#ifdef BY_POINTER
Vertex* verts[nMaxVertsPerFace];
Edge* edges[nMaxEdgesPerFace];
#else
index_t verts[nMaxVertsPerFace];
index_t edges[nMaxEdgesPerFace];
#endif
}
#ifndef BY_POINTER
// we can use other datastructure here, such as std:vector or even some HashMap
int nVerts,nEdges,nFaces;
Vertex verts[nMaxVerts];
Edge edges[nMaxEdges];
Vertex faces[nMaxFaces];
#endif
Advantages of index:
using index can be more memory efficient when we use uint8_t or uint16_t for index instead of 32-bit or 64-bit pointer
index can carry some additional information ( e.g. about orientation of edge ) encoded in some bits;
the ordering of objects in array can carry some information about structure ( e.g. vertexes of cube could be ordered as {0b000,0b001,0b010,0b011,0b100,0b101,0b110,0b111} ). This information is not visible in pointers
Advantages of pointers:
We don't need to care about the arrays (or other data-structures) to store the objects. The objects can be simply allocated dynamically on the heap by new Vertex().
May be faster (?) because it does not need to add the base address of the array (?). But this is probably negligible with respect to memory latency (?)
using index can be more memory efficient when we use uint8_t or
uint16_t for index instead of 32-bit or 64-bit pointer
True. Having a small representation reduce the total size of the structure, reducing cache miss when traversing it.
index can carry some additional information ( e.g. about orientation
of edge ) encoded in some bits;
True.
We don't need to care about the arrays (or other data-structures) to
store the objects. The objects can be simply allocated dynamically on
the heap by new Vertex().
This is exactly what you don't want to do, speaking of performances.
You want to be sure Vertex are all packed, to avoid unnecessary cache missing.
In this case the array would save you from that wrong temptation.
You also want to access them sequentially, at least as much as possible, again to minimize cache miss.
How much you data structure are packed, small and accessed sequentially, is what actually drive performances.
May be faster (?) because it does not need to add the base address of
the array (?). But this is probably negligible with respect to memory
latency (?)
Possibly negligible. Probably depends on specific hardware and|or compiler.
Another missing advantage about index: easier to manage when you reallocate.
Consider a structure that can grow, like the following:
struct VertexList
{
std::vector<Vertex> vertices;
Vertex *start; // you can still access using vector if you prefer; start = &vertices[0];
}
If you are referencing a given vertex using pointers, and a reallocation occurs, you will end up with an invalid pointer.
For performance, what matters is the speed with which you can read the "next" element in whatever traversal order(s) are commonly done in the hot path.
For example, if you have a series of edges which represent some path, you would want them to be stored contiguously in memory (not using new for each one), in the order in which they are connected.
For this case (edges forming a path), it's clear that you do not need pointers, and you also do not need indexes. The connections are implied by the storage locations, so you just need a pointer to the first and perhaps the last edges (i.e. you can store the whole path in a std::vector<Edge>).
A second example illustrating domain knowledge that we can exploit: imagine we have a game supporting up to 8 players and want to store "who has visited each of the edges in a path." Again we do not need pointers nor indexes to refer to the 8 players. Instead, we can simply store a uint8_t inside each Edge and use the bits as flags for each player. Yes, this is low-level bit banging, but it gives us compact storage and efficient lookup once we have an Edge*. But if we need to do the lookup in the other direction, from players to Edges, the most efficient will be to store e.g. a vector of uint32_t inside each player and do indexing into the Edge array.
But what if edges can be added and removed in the middle of a path? Well then we might want a linked list. In this case we should use an intrusive linked list, and allocate the Edges in a pool. Having done this, we can store pointers to the Edges in each player object, and they will never change or need to be updated. We use an intrusive linked list with the understanding that an Edge is only ever part of a single path, so extrinsic storage of the linked-list pointers would be wasteful (std::list needs to store pointers to each object; an intrusive list does not).
So, each case must be considered individually, with as much knowledge of the domain as we can discover in advance. Neither pointers and indexing should be the first approach.
Related
I have an object-based adjacency list graph that consists of nodes and edges stored in a vector.
class Graph
{
struct NodePrivate
{
QVector<int> m_FromEdges, m_ToEdges;
};
struct EdgePrivate
{
int m_iFrom, m_iFromIndex, m_iTo, m_iToIndex;
};
//...
private:
QVector<NodePrivate> m_Nodes;
QVector<EdgePrivate> m_Edges;
};
In order to ensure contiguity (and constant speed) of the graph elements when removing them I do removals by swapping the last element with the one to be removed.
Now when user of the graph accesses the elements he does so via Node and Edge classes that are really just a wrapper around an index to the graph (and int).
class Item
{
//...
private:
int m_Index = -1; //or QSharedPointer<int>, see below
const Graph *m_Graph = nullptr;
};
class Node : public Item {};
class Edge : public Item {};
By removing a node or an edge these indexes might become invalid. I would like these to be persistent and insofar have tried (successfuly) two strategies but I do not like either of them very much:
1) Track all objects of type Node and Edge by registering them and deregistering them in constructor(s) and destructor respectively. These are then used to update the internal index whenever the relevant index changes. Biggest drawback of this is quite a lot of unnecessary registered temporaries.
2) The other option is to use smart-pointer approach by having the index dynamic (std::shared_ptr<int>). The index is then updated through that which is arguably better than updating all objects but at the cost of dynamic memory.
Is there any other option to implement this or improve upon these two designs?
First of all, I must admit that I don't think this problem can be solved perfectly. If you really want to make a lot of small changes to your graphs regularly, then you should switch to storing everything in linked lists instead of arrays. Also, you can just give up and say explicitly, that all Node and Edge handles are invalidated, just like std::vector::iterator-s are invalidated when you add an element to std::vector.
General discussion
In your case, vertices and adjacency lists are stored in arrays. Also, you have Node and Edge helpers, which allow user to point to the real nodes and edges whenever they want. I'll call them handles (they are like C++ iterators without any iteration capabilities). I see two different ways for maintaining the handles after changes.
The first way is to store direct pointer (or index) to a physical object in each handle, as you do it now. In this case you have to change all handles to an object, whenever the object is moved. That is why you absolutely must register all the handles you give away somewhere. This is exactly the first solution you suggest, and it leads to "heavy" handles: creating, deleting and copying handles becomes costly, regardless of whether any objects are actually moved.
The second way is to store pointer to some intermediate thing inside a Handle. Then make sure that this thing is never changed during object's lifetime, even if objects move. Clearly, the thing you point to in a handle must be something different from real physical index of your node of edge, since they change. In this approach you have to pay for indirect access each time a handle is dereferenced, so handle access becomes slightly heavier.
The second solution you propose is following this second approach. The intermediate things (which are being pointed to by your handles) are dynamically allocated int-s wrapped in shared_ptr, one never-moving int per object. You have to suffer at least from separate dynamic allocation (+deallocation) per each object created, also from reference counters updates. The reference counters can be easily removed: store unique_ptr-s in NodePrivate and EdgePrivate objects, and raw pointers in Node and Edge objects.
New approach
The other solution following the second approach is to use IDs as intermediate things pointed to be handles. Whenever you create a node, assign it a new node ID, same for edges. Assign IDs sequentally, starting from zero. Now you can maintain bidirectional correspondence between physical indices and these IDs, and update it in O(1) time on a change.
struct NodePrivate
{
QVector<int> m_FromEdges, m_ToEdges;
int id; //getting ID by physical index
};
struct EdgePrivate
{
int m_iFrom, m_iFromIndex, m_iTo, m_iToIndex;
int id; //getting ID by physical index
};
private:
QVector<NodePrivate> m_Nodes;
QVector<EdgePrivate> m_Edges;
QVector<int> m_NodeById; //getting physical index by ID
QVector<int> m_EdgeById; //getting physical index by ID
Note that these new m_NodeById and m_EdgeById vectors grow when objects are created, but do not shrink when objects are deleted. So you'll have empty cells in these arrays, which will only be deallocated when you delete your graph. So you can use this solution only if you are sure that the total amount of nodes and edges created during graph's lifetime is relatively small, since you take 4 bytes of memory per each such object.
Improving memory consumption
You might have already noticed the similarity between the new solution just presented and the shared_ptr-based solution you had. In fact, if we do not distinguish C pointers and array indices, then they are the same, except for: in your solution int-s are allocated in heap, but in the proposed solution int-s are allocated in a pool allocator.
A very well-known improvement to a no-free pool allocator is the technique known as 'free lists', and we can apply it to the solution described above. Instead of always assigning new IDs to created objects, we allow to reuse them. In order to achieve that, we store a stack of free IDs, When an object is removed, we add its ID to this stack. When a new object is created, we take an ID for it from the stack. If stack is empty, then we assign a new ID.
struct EdgePrivate
{
int m_iFrom, m_iFromIndex, m_iTo, m_iToIndex;
int id; //getting ID by physical index
};
private:
QVector<EdgePrivate> m_Edges;
QVector<int> m_EdgeById; //getting physical index by ID
QVector<int> m_FreeEdgeIds; //freelist: stack of IDs to be reused
This improvement makes sure that memory consumption is proportional of the maximum number of objects you ever had alive simultaneously (not the total number objects created). But of course it increases memory overhead per object even further. It saves you from malloc/free cost, but you can have issues with memory fragmentation for large graphs after many operations.
I have to read a file in which is stored a matrix with cars (1=BlueCar, 2=RedCar, 0=Empty).
I need to write an algorithm to move the cars of the matrix in that way:
blue ones move downward;
red ones move rightward;
there is a turn in which all the blue ones move and a turn to move all the red ones.
Before the file is read I don't know the matrix size and if it's dense or sparse, so I have to implement two data structures (one for dense and one for sparse) and two algorithms.
I need to reach the best time and space complexity possible.
Due to the unknown matrix size, I think to store the data on the heap.
If the matrix is dense, I think to use something like:
short int** M = new short int*[m];
short int* M_data = new short int[m*n];
for(int i=0; i< m; ++i)
{
M[i] = M_data + i * n;
}
With this structure I can allocate a contiguous space of memory and it is also simple to be accessed with M[i][j].
Now the problem is the structure to choose for the sparse case, and I have to consider also how I can move the cars through the algorithm in the simplest way: for example when I evaluate a car, I need to find easily if in the next position (downward or rightward) there is another car or if it's empty.
Initially I thought to define BlueCar and RedCar objects that inherits from the general Car object. In this objects I can save the matrix coordinates and then put them in:
std::vector<BluCar> sparseBlu;
std::vector<RedCar> sparseRed;
Otherwise I can do something like:
vector< tuple< row, column, value >> sparseMatrix
But the problem of finding what's in the next position still remains.
Probably this is not the best way to do it, so how can I implement the sparse case in a efficient way? (also using a unique structure for sparse)
Why not simply create a memory mapping directly over the file? (assuming your data 0,1,2 is stored in contiguous bytes (or bits) in the file, and the position of those bytes also represents the coordinates of the cars)
This way you don't need to allocate extra memory and read in all the data, and the data can simply and efficiently be accessed with M[i][j].
Going over the rows would be L1-cache friendly.
In case of very sparse data, you could scan through the data once and keep a list of the empty regions/blocks in memory (only need to store startpos and size), which you could then skip (and adjust where needed) in further runs.
With memory mapping, only frequently accessed pages are kept in memory. This means that once you have scanned for the empty regions, memory will only be allocated for the frequently accessed non-empty regions (all this will be done automagically by the kernel - no need to keep track of it yourself).
Another benefit is that you are accessing the OS disk cache directly. Thus no need to keep copying and moving data between kernel space and user space.
To further optimize space- and memory usage, the cars could be stored in 2 bits in the file.
Update:
I'll have to move cars with openMP and MPI... Will the memory mapping
work also with concurrent threads?
You could certainly use multithreading, but not sure if openMP would be the best solution here, because if you work on different parts of the data at the same time, you may need to check some overlapping regions (i.e. a car could move from one block to another).
Or you could let the threads work on the middle parts of the blocks, and then start other threads to do the boundaries (with red cars that would be one byte, with blue cars a full row).
You would also need a locking mechanism for adjusting the list of the sparse regions. I think the best way would be to launch separate threads (depending on the size of the data of course).
In a somewhat similar task, I simply made use of Compressed Row Storage.
The Compressed Row and Column (in the next section) Storage formats
are the most general: they make absolutely no assumptions about the
sparsity structure of the matrix, and they don't store any unnecessary
elements. On the other hand, they are not very efficient, needing an
indirect addressing step for every single scalar operation in a
matrix-vector product or preconditioner solve.
You will need to be a bit more specific about time and space complexity requirements. CSR requires an extra indexing step for simple operations, but that is a minor amount of overhead if you're just doing simple matrix operations.
There's already an existing C++ implementation available online as well.
I have a class that implements two simple, pre-sized stacks; those are stored as members of the class of type vector pre-sized by the constructor. They are small and cache line size friendly objects.
Those two stacks are constant in size, persisted and updated lazily, and are often accessed together by some computationally cheap methods that, however, can be called a large number of times (tens to hundred of thousands of times per second).
All objects are already in good state (code is clean and does what it's supposed to do), all sizes kept under control (64k to 128K most cases for the whole chain of ops including results, rarely they get close to 256k, so at worse an L2 look-up and often L1).
some auto-vectorization comes into play, but other than that it's single threaded code throughout.
The class, minus some minor things and padding, looks like this:
class Curve{
private:
std::vector<ControlPoint> m_controls;
std::vector<Segment> m_segments;
unsigned int m_cvCount;
unsigned int m_sgCount;
std::vector<unsigned int> m_sgSampleCount;
unsigned int m_maxIter;
unsigned int m_iterSamples;
float m_lengthTolerance;
float m_length;
}
Curve::Curve(){
m_controls = std::vector<ControlPoint>(CONTROL_CAP);
m_segments = std::vector<Segment>( (CONTROL_CAP-3) );
m_cvCount = 0;
m_sgCount = 0;
std::vector<unsigned int> m_sgSampleCount(CONTROL_CAP-3);
m_maxIter = 3;
m_iterSamples = 20;
m_lengthTolerance = 0.001;
m_length = 0.0;
}
Curve::~Curve(){}
Bear with the verbosity, please, I'm trying to educate myself and make sure I'm not operating by some half-arsed knowledge:
Given the operations that are run on those and their actual use, performance is largely memory I/O bound.
I have a few questions related to optimal positioning of the data, keep in mind this is on Intel CPUs (Ivy and a few Haswell) and with GCC 4.4, I have no other use cases for this:
I'm assuming that if the actual storage of controls and segments are contiguous to the instance of Curve that's an ideal scenario for the cache (size wise the lot can easily fit on my target CPUs).
A related assumption is that if the vectors are distant from the instance of the Curve , and between themselves, as methods alternatively access the contents of those two members, there will be more frequent eviction and re-populating the L1 cache.
1) Is that correct (data is pulled for the entire stretch of cache size from the address first looked up on a new operation, and not in convenient multiple segments of appropriate size), or am I mis-understanding the caching mechanism and the cache can pull and preserve multiple smaller stretches of ram?
2) Following the above, insofar by pure circumstance all my test always end up with the class' instance and the vectors contiguous, but I assume that's just dumb luck, however statistically probable. Normally instancing the class reserves only the space for that object, and then the vectors are allocated in the next free contiguous chunk available, which is not guaranteed to be anywhere near my Curve instance if that previously found a small emptier niche in memory.
Is this correct?
3) Assuming 1 and 2 are correct, or close enough functionally speaking, I understand to guarantee performance I'd have to write an allocator of sorts to make sure the class object itself is large enough on instancing, and then copy the vectors in there myself and from there on refer to those.
I can probably hack my way to something like that if it's the only way to work through the problem, but I'd rather not hack it horribly if there are nice/smart ways to go about something like that. Any pointers on best practices and suggested methods would be hugely helpful (beyond "don't use malloc it's not guaranteed contiguous", that one I already have down :) ).
If the Curve instances fit into a cache line and the data of the two vectors also fit a cachline each, the situation is not that bad, because you have four constant cachelines then. If every element was accessed indirectly and randomly positioned in memory, every access to an element might cost you a fetch operation, which is avoided in that case. In the case that both Curve and its elements fit into less than four cachelines, you would reap benefits from putting them into contiguous storage.
True.
If you used std::array, you would have the guarantee that the elements are embedded in the owning class and not have the dynamic allocation (which in and of itself costs you memory space and bandwidth). You would then even avoid the indirect access that you would still have if you used a special allocator that puts the vector content in contiguous storage with the Curve instance.
BTW: Short style remark:
Curve::Curve()
{
m_controls = std::vector<ControlPoint>(CONTROL_CAP, ControlPoint());
m_segments = std::vector<Segment>(CONTROL_CAP - 3, Segment());
...
}
...should be written like this:
Curve::Curve():
m_controls(CONTROL_CAP),
m_segments(CONTROL_CAP - 3)
{
...
}
This is called "initializer list", search for that term for further explanations. Also, a default-initialized element, which you provide as second parameter, is already the default, so no need to specify that explicitly.
Most of the time, the way you would make a grid of objects in C++ is like this:
class Grid {
private:
GridObject[5][5];
};
But is there a way to save memory and not use tiles that have nothing in them? Say you had a chess board. What if you didn't want to use memory in unoccupied spaces? I was thinking of ways to do this but they are inefficient.
I would have a structure like:
class GridObjectWithIndex
{
public:
size_t index;
GridObject obj;
};
Where index is a value of the form x + y * GRIDSIZE so that you easily can decode the x and y of it.
And in your Grid class have:
class Grid
{
std::vector<GridObjectWithIndex> grid_elements;
};
The first thing which comes to my mind is storing pointers GridObject* in the grid. It will still consume sizeof(GridObject*) bytes for non-existent elements but it saves you from the nightmare of managing the data structure. If it's not absolutely critical - don't make it more complex than needed. It's definitely not worth for chess (8x8).
fritzone's solution will be good if you have a small number of elements over a wide range of indices. However think about read/write performance.
If you want to save the space while sending over the network or saving in a file you may want to consider archiving algorithms. Althogh it only makes sense for large sets on data.
What you are calling for is called a sparse data-structure, the most striking examples being sparse matrices.
The idea is to trade memory for CPU: use less memory at the cost of more operations. In general it means:
having a structure only representing the "unusual" values (with a fast access by coordinates)
keeping the "usual" value aside
As a proof of concept, you can use a std::map< std::pair<size_t, size_t>, Object > for the sparse structure and just have another Object on the side for whenever it's not present in the map.
This is the problem at hand:
I have several 10000s of arrays. Each array could be anywhere between 2-15 units in length.
The total length of all the elements in all the arrays and the number of arrays can be computed using some very low cost calculations. But the exact number in each array is not known until some fairly expensive computation is completed.
Since I know the total length of all the elements in all the arrays, I would like to just allocate data for it using just one new/malloc and just set pointers within this allocation. In my current implementation I use memmove to move the data after a certain item is inserted and updates all pointers accordingly.
Is there a better way of doing this?
Thanks,
Sid
It's not clear what you mean by better way. If you are looking for something that works faster and can afford some extra memory then you can keep two arrays, one with data, and the other one with the index of the array it belongs. After you added all the data, you can sort by the index and you have all your data split by arrays, finally you sweep the arrays and get the pointer to where each array belongs.
Regarding memory consumption, depending on how many arrays you have, and how big is your data, you can squeeze the index data to the last bits of your data, if you have it bounded by some number. This way, you only need to sort the numbers, and when you are sweeping retrieving the pointer where each array begins, you can clean the top bits.
Since I know the total length of all the elements in all the arrays, I would like to just allocate data for it using just one new/malloc and just set pointers within this allocation.
You can use one large vector. You'll need to manually calculate the offset of each sub-array yourself.
vectors guarantee that their data is stored in contiguous memory, but be careful of maintaining references or pointers to individual elements if the vector is used in such a way that may make it reallocate. Shouldn't be a problem since you're not adding anything beyond the initial size.
int main() {
std::vector<T> vec;
vec.reserve(calc_total_size());
// now you'll need to manually translate the offset of
// a given "array" and then add the offset of the element to that
T someElem = vec[array_offset + element_offset];
}
Yes, there is a better way:
std::vector<std::vector<Item>> array;
array.resize(cheap_calc());
for(int i = 0; i < array.size(); ++i) {
array[i].resize(expensive_calc(i));
for(int j = 0; j < array[i].size(); j++) {
array[i][j] = Item(some_other_calc());
}
}
No pointers, no muss, no fuss.
Are you looking for memory efficiency, speed efficiency, or simplicity?
You can always write or download a dead-simple pool allocator, then pass that as the allocator to the appropriate data structures. Because you know the total size in advance, and never need to resize vectors or add new ones, this can be even simpler than a typical pool allocator. Just malloc all of the storage in one big block, and keep a single pointer to the next block. To allocate n bytes, T *ret = nextBlock; nextBlock += n; return ret;. If your objects are trivial and don't need destruction, you can even just do one big free at the end.
This means you can use any data structure you want, or compare and contrast different ones. A vector of vectors? A giant vector of cells plus a vector of offsets? Something else you came up with that sounds crazy but just might work? You can compare their readability, usability, and performance without worrying about the memory allocation side of things.
(Of course if your goal is speed, packing things this way may not be the best answer. You can often gain a lot of speed by wasting a little space to improve your cache and/or page alignment. You could write a fancy allocator that, e.g., allocates vector space in a transposed way to improve the performance of your algorithm that does column-major where it should do row-major and vice-versa, but at that point, it's probably easier to tweak your algorithms than your allocator.)