An indexed set (for efficient removal in a vector) - c++

I was just about to implement my own class for efficient removal from an array, but thought I'd ask to see if anything like it already exists. What I want is list-like access efficiency but using an array. I want to use an array for reasons of cache coherence and so I don't have to continually be calling a memory allocator (as using std::list would when allocating nodes).
What I thought about doing was creating a class with two arrays. The first is a set of elements and the second array is a set of integers where each integer is a free slot in the first array. So I can add/remove elements from the array fairly easily, without allocating new memory for them, simply by taking an index from the free list and using that for the new element.
Does anything like this exist already? If I do my own, I'll have to also make my own iterators, so you can iterate the set avoiding any empty slots in the array, and I don't fancy that very much.
Thanks.
Note: The kind of operations I want to perform on the set are:
Iteration
Random access of individual elements, by index (or "handle" as I'm thinking of it)
Removal of an element anywhere in the set
Addition of an element to the set (order unimportant)

std::list<T> actually does sound exactly like the theoretically correct data structure for your job, because it supports the four operations you listed, all with optimal space and time complexity. std::list<T>::iterator is a handle that remains valid even if you add/remove other items to/from the list.
It may be that there is a custom allocator (i.e. not std::allocator<T>) that you could use with std::list<T, Allocator> to get the performance you want (internally pool nodes and then don't do runtime allocation everytime you add or remove a node). But that might be overkill.
I would start just using a std::list<T> with the default allocator and then only look at custom allocators or other data structures if you find the performance is too bad for your application.

If maintaining order of elements is irrelevant, use swap-and-pop.
Copy/move the last element over the one to be removed, then pop the back element. Super easy and efficient. You don't even need to bother with special checks for removing the element since it'll Just Work(tm) if you use the standard C++ vector and operations.
*iter = std::move(container.back());
container.pop_back();
I don't recall if pop_back() invalidated iterators on vector, but I don't think it does. If it does, just use indices directly or to recalculate a new valid iterator.
auto delta = iter - container.begin();
// mutate container
iter = container.begin() + delta;

You can use a single array by storing the information about the "empty" slots in the space of the empty slots.
For a contiguous block of empty slots in your array A, say of k slots starting from index n, store (k, n') at location A[n] (where n' is the index of the next block of free indexes). You may have to pack the two ints into a single word if your array is storing word-sized objects.
You're essentially storing a linked-list of free blocks, like a memory-manager might do.
It's a bit of a pain to code, but this'll allow you to allocate a free index in O(1) time, and to iterate through the allocated indices in O(n) time, where n is the number of allocated slots. Freeing an index will be O(n) time though in the worst case: this is the same problem as fragmented memory.
For the first free block, you can either store the index separately, or have the convention that you never allocate A[0] so you can always start a free-index search from there.

std::map might be useful in your case.

Related

Removing from the beginning of an std::vector in C++

I might be missing something very basic here but here is what I was wondering -
We know removing an element from the beginning of an std::vector ( vector[0] ) in C++ is an O(n) operation because all the other elements have to be shifted one place backwards.
But why isn't it implemented such that the pointer to the first element is moved one position ahead so that now the vector starts from the second element and, in essence, the first element is removed? This would be an O(1) operation.
std::array and C-style arrays are fixed-length, and you can't change their length at all, so I think you're having a typo there and mean std::vector instead.
"Why was it done that way?" is a bit of a historical question. Perspectively, if your system library allowed for giving back unused memory to the operating system, your "shift backwards" trick would disallow any reuse of the former first elements' memory later on.
Also, std::vector comes from systems (like they are still basically used in every operating system) with calls like free and malloc, where you need to keep the pointer to the beginning of an allocated region around, to be able to free it later on. Hence, you'd have to lug around another pointer in the std::vector structure to be able to free the vector, and that would only be a useful thing if someone deleted from the front. And if you're deleting from the front, chances are you might be better off using a reversed vector (and delete from the end), or a vector isn't the right data structure alltogether.
It is not impossible for a vector to be implemented like that (it wouldn't be std::vector though). You need to keep the pointer to first element in addition to a pointer to the underlying array (alternatively some offset can be stored, but no matter how you put it you need to store more data in the vector).
Consider that this is useful only for one quite specific use-case: Erasing the first element. Well, once you got that you can also benefit while inserting an element in the front when there is free space left. If there is free space left then even inserting in the first half could benefit by shifting only the first half.
However, all this does not fit with the concept of capacity. With std::vector you know exactly how many elements you can add before a reallocation occurs: capcity() - size(). With your proposal this wouldn't hold any more. Erasing the first element would affect capacity in an odd way. It would complicate the interface and usages of vectors for all use cases.
Further, erasing elements anywhere else would still not be O(1). In total it would incur a cost and add complexity for any use of the vector, while bringing an advantage only in a very narrow use case.
If you do find yourself in the situation that you need to erase the front element very often, then you can either store the vector in reverse, and erasing the last element is already O(1), or use a different container.

Keeping constant number of elements in vector

I am trying to figure out the fastest way to keep constant number of elements in vector (or maybe there is some ready-made structure that do it automatically).
In my app I am adding multiple elements to the vector and I need to do it fast. Because of vector's self resizing at some point it is significantly decreasing overall application speed. What I was thinking about is to do something like this:
if(my_vector.size() < 300)
my_vector.push_back(new_element);
else
{
my_vector.pop_front();
my_vector.push_back(new_element);
}
but after first few tests I've realized that it might not be the best solution, because I am not sure if pop_front() and later push_back() doesn't still need to resize at some point.
Is there any other solution for this?
Use a std::queue. Its underlying container is a std::deque, but like a stack a queue's interface is specifically for FIFO operations (push_back, pop_front), which is exactly what you're doing in your situation. Here's why a deque is better for this situation:
The storage of a deque is automatically expanded and contracted as
needed. Expansion of a deque is cheaper than the expansion of a
std::vector because it does not involve copying of the existing
elements to a new memory location.
The complexity (efficiency) of common operations on deques is as
follows:
Random access - constant O(1)
Insertion or removal of elements at the end or beginning - constant O(1)
To implement a fixed-size container with push_back and pop_front and minimal memory shuffling, use a std::array of the appropriate size. To keep track of things you'll need a front index for pushing elements and a back index for popping things. To push, store the element at the location given by front_index, then increment front_index and take the remainder modulo the container size. To pop, read the element at the location given by back_index, and adjust that index the same way you did front_index. With that in place, the code in the question will do what you need.
You just need to reserve the capacity to a reasonable number. The vector will not automatically shrink. So it only will grow and, possibly, stop at some point.
You might be also interested in the resize policies. For example Facebook made a substantial research and created own implementation of the vector - folly::fbvector which has better performance than std::vector

Cache efficient way to implement a list of a priori bounded size

I have a list for which I have an a priori set bound as a configuration option that it will never grow above N elements. Thus, it will fill up to N elements and after that the only operation I need to support is moving an element in the list to the end of the list.
Ideally, I would like to have a data structure, where I have list nodes that contain a pointer to the stored object. Since I know there will be N list nodes at the most, I could allocate them contiguously in memory, so that jumping around the list would presumably be a fast operation, since elements in the list would probably be warm in the cache.
Is it possible to have the STL std::list allocate its nodes in this way, i.e. can this be done with a custom allocator (I've never tried to write a custom allocator for an STL class before).

Is there an equivalent of vector::reserve() for an std::list?

I have a class that looks like this:
typedef std::list<char*> PtrList;
class Foo
{
public:
void DoStuff();
private:
PtrList m_list;
PtrList::iterator m_it;
};
The function DoStuff() basically adds elements to m_list or erases elements from it, finds an iterator to some special element in it and stores it in m_it. It is important to note that each value of m_it is used in every following call of DoStuff().
So what's the problem?
Everything works, except that profiling shows that the operator new is invoked too much due to list::push_back() called from DoStuff().
To increase performance I want to preallocate memory for m_list in the initialization of Foo as I would do if it were an std::vector. The problem is that this would introduce new problems such as:
Less efficient insert and erase of elements.
m_it becomes invalid as soon as the vector is changed from one call to DoStuff() to the next. EDIT: Alan Stokes suggested to use an index instead of an iterator, solving this issue.
My solution: the simplest solution I could think of is to implement a pool of objects that also has a linked-list functionality. This way I get a linked list and can preallocate memory for it.
Am I missing something or is it really the simplest solution? I'd rather not "re-invent the wheel", and use a standard solution instead, if it exists.
Any thoughts, workarounds or enlightening comments would be appreciated!
I think you are using wrong the container.
If you want fast push back then don't automatically assume that you need a linked list, a linked list is a slow container, it is basically suitable for reordering.
A better container is a std::deque. A deque is basically a array of arrays. It allocates a block of memory and occupies it when you push back, when it runs out it will allocate another block. This means that it only allocates very infrequently and you do not have to know the size of the container ahead of time for efficiency like std::vector and reserver.
You can use the splice function in std::list to implement a pool. Add a new member variable PtrList m_Pool. When you want to add a new object and the pool is not empty, assign the value to the first element in the pool and then splice it into the list. To erase an element, splice it from the list to the pool.
But if you don't care about the order of the elements, then a deque can be much faster. If you want to erase an element in the middle, copy the last element onto the element you want to delete, then erase the last element.
My advice is the same as 111111's, try switching to deque before you write any significant code.
However, to directly answer your question: you could use std::list with a custom allocator. It's a bit fiddly, and I'm not going to work through all the details here, but the gist of it is that you write a class that represents the memory allocation strategy for list nodes. The nodes allocated by list will be a small implementation-defined amount larger than char*, but they will all be the same size, which means you can write an optimized allocator just for that size (a pool of memory blocks rather than a pool of objects), and you can add functions to it that let you reserve whatever space you want in the allocator, at the time you want. Then the list can allocate/free quickly. This saves you needing to re-implement any of the actual list functionality.
If you were (for some reason) going to implement a pool of objects with list functionality, then you could start with boost::intrusive. That might also be useful when writing your own allocator, for keeping track of your list of free blocks.
List and vector are completely different in the way they manage objects.
Vector constructs elements in place into a allocated buffer of a given capacity. New allocation happens when the capacity is exhausted.
List allocate elements one by one, each into an individually allocated space.
Vector elements shift when something is inserted / removed, hence, vector indexes and element addresses are not stable.
List element are re-linked when something is inserted / removed, hence, list iterators and elements addresses are stable.
A way to make a list to behave similarly to a vector, is to replace the default allocator (that allocates form the system every time is invoked) with another one the allocates objects in larger chunks, dispatching sub-chunks to the list when it invokes it.
This is not something the standard library provides by default.
Could potentially use list::get_allocator().allocate(). Afaik, default behaviour would be to acquire memory as and when due to the non-contiguous nature of lists - hence the lack of reserve() - but no major drawbacks with using the allocator method occur to me immediately. Provided you have a non-critical section in your program, at the start or whatever, you can at least choose to take the damage at that point.

array vs vector vs list

I am maintaining a fixed-length table of 10 entries. Each item is a structure of like 4 fields. There will be insert, update and delete operations, specified by numeric position. I am wondering which is the best data structure to use to maintain this table of information:
array - insert/delete takes linear time due to shifting; update takes constant time; no space is used for pointers; accessing an item using [] is faster.
stl vector - insert/delete takes linear time due to shifting; update takes constant time; no space is used for pointers; accessing an item is slower than an array since it is a call to operator[] and a linked list .
stl list - insert and delete takes linear time since you need to iterate to a specific position before applying the insert/delete; additional space is needed for pointers; accessing an item is slower than an array since it is a linked list linear traversal.
Right now, my choice is to use an array. Is it justifiable? Or did I miss something?
Which is faster: traversing a list, then inserting a node or shifting items in an array to produce an empty position then inserting the item in that position?
What is the best way to measure this performance? Can I just display the timestamp before and after the operations?
Use STL vector. It provides an equally rich interface as list and removes the pain of managing memory that arrays require.
You will have to try very hard to expose the performance cost of operator[] - it usually gets inlined.
I do not have any number to give you, but I remember reading performance analysis that described how vector<int> was faster than list<int> even for inserts and deletes (under a certain size of course). The truth of the matter is that these processors we use are very fast - and if your vector fits in L2 cache, then it's going to go really really fast. Lists on the other hand have to manage heap objects that will kill your L2.
Premature optimization is the root of all evil.
Based on your post, I'd say there's no reason to make your choice of data structure here a performance based one. Pick whatever is most convenient and return to change it if and only if performance testing demonstrates it's a problem.
It is really worth investing some time in understanding the fundamental differences between lists and vectors.
The most significant difference between the two is the way they store elements and keep track of them.
- Lists -
List contains elements which have the address of a previous and next element stored in them. This means that you can INSERT or DELETE an element anywhere in the list with constant speed O(1) regardless of the list size. You also splice (insert another list) into the existing list anywhere with constant speed as well. The reason is that list only needs to change two pointers (the previous and next) for the element we are inserting into the list.
Lists are not good if you need random access. So if one plans to access nth element in the list - one has to traverse the list one by one - O(n) speed
- Vectors -
Vector contains elements in sequence, just like an array. This is very convenient for random access. Accessing the "nth" element in a vector is a simple pointer calculation (O(1) speed). Adding elements to a vector is, however, different. If one wants to add an element in the middle of a vector - all the elements that come after that element will have to be re allocated down to make room for the new entry. The speed will depend on the vector size and on the position of the new element. The worst case scenario is inserting an element at position 2 in a vector, the best one is appending a new element. Therefore - insert works with speed O(n), where "n" is the number of elements that need to be moved - not necessarily the size of a vector.
There are other differences that involve memory requirements etc., but understanding these basic principles of how lists and vectors actually work is really worth spending some time on.
As always ... "Premature optimization is the root of all evil" so first consider what is more convenient and make things work exactly the way you want them, then optimize. For 10 entries that you mention - it really does not matter what you use - you will never be able to see any kind of performance difference whatever method you use.
Prefer an std::vector over and array. Some advantages of vector are:
They allocate memory from the free space when increasing in size.
They are NOT a pointer in disguise.
They can increase/decrease in size run-time.
They can do range checking using at().
A vector knows its size, so you don't have to count elements.
The most compelling reason to use a vector is that it frees you from explicit memory management, and it does not leak memory. A vector keeps track of the memory it uses to store its elements. When a vector needs more memory for elements, it allocates more; when a vector goes out of scope, it frees that memory. Therefore, the user need not be concerned with the allocation and deallocation of memory for vector elements.
You're making assumptions you shouldn't be making, such as "accessing an item is slower than an array since it is a call to operator[]." I can understand the logic behind it, but you nor I can know until we profile it.
If you do, you'll find there is no overhead at all, when optimizations are turned on. The compiler inlines the function calls. There is a difference in memory performance. An array is statically allocated, while a vector dynamically allocates. A list allocates per node, which can throttle cache if you're not careful.
Some solutions are to have the vector allocate from the stack, and have a pool allocator for a list, so that the nodes can fit into cache.
So rather than worry about unsupported claims, you should worry about making your design as clean as possible. So, which makes more sense? An array, vector, or list? I don't know what you're trying to do so I can't answer you.
The "default" container tends to be a vector. Sometimes an array is perfectly acceptable too.
First a couple of notes:
A good rule of thumb about selecting data structures: Generally, if you examined all the possibilities and determined that an array is your best choice, start over. You did something very wrong.
STL lists don't support operator[], and if they did the reason that it would be slower than indexing an array has nothing to do with the overhead of a function call.
Those things being said, vector is the clear winner here. The call to operator[] is essentially negligible since the contents of a vector are guaranteed to be contiguous in memory. It supports insert() and erase() operations which you would essntially have to write yourself if you used an array. Basically it boils down to the fact that a vector is essentially an upgraded array which already supports all the operations you need.
I am maintaining a fixed-length table of 10 entries. Each item is a
structure of like 4 fields. There will be insert, update and delete
operations, specified by numeric position. I am wondering which is the
best data structure to use to maintain this table of information:
Based on this description it seems like list might be the better choice since its O(1) when inserting and deleting in the middle of the data structure. Unfortunately you cannot use numeric positions when using lists to do inserts and deletes like you can for arrays/vectors. This dilemma leads to a slew of questions which can be used to make an initial decision of which structure may be best to use. This structure can later be changed if testing clearly shows its the wrong choice.
The questions you need to ask are three fold. The first is how often are you planning on doing deletes/inserts in the middle relative to random reads. The second is how important is using a numeric position compared to an iterator. Finally, is order in your structure important.
If the answer to the first question is random reads will be more prevalent than a vector/array will probably work well. Note iterating through a data structure is not considered a random read even if the operator[] notation is used. For the second question, if you absolutely require numeric position than a vector/array will be required even though this may lead to a performance hit. Later testing may show this performance hit is negligible relative to the easier coding with numerical positions. Finally if order is unimportant you can insert and delete in a vector/array with an O(1) algorithm. A sample algorithm is shown below.
template <class T>
void erase(vector<T> & vect, int index) //note: vector cannot be const since you are changing vector
{
vect[index]= vect.back();//move the item in the back to the index
vect.pop_back(); //delete the item in the back
}
template <class T>
void insert(vector<T> & vect, int index, T value) //note: vector cannot be const since you are changing vector
{
vect.push_back(vect[index]);//insert the item at index to the back of the vector
vect[index] = value; //replace the item at index with value
}
I Believe it's as per your need if one needs more insert/to delete in starting or middle use list(doubly-linked internally) if one needs to access data randomly and addition to last element use array ( vector have dynamic allocation but if you require more operation as a sort, resize, etc use vector)