Maintaining an Ordered collection of objects

Maintaining an Ordered collection of objects - c++

I have the following requirements for a collection of objects:
Dynamic size (in theory unlimited, but in practice a couple of thousand should be more than enough)
Ordered, but allowing reorder and insertion at arbitrary locations.
Allows for deletion
Indexed Access - Random Access
Count
The objects I am storing are not large, a couple of properties and a small array or two (256 booleans)
Is there any built in classes I should know about before I go writing a linked list?

Original answer: That sounds like std::list (a doubly linked list) from the Standard Library.
New answer:
After your change to the specs, a std::vector might work as long as there aren't more than a few thousand elements and not a huge number of insertions and deletions in the middle of the vector. The linear complexity of insertion and deletion in the middle may be outweighed by the low constants on the vector operations. If you are doing a lot of insertions and deletions just at the beginning and end, std::deque might work as well.

-Insertion and Deletion: This is possible for any STL container, but the question is how long it takes to do it. Any linked-list container (list, map, set) will do this in constant time, while array-like containers (vector) will do it in linear time (with constant-amortized allocation).
-Sorting: Considering that you can keep a sorted collection at all times, this isn't much of an issue, any STL container will allow that. For map and set, you don't have to do anything, they already take care of keeping the collection sorted at all times. For vector or list, you have to do that work, i.e. you have to do binary search for the place where the new elements go and insert them there (but STL Algorithms has all the pieces you need for that).
-Resorting: If you need to take the current collection you have sorted with respect to rule A, and resort the collection with respect to rule B, this might be a problem. Containers like map and set are parametrized (as a type) by the sorting rule, this means that to resort it, you would have to extract every element from the original collection and insert them in a new collection with a new sorting rule. However, if you use a vector container, you can just use the STL sort function anytime to resort with whatever rule you like.
-Random Access: You said you needed random access. Personally, my definition of random access means that any element in the collection can be accessed (by index) in constant time. With that definition (which I think is quite standard), any linked-list implementation does not qualify and it leaves you with the only option of using an array-like container (e.g. std::vector).
Conclusion, to have all those properties, it would probably be best to use a std::vector and implement your own sorted insertion and sorted deletion (performing binary search into the vector to find the element to delete or the place to insert the new element). If your objects that you need to store are of significant size, and the data according to which they are sorted (name, ID, etc.) is small, you might consider splitting the problem by holding a unsorted linked-list of objects (with full information) and keeping a sorted vector of keys along with a pointer to the corresponding node in the linked-list (in that case, of course, use std::list for the former, and std::vector for the latter).
BTW, I'm no grand expert with STL containers, so I might have been wrong in the above, just think for yourself. Explore the STL for yourself, I'm sure you will find what you need, or at least all the pieces that you need. Maybe, look at Boost libraries too.

You haven't specified enough of your requirements to select the best container.
Dynamic size (in theory unlimited, but in practice a couple of thousand should be more than enough)
STL containers are designed to grow as needed.
Ordered, but allowing reorder and insertion at arbitrary locations.
Allowing reorder? A std::map can't be reordered: you can delete from one std::map and insert into another using a different ordering, but as different template instantiations the two variables will have different types. std::list has a sort() member function [thanks Blastfurnace for pointing this out], particularly efficient for large objects. A std::vector can be resorted easily using the non-member std::sort() function, particularly efficient for tiny objects.
Efficient insertion at arbitrary locations can be done in a map or list, but how will you find those locations? In a list, searching is linear (you must start from somewhere you already know about and scan forwards or backwards element by element). std::map provides efficient searching, as does an already-sorted vector, but inserting into a vector involves shifting (copying) all the subsequent elements to make space: that's a costly operation in the scheme of things.
Allows for deletion
All containers allow for deletion, but you have the exact-same efficiency issues as you do for insertion (i.e. fast for list if you already know the location, fast for map, deletion in vectors is slow, though you can "mark" elements deleted without removing them, e.g. making a string empty, having a boolean flag in a struct).
Indexed Access - Random Access
vector is indexed numerically (but can be binary searched), map by an arbitrary key (but no numerical index). list is not indexed and must be searched linearly from a known element.
Count
std::list provides an O(n) size() function (so that it can provide O(1) splice), but you can easily track the size yourself (assuming you won't splice). Other STL containers already have O(1) time for size().
Conclusions
Consider whether using a std::list will result in lots of inefficient linear searches for the element you need. If not, then a list does give you efficient insertion and deletion. Resorting is good.
A map or hash map will allow quick lookup and easy insertion/deletion, can't be resorted, but you can easily move the data out to another map with another sort criteria (with moderate efficiency.
A vector allows fast searching and in-place resorting, but the worst insert/deletion. It's the fastest for random-access lookup using the element index.

Related

iterate ordered versus unordered containers

I want to know which data-structures are more efficient for iterating through their elements between std::set, std::map and std::unordered_set, std::unordered_map.
I searched through SO and I found this question. The answers either propose to copy the elements in a std::vector or to use Boost.Container, which IMHO don't answer my question.
My purpose is to keep in a container a big number of unique elements, that most of the time I want to iterate through them. Insertions and extractions are more rare. I want to avoid std::vector in combination with std::unique.

Lets consider set vs unordered_set.
The main difference here is the 'nature' of the iteration, that is the traversal of the set will give you the elements in order while traversing a range in an unordered set will give you a bunch of values in no particular order.
Suppose you want to traverse a range [it1, it2]. If we exclude the lookup time that's needed to find elements it1 and it2 there can be no direct mapping from one case to another since the elements in between are not guarrandeed to be the same even if you've used the same elements to construct the container.
There are cases however where something like this has meaning when e.g. you want to traverse a fixed number of elements (regardless of what they are) or when you need to traverse the whole container. In such cases you need to consider implementation mechanics :
Sets are usually implemented like Red–black trees (a form of binary search trees). Like all binary search trees allow efficient in-order traversal (LRR: left root right) of their elements. That is to traverse you pay the cost of pointer chasing (just like traversing a list).
Unordered sets on the other hand are hash tables and to my knowledge the STL implementation uses hashing with chaining. That means (in a very very high level) that what's used for the structure is a (contiguous) buffer where each element is the head of a chain (list) that contains the elements. The way the elements are layed out across those chains (buckets) and across the buffer will affect the traversal time, however you'll be chasing pointers once again jumping through differents lists as well this time. I don't think it'll vary significantly from the tree case but won't be any better for sure.
In any case micro tuning and benchmarking will give you the answer for your particular application.

The difference does not lie between the ordering or lack of one but in the backing container. If it's a contiguous memory it should be fast to iterate over, due to simple implementation of iterator and cache friendliness.
Unordered containers are usually stored as a vector of vectors (or a similar thing), while ordered containers are implemented using trees, but it is left for implementation after all. This would suggest that iterating over unordered version should be waster. However this is left for implementation after all, and I saw implementations (which bent rules a little to be fair) with different behaviour.
Generally speaking, container performance is quite a complex topic and usually has to be tested in actual application to get reliable answer. There is plenty on implemention-defined stuff that might affect the performance. I'd go with hash_set if I had to go in blind. Copying into a vector might also turn out a good option.
EDIT: As #TonyD said in it's comment, there is a rule, that disallows invalidating iterators during addition of element when the max_load_factor() is not exceeded, this practically rules out backing containers which are contiguous in memory.
Thus, copying everything into a vector seems like even more reasonable option. If you need to remove duplicates, a feasible option might be to use http://en.cppreference.com/w/cpp/algorithm/sort and have dupes easily ignored. I have heard that using vector and sort to have a sorted array (or vector) is quite often a used option in case of need for a container that needs to be sorter and is being iterated over more often than modified.

iterate from fastest to slowest should be : set > map > unordered_set > unordered_map;
set is a little lighter than map, and they are ordered with binary tree rule, so should be faster than unordered_ containers.

vector vs. list from stl - remove method

std::list has a remove method, while the std::vector doesn't. What is the reason for that?

std::list<>::remove is a physical removal method, which can be implemented by physically destroying list elements that satisfy certain criteria (by physical destruction I mean the end of element's storage duration). Physical removal is only applicable to lists. It cannot be applied to arrays, like std::vector<>. It simply is not possible to physically end storage duration of an individual element of an array. Arrays can only be created and destroyed as a whole. This is why std::vector<> does not have anything similar to std::list<>::remove.
The universal removal method applicable to all modifiable sequences is what one might call logical removal: the target elements are "removed" from the sequence by overwriting their values with values of elements located further down in the sequence. I.e. the sequence is shifted and compacted by copying the persistent data "to the left". Such logical removal is implemented by freestanding functions, like std::remove. Such functions are applicable in equal degree to both std::vector<> and std::list<>.
In cases where the approach based on immediate physical removal of specific elements applies, it will work more efficiently than the generic approach I referred above as logical removal. That is why it was worth providing it specifically for std::list<>.

std::list::remove removes all items in a list that match the provided value.
std::list<int> myList;
// fill it with numbers
myList.remove(10); // physically removes all instances of 10 from the list
It has a similar function, std::list::remove_if, which allows you to specify some other predicate.
std::list::remove (which physically removes the elements) is required to be a member function as it needs to know about the memory structure (that is, it must update the previous and next pointers for each item that needs to be updated, and remove the items), and the entire function is done in linear time (a single iteration of the list can remove all of the requested elements without invalidating any of the iterators pointing to items that remain).
You cannot physically remove a single element from a std::vector. You either reallocate the entire vector, or you move every element after the removed items and adjust the size member. The "cleanest" implementation of that set of operations would be to do
// within some instance of vector
void vector::remove(const T& t)
{
erase(std::remove(t), end());
}
Which would require std::vector to depend on <algorithm> (something that is currently not required).
As the "sorting" is needed to remove the items without multiple allocations and copies being required. (You do not need to sort a list to physically remove elements).
Contrary to what others are saying, it has nothing to do with speed. It has to do with the algorithm needing to know how the data is stored in memory.
As a side note: This is also a similar reason why std::remove (and friends) do not actually remove the items from the container they operate on; they just move all the ones that are not going to be removed to the "front" of the container. Without the knowledge of how to actually remove an object from a container, the generic algorithm cannot actually do the removing.

Consider the implementation details of both containers. A vector has to provide a continuous memory block for storage. In order to remove an element at index n != N (with N being the vector's length), all elements from n+1 to N-1 need to be moved. The various functions in the <algorithm> header implement that behavior, like std::remove or std::remove_if. The advantage of these being free-standing functions is that they can work for any type that offers the needed iterators.
A list on the other hand, is implemented as a linked list structure, so:
It's fast to remove an element from anywhere
It's impossible to do it as efficiently using iterators (since the internal structure has to be known and manipulated).

In general in STL the logic is "if it can be done efficiently - then it's a class member. If it's inefficient - then it's an outside function"
This way they make the distinction between "correct" (i.e. "efficient") use of classes vs. "incorrect" (inefficient) use.
For example, random access iterators have a += operator, while other iterators use the std::advance function.
And in this case - removing elements from an std::list is very efficient as you don't need to move the remaining values like you do in std::vector

It's all about efficiency AND reference/pointer/iterator validity. list items can be removed without disturbing any other pointers and iterators. This is not true for a vector and other containers in all but the most trivial cases. Nothing prevents use the external strategy, but you have a superior options.. That said this fellow said it better than I could on a duplicate question
From another poster on a duplicate question:
The question is not why std::vector does not offer the operation, but
rather why does std::list offer it. The design of the STL is focused
on the separation of the containers and the algorithms by means of
iterators, and in all cases where an algorithm can be implemented
efficiently in terms of iterators, that is the option.
There are, however, cases where there are specific operations that can
be implemented much more efficiently with knowledge of the container.
That is the case of removing elements from a container. The cost of
using the remove-erase idiom is linear in the size of the container
(which cannot be reduced much), but that hides the fact that in the
worst case all but one of those operations are copies of the objects
(the only element that matches is the first), and those copies can
represent quite a big hidden cost.
By implementing the operation as a method in std::list the complexity
of the operation will still be linear, but the associated cost for
each one of the elements removed is very low, a couple of pointer
copies and releasing of a node in memory. At the same time, the
implementation as part of the list can offer stronger guarantees:
pointers, references and iterators to elements that are not erased do
not become invalidated in the operation.
Another example of an algorithm that is implemented in the specific
container is std::list::sort, that uses mergesort that is less
efficient than std::sort but does not require random-access iterators.
So basically, algorithms are implemented as free functions with
iterators unless there is a strong reason to provide a particular
implementation in a concrete container.

std::list is designed to work like a linked list. That is, it is designed (you might say optimized) for constant time insertion and removal ... but access is relatively slow (as it typically requires traversing the list).
std::vector is designed for constant-time access, like an array. So it is optimized for random access ... but insertion and removal are really only supposed to be done at the "tail" or "end", elsewhere they're typically going to be much slower.
Different data structures with different purposes ... therefore different operations.

To remove an element from a container you have to find it first. There's a big difference between sorted and unsorted vectors, so in general, it's not possible to implement an efficient remove method for the vector.

Queue-like data structure with random access element removal

Is there a data structure like a queue which also supports removal of elements at arbitrary points? Enqueueing and dequeueing occur most frequently, but mid-queue element removal must be similar in speed terms since there may be periods where that is the most common operation. Consistency of performance is more important than absolute speed. Time is more important than memory. Queue length is small, under 1,000 elements at absolute peak load.In case it's not obvious I'll state it explicitly: random insertion is not required.
Have tagged C++ since that is my implementation language, but I'm not using (and don't want to use) any STL or Boost. Pure C or C++ only (I will convert C solutions to a C++ class.)
Edit: I think what I want is a kind of dictionary that also has a queue interface (or a queue that also has a dictionary interface) so that I can do things like this:
Container.enqueue(myObjPtr1);
MyObj *myObjPtr2 = Container.dequeue();
Container.remove(myObjPtr3);

I think that double-link list is exactly what you want (assuming you do not want a priority queue):
Easy and fast adding elements to both ends
Easy and fast removal of elements from anywhere
You can use std::list container, but (in your case) it is difficult to remove an element
from the middle of the list if you only have a pointer (or reference) to the element (wrapped in STL's list element), but
you do not have an iterator. If using iterators (e.g. storing them) is not an option - then implementing a double linked list (even with element counter) should be pretty easy. If you implement your own list - you can directly operate on pointers to elements (each of them contains pointers to both of its neighbours). If you do not want to use Boost or STL this is probably the best option (and the simplest), and you have control of everything (you can even write your own block allocator for list elements to speed up things).

One option is to use an order statistic tree, an augmented tree structure that supports O(log n) random access to each element, along with O(log n) insertion and deletion at arbitrary points. Internally, the order statistic tree is implemented as a balanced binary search treewith extra information associated with it. As a result, lookups are a slower than in a standard dynamic array, but the insertions are much faster.
Hope this helps!

You can use a combination of a linked list and a hash table. In java it is called a LinkedHashSet.
The idea is simple, have a linked list of elements, and also maintain a hash map of (key,nodes), where node is a pointer to the relevant node in the linked list, and key is the key representing this node.
Note that the basic implementation is a set, and some extra work will be needed to make this data structure allow dupes.
This data structure allows you both O(1) head/tail access, and both O(1) access to any element in the list. [all on average armotorized]

What is better, a STL list or a STL Map for 20 entries, considering order of insertion is as important as the search speed

I have the following scenario.The implementation is required for a real time application.
1)I need to store at max 20 entries in a container(STL Map, STL List etc).
2)If a new entry comes and 20 entries are already present i have to overwrite the oldest entry with the new entry.
Considering point 2, i feel if the container is full (Max 20 entries) 'list' is the best bet as i can always remove the first entry in the list and add the new one at last (push_back). However, search won't be as efficient.
For only 20 entries, does it really make a big difference in terms of searching efficiency if i use a list in place of a map?
Also considering the cost of insertion in map i feel i should go for a list?
Could you please tell what is a better bet for me ?

1)I need to store at max 20 entries in a container(STL Map, STL List etc). 2)If a new entry comes and 20 entries are already present i have to overwrite the oldest entry with the new entry.
This seems to me the job for boost::circular_buffer.
In general the term circular buffer refers to an area in memory which is used to store incoming data. When the buffer is filled, new data is written starting at the beginning of the buffer and overwriting the old.
The circular_buffer is a STL compliant container. It is a kind of sequence similar to std::list or std::deque. It supports random access iterators, constant time insert and erase operations at the beginning or the end of the buffer and interoperability with std algorithms. The circular_buffer is especially designed to provide fixed capacity storage. When its capacity is exhausted, newly inserted elements will cause elements either at the beginning or end of the buffer (depending on what insert operation is used) to be overwritten.
The circular_buffer only allocates memory when created, when the capacity is adjusted explicitly, or as necessary to accommodate resizing or assign operations. On the other hand, there is also a circular_buffer_space_optimized available. It is an adaptor of the circular_buffer which does not allocate memory at once when created, rather it allocates memory as needed.
For the fast search, I think that with just 20 elements (if their comparison isn't too complicated) you're ok with a "low-cost" container like this and normal linear search, in my opinion it would be difficult to achieve better performance with other STL containers.

Maintain order of insertion, or allow fast searching: choose one.
std::map is not an option here because it doesn't maintain the order of insertion. Besides, it's an associative container. You should choose between a list, a deque and a vector. In terms of performance your best bet is a list, since you can pop off an element from the back and insert a new one at the front (or vice-versa) without any shifting or performance penalty.
The cost of insertion in a map, just as a sidenote, isn't expensive it all: it's in the order of O(log n). Practically irrelevant in the case of 20 elements. The same holds for a std::set.

With only 20 elements, I would not worry much about which container you use. If you determine that the container chosen is in fact a detriment to the performance of your application, it should be relatively easy to swap out the container chosen and replace it with a more-efficient container later.
With that being said, for a large number of elements, the std::deque would probably give you the best all-around efficiency for what you are trying to accomplish. Unlike std::vector, std::deque allows for removal from the front without needing to move all of the other elements. Unlike std::list, std::deque allows for random access of its elements.

You just need to implement a priority queue. STL Map doesn't work.

It depends on the size of the elements.
I know from my own experience that for five integers an unordered array of integers searched with linear search is faster than a set, a list or insertion sort and binary search on an ordered array.
The O() notation of an unordered array may be much worse than any of the other options but the normally unseen C in O(N+C) + C is so much smaller.
A list, set or map (anything that uses dynamic memory and is linked by pointers) will be dominated by cache misses, memory allocations and indirect reference penalties.

You need a Priority Queue implemented on an array.
See the Binary Heap for an implementation.

Do you already know that this is a bottleneck?
My advice would be to first use what is more natural to read while programming and only optimize it when you see that the performance is not what you need.

My suggestion would be to make a circular buffer. But that only works if "old" is determined by when it was inserted, and not some field.
If you need to have a proper LRU, then you should probably go and look at something like http://www.codeproject.com/KB/recipes/LRUCache.aspx?fid=1000025&df=90&mpp=25&noise=3&sort=Position&view=Quick&fr=15
But with 20 entries as your max, it will be very hard to you to find a complex algorithm that is actually faster than the trivial lineary check of every element.

Is the linked list only of limited use?

I was having a nice look at my STL options today. Then I thought of something.
It seems a linked list (a std::list) is only of limited use. Namely, it only really seems
useful if
The sequential order of elements in my container matters, and
I need to erase or insert elements in the middle.
That is, if I just want a lot of data and don't care about its order, I'm better off using an std::set (a balanced tree) if I want O(log n) lookup or a std::unordered_map (a hash map) if I want O(1) expected lookup or a std::vector (a contiguous array) for better locality of reference, or a std::deque (a double-ended queue) if I need to insert in the front AND back.
OTOH, if the order does matter, I am better off using a std::vector for better locality of reference and less overhead or a std::deque if a lot of resizing needs to occur.
So, am I missing something? Or is a linked list just not that great? With the exception of middle insertion/erasure, why on earth would someone want to use one?

Any sort of insertion/deletion is O(1). Even std::vector isn't O(1) for appends, it approaches O(1) because most of the time it is, but sometimes you are going to have to grow that array.
It's also very good at handling bulk insertion, deletion. If you have 1 million records and want to append 1 million records from another list (concat) it's O(1). Every other structure (assuming stadard/naive implementations) are at least O(n) (where n is the number of elements added).

Order is important very often. When it is, linked lists are good. If you have a growing collection, you have the option of linked lists, array lists (vector in C++) and double-ended queues (deque). Use linked lists if you want to modify (add, delete) elements anywhere in the list often. Use array lists if fast retrieval is important. Use double-ended queues if you want to add stuff to both ends of the data structure and fast retrieval is important. For the deque vs vector question: use vector unless inserting/removing things from the beginning is important, in which case use deque. See here for an in-depth look at this.
If order isn't important, linked lists aren't normally ideal.

std::list is notable for its splice() method, which allows you to move one more more elements from one list to another in constant time, without copying or allocating any elements or list nodes.

This question reminds me of this infamous one. Read it for the parallels as to why such simple data structures are important.

Linked List is a fundamental data structure. Other data structures, like hash maps, may use linked lists internally.
Two different algorithms may have O(1) time complexity, for a look up, but that doesn't mean they have the same performance. For example the first one may be 10 or 100 times faster than the second.
Whenever you need to store, iterate and do something with a bunch of data, the normal (and fast) data stucture for that task is the Linked List. More complex data structures are for special cases, ie Set is suitable when you don't want repeated values.

std::list has the following properties:
Sequence
Front Seuqence
Back Seuqence
Forward Container
Reverse Container
Of these properties std::vector does not have (Back Seuqence)
While std::set does not support any sequence properties or (Reverse Container)
So what does this mean?
Will a back sequence supports O(1) for rend() and rbegin() etc
For full information see:
What are the complexity guarantees of the standard containers?

Linked lists are immutable and recursive datastructures whereas arrays are mutable and imperative (=change-based). In functional programming, there are usually no arrays - You don't change elements but transform lists into new lists. While linked lists don't even need additional memory here, this isn't possible efficiently with arrays.
You can easily build or decompose lists without having to change any value.
double [] = []
double (head:rest) = (2 * head):(double rest)
In C++, which is an imperative language, you won't use lists that often. One example could be a list of spaceships in a game from which you can easily remove all spaceships that have been destroyed since the previous frame.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js