How to implement a map with limited size - c++

I would like to implement a map whose number of elements never exceeds a certain limit L. When the L+1-th element is inserted, the oldest entry should be removed from the map to empty the space.
I found something similar: Data Structure for Queue using Map Implementations in Java with Size limit of 5. There it is suggested to use a linked hash map, i.e., a hash map that also keeps a linked list of all elements. Unfortunately, that is for java, and I need a solution for C++. I could not find anything like this in the standard library nor in the boost libraries.
Same story here: Add and remove from MAP with limited size
A possible solution for C++ is given here, but it does not address my questions below: C++ how to mix a map with a circular buffer?
I would implement it in a very similar way to what is described there. An hash map to store the key-value pairs and a linked list, or a double-ended queue, of keys to keep the index of the entries. To insert a new value I would add it to the hash map and its key at the end of the index; if the size of the has at this point exceeds the limit, I would pop the first element of the index and remove the entry with that key from the has. Simple, same complexity as adding to the hash map.
Removing an entry requires an iteration over the index to remove the key from there, which has linear complexity for both linked lists and double-ended queue. (Double-ended queues also have the disadvantage that removing an element has itself linear complexity.) So it looks like the remove operation on such a data structure does not preserve the complexity as the underlying has map.
The question is: Is this increase in complexity necessary, or there are some clever ways to implement a limited map data structure so that both insertion and removal keep the same complexity?
Ooops, just posted and immediately realized something important. The size of the index is also limited. If that limit is constant, then the complexity of iterating over it can also be considered constant.
Well, the limit gives an upper bound to the cost of the removal operation.
If the limit is very high one might still prefer a solution that does not involve a linear iteration over the index.

I would still use an associative container to have direct access and a sequential one to allow easy removal of the older item. Let us look at the required access methods:
access to an element given its key => ok, the associative container allows direct access
add a new key-value pair
if the map is not full it is easy: push_back on the sequence container, and simple addition to the associative one
if the map is full, above action must happen, but the oldest element must be removed => front on the sequence container will give that element, and pop_front and erase will remove it, provided the key is contained in the sequence container
remove an element given by its key => trivial to remove from an associative container, but only list allows for constant time removal of an element provided you have an iterator on it. The good news is that removing or inserting an element in a list does not invalidate iterators pointing on other elements.
As you did not give any requirement for keeping keys sorted, I would use an unordered_map for the associative container and a list for the sequence one. The additional requirements is that the list must contain the key, and that the unordered_map must contain an iterator to its corresponding element in the list. The value can be in either container. As I assume that the major access will be the direct one, I would store the value in the map.
It boils down to:
a list<K> to allow identification of the oldest key
an unordered_map<K, pair<V, list<K>::iterator>>
It doubles the storage for key and adds an additional iterator. But keys are expected not to be too big and list::iterator normally contains little more than a pointer: this changes a small amount of memory for speed.
That should be enough to provide constant time
key access
insertion of a new item
key removal of an item

You may want to take a look at Boost.MultiIndex MRU example.

Related

C++/STL Structure for Indexed Linked List (Indices in Hash Table)

I'm looking for a way to remember locations in a doubly-linked list (in hash tables or other data structures).
In C, I would add prev and next pointers to my struct. Then, I could store references to elements of my struct wherever I wanted, and refer to them later. I need only maintain these prev/next pointers to manipulate my linked list, and stored references to locations in the list will stay updated.
What is the C++ approach to this problem?
The end goal is an data structure (which is sequenced, but not ordered, i.e. no comparison function exists, but they are relatively sequenced based on where they are inserted). I need to cheaply insert, delete, move objects as the structure grows. But I also need to cheaply look up each element by some key unrelated to the ordering, and I look up meaningful locations (like head, tail, and various checkpoints in the structure called slices). I need to be able to traverse the sequenced list after looking up a starting place by key or by slice.
Head and tail will be free. I was planning a hash table that maps the keys to list elements, and another hash table that maps slices to list elements.
I asked a more specific question related to this here:
Using Both Map and List for Same Objects
The conclusion I made was that I would need to maintain both a List and various Maps pointing to the same data to get the performance I need. But doing this by storing iterators in C++ seemed subpar. Instead it seemed easier to reimplement linked list (building it into my class) and using STL maps to point to data.
I was hoping for some input about which is a more fruitful route, or if there is some third plan that better meets my needs. My assumption is that the STL implementation of unordered_map is faster than anything I would implement, but I could match or beat the performance of list since I'm only using a subset of its functionality.
Thanks!
More precise description of my data/performance requirements:
Data will come in with a unique key. I will add it into a queue.
I'll need to update/move/remove/delete this data in O(1) based on its unique key.
I'll need to insert new data/read data based on metadata stored in other data structures.
I was speaking imprecisely when I said very large list above. The list will definitely fit into memory. Space is cheap enough that it is worth using other data structures to index this list.
I understand your requirements as being:
the data has a unique key
update/move/remove/delete this data in constant time, using its unique key
According to this the best fit would be the unodered_map: It works with a key, and uses a hash table to access the elements. In average insert, find, update is constant time (thanks to the hash table), unless the hash function is not appropriate (i.e. worst case if all elements would yield the same hash value, you would have linear time, as in a list, due to the colisions).
This seems also to match your initial intention:
Head and tail will be free. I was planning a hash table that maps the
keys to list elements, and another hash table that maps slices to list
elements.
Edit: If you need also to master sequencing of elements, independently of their key, you'd need to build a combined container, based on a list and an unordered_map which associates the key to an iterator to the element in the list. You'd then have to manage synchronisation, for example:
insert element: get iterator by inserting element into list, then add the iterator to the unordered_map using the element's key.
remove element: find iterator to element by searching for the key in the unordered_map, erase element in the list using this iterator, and finally erase the key in the unordered_map.
find element: find iterator to element by searching for the key in the unordered_map
sequential iteration: use the iterator to the begin of the list.
I'd route you to STL containers to browse... but when you write word 'very large' (and I'm currently Big Data professional) everything changes.
Nobody usually gives you good advice for scalability but ... here are points.
What is 'very large' in your case? Does std::list fit your needs? Before 3rd paragraph everything looks suitable if you are not too large. Do your structure fits in memory?
How about your structure aligned to memory manager? Simply C-like list with 'prev' and 'next' has serious disadvantage - every element usually is allocated from memory manager. If you are large, this matters and gives your memory over-usage.
What do you expect to be element external reference? If you use pointers - you loose ability to perform optimization on your structure. But probably you don't need it.
Actually you definitely need to consider some 'pools' management if you are really large and indices in such pools can be pretty good references if you modify your structure intensively.
Please consider about large twice. If you mean really large - you need special solution. Especially if your data is larger than your memory. If you are not so large - why not start with just std:list? When you answer to this question, probably your life could be much more easy ;-).

How to retrieve the elements from map in the order of insertion?

I have a map which stores <int, char *>. Now I want to retrieve the elements in the order they have been inserted. std::map returns the elements ordered by the key instead. Is it even possible?
If you are not concerned about the order based on the int (IOW you only need the order of insertion and you are not concerned with the normal key access), simply change it to a vector<pair<int, char*>>, which is by definition ordered by the insertion order (assuming you only insert at the end).
If you want to have two indices simultaneously, you need, well, Boost.MultiIndex or something similar. You'd probably need to keep a separate variable that would only count upwards (would be a steady counter), though, because you could use .size()+1 as new "insertion time key" only if you never erased anything from your map.
Now I want to retrieve the elements in the order they have been inserted. [...] Is it even possible?
No, not with std::map. std::map inserts the element pair into an already ordered tree structure (and after the insertion operation, the std::map has no way of knowing when each entry was added).
You can solve this in multiple ways:
use a std::vector<std::pair<int,char*>>. This will work, but not provide the automatic sorting that a map does.
use Boost stuff (#BartekBanachewicz suggested Boost.MultiIndex)
Use two containers and keep them synchronized: one with a sequential insert (e.g. std::vector) and one with indexing by key (e.g. std::map).
use/write a custom container yourself, so that supports both type of indexing (by key and insert order). Unless you have very specific requirements and use this a lot, you probably shouldn't need to do this.
A couple of options (other than those that Bartek suggested):
If you still want key-based access, you could use a map, along with a vector which contains all the keys, in insertion order. This gets inefficient if you want to delete elements later on, though.
You could build a linked list structure into your values: instead of the values being char*s, they're structs of a char* and the previously- and nextly-inserted keys**; a separate variable stores the head and tail of the list. You'll need to do the bookkeeping for that on your own, but it gives you efficient insertion and deletion. It's more or less what boost.multiindex will do.
** It would be nice to store map iterators instead, but this leads to circular definition problems.

Is there a linked hash set in C++?

Java has a LinkedHashSet, which is a set with a predictable iteration order. What is the closest available data structure in C++?
Currently I'm duplicating my data by using both a set and a vector. I insert my data into the set. If the data inserted successfully (meaning data was not already present in the set), then I push_back into the vector. When I iterate through the data, I use the vector.
If you can use it, then a Boost.MultiIndex with sequenced and hashed_unique indexes is the same data structure as LinkedHashSet.
Failing that, keep an unordered_set (or hash_set, if that's what your implementation provides) of some type with a list node in it, and handle the sequential order yourself using that list node.
The problems with what you're currently doing (set and vector) are:
Two copies of the data (might be a problem when the data type is large, and it means that your two different iterations return references to different objects, albeit with the same values. This would be a problem if someone wrote some code that compared the addresses of the "same" elements obtained in the two different ways, expecting the addresses to be equal, or if your objects have mutable data members that are ignored by the order comparison, and someone writes code that expects to mutate via lookup and see changes when iterating in sequence).
Unlike LinkedHashSet, there is no fast way to remove an element in the middle of the sequence. And if you want to remove by value rather than by position, then you have to search the vector for the value to remove.
set has different performance characteristics from a hash set.
If you don't care about any of those things, then what you have is probably fine. If duplication is the only problem then you could consider keeping a vector of pointers to the elements in the set, instead of a vector of duplicates.
To replicate LinkedHashSet from Java in C++, I think you will need two vanilla std::map (please note that you will get LinkedTreeSet rather than the real LinkedHashSet instead which will get O(log n) for insert and delete) for this to work.
One uses actual value as key and insertion order (usually int or long int) as value.
Another ones is the reverse, uses insertion order as key and actual value as value.
When you are going to insert, you use std::map::find in the first std::map to make sure that there is no identical object exists in it.
If there is already exists, ignore the new one.
If it does not, you map this object with the incremented insertion order to both std::map I mentioned before.
When you are going to iterate through this by order of insertion, you iterate through the second std::map since it will be sorted by insertion order (anything that falls into the std::map or std::set will be sorted automatically).
When you are going to remove an element from it, you use std::map::find to get the order of insertion. Using this order of insertion to remove the element from the second std::map and remove the object from the first one.
Please note that this solution is not perfect, if you are planning to use this on the long-term basis, you will need to "compact" the insertion order after a certain number of removals since you will eventually run out of insertion order (2^32 indexes for unsigned int or 2^64 indexes for unsigned long long int).
In order to do this, you will need to put all the "value" objects into a vector, clear all values from both maps and then re-insert values from vector back into both maps. This procedure takes O(nlogn) time.
If you're using C++11, you can replace the first std::map with std::unordered_map to improve efficiency, you won't be able to replace the second one with it though. The reason is that std::unordered map uses a hash code for indexing so that the index cannot be reliably sorted in this situation.
You might wanna know that std::map doesn't give you any sort of (log n) as in "null" lookup time. And using std::tr1::unordered is risky business because it destroys any ordering to get constant lookup time.
Try to bash a boost multi index container to be more freely about it.
The way you described your combination of std::set and std::vector sounds like what you should be doing, except by using std::unordered_set (equivalent to Java's HashSet) and std::list (doubly-linked list). You could also use std::unordered_map to store the key (for lookup) along with an iterator into the list where to find the actual objects you store (if the keys are different from the objects (or only a part of them)).
The boost library does provide a number of these types of combinations of containers and look-up indices. For example, this bidirectional list with fast look-ups example.

Insert, find-min(key) and delete(based on value) functionality in logN with STL containers?

Here's an interesting problem:
Let's say we have a set A for which the following are permitted:
Insert x
Find-min x
Delete the n-th inserted element in A
Create a data structure to permit these in logarithmic time.
The most common solution is with a heap. AFAIK, heaps with decrease-key (based on a value - generally the index when an element was added) keep a table with the Pos[1...N] meaning the i-th added value is now on index Pos[i], so it can find the key to decrease in O(1). Can someone confirm this?
Another question is how we solve the problem with STL containers? i.e. with sets, maps or priority queues. A partial solution i found is to have a priority queue with indexes but ordered by the value to these indexes. I.e. A[1..N] are our added elements in order of insertion. pri-queue with 1..N based on comparison of (A[i],A[j]). Now we keep a table with the deleted indexes and verify if the min-value index was deleted. Unfortunately, Find-min becomes slightly proportional with no. of deleted values.
Any alternative ideas?
Now I thought how to formulate a more general problem.
Create a data structure similar to multimap with <key, value> elements. Keys are not unique. Values are. Insert, find one (based on key), find (based on value), delete one (based on key) and delete (based on value) must be permitted O(logN).
Perhaps a bit oddly, this is possible with a manually implemented Binary Search Tree with a modification: for every node operation a hash-table or a map based on value is updated with the new pointer to the node.
Similar to having a strictly ordered std::set (if equal key order by value) with a hash-table on value giving the iterator to the element containing that value.
Possible with std::set and a (std::map/hash table) as described by Chong Luo.
You can use a combination of two containers to solve your problem - A vector in which you add each consecutive element and a set:
You use the set to execute find_min while
When you insert an element you execute push_back in the vector and insert in the set
When you delete the n-th element, you see it's value in the vector and erase it from the set. Here I assume the number of elements does not change after executing delete n-th element.
I think you can not solve the problem with only one container from STL. However there are some data structures that can solve your problem:
Skip list - can find the minimum in constant time and will perform the other two operations with amortized complexity O(log(n)). It is relatively easy to implement.
Tiered vector is easy to implement and will perform find_min in constant time and the other two operations in O(sqrt(n))
And of course the approach you propose - write your own heap that keeps track of where is the n-th element in it.

Best c++ container to strip items away from?

I have a list of files (stored as c style strings) that I will be performing a search on and I will remove those files that do not match my parameters. What is the best container to use for this purpose? I'm thinking Set as of now. Note the list of files will never be larger than when it is initialized. I'll only be deleting from the container.
I would definitely not use a set - you don't need to sort it so no point in using a set. Set is implemented as a self-balancing tree usually, and the self-balancing algorithm is unnecessary in your case.
If you're going to be doing this operation once, I would use a std::vector with remove_if (from <algorithm>), followed by an erase. If you haven't used remove_if before, what it does is go through and shifts all the relevant items down, overwriting the irrelevant ones in the process. You have to follow it with an erase to reduce the size of the vector. Like so:
std::vector<const char*> files;
files.erase(remove_if(files.begin(), files.end(), RemovePredicate()), files.end());
Writing the code to do the same thing with a std::list would be a little bit more difficult if you wanted to take advantage of its O(1) deletion time property. Seeing as you're just doing this one-off operation which will probably take so little time you won't even notice it, I'd recommend doing this as it's the easiest way.
And to be honest, I don't think you'll see that much difference in terms of speed between the std::list and std::vector approaches. The vector approach only copies each value once so it's actually quite fast, yet takes much less space. In my opinion, going up to a std::list and using three times the space is only really justified if you're doing a lot of addition and deletion throughout the entire application's lifetime.
Elements in a std::set must be unique, so unless the filenames are globally unique this won't suit your needs.
I would probably recommend a std::list.
From SGI:
A vector is a Sequence that supports random access to elements, constant time insertion and removal of elements at the end, and linear time insertion and removal of elements at the beginning or in the middle.
A list is a doubly linked list. That is, it is a Sequence that supports both forward and backward traversal, and (amortized) constant time insertion and removal of elements at the beginning or the end, or in the middle.
An slist is a singly linked list: a list where each element is linked to the next element, but not to the previous element. That is, it is a Sequence that supports forward but not backward traversal, and (amortized) constant time insertion and removal of elements.
Set is a Sorted Associative Container that stores objects of type Key. Set is a Simple Associative Container, meaning that its value type, as well as its key type, is Key. It is also a Unique Associative Container, meaning that no two elements are the same.
Multiset is a Sorted Associative Container that stores objects of type Key. Multiset is a Simple Associative Container, meaning that its value type, as well as its key type, is Key. It is also a Multiple Associative Container, meaning that two or more elements may be identical.
Hash_set is a Hashed Associative Container that stores objects of type Key. Hash_set is a Simple Associative Container, meaning that its value type, as well as its key type, is Key. It is also a Unique Associative Container, meaning that no two elements compare equal using the Binary Predicate EqualKey.
Hash_multiset is a Hashed Associative Container that stores objects of type Key. Hash_multiset is a simple associative container, meaning that its value type, as well as its key type, is Key. It is also a Multiple Associative Container, meaning that two or more elements may compare equal using the Binary Predicate EqualKey.
(Some containers have been omitted.)
I would go with hash_set if all you care to have is a container which is fast and doesn't contain multiple identical keys. hash_multiset if you do, set or multiset if you want the strings to be sorted, or list or slist if you want the strings to keep their insertion order.
After you've built your list/set, use remove_if to filter out your items based on your criteria.
I will start by tossing out vector since it is a sequential container. Set, I believe is close to being sequential or hashed. I would avoid that. A doublely-linked list, the stl list is one of these, has two pointers and the node. Basically, to remove an item, it breaks the chain then rejoins the two parts with the pointers.
Assuming your search criteria does not depend on the filename (ie. you search for content, file sizes etc.), so you cannot make use of a set, I'd go with a list. It will take you O(N) to construct the whole list, and O(1) per one delete.
If you wanted to make it even faster, and didn't insist on using ready-made STL containers, I would:
use a vector
delete using false delete, ie. marking an item as deleted
when the ratio of deleted/all items raises above certain threshold, I would filter the items with remove_if
This should give you the best space/time/cache performance. (Although you should profile it to be sure)
You can use two lists/vectors/whatever:
using namespace std;
vector<const char *> files;
files.push_back("foo.bat");
files.push_back("bar.txt");
vector<const char *> good_files; // Maybe reserve elements given files.size()?
for(vector<const char *>::const_iterator i = files.begin(); i != files.end(); ++i) {
if(file_is_good(*i)) {
new_files.push_back(*i);
}
}