Datastructure for quick access with more than one key or with key and priority - c++

Thanks to std::map and similar data structures, it's easy to do quick insertion, access and deletion of data elements based on a key.
Thanks to std::make_heap and it's colleages, it's easy to maintain a priority queue based on a value.
But very often, the algorithm needs a combination of both. For example, one has the following struct:
struct entry{
int id;
char name[20];
double value;
}
The algorithm needs to quickly find and remove the entry with the highest value. That calls for a priority queue with std's heap functions. It also needs to quickly remove some elements based on name and/or id. That calls for a std::map.
When programming that kind of algorithms, I often end up just using a good datastructure for the operation that is most needed (for example, priority access), and then use a linear search through that structure for the lesser needed operation, for example removal of a key.
But is it possible to implement that kind of algorithm maintaining quick access for priority and access over two keys?

One way is boost multi index.
Another is to create two data structures whose value is a shared_ptr<const entry> and who use a different ordering, then a wrapping class that ensures adding/removing occurs in both. When you want to edit you naturally have to remove then reinsert.
Boost's multi-index is more complex to set up, but claims faster performance as the two data structures are intertwined, causing better cache performance and less memory usage.

Related

Data structure for FIFO behaviour and fast lookup by value

So I am looking for a data structure which needs a FIFO behaviour but should also have a quick look up time by value.
In my current code I have some data duplication. I use a std::unordered_set and std::queue for achieving the behaviour I want but there's probably a better way of achieving this that I'm not thinking of at the moment. I have a function that adds my new entry to both the set and the queue when a new entry comes up. To search if an entry exists in the queue I use find() in the set. Laslty, I have a timer that is set off after an insertion to the queue. After a minute I get the entry in the front of the queue with queue.front(), then I use this value to erase from the set, and finally I do a pop on the queue.
This all works as expected and gives me both the FIFO behaviour and the constant time complexity for the look up but I have data duplication and I was wondering if there is a data structure (maybe something form boost?) which does what I want without the duplication.
Data structure for FIFO behaviour and fast lookup by value
A solution is to use two containers: Store the elements in an unordered set for fast lookup, and upon insertion, store iterator to the element in a queue. When you pop the queue, erase the corresponding element from the set.
A more structured approach is to use a multi-index container. The standard library doesn't provide such, but boost does. More specifically, you could use a combination of hashed and sequence indices.
This answer is mostly concerning corner cases of the problem as presented
If you problem is a practical one, and you are able store the elements with a std::vector - and if you have less than in the ballpark of some ~10-100 elements in the queue, then you could just use:
std::queue<T, std::vector<T> > q;
That is a queue using vector as the underlying container. When you have that small number of elements (only 10-100) then using advanced lookup methods is not worth it.
You then only needs to check for duplicates when you pop the queue not on every insertion. Again, that might or might not be usefull depending on your specific case. I can imagine cases where this method is superior. Eg. a webserver serving pages that gets a lot of hits to just one or a few pages. Then it might be faster to just add say 100,000 elements to the vector and then go and remove the duplicates all in one go when popping.
How about defining your own data structure which can act as a BST (for lookups) and as a min heap which you can use to impose fifo?
class node {
public:
static int autoIncrement = 0;
int order; // this will be auto-incremented to impose FIFO
int data;
node* left_Bst;
node* right_Bst;
node* left_Heap;
node* right_Heap;
node() {
order = autoIncrement;
autoIncrement++;
}
}
By doing this you are basically creating two data structures sharing the same nodes. BST's partial order is imposed via data, and heap's can be maintained via order variable.
During an insertion you can traverse via BST pointers and insert your element if it doesn't exist already and also modify the heap pointers accordingly after insertion.

Unique membership FIFO container

I need a first-in first-out queue which holds IDs with the catch that an ID should only be added if it is not already in the queue.
The best way I can think of is this:
typedef unsigned ID;
struct UniqueFifo
{
private:
std::set<ID> ids; // keep track of what is already in
std::queue<ID> fifo; // keep in fifo order
public:
void push(ID x) {
// only add to queue if not already in it
if(ids.find(x) == ids.end()) {
fifo.push(x);
ids.insert(x);
}
}
ID pop() {
// pop from queue
ID x = fifo.front();
fifo.pop();
// and also remove from map
ids.erase(x);
return x;
}
};
Is there are more elegant way of doing this with C++ STL containers?
Using a second data structure like that, optimised for insertion and searching, is the most scalable solution; but if the queue never gets particularly large, then it might be more efficient (and certainly simpler) to do a linear search in the queue itself. You'll need deque itself, rather then the queue adapter, in order to access the contents.
If it is large enough to justify a searchable "index", then consider using unordered_set if available, or some other hash-based set if not; that's likely to be faster, if you only require uniqueness and not any particular ordering.
Your code needs to insert the ID into the set as well as the queue. Conveniently, you can combine this with the check for uniqueness, so that only a single search is needed:
if (ids.insert(x).second) {
fifo.push(x);
}
You might also consider storing set iterators, rather than values, in the queue, to make erasing more efficient.
Your solution is not bad, but you need to use a std::set<ID>, not a std::map(a map is used to map keys to values while you only care about values here). Also consider using std::unordered_set if c++11 is available.
It's not bad at all. I can see other way to do this, mostly using one container (because the only "problem" that I see is that you're using 2 containers and so have to assert all the time that they are consistent with each other), but they are not more elegant and maybe cost even more.
IMHO leep this design (at least this interface) using a std::unordered_set instead of a std::set until you can benchmark your performances. After that you may have no problem or maybe the extra lookup in std::set will be too high for you. In this case you may try to keep a std::queue<std::pair<ID, std::unordered_set<ID>::const_iterator>> to ease the erasure (benefits from the fact that std::set and std::unordered_set iterator are not invalidated by addition or removal if the removal is not on there element).
But do not do this unless you need greater performances. Your code is simple and readable.

Which is the fastest STL container for find?

Alright as a preface I have a need to cache a relatively small subset of rarely modified data to avoid querying the database as frequently for performance reasons. This data is heavily used in a read-only sense as it is referenced often by a much larger set of data in other tables.
I've written a class which will have the ability to store basically the entirety of the two tables in question in memory while listening for commit changes in conjunction with a thread safe callback mechanism for updating the cached objects.
My current implementation has two std::vectors one for the elements of each table. The class provides both access to the entirety of each vector as well as convenience methods for searching for a specific element of table data via std::find, std::find_if, etc.
Does anyone know if using std::list, std::set, or std::map over std::vector for searching would be preferable? Most of the time that is what will be requested of these containers after populating once from the database when a new connection is made.
I'm also open to using C++0x features supported by VS2010 or Boost.
For searching a particular value, with std::set and std::map it takes O(log N) time, while with the other two it takes O(N) time; So, std::set or std::map are probably better. Since you have access to C++0x, you could also use std::unordered_set or std::unordered_map which take constant time on average.
For find_if, there's little difference between them, because it takes an arbitrary predicate and containers cannot optimize arbitrarily, of course.
However if you will be calling find_if frequently with a certain predicate, you can optimize yourself: use a std::map or std::set with a custom comparator or special keys and use find instead.
A sorted vector using std::lower_bound can be just as fast as std::set if you're not updating very often; they're both O(log n). It's worth trying both to see which is better for your own situation.
Since from your (extended) requirements you need to search on multiple fields, I would point you to Boost.MultiIndex.
This Boost library lets you build one container (with only one exemplary of each element it contains) and index it over an arbitrary number of indices. It also lets you precise which indices to use.
To determine the kind of index to use, you'll need extensive benchmarks. 500 is a relatively low number of entries, so constant factors won't play nicely. Furthermore, there can be a noticeable difference between single-thread and multi-thread usage (most hash-table implementations can collapse on MT usage because they do not use linear-rehashing, and thus a single thread ends up rehashing the table, blocking all others).
I would recommend a sorted index (skip-list like, if possible) to accomodate range requests (all names beginning by Abc ?) if the performance difference is either unnoticeable or simply does not matter.
If you only want to search for distinct values, one specific column in the table, then std::hash is fastest.
If you want to be able to search using several different predicates, you will need some kind of index structure. It can be implemented by extending your current vector based approach with several hash tables or maps, one for each field to search for, where the value is either an index into the vector, or a direct pointer to the element in the vector.
Going further, if you want to be able to search for ranges, such as all occasions having a date in July you need an ordered data structure, where you can extract a range.
Not an answer per se, but be sure to use a typedef to refer to the container type you do use, something like typedef std::vector< itemtype > data_table_cache; Then use your typedef type everywhere.

which data structure for a list of object with fast lookup feature

I have a data structure and have to perform lookup on it, I would like to optimize things...
struct Data
{
std::string id_;
double data_;
};
I use currently a std::vector<Data> and std::find algorithm but I'm wondering if another data structure would be more convenient :
hash table ?
map ?
boost multi index container ?
other things ?
EDIT:
Each time I receive a message from network I have to lookup into this vector (with id as key), and update/retrieve some informations. (Data structure have more fields than in my example)
EDIT2:
I don't care about order.
I have to insert/erase element into this data structure frequently.
It really depends on your requirements, but two possibilities are to sort your vector and do a binary search, or to use a map. Both can be implemented within about 15 minutes, so I suggest you try both of them.
Edit: Given your requirement that you want to add and remove things often, and the size of your data, I'd use an unordered_map (i.e. a hash table) as the first try. You can always change to another container later.
It depends on whether you care about the order of the elements in your container or not. If you do care, you can do no better than now. If you don't, a hashed container should provide the fastest lookup.
But it also depends on other factors. For instance, if you create the container once and never change it, then maybe an ordered vector, with binary search, will be best.

Least Recently Used cache using C++

I am trying to implement LRU Cache using C++ . I would like to know what is the best design for implementing them. I know LRU should provide find(), add an element and remove an element. The remove should remove the LRU element. what is the best ADTs to implement this
For ex: If I use a map with element as value and time counter as key I can search in O(logn) time, Inserting is O(n), deleting is O(logn).
One major issue with LRU caches is that there is little "const" operations, most will change the underlying representation (if only because they bump the element accessed).
This is of course very inconvenient, because it means it's not a traditional STL container, and therefore any idea of exhibiting iterators is quite complicated: when the iterator is dereferenced this is an access, which should modify the list we are iterating on... oh my.
And there are the performances consideration, both in term of speed and memory consumption.
It is unfortunate, but you'll need some way to organize your data in a queue (LRU) (with the possibility to remove elements from the middle) and this means your elements will have to be independant from one another. A std::list fits, of course, but it's more than you need. A singly-linked list is sufficient here, since you don't need to iterate the list backward (you just want a queue, after all).
However one major drawback of those is their poor locality of reference, if you need more speed you'll need to provide your own custom (pool ?) allocator for the nodes, so that they are kept as close together as possible. This will also alleviate heap fragmentation somewhat.
Next, you obviously need an index structure (for the cache bit). The most natural is to turn toward a hash map. std::tr1::unordered_map, std::unordered_map or boost::unordered_map are normally good quality implementation, some should be available to you. They also allocate extra nodes for hash collision handling, you might prefer other kinds of hash maps, check out Wikipedia's article on the subject and read about the characteristics of the various implementation technics.
Continuing, there is the (obvious) threading support. If you don't need thread support, then it's fine, if you do however, it's a bit more complicated:
As I said, there is little const operation on such a structure, thus you don't really need to differentiate Read/Write accesses
Internal locking is fine, but you might find that it doesn't play nice with your uses. The issue with internal locking is that it doesn't support the concept of "transaction" since it relinquish the lock between each call. If this is your case, transform your object into a mutex and provide a std::unique_ptr<Lock> lock() method (in debug, you can assert than the lock is taken at the entry point of each method)
There is (in locking strategies) the issue of reentrance, ie the ability to "relock" the mutex from within the same thread, check Boost.Thread for more information about the various locks and mutexes available
Finally, there is the issue of error reporting. Since it is expected that a cache may not be able to retrieve the data you put in, I would consider using an exception "poor taste". Consider either pointers (Value*) or Boost.Optional (boost::optional<Value&>). I would prefer Boost.Optional because its semantic is clear.
The best way to implement an LRU is to use the combination of a std::list and stdext::hash_map (want to use only std then std::map).
Store the data in the list so that
the least recently used in at the
last and use the map to point to the
list items.
For "get" use the map to get the
list addr and retrieve the data
and move the current node to the
first(since this was used now) and update the map.
For "insert" remove the last element
from the list and add the new data
to the front and update the map.
This is the fastest you can get, If you are using a hash_map you should almost have all the operations done in O(1). If using std::map it should take O(logn) in all cases.
A very good implementation is available here
This article describes a couple of C++ LRU cache implementations (one using STL, one using boost::bimap).
When you say priority, I think "heap" which naturally leads to increase-key and delete-min.
I would not make the cache visible to the outside world at all if I could avoid it. I'd just have a collection (of whatever) and handle the caching invisibly, adding and removing items as needed, but the external interface would be exactly that of the underlying collection.
As far as the implementation goes, a heap is probably the most obvious. It has complexities roughly similar to a map, but instead of building a tree from linked nodes, it arranges items in an array and the "links" are implicit based on array indices. This increases the storage density of your cache and improves locality in the "real" (physical) processor cache.
I suggest a heap and maybe a Fibonacci Heap
I'd go with a normal heap in C++.
With the std::make_heap (guaranteed by the standard to be O(n)), std::pop_heap, and std::push_heap in #include, implementing it would be absolutely cake. You only have to worry about increase-key.