C++11 unordered_map time complexity - c++

I'm trying to figure out the best way to do a cache for resources. I am mainly looking for native C/C++/C++11 solutions (i.e. I don't have boost and the likes as an option).
What I am doing when retrieving from the cache is something like this:
Object *ResourceManager::object_named(const char *name) {
if (_object_cache.find(name) == _object_cache.end()) {
_object_cache[name] = new Object();
}
return _object_cache[name];
}
Where _object_cache is defined something like: std::unordered_map <std::string, Object *> _object_cache;
What I am wondering is about the time complexity of doing this, does find trigger a linear-time search or is it done as some kind of a look-up operation?
I mean if I do _object_cache["something"]; on the given example it will either return the object or if it doesn't exist it will call the default constructor inserting an object which is not what I want. I find this a bit counter-intuitive, I would have expected it to report in some way (returning nullptr for example) that a value for the key couldn't be retrieved, not second-guess what I wanted.
But again, if I do a find on the key, does it trigger a big search which in fact will run in linear time (since the key will not be found it will look at every key)?
Is this a good way to do it, or does anyone have some suggestions, perhaps it's possible to use a look up or something to know if the key is available or not, I may access often and if it is the case that some time is spent searching I would like to eliminate it or at least do it as fast as possible.
Thankful for any input on this.

The default constructor (triggered by _object_cache["something"]) is what you want; the default constructor for a pointer type (e.g. Object *) gives nullptr (8.5p6b1, footnote 103).
So:
auto &ptr = _object_cache[name];
if (!ptr) ptr = new Object;
return ptr;
You use a reference into the unordered map (auto &ptr) as your local variable so that you assign into the map and set your return value in the same operation. In C++03 or if you want to be explicit, write Object *&ptr (a reference to a pointer).
Note that you should probably be using unique_ptr rather than a raw pointer to ensure that your cache manages ownership.
By the way, find has the same performance as operator[]; average constant, worst-case linear (only if every key in the unordered map has the same hash).

Here's how I'd write this:
auto it = _object_cache.find(name);
return it != _object_cache.end()
? it->second
: _object_cache.emplace(name, new Object).first->second;

The complexity of find on an std::unordered_map is O(1) (constant), specially with std::string keys which have good hashing leading to very low rate of collisions. Even though the name of the method is find, it doesn't do a linear scan as you pointed out.
If you want to do some kind of caching, this container is definitely a good start.

Note that a cache typically is not just a fast O(1) access but also a bounded data structure. The std::unordered_map will dynamically increase its size when more and more elements are added. When resources are limited (e.g. reading huge files from disk into memory), you want a bounded and fast data structure to improve the responsiveness of your system.
In contrast, a cache will use an eviction strategy whenever size() reaches capacity(), by replacing the least valuable element.
You can implement a cache on top of a std::unordered_map. The eviction strategy can then be implemented by redefining the insert() member. If you want to go for an N-way (for small and fixed N) associative cache (i.e. one item can replace at most N other items), you could use the bucket() interface to replace one of the bucket's entries.
For a fully associative cache (i.e. any item can replace any other item), you could use a Least Recently Used eviction strategy by adding a std::list as a secondary data structure:
using key_tracker_type = std::list<K>;
using key_to_value_type = std::unordered_map<
K,std::pair<V,typename key_tracker_type::iterator>
>;
By wrapping these two structures inside your cache class, you can define the insert() to trigger a replace when your capacity is full. When that happens, you pop_front() the Least Recently Used item and push_back() the current item into the list.
On Tim Day's blog there is an extensive example with full source code that implements the above cache data structure. It's implementation can also be done efficiently using Boost.Bimap or Boost.MultiIndex.

The insert/emplace interfaces to map/unordered_map are enough to do what you want: find the position, and insert if necessary. Since the mapped values here are pointers, ekatmur's response is ideal. If your values are fully-fledged objects in the map rather than pointers, you could use something like this:
Object& ResourceManager::object_named(const char *name, const Object& initialValue) {
return _object_cache.emplace(name, initialValue).first->second;
}
The values name and initialValue make up arguments to the key-value pair that needs to be inserted, if there is no key with the same value as name. The emplace returns a pair, with second indicating whether anything was inserted (the key in name is a new one) - we don't care about that here; and first being the iterator pointing to the (perhaps newly created) key-value pair entry with key equivalent to the value of name. So if the key was already there, dereferencing first gives the original Ojbect for the key, which has not been overwritten with initialValue; otherwise, the key was newly inserted using the value of name and the entry's value portion copied from initialValue, and first points to that.
ekatmur's response is equivalent to this:
Object& ResourceManager::object_named(const char *name) {
bool res;
auto iter = _object_cache.end();
std::tie(iter, res) = _object_cache.emplace(name, nullptr);
if (res) {
iter->second = new Object(); // we inserted a null pointer - now replace it
}
return iter->second;
}
but profits from the fact that the default-constructed pointer value created by operator[] is null to decide whether a new Object needs to be allocated. It's more succinct and easier to read.

Related

Is There a way to Find the Number of Keys in a multimap Inline?

A multimap's size reports the number of values it contains. I'm interested in the number of keys it contains. For example, given multimap<int, double> foo I'd like to be able to do this:
const auto keyCount = ???
One way to get this is to use a for-loop on a zero initialized keyCount:
for(auto it = cbegin(foo); foo != cend(foo); it = foo.upper_bound(foo->first)) ++keyCount;
This however, does not let me perform the operation inline. So I can't initialize a const auto keyCount.
A solution could be a lambda or function which wraps this for-loop such as:
template <typename Key, typename Value>
size_t getKeyCount(const multimap<Key, Value>& arg) {
size_t result = 0U;
for(auto it = cbegin(foo); foo != cend(foo); it = foo.upper_bound(foo->first)) ++result;
return result;
}
But I was hoping for something provided by the standard. Does such a thing exist?
No, there is no built-in in the standard to do this. Consider that your count function works because a multimap is internally sorted. Typical implementations such as libstdc++ use red-black trees for the internal representation. You would need to walk the tree in order to count all the (unique) keys. It's unavoidable.
1st, multimap doesn't contain a key count naively. So your question then is about how to use something from the algorithm library to find the count. The 2 constraints you have placed that preclude everything in the library are:
Must be used inline, so a count, not an iterator must be returned
Must perform at least as well as upper_bound used in a lambda, which has time complexity: O(log n)
1 leaves us with: count, count_if, for_each, as well as most of the algorithms in the numeric library
2 eliminates consideration of all of these as each of them has time complexity: O(n)
As such your getKeyCount is preferable to any other option that the standard supplies.
Just a comment about another option that may be presented: the maintenance of keyCount whenever something is added or removed from foo, this may seem workable, but it requires a check before each insert if the inserted key exists, and a check after each removal if the removed key still exists. Aside from this, there is also the consideration of the danger of multi-threaded inoperability and the loss of code readability, where it's not clear that this keyCount must be maintained along with foo. Ultimately this a bad idea unless the key count is taken significantly more frequently than foo is updated.
If you only are creating multimap and inserting values in it, you can keep a companion map to record different types of keys. That size of that map will give you key count.

Compression dictionary with very fast int->data lookups, and fast reverse lookups (search/insert/delete data)?

I would like to implement a dictionary that pairs unique heterogeneous data (variant) with unique int so that instead of repeating the value (that may be large) I would repeat the int. When needed, I would convert it via the dictionary to the original value.
The dataset will be large, so (int->data) in O(1) is important. (data->int) and insert/delete should all be O(log n) average case, since these operations are less important. Order of the data is of no concern, but insert/delete must not invalidate existing int keys.
I have tried hash-table and SSTable approaches. With hash-table the required storage is rather high even when using the hashed value as indice, not storing it with the values. Collissions lowers the efficiency but the amortized complexity is O(1) for all of the operations. SSTable on the other hand offers worse complexities for manipulation and duplicates the values (once for vector storage, once for the map-index). The overall memory consumption is only slightly lower than that of hash-table dictionary.
Requirements in summary of Dictionary:
Lookup int->data: O(1)
Lookup data->int: O(log n) at worst
Insertion: O(log n) at worst
Removal: O(log n) at worst [or alternative means like garbage collection that could perform worse if not being run all the time]
Minimum memory requirements possible
Is there a way to improve upon the design of a dictionary to further reduce the memory requirements while retaining O(1) int->data lookup and reasonable insertion/removal/data->int?
If the int->data speed is the most important thing, you should set things up so that's just an array-indexing operation.
Keep your data objects in a std::vector<data> forward_map. int->data is just a forward_map[i] lookup, which is O(1) with constant factors that are about as low as possible.
Use a separate data structure to support the search/insert/delete operations.
Depending on what comparison operations your "data" objects support, maybe a binary search tree or a std::unordered set would be good choices. The "value" type of the BST / set is just an int, but comparisons on those ints actually compare forward_map[i] < forward_map[j], not according to i < j.
So lets say you have a std::unordered_set< forward_map_reference_t > reverse_map. (It's not actually this easy with STL containers, see below.)
We're actually using a set as a map: The key is forward_map[val], and the value is int val itself.
To find the reverse_map entry for a given int k, you need to actually search it for forward_map[k].
const data_t & lookup(int k) { return forward_map[k]; }
int search(const data_t &): reverse_map.find() is efficient.
delete(const data_t &): search & delete the reverse_map entry, returning int k. Add k to a LIFO free-list for the forward_map. (Don't touch the forward_map entry. If you need to detect use-after-free of forward_map entries, then zero it at this point or something.)
insert(const data_t &): check the head of the free-list for an entry to reuse, otherwise forward_map.push_back(). k = the position you put the entry in the forward map. Add that k to the reverse map.
To avoid storing another copy of the data_t items, reverse_map needs to refer to the forward_map inside its search operations.
There's a potentially-large advantage to using a reverse_map based on a hash-table, rather than a search tree, because of cache-misses. Normally all the data needed to compare a key with a tree node is present in the node, but in this case it's a reference to forward_map. Not only can loads from reverse_map itself cache-miss, so can forward_map[k]. (Loads from unknown addresses can't get started early, unlike the known-address case on out-of-order CPUs, so this is extra bad). Speculative execution may get the next load from reverse_map started, but things are still bad. A hash table requires significantly fewer total key comparisons, which is a big plus.
Using STL containers?
There's actually a chicken and egg problem for using STL containers here: Consider a std::unordered_set<int>: The Key type is int. We'd use a custom KeyEqual function that compares based on forward_map[i]. But there's only .find(const Key& key), not .find(const data_t&).
An ugly workaround would be to temporarily copy a data_t into a free slot in forward_map so it would have an index that we could pass to unordered_set<int, custom_compare>::find, but this extra copying is dumb.
Another bad option (that probably won't optimize away at compile time) would be a class with a virtual function to access a data_t. The map holds a class with a single int member. We'd pass .find() a derived class that also has a data_t &, and refers to that instead of the int array index in its overload of the virtual function that's used by the Hash and KeyEquals functions.
You might have to build your own custom data structure, or use something other than STL, unless there's a way to get STL to accept keys of a different type from the existing set members.
You can use an intrusive linked list. For example:
struct Node {
Node *prev, *next;
variant<int, float, vector<string>> data;
};
Now, instead of storing an int to locate one of these things, just store a Node*. When you want to delete one:
~Node() {
if (prev) prev->next = next;
if (next) next->prev = prev;
}
Now it will disappear itself from the list when you call delete node.
Predictably, Boost has an implementation with lots of fancy features: http://www.boost.org/doc/libs/release/doc/html/intrusive.html

Proper Qt data structure for storing and accessing struct pointers

I have a certain struct:
struct MyClass::MyStruct
{
Statistics stats;
Oject *objPtr;
bool isActive;
QDateTime expiration;
};
For which I need to store pointers to in a private container. I will be getting objects from client code for which I need to return a pointer to the MyStruct. For example:
QList<MyStruct*> MyClass::structPtr( Statistics stats )
{
// Return all MyStruct* for which myStruct->stats == stats (== is overloaded)
}
or
QList<MyStruct*> MyClass::structPtr( Object *objPtr )
{
// Return all MyStruct* for which myStruct->objPtr == objPtr
}
Right now I'm storing these in a QLinkedList<MyStruct*> so that I can have fast insertions, and lookups roughly equivalent to QList<MyStruct*>. Ideally I would like to be able to perform lookups faster, without losing my insertion speed. This leads me to look at QHash, but I am not sure how I would use a QHash when I'm only storing values without keys, or even if that is a good idea.
What is the proper Qt/C++ way to address a problem such as this? Ideally, lookup times should be <= log(n). Would a QHash be a good idea here? If so, what should I use for a key and/or value?
If you want to use QHash for fast lookups, the hash's key type must be the same as the search token type. For example, if you want to find elements by Statistics value, your hash should be QHash<Statistics, MyStruct*>.
If you can live with only looking up your data in one specific way, a QHash should be fine for you. Though, in your case where you're pulling lists out, you may want to investigate QMultiHash and its .values() member. However, it's important to note, from the documentation:
The key type of a QHash must provide operator==() and a global hash function called qHash()
If you need to be able to pull these lists based on different information at different times you might just be better off iterating over the lists. All of Qt's containers provide std-style iterators, including its hash maps.

64bit array operation by C/C++

I have an efficiency critical application, where I need such an array-type data structure A. Its keys are 0, 1, 2,..., and its values are uint64_t distinct values. I need two constant operations:
1. Given i, return A[i];
2. Given val, return i such that A[i] == val
I prefer not to use hash table. Because I tried GLib GHashTable, it took around 20 mins to load 60 million values into the hash table (If I remove the insertion statement, it took only around 6 seconds). The time is not acceptable for my application. Or maybe somebody recommend other hash table libraries? I tried uthash.c, it crashed immediately.
I also tried SDArray, but it seems not the right one.
Does anybody know any data structure that would fulfill my requirements? Or any efficient hash table implementations? I prefer using C/C++.
Thanks.
In general, you need two hash tables for this task. As you know, hash tables give you a key look-up in expected constant time. Searching for a value requires iterating through the whole data structure, since information about the values isn't encoded in the hash look-up table.
Use two hash tables: One for key-value and one (reversed) for value-key look-up. In your particular case, the forward search can be done using a vector as long as your keys are "sequential". But this doesn't change the requirement for a data structure enabling fast reverse look-up.
Regarding the hash table implementation: In C++11, you have the new standard container std::unordererd_map available.
An implementation might look like this (of course this is tweakable, like introducing const-correctness, calling by reference etc.):
std::unordered_map<K,T> kvMap; // hash table for forward search
std::unordered_map<T,K> vkMap; // hash table for backward search
void insert(std::pair<K,T> item) {
kvMap.insert(item);
vkMap.insert(std::make_pair(item.second, item.first));
}
// expected O(1)
T valueForKey(K key) {
return kvMap[key];
}
// expected O(1)
K keyForValue(T value) {
return vkMap[value];
}
A clean C++11 implementation should "wrap" around the key-value hash map, so you have the "standard" interface in your wrapper class. Always keep the reverse map in sync with your forward map.
Regarding the creation performance: In most implementations, there is a way to tell the data structure how much elements are going to be inserted, called "reserve". For hash tables, this is a huge performance benefit, as dynamically resizing the data structure (which happens during insertions every now and then) completely re-structures the whole hash table, as it changes the hash function itself.
I would go for two vectors (assuming that your values are really distinct), as this is O(1) in access where map is O(log n) in access
vector<uint64_t> values;
vector<size_t> keys
values.reserve(maxSize); // do memory reservation first, so reallocation doesn't occur during reading of data
keys.reserve(maxSize); // do memory reservation first, so reallocation doesn't occur during reading of data
Then, when reading in data
values[keyRead] = data;
keys[valueRead] = key;
Reading information is then the same
data = values[currentKey];
key = keys[currentData];

Address of map value

I have a settings which are stored in std::map. For example, there is WorldTime key with value which updates each main cycle iteration. I don't want to read it from map when I do need (it's also processed each frame), I think it's not fast at all. So, can I get pointer to the map's value and access it? The code is:
std::map<std::string, int> mSettings;
// Somewhere in cycle:
mSettings["WorldTime"] += 10; // ms
// Somewhere in another place, also called in cycle
DrawText(mSettings["WorldTime"]); // Is slow to call each frame
So the idea is something like:
int *time = &mSettings["WorldTime"];
// In cycle:
DrawText(&time);
How wrong is it? Should I do something like that?
Best use a reference:
int & time = mSettings["WorldTime"];
If the key doesn't already exist, the []-access will create the element (and value-initialize the mapped value, i.e. 0 for an int). Alternatively (if the key already exists):
int & time = *mSettings.find("WorldTime");
As an aside: if you have hundreds of thousands of string keys or use lookup by string key a lot, you might find that an std::unordered_map<std::string, int> gives better results (but always profile before deciding). The two maps have virtually identical interfaces for your purpose.
According to this answer on StackOverflow, it's perfectly OK to store a pointer to a map element as it will not be invalidated until you delete the element (see note 3).
If you're worried so much about performance then why are you using strings for keys? What if you had an enum? Like this:
enum Settings
{
WorldTime,
...
};
Then your map would be using ints for keys rather than strings. It has to do comparisons between the keys because I believe std::map is implemented as a balanced tree. Comparisons between ints are much faster than comparisons between strings.
Furthermore, if you're using an enum for keys, you can just use an array, because an enum IS essentially a map from some sort of symbol (ie. WorldTime) to an integer, starting at zero. So then do this:
enum Settings
{
WorldTime,
...
NumSettings
};
And then declare your mSettings as an array:
int mSettings[NumSettings];
Which has faster lookup time compared to a std::map. Reference like this then:
DrawText(mSettings[WorldTime]);
Since you're basically just accessing a value in an array rather than accessing a map this is going to be a lot faster and you don't have to worry about the pointer/reference hack you were trying to do in the first place.