Best practices for searching in unordered_map - c++

I want to create a std::unordered_map < int, std::string >
or std::unordered_map< std::string, int >.
In this map I will store strings and their integer representations.
I'll fill this map only in the code(hard coded pairs).
I'll need convert input strings to their int values - find in map.
So I only need to search in the map at the run time.
At this point I need the best performance while converting.
In the best case - O(1).
My questions:
Should I use string as key or int ?
Should I define my own hash-function ?
What is the best-performance find function for the both cases string/int and int/string as key-pairs?

std::map or std::unordered_map or their multi-counterparts all are built up the same - they map a key (first template parameter) to a value (second one). If you want to get O(1) (unordered) or O(log(n)) (map) behaviour, you need to define as key that data type you want to get a value for.
As you want to find an integral value for a given string, the way to go in your case is std::unordered_map<std::string, int>.
A counter-example would be looking up names for error codes, there you typically have a specific error code returned by a function (or placed in errno) and want to get a string value for e. g. for printing the error to console. Then you'd have std::unordered_map<int, std::string> (provided you could not store the error strings in an array because of error codes being too far distributed...).
Edit:
Defining your own hash function is that kind of premature optimisation Konstantin mentions in his comment - std::string provides its own hash code function, which should fit well for most of the use cases. Only if you discover that your hashing gets too slow, try to find a faster way.
As all your strings are hard-coded, you might want to have a look at perfect hashing, e. g. in the gperf variant.

Related

Unordered map of unordered set in C++ 11

I wanted to implement something, that maps an unordered set of integers to an integer value. Some kind of C++ equivalent of Python dict, which has sets as keys and ints as values.
So far I used std::map<std::set<int>, int> set_lookup; but from what I understood this is unnecessarily slow as it uses trees. I don't care about the ordering, only speed is important.
From what I have understand, the desired structure is std::unordered_map<std::unordered_set<int>, int, hash> set_lookup; which needs a hash function to work.
Is this the right approach? And how would a minimum running example look like? I couldn't find how the hash part should look like.
It isn't clear whether you ask about the syntax for defining a hash function, or about how to define a mathematically good hash for a set of ints.
Anyway - in case it is the former, here is how you should technically define a hash function for your case:
template <>
struct hash<std::unordered_set<int>>
{
std::size_t operator()(const std::unordered_set<int>& k) const
{
using std::size_t;
using std::hash;
using std::string;
// ...
// Here you should create and return a meaning full hash value:
return 5;
}
};
void main()
{
std::unordered_map<std::unordered_set<int>, int> m;
}
Having written that, I join the other comments about whether it is a good direction to solve your problem.
You haven't described your problem, so I cannot answer that.
I understood [std::map<std::set<int>, int> set_lookup;] is unnecessarily slow as it uses trees.
Is [std::unordered_map<std::unordered_set<int>, int, hash>] the right approach?
It depends. If your keys are created then not changed, and you want to be able to do a lot of lookups very fast, then a hash-table based approach would indeed be good, but you'll need two things for that:
to be able to hash keys
to be able to compare keys
To hash keys, deciding on a good hash function is a bit of an art form. A rarely bad - but sometimes slower than necessary - approach is to use boost hash_combine (which is short enough that you can copy it into your code - see here for the implementation). If your integer values are already quite random across most of their bits, though, simply XORing them together would produce a great hash. If you're not sure, use hash_combine or a better hash (e.g. MURMUR32). The time taken to hash will depend on the time to traverse, and traversing an unordered_set typically involves a linked list traversal (which typically jumps around in memory pages and is CPU cache unfriendly). The best way to store the values for fast traversal is in contiguous memory - i.e. a std::vector<>, or std::array<> if the size is known at compile time.
The other thing you need to do is compare keys for equality: that also works fastest when elements in the key are contiguous in memory, and consistently ordered. Again, a sorted std::vector<> or std::array<> would be best.
That said, if the sets for your keys are large, and you can compromise on a statistical guarantee of key equality, you could use e.g. a 256-bit hash and code as if hash collisions always correspond to key equality. That's often not an acceptable risk, but if your hash is not collision prone and you have e.g. a 256 bit hash, a CPU could run flat-chat for millennia hashing distinct keys and still be unlikely to produce the same hash even once, so it is a use I've seen even financial firms use in their core in-house database products, as it can save so much time.
If you're tempted by that compromise, you'd want std::unordered_map<HashValue256, std::pair<int, std::vector<int>>>. To find the int associated with a set of integers, you'd hash them first, then do a lookup. It's easy to write a hash function that produces the same output for a set or sorted vector<> or array<>, as you can present the elements to something like hash_combine in the same sorted order during traversal (i.e. just size_t seed = 0; for (auto& element : any_sorted_container) hash_combine(seed, element);). Storing the vector<int> means you can traverse the unordered_map later if you want to find all the key "sets" - if you don't need to do that (e.g. you're only ever looking up the ints by keys known to the code at the time, and you're comfortable with the statistical improbability of a good hash colliding, you don't even need to store the keys/vectors): std::unordered_map<HashValue256, int>.

stl map having pair which contains array

I need help in implementing a design:
I have MessageID's(integer Macros) declared in project. Each MID is associated one or more sources(enum 0 -19). By checking each source of MID, I want to call different functions. My followed following approach:
typedef std::pair<int,unsigned int *> MIDPair;
- this binds MID(int) with sources(array of int)
typedef std::map<MIDPair,fpPtr> mapRSE;
- carries MIDpair with funtion pointer
Initially I am creating different pairs(mid's and array having applicable sources) and pushing them in map with applicable function pointers. I when i receive any MID i will check the current source and call corresponding function.
please let me know if my approach is correct on the cotainers selected/ or suggest me any other approach
Your approach is workable:
you'll need to use lower_bound or upper_bound to find a key with that MID value in the map, but you won't necessarily have the source enum value you want at that location: you'll have to increment over all keys with that MID, checking for the source value
you can use a binary search through the array of source integers (if you keep them sorted)
That's likely not too bad efficiency wise, but does involve quite a lot of fiddly coding.
You'd probably find it simpler to use a container like:
std::map<int, std::map<unsigned, fpPtr>> mapRSE;
Then you could call mapRSE[mid][source]() (or use .at or .find if you don't want to crash on an unexpected key).

Recursively finding all combinations 'n' of set<int> stored in a map. How do I "choose"?

I understand the overarching idea of what I want to do, but I'm not sure how to begin with implementation. I have a map of named int sets. Meaning the keys are strings and the values are set.
map<string, set<int> > data;
I want to find the max # of unique ints shared between 'n' sets. So I need to check all possible combinations and update some max variable every time I come across a better combination.
From what I gather I need to use Iterators to traverse through the map values. So am I carrying around 'n' Iterators in my recursive function? What would my base case even look like? Would it be it.end() of the first value being chosen? I'm a bit lost as you can imagine.
I am willing to abandon the map (tho I'd prefer to stick with it so that I can keep a name) and use something simpler like a vector for the sets if that simplifies things (which it seems like it might). Hmm...

std::map's behavior on referring to a key

I am writing a program for numerical simulation by using std::map to store some key-value pairs. The map is used as storing the states evoluted during the simulation. The type of the key is a integer and the value of corresponds to the key tells how many copies are there for the same keys, i.e. std::map. For each step of the simulation, I need to calculate how many values are there for the same key, so I will check that by the following code
if (map[key]>0) {do something here with the number of copies}
However, I soon find that this code doesn't work because even there is no such key in the map, whenever you call the map[key], it will generate a placeholder for that key and set the value as zero; therefore, I always overcount the total number of keys by std::map.size(). I later change the code as follow to search the key instead
if (map.find(key)!=map.end()) {...}
So is it the only and fastest way to check if a key exists or not for a map? I am going to run the simulation for hundreds millions times and it will call above code very often to check the key. Will it be too slow to use map.find() instead? Thanks.
The find member function is probably the fastest way to find whether a key is already in the map. That said, if you don't need to iterate over items in the map in order, you might get better performance with an std::unordered_map instead.
In a std::map or hashtable (std::unordered_map), the find function is very fast, as fast as using the [] subscripting operator. In fact, it's faster when the element is not found, because it doesn't have to insert one.
I don't think there is much difference in speed for various ways to check for existence of key. On the other hand: if your keys are integers and range is known, you might just use the array.
BTW:
I got interested about the speed of simple array, vector, map and unordered map. I have written simple program, that does 100000000 container[n]++, where n is a random number in range of 0 to 10000. The results:
array: 1.27s
vector: 1.36s
unordered map: 2.6s
map: 11.6s
The overhead of loop + index calculation in this simple case is ~0.8s.
So it all depends on how much time is spent elsewhere. If it's considerably more (per 100000000 iterations) then it does not matter much what you use. But if it's not, it can be quite a difference.
you can use hash_map, it is the fastest data structures for your key-value type;
also you can use map,but it is slower than hash_map

C++ search function

This question refers to C++.
Say I have 10 million records of data, each piece of data is a 6 digit number, which I will have numbers being inputted that need to be matched to this data.
It boils down to two questions:
What would be the best way to store this data? An array?
What would be the best way to search or match this data?
I'm looking for performance more than anything else, memory usage is not a problem. I was looking into hash functions but I'm not sure if that's what I should even be looking for.
For fast lookup, there are basically two options: std::map, which has O(log n) lookup, or std::unordered_map, which has expected O(1) lookup (but possibly worse).
If your key type is literally an integer (which by the sound of it is the case), you have perfect hashing for free, so an unordered map would be available with minimal additional cost, so I'd try that one.
But just make a typedef and try both and compare!
#include <map>
#include <unordered_map>
typedef unsigned int key_type; // fine, has < , ==, and std::hash
typedef std::map<key_type, some_value_type> my_map;
// typedef std::unordered_map<key_type, some_value_type> my_map;
my_map m; // populate
my_map::const_iterator it = m.find(<some random key>);
If you don't actually need to associate any data to the keys, i.e. if you don't need a value type, then replace "map" by "set" everywhere. If you need multiple records with the same key, replace "map" by "multimap" everywhere.
With only a 6 digit number to look up, you could keep an array of 1 million elements and do the lookup directly.
If you know right off the bat how many records you're going to have, you can preallocate an array to that size and then start storing the data. Otherwise, some other data structure such as a vector would be better.
For searching, use a binary search. It will significantly cut down on your search time.
Basically, what will happen...(the data needs to be sorted btw)..
You'll jump to the middle element of the data structure and see if your input is higher or lower. If it's higher, you go to the upper half of the structure and repeat this process recursively. If it's lower, you go to the lower half and do the same. You do this until you find your matching data.
Assuming memory is not an issue, why don't you store data into map or set in STL? Search must be one of the fastest.