In order to map a group of integer variables to values (say I have a1, a2, a3, a4, a5, a6 to determine a value v; that's a map like map<tuple<int, int, int, int, int, int>,VALUE_TYPE> or map<struct{int, int, int, int, int, int},VALUE_TYPE>), we may use
string constructed from integers as key
tuple as key
struct or class as key
and so on ...
I am very curious about the performance of these methods. In my situation,
less insertion, more query
wide range but sparse distribution of integer keys
time concerns me more, memory less
map access is the most consuming part, so even 10% speedup matters
The question is which way performs better in my situation?
If struct or class key chosen, is there any trick to make comparator more efficient?
Does unordered_map fit the situation better? How should I calculate hash of keys?
I will appreciate a lot to any suggestion or comment on solutions. And discussion on a more general situation is also welcomed. Thanks in advance.
Basically: Implement the different solutions, write a performance test program and measure!
For sure:
Using the integers as integers will be faster than converting them to strings.
A struct (or class) should be faster than a tuple (my experience).
Also try the std::array<> container (already provides operators for comparison too).
Hash map (std::unordered_map<>) is faster than a sorted map (std::map<>), but can of course only be used if you don't need to search with partial keys.
Also, consider implementing the trie data structure if your numbers are relatively small and you have a lot of them in a key. Just like std::unordered_map this will allow you to access the element in O(l) time where l is a length of the key, but also it will allow you to iterate throw all the values in ascending order and find a value by a partial key. The problem here is that your memory consumption will be proportional to the alphabet size (of course, you can store the inner edges in some complex data structure, but this will negatively affect performance).
Related
I wanted to implement something, that maps an unordered set of integers to an integer value. Some kind of C++ equivalent of Python dict, which has sets as keys and ints as values.
So far I used std::map<std::set<int>, int> set_lookup; but from what I understood this is unnecessarily slow as it uses trees. I don't care about the ordering, only speed is important.
From what I have understand, the desired structure is std::unordered_map<std::unordered_set<int>, int, hash> set_lookup; which needs a hash function to work.
Is this the right approach? And how would a minimum running example look like? I couldn't find how the hash part should look like.
It isn't clear whether you ask about the syntax for defining a hash function, or about how to define a mathematically good hash for a set of ints.
Anyway - in case it is the former, here is how you should technically define a hash function for your case:
template <>
struct hash<std::unordered_set<int>>
{
std::size_t operator()(const std::unordered_set<int>& k) const
{
using std::size_t;
using std::hash;
using std::string;
// ...
// Here you should create and return a meaning full hash value:
return 5;
}
};
void main()
{
std::unordered_map<std::unordered_set<int>, int> m;
}
Having written that, I join the other comments about whether it is a good direction to solve your problem.
You haven't described your problem, so I cannot answer that.
I understood [std::map<std::set<int>, int> set_lookup;] is unnecessarily slow as it uses trees.
Is [std::unordered_map<std::unordered_set<int>, int, hash>] the right approach?
It depends. If your keys are created then not changed, and you want to be able to do a lot of lookups very fast, then a hash-table based approach would indeed be good, but you'll need two things for that:
to be able to hash keys
to be able to compare keys
To hash keys, deciding on a good hash function is a bit of an art form. A rarely bad - but sometimes slower than necessary - approach is to use boost hash_combine (which is short enough that you can copy it into your code - see here for the implementation). If your integer values are already quite random across most of their bits, though, simply XORing them together would produce a great hash. If you're not sure, use hash_combine or a better hash (e.g. MURMUR32). The time taken to hash will depend on the time to traverse, and traversing an unordered_set typically involves a linked list traversal (which typically jumps around in memory pages and is CPU cache unfriendly). The best way to store the values for fast traversal is in contiguous memory - i.e. a std::vector<>, or std::array<> if the size is known at compile time.
The other thing you need to do is compare keys for equality: that also works fastest when elements in the key are contiguous in memory, and consistently ordered. Again, a sorted std::vector<> or std::array<> would be best.
That said, if the sets for your keys are large, and you can compromise on a statistical guarantee of key equality, you could use e.g. a 256-bit hash and code as if hash collisions always correspond to key equality. That's often not an acceptable risk, but if your hash is not collision prone and you have e.g. a 256 bit hash, a CPU could run flat-chat for millennia hashing distinct keys and still be unlikely to produce the same hash even once, so it is a use I've seen even financial firms use in their core in-house database products, as it can save so much time.
If you're tempted by that compromise, you'd want std::unordered_map<HashValue256, std::pair<int, std::vector<int>>>. To find the int associated with a set of integers, you'd hash them first, then do a lookup. It's easy to write a hash function that produces the same output for a set or sorted vector<> or array<>, as you can present the elements to something like hash_combine in the same sorted order during traversal (i.e. just size_t seed = 0; for (auto& element : any_sorted_container) hash_combine(seed, element);). Storing the vector<int> means you can traverse the unordered_map later if you want to find all the key "sets" - if you don't need to do that (e.g. you're only ever looking up the ints by keys known to the code at the time, and you're comfortable with the statistical improbability of a good hash colliding, you don't even need to store the keys/vectors): std::unordered_map<HashValue256, int>.
So previously I only had 1 key I needed to look up, so I was able to use a map:
std::map <int, double> freqMap;
But now I need to look up 2 different keys. I was thinking of using a vector with std::pair i.e.:
std::vector <int, std::pair<int, double>> freqMap;
Eventually I need to look up both keys to find the correct value. Is there a better way to do this, or will this be efficient enough (will have ~3k entries). Also, not sure how to search using the second key (first key in the std::pair). Is there a find for the pair based on the first key? Essentially I can access the first key by:
freqMap[key1]
But not sure how to iterate and find the second key in the pair.
Edit: Ok adding the use case for clarification:
I need to look up a val based on 2 keys, a mux selection and a frequency selection. The raw data looks something like this:
Mux, Freq, Val
0, 1000, 1.1
0, 2000, 2.7
0, 10e9, 1,7
1, 1000, 2.2
1, 2500, 0.8
6, 2000, 2.2
The blanket answer to "which is faster" is generally "you have to benchmark it".
But besides that, you have a number of options. A std::map is more efficient than other data structures on paper, but not necessarily in practice. If you truly are in a situation where this is performance critical (i.e. avoid premature optimisation) try different approaches, as sketched below, and measure the performance you get (memory-wise and cpu-wise).
Instead of using a std::map, consider throwing your data into a struct, give it proper names and store all values in a simple std::vector. If you modify the data only seldom, you can optimise retrieval cost at the expense of additional insertion cost by sorting the vector according to the key you are typically using to find an entry. This will allow you to do binary search, which can be much faster than linear search.
However, linear search can be surprisingly fast on a std::vector because of both cache locality and branch prediction. Both of which you are likely to lose when dealing with a map, unordered_map or (binary searched) sorted vector. So, although O(n) sounds much more scary than, say, O(log n) for map or even O(1) for unordered_map, it can still be faster under the right conditions.
Especially if you discover that you don't have a discernible index member you can use to sort your entries, you will have to either stick to linear search in contiguous memory (i.e. vector) or invest into a doubly indexed data structure (effectively something akin to two maps or two unordered_maps). Having two indexes usually prevents you from using a single map/unordered_map.
If you can pack your data more tightly (i.e. do you need an int or would a std::uint8_t do the job?, do you need a double? etc.) you will amplify cache locality and for only 3k entries you have good chances of an unsorted vector to perform best. Although operations on an std::size_t are typically faster themselves than on smaller types, iterating over contiguous memory usually offsets this effect.
Conclusion: Try an unsorted vector, a sorted vector (+binary search), a map and an unordered_map. Do proper benchmarking (with several repetitions) and pick the fastest one. If it doesn't make a difference pick the one that is the most straight-forward to understand.
Edit: Given your example data, it sounds like the first key has an extremely small domain. As far as I can tell "Mux" seems to be limited to a small number of different values which are near each other, in such a situation you may consider using an std::array as your primary indexing structure and have a suitable lookup structure as your second one. For example:
std::array<std::vector<std::pair<std::uint64_t,double>>,10>
std::array<std::unordered_map<std::uint64_t,double>,10>
I want to use a map to store key-value pairs.
The key of the map should contain information about the coordinates(int) of one point.
One possibility is to convert the ints to string. For example, coordinate(x,y) can be represented as "x#y" and store this string "x#y" as a key.
Another possibility is to use a pair to store the coordinates as pair<int, int> and using this pair as key.
Which way is a better approach and why ?
This depends on your definition of efficient, and we very quickly devolve into what might be considered premature optimization. There are a lot of things at play, and by the way you phrased your question I think we should take a very simplistic look:
Your primary considerations are probably:
Storage: how much memory is used by each key
Speed: how complex a key comparison is
Initialization: how complex it is to create a key
And let's assume that on your system:
int is 4 bytes
a pointer is 8 bytes
you are allocating your own memory for strings instead of using std::string (which is implementation-dependent)
Storage
std::pair<int,int> requires 8 bytes
your string requires 8 bytes for the pointer, plus additional memory for the string-representation of a value (up to 10 bytes per integer) and another byte for the separator
Speed
Comparing std::pair<int,int> requires at most two integer comparisons, which is fast on most processors
Comparing two strings is complex. Equality is easy, but less-than will be complicated. You could use a special padded syntax for your strings, requiring more storage, to make this less complex.
Initialization
std::pair<int,int> initialization is simple and fast
Creating a string representation of your two values requires memory allocation of some kind, possibly involving logic to determine the minimum amount of memory required, followed by the allocation itself (slow) and the actual numeric conversion (also slow). This is a double-whammy of "bottleneck".
Already you can see that at face value, using strings might be crazy... That is, unless you have some other important reason for it.
Now, should you even use std::pair<int,int>? It might be overkill. As an example, let's say you only store values that fit in the range [0,65535]. In that case, std::pair<uint16_t,uint16_t> would be sufficient, or you could pack the two values into a single uint32_t.
And then others have mentioned hashing, which is fine provided you require fast lookup but don't care about iteration order.
I said I'd keep this simplistic, so this is where I'll stop. Hopefully this has given you something to think about.
One final caution is this: Don't overthink your problem -- write it in the simplest way, and then TEST if it suits your needs.
1st, coordinate can be double number so I think pair< double , double > would be better choice.
2nd , if you really want to use int pair or string key, pair< int , int> would be better choice as string will always create more capacity than it's actual length.
Basically, You will loose some unused memory for each string key.
string.length() value can be equal to or less than string.capacity() value..
I am currently building a program that relies on multiple vectors and maps to retain previously computed information. My vectors are of the standard format, and not very interesting. The maps are of the form
std::map<std::string, long double>
with the intent of the string being a parseable mapping of one vector onto another, such as
std::map<std::string, long double> rmMap("fvProperty --> fvTime", 3.0234);
where I can later on split the string and compare the substrings to the vector names to figure out which ones were involved in getting the number.
However, I have recently found that std::tuple is available, meaning I can skip the string entirely and use the vector positions instead with
std::tuple<unsigned int, unsigned int, long double>
This allows me (as far as I can tell) to use both the first and the second value as a key, which seems preferable to parsing a string for my indices.
My crux is that I am unaware of the efficiency here. There will be a lot of calls to these tuples/maps, and efficacy is of the essence, as the program is expected to run for weeks before producing an end result.
Thus, I would ask you if tuples are more or equally efficient (in terms of memory, cache, and cycles) than maps when it comes to large and computationally intense programs.
EDIT:
If tuples cannot be used in this way, would the map
std::map<std::pair<unsigned int, unsigned int>, long double>
be an effective substitute to using strings for identification?
The advantage of the map is to efficiently query data associated to a key.
It cannot be compared to tuples which just pack values together: you will have to travel tuples yourself to retrieve the correct value.
Using a map<pair<unsigned int, unsigned int>, long double> as Mike suggests is probably the way to go.
Tuples and Maps are used for very different purposes. So it should be primarily about what you want to use them for, not how efficient they are. You'll have some trouble to turn on your car with a screw driver, as you will have trying to fix a screw with your car keys. I'd advice against both.
A tuple is one little dataset, in your case e.g. the set {1,2,3.0234}. A map is for mapping multiple keys to their values. In fact, a map internally consists of multiple tuples (pairs), each one containing a key and the associated value. Inside the map, the pairs are arranged in a way that makes searching for the keys easy.
In your case, I'd prefer the map<pair<int, int>, double> as your edit suggests. The keys (i.e. the vector index pairings) are much easier to search for and "parse" than those strings.
I want to condense a list of about 100 non-negative 32-bit integers into a single integer. Ideally the resulting integer is always unique but a few relatively rare collisions is acceptable. How can I do this?
I'm writing a puzzle-solver. Part of my search algorithm is avoiding re-exploring puzzle states already seen. I'll use the integer generated from the list as the key into a statesAlreadySeen table. Currently I'm using strings as the keys. However I've seen noticeable performance improvements when going from string keys to integer keys in a map<,> hence I'd like to switch.
Edit: Thanks for the unordered map suggestions! However I'm still curious about an actual hashing function. IIRC there's a simple function involving basic bit manipulation and xoring. Would be great to see this and have some general understanding of the collision probabilities.
As suggested in the comments, you need to use a hash function. Most easily accomplished with boost::hash_range:
#include <boost/functional/hash.hpp>
std::size_t vectorhash(std::vector<int> f){
size_t hash = std::size_t hash = boost::hash_range(f.begin(), f.end());
}
Having said that, if you don't have a real need to keep the states ordered (and I can't see why you would), I would go with us2012's solution of keeping string keys and switching to unordered_map - thus letting the container take care of the hashing.