I am currently building a program that relies on multiple vectors and maps to retain previously computed information. My vectors are of the standard format, and not very interesting. The maps are of the form
std::map<std::string, long double>
with the intent of the string being a parseable mapping of one vector onto another, such as
std::map<std::string, long double> rmMap("fvProperty --> fvTime", 3.0234);
where I can later on split the string and compare the substrings to the vector names to figure out which ones were involved in getting the number.
However, I have recently found that std::tuple is available, meaning I can skip the string entirely and use the vector positions instead with
std::tuple<unsigned int, unsigned int, long double>
This allows me (as far as I can tell) to use both the first and the second value as a key, which seems preferable to parsing a string for my indices.
My crux is that I am unaware of the efficiency here. There will be a lot of calls to these tuples/maps, and efficacy is of the essence, as the program is expected to run for weeks before producing an end result.
Thus, I would ask you if tuples are more or equally efficient (in terms of memory, cache, and cycles) than maps when it comes to large and computationally intense programs.
EDIT:
If tuples cannot be used in this way, would the map
std::map<std::pair<unsigned int, unsigned int>, long double>
be an effective substitute to using strings for identification?
The advantage of the map is to efficiently query data associated to a key.
It cannot be compared to tuples which just pack values together: you will have to travel tuples yourself to retrieve the correct value.
Using a map<pair<unsigned int, unsigned int>, long double> as Mike suggests is probably the way to go.
Tuples and Maps are used for very different purposes. So it should be primarily about what you want to use them for, not how efficient they are. You'll have some trouble to turn on your car with a screw driver, as you will have trying to fix a screw with your car keys. I'd advice against both.
A tuple is one little dataset, in your case e.g. the set {1,2,3.0234}. A map is for mapping multiple keys to their values. In fact, a map internally consists of multiple tuples (pairs), each one containing a key and the associated value. Inside the map, the pairs are arranged in a way that makes searching for the keys easy.
In your case, I'd prefer the map<pair<int, int>, double> as your edit suggests. The keys (i.e. the vector index pairings) are much easier to search for and "parse" than those strings.
Related
I wanted to implement something, that maps an unordered set of integers to an integer value. Some kind of C++ equivalent of Python dict, which has sets as keys and ints as values.
So far I used std::map<std::set<int>, int> set_lookup; but from what I understood this is unnecessarily slow as it uses trees. I don't care about the ordering, only speed is important.
From what I have understand, the desired structure is std::unordered_map<std::unordered_set<int>, int, hash> set_lookup; which needs a hash function to work.
Is this the right approach? And how would a minimum running example look like? I couldn't find how the hash part should look like.
It isn't clear whether you ask about the syntax for defining a hash function, or about how to define a mathematically good hash for a set of ints.
Anyway - in case it is the former, here is how you should technically define a hash function for your case:
template <>
struct hash<std::unordered_set<int>>
{
std::size_t operator()(const std::unordered_set<int>& k) const
{
using std::size_t;
using std::hash;
using std::string;
// ...
// Here you should create and return a meaning full hash value:
return 5;
}
};
void main()
{
std::unordered_map<std::unordered_set<int>, int> m;
}
Having written that, I join the other comments about whether it is a good direction to solve your problem.
You haven't described your problem, so I cannot answer that.
I understood [std::map<std::set<int>, int> set_lookup;] is unnecessarily slow as it uses trees.
Is [std::unordered_map<std::unordered_set<int>, int, hash>] the right approach?
It depends. If your keys are created then not changed, and you want to be able to do a lot of lookups very fast, then a hash-table based approach would indeed be good, but you'll need two things for that:
to be able to hash keys
to be able to compare keys
To hash keys, deciding on a good hash function is a bit of an art form. A rarely bad - but sometimes slower than necessary - approach is to use boost hash_combine (which is short enough that you can copy it into your code - see here for the implementation). If your integer values are already quite random across most of their bits, though, simply XORing them together would produce a great hash. If you're not sure, use hash_combine or a better hash (e.g. MURMUR32). The time taken to hash will depend on the time to traverse, and traversing an unordered_set typically involves a linked list traversal (which typically jumps around in memory pages and is CPU cache unfriendly). The best way to store the values for fast traversal is in contiguous memory - i.e. a std::vector<>, or std::array<> if the size is known at compile time.
The other thing you need to do is compare keys for equality: that also works fastest when elements in the key are contiguous in memory, and consistently ordered. Again, a sorted std::vector<> or std::array<> would be best.
That said, if the sets for your keys are large, and you can compromise on a statistical guarantee of key equality, you could use e.g. a 256-bit hash and code as if hash collisions always correspond to key equality. That's often not an acceptable risk, but if your hash is not collision prone and you have e.g. a 256 bit hash, a CPU could run flat-chat for millennia hashing distinct keys and still be unlikely to produce the same hash even once, so it is a use I've seen even financial firms use in their core in-house database products, as it can save so much time.
If you're tempted by that compromise, you'd want std::unordered_map<HashValue256, std::pair<int, std::vector<int>>>. To find the int associated with a set of integers, you'd hash them first, then do a lookup. It's easy to write a hash function that produces the same output for a set or sorted vector<> or array<>, as you can present the elements to something like hash_combine in the same sorted order during traversal (i.e. just size_t seed = 0; for (auto& element : any_sorted_container) hash_combine(seed, element);). Storing the vector<int> means you can traverse the unordered_map later if you want to find all the key "sets" - if you don't need to do that (e.g. you're only ever looking up the ints by keys known to the code at the time, and you're comfortable with the statistical improbability of a good hash colliding, you don't even need to store the keys/vectors): std::unordered_map<HashValue256, int>.
I want to use a map to store key-value pairs.
The key of the map should contain information about the coordinates(int) of one point.
One possibility is to convert the ints to string. For example, coordinate(x,y) can be represented as "x#y" and store this string "x#y" as a key.
Another possibility is to use a pair to store the coordinates as pair<int, int> and using this pair as key.
Which way is a better approach and why ?
This depends on your definition of efficient, and we very quickly devolve into what might be considered premature optimization. There are a lot of things at play, and by the way you phrased your question I think we should take a very simplistic look:
Your primary considerations are probably:
Storage: how much memory is used by each key
Speed: how complex a key comparison is
Initialization: how complex it is to create a key
And let's assume that on your system:
int is 4 bytes
a pointer is 8 bytes
you are allocating your own memory for strings instead of using std::string (which is implementation-dependent)
Storage
std::pair<int,int> requires 8 bytes
your string requires 8 bytes for the pointer, plus additional memory for the string-representation of a value (up to 10 bytes per integer) and another byte for the separator
Speed
Comparing std::pair<int,int> requires at most two integer comparisons, which is fast on most processors
Comparing two strings is complex. Equality is easy, but less-than will be complicated. You could use a special padded syntax for your strings, requiring more storage, to make this less complex.
Initialization
std::pair<int,int> initialization is simple and fast
Creating a string representation of your two values requires memory allocation of some kind, possibly involving logic to determine the minimum amount of memory required, followed by the allocation itself (slow) and the actual numeric conversion (also slow). This is a double-whammy of "bottleneck".
Already you can see that at face value, using strings might be crazy... That is, unless you have some other important reason for it.
Now, should you even use std::pair<int,int>? It might be overkill. As an example, let's say you only store values that fit in the range [0,65535]. In that case, std::pair<uint16_t,uint16_t> would be sufficient, or you could pack the two values into a single uint32_t.
And then others have mentioned hashing, which is fine provided you require fast lookup but don't care about iteration order.
I said I'd keep this simplistic, so this is where I'll stop. Hopefully this has given you something to think about.
One final caution is this: Don't overthink your problem -- write it in the simplest way, and then TEST if it suits your needs.
1st, coordinate can be double number so I think pair< double , double > would be better choice.
2nd , if you really want to use int pair or string key, pair< int , int> would be better choice as string will always create more capacity than it's actual length.
Basically, You will loose some unused memory for each string key.
string.length() value can be equal to or less than string.capacity() value..
I have a huge amount of data with the following structure:
(ID1, ID2) -> value
(12, 243) -> 7712
(311, 63) -> 123
...
The pure, binary, uncompressed data is just about 2,5 GiB, so it should basically fit into my RAM of 16GiB. But I'm getting into trouble with the allocation overhead when allocating many small objects.
Previously I stored all values with the same ID1 in the same object, so the lists inside the object only stored ID2 and the value. This approach was also useful for the actual context it is used for.
class DataStore {
int ID1;
std::unordered_map<int, int> data; //ID2 -> value
}
The probjem is, there is seemingly a huge overhead for loading many of those objects (millions of them)... Simply loading all the data this way obviously exceeds my RAM - no matter whether I use
DataStore* ds = new DataStore();
or just
DataStore ds = DataStore();
for storing them inside a vector or another sort of container.
Storing them in a nested hash map reduces the overhead (now it fits into my RAM), but it's still about 9,2GiB - I wonder if that's not reducable...
So, I tried
google::sparse_hash_map<int, google::sparse_hash_map<int, int>>
and std::map, std::unordered_map, but google::sparse_hash_map is currently the best one with the mentioned 9,2GB...
I also tried
std::map<std::pair<int, int>, int>
with a worse result.
I tried combining both keys in a long int with shifting, but the inserting performance was incredibly slow for some reason... So I have no idea how the memory usage would have been here, but it was far too slow, even with google sparsehash...
I read some things about pool allocators, but I'm not sure if I can allocate multiple containers in the same pool, which would be the only useful scenario, I think... I'm also not sure which pool allocator to use - if that would be a good idea - suggestions?
Lookup performance is pretty important, so a vector is not that optimal... Also inserting performance should not be too bad...
EDIT:
boost::flat_map Could be as memory efficient as I need it, but inserting is incredibly slow... I tried both
boost::container::flat_map<std::pair<unsigned int, unsigned int>, unsigned int>
boost::container::flat_map<unsigned int, boost::container::flat_map<unsigned int, unsigned int>>
Maybe I can have only good insert & lookup performance or memory efficience...
EDIT2:
google::sparse_hash_map<std::pair<unsigned int, unsigned int>, unsigned int, boost::hash<std::pair<unsigned int, unsigned int>>>
Uses about 6,3GiB, with some optimizions only 4,9GiB (value of (123,45) and (45,123) should be equal :) Really good - Thanks!
EDIT3:
In my context, because I now use pairs as key and no nested hashmaps, it's useless for most of my purposes (I often have to do things like "get all pairs which contain ID 12231 - so I have to iterate through all entries... One specific entry is nice, but a "half pair" is not nice to search. Another Index consumes almost as much memory as the nested hastmap solution... I will simply stick to my old approach (serialize the data and save it in some LevelDB), because it's incredibly fast with about 50ms for a specific part I need, compared to about two seconds for a whole iteration :) But thanks for all the advice :)
I want to create a std::unordered_map < int, std::string >
or std::unordered_map< std::string, int >.
In this map I will store strings and their integer representations.
I'll fill this map only in the code(hard coded pairs).
I'll need convert input strings to their int values - find in map.
So I only need to search in the map at the run time.
At this point I need the best performance while converting.
In the best case - O(1).
My questions:
Should I use string as key or int ?
Should I define my own hash-function ?
What is the best-performance find function for the both cases string/int and int/string as key-pairs?
std::map or std::unordered_map or their multi-counterparts all are built up the same - they map a key (first template parameter) to a value (second one). If you want to get O(1) (unordered) or O(log(n)) (map) behaviour, you need to define as key that data type you want to get a value for.
As you want to find an integral value for a given string, the way to go in your case is std::unordered_map<std::string, int>.
A counter-example would be looking up names for error codes, there you typically have a specific error code returned by a function (or placed in errno) and want to get a string value for e. g. for printing the error to console. Then you'd have std::unordered_map<int, std::string> (provided you could not store the error strings in an array because of error codes being too far distributed...).
Edit:
Defining your own hash function is that kind of premature optimisation Konstantin mentions in his comment - std::string provides its own hash code function, which should fit well for most of the use cases. Only if you discover that your hashing gets too slow, try to find a faster way.
As all your strings are hard-coded, you might want to have a look at perfect hashing, e. g. in the gperf variant.
In order to map a group of integer variables to values (say I have a1, a2, a3, a4, a5, a6 to determine a value v; that's a map like map<tuple<int, int, int, int, int, int>,VALUE_TYPE> or map<struct{int, int, int, int, int, int},VALUE_TYPE>), we may use
string constructed from integers as key
tuple as key
struct or class as key
and so on ...
I am very curious about the performance of these methods. In my situation,
less insertion, more query
wide range but sparse distribution of integer keys
time concerns me more, memory less
map access is the most consuming part, so even 10% speedup matters
The question is which way performs better in my situation?
If struct or class key chosen, is there any trick to make comparator more efficient?
Does unordered_map fit the situation better? How should I calculate hash of keys?
I will appreciate a lot to any suggestion or comment on solutions. And discussion on a more general situation is also welcomed. Thanks in advance.
Basically: Implement the different solutions, write a performance test program and measure!
For sure:
Using the integers as integers will be faster than converting them to strings.
A struct (or class) should be faster than a tuple (my experience).
Also try the std::array<> container (already provides operators for comparison too).
Hash map (std::unordered_map<>) is faster than a sorted map (std::map<>), but can of course only be used if you don't need to search with partial keys.
Also, consider implementing the trie data structure if your numbers are relatively small and you have a lot of them in a key. Just like std::unordered_map this will allow you to access the element in O(l) time where l is a length of the key, but also it will allow you to iterate throw all the values in ascending order and find a value by a partial key. The problem here is that your memory consumption will be proportional to the alphabet size (of course, you can store the inner edges in some complex data structure, but this will negatively affect performance).