Storing frequent keys in a map in C++ - c++

Imagine that I am calculating a double type score for some large number of strings (i.e. 100 millions), and I want to have a sort of caching effect to re-use what I have computed so far whenever it is necessary. Now, the smart way to do this is to store those frequent strings and their scores in the map, to reduce the memory usage. Is there any standard solution for this?

Bloom Filters along with Maps or Hashmaps
I am assuming as you mention that there are 100's of millions of strings not all of them might be present on disk.
What i would suggest is a two layer hierarchy as follows:
Create a hashmap which is of the following format:
std::map<std::string, double> map;
This could contain the last 500 referenced entries. You can have a eviction policy based on TTL if you use a map.
If you want a eviction policy based on least recently used, you will have tweak the datastructure a little bit
std::map<std::string, cached_data> map;
struct cached_data
{
double value;
int times_accessed; //Number of times it has been accessed.
}
Insertion will become slightly complicated.
Now what do we do when cache access fails, you don't want to read the disk unnecessarily if you actually don't have the data. For that you can use a bloomfilter. Bloom filter will tell you the probability of presence of your data, hence preventing the disk read.
I would suggest stories the large number of strings on disk in a sorted manner with index consisting of the Keys --> Disk address mapping.
Answer largely inspired by Cassandra

Related

Most efficient container for uint16_t to uint16_t map

I am developing program for very processing power limited machine where I want to map uint16_t keys to uint16_t values.
I am currently using std::map use non-safe reading:
std::map<uint16_t, uint16_t> m;
//fill m only once
while(true){
auto x = m[y];
}
The performance is still non-acceptable for the requirements. Is there a better solution in term of execution speed?
Edit:
Some information:
Total number of items is less than 500
Insertion is done only once
Querying values is done more than 250 times / sec
Keys and values are unique
Ram is very limited, overall ram is 512KB, free ram for this part of code is less than 50KB
100 MHz single core processor
Without some more context about your map,
if you intend to use a lot of keys, a big array like suggested before would be easy enough to handle since there wouldn't be collisions, but if you're not going to use all the memory, it might be wasteful.
if you intend to use a fair amount of data, but not enough that there'd be too many hash collisions, std::unordered_map has amortized O(1) lookups, and if you don't care about the order they're stored in, that could be a good guess.
if you're using not a lot of data and require it to be flexible, std::vector is a good choice
Seeing as all we know is that it's a uin16_t to uint16_t map, there's no one best answer.
Total number of items is less than 500
Insertion is done only once
Keep a fixed-size (500) array of key-value pairs plus a "valid items count", if the actual size is known only at runtime; populate it with the data and sort it; then you can simply do a binary search.
Given that the elements are few, depending on the specific processor it may even be more convenient to just do a linear search (perhaps keeping the most used elements first if you happen to have clues about the most frequently looked-up values).
Usually a binary tree (std::map which you currently use) provides adequate performance. Your CPU budget must be really small.
A binary search approach probably won't be much faster. A hashtable (std::unordered_map) might be a solution.
Just make sure to size it properly so as to avoid rehashing and stay just below 1.0 load factor.
std::unordered_map<uint16_t, uint16_t> m(503); // 503 is a prime
m.emplace(123, 234);
m.emplace(2345, 345);
std::cout << m.find(123)->second; // don't use operator[] to avoid creating entries
std::cout << m.find(2345)->second;
To check the quality of the hashing function, iterate over all the buckets by calling bucket_size() and add up anything above 1 (i.e. a collision).

Difference between multimap and unordered_multimap in c++? [duplicate]

I have a simple requirement, i need a map of type . however i need fastest theoretically possible retrieval time.
i used both map and the new proposed unordered_map from tr1
i found that at least while parsing a file and creating the map, by inserting an element at at time.
map took only 2 minutes while unordered_map took 5 mins.
As i it is going to be part of a code to be executed on Hadoop cluster and will contain ~100 million entries, i need smallest possible retrieval time.
Also another helpful information:
currently the data (keys) which is being inserted is range of integers from 1,2,... to ~10 million.
I can also impose user to specify max value and to use order as above, will that significantly effect my implementation? (i heard map is based on rb trees and inserting in increasing order leads to better performance (or worst?) )
here is the code
map<int,int> Label // this is being changed to unordered_map
fstream LabelFile("Labels.txt");
// Creating the map from the Label.txt
if (LabelFile.is_open())
{
while (! LabelFile.eof() )
{
getline (LabelFile,inputLine);
try
{
curnode=inputLine.substr(0,inputLine.find_first_of("\t"));
nodelabel=inputLine.substr(inputLine.find_first_of("\t")+1,inputLine.size()-1);
Label[atoi(curnode.c_str())]=atoi(nodelabel.c_str());
}
catch(char* strerr)
{
failed=true;
break;
}
}
LabelFile.close();
}
Tentative Solution: After review of comments and answers, i believe a Dynamic C++ array would be the best option, since the implementation will use dense keys. Thanks
Insertion for unordered_map should be O(1) and retrieval should be roughly O(1), (its essentially a hash-table).
Your timings as a result are way OFF, or there is something WRONG with your implementation or usage of unordered_map.
You need to provide some more information, and possibly how you are using the container.
As per section 6.3 of n1836 the complexities for insertion/retreival are given:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1836.pdf
One issue you should consider is that your implementation may need to continually be rehashing the structure, as you say you have 100mil+ items. In that case when instantiating the container, if you have a rough idea about how many "unique" elements will be inserted into the container, you can pass that in as a parameter to the constructor and the container will be instantiated accordingly with a bucket-table of appropriate size.
The extra time loading the unordered_map is due to dynamic array resizing. The resizing schedule is to double the number of cells each when the table exceeds it's load factor. So from an empty table, expect O(lg n) copies of the entire data table. You can eliminate these extra copies by sizing the hash table upfront. Specifically
Label.reserve(expected_number_of_entries / Label.max_load_factor());
Dividing by the max_load_factor is to account for the empty cells that are necessary for the hash table to operate.
unordered_map (at least in most implementations) gives fast retrieval, but relatively poor insertion speed compared to map. A tree is generally at its best when the data is randomly ordered, and at its worst when the data is ordered (you constantly insert at one end of the tree, increasing the frequency of re-balancing).
Given that it's ~10 million total entries, you could just allocate a large enough array, and get really fast lookups -- assuming enough physical memory that it didn't cause thrashing, but that's not a huge amount of memory by modern standards.
Edit: yes, a vector is basically a dynamic array.
Edit2: The code you've added some some problems. Your while (! LabelFile.eof() ) is broken. You normally want to do something like while (LabelFile >> inputdata) instead. You're also reading the data somewhat inefficiently -- what you apparently expecting is two numbers separated by a tab. That being the case, I'd write the loop something like:
while (LabelFile >> node >> label)
Label[node] = label;

Unordered map vs vector

I'm building a little 2d game engine. Now I need to store the prototypes of the game objects (all type of informations). A container that will have at most I guess few thousand elements all with unique key and no elements will be deleted or added after a first load. The key value is a string.
Various threads will run, and I need to send to everyone a key(or index) and with that access other information(like a texture for the render process or sound for the mixer process) available only to those threads.
Normally I use vectors because they are way faster to accessing a known element. But I see that unordered map also usually have a constant speed if I use the ::at element access. It would make the code much cleaner and also easier to maintain because I will deal with much more understandable man made strings.
So the question is, the difference in speed between a access to a vector[n] compared to a unorderedmap.at("string") is negligible compared to his benefits?
From what I understand accessing various maps in different part of the program, with different threads running just with a "name" for me is a big deal and the speed difference isn't that great. But I'm too inexperienced to be sure of this. Although I found informations about it seem I can't really understand if I'm right or wrong.
Thank you for your time.
As an alternative, you could consider using an ordered vector because the vector itself will not be modified. You can easily write an implementation yourself with STL lower_bound etc, or use an implementation from libraries ( boost::flat_map).
There is a blog post from Scott Meyers about container performance in this case. He did some benchmarks and the conclusion would be that an unordered_mapis probably a very good choice with high chances that it will be the fastest option. If you have a restricted set of keys, you can also compute a minimal optimal hash function, e.g. with gperf
However, for these kind of problems the first rule is to measure yourself.
My problem was to find a record on a container by a given std::string type as Key access. Considering Keys that only EXISTS(not finding them was not a option) and the elements of this container are generated only at the beginning of the program and never touched thereafter.
I had huge fears unordered map was not fast enough. So I tested it, and I want to share the results hoping I've not mistaken everything.
I just hope that can help others like me and to get some feedback because in the end I'm beginner.
So, given a struct of record filled randomly like this:
struct The_Mess
{
std::string A_string;
long double A_ldouble;
char C[10];
int* intPointer;
std::vector<unsigned int> A_vector;
std::string Another_String;
}
I made a undordered map, give that A_string contain the key of the record:
std::unordered_map<std::string, The_Mess> The_UnOrdMap;
and a vector I sort by the A_string value(which contain the key):
std::vector<The_Mess> The_Vector;
with also a index vector sorted, and used to access as 3thrd way:
std::vector<std::string> index;
The key will be a random string of 0-20 characters in lenght(I wanted the worst possible scenario) containing letter both capital and normal and numbers or spaces.
So, in short our contendents are:
Unordered map I measure the time the program get to execute:
record = The_UnOrdMap.at( key ); record is just a The_Mess struct.
Sorted Vector measured statements:
low = std::lower_bound (The_Vector.begin(), The_Vector.end(), key, compare);
record = *low;
Sorted Index vector:
low2 = std::lower_bound( index.begin(), index.end(), key);
indice = low2 - index.begin();
record = The_Vector[indice];
The time is in nanoseconds and is a arithmetic average of 200 iterations. I have a vector that I shuffle at every iteration containing all the keys, and at every iteration I cycle through it and look for the key I have here in the three ways.
So this are my results:
I think the initials spikes are a fault of my testing logic(the table I iterate contains only the keys generated so far, so it only has 1-n elements). So 200 iterations of 1 key search for the first time. 200 iterations of 2 keys search the second time etc...
Anyway, it seem that in the end the best option is the unordered map, considering that is a lot less code, it's easier to implement and will make the whole program way easier to read and probably maintain/modify.
You have to think about caching as well. In case of std::vector you'll have very good cache performance when accessing the elements - when accessing one element in RAM, CPU will cache nearby memory values and this will include nearby portions of your std::vector.
When you use std::map (or std::unordered_map) this is no longer true. Maps are usually implemented as self balancing binary-search trees, and in this case values can be scattered around the RAM. This imposes great hit on cache performance, especially as maps get bigger and bigger as CPU just cannot cache the memory that you're about to access.
You'll have to run some tests and measure performance, but cache misses can greatly hurt the performance of your program.
You are most likely to get the same performance (the difference will not be measurable).
Contrary to what some people seem to believe, unordered_map is not a binary tree. The underlying data structure is a vector. As a result, cache locality does not matter here - it is the same as for vector. Granted, you are going to suffer if you have collissions due to your hashing function being bad. But if your key is a simple integer, this is not going to happen. As a result, access to to element in hash map will be exactly the same as access to the element in the vector with time spent on getting hash value for integer, which is really non-measurable.

Data structure for storing huge number of indices, each pointing to a set

I am using a red black tree implementation in C++ (std::map), but currently, I see that my unsigned long long int indices get bigger and bigger, for larger experiment. I am going for 700,000,000 indices, and each index stores a std::set that contains a few more int elements (about 1-10). We got 128 GB RAM, but I see that we start to run short of it; in fact, if possible, I wanna go down even to 1,000,000,000 indices, if possible, in my experiment.
I gave this some thought, and was thinking about a forest of several maps put together. Basically, after a map hits a certain size threshold (or perhaps when bad_alloc starts to be thrown), save it to disk, clear it off the memory and then create another map and keep on doing until I got all indices. However, during the loading part, this will be very inefficient, as we can only hold one map in the RAM at a time. Worse, we need to check all maps for consistency.
So in this case, what are some of the data structure should I be looking for?
From your description, I think you have this:
typedef std::map<long long, std::set<int>> MyMap;
where the map is very big, and the individual sets are quite small. There are several sources of overhead here:
the individual entries in the map, each of which is a separate allocation;
the individual entries in the sets, ditto;
the structures which describe each set, independent of their contents.
With standard library components, it's not possible to eliminate all of these overheads; the semantics of associative containers pretty well mandates the individual allocation of each entry, and the use of red-black trees requires the addition of several pointers to each entry (in theory, only two pointers are required, but efficient implementation of iterators is difficult without parent pointers.)
However, you can reduce the overhead without losing functionality by combining the map with the sets, using a datastructure like this:
typedef std::set<std::pair<long long, int>> MyMap;
You can still answer all the same queries, although a few of them are slightly less convenient. Remember that std::pair's default comparator sorts in lexicographical order, so all of the elements with the same first value will be contiguous. So you can, for example, query whether a given index has any ints associated with it by using:
it = theMap.lower_bound(std::make_pair(index, INT_MIN));
if (it != theMap.end() && it->first == index) {
// there is at least one int associated with index
}
The same call to lower_bound will give you a begin iterator for the ints associate with the key, while a call toupper_bound(std::make_pair(key, INT_MAX))` will give you the corresponding end iterator, so you can easily iterate over all the values associated with a given key.
That still might not be enough to store 700 million indices with associated sets of integers in 128GB unless the average set size is really small. The next step would have to be a b-tree of some form, which is not in the standard library. B-trees avoid the individual entry overhead by combining a number of entries into a single cluster; that should be sufficient for your needs.
it looks like it is time to switch to B-trees (may be B+ or B*) -- this structure used in databases to manage indices. take a look here -- this is replacement for std-like associative containers w/ btree inside... but btrees can be used to keep indices in memory and on disk...
For such a large scale dataset, you should really work with a proper database server such as an SQL server. These servers are intended to work with cached large-scale datasets. An SQL server saves the data to a permenant cache such as a HDD, while maintaining good read/write performance by caching frequently accessed pages etc.

Difference in performance between map and unordered_map in c++

I have a simple requirement, i need a map of type . however i need fastest theoretically possible retrieval time.
i used both map and the new proposed unordered_map from tr1
i found that at least while parsing a file and creating the map, by inserting an element at at time.
map took only 2 minutes while unordered_map took 5 mins.
As i it is going to be part of a code to be executed on Hadoop cluster and will contain ~100 million entries, i need smallest possible retrieval time.
Also another helpful information:
currently the data (keys) which is being inserted is range of integers from 1,2,... to ~10 million.
I can also impose user to specify max value and to use order as above, will that significantly effect my implementation? (i heard map is based on rb trees and inserting in increasing order leads to better performance (or worst?) )
here is the code
map<int,int> Label // this is being changed to unordered_map
fstream LabelFile("Labels.txt");
// Creating the map from the Label.txt
if (LabelFile.is_open())
{
while (! LabelFile.eof() )
{
getline (LabelFile,inputLine);
try
{
curnode=inputLine.substr(0,inputLine.find_first_of("\t"));
nodelabel=inputLine.substr(inputLine.find_first_of("\t")+1,inputLine.size()-1);
Label[atoi(curnode.c_str())]=atoi(nodelabel.c_str());
}
catch(char* strerr)
{
failed=true;
break;
}
}
LabelFile.close();
}
Tentative Solution: After review of comments and answers, i believe a Dynamic C++ array would be the best option, since the implementation will use dense keys. Thanks
Insertion for unordered_map should be O(1) and retrieval should be roughly O(1), (its essentially a hash-table).
Your timings as a result are way OFF, or there is something WRONG with your implementation or usage of unordered_map.
You need to provide some more information, and possibly how you are using the container.
As per section 6.3 of n1836 the complexities for insertion/retreival are given:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1836.pdf
One issue you should consider is that your implementation may need to continually be rehashing the structure, as you say you have 100mil+ items. In that case when instantiating the container, if you have a rough idea about how many "unique" elements will be inserted into the container, you can pass that in as a parameter to the constructor and the container will be instantiated accordingly with a bucket-table of appropriate size.
The extra time loading the unordered_map is due to dynamic array resizing. The resizing schedule is to double the number of cells each when the table exceeds it's load factor. So from an empty table, expect O(lg n) copies of the entire data table. You can eliminate these extra copies by sizing the hash table upfront. Specifically
Label.reserve(expected_number_of_entries / Label.max_load_factor());
Dividing by the max_load_factor is to account for the empty cells that are necessary for the hash table to operate.
unordered_map (at least in most implementations) gives fast retrieval, but relatively poor insertion speed compared to map. A tree is generally at its best when the data is randomly ordered, and at its worst when the data is ordered (you constantly insert at one end of the tree, increasing the frequency of re-balancing).
Given that it's ~10 million total entries, you could just allocate a large enough array, and get really fast lookups -- assuming enough physical memory that it didn't cause thrashing, but that's not a huge amount of memory by modern standards.
Edit: yes, a vector is basically a dynamic array.
Edit2: The code you've added some some problems. Your while (! LabelFile.eof() ) is broken. You normally want to do something like while (LabelFile >> inputdata) instead. You're also reading the data somewhat inefficiently -- what you apparently expecting is two numbers separated by a tab. That being the case, I'd write the loop something like:
while (LabelFile >> node >> label)
Label[node] = label;