How to initialize an empty unordered map in C++? - c++

I have to run a loop (the looping on t) on an unordered map in C++ and each time the loop is run, the unordered map gets updated. But what I want to do is, start with an empty map each time the loop is run. How do I initialise an empty unordered map?
while (t--){
unordered_map<int, int> freq;
//perform various insertions and deletions in the map
//print all the elements in the map
}

Unordered maps are a bit tricky in the sense that they use two things:
A chain of {key,value} pairs (STL uses a std::forward_list).
An array of positions to the chain elements (hash table).
When you insert elements to the map, the array gets filled (load factor increases) and hash collisions start to become frequent. This ends up in that array being resized, and all its elements (positions to the chain of pairs) being re-created (this is called rehashing).
That being said, your code does exactly what you are asking for: declaring a variable of type std::unordered_map<int,int> initialises it by default. When the program loops back, the map gets out of scope before the following iteration (destructor is called) and a new variable is initialised when the new iteration begins.
However, you might consider using another alternative: calling clear() instead, at the beginning of your loop, and declare your map outside the loop:
std::unordered_map<int, int> freq;
while (t--) {
freq.clear();
// do something with freq
}
If all the iterations are similar (you introduce a similar amount of pairs in freq), the first iteration will find the appropriate size of the hash table (rehashing takes place), but subsequent iterations won't see this effect as often: during clear() we erase all the chain's elements but we keep the array, which will be reused during the whole loop.

Related

What is unordered_set in C++

Below is a block of code from leetcode, I'm wondering what does the second line mean. I never saw this kind of initializing set. Could anyone help? Thank!
bool containsDuplicate(vector<int>& nums) {
unordered_set<int> s (nums.begin(), nums.end());
return s.size()!=nums.size();
}
unordered_set<int> s (nums.begin(), nums.end());
This initializes the set by iterating over the specified vector, starting at the beginning of the vector and iterating until it reaches the end, calling s.insert(theVal) for each int in the vector.
return s.size()!=nums.size();
Since an unordered_set, by its nature, does not allow any duplicate keys to exist in the set (i.e. trying to insert a key into the set, while that key is already a member of the set, will not modify the set at all), then we know that if the final size of the set is less than the size of the input vector, that the input vector must have had at least one duplicate value in it.

Stable sort a C++ hash map - preserve the insertion order for equal elements

Say I have a std::unordered_map<std::string, int> that represents a word and the number of times that word appeared in a book, and I want to be able to sort it by the value.
The problem is, I want the sorting to be stable, so that in case two items have equal value I want the one who got inserted first to the map to be first.
It is simple to implement it by adding addition field that will keep the time it got inserted. Then, create a comperator that uses both time and the value. Using simple std::sort will give me O(Nlog(N)) time complexity.
In my case, space is not an issue whenever time can be improved. I want to take advantage of it and do a bucket sorting. Which should give me O(N) time complexity. But when using bucket sorting, there is no comperator, when iterating the items in the map the order is not preserved.
How can I both make it stable and still keep the O(N) time complexity via bucket sorting or something else?
I guess that if I had some kind of hash map that preserves the order of insertion while iterating it, it would solve my issue.
Any other solutions with the same time complexity are acceptable.
Note - I already saw this and that and due to the fact that they are both from 2009 and that my case is more specific I think, I opened this question.
Here is a possible solution I came up with using an std::unordered_map and tracking the order of inserting using a std::vector.
Create a hash map with the string as key and count as value.
In addition, create a vector with iterators to that map type.
When counting elements, if the object is not yet in the map, add to both map and vector. Else, just increment the counter. The vector will preserve the order the elements got inserted to the map, and the insertion / update will still be in O(1) time complexity.
Apply bucket sort by iterating over the vector (instead of the map), this ensures the order is preserved and we'll get a stable sort. O(N)
Extract from the buckets to make a sorted array. O(N)
Implementation:
unordered_map<std::string, int> map;
std::vector<std::unordered_map<std::string,int>::iterator> order;
// Lets assume this is my string stream
std::vector<std::string> words = {"a","b","a" ... };
// Insert elements to map and the corresponding iterator to order
for (auto& word : words){
auto it = map.emplace(word,1);
if (!it.second){
it.first->second++;
}
else {
order.push_back(it.first);
}
max_count = std::max(max_count,it.first->second);
}
// Bucket Sorting
/* We are iterating over the vector and not the map
this ensures we are iterating by the order they got inserted */
std::vector<std::vector<std::string>> buckets(max_count);
for (auto o : order){
int count = o->second;
buckets[count-1].push_back(o->first);
}
std::vector<std::string> res;
for (auto it = buckets.rbegin(); it != buckets.rend(); ++it)
for (auto& str : *it)
res.push_back(str);

Parallelism with C++ unordered_map

I have an unordered_map of type std::unordered_map<std::string, int64_t> sMap. This contains a number of strings and a 'weight' associated with each of them. I want to find the strings with the N largest weights.
If I wanted to do this using a single thread, I think I could create a priority queue of pairs like this
std::priority_queue<
std::pair<std::string, int64_t>,
std::vector<std::pair<std::string, int64_t>>,
std::function<bool(std::pair<std::string, int64_t>&,
std::pair<std::string, int64_t>&)>> prQ(comparePair);
and just go through the whole unordered_map, inserting elements to prQ while maintaining length N.
I want to achieve the same using multiple threads. I was thinking of assigning each thread to work on a few elements of the unordered_map to create a local priority queue of length N which can be merged into a global one at the end.
The problem I am facing right now is that the iterator which I get from unordered_map::begin() does not work with a + operator. At least that is the error that I am getting :
error: no match for ‘operator+’ (operand types are ‘std::unordered_map<std::basic_string<char>, long int>::iterator {aka std::__detail::_Node_iterator<std::pair<const std::b
asic_string<char>, long int>, false, true>}’ and ‘int’)
Thus, I cannot really specify a range of elements to be worked upon by a particular thread. The [] operator would take a key as expected and not an offset.
Essentially, I can't seem to find a way to have a data parallel loop that would work with only a few elements per thread. How can I solve this problem using multiple threads then?
EDIT : #Brian Vandberg asked me to supply a simplified example of the code that generates the error I was talking about.
std::unordered_map<std::string, int64_t> sMap;
//Initialize sMap values
int start = 0, end = 2;
for(auto i = sMap.begin() + start; sMap.begin() + end; ++i) {
std::cout<<i->first<<"\t"<<i->second<<"\n";
}
First, I'm not sure that I'd go with a priority queue for this problem (either single threaded, or as the part performed by a specific thread). The standard library has nth_element, which you can use to find the nth element in linear time. Following that, finding which elements are larger is also linear time.
You might consider this if speed is the problem, yours if size is a problem (nth_element will effectively force you to create a copy of the data). In this solution you iterate over the map (or part of it), and push_back only the weights into a vector, on which you perform nth_element. In the 2nd stage, loop again over the map, and choose those whose weight is higher.
Suppose you have the loop:
std::size_t j = 0;
for(const auto &e: sMap)
{
if(++j % k != i)
continue;
// Rest of code goes here.
}
Then if you use it for the ith thread out of k, it will partition the elements between the threads. Moreover, while all threads are iterating over the same elements (if only to skip most of them), it's happening in parallel.
Each thread can generate its candidates for the largest m elements, then choose the largest m elements from the km candidates using the method above (with the nth_element) or any other method.
It's interesting to ask what size of sMap will generate any speedup in practice.

Is std::sort the best choice to do in-place sort for a huge array with limited integer value?

I want to sort an array with huge(millions or even billions) elements, while the values are integers within a small range(1 to 100 or 1 to 1000), in such a case, is std::sort and the parallelized version __gnu_parallel::sort the best choice for me?
actually I want to sort a vecotor of my own class with an integer member representing the processor index.
as there are other member inside the class, so, even if two data have same integer member that is used for comparing, they might not be regarded as same data.
Counting sort would be the right choice if you know that your range is so limited. If the range is [0,m) the most efficient way to do so it have a vector in which the index represent the element and the value the count. For example:
vector<int> to_sort;
vector<int> counts;
for (int i : to_sort) {
if (counts.size() < i) {
counts.resize(i+1, 0);
}
counts[i]++;
}
Note that the count at i is lazily initialized but you can resize once if you know m.
If you are sorting objects by some field and they are all distinct, you can modify the above as:
vector<T> to_sort;
vector<vector<const T*>> count_sorted;
for (const T& t : to_sort) {
const int i = t.sort_field()
if (count_sorted.size() < i) {
count_sorted.resize(i+1, {});
}
count_sorted[i].push_back(&t);
}
Now the main difference is that your space requirements grow substantially because you need to store the vectors of pointers. The space complexity went from O(m) to O(n). Time complexity is the same. Note that the algorithm is stable. The code above assumes that to_sort is in scope during the life cycle of count_sorted. If your Ts implement move semantics you can store the object themselves and move them in. If you need count_sorted to outlive to_sort you will need to do so or make copies.
If you have a range of type [-l, m), the substance does not change much, but your index now represents the value i + l and you need to know l beforehand.
Finally, it should be trivial to simulate an iteration through the sorted array by iterating through the counts array taking into account the value of the count. If you want stl like iterators you might need a custom data structure that encapsulates that behavior.
Note: in the previous version of this answer I mentioned multiset as a way to use a data structure to count sort. This would be efficient in some java implementations (I believe the Guava implementation would be efficient) but not in C++ where the keys in the RB tree are just repeated many times.
You say "in-place", I therefore assume that you don't want to use O(n) extra memory.
First, count the number of objects with each value (as in Gionvanni's and ronaldo's answers). You still need to get the objects into the right locations in-place. I think the following works, but I haven't implemented or tested it:
Create a cumulative sum from your counts, so that you know what index each object needs to go to. For example, if the counts are 1: 3, 2: 5, 3: 7, then the cumulative sums are 1: 0, 2: 3, 3: 8, 4: 15, meaning that the first object with value 1 in the final array will be at index 0, the first object with value 2 will be at index 3, and so on.
The basic idea now is to go through the vector, starting from the beginning. Get the element's processor index, and look up the corresponding cumulative sum. This is where you want it to be. If it's already in that location, move on to the next element of the vector and increment the cumulative sum (so that the next object with that value goes in the next position along). If it's not already in the right location, swap it with the correct location, increment the cumulative sum, and then continue the process for the element you swapped into this position in the vector.
There's a potential problem when you reach the start of a block of elements that have already been moved into place. You can solve that by remembering the original cumulative sums, "noticing" when you reach one, and jump ahead to the current cumulative sum for that value, so that you don't revisit any elements that you've already swapped into place. There might be a cleverer way to deal with this, but I don't know it.
Finally, compare the performance (and correctness!) of your code against std::sort. This has better time complexity than std::sort, but that doesn't mean it's necessarily faster for your actual data.
You definitely want to use counting sort. But not the one you're thinking of. Its main selling point is that its time complexity is O(N+X) where X is the maximum value you allow the sorting of.
Regular old counting sort (as seen on some other answers) can only sort integers, or has to be implemented with a multiset or some other data structure (becoming O(Nlog(N))). But a more general version of counting sort can be used to sort (in place) anything that can provide an integer key, which is perfectly suited to your use case.
The algorithm is somewhat different though, and it's also known as American Flag Sort. Just like regular counting sort, it starts off by calculating the counts.
After that, it builds a prefix sums array of the counts. This is so that we can know how many elements should be placed behind a particular item, thus allowing us to index into the right place in constant time.
since we know the correct final position of the items, we can just swap them into place. And doing just that would work if there weren't any repetitions but, since it's almost certain that there will be repetitions, we have to be more careful.
First: when we put something into its place we have to increment the value in the prefix sum so that the next element with same value doesn't remove the previous element from its place.
Second: either
keep track of how many elements of each value we have already put into place so that we dont keep moving elements of values that have already reached their place, this requires a second copy of the counts array (prior to calculating the prefix sum), as well as a "move count" array.
keep a copy of the prefix sums shifted over by one so that we stop moving elements once the stored position of the latest element
reaches the first position of the next value.
Even though the first approach is somewhat more intuitive, I chose the second method (because it's faster and uses less memory).
template<class It, class KeyOf>
void countsort (It begin, It end, KeyOf key_of) {
constexpr int max_value = 1000;
int final_destination[max_value] = {}; // zero initialized
int destination[max_value] = {}; // zero initialized
// Record counts
for (It it = begin; it != end; ++it)
final_destination[key_of(*it)]++;
// Build prefix sum of counts
for (int i = 1; i < max_value; ++i) {
final_destination[i] += final_destination[i-1];
destination[i] = final_destination[i-1];
}
for (auto it = begin; it != end; ++it) {
auto key = key_of(*it);
// while item is not in the correct position
while ( std::distance(begin, it) != destination[key] &&
// and not all items of this value have reached their final position
final_destination[key] != destination[key] ) {
// swap into the right place
std::iter_swap(it, begin + destination[key]);
// tidy up for next iteration
++destination[key];
key = key_of(*it);
}
}
}
Usage:
vector<Person> records = populateRecords();
countsort(records.begin(), records.end(), [](Person const &){
return Person.id()-1; // map [1, 1000] -> [0, 1000)
});
This can be further generalized to become MSD Radix Sort,
here's a talk by Malte Skarupke about it: https://www.youtube.com/watch?v=zqs87a_7zxw
Here's a neat visualization of the algorithm: https://www.youtube.com/watch?v=k1XkZ5ANO64
The answer given by Giovanni Botta is perfect, and Counting Sort is definitely the way to go. However, I personally prefer not to go resizing the vector progressively, but I'd rather do it this way (assuming your range is [0-1000]):
vector<int> to_sort;
vector<int> counts(1001);
int maxvalue=0;
for (int i : to_sort) {
if(i > maxvalue) maxvalue = i;
counts[i]++;
}
counts.resize(maxvalue+1);
It is essentially the same, but no need to be constantly managing the size of the counts vector. Depending on your memory constraints, you could use one solution or the other.

How to access/iterate over all non-unique keys in an unordered_multimap?

I would like to access/iterate over all non-unique keys in an unordered_multimap.
The hash table basically is a map from a signature <SIG> that does indeed occur more than once in practice to identifiers <ID>. I would like to find those entries in the hash table where occurs once.
Currently I use this approach:
// map <SIG> -> <ID>
typedef unordered_multimap<int, int> HashTable;
HashTable& ht = ...;
for(HashTable::iterator it = ht.begin(); it != ht.end(); ++it)
{
size_t n=0;
std::pair<HashTable::iterator, HashTable::iterator> itpair = ht.equal_range(it->first);
for ( ; itpair.first != itpair.second; ++itpair.first) {
++n;
}
if( n > 1 ){ // access those items again as the previous iterators are not valid anymore
std::pair<HashTable::iterator, HashTable::iterator> itpair = ht.equal_range(it->first);
for ( ; itpair.first != itpair.second; ++itpair.first) {
// do something with those items
}
}
}
This is certainly not efficient as the outer loop iterates over all elements of the hash table (via ht.begin()) and the inner loop tests if the corresponding key is present more than once.
Is there a more efficient or elegant way to do this?
Note: I know that with a unordered_map instead of unordered_multimap I wouldn't have this issue but due to application requirements I must be able to store multiple keys <SIG> pointing to different identifiers <ID>. Also, an unordered_map<SIG, vector<ID> > is not a good choice for me as it uses roughly 150% of memory as I have many unique keys and vector<ID> adds quite a bit of overhead for each item.
Use std::unordered_multimap::count() to determine the number of elements with a specific key. This saves you the first inner loop.
You cannot prevent iterating over the whole HashTable. For that, the HashTable would have to maintain a second index that maps cardinality to keys. This would introduce significant runtime and storage overhead and is only usefull in a small number of cases.
You can hide the outer loop using std::for_each(), but I don't think it's worth it.
I think that you should change your data model to something like:
std::map<int, std::vector<int> > ht;
Then you could easily iterate over map, and check how many items each element contains with size()
But in this situation building a data structure and reading it in linear mode is a little bit more complicated.