Why does unordered_set maintain insertion order here - c++

I know unordered_set doesn't guarantee to maintain order.
But I was wondering if simple ascending sequence say N natural numbers was inserted to a set, will it maintain the order.
It did..
unordered_set<int> nums;
for (int i=10; i>0; --i) {
nums.insert(i);
}
nums.erase(4);
for (auto i : nums) {
cout<< i<< endl;
}
This prints:
1
2
3
5
6
7
8
9
10
I ran this several times with consistent result.
Why did it maintaint reverse order of insertion?
Is there an solid reason for this behaviour?(If this works, it can make my code super efficient ;) )

if simple ascending sequence say N natural numbers was inserted to a set, will it maintain the order. It did...
But it didn't, it reversed the order. Like you say, in fact:
Why did it maintain reverse order of insertion?
This is just an implementation detail. How any implementation of std::unordered_set stores its elements is not specified by the standard, except for the fact that it has buckets. Another implementation could very well store these 10 integers in order of insertion (not in the reverse order), or really any order at all.
Is there an solid reason for this behaviour?
Kind of. I will take GCC's implementation as an example.
std::unordered_set stores elements in buckets. When inserting a first element, it allocates enough space for a small number, say 13. This space is divided into 13 buckets, and each correspond to the hash of the integers 0 through 12 (the hash being the integer itself). That way, inserting the next few elements, which are all integers between 0 and 12, does not cause any rehash or collision and each element takes a bucket. Again, the reason they end up in the reverse order is an implementation detail and not relevant to this part.
However, if you insert more than 13 elements, the set needs to reallocate memory and move the elements, and at that point their order can change. In the case of GCC's implementation, it turns out the elements end up in the insertion order after the move, and so inserting the integers from 0 to 13 gives this sequence: 13 0 1 2 3 4 5 6 7 8 9 10 11 12.
You can look at the buckets yourself:
#include <unordered_set>
#include <iostream>
int main() {
std::unordered_set<int> s;
auto print_s = [&s](){
std::cout << "s = [ ";
for (auto i : s) {
std::cout << i << " ";
}
std::cout << "]\n";
};
auto print_bucket = [&s](int bucket){
for (auto it = s.begin(bucket); it != s.end(bucket); ++it) {
std::cout << *it << " ";
}
std::cout << "\n";
};
for (int i = 0; i < 20; ++i) {
std::cout << "i=" << i << "\n";
s.insert(i);
std::cout << "bucket_count=" << s.bucket_count() << "\n";
for (auto b = 0; b < s.bucket_count(); ++b) {
if (s.bucket_size(b) != 0) {
std::cout << "\tb=" << b << ": ";
print_bucket(b);
} else {
std::cout << "\tb=" << b << ": empty\n";
}
}
print_s();
}
}
Demo
Clang (with libc++ - thanks Miles Budnek) and MSVC do something completely different from GCC (libstdc++).

The container keep reverse insertion order is probably a side effect.
The primary reason would be to avoid iterate through all empty bucket which may be inefficient, by implement it, the insertion order is somewhat preserved.
From what it appears, seems like it simply keep the last bucket pointer/id and link to it when new bucket is used (and result in a reverse-linked buckets).

Related

Vector going out of bounds

I'm attempting to iterate through a list of 6 'Chess' pieces. Each round they move a random amount and if they land on another then they 'kill' it.
The problem is that when the last piece in my vector kills another piece I'm getting a vector 'out of range' error. I'm guessing it's because I'm iterating through a vector whilst also removing items from it, but I'm not increasing the count when I erase a piece so I'm not entirely sure. Any help would be greatly appreciated.
Here is my vector:
vector<Piece*> pieces;
pieces.push_back(&b);
pieces.push_back(&r);
pieces.push_back(&q);
pieces.push_back(&b2);
pieces.push_back(&r2);
pieces.push_back(&q2);
and this is the loop I iterate using:
while (pieces.size() > 1) {
cout << "-------------- Round " << round << " --------------" << endl;
round++;
cout << pieces.size() << " pieces left" << endl;
i = 0;
while (i < pieces.size()) {
pieces.at(i)->move(board.getMaxLength());
j = 0;
while (j < pieces.size()) {
if (pieces.at(i) != pieces.at(j) && col.detectCollision(pieces.at(i), pieces.at(j))) {
cout << pieces.at(i)->getName() << " has slain " << pieces.at(j)->getName() << endl << endl;
pieces.at(i)->setKills(pieces.at(i)->getKills() + 1);
pieces.erase(pieces.begin() + j);
}
else {
j++;
}
}
i++;
}
}
Solution
pieces.erase(pieces.begin() + j);
break;
Your logic needs a little refinement.
The way you coded it, it seems the "turn based" nature of chess has been replaced by a kind of "priority list" -- the pieces closer to the start of the vector are allowed to move first and, thus, get the priority into smashing other pieces.
I don't know if you want this logic to be right or wrong. Anyway, the trouble seems to be due to unconditionally executing the line
i++;
It should not be executed if you remove a piece for the same reason 'j++' isn't executed: you will jump over a piece.
I have a very strong feeling that this line of code:
i++;
is your culprit which is missing either a needed break condition or another conditional check that is missing from your loops. As it pertains to your nested while loops' conditions since they are based on the current size of your vector and are not being updated accordingly.
while (pieces.size() > 1) {
// ...
while (i < pieces.size()) {
// ...
while (j < pieces.size()) {
// ...
}
}
}
This is due to the fact that you are calling this within the inner most nested loop:
pieces.erase(pieces.begin() + j);
You are inside of a nested while loop and if a certain condition is met you are then erasing the object at this index location within your vector while you are still inside of the inner while loop that you never break from or check to see if the index is still valid.
Initially you are entering this while loop with a vector that has 6 entries, and you call erase on it within the nested loop and now your vector has 5 entries.
This can reek havoc on your loops because your index counters i & j were set according to the original length of your vector with a size of 6, but now the vector has been reduced to a size of 5 while you are still within the inner most nested loop that you never break from nor check to see if the indices are valid. On the next iteration these values are now invalidated as you never break out of the loops to reset the indices according to the new size of your vector, nor check to see if they are valid.
Try running this simple program that will demonstrate what I mean by the indices being invalidated within your nested loops.
int main() {
std::vector<std::string> words{ "please", "erase", "me" };
std::cout << "Original Size: " << words.size() << '\n';
for (auto& s : words)
std::cout << s << " ";
std::cout << '\n';
words.erase(words.begin() + 2);
std::cout << "New Size: " << words.size() << '\n';
for (auto& s : words)
std::cout << s << " ";
std::cout << '\n';
return 0;
}
-Output-
Original Size: 3
please erase me
New Size: 2
please erase
You should save locally pieces.at(i) and use this local variable everywhere you use pieces.at(i).
To avoid both elements out of bound and logical problems you can use std::list.
As aside, you should use std::vector<Piece*> only if these are non-owning pointers, otherwise you should use smart-pointers, probably unique_ptr.

c++ - Solution to 2-sum using unordered_map

Okay so I am trying to solve the 2-SUM problem in c++. Given a file of 1000000 numbers in arbitrary order, I need to determine if there exist pairs of integers whose sum is t where t is each of [-10000, 10000]. So this basically the 2-SUM problem.
So, I coded up my solution in C++ wherein I used unordered_map as my hash table. I am ensuring low load on the hash table. But still this takes around 1hr 15mins to finish(successful). Now, I am wondering if it should be that slow. Further reducing the load factor did not give any considerable performance boost.
I have no idea where I can optimise the code. I tried different load factors, doesn't help. This is question from a MOOC and people have been able to get this done in around 30 mins using the same hash table approach. Can anybody help me make this code faster. Or at least give a hint as to where the code might be slowing down.
Here is the code -
#include <iostream>
#include <unordered_map>
#include <fstream>
int main(int argc, char *argv[]){
if(argc != 2){
std::cerr << "Usage: ./2sum <filename>" << std::endl;
exit(1);
}
std::ifstream input(argv[1]);
std::ofstream output("log.txt");
std::unordered_map<long, int> data_map;
data_map.max_load_factor(0.05);
long tmp;
while(input >> tmp){
data_map[tmp] += 1;
}
std::cerr << "input done!" << std::endl;
std::cerr << "load factor " << data_map.load_factor() << std::endl;
//debug print.
for(auto iter = data_map.begin(); iter != data_map.end(); ++iter){
output << iter->first << " " << iter->second << std::endl;
}
std::cerr << "debug print done!" << std::endl;
//solve
long ans = 0;
for(long i = -10000; i <= 10000; ++i){
//try to find a pair whose sum = i.
//debug print.
if(i % 100 == 0)
std::cerr << i << std::endl;
for(auto iter = data_map.begin(); iter != data_map.end(); ++iter){
long x = iter->first;
long y = i - x;
if(x == y)
continue;
auto search_y = data_map.find(y);
if(search_y != data_map.end()){
++ans;
break;
}
}
}
std::cout << ans << std::endl;
return 0;
}
On a uniform set with all sums equally probable, the below will finish in seconds. Otherwise, for any missing sums, on my laptop takes about 0.75 secs to check for a missing sum.
The solution has a minor improvement in comparison with the OP's code: checking for duplicates and eliminating them.
Then it opens through a Monte Carlo heuristic: for about 1% of the total numbers, randomly picks one from the set and searches for all the sums in the [minSum, maxSum] range that can be made having one term as the randomly picked number and the rest of them. This will pre-populate the sums set with... say... 'sum that can be found trivially'. In my tests, using 1M numbers generated randonly between -10M and 10M, this is the single step necessary and takes a couple of seconds.
For pathological numbers distributions, in which some of the sum values are missing (or have not been found through the random heuristic), the second part uses a targeted exhaustive search over the not-found sum values, very much on the same line as the solution in the OP.
Extra explanations for the random/Monte Carlo heuristic(to address #AneeshDandime's comment of):
Though i do not fully understand it at the moment
Well, it's simple. Think like this: the naive approach is to take all the input values and add them in pairs, but retain only the sum in the [-10k, 10k]. It is however terrible expensive (O[N^2]). An immediate refinement would be: pick a value v0, then determine which other v1 values stand a chance to give a sum in the [-10k, 10k] range. If the input values are sorted, it's easier: you only need to select v1-s in the [-10k-v0, 10k-v0]; a good improvement, but if you keep this as the only approach, an exhaustive search would still be O(log2(N)N[-10k, 10k]).
However, this approach still has its value: if the input values are uniformly distributed, it will quickly populate the known sums set with the most common values (and spend the rest of time trying to find infrequent or missing sum values).
To capitalize, instead of using this 'til the end, one can proceed with a limited number of steps, hope to populate the majority of the sums. After that, we can switch the focus and enter the 'targeted search for sum values', but only for the sum value not found at this step.
[Edited: prev bug corrected. Now the algo is stable in regards with values present multiple times or single occurrences in input]
#include <algorithm>
#include <vector>
#include <random>
#include <unordered_set>
#include <unordered_map>
int main() {
typedef long long value_type;
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++
// substitute this with your input sequence from the file
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<value_type> initRnd(-5500, 10000000);
std::vector<value_type> sorted_vals;
for(ulong i=0; i<1000000; i++) {
int rnd=initRnd(gen);
sorted_vals.push_back(rnd);
}
std::cout << "Initialization end" << std::endl;
// end of input
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++
// use some constants instead of magic values
const value_type sumMin=-10000, sumMax=10000;
// Mapping val->number of occurrences
std::unordered_map<value_type, size_t> hashed_vals;
for(auto val : sorted_vals) {
hashed_vals[val]=hashed_vals[val]++;
}
// retain only the unique values and sort them
sorted_vals.clear();
for(auto val=hashed_vals.begin(); val!=hashed_vals.end(); ++val) {
sorted_vals.push_back(val->first);
}
std::sort(sorted_vals.begin(), sorted_vals.end());
// Store the encountered sums here
std::unordered_set<int> sums;
// some 1% iterations, looking at random for pair of numbers which will contribute with
// sum in the [-10000, 10000] range, and we'll collect those sums.
// We'll use the sorted vector of values for this purpose.
// If we are lucky, most of the sums (if not all) will be already filled in
std::uniform_int_distribution<size_t> rndPick(0, sorted_vals.size());
size_t numRandomPicks=size_t(sorted_vals.size()*0.1);
if(numRandomPicks > 75000) {
numRandomPicks=75000;
}
for(size_t i=0; i<numRandomPicks;i++) {
// pick a value index at random
size_t randomIx=rndPick(gen);
value_type val=sorted_vals[randomIx];
// now search for the values between -val-minSum and -val+maxSum;
auto low=std::lower_bound(sorted_vals.begin(), sorted_vals.end(), sumMin-val);
if(low==sorted_vals.end()) {
continue;
}
auto high=std::upper_bound(sorted_vals.begin(), sorted_vals.end(), sumMax-val);
if(high==sorted_vals.begin()) {
continue;
}
for(auto rangeIt=low; rangeIt!=high; rangeIt++) {
if(*rangeIt!=val || hashed_vals[val] > 1) {
// if not the same as the randomly picked value
// or if it is the same but that value occurred more than once in input
auto sum=val+*rangeIt;
sums.insert(sum);
}
}
if(sums.size()==size_t(sumMax-sumMin+1)) {
// lucky us, we found them all
break;
}
}
// after which, if some sums are not present, we'll search for them specifically
if(sums.size()!=size_t(sumMax-sumMin+1)) {
std::cout << "Number of sums still missing: "
<< size_t(sumMax-sumMin+1)-sums.size()
<< std::endl
;
for(int sum=sumMin; sum<=sumMax; sum++) {
if(sums.find(sum)==sums.end()) {
std::cout << "looking for sum: " << sum ;
// we couldn't find the sum, so we'll need to search for it.
// We'll use the unique_vals hash map this time to search for the other value
bool found=false;
for(auto i=sorted_vals.begin(); !found && i!=sorted_vals.end(); ++i) {
value_type v=*i;
value_type other_val=sum-v;
if( // v---- either two unequal terms to be summed or...
(other_val != v || hashed_vals[v] > 1) // .. the value occurred more than once
&& hashed_vals.find(other_val)!=hashed_vals.end() // and the other term exists
) {
// found. Record it as such and break
sums.insert(sum);
found=true;
}
}
std::cout << (found ? " found" : " not found") << std::endl;
}
}
}
std::cout << "Total number of distinct sums found: " << sums.size() << std:: endl;
}
You can reserve the space earlier for unordered map. It should increase performance a bit
What about sorting the array first and then for each element in the array, use binary search to find the number which would make it closer to -10000 and keep going "right" until you reached a sum +10000
This way you will avoid going through the array 20000 times.

c++ List Elements Randomly Not Deleted By Loop

I have a function that randomly deletes elements from a map called bunnies (the map contains class objects) and a list called names ( the list contains the keys to the map) when the number of elements reach more than 250. However, randomly the map element will be deleted but the list entry will not (I think this is what is going on, though clearly part of the map element survives). The outcome is that when I use the second section of code to iterate through the list and display the mapped values associated with those keys, I get large negative values like the example at the bottom.
Clearly the list element isn't being deleted, but why?
void cull(std::map<std::string, Bunny> &bunnies, std::list<std::string> &names,int n)
{
int number = n, position = 0;
for (number = n; number > 125; number--)
{
position = rand() % names.size();
std::list<std::string>::iterator it = names.begin();
std::advance(it, position);
bunnies.erase(*it);
names.erase(it);
it = names.begin();
}
std::cout << "\n" << n - 125 << "rabbits culled";
}
I use this code to print out the map values.
for (std::list<std::string>::iterator it = names.begin(); it != names.end(); it++)
{
n++;
std::cout << n << "\t" << " " << *it << "\t" << bunnies[*it].a() << "\t" << bunnies[*it].s() << "\t" << bunnies[*it].c() << "\t" << bunnies[*it].st() << "\n";
This is the output. The top is what it should display, the bottom is what happens when the program fails.
165 Tom_n 14 1 0 1
166 Lin_c -842150451 -842150451 -842150451 -842150451
The problem seems to be this:
std::advance(it, number);
This should be position, not number.
The other problem is that a map stores unique names. What if there is more than one Bunny with the same name? For example, if the list has 3 bunnies names "John", the map will be able to hold only one "John", since the key in a map must be unique.
Either use a multimap if names can be duplicated, or use a std::set instead of a std::list if Bunnies must have unique names.
Maybe overall, you can just use std::map<std::string, Bunny>, and forget about the std::list. The map by itself has all the information you need. Unless there is something I'm missing, I don't see the need for a std::list to do redundant work.

Searching Multiple Vectors

I have 4 vectors with about 45,000 records each right now. Looking for an efficient method to run through these 4 vectors and output how many times it matches the users input. Data needs to match on the same index of each vector.
Multiple for loops? Vector find?
Thanks!
If the elements need to match at the same location, it seems that a std::find() or std::find_if() combined with a check for the other vectors at the position is a reasonable approach:
std::vector<A> a(...);
std::vector<B> b(...);
std::vector<C> c(...);
std::vector<D> d(...);
std::size_t match(0);
for (auto it = a.begin(), end = a.end(); it != end; ) {
it = std::find_if(it, end, conditionA));
if (it != end) {
if (conditionB[it - a.begin()]
&& conditionC[it - a.begin()]
&& conditionD[it - a.begin()]) {
++match;
}
++it;
}
}
What I got from description is that, you have 4 vectors and lots of user data, you need to find out how many of times it matches with vectors at same index
so here goes the code ( i am writing a c++4.3.2 code)
#include<iostream>
#include<vector>
#include<algorithm>
using namespace std;
int main(){
vector<typeT>a;
vector<typeT>b;
vector<typeT>c;
vector<typeT>d;
vector<typeT>matched;
/*i am assuming you have initialized a,b,c and d;
now we are going to do pre-calculation for matching user data and store
that in vector matched */
int minsize=min(a.size(),b.size(),c.size(),d.size());
for(int i=0;i<minsize;i++)
{
if(a[i]==b[i]&&b[i]==c[i]&&c[i]==d[i])matched.push_back(a[i]);
}
return 0;
}
this was the precalculation part. now next depend on data type you are using, Use binary search with little bit of extra counting or using a better data structure which stores a pair(value,recurrence) and then applying binary search.
Time complexity will be O(n+n*log(n)+m*log(n)) where n is minsize in code and m is number of user input
Honestly, I would have a couple of methods to maintain your database(vectors).
Essentially, do a QuickSort to start out with.
Then ever so often consistently run a insertion sort (Faster then QuickSort for partially sorted lists)
Then just run binary search on those vectors.
edit:
I think a better way to store this is instead of using multiple vectors per entry. Have one class vector that stores all the values. (your current vectors)
class entry {
public:
variable data1;
variable data2;
variable data3;
variable data4;
}
Make this into a single vector. Then use my method I described above to sort through these vectors.
You will have to sort through by what type of data it is first. Then after call binary search on that data.
You can create a lookup table for the vector with std::unordered_multimap in O(n). Then you can use unordered_multimap::count() to get the number of times the item appears in the vector and unordered_multimap::equal_range() to get the indices of the items inside your vector.
std::vector<std::string> a = {"ab", "ba", "ca", "ab", "bc", "ba"};
std::vector<std::string> b = {"fg", "fg", "ba", "eg", "gf", "ge"};
std::vector<std::string> c = {"pq", "qa", "ba", "fg", "de", "gf"};
std::unordered_multimap<std::string,int> lookup_table;
for (int i = 0; i < a.size(); i++) {
lookup_table.insert(std::make_pair(a[i], i));
lookup_table.insert(std::make_pair(b[i], i));
lookup_table.insert(std::make_pair(c[i], i));
}
// count
std::string userinput;
std::cin >> userinput;
int count = lookup_table.count(userinput);
std::cout << userinput << " shows up " << count << " times" << std::endl;
// print all the places where the key shows up
auto range = lookup_table.equal_range(userinput);
for (auto it = range.first; it != range.second; it++) {
int ind = it->second;
std::cout << " " << it->second << " "
<< a[ind] << " "
<< b[ind] << " "
<< c[ind] << std::endl;
}
This will be the most efficient if you will be searching the lookup table many items. If you only need to search one time, then Dietmar Kühl's approach would be most efficient.

Given a sequence of numbers report which have been repeated and how many times

for example
1 1 1 1 2 3 3 4 5 5 5
1 repeated 3 times,
3 repeated 1 time,
5 repeated 2 times
here's the code but it has some troubles
int i, k, m, number, number_prev, e;
cout << "Insert how many numbers: ";
cin >> m;
cout << "insert number";
cin >> number;
number_prev = number;
int num_rep[m]; //array of repeated numbers
int cant_rep[m]; // array of correspondent number of repetitions
e = 0;
for (i=1; i<m; i++)
{
cin >> number;
if (number == number_prev)
{
if (number == num_rep[e-1])
cant_rep[e-1]++;
else
{
num_rep[e] = number;
cant_rep[e] = e + 1;
e++;
}
}
else
e = 0;
number_prev = number;
}
for (k = 0; k < e; k++)
cout << "\nnumber " << num_rep[k] << " repeated " << cant_rep[k] << " times.\n";
You should learn algorithms and data structures. This make your code simpler. just using associative container that saves pair
a number --> how many times it repeats
can simplify your program sufficiently
int main()
{
std::map<int, int> map;
int v;
while(std::cin >> v) {
map[v]++;
}
for (auto it = map.cbegin(); it != map.cend(); ++it) {
if (it->second > 1) {
std::cout << it->first << " repeats " << it->second - 1 << " times\n";
}
}
}
std::map is an associative container.
You can think about it as a key-->value storage with unique keys.
The example in real word is a dictionary:
There you have word and its definition. The word is a key and the definition is a value.
std::map<int, int> map;
^^^ ^^^
| |
key type value type
You can refer to values using [] operator.
this works like usual array, except instead of index you use your key.
You can also examine all key-value pairs, storied in the map using iterators.
it = map.cbegin(); // refers to the first key-value pair in the map
++it; // moves to the next key-value pair
it != map.cend(); // checks, if we at the end of map, so examined all elements already
As, I pointed out, map saves key-value pairs.
And in Standard C++ library struct std::pair is used to express pair.
It has first and second members, that represents first and second values, storied in a pair.
In the case of map, first is a key, and second is a value.
Again, we are storing a number as a key and how many times it repeats in a value.
Then, we read user input and increase value for the given key.
After that, we just examine all elements stored in a map and print them.
int num_rep[m]; //array of repeated numbers
int cant_rep[m]; // array of correspondent number of repetitions
Here, m is only known when running, array sizes must be known in compile time. use std::vector instead.
The code seems like a C-style C++ program:
1.You don't need to declare variables in the beginning of block. Declare them before use, it's more readable.
2.Use the STL types like std::vector can save you a lot of trouble in programs like this.
you say "Insert m numbers", but for (i=1; i<m; i++) will loop m-1 times, which might not is what you want.
as a supplement advice, you should do input check for the variables get from external world.like cin >> m;, for it can be zero or negative.