boost::unordered_map is... ordered? - c++

I have a boost::unordered_map, but it appears to be in order, giving me an overwhelming feeling of "You're Doing It Wrong". Why is the output to this in order? I would've expected the underlying hashing algorithm to have randomized this order:
#include <iostream>
#include <boost/unordered_map.hpp>
int main()
{
boost::unordered_map<int, int> im;
for(int i = 0; i < 50; ++i)
{
im.insert(std::make_pair(i, i));
}
boost::unordered_map<int, int>::const_iterator i;
for(i = im.begin(); i != im.end(); ++i)
{
std::cout << i->first << ", " << i->second << std::endl;
}
return 0;
}
...gives me...
0, 0
1, 1
2, 2
...
47, 47
48, 48
49, 49
Upon examination of boost's source code:
inline std::size_t hash_value(int v)
{
return static_cast<std::size_t>(v);
}
...which would explain it. The answers below hold the higher level thinking, as well, which I found useful.

While I can't speak to the boost internals as I'm not a C++ guy, I can propose a few higher-level questions that may alleviate your concerns:
1) What are the guarantees of an "unordered" map? Say you have an ordered map, and you want to create a map that does not guarantee ordering. An initial implementation may simply use the ordered map. It's almost never a problem to provide stronger guarantees than you advertise.
2) A hash function is something that hashes X -> int. If you already have an integer, you could use the identity function. While it may not be the most efficient in all cases, it could explain the behavior you're seeing.
Basically, seeing behavior like this is not necessarily a problem.

It is probably because your hashes are small integers.
Hash tables usually calculate the number of bucket in which to put the item like this: bucket_index = hash%p where p is a prime number, which is the number of hashtable buckets, which is large enough to provide low frequency of collisions.
For integers hash equals to the value of the integer.
You have a lot of data, so hashtable selects a large p.
For any p larger than i, bucket_index = i%p = i.
When iterating, the hashtable returns items from its buckets in order of their indexes, which for you is the order of keys. :)
Try using larger numbers if you want to see some randomness.

You're doing it right. unordered_map doesn't claim to have random order. In fact, it makes no claims about order whatsoever. You shouldn't expect anything whatsoever in terms of order, and that goes for disorder!

This is because map by default is ordered by 'order of insertion of keys' means if you insert keys 1,2,3,4,5 and print it you will always get 1,2,3,4,5 so it looks ordered. Try to add with random key values and see the result. It will not be same every time, as it should not be.

Related

Find uncommon elements using hashing

I think this is a fairly common question but I didn't find any answer for this using hashing in C++.
I have two arrays, both of the same lengths, which contain some elements, for example:
A={5,3,5,4,2}
B={3,4,1,2,1}
Here, the uncommon elements are: {5,5,1,1}
I have tried this approach- iterating a while loop on both the arrays after sorting:
while(i<n && j<n) {
if(a[i]<b[j])
uncommon[k++]=a[i++];
else if (a[i] > b[j])
uncommon[k++]=b[j++];
else {
i++;
j++;
}
}
while(i<n && a[i]!=b[j-1])
uncommon[k++]=a[i++];
while(j < n && b[j]!=a[i-1])
uncommon[k++]=b[j++];
and I am getting the correct answer with this. However, I want a better approach in terms of time complexity since sorting both arrays every time might be computationally expensive.
I tried to do hashing but couldn't figure it out entirely.
To insert elements from arr1[]:
set<int> uncommon;
for (int i=0;i<n1;i++)
uncommon.insert(arr1[i]);
To compare arr2[] elements:
for (int i = 0; i < n2; i++)
if (uncommon.find(arr2[i]) != uncommon.end())
Now, what I am unable to do is to send only those elements to the uncommon array[] which are uncommon to both of them.
Thank you!
First of all, std::set does not have anything to do with hashing. Sets and maps are ordered containers. Implementations may differ, but most likely it is a binary search tree. Whatever you do, you wont get faster that nlogn with them - the same complexity as sorting.
If you're fine with nlogn and sorting, I'd strongly advice just using set_symmetric_difference algorithm https://en.cppreference.com/w/cpp/algorithm/set_symmetric_difference , it requires two sorted containers.
But if you insist on an implementation relying on hashing, you should use std::unordered_set or std::unordered_map. This way you can be faster than nlogn. You can get your answer in nm time, where n = a.size() and m = b.size(). You should create two unordered_set`s: hashed_a, hashed_b and in two loops check what elements from hashed_a are not in hashed_b, and what elements in hashed_b are not in hashed_a. Here a pseudocode:
create hashed_a and hashed_b
create set_result // for the result
for (a_v : hashed_a)
if (a_v not in hashed_b)
set_result.insert(a_v)
for (b_v : hashed_b)
if (b_v not in hashed_a)
set_result.insert(b_v)
return set_result // it holds the symmetric diference, which you need
UPDATE: as noted in the comments, my answer doesn't count for duplicates. The easiest way to modify it for duplicates would be to use unordered_map<int, int> with the keys for elements in the set and values for number of encounters.
First, you need to find a way to distinguish between the same values contained in the same array (for ex. 5 and 5 in the first array, and 1 and 1 in the second array). This is the key to reducing the overall complexity, otherwise you can't do better than O(nlogn). A good possible algorithm for this task is to create a wrapper object to hold your actual values, and put in your arrays pointers to those wrapper objects with actual data, so your pointer addresses will serve as a unique identifier for objects. This wrapping will cost you just O(n1+n2) operations, but also an additional O(n1+n2) space.
Now your problem is that you have in both arrays only elements unique to each of those arrays, and you want to find the uncommon elements. This means the (Union of both array elements) - (Intersection of both array elements). Therefore, all you need to do is to push all the elements of the first array into a hash-map (complexity O(n1)), and then start pushing all the elements of the second array into the same hash-map (complexity O(n2)), by detecting the collisions (equality of an element from first array with an element from the second array). This comparison step will require O(n2) comparisons in the worst case. So for the maximum performance optimization you could have checked the size of the arrays before starting pushing the elements into the hash-map, and swap the arrays so that the first push will take place with the longest array. Your overall algorithm complexity would be O(n1+n2) pushes (hashings) and O(n2) comparisons.
The implementation is the most boring stuff, so I let it to you ;)
A solution without sorting (and without hashing but you seem to care more about complexity then the hashing itself) is to notice the following : an uncommon element e is an element that is in exactly one multiset.
This means that the multiset of all uncommon elements is the union between 2 multisets:
S1 = The element in A that are not in B
S2 = The element in B that are not in A
Using the std::set_difference, you get:
#include <set>
#include <vector>
#include <iostream>
#include <algorithm>
int main() {
std::multiset<int> ms1{5,3,5,4,2};
std::multiset<int> ms2{3,4,1,2,1};
std::vector<int> v;
std::set_difference( ms1.begin(), ms1.end(), ms2.begin(), ms2.end(), std::back_inserter(v));
std::set_difference( ms2.begin(), ms2.end(), ms1.begin(), ms1.end(), std::back_inserter(v));
for(int e : v)
std::cout << e << ' ';
return 0;
}
Output:
5 5 1 1
The complexity of this code is 4.(N1+N2 -1) where N1 and N2 are the size of the multisets.
Links:
set_difference: https://en.cppreference.com/w/cpp/algorithm/set_difference
compiler explorer: https://godbolt.org/z/o3KGbf
The Question can Be solved in O(nlogn) time-complexity.
ALGORITHM
Sort both array with merge sort in O(nlogn) complexity. You can also use sort-function. For example sort(array1.begin(),array1.end()).
Now use two pointer method to remove all common elements on both arrays.
Program of above Method
int i = 0, j = 0;
while (i < array1.size() && j < array2.size()) {
// If not common, print smaller
if (array1[i] < array2[j]) {
cout << array1[i] << " ";
i++;
}
else if (array2[j] < array1[i]) {
cout << array2[j] << " ";
j++;
}
// Skip common element
else {
i++;
j++;
}
}
Complexity of above program is O(array1.size() + array2.size()). In worst case say O(2n)
The above program gives the uncommon elements as output. If you want to store them , just create a vector and push them into vector.
Original Problem LINK

Create a function that checks whether an array has two opposite elements or not for less than n^2 complexity. (C++)

Create a function that checks whether an array has two opposite elements or not for less than n^2 complexity. Let's work with numbers.
Obviously the easiest way would be:
bool opposite(int* arr, int n) // n - array length
{
for(int i = 0; i < n; ++i)
{
for(int j = 0; j < n; ++j)
{
if(arr[i] == - arr[j])
return true;
}
}
return false;
}
I would like to ask if any of you guys can think of an algorithm that has a complexity less than n^2.
My first idea was the following:
1) sort array ( algorithm with worst case complexity: n.log(n) )
2) create two new arrays, filled with negative and positive numbers from the original array
( so far we've got -> n.log(n) + n + n = n.log(n))
3) ... compare somehow the two new arrays to determine if they have opposite numbers
I'm not pretty sure my ideas are correct, but I'm opened to suggestions.
An important alternative solution is as follows. Sort the array. Create two pointers, one initially pointing to the front (smallest), one initially pointing to the back (largest). If the sum of the two pointed-to elements is zero, you're done. If it is larger than zero, then decrement the back pointer. If it is smaller than zero, then increment the front pointer. Continue until the two pointers meet.
This solution is often the one people are looking for; often they'll explicitly rule out hash tables and trees by saying you only have O(1) extra space.
I would use an std::unordered_set and check to see if the opposite of the number already exist in the set. if not insert it into the set and check the next element.
std::vector<int> foo = {-10,12,13,14,10,-20,5,6,7,20,30,1,2,3,4,9,-30};
std::unordered_set<int> res;
for (auto e : foo)
{
if(res.count(-e) > 0)
std::cout << -e << " already exist\n";
else
res.insert(e);
}
Output:
opposite of 10 alrready exist
opposite of 20 alrready exist
opposite of -30 alrready exist
Live Example
Let's see that you can simply add all of elements to the unordered_set and when you are adding x check if you are in this set -x. The complexity of this solution is O(n). (as #Hurkyl said, thanks)
UPDATE: Second idea is: Sort the elements and then for all of the elements check (using binary search algorithm) if the opposite element exists.
You can do this in O(n log n) with a Red Black tree.
t := empty tree
for each e in A[1..n]
if (-e) is in t:
return true
insert e into t
return false
In C++, you wouldn't implement a Red Black tree for this purpose however. You'd use std::set, because it guarantees O(log n) search and insertion.
std::set<int> s;
for (auto e : A) {
if (s.count(-e) > 0) {
return true;
}
s.insert(e);
}
return false;
As Hurkyl mentioned, you could do better by just using std::unordered_set, which is a hashtable. This gives you O(1) search and insertion in the average case, but O(n) for both operations in the worst case. The total complexity of the solution in the average case would be O(n).

How can I economically store a sparse matrix during the process of element filling?

I know there are quite a few good ways to store a sparse matrix without taking up much memory.
But I'm wondering whether there is a good way to store a sparse matrix during the construction of it? Here is the more detailed scenario: the program constructs a sparse matrix by figuring out where to put a non-zero value on each iteration; and since the coordinates of the non-zero value will not be known until runtime, they are totally random and unpredictable.
I'm programming in C++. So is there a way to implement this in C++? Solutions in other languages are also appreciated.
You could have 3 parallel list and store rows id in one, column id in the other, value in the third. Once you are done with all entries, you could reorganize as needed, ex. sort by rows and columns.
What is not described in your question is how do you need/want to represent the sparse matrix in the end? What do you need to do with it? This would affect the representation
std::map might be what you're looking for, it's a key -> value map type. Combine this with std::set, which is a unique collection of elements. So, you could use a map of std::set, like so:
std::map<int, std::set<int> > sparseMatrix;
// Add some edges.
sparseMatrix[0].insert(1); // Add an edge from vertex 0 to 1.
sparseMatrix[4].insert(2); // Add an edge from vertex 4 to 2.
sparseMatrix[0].insert(1); // Edge already exists, no data added to the set.
This representation lets you represent a directed graph, it's analogous to an edge list. The behaviour of a set also prevents you from having two edges that are 'identical' (a->b and c->d, where a=b and c=d), which is nice, a behaviour you'd get if you used an adjacency matrix. You can iterate al the edges like so:
for(std::map<int, std::set<int> >::const_iterator i = sparseMatrix.begin();
i != sparseMatrix.end();
++i)
{
for(std::set<int>::const_iterator j = i->second.begin();
j != i->second.end();
++j)
{
std::cout << "An edge exists from " << i->first << " to " << *j << ".";
}
}
Some links:
Set documentation
Map documentation

Given an array of integers, find the first integer that is unique

Given an array of integers, find the first integer that is unique.
my solution: use std::map
put integer (number as key, its index as value) to it one by one (O(n^2 lgn)), if have duplicate, remove the entry from the map (O(lg n)), after putting all numbers into the map, iterate the map and find the key with smallest index O(n).
O(n^2 lgn) because map needs to do sorting.
It is not efficient.
other better solutions?
I believe that the following would be the optimal solution, at least based on time / space complexity:
Step 1:
Store the integers in a hash map, which holds the integer as a key and the count of the number of times it appears as the value. This is generally an O(n) operation and the insertion / updating of elements in the hash table should be constant time, on the average. If an integer is found to appear more than twice, you really don't have to increment the usage count further (if you don't want to).
Step 2:
Perform a second pass over the integers. Look each up in the hash map and the first one with an appearance count of one is the one you were looking for (i.e., the first single appearing integer). This is also O(n), making the entire process O(n).
Some possible optimizations for special cases:
Optimization A: It may be possible to use a simple array instead of a hash table. This guarantees O(1) even in the worst case for counting the number of occurrences of a particular integer as well as the lookup of its appearance count. Also, this enhances real time performance, since the hash algorithm does not need to be executed. There may be a hit due to potentially poorer locality of reference (i.e., a larger sparse table vs. the hash table implementation with a reasonable load factor). However, this would be for very special cases of integer orderings and may be mitigated by the hash table's hash function producing pseudorandom bucket placements based on the incoming integers (i.e., poor locality of reference to begin with).
Each byte in the array would represent the count (up to 255) for the integer represented by the index of that byte. This would only be possible if the difference between the lowest integer and the highest (i.e., the cardinality of the domain of valid integers) was small enough such that this array would fit into memory. The index in the array of a particular integer would be its value minus the smallest integer present in the data set.
For example on modern hardware with a 64-bit OS, it is quite conceivable that a 4GB array can be allocated which can handle the entire domain of 32-bit integers. Even larger arrays are conceivable with sufficient memory.
The smallest and largest integers would have to be known before processing, or another linear pass through the data using the minmax algorithm to find out this information would be required.
Optimization B: You could optimize Optimization A further, by using at most 2 bits per integer (One bit indicates presence and the other indicates multiplicity). This would allow for the representation of four integers per byte, extending the array implementation to handle a larger domain of integers for a given amount of available memory. More bit games could be played here to compress the representation further, but they would only support special cases of data coming in and therefore cannot be recommended for the still mostly general case.
All this for no reason. Just using 2 for-loops & a variable would give you a simple O(n^2) algo.
If you are taking all the trouble of using a hash map, then it might as well be what #Micheal Goldshteyn suggests
UPDATE: I know this question is 1 year old. But was looking through the questions I answered and came across this. Thought there is a better solution than using a hashtable.
When we say unique, we will have a pattern. Eg: [5, 5, 66, 66, 7, 1, 1, 77]. In this lets have moving window of 3. first consider (5,5,66). we can easily estab. that there is duplicate here. So move the window by 1 element so we get (5,66,66). Same here. move to next (66,66,7). Again dups here. next (66,7,1). No dups here! take the middle element as this has to be the first unique in the set. The left element belongs to the dup so could 1. Hence 7 is the first unique element.
space: O(1)
time: O(n) * O(m^2) = O(n) * 9 ≈ O(n)
Inserting to a map is O(log n) not O(n log n) so inserting n keys will be n log n. also its better to use set.
Although it's O(n^2), the following has small coefficients, isn't too bad on the cache, and uses memmem() which is fast.
for(int x=0;x<len-1;x++)
if(memmem(&array[x+1], sizeof(int)*(len-(x+1)), array[x], sizeof(int))==NULL &&
memmem(&array[x+1], sizeof(int)*(x-1), array[x], sizeof(int))==NULL)
return array[x];
public static string firstUnique(int[] input)
{
int size = input.Length;
bool[] dupIndex = new bool[size];
for (int i = 0; i < size; ++i)
{
if (dupIndex[i])
{
continue;
}
else if (i == size - 1)
{
return input[i].ToString();
}
for (int j = i + 1; j < size; ++j)
{
if (input[i]==input[j])
{
dupIndex[j] = true;
break;
}
else if (j == size - 1)
{
return input[i].ToString();
}
}
}
return "No unique element";
}
#user3612419
Solution given you is good with some what close to O(N*N2) but further optimization in same code is possible I just added two-3 lines that you missed.
public static string firstUnique(int[] input)
{
int size = input.Length;
bool[] dupIndex = new bool[size];
for (int i = 0; i < size; ++i)
{
if (dupIndex[i])
{
continue;
}
else if (i == size - 1)
{
return input[i].ToString();
}
for (int j = i + 1; j < size; ++j)
{
if(dupIndex[j]==true)
{
continue;
}
if (input[i]==input[j])
{
dupIndex[j] = true;
dupIndex[i] = true;
break;
}
else if (j == size - 1)
{
return input[i].ToString();
}
}
}
return "No unique element";
}

how to get median value from sorted map

I am using a std::map. Sometimes I will do an operation like: finding the median value of all items. e.g
if I add
1 "s"
2 "sdf"
3 "sdfb"
4 "njw"
5 "loo"
then the median is 3.
Is there some solution without iterating over half the items in the map?
I think you can solve the problem by using two std::map. One for smaller half of items (mapL) and second for the other half (mapU). When you have insert operation. It will be either case:
add item to mapU and move smallest element to mapL
add item to mapL and move greatest element to mapU
In case the maps have different size and you insert element to the one with smaller number of
elements you skip the move section.
The basic idea is that you keep your maps balanced so the maximum size difference is 1 element.
As far as I know STL all operations should work in O(ln(n)) time. Accessing smallest and greatest element in map can be done by using iterator.
When you have n_th position query just check map sizes and return greatest element in mapL or smallest element in mapR.
The above usage scenario is for inserting only but you can extend it to deleting items as well but you have to keep track of which map holds item or try to delete from both.
Here is my code with sample usage:
#include <iostream>
#include <string>
#include <map>
using namespace std;
typedef pair<int,string> pis;
typedef map<int,string>::iterator itis;
map<int,string>Left;
map<int,string>Right;
itis get_last(map<int,string> &m){
return (--m.end());
}
int add_element(int key, string val){
if (Left.empty()){
Left.insert(make_pair(key,val));
return 1;
}
pis maxl = *get_last(Left);
if (key <= maxl.first){
Left.insert(make_pair(key,val));
if (Left.size() > Right.size() + 1){
itis to_rem = get_last(Left);
pis cpy = *to_rem;
Left.erase(to_rem);
Right.insert(cpy);
}
return 1;
} else {
Right.insert(make_pair(key,val));
if (Right.size() > Left.size()){
itis to_rem = Right.begin();
pis cpy = *to_rem;
Right.erase(to_rem);
Left.insert(*to_rem);
}
return 2;
}
}
pis get_mid(){
int size = Left.size() + Right.size();
if (Left.size() >= size / 2){
return *(get_last(Left));
}
return *(Right.begin());
}
int main(){
Left.clear();
Right.clear();
int key;
string val;
while (!cin.eof()){
cin >> key >> val;
add_element(key,val);
pis mid = get_mid();
cout << "mid " << mid.first << " " << mid.second << endl;
}
}
I think the answer is no. You cannot just jump to the N / 2 item past the beginning because a std::map uses bidirectional iterators. You must iterate through half of the items in the map. If you had access to the underlying Red/Black tree implementation that is typically used for the std::map, you might be able to get close like in Dani's answer. However, you don't have access to that as it is encapsulated as an implementation detail.
Try:
typedef std::map<int,std::string> Data;
Data data;
Data::iterator median = std::advance(data.begin(), data.size() / 2);
Works if the size() is odd. I'll let you work out how to do it when size() is even.
In self balancing binary tree(std::map is one I think) a good approximation would be the root.
For exact value just cache it with a balance indicator, and each time an item added below the median decrease the indicator and increase when item is added above. When indicator is equal to 2/-2 move the median upwards/downwards one step and reset the indicator.
If you can switch data structures, store the items in a std::vector and sort it. That will enable accessing the middle item positionally without iterating. (It can be surprising but a sorted vector often out-performs a map, due to locality. For lookups by the sort key you can use binary search and it will have much the same performance as a map anyway. See Scott Meyer's Effective STL.)
If you know the map will be sorted, then get the element at floor(length / 2). If you're in a bit twiddly mood, try (length >> 1).
I know no way to get the median from a pure STL map quickly for big maps. If your map is small or you need the median rarely you should use the linear advance to n/2 anyway I think - for the sake of simplicity and being standard.
You can use the map to build a new container that offers median: Jethro suggested using two maps, based on this perhaps better would be a single map and a continuously updated median iterator. These methods suffer from the drawback that you have to reimplement every modifiying operation and in jethro's case even the reading operations.
A custom written container will also do what you what, probably most efficiently but for the price of custom code. You could try, as was suggested to modify an existing stl map implementation. You can also look for existing implementations.
There is a super efficient C implementation that offers most map functionality and also random access called Judy Arrays. These work for integer, string and byte array keys.
Since it sounds like insert and find are your two common operations while median is rare, the simplest approach is to use the map and std::advance( m.begin(), m.size()/2 ); as originally suggested by David Rodríguez. This is linear time, but easy to understand so I'd only consider another approach if profiling shows that the median calls are too expensive relative to the work your app is doing.
The nth_element() method is there for you for this :) It implements the partition part of the quick sort and you don't need your vector (or array) to be sorted.
And also the time complexity is O(n) (while for sorting you need to pay O(nlogn)).
For a sortet list, here it is in java code, but i assume, its very easy to port to c++:
if (input.length % 2 != 0) {
return input[((input.length + 1) / 2 - 1)];
} else {
return 0.5d * (input[(input.length / 2 - 1)] + input[(input.length / 2 + 1) - 1]);
}