Find uncommon elements using hashing

Find uncommon elements using hashing - c++

I think this is a fairly common question but I didn't find any answer for this using hashing in C++.
I have two arrays, both of the same lengths, which contain some elements, for example:
A={5,3,5,4,2}
B={3,4,1,2,1}
Here, the uncommon elements are: {5,5,1,1}
I have tried this approach- iterating a while loop on both the arrays after sorting:
while(i<n && j<n) {
if(a[i]<b[j])
uncommon[k++]=a[i++];
else if (a[i] > b[j])
uncommon[k++]=b[j++];
else {
i++;
j++;
}
}
while(i<n && a[i]!=b[j-1])
uncommon[k++]=a[i++];
while(j < n && b[j]!=a[i-1])
uncommon[k++]=b[j++];
and I am getting the correct answer with this. However, I want a better approach in terms of time complexity since sorting both arrays every time might be computationally expensive.
I tried to do hashing but couldn't figure it out entirely.
To insert elements from arr1[]:
set<int> uncommon;
for (int i=0;i<n1;i++)
uncommon.insert(arr1[i]);
To compare arr2[] elements:
for (int i = 0; i < n2; i++)
if (uncommon.find(arr2[i]) != uncommon.end())
Now, what I am unable to do is to send only those elements to the uncommon array[] which are uncommon to both of them.
Thank you!

First of all, std::set does not have anything to do with hashing. Sets and maps are ordered containers. Implementations may differ, but most likely it is a binary search tree. Whatever you do, you wont get faster that nlogn with them - the same complexity as sorting.
If you're fine with nlogn and sorting, I'd strongly advice just using set_symmetric_difference algorithm https://en.cppreference.com/w/cpp/algorithm/set_symmetric_difference , it requires two sorted containers.
But if you insist on an implementation relying on hashing, you should use std::unordered_set or std::unordered_map. This way you can be faster than nlogn. You can get your answer in nm time, where n = a.size() and m = b.size(). You should create two unordered_set`s: hashed_a, hashed_b and in two loops check what elements from hashed_a are not in hashed_b, and what elements in hashed_b are not in hashed_a. Here a pseudocode:
create hashed_a and hashed_b
create set_result // for the result
for (a_v : hashed_a)
if (a_v not in hashed_b)
set_result.insert(a_v)
for (b_v : hashed_b)
if (b_v not in hashed_a)
set_result.insert(b_v)
return set_result // it holds the symmetric diference, which you need
UPDATE: as noted in the comments, my answer doesn't count for duplicates. The easiest way to modify it for duplicates would be to use unordered_map<int, int> with the keys for elements in the set and values for number of encounters.

First, you need to find a way to distinguish between the same values contained in the same array (for ex. 5 and 5 in the first array, and 1 and 1 in the second array). This is the key to reducing the overall complexity, otherwise you can't do better than O(nlogn). A good possible algorithm for this task is to create a wrapper object to hold your actual values, and put in your arrays pointers to those wrapper objects with actual data, so your pointer addresses will serve as a unique identifier for objects. This wrapping will cost you just O(n1+n2) operations, but also an additional O(n1+n2) space.
Now your problem is that you have in both arrays only elements unique to each of those arrays, and you want to find the uncommon elements. This means the (Union of both array elements) - (Intersection of both array elements). Therefore, all you need to do is to push all the elements of the first array into a hash-map (complexity O(n1)), and then start pushing all the elements of the second array into the same hash-map (complexity O(n2)), by detecting the collisions (equality of an element from first array with an element from the second array). This comparison step will require O(n2) comparisons in the worst case. So for the maximum performance optimization you could have checked the size of the arrays before starting pushing the elements into the hash-map, and swap the arrays so that the first push will take place with the longest array. Your overall algorithm complexity would be O(n1+n2) pushes (hashings) and O(n2) comparisons.
The implementation is the most boring stuff, so I let it to you ;)

A solution without sorting (and without hashing but you seem to care more about complexity then the hashing itself) is to notice the following : an uncommon element e is an element that is in exactly one multiset.
This means that the multiset of all uncommon elements is the union between 2 multisets:
S1 = The element in A that are not in B
S2 = The element in B that are not in A
Using the std::set_difference, you get:
#include <set>
#include <vector>
#include <iostream>
#include <algorithm>
int main() {
std::multiset<int> ms1{5,3,5,4,2};
std::multiset<int> ms2{3,4,1,2,1};
std::vector<int> v;
std::set_difference( ms1.begin(), ms1.end(), ms2.begin(), ms2.end(), std::back_inserter(v));
std::set_difference( ms2.begin(), ms2.end(), ms1.begin(), ms1.end(), std::back_inserter(v));
for(int e : v)
std::cout << e << ' ';
return 0;
}
Output:
5 5 1 1
The complexity of this code is 4.(N1+N2 -1) where N1 and N2 are the size of the multisets.
Links:
set_difference: https://en.cppreference.com/w/cpp/algorithm/set_difference
compiler explorer: https://godbolt.org/z/o3KGbf

The Question can Be solved in O(nlogn) time-complexity.
ALGORITHM
Sort both array with merge sort in O(nlogn) complexity. You can also use sort-function. For example sort(array1.begin(),array1.end()).
Now use two pointer method to remove all common elements on both arrays.
Program of above Method
int i = 0, j = 0;
while (i < array1.size() && j < array2.size()) {
// If not common, print smaller
if (array1[i] < array2[j]) {
cout << array1[i] << " ";
i++;
}
else if (array2[j] < array1[i]) {
cout << array2[j] << " ";
j++;
}
// Skip common element
else {
i++;
j++;
}
}
Complexity of above program is O(array1.size() + array2.size()). In worst case say O(2n)
The above program gives the uncommon elements as output. If you want to store them , just create a vector and push them into vector.
Original Problem LINK

Related

Most efficient algorithm for Two-sum problem (involving indices)

The problem statement is given an array and a given sum "T", find all the pairs of indices of the elements in the array which add up to T. Additional requirements/constraints:
Indexing starts from 0
The indices must be displayed with lower index first (Ex: 24, 30 instead of 30, 24)
The indices must be displayed in ascending order (Ex: if we find (1,3), (0,2) and (5,8) the output must be (0,2) (1,3) (5,8)
There can be duplicate elements in the array, which also have to be considered
Here's my code in C++, I used the hash-table approach using unordered_set:
void Twosum(vector <int> res, int T){
int temp; int ti = -1;
unordered_set<int> s;
vector <int> res2 = res; //Just a copy of the input vector
vector <tuple<int, int>> indices; //Result to be output
for (int i = 0; i < (int)res.size(); i++){
temp = T - res[i];
if (s.find(temp) != s.end()){
while(ti < (int)res.size()){ //While loop for finding all the instances of temp in the array,
//not part of the original hash-table algorithm, something I added
ti = find(res2.begin(), res2.end(), temp) - res2.begin();
//Here find() takes O(n) time which is an issue
res2[ti] = lim; //To remove that instance of temp so that new instances
//can be found in the while loop, here lim = 10^9
if(i <= ti) indices.push_back(make_tuple(i, ti));
else indices.push_back(make_tuple(ti, i));
}
}
s.insert(res[i]);
}
if(ti == -1)
{cout<<"-1 -1"; //if no indices were found
return;}
sort(indices.begin(), indices.end()); //sorting since unordered_set stores elements randomly
for(int i=0; i<(int)indices.size(); i++)
cout<<get<0>(indices[i])<<" "<<get<1>(indices[i])<<endl;
}
This has multiple issues:
firstly that while loop doesn't work as intended, instead it shows SIGABRT error (free(): invalid pointer). The ti index is also somehow going beyond the vector bounds, even though I have that check in the while loop.
Secondly the find() function works in O(n) time, which increases the overall complexity to O(n^2), which is causing my program to timeout during execution. However that function is required since we have to output indices.
Lastly this unordered-set implementation doesn't seem to work when there are many duplicate elements in the array (since sets only take unique elements), which is one of the main constraints of the problem. This makes me think we need some sort of hash function or hashmap to deal with the duplicates? I'm not sure...
All the different algorithms I've found for this on the internet have dealt with just printing the elements and not the indices, hence I've had no luck with this problem.
If any of you know an optimal algorithm for this while also satisfying the constraints and running under O(n) time, your help would be highly appreciated. Thank you in advance.

Here is a pseudo-code answering your question, using hash tables (or maps) and set. I let you translate this to cpp using adapted data structures (in this case, classic hashmaps and sets will do the job well).
Notations: we will denote A the array, n its length, and T the "sum".
// first we build a map element -> {set of indices corresponding to this element}
Let M be an empty map; // or hash map, or hash table, or dictionary
for i from 0 to n-1 do {
Let e = A[i];
if e is not a key of M then {
M[e] = new_set()
}
M[e].add(i)
}
// Now we iterate over the elements
for each key e of M do {
if T-e is a key of M then {
display_combinations(M[e], M[T-e]);
}
}
// The helper function display_combinations
function display_combinations(set1, set2) {
for each element e1 of set1 do {
for element e2 of set2 do {
if e1 < e2 then {
display "(e1, e2)";
} else if e1 > e2 then {
display "(e2, e1)";
}
}
}
}
As said in the comments, the complexity in the worst case of this algorithm is in O(n²). A way to see that we cannot go below this complexity is that the size of the output may be in O(n²), in the case where all elements of the array have the value T/2.
Edit: this pseudo code does not output the pairs in the order. Just store them in an array of pairs, and sort this array before displaying it. Same, I did not treat the case where a pair (i, i) may satisfy the requirement. You may have to consider it (just change e1 > e2 by e1 >= e2 in the last loop)

Nearest permutation to given array

Question
I have two arrays of integers A[] and B[]. Array B[] is fixed, I need to to find the permutation of A[] which is lexiographically smaller than B[] and the permutation is nearest to B[]. Here what I mean is:
for i in (0 <= i < n)
abs(B[i]-A[i]) is minimum and A[] should be smaller than B[] lexiographically.
For Example:
A[]={1,3,5,6,7}
B[]={7,3,2,4,6}
So,possible nearest permutation of A[] to B[] is
A[]={7,3,1,6,5}
My Approach
Try all permutation of A[] and then compare that with B[]. But the time complexity would be (n! * n)
So is there any way to optimize this?
EDIT
n can be as large as 10^5
For better understanding

First, build an ordered map of the counts of the distinct elements of A.
Then, iterate forward through array indices (0 to n−1), "withdrawing" elements from this map. At each point, there are three possibilities:
If i < n-1, and it's possible to choose A[i] == B[i], do so and continue iterating forward.
Otherwise, if it's possible to choose A[i] < B[i], choose the greatest possible value for A[i] < B[i]. Then proceed by choosing the largest available values for all subsequent array indices. (At this point you no longer need to worry about maintaining A[i] <= B[i], because we're already after an index where A[i] < B[i].) Return the result.
Otherwise, we need to backtrack to the last index where it was possible to choose A[i] < B[i], then use the approach in the previous bullet-point.
Note that, despite the need for backtracking, the very worst case here is three passes: one forward pass using the logic in the first bullet-point, one backward pass in backtracking to find the last index where A[i] < B[i] was possible, and then a final forward pass using the logic in the second bullet-point.
Because of the overhead of maintaining the ordered map, this requires O(n log m) time and O(m) extra space, where n is the total number of elements of A and m is the number of distinct elements. (Since m ≤ n, we can also express this as O(n log n) time and O(n) extra space.)
Note that if there's no solution, then the backtracking step will reach all the way to i == -1. You'll probably want to raise an exception if that happens.
Edited to add (2019-02-01):
In a now-deleted answer, גלעד ברקן summarizes the goal this way:
To be lexicographically smaller, the array must have an initial optional section from left to right where A[i] = B[i] that ends with an element A[j] < B[j]. To be closest to B, we want to maximise the length of that section, and then maximise the remaining part of the array.
So, with that summary in mind, another approach is to do two separate loops, where the first loop determines the length of the initial section, and the second loop actually populates A. This is equivalent to the above approach, but may make for cleaner code. So:
Build an ordered map of the counts of the distinct elements of A.
Initialize initial_section_length := -1.
Iterate through the array indices 0 to n−1, "withdrawing" elements from this map. For each index:
If it's possible to choose an as-yet-unused element of A that's less than the current element of B, set initial_section_length equal to the current array index. (Otherwise, don't.)
If it's not possible to choose an as-yet-unused element of A that's equal to the current element of B, break out of this loop. (Otherwise, continue looping.)
If initial_section_length == -1, then there's no solution; raise an exception.
Repeat step #1: re-build the ordered map.
Iterate through the array indices from 0 to initial_section_length-1, "withdrawing" elements from the map. For each index, choose an as-yet-unused element of A that's equal to the current element of B. (The existence of such an element is ensured by the first loop.)
For array index initial_section_length, choose the greatest as-yet-unused element of A that's less than the current element of B (and "withdraw" it from the map). (The existence of such an element is ensured by the first loop.)
Iterate through the array indices from initial_section_length+1 to n−1, continuing to "withdraw" elements from the map. For each index, choose the greatest element of A that hasn't been used yet.
This approach has the same time and space complexities as the backtracking-based approach.

There are n! permutations of A[n] (less if there are repeating elements).
Use binary search over range 0..n!-1 to determine k-th lexicographic permutation of A[] (arbitrary found example) which is closest lower one to B[].
Perhaps in C++ you can exploit std::lower_bound

Based on the discussion in the comment section to your question, you seek an array made up entirely of elements of the vector A that is -- in lexicographic ordering -- closest to the vector B.
For this scenario, the algorithm becomes quite straightforward. The idea is the same as as already mentioned in the answer of #ruakh (although his answer refers to an earlier and more complicated version of your question -- that is still displayed in the OP -- and is therefore more complicated):
Sort A
Loop over B and select the element of A that is closest to B[i]. Remove that element from the list.
If no element in A is smaller-or-equal than B[i], pick the largest element.
Here is the basic implementation:
#include <string>
#include <vector>
#include <algorithm>
auto get_closest_array(std::vector<int> A, std::vector<int> const& B)
{
std::sort(std::begin(A), std::end(A), std::greater<>{});
auto select_closest_and_remove = [&](int i)
{
auto it = std::find_if(std::begin(A), std::end(A), [&](auto x) { return x<=i;});
if(it==std::end(A))
{
it = std::max_element(std::begin(A), std::end(A));
}
auto ret = *it;
A.erase(it);
return ret;
};
std::vector<int> ret(B.size());
for(int i=0;i<(int)B.size();++i)
{
ret[i] = select_closest_and_remove(B[i]);
}
return ret;
}
Applied to the problem in the OP one gets:
int main()
{
std::vector<int> A ={1,3,5,6,7};
std::vector<int> B ={7,3,2,4,6};
auto C = get_closest_array(A, B);
for(auto i : C)
{
std::cout<<i<<" ";
}
std::cout<<std::endl;
}
and it displays
7 3 1 6 5
which seems to be the desired result.

Create a function that checks whether an array has two opposite elements or not for less than n^2 complexity. (C++)

Create a function that checks whether an array has two opposite elements or not for less than n^2 complexity. Let's work with numbers.
Obviously the easiest way would be:
bool opposite(int* arr, int n) // n - array length
{
for(int i = 0; i < n; ++i)
{
for(int j = 0; j < n; ++j)
{
if(arr[i] == - arr[j])
return true;
}
}
return false;
}
I would like to ask if any of you guys can think of an algorithm that has a complexity less than n^2.
My first idea was the following:
1) sort array ( algorithm with worst case complexity: n.log(n) )
2) create two new arrays, filled with negative and positive numbers from the original array
( so far we've got -> n.log(n) + n + n = n.log(n))
3) ... compare somehow the two new arrays to determine if they have opposite numbers
I'm not pretty sure my ideas are correct, but I'm opened to suggestions.

An important alternative solution is as follows. Sort the array. Create two pointers, one initially pointing to the front (smallest), one initially pointing to the back (largest). If the sum of the two pointed-to elements is zero, you're done. If it is larger than zero, then decrement the back pointer. If it is smaller than zero, then increment the front pointer. Continue until the two pointers meet.
This solution is often the one people are looking for; often they'll explicitly rule out hash tables and trees by saying you only have O(1) extra space.

I would use an std::unordered_set and check to see if the opposite of the number already exist in the set. if not insert it into the set and check the next element.
std::vector<int> foo = {-10,12,13,14,10,-20,5,6,7,20,30,1,2,3,4,9,-30};
std::unordered_set<int> res;
for (auto e : foo)
{
if(res.count(-e) > 0)
std::cout << -e << " already exist\n";
else
res.insert(e);
}
Output:
opposite of 10 alrready exist
opposite of 20 alrready exist
opposite of -30 alrready exist
Live Example

Let's see that you can simply add all of elements to the unordered_set and when you are adding x check if you are in this set -x. The complexity of this solution is O(n). (as #Hurkyl said, thanks)
UPDATE: Second idea is: Sort the elements and then for all of the elements check (using binary search algorithm) if the opposite element exists.

You can do this in O(n log n) with a Red Black tree.
t := empty tree
for each e in A[1..n]
if (-e) is in t:
return true
insert e into t
return false
In C++, you wouldn't implement a Red Black tree for this purpose however. You'd use std::set, because it guarantees O(log n) search and insertion.
std::set<int> s;
for (auto e : A) {
if (s.count(-e) > 0) {
return true;
}
s.insert(e);
}
return false;
As Hurkyl mentioned, you could do better by just using std::unordered_set, which is a hashtable. This gives you O(1) search and insertion in the average case, but O(n) for both operations in the worst case. The total complexity of the solution in the average case would be O(n).

How to find the maximum number of pairs having difference less than a particular value?

I am given two arrays (can contain duplicates and of same length) containing positive integers. I have to find the maximum number of pairs that have absolute difference less than equal to a particular value (given) when numbers can be used only once from both the arrays.
For example:
arr1 = {1,2,3,4}
arr2 = {8,9,10,11}
diff = 5
Then, possible pairs are (3,8), (4,8). That is, only two such possible pairs are there.
Output should be 2.
Also, I can think of an algo for this in O(n^2). But, I need something better. I thought of hash maps (won't work because arrays contain duplicates), thought of sorting the arrays in descending and ascending order, wasn't really able to move forward from there.

The usual idea is to loop over sorted ranges. This, you can bring down the brute-force O(N^2) effort to usually O(N log N).
Here is an algorithm for that in pseudo code (maybe I'll update later with real C++ code):
Sort both arrays
Loop over both simultaneously with two iterators:
If a pair is found insert it into your list. Increase both iterators.
Otherwise, increase the indicator pointing to the smaller element.
In total, this is dominated by the sort which on average takes O(N log N).
Here is the promised code:
auto find_pairs(std::vector<int>& arr1, std::vector<int>& arr2, int diff)
{
std::vector<std::pair<int,int> > ret;
std::sort(std::begin(arr1), std::end(arr1));
std::sort(std::begin(arr2), std::end(arr2));
auto it1= std::begin(arr1);
auto it2= std::begin(arr2);
while(it1!= std::end(arr1) && it2!= std::end(arr2))
{
if(std::abs(*it1-*it2) == diff)
{
ret.push_back(std::make_pair(*it1,*it2));
++it1;
++it2;
}
else if(*it1<*it2)
{
++it1;
}
else
{
++it2;
}
}
return ret;
}
It returns the matching elements of the two vectors as a vector of std::pairs. For your example, it prints
3 8
4 9
DEMO

how to get median value from sorted map

I am using a std::map. Sometimes I will do an operation like: finding the median value of all items. e.g
if I add
1 "s"
2 "sdf"
3 "sdfb"
4 "njw"
5 "loo"
then the median is 3.
Is there some solution without iterating over half the items in the map?

I think you can solve the problem by using two std::map. One for smaller half of items (mapL) and second for the other half (mapU). When you have insert operation. It will be either case:
add item to mapU and move smallest element to mapL
add item to mapL and move greatest element to mapU
In case the maps have different size and you insert element to the one with smaller number of
elements you skip the move section.
The basic idea is that you keep your maps balanced so the maximum size difference is 1 element.
As far as I know STL all operations should work in O(ln(n)) time. Accessing smallest and greatest element in map can be done by using iterator.
When you have n_th position query just check map sizes and return greatest element in mapL or smallest element in mapR.
The above usage scenario is for inserting only but you can extend it to deleting items as well but you have to keep track of which map holds item or try to delete from both.
Here is my code with sample usage:
#include <iostream>
#include <string>
#include <map>
using namespace std;
typedef pair<int,string> pis;
typedef map<int,string>::iterator itis;
map<int,string>Left;
map<int,string>Right;
itis get_last(map<int,string> &m){
return (--m.end());
}
int add_element(int key, string val){
if (Left.empty()){
Left.insert(make_pair(key,val));
return 1;
}
pis maxl = *get_last(Left);
if (key <= maxl.first){
Left.insert(make_pair(key,val));
if (Left.size() > Right.size() + 1){
itis to_rem = get_last(Left);
pis cpy = *to_rem;
Left.erase(to_rem);
Right.insert(cpy);
}
return 1;
} else {
Right.insert(make_pair(key,val));
if (Right.size() > Left.size()){
itis to_rem = Right.begin();
pis cpy = *to_rem;
Right.erase(to_rem);
Left.insert(*to_rem);
}
return 2;
}
}
pis get_mid(){
int size = Left.size() + Right.size();
if (Left.size() >= size / 2){
return *(get_last(Left));
}
return *(Right.begin());
}
int main(){
Left.clear();
Right.clear();
int key;
string val;
while (!cin.eof()){
cin >> key >> val;
add_element(key,val);
pis mid = get_mid();
cout << "mid " << mid.first << " " << mid.second << endl;
}
}

I think the answer is no. You cannot just jump to the N / 2 item past the beginning because a std::map uses bidirectional iterators. You must iterate through half of the items in the map. If you had access to the underlying Red/Black tree implementation that is typically used for the std::map, you might be able to get close like in Dani's answer. However, you don't have access to that as it is encapsulated as an implementation detail.

Try:
typedef std::map<int,std::string> Data;
Data data;
Data::iterator median = std::advance(data.begin(), data.size() / 2);
Works if the size() is odd. I'll let you work out how to do it when size() is even.

In self balancing binary tree(std::map is one I think) a good approximation would be the root.
For exact value just cache it with a balance indicator, and each time an item added below the median decrease the indicator and increase when item is added above. When indicator is equal to 2/-2 move the median upwards/downwards one step and reset the indicator.

If you can switch data structures, store the items in a std::vector and sort it. That will enable accessing the middle item positionally without iterating. (It can be surprising but a sorted vector often out-performs a map, due to locality. For lookups by the sort key you can use binary search and it will have much the same performance as a map anyway. See Scott Meyer's Effective STL.)

If you know the map will be sorted, then get the element at floor(length / 2). If you're in a bit twiddly mood, try (length >> 1).

I know no way to get the median from a pure STL map quickly for big maps. If your map is small or you need the median rarely you should use the linear advance to n/2 anyway I think - for the sake of simplicity and being standard.
You can use the map to build a new container that offers median: Jethro suggested using two maps, based on this perhaps better would be a single map and a continuously updated median iterator. These methods suffer from the drawback that you have to reimplement every modifiying operation and in jethro's case even the reading operations.
A custom written container will also do what you what, probably most efficiently but for the price of custom code. You could try, as was suggested to modify an existing stl map implementation. You can also look for existing implementations.
There is a super efficient C implementation that offers most map functionality and also random access called Judy Arrays. These work for integer, string and byte array keys.

Since it sounds like insert and find are your two common operations while median is rare, the simplest approach is to use the map and std::advance( m.begin(), m.size()/2 ); as originally suggested by David Rodríguez. This is linear time, but easy to understand so I'd only consider another approach if profiling shows that the median calls are too expensive relative to the work your app is doing.

The nth_element() method is there for you for this :) It implements the partition part of the quick sort and you don't need your vector (or array) to be sorted.
And also the time complexity is O(n) (while for sorting you need to pay O(nlogn)).

For a sortet list, here it is in java code, but i assume, its very easy to port to c++:
if (input.length % 2 != 0) {
return input[((input.length + 1) / 2 - 1)];
} else {
return 0.5d * (input[(input.length / 2 - 1)] + input[(input.length / 2 + 1) - 1]);
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Find uncommon elements using hashing - c++

Related

Most efficient algorithm for Two-sum problem (involving indices)

Nearest permutation to given array

Create a function that checks whether an array has two opposite elements or not for less than n^2 complexity. (C++)

How to find the maximum number of pairs having difference less than a particular value?

how to get median value from sorted map

Categories

Resources