Performing intersection of vector c++

Performing intersection of vector c++ - c++

I have 200 vectors of size ranging from 1 to 4000000 stored in vecOfVec. I need to intersect these vectors with a single vector "vecSearched" of size 9000+ elements. I tried to do the same using the following code, however using perf tool I found that the intersection which I am doing is the bottleneck in my code. Is there some way by which I may perform an efficient intersection
#include <cstdlib>
#include <iostream>
#include <vector>
using namespace std;
int main(int argc, char** argv) {
vector<vector<unsigned> > vecOfVec; //contains 120 vectors of size ranging from 1 to 2000000 elements. All vectors in vecOfVec are sorted
vector<unsigned> vecSearched; //vector searched in contains 9000+ elements. Vectors in vecSearched are sorted
for(unsigned kbt=0; kbt<vecOfVec.size(); kbt++)
{
//first find first 9 values spaced at equi-distant places, use these 9 values for performing comparisons
vector<unsigned> equiSpacedVec;
if(((vecSearched[0]))>vecOfVec[kbt][(vecOfVec[kbt].size())-1]) //if beginning of searched vector > last value present in individual vectors of vecOfVec then continue
{
continue;
}
unsigned elementIndex=0; //used for iterating over equiSpacedVec
unsigned i=0; //used for iterating over individual buckets vecOfVec[kbt].second
//search for value in bucket and store it in bucketValPos
bool firstRun=true;
for(vector<unsigned>::iterator itValPos=vecSearched.begin();itValPos!=vecSearched.end();++itValPos)
{
//construct a summarized vector out of individual vectors of vecOfVec
if(firstRun)
{
firstRun=false;
unsigned elementIndex1=0; //used for iterating over equiSpacedVec
while(elementIndex1<(vecOfVec[kbt].size())) //create a small vector for skipping over the remaining vectors
{
if((elementIndex1+(10000))<(vecOfVec[kbt].size()))
elementIndex1+=10000;
else
break;
equiSpacedVec.push_back(vecOfVec[kbt][elementIndex1]);
}
}
//skip individual vectors of vecOfVec using summarized vector constructed above
while((!(equiSpacedVec.empty()))&&(equiSpacedVec.size()>(elementIndex+1))&&((*itValPos)>equiSpacedVec[elementIndex+1])){
elementIndex+=1;
if((i+100)<(vecOfVec[kbt].size()))
i+=100;
}
unsigned j=i;
while(((*itValPos)>vecOfVec[kbt][j])&&(j<vecOfVec[kbt].size())){
j++;
}
if(j>(vecOfVec[kbt].size()-1)) //element not found even at last position.
{
break;
}
if((*itValPos)==vecOfVec[kbt][j])
{
//store intersection result
}
}
}
return 0;
}

Your problem is a very popular one. Since you have no data correlating the vectors to intersect, it boils down to speeding up the intersection between two vectors and there are basically two approaches to it :
1. Without any preprocessing
This is usually adressed by three things :
Recuding the number of comparisons. Eg for small vectors (sized 1 to 50) you should binary search each element to avoid traversing all the 9000+ elements of the subject vector.
Improving code quality to reduce branch mispredictions, eg observing that the resulting set will usually be of smaller size than the input sets could transform such code :
while (Apos < Aend && Bpos < Bend) {
if (A[Apos] == B[Bpos]) {
C[Cpos++] = A[Apos];
Apos++; Bpos++;
}
else if (A[Apos] > B[Bpos]) {
Bpos++;
}
else {
Apos++;
}
}
to code that "unrolls" such comparisons creating though easier to predict branches (example for block size = 2) :
while (1) {
Adat0 = A[Apos]; Adat1 = A[Apos + 1];
Bdat0 = B[Bpos]; Bdat1 = B[Bpos + 1];
if (Adat0 == Bdat0) {
C[Cpos++] = Adat0;
}
else if (Adat0 == Bdat1) {
C[Cpos++] = Adat0;
goto advanceB;
}
else if (Adat1 == Bdat0) {
C[Cpos++] = Adat1;
goto advanceA;
}
if (Adat1 == Bdat1) {
C[Cpos++] = Adat1;
goto advanceAB;
}
else if (Adat1 > Bdat1) goto advanceB;
else goto advanceA;
advanceA:
Apos+=2;
if (Apos >= Aend) { break; } else { continue; }
advanceB:
Bpos+=2;
if (Bpos >= Bend) { break; } else { continue; }
advanceAB:
Apos+=2; Bpos+=2;
if (Apos >= Aend || Bpos >= Bend) { break; }
}
// fall back to naive algorithm for remaining elements
Using SIMD instructions to perform block operations
These techniques are hard to describe in a QA context but you can read about them (plus relevant optimizations like the if conversion) here and here or find implementation elements here
2. With preprocessing
This IMHO is a better way for your case because you have a single subject vector of size 9000+ elements. You could make an interval tree out of it or simply find a way to index it, eg make a structure that will speed up the search against it :
vector<unsigned> subject(9000+);
vector<range> index(9000/M);
where range is a struct like
struct range {
unsigned min, max;
};
thus creating a sequence of ranges like so
[0, 100], [101, 857], ... [33221, 33500]
that will allow to skip many many comparisons when doing the intersection (for example if an element of the other set is larger than the max of a subrange, you can skip that subrange completely)
3. Parallelize
Yep, there's always a third element in a list of two :P. When you have optimized enough the procedure (and only then), do break up your work into chunks and run it in parallel. The problem fits an embarassing pattern so 200 vectors vs 1 should definetely run as "50 vs 1 four times concurrently"
Test, measure, redesign!!

If I understood correctly your code then if N is the number of vectors and M the number of elements inside each vector then your algorithm is roughly O(N * M^2). Then there is the 'bucket' strategy that improves things a bit but its effect is difficult to evaluate at a first sight.
I would suggest you to work on sorted vectors and make intersections on sorted ones. Something like this:
vector<vector<unsigned> > vecOfVec;
vector<unsignend> vecSearched ;
for (vector<unsigned> v : vecSearched) // yes, a copy
{
std::sort(v.begin(), v.end()) ;
if (vecSearched.empty()) // first run detection
vSearched = v ;
else
{ // compute intersection of v and vecSearch
auto vit = v.begin() ;
auto vend = v.end() ;
auto sit = vecSearched.begin() ;
auto send = vecSearched.end() ;
vector<unsiged> result ;
while (vit != vend && sit != send)
{
if (*vit < *sit)
vit++ ;
else if (*vit == *sit)
{
result.push_bck(*it) ;
++vit ;
++sit ;
}
else // *vit > *sit
++sit ;
}
vecSearched = result ;
}
}
Code is untested, anyway the idea behind it is that intersection on sorted vectors is easier since you can compare those two iterators (vit, sit) and grow the one pointing to the smaller one. So intersecton is linear in M and whole complexity is O(N * M *log(M)), where log(M) is due to sorting

The simplest way to compare two sorted vectors is to iterate through both of them at the same time, and only increment whichever iterator has the smaller value. This is guaranteed to require the smallest number of comparisons if both vectors are unique. Actually, you can use the same code for all sorts of collections linked lists.
#Nikos Athanasiou (answer above) gives lots of useful tips to speed up your code, like using skip lists, simd comparisons. However your dataset is so tiny that even the straightforward naive code here runs blindingly fast...
template<typename CONT1, typename CONT2, typename OP_MARKIDENTICAL>
inline
void set_mark_overlap( const CONT1& cont1,
const CONT2& cont2,
OP_MARKIDENTICAL op_markidentical)
{
auto ii = cont1.cbegin();
auto end1 = cont1.cend();
auto jj = cont2.cbegin();
auto end2 = cont2.cend();
if (cont1.empty() || cont2.empty())
return;
for (;;)
{
// increment iterator to container 1 if it is less
if (*ii < *jj)
{
if (++ii == end1)
break;
}
// increment iterator to container 2 if it is less
else if (*jj < *ii)
{
if (++jj == end2)
break;
}
// same values
// increment both iterators
else
{
op_markidentical(*ii);
++ii;
if (ii == end1)
break;
//
// Comment if container1 can contain duplicates
//
++jj;
if (jj == end2)
break;
}
}
}
Here is how you might use this code:
template<typename TT>
struct op_store
{
vector<TT>& store;
op_store(vector<TT>& store): store(store){}
void operator()(TT val){store.push_back(val);}
};
vector<unsigned> first{1,2,3,4,5,6};
vector<unsigned> second{1,2,5,6, 7,9};
vector<unsigned> overlap;
set_mark_overlap( first, second, op_store<unsigned>(overlap));
for (const auto& ii : overlap)
std::cout << ii << ",";
std::cout << ii << "\n";
// 1，2，5，6
This code assumes that neither vector contains duplicates. If any of your vecOfVec contains duplicates, and you want each duplicate to be printed out, then you need to comment the indicated code above. If your vecSearched vector contains duplicates, it is not clear what would be the appropriate response...
In your case, the code to store matching values would be just these three lines:
// results
vector<vector<unsigned> > results(120);
for (unsigned ii = 0; ii < vecOfVec.size(); ++ii)
set_mark_overlap(vecSearched, vecOfVec[ii], op_store<unsigned>(results[ii]));
In terms of optimisation, your problem has two characteristics:
1) One list is always much shorter than the other
2) The shorter list is reused while the longer list is new to every comparison.
Up front costs (of pre-processing, e.g. for skip lists as suggested by #Nikos Athanasiou) are irrelevant with the short list of 9000 (which is used again and again) but not the longer list.
I imagine most of the skipping is for the longer lists, so skip lists may not be a panacea. How about doing a sort of dynamic skip-list so that you jump by N ( j += 4,000,000 / 9000) of by one (++j) when container two is catching up (in the code above). If you have jumped two far, then you can use a mini-binary search to find the right amount to increment j by.
Because of this asymmetry in list lengths, I can't see recoding with SIMD is going to help: we need to minimise the number of comparisons to less than (N+M) rather than increases the speed of each comparisons. However, it depends on your data. Code up and time things!
Here is the test code to create some vectors of random numbers and check that they are present
#include <iostream>
#include <vector>
#include <unordered_set>
#include <exception>
#include <algorithm>
#include <limits>
#include <random>
using namespace std;
template<typename TT>
struct op_store
{
std::vector<TT>& store;
op_store(vector<TT>& store): store(store){}
void operator()(TT val, TT val2){if (val != val2) std::cerr << val << " !- " << val2 << "\n"; store.push_back(val);}
};
void fill_vec_with_unique_random_values(vector<unsigned>& cont, unordered_set<unsigned>& curr_values)
{
static random_device rd;
static mt19937 e1(rd());
static uniform_int_distribution<unsigned> uniform_dist(0, std::numeric_limits<unsigned>::max());
for (auto& jj : cont)
{
for (;;)
{
unsigned new_value = uniform_dist(e1);
// make sure all values are unique
if (curr_values.count(new_value) == 0)
{
curr_values.insert(new_value);
jj = new_value;
break;
}
}
}
}
int main (int argc, char *argv[])
{
static random_device rd;
static mt19937 e1(rd());
// vector searched in contains 9000+ elements. Vectors in vecSearched are sorted
vector<unsigned> vecSearched(9000);
unordered_set<unsigned> unique_values_9000;
fill_vec_with_unique_random_values(vecSearched, unique_values_9000);
//
// Create 120 vectors of size ranging from 1 to 2000000 elements. All vectors in vecOfVec are sorted
//
vector<vector<unsigned> > vecOfVec(5);
normal_distribution<> vec_size_normal_dist(1000000U, 500000U);
for (unsigned ii = 0; ii < vecOfVec.size(); ++ii)
{
std::cerr << " Create Random data set" << ii << " ...\n";
auto vec_size = min(2000000U, static_cast<unsigned>(vec_size_normal_dist(e1)));
vecOfVec[ii].resize(vec_size);
// Do NOT share values with the 9000. We will manually add these later
unordered_set<unsigned> unique_values(unique_values_9000);
fill_vec_with_unique_random_values(vecOfVec[ii], unique_values);
}
// insert half of vecSearched in our 120 vectors so that we know what we are going to find
vector<unsigned> correct_results(begin(vecSearched), begin(vecSearched) + 4500);
for (unsigned ii = 0; ii < vecOfVec.size(); ++ii)
vecOfVec[ii].insert(vecOfVec[ii].end(), begin(correct_results), end(correct_results));
// Make sure everything is sorted
std::cerr << " Sort data ...\n";
for (unsigned ii = 0; ii < vecOfVec.size(); ++ii)
sort(begin(vecOfVec[ii]), end(vecOfVec[ii]));
sort(begin(vecSearched), end(vecSearched));
sort(begin(correct_results), end(correct_results));
std::cerr << " Match ...\n";
// results
vector<vector<unsigned> > results(120);
for (unsigned ii = 0; ii < vecOfVec.size(); ++ii)
{
std::cerr << ii << " done\n";
set_mark_overlap(vecSearched, vecOfVec[ii], op_store<unsigned>(results[ii]));
// check all is well
if (results[ii] != correct_results)
throw runtime_error("Oops");
}
return(0);
}

using set_intersection could help, but i do not know, if it improves the overall speed:
vector<vector<unsigned int> > vecOfVec(200);
vector<unsigned int> vecSearched;
set<unsigned int> intersection;
for(auto it = vecOfVec.begin(); it != vecOfVec.end(); ++it)
{
std::set_intersection(it->begin(), it->end(), vecSearched.begin(), vecSearched.end(), std::inserter(intersection, intersection.begin()));
}

Related

Efficient way of finding if a container contains duplicated values with STL? [duplicate]

I wrote this code in C++ as part of a uni task where I need to ensure that there are no duplicates within an array:
// Check for duplicate numbers in user inputted data
int i; // Need to declare i here so that it can be accessed by the 'inner' loop that starts on line 21
for(i = 0;i < 6; i++) { // Check each other number in the array
for(int j = i; j < 6; j++) { // Check the rest of the numbers
if(j != i) { // Makes sure don't check number against itself
if(userNumbers[i] == userNumbers[j]) {
b = true;
}
}
if(b == true) { // If there is a duplicate, change that particular number
cout << "Please re-enter number " << i + 1 << ". Duplicate numbers are not allowed:" << endl;
cin >> userNumbers[i];
}
} // Comparison loop
b = false; // Reset the boolean after each number entered has been checked
} // Main check loop
It works perfectly, but I'd like to know if there is a more elegant or efficient way to check.

You could sort the array in O(nlog(n)), then simply look until the next number. That is substantially faster than your O(n^2) existing algorithm. The code is also a lot cleaner. Your code also doesn't ensure no duplicates were inserted when they were re-entered. You need to prevent duplicates from existing in the first place.
std::sort(userNumbers.begin(), userNumbers.end());
for(int i = 0; i < userNumbers.size() - 1; i++) {
if (userNumbers[i] == userNumbers[i + 1]) {
userNumbers.erase(userNumbers.begin() + i);
i--;
}
}
I also second the reccomendation to use a std::set - no duplicates there.

The following solution is based on sorting the numbers and then removing the duplicates:
#include <algorithm>
int main()
{
int userNumbers[6];
// ...
int* end = userNumbers + 6;
std::sort(userNumbers, end);
bool containsDuplicates = (std::unique(userNumbers, end) != end);
}

Indeed, the fastest and as far I can see most elegant method is as advised above:
std::vector<int> tUserNumbers;
// ...
std::set<int> tSet(tUserNumbers.begin(), tUserNumbers.end());
std::vector<int>(tSet.begin(), tSet.end()).swap(tUserNumbers);
It is O(n log n). This however does not make it, if the ordering of the numbers in the input array needs to be kept... In this case I did:
std::set<int> tTmp;
std::vector<int>::iterator tNewEnd =
std::remove_if(tUserNumbers.begin(), tUserNumbers.end(),
[&tTmp] (int pNumber) -> bool {
return (!tTmp.insert(pNumber).second);
});
tUserNumbers.erase(tNewEnd, tUserNumbers.end());
which is still O(n log n) and keeps the original ordering of elements in tUserNumbers.
Cheers,
Paul

It is in extension to the answer by #Puppy, which is the current best answer.
PS : I tried to insert this post as comment in the current best answer by #Puppy but couldn't so as I don't have 50 points yet. Also a bit of experimental data is shared here for further help.
Both std::set and std::map are implemented in STL using Balanced Binary Search tree only. So both will lead to a complexity of O(nlogn) only in this case. While the better performance can be achieved if a hash table is used. std::unordered_map offers hash table based implementation for faster search. I experimented with all three implementations and found the results using std::unordered_map to be better than std::set and std::map. Results and code are shared below. Images are the snapshot of performance measured by LeetCode on the solutions.
bool hasDuplicate(vector<int>& nums) {
size_t count = nums.size();
if (!count)
return false;
std::unordered_map<int, int> tbl;
//std::set<int> tbl;
for (size_t i = 0; i < count; i++) {
if (tbl.find(nums[i]) != tbl.end())
return true;
tbl[nums[i]] = 1;
//tbl.insert(nums[i]);
}
return false;
}
unordered_map Performance (Run time was 52 ms here)
Set/Map Performance

You can add all elements in a set and check when adding if it is already present or not. That would be more elegant and efficient.

I'm not sure why this hasn't been suggested but here is a way in base 10 to find duplicates in O(n).. The problem I see with the already suggested O(n) solution is that it requires that the digits be sorted first.. This method is O(n) and does not require the set to be sorted. The cool thing is that checking if a specific digit has duplicates is O(1). I know this thread is probably dead but maybe it will help somebody! :)
/*
============================
Foo
============================
*
Takes in a read only unsigned int. A table is created to store counters
for each digit. If any digit's counter is flipped higher than 1, function
returns. For example, with 48778584:
0 1 2 3 4 5 6 7 8 9
[0] [0] [0] [0] [2] [1] [0] [2] [2] [0]
When we iterate over this array, we find that 4 is duplicated and immediately
return false.
*/
bool Foo(int number)
{
int temp = number;
int digitTable[10]={0};
while(temp > 0)
{
digitTable[temp % 10]++; // Last digit's respective index.
temp /= 10; // Move to next digit
}
for (int i=0; i < 10; i++)
{
if (digitTable [i] > 1)
{
return false;
}
}
return true;
}

It's ok, specially for small array lengths. I'd use more efficient aproaches (less than n^2/2 comparisons) if the array is mugh bigger - see DeadMG's answer.
Some small corrections for your code:
Instead of int j = i writeint j = i +1 and you can omit your if(j != i) test
You should't need to declare i variable outside the for statement.

I think #Michael Jaison G's solution is really brilliant, I modify his code a little to avoid sorting. (By using unordered_set, the algorithm may faster a little.)
template <class Iterator>
bool isDuplicated(Iterator begin, Iterator end) {
using T = typename std::iterator_traits<Iterator>::value_type;
std::unordered_set<T> values(begin, end);
std::size_t size = std::distance(begin,end);
return size != values.size();
}

//std::unique(_copy) requires a sorted container.
std::sort(cont.begin(), cont.end());
//testing if cont has duplicates
std::unique(cont.begin(), cont.end()) != cont.end();
//getting a new container with no duplicates
std::unique_copy(cont.begin(), cont.end(), std::back_inserter(cont2));

#include<iostream>
#include<algorithm>
int main(){
int arr[] = {3, 2, 3, 4, 1, 5, 5, 5};
int len = sizeof(arr) / sizeof(*arr); // Finding length of array
std::sort(arr, arr+len);
int unique_elements = std::unique(arr, arr+len) - arr;
if(unique_elements == len) std::cout << "Duplicate number is not present here\n";
else std::cout << "Duplicate number present in this array\n";
return 0;
}

As mentioned by #underscore_d, an elegant and efficient solution would be,
#include <algorithm>
#include <vector>
template <class Iterator>
bool has_duplicates(Iterator begin, Iterator end) {
using T = typename std::iterator_traits<Iterator>::value_type;
std::vector<T> values(begin, end);
std::sort(values.begin(), values.end());
return (std::adjacent_find(values.begin(), values.end()) != values.end());
}
int main() {
int user_ids[6];
// ...
std::cout << has_duplicates(user_ids, user_ids + 6) << std::endl;
}

fast O(N) time and space solution
return first when it hits duplicate
template <typename T>
bool containsDuplicate(vector<T>& items) {
return any_of(items.begin(), items.end(), [s = unordered_set<T>{}](const auto& item) mutable {
return !s.insert(item).second;
});
}

Not enough karma to post a comment. Hence a post.
vector <int> numArray = { 1,2,1,4,5 };
unordered_map<int, bool> hasDuplicate;
bool flag = false;
for (auto i : numArray)
{
if (hasDuplicate[i])
{
flag = true;
break;
}
else
hasDuplicate[i] = true;
}
(flag)?(cout << "Duplicate"):("No duplicate");

Why is the complexity of std::unordered_set operator==() N^2?

I have two vectors v1 and v2 of type std::vector<std::string>. Both vectors have unique values and should compare equal if values compare equal but independent of the order values appear in the vector.
I assume two sets of type std::unordered_set would have been a better choice, but I take it as it is, so two vectors.
Nevertheless, I thought for the needed order insensitive comparison I'll just use operator== from std::unordered_set by copying to two std::unordered_set. Very much like this:
bool oi_compare1(std::vector<std::string> const&v1,
std::vector<std::string> const&v2)
{
std::unordered_set<std::string> tmp1(v1.begin(),v1.end());
std::unordered_set<std::string> tmp2(v2.begin(),v2.end());
return tmp1 == tmp2;
}
While profiling I noticed this function consuming a lot of time, so I checked doc and saw the O(n*n) complexity here. I am confused, I was expecting O(n*log(n)), like e.g. for the following naive solution I came up with:
bool oi_compare2(std::vector<std::string> const&v1,
std::vector<std::string> const&v2)
{
if(v1.size() != v2.size())
return false;
auto tmp = v2;
size_t const size = tmp.size();
for(size_t i = 0; i < size; ++i)
{
bool flag = false;
for(size_t j = i; j < size; ++j)
if(v1[i] == tmp[j]){
flag = true;
std::swap(tmp[i],tmp[j]);
break;
}
if(!flag)
return false;
}
return true;
}
Why the O(n*n) complexity for std::unordered_set and is there a build in function I can use for order insensitive comparision?
EDIT----
BENCHMARK
#include <unordered_set>
#include <chrono>
#include <iostream>
#include <vector>
bool oi_compare1(std::vector<std::string> const&v1,
std::vector<std::string> const&v2)
{
std::unordered_set<std::string> tmp1(v1.begin(),v1.end());
std::unordered_set<std::string> tmp2(v2.begin(),v2.end());
return tmp1 == tmp2;
}
bool oi_compare2(std::vector<std::string> const&v1,
std::vector<std::string> const&v2)
{
if(v1.size() != v2.size())
return false;
auto tmp = v2;
size_t const size = tmp.size();
for(size_t i = 0; i < size; ++i)
{
bool flag = false;
for(size_t j = i; j < size; ++j)
if(v1[i] == tmp[j]){
flag = true;
std::swap(tmp[i],tmp[j]);
break;
}
if(!flag)
return false;
}
return true;
}
int main()
{
std::vector<std::string> s1{"1","2","3"};
std::vector<std::string> s2{"1","3","2"};
std::cout << std::boolalpha;
for(size_t i = 0; i < 15; ++i)
{
auto tmp1 = s1;
for(auto &iter : tmp1)
iter = std::to_string(i)+iter;
s1.insert(s1.end(),tmp1.begin(),tmp1.end());
s2.insert(s2.end(),tmp1.begin(),tmp1.end());
}
std::cout << "size1 " << s1.size() << std::endl;
std::cout << "size2 " << s2.size() << std::endl;
for(auto && c : {oi_compare1,oi_compare2})
{
auto start = std::chrono::steady_clock::now();
bool flag = true;
for(size_t i = 0; i < 10; ++i)
flag = flag && c(s1,s2);
std::cout << "ms=" << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start).count() << " flag=" << flag << std::endl;
}
return 0;
}
gives
size1 98304
size2 98304
ms=844 flag=true
ms=31 flag=true
--> naive approach way faster.
For all the Complexity O(N*N) experts here...
Let me go through this naive approach. I have two loops there. The first loop is running from i=0 to size which is N. The inner loop is called from j=i!!!!!! to N. In language spoken it means I call the Inner loop N times. But the complexity of the inner loop is log(n) due to the starting index of j = i !!!!. If you still dont believe me calculate the complexity from benchmarks and you will see...
EDIT2---
LIVE ON WANDBOX
https://wandbox.org/permlink/v26oxnR2GVDb9M6y

Since unordered_set is build using hashmap, the logic to compare lhs==rhs will be:
Check size of lhs and rhs, if not equal, return false
For each item in lhs, find it in rhs, and compare
For hashmap, the single find time complexity for an item in rhs in worst case will be O(n). So the worst case time complexity will be O(n^2). However normally you get an time complexity of O(n).

I'm sorry to tell you, your benchmark of operator== is faulty.
oi_compare1 accepts 2 vectors and needs to build up 2 complete unordered_set instances, to than call operator== and destroy the complete bunch again.
oi_compare2 also accepts 2 vectors, and immediately uses them for the comparison on size. Only copies 1 instance (v2 to tmp), which is much more performant for a vector.
operator==
Looking at the documentation: https://en.cppreference.com/w/cpp/container/unordered_set/operator_cmp we can see the expected complexity:
Proportional to N calls to operator== on value_type, calls to the predicate returned by key_eq, and calls to the hasher returned by hash_function, in the average case, proportional to N2 in the worst case where N is the size of the container.
edit
There is a simple algorithm, you can loop over the unordered_set and do a simple lookup in the other one. Without hash collisions, it will find each element in it's own internal bucket and compare it for equality as the hashing ain't sufficient.
Assuming you don't have hash collisions, each element of the unordered_set has a stable order in which they are stored. One could loop over the internal buckets and compare the elements 2-by-2 (1st of the one with the 1st of the second, 2nd of the one with the 2nd of the second ...). This nicely gives O(N). This doesn't work when you have different sizes of the buckets you store the values in, or when the assignment of buckets uses a different calculation to deal with collisions.
Assuming you are unlucky and every element results into the same hash. (Known as hash flooding) You result in a list of elements without order. To compare, you have to check for each element if it exists in the other one, causing O(N*N).
This last one is easy reproducible if you rig your hash to always return the same number. Build the one set in the reverse order as the other one.

Why is single iteration for map insertion and map.find() much slower than two separate iterations for insert and for map.find()

I find an interesting phenomenon when I try to optimize my solution for the leetcode two sum problem (https://leetcode.com/problems/two-sum/description/).
Leetcode description for the two-sum problem is:
Given an array of integers, return indices of the two numbers such that they add up to a specific target.
You may assume that each input would have exactly one solution, and you may not use the same element twice.
Initially, I solve this problem by using two loops. First I loop through input array to store array value and array index as pair into a map. Then I loop through input array again to loop up each element and check if it exists in the map. The following is my solution from leetcode:
class Solution {
public:
vector<int> twoSum(vector<int>& nums, int target)
{
vector<int> res;
map<int, int> store;
for(int i = 0; i < nums.size(); ++i)
{
store[nums[i]] = i;
}
for(int i = 0; i < nums.size(); ++i)
{
auto iter = store.find(target - nums[i]);
if(iter != store.end() && (iter -> second) != i)
{
res.push_back(i);
res.push_back(iter -> second);
break;
}
}
return res;
}
};
This solution takes 4ms in leetcode submission. Since I am looping through the same array twice, I was thinking to optimize my code by combining insert operation and map.find() into a single loop. Therefore I can check for a solution while inserting elements. Then I have the following solution:
class Solution {
public:
vector<int> twoSum(vector<int>& nums, int target)
{
vector<int> res;
map<int, int> store;
for(int i = 0; i < nums.size(); ++i)
{
auto iter = store.find(target - nums[i]);
if(iter != store.end() && (iter -> second) != i)
{
res.push_back(i);
res.push_back(iter -> second);
break;
}
store[nums[i]] = i;
}
return res;
}
};
However, the single loop version is much slower than two separate loops, which takes 12ms.
For further research, I made a test case where the input size is 100000001 and solution for this code will be [0, 100000001] (first index and last index). The following is my test code:
#include <iostream>
#include <vector>
#include <algorithm>
#include <map>
#include <iterator>
#include <cstdio>
#include <ctime>
using namespace std;
vector<int> twoSum(vector<int>& nums, int target)
{
vector<int> res;
map<int, int> store;
for(int i = 0; i < nums.size(); ++i)
{
auto iter = store.find(target - nums[i]);
if(iter != store.end() && (iter -> second) != i)
{
res.push_back(i);
res.push_back(iter -> second);
break;
}
store[nums[i]] = i;
}
return res;
}
vector<int> twoSum2(vector<int>& nums, int target)
{
vector<int> res;
map<int, int> store;
for(int i = 0; i < nums.size(); ++i)
{
store[nums[i]] = i;
}
for(int i = 0; i < nums.size(); ++i)
{
auto iter = store.find(target - nums[i]);
if(iter != store.end() && (iter -> second) != i)
{
res.push_back(i);
res.push_back(iter -> second);
break;
}
}
return res;
}
int main()
{
std::vector<int> test1;
test1.push_back(4);
for (int i = 0; i < 100000000; ++i)
{
test1.push_back(3);
}
test1.push_back(6);
std::clock_t start;
double duration;
start = std::clock();
auto res1 = twoSum(test1, 10);
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
std::cout<<"single loop: "<< duration <<'\n';
cout << "result: " << res1[1] << ", " << res1[0] << endl;
start = std::clock();
res1 = twoSum2(test1, 10);
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
std::cout<<"double loops: "<< duration <<'\n';
cout << "result: " << res1[0] << ", " << res1[1] << endl;
}
I still get a similar result, which single loop version (7.9s) is slower than double loops version (3.0s):
results
I don't really understand why a single loop combined version is slower than a double loops separated version? I think the single loop version should reduce some redundant looping. Is it because of the STL map implementation that it is better to do insertion and map.find() operation separately in two loops, rather than do insertion and map.find() alternately in one loop?
BTW I am working on a MAC OS and using Apple LLVM version 10.0.0 (clang-1000.10.44.2).

Let us see what actually happens in both scenarii.
In the two loops scenario, you do N insertions in the map, but then do only one find, because as the map is fully fed, you get the expected result on first iteration.
In the single loop scenario, you must wait for the last insertion to find the result. So you do N-1 insertions and N-1 find.
It is no surprise that it takes twice the time in your worst case test...
For randomized use cases, the two loop scenario will result in exactly N insertions, and statistically N/2 find. Best case N inserts 1 find, worst case N inserts N-1 finds.
In the single loop you start finding as soon as the map in not empty. The best case is 1 insert and 1 find (far better than two loops), and the worst case is N-1 inserts and N-1 finds. I know that it is easy to be misguided in probabilities, but I would expect statistically 3N/4 inserts and N/2 finds. So slightly better than the two loops scenario.
TL/DR: you get better results for the two loops scenario than for the single loop one, because your test case is the best for two loops and worst for single loop.

How can I compare across vectors by index and return index sets that have the same values?

Given a set of binary vectors S, what is the most efficient way to compare all elements in every vector in S, and return all sets of indices that have the same value across all vectors?
for example:
Here the vectors are displayed horizontally, and each element is labelled x1, x2, x3, etc. The algorithm should return the sets {x1, x8} and {x7, x9} (ignore x4 and x6 in the image, that is related to another problem).
Here is my (very hacky) solution so far:
#include <iostream>
#include <vector>
using namespace std;
int main() {
// initialise test vectors
std::vector<std::vector<int> > vecs;
vecs.push_back(std::vector<int>{0,0,1,1,0,0,1,0,1});
vecs.push_back(std::vector<int>{1,0,0,1,1,0,1,1,1});
vecs.push_back(std::vector<int>{1,1,0,1,0,0,0,1,0});
// vector to keep track if index already in a group
std::vector<int> in_group (vecs[0].size(), 0);
// output vector
std::vector<std::vector<int> > output;
for (int i = 0; i < vecs[0].size(); ++i){
// if already in group, skip current index
if (in_group[i]) continue;
else in_group[i] = 1;
// vector to store values in current group
std::vector<int> curr_group {i};
for (int j = i+1; j < vecs[0].size();++j){
bool match = true;
// if already in a group, continue
if (in_group[j]) continue;
for (int s = 0; s < vecs.size(); ++s){
if (vecs[s][i] != vecs[s][j]){
match = false;
break;
}
}
// if loop completed without breaking, match found
if (match){
curr_group.push_back(j);
in_group[j] = 1;
}
}
// put current group in output vector
output.push_back(curr_group);
}
// display output
for (int i = 0; i < output.size(); ++i){
for (int j = 0; j < output[i].size(); ++j){
std::cout << "x" << output[i][j] << " ";
}
std::cout << std::endl;
}
return 0;
}
It basically just iterates over each index and compares each other index on each of the vectors, and if it gets to the bottom without a mismatch, it adds it to the current group. If no match is found, the group is added with only the single index (this is a desired function). The output to this function is:
x0 x7
x1
x2
x3
x4
x5
x6 x8
Which is correct (if you translate the value of each index, +1), so it works. I am just wondering if there is a better/faster way to do this, maybe using a fancy data structure or something? The vectors that I am comparing are very large (up to a million values per vector), and I am comparing across many vectors (up to 1000+), so efficiency is important.
Any help would be greatly appreciated!

Something along these lines, perhaps:
#include <iostream>
#include <vector>
#include <bitset>
#include <unordered_map>
int main() {
// initialise test vectors
std::vector<std::vector<int> > vecs;
vecs.push_back(std::vector<int>{0,0,1,1,0,0,1,0,1});
vecs.push_back(std::vector<int>{1,0,0,1,1,0,1,1,1});
vecs.push_back(std::vector<int>{1,1,0,1,0,0,0,1,0});
std::unordered_map<unsigned, std::vector<int>> groups;
for (int i = 0; i < vecs[0].size(); ++i){
unsigned key = 0;
for (int j = 0; j < vecs.size(); ++j) {
key += vecs[j][i] << j;
}
groups[key].push_back(i);
}
// display output
for (const auto& group : groups) {
for (auto index : group.second) {
std::cout << "x" << index << " ";
}
std::cout << std::endl;
}
return 0;
}

First of all, transform each column into an object. You need to be able to perform a comparison on each two objects. Any "big integer" implementation should suffice.
With these, build a vector of pairs, consisting of the column index and the big int.
Sort this vector by the big integer, and now all matching columns are subsequent in the vector.
Finally iterate once to find each group of identical columns and you are done.
Runtime complexity of this algorithm is just O(n log n), which is magnitudes faster than your current O(n^3) implementation.

What about creating a set of vectors that record the set of indices having a given sequence. Each stage you split each vector depending the next binary value, eliminating any vectors that reduce to size 1.
stage 1:
split { 1,2,3,4,5,6,7,8,9 }
<0> -> { 1,2,5,6,8 }
<1> -> { 3,4,7,9 }
stage 2:
split { 1,2,5,6,8 }
<0> -> { 2,6 }
<1> -> { 1,5,8 }
split { 3,4,7,9 }
<0> -> { 3 } <-- eliminate as size is 1
<1> -> { 4,7,9 }
stage 3:
split { 2,6 }
<0> -> { 6 } <-- eliminate as size is 1
<1> -> { 2 } <-- eliminate as size is 1
split { 1,5,8 }
<0> -> { 5 } <-- eliminate as size is 1
<1> -> { 1,8 }
split { 4,7,9 }
<0> -> { 7,9 }
<1> -> { 4 } <-- eliminate as size is 1
Note that you don't need to record the sequence, just split the vectors from the previous stage based on the values in the current binary vector. The worst case is you examine each element of the arrays once therefore the complexity is O(n).

How to find a unique number using std::find

Hey here is a trick question asked in class today, I was wondering if there is a way to find a unique number in a array, The usual method is to use two for loops and get the unique number which does not match with all the others I am using std::vectors for my array in C++ and was wondering if find could spot the unique number as I wouldn't know where the unique number is in the array.

Assuming that we know that the vector has at least three
elements (because otherwise, the question doesn't make sense),
just look for an element different from the first. If it
happens to be the second, of course, we have to check the third
to see whether it was the first or the second which is unique,
which means a little extra code, but roughly:
std::vector<int>::const_iterator
findUniqueEntry( std::vector<int>::const_iterator begin,
std::vector<int>::const_iterator end )
{
std::vector<int>::const_iterator result
= std::find_if(
next( begin ), end, []( int value) { return value != *begin );
if ( result == next( begin ) && *result == *next( result ) ) {
-- result;
}
return result;
}
(Not tested, but you get the idea.)

As others have said, sorting is one option. Then your unique value(s) will have a different value on either side.
Here's another option that solves it, using std::find, in O(n^2) time(one iteration of the vector, but each iteration iterates through the whole vector, minus one element.) - sorting not required.
vector<int> findUniques(vector<int> values)
{
vector<int> uniqueValues;
vector<int>::iterator begin = values.begin();
vector<int>::iterator end = values.end();
vector<int>::iterator current;
for(current = begin ; current != end ; current++)
{
int val = *current;
bool foundBefore = false;
bool foundAfter = false;
if (std::find(begin, current, val) != current)
{
foundBefore = true;
}
else if (std::find(current + 1, end, val) != end)
{
foundAfter = true;
}
if(!foundBefore && !foundAfter)
uniqueValues.push_back(val);
}
return uniqueValues;
}
Basically what is happening here, is that I am running ::find on the elements in the vector before my current element, and also running ::find on the elements after my current element. Since my current element already has the value stored in 'val'(ie, it's in the vector once already), if I find it before or after the current value, then it is not a unique value.
This should find all values in the vector that are not unique, regardless of how many unique values there are.
Here's some test code to run it and see:
void printUniques(vector<int> uniques)
{
vector<int>::iterator it;
for(it = uniques.begin() ; it < uniques.end() ; it++)
{
cout << "Unique value: " << *it << endl;
}
}
void WaitForKey()
{
system("pause");
}
int main()
{
vector<int> values;
for(int i = 0 ; i < 10 ; i++)
{
values.push_back(i);
}
/*for(int i = 2 ; i < 10 ; i++)
{
values.push_back(i);
}*/
printUniques(findUniques(values));
WaitForKey();
return -13;
}
As an added bonus:
Here's a version that uses a map, does not use std::find, and gets the job done in O(nlogn) time - n for the for loop, and log(n) for map::find(), which uses a red-black tree.
map<int,bool> mapValues(vector<int> values)
{
map<int, bool> uniques;
for(unsigned int i = 0 ; i < values.size() ; i++)
{
uniques[values[i]] = (uniques.find(values[i]) == uniques.end());
}
return uniques;
}
void printUniques(map<int, bool> uniques)
{
cout << endl;
map<int, bool>::iterator it;
for(it = uniques.begin() ; it != uniques.end() ; it++)
{
if(it->second)
cout << "Unique value: " << it->first << endl;
}
}
And an explanation. Iterate over all elements in the vector<int>. If the current member is not in the map, set its value to true. If it is in the map, set the value to false. Afterwards, all values that have the value true are unique, and all values with false have one or more duplicates.

If you have more than two values (one of which has to be unique), you can do it in O(n) in time and space by iterating a first time through the array and filling a map that has as a key the value, and value the number of occurences of the key.
Then you just have to iterate through the map in order to find a value of 1. That would be a unique number.

This example uses a map to count number occurences. Unique number will be seen only one time:
#include <iostream>
#include <map>
#include <vector>
int main ()
{
std::map<int,int> mymap;
std::map<int,int>::iterator mit;
std::vector<int> v;
std::vector<int> myunique;
v.push_back(10); v.push_back(10);
v.push_back(20); v.push_back(30);
v.push_back(40); v.push_back(30);
std::vector<int>::iterator vit;
// count occurence of all numbers
for(vit=v.begin();vit!=v.end();++vit)
{
int number = *vit;
mit = mymap.find(number);
if( mit == mymap.end() )
{
// there's no record in map for your number yet
mymap[number]=1; // we have seen it for the first time
} else {
mit->second++; // thiw one will not be unique
}
}
// find the unique ones
for(mit=mymap.begin();mit!=mymap.end();++mit)
{
if( mit->second == 1 ) // this was seen only one time
{
myunique.push_back(mit->first);
}
}
// print out unique numbers
for(vit=myunique.begin();vit!=myunique.end();++vit)
std::cout << *vit << std::endl;
return 0;
}
Unique numbers in this example are 20 and 40. There's no need for the list to be ordered for this algorithm.

Do you mean to find a number in a vector which appears only once? The nested loop if the easy solution. I don't think std::find or std::find_if is very useful here. Another option is to sort the vector so that you only need to find two consecutive numbers that are different. It seems overkill, but it is actually O(nlogn) instead of O(n^2) as the nested loop:
void findUnique(const std::vector<int>& v, std::vector<int> &unique)
{
if(v.size() <= 1)
{
unique = v;
return;
}
unique.clear();
vector<int> w = v;
std::sort(w.begin(), w.end());
if(w[0] != w[1]) unique.push_back(w[0]);
for(size_t i = 1; i < w.size(); ++i)
if(w[i-1] != w[i]) unique.push_back(w[i]);
// unique contains the numbers that are not repeated
}

Assuming you are given an array size>=3 which contains one instance of value A, and all other values are B, then you can do this with a single for loop.
int find_odd(int* array, int length) {
// In the first three elements, we are guaranteed to have 2 common ones.
int common=array[0];
if (array[1]!=common && array[2]!=common)
// The second and third elements are the common one, and the one we thought was not.
return common;
// Now search for the oddball.
for (int i=0; i<length; i++)
if (array[i]!=common) return array[i];
}
EDIT:
K what if more than 2 in an array of 5 are different? – super
Ah... that is a different problem. So you have an array of size n, which contains the common element c more than once, and all other elements exactly once. The goal is to find the set of non-common (i.e. unique) elements right?
Then you need to look at Sylvain's answer above. I think he was answering a different question, but it would work for this. At the end, you will have a hash map full of the counts of each value. Loop through the hash map, and every time you see a value of 1, you will know the key is a unique value in the input array.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js