C++, fast remove elements from vector unique to another vector

C++, fast remove elements from vector unique to another vector - c++

There are 2 unsorted vectors of int and vector of pairs int, int
std::vector <int> v1;
std::vector <std::pair<int, float> > v2;
containing millions of items.
How to remove as fast as possible such items from v1, that are unique to v2.first (ie not included in v2.first)?
Example:
v1: 5 3 2 4 7 8
v2: {2,8} {7,10} {5,0} {8,9}
----------------------------
v1: 3 4

There are two tricks I would use to do this as quickly as possible:
Use some sort of associative container (probably std::unordered_set) to store all of the integers in the second vector to make it dramatically more efficient to look up whether some integer in the first vector should be removed.
Optimize the way in which you delete elements from the initial vector.
More concretely, I'd do the following. Begin by creating a std::unordered_set and adding all of the integers that are the first integer in the pair from the second vector. This gives (expected) O(1) lookup time to check whether or not a specific int exists in the set.
Now that you've done that, use the std::remove_if algorithm to delete everything from the original vector that exists in the hash table. You can use a lambda to do this:
std::unordered_set<int> toRemove = /* ... */
v1.erase(std::remove_if(v1.begin(), v1.end(), [&toRemove] (int x) -> bool {
return toRemove.find(x) != toRemove.end();
}, v1.end());
This first step of storing everything in the unordered_set takes expected O(n) time. The second step does a total of expected O(n) work by bunching all the deletes up to the end and making lookups take small time. This gives a total of expected O(n)-time, O(n) space for the entire process.
If you are allowed to sort the second vector (the pairs), then you could alternatively do this in O(n log n) worst-case time, O(log n) worst-case space by sorting the vector by the key, then using std::binary_search to check whether a particular int from the first vector should be eliminated or not. Each binary search takes O(log n) time, so the total time required is O(n log n) for the sorting, O(log n) time per element in the first vector (for a total of O(n log n)), and O(n) time for the deletion, giving a total of O(n log n).
Hope this helps!

Assuming that neither container is sorted and that sorting is actually too expensive or memory is scarce:
v1.erase(std::remove_if(v1.begin(), v1.end(),
[&v2](int i) {
return std::find_if(v2.begin(), v2.end(),
[](const std::pair<int, float>& p) {
return p.first == i; })
!= v2.end() }), v1.end());
Alternatively sort v2 on first and use a binary search instead. If there is enough memory use an unordered_set to sort the first of v2.
Complete C++03 version:
#include <iostream>
#include <vector>
#include <utility>
#include <algorithm>
struct find_func {
find_func(int i) : i(i) {}
int i;
bool operator()(const std::pair<int, float>& p) {
return p.first == i;
}
};
struct remove_func {
remove_func(std::vector< std::pair<int, float> >* v2)
: v2(v2) {}
std::vector< std::pair<int, float> >* v2;
bool operator()(int i) {
return std::find_if(v2->begin(), v2->end(), find_func(i)) != v2->end();
}
};
int main()
{
// c++11 here
std::vector<int> v1 = {5, 3, 2, 4, 7, 8};
std::vector< std::pair<int, float> > v2 = {{2,8}, {7,10}, {5,0}, {8,9}};
v1.erase(std::remove_if(v1.begin(), v1.end(), remove_func(&v2)), v1.end());
// and here
for(auto x : v1) {
std::cout << x << std::endl;
}
return 0;
}

Related

Find the unique elements of a vector C++

Is there a fast way to find all the single elements (only appeared once) in a vector of elements? All the elements in the vector is either single or dual (appeared twice). My answer is sort all the elements and then remove double appeared elements. Any faster way to do it?

So for small enough n (<=1e8) sorting and removal (using std::sort() and std::unique) approach is still faster than hash tables.
Sample code: O(n log n)
vector<int>A = {1,2,3,1,2,5};
sort(A.begin(),A.end());
A.erase(unique(A.begin(),A.end()),A.end());
for(int&x:A)
cout<<x<<" ";

if your elements are hashable, you can use a std::unordered_map<T, int> to store the count of each element, which will take amortized linear time:
template<typename T>
std::vector<T> uniqueElements(const std::vector<T>& v) {
std::unordered_map<T, int> counts;
for(const auto& elem : v) ++counts[elem];
std::vector<T> result;
for(auto [elem, count] : counts)
if(count == 1)
result.push_back(elem);
return result;
}
For small lists, sorting and then doing a linear pass might still be faster.
Also note that this copies your elements, which might also be expensive in some cases

c++ improve vector sorting by presorting with old vector

I have a a vector of pair with the following typdef
typedef std::pair<double, int> myPairType;
typedef std::vector<myPairType> myVectorType;
myVectorType myVector;
I fill this vector with double values and the int part of the pair is an index.
The vector then looks like this
0.6594 1
0.5434 2
0.5245 3
0.8431 4
...
My program has a number of time steps with slight variations in the double values and every time step I sort this vector with std::sort to something like this.
0.5245 3
0.5434 2
0.6594 1
0.8431 4
The idea is now to somehow use the vector from the last time step (the "old vector, already sorted) to presort the current vector (the new vector, not yet sorted). And use an insertions sort or tim sort to sort the "rest" of the then presorted vector.
Is this somehow possible? I couldn't find a function to order the "new" vector of pairs by one part (the int part).
And if it is possible could this be faster then sorting the whole unsorted "new" vector?
Thanks for any pointers into the right direction.
tiom
UPDATE
First of all thanks for all the suggestions and code examples. I will have a look at each of them and do some benchmarking if they will speed up the process.
Since there where some questions regarding the vectors I will try to explain in more detail what I want to accomplish.
As I said I have a number if time steps 1 to n. For every time step I have a vector of double data values with approximately 260000 elements.
In every time step I add an index to this vector which will result in a vector of pairs <double, int>. See the following code snippet.
typedef typename myVectorType::iterator myVectorTypeIterator; // iterator for myVector
std::vector<double> vectorData; // holds the double data values
myVectorType myVector(vectorData.size()); // vector of pairs <double, int>
myVectorTypeIterator myVectorIter = myVector.begin();
// generating of the index
for (int i = 0; i < vectorData.size(); ++i) {
myVectorIter->first = vectorData[i];
myVectorIter->second = i;
++myVectorIter;
}
std::sort(myVector.begin(), myVector.end() );
(The index is 0 based. Sorry for my initial mistake in the example above)
I do this for every time step and then sort this vector of pairs with std::sort.
The idea was now to use the sorted vector of pairs of time step j-1 (lets call it vectorOld) in time step j as a "presorter" for the "new" myVector since I assume the ordering of the sorted "new" myVector of time step j will only differ in some cases from the already sorted vectorOld of time step j-1.
With "presorter" I mean to rearrange the pairs in the "new" myVector into a vector presortedVector of type myVectorType by the same index order as the vectorOld and then let a tim sort or some similar sorting algorithm that is good in presorted date do the rest of the sorting.
Some data examples:
This is what the beginning of myVector looks like in time step j-1 before the sorting.
0.0688015 0
0.0832928 1
0.0482259 2
0.142874 3
0.314859 4
0.332909 5
...
And after the sorting
0.000102207 23836
0.000107378 256594
0.00010781 51300
0.000109315 95454
0.000109792 102172
...
So I in the next time step j this is my vectorOld and I like to take the element with index 23836 of the "new" myVector and put it in the first place of the presortedVector, element with index 256594 should be the second element in presortedVector and so on. But the elements have to keep their original index. So 256594 will not be index 0 but only element 0 in presortedVector still with index 256594
I hope this is a better explanation of my plan.

First, scan through the sequence to find the first element that's smaller than the preceding one (either a loop, or C++11's std::is_sorted_until). This is the start of the unsorted portion. Use std::sort on the remainder, then merge the two halves with std::inplace_merge.
template<class RandomIt, class Compare>
void sort_new_elements(RandomIt first, RandomIt last, Compare comp)
{
RandomIt mid = std::is_sorted_until(first, last, comp);
std::sort(mid, last, comp);
std::inplace_merge(first, mid, last, comp);
}
This should be more efficient than sorting the whole sequence indiscriminately, as long as the presorted sequence at the front is significantly larger than the unsorted part.

Using the sorted vector would likely result in more comparisons (just to find a matching item).
What you seem to be looking for is a self-ordering container.
You could use a set (and remove/re-insert on modification).
Alternatively you could use Boost Multi Index which affords a bit more convenience (e.g. use a struct instead of the pair)

I have no idea if this could be faster than sorting the whole unsorted "new" vector. It will depend on the data.
But this will create a sorted copy of a new vector based on the order of an old vector:
myVectorType getSorted(const myVectorType& unsorted, const myVectorType& old) {
myVectorType sorted(unsorted.size());
auto matching_value
= [&unsorted](const myPairType& value)
{ return unsorted[value.second - 1]; };
std::transform(old.begin(), old.end(), sorted.begin(), matching_value);
return sorted;
}
You will then need to "finish" sorting this vector. I don't know how much quicker (if at all) this will be than sorting it from scratch.
Live demo.

Well you can create new vector with the order of the old and then use algorithms that has good complexity for (nearly) sorted inputs for the restoration of order.
Below I put an example of how it works, with Mark's function as restore_order:
#include <iostream>
#include <algorithm>
#include <vector>
#include <utility>
using namespace std;
typedef std::pair<double, int> myPairType;
typedef std::vector<myPairType> myVectorType;
void outputMV(const myVectorType& vect, std::ostream& out)
{
for(const auto& element : vect)
out << element.first << " " << element.second << '\n';
}
//https://stackoverflow.com/a/28813905/1133179
template<class RandomIt, class Compare>
void restore_order(RandomIt first, RandomIt last, Compare comp)
{
RandomIt mid = std::is_sorted_until(first, last, comp);
std::sort(mid, last, comp);
std::inplace_merge(first, mid, last, comp);
}
int main() {
myVectorType myVector = {{3.5,0},{1.4,1},{2.5,2},{1.0,3}};
myVectorType mv2 = {{3.6,0},{1.35,1},{2.6,2},{1.36,3}};
auto comparer = [] (const auto& lhs, const auto& rhs) { return lhs.first < rhs.first;};
// make sure we didn't mess with the initial indexing
int i = 0;
for(auto& element : myVector) element.second = i++;
i = 0;
for(auto& element : mv2) element.second = i++;
//sort the initial vector
std::sort(myVector.begin(), myVector.end(), comparer);
outputMV(myVector, cout);
// this will replace each element of myVector with a corresponding
// value from mv2 using the old sorted order
std::for_each(myVector.begin(), myVector.end(),
[mv2] (auto& el) {el = mv2[el.second];}
);
// restore order in case it was different for the new vector
restore_order(myVector.begin(), myVector.end(), comparer);
outputMV(myVector, cout);
return 0;
}
This works in O(n) up to the point of restore then. Then the trick is to use good function for it. A nice candidate will have good complexity for nearly sorted inputs. I used function that Mark Ransom posted, which works, but still isn't perfect.
It could get outperformed by bubble sort inspired method. Something like, iterate over each element, if the order between current and next element is wrong recursively swap current and next. However there is a bet on how much the order changes - if the order doesn't vary much you will stay close to O(2n), if does - you will go up to O(n^2).
I think the best would be an implementation of natural merge sort. That has best case (sorted input) O(n), and worst O(n log n).

Check for common members in vector c++

What is the best way to verify if there are common members within multiple vectors?
The vectors aren't necessarily of equal size and they may contain custom data (such as structures containing two integers that represent a 2D coordinate).
For example:
vec1 = {(1,2); (3,1); (2,2)};
vec2 = {(3,4); (1,2)};
How to verify that both vectors have a common member?
Note that I am trying to avoid inneficient methods such as going through all elements and check for equal data.

For non-trivial data sets, the most efficient method is probably to sort both vectors, and then use std::set_intersection function defined in , like follows:
#include <vector>
#include <algorithm>
using namespace std;
typedef vector<pair<int, int>> tPointVector;
tPointVector vec1 {{1,2}, {3,1}, {2,2}};
tPointVector vec2 {{3,4}, {1,2}};
std::sort(begin(vec1), end(vec1));
std::sort(begin(vec2), end(vec2));
tPointVector vec3;
vec3.reserve(std::min(vec1.size(), vec2.size()));
set_intersection(begin(vec1), end(vec1), begin(vec2), end(vec2), back_inserter(vec3));
You may get better performance with a nonstandard algorithm if you do not need to know which elements are different, but only the number of common elements, because then you can avoid having to create new copies of the common elements.
In any case, it seems to me that starting by sorting both containers will give you the best performance for data sets with more than a few dozen elements.
Here's an attempt at writing an algorithm that just gives you the count of matching elements (untested):
auto it1 = begin(vec1);
auto it2 = begin(vec2);
const auto end1 = end(vec1);
const auto end2 = end(vec2);
sort(it1, end1);
sort(it2, end2);
size_t numCommonElements = 0;
while (it1 != end1 && it2 != end2) {
bool oneIsSmaller = *it1 < *it2;
if (oneIsSmaller) {
it1 = lower_bound(it1, end1, *it2);
} else {
bool twoIsSmaller = *it2 < *it1;
if (twoIsSmaller) {
it2 = lower_bound(it2, end2, *it1);
} else {
// none of the elements is smaller than the other
// so it's a match
++it1;
++it2;
++numCommonElements;
}
}
}

Note that I am trying to avoid inneficient methods such as going through all elements and check for equal data.
You need to go through all elements at least once, I assume you're implying you don't want to check every combinations. Indeed you don't want to do :
for all elements in vec1, go through the entire vec2 to check if the element is here. This won't be efficient if your vectors have a big number of elements.
If you prefer a linear time solution and you don't mind using extra memory here is what you can do :
You need a hashing function to insert element in an unordered_map or unordered_set
See https://stackoverflow.com/a/13486174/2502814
// next_permutation example
#include <iostream> // std::cout
#include <unordered_set> // std::unordered_set
#include <vector> // std::vector
using namespace std;
namespace std {
template <>
struct hash<pair<int, int>>
{
typedef pair<int, int> argument_type;
typedef std::size_t result_type;
result_type operator()(const pair<int, int> & t) const
{
std::hash<int> int_hash;
return int_hash(t.first + 6495227 * t.second);
}
};
}
int main () {
vector<pair<int, int>> vec1 {{1,2}, {3,1}, {2,2}};
vector<pair<int, int>> vec2 {{3,4}, {1,2}};
// Copy all elements from vec2 into an unordered_set
unordered_set<pair<int, int>> in_vec2;
in_vec2.insert(vec2.begin(),vec2.end());
// Traverse vec1 and check if elements are here
for (auto& e : vec1)
{
if(in_vec2.find(e) != in_vec2.end()) // Searching in an unordered_set is faster than going through all elements of vec2 when vec2 is big.
{
//Here are the elements in common:
cout << "{" << e.first << "," << e.second << "} is in common!" << endl;
}
}
return 0;
}
Output : {1,2} is in common!
You can either do that, or copy all elements of vec1 into an unordered_set, and then traverse vec2.
Depending on the sizes of vec1 and vec2, one solution might be faster than the other.
Keep in mind that picking the smaller vector to insert in the unordered_set also means you will use less extra memory.

I believe you use a 2D tree to search in 2 dimenstions. An optimal algorithm to the problem you specified would fall under the class of geometric algorithms. Maybe this link is of use to you: http://www.cs.princeton.edu/courses/archive/fall05/cos226/lectures/geosearch.pdf .

Efficient way to get the indizes of the k highest values in vector<float>

How can I create a std::map<int, float> from a vector<float>, so that the map contains the k highest values from the vector with the keys beeing the index of the value in the vector.
A naive approach would be to traverse the vector (O(n)), extract and erase (O(n)) the highest element k times (O(k)), leading to a complexity of O(k*n^2), which is suboptimal, I guess.
Even better would be to just copy (O(n)) and remove the smallest until size is k. Which would lead to O(n^2). Still polynomial...
Any ideas?

Following should do the job:
#include <cstdint>
#include <algorithm>
#include <iostream>
#include <map>
#include <tuple>
#include <vector>
// Compare: greater T2 first.
struct greater_by_second
{
template <typename T1, typename T2>
bool operator () (const std::pair<T1, T2>& lhs, const std::pair<T1, T2>& rhs)
{
return std::tie(lhs.second, lhs.first) > std::tie(rhs.second, rhs.first);
}
};
std::map<std::size_t, float> get_index_pairs(const std::vector<float>& v, int k)
{
std::vector<std::pair<std::size_t, float>> indexed_floats;
indexed_floats.reserve(v.size());
for (std::size_t i = 0, size = v.size(); i != size; ++i) {
indexed_floats.emplace_back(i, v[i]);
}
std::nth_element(indexed_floats.begin(),
indexed_floats.begin() + k,
indexed_floats.end(), greater_by_second());
return std::map<std::size_t, float>(indexed_floats.begin(), indexed_floats.begin() + k);
}
Let's test it:
int main(int argc, char *argv[])
{
const std::vector<float> fs {45.67f, 12.34f, 67.8f, 4.2f, 123.4f};
for (const auto& elem : get_index_pairs(fs, 2)) {
std::cout << elem.first << " " << elem.second << std::endl;
}
return 0;
}
Output:
2 67.8
4 123.4

You can keep a list of the k-highest values so far, and update it for each of the values in your vector, which takes you down to O(n*log k) (assuming log k for each update of the list of highest values) or, for a naive list, O(kn).
You can probably get closer to O(n), but assuming k is probably pretty small, may not be worth the effort.

Your optimal solution will have a complexity of O(n+k*log(k)), since sorting the k elements can be reduced to this, and you will have to look at each of the elements at least once.
Two possible solutions come to mind:
Iterate through the vector while adding all elements to a bounded (size k) priority-queue/heap, also keeping their indices.
Create a copy of your vector with including the original indices, i.e. std::vector<std::pair<float, std::size_t>> and use std::nth_element to move the k highest values to the front using a comparator that compares only the first element. Then insert those elements into your target map. Ironically, that last step adds you the k*log(k) in the overall complexity, while nth_element is linear (but will permute your indices).

Maybe I did not get it, but in case the incremental approach is not an option, why not use std::sort std::partial_sort?
That should be an o(n log k), and since k is very likely to be small, that makes practically an o(n).
Edit: thanks to Mike Seymour for the update.
Edit (bis):
The idea is to use an intermediate vector for sorting, and then put it into the map. Trying to reduce the order of the computation would only be justified for significant amount of data, so I guess the copy time (in o(n) ) could be lost in background noise.
Edit (bis):
That's actually what the selected answer does, without the theorietical explanations :).

std::sort that also keeps track of number of unique entries at each level

Say I have a std::vector. Say the vectors contain numbers. Let's take this std::vector
1,3,5,4,3,4,5,1,6,3
std::sort<std::less<int>> will sort this into
1,1,3,3,3,4,4,5,5,6,
How would I ammend sort so that at the same time it is sorting, it also computes the quantity of numbers at the same level. So say in addition to sorting, it would also compile the following dictionary [level is also int]
std::map<level, int>
<1, 2>
<2, 3>
<3, 2>
<4, 2>
<5, 1>
<6, 1>
so there are 2 1's, 3 3's, 2 4's, and so on.
The reason I [think] I need this is because I don't want to sort the vector, THEN once again, compute the number of duplicates at each level. It seems faster to do it both in one pass?
Thank you all! bjskishore123 is the closest thing to what I was asking, but all the responses educated me. Thanks again.

As stated by #bjskishore123, you can use a map to guarantee the correct order of your set. As a bonus, you will have an optimized strucutre to search (the map, of course).
Inserting/searching in a map takes O(log(n)) time, while traversing the vector is O(n). So, the alghorithm is O(n*log(n)). Wich is the same complexity as any sort algorithm that needs to compare elements: merge sort or quick sort, for example.
Here is a sample code for you:
int tmp[] = {5,5,5,5,5,5,2,2,2,2,7,7,7,7,1,1,1,1,6,6,6,2,2,2,8,8,8,5,5};
std::vector<int> values(tmp, tmp + sizeof(tmp) / sizeof(tmp[0]));
std::map<int, int> map_values;
for_each(values.begin(), values.end(), [&](int value)
{
map_values[value]++;
});
for(std::map<int, int>::iterator it = map_values.begin(); it != map_values.end(); it++)
{
std::cout << it->first << ": " << it->second << "times";
}
Output:
1: 4times
2: 7times
5: 8times
6: 3times
7: 4times
8: 3times

I don't think you can do this in one pass. Let's say you provide your own custom comparator for sorting which somehow tries to count the duplicates.
However the only thing you can capture in the sorter is the value(maybe reference but doesn't matter) of the current two elements being compared. You have no other information because std::sort doesn't pass any thing else to the sorter.
Now the way std::sort works it will keep swapping elements until they reach the proper location in the sorted vector. That means a single member can be sent to the sorter multiple times making it impossible to count exactly. You can count how many times a certain element and all others value equal to it have been moved but not exactly how many of them are in there.

Instead of using a vector,
While storing number one by one, Use std::multiset container
It stores internally in sorted order.
While storing each number, use a map to keep track of the number of occurrences of each number.
map<int, int> m;
Each time a number is added do
m[num]++;
So, no need of another pass to calculate the number of occurrences, although you need to iterate in map to get each occurrence count.
=============================================================================
THE FOLLOWING IS AN ALTERNATE SOLUTION WHICH IS NOT RECOMMENDED .
GIVING IT AS YOU ASKED A WAY WHICH USES STD::SORT.
Below code makes use of comparison function to count the occurrences.
#include <iostream>
#include <map>
#include <vector>
#include <algorithm>
using namespace std;
struct Elem
{
int index;
int num;
};
std::map<int, int> countMap; //Count map
std::map<int, bool> visitedMap;
bool compare(Elem a, Elem b)
{
if(visitedMap[a.index] == false)
{
visitedMap[a.index] = true;
countMap[a.num]++;
}
if(visitedMap[b.index] == false)
{
visitedMap[b.index] = true;
countMap[b.num]++;
}
return a.num < b.num;
}
int main()
{
vector<Elem> v;
Elem e[5] = {{0, 10}, {1, 20}, {2, 30}, {3, 10}, {4, 20} };
for(size_t i = 0; i < 5; i++)
v.push_back(e[i]);
std::sort(v.begin(), v.end(), compare);
for(map<int, int>::iterator it = countMap.begin(); it != countMap.end(); it++)
cout<<"Element : "<<it->first<<" occurred "<<it->second<<" times"<<endl;
}
Output:
Element : 10 occurred 2 times
Element : 20 occurred 2 times
Element : 30 occurred 1 times

If you have lots of duplicates, the fastest way to accomplish this task is probably to first count duplicates using a hash map, which is O(n), and then to sort the map, which is O(m log m) where m is the number of unique values.
Something like this (in c++11):
#include <algorithm>
#include <unordered_map>
#include <utility>
#include <vector>
std::vector<std::pair<int, int>> uniqsort(const std::vector<int>& v) {
std::unordered_map<int, int> count;
for (auto& val : v) ++count[val];
std::vector<std::pair<int, int>> result(count.begin(), count.end());
std::sort(result.begin(), result.end());
return result;
}
There are lots of variations on the theme, depending on what you need, precisely. For example, perhaps you don't even need the result to be sorted; maybe it's enough to just have the count map. Or maybe you would prefer the result to be a sorted map from int to int, in which case you could just build a regular std::map, instead. (That would be O(n log m).) Or maybe you know something about the values which make them faster to sort (like the fact that they are small integers in a known range.) And so on.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++, fast remove elements from vector unique to another vector - c++

Related

Find the unique elements of a vector C++

c++ improve vector sorting by presorting with old vector

Check for common members in vector c++

Efficient way to get the indizes of the k highest values in vector<float>

std::sort that also keeps track of number of unique entries at each level

Categories

Resources