Fast union building of multiple vectors in C++ - c++

I’m searching for a fast way to build a union of multiple vectors in C++.
More specifically: I have a collection of vectors (usually 15-20 vectors with several thousand unsigned integers; always sorted and unique so they could also be an std::set). For each stage, I choose some (usually 5-10) of them and build a union vector. Than I save the length of the union vector and choose some other vectors. This will be done for several thousand times. In the end I'm only interested in the length of the shortest union vector.
Small example:
V1: {0, 4, 19, 40}
V2: {2, 4, 8, 9, 19}
V3: {0, 1, 2, 4, 40}
V4: {9, 10}
// The Input Vectors V1, V2 … are always sorted and unique (could also be an std::set)
Choose V1 , V3;
Union Vector = {0, 1, 2, 4, 19, 40} -> Size = 6;
Choose V1, V4;
Union Vector = {0,4, 9, 10, 19 ,40} -> Size = 6;
… and so on …
At the moment I use std::set_union but I’m sure there must be a faster way.
vector< vector<uint64_t>> collection;
vector<uint64_t> chosen;
for(unsigned int i = 0; i<chosen->size(); i++) {
set_union(collection.at(choosen.at(i)).begin(),
collection.at(choosen.at(i)).end(),
unionVector.begin(),
unionVector.end(),
back_inserter(unionVectorTmp));
unionVector.swap(unionVectorTmp);
unionVectorTmp.clear();
}
I'm grateful for every reference.
EDIT 27.04.2017
A new Idea:
unordered_set<unsigned int> unionSet;
unsigned int counter = 0;
for(const auto &sel : selection){
for(const auto &val : sel){
auto r = unionSet.insert(val);
if(r.second){
counter++;
}
}
}

If they're sorted you can roll your own thats O(N+M) in runtime. Otherwise you can use a hashtable with similar runtime

The de facto way in C++98 is set_intersection, but with c++11 (or TR1) you can go for unordered_set, provided the initial vector is sorted, you will have a nice O(N) algorithm.
Construct an unordered_set out of your first vector
Check if the elements of your 2nd vector are in the set
Something like that will do:
std::unordered_set<int> us(std::begin(v1), std::end(v1));
auto res = std::count_if(std::begin(v2), std::end(v2), [&](int n) {return us.find(n) != std::end(us);}

There's no need to create the entire union vector. You can count the number of unique elements among the selected vectors by keeping a list of iterators and comparing/incrementing them appropriately.
Here's the pseudo-code:
int countUnique(const std::vector<std::vector<unsigned int>>& selection)
{
std::vector<std::vector<unsigned int>::const_iterator> iters;
for (const auto& sel : selection) {
iters.push_back(sel.begin());
}
auto atEnd = [&]() -> bool {
// check if all iterators equal end
};
int count = 0;
while (!atEnd()) {
const int min = 0; // find minimum value among iterators
for (size_t i = 0; i < iters.size(); ++i) {
if (iters[i] != selection[i].end() && *iters[i] == min) {
++iters[i];
}
}
++count;
}
return count;
}
This uses the fact that your input vectors are sorted and only contain unique elements.
The idea is to keep an iterator into each selected vector. The minimum value among those iterators is our next unique value in the union vector. Then we increment all iterators whose value is equal to that minimum. We repeat this until all iterators are at the end of the selected vectors.

Related

Is it possible to make a vector of ranges in cpp20

Let's say I have a a vector<vector<int>>. I want to use ranges::transform in such a way that I get
vector<vector<int>> original_vectors;
using T = decltype(ranges::views::transform(original_vectors[0], [&](int x){
return x;
}));
vector<int> transformation_coeff;
vector<T> transformed_vectors;
for(int i=0;i<n;i++){
transformed_vectors.push_back(ranges::views::transform(original_vectors[i], [&](int x){
return x * transformation_coeff[i];
}));
}
Is such a transformation, or something similar, currently possible in C++?
I know its possible to simply store the transformation_coeff, but it's inconvenient to apply it at every step. (This will be repeated multiple times so it needs to be done in O(log n), therefore I can't explicitly apply the transformation).
Yes, you can have a vector of ranges. The problem in your code is that you are using a temporary lambda in your using statement. Because of that, the type of the item you are pushing into the vector later is different from T. You can solve it by assigning the lambda to a variable first:
vector<vector<int>> original_vectors;
auto lambda = [&](int x){return x;};
using T = decltype(ranges::views::transform(original_vectors[0], lambda));
vector<T> transformed_vectors;
transformed_vectors.push_back(ranges::views::transform(original_vectors[0], lambda));
It is not possible in general to store different ranges in a homogeneous collection like std::vector, because different ranges usually have different types, especially if transforms using lambdas are involved. No two lambdas have the same type and the type of the lambda will be part of the range type. If the signatures of the functions you want to pass to the transform are the same, you could wrap the lambdas in std::function as suggested by #IlCapitano (https://godbolt.org/z/zGETzG4xW). Note that this comes at the cost of the additional overhead std::function entails.
A better option might be to create a range of ranges.
If I understand you correctly, you have a vector of n vectors, e.g.
std::vector<std::vector<int>> original_vector = {
{1, 5, 10},
{2, 4, 8},
{5, 10, 15}
};
and a vector of n coefficients, e.g.
std::vector<int> transformation_coeff = {2, 1, 3};
and you want a range of ranges representing the transformed vectors, where the ith range represents the ith vector's elements which have been multiplied by the ith coefficient:
{
{ 2, 10, 20}, // {1, 5, 10} * 2
{ 2, 4, 8}, // {2, 4, 8} * 1
{15, 30, 45} // {5, 10, 15} * 3
}
Did I understand you correctly? If yes, I don't understand what you mean with your complexity requirement of O(log n). What does n refer to in this scenario? How would this calculation be possible in less than n steps? Here is a solution that gives you the range of ranges you want. Evaluating this range requires O(n*m) multiplications, where m is an upper bound for the number of elements in each inner vector. I don't think it can be done in less steps because you have to multiply each element in original_vector once. Of course, you can always just evaluate part of the range, because the evaluation is lazy.
C++20
The strategy is to first create a range for the transformed i-th vector given the index i. Then you can create a range of ints using std::views::iota and transform it to the inner ranges:
auto transformed_ranges = std::views::iota(0) | std::views::transform(
[=](int i){
// get a range containing only the ith inner range
auto ith = original_vector | std::views::drop(i) | std::views::take(1) | std::views::join;
// transform the ith inner range
return ith | std::views::transform(
[=](auto const& x){
return x * transformation_coeff[i];
}
);
}
);
You can now do
for (auto const& transformed_range : transformed_ranges){
for (auto const& val : transformed_range){
std::cout << val << " ";
}
std::cout<<"\n";
}
Output:
2 10 20
2 4 8
15 30 45
Full Code on Godbolt Compiler Explorer
C++23
This is the perfect job for C++23's std::views::zip_transform:
auto transformed_ranges = std::views::zip_transform(
[=](auto const& ith, auto const& coeff){
return ith | std::views::transform(
[=](auto const& x){
return x * coeff;
}
);
},
original_vector,
transformation_coeff
);
It's a bit shorter and has the added benefit that transformation_coeff is treated as a range as well:
It is more general, because we are not restricted to std::vectors
In the C++20 solution you get undefined behaviour without additional size checking if transformation_coeff.size() < original_vector.size() because we are indexing into the vector, while the C++23 solution would just return a range with fewer elements.
Full Code on Godbold Compiler Explorer

c++ Algorithm to Compare various length vectors and isolate "unique", sort of

I have a complex problem and have been trying to identify what needs to be a very, very efficient algorithm. I'm hoping i can get some ideas from you helpful folks. Here is the situation.
I have a vector of vectors. These nested vectors are of various length, all storing integers in a random order, such as (pseudocode):
vector_list = {
{ 1, 4, 2, 3 },
{ 5, 9, 2, 1, 3, 3 },
{ 2, 4, 2 },
...,
100 more,
{ 8, 2, 2, 4 }
}
and so on, up to over 100 different vectors at a time inside vector_list. Note that the same integer can appear in each vector more than once. I need to remove from this vector_list any vectors that are duplicates of another vector. A vector is a duplicate of another vector if:
It has the same integers as the other vector (regardless of order). So if we have
vec1 = { 1, 2, 3 }
vec2 = { 2, 3, 1 }
These are duplicates and I need to remove one of them, it doesnt matter which one.
A vector contains all of the other integers of the other vector. So if we have
vec1 = { 3, 2, 2 }
vec2 = { 4, 2, 3, 2, 5 }
Vec2 has all of the ints of vec1 and is bigger, so i need to delete vec1 in favor of vec2
The problem is as I mentioned the list of vectors can be very big, over 100, and the algorithm may need to run as many as 1000 times on a button click, with a different group of 100+ vectors over 1000 times. Hence the need for efficiency. I have considered the following:
Sorting the vectors may make life easier, but as I said, this has to be efficient, and i'd rather not sort if i didnt have to.
It's more complicated by the fact that the vectors aren't in any order with respect to their size. For example, if the vectors in the list were ordered by size:
vector_list = {
{ },
{ },
{ },
{ },
{ },
...
{ },
{ }
}
It might make life easier, but that seems like it would take a lot of effort and I'm not sure about the gain.
The best effort I've had so far to try and solve this problem is:
// list of vectors, just 4 for illustration, but in reality more like 100, with lengths from 5 to 15 integers long
std::vector<std::vector<int>> vector_list;
vector_list.push_back({9});
vector_list.push_back({3, 4, 2, 8, 1});
vector_list.push_back({4, 2});
vector_list.push_back({1, 3, 2, 4});
std::vector<int>::iterator it;
int i;
int j;
int k;
// to test if a smaller vector is a duplicate of a larger vector, i copy the smaller vector, then
// loop through ints in the larger vector, seeing if i can find them in the copy of the smaller. if i can,
// i remove the item from the smaller copy, and if the size of the smaller copy reaches 0, then the smaller vector
// was a duplicate of the larger vector and can be removed.
std::vector<int> copy;
// flag for breaking a for loop below
bool erased_i;
// loop through vector list
for ( i = 0; i < vector_list.size(); i++ )
{
// loop again, so we can compare every vector to every other vector
for ( j = 0; j < vector_list.size(); j++ )
{
// don't want to compare a vector to itself
if ( i != j )
{
// if the vector in i loop is at least as big as the vector in j loop
if ( vector_list[i].size() >= vector_list[j].size() )
{
// copy the smaller j vector
copy = vector_list[j];
// loop through each item in the larger i vector
for ( k = 0; k < vector_list[i].size(); k++ ) {
// if the item in the larger i vector is in the smaller vector,
// remove it from the smaller vector
it = std::find(copy.begin(), copy.end(), vector_list[i][k]);
if (it != copy.end())
{
// erase
copy.erase(it);
// if the smaller vector has reached size 0, then it must have been a smaller duplicate that
// we can delete
if ( copy.size() == 0 ) {
vector_list.erase(vector_list.begin() + j);
j--;
}
}
}
}
else
{
// otherwise vector j must be bigger than vector i, so we do the same thing
// in reverse, trying to erase vector i
copy = vector_list[i];
erased_i = false;
for ( k = 0; k < vector_list[j].size(); k++ ) {
it = std::find(copy.begin(), copy.end(), vector_list[j][k]);
if (it != copy.end()) {
copy.erase(it);
if ( copy.size() == 0 ) {
vector_list.erase(vector_list.begin() + i);
// put an extra flag so we break out of the j loop as well as the k loop
erased_i = true;
break;
}
}
}
if ( erased_i ) {
// break the j loop because we have to start over with whatever
// vector is now in position i
break;
}
}
}
}
}
std::cout << "ENDING VECTORS\n";
// TERMINAL OUTPUT:
vector_list[0]
[9]
vector_list[1]
[3, 4, 2, 8, 1]
So this function gives me the right results, as these are the 2 unique vectors. It also gives me the correct results if i push the initial 4 vectors in reverse order, so the smallest one comes last for example. But it feels so inefficient comparing every vector to every other vector. Plus i have to create these "copies" and try to reduce them to 0 .size() with every comparison I make. very inefficient.
Anyways, any ideas on how I could make this speedier would be much appreciated. Maybe some kind of organization by vector length, I dunno.... It seems wasteful to compare them all to each other.
Thanks!
Loop through the vectors and for each vector, map the count of unique values occurring in it. unordered_map<int, int> would suffice for this, let's call it M.
Also maintain a set<unordered_map<int, int>>, say S, ordered by the size of unordered_map<int, int> in decreasing order.
Now we will have to compare contents of M with the contents of unordered_maps in S. Let's call M', the current unordered_map in S being compared with M. M will be a subset of M' only when the count of all the elements in M is less than or equal to the count of their respective elements in M'. If that's the case then it's a duplicate and we'll not insert. For any other case, we'll insert. Also notice that if the size of M is greater than the size of M', M can't be a subset of M'. That means we can insert M in S. This can be used as a pre-condition to speed things up. Maintain the indices of vectors which weren't inserted in S, these are the duplicates and have to be deleted from vector_list in the end.
Time Complexity: O(N*M) + O(N^2*D) + O(N*log(N)) = O(N^2*D) where N is the number of vectors in vector_list, M is the average size of the vectors in vector_list and D is the average size of unordered_map's in S. This is for the worst case when there aren't any duplicates. For average case, when there are duplicates, the second complexity will come down.
Edit: The above procedure will create a problem. To fix that, we'll need to make unordered_maps of all vectors, store them in a vector V, and sort that vector in decreasing order of the size of unordered_map. Then, we'll start from the biggest in this vector and apply the above procedure on it. This is necessary because, a subset, say M1 of a set M2, can be inserted into S before M2 if the respective vector of M1 comes before the respective vector of M2 in vector_list. So now we don't really need S, we can compare them within V itself. Complexity won't change.
Edit 2: The same problem will occur again if sizes of two unordered_maps are the same in V when sorting V. To fix that, we'll need to keep the contents of unordered_maps in some order too. So just replace unordered_map with map and in the comparator function, if the size of two maps is the same, compare element by element and whenever the keys are not the same for the very first time or are same but the M[key] is not the same, put the bigger element before the other in V.
Edit 3: New Time Complexity: O(N*M*log(D)) + O(N*D*log(N)) + O(N^2*D*log(D)) = O(N^2*D*log(D)). Also you might want to pair the maps with the index of the respective vectors in vector_list so as to know which vector you must delete from vector_list when you find a duplicate in V.
IMPORTANT: In sorted V, we must start checking from the end just to be safe (in case we choose to delete a duplicate from vector_list as well as V whenever we encounter it). So for the last map in V compare it with the rest of the maps before it to check if it is a duplicate.
Example:
vector_list = {
{1, 2, 3},
{2, 3, 1},
{3, 2, 2},
{4, 2, 3, 2, 5},
{1, 2, 3, 4, 6, 2},
{2, 3, 4, 5, 6},
{1, 5}
}
Creating maps of respective vectors:
V = {
{1->1, 2->1, 3->1},
{1->1, 2->1, 3->1},
{2->2, 3->1},
{2->2, 3->1, 4->1, 5->1},
{1->1, 2->2, 3->1, 4->1, 6->1},
{2->1, 3->1, 4->1, 5->1, 6->1},
{1->1, 5->1}
}
After sorting:
V = {
{1->1, 2->2, 3->1, 4->1, 6->1},
{2->1, 3->1, 4->1, 5->1, 6->1},
{2->2, 3->1, 4->1, 5->1},
{1->1, 2->1, 3->1},
{1->1, 2->1, 3->1},
{1->1, 5->1},
{2->2, 3->1}
}
After deleting duplicates:
V = {
{1->1, 2->2, 3->1, 4->1, 6->1},
{2->1, 3->1, 4->1, 5->1, 6->1},
{2->2, 3->1, 4->1, 5->1},
{1->1, 5->1}
}
Edit 4: I tried coding it up. Running it a 1000 times on a list of 100 vectors, the size of each vector being in range [1-250], the range of the elements of vector being [0-50] and assuming the input is available for all the 1000 times, it takes around 2 minutes on my machine. It goes without saying that there is room for improvement in my code (and my machine).
My approach is to copy the vectors that pass the test to an empty vector.
May be inefficient.
May have bugs.
HTH :)
C++ Fiddle
#include <algorithm>
#include <iostream>
#include <iterator>
#include <vector>
int main(int, char **) {
using namespace std;
using vector_of_integers = vector<int>;
using vector_of_vectors = vector<vector_of_integers>;
vector_of_vectors in = {
{ 1, 4, 2, 3 }, // unique
{ 5, 9, 2, 1, 3, 3 }, // unique
{ 3, 2, 1 }, // exists
{ 2, 4, 2 }, // exists
{ 8, 2, 2, 4 }, // unique
{ 1, 1, 1 }, // exists
{ 1, 2, 2 }, // exists
{ 5, 8, 2 }, // unique
};
vector_of_vectors out;
// doesnt_contain_vector returns true when there is no entry in out that is superset of any of the passed vectors
auto doesnt_contain_vector = [&out](const vector_of_integers &in_vector) {
// is_subset returns true a vector contains all the integers of the passed vector
auto is_subset = [&in_vector](const vector_of_integers &out_vector) {
// contained returns true when the vector contains the passed integer
auto contained = [&out_vector](int i) {
return find(out_vector.cbegin(), out_vector.cend(), i) != out_vector.cend();
};
return all_of(in_vector.cbegin(), in_vector.cend(), contained);
};
return find_if(out.cbegin(), out.cend(), is_subset) == out.cend();
};
copy_if(in.cbegin(), in.cend(), back_insert_iterator<vector_of_vectors>(out), doesnt_contain_vector);
// show results
for (auto &vi: out) {
copy(vi.cbegin(), vi.cend(), std::ostream_iterator<int>(std::cout, ", "));
cout << "\n";
}
}
You could try something like this. I use std::sort and std::includes. Perhaps this is not the most effective solution.
// sort all nested vectors
std::for_each(vlist.begin(), vlist.end(), [](auto& v)
{
std::sort(v.begin(), v.end());
});
// sort vector of vectors by length of items
std::sort(vlist.begin(), vlist.end(), [](const vector<int>& a, const vector<int>& b)
{
return a.size() < b.size();
});
// exclude all duplicates
auto i = std::begin(vlist);
while (i != std::end(vlist)) {
if (any_of(i+1, std::end(vlist), [&](const vector<int>& a){
return std::includes(std::begin(a), std::end(a), std::begin(*i), std::end(*i));
}))
i = vlist.erase(i);
else
++i;
}

C++ vectors and pairs

I am implementing a function and here is the RME:
//EFFECTS: returns a summary of the dataset as (value, frequency) pairs
// In the returned vector-of-vectors, the inner vector is a (value,frequency) pair. The outer vector contains many of these pairs. The pairs should be
// sorted by value.
// {
// {1, 2},
// {2, 3},
// {17, 1}
// }
//
// This means that the value 1 occurred twice, the value 2 occurred 3 times,
// and the value 17 occurred once
std::vector<std::vector<double> > summarize(std::vector<double> v);
The above code is the function I am implementing.
How do I approach this?
BY THE WAY, there is a sort function available that I will use to sort the numbers so ignore that part.
I created a new vector for a pair of (double(double), int(freq)) and then did a for loop to put values in it. But then tried to return it but it said it couldn't convert my vector to the type that the function is supposed to return.
I would suggest to keep your data structure relevant to the data you are trying to represent. You have used the word pair so many times in your question that this is screaming for pairs. You could use a vector of pairs like:
std::vector<std::pair<double,int>> summarize
Or even better, use a map if you have unique values:
std::map<double,int> freqMap
Here's some pseudocode:
sorted_vector = sort(input_vector);
current_value = sorted_vector[0];
count = 0;
for each element in sorted_vector:
if element == current_value:
count = count + 1;
else:
output_vector.push_back(current_value, count);
current_value = element;
count = 1;
// push final pair
output_vector.push_back(current_value, count);
The return type should be a map, mapping each double to its frequency.
Modern:
std::map<double, int>summarize(std::vector<double> v)
{
std::map<double, int> ret;
for (auto& i : v)
++ret[i];
return ret;
}
Pre-C++11:
std::map<double, int>summarize(std::vector<double> v)
{
std::map<double, int> ret;
for (std::map<double,int>::iterator it = v.begin();
it != v.end(); ++it)
++ret[*it];
return ret;
}
If you really must return a vector of vectors, take the sensible summarize function above and write a wrapper that bastardizes the return type to the vector of vectors by traversing the map.

Sort multidimensional array and keep index C++

Is it possible to sort a multidimensional array (row by row) using sort in C++ such that I can keep the index?
For example,
13, 14, 5, 16
0, 4, 3, 2
7, 3, 7, 6
9, 1, 11, 12
Becomes:
{ 5,13,14,16}
{ 0,2,3,4 }
{ 3,6,7,7}
{ 1,9,11,12 }
And the array with the index would be:
{2,0,1,3}
{0,3,2,1}
{1,3,0,2}
{ 1,0,2,3}
First create the array of integer indices; here it is for 1D array:
int ind[arr.size()];
for( int i=0; i<arr.size(); ++i)
ind[i] = i;
Then create the comparison object. Here is a ballpark of that in C++99 lingo; for C++11 you can shortcut that by using a lambda:
struct compare
{
bool operator()( int left, int right ) {
return arr[left] < arr[right];
}
};
The sort the index array using that functor:
std::sort( ind, ind+sizeof(arr), compare );
Finally, use the sorted index array to order the values array.
Yes. To sort row by row, you have to set the appropriate starting and ending point in the sort function.To keep the index part, you can first create pairs of the array elements and index using the make_pair command. After executing the above code, you can reconstruct the index array.
You will need to do something like this (I haven't tried it out though):
for (i = 0; i < matrix.size(); i++)
{
sort(matrix[i].begin(), matrix[i].end());
}
Remember to add the index as the second element in the pair, because the default comparision operator for pairs checks the first element, followed by the second element.

Pick a unique random subset from a set of unique values

C++. Visual Studio 2010.
I have a std::vector V of N unique elements (heavy structs). How can efficiently pick M random, unique, elements from it?
E.g. V contains 10 elements: { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } and I pick three...
4, 0, 9
0, 7, 8
But NOT this: 0, 5, 5 <--- not unique!
STL is preferred. So, something like this?
std::minstd_rand gen; // linear congruential engine??
std::uniform_int<int> unif(0, v.size() - 1);
gen.seed((unsigned int)time(NULL));
// ...?
// Or is there a good solution using std::random_shuffle for heavy objects?
Create a random permutation of the range 0, 1, ..., N - 1 and pick the first M of them; use those as indices into your original vector.
A random permutation is easily made with the standard library by using std::iota together with std::random_shuffle:
std::vector<Heavy> v; // given
std::vector<unsigned int> indices(V.size());
std::iota(indices.begin(), indices.end(), 0);
std::random_shuffle(indices.begin(), indices.end());
// use V[indices[0]], V[indices[1]], ..., V[indices[M-1]]
You can supply random_shuffle with a random number generator of your choice; check the docu­men­tation for details.
Most of the time, the method provided by Kerrek is sufficient. But if N is very large, and M is orders of magnitude smaller, the following method may be preferred.
Create a set of unsigned integers, and add random numbers to it in the range [0,N-1] until the size of the set is M. Then use the elements at those indexes.
std::set<unsigned int> indices;
while (indices.size() < M)
indices.insert(RandInt(0,N-1));
Since you wanted it to be efficient, I think you can get an amortised O(M), assuming you have to perform that operation a lot of times. However, this approach is not reentrant.
First of all create a local (i.e. static) vector of std::vector<...>::size_type (i.e. unsigned will do) values.
If you enter your function, resize the vector to match N and fill it with values from the old size to N-1:
static std::vector<unsigned> indices;
if (indices.size() < N) {
indices.reserve(N);
for (unsigned i = indices.size(); i < N; i++) {
indices.push_back(i);
}
}
Then, randomly pick M unique numbers from that vector:
std::vector<unsigned> result;
result.reserver(M);
for (unsigned i = 0; i < M; i++) {
unsigned const r = getRandomNumber(0,N-i); // random number < N-i
result.push_back(indices[r]);
indices[r] = indices[N-i-1];
indices[N-i-1] = r;
}
Now, your result is sitting in the result vector.
However, you still have to repair your changes to indices for the next run, so that indices is monotonic again:
for (unsigned i = N-M; i < N; i++) {
// restore previously changed values
indices[indices[i]] = indices[i];
indices[i] = i;
}
But this approach is only useful, if you have to run that algorithm a lot and N doesn't grow so big that you cannot live with indices eating up RAM all the the time.