C++ vectors and pairs - c++

I am implementing a function and here is the RME:
//EFFECTS: returns a summary of the dataset as (value, frequency) pairs
// In the returned vector-of-vectors, the inner vector is a (value,frequency) pair. The outer vector contains many of these pairs. The pairs should be
// sorted by value.
// {
// {1, 2},
// {2, 3},
// {17, 1}
// }
//
// This means that the value 1 occurred twice, the value 2 occurred 3 times,
// and the value 17 occurred once
std::vector<std::vector<double> > summarize(std::vector<double> v);
The above code is the function I am implementing.
How do I approach this?
BY THE WAY, there is a sort function available that I will use to sort the numbers so ignore that part.
I created a new vector for a pair of (double(double), int(freq)) and then did a for loop to put values in it. But then tried to return it but it said it couldn't convert my vector to the type that the function is supposed to return.

I would suggest to keep your data structure relevant to the data you are trying to represent. You have used the word pair so many times in your question that this is screaming for pairs. You could use a vector of pairs like:
std::vector<std::pair<double,int>> summarize
Or even better, use a map if you have unique values:
std::map<double,int> freqMap

Here's some pseudocode:
sorted_vector = sort(input_vector);
current_value = sorted_vector[0];
count = 0;
for each element in sorted_vector:
if element == current_value:
count = count + 1;
else:
output_vector.push_back(current_value, count);
current_value = element;
count = 1;
// push final pair
output_vector.push_back(current_value, count);

The return type should be a map, mapping each double to its frequency.
Modern:
std::map<double, int>summarize(std::vector<double> v)
{
std::map<double, int> ret;
for (auto& i : v)
++ret[i];
return ret;
}
Pre-C++11:
std::map<double, int>summarize(std::vector<double> v)
{
std::map<double, int> ret;
for (std::map<double,int>::iterator it = v.begin();
it != v.end(); ++it)
++ret[*it];
return ret;
}
If you really must return a vector of vectors, take the sensible summarize function above and write a wrapper that bastardizes the return type to the vector of vectors by traversing the map.

Related

c++ Algorithm to Compare various length vectors and isolate "unique", sort of

I have a complex problem and have been trying to identify what needs to be a very, very efficient algorithm. I'm hoping i can get some ideas from you helpful folks. Here is the situation.
I have a vector of vectors. These nested vectors are of various length, all storing integers in a random order, such as (pseudocode):
vector_list = {
{ 1, 4, 2, 3 },
{ 5, 9, 2, 1, 3, 3 },
{ 2, 4, 2 },
...,
100 more,
{ 8, 2, 2, 4 }
}
and so on, up to over 100 different vectors at a time inside vector_list. Note that the same integer can appear in each vector more than once. I need to remove from this vector_list any vectors that are duplicates of another vector. A vector is a duplicate of another vector if:
It has the same integers as the other vector (regardless of order). So if we have
vec1 = { 1, 2, 3 }
vec2 = { 2, 3, 1 }
These are duplicates and I need to remove one of them, it doesnt matter which one.
A vector contains all of the other integers of the other vector. So if we have
vec1 = { 3, 2, 2 }
vec2 = { 4, 2, 3, 2, 5 }
Vec2 has all of the ints of vec1 and is bigger, so i need to delete vec1 in favor of vec2
The problem is as I mentioned the list of vectors can be very big, over 100, and the algorithm may need to run as many as 1000 times on a button click, with a different group of 100+ vectors over 1000 times. Hence the need for efficiency. I have considered the following:
Sorting the vectors may make life easier, but as I said, this has to be efficient, and i'd rather not sort if i didnt have to.
It's more complicated by the fact that the vectors aren't in any order with respect to their size. For example, if the vectors in the list were ordered by size:
vector_list = {
{ },
{ },
{ },
{ },
{ },
...
{ },
{ }
}
It might make life easier, but that seems like it would take a lot of effort and I'm not sure about the gain.
The best effort I've had so far to try and solve this problem is:
// list of vectors, just 4 for illustration, but in reality more like 100, with lengths from 5 to 15 integers long
std::vector<std::vector<int>> vector_list;
vector_list.push_back({9});
vector_list.push_back({3, 4, 2, 8, 1});
vector_list.push_back({4, 2});
vector_list.push_back({1, 3, 2, 4});
std::vector<int>::iterator it;
int i;
int j;
int k;
// to test if a smaller vector is a duplicate of a larger vector, i copy the smaller vector, then
// loop through ints in the larger vector, seeing if i can find them in the copy of the smaller. if i can,
// i remove the item from the smaller copy, and if the size of the smaller copy reaches 0, then the smaller vector
// was a duplicate of the larger vector and can be removed.
std::vector<int> copy;
// flag for breaking a for loop below
bool erased_i;
// loop through vector list
for ( i = 0; i < vector_list.size(); i++ )
{
// loop again, so we can compare every vector to every other vector
for ( j = 0; j < vector_list.size(); j++ )
{
// don't want to compare a vector to itself
if ( i != j )
{
// if the vector in i loop is at least as big as the vector in j loop
if ( vector_list[i].size() >= vector_list[j].size() )
{
// copy the smaller j vector
copy = vector_list[j];
// loop through each item in the larger i vector
for ( k = 0; k < vector_list[i].size(); k++ ) {
// if the item in the larger i vector is in the smaller vector,
// remove it from the smaller vector
it = std::find(copy.begin(), copy.end(), vector_list[i][k]);
if (it != copy.end())
{
// erase
copy.erase(it);
// if the smaller vector has reached size 0, then it must have been a smaller duplicate that
// we can delete
if ( copy.size() == 0 ) {
vector_list.erase(vector_list.begin() + j);
j--;
}
}
}
}
else
{
// otherwise vector j must be bigger than vector i, so we do the same thing
// in reverse, trying to erase vector i
copy = vector_list[i];
erased_i = false;
for ( k = 0; k < vector_list[j].size(); k++ ) {
it = std::find(copy.begin(), copy.end(), vector_list[j][k]);
if (it != copy.end()) {
copy.erase(it);
if ( copy.size() == 0 ) {
vector_list.erase(vector_list.begin() + i);
// put an extra flag so we break out of the j loop as well as the k loop
erased_i = true;
break;
}
}
}
if ( erased_i ) {
// break the j loop because we have to start over with whatever
// vector is now in position i
break;
}
}
}
}
}
std::cout << "ENDING VECTORS\n";
// TERMINAL OUTPUT:
vector_list[0]
[9]
vector_list[1]
[3, 4, 2, 8, 1]
So this function gives me the right results, as these are the 2 unique vectors. It also gives me the correct results if i push the initial 4 vectors in reverse order, so the smallest one comes last for example. But it feels so inefficient comparing every vector to every other vector. Plus i have to create these "copies" and try to reduce them to 0 .size() with every comparison I make. very inefficient.
Anyways, any ideas on how I could make this speedier would be much appreciated. Maybe some kind of organization by vector length, I dunno.... It seems wasteful to compare them all to each other.
Thanks!
Loop through the vectors and for each vector, map the count of unique values occurring in it. unordered_map<int, int> would suffice for this, let's call it M.
Also maintain a set<unordered_map<int, int>>, say S, ordered by the size of unordered_map<int, int> in decreasing order.
Now we will have to compare contents of M with the contents of unordered_maps in S. Let's call M', the current unordered_map in S being compared with M. M will be a subset of M' only when the count of all the elements in M is less than or equal to the count of their respective elements in M'. If that's the case then it's a duplicate and we'll not insert. For any other case, we'll insert. Also notice that if the size of M is greater than the size of M', M can't be a subset of M'. That means we can insert M in S. This can be used as a pre-condition to speed things up. Maintain the indices of vectors which weren't inserted in S, these are the duplicates and have to be deleted from vector_list in the end.
Time Complexity: O(N*M) + O(N^2*D) + O(N*log(N)) = O(N^2*D) where N is the number of vectors in vector_list, M is the average size of the vectors in vector_list and D is the average size of unordered_map's in S. This is for the worst case when there aren't any duplicates. For average case, when there are duplicates, the second complexity will come down.
Edit: The above procedure will create a problem. To fix that, we'll need to make unordered_maps of all vectors, store them in a vector V, and sort that vector in decreasing order of the size of unordered_map. Then, we'll start from the biggest in this vector and apply the above procedure on it. This is necessary because, a subset, say M1 of a set M2, can be inserted into S before M2 if the respective vector of M1 comes before the respective vector of M2 in vector_list. So now we don't really need S, we can compare them within V itself. Complexity won't change.
Edit 2: The same problem will occur again if sizes of two unordered_maps are the same in V when sorting V. To fix that, we'll need to keep the contents of unordered_maps in some order too. So just replace unordered_map with map and in the comparator function, if the size of two maps is the same, compare element by element and whenever the keys are not the same for the very first time or are same but the M[key] is not the same, put the bigger element before the other in V.
Edit 3: New Time Complexity: O(N*M*log(D)) + O(N*D*log(N)) + O(N^2*D*log(D)) = O(N^2*D*log(D)). Also you might want to pair the maps with the index of the respective vectors in vector_list so as to know which vector you must delete from vector_list when you find a duplicate in V.
IMPORTANT: In sorted V, we must start checking from the end just to be safe (in case we choose to delete a duplicate from vector_list as well as V whenever we encounter it). So for the last map in V compare it with the rest of the maps before it to check if it is a duplicate.
Example:
vector_list = {
{1, 2, 3},
{2, 3, 1},
{3, 2, 2},
{4, 2, 3, 2, 5},
{1, 2, 3, 4, 6, 2},
{2, 3, 4, 5, 6},
{1, 5}
}
Creating maps of respective vectors:
V = {
{1->1, 2->1, 3->1},
{1->1, 2->1, 3->1},
{2->2, 3->1},
{2->2, 3->1, 4->1, 5->1},
{1->1, 2->2, 3->1, 4->1, 6->1},
{2->1, 3->1, 4->1, 5->1, 6->1},
{1->1, 5->1}
}
After sorting:
V = {
{1->1, 2->2, 3->1, 4->1, 6->1},
{2->1, 3->1, 4->1, 5->1, 6->1},
{2->2, 3->1, 4->1, 5->1},
{1->1, 2->1, 3->1},
{1->1, 2->1, 3->1},
{1->1, 5->1},
{2->2, 3->1}
}
After deleting duplicates:
V = {
{1->1, 2->2, 3->1, 4->1, 6->1},
{2->1, 3->1, 4->1, 5->1, 6->1},
{2->2, 3->1, 4->1, 5->1},
{1->1, 5->1}
}
Edit 4: I tried coding it up. Running it a 1000 times on a list of 100 vectors, the size of each vector being in range [1-250], the range of the elements of vector being [0-50] and assuming the input is available for all the 1000 times, it takes around 2 minutes on my machine. It goes without saying that there is room for improvement in my code (and my machine).
My approach is to copy the vectors that pass the test to an empty vector.
May be inefficient.
May have bugs.
HTH :)
C++ Fiddle
#include <algorithm>
#include <iostream>
#include <iterator>
#include <vector>
int main(int, char **) {
using namespace std;
using vector_of_integers = vector<int>;
using vector_of_vectors = vector<vector_of_integers>;
vector_of_vectors in = {
{ 1, 4, 2, 3 }, // unique
{ 5, 9, 2, 1, 3, 3 }, // unique
{ 3, 2, 1 }, // exists
{ 2, 4, 2 }, // exists
{ 8, 2, 2, 4 }, // unique
{ 1, 1, 1 }, // exists
{ 1, 2, 2 }, // exists
{ 5, 8, 2 }, // unique
};
vector_of_vectors out;
// doesnt_contain_vector returns true when there is no entry in out that is superset of any of the passed vectors
auto doesnt_contain_vector = [&out](const vector_of_integers &in_vector) {
// is_subset returns true a vector contains all the integers of the passed vector
auto is_subset = [&in_vector](const vector_of_integers &out_vector) {
// contained returns true when the vector contains the passed integer
auto contained = [&out_vector](int i) {
return find(out_vector.cbegin(), out_vector.cend(), i) != out_vector.cend();
};
return all_of(in_vector.cbegin(), in_vector.cend(), contained);
};
return find_if(out.cbegin(), out.cend(), is_subset) == out.cend();
};
copy_if(in.cbegin(), in.cend(), back_insert_iterator<vector_of_vectors>(out), doesnt_contain_vector);
// show results
for (auto &vi: out) {
copy(vi.cbegin(), vi.cend(), std::ostream_iterator<int>(std::cout, ", "));
cout << "\n";
}
}
You could try something like this. I use std::sort and std::includes. Perhaps this is not the most effective solution.
// sort all nested vectors
std::for_each(vlist.begin(), vlist.end(), [](auto& v)
{
std::sort(v.begin(), v.end());
});
// sort vector of vectors by length of items
std::sort(vlist.begin(), vlist.end(), [](const vector<int>& a, const vector<int>& b)
{
return a.size() < b.size();
});
// exclude all duplicates
auto i = std::begin(vlist);
while (i != std::end(vlist)) {
if (any_of(i+1, std::end(vlist), [&](const vector<int>& a){
return std::includes(std::begin(a), std::end(a), std::begin(*i), std::end(*i));
}))
i = vlist.erase(i);
else
++i;
}

How can I insert multiple numbers to a particular element of a vector?

I am quite new to C++ and vector. I am calculating two things say 'i' and 'x' and I want to add 'x' that belongs to a particular vector element 'i'. I learned that if I have one 'x' value, I can simply do that by 'vec.at(i) = x'. But what if I want to add several 'x' values to a particular 'i' index of a vector?
Let's try to make it clear: Let's say I am searching for number '5' and '3' over a list of numbers from 1 to 10 (5 and 3 can occur multiple times in the list) and each time I am looking for number 5 or 3 that belong to index '2' of 'vec' I can do 'vec.at(2) = 5' or 'vec.at(2) = 3'. Then what if I have two '5' values and two '3' values so the sum of the index '2' of 'vec' will be '5+5+3+3' = 16?
P.S: using a counter and multiply concept will not solve my problem as the real problem is quite complicated. This query is just an example only. I want a solution within vector concept. I appreciate your help in advance.
If you know how many indices you want ahead of time, then try std::vector<std::vector<int>> (or instead of int use double or whatever).
For instance, if you want a collection of numbers corresponding to each number from 0 to 9, try
//This creates the vector of vectors,
//of length 10 (i.e. indices [0,9])
//with an empty vector for each element.
std::vector<std::vector<int>> vec(10, std::vector<int>());
To insert an element at a given index (assuming that there is something there, so in the above case there is only 'something there' for elements 0 through 9), try
vec.at(1).push_back(5);
vec.at(1).push_back(3);
And then to take the sum of the numbers in the vector at index 1:
int sum = 0;
for (int elem : vec.at(1)) { sum += elem; }
//sum should now be 8
If you want it to work for arbitrary indices, then it should be
std::map<int, std::vector<int>> map;
map[1].push_back(5); //creates an empty vector at index 1, then inserts
map[1].push_back(3); //uses the existing vector at index 1
int sum = 0;
for (int elem : map.at(1)) { sum += elem; }
Note that for std::vector and std::map, using [] do very different things. Most of the time you want at, which is about the same for both, but in this very specific case, [] for std::map is a good choice.
EDIT: To sum over every element in every vector in the map, you need an outer loop to go through the vectors in the map (paired with their index) and an inner loop like the one above. For example:
int sum = 0;
for (const std::pair<int, std::vector<int>>& index_vec : map) {
for (int elem : index_vec.second) { sum += elem; }
}

How to make lower bound binary search if we have vector of pairs

I'm trying to implement lower_bound function in my c++ program, but the problem is next: it works fine with vector but it fails if we have to search over vector of pairs
I have one vector of pairs and i want to search first the first member of the pair and if we have multiple values with same value i want to return the smallest of the second value, for example:
Let's say we have the following vector of pairs
v = {(1,1),(2,1),(2,2),(2,3),(3,4),(5,6)};
Let's say we are searching for value K = 2, now I want to return the position 1 (if the array is 0-indexed) because the second value of the pair is 1 and 1 is smallest.
How can I implement this in easiest way, I tried implementing this but it fails in compiling, here is my code:
vector<pair<int,int> >a,b;
void solve() {
sort(b.begin(), b.end());
sort(a.begin(), a.end());
vector<int>::iterator it;
for(int i=0;i<a.size();i++) {
ll zero=0;
int to_search=max(zero, k-a[i].first);
it=lower_bound(b.begin(), b.end(), to_search);
int position=it-b.begin();
if(position==b.size()) continue;
answer=min(answer, a[i].second+b[position].second);
}
}
In other words I'm searching for the first value, but if there are more of that value return the one with smallest second element.
Thanks in advance.
less operator work on pair, so you may use directly
std::lower_bound(v.begin(), v.end(), std::make_pair(2, std::numeric_limits<int>::min()));

Fast union building of multiple vectors in C++

I’m searching for a fast way to build a union of multiple vectors in C++.
More specifically: I have a collection of vectors (usually 15-20 vectors with several thousand unsigned integers; always sorted and unique so they could also be an std::set). For each stage, I choose some (usually 5-10) of them and build a union vector. Than I save the length of the union vector and choose some other vectors. This will be done for several thousand times. In the end I'm only interested in the length of the shortest union vector.
Small example:
V1: {0, 4, 19, 40}
V2: {2, 4, 8, 9, 19}
V3: {0, 1, 2, 4, 40}
V4: {9, 10}
// The Input Vectors V1, V2 … are always sorted and unique (could also be an std::set)
Choose V1 , V3;
Union Vector = {0, 1, 2, 4, 19, 40} -> Size = 6;
Choose V1, V4;
Union Vector = {0,4, 9, 10, 19 ,40} -> Size = 6;
… and so on …
At the moment I use std::set_union but I’m sure there must be a faster way.
vector< vector<uint64_t>> collection;
vector<uint64_t> chosen;
for(unsigned int i = 0; i<chosen->size(); i++) {
set_union(collection.at(choosen.at(i)).begin(),
collection.at(choosen.at(i)).end(),
unionVector.begin(),
unionVector.end(),
back_inserter(unionVectorTmp));
unionVector.swap(unionVectorTmp);
unionVectorTmp.clear();
}
I'm grateful for every reference.
EDIT 27.04.2017
A new Idea:
unordered_set<unsigned int> unionSet;
unsigned int counter = 0;
for(const auto &sel : selection){
for(const auto &val : sel){
auto r = unionSet.insert(val);
if(r.second){
counter++;
}
}
}
If they're sorted you can roll your own thats O(N+M) in runtime. Otherwise you can use a hashtable with similar runtime
The de facto way in C++98 is set_intersection, but with c++11 (or TR1) you can go for unordered_set, provided the initial vector is sorted, you will have a nice O(N) algorithm.
Construct an unordered_set out of your first vector
Check if the elements of your 2nd vector are in the set
Something like that will do:
std::unordered_set<int> us(std::begin(v1), std::end(v1));
auto res = std::count_if(std::begin(v2), std::end(v2), [&](int n) {return us.find(n) != std::end(us);}
There's no need to create the entire union vector. You can count the number of unique elements among the selected vectors by keeping a list of iterators and comparing/incrementing them appropriately.
Here's the pseudo-code:
int countUnique(const std::vector<std::vector<unsigned int>>& selection)
{
std::vector<std::vector<unsigned int>::const_iterator> iters;
for (const auto& sel : selection) {
iters.push_back(sel.begin());
}
auto atEnd = [&]() -> bool {
// check if all iterators equal end
};
int count = 0;
while (!atEnd()) {
const int min = 0; // find minimum value among iterators
for (size_t i = 0; i < iters.size(); ++i) {
if (iters[i] != selection[i].end() && *iters[i] == min) {
++iters[i];
}
}
++count;
}
return count;
}
This uses the fact that your input vectors are sorted and only contain unique elements.
The idea is to keep an iterator into each selected vector. The minimum value among those iterators is our next unique value in the union vector. Then we increment all iterators whose value is equal to that minimum. We repeat this until all iterators are at the end of the selected vectors.

Quickest way to compute the number of shared elements between two vectors

Suppose I have two vectors of the same size vector< pair<float, NodeDataID> > v1, v2; I want to compute how many elements from both v1 and v2 have the same NodeDataID. For example if v1 = {<3.7, 22>, <2.22, 64>, <1.9, 29>, <0.8, 7>}, and v2 = {<1.66, 7>, <0.03, 9>, <5.65, 64>, <4.9, 11>}, then I want to return 2 because there are two elements from v1 and v2 that share the same NodeDataIDs: 7 and 64.
What is the quickest way to do that in C++ ?
Just for information, note that the type NodeDataIDs is defined as I use boost as:
typedef adjacency_list<setS, setS, undirectedS, NodeData, EdgeData> myGraph;
typedef myGraph::vertex_descriptor NodeDataID;
But it is not important since we can compare two NodeDataID using the operator == (that is, possible to do v1[i].second == v2[j].second)
Put the elements of the first vector into a hash table. Iterate over the second vector, testing each element whether it is in the hash table.
A hash table has the advantage that inserts and lookups can be done in constant time. This means, finding the intersection can be done in linear time. This is optimal, because regardless of the algorithm, you have to look at each vector element at least once.
Boost has boost::intrusive::hashtable, but it's (as the name suggests), intrusive.
The simplest solution is just to put elements of the first vector in a set then for the second vector we insert each element in this set (ret = myset.insert(an_id)) and if ret.second is false then the element exists, thus we increase a counter.
set<NodeDataID> myset;
int counter = 0;
for(int i = 0; i < v1.size(); ++i)
myset.insert(v1[i].second);
for(int i = 0; i < v2.size(); ++i)
{
pair<set<NodeDataID>::iterator,bool> ret = myset.insert(v2[i].second);
if(ret.second == false)
++counter;
}