How to erase items which are values of map - c++

I have a map like this:
std::map<unsigned,(string,timestamp)> themap.
But I need to manage the size of the map by retaining only the highest 1000 timestamps in the map. I am wondering what is the most efficient way to handle this?
Should I somehow make a copy of the map above into a
std::map<timestamp,(string,unsigned)> - erase elements not in top 1000, then massage this map back into original?
Or some other way?
Here is my code.
/* map of callid (a number) to a carid (a hex string) and a timestamp (just using unsigned here)
The map will grow through time and potentially grow to some massive amount which would use up all
computer memory. So max sixe of map can be kept to 1000 (safe to do so). Want to remove items
based on timestamp
*/
#include <vector>
#include <string>
#include <iostream>
#include <algorithm>
#include <map>
typedef unsigned callid;
typedef unsigned timestamp;
typedef std::string carid;
typedef std::pair<carid,timestamp> caridtime;
typedef std::map<callid,caridtime> callid2caridmap;
int main() {
//map of callid -> (carid,timestamp)
callid2caridmap cmap;
//test data below
const std::string startstring("4559023584c8");
std::vector<carid> caridvec;
caridvec.reserve(1000);
for(int i = 1; i < 2001; ++i) {
char buff[20] = {0};
sprintf(buff, "%04u", i);
std::string s(startstring);
s += buff;
caridvec.push_back(s);
}
//generate some made up callids
std::vector<callid> tsvec;
for(int i = 9999; i < 12000; ++i) {
tsvec.push_back(i);
}
//populate map
for(unsigned i = 0; i < 2000; ++i)
cmap[tsvec[i]] = std::make_pair(caridvec[i], i+1);
//expiry handling
static const int MAXNUMBER = 1000;
// what I want to do is retain top 1000 with highest timestamps and remove all other entries.
// But of course map is ordered by the key
// what is best approach. std::transform??
// just iterate each one. But then I don't know what my criteria for erasing is until I have
// found the largest 1000 items
// std::for_each(cmap.begin(), cmap.end(), cleaner);
//nth_element seems appropriate. Do I reverse the map and have key as timestamp, use nth_element
//to work out what part to erase, then un-reverse the map as before with 1000 elements
//std::nth_element(coll.begin(), coll.begin()+MAXNUMBER, coll.end());
//erase from coll.begin()+MAXNUMBER to coll.end()
return 0;
}
UPDATE:
Here is a solution which I am playing with.
// as map is populated also fill queue with timestamp
std::deque<timestamp> tsq;
for(unsigned i = 0; i < 2000; ++i) {
cmap[tsvec[i]] = std::make_pair(caridvec[i], i+1);
tsq.push_back(tsvec[i]);
}
std::cout << "initial cmap size = " << cmap.size() << std::endl;
// expire old entries
static const int MAXNUMBER = 1000;
while(tsq.size() > MAXNUMBER) {
callid2caridmap::iterator it = cmap.find(tsq.front());
if(it != cmap.end())
cmap.erase(it);
tsq.pop_front();
}
std::cout << "cmap size now = " << cmap.size() << std::endl;
But still interested in any possible alternatives.

Make a max-heap timestamp -> iterator to the object in the map.
The heap will be <= 1000 items.
check that when you insert in the heap you either have < 1000 items or the timestamp is < the max value of the heap and do the work in consequence when popping the item from the heap, if all this makes sense

Related

finding the longest substring with k different/unique characters using hash c++

I came to the problem of finding the longest substring with k unique characters. For instance, given the following str=abcbbbddcc, the results should be:
k=2 => bcbbb
k=3 => bcbbbddcc
I created a function for this purposes using a hash table. The hash-table is going to act as a search-window. Whenever there are more than k unique characters inside of the current window, I shrink it by moving the current "start" of the windows to the right. Otherwise, I just expand the size of the window. Unfortunately, it seems to be a bug on my code and still I'm not able to find it. Could anyone please help me to find the issue? The output of my function are the start index of the substring together with its length, i.e. substring(start, start+maxSize);. I found some related posts java-sol and python-sol, but still no C++ based solution using a hash-table.
#include <iostream>
#include <vector>
#include <string>
#include <unordered_map>
typedef std::vector<int> vector;
typedef std::string string;
typedef std::unordered_map<char, int> unordered_map;
typedef unordered_map::iterator map_iter;
vector longestSubstring(const string & str, int k){
if(str.length() == 0 || k < 0){
return {0};
}
int size = str.length();
int start = 0;
unordered_map map;
int maxSize = 0;
int count = 0;
char c;
for(int i = 0; i < size; i++){
c = str[i];
if(map.find(c)!=map.end()){
map[c]++;
}
else{
map.insert({c, 1});
}
while(map.size()>k){
c = str[start];
count = map[c];
if(count>1){
map[c]--;
}
else{
map.erase(c);
}
start++;
}
maxSize = std::max(maxSize, i-start+1);
}
return {start, maxSize};
}
Before maxSize = std::max(maxSize, i-start+1); you must ensure that map size is exactly k - you can never reach k but current code instanly updates maxSize .
Also remember start value in own max code
if (map.size() == k)
if (i - start + 1 > maxSize) {
maxSize = i - start + 1;
astart = start;
}
...
return {astart, maxSize};
Ideone check

Optimizing time performances unordered_map c++

I'm stuck in an optimization problem. I have a huge database (about 16M entries) which represents ratings given by different users to different items. From this database I have to evaluate a correlation measure between different users (i.e. I have to implement a similarity matrix). Fortunately this correlation matrix is symmetric, so I just have to calculate half of it.
Let me focus for example on the first column of the matrix: there are 135k users in total so I keep one user fixed and I find all the common rated items between this user and all the other ones (with a for loop). The time problem appears also if I compare the single user with 20k other users instead of 135k.
My approach is the following: first I query the DB to obtain for example all the data of the first 20k users (this takes time also with indexes implementation, but it doesn't bother me since I do it only once) and I stored everything in an unordered map using the userID as key; then for this unordered_map I use as bucket another unordered_map which stores all the ratings given by the user, this time using as key the itemID.
Then, in order to find the set of items that both have rated, I cycle on the user which have rated less items, searching if the other one have also rated the same stuff. The fastest data structures that I know are hashmaps, but for a single complete columns my algorithm takes 30s (just for 20k entries) which translates in WEEKS for the complete matrix.
The code is the following:
void similarity_matrix(sqlite3 *db, sqlite3 *db_avg, sqlite3 *similarity, long int tot_users, long int interval) {
long int n = 1;
double sim;
string temp_s;
vector<string> insert_query;
sqlite3_stmt *stmt;
std::cout << "Starting creating similarity matrix..." << std::endl;
string query_string = "SELECT * from usersratings where usersratings.user <= 20000;";
unordered_map<int, unordered_map<int, int>> users_map = db_query(query_string.c_str(), db);
std::cout << "Query time: " << duration_ << " s." << std::endl;
unordered_map<int, int> u1_map = users_map[1];
string select_avg = "SELECT * from averages;";
unordered_map<int, double> avg_map = avg_value(select_avg.c_str(), db_avg);
for (int i = 2; i <= tot_users; i++)
{
unordered_map<int, int> user;
int compare_id;
if (users_map[i].size() <= u1_map.size()) {
user = users_map[i];
compare_id = 1;
}
else {
user = u1_map;
compare_id = i;
}
int matches = 0;
double newnum = 0;
double newden1 = 0;
double newden2 = 0;
unordered_map<int, int> item_map = users_map[compare_id];
for (unordered_map<int, int>::iterator it = user.begin(); it != user.end(); ++it)
{
if (item_map.size() != 0) {
int rating = item_map[it->first];
if (rating != 0) {
double diff1 = (it->second - avg_map[1]);
double diff2 = (rating - avg_map[i]);
newnum += diff1 * diff2;
newden1 += pow(diff1, 2);
newden2 += pow(diff2, 2);
}
}
}
sim = newnum / (sqrt(newden1) * sqrt(newden2));
}
std::cout << "Execution time for first column: " << duration << " s." << std::endl;
std::cout << "First column finished..." << std::endl;
}
This sticks to me as an immediate potential performance trap:
unordered_map<int, unordered_map<int, int>> users_map = db_query(query_string.c_str(), db);
If the size of each sub-map for each user is anywhere close to the number of users, then you have a quadratic complexity algorithm which is going to get exponentially slower the more users you have.
unordered_map does offer constant time search but it's still a search. The amount of instructions required to do it is going to dwarf, say, the cost of indexing an array, especially if there are many collisions which implies inner loops each time you try to search the map. It also isn't necessarily represented in a way that allows for the fastest sequential iteration. So if you can just use std::vector for at least the sub-lists and avg_map like so, that should help a lot for starters:
typedef pair<int, int> ItemRating;
typedef vector<ItemRating> ItemRatings;
unordered_map<int, ItemRatings> users_map = ...;
vector<double> avg_map = ...;
Even the outer users_map could be a vector unless it's sparse and not all indices are used. If it's sparse and the range of user IDs still fits into a reasonable range (not an astronomically large integer), you could potentially construct two vectors -- one which stores the user data and has a size proportional to the number of users, while another is proportional to the valid index range of users and stores nothing but indices into the former vector to translate from a user ID to an index with a simple array lookup if you need to be able to access user data through a user ID.
// User data array.
vector<ItemRatings> user_data(num_users);
// Array that translates sparse user ID integers to indices into the
// above dense array. A value of -1 indicates that a user ID is not used.
// To fetch user data for a particular user ID, we do:
// const ItemRatings& ratings = user_data[user_id_to_index[user_id]];
vector<int> user_id_to_index(biggest_user_index+1, -1);
You're also copying around those unordered_maps needlessly for each iteration of the outer loop. While I don't think that's the source of biggest bottleneck, it would help to avoid deep copying these data structures you don't even modify by using references or pointers:
// Shallow copy, don't deep copy big stuff needlessly.
const unordered_map<int, int>& user = users_map[i].size() <= u1_map.size() ?
users_map[i]: u1_map;
const int compare_id = users_map[i].size() <= u1_map.size() ? 1: i;
const unordered_map<int, int>& item_map = users_map[compare_id];
...
You also don't need to check if item_map is empty in the inner loop. That check should be hoisted outside. That's a micro-optimization which is unlikely to help much at all, but still eliminating blatant waste.
The final code after this first pass would be something like this:
vector<ItemRatings> user_data = ..;
vector<double> avg_map = ...;
// Fill `rating_values` with the values from the first user.
vector<int> rating_values(item_range, 0);
const ItemRatings& ratings1 = user_data[0];
for (auto it = ratings1.begin(); it != ratings1.end(); ++it)
{
const int item = it->first;
const int rating = it->second;
rating_values[item] += rating;
}
// For each user starting from the second user:
for (int i=1; i < tot_users; ++i)
{
double newnum = 0;
double newden1 = 0;
double newden2 = 0;
const ItemRatings& ratings2 = user_data[i];
for (auto it = ratings2.begin(); it != ratings2.end(); ++it)
{
const int item = it->first;
const int rating1 = rating_values[it->first];
if (rating != 0) {
const int rating2 = it->second;
double diff1 = rating2 - avg_map[1];
double diff2 = rating1 - avg_map[i];
newnum += diff1 * diff2;
newden1 += pow(diff1, 2);
newden2 += pow(diff2, 2);
}
}
sim = newnum / (sqrt(newden1) * sqrt(newden2));
}
The biggest difference in the above code is that we eliminated all searches through unordered_map and replaced them with simple indexed access of an array. We also eliminated a lot of redundant copying of data structures.

Having trouble creating an array that shows how the indices were moved in another array

This is the gist of the function I'm trying to make. However whenever I print out the order_of_change array its values are always completely off as to where the values of tumor were moved to. I changed the i inside of the if statement to tumor[i] to make sure that tumor[i] was indeed matching its corresponding value in temp_array and it does. Can anyone tell me whats going wrong?
double temp_array[20];
for (int i = 0; i < 20; i++)
{
temp_array[i] = tumor[i];
}
//sort tumor in ascending order
sort(tumor, tumor + 20); //tumor is an array of 20 random numbers
int x = 0; //counter
int order_of_change[20]; //array to house the index change done by sort
while (x < 20) //find where each value was moved to and record it in order_of_change
{
for (int i = 0; i < 20; i++)
{
if (temp_array[x] == tumor[i])
{
order_of_change[x] = i;
x += 1;
}
}
}
To sort the data, but only have the indices show the sort order, all you need to do is create an array of indices in ascending order (starting from 0), and then use that as part of the std::sort criteria.
Here is an example:
#include <algorithm>
#include <iostream>
#include <array>
void test()
{
std::array<double, 8> tumor = {{4, 3, 7, 128,18, 45, 1, 90}};
std::array<int, 8> indices = {0,1,2,3,4,5,6,7};
//sort tumor in ascending order
std::sort(indices.begin(), indices.end(), [&](int n1, int n2)
{ return tumor[n1] < tumor[n2]; });
// output the tumor array using the indices that were sorted
for (size_t i = 0; i < tumor.size(); ++i)
std::cout << tumor[indices[i]] << "\n";
// show the indices
std::cout << "\n\nHere are the indices:\n";
for (size_t i = 0; i < tumor.size(); ++i)
std::cout << indices[i] << "\n";
}
int main()
{ test(); }
Live Example
Even though the example uses std::array, the principle is the same. Sort the index array based on the items in the data. The tumor array stays intact without the actual elements being moved.
This technique can also be used if the items in the array (or std::vector) are expensive to copy if they're moved around, but still want to have the ability to produce a sorted list without actually sorting items.

Performing intersection of vector c++

I have 200 vectors of size ranging from 1 to 4000000 stored in vecOfVec. I need to intersect these vectors with a single vector "vecSearched" of size 9000+ elements. I tried to do the same using the following code, however using perf tool I found that the intersection which I am doing is the bottleneck in my code. Is there some way by which I may perform an efficient intersection
#include <cstdlib>
#include <iostream>
#include <vector>
using namespace std;
int main(int argc, char** argv) {
vector<vector<unsigned> > vecOfVec; //contains 120 vectors of size ranging from 1 to 2000000 elements. All vectors in vecOfVec are sorted
vector<unsigned> vecSearched; //vector searched in contains 9000+ elements. Vectors in vecSearched are sorted
for(unsigned kbt=0; kbt<vecOfVec.size(); kbt++)
{
//first find first 9 values spaced at equi-distant places, use these 9 values for performing comparisons
vector<unsigned> equiSpacedVec;
if(((vecSearched[0]))>vecOfVec[kbt][(vecOfVec[kbt].size())-1]) //if beginning of searched vector > last value present in individual vectors of vecOfVec then continue
{
continue;
}
unsigned elementIndex=0; //used for iterating over equiSpacedVec
unsigned i=0; //used for iterating over individual buckets vecOfVec[kbt].second
//search for value in bucket and store it in bucketValPos
bool firstRun=true;
for(vector<unsigned>::iterator itValPos=vecSearched.begin();itValPos!=vecSearched.end();++itValPos)
{
//construct a summarized vector out of individual vectors of vecOfVec
if(firstRun)
{
firstRun=false;
unsigned elementIndex1=0; //used for iterating over equiSpacedVec
while(elementIndex1<(vecOfVec[kbt].size())) //create a small vector for skipping over the remaining vectors
{
if((elementIndex1+(10000))<(vecOfVec[kbt].size()))
elementIndex1+=10000;
else
break;
equiSpacedVec.push_back(vecOfVec[kbt][elementIndex1]);
}
}
//skip individual vectors of vecOfVec using summarized vector constructed above
while((!(equiSpacedVec.empty()))&&(equiSpacedVec.size()>(elementIndex+1))&&((*itValPos)>equiSpacedVec[elementIndex+1])){
elementIndex+=1;
if((i+100)<(vecOfVec[kbt].size()))
i+=100;
}
unsigned j=i;
while(((*itValPos)>vecOfVec[kbt][j])&&(j<vecOfVec[kbt].size())){
j++;
}
if(j>(vecOfVec[kbt].size()-1)) //element not found even at last position.
{
break;
}
if((*itValPos)==vecOfVec[kbt][j])
{
//store intersection result
}
}
}
return 0;
}
Your problem is a very popular one. Since you have no data correlating the vectors to intersect, it boils down to speeding up the intersection between two vectors and there are basically two approaches to it :
1. Without any preprocessing
This is usually adressed by three things :
Recuding the number of comparisons. Eg for small vectors (sized 1 to 50) you should binary search each element to avoid traversing all the 9000+ elements of the subject vector.
Improving code quality to reduce branch mispredictions, eg observing that the resulting set will usually be of smaller size than the input sets could transform such code :
while (Apos < Aend && Bpos < Bend) {
if (A[Apos] == B[Bpos]) {
C[Cpos++] = A[Apos];
Apos++; Bpos++;
}
else if (A[Apos] > B[Bpos]) {
Bpos++;
}
else {
Apos++;
}
}
to code that "unrolls" such comparisons creating though easier to predict branches (example for block size = 2) :
while (1) {
Adat0 = A[Apos]; Adat1 = A[Apos + 1];
Bdat0 = B[Bpos]; Bdat1 = B[Bpos + 1];
if (Adat0 == Bdat0) {
C[Cpos++] = Adat0;
}
else if (Adat0 == Bdat1) {
C[Cpos++] = Adat0;
goto advanceB;
}
else if (Adat1 == Bdat0) {
C[Cpos++] = Adat1;
goto advanceA;
}
if (Adat1 == Bdat1) {
C[Cpos++] = Adat1;
goto advanceAB;
}
else if (Adat1 > Bdat1) goto advanceB;
else goto advanceA;
advanceA:
Apos+=2;
if (Apos >= Aend) { break; } else { continue; }
advanceB:
Bpos+=2;
if (Bpos >= Bend) { break; } else { continue; }
advanceAB:
Apos+=2; Bpos+=2;
if (Apos >= Aend || Bpos >= Bend) { break; }
}
// fall back to naive algorithm for remaining elements
Using SIMD instructions to perform block operations
These techniques are hard to describe in a QA context but you can read about them (plus relevant optimizations like the if conversion) here and here or find implementation elements here
2. With preprocessing
This IMHO is a better way for your case because you have a single subject vector of size 9000+ elements. You could make an interval tree out of it or simply find a way to index it, eg make a structure that will speed up the search against it :
vector<unsigned> subject(9000+);
vector<range> index(9000/M);
where range is a struct like
struct range {
unsigned min, max;
};
thus creating a sequence of ranges like so
[0, 100], [101, 857], ... [33221, 33500]
that will allow to skip many many comparisons when doing the intersection (for example if an element of the other set is larger than the max of a subrange, you can skip that subrange completely)
3. Parallelize
Yep, there's always a third element in a list of two :P. When you have optimized enough the procedure (and only then), do break up your work into chunks and run it in parallel. The problem fits an embarassing pattern so 200 vectors vs 1 should definetely run as "50 vs 1 four times concurrently"
Test, measure, redesign!!
If I understood correctly your code then if N is the number of vectors and M the number of elements inside each vector then your algorithm is roughly O(N * M^2). Then there is the 'bucket' strategy that improves things a bit but its effect is difficult to evaluate at a first sight.
I would suggest you to work on sorted vectors and make intersections on sorted ones. Something like this:
vector<vector<unsigned> > vecOfVec;
vector<unsignend> vecSearched ;
for (vector<unsigned> v : vecSearched) // yes, a copy
{
std::sort(v.begin(), v.end()) ;
if (vecSearched.empty()) // first run detection
vSearched = v ;
else
{ // compute intersection of v and vecSearch
auto vit = v.begin() ;
auto vend = v.end() ;
auto sit = vecSearched.begin() ;
auto send = vecSearched.end() ;
vector<unsiged> result ;
while (vit != vend && sit != send)
{
if (*vit < *sit)
vit++ ;
else if (*vit == *sit)
{
result.push_bck(*it) ;
++vit ;
++sit ;
}
else // *vit > *sit
++sit ;
}
vecSearched = result ;
}
}
Code is untested, anyway the idea behind it is that intersection on sorted vectors is easier since you can compare those two iterators (vit, sit) and grow the one pointing to the smaller one. So intersecton is linear in M and whole complexity is O(N * M *log(M)), where log(M) is due to sorting
The simplest way to compare two sorted vectors is to iterate through both of them at the same time, and only increment whichever iterator has the smaller value. This is guaranteed to require the smallest number of comparisons if both vectors are unique. Actually, you can use the same code for all sorts of collections linked lists.
#Nikos Athanasiou (answer above) gives lots of useful tips to speed up your code, like using skip lists, simd comparisons. However your dataset is so tiny that even the straightforward naive code here runs blindingly fast...
template<typename CONT1, typename CONT2, typename OP_MARKIDENTICAL>
inline
void set_mark_overlap( const CONT1& cont1,
const CONT2& cont2,
OP_MARKIDENTICAL op_markidentical)
{
auto ii = cont1.cbegin();
auto end1 = cont1.cend();
auto jj = cont2.cbegin();
auto end2 = cont2.cend();
if (cont1.empty() || cont2.empty())
return;
for (;;)
{
// increment iterator to container 1 if it is less
if (*ii < *jj)
{
if (++ii == end1)
break;
}
// increment iterator to container 2 if it is less
else if (*jj < *ii)
{
if (++jj == end2)
break;
}
// same values
// increment both iterators
else
{
op_markidentical(*ii);
++ii;
if (ii == end1)
break;
//
// Comment if container1 can contain duplicates
//
++jj;
if (jj == end2)
break;
}
}
}
Here is how you might use this code:
template<typename TT>
struct op_store
{
vector<TT>& store;
op_store(vector<TT>& store): store(store){}
void operator()(TT val){store.push_back(val);}
};
vector<unsigned> first{1,2,3,4,5,6};
vector<unsigned> second{1,2,5,6, 7,9};
vector<unsigned> overlap;
set_mark_overlap( first, second, op_store<unsigned>(overlap));
for (const auto& ii : overlap)
std::cout << ii << ",";
std::cout << ii << "\n";
// 1,2,5,6
This code assumes that neither vector contains duplicates. If any of your vecOfVec contains duplicates, and you want each duplicate to be printed out, then you need to comment the indicated code above. If your vecSearched vector contains duplicates, it is not clear what would be the appropriate response...
In your case, the code to store matching values would be just these three lines:
// results
vector<vector<unsigned> > results(120);
for (unsigned ii = 0; ii < vecOfVec.size(); ++ii)
set_mark_overlap(vecSearched, vecOfVec[ii], op_store<unsigned>(results[ii]));
In terms of optimisation, your problem has two characteristics:
1) One list is always much shorter than the other
2) The shorter list is reused while the longer list is new to every comparison.
Up front costs (of pre-processing, e.g. for skip lists as suggested by #Nikos Athanasiou) are irrelevant with the short list of 9000 (which is used again and again) but not the longer list.
I imagine most of the skipping is for the longer lists, so skip lists may not be a panacea. How about doing a sort of dynamic skip-list so that you jump by N ( j += 4,000,000 / 9000) of by one (++j) when container two is catching up (in the code above). If you have jumped two far, then you can use a mini-binary search to find the right amount to increment j by.
Because of this asymmetry in list lengths, I can't see recoding with SIMD is going to help: we need to minimise the number of comparisons to less than (N+M) rather than increases the speed of each comparisons. However, it depends on your data. Code up and time things!
Here is the test code to create some vectors of random numbers and check that they are present
#include <iostream>
#include <vector>
#include <unordered_set>
#include <exception>
#include <algorithm>
#include <limits>
#include <random>
using namespace std;
template<typename TT>
struct op_store
{
std::vector<TT>& store;
op_store(vector<TT>& store): store(store){}
void operator()(TT val, TT val2){if (val != val2) std::cerr << val << " !- " << val2 << "\n"; store.push_back(val);}
};
void fill_vec_with_unique_random_values(vector<unsigned>& cont, unordered_set<unsigned>& curr_values)
{
static random_device rd;
static mt19937 e1(rd());
static uniform_int_distribution<unsigned> uniform_dist(0, std::numeric_limits<unsigned>::max());
for (auto& jj : cont)
{
for (;;)
{
unsigned new_value = uniform_dist(e1);
// make sure all values are unique
if (curr_values.count(new_value) == 0)
{
curr_values.insert(new_value);
jj = new_value;
break;
}
}
}
}
int main (int argc, char *argv[])
{
static random_device rd;
static mt19937 e1(rd());
// vector searched in contains 9000+ elements. Vectors in vecSearched are sorted
vector<unsigned> vecSearched(9000);
unordered_set<unsigned> unique_values_9000;
fill_vec_with_unique_random_values(vecSearched, unique_values_9000);
//
// Create 120 vectors of size ranging from 1 to 2000000 elements. All vectors in vecOfVec are sorted
//
vector<vector<unsigned> > vecOfVec(5);
normal_distribution<> vec_size_normal_dist(1000000U, 500000U);
for (unsigned ii = 0; ii < vecOfVec.size(); ++ii)
{
std::cerr << " Create Random data set" << ii << " ...\n";
auto vec_size = min(2000000U, static_cast<unsigned>(vec_size_normal_dist(e1)));
vecOfVec[ii].resize(vec_size);
// Do NOT share values with the 9000. We will manually add these later
unordered_set<unsigned> unique_values(unique_values_9000);
fill_vec_with_unique_random_values(vecOfVec[ii], unique_values);
}
// insert half of vecSearched in our 120 vectors so that we know what we are going to find
vector<unsigned> correct_results(begin(vecSearched), begin(vecSearched) + 4500);
for (unsigned ii = 0; ii < vecOfVec.size(); ++ii)
vecOfVec[ii].insert(vecOfVec[ii].end(), begin(correct_results), end(correct_results));
// Make sure everything is sorted
std::cerr << " Sort data ...\n";
for (unsigned ii = 0; ii < vecOfVec.size(); ++ii)
sort(begin(vecOfVec[ii]), end(vecOfVec[ii]));
sort(begin(vecSearched), end(vecSearched));
sort(begin(correct_results), end(correct_results));
std::cerr << " Match ...\n";
// results
vector<vector<unsigned> > results(120);
for (unsigned ii = 0; ii < vecOfVec.size(); ++ii)
{
std::cerr << ii << " done\n";
set_mark_overlap(vecSearched, vecOfVec[ii], op_store<unsigned>(results[ii]));
// check all is well
if (results[ii] != correct_results)
throw runtime_error("Oops");
}
return(0);
}
using set_intersection could help, but i do not know, if it improves the overall speed:
vector<vector<unsigned int> > vecOfVec(200);
vector<unsigned int> vecSearched;
set<unsigned int> intersection;
for(auto it = vecOfVec.begin(); it != vecOfVec.end(); ++it)
{
std::set_intersection(it->begin(), it->end(), vecSearched.begin(), vecSearched.end(), std::inserter(intersection, intersection.begin()));
}

How to delete items from a std::vector given a list of indices

I have a vector of items items, and a vector of indices that should be deleted from items:
std::vector<T> items;
std::vector<size_t> indicesToDelete;
items.push_back(a);
items.push_back(b);
items.push_back(c);
items.push_back(d);
items.push_back(e);
indicesToDelete.push_back(3);
indicesToDelete.push_back(0);
indicesToDelete.push_back(1);
// given these 2 data structures, I want to remove items so it contains
// only c and e (deleting indices 3, 0, and 1)
// ???
What's the best way to perform the deletion, knowing that with each deletion, it affects all other indices in indicesToDelete?
A couple ideas would be to:
Copy items to a new vector one item at a time, skipping if the index is in indicesToDelete
Iterate items and for each deletion, decrement all items in indicesToDelete which have a greater index.
Sort indicesToDelete first, then iterate indicesToDelete, and for each deletion increment an indexCorrection which gets subtracted from subsequent indices.
All seem like I'm over-thinking such a seemingly trivial task. Any better ideas?
Edit Here is the solution, basically a variation of #1 but using iterators to define blocks to copy to the result.
template<typename T>
inline std::vector<T> erase_indices(const std::vector<T>& data, std::vector<size_t>& indicesToDelete/* can't assume copy elision, don't pass-by-value */)
{
if(indicesToDelete.empty())
return data;
std::vector<T> ret;
ret.reserve(data.size() - indicesToDelete.size());
std::sort(indicesToDelete.begin(), indicesToDelete.end());
// new we can assume there is at least 1 element to delete. copy blocks at a time.
std::vector<T>::const_iterator itBlockBegin = data.begin();
for(std::vector<size_t>::const_iterator it = indicesToDelete.begin(); it != indicesToDelete.end(); ++ it)
{
std::vector<T>::const_iterator itBlockEnd = data.begin() + *it;
if(itBlockBegin != itBlockEnd)
{
std::copy(itBlockBegin, itBlockEnd, std::back_inserter(ret));
}
itBlockBegin = itBlockEnd + 1;
}
// copy last block.
if(itBlockBegin != data.end())
{
std::copy(itBlockBegin, data.end(), std::back_inserter(ret));
}
return ret;
}
I would go for 1/3, that is: order the indices vector, create two iterators into the data vector, one for reading and one for writting. Initialize the writing iterator to the first element to be removed, and the reading iterator to one beyond that one. Then in each step of the loop increment the iterators to the next value (writing) and next value not to be skipped (reading) and copy/move the elements. At the end of the loop call erase to discard the elements beyond the last written to position.
BTW, this is the approach implemented in the remove/remove_if algorithms of the STL with the difference that you maintain the condition in a separate ordered vector.
std::sort() the indicesToDelete in descending order and then delete from the items in a normal for loop. No need to adjust indices then.
It might even be option 4:
If you are deleting a few items from a large number, and know that there will never be a high density of deleted items:
Replace each of the items at indices which should be deleted with 'tombstone' values, indicating that there is nothing valid at those indices, and make sure that whenever you access an item, you check for a tombstone.
It depends on the numbers you are deleting.
If you are deleting many items, it may make sense to copy the items that are not deleted to a new vector and then replace the old vector with the new vector (after sorting the indicesToDelete). That way, you will avoid compressing the vector after each delete, which is an O(n) operation, possibly making the entire process O(n^2).
If you are deleting a few items, perhaps do the deletion in reverse index order (assuming the indices are sorted), then you do not need to adjust them as items get deleted.
Since the discussion has somewhat transformed into a performance related question, I've written up the following code. It uses remove_if and vector::erase, which should move the elements a minimal number of times. There's a bit of overhead, but for large cases, this should be good.
However, if you don't care about the relative order of elements, then this will not be all that fast.
#include <algorithm>
#include <iostream>
#include <string>
#include <vector>
#include <set>
using std::vector;
using std::string;
using std::remove_if;
using std::cout;
using std::endl;
using std::set;
struct predicate {
public:
predicate(const vector<string>::iterator & begin, const vector<size_t> & indices) {
m_begin = begin;
m_indices.insert(indices.begin(), indices.end());
}
bool operator()(string & value) {
const int index = distance(&m_begin[0], &value);
set<size_t>::iterator target = m_indices.find(index);
return target != m_indices.end();
}
private:
vector<string>::iterator m_begin;
set<size_t> m_indices;
};
int main() {
vector<string> items;
items.push_back("zeroth");
items.push_back("first");
items.push_back("second");
items.push_back("third");
items.push_back("fourth");
items.push_back("fifth");
vector<size_t> indicesToDelete;
indicesToDelete.push_back(3);
indicesToDelete.push_back(0);
indicesToDelete.push_back(1);
vector<string>::iterator pos = remove_if(items.begin(), items.end(), predicate(items.begin(), indicesToDelete));
items.erase(pos, items.end());
for (int i=0; i< items.size(); ++i)
cout << items[i] << endl;
}
The output for this would be:
second
fourth
fifth
There is a bit of a performance overhead that can still be reduced. In remove_if (atleast on gcc), the predicate is copied by value for each element in the vector. This means that we're possibly doing the copy constructor on the set m_indices each time. If the compiler is not able to get rid of this, then I would recommend passing the indices in as a set, and storing it as a const reference.
We could do that as follows:
struct predicate {
public:
predicate(const vector<string>::iterator & begin, const set<size_t> & indices) : m_begin(begin), m_indices(indices) {
}
bool operator()(string & value) {
const int index = distance(&m_begin[0], &value);
set<size_t>::iterator target = m_indices.find(index);
return target != m_indices.end();
}
private:
const vector<string>::iterator & m_begin;
const set<size_t> & m_indices;
};
int main() {
vector<string> items;
items.push_back("zeroth");
items.push_back("first");
items.push_back("second");
items.push_back("third");
items.push_back("fourth");
items.push_back("fifth");
set<size_t> indicesToDelete;
indicesToDelete.insert(3);
indicesToDelete.insert(0);
indicesToDelete.insert(1);
vector<string>::iterator pos = remove_if(items.begin(), items.end(), predicate(items.begin(), indicesToDelete));
items.erase(pos, items.end());
for (int i=0; i< items.size(); ++i)
cout << items[i] << endl;
}
Basically the key to the problem is remembering that if you delete the object at index i, and don't use a tombstone placeholder, then the vector must make a copy of all of the objects after i. This applies to every possibility you suggested except for #1. Copying to a new list makes one copy no matter how many you delete, making it by far the fastest answer.
And as David Rodríguez said, sorting the list of indexes to be deleted allows for some minor optimizations, but it may only worth it if you're deleting more than 10-20 (please profile first).
Here is my solution for this problem which keeps the order of the original "items":
create a "vector mask" and initialize (fill) it with "false" values.
change the values of mask to "true" for all the indices you want to remove.
loop over all members of "mask" and erase from both vectors "items" and "mask" the elements with "true" values.
Here is the code sample:
#include <iostream>
#include <vector>
using namespace std;
int main()
{
vector<unsigned int> items(12);
vector<unsigned int> indicesToDelete(3);
indicesToDelete[0] = 3;
indicesToDelete[1] = 0;
indicesToDelete[2] = 1;
for(int i=0; i<12; i++) items[i] = i;
for(int i=0; i<items.size(); i++)
cout << "items[" << i << "] = " << items[i] << endl;
// removing indeces
vector<bool> mask(items.size());
vector<bool>::iterator mask_it;
vector<unsigned int>::iterator items_it;
for(size_t i = 0; i < mask.size(); i++)
mask[i] = false;
for(size_t i = 0; i < indicesToDelete.size(); i++)
mask[indicesToDelete[i]] = true;
mask_it = mask.begin();
items_it = items.begin();
while(mask_it != mask.end()){
if(*mask_it){
items_it = items.erase(items_it);
mask_it = mask.erase(mask_it);
}
else{
mask_it++;
items_it++;
}
}
for(int i=0; i<items.size(); i++)
cout << "items[" << i << "] = " << items[i] << endl;
return 0;
}
This is not a fast implementation for using with large data sets. The method "erase()" takes time to rearrange the vector after eliminating the element.