Find equals value into an array in c++ - c++

There is a faster way to find equals value into an array instead of comparing all elements one by one with all the array's elements ?
for(int i = 0; i < arrayLenght; i ++)
{
for(int k = i; k < arrayLenght; i ++)
{
if(array[i] == array[k])
{
sprintf(message,"There is a duplicate of %s",array[i]);
ShowMessage(message);
break;
}
}
}

Since sorting your container is a possible solution, std::unique is the simplest solution to your problem:
std::vector<int> v {0,1,0,1,2,0,1,2,3};
std::sort(begin(v), end(v));
v.erase(std::unique(begin(v), end(v)), end(v));
First, the vector is sorted. You can use anything, std::sort is just the simplest. After that, std::unique shifts the duplicates to the end of the container and returns an iterator to the first duplicate. This is then eaten by erase and effectively removes those from the vector.

You could use std::multiset and then count duplicates afterwards like this:
#include <iostream>
#include <set>
int main()
{
const int arrayLenght = 14;
int array[arrayLenght] = { 0,2,1,3,1,4,5,5,5,2,2,3,5,5 };
std::multiset<int> ms(array, array + arrayLenght);
for (auto it = ms.begin(), end = ms.end(); it != end; it = ms.equal_range(*it).second)
{
int cnt = 0;
if ((cnt = ms.count(*it)) > 1)
std::cout << "There are " << cnt << " of " << *it << std::endl;
}
}
https://ideone.com/6ktW89
There are 2 of 1
There are 3 of 2
There are 2 of 3
There are 5 of 5

If your value_type of this array could be sorted by operator <(a strict weak order) it's a good choice to do as YSC answered.
If not,maybe you can try to define a hash function to hash the objects to different values.Then you can do this in O(n) time complexity,like:
struct ValueHash
{
size_t operator()(const Value& rhs) const{
//do_something
}
};
struct ValueCmp
{
bool operator()(const Value& lhs, const Value& rhs) const{
//do_something
}
};
unordered_set<Value,ValueHash,ValueCmp> myset;
for(int i = 0; i < arrayLenght; i ++)
{
if(myset.find(array[i])==myset.end())
myset.insert(array[i]);
else
dosomething();
}

In case you have a large amount of data, you can first sort the array (quick sort gives you a first pass in O(n*log(n))) and then do a second pass by comparing each value with the next (as they might be all together) to find duplicates (this is a sequential pass in O(n)) so, sorting in a first pass and searching the sorted array for duplicates gives you O(n*log(n) + n), or finally O(n*log(n)).
EDIT
An alternative has been suggested in the comments, of using a std::set to check for already processed data. The algorithm just goes element by element, checking if the element has been seen before. This can lead to a O(n) algorithm, but only if you take care of using a hash set. In case you use a sorted set, then you incur in an O(log(n)) for each set search and finish in the same O(n*log(n)). But because the proposal can be solved with a hash set (you have to be careful in selecting an std::unsorted_set, so you don't get the extra access time per search) you get a final O(n). Of course, you have to account for possible automatic hash table grow or a huge waste of memory used in the hash table.
Thanks to #freakish, who pointed the set solution in the comments to the question.

Related

Efficient way of finding if a container contains duplicated values with STL? [duplicate]

I wrote this code in C++ as part of a uni task where I need to ensure that there are no duplicates within an array:
// Check for duplicate numbers in user inputted data
int i; // Need to declare i here so that it can be accessed by the 'inner' loop that starts on line 21
for(i = 0;i < 6; i++) { // Check each other number in the array
for(int j = i; j < 6; j++) { // Check the rest of the numbers
if(j != i) { // Makes sure don't check number against itself
if(userNumbers[i] == userNumbers[j]) {
b = true;
}
}
if(b == true) { // If there is a duplicate, change that particular number
cout << "Please re-enter number " << i + 1 << ". Duplicate numbers are not allowed:" << endl;
cin >> userNumbers[i];
}
} // Comparison loop
b = false; // Reset the boolean after each number entered has been checked
} // Main check loop
It works perfectly, but I'd like to know if there is a more elegant or efficient way to check.
You could sort the array in O(nlog(n)), then simply look until the next number. That is substantially faster than your O(n^2) existing algorithm. The code is also a lot cleaner. Your code also doesn't ensure no duplicates were inserted when they were re-entered. You need to prevent duplicates from existing in the first place.
std::sort(userNumbers.begin(), userNumbers.end());
for(int i = 0; i < userNumbers.size() - 1; i++) {
if (userNumbers[i] == userNumbers[i + 1]) {
userNumbers.erase(userNumbers.begin() + i);
i--;
}
}
I also second the reccomendation to use a std::set - no duplicates there.
The following solution is based on sorting the numbers and then removing the duplicates:
#include <algorithm>
int main()
{
int userNumbers[6];
// ...
int* end = userNumbers + 6;
std::sort(userNumbers, end);
bool containsDuplicates = (std::unique(userNumbers, end) != end);
}
Indeed, the fastest and as far I can see most elegant method is as advised above:
std::vector<int> tUserNumbers;
// ...
std::set<int> tSet(tUserNumbers.begin(), tUserNumbers.end());
std::vector<int>(tSet.begin(), tSet.end()).swap(tUserNumbers);
It is O(n log n). This however does not make it, if the ordering of the numbers in the input array needs to be kept... In this case I did:
std::set<int> tTmp;
std::vector<int>::iterator tNewEnd =
std::remove_if(tUserNumbers.begin(), tUserNumbers.end(),
[&tTmp] (int pNumber) -> bool {
return (!tTmp.insert(pNumber).second);
});
tUserNumbers.erase(tNewEnd, tUserNumbers.end());
which is still O(n log n) and keeps the original ordering of elements in tUserNumbers.
Cheers,
Paul
It is in extension to the answer by #Puppy, which is the current best answer.
PS : I tried to insert this post as comment in the current best answer by #Puppy but couldn't so as I don't have 50 points yet. Also a bit of experimental data is shared here for further help.
Both std::set and std::map are implemented in STL using Balanced Binary Search tree only. So both will lead to a complexity of O(nlogn) only in this case. While the better performance can be achieved if a hash table is used. std::unordered_map offers hash table based implementation for faster search. I experimented with all three implementations and found the results using std::unordered_map to be better than std::set and std::map. Results and code are shared below. Images are the snapshot of performance measured by LeetCode on the solutions.
bool hasDuplicate(vector<int>& nums) {
size_t count = nums.size();
if (!count)
return false;
std::unordered_map<int, int> tbl;
//std::set<int> tbl;
for (size_t i = 0; i < count; i++) {
if (tbl.find(nums[i]) != tbl.end())
return true;
tbl[nums[i]] = 1;
//tbl.insert(nums[i]);
}
return false;
}
unordered_map Performance (Run time was 52 ms here)
Set/Map Performance
You can add all elements in a set and check when adding if it is already present or not. That would be more elegant and efficient.
I'm not sure why this hasn't been suggested but here is a way in base 10 to find duplicates in O(n).. The problem I see with the already suggested O(n) solution is that it requires that the digits be sorted first.. This method is O(n) and does not require the set to be sorted. The cool thing is that checking if a specific digit has duplicates is O(1). I know this thread is probably dead but maybe it will help somebody! :)
/*
============================
Foo
============================
*
Takes in a read only unsigned int. A table is created to store counters
for each digit. If any digit's counter is flipped higher than 1, function
returns. For example, with 48778584:
0 1 2 3 4 5 6 7 8 9
[0] [0] [0] [0] [2] [1] [0] [2] [2] [0]
When we iterate over this array, we find that 4 is duplicated and immediately
return false.
*/
bool Foo(int number)
{
int temp = number;
int digitTable[10]={0};
while(temp > 0)
{
digitTable[temp % 10]++; // Last digit's respective index.
temp /= 10; // Move to next digit
}
for (int i=0; i < 10; i++)
{
if (digitTable [i] > 1)
{
return false;
}
}
return true;
}
It's ok, specially for small array lengths. I'd use more efficient aproaches (less than n^2/2 comparisons) if the array is mugh bigger - see DeadMG's answer.
Some small corrections for your code:
Instead of int j = i writeint j = i +1 and you can omit your if(j != i) test
You should't need to declare i variable outside the for statement.
I think #Michael Jaison G's solution is really brilliant, I modify his code a little to avoid sorting. (By using unordered_set, the algorithm may faster a little.)
template <class Iterator>
bool isDuplicated(Iterator begin, Iterator end) {
using T = typename std::iterator_traits<Iterator>::value_type;
std::unordered_set<T> values(begin, end);
std::size_t size = std::distance(begin,end);
return size != values.size();
}
//std::unique(_copy) requires a sorted container.
std::sort(cont.begin(), cont.end());
//testing if cont has duplicates
std::unique(cont.begin(), cont.end()) != cont.end();
//getting a new container with no duplicates
std::unique_copy(cont.begin(), cont.end(), std::back_inserter(cont2));
#include<iostream>
#include<algorithm>
int main(){
int arr[] = {3, 2, 3, 4, 1, 5, 5, 5};
int len = sizeof(arr) / sizeof(*arr); // Finding length of array
std::sort(arr, arr+len);
int unique_elements = std::unique(arr, arr+len) - arr;
if(unique_elements == len) std::cout << "Duplicate number is not present here\n";
else std::cout << "Duplicate number present in this array\n";
return 0;
}
As mentioned by #underscore_d, an elegant and efficient solution would be,
#include <algorithm>
#include <vector>
template <class Iterator>
bool has_duplicates(Iterator begin, Iterator end) {
using T = typename std::iterator_traits<Iterator>::value_type;
std::vector<T> values(begin, end);
std::sort(values.begin(), values.end());
return (std::adjacent_find(values.begin(), values.end()) != values.end());
}
int main() {
int user_ids[6];
// ...
std::cout << has_duplicates(user_ids, user_ids + 6) << std::endl;
}
fast O(N) time and space solution
return first when it hits duplicate
template <typename T>
bool containsDuplicate(vector<T>& items) {
return any_of(items.begin(), items.end(), [s = unordered_set<T>{}](const auto& item) mutable {
return !s.insert(item).second;
});
}
Not enough karma to post a comment. Hence a post.
vector <int> numArray = { 1,2,1,4,5 };
unordered_map<int, bool> hasDuplicate;
bool flag = false;
for (auto i : numArray)
{
if (hasDuplicate[i])
{
flag = true;
break;
}
else
hasDuplicate[i] = true;
}
(flag)?(cout << "Duplicate"):("No duplicate");

Why is the complexity of std::unordered_set operator==() N^2?

I have two vectors v1 and v2 of type std::vector<std::string>. Both vectors have unique values and should compare equal if values compare equal but independent of the order values appear in the vector.
I assume two sets of type std::unordered_set would have been a better choice, but I take it as it is, so two vectors.
Nevertheless, I thought for the needed order insensitive comparison I'll just use operator== from std::unordered_set by copying to two std::unordered_set. Very much like this:
bool oi_compare1(std::vector<std::string> const&v1,
std::vector<std::string> const&v2)
{
std::unordered_set<std::string> tmp1(v1.begin(),v1.end());
std::unordered_set<std::string> tmp2(v2.begin(),v2.end());
return tmp1 == tmp2;
}
While profiling I noticed this function consuming a lot of time, so I checked doc and saw the O(n*n) complexity here. I am confused, I was expecting O(n*log(n)), like e.g. for the following naive solution I came up with:
bool oi_compare2(std::vector<std::string> const&v1,
std::vector<std::string> const&v2)
{
if(v1.size() != v2.size())
return false;
auto tmp = v2;
size_t const size = tmp.size();
for(size_t i = 0; i < size; ++i)
{
bool flag = false;
for(size_t j = i; j < size; ++j)
if(v1[i] == tmp[j]){
flag = true;
std::swap(tmp[i],tmp[j]);
break;
}
if(!flag)
return false;
}
return true;
}
Why the O(n*n) complexity for std::unordered_set and is there a build in function I can use for order insensitive comparision?
EDIT----
BENCHMARK
#include <unordered_set>
#include <chrono>
#include <iostream>
#include <vector>
bool oi_compare1(std::vector<std::string> const&v1,
std::vector<std::string> const&v2)
{
std::unordered_set<std::string> tmp1(v1.begin(),v1.end());
std::unordered_set<std::string> tmp2(v2.begin(),v2.end());
return tmp1 == tmp2;
}
bool oi_compare2(std::vector<std::string> const&v1,
std::vector<std::string> const&v2)
{
if(v1.size() != v2.size())
return false;
auto tmp = v2;
size_t const size = tmp.size();
for(size_t i = 0; i < size; ++i)
{
bool flag = false;
for(size_t j = i; j < size; ++j)
if(v1[i] == tmp[j]){
flag = true;
std::swap(tmp[i],tmp[j]);
break;
}
if(!flag)
return false;
}
return true;
}
int main()
{
std::vector<std::string> s1{"1","2","3"};
std::vector<std::string> s2{"1","3","2"};
std::cout << std::boolalpha;
for(size_t i = 0; i < 15; ++i)
{
auto tmp1 = s1;
for(auto &iter : tmp1)
iter = std::to_string(i)+iter;
s1.insert(s1.end(),tmp1.begin(),tmp1.end());
s2.insert(s2.end(),tmp1.begin(),tmp1.end());
}
std::cout << "size1 " << s1.size() << std::endl;
std::cout << "size2 " << s2.size() << std::endl;
for(auto && c : {oi_compare1,oi_compare2})
{
auto start = std::chrono::steady_clock::now();
bool flag = true;
for(size_t i = 0; i < 10; ++i)
flag = flag && c(s1,s2);
std::cout << "ms=" << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start).count() << " flag=" << flag << std::endl;
}
return 0;
}
gives
size1 98304
size2 98304
ms=844 flag=true
ms=31 flag=true
--> naive approach way faster.
For all the Complexity O(N*N) experts here...
Let me go through this naive approach. I have two loops there. The first loop is running from i=0 to size which is N. The inner loop is called from j=i!!!!!! to N. In language spoken it means I call the Inner loop N times. But the complexity of the inner loop is log(n) due to the starting index of j = i !!!!. If you still dont believe me calculate the complexity from benchmarks and you will see...
EDIT2---
LIVE ON WANDBOX
https://wandbox.org/permlink/v26oxnR2GVDb9M6y
Since unordered_set is build using hashmap, the logic to compare lhs==rhs will be:
Check size of lhs and rhs, if not equal, return false
For each item in lhs, find it in rhs, and compare
For hashmap, the single find time complexity for an item in rhs in worst case will be O(n). So the worst case time complexity will be O(n^2). However normally you get an time complexity of O(n).
I'm sorry to tell you, your benchmark of operator== is faulty.
oi_compare1 accepts 2 vectors and needs to build up 2 complete unordered_set instances, to than call operator== and destroy the complete bunch again.
oi_compare2 also accepts 2 vectors, and immediately uses them for the comparison on size. Only copies 1 instance (v2 to tmp), which is much more performant for a vector.
operator==
Looking at the documentation: https://en.cppreference.com/w/cpp/container/unordered_set/operator_cmp we can see the expected complexity:
Proportional to N calls to operator== on value_type, calls to the predicate returned by key_eq, and calls to the hasher returned by hash_function, in the average case, proportional to N2 in the worst case where N is the size of the container.
edit
There is a simple algorithm, you can loop over the unordered_set and do a simple lookup in the other one. Without hash collisions, it will find each element in it's own internal bucket and compare it for equality as the hashing ain't sufficient.
Assuming you don't have hash collisions, each element of the unordered_set has a stable order in which they are stored. One could loop over the internal buckets and compare the elements 2-by-2 (1st of the one with the 1st of the second, 2nd of the one with the 2nd of the second ...). This nicely gives O(N). This doesn't work when you have different sizes of the buckets you store the values in, or when the assignment of buckets uses a different calculation to deal with collisions.
Assuming you are unlucky and every element results into the same hash. (Known as hash flooding) You result in a list of elements without order. To compare, you have to check for each element if it exists in the other one, causing O(N*N).
This last one is easy reproducible if you rig your hash to always return the same number. Build the one set in the reverse order as the other one.

Which STL to use to find index by value in O(1) in C++

Say I have an array arr[] = {1 , 3 , 5, 12124, 24354, 12324, 5}
I want to know the index of the value 5(i.e, 2) in O(1).
How should I go about this?
P.S :
1. Throughout my program, I shall be finding only indices and not the vice versa (getting the value by index).
2. The array can have duplicates.
If you can guarantee there are no duplicates in the array, you're best bet is probably creating an unordered_map where the map key is the array value, and map value is its index.
I wrote a method below that converts an array to an unordered_map.
#include <unordered_map>
#include <iostream>
template <typename T>
void arrayToMap(const T arr[], size_t arrSize, std::unordered_map<T, int>& map)
{
for(int i = 0; i < arrSize; ++i) {
map[arr[i]] = i;
}
}
int main()
{
int arr[] = { 1 , 3 , 5, 12124, 24354, 12324, 5 };
std::unordered_map<int, int> map;
arrayToMap(arr, sizeof(arr)/sizeof(*arr), map);
std::cout << "Value" << '\t' << "Index" << std::endl;
for(auto it = map.begin(), e = map.end(); it != e; ++it) {
std::cout << it->first << "\t" << it->second << std::endl;
}
}
However, in your example you use the value 5 twice. This causes a strange output in the above code. The outputted map does not have a value with an index 2. Even if you use an array, you would be confronted with a similar problem (i.e. should you use the value at 2 or 6?).
If you really need both values, you could use unordered_multimap, but the syntax for accessing elements isn't easy as using the operator[] (you have to use unordered_multipmap::find() which returns an iterator).
template <typename T>
void arrayToMap(const T arr[], size_t arrSize, std::unordered_multimap<T, int>& map)
{
for(int i = 0; i < arrSize; ++i) {
map.emplace(arr[i], i);
}
}
Finally, you should consider that unordered_map's fast look-up time O(1) comes with some overhead, so it uses more memory than a simple array. But if you end up using an array (which is comparatively much more memory efficient), searching for a specific value is guaranteed to be O(n) where n is the index of the value.
Edit - If you need the duplicate with the lowest index to be kept instead of the highest, you can just reverse the order of insertion:
template <typename T>
void arrayToMap(const T arr[], size_t arrSize, std::unordered_map<T, int>& map)
{
for(int i = arraySize - 1; i >= 0; --i) {
map[arr[i]] = i;
}
}
Use std::unordered_map from C++11 to map elements as key and indices as value. Then you can get answer of your query in amortized O(1) complexity. std::unordered_map will work because there is no duplicacy as you said but cost you linear size extra space.
If your value's range is not too large, you can use an array as well. This will yield even better theta(1) complexity.
use unordered_multimap (C++11 only) with the value as the key, and the position index as the value.

Time-efficient way to count number of distinct numbers

get_number() returns an integer. I'm going to call it 30 times and count the number of distinct integers returned. My plan is to put these numbers into an std::array<int,30>, sort it and then use std::unique.
Is that a good solution? Is there a better one? This piece of code will be the bottleneck of my program.
I'm thinking there should be a hash-based solution, but maybe its overhead would be too much when I've only got 30 elements?
Edit I changed unique to distinct. Example:
{1,1,1,1} => 1
{1,2,3,4} => 4
{1,3,3,1} => 2
I would use std::set<int> as it's simpler:
std::set<int> s;
for(/*loop 30 times*/)
{
s.insert(get_number());
}
std::cout << s.size() << std::endl; // You get count of unique numbers
If you want to count return times of each unique number, I'd suggest map
std::map<int, int> s;
for(int i=0; i<30; i++)
{
s[get_number()]++;
}
cout << s.size() << std::endl; // total count of distinct numbers returned
for (auto it : s)
{
cout << it.first << " " << it.second<< std::endl; // each number and return counts
}
The simplest solution would be to use a std::map:
std::map<int, size_t> counters;
for (size_t i = 0; i != 30; ++i) {
counters[getNumber()] += 1;
}
std::vector<int> uniques;
for (auto const& pair: counters) {
if (pair.second == 1) { uniques.push_back(pair.first); }
}
// uniques now contains the items that only appeared once.
Using a std::map, std::set or the std::sort algorithm will give you a O(n*log(n)) complexity. For a small to large number of elements it is perfectly correct. But you use a known integer range and this opens the door to lot of optimizations.
As you say (in a comment) that the range of your integers is known and short: [0..99]. I would recommend to implement a modified counting sort. See: http://en.wikipedia.org/wiki/Counting_sort
You can count the number of distinct items while doing the sort itself, removing the need for the std::unique call. The whole complexity would be O(n). Another advantage is that the memory needed is independent of the number of input items. If you had 30.000.000.000 integers to sort, it would not need a single supplementary byte to count the distinct items.
Even is the range of allowed integer value is large, says [0..10.000.000] the memory consumed would be quite low. Indeed, an optimized version could consume as low as 1 bit per allowed integer value. That is less than 2 MB of memory or 1/1000th of a laptop ram.
Here is a short example program:
#include <cstdlib>
#include <algorithm>
#include <iostream>
#include <vector>
// A function returning an integer between [0..99]
int get_number()
{
return rand() % 100;
}
int main(int argc, char* argv[])
{
// reserves one bucket for each possible integer
// and initialize to 0
std::vector<int> cnt_buckets(100, 0);
int nb_distincts = 0;
// Get 30 numbers and count distincts
for(int i=0; i<30; ++i)
{
int number = get_number();
std::cout << number << std::endl;
if(0 == cnt_buckets[number])
++ nb_distincts;
// We could optimize by doing this only the first time
++ cnt_buckets[number];
}
std::cerr << "Total distincts numbers: " << nb_distincts << std::endl;
}
You can see it working:
$ ./main | sort | uniq | wc -l
Total distincts numbers: 26
26
The simplest way is just to use std::set.
std::set<int> s;
int uniqueCount = 0;
for( int i = 0; i < 30; ++i )
{
int n = get_number();
if( s.find(n) != s.end() ) {
--uniqueCount;
continue;
}
s.insert( n );
}
// now s contains unique numbers
// and uniqueCount contains the number of unique integers returned
Using an array and sort seems good, but unique may be a bit overkill if you just need to count distinct values. The following function should return number of distinct values in a sorted range.
template<typename ForwardIterator>
size_t distinct(ForwardIterator begin, ForwardIterator end) {
if (begin == end) return 0;
size_t count = 1;
ForwardIterator prior = begin;
while (++begin != end)
{
if (*prior != *begin)
++count;
prior = begin;
}
return count;
}
In contrast to the set- or map-based approaches this one does not need any heap allocation and elements are stored continuously in memory, therefore it should be much faster. Asymptotic time complexity is O(N log N) which is the same as when using an associative container. I bet that even your original solution of using std::sort followed by std::unique would be much faster than using std::set.
Try a set, try an unordered set, try sort and unique, try something else that seems fun.
Then MEASURE each one. If you want the fastest implementation, there is no substitute for trying out real code and seeing what it really does.
Your particular platform and compiler and other particulars will surely matter, so test in an environment as close as possible to where it will be running in production.

How to delete items from a std::vector given a list of indices

I have a vector of items items, and a vector of indices that should be deleted from items:
std::vector<T> items;
std::vector<size_t> indicesToDelete;
items.push_back(a);
items.push_back(b);
items.push_back(c);
items.push_back(d);
items.push_back(e);
indicesToDelete.push_back(3);
indicesToDelete.push_back(0);
indicesToDelete.push_back(1);
// given these 2 data structures, I want to remove items so it contains
// only c and e (deleting indices 3, 0, and 1)
// ???
What's the best way to perform the deletion, knowing that with each deletion, it affects all other indices in indicesToDelete?
A couple ideas would be to:
Copy items to a new vector one item at a time, skipping if the index is in indicesToDelete
Iterate items and for each deletion, decrement all items in indicesToDelete which have a greater index.
Sort indicesToDelete first, then iterate indicesToDelete, and for each deletion increment an indexCorrection which gets subtracted from subsequent indices.
All seem like I'm over-thinking such a seemingly trivial task. Any better ideas?
Edit Here is the solution, basically a variation of #1 but using iterators to define blocks to copy to the result.
template<typename T>
inline std::vector<T> erase_indices(const std::vector<T>& data, std::vector<size_t>& indicesToDelete/* can't assume copy elision, don't pass-by-value */)
{
if(indicesToDelete.empty())
return data;
std::vector<T> ret;
ret.reserve(data.size() - indicesToDelete.size());
std::sort(indicesToDelete.begin(), indicesToDelete.end());
// new we can assume there is at least 1 element to delete. copy blocks at a time.
std::vector<T>::const_iterator itBlockBegin = data.begin();
for(std::vector<size_t>::const_iterator it = indicesToDelete.begin(); it != indicesToDelete.end(); ++ it)
{
std::vector<T>::const_iterator itBlockEnd = data.begin() + *it;
if(itBlockBegin != itBlockEnd)
{
std::copy(itBlockBegin, itBlockEnd, std::back_inserter(ret));
}
itBlockBegin = itBlockEnd + 1;
}
// copy last block.
if(itBlockBegin != data.end())
{
std::copy(itBlockBegin, data.end(), std::back_inserter(ret));
}
return ret;
}
I would go for 1/3, that is: order the indices vector, create two iterators into the data vector, one for reading and one for writting. Initialize the writing iterator to the first element to be removed, and the reading iterator to one beyond that one. Then in each step of the loop increment the iterators to the next value (writing) and next value not to be skipped (reading) and copy/move the elements. At the end of the loop call erase to discard the elements beyond the last written to position.
BTW, this is the approach implemented in the remove/remove_if algorithms of the STL with the difference that you maintain the condition in a separate ordered vector.
std::sort() the indicesToDelete in descending order and then delete from the items in a normal for loop. No need to adjust indices then.
It might even be option 4:
If you are deleting a few items from a large number, and know that there will never be a high density of deleted items:
Replace each of the items at indices which should be deleted with 'tombstone' values, indicating that there is nothing valid at those indices, and make sure that whenever you access an item, you check for a tombstone.
It depends on the numbers you are deleting.
If you are deleting many items, it may make sense to copy the items that are not deleted to a new vector and then replace the old vector with the new vector (after sorting the indicesToDelete). That way, you will avoid compressing the vector after each delete, which is an O(n) operation, possibly making the entire process O(n^2).
If you are deleting a few items, perhaps do the deletion in reverse index order (assuming the indices are sorted), then you do not need to adjust them as items get deleted.
Since the discussion has somewhat transformed into a performance related question, I've written up the following code. It uses remove_if and vector::erase, which should move the elements a minimal number of times. There's a bit of overhead, but for large cases, this should be good.
However, if you don't care about the relative order of elements, then this will not be all that fast.
#include <algorithm>
#include <iostream>
#include <string>
#include <vector>
#include <set>
using std::vector;
using std::string;
using std::remove_if;
using std::cout;
using std::endl;
using std::set;
struct predicate {
public:
predicate(const vector<string>::iterator & begin, const vector<size_t> & indices) {
m_begin = begin;
m_indices.insert(indices.begin(), indices.end());
}
bool operator()(string & value) {
const int index = distance(&m_begin[0], &value);
set<size_t>::iterator target = m_indices.find(index);
return target != m_indices.end();
}
private:
vector<string>::iterator m_begin;
set<size_t> m_indices;
};
int main() {
vector<string> items;
items.push_back("zeroth");
items.push_back("first");
items.push_back("second");
items.push_back("third");
items.push_back("fourth");
items.push_back("fifth");
vector<size_t> indicesToDelete;
indicesToDelete.push_back(3);
indicesToDelete.push_back(0);
indicesToDelete.push_back(1);
vector<string>::iterator pos = remove_if(items.begin(), items.end(), predicate(items.begin(), indicesToDelete));
items.erase(pos, items.end());
for (int i=0; i< items.size(); ++i)
cout << items[i] << endl;
}
The output for this would be:
second
fourth
fifth
There is a bit of a performance overhead that can still be reduced. In remove_if (atleast on gcc), the predicate is copied by value for each element in the vector. This means that we're possibly doing the copy constructor on the set m_indices each time. If the compiler is not able to get rid of this, then I would recommend passing the indices in as a set, and storing it as a const reference.
We could do that as follows:
struct predicate {
public:
predicate(const vector<string>::iterator & begin, const set<size_t> & indices) : m_begin(begin), m_indices(indices) {
}
bool operator()(string & value) {
const int index = distance(&m_begin[0], &value);
set<size_t>::iterator target = m_indices.find(index);
return target != m_indices.end();
}
private:
const vector<string>::iterator & m_begin;
const set<size_t> & m_indices;
};
int main() {
vector<string> items;
items.push_back("zeroth");
items.push_back("first");
items.push_back("second");
items.push_back("third");
items.push_back("fourth");
items.push_back("fifth");
set<size_t> indicesToDelete;
indicesToDelete.insert(3);
indicesToDelete.insert(0);
indicesToDelete.insert(1);
vector<string>::iterator pos = remove_if(items.begin(), items.end(), predicate(items.begin(), indicesToDelete));
items.erase(pos, items.end());
for (int i=0; i< items.size(); ++i)
cout << items[i] << endl;
}
Basically the key to the problem is remembering that if you delete the object at index i, and don't use a tombstone placeholder, then the vector must make a copy of all of the objects after i. This applies to every possibility you suggested except for #1. Copying to a new list makes one copy no matter how many you delete, making it by far the fastest answer.
And as David Rodríguez said, sorting the list of indexes to be deleted allows for some minor optimizations, but it may only worth it if you're deleting more than 10-20 (please profile first).
Here is my solution for this problem which keeps the order of the original "items":
create a "vector mask" and initialize (fill) it with "false" values.
change the values of mask to "true" for all the indices you want to remove.
loop over all members of "mask" and erase from both vectors "items" and "mask" the elements with "true" values.
Here is the code sample:
#include <iostream>
#include <vector>
using namespace std;
int main()
{
vector<unsigned int> items(12);
vector<unsigned int> indicesToDelete(3);
indicesToDelete[0] = 3;
indicesToDelete[1] = 0;
indicesToDelete[2] = 1;
for(int i=0; i<12; i++) items[i] = i;
for(int i=0; i<items.size(); i++)
cout << "items[" << i << "] = " << items[i] << endl;
// removing indeces
vector<bool> mask(items.size());
vector<bool>::iterator mask_it;
vector<unsigned int>::iterator items_it;
for(size_t i = 0; i < mask.size(); i++)
mask[i] = false;
for(size_t i = 0; i < indicesToDelete.size(); i++)
mask[indicesToDelete[i]] = true;
mask_it = mask.begin();
items_it = items.begin();
while(mask_it != mask.end()){
if(*mask_it){
items_it = items.erase(items_it);
mask_it = mask.erase(mask_it);
}
else{
mask_it++;
items_it++;
}
}
for(int i=0; i<items.size(); i++)
cout << "items[" << i << "] = " << items[i] << endl;
return 0;
}
This is not a fast implementation for using with large data sets. The method "erase()" takes time to rearrange the vector after eliminating the element.