What is time complexity for std::find_if() function using std::set in C++?
Consider we have the following example:
auto cmp = [&](const pair<int, set<int>>& a , const pair<int, set<int>>& b) -> bool {
if (a.second.size() == b.second.size()) {
return a.first < b.first;
}
return a.second.size() < b.second.size();
};
set<pair<int, set<int>>, decltype(cmp)> tree(cmp);
...
int value = ...;
auto it = find_if(tree.begin(), tree.end(), [](const pair<int, int> &p) {
return p.first == value;
});
I know it takes std::find_if O(n) to work with std::vector. Can't see any problems for this function not to work with std::set with time complexity O(log(n)).
The complexity for std::find, std::find_if, and std::find_if_not is O(N). It does not matter what type of container you are using as the function is basically implemented like
template<class InputIt, class UnaryPredicate>
constexpr InputIt find_if(InputIt first, InputIt last, UnaryPredicate p)
{
for (; first != last; ++first) {
if (p(*first)) {
return first;
}
}
return last;
}
Which you can see just does a linear scan from first to last
If you want to take advantage of std::set being a sorted container, then you can use std::set::find which has O(logN) complexity. You can get a find_if type of behavior by using a comparator that is transparent.
std::find_if is equal opportunity because it can't take advantage of any special magic from the container. It just iterates the container looking for a match, so worst case is always O(N).
See documentation for std::find_if starting at the section on Complexity and pay special attention to the possible implementations.
Note that std::find_if's actual performance will likely be MUCH worse with a set than that of a vector of the same size because of the lack of locality in the data stored.
Related
This code has the Visual Studio error C3892. If I change std::set to std::vector - it works.
std::set<int> a;
a.erase(std::remove_if(a.begin(), a.end(), [](int item)
{
return item == 10;
}), a.end());
What's wrong? Why can't I use std::remove_if with std::set?
You cannot use std::remove_if() with sequences which have const parts. The sequence of std::set<T> elements are made up of T const objects. We actually discussed this question just yesterday at the standard C++ committee and there is some support to create algorithms dealing specifically with the erase()ing objects from containers. It would look something like this (see also N4009):
template <class T, class Comp, class Alloc, class Predicate>
void discard_if(std::set<T, Comp, Alloc>& c, Predicate pred) {
for (auto it{c.begin()}, end{c.end()}; it != end; ) {
if (pred(*it)) {
it = c.erase(it);
}
else {
++it;
}
}
}
(it would probably actually delegate to an algorithm dispatching to the logic above as the same logic is the same for other node-based container).
For you specific use, you can use
a.erase(10);
but this only works if you want to remove a key while the algorithm above works with arbitrary predicates. On the other hand, a.erase(10) can take advantage of std::set<int>'s structure and will be O(log N) while the algorithm is O(N) (with N == s.size()).
Starting with C++20, you can use std::erase_if for containers with an erase() method, just as Kühl explained.
// C++20 example:
std::erase_if(setUserSelection, [](auto& pObject) {
return !pObject->isSelectable();
});
Notice that this also includes std::vector, as it has an erase method. No more chaining a.erase(std::remove_if(... :)
std::remove_if re-orders elements, so it cannot be used with std::set. But you can use std::set::erase:
std::set<int> a;
a.erase(10);
I have looked at find and binary_search, but find doesn't take advantage of the fact that the vector is sorted, and binary_search only returns a true or false, not where it found the value. Is there any function that can give me the best of both worlds?
You can use find to locate a particular element in any container in time O(N). With vector you can do random access and take advantage of the lower_bound (log2(N)), upper_bound, or equal_range class of std algorithms. std::lower_bound will do that for you. It's in the equivalent-behavior section at the top for binary_search. However, the utility of binary_search is limited to yes and no answers (maybe the naming needs to be improved in the future version of C++; binary_in()).
There is a method, std::equal_range, which will give you a pair containing the lower and upper bound of the subset holding the desired value. If both of those items in the pair are identical, then the value you were looking for doesn't exist.
template<class T, class U>
bool contains(const std::vector<T>& container, const U& v)
{
auto it = std::lower_bound(
container.begin(),
container.end(),
v,
[](const T& l, const U& r){ return l < r; });
return it != container.end() && *it == v;
}
I have a map:
std::map<TyString, int> myMap;
However, in some cases I want to std::map::find an entry by making the comparision TyString == TyStringRef, i.e.
myMap.find(TyStringRef("MyString"));
The reason is that TyString wraps a const char * that it allocates and deallocates by itself.
However, for only finding an entry I don't like to allocate a new string, instead I want to use only the reference (TyStringRef only wraps a const char * without allocating or deallocating memory).
Of course I can just convert the TyStringRef to a TyString, but then I have the memory overhead described above.
Is there an intelligent way to solve this?
Thanks!
Note that std::map::find uses operator< per default, or a user-defined comparison functor. So unless you overload operator< for TyString and TyStringRef, you can't lookup a key in logarithmic time. With operator== being overloaded, you can still lookup in linear time, but not using std::map::find.
For this, you should use a generic algorithm from #include <algorithm>, which is independent from the container classes. It can take any type T and compares it using operator== on the result of operator*() of the iterators you pass in.
std::find(sequence.begin(), sequence.end(), myKey);
However, there is one problem: Since you have a std::map, which uses pairs for the iterators, the key-value-pair will be compared. So you have to use std::find_if, which takes a predicate instead of a value to search for. This predicate should return true for the element you are looking for. You want to have the element (pair) for which first == myKey, so you end up with a code like this:
std::find_if(myMap.begin(), myMap.end(), [](const std::pair<TyString,int> & pair) {
return pair.first == TyStringRef("MyString");
};
This conceptually works, but it won't make use of the binary tree in std::map. So it will take linear time compared to logarithmic time of std::map::find.
There is an alternative, which looks a bit strange in the beginning, but it has the advantage that it will be a logarithmic time lookup. It requires you to overload operator<(TyString,TyStringRef). You can use std::lower_bound to find the first element which is not less (greater or equal) some element with respect to a given comparison function.
std::lower_bound(myMap.begin(), myMap.end(), TyStringRef("MyString"),
[](const std::pair<TyString,int> & entry, const & TyStringRef stringRef) {
return entry.first < stringRef;
}
);
After the "lower bound" was found, you still have to test if the keys compare equal. If they don't, the element was not found. Since it might be possible that all elements compare less with the element you're looking for, so the returned iterator might be the end iterator, which should not be dereferenced. So the full code becomes this, which is analogous to std::map::find and returns the end iterator if the key wasn't found:
template<class Map, class KeyCompareType,
class Iterator = typename Map::const_iterator>
Iterator findInMap(const Map &map, const KeyCompareType &key)
{
typedef typename Map::value_type value_type;
auto predicate = [](const value_type & entry, const KeyCompareType & key) {
return entry.first < key;
};
Iterator it = std::lower_bound(map.begin(), map.end(), key, predicate);
if (it != map.end()) {
if (!(it->first == key))
it = map.end();
}
return it;
}
Live example
You could use STLport, which already does this on its own. Maybe other standardlibrary implementations do the same? Alternatively, you could use std::find(), but that would cost you the logarithmic lookup.
Profiling my cpu-bound code has suggested I that spend a long time checking to see if a container contains completely unique elements. Assuming that I have some large container of unsorted elements (with < and = defined), I have two ideas on how this might be done:
The first using a set:
template <class T>
bool is_unique(vector<T> X) {
set<T> Y(X.begin(), X.end());
return X.size() == Y.size();
}
The second looping over the elements:
template <class T>
bool is_unique2(vector<T> X) {
typename vector<T>::iterator i,j;
for(i=X.begin();i!=X.end();++i) {
for(j=i+1;j!=X.end();++j) {
if(*i == *j) return 0;
}
}
return 1;
}
I've tested them the best I can, and from what I can gather from reading the documentation about STL, the answer is (as usual), it depends. I think that in the first case, if all the elements are unique it is very quick, but if there is a large degeneracy the operation seems to take O(N^2) time. For the nested iterator approach the opposite seems to be true, it is lighting fast if X[0]==X[1] but takes (understandably) O(N^2) time if all the elements are unique.
Is there a better way to do this, perhaps a STL algorithm built for this very purpose? If not, are there any suggestions eek out a bit more efficiency?
Your first example should be O(N log N) as set takes log N time for each insertion. I don't think a faster O is possible.
The second example is obviously O(N^2). The coefficient and memory usage are low, so it might be faster (or even the fastest) in some cases.
It depends what T is, but for generic performance, I'd recommend sorting a vector of pointers to the objects.
template< class T >
bool dereference_less( T const *l, T const *r )
{ return *l < *r; }
template <class T>
bool is_unique(vector<T> const &x) {
vector< T const * > vp;
vp.reserve( x.size() );
for ( size_t i = 0; i < x.size(); ++ i ) vp.push_back( &x[i] );
sort( vp.begin(), vp.end(), ptr_fun( &dereference_less<T> ) ); // O(N log N)
return adjacent_find( vp.begin(), vp.end(),
not2( ptr_fun( &dereference_less<T> ) ) ) // "opposite functor"
== vp.end(); // if no adjacent pair (vp_n,vp_n+1) has *vp_n < *vp_n+1
}
or in STL style,
template <class I>
bool is_unique(I first, I last) {
typedef typename iterator_traits<I>::value_type T;
…
And if you can reorder the original vector, of course,
template <class T>
bool is_unique(vector<T> &x) {
sort( x.begin(), x.end() ); // O(N log N)
return adjacent_find( x.begin(), x.end() ) == x.end();
}
You must sort the vector if you want to quickly determine if it has only unique elements. Otherwise the best you can do is O(n^2) runtime or O(n log n) runtime with O(n) space. I think it's best to write a function that assumes the input is sorted.
template<class Fwd>
bool is_unique(In first, In last)
{
return adjacent_find(first, last) == last;
}
then have the client sort the vector, or a make a sorted copy of the vector. This will open a door for dynamic programming. That is, if the client sorted the vector in the past then they have the option to keep and refer to that sorted vector so they can repeat this operation for O(n) runtime.
The standard library has std::unique, but that would require you to make a copy of the entire container (note that in both of your examples you make a copy of the entire vector as well, since you unnecessarily pass the vector by value).
template <typename T>
bool is_unique(std::vector<T> vec)
{
std::sort(vec.begin(), vec.end());
return std::unique(vec.begin(), vec.end()) == vec.end();
}
Whether this would be faster than using a std::set would, as you know, depend :-).
Is it infeasible to just use a container that provides this "guarantee" from the get-go? Would it be useful to flag a duplicate at the time of insertion rather than at some point in the future? When I've wanted to do something like this, that's the direction I've gone; just using the set as the "primary" container, and maybe building a parallel vector if I needed to maintain the original order, but of course that makes some assumptions about memory and CPU availability...
For one thing you could combine the advantages of both: stop building the set, if you have already discovered a duplicate:
template <class T>
bool is_unique(const std::vector<T>& vec)
{
std::set<T> test;
for (typename std::vector<T>::const_iterator it = vec.begin(); it != vec.end(); ++it) {
if (!test.insert(*it).second) {
return false;
}
}
return true;
}
BTW, Potatoswatter makes a good point that in the generic case you might want to avoid copying T, in which case you might use a std::set<const T*, dereference_less> instead.
You could of course potentially do much better if it wasn't generic. E.g if you had a vector of integers of known range, you could just mark in an array (or even bitset) if an element exists.
You can use std::unique, but it requires the range to be sorted first:
template <class T>
bool is_unique(vector<T> X) {
std::sort(X.begin(), X.end());
return std::unique(X.begin(), X.end()) == X.end();
}
std::unique modifies the sequence and returns an iterator to the end of the unique set, so if that's still the end of the vector then it must be unique.
This runs in nlog(n); the same as your set example. I don't think you can theoretically guarantee to do it faster, although using a C++0x std::unordered_set instead of std::set would do it in expected linear time - but that requires that your elements be hashable as well as having operator == defined, which might not be so easy.
Also, if you're not modifying the vector in your examples, you'd improve performance by passing it by const reference, so you don't make an unnecessary copy of it.
If I may add my own 2 cents.
First of all, as #Potatoswatter remarked, unless your elements are cheap to copy (built-in/small PODs) you'll want to use pointers to the original elements rather than copying them.
Second, there are 2 strategies available.
Simply ensure there is no duplicate inserted in the first place. This means, of course, controlling the insertion, which is generally achieved by creating a dedicated class (with the vector as attribute).
Whenever the property is needed, check for duplicates
I must admit I would lean toward the first. Encapsulation, clear separation of responsibilities and all that.
Anyway, there are a number of ways depending on the requirements. The first question is:
do we have to let the elements in the vector in a particular order or can we "mess" with them ?
If we can mess with them, I would suggest keeping the vector sorted: Loki::AssocVector should get you started.
If not, then we need to keep an index on the structure to ensure this property... wait a minute: Boost.MultiIndex to the rescue ?
Thirdly: as you remarked yourself a simple linear search doubled yield a O(N2) complexity in average which is no good.
If < is already defined, then sorting is obvious, with its O(N log N) complexity.
It might also be worth it to make T Hashable, because a std::tr1::hash_set could yield a better time (I know, you need a RandomAccessIterator, but if T is Hashable then it's easy to have T* Hashable to ;) )
But in the end the real issue here is that our advises are necessary generic because we lack data.
What is T, do you intend the algorithm to be generic ?
What is the number of elements ? 10, 100, 10.000, 1.000.000 ? Because asymptotic complexity is kind of moot when dealing with a few hundreds....
And of course: can you ensure unicity at insertion time ? Can you modify the vector itself ?
Well, your first one should only take N log(N), so it's clearly the better worse case scenario for this application.
However, you should be able to get a better best case if you check as you add things to the set:
template <class T>
bool is_unique3(vector<T> X) {
set<T> Y;
typename vector<T>::const_iterator i;
for(i=X.begin(); i!=X.end(); ++i) {
if (Y.find(*i) != Y.end()) {
return false;
}
Y.insert(*i);
}
return true;
}
This should have O(1) best case, O(N log(N)) worst case, and average case depends on the distribution of the inputs.
If the type T You store in Your vector is large and copying it is costly, consider creating a vector of pointers or iterators to Your vector elements. Sort it based on the element pointed to and then check for uniqueness.
You can also use the std::set for that. The template looks like this
template <class Key,class Traits=less<Key>,class Allocator=allocator<Key> > class set
I think You can provide appropriate Traits parameter and insert raw pointers for speed or implement a simple wrapper class for pointers with < operator.
Don't use the constructor for inserting into the set. Use insert method. The method (one of overloads) has a signature
pair <iterator, bool> insert(const value_type& _Val);
By checking the result (second member) You can often detect the duplicate much quicker, than if You inserted all elements.
In the (very) special case of sorting discrete values with a known, not too big, maximum value N.
You should be able to start a bucket sort and simply check that the number of values in each bucket is below 2.
bool is_unique(const vector<int>& X, int N)
{
vector<int> buckets(N,0);
typename vector<int>::const_iterator i;
for(i = X.begin(); i != X.end(); ++i)
if(++buckets[*i] > 1)
return false;
return true;
}
The complexity of this would be O(n).
Using the current C++ standard containers, you have a good solution in your first example. But if you can use a hash container, you might be able to do better, as the hash set will be nO(1) instead of nO(log n) for a standard set. Of course everything will depend on the size of n and your particular library implementation.
I have a collection of elements that I need to operate over, calling member functions on the collection:
std::vector<MyType> v;
... // vector is populated
For calling functions with no arguments it's pretty straight-forward:
std::for_each(v.begin(), v.end(), std::mem_fun(&MyType::myfunc));
A similar thing can be done if there's one argument to the function I wish to call.
My problem is that I want to call a function on elements in the vector if it meets some condition. std::find_if returns an iterator to the first element meeting the conditions of the predicate.
std::vector<MyType>::iterator it =
std::find_if(v.begin(), v.end(), MyPred());
I wish to find all elements meeting the predicate and operate over them.
I've been looking at the STL algorithms for a "find_all" or "do_if" equivalent, or a way I can do this with the existing STL (such that I only need to iterate once), rather than rolling my own or simply do a standard iteration using a for loop and comparisons.
Boost Lambda makes this easy.
#include <boost/lambda/lambda.hpp>
#include <boost/lambda/bind.hpp>
#include <boost/lambda/if.hpp>
std::for_each( v.begin(), v.end(),
if_( MyPred() )[ std::mem_fun(&MyType::myfunc) ]
);
You could even do away with defining MyPred(), if it is simple. This is where lambda really shines. E.g., if MyPred meant "is divisible by 2":
std::for_each( v.begin(), v.end(),
if_( _1 % 2 == 0 )[ std::mem_fun( &MyType::myfunc ) ]
);
Update:
Doing this with the C++0x lambda syntax is also very nice (continuing with the predicate as modulo 2):
std::for_each( v.begin(), v.end(),
[](MyType& mt ) mutable
{
if( mt % 2 == 0)
{
mt.myfunc();
}
} );
At first glance this looks like a step backwards from boost::lambda syntax, however, it is better because more complex functor logic is trivial to implement with c++0x syntax... where anything very complicated in boost::lambda gets tricky quickly. Microsoft Visual Studio 2010 beta 2 currently implements this functionality.
I wrote a for_each_if() and a for_each_equal() which do what I think you're looking for.
for_each_if() takes a predicate functor to evaluate equality, and for_each_equal() takes a value of any type and does a direct comparison using operator ==. In both cases, the function you pass in is called on each element that passes the equality test.
/* ---
For each
25.1.1
template< class InputIterator, class Function, class T>
Function for_each_equal(InputIterator first, InputIterator last, const T& value, Function f)
template< class InputIterator, class Function, class Predicate >
Function for_each_if(InputIterator first, InputIterator last, Predicate pred, Function f)
Requires:
T is of type EqualityComparable (20.1.1)
Effects:
Applies f to each dereferenced iterator i in the range [first, last) where one of the following conditions hold:
1: *i == value
2: pred(*i) != false
Returns:
f
Complexity:
At most last - first applications of f
--- */
template< class InputIterator, class Function, class Predicate >
Function for_each_if(InputIterator first,
InputIterator last,
Predicate pred,
Function f)
{
for( ; first != last; ++first)
{
if( pred(*first) )
f(*first);
}
return f;
};
template< class InputIterator, class Function, class T>
Function for_each_equal(InputIterator first,
InputIterator last,
const T& value,
Function f)
{
for( ; first != last; ++first)
{
if( *first == value )
f(*first);
}
return f;
};
Is it ok to change the vector? You may want to look at the partition algorithm.
Partition algorithm
Another option would be to change your MyType::myfunc to either check the element, or to take a predicate as a parameter and use it to test the element it's operating on.
std::vector<int> v, matches;
std::vector<int>::iterator i = v.begin();
MyPred my_pred;
while(true) {
i = std::find_if(i, v.end(), my_pred);
if (i == v.end())
break;
matches.push_back(*i);
}
For the record, while I have seen an implementation where calling end() on a list was O(n), I haven't seen any STL implementations where calling end() on a vector was anything other than O(1) -- mainly because vectors are guaranteed to have random-access iterators.
Even so, if you are worried about an inefficient end(), you can use this code:
std::vector<int> v, matches;
std::vector<int>::iterator i = v.begin(), end = v.end();
MyPred my_pred;
while(true) {
i = std::find_if(i, v.end(), my_pred);
if (i == end)
break;
matches.push_back(*i);
}
For what its worth for_each_if is being considered as an eventual addition to boost. It isn't hard to implement your own.
Lamda functions - the idea is to do something like this
for_each(v.begin(), v.end(), [](MyType& x){ if (Check(x) DoSuff(x); })
Origial post here.
You can use Boost.Foreach:
BOOST_FOREACH (vector<...>& x, v)
{
if (Check(x)
DoStuff(x);
}