std::lower_bound with skipping invalid elements - c++

I have a list of file names, with each representing a point in time. The list typically has thousands of elements. Given a time point, I'd like to convert these files names into time objects (I'm using boost::ptime), and then find the value of std::lower_bound of this time point with respect to the files names.
Example:
Filenames (with date + time, minutes increasing, with a minute for every file):
station01_20170612_030405.hdf5
station01_20170612_030505.hdf5
station01_20170612_030605.hdf5
station01_20170612_030705.hdf5
station01_20170612_030805.hdf5
station01_20170612_030905.hdf5
If I have a time-point 2017-06-12 03:06:00, then it fits here:
station01_20170612_030405.hdf5
station01_20170612_030505.hdf5
<--- The lower bound I am looking for is here
station01_20170612_030605.hdf5
station01_20170612_030705.hdf5
station01_20170612_030805.hdf5
station01_20170612_030905.hdf5
So far, everything is simple. Now the problem is that the list of files may be doped with some invalid file name, which will make the conversion to a time point fail.
Currently, I'm doing this the easy/inefficient way, and I'd like to optimize it, because this program will go on a server and the cost of operation matters. So, the stupid way is: Create a new list with time points, and only push time points that are valid:
vector<ptime> filesListTimePoints;
filesListTimePoints.reserve(filesList.size());
ptime time;
for(long i = 0; i < filesList.size(); i++) {
ErrorCode error = ConvertToTime(filesList[i], time);
if(error.errorCode() == SUCCESS)
filesListTimePoints.push_back(time);
}
//now use std::lower_bound() on filesListTimePoints
You see, the problem is that I'm using a linear solution with a problem that can be solved with O(log(N)) complexity. I don't need to convert all files or even look at all of them!
My question: How can I embed this into std::lower_bound, such that it remains with optimal complexity?
My idea of a possible solution:
On cppreference, there's a basic implementation of std::lower_bound. I'm thinking of modifying that to get a working solution. But I'm not sure what to do when a convesion fails, since this algorithm highly depends on monotonic behavior. Does this problem have a solution, even mathematically speaking?
Here's the version I'm thinking about initially:
template<class ForwardIt, class T>
ForwardIt lower_bound(ForwardIt first, ForwardIt last, const T& value)
{
ForwardIt it;
typename std::iterator_traits<ForwardIt>::difference_type count, step;
count = std::distance(first, last);
while (count > 0) {
it = first;
step = count / 2;
std::advance(it, step);
ErrorCode error = ConvertToTime(*it, time);
if(error.errorCode() == SUCCESS)
{
if (*it < value) {
first = ++it;
count -= step + 1;
}
else
count = step;
}
else {
// skip/ignore this point?
}
}
return first;
}
My ultimate solution (which might sound stupid) is to make this method a mutator of the list, and erase elements that are invalid. Is there a cleaner solution?

You can simply index by optional<ptime>. If you want to cache the converted values, consider making it a multimap<optional<ptime>, File>.
Better yet, make a datatype representing the file, and calculate the timepoint inside its constructor:
struct File {
File(std::string fname) : _fname(std::move(fname)), _time(parse_time(_fname)) { }
boost::optional<boost::posix_time::ptime> _time;
std::string _fname;
static boost::optional<boost::posix_time::ptime> parse_time(std::string const& fname) {
// return ptime or boost::none
}
};
Now, simply define operator< suitably or use e.g. boost::multi_index_container to index by _time
Further notes:
in case it wasn't clear, such a map/set will have it's own lower_bound, upper_bound and equal_range operations, and will obviously also work with std::lower_bound and friends.
there's always filter_iterator adaptor: http://www.boost.org/doc/libs/1_64_0/libs/iterator/doc/filter_iterator.html

Related

Iterating over the result of a member function

I have code that creates several object instances (each instance having a fitness value, among other things) from which I want to sample N unique objects using weighted selection based on their fitness values. All objects not sampled are then discarded (but they need to be initially created to determine their fitness value).
my current code looks something like this:
vector<Item> getItems(..) {
std::vector<Item> items
.. // generate N values for items
int N = items.size();
std::vector<double> fitnessVals;
for(auto it = items.begin(); it != items.end(); ++it)
fitnessVals.push_back(it->getFitness());
std::mt19937& rng = getRng();
for(int i = 0, i < N, ++i) {
std::discrete_distribution<int> dist(fitnessVals.begin() + i, fitnessVals.end());
unsigned int pick = dist(rng);
std::swap(fitnessVals.at(i), fitnessVals.at(pick));
std::swap(items.at(i), items.at(pick));
}
items.erase(items.begin() + N, items.end());
return items;
}
Typically ~10,000 instances are initially created, with N being ~200. The fitness value is non-negative, usually valued at ~70. It could go as high as ~3000, but higher values are increasingly more unlikely.
Is there an elegant way to get rid of the fitnessVals vector? Or perhaps a better way to do this in general? Efficiency is important, but I'm also wondering about good C++ coding practices.
If you're asking whether you can do this just with the items in your items vector, the answer is yes. The following is a rather hideous but none-the-less effective way to do that: I apologize in advance for the density.
This wraps the unsuspecting container iterator in another iterator of our own devices; one that pairs it with a member function of your choice. You may have to dance with const in this to get it to work correctly with your member function choice. That task i leave to you.
template<typename Iter, typename R>
struct memfn_iterator_s :
public std::iterator<std::input_iterator_tag, R>
{
using value_type = typename std::iterator_traits<Iter>::value_type;
memfn_iterator_s(Iter it, R(value_type::*fn)())
: m_it(it), mem_fn(fn) {}
R operator*()
{
return ((*m_it).*mem_fn)();
}
bool operator ==(const memfn_iterator_s& arg) const
{
return m_it == arg.m_it;
}
bool operator !=(const memfn_iterator_s& arg) const
{
return m_it != arg.m_it;
}
memfn_iterator_s& operator ++() { ++m_it; return *this; }
private:
R (value_type::*mem_fn)();
Iter m_it;
};
A generator function follows to create the above monstrosity:
template<typename Iter, typename R>
memfn_iterator_s<Iter,R> memfn_iterator(
Iter it,
R (std::iterator_traits<Iter>::value_type::*fn)())
{
return memfn_iterator_s<Iter,R>(it, fn);
}
What this buys you is the ability to do this:
auto it_end = memfn_iterator(items.end(), &Item::getFitness);
for(unsigned int i = 0; i < N; ++i)
{
auto it_begin = memfn_iterator(items.begin()+i, &Item::getFitness);
std::discrete_distribution<unsigned int> dist(it_begin, it_end);
std::swap(items.at(i), items.at(i+dist(rng)));
}
items.erase(items.begin() + N, items.end());
No temporary array is required. The member function is called for the respective item when required by the discrete distribution (which usually keeps it own vector of weights, and as such replicating that effort would be redundant).
Dunno if you'll get anything helpful or useful out of that, but it was fun to think about.
It's pretty nice that they have a discrete distribution in STL. As far as I know, the most efficient algorithm for sampling from a set of weighted objects (i.e., with probability proportional to weights) is the alias method. There's a Java implementation here: http://www.keithschwarz.com/interesting/code/?dir=alias-method
I suspect that's what the STL discrete_distribution uses anyway. If you're going to be calling your getItems function frequently, you might want to create a "FitnessSet" class or something so that you don't have to build your distribution every time you want to sample from the same set.
EDIT: Another suggestion... If you want to be able to delete items, you could instead store your objects in a binary tree. Each node would contain the sum of the weights in the subtree beneath it, and the objects themselves could be in the leaves. You could select an object through a series of log(N) coin tosses: at a given node, choose a random number between 0 and node.subtreeweight. If it's less than node.left.subtreeweight, go left; otherwise go right. Continue recursively until you reach a leaf.
I would try something like the following (see code comments):
#include <algorithm> // For std::swap and std::transform
#include <functional> // For std::mem_fun_ref
#include <random> // For std::discrete_distribution
#include <vector> // For std::vector
size_t
get_items(std::vector<Item>& results, const std::vector<Item>& items)
{
// Copy the items to the results vector. All operations should be
// done on it, rather than the original items vector.
results.assign(items.begin(), items.end());
// Create the fitness values vector, immediately allocating
// the number of doubles required to match the size of the
// input item vector.
std::vector<double> fitness_vals(results.size());
// Use some STL "magic" ...
// This will iterate over the items vector, calling the
// getFitness() method on each item, and storing the result
// in the fitness_vals vector.
std::transform(results.begin(), results.end(),
fitness_vals.begin(),
std::mem_fun_ref(&Item::getFitness));
//
std::mt19937& rng = getRng();
for (size_t i=0; i < results.size(); ++i) {
std::discrete_distribution<int> dist(fitness_vals.begin() + i, fitness_vals.end());
unsigned int pick = dist(rng);
std::swap(fitness_vals[ii], fitness_vals[pick]);
std::swap(results[i], results[pick]);
}
return (results.size());
}
Instead of returning the results vector, the caller provides a vector into which the results should be added. Also, the original vector (passed as the second parameter) remains unchanged. If this is not something that concerns you, you can always pass just the one vector and work with it directly.
I don't see a way to not have the fitness values vector; the discrete_distribution constructor needs to have the begin and end iterators, so from what I can tell, you will need to have this vector.
The rest of it is basically the same, with the return value being the number of items in the result vector, rather than the vector itself.
This example makes use of a number of STL features (algorithms, containers, functors) which I have found to be useful and part of my day-to-day development.
Edit: the call to items.erase() is superfluous; items.begin() + N where N == items.size() is equivalent to items.end(). The call to items.erase() would equate to a no-op.

Insert multiple values into vector

I have a std::vector<T> variable. I also have two variables of type T, the first of which represents the value in the vector after which I am to insert, while the second represents the value to insert.
So lets say I have this container: 1,2,1,1,2,2
And the two values are 2 and 3 with respect to their definitions above. Then I wish to write a function which will update the container to instead contain:
1,2,3,1,1,2,3,2,3
I am using c++98 and boost. What std or boost functions might I use to implement this function?
Iterating over the vector and using std::insert is one way, but it gets messy when one realizes that you need to remember to hop over the value you just inserted.
This is what I would probably do:
vector<T> copy;
for (vector<T>::iterator i=original.begin(); i!=original.end(); ++i)
{
copy.push_back(*i);
if (*i == first)
copy.push_back(second);
}
original.swap(copy);
Put a call to reserve in there if you want. You know you need room for at least original.size() elements. You could also do an initial iteraton over the vector (or use std::count) to determine the exact amount of elements to reserve, but without testing, I don't know whether that would improve performance.
I propose a solution that works in place and in O(n) in memory and O(2n) time. Instead of O(n^2) in time by the solution proposed by Laethnes and O(2n) in memory by the solution proposed by Benjamin.
// First pass, count elements equal to first.
std::size_t elems = std::count(data.begin(), data.end(), first);
// Resize so we'll add without reallocating the elements.
data.resize(data.size() + elems);
vector<T>::reverse_iterator end = data.rbegin() + elems;
// Iterate from the end. Move elements from the end to the new end (and so elements to insert will have some place).
for(vector<T>::reverse_iterator new_end = data.rbegin(); end != data.rend() && elems > 0; ++new_end,++end)
{
// If the current element is the one we search, insert second first. (We iterate from the end).
if(*end == first)
{
*new_end = second;
++new_end;
--elems;
}
// Copy the data to the end.
*new_end = *end;
}
This algorithm may be buggy but the idea is to copy only once each elements by:
Firstly count how much elements we'll need to insert.
Secondly by going though the data from the end and moving each elements to the new end.
This is what I probably would do:
typedef ::std::vector<int> MyList;
typedef MyList::iterator MyListIter;
MyList data;
// ... fill data ...
const int searchValue = 2;
const int addValue = 3;
// Find first occurence of searched value
MyListIter iter = ::std::find(data.begin(), data.end(), searchValue);
while(iter != data.end())
{
// We want to add our value after searched one
++iter;
// Insert value and return iterator pointing to the inserted position
// (original iterator is invalid now).
iter = data.insert(iter, addValue);
// This is needed only if we want to be sure that out value won't be used
// - for example if searchValue == addValue is true, code would create
// infinite loop.
++iter;
// Search for next value.
iter = ::std::find(iter, data.end(), searchValue);
}
but as you can see, I couldn't avoid the incrementation you mentioned. But I don't think that would be bad thing: I would put this code to separate functions (probably in some kind of "core/utils" module) and - of course - implement this function as template, so I would write it only once - only once worrying about incrementing value is IMHO acceptable. Very acceptable.
template <class ValueType>
void insertAfter(::std::vector<ValueType> &io_data,
const ValueType &i_searchValue,
const ValueType &i_insertAfterValue);
or even better (IMHO)
template <class ListType, class ValueType>
void insertAfter(ListType &io_data,
const ValueType &i_searchValue,
const ValueType &i_insertAfterValue);
EDIT:
well, I would solve problem little different way: first count number of the searched value occurrence (preferably store in some kind of cache which can be kept and used repeatably) so I could prepare array before (only one allocation) and used memcpy to move original values (for types like int only, of course) or memmove (if the vector allocated size is sufficient already).
In place, O(1) additional memory and O(n) time (Live at Coliru):
template <typename T, typename A>
void do_thing(std::vector<T, A>& vec, T target, T inserted) {
using std::swap;
typedef typename std::vector<T, A>::size_type size_t;
const size_t occurrences = std::count(vec.begin(), vec.end(), target);
if (occurrences == 0) return;
const size_t original_size = vec.size();
vec.resize(original_size + occurrences, inserted);
for(size_t i = original_size - 1, end = i + occurrences; i > 0; --i, --end) {
if (vec[i] == target) {
--end;
}
swap(vec[i], vec[end]);
}
}

How to efficiently select a random element from a std::set

How can I efficiently select a random element from a std::set?
A std::set::iterator is not a random access iterator. So I can't directly index a randomly chosen element like I could for a std::deque or std::vector
I could take the iterator returned from std::set::begin() and increment it a random number of times in the range [0,std::set::size()), but that seems to be doing a lot of unnecessary work. For an "index" close to the set's size, I would end up traversing the entire first half of the internal tree structure, even though it's already known the element won't be found there.
Is there a better approach?
In the name of efficiency, I am willing to define "random" as less random than whatever approach I might have used to choose a random index in a vector. Call it "reasonably random".
Edit...
Many insightful answers below.
The short version is that even though you can find a specific element in log(n) time, you can't find an arbitrary element in that time through the std::set interface.
Use boost::container::flat_set instead:
boost::container::flat_set<int> set;
// ...
auto it = set.begin() + rand() % set.size();
Insertions and deletions become O(N) though, I don't know if that's a problem. You still have O(log N) lookups, and the fact that the container is contiguous gives an overall improvement that often outweighs the loss of O(log N) insertions and deletions.
What about a predicate for find (or lower_bound) which causes a random tree traversal? You'd have to tell it the size of the set so it could estimate the height of the tree and sometimes terminate before leaf nodes.
Edit: I realized the problem with this is that std::lower_bound takes a predicate but does not have any tree-like behavior (internally it uses std::advance which is discussed in the comments of another answer). std::set<>::lower_bound uses the predicate of the set, which cannot be random and still have set-like behavior.
Aha, you can't use a different predicate, but you can use a mutable predicate. Since std::set passes the predicate object around by value you must use a predicate & as the predicate so you can reach in and modify it (setting it to "randomize" mode).
Here's a quasi-working example. Unfortunately I can't wrap my brain around the right random predicate so my randomness is not excellent, but I'm sure someone can figure that out:
#include <iostream>
#include <set>
#include <stdlib.h>
#include <time.h>
using namespace std;
template <typename T>
struct RandomPredicate {
RandomPredicate() : size(0), randomize(false) { }
bool operator () (const T& a, const T& b) {
if (!randomize)
return a < b;
int r = rand();
if (size == 0)
return false;
else if (r % size == 0) {
size = 0;
return false;
} else {
size /= 2;
return r & 1;
}
}
size_t size;
bool randomize;
};
int main()
{
srand(time(0));
RandomPredicate<int> pred;
set<int, RandomPredicate<int> & > s(pred);
for (int i = 0; i < 100; ++i)
s.insert(i);
pred.randomize = true;
for (int i = 0; i < 100; ++i) {
pred.size = s.size();
set<int, RandomPredicate<int> >::iterator it = s.lower_bound(0);
cout << *it << endl;
}
}
My half-baked randomness test is ./demo | sort -u | wc -l to see how many unique integers I get out. With a larger sample set try ./demo | sort | uniq -c | sort -n to look for unwanted patterns.
If you could access the underlying red-black tree (assuming that one exists) then you could access a random node in O(log n) choosing L/R as the successive bits of a ceil(log2(n))-bit random integer. However, you can't, as the underlying data structure is not exposed by the standard.
Xeo's solution of placing iterators in a vector is O(n) time and space to set up, but amortized constant overall. This compares favourably to std::next, which is O(n) time.
You can use the std::advance method:
set <int> myset;
//insert some elements into myset
int rnd = rand() % myset.size();
set <int> :: const_iterator it(myset.begin());
advance(it, rnd);
//now 'it' points to your random element
Another way to do this, probably less random:
int mini = *myset().begin(), maxi = *myset().rbegin();
int rnd = rand() % (maxi - mini + 1) + mini;
int rndresult = *myset.lower_bound(rnd);
If either the set doesn't update frequently or you don't need to run this algorithm frequently, keep a mirrored copy of the data in a vector (or just copy the set to a vector on need) and randomly select from that.
Another approach, as seen in a comment, is to keep a vector of iterators into the set (they're only invalidated on element deletion for sets) and randomly select an iterator.
Finally if you don't need a tree-based set, you could use vector or deque as your underlying container and sort/unique-ify when needed.
You can do this by maintaining a normal array of values; when you insert to the set, you append the element to the end of the array (O(1)), then when you want to generate a random number you can grab it from the array in O(1) as well.
The issue comes when you want to remove elements from the array. The most naive method would take O(n), which might be efficient enough for your needs. However, this can be improved to O(log n) using the following method;
Keep, for each index i in the array, prfx[i], which represents the number of non-deleted elements in the range 0...i in the array. Keep a segment tree, where you keep the maximum prfx[i] contained in each range.
Updating the segment tree can be done in O(log n) per deletion. Now, when you want to access the random number, you query the segment tree to find the "real" index of the number (by finding the earliest range in which the maximum prfx is equal to the random index). This makes the random-number generation of complexity O(log n).
Average O(1)/O(log N) (hashable/unhashable) insert/delete/sample with off-the-shelf containers
The idea is simple: use rejection sampling while upper bounding the rejection rate, which is achievable with a amortized O(1) compaction operation.
However, unlike solutions based on augmented trees, this approach cannot be extended to support weighted sampling.
template <typename T>
class UniformSamplingSet {
size_t max_id = 0;
std::unordered_set<size_t> unused_ids;
std::unordered_map<size_t, T> id2value;
std::map<T, size_t> value2id;
void compact() {
size_t id = 0;
std::map<T, size_t> new_value2id;
std::unordered_map<size_t, T> new_id2value;
for (auto [_, value] : id2value) {
new_value2id.emplace(value, id);
new_id2value.emplace(id, value);
++id;
}
max_id = id;
unused_ids.clear();
std::swap(id2value, new_id2value);
std::swap(value2id, new_value2id);
}
public:
size_t size() {
return id2value.size();
}
void insert(const T& value) {
size_t id;
if (!unused_ids.empty()) {
id = *unused_ids.begin();
unused_ids.erase(unused_ids.begin());
} else {
id = max_id++;
}
if (!value2id.emplace(value, id).second) {
unused_ids.insert(id);
} else {
id2value.emplace(id, value);
}
}
void erase(const T& value) {
auto it = value2id.find(value);
if (it == value2id.end()) return;
unused_ids.insert(it->second);
id2value.erase(it->second);
value2id.erase(it);
if (unused_ids.size() * 2 > max_id) {
compact();
};
}
// uniform(n): uniform random in [0, n)
template <typename F>
T sample(F&& uniform) {
size_t i;
do { i = uniform(max_id); } while (unused_ids.find(i) != unused_ids.end());
return id2value.at(i);
}

Determining if an unordered vector<T> has all unique elements

Profiling my cpu-bound code has suggested I that spend a long time checking to see if a container contains completely unique elements. Assuming that I have some large container of unsorted elements (with < and = defined), I have two ideas on how this might be done:
The first using a set:
template <class T>
bool is_unique(vector<T> X) {
set<T> Y(X.begin(), X.end());
return X.size() == Y.size();
}
The second looping over the elements:
template <class T>
bool is_unique2(vector<T> X) {
typename vector<T>::iterator i,j;
for(i=X.begin();i!=X.end();++i) {
for(j=i+1;j!=X.end();++j) {
if(*i == *j) return 0;
}
}
return 1;
}
I've tested them the best I can, and from what I can gather from reading the documentation about STL, the answer is (as usual), it depends. I think that in the first case, if all the elements are unique it is very quick, but if there is a large degeneracy the operation seems to take O(N^2) time. For the nested iterator approach the opposite seems to be true, it is lighting fast if X[0]==X[1] but takes (understandably) O(N^2) time if all the elements are unique.
Is there a better way to do this, perhaps a STL algorithm built for this very purpose? If not, are there any suggestions eek out a bit more efficiency?
Your first example should be O(N log N) as set takes log N time for each insertion. I don't think a faster O is possible.
The second example is obviously O(N^2). The coefficient and memory usage are low, so it might be faster (or even the fastest) in some cases.
It depends what T is, but for generic performance, I'd recommend sorting a vector of pointers to the objects.
template< class T >
bool dereference_less( T const *l, T const *r )
{ return *l < *r; }
template <class T>
bool is_unique(vector<T> const &x) {
vector< T const * > vp;
vp.reserve( x.size() );
for ( size_t i = 0; i < x.size(); ++ i ) vp.push_back( &x[i] );
sort( vp.begin(), vp.end(), ptr_fun( &dereference_less<T> ) ); // O(N log N)
return adjacent_find( vp.begin(), vp.end(),
not2( ptr_fun( &dereference_less<T> ) ) ) // "opposite functor"
== vp.end(); // if no adjacent pair (vp_n,vp_n+1) has *vp_n < *vp_n+1
}
or in STL style,
template <class I>
bool is_unique(I first, I last) {
typedef typename iterator_traits<I>::value_type T;
…
And if you can reorder the original vector, of course,
template <class T>
bool is_unique(vector<T> &x) {
sort( x.begin(), x.end() ); // O(N log N)
return adjacent_find( x.begin(), x.end() ) == x.end();
}
You must sort the vector if you want to quickly determine if it has only unique elements. Otherwise the best you can do is O(n^2) runtime or O(n log n) runtime with O(n) space. I think it's best to write a function that assumes the input is sorted.
template<class Fwd>
bool is_unique(In first, In last)
{
return adjacent_find(first, last) == last;
}
then have the client sort the vector, or a make a sorted copy of the vector. This will open a door for dynamic programming. That is, if the client sorted the vector in the past then they have the option to keep and refer to that sorted vector so they can repeat this operation for O(n) runtime.
The standard library has std::unique, but that would require you to make a copy of the entire container (note that in both of your examples you make a copy of the entire vector as well, since you unnecessarily pass the vector by value).
template <typename T>
bool is_unique(std::vector<T> vec)
{
std::sort(vec.begin(), vec.end());
return std::unique(vec.begin(), vec.end()) == vec.end();
}
Whether this would be faster than using a std::set would, as you know, depend :-).
Is it infeasible to just use a container that provides this "guarantee" from the get-go? Would it be useful to flag a duplicate at the time of insertion rather than at some point in the future? When I've wanted to do something like this, that's the direction I've gone; just using the set as the "primary" container, and maybe building a parallel vector if I needed to maintain the original order, but of course that makes some assumptions about memory and CPU availability...
For one thing you could combine the advantages of both: stop building the set, if you have already discovered a duplicate:
template <class T>
bool is_unique(const std::vector<T>& vec)
{
std::set<T> test;
for (typename std::vector<T>::const_iterator it = vec.begin(); it != vec.end(); ++it) {
if (!test.insert(*it).second) {
return false;
}
}
return true;
}
BTW, Potatoswatter makes a good point that in the generic case you might want to avoid copying T, in which case you might use a std::set<const T*, dereference_less> instead.
You could of course potentially do much better if it wasn't generic. E.g if you had a vector of integers of known range, you could just mark in an array (or even bitset) if an element exists.
You can use std::unique, but it requires the range to be sorted first:
template <class T>
bool is_unique(vector<T> X) {
std::sort(X.begin(), X.end());
return std::unique(X.begin(), X.end()) == X.end();
}
std::unique modifies the sequence and returns an iterator to the end of the unique set, so if that's still the end of the vector then it must be unique.
This runs in nlog(n); the same as your set example. I don't think you can theoretically guarantee to do it faster, although using a C++0x std::unordered_set instead of std::set would do it in expected linear time - but that requires that your elements be hashable as well as having operator == defined, which might not be so easy.
Also, if you're not modifying the vector in your examples, you'd improve performance by passing it by const reference, so you don't make an unnecessary copy of it.
If I may add my own 2 cents.
First of all, as #Potatoswatter remarked, unless your elements are cheap to copy (built-in/small PODs) you'll want to use pointers to the original elements rather than copying them.
Second, there are 2 strategies available.
Simply ensure there is no duplicate inserted in the first place. This means, of course, controlling the insertion, which is generally achieved by creating a dedicated class (with the vector as attribute).
Whenever the property is needed, check for duplicates
I must admit I would lean toward the first. Encapsulation, clear separation of responsibilities and all that.
Anyway, there are a number of ways depending on the requirements. The first question is:
do we have to let the elements in the vector in a particular order or can we "mess" with them ?
If we can mess with them, I would suggest keeping the vector sorted: Loki::AssocVector should get you started.
If not, then we need to keep an index on the structure to ensure this property... wait a minute: Boost.MultiIndex to the rescue ?
Thirdly: as you remarked yourself a simple linear search doubled yield a O(N2) complexity in average which is no good.
If < is already defined, then sorting is obvious, with its O(N log N) complexity.
It might also be worth it to make T Hashable, because a std::tr1::hash_set could yield a better time (I know, you need a RandomAccessIterator, but if T is Hashable then it's easy to have T* Hashable to ;) )
But in the end the real issue here is that our advises are necessary generic because we lack data.
What is T, do you intend the algorithm to be generic ?
What is the number of elements ? 10, 100, 10.000, 1.000.000 ? Because asymptotic complexity is kind of moot when dealing with a few hundreds....
And of course: can you ensure unicity at insertion time ? Can you modify the vector itself ?
Well, your first one should only take N log(N), so it's clearly the better worse case scenario for this application.
However, you should be able to get a better best case if you check as you add things to the set:
template <class T>
bool is_unique3(vector<T> X) {
set<T> Y;
typename vector<T>::const_iterator i;
for(i=X.begin(); i!=X.end(); ++i) {
if (Y.find(*i) != Y.end()) {
return false;
}
Y.insert(*i);
}
return true;
}
This should have O(1) best case, O(N log(N)) worst case, and average case depends on the distribution of the inputs.
If the type T You store in Your vector is large and copying it is costly, consider creating a vector of pointers or iterators to Your vector elements. Sort it based on the element pointed to and then check for uniqueness.
You can also use the std::set for that. The template looks like this
template <class Key,class Traits=less<Key>,class Allocator=allocator<Key> > class set
I think You can provide appropriate Traits parameter and insert raw pointers for speed or implement a simple wrapper class for pointers with < operator.
Don't use the constructor for inserting into the set. Use insert method. The method (one of overloads) has a signature
pair <iterator, bool> insert(const value_type& _Val);
By checking the result (second member) You can often detect the duplicate much quicker, than if You inserted all elements.
In the (very) special case of sorting discrete values with a known, not too big, maximum value N.
You should be able to start a bucket sort and simply check that the number of values in each bucket is below 2.
bool is_unique(const vector<int>& X, int N)
{
vector<int> buckets(N,0);
typename vector<int>::const_iterator i;
for(i = X.begin(); i != X.end(); ++i)
if(++buckets[*i] > 1)
return false;
return true;
}
The complexity of this would be O(n).
Using the current C++ standard containers, you have a good solution in your first example. But if you can use a hash container, you might be able to do better, as the hash set will be nO(1) instead of nO(log n) for a standard set. Of course everything will depend on the size of n and your particular library implementation.

Where can I get a "useful" C++ binary search algorithm?

I need a binary search algorithm that is compatible with the C++ STL containers, something like std::binary_search in the standard library's <algorithm> header, but I need it to return the iterator that points at the result, not a simple boolean telling me if the element exists.
(On a side note, what the hell was the standard committee thinking when they defined the API for binary_search?!)
My main concern here is that I need the speed of a binary search, so although I can find the data with other algorithms, as mentioned below, I want to take advantage of the fact that my data is sorted to get the benefits of a binary search, not a linear search.
so far lower_bound and upper_bound fail if the datum is missing:
//lousy pseudo code
vector(1,2,3,4,6,7,8,9,0) //notice no 5
iter = lower_bound_or_upper_bound(start,end,5)
iter != 5 && iter !=end //not returning end as usual, instead it'll return 4 or 6
Note: I'm also fine using an algorithm that doesn't belong to the std namespace as long as its compatible with containers. Like, say, boost::binary_search.
There is no such functions, but you can write a simple one using std::lower_bound, std::upper_bound or std::equal_range.
A simple implementation could be
template<class Iter, class T>
Iter binary_find(Iter begin, Iter end, T val)
{
// Finds the lower bound in at most log(last - first) + 1 comparisons
Iter i = std::lower_bound(begin, end, val);
if (i != end && !(val < *i))
return i; // found
else
return end; // not found
}
Another solution would be to use a std::set, which guarantees the ordering of the elements and provides a method iterator find(T key) that returns an iterator to the given item. However, your requirements might not be compatible with the use of a set (for example if you need to store the same element multiple times).
You should have a look at std::equal_range. It will return a pair of iterators to the range of all results.
There is a set of them:
http://www.sgi.com/tech/stl/table_of_contents.html
Search for:
lower_bound
upper_bound
equal_range
binary_search
On a separate note:
They were probably thinking that searching containers could term up more than one result. But on the odd occasion where you just need to test for existence an optimized version would also be nice.
If std::lower_bound is too low-level for your liking, you might want to check boost::container::flat_multiset.
It is a drop-in replacement for std::multiset implemented as a sorted vector using binary search.
The shortest implementation, wondering why it's not included in the standard library:
template<class ForwardIt, class T, class Compare=std::less<>>
ForwardIt binary_find(ForwardIt first, ForwardIt last, const T& value, Compare comp={})
{
// Note: BOTH type T and the type after ForwardIt is dereferenced
// must be implicitly convertible to BOTH Type1 and Type2, used in Compare.
// This is stricter than lower_bound requirement (see above)
first = std::lower_bound(first, last, value, comp);
return first != last && !comp(value, *first) ? first : last;
}
From https://en.cppreference.com/w/cpp/algorithm/lower_bound
int BinarySearch(vector<int> array,int var)
{
//array should be sorted in ascending order in this case
int start=0;
int end=array.size()-1;
while(start<=end){
int mid=(start+end)/2;
if(array[mid]==var){
return mid;
}
else if(var<array[mid]){
end=mid-1;
}
else{
start=mid+1;
}
}
return 0;
}
Example: Consider an array, A=[1,2,3,4,5,6,7,8,9]
Suppose you want to search the index of 3
Initially, start=0 and end=9-1=8
Now, since start<=end; mid=4; (array[mid] which is 5) !=3
Now, 3 lies to the left of mid as its smaller than 5. Therefore, we only search the left part of the array
Hence, now start=0 and end=3; mid=2.Since array[mid]==3, hence we got the number we were searching for. Hence, we return its index which is equal to mid.
Check this function, qBinaryFind:
RandomAccessIterator qBinaryFind ( RandomAccessIterator begin, RandomAccessIterator end, const T & value )
Performs a binary search of the range
[begin, end) and returns the position
of an occurrence of value. If there
are no occurrences of value, returns
end.
The items in the range [begin, end)
must be sorted in ascending order; see
qSort().
If there are many occurrences of the
same value, any one of them could be
returned. Use qLowerBound() or
qUpperBound() if you need finer
control.
Example:
QVector<int> vect;
vect << 3 << 3 << 6 << 6 << 6 << 8;
QVector<int>::iterator i =
qBinaryFind(vect.begin(), vect.end(), 6);
// i == vect.begin() + 2 (or 3 or 4)
The function is included in the <QtAlgorithms> header which is a part of the Qt library.
std::lower_bound() :)
A solution returning the position inside the range could be like this, using only operations on iterators (it should work even if iterator does not arithmetic):
template <class InputIterator, typename T>
size_t BinarySearchPos(InputIterator first, InputIterator last, const T& val)
{
const InputIterator beginIt = first;
InputIterator element = first;
size_t p = 0;
size_t shift = 0;
while((first <= last))
{
p = std::distance(beginIt, first);
size_t u = std::distance(beginIt, last);
size_t m = p + (u-p)/2; // overflow safe (p+u)/2
std::advance(element, m - shift);
shift = m;
if(*element == val)
return m; // value found at position m
if(val > *element)
first = element++;
else
last = element--;
}
// if you are here the value is not present in the list,
// however if there are the value should be at position u
// (here p==u)
return p;
}